Wednesday, December 14, 2011

Data-Driven Language Design

One recurring issue with the evolution of programming languages is the tension between feature creep and stagnation. Some programming languages manage to evolve very rapidly, incorporating all kinds of features in response to developer requests or the language architect's leading. Brian Goetz has an interesting article on DeveloperWorks which argues for using "corpus analysis", studying the actual frequency of use of language constructs in the corpus of source code, to guide the evolution of a language. In his article, he describes two examples, precise exceptions rethrow semantics and type inference, where corpus analysis played a role in determine how Java evolved. When I was working on SML/NJ, we informally used this technique to determine the best design in terms of certain type checking/inference and module system elements.

Coming from a background deeply rooted in the idea that rigorous semantics and type-theoretic formalisms are the key foundations of good (and sound) programming language design, this notion of "quantitative" programming language design has struck a chord with me. If programming language designers can apply corpus analysis consistently and rigorously, this may benefit the evolution of programming languages. A big chasm in the programming language research community is the distinction between two of the main research conferences, Programming Language Design and Implementation (PLDI) and POPL (Symposium of Principles of Programming Languages). PLDI is the more empirical of the two, being the successor to the old ACM Compiler Construction and Optimization conferences. Papers which get into PLDI invariably contain benchmarks, performance comparisons, measurements of execution times, memory utilization, and the like. The quality of the metrics vary from paper to paper, but every paper must contain something. In contrast, POPL papers do not normally contain such measurements. POPL papers emphasize theoretical underpinnings, formalizations, and proofs of correctness behind programming languages and related systems. Corpus analysis is clearly empirical, but it makes me wonder whether there is some more scientific approach to programming language design, one that would be quantitative but also connect those measurements to a formal model, perhaps even a predictive model. Just as "behavioral economics" has revolutionized our understanding of the world, corpus analysis is one component in a "behavioral programming language design and semantics". Programming languages are primarily used by people, so naturally how people work with different languages matter.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.