Wednesday, November 26, 2014

Who really drives high quality open source?

The current Cambrian explosion of software we are witnessing today is due in no small part to a number of great open source projects. Open source software has the lion's share of web browsers (Webkit, Firefox, and Chrome), web servers (Apache httpd and nginx), mobile and server OSses (Android and Linux), web frameworks (too many to summarize), distributed computing and deployment frameworks (Hadoop, Spark, Mesos), and databases/distributed data stores (PostgreSQL, MariaDB, Cassandra, Riak, Mongo). By "great" here, I mean a combination of widespread use and high quality as reflected (imperfectly) in terms of defect density. But what really drove those projects to the great success and influence we are witnessing now? Just as there is the rags to riches mythos, so there is the mythos of Joe Everybody contributing to open source. Consider an analogy with replication and research. In the research community, there is also the ideal that research results can be validated by replication of experimental results by others. However, in research, the incentives do not align to promote such replication activities since an academic cannot stake tenure on replication. I wonder if there is a similar gap between ideal and practice in open source organization and contribution. Another related question I would like to answer is what and how can one learn high quality software engineering practices from top open source projects.

I decided to start a small project to study how the organizational aspects of open source works in the practice, on the ground. Of course, the mantra is that with open source, anyone and everyone can contribute, but in practice who really does? Of course, open source varies greatly from project to project. Projects under the Apache Foundation look quite different from a random sampling of Github projects. The ecosystem in general as a lot of noise. There are many projects, even the majority of projects on Github, Sourceforge, and Freecode which have little to no activity. For the active ones, most open source projects have a small core team of committers and slightly larger group of contributors. There is quite a few studies that consider a snapshot of open source commit activities. This is a fairly commonly used metric that can be obtained from Github and Freecode for open source projects. However, I conjecture that the developers' discussion mailing list activity may be another interesting metric both for predicting commit activity and for understanding the network organization of open source projects at large, especially for contributors who are not committers.

What amounts to high quality in open source must necessarily be somewhat subjective. One published metric is the notion of defect density (defects detected per 1,000 lines of code), which Coverity reports as part of its open source Coverity Scan report. Defect density is a metric based on source lines of code so it suffers from some of the same shortcomings as any metric based on lines of code. The challenge with defect density is that it generally is not comparable across programming languages (consider how Coverity reports densities of < 0.7 for C/C++ projects but > 1.0 for top Java projects) and even across projects in the same language, coding style may impact defect density without saying much about quality and defects themselves. In 2012, Coverity reported that the CPython project has a defect density of 0.005. For a point of reference, the average defect density for open source C/C++ projects in the Coverity Scan service in that year was 0.69. For the Linux kernel and PostgreSQL it was 0.76 (2012) and 0.21 (2011) respectively. In contrast, the average defect density for Java projects (including Cassandra, CloudStack, Hadoop, and HBase) was an astounding 2.72.

In terms of geographical distribution of open source contributors, there are some surprising data points. David Fischer has a visualization of the geographical distribution of the contributors to the top 200 Github projects. Curiously, for C++ projects, New York contributors far outnumber SF Bay Area contributors by a factor of almost 4 to 1. For C language projects, Mountain View and Portland are the top locales for contributors. As expected, SF Bay Area is the preferred locale for JavaScript, Ruby, and Obj-C projects.

Anecdotally, I have a few observations about open source organization. Corporations can contribute quite a lot to a project. This should not be surprising since many of the top projects have strong corporate backers. For famous projects such as Mozilla and Python, you also have a cadre of graduate students staking out bugs or proposing extensions that are the basis of their thesis research. Some major corporate users of open source projects may indeed occasionally contribute, but the participation is often muted and isolated. Actual individual contributors who have no connection with the committer team nor academic interest and rise to significant and sustained participation are quite rare. As I move this study along, I would like to present concrete data to answer the questions: who drives successful open source, what is the average tenure of a committer and how are they supported, what does it look like for a new contributor to rise to an active and accepted contributor.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.