Programming languages, Cloud, and Financial Markets: November 2014

Sunday, November 30, 2014

Top 5 Container Cluster Managers: Containers and Cloud Management


photo by OneEighteen	via PhotoRee

The rapid ascent of containers is nothing less than breathtaking. In the space of a year and a half, Docker has went from initial release to being adopted by every major IaaS cloud host (Google, Amazon, Microsoft, Rackspace's On-Metal CoreOS), VM software vendor, and a few toolchain developers. As a point of reference, VMWare ESX took over 5 years before ever getting near such market penetration. It is interesting to consider at this point what the management tool options are for containers and clusters/networks of containers. There are two main categories of use cases for Docker: devops (e.g., continuous integration) and virtualization infrastructure (e.g., improve utilization of server resources using lightweight containers in place of hypervisor-based virtualization).

This is a series of evaluations of container cluster managers: Part 2 (Kubernetes and Decking) and Part 3 (Flocker).

Parallel Data Types

True parallelism can be achieved via a wide variety of avenues. Instruction-level parallelism arguably offered a huge value at one point when processor architectures bulked up to execute machine instructions in parallel whenever possible including when instructions can be reordered or delay slots can be filled. All this could have been done without an application programmer's involvement or knowledge, hence it promised the benefits of parallelism for most programs out-of-the-box without modifications. The onus, instead, was on the processor architects and compiler developers to take advantage of such parallelism, which sometimes turned out to be a tall order. Task parallelism, in contrast, typically demands a lot more from application programmers since exposing and taking advantage of parallelism becomes very domain-specific. The focus of parallelism for task parallelism depends on balancing the loads of execution threads. Runtime support can help fill in the gaps here by applying advanced scheduling techniques such as work stealing, but ultimately what constitutes a thread (i.e., how to partition the program) and how to mediate communication and sharing between threads is often up to the programmer. Another way to achieve parallelism at a fine-grain level is to implement and expose a parallel data type in a language, runtime, or distributed computing framework. Apache Spark achieves this using its RDD (Resilient Distributed Dataset) abstraction which can copy data from Scala Seq, Java Collection, and Python iterables. Hadoop uses Java Iterable for MapReduce. For modern distributed computing environments, the pipeline itself might be any DAG (more general than map-reduce), but ultimately parallelism stems from the data representation, hence data parallelism. Note that of all the forms of parallelism it is data parallelism that ultimately has demonstrated straightforward and efficient scaling to truly large problems.

Friday, November 28, 2014

How to search for great programmers, Part 1


photo by super-structure	via PhotoRee

A running theme these days is that Github is the new resume and a great source for learning. I am skeptical of these claims. If Github is the new resume, then it is one that is potentially even more difficult to evaluate than before and will greatly limit the candidate pool and not necessarily to a high quality one. The bigger question is what role should Github play in the hiring process. After all, many if not most of the famous hero programmers out there (e.g., Brian Kernighan, Herb Sutter, Alex Stepanov) do not have Github profiles. Linus Torvalds does, but he seems to be an exception. A Github profile is also difficult to evaluate. Do we consider only superficial metrics such as commit/contribution frequency or dive deep into a random project? Either of those options do not seem particularly attractive. At most, they will provide an imperfect measure for enthusiasm. It seems to me if code samples is what we are after, then we should ask for code samples. A real resume provides a highly condensed means to evaluate for relevant experience and education that can be evaluated in seconds.

Who really drives high quality open source?

The current Cambrian explosion of software we are witnessing today is due in no small part to a number of great open source projects. Open source software has the lion's share of web browsers (Webkit, Firefox, and Chrome), web servers (Apache httpd and nginx), mobile and server OSses (Android and Linux), web frameworks (too many to summarize), distributed computing and deployment frameworks (Hadoop, Spark, Mesos), and databases/distributed data stores (PostgreSQL, MariaDB, Cassandra, Riak, Mongo). By "great" here, I mean a combination of widespread use and high quality as reflected (imperfectly) in terms of defect density. But what really drove those projects to the great success and influence we are witnessing now? Just as there is the rags to riches mythos, so there is the mythos of Joe Everybody contributing to open source. Consider an analogy with replication and research. In the research community, there is also the ideal that research results can be validated by replication of experimental results by others. However, in research, the incentives do not align to promote such replication activities since an academic cannot stake tenure on replication. I wonder if there is a similar gap between ideal and practice in open source organization and contribution. Another related question I would like to answer is what and how can one learn high quality software engineering practices from top open source projects.

What is New in Android Lollipop

It is exciting times with the new Android Lollipop (Android 5.0) release hitting a late model Android handset near you. It is apparently already available on Motorola Droid X and G and slated for Nexus phones on November 12th. There is a lot of interesting features in this release, from the switching on of the Android Runtime (ART) which enables ahead-of-time compilation of Android apps (instead of the more traditional Dalvik just-in-time compilation) to a new garbage collector.

ART was introduced in KitKat (Android 4.4) but it is now the default in Lollipop.

Programming languages, Cloud, and Financial Markets