Programming languages, Cloud, and Financial Markets

Bug Lifecycle

2023-06-08T07:00:00.001-07:00

Back in the International Conference on Software Engineering (ICSE) 2010, Philip Guo et al published a study on bug resolution in a paper "Characterizing and Predicting Which Bugs Get Fixed: An Empirical Study of Microsoft Windows". The paper investigates the factors that influence bug-fixing decisions in the development of Microsoft Windows, particularly Windows Vista and Windows 7. The authors aimed to understand why some bugs in the Windows operating system are addressed promptly while others remain unresolved for an extended period. They conducted an extensive empirical study by analyzing a large dataset of bug reports and their associated properties from the Windows bug tracking system.

The paper provides several key findings based on their analysis:

Bug Characteristics: The study found that certain bug characteristics influence the likelihood of them being fixed. Bugs that are easier to reproduce, have clear descriptions, and affect a wider user base are more likely to be addressed promptly.
Bug Severity: The severity of a bug plays a significant role in determining the likelihood of it being fixed quickly. High-severity bugs, which have a substantial impact on user experience or system stability, are given higher priority and are more likely to be fixed sooner.
Developer Expertise: The experience and familiarity of developers with specific parts of the codebase influence bug-fixing decisions. Developers tend to fix bugs related to areas they are more knowledgeable about, resulting in variations in bug-fixing rates across different components of the system.
Bug Reporting Quality: The quality of bug reports, including the level of detail, reproducibility, and clarity of description, affects the likelihood of a bug being fixed. Reports that provide more precise information are more likely to receive attention and prompt resolutions.
Bug Activity and Age: Bugs that receive more comments, indicating active discussions and user interest, tend to be fixed faster. Additionally, the age of a bug influences its chances of being fixed, with older bugs often being prioritized to reduce the backlog.

Based on these findings, the authors developed a prediction model to estimate the probability of bug fixes. They used machine learning techniques to train the model, incorporating various bug characteristics and contextual factors. The model showed promise in predicting which bugs are more likely to be fixed promptly. The paper presented a model based on some definition of bug reporter reputation, frequency of bug edits, bug reassignments, geographical distance, and reopenings. They present both the predictive and descriptive statistical models

How to do a code review, Part 2

2023-06-07T04:15:00.010-07:00

Microsoft several years ago published a conversational article on their case study on code review as well as their own in-house code review tool CodeFlow.

Code Review Challenges: The article highlights several challenges associated with code reviews, including time constraints, difficulty in coordinating reviewers, lack of context, and limitations of traditional review methods like in-person meetings or email-based reviews.
Asynchronous Code Reviews: CodeFlow addresses the challenges by providing a platform for asynchronous code reviews. This allows developers to review and provide feedback on code changes at their own convenience, regardless of time zones or physical location.
Distributed Collaboration: CodeFlow enables distributed collaboration by providing a centralized platform where developers can review, comment, and track the progress of code reviews. It eliminates the need for manual coordination and facilitates efficient communication between reviewers and authors.
Code Quality Improvement: Code reviews play a crucial role in improving code quality. By involving multiple reviewers, CodeFlow helps identify defects, bugs, and design flaws early in the development process. This leads to better overall code quality and reduces the likelihood of introducing errors into the codebase.
Knowledge Sharing: CodeFlow promotes knowledge sharing among developers. Reviewers gain insights into different codebases and programming practices, fostering a culture of learning and continuous improvement. It also helps distribute knowledge across teams and reduces reliance on individual developers.
Iterative Review Process: CodeFlow supports an iterative review process where developers can address feedback, make changes, and have subsequent rounds of review. This promotes collaboration, allows for further refinement of code changes, and encourages constructive discussions between authors and reviewers.
Metrics and Insights: The article presents statistical data on CodeFlow's usage, including the number of code reviews conducted, average review duration, and the number of code changes accepted or rejected. These metrics provide insights into the effectiveness and efficiency of the code review process.

Check out part 1 of these series on code review.

Corporate Governance Today

2017-07-10T11:00:00.000-07:00

Once upon a time, companies were managed by founder-owners, engineers, gritty tradesmen, and the like. The current norm of companies managed by a class of professional managers is a relatively recent development. Once upon a time, the dream of an upstart company, especially a tech company, was to go public and become a publicly traded company. A couple of market crashes later and a rather unique interest rate environment has turned this aspiration upside down and many companies have chosen to delay initial public offerings. As of late, the pace in IPOs has picked up somewhat, but it still pales in comparison to the heyday of tech IPOs. One performance all publicly traded companies must have is the shareholders' annual meeting. There is quite a contrast from company to company in how they carry this out. For Berkshire Hathaway, it is a celebratory confab of the giants of industry and finance coming before the oracle of Omaha Warren Buffet. For most companies, it is a formal requirement that they rather do without. In some cases, annual meetings have become outright adversarial. Having already to file numerous SEC-required filings and give earnings calls for the benefit of Wall Street analysts, yet another venue where shareholders, the purported fractional owners of companies, and influencers can raise questions is probably few management teams' idea of fun. It seems to be all too easy for an annual meeting to become adversarial, whether it be due to an impending say-on-pay vote, some kind of environmental or social concern, or just plain input from the shareholders on how a company can tack to become more efficient, profitable, and fast growing. Few teams want to hear from the shareholders because the best input from shareholders always seems to be no input from the perspective of the management. With a large part of the public shares of companies held by mutual funds and ETFs, which in turn are held in 401k accounts, this affects many of us. Nearly all the time, funds carry out a perfunctory according to management recommendation without any consideration or else abstain from voting. This works very much in the favor of management since they can count on funds to support share value, their share value, while not having to worry about large shareholders attempting to get in their say. When did corporate governance become so adversarial and arguably dysfunctional?

This situation also affects tech companies. Though some of the leading tech companies are nominally run by engineers, most are backed by a senior leadership team of professional managers supported by legions of more professional managers. It's not to say that management's job is easy. It is very challenging and sometimes you are put in a no-win situation. However, the fact that communication has broken down or started to fray at the corners between management and their nominal bosses, the shareholders, points to some kind of sea change. Even the entry of activist investors such as Paul Singer's Elliot Management, Carl Icahn, and Third Point are a relatively new phenomenon.

I have written previously about shareholder meetings and how some small investors such as Matthew Rafat are trying to get their say in.

How to do a code review, Part 1

2017-07-09T09:00:00.000-07:00

Code review is a practical and important skill that truly distinguishes the highly experienced engineer. Knowing what are the common pitfalls, where to look for them, and how to understand someone else's code quickly is an invaluable skill to have. It is also a skill that is rarely taught in school. To really ramp up your programming skills, get out there and read other people's code. Reading high quality code is a good learning experience. Occasionally reading poorly written code is also insightful. What are the most common errors and where do you find them? Through code reviews, software engineers can spread knowledge through their teams as well as improve consistency and code quality. For the interviewee, you have to review your own code, so it helps to know what are the things you need to double check. One great way to evaluate a team's culture is to probe and understand their code review practices or lack thereof. Below, I will cover some of the best practices and provide a few pointers on where you can get started.

Platform-as-a-Service Providers

2017-07-02T20:18:00.001-07:00

A overwhelming amount of attention and resources has been focused on Infrastructure-as-a-Service (IaaS) clouds. IaaS clouds are in a sense more flexible but requires considerably more setup and configuration than Platform-as-a-Service (PaaS) clouds. IaaS clouds can serve both for webapp hosting as well as Hadoop-style cluster computations. PaaS clouds are clearly targeted at webapp hosting. PaaS cloud hosts usually provide a wide variety of supporting services (e.g., payment gateways, monitoring, CRM) to help developers focus on core business logic, but often businesses have found out that PaaS is too restrictive, requiring one to give up too much control over the webapp architecture. In particular, PaaS often requires one to select from a limited set of data stores and load balancers. Sometimes, PaaS requires webapps to be written against a particular proprietary API, potentially a restricted and even proprietary data store. This would make something rather simple such as a deploying a Wordpress site a non-starter.

Resources

2017-07-01T13:14:00.001-07:00

What's Opening in your neighborhood scours the web for terrific restaurants that are coming soon to your neighborhood.

Explaining Market Crashes

2016-06-23T06:33:00.004-07:00

The body of academic theories for market crashes explores a number of possible explanations. Why do markets crash? There may be some event that sets everything in motion, but why and when does a decline turn from an ordinary slide to a crash? Didier Sornette wrote a book and did a TED talk on the subject. Scholars as illustrious as Fischer Black and Myron Scholes weighed in on the subject. There are four main theories addressing this phenomenon:

Leverage effects: Drop in prices increases leverage both operating and financial thus exacerbating volatility when businesses and investors have to raise capital to cover leverage or by reducing leverage
Volatility feedback: When bad news arrives, the risk premia magnifies the direct effect of the news
Stochastic bubbles: Crash occurs when a buble pops thus resulting in a low-probability event that produces large negative returns
Investor heterogeneity: Different investors have varying constraints when it comes to short-sales. The more bearish group of investors subject to short-sales constraints may just sell all their sales, a suboptimal solution, and thus their information is not fully included in the market

The oldest theory is focused on leverage effects. The idea is that a drop in prices raises operating and financial leverage thus exacerbating volatility. This theory is articulated by Fischer Black and Myron Scholes in 1973 "The Pricing of Options and Corporate Liabilities" in the Journal of Political Economy. Andrew A. Christie further explores the idea in "The Stochastic Behavior of Common Stock Variances--Value, Leverage and Interest Rate Effects" in the Journal of Financial Economics 1982.

Cloud Providers in China

2016-04-06T17:02:00.000-07:00

The cloud business is slowly but surely picking up in China as it has in the West. Out of the major multinational providers, only Microsoft Azure offers comprehensive service in China. Amazon in 2014 has introduced a closed limited program for region in Beijing, but this service excludes common services such as Elastic Beanstalk. China, however, has its own homegrown cloud services, principally SinaCloud and AliYun from Sina and Alibaba respectively. Baidu offers a PaaS service but not a full fledged IaaS. Unlike US cloud services, most Chinese cloud services are pay by the month instead of pay by the hour. This arrangement is quite limiting since one would be unable to scale down costs below the monthly unit. This truly adds up when one is talking about thousands of instances.

Investigating dislocations in the oil complex

2016-02-02T09:21:00.003-08:00

The oil and basic material complex has suffered quite a downturn in 2015 and also in the beginning of 2016. Equity prices have declined dramatically. Oil and basic material company bonds have also been under quite a lot of pressure from downgrades from the credit rating agencies. Even large-cap companies have seen the yields they have to pay skyrocket. As the truism goes, when market volatility goes up, correlations all head to 1, since everyone sells everything. Having noticed HAL, BHP, and COP bonds going for 4.5% to 6%+, which is a huge spread to Treasuries, one wonders how severe of a downturn the market is pricing in.

The following is the correlation of the changes in bond yield (ought to be inversely related to bond price) and USO returns over the period January 4 to February 1.

	Bond Yield	Equity Price
HAL	-0.064	0.68
COP	-0.62	0.79
BHP	-0.78	0.67

Unsurprisingly, each of these companies equity prices are highly correlated to oil prices at this juncture. Even BHP, which is mildly removed from the oil complex, since it works in many basic materials and mining beyond oil, is highly correlated. It turns out that the bond yields of these otherwise investment grade bonds are also moving with the oil prices except for the case of HAL.

Understanding Cloud Host Pricing, Part 2

2014-12-29T06:00:00.000-08:00

In the past few years, there has been a movement to standardize cloud compute resource measurements in order to make way for public trading of compute resources. The idea is simple, but execution may be complicated: each company can run something like OpenStack and rent off underutilized compute resources and these resources can be further trading on public exchanges to enable companies to hedge for price spikes. Along these lines, Amazon was quite early in introducing the Reserved Instance Marketplace. A public trading of standardized compute units will enable smaller organizations to monetize underutilized assets. This model is not without its challenges. Compute resources have many aspects that distinguish them. Performance may vary dramatically. In this post, I investigate some of the smaller cloud hosts and their prices.

Provider	Minimum Unit ($/hr)	Memory (GB)	Instance Storage (GB)	Persistent Block Storage ($/GB/mo)
HP Helion	0.03	1	10	0.10
IBM SoftLayer	0.04	1	25	0.10
Oracle Cloud		1.8
CloudSigma	0.0319	1	1	0.14 SSD
GoGrid	0.02-0.03	0.5	25	-
DreamHost DreamCompute	0.0264	2	25	-
Internap	0.04	1	20 SSD	0.30

Top 5 Gotchas When Running Docker on a Mac

2014-12-27T06:00:00.000-08:00

Running Docker on Mac is meant to be a convenience but the fact that Docker on Mac is a 2nd class citizen shows up every now and then. Since Docker is based on Linux cgroups, it cannot and does not run natively on MacOS X. Instead, Docker runs on Macs by using boot2docker, a shim that boots up a whole VirtualBox VM on which one will actually run Docker. Running Docker inside of a VM on Macs complicates things quite a bit.

Understanding Cloud Host Pricing, Part 1

2014-12-26T06:00:00.000-08:00

The pricing schemes of the top cloud infrastructure-as-a-service (IaaS) providers are rather complicated. They cannot be compared directly since their performance characteristics vary. Moreover, the differences in costs of instances, storage, and bandwidth may offset each other. For example, one provider A's storage costs may be greater than provider B but B's instance costs may be greater. Some providers bill for instances by the minute (Azure and Google Computer Engine) whereas others bill by hour (rounded up), such as AWS. Some providers charge for storage by the TB (Azure) whereas others charge by GB/month or sometimes GB/hour (Rackspace). Consequently, what constitutes to the best deal in terms of cloud hosting depends on your specific workloads and storage needs. In this series, we will investigate the various aspects of cloud host pricing from the major providers: Amazon AWS, Microsoft Azure, Google Compute Engine, DigitalOcean, and Rackspace.

Retry Pattern

2014-12-17T07:41:00.000-08:00

A common design pattern in fault-tolerant distributed systems is the retry pattern. A given operation may experience a variety of failures:

rare transient failures
(e.g., corrupted packet) can be recovered from immediately and thus should retry immediately
common transient failures
(e.g., network busy) can retry after waiting for a period of time (possibly with exponential backoff
permanent failure
should not retry, bail out and clean up

Of course, the final case is that the operation succeeds and the function must do some work to address that. This is an interesting design pattern not only for distributed systems but also fault-tolerant systems in general. For example, high-performance JavaScript engines have parsers and tokenizers that must be robust to various failures. In fact, it is one example where large systems have used multiple exit points, more complicated control flow for which C-based programs may use gotos.

Docker Image for SML/NJ

2014-12-10T05:04:00.003-08:00

The various Linux distros package repos carry very outdated versions of the SML/NJ compiler. This Docker image builds the latest official SML/NJ release.

Apache Mesos and Hadoop YARN Scheduling

2014-12-09T06:00:00.000-08:00

Mesos and YARN are two powerful cluster managers that can play host to a variety of distributed programming frameworks (Hadoop Map-Reduce, Dryad, Spark, and Storm) as well as multiple instance of the same framework (e.g., different versions of Hadoop). Both are concerned about optimizing utilization of cluster resources especially in terms of data locality of data distributed around the cluster. Google's paper on Omega, their own cluster scheduling system, dubs Mesos a two-level scheduler, which provide some flexibility by having a single resource manager offer resources to multiple parallel, independent schedulers. YARN is considered a monolithic scheduler since independent Application Masters are only responsible for job management and not scheduling. Scheduling is the essence of efficient Big Data processing. However, where do these two systems differ?

Alternatives to Docker: LXD and Rocket

2014-12-08T06:00:00.000-08:00

Two recently announced alternatives to the Docker Linux container runtime LXD and Rocket aim to offer some interesting value propositions. Before I get to the details, let's first identify what use cases and aspects of the Docker container runtime of interest here. Docker identifies a few major categories of use cases: continuous integration, continuous delivery, scaling distributed applications, and Platform-as-a-Service. The former two are DevOps use cases. At this point, Docker pretty much has a lock on DevOps use cases. Moreover, neither of the would-be competitors truly target DevOps. It becomes obvious when you consider that LXD was intended to run on OpenStack Server environments and Rocket on CoreOS/fleet (though it is not necessarily tied to CoreOS). Docker runs on workstation environments and even on top of VirtualBox via boot2docker to support DevOps functionality on MacOS X. The more competitive aspect is the cloud infrastructure one. Here, Docker is competing with a wider range of technologies to support scaling on the cloud and providing PaaS functionality. This is the market where LXD and Rocket would operate. This is also the area where hypervisors have reigned.

How to search for great programmers, Part 2

2014-12-07T06:00:00.000-08:00

Aline Lerner of Trialpay recently posted statistics on an experiment about resume review. The conclusion was that recruiters, engineers, and just about everyone score resumes all over the place and therefore resumes have weak signal value. The claim was that the strongest signal in the resume was the number of typos. Although the study seems extensive, I think there are a number of weaknesses in the experimental design. One weakness was acknowledged: the ground truth is the author's own subjective evaluation of the candidates. Another weakness was how the survey questions were somewhat misleading in the first place. The questionnaire asks "would you interview this candidate" and yet this was compared with the ground truth of "will the candidate perform well on the job or technical interview". As I alluded to in an earlier post in this series, the role of a resume is to help filter for red flags and to guide the formal interview, not to determine whether a candidate is a star performer by itself. The fact of the matter is, a resume is a self-reported synopsis of a candidate's track record. To evaluate a candidate, I would think track records are important, as is potential.

What is really interesting about Quantitative Behavioral Finance

2014-12-06T06:00:00.000-08:00


photo by Stuck in Customs	via PhotoRee

Quantitative behavioral finance has not been with out for a very long time. As a relatively recent development and area of discourse, it has only begun to gain a following. One very interesting aspect of this field is the use of experimental asset markets. These studies are based on experiments conducted on a small group of people (but with real money hence a real market) to examine where rational expectations and classical game theory fails to explain human behavior. This is basically small-scale version of the prediction markets such as the Iowa Electronic Markets, Intrade, and Betfair. However, unlike prediction markets where the ultimate objective is to predict an external event, experimental asset markets are more interested in the mechanics and patterns of the market itself. Caginalp, Vernon Smith (Nobel Memorial Economics Prize recipient of 2002), and David Porter have a couple of papers on experiments in this mode. Both experiments examine how financial bubbles can happen. Some of the take-aways are that excess cash and information asymmetry due to lack of an open book may exacerbate bubbles. I think the matter of information asymmetry is very salient. Despite all the effort and money invested in improving information infrastructure by banks and hedge funds, ultimately information is distributed non-uniformly to market participants. This is most obvious in the case of the retail investor who neither has the time nor resources to obtain and analyze all the market information.

Container Virtualization Options

2014-12-05T07:28:00.001-08:00


photo by sioda	via PhotoRee

Looks like the container virtualization space is becoming a little more interesting this week. Previously, Docker was the only more or complete standard container implementation (with definition of image, image creation, and container start/stop management). There was Canonical's LXD, it didn't seem to be garnering nearly as much attention and support since it was only announced a month ago. However, with the Docker and CoreOS organizations starting to encroach on each other's territory, the CoreOS community has released an early version of their own container runtime, Rocket. On the balance, Docker has moved into the container cluster orchestration and management space with Docker Swarm and Docker Compose, the latter being still in the design stage.

Containers versus Virtual Machines

2014-12-05T06:00:00.000-08:00

Containers-based systems (e.g., Docker, LXC, cgroups) and virtual machines (VMWare, Xen) both seek to bring the benefits of virtualization to the data center and developer workflow. They have considerable overlap in benefits. Although both do some kind of virtualization to enable better utilization of physical hardware, there are also some key differences. Containers do virtualization at the OS kernel-level. Hence isolation is limited to what the kernel can enforce. Containers do a lot of sharing of layers of file systems, courtesy of AuFS, which potentially makes better use of disk space and image space than virtual machines which commit the entire contents of a VM's disk to the disk image.

Supervisor trees

2014-12-04T06:00:00.000-08:00

In my past few posts, I have focused on fault tolerant distributed systems as implemented through cluster managers. Apache Mesos, Kubernetes, and many others all attempt to support fault tolerance by auto-restarting and other self-healing techniques at the cluster manager level. As such, they rightly claim that they are the new operating systems of the cloud. It turns out, however, cluster managers certainly do not have a monopoly on fault tolerance features. Long before Mesos, Kubernetes, and possibly even University of Wisconsin's Condor, a distributed processing system with considerable more pedigree, Erlang had supervisor trees and supervisor behaviors (a kind of language interface) in the runtime thus supporting large, highly fault tolerant distributed systems decades ago.

Security in Containerization Technology

2014-12-03T06:00:00.000-08:00

One lingering worry with containerization is security. Previously, with conventional type 0 and type 1 (native, bare-metal) hypervisor technology, we greatly limited our trusted based to small hypervisors (e.g., Xen is < 150kloc). Some were so small (seL4 core was 7.5kloc) that they were amenable to mechanized formal verification. OSes supporting containers, in contrast, are much larger. Even CoreOS, intended as a slimmed down version of the Chrome OS Linux kernel that just supports modern bare-metal architectures for containers, that is fundamentally more challenging to vet, not to mention verify, than a simple hypervisor. Etcd and fleet alone add up to 44k sloc of Go. So for all the great inroads we were making in verification, the move towards containerization in the data center brings new challenges and potentially resets some of the progress the community has made in mechanically verifying security and functional correctness of the lowest layers of software systems and infrastructure.

ClusterHQ Flocker

2014-12-02T06:00:00.000-08:00

Flocker does multi-host orchestration for Docker containers. It is intended mainly as a means for containerizing and orchestration distributed data stores and databases although in principle it can deploy any app. Unlike some of the other solutions out it, Flocker aims to support checkpointing of stateful and stateless containers to support migration of (running) containers across nodes. This seems like a great feature if one wanted to do work stealing rescheduling of containers as the execution profile changes and other nodes become available. Flocker provides its own NAT layer for mediating communication between containers across nodes. It also supports ZFS persistent volumes to maintain state. Flocker itself does not aim to do any particularly sophisticated scheduling (c.f. Kubernetes) but instead relies on the user to supply scheduling.

Kubernetes and Decking Container Cluster Managers

2014-12-01T06:00:00.000-08:00

Fig. 1: Kubernetes Architecture

Kubernetes manages user-defined collections of containers called pods. Note that "pod" refers to the running container and not a static image (in Docker terminology). Besides containers, a pod can also have persistent storage attached as a volume and also define custom container health checks. Pods themselves can be organized together into "groups", a kind of "API object", which in turn can be referenced by label. There are two other main forms of API objects: replication controllers and services. The former produces a fixed number of replicas of a pod template. The latter defines internal and external ports for establishing connectivity across pods.