Programming languages, Cloud, and Financial Markets: 2011

Thursday, December 22, 2011

Shareholder Meetings

Matthew Rafat's blog willworkforjustice.blogspot.com has a first-hand record of shareholder meetings for a number of companies, though mostly in the Bay Area since that is where he is based. With organizations such as moxyvote starting up and various motions towards more responsive corporate governance, this record is an interesting data point. Large shareholders (pension funds, etc.) are still for the most part slumbering giants.

How long does it take to compile a web browser, Part 1

In today's brave new world, people can build and deploy a web app very quickly. However, on the other side of the equation, systems software such as web browsers are not getting any simpler. The cost of the rich media experience in modern web apps quite frequently borne by systems software including browsers, servers, and plugins. Just eyeballing, in Chrome (default multi-process mode), each tab take 30-100 MB of RAM. Luckily, I am not one to keep many tabs open concurrently, but many, many people do. A rich web experience can quickly gobble up whatever amount of memory you can throw at it. There are many people who have compared memory usage for Chrome, Firefox, Safari, and Internet Explorer (TechPP, CNet, Tom's Hardware, DotNetPerls, and many more I am sure), but, here, I would like to take a look at another facet of web browser complexity: source code size, time to build, and size of debug executable. This post follows and expands upon the type of study I did on open source finance software.

US Debt Service

The economic times have really changed. The scope of household deleveraging is quite astounding. The Federal Reserve releases a figure called the Financial Service Ratio which measures amount of consumer and mortgage debt servicing including automobile lease payments, rental payments on tenant-occupied property, homeowners' insurance, and property tax payments to disposable personal income. For renters, the ratio peaked around 2001 at 31.05. For homeowners, it peak a lot later around 2007 at 17.55. Both have been sliding ever since. This data refers to the country as a whole. The picture for the individual regions may look different.

Symmetric Type Sharing in ML

In the Standard ML language, there are two primary mechanisms for refining a module signature after the fact. One is a symmetric mechanism called type sharing and the other called where type, aka fibration. These are useful for facilitating modular programming by enabling reuse of module signatures (i.e., the syntactic "type" of a module).

Buybacks and Dividend Growth

There is a lively discussion on one of the front-page stories on SeekingAlpha about whether buybacks destroy value. As in my previous posts, I am on the fence regarding this one. It really varies on a case to case basis depending on where the money is coming from (some buybacks are highly leveraged and some special dividends come from debt issuance) and where it could have gone instead. It was surprising to see at least one investigation seemingly indicating that Buyback Achievers trump Dividend Achievers by a considerable margin. In terms of the intersection between PKW (PowerShare Buyback Achievers ) and PFM (PowerShares Dividend Achievers), it is only 12 companies: IBM, TGT, WAG, CB, FDO, WMT, GWW, CTAS, TJX, SFG, CASY, LOW. This group outperformed both PKW and PFM over a 1 year time frame. So, anecdotally, a combination of buybacks and dividends coincide with outperformance. Of course, there may be a number of confounding factors that may affect this.

Social Web IPOs

Looks like Zynga's IPO day didn't start a social web frenzy (down 7.8% as of now). The broad market certainly was no help here. Quite a pattern is forming with Pandora, Groupon, Linkedin, and Renren. None of these led to an unqualified success, and most of them are doing quite poorly especially if one bought at the open of the first day. Those who bought at the IPO price did considerably better for each of those 3 names. Groupon and Pandora debuted last month and in June respectively. Linkedin and Renren debuted in May.

Can free tiers be tear free?

Many of the cloud computing services offer some kind of free tier these days. Most of them are time-limited (e.g., Amazon's free tier lasts for 1 year, Microsoft's 90 days), but some are not including Google. How do they stack up?

Data-Driven Language Design

One recurring issue with the evolution of programming languages is the tension between feature creep and stagnation. Some programming languages manage to evolve very rapidly, incorporating all kinds of features in response to developer requests or the language architect's leading. Brian Goetz has an interesting article on DeveloperWorks which argues for using "corpus analysis", studying the actual frequency of use of language constructs in the corpus of source code, to guide the evolution of a language. In his article, he describes two examples, precise exceptions rethrow semantics and type inference, where corpus analysis played a role in determine how Java evolved. When I was working on SML/NJ, we informally used this technique to determine the best design in terms of certain type checking/inference and module system elements.

Share Buybacks, Part 4

Taking a sampling of the 41 PKW Buyback Achievers I could find XBRL annual reports on, there are a number of interesting points. The mean buyback rate for 2010 was 7.3% of common stock shares outstanding (DEI). Given this data point, IBM's rate of buyback seems only a middling one. I would caution generalization from a single year's worth of data, but it does provide a little perspective. A particular high or low buyback rate does not mean much by itself. One has to consider where the money for the buybacks is coming from and how is the rest of the business doing.

Vicious Cycle

The Fed didn't ride in with the cavalry today. Looking from their perspective, why would they? The domestic data from the past few weeks did not indicate catastrophe. Popular support for further easing is tenuous at best. The main risk is from the debt situation in Europe. The evanescent hope that fueled the past few weeks' action dissipated just as quickly. The Dollar and Treasuries skyrocketed from the lows of the day. In Europe, there is a vicious cycle going on. The large private holders of sovereigns (i.e., banks and funds) insist on being made whole, thus pressuring the very sovereigns they don't want to default. The European governments insist on more and more austerity, negating what little chance they had to have secular growth lift them out of the mess.

Server Infrastructure, Part 2

Following post on Oscigen's benchmark, I was curious how much does increased memory usage due to Ruby and Python-based web app stacks affect the bottom line. Plotting Amazon EC2 instance cost against instance memory shows that costs scale roughly linearly with the memory allowance. The least squares fit is y(x) = 0.028*x + 0.079. Linear scaling means that when Ruby/Rails take 10x the memory (roughly the difference between OCaml/Oscigen and Ruby/Rails) then it costs 10x as much in the long run (and after scaling up). For smaller instances, the difference is probably less. Of course, this doesn't take into account the productivity difference. However, although Ruby and Rails are clearly more productive than C/CGI, it is not clear that it is more productive than other high-level languages with better memory use and performance characteristics.

A Paper on "High Frequency" Treasuries

There is a paper from London Business School dated from a while back (2002) that studies simple pairs strategies applied to the whole universe of US treasuries secondary market. The data used are intraday quotes and transaction information from GovPX (thus "high frequency"), which has since been acquired. The paper has an interesting table comparing equity and treasury markets (Table 1). One particular metric is the "social acceptance" of shorting the securities. The claim is that shorting equities has low acceptance whereas shorting treasuries is "less affected". The proposed strategies reduced the standard deviation of returns considerably though the level of returns were also dragged down somewhat. The best performing strategies generally kept risk controls very tight. When transaction and financing costs are accounted for, the returns start to trail the risk free return, though it outperforms equity and the treasury bond index in terms of Sharpe ratio and Gain-Loss ratio. Curiously, these strategies are slightly positively correlated with equities and slightly negatively correlated with the bond index.

Saturday, December 10, 2011

Save bits, save power, save the world, Part 2

The GHC Haskell and OCaml compilers do not have builtin popcount support, so again I precomputed the bit pop count for each ASCII letter. Using the same ~9 Mb 4x concatenation of /usr/share/dict/words from last time I measured the performance of the resultant code. This time, the test machine was a Macbook Pro 2.16 GHz with 2 Gb of RAM. The microprocessor is a Core Duo T7400 which has no support for the popcnt instruction anyway.

Europe Again

The markets closed on down note. After hours, we have ES, ZN, and DX at 1231.75 (down ~2.5%), 131'055 (up ~0.39%), and 78.920 (up ~0.55%) respectively. Tomorrow is the end of another summit (an EU summit). Europe remains the mover of markets for now.

Risk Aversion

The Dollar is up, 78.780 (+0.310). Treasuries are holding quite firm at 130'240. The ECB lowered rates back to the record low to 1.00%, offered banks unlimited cash, but did not initiate new bond purchases. Gold has tripped up a bit, but nothing game-changing at this point. Taking a look at the weekly for large-cap equities (S&P 500 E-mini), one can see the slow but steady upward Europe hope rally (rising channel). Today's action appears to have rejected new highs, but there has been no significant confirmation of a break in this very short-term time frame.

In terms of the individual stocks front, I will be looking at some emerging market issues. Brazilian metals miner VALE looks to be hanging around the 52-week bottom at 22.42. Brazilian oil giant Petrobras PBR bounced up a little but lagged the broader market. Today it took a 3.84% plunge to 27.10. Chinese oil offshore driller CNOOC CEO is at 193.68, which is actually around the top of its recent range. It currently yields around 2.9%

Wednesday, December 7, 2011

Save bits, save power, save the world, Part 1

Ok, the title was a little tongue-in-cheek. I did a fairly whimsical experiment yesterday. Since text strings are represented as bits which are either 1 (on) or 0 (off), I was wondering which unused domain names used the minimum number of "on" bits. The answer may be non-intuitive to the non-computer scientist.

Substructural Type Systems and Performance, Part 1

Why is the name of this blog "substructural"? In turns out that in the realm of advanced type systems, there is a family of type systems derived from a family of logical systems called substructural logics. The chief distinction of substructural type systems is that one or more of the "structural properties" of type environments are eliminated. These rules are exchange, weakening, and contraction. Exchange allows arbitrarily reordering bindings in the type environment. Weakening enables adding new bindings. A type derivation is deemed "weakened" in the sense that if a program already typechecks using a certain type environment, adding new bindings to that environment won't negate the well-typing of the program.

The Steam Engine of the 2010s?

I read a couple of technology-/startup-oriented articles recently. One was yesterday's Forbes article on developeronomics which describes the growing prominence of talented software developers and investment in them in stark terms.

Share Buybacks, Part 3

Following my previous discussions of share buybacks here and here, let's look into the distribution of the share counts for the PKW (Share Buyback Achievers ETF) components. The components of the Share Buyback Achievers exhibit significant diversity in outstanding shares. For this study, I used the outstanding share information from Yahoo Finance which appears to be a rough undiluted number. The mean, median, and standard deviation of share count are 242M, 81M, and 430M respectively. The component with the minimum number of shares outstanding in this fund is homebuilder NVR with a mere 4.98M shares out. Mega-retailer WMT has the most with 3.44B shares.

Firestorm

The past few days have seen a veritable firestorm of flames between the Java and Scala communities. The proximate cause was a leaked private email outlining what the Yammer folks disliked about Scala. An official post from Yammer followed but not before many bloggers had a field day with the leaked email. What is interesting is that one such discussion on Stephen Colebourne's blog pointed that the the lack of a "module system" is a principal deficiency of Scala. This point, mind you, was not part of the Yammer email. In fact, it preceded the email by a few days. The post cited the Scala's community's response to an inquiry about Scala and "module systems" [Google Groups]. That inquiry initially resulted in some misunderstanding.

Multiarmed Bandits

I tweeted a little while ago about a post from the Untyped people on applying a multiarmed bandit algorithm to website optimization (in contrast to A/B testing). It turns out that many have considered the multiarmed bandit problem for optimizing market maker profitability, Bandit Market Maker by Penna and Reid [PDF]. The basic idea is to dynamically optimize (i.e., reinforcement/online learning) the use of k-levers each operating under initially unknown and eventually partly known distributions. Presumably, the more one uses the levers, the more one learns about the reward, but there is an exploration-exploitation dilemma -- since we are operating online one has to choose whether to take the empirically revealed best choice thus far or to continue to explore for more rewarding options. The exploration-exploitation dilemma also matters to genetic algorithms and programming, but the multiarmed bandit problem explicitly optimizes for it. The exciting thing is that after decades of research, they have found optimal algorithms for the multiarmed bandit problem. The Untyped post cites one of them, Finite-time Analysis of the Multiarmed Bandit Problem by Auer et al [PDF].

Thursday, December 1, 2011

Crude Oil and Oil Stocks

One would logically expect oil stocks to track with crude oil prices. For the most part, this is true. Below are the correlation coefficients between a select number of oil services, drillers, refiners, integrated conglomerates, an oil sector ETF (XLE), and an ETF for accessing crude prices (USO), imperfect as it may be due to futures roll. Though most companies track with USO, there are a couple of exceptions. Transocean (RIG) exhibits low correlation with other sector companies and even a negative correlation coefficient with the commodity itself. Halliburton (HAL) has also been exhibiting very low correlation coefficient (0.16).

Euphoria

The markets have been alternating quite quickly from euphoria to depression and back to euphoria. Today is one of those euphoric days. S&P e-mini December (ES) is up more than 3.2% as I write this. The 10-year (ZN) took a nose-dive after news from the central banks and ADP numbers (which were very good and came with a hefty upward revision of October numbers), but has recovered to a -0.23% loss. The dollar (DX) has tanked also at 78.450. The next big event is Friday Nonfarm Payrolls (NFP).

Tuesday, November 29, 2011

Share Buybacks, Part 2

In my last post on share buybacks, I started by breaking down the components of the Powershares Share Buyback Achievers ETF PKW. The fund PKW sports an average PE of 13, a yield of 0.78%, and a beta of 0.98. Over several periods, PKW outperformed SPY (with a bout of slight underperformance recently):

Scala Performance

I just found this experiment pitting several Python-based servers versus a Scala one. After "priming the pump" to get the VM to optimize the Scala one, Scala handled 3.6x more requests per second than the Python Twisted web server. It even handled 8.5% more requests than a largely C-infused Python server Fast Asyncronous Python WSGI Server. No memory use information was given, but these results do definitely tilt towards Scala.

Interestingly, at least according to the Computer Language Shootout (which admittedly has some issues), Scala significantly outperforms the listed JVM-hosted languages such as JRuby and Clojure (in execution time). JRuby has a steady lead in terms of code size.

Monday, November 28, 2011

Share Buybacks, Part 1

I recently read an interesting article on SeekingAlpha about IBM's share buyback record. The article says that over the past 16 years, IBM has been able to half its outstanding share count, which Warren Buffett claims as one reason why he bought $10B worth. That got me thinking, there is at least one ETF focused on share buybacks, PKW PowerShares Buyback Achievers. The fund has 139 components. What is IBM's performance relative to the rest of these? The fund has only been around since 12/2006, but presumably the index (Mergent DRB) has been around a while longer. Some of the components include TGT, DTV, TWX, TJX, COH, AZO, and GPS at 4.77%, 13.4%, 4.99%, 4.82%, 9.11%, 10.62%, and 10.35% decrease in common stock shares outstanding from FY 2010 to FY 2011 respectively. From the SeekingAlpha article, IBM has bought back anywhere from 0.01% to 8.73% of shares in a single year. Stay tuned for more information on the share buyback data point.

This is the first post in a series. The next post is here.

Friday, November 25, 2011

ETF Performance

Comparing ETF performance is actually somewhat tricky, though the issue isn't particular to ETFs. Though the CFA material has whole sections to deal with presentation of performance, there remains many variables. Returns can be annualized (though generally only performance terms 1 year or more should be annualized) or cumulative. The prices used could be bid-ask midpoints at closing (which is the preferred method by ETF issuers) or last trade price (which is the data available to us mere mortals). The performance can be total return (including any distributions such as dividends and capital distributions) or market price return. The cumulative return could end on today or the last business day of the preceding month. Then there is market and NAV performance. All these factors can make a significant difference in the calculated performance numbers, especially over longer periods. For illiquid issues, the bid-ask midpoints and last trade may differ considerably. Yesterday, I collected cumulative returns for IYY (iShares Dow US Broad Index ETF) from a few places:

Schwab Cumulative

Period	Mkt	NAV
1 mo	11.3	11.3
3 mo	-3.3	-3.2
6 mo	-8.0	-8.0
1 yr	-7.8	-7.8

Yahoo "Trailing Returns"

Period	Mkt	NAV
1 mo	11.30	11.33
3 mo	-3.26	-3.17
1 yr	7.76	7.84

iShares Cumulative

Period	Mkt	Total
1 mo	11.31	11.33
3 mo	-3.15	-3.16
6 mo	-7.96	-7.99
1 yr	7.83	7.87

Though the Schwab and Yahoo data are agreement, notice how the iShares reported performances diverges starting from the 3 mos point. In a low yield environment, the difference between -3.26% and -3.15% cumulative market returns is huge. I have not been able to reconcile this discrepancy exactly. It is not merely the $0.262 dividend distribution during the 3-mo period. I think the moral of the story is threefold:

take nothing for granted
always read the fine print on how performance is calculated
try to obtain performance data from multiple sources

Thursday, November 24, 2011

Yahoo Finance Gotcha

While writing scripts to reproduce the returns information on some ETFs, I found a silly gotcha when using data from Yahoo Finance. It turns out that the website and CSV historical quotes information interprets starting dates differently. If one asks for a starting day that is not a business day, the website will also give quote information for the last business day preceding your starting day. The CSV historical quotes has the opposite behavior. If the starting day is not a business day, the earliest quote that the CSV gives is the first business day after your starting day. For example, the earliest quote http://ichart.finance.yahoo.com/table.csv?s=SCHX&a=06&b=31&c=2011&d=09&e=31&f=2011&g=d&ignore=.csv will give is 2011-08-01 whereas http://finance.yahoo.com/q/hp?s=SCHX&a=06&b=31&c=2011&d=09&e=31&f=2011&g=d will give 2011-07-29, with July 31, 2011 (the month in the URL request is also 0-origin so 06 is July) being a Sunday. This is certainly something to watch out for when fine-tuning data collection from Yahoo.

Wednesday, November 23, 2011

Pre-Holiday Closing

The markets close for this short week due to the US holidays. The S&P 500 stands at 1161.79 and the 10-year treasury yield at 1.88. The Treasury market has been telegraphing the massive tail risk for quite a while now ever since February when yields hit 3.74 and have been tanking ever since. Now it looks like the core Euro-zone is feeling the heat with French and even German sovereign yields rising. The German 10-year bund still has a -22.9 basis point spread over the US 10-year, though the 2- and 5-year have 39.2 and 25.2 basis points spreads over their respective US counterparts. On the bright side, Durable Goods did not turn out too bad, but that is little consolation for the volatility wary.

Tuesday, November 22, 2011

Static Type System and Numerical Computing

The programming languages with which most people are familiar, such C/C++, Java, Python, JavaScript, and Ruby, presents a very stark contrast. On the one hand, you have statically typed languages (C/C++ and Java) which are amenable to a wide assortment of compile-time optimizations due to the presence of types. On the other, you have dynamically typed languages (Python, JavaScript, and Ruby) which are "fun" to program and amenable to rapid prototype due to more succinct syntax partly due to the absence of explicit type annotations. The sweet spot, in my humble opinion, is statically type but with type inference, a subject of my expertise and doctoral studies. I plan to write a few posts on this subject, near and dear to my heart. Heavy duty Numerical computing is one area where dynamically typed languages such as Matlab, R, and NumPy/Python have gained a considerable mindshare. Since NumPy offloads computations to efficient C libraries, this is quite reasonable. I would like to discuss where I think numerical computing and statistical analysis programs can benefit from a more sophisticated static type system with type inference. To whet your appetite, consider the case that C++'s Boost Library contains the functionality needed to statically check for units (i.e., dimensional analysis). The performance difference between code generated from statically typed languages and from dynamically typed languages is not just the absence of dynamic type checks. Types in static type systems enable a considerable array of optimizations to get rid of that last tiny ounce of latency.

Monday, November 21, 2011

Top 5 Open Source Quant Finance Projects

In the recent months, I have found the data on open source development projects to be quite fascinating. It turns out that most of the activity in open source quant finance code happens on Sourceforge. I pulled some of that data and charted them.

Datamining Sentiment from News Text

In an earlier post, I mentioned different companies using social media to gather information for stock market trades. The larger area of developing predictive analytics and datamining of news article text has several interesting players: Thomas Reuters [PDF], Dow Jones (RavenPack), Predictive Signal, and MarketPsych. Predictive Signal claims 10.4% positive returns (not annualized) over a nearly 3 month period (May to Aug 2011) versus an S&P 500 decline of 9.9%. MarketPsych cites a 25+% better returns than the S&P 500 with a third of the volatility from September 2008 to December 2010. In the latter, they actually traded over that 2 year period using their proprietary signals (MarketPsy Long-Short Fund LP).

Options Expiration Friday

Today is options expiration Friday and Leading Indicators at 10:00am. Futures currently read as follows:

Contract	Price
ES Dec	1223.00
ZN Dec	130'155
DX Dec	77.840

Thursday, November 17, 2011

Correlations

Looks like the stock market broke down a bit and is hanging outside the previous trading range. I put together a script to calculate correlation coefficients to see where things stand.

	spy	sdy	tlt	zroz	uup	gld	xle	xlf
spy	1.00	0.93	-0.68	-0.64	-0.34	-0.57	0.95	0.94
sdy	0.93	1.00	-0.41	-0.36	-0.15	-0.43	0.83	0.77
tlt	-0.68	-0.41	1.00	1.00	0.69	0.51	-0.77	-0.87
zroz	-0.64	-0.36	1.00	1.00	0.72	0.45	-0.75	-0.84
uup	-0.34	-0.15	0.69	0.72	1.00	-0.08	-0.53	-0.51
gld	-0.57	-0.43	0.51	0.45	-0.08	1.00	-0.42	-0.61
xle	0.95	0.83	-0.77	-0.75	-0.53	-0.42	1.00	0.95
xlf	0.94	0.77	-0.87	-0.84	-0.51	-0.61	0.95	1.00

The correlation coefficients are computed for the past 100-days of returns calculated from adjusted closing prices for each asset. There is a little something for everyone here. Equities (SPY) and dividend-paying equities (SDY) have strong correlations. Despite appreciating crude prices, energy stocks (XLE) are still strongly tied with equities as a whole. To get negative correlation with equities, one may look to treasuries (TLT and ZROZ [Pimco Long Dated Zeros]), the dollar, and gold. TLT and ZROZ similar but in terms of correlations with other assets, it happens to be TLT that exhibits the slightly more pronounced negative correlations despite ZROZ being more of a pure play. Surprisingly, the least correlated assets in this study are gold (GLD) and dollar (UUP).

Wednesday, November 16, 2011

Programming Language Gotchas

Looks like I spoke too soon. The October rally was looking a little long in tooth, and a little warning from the credit agencies was enough to get the market to go risk off at least for this afternoon. At least yields did not decline too much (2.01% for the 10-year according to MarketWatch). The Dollar Index (DX), however, has certainly been in an uptrend since the end of October trough.

Coming from the Standard ML world of languages makes me take many things for granted. It also gives me a probably pretty narrow perspective. Hacking on my trading analysis tools in other languages helps me broaden that somewhat, though I have certainly worked in the mainstream languages long before I encountered ML and its brethren. I would like to keep a record of these gotchas as I encounter them. I do not intend this as value judgments but rather a documentation of noteworthy differences from a higher-order typed programming language perspective.

Cash Secured Puts

Cash secured put writing is a venerable strategy where an investor sells puts while simultaneously keeping enough cash for the duration of the put contracts to cover possible exercise. Note that although puts tend to be exercised upon expiration, there is a significant probability of early exercise. This strategy is typically considered a conservative strategy, in the realm of options strategies at least, because an investor will not have to fork over any additional money to cover margin calls so max loss is capped at the exercise value of the put minus premium from the put writing. Typically, this strategy is utilized when an investor expects the underlying instrument to stay largely flat (i.e., appreciate or depreciate no more than the premium generated from writing the put contract). With an options qualified brokerage account, an investor can easily execute his own cash secured put writing strategy. Some ETF managers (e.g., WisdomTree), however, have also introduced the strategy as executed on broad equity indices in the form of an ETF.

Seeing that this morning was a benign CPI day with no earth-moving news from Europe, I thought that it would be an interesting exercise to look at a volatility-selling options strategy. The strategy is a fairly conservative one, cash secured put writing. My sample is the 341 component subset of the S&P 500 with options activity today (i.e., there are bids and asks). I will be only looking at the nearest to the money strikes to the last traded price of the underlying. The puts expire in less than 3 days (Friday, 11/19/11) so there really is not much time premium left. The returns are computed as put premium (midpoint of the bid and ask) over the strike times the 100 multiplier (i.e., the amount of cash needed in case of assignment). The VIX today (the last time I looked) remains elevated over 30.

The above is a histogram of the cash secured put returns. The horizontal axis marks the returns (put premium/ (100*strike)) and the vertical the number of securities that have that return. The median, mean, and standard deviation for the returns are 0.000675, 0.00101, and 0.00111 respectively. The max is 0.01 for TSS.

Tuesday, November 15, 2011

Server Infrastructure

Recently, I have been building a market data feed and order book simulator. Since I had already built an Ncurses front-end (console GUI) for Trader Workstation, this was the opportunity to try out something new. I decided to try out Lift/web, Scala's web framework. Lift/web is a perfectly fine framework. The ORM side of things (Mapper for RDBMS and Record geared more towards NoSQL) still seem to be a work-in-progress (though a very usable work-in-progress), but Comet (Ajax-Push) works as advertised. In my experience, however, memory overhead seems to be a real issue. During development, Lift managed to gobble up all the default heap space after just a few source updates. None of the requests were large by any means. Apparently, there are a few configuration tweaks I still need to try to get this under control, but this whole episode got me thinking: with all the emphasis on massively scalable webapps (it seems everyone aspires to serve millions and billions of requests), why do most people tolerate the memory use profiles of Rails/Ruby, Python/Django, and PHP?

There are probably plenty of issues with this study on Oscigen, an OCaml-based web app framework, but it does seem to reflect the reality that there is a lot of inefficiency in modern web framework stacks. The above benchmark claims that under fairly benign circumstances, Oscigen outperforms Rails by 10x in terms of both requests/sec and reduced memory use (4.5MB versus 49MB). Oscigen's memory profile apparently ties with a simple C-based web app (using FastCGI). Interestingly, on the opposite end of the spectrum, there is also a C++-based web stack, Wt. Regardless of the merits of the benchmark, the fact is that people put up with a considerable amount of memory use. Considering that memory on Amazon EC2 is not exactly free, it does concern me. On one web app hosting organization's site, Micro instances are only good for PHP apps whereas Small instances are needed for the lightest of Rails and Java apps. Large servers with 7.5GB RAM (which currently run from $244/month) are needed for "full-size" Rails and Java apps. Is the current state of things a case of memory and servers are too cheap or caching covers over all problems? There is always the red herring of network latency, but if concurrent requests served with constrained resources is the main metric then memory use matters.

Or maybe it is because good garbage collectors are hard to come by. At least at one point, garbage collection was an issue in Ruby. This chain of thought led me to look up whether there are any good solutions for web stacks with low memory footprints, especially any that come in neat statically-typed packages. There is certainly Oscigen/OCaml. The StackOverflow people have also suggested Adam Chlipala's Ur/Web system. In the latter stack, after some searching through mail list archives, I have learned that Ur/Web uses region memory management (where the boundaries of a request delimits regions). The main website kept mentioning that Ur/Web is fast at least in part due to not having a garbage collector. That, of course, got me curious as to what did it use for memory management. Ur/Web also seems to be built on Standard ML (the Makefile has settings for both SML/NJ and MLton), which is certainly a plus according to my bias. Though explicit regions (aka memory pools [but careful, this is an overloaded term] or arenas) are pretty efficient (being part of Apache, nginx, PostgreSQL), region inference has its own issues, principally that programs written in a region-unfriendly fashion can lead to poor performance. The consensus on the matter appears to be that regions and garbage collection can be combined in a single system to address region-unfriendliness while simultaneously minimizing garbage collection.

Monday, November 14, 2011

Genetic Programming Performance

There is a whole lot of literature on the subject of genetic and evolutionary algorithms for trading strategies. The quality of the work seems to vary considerably. I have yet to discover the canonical and seminal lines of work in the field. Unlike in programming languages research and perhaps other scientific fields, the bibliographies do not all follow the same chain of citations. Or perhaps they do, but I have yet to discover it. More on this after the jump.

Genetic Programming and Trading Strategies

From an anecdotal sampling of Linkedin and web discussions, people have applied various machine learning techniques to the creation of trading strategies. The most popular techniques for trading strategies mirror that of the general applied machine learning community, meaning SVM and neural networks are frequently mentioned. One relatively lesser known technique is genetic programming (GP), a variant of genetic algorithms for tree data structures, normally abstract syntax trees. Curiously, the academic community has studied genetic programming quite a lot. Indeed, there are many papers on genetic programming for trade strategies in GECCO (a conference on genetic and evolutionary computation) and other conferences/workshops. Some people from State Street have even contributed their study. The Quant Stack Exchange has a discussion on genetic algorithms where the problem of data snooping is frequently mentioned as a shortcoming, a problem addressed by regularization in conventional learning techniques. I informally survey some of the studies of the effectiveness of GP in a follow-up post.

My cursory interest in this area is the potential impact of programming language design and especially static type systems on the performance of genetic programming for trading strategies. The nice thing about genetic programming is that the results produced can be theoretically unrestricted unlike basic machine learning techniques that make assumptions such as linearity about the underlying model. Practically, however, the results produced by genetic programming are usually small and possibly uninteresting. This is where modern programming languages theory comes in. In the past decade or so, a number of researchers have studied the problem of the generation of random test cases (of which QuickCheck is a prime example). One of the main engineering challenges there is to ensure that the test cases generated do not all fall into a trivial category. The test cases here can range from instances of simple primitive types (e.g., ints) to nontrivial data structures (e.g., syntax trees or data packet queues). Usually it falls to the developer to write functions that generate a good distribution of test cases to ensure, for example, that the probability of getting a list of length greater than 3 is not negligible. In practice, QuickCheck supplies an assortment of combinators that help produce a good distribution. The developer's domain knowledge is still necessary, but the test generation problem reduces down to a choice of combinators.

Monday, November 7, 2011

Semantics of Nested Classes

Although nested classes are widely adopted in many popular programming languages (e.g., C++, C#, Java, Python, Ruby), I have yet to see a good treatment giving the motivation, use cases, and architectural issues associated with the feature. Moreover, another undocumented aspect is how the semantics of nested classes vary from language to language. They appear to be useful for namespace management (though C++ namespace and Java packages also cover this), simulating closures, and callbacks for GUIs and event handling (basically a variant of closure simulation).
First thing is first, what are nested classes?

class A {
  class B { }
}

In essence, they are scoped type declarations and definitions. Instead of defining a class B at the top-level of the program source, one declares a locally scoped one within the definition of another class, A in this case. One immediate consequence of this is a form of namespace management for user-defined aggregate types (i.e., classes).
There are a lot of gotchas due the language design here. The scoping discipline is incomplete. C/C++'s restriction on shadowing prohibits the following code:

 
class B; 

class A {
  B something;
  class B { };
}

In contrast, the following code compiles:

class A {
  private:
    class B {};
  public:
    B makeB();
}

Despite the fact that A::B is unnameable outside an instance of A, A exports the member function makeB which yields a value of type A::B. Though one cannot name the return value of makeB outside of A, the value can still be used as is in the case in the following:

class A {
  private:
    class B { 
      int x;
       public: 
      B add(B _x) { x += _x; }
    };
  public:
    B makeB();
};

int f() {
  A a0, a1;
  a0.makeB().add(a1.makeB());
}

The above illustrates an important point. C++ nested classes do not declare distinct types for each instance. Although one cannot annotate them as static, they are exactly that. In the example, a0.B and a1.B are of the same type. This is why a0.makeB().add(a1.makeB()) type checks. This means that writing a nested class is equivalent to lifting that class definition out to the top-level and selectively hiding the name of the nested class.
Between different languages, there is a design chasm. In C++, there is only the unitary nested class and its associated semantics. Java distinguishes between static nested classes (similar to the C++ nested class) and inner classes. Java actually has a number of variants of nested classes (https://blogs.oracle.com/darcy/entry/nested_inner_member_and_top). The above C++ example translated into Java fails type checking because of A.B is private.

class A {
      private class B {

        public B add(B rhs) { val += rhs.val; return this; }
        private int val;

      }

      public B makeB() { return new B(); };
}

public class Nested0 {

   public static void main() {
     A a0 = new A();
     A a1 = new A();
     a0.makeB().add(a1.makeB());
   }
}

The call to add is flagged as a problem because a0.makeB()'s type (i.e., A.B) is inaccessible. If A.B were public or protected, however, the type checker is satisfied despite a0.B and a1.B supposedly denoting two distinct inner classes. Thus, at least at this point, inner classes do not provide abstract types in the sense of generative types.

On the other hand, as this discussion on Java decompilation (javap) shows, the semantics of inner classes are such that have a kind of C++

friend

relationship with their outer enclosing class. The discussion on TheServerSide shows how this is implemented: the compiler generates a static method that "breaks the encapsulation wall" to permit accessing private members of the outer class.

References

Herb Sutter wrote an article on nested classes in C++.

Thursday, December 22, 2011

Tuesday, December 20, 2011

Monday, December 19, 2011

Friday, December 16, 2011

Thursday, December 15, 2011

Wednesday, December 14, 2011

Tuesday, December 13, 2011

Monday, December 12, 2011

Saturday, December 10, 2011

Thursday, December 8, 2011