Programming languages, Cloud, and Financial Markets: November 2011

Wednesday, November 30, 2011

Euphoria

The markets have been alternating quite quickly from euphoria to depression and back to euphoria. Today is one of those euphoric days. S&P e-mini December (ES) is up more than 3.2% as I write this. The 10-year (ZN) took a nose-dive after news from the central banks and ADP numbers (which were very good and came with a hefty upward revision of October numbers), but has recovered to a -0.23% loss. The dollar (DX) has tanked also at 78.450. The next big event is Friday Nonfarm Payrolls (NFP).

Tuesday, November 29, 2011

Share Buybacks, Part 2

In my last post on share buybacks, I started by breaking down the components of the Powershares Share Buyback Achievers ETF PKW. The fund PKW sports an average PE of 13, a yield of 0.78%, and a beta of 0.98. Over several periods, PKW outperformed SPY (with a bout of slight underperformance recently):

Scala Performance

I just found this experiment pitting several Python-based servers versus a Scala one. After "priming the pump" to get the VM to optimize the Scala one, Scala handled 3.6x more requests per second than the Python Twisted web server. It even handled 8.5% more requests than a largely C-infused Python server Fast Asyncronous Python WSGI Server. No memory use information was given, but these results do definitely tilt towards Scala.

Interestingly, at least according to the Computer Language Shootout (which admittedly has some issues), Scala significantly outperforms the listed JVM-hosted languages such as JRuby and Clojure (in execution time). JRuby has a steady lead in terms of code size.

Monday, November 28, 2011

Share Buybacks, Part 1

I recently read an interesting article on SeekingAlpha about IBM's share buyback record. The article says that over the past 16 years, IBM has been able to half its outstanding share count, which Warren Buffett claims as one reason why he bought $10B worth. That got me thinking, there is at least one ETF focused on share buybacks, PKW PowerShares Buyback Achievers. The fund has 139 components. What is IBM's performance relative to the rest of these? The fund has only been around since 12/2006, but presumably the index (Mergent DRB) has been around a while longer. Some of the components include TGT, DTV, TWX, TJX, COH, AZO, and GPS at 4.77%, 13.4%, 4.99%, 4.82%, 9.11%, 10.62%, and 10.35% decrease in common stock shares outstanding from FY 2010 to FY 2011 respectively. From the SeekingAlpha article, IBM has bought back anywhere from 0.01% to 8.73% of shares in a single year. Stay tuned for more information on the share buyback data point.

This is the first post in a series. The next post is here.

Friday, November 25, 2011

ETF Performance

Comparing ETF performance is actually somewhat tricky, though the issue isn't particular to ETFs. Though the CFA material has whole sections to deal with presentation of performance, there remains many variables. Returns can be annualized (though generally only performance terms 1 year or more should be annualized) or cumulative. The prices used could be bid-ask midpoints at closing (which is the preferred method by ETF issuers) or last trade price (which is the data available to us mere mortals). The performance can be total return (including any distributions such as dividends and capital distributions) or market price return. The cumulative return could end on today or the last business day of the preceding month. Then there is market and NAV performance. All these factors can make a significant difference in the calculated performance numbers, especially over longer periods. For illiquid issues, the bid-ask midpoints and last trade may differ considerably. Yesterday, I collected cumulative returns for IYY (iShares Dow US Broad Index ETF) from a few places:

Schwab Cumulative

Period	Mkt	NAV
1 mo	11.3	11.3
3 mo	-3.3	-3.2
6 mo	-8.0	-8.0
1 yr	-7.8	-7.8

Yahoo "Trailing Returns"

Period	Mkt	NAV
1 mo	11.30	11.33
3 mo	-3.26	-3.17
1 yr	7.76	7.84

iShares Cumulative

Period	Mkt	Total
1 mo	11.31	11.33
3 mo	-3.15	-3.16
6 mo	-7.96	-7.99
1 yr	7.83	7.87

Though the Schwab and Yahoo data are agreement, notice how the iShares reported performances diverges starting from the 3 mos point. In a low yield environment, the difference between -3.26% and -3.15% cumulative market returns is huge. I have not been able to reconcile this discrepancy exactly. It is not merely the $0.262 dividend distribution during the 3-mo period. I think the moral of the story is threefold:

take nothing for granted
always read the fine print on how performance is calculated
try to obtain performance data from multiple sources

Thursday, November 24, 2011

Yahoo Finance Gotcha

While writing scripts to reproduce the returns information on some ETFs, I found a silly gotcha when using data from Yahoo Finance. It turns out that the website and CSV historical quotes information interprets starting dates differently. If one asks for a starting day that is not a business day, the website will also give quote information for the last business day preceding your starting day. The CSV historical quotes has the opposite behavior. If the starting day is not a business day, the earliest quote that the CSV gives is the first business day after your starting day. For example, the earliest quote http://ichart.finance.yahoo.com/table.csv?s=SCHX&a=06&b=31&c=2011&d=09&e=31&f=2011&g=d&ignore=.csv will give is 2011-08-01 whereas http://finance.yahoo.com/q/hp?s=SCHX&a=06&b=31&c=2011&d=09&e=31&f=2011&g=d will give 2011-07-29, with July 31, 2011 (the month in the URL request is also 0-origin so 06 is July) being a Sunday. This is certainly something to watch out for when fine-tuning data collection from Yahoo.

Wednesday, November 23, 2011

Pre-Holiday Closing

The markets close for this short week due to the US holidays. The S&P 500 stands at 1161.79 and the 10-year treasury yield at 1.88. The Treasury market has been telegraphing the massive tail risk for quite a while now ever since February when yields hit 3.74 and have been tanking ever since. Now it looks like the core Euro-zone is feeling the heat with French and even German sovereign yields rising. The German 10-year bund still has a -22.9 basis point spread over the US 10-year, though the 2- and 5-year have 39.2 and 25.2 basis points spreads over their respective US counterparts. On the bright side, Durable Goods did not turn out too bad, but that is little consolation for the volatility wary.

Tuesday, November 22, 2011

Static Type System and Numerical Computing

The programming languages with which most people are familiar, such C/C++, Java, Python, JavaScript, and Ruby, presents a very stark contrast. On the one hand, you have statically typed languages (C/C++ and Java) which are amenable to a wide assortment of compile-time optimizations due to the presence of types. On the other, you have dynamically typed languages (Python, JavaScript, and Ruby) which are "fun" to program and amenable to rapid prototype due to more succinct syntax partly due to the absence of explicit type annotations. The sweet spot, in my humble opinion, is statically type but with type inference, a subject of my expertise and doctoral studies. I plan to write a few posts on this subject, near and dear to my heart. Heavy duty Numerical computing is one area where dynamically typed languages such as Matlab, R, and NumPy/Python have gained a considerable mindshare. Since NumPy offloads computations to efficient C libraries, this is quite reasonable. I would like to discuss where I think numerical computing and statistical analysis programs can benefit from a more sophisticated static type system with type inference. To whet your appetite, consider the case that C++'s Boost Library contains the functionality needed to statically check for units (i.e., dimensional analysis). The performance difference between code generated from statically typed languages and from dynamically typed languages is not just the absence of dynamic type checks. Types in static type systems enable a considerable array of optimizations to get rid of that last tiny ounce of latency.

Monday, November 21, 2011

Top 5 Open Source Quant Finance Projects

In the recent months, I have found the data on open source development projects to be quite fascinating. It turns out that most of the activity in open source quant finance code happens on Sourceforge. I pulled some of that data and charted them.

Datamining Sentiment from News Text

In an earlier post, I mentioned different companies using social media to gather information for stock market trades. The larger area of developing predictive analytics and datamining of news article text has several interesting players: Thomas Reuters [PDF], Dow Jones (RavenPack), Predictive Signal, and MarketPsych. Predictive Signal claims 10.4% positive returns (not annualized) over a nearly 3 month period (May to Aug 2011) versus an S&P 500 decline of 9.9%. MarketPsych cites a 25+% better returns than the S&P 500 with a third of the volatility from September 2008 to December 2010. In the latter, they actually traded over that 2 year period using their proprietary signals (MarketPsy Long-Short Fund LP).

Options Expiration Friday

Today is options expiration Friday and Leading Indicators at 10:00am. Futures currently read as follows:

Contract	Price
ES Dec	1223.00
ZN Dec	130'155
DX Dec	77.840

Thursday, November 17, 2011

Correlations

Looks like the stock market broke down a bit and is hanging outside the previous trading range. I put together a script to calculate correlation coefficients to see where things stand.

	spy	sdy	tlt	zroz	uup	gld	xle	xlf
spy	1.00	0.93	-0.68	-0.64	-0.34	-0.57	0.95	0.94
sdy	0.93	1.00	-0.41	-0.36	-0.15	-0.43	0.83	0.77
tlt	-0.68	-0.41	1.00	1.00	0.69	0.51	-0.77	-0.87
zroz	-0.64	-0.36	1.00	1.00	0.72	0.45	-0.75	-0.84
uup	-0.34	-0.15	0.69	0.72	1.00	-0.08	-0.53	-0.51
gld	-0.57	-0.43	0.51	0.45	-0.08	1.00	-0.42	-0.61
xle	0.95	0.83	-0.77	-0.75	-0.53	-0.42	1.00	0.95
xlf	0.94	0.77	-0.87	-0.84	-0.51	-0.61	0.95	1.00

The correlation coefficients are computed for the past 100-days of returns calculated from adjusted closing prices for each asset. There is a little something for everyone here. Equities (SPY) and dividend-paying equities (SDY) have strong correlations. Despite appreciating crude prices, energy stocks (XLE) are still strongly tied with equities as a whole. To get negative correlation with equities, one may look to treasuries (TLT and ZROZ [Pimco Long Dated Zeros]), the dollar, and gold. TLT and ZROZ similar but in terms of correlations with other assets, it happens to be TLT that exhibits the slightly more pronounced negative correlations despite ZROZ being more of a pure play. Surprisingly, the least correlated assets in this study are gold (GLD) and dollar (UUP).

Wednesday, November 16, 2011

Programming Language Gotchas

Looks like I spoke too soon. The October rally was looking a little long in tooth, and a little warning from the credit agencies was enough to get the market to go risk off at least for this afternoon. At least yields did not decline too much (2.01% for the 10-year according to MarketWatch). The Dollar Index (DX), however, has certainly been in an uptrend since the end of October trough.

Coming from the Standard ML world of languages makes me take many things for granted. It also gives me a probably pretty narrow perspective. Hacking on my trading analysis tools in other languages helps me broaden that somewhat, though I have certainly worked in the mainstream languages long before I encountered ML and its brethren. I would like to keep a record of these gotchas as I encounter them. I do not intend this as value judgments but rather a documentation of noteworthy differences from a higher-order typed programming language perspective.

Cash Secured Puts

Cash secured put writing is a venerable strategy where an investor sells puts while simultaneously keeping enough cash for the duration of the put contracts to cover possible exercise. Note that although puts tend to be exercised upon expiration, there is a significant probability of early exercise. This strategy is typically considered a conservative strategy, in the realm of options strategies at least, because an investor will not have to fork over any additional money to cover margin calls so max loss is capped at the exercise value of the put minus premium from the put writing. Typically, this strategy is utilized when an investor expects the underlying instrument to stay largely flat (i.e., appreciate or depreciate no more than the premium generated from writing the put contract). With an options qualified brokerage account, an investor can easily execute his own cash secured put writing strategy. Some ETF managers (e.g., WisdomTree), however, have also introduced the strategy as executed on broad equity indices in the form of an ETF.

Seeing that this morning was a benign CPI day with no earth-moving news from Europe, I thought that it would be an interesting exercise to look at a volatility-selling options strategy. The strategy is a fairly conservative one, cash secured put writing. My sample is the 341 component subset of the S&P 500 with options activity today (i.e., there are bids and asks). I will be only looking at the nearest to the money strikes to the last traded price of the underlying. The puts expire in less than 3 days (Friday, 11/19/11) so there really is not much time premium left. The returns are computed as put premium (midpoint of the bid and ask) over the strike times the 100 multiplier (i.e., the amount of cash needed in case of assignment). The VIX today (the last time I looked) remains elevated over 30.

The above is a histogram of the cash secured put returns. The horizontal axis marks the returns (put premium/ (100*strike)) and the vertical the number of securities that have that return. The median, mean, and standard deviation for the returns are 0.000675, 0.00101, and 0.00111 respectively. The max is 0.01 for TSS.

Tuesday, November 15, 2011

Server Infrastructure

Recently, I have been building a market data feed and order book simulator. Since I had already built an Ncurses front-end (console GUI) for Trader Workstation, this was the opportunity to try out something new. I decided to try out Lift/web, Scala's web framework. Lift/web is a perfectly fine framework. The ORM side of things (Mapper for RDBMS and Record geared more towards NoSQL) still seem to be a work-in-progress (though a very usable work-in-progress), but Comet (Ajax-Push) works as advertised. In my experience, however, memory overhead seems to be a real issue. During development, Lift managed to gobble up all the default heap space after just a few source updates. None of the requests were large by any means. Apparently, there are a few configuration tweaks I still need to try to get this under control, but this whole episode got me thinking: with all the emphasis on massively scalable webapps (it seems everyone aspires to serve millions and billions of requests), why do most people tolerate the memory use profiles of Rails/Ruby, Python/Django, and PHP?

There are probably plenty of issues with this study on Oscigen, an OCaml-based web app framework, but it does seem to reflect the reality that there is a lot of inefficiency in modern web framework stacks. The above benchmark claims that under fairly benign circumstances, Oscigen outperforms Rails by 10x in terms of both requests/sec and reduced memory use (4.5MB versus 49MB). Oscigen's memory profile apparently ties with a simple C-based web app (using FastCGI). Interestingly, on the opposite end of the spectrum, there is also a C++-based web stack, Wt. Regardless of the merits of the benchmark, the fact is that people put up with a considerable amount of memory use. Considering that memory on Amazon EC2 is not exactly free, it does concern me. On one web app hosting organization's site, Micro instances are only good for PHP apps whereas Small instances are needed for the lightest of Rails and Java apps. Large servers with 7.5GB RAM (which currently run from $244/month) are needed for "full-size" Rails and Java apps. Is the current state of things a case of memory and servers are too cheap or caching covers over all problems? There is always the red herring of network latency, but if concurrent requests served with constrained resources is the main metric then memory use matters.

Or maybe it is because good garbage collectors are hard to come by. At least at one point, garbage collection was an issue in Ruby. This chain of thought led me to look up whether there are any good solutions for web stacks with low memory footprints, especially any that come in neat statically-typed packages. There is certainly Oscigen/OCaml. The StackOverflow people have also suggested Adam Chlipala's Ur/Web system. In the latter stack, after some searching through mail list archives, I have learned that Ur/Web uses region memory management (where the boundaries of a request delimits regions). The main website kept mentioning that Ur/Web is fast at least in part due to not having a garbage collector. That, of course, got me curious as to what did it use for memory management. Ur/Web also seems to be built on Standard ML (the Makefile has settings for both SML/NJ and MLton), which is certainly a plus according to my bias. Though explicit regions (aka memory pools [but careful, this is an overloaded term] or arenas) are pretty efficient (being part of Apache, nginx, PostgreSQL), region inference has its own issues, principally that programs written in a region-unfriendly fashion can lead to poor performance. The consensus on the matter appears to be that regions and garbage collection can be combined in a single system to address region-unfriendliness while simultaneously minimizing garbage collection.

Monday, November 14, 2011

Genetic Programming Performance

There is a whole lot of literature on the subject of genetic and evolutionary algorithms for trading strategies. The quality of the work seems to vary considerably. I have yet to discover the canonical and seminal lines of work in the field. Unlike in programming languages research and perhaps other scientific fields, the bibliographies do not all follow the same chain of citations. Or perhaps they do, but I have yet to discover it. More on this after the jump.

Genetic Programming and Trading Strategies

From an anecdotal sampling of Linkedin and web discussions, people have applied various machine learning techniques to the creation of trading strategies. The most popular techniques for trading strategies mirror that of the general applied machine learning community, meaning SVM and neural networks are frequently mentioned. One relatively lesser known technique is genetic programming (GP), a variant of genetic algorithms for tree data structures, normally abstract syntax trees. Curiously, the academic community has studied genetic programming quite a lot. Indeed, there are many papers on genetic programming for trade strategies in GECCO (a conference on genetic and evolutionary computation) and other conferences/workshops. Some people from State Street have even contributed their study. The Quant Stack Exchange has a discussion on genetic algorithms where the problem of data snooping is frequently mentioned as a shortcoming, a problem addressed by regularization in conventional learning techniques. I informally survey some of the studies of the effectiveness of GP in a follow-up post.

My cursory interest in this area is the potential impact of programming language design and especially static type systems on the performance of genetic programming for trading strategies. The nice thing about genetic programming is that the results produced can be theoretically unrestricted unlike basic machine learning techniques that make assumptions such as linearity about the underlying model. Practically, however, the results produced by genetic programming are usually small and possibly uninteresting. This is where modern programming languages theory comes in. In the past decade or so, a number of researchers have studied the problem of the generation of random test cases (of which QuickCheck is a prime example). One of the main engineering challenges there is to ensure that the test cases generated do not all fall into a trivial category. The test cases here can range from instances of simple primitive types (e.g., ints) to nontrivial data structures (e.g., syntax trees or data packet queues). Usually it falls to the developer to write functions that generate a good distribution of test cases to ensure, for example, that the probability of getting a list of length greater than 3 is not negligible. In practice, QuickCheck supplies an assortment of combinators that help produce a good distribution. The developer's domain knowledge is still necessary, but the test generation problem reduces down to a choice of combinators.

Monday, November 7, 2011

Semantics of Nested Classes

Although nested classes are widely adopted in many popular programming languages (e.g., C++, C#, Java, Python, Ruby), I have yet to see a good treatment giving the motivation, use cases, and architectural issues associated with the feature. Moreover, another undocumented aspect is how the semantics of nested classes vary from language to language. They appear to be useful for namespace management (though C++ namespace and Java packages also cover this), simulating closures, and callbacks for GUIs and event handling (basically a variant of closure simulation).
First thing is first, what are nested classes?

class A {
  class B { }
}

In essence, they are scoped type declarations and definitions. Instead of defining a class B at the top-level of the program source, one declares a locally scoped one within the definition of another class, A in this case. One immediate consequence of this is a form of namespace management for user-defined aggregate types (i.e., classes).
There are a lot of gotchas due the language design here. The scoping discipline is incomplete. C/C++'s restriction on shadowing prohibits the following code:

 
class B; 

class A {
  B something;
  class B { };
}

In contrast, the following code compiles:

class A {
  private:
    class B {};
  public:
    B makeB();
}

Despite the fact that A::B is unnameable outside an instance of A, A exports the member function makeB which yields a value of type A::B. Though one cannot name the return value of makeB outside of A, the value can still be used as is in the case in the following:

class A {
  private:
    class B { 
      int x;
       public: 
      B add(B _x) { x += _x; }
    };
  public:
    B makeB();
};

int f() {
  A a0, a1;
  a0.makeB().add(a1.makeB());
}

The above illustrates an important point. C++ nested classes do not declare distinct types for each instance. Although one cannot annotate them as static, they are exactly that. In the example, a0.B and a1.B are of the same type. This is why a0.makeB().add(a1.makeB()) type checks. This means that writing a nested class is equivalent to lifting that class definition out to the top-level and selectively hiding the name of the nested class.
Between different languages, there is a design chasm. In C++, there is only the unitary nested class and its associated semantics. Java distinguishes between static nested classes (similar to the C++ nested class) and inner classes. Java actually has a number of variants of nested classes (https://blogs.oracle.com/darcy/entry/nested_inner_member_and_top). The above C++ example translated into Java fails type checking because of A.B is private.

class A {
      private class B {

        public B add(B rhs) { val += rhs.val; return this; }
        private int val;

      }

      public B makeB() { return new B(); };
}

public class Nested0 {

   public static void main() {
     A a0 = new A();
     A a1 = new A();
     a0.makeB().add(a1.makeB());
   }
}

The call to add is flagged as a problem because a0.makeB()'s type (i.e., A.B) is inaccessible. If A.B were public or protected, however, the type checker is satisfied despite a0.B and a1.B supposedly denoting two distinct inner classes. Thus, at least at this point, inner classes do not provide abstract types in the sense of generative types.

On the other hand, as this discussion on Java decompilation (javap) shows, the semantics of inner classes are such that have a kind of C++

friend

relationship with their outer enclosing class. The discussion on TheServerSide shows how this is implemented: the compiler generates a static method that "breaks the encapsulation wall" to permit accessing private members of the outer class.

References

Herb Sutter wrote an article on nested classes in C++.

Programming languages, Cloud, and Financial Markets