Thursday, April 5, 2012

Probabilistic Counter

The High Scalability blog has a recent post on probabilistic algorithms for approximate counting. The author unfortunately coined the term "linear probabilistic counter", which corresponds to no results if you do a Google search. The more usual terminology is linear-time probabilistic counting (a paper from KAIST gives the original definition from 1990) and closely related, approximate counting. The former paper describes that linear counting is a hashing-based sorting algorithm that approximates column cardinality. The name comes from the fact that the count estimator should be approximately proportional (linear to) the true count. The study of probabilistic algorithms is indeed a growing field. It is also tied to that oft forgotten technique of sampling in this modern age of big data. There are a lot of related concepts here. The common theme to these kind of problems is massive amounts of entries where each entry is constrained to be from some finite set.

The basic linear counting algorithm consists of

  1. initializing a new size m bitmap to all 0 and setting the bit corresponding to the hash address for each value
  2. compute for each value the estimator for the column count for that value to be n = - m ln V such that V is the fraction of the size m bitmap that is empty
The complete linear counting algorithm parametrizes this process by the desired standard error.

One algorithm which I believe is probably the most widely known is the family of non-comparative integer sorting algorithms such as radix sort, bucket sort, and counting sort. Both of these algorithms are in Introduction to Algorithms . By appealing to fundamental structure other than comparison, both these sorting algorithms can perform in O(n) time instead of the theoretical optimum of O(n lg n) for comparison-based sorting. This is only true when the structure we're relying on is not more complex than the size of data we're interested in sorting. For example, if the size of integers (or maximum difference between integers) is much larger than the number of integers we're sorting, then counting sort would be inappropriate since the running time of counting sort is O(n + k) where k is the maximum "key size." When k dwarfs n, counting sort doesn't make sense. Bitmap sorting (an example in Programming Pearls (2nd Edition) ) follows along similar lines.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.