## Wednesday, December 7, 2011

### Save bits, save power, save the world, Part 1

Ok, the title was a little tongue-in-cheek. I did a fairly whimsical experiment yesterday. Since text strings are represented as bits which are either 1 (on) or 0 (off), I was wondering which unused domain names used the minimum number of "on" bits. The answer may be non-intuitive to the non-computer scientist. Here are the first few letters sorted according to the "bit population count" (number of on bits in their representation, aka Hamming weight which is useful in cryptography, distributed hash tables, and some optimal representation problems): A, a, B, b, D, d, H, h, P, p, C, c, E, e, F, f, I, i, J, j, L, l, Q, q, R. Notice that all uppercase letter come before their lowercase counterpart. More interestingly, d comes before c, h before c, e, f, so on and so forth. This is because d's (base 10 rep: 100) binary representation is 1100100 whereas c's (base 10 rep: 99) is 1100011. A naive calculation would conclude that using d instead of c saves 25% of the power (3 on bits instead of 4 on bits). Alas, that isn't how microprocessors and transistors work, but I will discuss that later.

(From Wikipedia)

The popcounts of the lowercase ASCII alphabet are the following:

 a b c d e f g h i j k l m n o p q r s t u v w x y z 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6 3 4 4 5 4 5 5 6 4 5 5

Bit population count is a well-studied problem with a number of fast algorithms. It is so important that Intel has included a machine instruction for it `POPCNT`. GCC 4.5 and LLVM 2.5 support the instruction through the builtin function `__builtin_popcount`. For LLVM, if the instruction isn't supported or enabled, it falls back on a variation of HAKMEM. I conducted a simple benchmark of the leading algorithms and the builtin instruction.

AlgorithmExecution Time (ms)
MIT HAKMEM 169523
Iterated617
Precomputed 8-bit460
Builtin popcnt378
Builtin popcnt g++ -O3172

From the results, you can see that precomputed lookup gets pretty close to the performance of the builtin. Given that the data I ran the program on used only ASCII characters, precomputing the 16-bit pop counts would not have improved the performance any further. Ultimately, the precomputed lookup still has to do a handful of instructions worth of work as opposed to the builtin single instruction.

Using the builtin popcnt instruction reduces multiple instructions to a single one for a particularly tight loop body:

### Loop body for Intel SSE4

```movzbl  (%rax), %eax
movsbl  %al, %eax
popcnt  %eax, %eax
addl    %eax, -24(%rbp)
leaq    -96(%rbp), %rax
movl    \$0, %esi
movq    %rax, %rdi
```

For the precomputed counts, the execution times other interpreters and compilers are as follows:

NameTime (ms)
Python3599
Ruby4982
Java370
Scala689
Java's `Integer.bitCount` appears to be extraordinarily fast at 370 ms. It is highly likely that the Java VM is using the builtin instruction. Given that performance, it shouldn't be surprising that Scala performs at 689 ms. For OCaml, there is a third-party extended library that includes a popcount function, but that function implements the naive algorithm.

See Part 2 for OCaml and Haskell performance measurements.

### Experiment Description

The experiment consisted of computing the bit population counts for each word in a 9.5Mb file generated by concatenating /usr/share/dict/words 4 times. The platform was a Mac Pro Xeon W3530 2.8GHz machine with 3Gb of RAM. Python was 2.6.1. Ruby was version 1.9.2p180. G++ was 4.6. Java was 1.6.0_26. Scala was 2.9.0.1.

### Python

```#!/usr/bin/env python

words = open('words4', 'r')

letters = {'A': 2, 'C': 3, 'B': 2, 'E': 3, 'D': 2, 'G': 4,
'F': 3, 'I': 3, 'H': 2, 'K': 4, 'J': 3, 'M': 4, 'L': 3,
'O': 5, 'N': 4, 'Q': 3, 'P': 2, 'S': 4, 'R': 3, 'U': 4,
'T': 3, 'W': 5, 'V': 4, 'Y': 4, 'X': 3, '[': 5, 'Z': 4,
']': 5, '\\': 4, '_': 6, '^': 5, 'a': 3, '`': 2, 'c': 4,
'b': 3, 'e': 4, 'd': 3, 'g': 5, 'f': 4, 'i': 4, 'h': 3,
'k': 5, 'j': 4, 'm': 5, 'l': 4, 'o': 6, 'n': 5, 'q': 4,
'p': 3, 's': 5, 'r': 4, 'u': 5, 't': 4, 'w': 6, 'v': 5,
'y': 5, 'x': 4, 'z': 5}

def bruteforcepopsum(x):
return sum(map(lambda y: letters[y], x))

for w in words:
bruteforcepopsum(w.strip())
```

### Ruby

```#!/usr/bin/env ruby
# Generated by Hash[('A'.ord..'z'.ord).to_a.map
#            {|x| [x.chr, x.to_s(2).count('1')] }]
\$letters = {"A"=>2, "B"=>2, "C"=>3, "D"=>2, "E"=>3,
"F"=>3, "G"=>4, "H"=>2, "I"=>3, "J"=>3, "K"=>4,
"L"=>3, "M"=>4, "N"=>4, "O"=>5, "P"=>2, "Q"=>3,
"R"=>3, "S"=>4, "T"=>3, "U"=>4, "V"=>4, "W"=>5,
"X"=>3, "Y"=>4, "Z"=>4, "["=>5, "\\"=>4, "]"=>5,
"^"=>5, "_"=>6, "`"=>2, "a"=>3, "b"=>3, "c"=>4,
"d"=>3, "e"=>4, "f"=>4, "g"=>5, "h"=>3, "i"=>4,
"j"=>4, "k"=>5, "l"=>4, "m"=>5, "n"=>5, "o"=>6,
"p"=>3, "q"=>4, "r"=>4, "s"=>5, "t"=>4, "u"=>5,
"v"=>5, "w"=>6, "x"=>4, "y"=>5, "z"=>5}

words = open('words4', 'r')

def bruteforcepopsum(x)
(x.chars.map {|y| \$letters[y]}).inject(:+)
end

i = 0

for w in words do
word = w.strip
s = bruteforcepopsum(word)
end
```

### C++

```#include
#include
#include

using namespace std;

#ifdef BUILTIN
#define bitcount __builtin_popcount
#else
#define bitcount countfunp
#endif

static char bits_in_char [256] = { 0, 1, 1, 2, 1,
2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4,
5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4,
5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3,
4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4,
5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4,
5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5,
6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};

int iterated_bitcount(unsigned int n) {
int count = 0;
while (n) {
count += n & 0x1u;
n >>= 1;
}
return count;
}

int precompute8_bitcount (unsigned int n)  {
// works only for 32-bit ints

return bits_in_char [n  & 0xffu]
+  bits_in_char [(n >>  8 ) & 0xffu]
+  bits_in_char [(n >> 16) & 0xffu]
+  bits_in_char [(n >> 24) & 0xffu] ;
}

int hakmem_bitcount(unsigned int n) {
/* works for 32-bit numbers only    */
/* fix last line for 64-bit numbers */

register unsigned int tmp;

tmp = n - ((n >> 1) & 033333333333)
- ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}

int main(int argc, char* argv[])
{

ifstream f("words4");
int (*countfunp)(unsigned int) = 0;

#ifndef BUILTIN
switch (atoi(argv[2])) {
case 0:
countfunp = &hakmem_bitcount;
break;
case 1:
countfunp = &iterated_bitcount;
break;
default:
countfunp = &precompute8_bitcount;
break;
}
#endif
int i = 0;
while (f.good() && (i < 10 || !atoi(argv[1]))) {
string s;
int cnt = 0;
getline(f, s);
for (string::const_iterator I=s.begin(), E=s.end(); I != E; I++) {
cnt += bitcount(*I);
}
i++;
if (atoi(argv[1])) {
cout << s << " has bit pop count " << cnt << endl;
}
}
f.close();

return 0;
}
```

### Java

```import java.io.*;

class popcnt {

static public void main(String[] args) throws IOException
{

int sum = 0;
LineNumberReader rdr = new LineNumberReader(new FileReader("words4"));
while(rdr.ready()) {
String s = rdr.readLine();
char[] ca = s.toCharArray();
sum = 0;
for (char i : ca) {
sum += Integer.bitCount(i);
}
}
System.out.println("Last word has bit pop " + Integer.toString(sum));
rdr.close();
}
}
```

### Scala

```import scala.io.Source;

object popcnt {
def main(args: Array[String]) {
val f = Source.fromFile("words4")
f.getLines.map(line =>
line.trim.toCharArray.map(x =>
Integer.bitCount(x)).reduceLeft[Int](_+_))
f.close()
}
}
```

#### Post a Comment

Note: Only a member of this blog may post a comment.