Ok, the title was a little tongue-in-cheek. I did a fairly whimsical experiment yesterday. Since text strings are represented as bits which are either 1 (on) or 0 (off), I was wondering which unused domain names used the minimum number of "on" bits. The answer may be non-intuitive to the non-computer scientist. Here are the first few letters sorted according to the "bit population count" (number of on bits in their representation, aka Hamming weight which is useful in cryptography, distributed hash tables, and some optimal representation problems): A, a, B, b, D, d, H, h, P, p, C, c, E, e, F, f, I, i, J, j, L, l, Q, q, R. Notice that all uppercase letter come before their lowercase counterpart. More interestingly, d comes before c, h before c, e, f, so on and so forth. This is because d's (base 10 rep: 100) binary representation is 1100100 whereas c's (base 10 rep: 99) is 1100011. A naive calculation would conclude that using d instead of c saves 25% of the power (3 on bits instead of 4 on bits). Alas, that isn't how microprocessors and transistors work, but I will discuss that later.
The popcounts of the lowercase ASCII alphabet are the following:
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
3 | 3 | 4 | 3 | 4 | 4 | 5 | 3 | 4 | 4 | 5 | 4 | 5 | 5 | 6 | 3 | 4 | 4 | 5 | 4 | 5 | 5 | 6 | 4 | 5 | 5 |
Bit population count is a well-studied problem with a number of fast algorithms. It is so important that Intel has included a machine instruction for it POPCNT
. GCC 4.5 and LLVM 2.5 support the instruction through the builtin function __builtin_popcount
. For LLVM, if the instruction isn't supported or enabled, it falls back on a variation of HAKMEM. I conducted a simple benchmark of the leading algorithms and the builtin instruction.
Algorithm | Execution Time (ms) |
---|---|
MIT HAKMEM 169 | 523 |
Iterated | 617 |
Precomputed 8-bit | 460 |
Builtin popcnt | 378 |
Builtin popcnt g++ -O3 | 172 |
From the results, you can see that precomputed lookup gets pretty close to the performance of the builtin. Given that the data I ran the program on used only ASCII characters, precomputing the 16-bit pop counts would not have improved the performance any further. Ultimately, the precomputed lookup still has to do a handful of instructions worth of work as opposed to the builtin single instruction.
Using the builtin popcnt instruction reduces multiple instructions to a single one for a particularly tight loop body:Loop body for Intel SSE4
movzbl (%rax), %eax movsbl %al, %eax popcnt %eax, %eax addl %eax, -24(%rbp) leaq -96(%rbp), %rax movl $0, %esi movq %rax, %rdi
For the precomputed counts, the execution times other interpreters and compilers are as follows:
Name | Time (ms) |
---|---|
Python | 3599 |
Ruby | 4982 |
Java | 370 |
Scala | 689 |
Integer.bitCount
appears to be extraordinarily fast at 370 ms. It is highly likely that the Java VM is using the builtin instruction. Given that performance, it shouldn't be surprising that Scala performs at 689 ms. For OCaml, there is a third-party extended library that includes a popcount function, but that function implements the naive algorithm.
See Part 2 for OCaml and Haskell performance measurements.
Experiment Description
The experiment consisted of computing the bit population counts for each word in a 9.5Mb file generated by concatenating /usr/share/dict/words 4 times. The platform was a Mac Pro Xeon W3530 2.8GHz machine with 3Gb of RAM. Python was 2.6.1. Ruby was version 1.9.2p180. G++ was 4.6. Java was 1.6.0_26. Scala was 2.9.0.1.
Python
#!/usr/bin/env python words = open('words4', 'r') letters = {'A': 2, 'C': 3, 'B': 2, 'E': 3, 'D': 2, 'G': 4, 'F': 3, 'I': 3, 'H': 2, 'K': 4, 'J': 3, 'M': 4, 'L': 3, 'O': 5, 'N': 4, 'Q': 3, 'P': 2, 'S': 4, 'R': 3, 'U': 4, 'T': 3, 'W': 5, 'V': 4, 'Y': 4, 'X': 3, '[': 5, 'Z': 4, ']': 5, '\\': 4, '_': 6, '^': 5, 'a': 3, '`': 2, 'c': 4, 'b': 3, 'e': 4, 'd': 3, 'g': 5, 'f': 4, 'i': 4, 'h': 3, 'k': 5, 'j': 4, 'm': 5, 'l': 4, 'o': 6, 'n': 5, 'q': 4, 'p': 3, 's': 5, 'r': 4, 'u': 5, 't': 4, 'w': 6, 'v': 5, 'y': 5, 'x': 4, 'z': 5} def bruteforcepopsum(x): return sum(map(lambda y: letters[y], x)) for w in words: bruteforcepopsum(w.strip())
Ruby
#!/usr/bin/env ruby # Generated by Hash[('A'.ord..'z'.ord).to_a.map # {|x| [x.chr, x.to_s(2).count('1')] }] $letters = {"A"=>2, "B"=>2, "C"=>3, "D"=>2, "E"=>3, "F"=>3, "G"=>4, "H"=>2, "I"=>3, "J"=>3, "K"=>4, "L"=>3, "M"=>4, "N"=>4, "O"=>5, "P"=>2, "Q"=>3, "R"=>3, "S"=>4, "T"=>3, "U"=>4, "V"=>4, "W"=>5, "X"=>3, "Y"=>4, "Z"=>4, "["=>5, "\\"=>4, "]"=>5, "^"=>5, "_"=>6, "`"=>2, "a"=>3, "b"=>3, "c"=>4, "d"=>3, "e"=>4, "f"=>4, "g"=>5, "h"=>3, "i"=>4, "j"=>4, "k"=>5, "l"=>4, "m"=>5, "n"=>5, "o"=>6, "p"=>3, "q"=>4, "r"=>4, "s"=>5, "t"=>4, "u"=>5, "v"=>5, "w"=>6, "x"=>4, "y"=>5, "z"=>5} words = open('words4', 'r') def bruteforcepopsum(x) (x.chars.map {|y| $letters[y]}).inject(:+) end i = 0 for w in words do word = w.strip s = bruteforcepopsum(word) end
C++
#include#include #include using namespace std; #ifdef BUILTIN #define bitcount __builtin_popcount #else #define bitcount countfunp #endif static char bits_in_char [256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8}; int iterated_bitcount(unsigned int n) { int count = 0; while (n) { count += n & 0x1u; n >>= 1; } return count; } int precompute8_bitcount (unsigned int n) { // works only for 32-bit ints return bits_in_char [n & 0xffu] + bits_in_char [(n >> 8 ) & 0xffu] + bits_in_char [(n >> 16) & 0xffu] + bits_in_char [(n >> 24) & 0xffu] ; } int hakmem_bitcount(unsigned int n) { /* works for 32-bit numbers only */ /* fix last line for 64-bit numbers */ register unsigned int tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); return ((tmp + (tmp >> 3)) & 030707070707) % 63; } int main(int argc, char* argv[]) { ifstream f("words4"); int (*countfunp)(unsigned int) = 0; #ifndef BUILTIN switch (atoi(argv[2])) { case 0: countfunp = &hakmem_bitcount; break; case 1: countfunp = &iterated_bitcount; break; default: countfunp = &precompute8_bitcount; break; } #endif int i = 0; while (f.good() && (i < 10 || !atoi(argv[1]))) { string s; int cnt = 0; getline(f, s); for (string::const_iterator I=s.begin(), E=s.end(); I != E; I++) { cnt += bitcount(*I); } i++; if (atoi(argv[1])) { cout << s << " has bit pop count " << cnt << endl; } } f.close(); return 0; }
Java
import java.io.*; class popcnt { static public void main(String[] args) throws IOException { int sum = 0; LineNumberReader rdr = new LineNumberReader(new FileReader("words4")); while(rdr.ready()) { String s = rdr.readLine(); char[] ca = s.toCharArray(); sum = 0; for (char i : ca) { sum += Integer.bitCount(i); } } System.out.println("Last word has bit pop " + Integer.toString(sum)); rdr.close(); } }
Scala
import scala.io.Source; object popcnt { def main(args: Array[String]) { val f = Source.fromFile("words4") f.getLines.map(line => line.trim.toCharArray.map(x => Integer.bitCount(x)).reduceLeft[Int](_+_)) f.close() } }
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.