Wednesday, September 28, 2011

Better Bit Mixing - Improving on MurmurHash3's 64-bit Finalizer

Austin Appleby's superb MurmurHash3 is one of the best general purpose hash functions available today. Still, as good as it is, there are a couple of minor things about it that make me uneasy. I want to talk about one of those things today.

Here is the 64-bit finalizer from MurmurHash3:

UInt64 MurmurHash3Mixer( UInt64 key )
  key ^= (key >> 33);
  key *= 0xff51afd7ed558ccd;
  key ^= (key >> 33);
  key *= 0xc4ceb9fe1a85ec53;
  key ^= (key >> 33);

  return key;

The goal of a finalizer (sometimes called a "bit mixer") is to take an intermediate hash value that may not be thoroughly mixed and increase its entropy to obtain both better distribution and fewer collisions among hashes. Small differences in input values, as you might get when hashing nearly identical data, should result in large differences in output values after mixing.

Ideally, flipping a single bit in the input key results in all output bits changing with a probability of 0.5. This is called the Avalanche effect. Cryptographically-secure hashes are willing to spend an impressive number of computer cycles getting this probability close to 0.5. Hash functions that don't have to be cryptographically-secure trade away some avalanche accuracy in favor of performance. MurmurHash3 gets close to 0.5 with a remarkable economy of effort: Just two multiplies, three shifts and three exclusive-ors. Austin reports MurmurHash3 avalanches all bits to within 0.25% bias.

The overall structure of the finalizer (shifts and multiplies) was designed by Austin. The constants were chosen by "a simple simulated-annealing algorithm" (his description.) The part that troubles me is the simulation was driven with random numbers. That doesn't seem right. After all, if the inputs were truly random what would be the point of mixing? It seems to me that a good mixer should be able to take a series of, say, counting numbers and produce output that is virtually indistinguishable from random.

First question: Could I build a better mixer by training the simulation on low-entropy inputs such as counting numbers and high-likelihood bit patterns? Here are the results for MurmurHash3 and the top 14 mixers my simulation found:

4,853,184 Low-entropy keys

MixerMaximum errorMean error
MurmurHash3 mixer0.0059746343843550.000265202074121

The maximum error tends to be about seven times lower. Mean error is lower too but only slightly. The answer, then, is yes, it does get better results with low-entropy keys. But that leads to the second question: Does training on low-entropy keys result in better or worse performance with high-entropy keys? To find out, I tested it with 100 million cryptographic-quality random numbers:

100,000,000 Random keys

MixerMaximum errorMean error
MurmurHash3 mixer0.0002123800000000.000040445117187

Yes, we can do slightly better than MurmurHash3 even on random test keys.

On average this is a small but measurable improvement that comes at no performance cost. For the worst-case, low-entropy keys that concern us the most, it provides a significant improvement.

Here's a handy tip: A great use for a mixer like this is to use it to "purify" untrustworthy input hash keys. If a calling function provides poor-quality hashes (even counting numbers!) as input to your code then purifying it with one of these mixers will insure it does no harm.

Here are the parameters for the mixers I tested:

Mixer parameters

MixerMixer operations
MurmurHash3 mixer330xff51afd7ed558ccd330xc4ceb9fe1a85ec5333

Tuesday, September 27, 2011

About Me

I love building things.

I have co-founded four startups, assembled and managed teams of developers and delivered products that millions of people use every day.

Some bits of trivia:

Remember that pinball game that shipped with Windows from Windows95 through XP? That was made by me and my co-founders at my first startup, Cinematronics. We were later acquired by Maxis.

My second startup, Eclipse Entertainment, created the first browser-based 3D graphics engine. It was acquired by WildTangent.

I built a compressor called Quantum and for a few years in the 90s it was the best-performing lossless data compressor. It was the first compressor with a workable solution to the optimal parsing problem and typically outperformed PkZip by 20%-30%. Borland, Microsoft and Novell licensed it. Microsoft used it for their .CAB files (it knocked two diskettes off the size of Windows) and almost all of their products were compressed with Quantum.

I created World Class Chess which was the first commercial chess program which offered a user-configurable opening book library. I made it configurable so my chess-playing friends would populate the opening book for me. That plan back-fired in an amusing way: They tended to enter whatever obscure openings they were studying at the time. As a result the program liked to play openings it couldn't understand and when it exhausted its book it proceeded to waste time rearranging its pieces to match its own ideas about pawn structure, mobility and king safety. (If you're a programmer you might enjoy thinking about how to represent a chess opening book library in just 8 bits per move. It used to be important to fit it in as little space as possible.)

I created Launch, a Windows shell that replaced the dreaded Program Manager. It sold well for a few years and had a community of enthusiastic users.

I created Savant, the first Scrabble game that was fast enough to do a tree search in the end-game. Its middle-game was weaker than the best program at the time but it made up for it with near-perfect end-game play. Computers are faster now and the best Scrabble games today do Monte-Carlo simulations at all stages of the game.

I built and shipped (in Encarta) a natural-language search engine at Microsoft. Its accuracy outperformed our best keyword-based search engine at the time.

Optimizing for performance is one of my passions. You can read about a couple of my programs at the following links:

There Ain't No Such Thing As The Fastest Code

It's a Wonderful Life

I wrote much of Borland's Turbo C run-time library. I often think back to what a great experience it was to be a part of that team. The stars were aligned for Borland in those days. It was magical and, in many ways, you could say that I've spent much of my career attempting to recreate that magic. I've come close but have never been totally successful. Such things are rare. I was too young to appreciate it at the time.

My interests are focused around machine learning, natural language processing, information theory, high-performance computing and scalability.

I lived and worked in Japan for two years though I've pretty much forgotten what little Japanese I once knew.

I enjoy games and puzzles: Go (Weiqi/Weichi/Baduk), Chess and Scrabble.

In 2002 I took a year off and rode a motorcycle around the world and had some adventures. I was interrogated by the Russian secret police and told them everything I knew in great detail. They grew bored and let me go.

Some things I believe: Working with good people is soul-enriching. Mentoring younger people is a way of repaying those who mentored me. Shipping is hard but real artists ship. Nothing can replace perseverance.