Hash Tables (Part 2)
One of the more important but somewhat mysterious aspects of hashtables is writing hash functions. A hash function needs to take your data (possibly fixed-size, possibly variable-sized) and turn it into a fixed-size number that has “good distribution”. If you manage this, you won’t get a lot of collisions. This means each hash lookup will have to look at only a few buckets in the table and so will be fast.
But what kind of distribution is good? And how do you make such a function? Fortunately, a smart guy named Bob Jenkins has figured this out for you. He says a good hash function is one with no “funnels” – cases where some number m
of input bits affects a smaller number n < m
of bits of the state while mixing. Or, apparently, the same thing in reverse, whatever that means.
What does that buy you? Well, you get a function with the avalanche property – any one-bit change in the input string will on average flip about half the bits in the output. That’s good because it means that similar inputs will have very different hash codes — that’s useful because real-world input sets actually tend to have pretty low entropy. In other words, there’s going to be a lot of structured similarity to your input strings. It’s also good because it means every bit of the hash code is useful, so you can make a smaller good hash just by masking off the low couple of bits. In the next installment we’ll see why this is so good.
OK, so now we know what we want out of a hash function. We’d like it to be “funnel-free” so that it achieves “avalanche”. What exactly does that imply? Well, it means that you can’t use the popular hack of hashing strings by taking only the length and the last and first few characters. Doing so makes your hash function highly susceptible to pathological input sets that will cause horrible collision rates, because any set of inputs that is identical on the ends but differs in the middle will hash to the same thing. While it can be expensive to hash all of a long string, it’s usually a worthwhile tradeoff to keep the hash code around in the string and compute a full good hash once.
OK, so with that out of the way, how do you actually make a function that has these qualities, and is reasonably fast to compute? Basically you mess around with bitwise operations and magic constants until you get one that passes the appropriate tests. But you generally don’t have to, because others have done that for you. Bob Jenkins himself gives code for a few good hash functions, and favorably cites FNV.
But others have applied his hashing principles and come up with functions that are faster to compute, and which still give good results.
The fastest strong hash I’ve found for arbitrary-length data such as strings is Paul Hsieh’s aptly named SuperFastHash. The basic idea of this hash function is to walk the string 16 bits at a time, mixing with a 32-bit internal state; most other hash functions go only 8 bits at a time. This function does only a small amount of mixing at each step, and then applies some final scrambling at the end. Working in 16-bit chunks is actually a really great fit for UTF-16 unicode strings, which includes all the strings used by JavaScript and in the DOM. They’re made out of 16-bit chunks, so the function can actually be further simplified to not worry about unaligned access or odd bytes at the end. You can find an example of my version in WebCore/khtml/xml/dom_stringimpl.cpp
.
For fixed-size 32-bit or 64-bit quantities, such as integers or pointers, you can actually do even better. Thomas Wang presents 32-bit mix and 64-bit mix functions that can be computed in very few machine cycles, but still give good distribution. You can find these in WebCore/khtml/misc/hashfunctions.h
as the pointerHash functions. A little template magic is used to pick the right one for your platform pointer size.
So what’s the bottom line? It’s handy to know what makes a good hash function, but in most cases you are better off turning to existing tested and theoretically sound hash functions than trying to make up your own.
In our next installment, we’ll look at some of the choices you can make when implementing a hash table, and which ones go best with a good hash function.