Surfin’ Safari

Hashtables, Part 1

Posted by Maciej Stachowiak on Sunday, July 10th, 2005 at 12:31 am

A Hashtable Primer

Hashtables are one of the most important basic data structures in a web browser. But most people don’t understand much about them. Even though you will rarely be called upon to build your own hashtable from scratch, it’s useful to know a little about how they work, and what kind of problems they can solve.

A hashtable is an associative array that uses a hash function to map items to pseudo-random slots in an array. With a good hash function, a hashtable can give fast, O(1) access for key-value lookups.

OK, so that’s a bunch of computer-science gobbledygook, but what are hashtables actually good for? Lots of things. Say you want to create a fast mapping from names to particular values, for example the mapping from id attributes to elements that is used by document.getElementById. Well, a hash map is just the thing. You can build a table that quickly maps from a string to the corresponding DOM node.

You can also use hashtables to implement “sets”. Say you quickly want to check if a tag is one of the set of HTML inline elements, so you can parse correctly. You can use a hashtable to build a set that quickly lets you check if a tag name is one of these tags.

Finally, hashtables are good for caches. Sometimes when you do a complex computation to determine a value, you want to be sure you can do that same computation quickly a second time. So you use a hashtable to map from the inputs to the already computed output.

So that’s a little about what hashtables are good for.Stay tuned for our next exciting episode, which will cover different hashtable algorithms, and which ones let you make the fastest possible hashtable.

2 Responses to “Hashtables, Part 1”

  1. iSee Says:

    On Hash-tables.
    Let me clear up some common misconceptions about hash tables:
    While it is feasible to get actually very fast lookups using hash functions and the O(1) access is commonly stated, but strictly speaking, only true if your key-length and your address-space (pointer length) are considered constant.
    Though the later one might be ignored (except they can lead to slow-down when moving to a 64-bit architecture), the former one poses a real problem.
    The best-case lookup complexity can not be better then O(l) where l is the length of the key for key-value lookups (you can’t possibly do a lookup without looking at the key). Again this means the best-case (with the key-space is fully utilized) you get a lookup complexity of log(n) where n is the number of entries which happens to be the complexity of balanced search trees.
    Having said that, as with quicksort, hash tables can be much faster in practice than balanced search trees and there is hope they help in some places making Safari a neat, fast browser.
    However, as with quicksort, there are some pitfalls you should be aware of that can adversely affect complexity. For quicksort, these are sorted arrays and reversed arrays, for hash tables you need to care about clashes and deletes that make the use of hash tables for caches a questionable choice unless you choose to just overwrite the old value in case of a clash.
    Let me explain: A hash function maps the key space to an array position where the value is stored typically making it much shorter (if you would not make it shorter you could just use the key and use an array big enough to hold an entry for each possible key - and in fact if your key is suitable that is what you should do). As the physical key is much smaller, there will be clashes (i.e. different keys that are mapped onto the same slot). And there will be clashes. I do not know the exact finish. Even under ideal conditions (i.e. complete randomness) pick 23 people and it is more likely that two of them have a birthday on the same day of the year than not. Of course hash table algorithms can cope with clashes (otherwise they would basically be useless), but there is some hassle involved with it, especially if you choose to store the clashing value inside the same array as the other values. In that case it will be hard to ever really delete values after there was a clash, and you will over time end up doing a linear search through the hash table which is very slow.
    So if you (the reader) are ever using hash-tables for read-write data structures that exist for a long time, be sure to have a fallback data structure for clashes or some smart and efficient mechanism to delete entries without leaving delete-markers around or cause local contention.
    If you are looking for alternatives, there is at least one. In the need of a very dynamic cache data structure in an earlier project, I have created/(re-?)invented a data structure I call “algebraic search tree” that is fast and scalable and works basically by chopping the key into parts and applying each part to a little lookup table. If you want further information, feel free to contact me (i underscore see at macnews.de)

  2. maciej Says:

    As you will soon discover in future parts of this series, the new WebCore hashtable does extremely well in terms of collision rate, through the combination of good algorithms and good hash functions.