Wednesday, June 23, 2010

Hash Table Performance: the Not Always Completely Told Story

Here is a story everyone knows, but with an ending that maybe not everyone knows.

How do you find a piece of information in a data structure? If you want to do so efficiently, generally you use either an array or a hash table.

Your typical Von Neumann machine has only one really efficient data structure built into its architecture: namely, the array, which allows you to map an integer index into a memory location. If the contents of your array are instances of some data type with a fixed maximum size, you can premultiply the index by that size and thereby map the index to a value of the data type stored in the array. If your data type is implemented as pointer to some structure, you can effectively drop the maximum-size constraint (because all array elements are the size of a pointer), at the cost of a pointer dereference on each array element access.

Looking up something in an array generally requires constant (O(1)) time, whereas finding a particular value in a data structure without an array (such as a linked list) takes time proportional to the size of the data structure (O(n)).

Arrays are fine if each thing you want to look up has a number, and if the set of numbers assigned is compact enough that the array fits easily in memory. But not everything you want to find out about is easy to represent as a number. For example, you'll want to order an address book by name, not by assigning a number to everyone you know.

Hash functions are functions that take a value (the “key”) and return an integer (a “hash code”). For a given value, the hash code must always be the same (a hash function is a pure function of the value). The key value is not itself usually an integer (although it can be, which is useful when you look something up using a noncompact set of integers as lookup keys). A hash function argument is often a text string, but it makes sense to transform just about any type of value into a hash code (which is why Java puts hashCode at the top of the type hierarchy, in Object). For example, if you want to keep track of information by object type, you can use a java.lang.Class instance as a hash key.

Hash functions are typically used in a conjunction with a hash table, which is an array whose index is a hash code (or a truncation of a hash code, in the common case where the hash function produces a larger range of values than the size of the array). The hash table has a mechanism for dealing with hash collisions, to cover the case where more than one value has the same truncated hash code. You can view a hash table as an array indexed by something other than an integer, and because the key lookup operation is an array index operation, which is O(1), hash table performance can often be taken to be O(1).

A hash table is more complicated than an array, however, because a hash table usually needs to include its index values within its entries. Except in the case of perfect (collision-free) hashing, you must ensure that the key you think you're looking up is really present in the table—that you haven't instead reached an entry that happens to have the same truncated hash code as the one you want. This typically entails comparing the key you're given with the copy of the key stored in the hash table to entry, to make sure they're the same.

The real performance characteristics of a hash table depend on several factors:

  • The time to compute the hash code on an index value. This time is actually O(n), where n is the size of input to the hash function, which might be nontrivial (consider a hash code that examines every character of a long string. If you use the same value as an index more than once, it may be worthwhile to cache the hash code. java.lang.String, for example, caches the hash code for a Java string.)

  • The mechanism used by the hash table to deal with collisions. In hash tables that allocate larger arrays as they grow (the usual strategy), collisions are few, and the overall cost of the array reallocations is amortized O(1), so this cost can typically be ignored.

  • The time required to check whether the index value is valid by comparing it with the index value stored in the hash table. Like the computation of the hash code itself, in the general case, this is O(n) in the size of the data compared. (To ensure that a hash table behaves both efficiently and consistently, you typically want to examine exactly the same components of an index value in both the hash function and the comparison for equality.)

So in fact, hash table performance is not strictly O(1), because it depends on the speed of the hash and equality-check functions applied to the table's keys. For many data structures, however, keys are small (of fixed size, or with a reasonably small maximum size), so that calling the performance O(1) is no more than a little white lie.