Jump to content

Tabulation hashing

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by David Eppstein (talk | contribs) at 23:38, 8 January 2016 (Method: source). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It is simple and fast enough to be usable in practice, and has theoretical properties that (in contrast to some other universal hashing methods) make it usable with linear probing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections.

Method

Let p denote the number of bits in a key to be hashed, and q denote the number of bits desired in an output hash function. Choose another number r, less than or equal to p; this choice is arbitrary, and controls the tradeoff between time and memory usage of the hashing method: smaller values of r use less memory but cause the hash function to be slower. Compute t by rounding p/r up to the next larger integer; this gives the number of r-bit blocks needed to represent a key. For instance, if r = 8, then an r-bit number is a byte, and t is the number of bytes per key. The key idea of tabulation hashing is to view a key as a vector of t r-bit numbers, use a lookup table filled with random values to compute a hash value for each of the r-bit numbers representing a given key, and combine these values with the bitwise binary exclusive or operation. The choice of r should be made in such a way that this table is not too large; e.g., so that it fits into the computer's cache memory.[1]

The initialization phase of the algorithm creates a two-dimensional array T of dimensions 2r by t, and fills the array with random q-bit numbers. Once the array T is initialized, it can be used to compute the hash value h(x) of any given key x. To do so, partition x into r-bit values, where x0 consists of the low order r bits of x, x1 consists of the next r bits, etc. For example, with the choice r = 8, xi is just the ith byte of x. Then, use these values as indices into T and combine them with the exclusive or operation:Cite error: A <ref> tag is missing the closing </ref> (see the help page). Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.[2]

Cuckoo hashing, another technique for implementing hash tables, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used.[2]

Extensions

Although tabulation hashing as described above ("simple tabulation hashing") is only 3-independent, variations of this method can be used to obtain hash functions with much higher degrees of independence. Siegel (2004) uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on expander graphs for transforming the key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive. Thorup (2013) provides a scheme based on tabulation hashing that reaches high degrees of independence more quickly, in a more constructive way. He observes that using one round of simple tabulation hashing to expand the input keys to six times their original length, and then a second round of simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exponential in the parameter r, the number of bits per block in the partition of the keys into blocks.

Simple tabulation is limited to keys of a fixed length, because a different table of random values needs to be initialized for each position of a block in the keys. Lemire (2012) studies variations of tabulation hashing suitable for variable-length keys such as character strings. The general type of hashing scheme studied by Lemire uses a single table T indexed by the value of a block, regardless of its position within the key. However, the values from this table may be combined by a more complicated function than bitwise exclusive or. Lemire shows that no scheme of this type can be 3-independent. Nevertheless, he shows that it is still possible to achieve 2-independence. In particular, a tabulation scheme that interprets the values T[xi] (where xi is, as before, the ith block of the input) as the coefficients of a polynomial over a finite field and then takes the remainder of the resulting polynomial modulo another polynomial, gives a 2-independent hash function.

Notes

  1. ^ Morin, Pat, "Section 5.2.3: Tabulation hashing", Open Data Structures (in pseudocode) (0.1Gβ ed.), pp. 115–116, retrieved 2016-01-08.
  2. ^ a b Pătraşcu & Thorup (2012).

References