Hash table

Hash table
Hash table
Type	Unsorted associative array
Invented	1953
Operation
Time complexity in big O notation
Operation	Average
Search	O(1 + n/k) ; k: number of buckets
Insert	O(1)
Delete	O(1 + n/k)
Space complexity
Space	O(n)

In computing, a hash table (also hash map) is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.

Ideally, the hash function should assign each possible key to a unique bucket, but this ideal situation is rarely achievable in practice (unless the hash keys are fixed; i.e. new entries are never added to the table after it is created). Instead, most hash table designs assume that hash collisions—different keys that are assigned by the hash function to the same bucket—will occur and must be accommodated in some way.

In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized^[2]) cost per operation.^[3]^[4]

In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

Dynamic resizing

To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand the table when items are inserted. For example, in Java's HashMap class the default load factor threshold for table expansion is 0.75.

Since buckets are usually implemented on top of a dynamic array and any constant proportion for resizing greater than 1 will keep the load factor under the desired limit, the exact choice of the constant is determined by the same space-time tradeoff as for dynamic arrays.

Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket locations.

To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the table—followed by a rehash—when items are deleted. From the point of space-time tradeoffs, this operation is similar to the deallocation in dynamic arrays.

Resizing by copying all entries

A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold r_max. Then a new larger table is allocated, all the entries of the old table are removed and inserted into this new table, and the old table is returned to the free storage pool. Symmetrically, when the load factor falls below a second threshold r_min, all entries are moved to a new smaller table.

If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a constant, independent of the number of entries n and of the number m of operations performed.

For example, consider a table that was created with the minimum possible size and is doubled each time the load ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that occur in all dynamic resizings of the table is at most m − 1. In other words, dynamic resizing roughly doubles the cost of each insert or delete operation.

Incremental resizing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually:

During the resize, allocate the new hash table, but keep the old table unchanged.
In each lookup or delete operation, check both tables.
Perform insertion operations only in the new table.
At each insertion also move r elements from the old table to the new table.
When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

Monotonic keys

If it is known that key values will always increase (or decrease) monotonically, then a variation of consistent hashing can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash function—and indeed hash table—both of which can be different for each range. Since it is common to grow the overall number of entries by doubling, there will only be O(lg(N)) ranges to check, and binary search time for the redirection would be O(lg(lg(N))). As with consistent hashing, this approach guarantees that any key's hash, once issued, will never change, even when the hash table is later grown.

Performance analysis

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size k with open addressing has no collisions and holds up to k elements, with a single comparison for successful lookup, and a table of size k with chaining and n keys has the minimum max(0, n-k) collisions and O(1 + n/k) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(n) amortized comparisons per insertion and up to n comparisons for a successful lookup.

Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of b implies that only n/bⁱ keys are inserted i or more times, so that the total number of insertions is bounded above by bn/(b-1), which is O(n). By using rehashing to maintain n < k, tables using both chaining and open addressing can have unlimited elements and perform successful lookup in a single comparison for the best choice of hash function.

In more realistic models, the hash function is a random variable over a probability distribution of hash functions, and performance is computed on average over the choice of hash function. When this distribution is uniform, the assumption is called "simple uniform hashing" and it can be shown that hashing with chaining requires Θ(1 + n/k) comparisons on average for an unsuccessful lookup, and hashing with open addressing requires Θ(1/(1 - n/k)).^[5] Both these bounds are constant, if we maintain n/k < c using table resizing, where c is a fixed constant less than 1.

Uses

Database indexing

Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications.

Sets

Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an entry exists or not.

Those structures can therefore be used to implement a set data structure, which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets.

Unique data representation

Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.)

String interning

Implementations

In programming languages

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In C++11, for example, the unordered_map class provides hash tables for keys and values of arbitrary type.

In PHP 5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein to generate the hash values used in managing the mappings of data pointers stored in a hash table. In the PHP source code, it is labelled as DJBX33A (Daniel J. Bernstein, Times 33 with Addition).

Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly optimized as they are used internally to implement namespaces.

In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and generic Dictionary classes, which store key-value pairs, and the generic HashSet class, which stores only values.

Independent packages

SparseHash (formerly Google SparseHash) An extremely memory-efficient hash_map implementation, with only 2 bits/entry of overhead. The SparseHash library has several C++ hash map implementations with different performance characteristics, including one that optimizes for memory use and another that optimizes for speed.
SunriseDD An open source C library for hash table storage of arbitrary data objects with lock-free lookups, built-in reference counting and guaranteed order iteration. The library can participate in external reference counting systems or use its own built-in reference counting. It comes with a variety of hash functions and allows the use of runtime supplied hash functions via callback mechanism. Source code is well documented.
uthash This is an easy-to-use hash table for C structures.

References

^ Thomas H. Cormen ; et al. (2009). Introduction to Algorithms (3rd ed.). Massachusetts Institute of Technology. pp. 253–280. ISBN 978-0-262-03384-8. {{cite book}}: Explicit use of et al. in: |author= (help); Unknown parameter |note= ignored (help)CS1 maint: extra punctuation (link)
^ Charles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithms—Fall 2005
^ Donald Knuth (1998). 'The Art of Computer Programming'. Vol. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0. {{cite book}}: Unknown parameter |note= ignored (help)CS1 maint: extra punctuation (link)
^ Cite error: The named reference cormen was invoked but never defined (see the help page).
^ Doug Dunham. CS 4521 Lecture Notes. University of Minnesota Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009.

External links

A Hash Function for Hash Table Lookup by Bob Jenkins.
Hash Tables by SparkNotes—explanation using C
Hash functions by Paul Hsieh
Design of Compact and Efficient Hash Tables for Java link not working
Libhashish hash library
NIST entry on hash tables
Open addressing hash table removal algorithm from ICI programming language, ici_set_unassign in set.c (and other occurrences, with permission).
A basic explanation of how the hash table works by Reliable Software
Lecture on Hash Tables
Hash-tables in C—two simple and clear examples of hash tables implementation in C with linear probing and chaining
Open Data Structures - Chapter 5 - Hash Tables
MIT's Introduction to Algorithms: Hashing 1 MIT OCW lecture Video
MIT's Introduction to Algorithms: Hashing 2 MIT OCW lecture Video
How to sort a HashMap (Java) and keep the duplicate entries

[Cormen_et_al-1] Thomas H. Cormen ; et al. (2009). Introduction to Algorithms (3rd ed.). Massachusetts Institute of Technology. pp. 253–280. ISBN 978-0-262-03384-8. {{cite book}}: Explicit use of et al. in: |author= (help); Unknown parameter |note= ignored (help)CS1 maint: extra punctuation (link)

[leiser-2] Charles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithms—Fall 2005

[knuth-3] Donald Knuth (1998). 'The Art of Computer Programming'. Vol. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0. {{cite book}}: Unknown parameter |note= ignored (help)CS1 maint: extra punctuation (link)

[cormen-4] Cite error: The named reference cormen was invoked but never defined (see the help page).

[5] Doug Dunham. CS 4521 Lecture Notes. University of Minnesota Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009.

[1]

[2]

[3]

[4]

[5]

v t e Data structures
Types	Collection Container
Abstract	Associative array Multimap Retrieval Data Structure List Stack Queue Double-ended queue Priority queue Double-ended priority queue Set Multiset Disjoint-set
Arrays	Bit array Circular buffer Dynamic array Hash table Hashed array tree Sparse matrix
Linked	Association list Linked list Skip list Unrolled linked list XOR linked list
Trees	B-tree Binary search tree AA tree AVL tree Red–black tree Self-balancing tree Splay tree Heap Binary heap Binomial heap Fibonacci heap R-tree R* tree R+ tree Hilbert R-tree Trie Hash tree
Graphs	Binary decision diagram Directed acyclic graph Directed acyclic word graph
List of data structures