Feature hashing

In machine learning, the hashing trick is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in a hash table.

Motivating example

In a typical document classification task, the input to the machine learning algorithm (both during learning and classification) is free text. From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted. To work efficiently with typical learning algorithms, these BOWs must be converted to (typically sparse) vectors such that each word type corresponds to a unique position in each vector.

The common approach is to construct, at learning time or prior to that, a vocabulary or dictionary of all word types to be expected and use that to map words to indices. E.g., the texts

John likes to watch movies. Mary likes too.
John also likes to watch football games.

can to converted, using the dictionary

{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}

to the matrix

[[1 2 1 1 1 0 0 0 1 1]
 [1 1 1 1 0 1 1 1 0 0]]

(Punctuation was removed, as is usual in document classification.) Note that the above example is atypical in that the resulting vectors are very dense, because the dictionary was tailored to the input sentences.

The problem with this process is the dictionary either cannot contain all words that might occur in unseen documents, or be incredibly large, taking up a large amount of storage space.

Feature vectorization using the hashing trick

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick builds a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices:

function hashing_vectorizer(features : array of string, N : integer):
    x = new vector[N]
    for f in features:
        h = hash(f)
        x[h mod N] += 1
    return x

It has been suggested that a second, single-bit output hash function ξ be used to determine the sign of the update value, to counter the effect of hash collisions.^[1]

References

^ Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.

External links

What is the "hashing trick"? - MetaOptimize Q+A

[1] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.

[1]