Suffix array

A suffix array (invented by Gene Myers and Udi Manber) is an indexing structure for locating substrings in a larger string. Originally developed to reduce memory consumption compared to a suffix tree, this structure began the trend towards compressed suffix arrays and BWT-based compressed full-text indices.

The kth suffix of a string S is S with the first k-1 characters removed, for some starting position k. A string of n characters has n suffixes, denoted by their starting positions 1..n.

For instance, the string "abracadabra" has the following set of suffixes:

1  abracadabra
2  bracadabra
3  racadabra
4  acadabra
5  cadabra
6  adabra
7  dabra
8  abra
9  bra
10 ra
11 a

The first step in building a suffix array is to sort the suffixes lexicographically, yielding:

11 a
8  abra
1  abracadabra
4  acadabra
6  adabra
9  bra
2  bracadabra
5  cadabra
7  dabra
10 ra

To find a given substring X, two binary searches are used to find the range of suffixes prefixed by X. To gain efficiency, the string comparisons in the binary searches are modified to take advantage of longest common prefixes (LCP's) as the search narrows. The output of the search is the list of starting positions for the suffixes prefixed by X.

The key insight of the suffix array is to denote each suffix by its starting position only (the left column above). The resultant array of numbers, combined with the original string, is a compact representation of the sorted suffix list, consuming one character and one integer for each character in the string.