Generalized suffix array

In computer science, a generalized suffix array (or GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings $S=S_{1},S_{2},S_{3},...,S_{k}$ of total length $n$ , it is a lexicographically sorted array of all suffixes of each string in $S$ .

Functionality

Let { $S=S_{1},S_{2},S_{3},...,S_{k}$ }.
Lexicographically sorted array of all suffixes of each string in $S$ .
Each suffix is represented by an integer pair $(i,j)$ denoting suffix starting from position j in $s_{i}$ .
If suffixes from different strings within $S$ are identical, they occupy consecutive positions in the generalized suffix array. However, for convenience, the exception can be made of only listing it once.

Algorithms and Implementations

Algorithms for constructing a generalized suffix array include Fei Shi's (1996) algorithm which runs in $\Theta (NlogN)$ time with a worst case of $\Theta (n)$ ^[1]
The external generalized enhanced suffix array (eGSA) construction algorithm which resembles a two-phase multiway merge sort^[2]

Solving the Pattern Matching Problem

Generalized suffix arrays can be used to solve the pattern matching problem.

Given the pattern $P$ and a text $T$ , find all occurrences of $P$ in $T$ .
If the generalized suffix array, <G, of the text T is available, then first, the suffixes that have P as a prefix need to be found.
Since the G, is a lexicographically sorted order of the suffixes of T, all such suffixes will appear in consecutive positions in it. Furthermore, the sorted order attribute of G allows for easy identification of the suffixes using binary search.
Using binary search, find the smallest index i in G such that G such that $suff_{G}[i]$ contains P as a prefix, or determine that no such suffix is present. If no suffix is found, P does not occur in T. Otherwise, find the largest index j(\geq i) contains P as a prefix. All the elements in the range G[i..j] give the starting positions of the occurrences of P in T.
Binary search in G takes O(log n) comparisons. In each comparison, P is compared with a suffix to determine their lexigraphic order. This requires comparing at most |P| = m characters. An lcp array is not required, but having one will benefit running time.
Thus, the runtime of the algorithm is O(m log n). By comparison, solving this problem using suffix trees takes O(m) time, however the space is smaller since the space required by the algorithm is only n words, in addition to the space required to store the string. By keeping track of lcp information and using slightly more space, running time can be improved to O(m+log N).

Other Applications

A generalized suffix array can be used to compete the longest common subsequence of the strings in the set in linear time, whereas a naive implementation would compute this information in quadratic time.

References

^ Shi, Fei (1996), "Suffix arrays for multiple strings: A method for on-line multiple string searches.", Lecture Notes in Computer Science, vol. 1179, Springer Berlin Heidelberg, doi:10.1007/BFb0027775, ISBN 978-3-540-62031-0
^ Louza, Felipe; Telles, Guilherme; Hoffman, Steve; Ciferri, Cristina (2017), "Generalized enhanced suffix array construction in external memory", Algorithms for molecular biology : AM, vol. 12, doi:10.1186/s13015-017-0117-9{{citation}}: CS1 maint: unflagged free DOI (link)

[1] Shi, Fei (1996), "Suffix arrays for multiple strings: A method for on-line multiple string searches.", Lecture Notes in Computer Science, vol. 1179, Springer Berlin Heidelberg, doi:10.1007/BFb0027775, ISBN 978-3-540-62031-0

[2] Louza, Felipe; Telles, Guilherme; Hoffman, Steve; Ciferri, Cristina (2017), "Generalized enhanced suffix array construction in external memory", Algorithms for molecular biology : AM, vol. 12, doi:10.1186/s13015-017-0117-9{{citation}}: CS1 maint: unflagged free DOI (link)

[1]

[2]

Functionality

Algorithms and Implementations

Solving the Pattern Matching Problem

Other Applications

See Also

References