Jump to content

Generalized suffix array

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by BryanRamos0199 (talk | contribs) at 17:23, 12 May 2021. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer science, a generalized suffix array (or GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings of total length , it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.

Functionality

  • Let {}.
  • Lexicographically sorted array of all suffixes of each string in .
  • Each suffix is represented by an integer pair denoting suffix starting from position j in .
  • If suffixes from different strings within are identical, they occupy consecutive positions in the generalized suffix array. However, for convenience, the exception can be made of only listing repeats once.

A generalized suffix array can be generated for a generalized suffix tree. Compared to a generalized suffix tree, a generalized suffix array can be constructed in that it will use less space than the tree, the only drawback is it will require more time to construct.

Algorithms and Implementations

Algorithms and tools for constructing a generalized suffix array include:

  • Fei Shi's (1996) algorithm which runs in time with a worst case of and includes sorting, searching and finding the longest common prefixes[1]
  • The external generalized enhanced suffix array (eGSA) construction algorithm which resembles a two-phase multiway merge sort specializes in external memory construction when the size of the input collection or data structure exceed the amount of internal memory[2]
  • gsufsort is fast, portable, lightweight tool for constructing generalized suffix arrays and related data structures ;ole Burrows Wheeler transform or LCP Array)[3]

Solving the Pattern Matching Problem

Generalized suffix arrays can be used to solve the pattern matching problem:[4]

  • Given the pattern and a text , find all occurrences of in .
  • If the generalized suffix array, , of is available, then first, the suffixes that have as a prefix need to be found.
  • Since is a lexicographically sorted order of the suffixes of , all such suffixes will appear in consecutive positions in it. Furthermore, the sorted order attribute of G allows for easy identification of the suffixes using binary search.
  • Using binary search, find the smallest index in such that such that contains as a prefix, or determine that no such suffix is present. If no suffix is found, does not occur in . Otherwise, find the largest index contains as a prefix. All the elements in the range give the starting positions of the occurrences of in .
  • Binary search in takes comparisons. In each comparison, is compared with a suffix to determine their lexicographic order. This requires comparing at most characters. An array is not required, but having one will benefit running time.
  • Thus, the runtime of the algorithm is . By comparison, solving this problem using suffix trees takes time, however the space is smaller utilizing suffix arrays since the space required by the algorithm is only words, in addition to the space required to store the string. By keeping track of information and using slightly more space, running time can be improved to .

Other Applications

  • A generalized suffix array can be used to compete the longest common subsequence of the strings in the set in linear time, whereas a naive implementation would compute this information in quadratic time.
  • Create the Longest Previous Factor array used in text compression and for detecting motifs, repeats[5]

See Also

References

  1. ^ Shi, Fei (1996), "Suffix arrays for multiple strings: A method for on-line multiple string searches.", Lecture Notes in Computer Science, vol. 1179, Springer Berlin Heidelberg, doi:10.1007/BFb0027775, ISBN 978-3-540-62031-0
  2. ^ Louza, Felipe; Telles, Guilherme; Hoffman, Steve; Ciferri, Cristina (2017), "Generalized enhanced suffix array construction in external memory", Algorithms for molecular biology : AM, vol. 12, doi:10.1186/s13015-017-0117-9{{citation}}: CS1 maint: unflagged free DOI (link)
  3. ^ Louza, Felipe, gsufsort
  4. ^ Aluru, Srinivas, Suffix Trees and Suffix Arrays (PDF)
  5. ^ Crochemore, Maxime; Grossi, Roberto; Kärkkäinen, Juha; Landau, Gad (2013), "A Constant-Space Comparison-Based Algorithm for Computing the Burrows–Wheeler Transform", Combinatorial Pattern Matching. CPM 2013. Lecture Notes in Computer Science, vol. 7922, Springer, Berlin, Heidelberg, doi:10.1007/978-3-642-38905-4_9