Jump to content

Generalized suffix tree

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 149.132.188.231 (talk) at 15:02, 5 September 2008 (Example). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Suffix tree for the strings ABAB and BABA. Suffix links not shown.

A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia trie containing all suffixes of the strings. It is mostly used in bioinformatics[1].

Functionality

It can be built in time and space, and you can use it to find all occurrences of a string of length in time, which is asymptotically optimal (assuming the size of the alphabet is constant, see [2] page 119).

When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node.

Example

A suffix tree for the strings ABAB and BABA is shown in a figure above. They are padded with the unique terminator strings $0 and $1. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on $ from the root are left out in this example. mm

Alternatives

An alternative to building a generalised suffix tree is to concatenate the strings, and build a regular suffix tree or suffix array for the resulting string. When you evaluate the hits after a search, you map the global positions into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.

References

  • ^ Paul Bieganski, John Riedl, John Carlis, and Ernest F. Retzel (1994). "Generalized Suffix Trees for Biological Sequence Data". Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference on. pp. 35–44. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  • ^ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8.