Generalized suffix tree

A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings $D=S^{1},S^{2},\dots ,S^{d}$ of total length $n$ , it is a Patricia trie containing all $n$ suffixes of the strings. It is an index structure which gives you substring search. It is mostly used in bioinformatics.

It can be built in $\Theta (n)$ time and space, and you can use to find all $occ$ occurrences of a string $P$ of length $m$ in $O(m+occ)$ time, which is assymptotically optimal. (This is under the assumption that the size of the alphabet is viewed as a constant. Otherwise, the running-times depend on the implementation.)

An alternative to building a generalised suffix tree, is to concatenate the strings, and build a regular suffix tree for the resulting string. When you evaluate the hits after a search, you map the global positions into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.

Refereces

Paul Bieganski, John Riedl, John Carlis, and Ernest F. Retzel. Generalized Suffix Trees for Biological Sequence Data. System Sciences, 1994. Vol.V: Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference on.