Jump to content

Substring index

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by David Eppstein (talk | contribs) at 08:14, 12 December 2024 (ce). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. Once constructed from a document or set of documents, a substring index can be used to locate all occurrences of a pattern in time linear or near-linear in the pattern size, with no dependence or only logarithmic dependence on the document size.

These data structures typically treat their text and pattern as strings over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in Unicode) but in practical applications for text retrieval it may be preferable to treat the (stemmed) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in letters of their alphabet.

The phrase full-text index is often used for substring indexes. But this is ambiguous, as it is also used for regular word indexes such as inverted files and document retrieval. See full text search.

Specific data structures that can be used as substring indexes include:

References

  1. ^ a b c Grossi, Roberto; Vitter, Jeffrey Scott (2005), "Compressed suffix arrays and suffix trees with applications to text indexing and string matching" (PDF), SIAM Journal on Computing, 35 (2): 378–407, doi:10.1137/S0097539702402354, MR 2191449
  2. ^ Ferragina, Paolo; Manzini, Giovanni (2005), "Indexing compressed text", Journal of the ACM, 52 (4): 552–581, doi:10.1145/1082036.1082039, MR 2164632