Linguistic sequence complexity
![]() | This article possibly contains original research. (March 2012) |
Linguistic sequence complexity (LC) is a measure of the 'vocabulary richness' of a text.[1] When a nucleotide sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its N-grams (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a DNA sequence, the richer its oligonucleotide vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990)[1] without changing the essence of the linguistic complexity approach.[1][2][3][4]
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level i is equal to the actual vocabulary size of words with the length i in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level i is either 4i or N-j+1, whichever is smaller. Complexity (C) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (Ui):Cite error: A <ref>
tag is missing the closing </ref>
(see the help page).
This novel formula is different from the previous LC measure in two respects: in the way vocabulary usage Ui is calculated, and because i is not in the range of 2 to N-1 but only up to W. This new limitation on the range of Ui makes the algorithm substantially more effective without loss of power.[2]
This sequence analysis complexity calculation can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect direct or inverted repeats, polypurine and polypyrimidine triple-stranded DNA structures, and four-stranded structures (such as G-quadruplexes).[5]
References
- ^ a b c Edward N. Trifonov (1990). "Making sense of the human genome". Structure and Methods. Human Genome Initiative and DNA Recombination. Vol. 1. Adenine Press, New York. pp. 69–77.
{{cite book}}
: External link in
(help); Unknown parameter|author=
|book=
ignored (help)CS1 maint: numeric names: authors list (link) - ^ a b Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/S0097-8485(99)00007-8, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1016/S0097-8485(99)00007-8
instead.} - ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/nar/gkh466, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1093/nar/gkh466
instead.} - ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.tcs.2004.06.023, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1016/j.tcs.2004.06.023
instead.} - ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.ygeno.2011.04.009, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1016/j.ygeno.2011.04.009
instead.}