Linguistic sequence complexity

The linguistic complexity (LC) measure ^[1] is a measure of the 'vocabulary richness' of a text. When a nucleotide sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the repetition of its N-grams (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a DNA sequence, the richer its oligonucleotide vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990)^[1] without changing the essence of the linguistic complexity approach.^{[original research?]}^[2]^[3]^[4]

The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level $i$ is equal to the actual vocabulary size of words with the length $i$ in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level $i$ is either 4ⁱ or N-j+1, whichever is smaller. Complexity ( $C$ ) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U_i):^{[citation needed]}

$C=U_{1}U_{2}...U_{i}....U_{w}$

Vocabulary usage for oligomers of a given size $i$ can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U₂ for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U₃ for the same sequence = 15/15, and U₄=14/14. For the sequence ACACACACACACACACA, U₁=1/2; U₂=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U₃ for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on. The value of $C$ provides a measure of sequence complexity in the convenient range 0<C<1 for various DNA sequence fragments of a given length.^{[citation needed]} This novel formula is different from the previous LC measure in two respects: in the way vocabulary usage U_i is calculated, and because $i$ is not in the range of 2 to N-1 but only up to W. This new limitation on the range of U_i makes the algorithm substantially more effective without loss of power.^{[original research?]}

This sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect direct or inverted repeats, polypurine and polypyrimidine triple-stranded DNA structures, and four-stranded structures (such as G-quadruplexes).^[5]

References

^ ^a ^b Edward N. Trifonov (1990). "Making sense of the human genome". Structure and Methods. Human Genome Initiative and DNA Recombination. Vol. 1. Adenine Press, New York. pp. 69–77. {{cite book}}: External link in |author= (help); Unknown parameter |book= ignored (help)CS1 maint: numeric names: authors list (link)
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/S0097-8485(99)00007-8, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/S0097-8485(99)00007-8 instead.}
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/nar/gkh466, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/nar/gkh466 instead.}
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.tcs.2004.06.023, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/j.tcs.2004.06.023 instead.}
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.ygeno.2011.04.009, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/j.ygeno.2011.04.009 instead.}

[Trifonov1990-1] Edward N. Trifonov (1990). "Making sense of the human genome". Structure and Methods. Human Genome Initiative and DNA Recombination. Vol. 1. Adenine Press, New York. pp. 69–77. {{cite book}}: External link in |author= (help); Unknown parameter |book= ignored (help)CS1 maint: numeric names: authors list (link)

[Gabrielian1999-2] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/S0097-8485(99)00007-8, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/S0097-8485(99)00007-8 instead.}

[Orlov2004-3] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/nar/gkh466, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/nar/gkh466 instead.}

[Janson2004-4] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.tcs.2004.06.023, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/j.tcs.2004.06.023 instead.}

[Kalendar2011-5] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/j.ygeno.2011.04.009, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/j.ygeno.2011.04.009 instead.}

[1]

[2]

[3]

[4]

[5]