Linguistic sequence complexity

The linguistic complexity measure

The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect direct or inverted repeats, polypurine and polypyrimidine triple-stranded DNA structures, and four-stranded structures (such as G-quadruplexes).

Linguistic complexity (LC) measurements are performed using the alphabet-capacity [L-gram method] along the whole sequence length and calculated as the sum of the observed range (xi) from 1 to L size words in the sequence divided by the sum of the expected (E) value for this sequence length ^[1], ^[2], ^[3], ^[4].

^[5] was introduced as a measure of the ‘vocabulary richness’of a text. When a nucleotide sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some k-grams (words), can be calculated, and served as a measure of sequence complexity. Thus, the more complex a DNA sequence, the richer is its oligonucleotide vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity (LC) approach.

The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level i is equal to the actual vocabulary size of words with the length i in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level i is either 4i or N?i+1, whichever is smaller. Complexity (C) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures View the MathML source:

Vocabulary usage View the MathML source for oligomers of a given size i can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U2 for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U3 for the same sequence = 15/15, and U4=14/14. For the sequence ACACACACACACACACA, U1=1/2;U2=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U3 for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260,W=5; for RW<1029,W=6, and so on. The value of C provides a measure of sequence complexity in the convenient range 0<C<1 for various DNA sequence fragments of a given length. This novel formula ((1)) is different from the previous LC measure in two respects: in the way vocabulary usage Ui is calculated, and because i is not in the range of 2 to N?1 but only up to W. This new limitation on the range of Ui makes the algorithm substantially more effective without loss of power.

References

^ Andrei Gabrielian, Alexander Bolshoy (1999). "Sequence complexity and DNA curvature". Computer & Chemistry. 23: 263–274. doi:10.1016/S0097-8485(99)00007-8.
^ Y.L. Orlov, V.N. Potapov (2004). "Complexity: an internet resource for analysis of DNA sequence complexity". Nucleic Acids Res. 32: W628 – W633. doi:10.1093/nar/gkh466.
^ Svante Janson, Stefano Lonardi, Wojciech Szpankowski (2004). "On average sequence complexity". Theoretical Computer Science. 326: 213–227. doi:10.1016/j.tcs.2004.06.023.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Kalendar R, Lee D, Schulman AH (2011). "Java web tools for PCR, in silico PCR, and oligonucleotide assembly and analysis". Genomics. 98 (2): 137–144. doi:10.1016/j.ygeno.2011.04.009. PMID 21569836.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ E.N Trifonov (1990). "Making sense of the human genome". Structure and Methods. Human Genome Initiative and DNA Recombination. Vol. 1. Adenine Press, New York. pp. 69–77. {{cite book}}: Unknown parameter |book= ignored (help)

[1] Andrei Gabrielian, Alexander Bolshoy (1999). "Sequence complexity and DNA curvature". Computer & Chemistry. 23: 263–274. doi:10.1016/S0097-8485(99)00007-8.

[2] Y.L. Orlov, V.N. Potapov (2004). "Complexity: an internet resource for analysis of DNA sequence complexity". Nucleic Acids Res. 32: W628 – W633. doi:10.1093/nar/gkh466.

[3] Svante Janson, Stefano Lonardi, Wojciech Szpankowski (2004). "On average sequence complexity". Theoretical Computer Science. 326: 213–227. doi:10.1016/j.tcs.2004.06.023.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[4] Kalendar R, Lee D, Schulman AH (2011). "Java web tools for PCR, in silico PCR, and oligonucleotide assembly and analysis". Genomics. 98 (2): 137–144. doi:10.1016/j.ygeno.2011.04.009. PMID 21569836.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[5] E.N Trifonov (1990). "Making sense of the human genome". Structure and Methods. Human Genome Initiative and DNA Recombination. Vol. 1. Adenine Press, New York. pp. 69–77. {{cite book}}: Unknown parameter |book= ignored (help)

[1]

[2]

[3]

[4]

[5]