Dot plot (bioinformatics)

In bioinformatics a dot plot is a graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them. It is a type of recurrence plot.
>CY003854.1 Influenza A virus (A/mallard/Alberta/77/1977(H2N3)) segment 1, complete sequence AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCCC GCACCCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAATACACATCAGGAAG GCAAGAGAAGAACCCCGCACTCAGGATGAAGTGGATGATGGCAATGAAATATCCAATTACTGCAGATAAG AGAATAATGGAAATGATTCCTGAAAGGAATGAACAAGGACAAACCCTCTGGAGCAAAACAAACGATGCCG GCTCAGACCGAGTGATGGTATCACCTCTGGCCGTGACATGGTGGAATAGGAATGGACCAACAACAAGTAC AGTTCACTACCCAAAGGTATATAAAACTTATTTCGAAAAAGTCGAAAGGTTGAAACACGGGACCTTTGGC CCCGTCCACTTCAGAAATCAAGTTAAGATAAGACGGAGGGTTGACATAAACCCTGGCCACGCAGACCTCA GTGCCAAAGAGGCACAGGATGTAATCATGGAAGTTGTTTTCCCAAATGAAGTGGGAGCTAGAATACTAAC ATCGGAGTCACAACTGACAATAACAAAAGAGAAAAAGGAAGAACTCCAGGACTGTAAAATTGCCCCCTTG ATGGTAGCATACATGCTAGAAAGAGAGTTGGTCCGCAAAACGAGGTTCCTCCCAGTGGCTGGTGGAACAA GCAGTGTCTATATTGAGGTGTTGCATTTAACCCAGGGGACATGCTGGGAGCAGATGTACACTCCAGGAGG GGAAGTGAGAAATGATGATGTTGACCAAAGCTTGATTATCGCTGCCAGGAACATAGTAAGAAGAGCAACG GTATCAGCAGACCCACTAGCATCTCTATTGGAGATGTGCCACAGCACACAGATTGGGGGAATAAGGATGG TAGACATCCTTCGGCAAAATCCAACAGAGGAACAAGCCGTGGACATATGCAAGGCAGCAATGGGCTTGAG GATTAGCTCATCTTTCAGCTTTGGTGGATTCACTTTCAAAAGAACAAGCGGGTCGTCAGTTAAGAGAGAA GAAGAAGTGCTTACGGGCAACCTTCAAACATTGAAAATAAGAGTACATGAGGGGTATGAAGAGTTCACAA TGGTTGGGAGAAGAGCAACAGCTATTCTAAGAAAGGCAACCAGGAGATTGATCCAGCTAATAGTAAGTGG GAGAGACGAGCAGTCAATTGCTGAAGCAATAATTGTGGCCATGGTATTTTCACAAGAGGATTGCATGATC AAGGCAGTTCGGGGTGATCTGAACTTTGTCAATAGGGCAAATCAGCGACTGAACCCCATGCATCAACTCT TGAGACACTTCCAAAAGGATGCAAAAGTGCTTTTCCAAAACTGGGGAATTGAACCCATTGACAATGTGAT GGGAATGATCGGAATATTGCCCGACATGACCCCAAGTACTGAGATGTCGCTGAGGGGGATAAGAGTCAGC AAAATGGGAGTAGATGAATACTCCAGCACAGAAAGGGTGGTGGTGAGCATTGACCGATTTTTAAGGGTTC GGGATCAACGGGGAAACGTACTATTGTCACCCGAAGAAGTTAGCGAGACACAAGGAACGGAGAAACTGAC AATAACTTATTCGTCATCAATGATGTGGGAGATCAATGGTCCTGAGTCGGTGTTGGTCAATACTTATCAA TGGATCATCAGGAACTGGGAGACTGTGAAAATTCAATGGTCACAGGATCCCACAATGTTATATAATAAGA TGGAATTCGAGCCATTTCAGTCTCTGGTCCCTAAGGCAGCCAGAGGTCAATACAGCGGATTCGTGAGGAC ACTGTTCCAGCAGATGCGGGATGTGCTTGGAACATTTGACACTGTTCAGATAATAAAACTTCTTCCCTTT GCTGCTGCTCCACCAGAACAGAGTAGGATGCAGTTCTCCTCCCTGACTGTGAATGTGAGAGGATCAGGAA TGAGGATACTGGTAAGAGGCAATTCTCCAGTGTTCAATTACAACAAGGCCACCAAGAGGCTTACAGTCCT TGGAAAAGATGCAGGTGCATTGACCGAAGATCCAGATGAAGGCACAGCTGGAGTGGAGTCTGCTGTTCTA AGAGGATTCCTCATTTTGGGCAAAGAAGACAAGAGATATGGCCCAGCATTAAGCATCAATGAGCTGAGCA ATCTTGCAAAAGGAGAGAAGGCTAATGTGCTAATTGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAA ACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGT CGAATTGTTTAAAAACGACCTTGTTTCTACT
>CY003886.1 Influenza A virus (A/mallard duck/ALB/376/1985(H2N3)) segment 1, complete sequence AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCCC GCACTCGCGAGATACTCACCAAAACCACTGTGGACCATATGGCCATAATCAAAAAATACACATCAGGAAG GCAAGAGAAGAATCCCGCACTCAGGATGAAATGGATGATGGCAATGAAATATCCAATTACAGCGGATAAG AGGATAATGGAGATGATTCCCGAGAGGAATGAACAAGGGCAAACCCTCTGGAGCAAAACAAATGATGCCG GCTCAGACCGAGTGATGGTATCACCTCTGGCTGTGACATGGTGGAATAGGAATGGACCAACAACAAGTAC AATTCACTACCCAAAGGTATATAAAACCTATTTCGAAAAGGTCGAAAGGTTAAAACATGGGACCTTTGGC CCCGTTCACTTCAGGAATCAAGTTAAGATAAGACGGAGAGTTGACATAAACCCTGGACATGCAGACCTCA GTGCCAAAGAGGCACAGGATGTAATCATGGAAGTTGTTTTCCCAAATGAAGTGGGGGCCAGGATATTAAC ATCGGAGTCACAGCTGACAATAACAAAAGAGAAAAAGGAAGAACTCCAAGATTGTAAAATTGCCCCCTTG ATGGTAGCATACATGCTAGAAAGAGAGTTAGTCCGCAAAACGAGGTTCCTCCCAGTGGCTGGTGGAACAA GCAGTGTTTATATTGAGGTGTTGCATTTGACCCAGGGAACATGCTGGGAACAAATGTACACTCCAGGAGG GGAAGTGAGAAATGATGATGTTGACCAAAGCTTAATTATCGCTGCCAGGAATATAGTAAGAAGAGCAACG GTATCAGCAGACCCACTAGCGTCTCTATTGGAGATGTGCCACAGCACACAGATTGGTGGAATAAGGATGG TAGACATCCTTAGGCAGAATCCAACAGAGGAACAAGCCGTGGATATATGCAAGGCGGCAATGGGCTTGAG GATTAGCTCATCTTTCAGCTTCGGTGGATTCACTTTTAAAAGAACAAGTGGGTCGTCAGTCAAAAGAGAA GAAGAAGTGCTTACGGGCAACCTTCAAACACTGAAAATAAGAGTGCATGAGGGGTATGAAGAATTCACAA TGGTTGGGAGAAGAGCAACAGCTATTCTCAGGAAGGCAACCAGGAGATTGATTCAGCTAATAGTCAGTGG GAGAGATGAACAGTCAATTGCTGAAGCAATAATTGTAGCTATGGTATTTTCACAAGAGGATTGCATGATC AAGGCAGTTCGGGGTGATCTGAACTTTGTCAATAGAGCAAACCAGCGACTGAACCCCATGCATCAACTCT TGAGACATTTCCAAAAGGATGCAAAAGTGCTTTTCCAAAATTGGGGAATTGAACCCATTGACAATGTGAT GGGAATGATCGGAATACTACCCGACATGACCCCAAGTACTGAGACGTCATTGAGAGGGATAAGAGTCAGC AAAATGGGAGTGGATGAATACTCCAGCACAGAGAGAGTGGTGGTGAGCATTGACCGTTTTTTAAGGGTTC GGGATCAACGGGGAAACGTACTATTGTCACCTGAAGAAGTCAGCGAGACGCAAGGGACGGAAAAGTTGAC AATAACTTACTCATCATCAATGATGTGGGAGATCAATGGTCCTGAATCAGTGTTGGTCAATACTTACCAG TGGATCATCAGAAACTGGGAGACTGTGAAAATTCAATGGTCACAGGATCCCACAATGTTGTACAATAAGA TGGAATTCGAGCCATTTCAGTCTCTGGTCCCTAAGGCAGCTAGAGGTCAATACAGCGGATTCGTGAGGAC GCTGTTCCAACAAATGCGGGATGTGCTTGGAACATTTGACACTGTTCAGATAATAAAACTTCTCCCCTTT GCTGCTGCCCCACCAGAACAGAGTAGGATGCAGTTCTCCTCCTTGACTGTGAATGTAAGAGGATCAGGAA TGAGGATACTGGTAAGAGGCAACTCTCCAGTGTTCAATTACAACAAGGCCACCAAGAGGCTTACAGTCCT CGGGAAGGATGCAGGTGCATTAACTGAAGACCCAGATGAAGGCACAGCTGGAGTGGAATCTGCTGTTCTG AGAGGATTCCTCATTTTGGGCAAAGAAGACAAGAGATATGGCCCAGCATTGAGCATCAATGAGCTGAGCA ATCTTGCAAAAGGAGAGAAGGCTAATGTGCTAATTGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAA ACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGGATTCGGATGGCCATCAATTAGTGT CGAATTGTTTAAAAACGACCTTGTTTCTACT
Interpretation
Some idea of the similarity of the two sequences can be gleaned from the number and length of matching segments shown in the matrix. Identical proteins will obviously have a diagonal line in the center of the matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of local similarity or repetitive sequences give rise to further diagonal matches in addition to the central diagonal. One way of reducing this noise is to only shade runs or 'tuples' of residues, e.g. a tuple of 3 corresponds to three residues in a row. This is effective because the probability of matching three residues in a row by chance is much lower than single-residue matches.
Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.
See also
References
Cite error: A list-defined reference named "gibbs-mcintyre" is not used in the content (see the help page).
Software to create plots
- SynMap - An easy to use, web-based tool to generate dotplots for many species with access to an extensive genome database. Offered by the comparative genomics platform CoGe.
- Genomdiff – An open source Java dot plot program for viruses.
- Gepard[1] - Dot plot tool suitable for even genome scale.
- ANACON – Contact analysis of dot plots.
- General introduction to dot plots with example algorithms and a software tool to create small and medium size dot plots.
- Dotlet – Provides a program allowing you to construct a dot plot with your own sequences.
- UGENE Dot Plot viewer – Opensource dot plot visualizer.
- seqinr - R package to generate dot plots.
- dotplot - R package to rapidly generate dot plots as either traditional or ggplot graphics.
- dotmatcher - Web tool to generate dot plots.
- Dotter[2] - Stand alone program to generate dot plots.
- JDotter[3] - Java version of Dotter.
- Dotplot, easy (educational) HTML5 tool to generate dot plots from RNA sequences.
- lastz and laj, programs to prepare and visualize genomic alignments.
- Flexidot, customizable and ambiguity-aware dotplot suite for visual sequence analyses implemented in Python.
- ^ Krumsiek, J.; Arnold, R.; Rattei, T. (2007-04-15). "Gepard: a rapid and sensitive tool for creating dotplots on genome scale". Bioinformatics. 23 (8): 1026–1028. doi:10.1093/bioinformatics/btm039. ISSN 1367-4803.
- ^ Sonnhammer, E. L.; Durbin, R. (1995-12-29). "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis". Gene. 167 (1–2): GC1–10. ISSN 0378-1119. PMID 8566757.
- ^ Brodie, R.; Roper, R. L.; Upton, C. (2004-01-22). "JDotter: a Java interface to multiple dotplots generated by dotter". Bioinformatics. 20 (2): 279–281. doi:10.1093/bioinformatics/btg406. ISSN 1367-4803.