Consensus sequence

In bioinformatics, pattern recognition is a major topic. The search for a sequence motif or a signal sequence usually yields a concensus on the sequence searched for. Hence, this sequence is called a consensus sequence. This tells what amino acids or nucleotides can be found in a certain position. Consider the folowing random DNA sequence:

AA[CT]GT{A}CTG[CG]

In this notation, A (or C, or..) means that always an A (C or..) is found in that position. [CT] stands for either C or T and {A} means anything but A.

Amino acid sequences are usually given using the one letter codes for amino acids. In this notation, X stands for any amino acid and H denotes any hydrophobic amino acid.

The notation [CT] does not give any indication on the relative probabilities of C or T occurring on that spot.

An other way of representing is by a sequence logo. This is a graphical representation of the concensus sequence, in which the size of a symbol is related to the probability a given nucleotid or amino acid occurs on a certain position. This representation can only be used for short sequences, for obvious reasons.

DNA sequencing is usually about finding rather large sequences. Especially when a whole genome is sequenced (eg Human genome project), and DNA from several specimens is used, several ambiguous sites will be found. The result might be called a "consensus sequence", however, this is rarely done.