Secondary structure prediction

Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins and RNA sequences based only on knowledge of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, the success of a prediction is determined by comparing it to the results of the DSSP algorithm applied to the crystal structure of the protein; for nucleic acids, it may be determined from the hydrogen bonding pattern.

The first methods for secondary structure prediction were introduced in the late 1960's and were based on the helix-forming tendencies of amino acids, as measured experimentally from polymer studies. A marked improvement occurred in the mid-70's with the development of statistical methods that used the then newly determined protein structures to estmate the probability of each amino acid to adopt alpha helix, beta sheet, turns or loops, and random coil conformations. Such statistical methods for predicting the secondary structure of a single sequence were developed further and have now reached ~60% accuracy. Another quantum leap in prediction accuracy occurred when it was realized that secondary structure is conserved over evolution. Therefore, the secondary structure of all the sequences in a multiple sequence alignment can be estimated at once; instead of measuring the propensity of a single amino acid to adopt a type of secondary structure, one may assess the propensity of a column of structurally analogous amino acids from an alignment of sequences, which may have diverged evolutionarily as long as 1 billion years earlier. Using this multiple-sequence-alignment approach and modern machine learning methods (such as neural nets and support vector machines), some methods of secondary structure prediction have reached nearly 80% accuracy for globular proteins. This high accuracy is critical for fold recognition and de novo methods for protein structure prediction, and has also been applied to classifying structural motifs and refining sequence alignments. The accuracy of such prediction methods is assessed on a weekly basis, as new structures are added to the Protein Data Bank. Further progress in accuracy may be difficult owing to idiosyncracies of the DSSP standard (particularly near the ends of secondary structure units) and to the role of tertiary interactions (i.e., interactions between amino acids that are distant along the protein backbone) in determining secondary structure; for example, a segment with a strong helical tendency may nonetheless adopt a beta-strand conformation if its side chains pack well and the rest of the protein is composed of beta sheets. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils.^[1]

The problems of predicting RNA secondary structure are broadly related but dependent mainly on base pairing and base stacking interactions; many RNA molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of RNA molecules, such as transfer RNA or microRNA, is observed. Most RNA secondary structure prediction methods rely on variations of dynamic programming and therefore are unable to efficiently identify pseudoknots.

Protein structure

Chou-Fasman method

The Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure.^[2] The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970's, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughtly 50-60% accurate in predicting secondary structures.^[1]

GOR method

The GOR method, named for the three scientists who developed it - Garnier, Osguthorpe, and Robson - is an information theory-based method developed not long after Chou-Fasman that uses more powerful probabilistic techniques of Bayesian inference.^[3] The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the conditional probability of the amino acid assuming each structure given that its neighbors assume the same structure. This method is both more sensitive and more accurate due to the fact that amino acid structural propensities are only strong for a small number of amino acids such as proline and glycine. The original GOR method is roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicts as loops or disorganized regions.

Neural networks

Neural network methods use training sets of solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet.

RNA structure

Dynamic programming algorithms are commonly used to detect base pairing patterns that are are "well-nested", that is, form hydrogen bonds only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include double helices, stem-loops, and variants of the "cloverleaf" pattern found in transfer RNA molecules. These methods rely on precalculated parameters estimating the free energy associated with particular types of base-pairing interactions, including Watson-Crick and Hoogsteen base pairs. Depending on the complexity of the method, single base pairs may be considered, or short two- or three-base segments to incorporate the effects of base stacking. This method cannot identify pseudoknots, which are not well nested, without substantial algorithmic modifications that are extremely computationally expensive.^[4]

Sequence covariation methods rely on the existence of a data set composed of multiple homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete.^[5]

References

^ ^a ^b Mount DM (2004). Bioinformatics: Sequence and Genome Analysis, 2, Cold Spring Harbor Laboratory Press. ISBN 0879697121.
^ Chou PY, Fasman GD. (1974). Prediction of protein conformation. Biochemistry. 13(2):222-45.
^ Garnier J, Osguthorpe DJ, Robson B. (1978). Analysis of the accuracy and implications of simple methods for predicting the seconday structure of globular proteins. J Mol Biol 120:97-120.
^ Rivas E, Eddy S. (1999). A dynamic programming algorithm for RNA structure prediction including pseudoknots, J Mol Biol, 285(5): 2053-2068.
^ Lyngsø RB, Pedersen CN. (2000). RNA pseudoknot prediction in energy-based models. J Comput Biol 7(3-4): 409-427.

External links

PredictProtein
Mfold RNA structure prediction

[Mount-1] Mount DM (2004). Bioinformatics: Sequence and Genome Analysis, 2, Cold Spring Harbor Laboratory Press. ISBN 0879697121.

[Chou-2] Chou PY, Fasman GD. (1974). Prediction of protein conformation. Biochemistry. 13(2):222-45.

[Garnier-3] Garnier J, Osguthorpe DJ, Robson B. (1978). Analysis of the accuracy and implications of simple methods for predicting the seconday structure of globular proteins. J Mol Biol 120:97-120.

[Rivas-4] Rivas E, Eddy S. (1999). A dynamic programming algorithm for RNA structure prediction including pseudoknots, J Mol Biol, 285(5): 2053-2068.

[Lyngso-5] Lyngsø RB, Pedersen CN. (2000). RNA pseudoknot prediction in energy-based models. J Comput Biol 7(3-4): 409-427.

[1]

[2]

[3]

[4]

[5]