Protein fragment library

Protein backbone fragment libraries have been used successfully in a variety of structural biology applications, including homology modeling, structure prediction, structure determination, decoy detection, and … By reducing the complexity of the search space, these fragment libraries enable more rapid search of conformational space, leading to more efficient and accurate models.

Motivation

Proteins can adopt an exponential number of states when modeled discretely. Typically a protein’s conformations are represented as sets of dihedral angles, bond lengths, and bond angles between all connected atoms. The most common simplification is to assume ideal bond lengths and bond angles. However this still leaves the phi-psi angles of the backbone, and up to four dihedral angles for each side chain, leading to a worst case complexity of k^6*n possible states of the protein, where n is the number of residues and k is the number of discrete states modeled for each dihedral angle. In order to reduce the conformational space, one can use protein fragment libraries rather than explicitly model every phi-psi angle.

Fragments are short segments of the peptide backbone, typically from 5 to 15 residues long, and do not include the side chains. They may specify the location of just the C-alpha atoms if it is a reduced atom representation, or all the backbone heavy atoms (N, C-alpha, C carbonyl, O). Note, side chains are typically not modeled using the fragment library approach. To model discrete states of a side chain, one could use a rotamer library approach.

Construction

Libraries of these fragments are constructed from an analysis of the Protein Data Bank (PDB). First, a representative subset of the PDB is chosen which should cover a diverse array of structures. Then, for each structure, every set of n consecutive residues is taken as a sample fragment. The samples are then clustered into k groups, based upon how similar they are to each other in configuration, using algorithms such as k-means clustering. The parameters n and k are chosen according to the application (see discussion on complexity below). The centroids of the clusters are then taken to represent the fragment. Further optimization can be performed to ensure that the centroid possesses ideal bond geometry.