Consensus CDS Project

CCDS Project
Content
Description	Consensus of protein coding regions
Contact
Research center	National Center for Biotechnology Information; European Bioinformatics Institute; University of Santa Cruz, California; Wellcome Trust Sanger Institute
Authors	Pruitt KD
Primary citation	Pruitt et al.
Release date	2009
Access
Website	http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi

The Consensus Coding Sequence Project is a collaboration between the National Center for Biotechnology Information, the European Bioinformatics Institute, the University of Santa Cruz, California and the Wellcome Trust Sanger Institute, to agree upon a consistent set of protein coding genes for humans and mice for public use.^[1] The CCDS gene sets have been arrived at by consensus of the different partners ^[2] and they consist of over 17,000 human and over 16,800 mouse genes.

The CCDS set is calculated following coordinated whole genome annotation updates carried out by the NCBI and Ensembl. Annotation updates represent genes that are defined by a mixture of manual curation and automated computational processing.

The general process flow for defining the CCDS gene set includes:compare genome annotation results identify annotated coding regions that have identical location coordinates on the genome quality evaluation remove lower quality CDSs from the core set pending additional review among the collaboration groups.

The CCDS set includes coding regions that are annotated as full-length (with an initiating ATG and valid stop-codon), can be translated from the genome without frameshifts, and use consensus splice-sites. The number and type of quality tests performed may be expanded in the future but includes analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology.

Consensus CDS Project

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser ^[1]. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation (2).

Motivation and background

Biological and biomedical research has come to rely on accurate and consistent annotation of genes and their products on genome assemblies. Reference annotations of genomes are available from various sources, each with their own independent goals and policies, which results in some annotation variation.

The CCDS project was established to identify a gold standard set of protein-coding gene annotations that are identically annotated on the human and mouse reference genome assemblies by the participating annotation groups. Since the first data release in 2005 (1), the CCDS database has continued to grow in size. The CCDS gene sets that have been arrived at by consensus of the different partners (2) now consist of over 18,000 human and over 20,000 mouse genes. The CCDS dataset is increasingly representing more alternative splicing events with each new release (3).

Contributing groups (3)

Participating annotation groups include: • National Center for Biotechnology Information (NCBI) • European Bioinformatics Institute (EBI) • Wellcome Trust Sanger Institute (WTSI) • University of California Santa Cruz (UCSC) Manual annotation is provided by: • Reference Sequence (RefSeq) at NCBI • Human and Vertebrate Analysis and Annotation (HAVANA) at WTSI

Defining the CCDS gene set

Definition of “Consensus”

Consensus is defined as protein-coding regions that agree at the start codon, stop codon, and splice junctions, and for which the prediction meets quality assurance benchmarks (1) A combination of manual and automated genome annotations provided by NCBI and Ensembl (which incorporates manual HAVANA annotations) are compared to identify annotations with matching genomic coordinates.

Quality assurance testing

In order to ensure that CDSs are of high quality, multiple quality assurance (QA) tests are performed, independent of those done by collaborating groups as part of their annotation pipelines.

QA tests include analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology (2). Annotations that fail QA tests undergo a round of manual checking that may improve results or reach a decision to reject annotation matches based on QA failure.

Review process

The CCDS database is unique in that the review process must be carried out by multiple collaborators, and agreement must be reached before any changes can be made. This is made possible with a collaborator coordination system that includes a work process flow and forums for analysis and discussion. The CCDS database operates an internal website that serves multiple purposes including curator communication, collaborator voting, providing special reports and tracking the status of CCDS representations. When a collaborating CCDS group member identifies a CCDS ID that may need review, a voting process is employed to decide on the final outcome.

Manual curation

Coordinated manual curation is supported by a restricted-access website and a discussion e-mail list. CCDS curation guidelines were established to address specific conflicts that were observed at a higher frequency. Establishment of CCDS curation guidelines has helped to make the CCDS curation process more efficient by reducing the number of conflicting votes and time spent in discussion to reach a consensus agreement. Link to CCDS curation guidelines: < http://www.ncbi.nlm.nih.gov/CCDS/docs/CCDS_curation_guidelines.pdf>

Curation policies established for the CCDS data set have been integrated in to the RefSeq and HAVANA annotation guidelines and thus, new annotations provided by both groups are more likely to be concordant and result in addition of a CCDS ID. These standards address specific problem areas, are not a comprehensive set of annotation guidelines, and do not restrict the annotation polices of any collaborating group (2). Examples include, standardized curation guidelines for selection of the initiation codon and interpretation of upstream ORFs and transcripts that are predicted to be candidates for nonsense-mediated decay (NMD). Curation occurs continuously, and any of the collaborating centers can flag a CCDS ID as a potential update or withdrawal.

Conflicting opinions are addressed by consulting with scientific experts or other annotation curation groups such as the HUGO Gene Nomenclature Committee (HGNC) and Mouse Genome Informatics (MGI). If a conflict cannot be resolved, then collaborators agree to withdrawal the CCDS ID until more information becomes available.

Curation challenges and annotation guidelines

NMD Nonsense-mediated decay (NMD) is the most powerful mRNA surveillance process. NMD eliminates defective mRNA before it can be translated into protein (4). If the defective mRNA is translated, the truncated protein may cause disease. Different mechanisms have been proposed to explain NMD, although none currently provide a satisfactory answer. One of the proposed mechanisms is the exon junction complex (EJC) model; if the stop codon is >50 nt upstream of the last exon-exon junction, then the transcripts is assumed to be a NMD candidate (2) The CCDS collaborators utilise a conservative method, based on the EJC model, to check mRNA transcripts for potential NMD candidates that may be translated into nonsense protein molecules. Any transcripts determined to be NMD candidates are excluded from the CCDS data set except in the following situations (2): (i) If all transcripts at one particular locus are assessed to be NMD candidates but the locus is previously known to be protein coding region. One of these transcripts will be recorded in the CCDS data set. (ii) If there is experimental evidence suggesting that a functional protein is produced from a NMD candidate transcript.

Previously, NMD candidate transcripts were considered to be protein coding transcripts by both RefSeq and HAVANA, and thereby, these NMD candidate transcripts were represented in the CCDS data set. The RefSeq group and the HAVANA project have subsequently revised their annotation policies.

Upstream open reading frames AUG initiation codons located within transcript leaders are known as upstream AUGs (uAUGs). Sometimes, the uAUGs are associated with upstream open reading frames (uORFs) (5). uORFs are found in approximately 50% of human and mouse transcripts (6). The existence of uORFs is another challenge for the CCDS data set. The scanning mechanism for translation initiation suggests that small ribosomal subunits (40S) bind at the 5’ end of a nascent mRNA transcript and scans for the first AUG start codon (7). It is possible that an uAUG is recognised first, and the corresponding uORF is then translated. The translated uORF could be a NMD candidate although studies have shown that some uORFs can avoid NMD. The average size limit for uORFs that will escape NMD is approximately 35 amino acids (2; 8). It also has been suggested that uORFs inhibit translation of the downstream gene by trapping a ribosome initiation complex and causing the ribosome to dissociate from the mRNA transcript before it reaches the protein-coding regions (4; 6) . Currently, no studies have reported the global impact of uORFs on translational regulation.

The current annotation guidelines include mRNA sequences containing uORFs in the CCDS data set only if the mRNA transcripts are proved to have the two biological features (2): i. Transcripts have a strong Kozak signal ii. Transcripts are either ≥ 35 amino acids or overlap with the primary open reading frames

Multiple in-frame translation start sites Multiple factors contribute to the translation initiation, such as uORFs, secondary structure and the sequence context around the translation initiation site. A common start site is defined within Kozak consensus sequence: (GCC) GCCACCAUGG in vertebrates. The sequence in brackets (GCC) is the motif with unknown biological impact (7). There are variations within Kozak consensus sequence, such as G or A is observed three nucleotides upstream (at position -3) of AUG. Bases between positions -3 and +4 of Kozak sequence have the most significant impact on translational efficiency. Hence, a sequence (A/G)NNAUGG is defined as a strong Kozak signal in the CCDS project.

According to the scanning mechanism, the small ribosomal subunit can initiate translation from the first reached start codon. There are exceptions to the scanning model: (i) initiation site is not surrounded by a strong Kozak signal, which results in leaky scanning. And thereby, the ribosome skips this AUG and initiates translation from a downstream start site. (ii) A shorter ORF can allow ribosome re-initiate translation at a downstream ORF (7). According to the CCDS annotation guidelines, the longest ORF must be annotated except there is experimental evidence that an internal start site is used to initiate translation. Additionally, other types of new data, such as ribosome profiling data (9), can be used to identify start codons. The CCDS data set records one translation initiate site per CCDS ID. Any alternative start sites may be used for translation and will be stated in a CCDS public note.

Read-through transcripts Read-through transcripts are also known as conjoined genes or co-transcribed genes. Read-through transcripts are defined as transcripts combining at least part of one exon from each of two or more distinct known (partner) genes which lie on the same chromosome in the same orientation (10). The biological function of read-through transcripts and the corresponding protein molecules remain unknown. However, the definition of a read-through gene in the CCDS data set is that the individual partner genes must be distinct, and the read-through transcripts must share ≥ 1 exon (or ≥ 2 splice sites except in the case of a shared terminal exon) with each of the distinct shorter loci (2). In some circumstances, transcripts are not considered as read-through transcripts: (i) transcripts are produced from overlapping genes but do not share same splice sites; (ii) transcripts are translated from genes that have nested structures relative to each other. The CCDS collaborators and the HGNC have agreed that the read-through transcript is represented as a separate locus.

Quality of reference genome sequence As the CCDS data set is built to represent genomic annotations of human and mouse, the quality problems with the human and mouse reference genome sequences become another challenge. Quality problems occur when the reference genome sequence is misassembled. Thereby the misassembled reference genome sequence contains premature stop codons, frame-shift indels, or likely polymorphic pseudogenes. Once these quality problems are identified, the CCDS collaborators report the issues to the Genome Reference Consortium, which investigates and makes corrections, if necessary.

Access to CCDS data

The CCDS project is available from the NCBI CCDS data set page ((http://www.ncbi.nlm.nih.gov/ CCDS/)), which provides FTP download links and a query interface to acquire information about CCDS sequences and locations. CCDS reports can be obtained by using the query interface, which is located at the top of the CCDS data set page. Users can select various types of identifiers such as CCDS ID, gene ID, gene symbol, nucleotide ID and protein ID to search specific CCDS information (1). The CCDS reports (Figure 1) are presented in a table format, providing links to specific resources, such as a history report, Entrez Gene (11). or re-query the CCDS data set. The sequence identifiers table presents transcript information in Vega, Ensembl and Blink. The chromosome location table includes the genomic coordinates for each individual exon of the specific coding sequence. This table also provides links to several different genome browsers, which allow you to visualise the structure of the coding region (1). Exact nucleotide sequence and protein sequence of the specific coding sequence are also displayed in the section of CCDS sequence data.

Current applications (3)

The CCDS dataset is an integral part of the GENCODE gene annotation project (12) and it is used as a standard for high-quality coding exon definition in various research fields, including clinical studies, large-scale epigenomic studies, exome projects and exon array design. Due to the consensus annotation of CCDS exons by the independent annotation groups, exome projects in particular have regarded CCDS coding exons as reliable targets for downstream studies (e.g., for single nucleotide variant detection), and these exons have been used as coding region targets in commercially available exome kits (13).

Future prospects

Long-term goals include the addition of attributes that indicate where transcript annotation is also identical (including the UTRs) and to indicate splice variants with different UTRs that have the same CCDS ID. It is also anticipated that as more complete and high-quality genome sequence data become available for other organisms, annotations from these organisms may be in scope for CCDS representation.

The CCDS set will become more complete as the independent curation groups agree on cases where they initially differ, as additional experimental validation of weakly supported genes occurs, and as automatic annotation methods continue to improve. Communication among the CCDS collaborating groups is ongoing and will resolve differences and identify refinements between CCDS update cycles. Human updates are expected to occur roughly every 6 months and mouse releases yearly (3) .

External links

CCDS home page

This genetics article is a stub. You can help Wikipedia by expanding it.

[pmid19498102-1] Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D (2009). "The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes". Genome Res. 19 (7): 1316–23. doi:10.1101/gr.080531.108. PMC 2704439. PMID 19498102.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[2] Harte, RA (2012). "Tracking and coordinating an international curation effort for the CCDS Project". Database : the journal of biological databases and curation. 2012: bas008. doi:10.1093/database/bas008. PMC 3308164. PMID 22434842. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1]

[2]

Consensus CDS Project

Consensus CDS Project

Motivation and background

Contributing groups (3)

Defining the CCDS gene set

Quality assurance testing

Review process

Manual curation

Curation challenges and annotation guidelines

Access to CCDS data

Current applications (3)

Future prospects

See also

References

External links