Sequence database

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically sequences were published in paper form, but as the number of sequences grew this storage method became unsustainable.

Search issues

Sequence databases can be searched using a variety of methods. The most common usage is probably searching for sequences similar to a certain target protein or gene whose sequence is already known to the user. The BLAST program is a popular method of this type.

Current issues

Records in sequence databases are deposited from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, may vary in quality. There is much redundancy, as multiple labs may submit numerous sequences that are identical, or nearly identical, to others in the databases.^[1]

Many annotations of the sequences are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This can lead to a transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information.^[2] Therefore, care must be taken when interpreting the annotation data from sequence databases.

References

^ Sikic, K.; Carugo, O. (2010). "Protein sequence redundancy reduction: comparison of various method". Bioinformation. 5 (6): 234–9. PMID 21364823. {{cite journal}}: Cite has empty unknown parameter: |month= (help)
^ Iliopoulos, I.; Tsoka, S.; Andrade, MA.; Enright, AJ.; Carroll, M.; Poullet, P.; Promponas, V.; Liakopoulos, T.; Palaios, G. (2003). "Evaluation of annotation strategies using an entire genome sequence". Bioinformatics. 19 (6): 717–26. PMID 12691983. {{cite journal}}: Unknown parameter |month= ignored (help)

External links

Major bioinformatics databases

European Bioinformatics Institute databases
NCBI completely sequenced genomes
Stanford Saccharomyces Genome Database
Protein, the NIH protein database, a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB

[Sikic-2010-1] Sikic, K.; Carugo, O. (2010). "Protein sequence redundancy reduction: comparison of various method". Bioinformation. 5 (6): 234–9. PMID 21364823. {{cite journal}}: Cite has empty unknown parameter: |month= (help)

[Iliopoulos-2003-2] Iliopoulos, I.; Tsoka, S.; Andrade, MA.; Enright, AJ.; Carroll, M.; Poullet, P.; Promponas, V.; Liakopoulos, T.; Palaios, G. (2003). "Evaluation of annotation strategies using an entire genome sequence". Bioinformatics. 19 (6): 717–26. PMID 12691983. {{cite journal}}: Unknown parameter |month= ignored (help)

[1]

[2]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

Search issues

Current issues

References

See also

External links

Major bioinformatics databases