Jump to content

Sequence assembly

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Wk master editor (talk | contribs) at 11:40, 21 September 2008 (Available assemblers). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In bioinformatics, sequence assembly refers to aligning and merging many fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go but rather small pieces between 20 and 1000 bases. Typically the short fragments - which are called reads - result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

The problem of sequence assembly can be compared to a giant jigsaw puzzle, many times having an image with a lot of clear blue sky. However, these genomic puzzles tend to be a bit more complex than the average jigsaw puzzle: they have between 500 and several hundred million pieces, are printed on both sides, with many vital pieces possibly missing. Some of the pieces are dirty or unrecognisable, and several pieces from another puzzle might have been mixed in. Additionally, a few pieces themselves appear to have been cut and reassembled by a very impatient two-year-old with a pair of scissors and a bottle of glue.

Genome assemblers

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs needed to increasingly employ more and more sophisticated strategies to handle:

  • terabytes of sequencing data which need processing on computing clusters;
  • identical and nearly sequences (know as repeats) which can - in the worst case - increase the time and space complexity of algorithms exponentially;
  • and errors in the fragments from the sequencing instruments, which can confound assembly.

Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly Drosophila melanogaster, in 2000 and the human genome just a year later, scientists developed assemblers like Celera Assembler (first developed by a private company) and Arachne able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the open source framework.

EST assemblers

EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems are not too different between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the intra-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high amounts (housekeeping genes), which again leads to the problem of similar sequences present in high amounts in the data set to be assembled.

Furthermore, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, recoding, and post-transcriptional modification.

De-novo vs. mapping assembly

In sequence assembly, two different types can be distinguished:

  1. de-novo: assembling reads together so that they form a new, previously unknown sequence
  2. mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that is has a complexity of O(n^2) but can be reduced to O(log(n)). Referring to the comparison drawn to jigsaw puzzles in the introduction: while for mapping assemblies one knows what the image will look like in the end, the de-novo assemblies are more hardcore in the sense as the pieces are present, but no image depicting how the final image will look like.

Influence of technological changes

The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.

In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.

In 1975, the Dideoxy termination method (also known as Sanger sequencing) was invented and until shortly after 2000, the technology was improved up to a point were fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects that

  • are about 800 - 900 bases long
  • contain sequencing artefacts like sequencing and cloning vectors
  • have varying error rates between 0.5 and 10%

With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger ones like the human genome with approximately 35 million reads needed already large computing farms and distributed computing.

By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing methods generated reads much shorter than from Sanger sequencing: initially about 100 bases, now 250 bases and expected to grow to 450 bases by the end of 2008. However, due to the much higher throughput and lower cost than Sanger sequencing, the adoption of this technology by genome centers pushed development of sequence assemblers to deal with this new type of sequences. The sheer amount of data coupled with technology specific error patterns in the reads delayed development of assemblers, at the beginning in 2004 only the Newbler assembler from 454 was available. Presented in mid 2007, the MIRA assembler by Chevreux et al. was the first freely available assembler who could assemble 454 reads and mixtures of 454 reads and Sanger reads; using sequences from different sequencing technologies was subsequently coined hybrid assembly.

Ironically, technological development of sequencing continued to improve into the wrong way (from a sequence assembly point of view). Since 2006, the Solexa technology is available and heavily used to generate roundabout 100 million reads per day on a single sequencing. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. The downside is that these reads have a length of only 36 bases (expected to grow to 50 bases by the end of 2008). This makes sequence alignment an even more daunting task. Presented by the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads, quickly followed by a number of others.

Greedy algorithm

Given a set of sequence fragments the object is to find the Shortest common supersequence.

  1. calculate pairwise alignments of all fragments
  2. choose two fragments with the largest overlap
  3. merge chosen fragments
  4. repeat step 2. and 3. until only one fragment is left

The result is a suboptimal solution to the problem.

Available assemblers

The following table lists assemblers that have a de-novo assembly capability on at least one of the supported technolgies. For a list of software including mapping assemblers, see this post in the SeqAnswers discussion forum.

Name Type Technologies Author Presented /

Last updated

Licence*
AMOS genomes Sanger, 454 Salzberg S. et al. 2002? / 2008? OS
Arachne WGA (large) genomes Sanger Batzoglou S. et al. 2002 / 2003 (C / NC-A?)
CAP3, PCAP genomes Sanger Huang, X. et al. 1999 / 2005 C / NC-A
Celera WGA Assembler (large) genomes Sanger, 454 in devel. Myers, G. et al. 2004 / 2008 OS
CLC Genomics Workbench genomes Sanger, 454, Solexa, SOLiD CLC bio ? / 2008 C
CodonCode Aligner BACs (,genomes?) Sanger CodonCode Corporation 2003 / 2007 C
DNA Baser Assembler contigs s s August 2008 (C $399)
Euler genomes Sanger, 454 (,Solexa ?) Pevzner, P. et al. 2001 / 2006? (C / NC-A?)
Euler-sr genomes 454, Solexa Chaisson, MJ. et al. 2008 NC-A
MIRA, miraEST genomes, ESTs Sanger, 454, Solexa Chevreux, B. 1998 / 2008 OS
Mosaik genomes Sanger, 454, Solexa Strömberg, M. 2008 C
NextGENe (small genomes?) Solexa, SOLiD Softgenetics 2008 C
Phrap genomes Sanger Green, P. 2002 / 2003 C / NC-A
TIGR Assembler genomic Sanger - 1995 / 2003 OS
Sequencher (small) genomes Sanger Gene Codes Corporation ? / 2007 C
SeqMan NGen (small) genomes Sanger, 454, Solexa DNASTAR ? / 2008 C
SHARCGS (small) genomes Solexa Dohm et al. 2007 / 2007 OS
SSAKE (small) genomes Solexa (SOLiD? Helicos?) Warren, R. et al. 2007 / 2007 OS
Staden gap4 package BACs (, small genomes?) Sanger Staden et al. 1991 / 2006 OS
VCAKE (small) genomes Solexa (SOLiD?, Helicos?) Jeck, W. et al. 2007 / 2007 OS
Velvet (small) genomes Sanger, Solexa (SOLiD?) Zerbino, D. et al. 2007 / 2008 OS
*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics; Brackets = unclear, but most likely C / NC-A

See also

References

Next Generation Sequencing Information [1] Next Generation Sequencing Blog [2]