Assemblies



Introduction

Welcome to the phage assembly suite and tutorial (PHAST). This set of online modules is designed to teach the basics of genome assembly through an interactive learning process. If you would like to immediately start an assembly, go to the quick walkthrough page. Otherwise, you should

  1. Read the introduction section
  2. Read the home page section
  3. Begin a genome assembly
  4. Begin to read the tutorial - it will guide you though the genome assembly process.

The process of assembling a phage genome, initiated in the lower tab, can take anywhere from minutes to hours. If the genome you have selected will take a very long time to assemble, PHAST ask to email you when the assembly is complete. While the server is working on your assembly, do not submit another assembly request. It is recommended that you begin an assembly with default settings before reading the tutorial. Once an assembly is finished, you should adjust the parameters according to what you have learned in the tutorial and re-assemble the same genome. When you have completed more than one assembly, you can compare the different assemblies using the comparison tool.

The Home Page

The lower tab, labeled "Assembly Parameters," contains genome assembly options. In this tab, clicking the submit button will generate a fasta file and consensus sequence webpage based on the input. These files and the parameters used to generate them will appear on the right hand "Assemblies" column when the assembly completes.

Sequencing Technology

Sanger sequencing was one of the first DNA sequencing methods, and is still used today (see sanger sequencing video). The Sanger method is very accurate but expensive and requires a significant amount of human oversight. The cost and speed of Sanger sequencing make it unsuitable for sequencing entire genomes. Over the last few years, interdisciplinary teams of investigators have developed new sequencing technologies, dubbed "next generation sequencing", or NextGen. Individually, these NextGen technologies are less accurate than Sanger sequencing but are much faster and less expensive. The speed and cost of these technologies make it possible to sequence each nucleotide multiple times, which collectively improves accuracy through redundancy. The large volume and low cost make NextGen methods ideal for whole genome sequencing projects.

There are many different NextGen sequencing technologies, and each one determines the identity of a nucleotide differently. Each technology was developed and marketed by a different company (Table 1). Like Vasoline and Kleenex, the technologies are often referred to by their company names. Your phage genomes were sequenced using Roche's 454 FLX sequencer with Titanium reagents (called 454 sequencing because the original company was called 454). NextGen methods do not sequence all of the nucleotides within a genome in order. Instead, small segments of the genome are sequenced, 30 to 800 nucleotides at a time, and this process is repeated millions of times.


Generation Company Platform Approx. Read Length (nt)
First ABI/Life Technologies 3730xl 600 - 1000
Next Roche/454 Genome Sequencer FLX Titanium 300 - 1000
Next Illumina HiSeq 2000 36 - 100
Next ABI/Life Technologies 5500xl SOLiD System 50 - 75

Table 1. Some common NextGen generation sequencing technologies, compared to first generation (Sanger) sequencing.

The most common and cost effective method used to sequence entire genomes is whole genome shotgun (WGS) sequencing. In WGS, multiple copies of genomic DNA are broken into millions of random fragments. These fragments are sequenced individually and their DNA sequences are electronically "stitched" back together into the genome through a process called assembly. Assembly is extremely complicated because there are missing DNA segments (gaps), errors in nucleotide identification, and long segments of repeated nucleotides, making it difficult to assemble the final genome (see shotgun sequencing video).

As an analogy for shotgun sequencing, imagine you have three copies of the same novel, printed in a language you cannot read. Because most novels are at least 40,000 words, which corresponds nicely to the number of base pairs in a small phage genome (Nebula Awards, 2009), each word in the novel represents a nucleotide in the genome. Every page of each novel copy has been randomly cut into thousands of horizontal strips, and some of the strips are missing. Furthermore, suppose each of the three copies of the novel has random typos throughout, in different places in each copy. Your almost impossible assignment is to arrange the thousands of strips of paper from all three novels to assemble a single copy of the original book (Figure 1). If you have more copies of the same book, would this assembly task become easier or harder? Your answer is the source of power behind NextGen sequencing and is related to the concept of genome coverage.


Image of Analogy

Figure 1. The genome assembly novel analogy. 1A: Three copies of the same novel. 1B: An example of one page out of the novel. All pages will be randomly cut into strips of characters. Note that there are random typos and errors throughout each novel. 1C: A few strips of characters from one page. 1D: All of the strips of characters from the 3 novels. 1E: Every single strip from 1D must be assembled as shown here to create a single copy of the novel. Note that some of the strips are also missing, further complicating this process.

Critical to the genome assembly process is genome "coverage." Genome coverage refers to the ratio between the cumulative size, in nucleotides, of a set of reads and the size of the genome. For example, if you had 1 million reads of 100 nt each, and the genome is 4 million bp long, you would have 25X coverage (100 million nt sequenced/4 million bp in the genome = 25 fold coverage).

No sequencing technology is perfect and each sequencing technology has different error rates. Technologies can misread a nucleotide or skip a nucleotide altogether. However, with high coverage, a computer or person can more accurately deduce the correct genome sequence based on the consensus of the majority of the smaller reads (Figure 2). This consensus of DNA sequence is why the final genome is often referred to as the "consensus sequence." Computer programs such as consed allow you to view the consensus sequence and the individual reads used to make the consensus sequence (Gordon et al.,1998). You might use consed in the second semester of your phage genome course, depending on the learning goals of your course. In addition to a fasta assembly file, PHAST will produce a webpage (accessible in the Assemblies column) that shows the consensus sequence alignment calls made by MIRA (the assembly program) for each assembly submitted.


Consensus Image

Figure 2. The consensus sequence is determined by the majority of reads. The green C appears in the majority of reads, suggesting the red T is an error. Therefore, the consensus sequence contains a C instead of a T.

Greater genome coverage produces a better genome assembly (Figure 3) just has having more books torn into strips would make it easier for you to assemble the original novel. High coverage gives the assembly algorithm the ability to identify raw reads that overlap with each other. Using the book analogy again, 25X coverage would be like having 25 novels cut into strips instead of only 3 novels. With more novels you are more likely to find strips of paper that overlap and cover the entire text. If raw sequence reads have a large area of overlap, it is very likely these two raw reads should be merged into one contiguous sequence (i.e. the union of read #1 and read #2) in the final genome. Thus the assembler can combine these two smaller reads into one larger read, called a contig, short for a contiguous piece of DNA.


Genome Coverage Image

Figure 3. Genome Coverage and the Assembly Process. Multiple copies of a genome are randomly broken into small fragments. Chunks of these fragments are sequenced, generating reads. Reads are combined in areas they overlap. Portions of the genome in which many reads overlap are said to have high coverage (green bar). Portions in which a few reads overlap are said to have low coverage (red bar). The majority of the reads form the final consensus sequence (see Figure 2). The higher the coverage of a consensus sequence segment, the more confidant you can be in the accuracy of that segment.

Hundreds of computer algorithms have been developed to assist scientists in the arduous task of assembly, and they all begin with contigs. Contigs are contiguous sequences of DNA based on overlapping sections of DNA (Figure 4A). In the book analogy, a contig would be a page stitched together based on areas where the random paper strips overlapped (Figure 4B). Notice that longer reads or fragments of words make it easier to assemble the final consensus contig. Growth of the contig continues as long as quality overlaps exist between raw reads. Your phage's genome is small and simple enough that the process of creating contigs produces an assembled genome. For larger, more complex genomes, an additional step called scaffolding is needed. Scaffolds are generated by assembling contigs, ordered (first to last) and oriented (facing left or right) with respect to one another and the physical genome. In many assembly projects, a finished genome is the result of filling in gaps between scaffolds. Again, your phage project had sufficient coverage and few enough repeated DNA sequences to be assembled into a small number of contigs hopefully only one contig - so a scaffolding procedure is not needed.


Contig Image
Figure 4. Contig formation. 4A: Creating a contig (orange) from line fragments (black). 4B: Creating a contig (orange) from genome reads (black). Bars represent DNA sequences.

Assembly Methods

There are two basic assembly approaches: a reference based approach, and a de novo approach. In a reference-based assembly, the raw reads of the genome being assembled are compared to an established reference genome sequence as an assembly guide. Reference-based assembly is fast and uses less computational power than de novo assembly. However, reference genome assembly is only suitable if a good reference genome is available. A reference genome should be the genome of a closely related species, such as a different strain of the same bacteria (Pop, 2009). In many cases, like that of your phage, a reference genome is not known at the time of assembly. At the time of assembly, you cannot tell if the phage you isolated has a genome unrelated to any previously sequenced phage, or has a genome very similar to a phage in the Mycobacteriophage database.

When an appropriate reference genome is not available, the assembly process you must use is de novo, i.e. done from scratch. De novo assembly takes longer and uses more computational resources, but is the only option for genomes with no suitable reference genome. Within the category of de novo assembly, investigators use one of three possible methods:

1. The greedy method
The greedy algorithm joins a sequence read with another read that has the best overlap score until no more reads can be joined.
2. The Overlap Layout Consensus sequence (OLC) method
The OLC method generates a graph using reads and overlaps. The nodes (circles) of the graph are the reads, and the edges (arrows) represent overlaps of reads. In this way, the assembly process becomes synonymous with finding a pathway through the graph that visits every node at exactly once (Figure 5B; edges not labeled, overlap of at least 2 nt).
3. The de Bruijn graph method
The de Bruijn method constructs a graph similar to the OLC graph. In this graph, the edges are unique subsequences within reads while nodes are overlapping sequences of reads of uniform length (see de Bruijn tutorial). Thus, the assembly algorithm becomes finding a path in the graph that visits every edge at least once (it is ok to visit a node more than once; Figure 5C).

In both the OLC and de Bruijn assembly methods, it might be possible to traverse the graph in more than one way, representing different genome arrangements. Although de novo assemblies use one of the three methods described above, different investigators have implemented their method of choice with slight variations. These variations can produce different outcomes, so that even though two assembly programs may both use an OLC approach, for example, their output could be quite different. To learn more about the details behind the graphs and the differences between them, see the OLC tutorial, the de Bruijn tutorial, and the OLC vs de Bruijn comparison chart.


Realistic Graphs
Figure 5. Reads and two possible assembly graphs. 5A: Hypothetical reads aligned to the consensus sequence. 5B: The OLC assembly graph created from these reads. Edges represent overlaps of 2 or more nt. The assembly process is to visit every node. 5C: The de Bruijn assembly graph created from these reads. Edges represent 4 nt segments with 2 nt overlap between the nodes. The assembly process is to visit every edge.

The phage genomes you will annotate were assembled using a de novo assembler computer algorithm called Newbler (also known as gs-assembler). Newbler software is included when you buy a 454 sequencer, and is optimized to take into account the errors and various outputs associated with 454 data. At one point in time, Newbler used an OLC implementation to assemble genomes (Margulies et al., 2005). Since then, there have been updates to the Newbler algorithm, but Roche has not described their assembly algorithm (Miller et al., 2010). Currently, Newbler's exact de novo assembly algorithm is unclear. Regardless, the program is very good at assembling 454 data and its outputs are very likely to be correct. Newbler assembly of 454 data could be described as a "gold standard" for de novo genome assembly. However, because the Newbler software is proprietary, it is difficult to critique their algorithms.

Because Newbler is proprietary and its algorithm has not been described publicly, you cannot experiment with the software used to assemble your phage genome. Instead, you will use a similar open source assembler called MIRA (Chevreux, 2005). MIRA is capable of performing both de novo (using a greedy, OLC hybrid approach) and reference-based assemblies. MIRA assembles genomes using reads from many different sequencing systems and is not limited to 454 data. For this website, MIRA has been optimized to assemble small genomes sequenced with the 454 FLX system with Titanium reagents. Because phages have very small genomes, you can do quality operations within MIRA that require more computer memory and could not be performed for larger genomes.

During the sequencing process, 454 sequencers attach short DNA sequence adaptors to all the DNA fragments. These adaptor sequences are often included within the final read output and must be trimmed off (Chevreux, 2010). Additionally, labs sequencing multiple genomes use DNA tags, called Multiplex Identifier (MID) tags, to track each genome project by identifying a specific genome in the 454 workflow (Chevreux, 2010). Each project would use a unique MID tag and these sequence tags are part of the final read output. In both cases, these DNA tag sequences would complicate the assembly because they produce false overlapping segments of DNA that could lead to erroneous contig formation. Therefore, it is especially important to clean your reads before running MIRA, to filter out DNA sequences that are not part of the original genome.

Newbler automatically cleans the reads of any DNA tags and MIRA has some automatic cleaning capabilities to complement the Roche 454 preprocessing software. However, it is strongly recommended that you preprocess raw reads to assist MIRA in the cleaning process. The first cleaning step uses a computer program called sff_extract (Blanca, 2010), which identifies adaptor fragments at the beginning or end of a read and flags the fragments for MIRA. The second cleaning step uses a program called SMALT (Ponstingl, 2011) that accurately identifies custom tags, provided by the user, contained within raw reads. Custom tags go unnoticed by other software, because they are not standard 454 tags. Again, custom tags within a read are flagged for MIRA. MIRA will "clip" the flagged tags and use the un-flagged read segments in the assembly process.

Conclusion

By the time you have read this far, you should have generated one or two assemblies of your phage's genome. Again, your assemblies are accessible in the right hand "Assemblies" column. By clicking on your assembly you can see its N50 score, which relates to contig size. When comparing two genomes of the same reads, a larger N50 score means a better assembly. Through PHAST's Comparison Tool, you can further compare your assemblies using a dotplot program called gepard (Krumsiek et al., 2007). Much of the classification of phages in the Mycobacteriophage Database is based on dotplot alignments. So, this tool may be useful in trying to predict the classification of your phage. If wish to compare dotplots in an interactive environment, you should download the desktop application of gepard as well as your assembly fasta files (also located in the Assemblies column).

Finally, it is also possible to compare multiple genomes at once. Tools like mauve (Darling et al., 2010) will allow you to compare multiple genomes. Again, you will need to download mauve and each fasta file in order to run such comparisons.

When you do compare genome assemblies, be sure to notice the different arrangements of the genomes. How did different input parameters affect the different assemblies? If your phage genome sequence is available in the Mycobacteriophage Database, try comparing your experimental assemblies to the Newbler assembly. How similar or dissimilar are the different assemblies of your phage genome?

Works Cited

  1. Blanca J. Bioinformatics at COMAV: sff_extract [Internet]. COMAV institute: 2010 [cited 7 July 2011]. http://bioinf.comav.upv.es/sff_extract/
  2. Chevreux B. MIRA: an automated genome and EST assembler. Ruprecht-Karls University, Heidelberg, Germany. 2005.
  3. Chevreux B. Sequence assembly with MIRA3: The Definitive Guide. 2010.
  4. Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Research. 2004 July;14(7):1394–1403.
  5. Gordon D, Abajian C, Green P. Consed: a graphical tool for sequence finishing. Genome Research. 1998;8(3):195–202.
  6. Krumsiek J, Arnold R, Rattei T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics. 2007 April 30;23(8):1026–1028.
  7. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005 September 15;437(7057):376–380.
  8. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010 June 1;95(6):315–327.
  9. Nebula Awards [Internet]. Science Fiction & Fantasy Writers of America: 2009 [cited 7 July 2011]. http://www.sfwa.org/nebula-awards/rules/
  10. Ponstingl H. SMALT [Internet]. Sanger Institute: 2011 [cited 7 July 2011]. http://www.sanger.ac.uk/resources/software/smalt/
  11. Pop M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics. 2009 June 7;10(4):354–366.

Additional Reading