Genome Assembly Project: Leland Taylor '12
Contents
Links
http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here
SEQanswers - online community for sequencing and assemblers.
- SEQanswers Wiki - sweet software hub with over 400 bioinformatics programs (also see http://seqanswers.com/wiki/Special:BrowseData)
- http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs
http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS
http://bioinf.comav.upv.es/index.html - a few useful python scripts
http://www.mediawiki.org/wiki/Help:Formatting - wiki formatting
Cool Links
http://pathogenomics.bham.ac.uk/hts/ - world map of high throughput sequencers.
Link Sandbox
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, All Paths 2)
http://seqanswers.com/forums/showthread.php?t=9465&highlight=virus - previous virus assembly experiances
http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly - .sff assembly
http://seqanswers.com/forums/showthread.php?t=10988&highlight=k-mer - how to estimate genome size
Vocab
- Comparison assembly (aka reference based) - basically align reads to reference genome
- Regions that differ slightly (like large insertions) still need to be assembled de novo (Genome assembly reborn: recent computational challenges)
- Protein reference gene can be used as comparison (ABBA)
- Useful if no close reference genome available.
- PROGRAMS: AMOScmp, Maq, ABBA (protein)
- Maq - characterizes SNPs between target genome and reference.
- Coverage - the ratio between the cumulative size of a set of reads and the size of the genome
- de novo assembly - NP-hard problem. Assembly from scratch
- Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
- hybrid de novo assembly - using data from different sources... harder, but can gain insight by playing the different data sets to their strengths.
- Eulerian method ideal for this.
- hybrid de novo assembly - using data from different sources... harder, but can gain insight by playing the different data sets to their strengths.
- Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
- DeNovoAssemblyMethod: Bruijn graphs (aka Eulerian pathway) (1.2 pg 359) (often abbreviated DBG)
- Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
- Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
- Typically more than one Eulerian path can be found... representing the many different ways a genome ca be rearranged around repeats.
- Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
- "Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
- Repeats
- Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
- Repeats make graph more complex because introduce cycles (3.1 pg 320)
- Complications Bruijn Graphs must overcome with DNA assemblies (3.1 pg 320)
- DNA = double stranded. Forward sequence of read may overlap with the forward or reverse complement of other reads
- Methods To Overcome: graph contains nodes and edges for both strands, forward/reverse make up half nodes individually and paths must enter/exit from same half (Velvet), alternate strands in single node with 2 sides and paths enter/exit opposite sides (ABySS).
- Repeats
- Repeats longer than K lead to tangled K-mer graphs
- Perfect repeats >= K lead to frayed ends
- Palindromes
- Make paths fold back on themselves
- Method To Overcome: require k to be odd (Velvet). an odd size k-mer cannot match its reverse complement
- Sequence Error
- Very sensitive to sequencing errors
- Make graph more complex by adding edges
- Steps to overcome
- Preprocess reads to remove errors (kick out low qual...)
- Weight graph edges by number of reads that support them & erode lightly supported paths
- Convert paths to sequences and use sequence alignment algorithms to collapse nearly identical paths
- Very sensitive to sequencing errors
- DNA = double stranded. Forward sequence of read may overlap with the forward or reverse complement of other reads
- Ideal for short reads... high depth of coverage of reads in roughly equal length
- "when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin
- See daniel zerbinos phd thesis
- PROGRAM: Velvet, Allpaths, Euler-SR, Euler-USR
- Ideal for short reads... high depth of coverage of reads in roughly equal length
- DeNovoAssemblyMethod: Greedy (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Join contigs that overlap the best and and when no ore reads or contigs can be joined
- "Overlaps" = prefix of one read shares enough similarity with suffix of another
- "Greedy" - the algorithm optimizes locally... ie the quality of overlap between reads
- Starts by processing the best overlap first, so chooses that path and may misassemble many repeats
- PROGRAM: TIGR Assembler, CAP3
- DeNovoAssemblyMethod: Greedy for Short Reads (1.2 pg 357)
- Unassembled read chosen as start contig... Contig built on the 3' end until no more extensions possible. Then build onto 5' end using rev comp of original read.
- Extension process terminated when conflicting information found... 2+ reads could extend the contig.
- PROGRAM: SSAKE, VCAKE, SHARCGS
- DeNovoAssemblyMethod: Overlap layout consensus (OLC) (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Create overlap map... each read = node, edge = if overlap identified between reads
- Core part of OLC
- Ultimate goal = find a single path that traverses each node of the graph exactly once (hamiltonian path).
- Step 3: Consensus computation - determine the DNA sequence implied by the arrangement of reads along the chosen path
- PROGRAMS: Celera Assembler, Arachne, newbler, Edena
- Most popular type... perhaps from its flexibility
- FileType: .fna (from 454) - all of the different reads with a > header that includes the following information (https://shell.cgrb.oregonstate.edu/sites/default/files/Files/Docs/Roche/software/454_Sequencing_Software_Manual_v2.5p1_Overview.pdf)
- rank = ranking of well by signal intensity
- x & y = coordinates of the well center pixel
- length = the length of the read
- FileType: .qual (from 454) - base quality score values for each nt in the corresponding .fna file. Each entry has the same header as the .fna.
- FileType: .sff (from 454) - Standard Flowgram Files
- .sff file format consists of three sections: a common header section occurring once in the file; then for each read stored in the file, a read header section and a read data section
indel - shorthand term for insertion or deletion
- k-mer
- the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
- Usually linked to Eulerian path assemblies (Brujin graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller
- Length Unit: kb (= kbp) = kilo base pairs = 1,000 bp
- Length Unit: Mb = mega base pairs = 1,000,000 bp
- Length Unit: Gb = giga base pairs = 1,000,000,000 bp.
- Mate-paired reads
- Back in the day (with clone libraries) paired reads = sequencing opposite directions of an amplified clone,
- Now... amplify and sequence short paired-end tags in parallel
- Short genomic sequences circularized using known linker sequence, then cleaved outside of the ligation site... leaving an overhang tag (2.1 pg 4)
- Metagenomics - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)
- N50
- the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
- Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
- N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
- DNA Palindrome - DNA sequence that is its own reverse complement
- Repeats
- Tandem repeats
- Inverted repeats
- Imperfect repeats
- Perfect repeats
- Sanger-based sequencing - first generation sequencing (1000-2000bp reads)
- Scaffolds (1.2 pg 360)
- Groups of contigs whose relative placement is known though the DNA sequence of genomic regions connecting these contigs is unknown
- Usually relies on mate-pair information (other method = optical mapping like in SOMA)
- Two contigs = adjacent if one end of mate-pair in contig1 and other end of mate-pair in contig2
- Most assemblers have scaffolding module
- The longer contigs are, the easier it is to scaffold
- PROGRAMS: Euler, Arachne, Celera Assembler, Velvet, Bambus (stand alone scaffold assembler), SOMA (scaffolding with restriction maps... need more tests than the data we have).
- See 1.2 paper for overview of algorithms of these scaffolders
- Spur - single reads that disagree with the bulk of reads within a region of an assembly graph (OLC method) (1.2 pg 359)
- Unitig - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
- Preliminary, high confidence, conservative contigs. (3.1 pg 319)
- PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne
Programs
De Novo Assemblers
- Celera Assembler
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
- Euler+ (Euler 2.0)
- Developed in 2004
- Uses A-Bruijn graphs
- Deals with errors in reads by inducing vertices with ungapped alignments (2.1 pg 2)
- Allows mismatches rather than exact l-tuples in Bruijn graphs
- Does not scale well with next gen because graph grows with coverage - run out of memory (2.1 pg 2)
- However, focus for this project is phages. May not be an issue.
- Euler-SR (short read) (introduced in 2.1)
- Developed in 2007
- Modified version of Euler+
- "Substitutes the maximum spanning tree optimization of the A-Bruijn graph by the maximum branching optimization on de Bruijin graphs" (2.1 pg 2)
- General Walkthrough of all Euler implementations (3.1 pg 320)
- Filter reads with spectral alignment process
- Detect erroneous base calls by noting low frequency K-mers
- Most true K-mers = repeated in several reads
- Sequence errors are random, so for any K where 4^K exceeds twice the genome size, most erroneous K-mers should be unique.
- List K-mers and their observed frequencies
- Either exclude or correct low frequency k-mers
- Correction - reduces total number of k-mers (and the size of the graph), can also mask polymorphism
- Only corrects substitution errors (not indels)
- Step 1. Determine threshold to trust or not trust reads
- Threshold determined via distribution of k-mer frequencies present in the reads
- Usually binomial distribution... 1st peak = k-mers that occur once or twice (due to seq error or low coverage) 2nd peak = redundant k-mers (due to read coverage or repeats)
- Threshold = between the peaks
- Threshold determined via distribution of k-mer frequencies present in the reads
- Step 2. Determine if read is good or bad and proceed accordingly
- Bad reads - "Euler executes a greedy exploration for base call substitutions that reduce bad k-mer count" (3.1 pg 320)
- If fully corrected, then accept the read. Otherwise reject
- Detect erroneous base calls by noting low frequency K-mers
- Build K-mer graph from processed reads
- Processes k-mers, not reads, so discards long-range continuity info (within reads)
- Filter reads with spectral alignment process
- Output: both contigs and the repeat graph
- Connects the contigs by repeats and can help direct the finishing efforts
- Only uses read and qual files
- Bulk of the time this program takes is in error correction
- Error correction: can be done based on qual values or based on spectral alignment technique
- Steps: error correction, graph construction, graph correction, assembly by transforming paths into contigs
- MIRA 3 (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)
- Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
- Can assemble de novo or to a scaffold
- Newbler (aka gsAssembler)
- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- Assembler provided by 454 company.
- Specific to 454 data... so uses 454 dependent error model and has access to flowgrams (2.1 pg 3)
- Good blog: http://contig.wordpress.com/
- For pure 454 data, probably the best assembler
- Velvet
- Does extensive compression of de Bruijn graph
- PreProcessing to reduce graph size
- Spurr removal algorithm = similar to Euler erosion procedure.
- Note: No spectral alignment filter (because authors thought it naive)
- Graph processing to reduce graph size
- Bounded search for bubbles
- Breadth-first-search - starts at nodes with multiple out edges
- Bounded because can have bubbles within bubbles, so finding all of them would be impractical
- Limits bubbles to those with a sequence similarity requirement on alternate paths
- Removes bubble path with fewer reads
- Step here about realignment I do not understand
- This operation risks ignoring polymorphisms or the collapse of near identical repeats
- Whole process similar to Euler's bulge remover and the bubble smoothing in OLC assemblers
- Bounded search for bubbles
- Read Threading
- Removes path with fewer reads than a threshold
- Risks removing low coverage sequence... but supposed to remove spurious connections induced by convergent sequencing errors
- Mate Pairs reduction
- Uses pebble algorithm (see Pebble and Rock Band)
- PreProcessing to reduce graph size
- Variables
- Length of k-mers (note this is constrained to be an odd number)
- The min frequency of expected k-mers in reads - determines which k-mers are pruned a-priori
- The expected coverage of genome - controls spurious connection breaking
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
- SUMMARY: does not have error-correction pre-processor, but has error avoidance filter. (not good for large genomes)
- Does extensive compression of de Bruijn graph
Veiwers
- Consed
- Aligned Reads View
- Low quality base = darker and lower case
- High quality base = ligher (whiter) and upper case
- Black = unaligned
- Asterisk = spacer... like dash in blast
- Red letters = disagree with consensus read. Alot of red in high quality region
- Middle click = open up trace window
- Navigate menue for low quality reads
- Can also export consensus sequences
- Assembly View
- Dark green line = read coverage
- Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
- Programs Compatible with Consed: Newbler, Phrap
- Aligned Reads View
Scripts
http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.
sff_extract - python script that extracts .sff data
clean_reads - python script that cleans reads
Big Questions
- De novo or Reference/Comparative based assembly?
- Comparative assembly = suited for sequencing multiple strains of a same bacterium (1.2 pg 361).
- The HHMI program uses de novo assemblers (http://phagesdb.org/workflow/Sequencing/).
- What kind of coverage is typical for phages?
Journal
May 23 2011
Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.
The database includes three file types: .fna .qual .sff
Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).
Notes
- Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
- PROGRAM: Jellyfish - counts k-mers http://www.cbcb.umd.edu/software/jellyfish/
- Lists compression techniques and the order to employ them
- Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
- Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).
Thoughts
- Look for assembler that uses De Brujin graph?
- PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
- This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
- Is it possible to get the code used in this project?
Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).
Notes
- THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
- THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.
- Assembly Validation
- Stat analysis for areas with unusually deep coverage
- Poission stat to estimate likely hood region represents a repeat
- DNP??
- PROGRAM: DNPtrapper, Euler-AIR
- Mate pairs
- In correct assembly, tiling of reads should be consistent with mate pair constraints (relative orientation and paired ends)
- PROGRAM: Tampa, Zimin, and visual tools like BACCardI, Hawkeye
- Hawkeye also has built in AMOSvalidate
- Whole genome mapping (with restriction enzymes)
- PROGRAM: SOMA
Thoughts
- I am torn between working with de novo sequencers or comparative assemblers... Perhaps the best way is to start with de novo and then move to comparative, as de novo will likely be needed for comparative assemblers.
- The target phage genomes are small, so we do not really need to worry that much about efficiency.
- There are so many programs out there that seem like they would work for my purposes. Perhaps my project should be to take an existing program and determine parameters ideal for phages.
- It may be difficult to make a universal program. So much of the ideal assembly program depends on the characteristics of your reads... Are they short/long reads? A single data source? Do you have mate-pair information? Perhaps I am creating a tool to just peer into the assembly program.
- I can design it to be optimized for the 454 data I have now.
- Areas of future work identified in this paper.
- Designing programs to use multiple datasets to identify the best assembly. For example, Illumina and 454 data, but also non-sequence data like genome mapping (which has been shown to really help out assemblies)
To Do Tomorrow:
- Think more about thoughts segments above from papers
- Do research on virus/prokaryote genome assemblies.
- Read the new promising papers identified today.
- Begin to think about ways to characterize our data to answer what assembly method would be best in this scenario.
May 24 2011
Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Research 18, 324-330 (2008).
Notes
- In the beginning, high throughput sequencing with short reads required reference genome and was limited to resequencing projects, gene expression, and genomic profiling projects.
- "Long" contigs = contigs > 500bp
- These programs which rely heavily on mate pair information do not perform well in next gen sequence data
- PROGRAM: Phrap, PCAP, Arachne, Euler+
- I think Illumina and 454 now include mate pair information in the sequencing data. This paper written in 2008 said it was coming soon.
- Error correction: discard low quality reads
- If quality values are not available, they have another method
Thoughts
- Euler-SR seems to be a spin off of Euler+.
- What tradeoffs does the Euler-SR approach make?
- Since I am focused on viral genomes, the need for memory conservation is not as big of an issue.
- What tradeoffs does the Euler-SR approach make?
- Does the raw data I have include mate pair information?
- Such information can drastically improve the quality of a genome assembly.
To Do Tomorrow
- Install Linux to try and run newbler
- Read Miller review article
May 25 2011
Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327 (2010).
Notes
- THESIS: Nice overview of types of graphs used for assemblies. Also summary of several algorithms (mainly de Bruijn methods) including SSAKE, Newbler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo
- "Repeat graphs can be used to identify and catalog repeats." (see Identifying repeat domains in large genomes)
- Nice K-mer graph Figures
- Fig 1 = basic k-mer idea
- Fig 2 = pairwise overlap of k-mer graph
- Fig 3 = complexity in K-mers. Includes spurs, bubbles, frayed ropes, cycles.
- Links assembly to NP-hard reduction problem
Thoughts
- What if try to implement reduction approach that has never been applied to assemblies?
- The Velvet program compresses graphs. One step is to identify bubbles in graph, but the program does not search for all bubbles. It does a bounded search. What would the benefit be to finding all bubbles?... if it good, may want to make a version that does that because we have small genomes.
Data Characterization
- Phage genome size from ~40 kb (40,000) to ~164 kb (164,000) (http://phagesdb.org/sort/)
- Our reads are ~500nt on average. We do have small ones ~50nt.
- The database includes three file types: .fna .qual .sff
- Write program to calculate fold coverage? Avg sequence size?
- Don't forget to clean the data before using it.
Basic Timeline
- 1st – 2nd Week
- Learn how to manipulate and handle raw read files.
- Familiarize myself with key sources listed above.
- Write module to calculate fold coverage using genome size estimate and total size of all reads.
- Write a prioritized list of features and goals for my program.
- 3rd – 6th week
- Develop my program in modules according to the prioritized features.
- Compare my program’s genome to previously assembled genomes from this raw data.
- Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
- Edit the program based on any issues encountered with the full data set.
- 7th – 10th week (Ending: July 29, 2011)
- Finish wet-lab accuracy tests
- Fine–tune the program based on any issues encountered with the full data set.
- Attempt to assemble the “Meatball” phage genome.
Random Notes
Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn
Figure out weighting scale to identify missassemblies via machine based learning (A machine-learning approach to combined evidence validation of genome assemblies)
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin
I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences. We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads. -ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize -SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp) -CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb) In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb. Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly! All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had. Now we are going to throw more data at it, hoping for a much better assembly