Genome Assembly Project: Leland Taylor '12
Contents
Useful Links
http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here
http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS
http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, All Paths 2)
Vocab
- Comparison assembly (aka reference based) - basically align reads to reference genome
- Regions that differ slightly (like large insertions) still need to be assembled de novo (Genome assembly reborn: recent computational challenges)
- Protein reference gene can be used as comparison (ABBA)
- Useful if no close reference genome available.
- PROGRAMS: AMOScmp, Maq, ABBA (protein)
- Maq - characterizes SNPs between target genome and reference.
- Coverage - the ratio between the cumulative size of a set of reads and the size of the genome
- de novo assembly - NP-hard problem. Assembly from scratch
- Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
- hybrid de novo assembly
- Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
- DeNovoAssemblyMethod: Brujin graphs (aka Eulerian pathway) (1.2 pg 359)
- Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
- Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
- Typically more than one Eulerian path can be found... representing the many different ways a genome ca be rearranged around repeats.
- Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
- "Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
- "when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin
- Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
- Very sensitive to sequencing errors
- Ideal for short reads... high depth of coverage of reads in roughly equal length
- See daniel zerbinos phd thesis
- PROGRAM: Velvet, Allpaths
- DeNovoAssemblyMethod: Greedy (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Join contigs that overlap the best and and when no ore reads or contigs can be joined
- "Overlaps" = prefix of one read shares enough similarity with suffix of another
- "Greedy" - the algorithm optimizes locally... ie the quality of overlap between reads
- Starts by processing the best overlap first, so chooses that path and may misassemble many repeats
- PROGRAM: TIGR Assembler, CAP3
- DeNovoAssemblyMethod: Greedy for Short Reads (1.2 pg 357)
- Unassembled read chosen as start contig... Contig built on the 3' end until no more extensions possible. Then build onto 5' end using rev comp of original read.
- Extension process terminated when conflicting information found... 2+ reads could extend the contig.
- PROGRAM: SSAKE, VCAKE, SHARCGS
- DeNovoAssemblyMethod: Overlap layout consensus (OLC) (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Create overlap map... each read = node, edge = if overlap identified between reads
- Core part of OLC
- Ultimate goal = find a single path that traverses each node of the graph exactly once (hamiltonian path).
- Step 3: Consensus computation - determine the DNA sequence implied by the arrangement of reads along the chosen path
- PROGRAMS: Celera Assembler, Arachne, newbler, Edena
- Most popular type... perhaps from its flexibility
- FileType: .fna
- FileType: .qual
- FileType: .sff
- k-mer
- the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
- Usually linked to Eulerian path assemblies (Brujin graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller
- Mate-pair -
- Metagenomics - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)
- N50
- the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
- Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
- N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
- Sanger-based sequencing - first generation sequencing (1000-2000bp reads)
- Scaffolds (1.2 pg 360)
- Groups of contigs whose relative placement is known though the DNA sequence of genomic regions connecting these contigs is unknown
- Usually relies on mate-pair information (other method = optical mapping like in SOMA)
- Two contigs = adjacent if one end of mate-pair in contig1 and other end of mate-pair in contig2
- Most assemblers have scaffolding module
- The longer contigs are, the easier it is to scaffold
- PROGRAMS: Euler, Arachne, Celera Assembler, Velvet, Bambus (stand alone scaffold assembler), SOMA (scaffolding with restriction maps... need more tests than the data we have).
- See 1.2 paper for overview of algorithms of these scaffolders
- Spur - single reads that disagree with the bulk of reads within a region of an assembly graph (OLC method) (1.2 pg 359)
- Unitig - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
- PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne
Assembly Programs
- Euler-SR (short read)
- Newbler
- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- Assembler provided by 454 company.
- Good blog: http://contig.wordpress.com/
Scripts
http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.
Big Questions
- De novo or Reference/Comparative based assembly?
- Comparative assembly = suited for sequencing multiple strains of a same bacterium (1.2 pg 361).
- What kind of coverage is typical for phages?
Journal
November 4 2024
Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.
The database includes three file types: .fna .qual .sff
Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).
Notes
- Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
- PROGRAM: Jellyfish - counts k-mers http://www.cbcb.umd.edu/software/jellyfish/
- Lists compression techniques and the order to employ them
- Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
- Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).
Thoughts
- Look for assembler that uses De Brujin graph?
- PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
- This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
- Is it possible to get the code used in this project?
Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).
Notes
- THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
- THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.
Thoughts
Basic Timeline
- 1st – 2nd Week
- Learn how to manipulate and handle raw read files.
- Familiarize myself with key sources listed above.
- Write module to calculate fold coverage using genome size estimate and total size of all reads.
- Write a prioritized list of features and goals for my program.
- 3rd – 6th week
- Develop my program in modules according to the prioritized features.
- Compare my program’s genome to previously assembled genomes from this raw data.
- Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
- Edit the program based on any issues encountered with the full data set.
- 7th – 10th week (Ending: July 29, 2011)
- Finish wet-lab accuracy tests
- Fine–tune the program based on any issues encountered with the full data set.
- Attempt to assemble the “Meatball” phage genome.
Random Notes
Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn