Genome Assembly Project: Leland Taylor '12

1 Useful Links
2 Vocab
3 Assembly Programs
4 Scripts
5 Big Questions
6 Journal
- 6.1 March 5 2026
  - 6.1.1 Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).
  - 6.1.2 Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).
7 Basic Timeline
8 Random Notes

Useful Links

http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here

http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS

http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs

http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, ALL PATHS 2)

Vocab

Comparison assembly (aka reference based) - basically align reads to reference genome
- Regions that differ slightly (like large insertions) still need to be assembled de novo (Genome assembly reborn: recent computational challenges)

Coverage - the ratio between the cumulative size of a set of reads and the size of the genome

de novo assembly - NP-hard problem. Assembly from scratch
- Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
  - hybrid de novo assembly

DeNovoAssemblyMethod: Brujin graphs (aka Eulerian pathway) (1.2 pg 359)
- Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
- Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
  - Typically more than one Eulerian path can be found... representing the many different ways a genome ca be rearranged around repeats.
- Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
  - "Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
- "when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin
- Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
- Very sensitive to sequencing errors
- Ideal for short reads... high depth of coverage of reads in roughly equal length
- See daniel zerbinos phd thesis
- PROGRAM: Velvet, Allpaths

DeNovoAssemblyMethod: Greedy (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Join contigs that overlap the best and and when no ore reads or contigs can be joined
  - "Overlaps" = prefix of one read shares enough similarity with suffix of another
- "Greedy" - the algorithm optimizes locally... ie the quality of overlap between reads
  - Starts by processing the best overlap first, so chooses that path and may misassemble many repeats
- PROGRAM: TIGR Assembler, CAP3

DeNovoAssemblyMethod: Greedy for Short Reads (1.2 pg 357)
- Unassembled read chosen as start contig... Contig built on the 3' end until no more extensions possible. Then build onto 5' end using rev comp of original read.
- Extension process terminated when conflicting information found... 2+ reads could extend the contig.
- PROGRAM: SSAKE, VCAKE, SHARCGS

DeNovoAssemblyMethod: Overlap layout consensus (OLC) (1.2 pg 357)
- Step 1: Reads compared to each other to construct a list of pair-wise overlaps
- Step 2: Create overlap map... each read = node, edge = if overlap identified between reads
  - Core part of OLC
  - Ultimate goal = find a single path that traverses each node of the graph exactly once (hamiltonian path).
- Step 3: Consensus computation - determine the DNA sequence implied by the arrangement of reads along the chosen path
- PROGRAMS: Celera Assembler, Arachne, newbler, Edena
- Most popular type... perhaps from its flexibility

FileType: .fna

FileType: .qual

FileType: .sff

k-mer
- the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
- Usually linked to Eulerian path assemblies (Brujin graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller

Mate-pair -

Metagenomics - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)

N50
- the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
- Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
- N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)

Sanger-based sequencing - first generation sequencing (1000-2000bp reads)

Scaffolds (1.2 pg 360)
- Groups of contigs whose relative placement is known though the DNA sequence of genomic regions connecting these contigs is unknown
- Usually relies on mate-pair information (other method = optical mapping like in SOMA)
  - Two contigs = adjacent if one end of mate-pair in contig1 and other end of mate-pair in contig2
- Most assemblers have scaffolding module
- The longer contigs are, the easier it is to scaffold
- PROGRAMS: Euler, Arachne, Celera Assembler, Velvet, Bambus (stand alone scaffold assembler), SOMA (scaffolding with restriction maps... need more tests than the data we have).
  - See 1.2 paper for overview of algorithms of these scaffolders

Spur - single reads that disagree with the bulk of reads within a region of an assembly graph (OLC method) (1.2 pg 359)

Unitig - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
- PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne

Assembly Programs

Euler-SR (short read)
Newbler
- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- Assembler provided by 454 company.
- Good blog: http://contig.wordpress.com/

Scripts

http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.

Big Questions

De novo or Reference based assembly?

What kind of coverage is typical for phages?

Journal

March 5 2026

Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.

The database includes three file types: .fna .qual .sff

Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Notes

Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
- PROGRAM: Jellyfish - counts k-mers http://www.cbcb.umd.edu/software/jellyfish/
Lists compression techniques and the order to employ them
Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).

Thoughts

Look for assembler that uses De Brujin graph?
- PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
Is it possible to get the code used in this project?

Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

Notes

THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.

Thoughts

Basic Timeline

1st – 2nd Week
- Learn how to manipulate and handle raw read files.
- Familiarize myself with key sources listed above.
- Write module to calculate fold coverage using genome size estimate and total size of all reads.
- Write a prioritized list of features and goals for my program.
3rd – 6th week
- Develop my program in modules according to the prioritized features.
- Compare my program’s genome to previously assembled genomes from this raw data.
- Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
- Edit the program based on any issues encountered with the full data set.
7th – 10th week (Ending: July 29, 2011)
- Finish wet-lab accuracy tests
- Fine–tune the program based on any issues encountered with the full data set.
- Attempt to assemble the “Meatball” phage genome.

Random Notes

Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn

Genome Assembly Project: Leland Taylor '12

Contents

Useful Links

Vocab

Assembly Programs

Scripts

Big Questions

Journal

March 5 2026

Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

Basic Timeline

Random Notes

Navigation menu

Views

Personal tools

Navigation

Search

Tools