Genome Assembly Project: Leland Taylor '12

Useful Links

http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here

http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS

AssemblyMethod: Brujin graphs
- "when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin

k-mer
- the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)

Newbler
- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- May be made by 454 company.
- Good blog: http://contig.wordpress.com/

De novo or Reference based assembly?

Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.

The database includes three file types: .fna .qual .sff

Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Notes

Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
- PROGRAM: Jellyfish - counts k-mers http://www.cbcb.umd.edu/software/jellyfish/
Lists compression techniques and the order to employ them
Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).

Thoughts

Look for assembler that uses De Brujin graph?
- PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
Is it possible to get the code used in this project?

Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

Notes

Thoughts

1st – 2nd Week
- Learn how to manipulate and handle raw read files.
- Familiarize myself with key sources listed above.
- Write module to calculate fold coverage using genome size estimate and total size of all reads.
- Write a prioritized list of features and goals for my program.
3rd – 6th week
- Develop my program in modules according to the prioritized features.
- Compare my program’s genome to previously assembled genomes from this raw data.
- Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
- Edit the program based on any issues encountered with the full data set.
7th – 10th week (Ending: July 29, 2011)
- Finish wet-lab accuracy tests
- Fine–tune the program based on any issues encountered with the full data set.
- Attempt to assemble the “Meatball” phage genome.