Difference between revisions of "Genome Assembly Project: Leland Taylor '12"
From GcatWiki
(→Useful Links) |
|||
Line 5: | Line 5: | ||
http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs | http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs | ||
+ | |||
+ | == Basic Timeline == | ||
+ | *1st – 2nd Week | ||
+ | **Learn how to manipulate and handle raw read files. | ||
+ | **Familiarize myself with key sources listed above. | ||
+ | **Write module to calculate fold coverage using genome size estimate and total size of all reads. | ||
+ | **Write a prioritized list of features and goals for my program. | ||
+ | *3rd – 6th week | ||
+ | **Develop my program in modules according to the prioritized features. | ||
+ | **Compare my program’s genome to previously assembled genomes from this raw data. | ||
+ | **Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome. | ||
+ | **Edit the program based on any issues encountered with the full data set. | ||
+ | *7th – 10th week (Ending: July 29, 2011) | ||
+ | **Finish wet-lab accuracy tests | ||
+ | **Fine–tune the program based on any issues encountered with the full data set. | ||
+ | **Attempt to assemble the “Meatball” phage genome. | ||
== {{CURRENTMONTHNAME}} {{CURRENTDAY}} {{CURRENTYEAR}} == | == {{CURRENTMONTHNAME}} {{CURRENTDAY}} {{CURRENTYEAR}} == |
Revision as of 15:16, 23 May 2011
Useful Links
http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here
http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS
http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs
Basic Timeline
- 1st – 2nd Week
- Learn how to manipulate and handle raw read files.
- Familiarize myself with key sources listed above.
- Write module to calculate fold coverage using genome size estimate and total size of all reads.
- Write a prioritized list of features and goals for my program.
- 3rd – 6th week
- Develop my program in modules according to the prioritized features.
- Compare my program’s genome to previously assembled genomes from this raw data.
- Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
- Edit the program based on any issues encountered with the full data set.
- 7th – 10th week (Ending: July 29, 2011)
- Finish wet-lab accuracy tests
- Fine–tune the program based on any issues encountered with the full data set.
- Attempt to assemble the “Meatball” phage genome.
November 21 2024
Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).
Notes
- Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
- TOOL: Jellyfish - counts k-mers http://www.cbcb.umd.edu/software/jellyfish/
- Lists compression techniques and the order to employ them
- Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
- Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).
Thoughts
- Look for assembler that uses De Brujin graph?
- This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
- Is it possible to get the code used in this project?
Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).
Notes
Thoughts