Difference between revisions of "Genome Assembly Project: Leland Taylor '12"

Revision as of 14:24, 23 May 2011

Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Use De Brujin graphs to estimate "completeness" of genomes
- Find Eulerian path(s) in these graphs
- Note the assumptions made in the paper
Lists compression techniques and the order to employ them
Can use this method to compute N50
- N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
- A higher N50 score usually correlates to a more "correct" genome
Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).

Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

@@ Line 1: / Line 1: @@
 == {{CURRENTMONTHNAME}} {{CURRENTDAY}} {{CURRENTYEAR}} ==
 #Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).
-**Use De Brujin graphs to estimate "completeness" of genomes
+*Use De Brujin graphs to estimate "completeness" of genomes
-***Find Eulerian path(s) in these graphs
+**Find Eulerian path(s) in these graphs
-***Note the assumptions made in the paper
+**Note the assumptions made in the paper
-**Lists compression techniques and the order to employ them
+*Lists compression techniques and the order to employ them
-**Can use this method to compute N50
+*Can use this method to compute N50
-***N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
+**N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
-***A higher N50 score usually correlates to a more "correct" genome
+**A higher N50 score usually correlates to a more "correct" genome
-**Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).
+*Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).
 #Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).