Difference between revisions of "Genome Assembly Project: Leland Taylor '12"

From GcatWiki
Jump to: navigation, search
(Vocab)
Line 24: Line 24:
 
*k-mer
 
*k-mer
 
**the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
 
**the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
 +
 +
*N50
 +
**the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
 +
**Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
 +
**N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
 +
  
 
*Sanger-based sequencing - first generation sequencing
 
*Sanger-based sequencing - first generation sequencing

Revision as of 16:14, 23 May 2011

Useful Links

http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here

http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS

http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs

http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, ALL PATHS 2)

Vocab

  • AssemblyMethod: Overlap layout consensus
  • FileType: .fna
  • FileType: .qual
  • FileType: .sff
  • hybrid de novo assembly
  • k-mer
    • the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
  • N50
    • the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
    • Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
    • N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)


  • Sanger-based sequencing - first generation sequencing

Assembly Programs

Scripts

http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.

Big Questions

De novo or Reference based assembly?

Journal

May 23 2024

Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.

The database includes three file types: .fna .qual .sff


Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Notes

  • Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
  • Lists compression techniques and the order to employ them
  • Can use this method to compute N50
    • N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
    • A higher N50 score usually correlates to a more "correct" genome
  • Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).

Thoughts

  • Look for assembler that uses De Brujin graph?
    • PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
  • This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
  • Is it possible to get the code used in this project?


Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

Notes

Thoughts

Basic Timeline

  • 1st – 2nd Week
    • Learn how to manipulate and handle raw read files.
    • Familiarize myself with key sources listed above.
    • Write module to calculate fold coverage using genome size estimate and total size of all reads.
    • Write a prioritized list of features and goals for my program.
  • 3rd – 6th week
    • Develop my program in modules according to the prioritized features.
    • Compare my program’s genome to previously assembled genomes from this raw data.
    • Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
    • Edit the program based on any issues encountered with the full data set.
  • 7th – 10th week (Ending: July 29, 2011)
    • Finish wet-lab accuracy tests
    • Fine–tune the program based on any issues encountered with the full data set.
    • Attempt to assemble the “Meatball” phage genome.