Difference between revisions of "Genome Assembly Project: Leland Taylor '12"

From GcatWiki
Jump to: navigation, search
(Vocab)
(Program Development)
 
(236 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Useful Links ==
+
== Links ==
 
http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here
 
http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here
 +
 +
[http://seqanswers.com/ SEQanswers] - online community for sequencing and assemblers.
 +
*[http://seqanswers.com/wiki/SEQanswers SEQanswers Wiki] - sweet software hub with over 400 bioinformatics programs (also see http://seqanswers.com/wiki/Special:BrowseData)
 +
*http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs
  
 
http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS
 
http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS
  
http://seqanswers.com/forums/showthread.php?t=43 - a good list of assembly programs
+
http://bioinf.comav.upv.es/index.html - a few useful python scripts
 +
 
 +
http://www.mediawiki.org/wiki/Help:Formatting - wiki formatting
 +
 
 +
http://adenine.biz.fh-weihenstephan.de/homePage/ - alot of useful software
 +
 
 +
http://adenine.biz.fh-weihenstephan.de/homePage/node4.html - good tools... k-mer counting and counting unique substrings
 +
 
 +
http://www.voidspace.org.uk/python/cgi.shtml#index - CGI programming with python
 +
 
 +
http://docs.python.org/howto/webservers.html - better CGI programming with python
 +
 
 +
http://gage.cbcb.umd.edu/index.html#what - GAGE = project to compare assemblers
 +
 
 +
http://www.yak.net/fqa/84.html - Send non STMP email on unix with python
 +
 
 +
http://codecanyon.net/item/explr-jquery-simple-tree-plugin-/136839
 +
 
 +
http://validator.w3.org/ - validate html
 +
 
 +
=== Cool Links ===
 +
 
 +
http://pathogenomics.bham.ac.uk/hts/ - world map of high throughput sequencers.
 +
 
 +
http://www.wired.com/wiredscience/2011/07/genome-structure/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Index+3+%28Top+Stories+2%29%29 - are individual traits linked to genome organization?
 +
 
 +
=== Link Sandbox ===
  
 
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, All Paths 2)
 
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, All Paths 2)
 +
 +
http://seqanswers.com/forums/showthread.php?t=9465&highlight=virus - previous virus assembly experiances
 +
 +
http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly - .sff assembly
 +
 +
http://seqanswers.com/forums/showthread.php?t=10988&highlight=k-mer - how to estimate genome size
  
 
== Vocab ==
 
== Vocab ==
Line 20: Line 56:
 
*<u>''de novo'' assembly</u> - NP-hard problem. Assembly from scratch
 
*<u>''de novo'' assembly</u> - NP-hard problem. Assembly from scratch
 
**Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
 
**Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
***hybrid ''de novo'' assembly
+
***hybrid ''de novo'' assembly - using data from different sources... harder, but can gain insight by playing the different data sets to their strengths.
 +
****Eulerian method ideal for this.
  
*<u>DeNovoAssemblyMethod: Brujin graphs (aka Eulerian pathway) </u> (1.2 pg 359)
+
*<u>DeNovoAssemblyMethod: Bruijn graphs (aka Eulerian pathway) </u> (1.2 pg 359) (often abbreviated DBG)
 
**Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
 
**Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
 
**Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
 
**Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
Line 28: Line 65:
 
**Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
 
**Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
 
***"Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
 
***"Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
**"when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin
+
**Repeats
**Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
+
***Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
**Very sensitive to sequencing errors
+
***Repeats make graph more complex because introduce cycles (3.1 pg 320)
 +
**Complications Bruijn Graphs must overcome with DNA assemblies (3.1 pg 320)
 +
**#DNA = double stranded. Forward sequence of read may overlap with the forward or reverse complement of other reads
 +
**#*Methods To Overcome: graph contains nodes and edges for both strands, forward/reverse make up half nodes individually and paths must enter/exit from same half (Velvet), alternate strands in single node with 2 sides and paths enter/exit opposite sides (ABySS).
 +
**#Repeats
 +
**#*Repeats longer than K lead to tangled K-mer graphs
 +
**#*Perfect repeats >= K lead to frayed ends
 +
**#Palindromes
 +
**#*Make paths fold back on themselves
 +
**#*Method To Overcome: require k to be odd (Velvet). an odd size k-mer cannot match its reverse complement
 +
**#Sequence Error
 +
**#*Very sensitive to sequencing errors
 +
**#**Make graph more complex by adding edges
 +
**#*Steps to overcome
 +
**#*#Preprocess reads to remove errors (kick out low qual...)
 +
**#*#Weight graph edges by number of reads that support them & erode lightly supported paths
 +
**#*#Convert paths to sequences and use sequence alignment algorithms to collapse nearly identical paths
 
**Ideal for short reads... high depth of coverage of reads in roughly equal length
 
**Ideal for short reads... high depth of coverage of reads in roughly equal length
 +
***"when reads are so long it is better use an overlap layout method in order to avoid a great number of false positives" http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin
 
**See daniel zerbinos phd thesis
 
**See daniel zerbinos phd thesis
**PROGRAM: Velvet, Allpaths
+
**PROGRAM: Velvet, Allpaths, Euler-SR, Euler-USR
  
 
*<u>DeNovoAssemblyMethod: Greedy</u> (1.2 pg 357)
 
*<u>DeNovoAssemblyMethod: Greedy</u> (1.2 pg 357)
Line 57: Line 111:
 
**Most popular type... perhaps from its flexibility
 
**Most popular type... perhaps from its flexibility
  
*<u>FileType: .fna</u>
+
*<u>Directed Graph</u> - graph in which all edges are provided with an orientation, so that an edge connecting v to w is not the same as an edge connecting w to v. (7.1)
 +
 
 +
*<u>Eulerian Cycle Problem</u> (ECP) - is it possible to traverse every '''edge''' in graph G and end up at the starting point? (7.1)
 +
 
 +
*<u>Euler's Theorems</u>
 +
**Theorem 1 - a Eulerian cycle exists if the degree of each vertex in G is even. The degree = the total number of edges connecting vertex v to other vertices. (7.1)
 +
**Theorem 2 - a Eulerian cycle exists if the indegree and outdegree of every vertex of G are equal (they don't all have to be equal to each other). This is for directional graph, were w to v edge not the same as v to w edge. (7.1)
 +
 
 +
*<u>FileType: .fna</u> (from 454) - all of the different reads with a > header that includes the following information (https://shell.cgrb.oregonstate.edu/sites/default/files/Files/Docs/Roche/software/454_Sequencing_Software_Manual_v2.5p1_Overview.pdf)
 +
**rank = ranking of well by signal intensity
 +
**x & y = coordinates of the well center pixel
 +
**length = the length of the read
 +
 
 +
*<u>FileType: .qual</u> (from 454) - base quality score values for each nt in the corresponding .fna file. Each entry has the same header as the .fna.
  
*<u>FileType: .qual</u>
+
*<u>FileType: .sff</u> (from 454) - Standard Flowgram Files
 +
**.sff file format consists of three sections: a common header section occurring once in the file; then for each read stored in the file, a read header section and a read data section
 +
**I think these files contain mate pair information (http://seqanswers.com/forums/showthread.php?t=11387&highlight=mate+information)
  
*<u>FileType: .sff</u>
+
*<u>indel</u> - shorthand term for insertion or deletion
  
 
*<u>k-mer</u>
 
*<u>k-mer</u>
 +
**all of the k nucleotide words present in a genome
 
**the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
 
**the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
**Usually linked to Eulerian path assemblies (Brujin graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller
+
**Usually linked to Eulerian path assemblies (Bruijn graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller
 +
**k-mers matter because they are the edges of the Bruijn graph (7.1)
 +
 
 +
*<u>k-mer multiplicity</u> - in de Bruijn graph... how many times k-mer occurs (t). Then in graph, attach k-mer to suffix in graph via t edges. (7.1 pg 19)
  
*<u>Mate-pair</u> -  
+
*<u>Length Unit: kb </u>(= kbp) = kilo base pairs = 1,000 bp
 +
*<u>Length Unit: Mb </u>= mega base pairs = 1,000,000 bp
 +
*<u>Length Unit: Gb </u>= giga base pairs = 1,000,000,000 bp.
 +
 
 +
*<u>Mate-paired reads</u>
 +
**Back in the day (with clone libraries) paired reads = sequencing opposite directions of an amplified clone,
 +
**Now... amplify and sequence short paired-end tags in parallel
 +
***Short genomic sequences circularized using known linker sequence, then cleaved outside of the ligation site... leaving an overhang tag (2.1 pg 4)
 +
**Have a 1kb fragment and sequence 100bp on each size... then know what ever contig these reads are in must be ~1kb apart
  
 
*<u>Metagenomics</u> - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)
 
*<u>Metagenomics</u> - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)
Line 75: Line 156:
 
**Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
 
**Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
 
**N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
 
**N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
 +
**Sort all of the contigs by length. What is the length of the contig in the middle?
 +
**I sort the list of lengths from low to high. Then, starting with the longest sequences first, I subtract one sequence length at a time from the total length (total number of bases), until I reach one half of the total length. The sequence length I just subtracted (or the longest remaining length .. one could quibble) is the N50.
 +
 +
*<u>Paired end linker sequences</u>
 +
 +
*<u>Paired end adaptor sequences</u>
 +
 +
*<u>DNA Palindrome</u> - DNA sequence that is its own reverse complement
 +
 +
*<u>Hamiltonian Cycle Problem</u> (HPC) - is there a pathway in graph G in which one travels to every '''vertex''' once and returns to the starting point?
 +
 +
*<u>Repeats</u>
 +
**Tandem repeats
 +
**Inverted repeats
 +
**Imperfect repeats
 +
**Perfect repeats
  
 
*<u>Sanger-based sequencing</u> - first generation sequencing (1000-2000bp reads)
 
*<u>Sanger-based sequencing</u> - first generation sequencing (1000-2000bp reads)
Line 90: Line 187:
  
 
*<u>Unitig</u> - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
 
*<u>Unitig</u> - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
 +
**Preliminary, high confidence, conservative contigs. (3.1 pg 319)
 
**PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne
 
**PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne
  
== Assembly Programs ==
+
==[[Programs]]==
*Euler-SR (short read)
 
*Newbler
 
**An Overlap Layout Consensus assembler.
 
**Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
 
**Assembler provided by 454 company.
 
**Good blog: http://contig.wordpress.com/
 
  
 
== Scripts ==
 
== Scripts ==
 
http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.
 
http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.
 +
 +
[http://bioinf.comav.upv.es/sff_extract/ sff_extract] - python script that extracts .sff data
 +
 +
[http://bioinf.comav.upv.es/clean_reads/index.html clean_reads] - python script that cleans reads
 +
 +
[http://cgpy-shell.googlecode.com/svn/trunk/] - safe cgi shell interface
  
 
== Big Questions ==
 
== Big Questions ==
 
*''De novo'' or Reference/Comparative based assembly?
 
*''De novo'' or Reference/Comparative based assembly?
 
**Comparative assembly = suited for sequencing multiple strains of a same bacterium (1.2 pg 361).
 
**Comparative assembly = suited for sequencing multiple strains of a same bacterium (1.2 pg 361).
 +
**The HHMI program uses de novo assemblers (http://phagesdb.org/workflow/Sequencing/).
  
 
*What kind of coverage is typical for phages?
 
*What kind of coverage is typical for phages?
  
 
==Journal==
 
==Journal==
=== {{CURRENTMONTHNAME}} {{CURRENTDAY}} {{CURRENTYEAR}} ===
+
=== May 23 2011 ===
 
Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.
 
Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.
  
 
The database includes three file types: .fna  .qual  .sff
 
The database includes three file types: .fna  .qual  .sff
 
  
 
====Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).====
 
====Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).====
Line 140: Line 238:
 
*THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
 
*THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
 
*THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.
 
*THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.
 +
*Assembly Validation
 +
**Stat analysis for areas with unusually deep coverage
 +
**Poission stat to estimate likely hood region represents a repeat
 +
***DNP??
 +
***PROGRAM: DNPtrapper, Euler-AIR
 +
**Mate pairs
 +
***In correct assembly, tiling of reads should be consistent with mate pair constraints (relative orientation and paired ends)
 +
***PROGRAM: Tampa, Zimin, and visual tools like BACCardI, Hawkeye
 +
****Hawkeye also has built in AMOSvalidate
 +
**Whole genome mapping (with restriction enzymes)
 +
***PROGRAM: SOMA
  
 
''Thoughts''
 
''Thoughts''
 +
*I am torn between working with de novo sequencers or comparative assemblers... Perhaps the best way is to start with de novo and then move to comparative, as de novo will likely be needed for comparative assemblers.
 +
*The target phage genomes are small, so we do not really need to worry that much about efficiency.
 +
*There are so many programs out there that seem like they would work for my purposes. Perhaps my project should be to take an existing program and determine parameters ideal for phages.
 +
*It may be difficult to make a universal program. So much of the ideal assembly program depends on the characteristics of your reads... Are they short/long reads? A single data source? Do you have mate-pair information? Perhaps I am creating a tool to just peer into the assembly program.
 +
**I can design it to be optimized for the 454 data I have now.
 +
*Areas of future work identified in this paper.
 +
**Designing programs to use multiple datasets to identify the best assembly. For example, Illumina and 454 data, but also non-sequence data like genome mapping (which has been shown to really help out assemblies)
 +
 +
'''To Do Tomorrow:'''
 +
*Think more about thoughts segments above from papers
 +
*Do research on virus/prokaryote genome assemblies.
 +
*Read the new promising papers identified today.
 +
*Begin to think about ways to characterize our data to answer what assembly method would be best in this scenario.
 +
 +
=== May 24 2011 ===
  
== Basic Timeline ==
+
====Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Research 18, 324-330 (2008).====
*1st – 2nd Week
+
''Notes''
**Learn how to manipulate and handle raw read files.  
+
*In the beginning, high throughput sequencing with short reads required reference genome and was limited to resequencing projects, gene expression, and genomic profiling projects.
**Familiarize myself with key sources listed above.
+
*"Long" contigs = contigs > 500bp
**Write module to calculate fold coverage using genome size estimate and total size of all reads.
+
*These programs which rely heavily on mate pair information do not perform well in next gen sequence data
**Write a prioritized list of features and goals for my program.
+
**PROGRAM: Phrap, PCAP, Arachne, Euler+
*3rd – 6th week
+
*I think Illumina and 454 now include mate pair information in the sequencing data. This paper written in 2008 said it was coming soon.
**Develop my program in modules according to the prioritized features.
+
*Error correction: discard low quality reads
**Compare my program’s genome to previously assembled genomes from this raw data.  
+
**If quality values are not available, they have another method
**Quantify the accuracy of my genome by testing for the size of a predicted gap or feature in the genome to size of that actual segment of DNA in the blueberry genome.
+
 
**Edit the program based on any issues encountered with the full data set.  
+
''Thoughts''
*7th – 10th week (Ending: July 29, 2011)
+
*Euler-SR seems to be a spin off of Euler+.
**Finish wet-lab accuracy tests
+
**What tradeoffs does the Euler-SR approach make?
**Fine–tune the program based on any issues encountered with the full data set.  
+
***Since I am focused on viral genomes, the need for memory conservation is not as big of an issue.
**Attempt to assemble the “Meatball” phage genome.
+
*Does the raw data I have include mate pair information?
 +
**Such information can drastically improve the quality of a genome assembly.
 +
 
 +
'''To Do Tomorrow'''
 +
*Install Linux to try and run newbler
 +
*Read Miller review article
 +
 
 +
=== May 25 2011 ===
 +
 
 +
====Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327 (2010).====
 +
''Notes''
 +
*THESIS: Nice overview of types of graphs used for assemblies. Also summary of several algorithms (mainly de Bruijn methods) including SSAKE, Newbler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo
 +
*"Repeat graphs can be used to identify and catalog repeats." (see Identifying repeat domains in large genomes)
 +
*Nice K-mer graph Figures
 +
**Fig 1 = basic k-mer idea
 +
**Fig 2 = pairwise overlap of k-mer graph
 +
**Fig 3 = complexity in K-mers. Includes spurs, bubbles, frayed ropes, cycles.
 +
*Links assembly to NP-hard reduction problem
 +
 
 +
''Thoughts''
 +
*What if try to implement reduction approach that has never been applied to assemblies?
 +
*The Velvet program compresses graphs. One step is to identify bubbles in graph, but the program does not search for all bubbles. It does a bounded search. What would the benefit be to finding all bubbles?... if it good, may want to make a version that does that because we have small genomes.
 +
 
 +
'''To Do Tomorrow'''
 +
*Install Linux and run Newbler
 +
*Understand how specific implementations of algorithms work and how I could optimize them for a small genome assembly of a phage
 +
 
 +
=== May 26 2011 ===
 +
*Tried to install Newbler on virtual Fedora system... it did not work. I have an install Fedora 15 disk, I just need to figure out a computer for it
 +
*I installed Mira, and ran assembly on Timshell using the default settings
 +
'''To Do Tomorrow'''
 +
*Continue with assemblies
 +
*Review plans with Dr. Heyer and assembly methods
 +
 
 +
=== May 27 2011 ===
 +
*Worked with Euler and Velvet. Ran a few assemblies, but I think I need better parameters.
 +
*Thoughts about the project
 +
**This project is a teaching tool specifically for the phage genome HHMI project.  
 +
***Should I develop contact and discuss ideas with the people at HHMI?
 +
**It gudes user through choices and gives assemblies. It then could use another visual program to show differences in the assembly and how students could PCR verify which one is correct.
 +
**ISSUES:
 +
***Problems could be due simply to lack of fold coverage in areas. In which case it is more in the hands of the sequencing center.
 +
****Maybe we let the sequencing center assemble a whole genome. Then they give the students the raw files. That way if certain areas need more fold coverage, presumably, the assembly center would have taken care of it.
 +
***At this point I am unsure of if I can just figure out the optimal programs for other sequencer data. One program claims to be better and another. But then the other gets updated. Who is better now?
 +
*I found my self thinking about systems biology and visualizations - specifically online visuals kind of like what (http://www.cytoscape.org/) does. [http://cytoscapeweb.cytoscape.org/ This] does exactly what I was thinking of. Cool!
 +
**Maybe part of my project can be to develop the ability to visualize de Bruijn graphs using cytoscape?
 +
 
 +
=== May 31 2011 ===
 +
*I installed and ran an assembly with the Celera Assembler. This data is in contigs. The next step is scaffolding with something like Bambus, but I still do not know how to generate the .mates file.
 +
*I found some software that may be useful in k-mer counting and determining the minimum unique substrings http://genometools.org/index.html
 +
*I found Jellyfish - a good kmer counting tool. I have it installed.
 +
 
 +
'''To Do Tomorrow'''
 +
*Read up on Jellyfish
 +
*Develop program
 +
**Could you calculate overall GC content, by calculating the smaller fasta read GC content, adding all the GC content of reads up, and dividing by total genome size?
 +
*Identify essential phage genes for reference guided assembly.
 +
 
 +
'''Questions for DHMRI'''
 +
*Is the 454 phage data from a FLX Titanium machine? I think it is.
 +
*Does these data contain mate pair information. I am trying to run a few scaffolding programs, but need to generate a .mates file or something similar that contains mate pair information. (for example, the Bambus scaffolder).
 +
 
 +
=== June 1 2011 ===
 +
====Compeau, P.E.C. & Pevzner, P.A. Genome Reconstruction: A Puzzle with a Billion Pieces. 1-30 (2010).====
 +
*Excellent description of Eulerian Path, Hamiltonian Path, and linking it to de Bruijn graphs in genome assemblies.
 +
*Also, quick history on Euler, Hamiltonian, and DNA sequencing technology.
 +
 
 +
'''To Do'''
 +
*Identify essential phage genes for reference guided assembly.
 +
*What do clusters mean?
 +
*Figure out how to identify linker sequences in 454 data.
 +
*Continue program development
 +
 
 +
=== June 2 2011 ===
 +
*I have been working on my program. It is nearly done locally.
 +
 
 +
'''To Do'''
 +
*Identify essential phage genes for reference guided assembly.
 +
*What do clusters mean?
 +
*Figure out how to identify linker sequences in 454 data.
 +
*Continue program development
 +
 
 +
===June 6 2011===
 +
*Website for genome assembly tools = http://gcat.davidson.edu/genomeAssembly/index.html
 +
 
 +
'''To Do'''
 +
*Identify essential phage genes for reference guided assembly.
 +
*What do clusters mean?
 +
*Figure out how to identify linker sequences in 454 data.
 +
*Continue program development
 +
 
 +
===June 7 2011===
 +
*Phone call with Dr. Graham Hatfull, Dr. Welkin Pope, and Dr. Dan Russell
 +
*[http://phagesdb.org/media/workflow/protocols/pdfs/annotation_guide11.12.2010.pdf Annotation Steps]
 +
*#Find your phage and locally blast it to Phage DB
 +
*#Phamerator ([http://phagesdb.org/static/TW2011Software.html PhageDB software]) with auto annotated genome
 +
*#*Used to compare phage gene products and genomes to one another and display the results of these comparison
 +
*#*comparing all annotated mycobacteriophage gene products to one antoher, and then grouping genes with similar products into Phamilies, or "Phams."
 +
*#*[http://www.jmu.edu/biology/faculty_cresawns.shtml Dr. Steve Cresawn]
 +
*#*Assorts protein-coding genes into phamilies of related sequences using pair wise comparisons to generate a database of gene relationships.  This database is used to create genome maps of multiple phages that incorporate nucleotide and amino acid sequence relationships, as well as denoting the conserved domains within genes
 +
*#**Pham = a group of genes with >32.5% ClustalW alignment and a BLASTp comparison of 10^-50 threshold
 +
*#***Shows amino acid similarity via colors of boxes and nucleotide similarity via (ROYGBIV) boxes across genes
 +
*#DNA Master, Aragorn, tRNA SE, Glimmer, GeneMark, Frame-Shift Finder
 +
*#Do functional annotation/genome maps
 +
*"Bacteriophage genomes average approximately 100 genes and yet on average we know no more than 20 associated functions" (pg 23 of http://phagesdb.org/media/workflow/protocols/pdfs/annotation_guide11.12.2010.pdf)
 +
*From discussion with Dr. Campbell, clusters are based on genome size and orientation of genes.
 +
 
 +
'''To Do'''
 +
*Prepare for this weekend
 +
*Finish Tools
 +
 
 +
===June 8 2011===
 +
 
 +
====Hatfull, G.F. et al. Exploring the Mycobacteriophage Metaproteome: Phage Genomics as an Educational Platform. PLoS Genet 2, e92 (2006).====
 +
*Mycobacteriophage
 +
**Give insight into viral diversity and evolutionary mechanisms
 +
**Genes encoding viron structure and assembly functions = clustered in long operons in well defined gene order
 +
***tmp = gene for length of tail... often logest ORF in genome
 +
*Phamilies = groups of related protein encoding genes
 +
**Related = share amino acid similarity of Evalue > 0.001 or 25% aa identity across its length
 +
**Give insight into
 +
***Genes most prevalent in phages
 +
***Signature sequences characteristic of mycobacteriophages
 +
***How Phamily size corresponds to phage and bacterial homologs outside of the mycobacteriophage group.
 +
*Figure 5
 +
**Shows phams... groups of genes and how similar or dissimilar they are across different genomes
 +
**"'''The future development of automated circle drawing software should greatly facilitate this'''" (pg 0839)
 +
***This = extending this model to all phamilies that contain 2+ members
 +
 
 +
====Pope, W.H. et al. Expanding the Diversity of Mycobacteriophages: Insights into Genome Architecture and Evolution. PLoS ONE 6, e16329 (2011).====
 +
*Most bacteriophages target a single bacterial genus
 +
**Not much similarity between phages who target different hosts
 +
*Mosaic architectures = characteristic of phage genomes
 +
**Each individual genome can be composed of a series of individual modules that can be shared by genomes that otherwise may not be related
 +
***aka phams
 +
*Morons = segments of DNA present in one genome, but absent in a related genome
 +
*Clusters = assigned via dotplot... must be recognizable in dotplot and span more than half of the genome lengths
 +
 
 +
'''To Do'''
 +
*Finish Tools - add maximum estimation to fit gaussian curve
 +
 
 +
===June 13 2011===
 +
*See project priority list
 +
 
 +
===June 14 2011===
 +
'''To Do'''
 +
*Ask DHRMI 3 questions. 1. Verify 454 seq = FLX titanium 2. Did they use default linker sequence to generate mate pairs? 3. What is insert size... insert size? Mates are on average i +- d bp apart.
 +
 
 +
===June 15 2011===
 +
*The closer mean (alpha*beta) is to xmax the better the cutoffpoints I derive from gamma distribution and the more accurate my genome size estimation is
 +
 
 +
'''To Do'''
 +
*Figure out more consistant way to estimate accurate alpha and beta parameters
 +
*Choose assembly algorithm
 +
 
 +
===June 16 2011===
 +
*I looked at another method to estimate alpha and beta. It works better, that is predicts the genome size more accurately; however, at certain points it does not work. Two input points and probabilities must be so far apart. This is buggy.
 +
**The alternative is to generate fake raw data and to fit the gamma distribution to it using gamma.fit(data). This method works, but does not seem to be as accurate as the previous version. I feel like I should be able to get a more accurate estimation of genome size.
 +
**At this point, I am leaving this topic for another time. All of my comments are in the code, and it should be easy to pick up. Worse comes to worse, I can just use one of the methods I have completed even though it is not as accurate as I would like.
 +
*MIRA 3 = easiest out of box assembler. If I do not get the information I need on the reads this assembler gives me the most unified genome.
 +
**It was designed specifically for 454 data
 +
*CABOG = relatively simple. Pretty straight forward - only need to work out a few parameters and the program handles the rest.
 +
*Euler-SR = not simple to use with mate pair information. All sorts of processing has to be done. However, it does give alot of control over the assembly process.  
 +
 
 +
'''To Do'''
 +
*Decide on an algorithm to develop
 +
 
 +
===June 20 2011===
 +
*I have done multiple assemblies (timshel, arcadian). Every time mira 3 returns significantly fewer contigs/scaffolds than wgs-assembler. Additionally the mira 3 contigs/assemblies are more similar to (extremely similar) to the final published assembly.
 +
**Mira assemblies tend to be slightly larger than the published genome, but have an extremely similar order .
 +
**wgs assembler is often closer to the published genome size, however the order is much different.
 +
 
 +
===June 24 2011===
 +
*I have the website running locally. User can use 1 or 2 genomes, change the kmer size and clean the reads.
 +
*For more reads, I am running into a timeout error.
 +
**Way to overcome this: http://stackoverflow.com/questions/3946349/preventing-a-504-gateway-timeout-with-huge-php-script
 +
 
 +
'''To do'''
 +
*Begin tutorial
 +
*Add precent of raw reads to use parameters
 +
 
 +
===June 29 2011===
 +
*I have finished a rough draft of a tutorial.
 +
 
 +
'''To do'''
 +
*Figure out file naming system - name by ip, at the end of assembly, script goes in and places file with parameters
 +
 
 +
*Write script to keep a list of all the genomes assembled thus far
 +
 
 +
===July 6 2011===
 +
*It does not appear that SMALT info is even being parsed into mira.
 +
 
 +
'''To do'''
 +
*Finish tutorial Draft 2
 +
*Fix cleaning reads... sff_extract auto cleans some reads. Turn this off... maybe do this via no .xml file?
 +
*Use SMALT to screen for MID?
 +
*Fix 2 genomes
 +
*Fix precent of reads
 +
 
 +
===July 12 2011===
 +
'''To do'''
 +
*Finish tutorial Draft 2
 +
*Add cleaning script
 +
*Change mira citation to match the one used in the output. Also, use pop paper as reference for how similar reference assemblies should be.
 +
*Figure out what is up with the megahubs issue in the data
 +
 
 +
==Data Characterization==
 +
*I am pretty sure these reads are from the 454 FLX sequencer - actually maybe titanium (http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Roche_454_Platforms#454_FLX)
 +
*Phage genome size from ~40 kb (40,000) to ~164 kb (164,000) (http://phagesdb.org/sort/)
 +
*Our reads are ~500nt on average. We do have small ones ~50nt.
 +
*The database includes three file types: .fna .qual .sff
 +
*Write program to calculate fold coverage? Avg sequence size?
 +
*Don't forget to clean the data before using it.
 +
 
 +
==Program Development==
 +
===[[General Statistical Calculation Tool]]===
 +
===[[Genome Assembly (gass)]]===
 +
===[[Web based genome comparison tool]]===
 +
 
 +
==[[Tutorial]]==
 +
 
 +
== [[Project Priority List]] ==
 +
 
 +
==[[Explaining My Project]]==
  
 
== Random Notes ==
 
== Random Notes ==
 +
====[[JavaScript]]====
 +
====[[Securely Connecting to Server via Terminal]]====
 +
====[[Working with Images and USB]]====
 +
====[[Installing dependancy libraries for Newbler on Fedora 15]]====
 +
====Intall additional modules to Enthought Python====
 +
*/Library/Frameworks/EPD64.framework/Versions/7.0/lib/python2.7/
 +
*Make DESTDIR=/dir/to/install/to
 +
 
Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn
 
Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn
 +
 +
Figure out weighting scale to identify missassemblies via machine based learning (A machine-learning approach to combined evidence validation of genome assemblies)
 +
 +
 +
http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin
 +
<pre>
 +
I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.
 +
 +
We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.
 +
 +
-ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize
 +
 +
-SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)
 +
 +
-CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)
 +
 +
In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.
 +
 +
Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!
 +
 +
All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.
 +
 +
Now we are going to throw more data at it, hoping for a much better assembly</pre>

Latest revision as of 19:13, 13 September 2011

Contents

Links

http://phagesdb.org/ - phage database. Assembled versions of the raw files we have are located here

SEQanswers - online community for sequencing and assemblers.

http://www.cbcb.umd.edu/ - UMD bioinformatics center. Good open source programs. Also includes AMOS

http://bioinf.comav.upv.es/index.html - a few useful python scripts

http://www.mediawiki.org/wiki/Help:Formatting - wiki formatting

http://adenine.biz.fh-weihenstephan.de/homePage/ - alot of useful software

http://adenine.biz.fh-weihenstephan.de/homePage/node4.html - good tools... k-mer counting and counting unique substrings

http://www.voidspace.org.uk/python/cgi.shtml#index - CGI programming with python

http://docs.python.org/howto/webservers.html - better CGI programming with python

http://gage.cbcb.umd.edu/index.html#what - GAGE = project to compare assemblers

http://www.yak.net/fqa/84.html - Send non STMP email on unix with python

http://codecanyon.net/item/explr-jquery-simple-tree-plugin-/136839

http://validator.w3.org/ - validate html

Cool Links

http://pathogenomics.bham.ac.uk/hts/ - world map of high throughput sequencers.

http://www.wired.com/wiredscience/2011/07/genome-structure/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Index+3+%28Top+Stories+2%29%29 - are individual traits linked to genome organization?

Link Sandbox

http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin - user comparison of several assemblers (SOAPdenovo, ABySS, All Paths 2)

http://seqanswers.com/forums/showthread.php?t=9465&highlight=virus - previous virus assembly experiances

http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly - .sff assembly

http://seqanswers.com/forums/showthread.php?t=10988&highlight=k-mer - how to estimate genome size

Vocab

  • Comparison assembly (aka reference based) - basically align reads to reference genome
    • Regions that differ slightly (like large insertions) still need to be assembled de novo (Genome assembly reborn: recent computational challenges)
    • Protein reference gene can be used as comparison (ABBA)
      • Useful if no close reference genome available.
    • PROGRAMS: AMOScmp, Maq, ABBA (protein)
      • Maq - characterizes SNPs between target genome and reference.
  • Coverage - the ratio between the cumulative size of a set of reads and the size of the genome
  • de novo assembly - NP-hard problem. Assembly from scratch
    • Usually highly fragmented used with short reads (Genome assembly reborn: recent computational challenges)
      • hybrid de novo assembly - using data from different sources... harder, but can gain insight by playing the different data sets to their strengths.
        • Eulerian method ideal for this.
  • DeNovoAssemblyMethod: Bruijn graphs (aka Eulerian pathway) (1.2 pg 359) (often abbreviated DBG)
    • Make graph: nodes = each k-1 section, edges = exact overlap of k-2 (see 1.2 fig 3B).
    • Assembler = find pathway though this graph that uses every edge (aka Eulerian path)
      • Typically more than one Eulerian path can be found... representing the many different ways a genome ca be rearranged around repeats.
    • Loose long range connectivity information by choping up reads in to k-mer sets... this info lost = useful in reducing ambiguity of graph structure
      • "Eulerian superpath problem" - solves this issue via graph constructed from sub-paths corresponding to individual reads provided as input to assembler
    • Repeats
      • Better at resolving repeats (Assembly complexity of prokaryotic genomes using short reads)
      • Repeats make graph more complex because introduce cycles (3.1 pg 320)
    • Complications Bruijn Graphs must overcome with DNA assemblies (3.1 pg 320)
      1. DNA = double stranded. Forward sequence of read may overlap with the forward or reverse complement of other reads
        • Methods To Overcome: graph contains nodes and edges for both strands, forward/reverse make up half nodes individually and paths must enter/exit from same half (Velvet), alternate strands in single node with 2 sides and paths enter/exit opposite sides (ABySS).
      2. Repeats
        • Repeats longer than K lead to tangled K-mer graphs
        • Perfect repeats >= K lead to frayed ends
      3. Palindromes
        • Make paths fold back on themselves
        • Method To Overcome: require k to be odd (Velvet). an odd size k-mer cannot match its reverse complement
      4. Sequence Error
        • Very sensitive to sequencing errors
          • Make graph more complex by adding edges
        • Steps to overcome
          1. Preprocess reads to remove errors (kick out low qual...)
          2. Weight graph edges by number of reads that support them & erode lightly supported paths
          3. Convert paths to sequences and use sequence alignment algorithms to collapse nearly identical paths
    • Ideal for short reads... high depth of coverage of reads in roughly equal length
    • See daniel zerbinos phd thesis
    • PROGRAM: Velvet, Allpaths, Euler-SR, Euler-USR
  • DeNovoAssemblyMethod: Greedy (1.2 pg 357)
    • Step 1: Reads compared to each other to construct a list of pair-wise overlaps
    • Step 2: Join contigs that overlap the best and and when no ore reads or contigs can be joined
      • "Overlaps" = prefix of one read shares enough similarity with suffix of another
    • "Greedy" - the algorithm optimizes locally... ie the quality of overlap between reads
      • Starts by processing the best overlap first, so chooses that path and may misassemble many repeats
    • PROGRAM: TIGR Assembler, CAP3
  • DeNovoAssemblyMethod: Greedy for Short Reads (1.2 pg 357)
    • Unassembled read chosen as start contig... Contig built on the 3' end until no more extensions possible. Then build onto 5' end using rev comp of original read.
    • Extension process terminated when conflicting information found... 2+ reads could extend the contig.
    • PROGRAM: SSAKE, VCAKE, SHARCGS
  • DeNovoAssemblyMethod: Overlap layout consensus (OLC) (1.2 pg 357)
    • Step 1: Reads compared to each other to construct a list of pair-wise overlaps
    • Step 2: Create overlap map... each read = node, edge = if overlap identified between reads
      • Core part of OLC
      • Ultimate goal = find a single path that traverses each node of the graph exactly once (hamiltonian path).
    • Step 3: Consensus computation - determine the DNA sequence implied by the arrangement of reads along the chosen path
    • PROGRAMS: Celera Assembler, Arachne, newbler, Edena
    • Most popular type... perhaps from its flexibility
  • Directed Graph - graph in which all edges are provided with an orientation, so that an edge connecting v to w is not the same as an edge connecting w to v. (7.1)
  • Eulerian Cycle Problem (ECP) - is it possible to traverse every edge in graph G and end up at the starting point? (7.1)
  • Euler's Theorems
    • Theorem 1 - a Eulerian cycle exists if the degree of each vertex in G is even. The degree = the total number of edges connecting vertex v to other vertices. (7.1)
    • Theorem 2 - a Eulerian cycle exists if the indegree and outdegree of every vertex of G are equal (they don't all have to be equal to each other). This is for directional graph, were w to v edge not the same as v to w edge. (7.1)
  • FileType: .qual (from 454) - base quality score values for each nt in the corresponding .fna file. Each entry has the same header as the .fna.
  • indel - shorthand term for insertion or deletion
  • k-mer
    • all of the k nucleotide words present in a genome
    • the larger the kmer the longer the overlap between two reads has to be. that's also a reason why the kmer can never be larger then your minimum read length. SO an assembly at a higher kmer size is always more "accurate"(not talking about better N50) than the one at a lower kmer size. (http://seqanswers.com/forums/showthread.php?t=9396&highlight=Brujin)
    • Usually linked to Eulerian path assemblies (Bruijn graphs)... in implementations, the higher k-mer, the less RAM because the graph will be smaller
    • k-mers matter because they are the edges of the Bruijn graph (7.1)
  • k-mer multiplicity - in de Bruijn graph... how many times k-mer occurs (t). Then in graph, attach k-mer to suffix in graph via t edges. (7.1 pg 19)
  • Length Unit: kb (= kbp) = kilo base pairs = 1,000 bp
  • Length Unit: Mb = mega base pairs = 1,000,000 bp
  • Length Unit: Gb = giga base pairs = 1,000,000,000 bp.
  • Mate-paired reads
    • Back in the day (with clone libraries) paired reads = sequencing opposite directions of an amplified clone,
    • Now... amplify and sequence short paired-end tags in parallel
      • Short genomic sequences circularized using known linker sequence, then cleaved outside of the ligation site... leaving an overhang tag (2.1 pg 4)
    • Have a 1kb fragment and sequence 100bp on each size... then know what ever contig these reads are in must be ~1kb apart
  • Metagenomics - sequencing entire microbial communities instead of isolate genomes (Genome assembly reborn: recent computational challenges)
  • N50
    • the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. (http://seqanswers.com/forums/showthread.php?t=2332)
    • Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value (http://seqanswers.com/forums/showthread.php?t=2332)
    • N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. (http://seqanswers.com/forums/showthread.php?t=2332)
    • Sort all of the contigs by length. What is the length of the contig in the middle?
    • I sort the list of lengths from low to high. Then, starting with the longest sequences first, I subtract one sequence length at a time from the total length (total number of bases), until I reach one half of the total length. The sequence length I just subtracted (or the longest remaining length .. one could quibble) is the N50.
  • Paired end linker sequences
  • Paired end adaptor sequences
  • DNA Palindrome - DNA sequence that is its own reverse complement
  • Hamiltonian Cycle Problem (HPC) - is there a pathway in graph G in which one travels to every vertex once and returns to the starting point?
  • Repeats
    • Tandem repeats
    • Inverted repeats
    • Imperfect repeats
    • Perfect repeats
  • Sanger-based sequencing - first generation sequencing (1000-2000bp reads)
  • Scaffolds (1.2 pg 360)
    • Groups of contigs whose relative placement is known though the DNA sequence of genomic regions connecting these contigs is unknown
    • Usually relies on mate-pair information (other method = optical mapping like in SOMA)
      • Two contigs = adjacent if one end of mate-pair in contig1 and other end of mate-pair in contig2
    • Most assemblers have scaffolding module
    • The longer contigs are, the easier it is to scaffold
    • PROGRAMS: Euler, Arachne, Celera Assembler, Velvet, Bambus (stand alone scaffold assembler), SOMA (scaffolding with restriction maps... need more tests than the data we have).
      • See 1.2 paper for overview of algorithms of these scaffolders
  • Spur - single reads that disagree with the bulk of reads within a region of an assembly graph (OLC method) (1.2 pg 359)
  • Unitig - uniquely assemblable contig (defined by Celera Assembler)... Contig constructed until a fork in a graph (OLC method) (1.2 pg 358)
    • Preliminary, high confidence, conservative contigs. (3.1 pg 319)
    • PROGRAMS: Minimus, newbler, internal datastructures of Celera Assembler and Arachne

Programs

Scripts

http://brianknaus.com/software/srtoolbox/fastq2fasta.pl - convert fastq to fasta.

sff_extract - python script that extracts .sff data

clean_reads - python script that cleans reads

[1] - safe cgi shell interface

Big Questions

  • De novo or Reference/Comparative based assembly?
  • What kind of coverage is typical for phages?

Journal

May 23 2011

Looking at the raw assembly files, it looks like our reads are ~500nt on average. We do have small ones ~50nt.

The database includes three file types: .fna .qual .sff

Kingsford, C., Schatz, M.C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

Notes

  • Use De Brujin graphs to estimate "completeness" of genomes assembled via de novo assembly
  • Lists compression techniques and the order to employ them
  • Can use this method to compute N50
    • N50 = the length of the largest contig (m) such that at least 50% of genome covered by contigs of size >= m.
    • A higher N50 score usually correlates to a more "correct" genome
  • Regardless of correctness of genome, for nearly all read sizes (1000nt > size > 25nt), 85%+ of genes accurately identified (85% is for 25nt reads).

Thoughts

  • Look for assembler that uses De Brujin graph?
    • PROGRAM: EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach. http://euler-assembler.ucsd.edu/portal/
  • This paper showed how to get an upper limit of correctness of genome. Compare several existing de novo assemblers using the methods here as comparison.
  • Is it possible to get the code used in this project?

Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354-366 (2009).

Notes

  • THESIS: Nice presentation of current technology and description of whole process (including de novo vs reference, de novo subtypes, scaffolding, assembly validation, metagenomics)
  • THESIS: Table 1 = great summary of 2nd gen sequencing and the trade offs you make with it.
  • Assembly Validation
    • Stat analysis for areas with unusually deep coverage
    • Poission stat to estimate likely hood region represents a repeat
      • DNP??
      • PROGRAM: DNPtrapper, Euler-AIR
    • Mate pairs
      • In correct assembly, tiling of reads should be consistent with mate pair constraints (relative orientation and paired ends)
      • PROGRAM: Tampa, Zimin, and visual tools like BACCardI, Hawkeye
        • Hawkeye also has built in AMOSvalidate
    • Whole genome mapping (with restriction enzymes)
      • PROGRAM: SOMA

Thoughts

  • I am torn between working with de novo sequencers or comparative assemblers... Perhaps the best way is to start with de novo and then move to comparative, as de novo will likely be needed for comparative assemblers.
  • The target phage genomes are small, so we do not really need to worry that much about efficiency.
  • There are so many programs out there that seem like they would work for my purposes. Perhaps my project should be to take an existing program and determine parameters ideal for phages.
  • It may be difficult to make a universal program. So much of the ideal assembly program depends on the characteristics of your reads... Are they short/long reads? A single data source? Do you have mate-pair information? Perhaps I am creating a tool to just peer into the assembly program.
    • I can design it to be optimized for the 454 data I have now.
  • Areas of future work identified in this paper.
    • Designing programs to use multiple datasets to identify the best assembly. For example, Illumina and 454 data, but also non-sequence data like genome mapping (which has been shown to really help out assemblies)

To Do Tomorrow:

  • Think more about thoughts segments above from papers
  • Do research on virus/prokaryote genome assemblies.
  • Read the new promising papers identified today.
  • Begin to think about ways to characterize our data to answer what assembly method would be best in this scenario.

May 24 2011

Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Research 18, 324-330 (2008).

Notes

  • In the beginning, high throughput sequencing with short reads required reference genome and was limited to resequencing projects, gene expression, and genomic profiling projects.
  • "Long" contigs = contigs > 500bp
  • These programs which rely heavily on mate pair information do not perform well in next gen sequence data
    • PROGRAM: Phrap, PCAP, Arachne, Euler+
  • I think Illumina and 454 now include mate pair information in the sequencing data. This paper written in 2008 said it was coming soon.
  • Error correction: discard low quality reads
    • If quality values are not available, they have another method

Thoughts

  • Euler-SR seems to be a spin off of Euler+.
    • What tradeoffs does the Euler-SR approach make?
      • Since I am focused on viral genomes, the need for memory conservation is not as big of an issue.
  • Does the raw data I have include mate pair information?
    • Such information can drastically improve the quality of a genome assembly.

To Do Tomorrow

  • Install Linux to try and run newbler
  • Read Miller review article

May 25 2011

Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327 (2010).

Notes

  • THESIS: Nice overview of types of graphs used for assemblies. Also summary of several algorithms (mainly de Bruijn methods) including SSAKE, Newbler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo
  • "Repeat graphs can be used to identify and catalog repeats." (see Identifying repeat domains in large genomes)
  • Nice K-mer graph Figures
    • Fig 1 = basic k-mer idea
    • Fig 2 = pairwise overlap of k-mer graph
    • Fig 3 = complexity in K-mers. Includes spurs, bubbles, frayed ropes, cycles.
  • Links assembly to NP-hard reduction problem

Thoughts

  • What if try to implement reduction approach that has never been applied to assemblies?
  • The Velvet program compresses graphs. One step is to identify bubbles in graph, but the program does not search for all bubbles. It does a bounded search. What would the benefit be to finding all bubbles?... if it good, may want to make a version that does that because we have small genomes.

To Do Tomorrow

  • Install Linux and run Newbler
  • Understand how specific implementations of algorithms work and how I could optimize them for a small genome assembly of a phage

May 26 2011

  • Tried to install Newbler on virtual Fedora system... it did not work. I have an install Fedora 15 disk, I just need to figure out a computer for it
  • I installed Mira, and ran assembly on Timshell using the default settings

To Do Tomorrow

  • Continue with assemblies
  • Review plans with Dr. Heyer and assembly methods

May 27 2011

  • Worked with Euler and Velvet. Ran a few assemblies, but I think I need better parameters.
  • Thoughts about the project
    • This project is a teaching tool specifically for the phage genome HHMI project.
      • Should I develop contact and discuss ideas with the people at HHMI?
    • It gudes user through choices and gives assemblies. It then could use another visual program to show differences in the assembly and how students could PCR verify which one is correct.
    • ISSUES:
      • Problems could be due simply to lack of fold coverage in areas. In which case it is more in the hands of the sequencing center.
        • Maybe we let the sequencing center assemble a whole genome. Then they give the students the raw files. That way if certain areas need more fold coverage, presumably, the assembly center would have taken care of it.
      • At this point I am unsure of if I can just figure out the optimal programs for other sequencer data. One program claims to be better and another. But then the other gets updated. Who is better now?
  • I found my self thinking about systems biology and visualizations - specifically online visuals kind of like what (http://www.cytoscape.org/) does. This does exactly what I was thinking of. Cool!
    • Maybe part of my project can be to develop the ability to visualize de Bruijn graphs using cytoscape?

May 31 2011

  • I installed and ran an assembly with the Celera Assembler. This data is in contigs. The next step is scaffolding with something like Bambus, but I still do not know how to generate the .mates file.
  • I found some software that may be useful in k-mer counting and determining the minimum unique substrings http://genometools.org/index.html
  • I found Jellyfish - a good kmer counting tool. I have it installed.

To Do Tomorrow

  • Read up on Jellyfish
  • Develop program
    • Could you calculate overall GC content, by calculating the smaller fasta read GC content, adding all the GC content of reads up, and dividing by total genome size?
  • Identify essential phage genes for reference guided assembly.

Questions for DHMRI

  • Is the 454 phage data from a FLX Titanium machine? I think it is.
  • Does these data contain mate pair information. I am trying to run a few scaffolding programs, but need to generate a .mates file or something similar that contains mate pair information. (for example, the Bambus scaffolder).

June 1 2011

Compeau, P.E.C. & Pevzner, P.A. Genome Reconstruction: A Puzzle with a Billion Pieces. 1-30 (2010).

  • Excellent description of Eulerian Path, Hamiltonian Path, and linking it to de Bruijn graphs in genome assemblies.
  • Also, quick history on Euler, Hamiltonian, and DNA sequencing technology.

To Do

  • Identify essential phage genes for reference guided assembly.
  • What do clusters mean?
  • Figure out how to identify linker sequences in 454 data.
  • Continue program development

June 2 2011

  • I have been working on my program. It is nearly done locally.

To Do

  • Identify essential phage genes for reference guided assembly.
  • What do clusters mean?
  • Figure out how to identify linker sequences in 454 data.
  • Continue program development

June 6 2011

To Do

  • Identify essential phage genes for reference guided assembly.
  • What do clusters mean?
  • Figure out how to identify linker sequences in 454 data.
  • Continue program development

June 7 2011

  • Phone call with Dr. Graham Hatfull, Dr. Welkin Pope, and Dr. Dan Russell
  • Annotation Steps
    1. Find your phage and locally blast it to Phage DB
    2. Phamerator (PhageDB software) with auto annotated genome
      • Used to compare phage gene products and genomes to one another and display the results of these comparison
      • comparing all annotated mycobacteriophage gene products to one antoher, and then grouping genes with similar products into Phamilies, or "Phams."
      • Dr. Steve Cresawn
      • Assorts protein-coding genes into phamilies of related sequences using pair wise comparisons to generate a database of gene relationships. This database is used to create genome maps of multiple phages that incorporate nucleotide and amino acid sequence relationships, as well as denoting the conserved domains within genes
        • Pham = a group of genes with >32.5% ClustalW alignment and a BLASTp comparison of 10^-50 threshold
          • Shows amino acid similarity via colors of boxes and nucleotide similarity via (ROYGBIV) boxes across genes
    3. DNA Master, Aragorn, tRNA SE, Glimmer, GeneMark, Frame-Shift Finder
    4. Do functional annotation/genome maps
  • "Bacteriophage genomes average approximately 100 genes and yet on average we know no more than 20 associated functions" (pg 23 of http://phagesdb.org/media/workflow/protocols/pdfs/annotation_guide11.12.2010.pdf)
  • From discussion with Dr. Campbell, clusters are based on genome size and orientation of genes.

To Do

  • Prepare for this weekend
  • Finish Tools

June 8 2011

Hatfull, G.F. et al. Exploring the Mycobacteriophage Metaproteome: Phage Genomics as an Educational Platform. PLoS Genet 2, e92 (2006).

  • Mycobacteriophage
    • Give insight into viral diversity and evolutionary mechanisms
    • Genes encoding viron structure and assembly functions = clustered in long operons in well defined gene order
      • tmp = gene for length of tail... often logest ORF in genome
  • Phamilies = groups of related protein encoding genes
    • Related = share amino acid similarity of Evalue > 0.001 or 25% aa identity across its length
    • Give insight into
      • Genes most prevalent in phages
      • Signature sequences characteristic of mycobacteriophages
      • How Phamily size corresponds to phage and bacterial homologs outside of the mycobacteriophage group.
  • Figure 5
    • Shows phams... groups of genes and how similar or dissimilar they are across different genomes
    • "The future development of automated circle drawing software should greatly facilitate this" (pg 0839)
      • This = extending this model to all phamilies that contain 2+ members

Pope, W.H. et al. Expanding the Diversity of Mycobacteriophages: Insights into Genome Architecture and Evolution. PLoS ONE 6, e16329 (2011).

  • Most bacteriophages target a single bacterial genus
    • Not much similarity between phages who target different hosts
  • Mosaic architectures = characteristic of phage genomes
    • Each individual genome can be composed of a series of individual modules that can be shared by genomes that otherwise may not be related
      • aka phams
  • Morons = segments of DNA present in one genome, but absent in a related genome
  • Clusters = assigned via dotplot... must be recognizable in dotplot and span more than half of the genome lengths

To Do

  • Finish Tools - add maximum estimation to fit gaussian curve

June 13 2011

  • See project priority list

June 14 2011

To Do

  • Ask DHRMI 3 questions. 1. Verify 454 seq = FLX titanium 2. Did they use default linker sequence to generate mate pairs? 3. What is insert size... insert size? Mates are on average i +- d bp apart.

June 15 2011

  • The closer mean (alpha*beta) is to xmax the better the cutoffpoints I derive from gamma distribution and the more accurate my genome size estimation is

To Do

  • Figure out more consistant way to estimate accurate alpha and beta parameters
  • Choose assembly algorithm

June 16 2011

  • I looked at another method to estimate alpha and beta. It works better, that is predicts the genome size more accurately; however, at certain points it does not work. Two input points and probabilities must be so far apart. This is buggy.
    • The alternative is to generate fake raw data and to fit the gamma distribution to it using gamma.fit(data). This method works, but does not seem to be as accurate as the previous version. I feel like I should be able to get a more accurate estimation of genome size.
    • At this point, I am leaving this topic for another time. All of my comments are in the code, and it should be easy to pick up. Worse comes to worse, I can just use one of the methods I have completed even though it is not as accurate as I would like.
  • MIRA 3 = easiest out of box assembler. If I do not get the information I need on the reads this assembler gives me the most unified genome.
    • It was designed specifically for 454 data
  • CABOG = relatively simple. Pretty straight forward - only need to work out a few parameters and the program handles the rest.
  • Euler-SR = not simple to use with mate pair information. All sorts of processing has to be done. However, it does give alot of control over the assembly process.

To Do

  • Decide on an algorithm to develop

June 20 2011

  • I have done multiple assemblies (timshel, arcadian). Every time mira 3 returns significantly fewer contigs/scaffolds than wgs-assembler. Additionally the mira 3 contigs/assemblies are more similar to (extremely similar) to the final published assembly.
    • Mira assemblies tend to be slightly larger than the published genome, but have an extremely similar order .
    • wgs assembler is often closer to the published genome size, however the order is much different.

June 24 2011

To do

  • Begin tutorial
  • Add precent of raw reads to use parameters

June 29 2011

  • I have finished a rough draft of a tutorial.

To do

  • Figure out file naming system - name by ip, at the end of assembly, script goes in and places file with parameters
  • Write script to keep a list of all the genomes assembled thus far

July 6 2011

  • It does not appear that SMALT info is even being parsed into mira.

To do

  • Finish tutorial Draft 2
  • Fix cleaning reads... sff_extract auto cleans some reads. Turn this off... maybe do this via no .xml file?
  • Use SMALT to screen for MID?
  • Fix 2 genomes
  • Fix precent of reads

July 12 2011

To do

  • Finish tutorial Draft 2
  • Add cleaning script
  • Change mira citation to match the one used in the output. Also, use pop paper as reference for how similar reference assemblies should be.
  • Figure out what is up with the megahubs issue in the data

Data Characterization

Program Development

General Statistical Calculation Tool

Genome Assembly (gass)

Web based genome comparison tool

Tutorial

Project Priority List

Explaining My Project

Random Notes

JavaScript

Securely Connecting to Server via Terminal

Working with Images and USB

Installing dependancy libraries for Newbler on Fedora 15

Intall additional modules to Enthought Python

  • /Library/Frameworks/EPD64.framework/Versions/7.0/lib/python2.7/
  • Make DESTDIR=/dir/to/install/to

Citation method for papers read = the number journal is (currently 6 topic, then the sub section)... so currently if I say 1.2, then I look at Journal (6).1.2 = Genome assembly reborn

Figure out weighting scale to identify missassemblies via machine based learning (A machine-learning approach to combined evidence validation of genome assemblies)


http://seqanswers.com/forums/showthread.php?t=3913&highlight=Brujin

I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

-ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

-SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

-CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb. 

Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

Now we are going to throw more data at it, hoping for a much better assembly