Difference between revisions of "Tutorial"

From GcatWiki
Jump to: navigation, search
(Created page with 'The assembly algorithm is optimized for 454 FLX titanium data. The work flow is thus... #Extract the information from .sff file. The raw output of 454 data is a .sff file. The …')
 
(SMALT)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
*include walkthrough and explain scaffolding as well
 +
**Words/actual text
 +
**Graphics - maybe make interactive graphics with html 5 technology... adobe flex, processing.js
 +
**Animations
 +
 
The assembly algorithm is optimized for 454 FLX titanium data.  
 
The assembly algorithm is optimized for 454 FLX titanium data.  
  
Line 5: Line 10:
 
#Preprocess the raw reads with SMALT. Many labs design tags for sequencing when using the 454 workflow (DNA bar codes?) (mira tutorial pg 76). These tages are often very similar. These tags must be filtered out before the assembly. They cause edges inside the assembly graph and complicate it. The tag may bring two sequences together in a consensus sequence that do not belog. Most sequencing facilities screen for these tags (''and remove them for the raw files we get??? or no???''). SMALT identifies kmers (default set to 13 bp) and makes a screening file for MIRA to use. Then during the assembly process, MIRA uses the screening file to clip?
 
#Preprocess the raw reads with SMALT. Many labs design tags for sequencing when using the 454 workflow (DNA bar codes?) (mira tutorial pg 76). These tages are often very similar. These tags must be filtered out before the assembly. They cause edges inside the assembly graph and complicate it. The tag may bring two sequences together in a consensus sequence that do not belog. Most sequencing facilities screen for these tags (''and remove them for the raw files we get??? or no???''). SMALT identifies kmers (default set to 13 bp) and makes a screening file for MIRA to use. Then during the assembly process, MIRA uses the screening file to clip?
 
#Assemble using MIRA optimized for 454 titanium non linked sequence reads for small genomes. Because phages have small genomes, we can do quality operations that increase memory requirements that could not be done for larger genomes. This includes operations.... blah.
 
#Assemble using MIRA optimized for 454 titanium non linked sequence reads for small genomes. Because phages have small genomes, we can do quality operations that increase memory requirements that could not be done for larger genomes. This includes operations.... blah.
 +
 +
===More stuff on workflow===
 +
*Before getting reads, they should be already separated by MID tags (http://seqanswers.com/forums/showthread.php?t=11999)... To separate them for your reads, would have to use Roche's sffTools. This separates and automatically deletes mid files (http://seqanswers.com/forums/archive/index.php/t-5902.html)
 +
====sff_extract====
 +
*This does some clipping already. When you look at the raw reads (.fasta) the lowercase bases are those that will be clipped.
 +
**This information is stored in the trace xml file.
 +
 +
====SMALT====
 +
*For my workflow, SMALT is a screening step to ensure no linker sequences or MID tags are in the final output. Normally, this step would only be used to clean out additional read tags.
 +
*All of these mid tags are scanned for - http://code.google.com/p/biopieces/wiki/remove_mids But, they really should not be there in the first place. Rescanning for MID does not really make sense. The linker section does, but that is just because I know I don't need linker info.
 +
*http://www.freelists.org/post/mira_talk/smalt-doc-additions
 +
*http://www.freelists.org/post/mira_talk/454-cleaning,16
 +
*http://seqanswers.com/forums/showthread.php?t=4819
 +
*https://wikis.utexas.edu/display/GSAF/454+-+all+flavors

Latest revision as of 20:28, 6 July 2011

  • include walkthrough and explain scaffolding as well
    • Words/actual text
    • Graphics - maybe make interactive graphics with html 5 technology... adobe flex, processing.js
    • Animations

The assembly algorithm is optimized for 454 FLX titanium data.

The work flow is thus...

  1. Extract the information from .sff file. The raw output of 454 data is a .sff file. The assembler that comes with 454, Newbler, reads in that raw file. In order to use MIRA, we must convert the .sff file into files that contain all of the information contained in this file. We break it up into a .fasta file that contains the seq reads, a .qual file that contains information on the certainty of each individual base pair in the respective read, and a tracefile.xml that contains clipping information (expand here).
  2. Preprocess the raw reads with SMALT. Many labs design tags for sequencing when using the 454 workflow (DNA bar codes?) (mira tutorial pg 76). These tages are often very similar. These tags must be filtered out before the assembly. They cause edges inside the assembly graph and complicate it. The tag may bring two sequences together in a consensus sequence that do not belog. Most sequencing facilities screen for these tags (and remove them for the raw files we get??? or no???). SMALT identifies kmers (default set to 13 bp) and makes a screening file for MIRA to use. Then during the assembly process, MIRA uses the screening file to clip?
  3. Assemble using MIRA optimized for 454 titanium non linked sequence reads for small genomes. Because phages have small genomes, we can do quality operations that increase memory requirements that could not be done for larger genomes. This includes operations.... blah.

More stuff on workflow

sff_extract

  • This does some clipping already. When you look at the raw reads (.fasta) the lowercase bases are those that will be clipped.
    • This information is stored in the trace xml file.

SMALT