Tutorial

From GcatWiki
Jump to: navigation, search
  • include walkthrough and explain scaffolding as well
    • Words/actual text
    • Graphics - maybe make interactive graphics with html 5 technology... adobe flex, processing.js
    • Animations

The assembly algorithm is optimized for 454 FLX titanium data.

The work flow is thus...

  1. Extract the information from .sff file. The raw output of 454 data is a .sff file. The assembler that comes with 454, Newbler, reads in that raw file. In order to use MIRA, we must convert the .sff file into files that contain all of the information contained in this file. We break it up into a .fasta file that contains the seq reads, a .qual file that contains information on the certainty of each individual base pair in the respective read, and a tracefile.xml that contains clipping information (expand here).
  2. Preprocess the raw reads with SMALT. Many labs design tags for sequencing when using the 454 workflow (DNA bar codes?) (mira tutorial pg 76). These tages are often very similar. These tags must be filtered out before the assembly. They cause edges inside the assembly graph and complicate it. The tag may bring two sequences together in a consensus sequence that do not belog. Most sequencing facilities screen for these tags (and remove them for the raw files we get??? or no???). SMALT identifies kmers (default set to 13 bp) and makes a screening file for MIRA to use. Then during the assembly process, MIRA uses the screening file to clip?
  3. Assemble using MIRA optimized for 454 titanium non linked sequence reads for small genomes. Because phages have small genomes, we can do quality operations that increase memory requirements that could not be done for larger genomes. This includes operations.... blah.

More stuff on workflow

sff_extract

  • This does some clipping already. When you look at the raw reads (.fasta) the lowercase bases are those that will be clipped.
    • This information is stored in the trace xml file.

SMALT