Difference between revisions of "Programs"

From GcatWiki
Jump to: navigation, search
(Euler-SR (short read) (introduced in 2.1))
(SMALT)
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==OLC Assemblers==
 
==OLC Assemblers==
 
=====[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page CABOG] (now wgs assmebler)=====
 
=====[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page CABOG] (now wgs assmebler)=====
** sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
+
* sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
***These settings because I think they used a 454 FLX titanium reader
+
**These settings because I think they used a 454 FLX titanium reader
****sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker titanium -linker flx GWLSDZG07.sff
+
***sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker titanium GWLSDZG07.sff
***For 454 FLX reader
+
**For 454 FLX reader
****sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
+
***sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
****sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
+
***sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
**runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp  timshel.frg
+
*runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp  timshel.frg
***-s allows the user to specify a specification file that formats the assembly [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SpecFiles]
+
**-s allows the user to specify a specification file that formats the assembly [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SpecFiles]
**From here, use bambus
+
***-s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA
**'''Parameters and Thoughts'''
+
*runCA -p acadian -d /Users/letaylor/Desktop/CABOG/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA acadian.frg
***Automatic trimming of 454 reads
+
*runCA -p timshel -d /Users/letaylor/Desktop/Timshel/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA timshelTrim.frg
***Need to know insert size and linker sequence for scaffolding part to work
+
*From here, use bambus
****Include scaffolding software. (can choose not to use it)
+
*'''Parameters and Thoughts'''
****output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
+
**Automatic trimming of 454 reads
 +
**Need to know insert size and linker sequence for scaffolding part to work
 +
***Include scaffolding software. (can choose not to use it)
 +
***output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
 +
*Another tutoiral http://www.cbcb.umd.edu/research/CeleraAssembler.shtml
  
==De Novo Assemblers==
+
==De Bruijn Assemblers==
 
====ABySS====
 
====ABySS====
 
**Designed to address memory limitations for mammalian size genome assemblies
 
**Designed to address memory limitations for mammalian size genome assemblies
Line 74: Line 78:
 
*#***Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
 
*#***Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
 
*#****CGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ([http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SffToCA SffToCA])
 
*#****CGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ([http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SffToCA SffToCA])
 +
*#**I think fasta files given to me already have this info cut out. Don't know if I can get it or what.
 +
*#Run Assembly
 +
*#*Basic Assembly - ${EUSRC}/assembly/Assemble.pl reads.fa 25
 +
*#**Choose kmer - 25 <= k <= 28
 +
*#*Mate Pair Assembly
 +
*#**${EUSRC}/assembly/Assemble.pl reads.fa 25 -ruleFile readtitle.rules
 +
*#**You will have to generate a "rule file" that describes the mates you are using. This is a file with a regular expression in quotes, followed by two required keywords, CloneLength and CloneVar. The mate-pairs are used in the order that they are specified in rule file.  For optimal results, the clone rules should be specified in order of increasing clone size.
 +
*#**Rule File - pretty confusing
 +
*#***Example =  ([^/]*)/([12])" CloneLength=200 CloneVar=50
 +
*#***The Regex format http://opengroup.org/onlinepubs/007908775/xbd/re.html#tag_007_004
 +
*#Refining the Assembly
 +
*#*Assembly in transformed directory, however if mate pairs, in matetransformed directory
 +
*#*Show Dead Ends: ${EUSRC}/assembly/${MACHTYPE}/printGraphSummary transformed/reads.fasta -sources
 +
*#**If you have many, may be due to low coverage. Should want to regroup.
 +
*#*You can try to fix dead ends by joining them when the overlap is less than the vertex size, but greater than some minimal length, say 10.
 +
*#**${EUSRC}/assembly/${MACHTYPE}/joinsas transformed/reads.fasta 10 transformed/reads.j
 +
*#**${EUSRC}/assembly/${MACHTYPE}/printContigs transformed/reads.j
 +
*#**cp transformed/reads.j.contig
  
 
====[http://www.chevreux.org/projects_mira.html MIRA 3] (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)====
 
====[http://www.chevreux.org/projects_mira.html MIRA 3] (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)====
**Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
+
*Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
**Can assemble de novo or to a scaffold
+
*Can assemble de novo or to a scaffold
**Installation
+
*Installation
***Basic installation for single environment
+
**Basic installation for single environment
***#PATH=/Users/letaylor/Documents/mira/bin:$PATH
+
**#PATH=/Users/letaylor/Documents/mira/bin:$PATH
***#export PATH
+
**#export PATH
***#This works for this environment only... must redo it every time
+
**#This works for this environment only... must redo it every time
***Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
+
**Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
***#$ vim ~/.bashrc
+
**#$ vim ~/.bashrc
***#export PATH=$PATH:/path/to/mira/bin
+
**#export PATH=$PATH:/path/to/mira/bin
***#export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
+
**#export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
***#'esc' to stop editing in vim
+
**#'esc' to stop editing in vim
***#:wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
+
**#:wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
***#Now when want to run type $ source ~/.bashrc then $ mira
+
**#Now when want to run type $ source ~/.bashrc then $ mira
***For 454 data
+
**For 454 data
***#Download [http://bioinf.comav.upv.es/sff_extract/ sff_extract] and save as sff_extract (no file type)
+
**#Download [http://bioinf.comav.upv.es/sff_extract/ sff_extract] and save as sff_extract (no file type)
***#Make executable chmod +x /path/to/script
+
**#Make executable chmod +x /path/to/script
***#sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
+
**#sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
 +
*Preprocessing Data
 +
**Use sff_extract script
 +
**Use SSAHA2 to clean tags
 +
**use -CL:pec to clip right end of 454 data (only if have >10x coverage, which is a safe assumption given 454 data for phage reads)
 +
*Command
 +
**mira --project=acadia --job=denovo,genome,accurate,454 -fasta COMMON_SETTINGS -CL:pec >&log_assembly
 +
**mira --project=acadian --parameterfile=/Users/letaylor/Dropbox/College/CIS/Development/gass/mira/parameters.par >&log_assembly
  
 
====[http://sequence.otago.ac.nz/download/ManualPartC.pdf Newbler] (aka gsAssembler)====
 
====[http://sequence.otago.ac.nz/download/ManualPartC.pdf Newbler] (aka gsAssembler)====
Line 132: Line 161:
 
**velvetg ./ -cov_cutoff auto -exp_cov auto
 
**velvetg ./ -cov_cutoff auto -exp_cov auto
  
===Veiwers===
+
==Veiwers==
*[http://www.phrap.org/consed/consed.html Consed]
+
====[http://www.phrap.org/consed/consed.html Consed]====
**Aligned Reads View
+
*Aligned Reads View
***Low quality base = darker and lower case
+
**Low quality base = darker and lower case
***High quality base = ligher (whiter) and upper case
+
**High quality base = ligher (whiter) and upper case
***Black = unaligned
+
**Black = unaligned
***Asterisk = spacer... like dash in blast
+
**Asterisk = spacer... like dash in blast
***Red letters = disagree with consensus read. Alot of red in high quality region
+
**Red letters = disagree with consensus read. Alot of red in high quality region
***Middle click = open up trace window
+
**Middle click = open up trace window
***Navigate menue for low quality reads
+
**Navigate menue for low quality reads
***Can also export consensus sequences
+
**Can also export consensus sequences
**Assembly View
+
*Assembly View
***Dark green line = read coverage
+
**Dark green line = read coverage
***Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
+
**Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
**Programs Compatible with Consed: Newbler, Phrap
+
*Programs Compatible with Consed: Newbler, Phrap
  
===Other===
+
====[http://alggen.lsi.upc.es/recerca/align/mgcat/#about M-GCAT]====
*[http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish]
+
*To run = python viewer.pyc
**Used to identify k-mers
 
**PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
 
**jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
 
***-c  For sequencing reads, use a value for -c large enough to counts up to twice the coverage
 
**jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
 
***Read stats output file from above
 
**jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
 
**jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0
 
  
*[http://adenine.biz.fh-weihenstephan.de/cgi-bin/shustring/shustring.cgi.pl Shustring]  
+
====[http://asap.ahabs.wisc.edu/mauve/index.php Mauve]====
**Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
+
 
**Global shustring
+
==Other==
**Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.
+
====[http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish]====
 +
*Used to identify k-mers
 +
*PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
 +
*jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
 +
**-c  For sequencing reads, use a value for -c large enough to counts up to twice the coverage
 +
*jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
 +
**Read stats output file from above
 +
*jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
 +
*jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0
 +
 
 +
====[http://adenine.biz.fh-weihenstephan.de/cgi-bin/shustring/shustring.cgi.pl Shustring]====
 +
*Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
 +
*Global shustring
 +
*Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.
 +
 
 +
====[http://www.sanger.ac.uk/resources/software/ssaha2/ SSAHA2]====
 +
*Use to clean up tagged sequences
 +
 
 +
====[http://www.sanger.ac.uk/resources/software/smalt/ SMALT]====
 +
*Replaced ssaha2.
 +
*http://www.freelists.org/post/mira_talk/454-cleaning,15
 +
*To use to screen for mira
 +
*#smalt index -k 13 -s 1 index_out_name sequences.fasta
 +
*#*looks for kmers of size 13 with step size of 1
 +
*#smalt map -f ssaha -d 1 -m 7 -x -o seqName_smaltvectorscreen_in.txt index_out_name sequences.fasta
 +
*#*ouput in ssaha format for mira, score_difference, minimum score, x = more thorough search

Latest revision as of 14:00, 21 June 2011

OLC Assemblers

CABOG (now wgs assmebler)
  • sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
    • These settings because I think they used a 454 FLX titanium reader
      • sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker titanium GWLSDZG07.sff
    • For 454 FLX reader
      • sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
      • sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
  • runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp timshel.frg
    • -s allows the user to specify a specification file that formats the assembly [1]
      • -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA
  • runCA -p acadian -d /Users/letaylor/Desktop/CABOG/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA acadian.frg
  • runCA -p timshel -d /Users/letaylor/Desktop/Timshel/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA timshelTrim.frg
  • From here, use bambus
  • Parameters and Thoughts
    • Automatic trimming of 454 reads
    • Need to know insert size and linker sequence for scaffolding part to work
      • Include scaffolding software. (can choose not to use it)
      • output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
  • Another tutoiral http://www.cbcb.umd.edu/research/CeleraAssembler.shtml

De Bruijn Assemblers

ABySS

    • Designed to address memory limitations for mammalian size genome assemblies

AllPaths

    • Assembler intended for large genomes
  • Celera Assembler
    • Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat

Euler+ (Euler 2.0)

    • Developed in 2004
    • Uses A-Bruijn graphs
    • Deals with errors in reads by inducing vertices with ungapped alignments (2.1 pg 2)
      • Allows mismatches rather than exact l-tuples in Bruijn graphs
    • Does not scale well with next gen because graph grows with coverage - run out of memory (2.1 pg 2)
      • However, focus for this project is phages. May not be an issue.

Euler-SR (short read) (introduced in 2.1)

    • Developed in 2007
    • Modified version of Euler+
      • "Substitutes the maximum spanning tree optimization of the A-Bruijn graph by the maximum branching optimization on de Bruijin graphs" (2.1 pg 2)
    • General Walkthrough of all Euler implementations (3.1 pg 320)
      1. Filter reads with spectral alignment process
        • Detect erroneous base calls by noting low frequency K-mers
          • Most true K-mers = repeated in several reads
          • Sequence errors are random, so for any K where 4^K exceeds twice the genome size, most erroneous K-mers should be unique.
          • List K-mers and their observed frequencies
            • Either exclude or correct low frequency k-mers
            • Correction - reduces total number of k-mers (and the size of the graph), can also mask polymorphism
              • Only corrects substitution errors (not indels)
          • Step 1. Determine threshold to trust or not trust reads
            • Threshold determined via distribution of k-mer frequencies present in the reads
              • Usually binomial distribution... 1st peak = k-mers that occur once or twice (due to seq error or low coverage) 2nd peak = redundant k-mers (due to read coverage or repeats)
              • Threshold = between the peaks
          • Step 2. Determine if read is good or bad and proceed accordingly
            • Bad reads - "Euler executes a greedy exploration for base call substitutions that reduce bad k-mer count" (3.1 pg 320)
            • If fully corrected, then accept the read. Otherwise reject
      2. Build K-mer graph from processed reads
        • Processes k-mers, not reads, so discards long-range continuity info (within reads)
    • Output: both contigs and the repeat graph
      • Connects the contigs by repeats and can help direct the finishing efforts
    • Only uses read and qual files
    • Bulk of the time this program takes is in error correction
      • Error correction: can be done based on qual values or based on spectral alignment technique
    • Steps: error correction, graph construction, graph correction, assembly by transforming paths into contigs
  • Command Steps
    1. Clean Data {EUSRC}/assembly/${MACHTYPE}/sff2fasta reads.sff reads.fasta
      • /Users/letaylor/Documents/euler/assembly/x86_64/sff2fasta timshel.sff timshel.fasta
      • Newbler can take advantage of the flow values, but euler-sr does not (Euler tutorial)
      • EITHER OUTPUT IN FASTQ FORMAT OR NO LONGER CORRECTLY READING .SFF FILES
      • OPTIONAL: quality filter ${EUSRC}/assembly/${MACHTYPE}/qualityTrimmer -fasta reads.fasta -qual reads.fasta.qual -outFasta reads.trimmed
    2. Pair paired end reads
      • ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split
      • Can also do separate file for unpaired sequences
        • ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split -singletons reads.454.fasta.singletons
          • Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
            • CGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA (SffToCA)
        • I think fasta files given to me already have this info cut out. Don't know if I can get it or what.
    3. Run Assembly
      • Basic Assembly - ${EUSRC}/assembly/Assemble.pl reads.fa 25
        • Choose kmer - 25 <= k <= 28
      • Mate Pair Assembly
        • ${EUSRC}/assembly/Assemble.pl reads.fa 25 -ruleFile readtitle.rules
        • You will have to generate a "rule file" that describes the mates you are using. This is a file with a regular expression in quotes, followed by two required keywords, CloneLength and CloneVar. The mate-pairs are used in the order that they are specified in rule file. For optimal results, the clone rules should be specified in order of increasing clone size.
        • Rule File - pretty confusing
    4. Refining the Assembly
      • Assembly in transformed directory, however if mate pairs, in matetransformed directory
      • Show Dead Ends: ${EUSRC}/assembly/${MACHTYPE}/printGraphSummary transformed/reads.fasta -sources
        • If you have many, may be due to low coverage. Should want to regroup.
      • You can try to fix dead ends by joining them when the overlap is less than the vertex size, but greater than some minimal length, say 10.
        • ${EUSRC}/assembly/${MACHTYPE}/joinsas transformed/reads.fasta 10 transformed/reads.j
        • ${EUSRC}/assembly/${MACHTYPE}/printContigs transformed/reads.j
        • cp transformed/reads.j.contig

MIRA 3 (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)

  • Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
  • Can assemble de novo or to a scaffold
  • Installation
    • Basic installation for single environment
      1. PATH=/Users/letaylor/Documents/mira/bin:$PATH
      2. export PATH
      3. This works for this environment only... must redo it every time
    • Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
      1. $ vim ~/.bashrc
      2. export PATH=$PATH:/path/to/mira/bin
      3. export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
      4. 'esc' to stop editing in vim
        wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
      5. Now when want to run type $ source ~/.bashrc then $ mira
    • For 454 data
      1. Download sff_extract and save as sff_extract (no file type)
      2. Make executable chmod +x /path/to/script
      3. sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
  • Preprocessing Data
    • Use sff_extract script
    • Use SSAHA2 to clean tags
    • use -CL:pec to clip right end of 454 data (only if have >10x coverage, which is a safe assumption given 454 data for phage reads)
  • Command
    • mira --project=acadia --job=denovo,genome,accurate,454 -fasta COMMON_SETTINGS -CL:pec >&log_assembly
    • mira --project=acadian --parameterfile=/Users/letaylor/Dropbox/College/CIS/Development/gass/mira/parameters.par >&log_assembly

Newbler (aka gsAssembler)

Velvet

    • Does extensive compression of de Bruijn graph
      • PreProcessing to reduce graph size
        • Spurr removal algorithm = similar to Euler erosion procedure.
        • Note: No spectral alignment filter (because authors thought it naive)
      • Graph processing to reduce graph size
        • Bounded search for bubbles
          • Breadth-first-search - starts at nodes with multiple out edges
          • Bounded because can have bubbles within bubbles, so finding all of them would be impractical
          • Limits bubbles to those with a sequence similarity requirement on alternate paths
          • Removes bubble path with fewer reads
          • Step here about realignment I do not understand
          • This operation risks ignoring polymorphisms or the collapse of near identical repeats
          • Whole process similar to Euler's bulge remover and the bubble smoothing in OLC assemblers
      • Read Threading
        • Removes path with fewer reads than a threshold
        • Risks removing low coverage sequence... but supposed to remove spurious connections induced by convergent sequencing errors
      • Mate Pairs reduction
    • Variables
      • Length of k-mers (note this is constrained to be an odd number)
      • The min frequency of expected k-mers in reads - determines which k-mers are pruned a-priori
      • The expected coverage of genome - controls spurious connection breaking
    • Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
    • SUMMARY: does not have error-correction pre-processor, but has error avoidance filter. (not good for large genomes)
    • velveth ./ 21,31,2 -long 7.GAC.454Reads.fna
      • m <= k <= M with a step of s
    • velvetg ./ -cov_cutoff auto -exp_cov auto

Veiwers

Consed

  • Aligned Reads View
    • Low quality base = darker and lower case
    • High quality base = ligher (whiter) and upper case
    • Black = unaligned
    • Asterisk = spacer... like dash in blast
    • Red letters = disagree with consensus read. Alot of red in high quality region
    • Middle click = open up trace window
    • Navigate menue for low quality reads
    • Can also export consensus sequences
  • Assembly View
    • Dark green line = read coverage
    • Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
  • Programs Compatible with Consed: Newbler, Phrap

M-GCAT

  • To run = python viewer.pyc

Mauve

Other

Jellyfish

  • Used to identify k-mers
  • PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
  • jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
    • -c For sequencing reads, use a value for -c large enough to counts up to twice the coverage
  • jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
    • Read stats output file from above
  • jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
  • jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0

Shustring

  • Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
  • Global shustring
  • Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.

SSAHA2

  • Use to clean up tagged sequences

SMALT

  • Replaced ssaha2.
  • http://www.freelists.org/post/mira_talk/454-cleaning,15
  • To use to screen for mira
    1. smalt index -k 13 -s 1 index_out_name sequences.fasta
      • looks for kmers of size 13 with step size of 1
    2. smalt map -f ssaha -d 1 -m 7 -x -o seqName_smaltvectorscreen_in.txt index_out_name sequences.fasta
      • ouput in ssaha format for mira, score_difference, minimum score, x = more thorough search