Difference between revisions of "Programs"

Latest revision as of 14:00, 21 June 2011

OLC Assemblers

CABOG (now wgs assmebler)

sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
- These settings because I think they used a 454 FLX titanium reader
  - sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker titanium GWLSDZG07.sff
- For 454 FLX reader
  - sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
  - sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp timshel.frg
- -s allows the user to specify a specification file that formats the assembly [1]
  - -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA
runCA -p acadian -d /Users/letaylor/Desktop/CABOG/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA acadian.frg
runCA -p timshel -d /Users/letaylor/Desktop/Timshel/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA timshelTrim.frg
From here, use bambus
Parameters and Thoughts
- Automatic trimming of 454 reads
- Need to know insert size and linker sequence for scaffolding part to work
  - Include scaffolding software. (can choose not to use it)
  - output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
Another tutoiral http://www.cbcb.umd.edu/research/CeleraAssembler.shtml

De Bruijn Assemblers

ABySS

- Designed to address memory limitations for mammalian size genome assemblies

AllPaths

- Assembler intended for large genomes

Celera Assembler
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat

Euler+ (Euler 2.0)

- Developed in 2004
- Uses A-Bruijn graphs
- Deals with errors in reads by inducing vertices with ungapped alignments (2.1 pg 2)
  - Allows mismatches rather than exact l-tuples in Bruijn graphs
- Does not scale well with next gen because graph grows with coverage - run out of memory (2.1 pg 2)
  - However, focus for this project is phages. May not be an issue.

Euler-SR (short read) (introduced in 2.1)

- Developed in 2007
- Modified version of Euler+
  - "Substitutes the maximum spanning tree optimization of the A-Bruijn graph by the maximum branching optimization on de Bruijin graphs" (2.1 pg 2)
- General Walkthrough of all Euler implementations (3.1 pg 320)
  1. Filter reads with spectral alignment process
    - Detect erroneous base calls by noting low frequency K-mers
      - Most true K-mers = repeated in several reads
      - Sequence errors are random, so for any K where 4^K exceeds twice the genome size, most erroneous K-mers should be unique.
      - List K-mers and their observed frequencies
        Either exclude or correct low frequency k-mers
        
        Correction - reduces total number of k-mers (and the size of the graph), can also mask polymorphism
        Only corrects substitution errors (not indels)
      - Step 1. Determine threshold to trust or not trust reads
        Threshold determined via distribution of k-mer frequencies present in the reads
        Usually binomial distribution... 1st peak = k-mers that occur once or twice (due to seq error or low coverage) 2nd peak = redundant k-mers (due to read coverage or repeats)
        
        Threshold = between the peaks
      - Step 2. Determine if read is good or bad and proceed accordingly
        Bad reads - "Euler executes a greedy exploration for base call substitutions that reduce bad k-mer count" (3.1 pg 320)
        
        If fully corrected, then accept the read. Otherwise reject
  2. Build K-mer graph from processed reads
    - Processes k-mers, not reads, so discards long-range continuity info (within reads)
- Output: both contigs and the repeat graph
  - Connects the contigs by repeats and can help direct the finishing efforts
- Only uses read and qual files
- Bulk of the time this program takes is in error correction
  - Error correction: can be done based on qual values or based on spectral alignment technique
- Steps: error correction, graph construction, graph correction, assembly by transforming paths into contigs
Command Steps
1. Clean Data {EUSRC}/assembly/${MACHTYPE}/sff2fasta reads.sff reads.fasta
  - /Users/letaylor/Documents/euler/assembly/x86_64/sff2fasta timshel.sff timshel.fasta
  - Newbler can take advantage of the flow values, but euler-sr does not (Euler tutorial)
  - EITHER OUTPUT IN FASTQ FORMAT OR NO LONGER CORRECTLY READING .SFF FILES
  - OPTIONAL: quality filter ${EUSRC}/assembly/${MACHTYPE}/qualityTrimmer -fasta reads.fasta -qual reads.fasta.qual -outFasta reads.trimmed
2. Pair paired end reads
  - ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split
  - Can also do separate file for unpaired sequences
    - ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split -singletons reads.454.fasta.singletons
      - Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
        CGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA (SffToCA)
    - I think fasta files given to me already have this info cut out. Don't know if I can get it or what.
3. Run Assembly
  - Basic Assembly - ${EUSRC}/assembly/Assemble.pl reads.fa 25
    - Choose kmer - 25 <= k <= 28
  - Mate Pair Assembly
    - ${EUSRC}/assembly/Assemble.pl reads.fa 25 -ruleFile readtitle.rules
    - You will have to generate a "rule file" that describes the mates you are using. This is a file with a regular expression in quotes, followed by two required keywords, CloneLength and CloneVar. The mate-pairs are used in the order that they are specified in rule file. For optimal results, the clone rules should be specified in order of increasing clone size.
    - Rule File - pretty confusing
      - Example = ([^/]*)/([12])" CloneLength=200 CloneVar=50
      - The Regex format http://opengroup.org/onlinepubs/007908775/xbd/re.html#tag_007_004
4. Refining the Assembly
  - Assembly in transformed directory, however if mate pairs, in matetransformed directory
  - Show Dead Ends: ${EUSRC}/assembly/${MACHTYPE}/printGraphSummary transformed/reads.fasta -sources
    - If you have many, may be due to low coverage. Should want to regroup.
  - You can try to fix dead ends by joining them when the overlap is less than the vertex size, but greater than some minimal length, say 10.
    - ${EUSRC}/assembly/${MACHTYPE}/joinsas transformed/reads.fasta 10 transformed/reads.j
    - ${EUSRC}/assembly/${MACHTYPE}/printContigs transformed/reads.j
    - cp transformed/reads.j.contig

MIRA 3 (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)

Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
Can assemble de novo or to a scaffold
Installation
- Basic installation for single environment
  1. PATH=/Users/letaylor/Documents/mira/bin:$PATH
  2. export PATH
  3. This works for this environment only... must redo it every time
- Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
  1. $ vim ~/.bashrc
  2. export PATH=$PATH:/path/to/mira/bin
  3. export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
  4. 'esc' to stop editing in vim
    wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
  5. Now when want to run type $ source ~/.bashrc then $ mira
- For 454 data
  1. Download sff_extract and save as sff_extract (no file type)
  2. Make executable chmod +x /path/to/script
  3. sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
Preprocessing Data
- Use sff_extract script
- Use SSAHA2 to clean tags
- use -CL:pec to clip right end of 454 data (only if have >10x coverage, which is a safe assumption given 454 data for phage reads)
Command
- mira --project=acadia --job=denovo,genome,accurate,454 -fasta COMMON_SETTINGS -CL:pec >&log_assembly
- mira --project=acadian --parameterfile=/Users/letaylor/Dropbox/College/CIS/Development/gass/mira/parameters.par >&log_assembly

Newbler (aka gsAssembler)

- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- Assembler provided by 454 company.
  - Specific to 454 data... so uses 454 dependent error model and has access to flowgrams (2.1 pg 3)
- Good blog: http://contig.wordpress.com/
- For pure 454 data, probably the best assembler

Velvet

- Does extensive compression of de Bruijn graph
  - PreProcessing to reduce graph size
    - Spurr removal algorithm = similar to Euler erosion procedure.
    - Note: No spectral alignment filter (because authors thought it naive)
  - Graph processing to reduce graph size
    - Bounded search for bubbles
      - Breadth-first-search - starts at nodes with multiple out edges
      - Bounded because can have bubbles within bubbles, so finding all of them would be impractical
      - Limits bubbles to those with a sequence similarity requirement on alternate paths
      - Removes bubble path with fewer reads
      - Step here about realignment I do not understand
      - This operation risks ignoring polymorphisms or the collapse of near identical repeats
      - Whole process similar to Euler's bulge remover and the bubble smoothing in OLC assemblers
  - Read Threading
    - Removes path with fewer reads than a threshold
    - Risks removing low coverage sequence... but supposed to remove spurious connections induced by convergent sequencing errors
  - Mate Pairs reduction
    - Uses pebble algorithm (see Pebble and Rock Band)
- Variables
  - Length of k-mers (note this is constrained to be an odd number)
  - The min frequency of expected k-mers in reads - determines which k-mers are pruned a-priori
  - The expected coverage of genome - controls spurious connection breaking
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
- SUMMARY: does not have error-correction pre-processor, but has error avoidance filter. (not good for large genomes)
- velveth ./ 21,31,2 -long 7.GAC.454Reads.fna
  - m <= k <= M with a step of s
- velvetg ./ -cov_cutoff auto -exp_cov auto

Veiwers

Consed

Aligned Reads View
- Low quality base = darker and lower case
- High quality base = ligher (whiter) and upper case
- Black = unaligned
- Asterisk = spacer... like dash in blast
- Red letters = disagree with consensus read. Alot of red in high quality region
- Middle click = open up trace window
- Navigate menue for low quality reads
- Can also export consensus sequences
Assembly View
- Dark green line = read coverage
- Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
Programs Compatible with Consed: Newbler, Phrap

M-GCAT

To run = python viewer.pyc

Mauve

Other

Jellyfish

Used to identify k-mers
PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
- -c For sequencing reads, use a value for -c large enough to counts up to twice the coverage
jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
- Read stats output file from above
jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0

Shustring

Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
Global shustring
Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.

SSAHA2

Use to clean up tagged sequences

SMALT

Replaced ssaha2.
http://www.freelists.org/post/mira_talk/454-cleaning,15
To use to screen for mira
1. smalt index -k 13 -s 1 index_out_name sequences.fasta
  - looks for kmers of size 13 with step size of 1
2. smalt map -f ssaha -d 1 -m 7 -x -o seqName_smaltvectorscreen_in.txt index_out_name sequences.fasta
  - ouput in ssaha format for mira, score_difference, minimum score, x = more thorough search

@@ Line 1: / Line 1: @@
 ==OLC Assemblers==
 =====[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page CABOG] (now wgs assmebler)=====
-** sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
+* sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
-***These settings because I think they used a 454 FLX titanium reader
+**These settings because I think they used a 454 FLX titanium reader
-****sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker titanium -linker flx GWLSDZG07.sff
+***sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker titanium GWLSDZG07.sff
-***For 454 FLX reader
+**For 454 FLX reader
-****sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
+***sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
-****sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
+***sffToCA -libraryname GWLSDZG07 -clear all -trim chop  -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
-**runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp  timshel.frg
+*runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp  timshel.frg
-***-s allows the user to specify a specification file that formats the assembly [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SpecFiles]
+**-s allows the user to specify a specification file that formats the assembly [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SpecFiles]
-**From here, use bambus
+***-s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA
-**'''Parameters and Thoughts'''
+*runCA -p acadian -d /Users/letaylor/Desktop/CABOG/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA acadian.frg
-***Automatic trimming of 454 reads
+*runCA -p timshel -d /Users/letaylor/Desktop/Timshel/assemble1 -s /Users/letaylor/Dropbox/College/CIS/Development/wgs/spec.runCA timshelTrim.frg
-***Need to know insert size and linker sequence for scaffolding part to work
+*From here, use bambus
-****Include scaffolding software. (can choose not to use it)
+*'''Parameters and Thoughts'''
-****output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
+**Automatic trimming of 454 reads
+**Need to know insert size and linker sequence for scaffolding part to work
+***Include scaffolding software. (can choose not to use it)
+***output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
+*Another tutoiral http://www.cbcb.umd.edu/research/CeleraAssembler.shtml
-==De Novo Assemblers==
+==De Bruijn Assemblers==
 ====ABySS====
 **Designed to address memory limitations for mammalian size genome assemblies
@@ Line 94: / Line 98: @@
 ====[http://www.chevreux.org/projects_mira.html MIRA 3] (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)====
-**Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
+*Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
-**Can assemble de novo or to a scaffold
+*Can assemble de novo or to a scaffold
-**Installation
+*Installation
-***Basic installation for single environment
+**Basic installation for single environment
-***#PATH=/Users/letaylor/Documents/mira/bin:$PATH
+**#PATH=/Users/letaylor/Documents/mira/bin:$PATH
-***#export PATH
+**#export PATH
-***#This works for this environment only... must redo it every time
+**#This works for this environment only... must redo it every time
-***Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
+**Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
-***#$ vim ~/.bashrc
+**#$ vim ~/.bashrc
-***#export PATH=$PATH:/path/to/mira/bin
+**#export PATH=$PATH:/path/to/mira/bin
-***#export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
+**#export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
-***#'esc' to stop editing in vim
+**#'esc' to stop editing in vim
-***#:wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
+**#:wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
-***#Now when want to run type $ source ~/.bashrc then $ mira
+**#Now when want to run type $ source ~/.bashrc then $ mira
-***For 454 data
+**For 454 data
-***#Download [http://bioinf.comav.upv.es/sff_extract/ sff_extract] and save as sff_extract (no file type)
+**#Download [http://bioinf.comav.upv.es/sff_extract/ sff_extract] and save as sff_extract (no file type)
-***#Make executable chmod +x /path/to/script
+**#Make executable chmod +x /path/to/script
-***#sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
+**#sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
+*Preprocessing Data
+**Use sff_extract script
+**Use SSAHA2 to clean tags
+**use -CL:pec to clip right end of 454 data (only if have >10x coverage, which is a safe assumption given 454 data for phage reads)
+*Command
+**mira --project=acadia --job=denovo,genome,accurate,454 -fasta COMMON_SETTINGS -CL:pec >&log_assembly
+**mira --project=acadian --parameterfile=/Users/letaylor/Dropbox/College/CIS/Development/gass/mira/parameters.par >&log_assembly
 ====[http://sequence.otago.ac.nz/download/ManualPartC.pdf Newbler] (aka gsAssembler)====
@@ Line 150: / Line 161: @@
 **velvetg ./ -cov_cutoff auto -exp_cov auto
-===Veiwers===
+==Veiwers==
-*[http://www.phrap.org/consed/consed.html Consed]
+====[http://www.phrap.org/consed/consed.html Consed]====
-**Aligned Reads View
+*Aligned Reads View
-***Low quality base = darker and lower case
+**Low quality base = darker and lower case
-***High quality base = ligher (whiter) and upper case
+**High quality base = ligher (whiter) and upper case
-***Black = unaligned
+**Black = unaligned
-***Asterisk = spacer... like dash in blast
+**Asterisk = spacer... like dash in blast
-***Red letters = disagree with consensus read. Alot of red in high quality region
+**Red letters = disagree with consensus read. Alot of red in high quality region
-***Middle click = open up trace window
+**Middle click = open up trace window
-***Navigate menue for low quality reads
+**Navigate menue for low quality reads
-***Can also export consensus sequences
+**Can also export consensus sequences
-**Assembly View
+*Assembly View
-***Dark green line = read coverage
+**Dark green line = read coverage
-***Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
+**Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
-**Programs Compatible with Consed: Newbler, Phrap
+*Programs Compatible with Consed: Newbler, Phrap
-===Other===
+====[http://alggen.lsi.upc.es/recerca/align/mgcat/#about M-GCAT]====
-*[http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish]
+*To run = python viewer.pyc
-**Used to identify k-mers
-**PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
-**jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
-***-c  For sequencing reads, use a value for -c large enough to counts up to twice the coverage
-**jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
-***Read stats output file from above
-**jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
-**jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0
-*[http://adenine.biz.fh-weihenstephan.de/cgi-bin/shustring/shustring.cgi.pl Shustring]
+====[http://asap.ahabs.wisc.edu/mauve/index.php Mauve]====
-**Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
-**Global shustring
+==Other==
-**Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.
+====[http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish]====
+*Used to identify k-mers
+*PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
+*jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
+**-c  For sequencing reads, use a value for -c large enough to counts up to twice the coverage
+*jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
+**Read stats output file from above
+*jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
+*jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0
+====[http://adenine.biz.fh-weihenstephan.de/cgi-bin/shustring/shustring.cgi.pl Shustring]====
+*Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
+*Global shustring
+*Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.
+====[http://www.sanger.ac.uk/resources/software/ssaha2/ SSAHA2]====
+*Use to clean up tagged sequences
+====[http://www.sanger.ac.uk/resources/software/smalt/ SMALT]====
+*Replaced ssaha2.
+*http://www.freelists.org/post/mira_talk/454-cleaning,15
+*To use to screen for mira
+*#smalt index -k 13 -s 1 index_out_name sequences.fasta
+*#*looks for kmers of size 13 with step size of 1
+*#smalt map -f ssaha -d 1 -m 7 -x -o seqName_smaltvectorscreen_in.txt index_out_name sequences.fasta
+*#*ouput in ssaha format for mira, score_difference, minimum score, x = more thorough search

Difference between revisions of "Programs"

Latest revision as of 14:00, 21 June 2011

Contents

OLC Assemblers

CABOG (now wgs assmebler)

De Bruijn Assemblers

ABySS

AllPaths

Euler+ (Euler 2.0)

Euler-SR (short read) (introduced in 2.1)

MIRA 3 (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)

Newbler (aka gsAssembler)

Velvet

Veiwers

Consed

M-GCAT

Mauve

Other

Jellyfish

Shustring

SSAHA2

SMALT

Navigation menu

Views

Personal tools

Navigation

Search

Tools