Difference between revisions of "Programs"
From GcatWiki
(→[http://www.sanger.ac.uk/resources/software/ssaha2/ SSAHA2) |
|||
Line 183: | Line 183: | ||
*Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique. | *Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique. | ||
− | ====[http://www.sanger.ac.uk/resources/software/ssaha2/ SSAHA2==== | + | ====[http://www.sanger.ac.uk/resources/software/ssaha2/ SSAHA2]==== |
*USe to clean up tagged sequences | *USe to clean up tagged sequences |
Revision as of 14:36, 20 June 2011
OLC Assemblers
CABOG (now wgs assmebler)
- sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshel.frg GWLSDZG07.sff
- These settings because I think they used a 454 FLX titanium reader
- sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker titanium GWLSDZG07.sff
- For 454 FLX reader
- sffToCA -libraryname GWLSDZG07 -clear 454 -trim chop -output timshel.frg GWLSDZG07.sff
- sffToCA -libraryname GWLSDZG07 -clear all -trim chop -output timshelTRIM.frg -insertsize 3200 900 -linker flx GWLSDZG07.sff
- These settings because I think they used a 454 FLX titanium reader
- runCA -p timshel -d /Users/letaylor/Desktop/Timshel/cabog/timshelTemp timshel.frg
- -s allows the user to specify a specification file that formats the assembly [1]
- From here, use bambus
- Parameters and Thoughts
- Automatic trimming of 454 reads
- Need to know insert size and linker sequence for scaffolding part to work
- Include scaffolding software. (can choose not to use it)
- output placed in 9-terminator folder - xxx.scf.fasta = scaffolds (.ctg. is contig)
- Another tutoiral http://www.cbcb.umd.edu/research/CeleraAssembler.shtml
De Novo Assemblers
ABySS
- Designed to address memory limitations for mammalian size genome assemblies
AllPaths
- Assembler intended for large genomes
- Celera Assembler
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
Euler+ (Euler 2.0)
- Developed in 2004
- Uses A-Bruijn graphs
- Deals with errors in reads by inducing vertices with ungapped alignments (2.1 pg 2)
- Allows mismatches rather than exact l-tuples in Bruijn graphs
- Does not scale well with next gen because graph grows with coverage - run out of memory (2.1 pg 2)
- However, focus for this project is phages. May not be an issue.
Euler-SR (short read) (introduced in 2.1)
- Developed in 2007
- Modified version of Euler+
- "Substitutes the maximum spanning tree optimization of the A-Bruijn graph by the maximum branching optimization on de Bruijin graphs" (2.1 pg 2)
- General Walkthrough of all Euler implementations (3.1 pg 320)
- Filter reads with spectral alignment process
- Detect erroneous base calls by noting low frequency K-mers
- Most true K-mers = repeated in several reads
- Sequence errors are random, so for any K where 4^K exceeds twice the genome size, most erroneous K-mers should be unique.
- List K-mers and their observed frequencies
- Either exclude or correct low frequency k-mers
- Correction - reduces total number of k-mers (and the size of the graph), can also mask polymorphism
- Only corrects substitution errors (not indels)
- Step 1. Determine threshold to trust or not trust reads
- Threshold determined via distribution of k-mer frequencies present in the reads
- Usually binomial distribution... 1st peak = k-mers that occur once or twice (due to seq error or low coverage) 2nd peak = redundant k-mers (due to read coverage or repeats)
- Threshold = between the peaks
- Threshold determined via distribution of k-mer frequencies present in the reads
- Step 2. Determine if read is good or bad and proceed accordingly
- Bad reads - "Euler executes a greedy exploration for base call substitutions that reduce bad k-mer count" (3.1 pg 320)
- If fully corrected, then accept the read. Otherwise reject
- Detect erroneous base calls by noting low frequency K-mers
- Build K-mer graph from processed reads
- Processes k-mers, not reads, so discards long-range continuity info (within reads)
- Filter reads with spectral alignment process
- Output: both contigs and the repeat graph
- Connects the contigs by repeats and can help direct the finishing efforts
- Only uses read and qual files
- Bulk of the time this program takes is in error correction
- Error correction: can be done based on qual values or based on spectral alignment technique
- Steps: error correction, graph construction, graph correction, assembly by transforming paths into contigs
- Command Steps
- Clean Data {EUSRC}/assembly/${MACHTYPE}/sff2fasta reads.sff reads.fasta
- /Users/letaylor/Documents/euler/assembly/x86_64/sff2fasta timshel.sff timshel.fasta
- Newbler can take advantage of the flow values, but euler-sr does not (Euler tutorial)
- EITHER OUTPUT IN FASTQ FORMAT OR NO LONGER CORRECTLY READING .SFF FILES
- OPTIONAL: quality filter ${EUSRC}/assembly/${MACHTYPE}/qualityTrimmer -fasta reads.fasta -qual reads.fasta.qual -outFasta reads.trimmed
- Pair paired end reads
- ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split
- Can also do separate file for unpaired sequences
- ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split -singletons reads.454.fasta.singletons
- Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
- CGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA (SffToCA)
- Need to have linker sequence in linker.fasta. According to wgs-Assembler documentation, the default linker sequence for 454 FLX titanium are...
- I think fasta files given to me already have this info cut out. Don't know if I can get it or what.
- ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split -singletons reads.454.fasta.singletons
- Run Assembly
- Basic Assembly - ${EUSRC}/assembly/Assemble.pl reads.fa 25
- Choose kmer - 25 <= k <= 28
- Mate Pair Assembly
- ${EUSRC}/assembly/Assemble.pl reads.fa 25 -ruleFile readtitle.rules
- You will have to generate a "rule file" that describes the mates you are using. This is a file with a regular expression in quotes, followed by two required keywords, CloneLength and CloneVar. The mate-pairs are used in the order that they are specified in rule file. For optimal results, the clone rules should be specified in order of increasing clone size.
- Rule File - pretty confusing
- Example = ([^/]*)/([12])" CloneLength=200 CloneVar=50
- The Regex format http://opengroup.org/onlinepubs/007908775/xbd/re.html#tag_007_004
- Basic Assembly - ${EUSRC}/assembly/Assemble.pl reads.fa 25
- Refining the Assembly
- Assembly in transformed directory, however if mate pairs, in matetransformed directory
- Show Dead Ends: ${EUSRC}/assembly/${MACHTYPE}/printGraphSummary transformed/reads.fasta -sources
- If you have many, may be due to low coverage. Should want to regroup.
- You can try to fix dead ends by joining them when the overlap is less than the vertex size, but greater than some minimal length, say 10.
- ${EUSRC}/assembly/${MACHTYPE}/joinsas transformed/reads.fasta 10 transformed/reads.j
- ${EUSRC}/assembly/${MACHTYPE}/printContigs transformed/reads.j
- cp transformed/reads.j.contig
- Clean Data {EUSRC}/assembly/${MACHTYPE}/sff2fasta reads.sff reads.fasta
MIRA 3 (also http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page)
- Good for < 1.2 mil bases (http://seqanswers.com/forums/showthread.php?t=4136&highlight=virus+assembly)
- Can assemble de novo or to a scaffold
- Installation
- Basic installation for single environment
- PATH=/Users/letaylor/Documents/mira/bin:$PATH
- export PATH
- This works for this environment only... must redo it every time
- Write bash script to do this... (http://prashanthmadi.blogspot.com/2010/10/installing-mira.html)
- $ vim ~/.bashrc
- export PATH=$PATH:/path/to/mira/bin
- export PATH=/Users/letaylor/Documents/mira/scripts:$PATH
- 'esc' to stop editing in vim
- wq to write and quit vim (http://www.tuxfiles.org/linuxhelp/vimcheat.html)
- Now when want to run type $ source ~/.bashrc then $ mira
- For 454 data
- Download sff_extract and save as sff_extract (no file type)
- Make executable chmod +x /path/to/script
- sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
- Basic installation for single environment
Newbler (aka gsAssembler)
- An Overlap Layout Consensus assembler.
- Good for reads > 250nt (http://seqanswers.com/forums/showthread.php?t=5092&highlight=Brujin).
- Assembler provided by 454 company.
- Specific to 454 data... so uses 454 dependent error model and has access to flowgrams (2.1 pg 3)
- Good blog: http://contig.wordpress.com/
- For pure 454 data, probably the best assembler
Velvet
- Does extensive compression of de Bruijn graph
- PreProcessing to reduce graph size
- Spurr removal algorithm = similar to Euler erosion procedure.
- Note: No spectral alignment filter (because authors thought it naive)
- Graph processing to reduce graph size
- Bounded search for bubbles
- Breadth-first-search - starts at nodes with multiple out edges
- Bounded because can have bubbles within bubbles, so finding all of them would be impractical
- Limits bubbles to those with a sequence similarity requirement on alternate paths
- Removes bubble path with fewer reads
- Step here about realignment I do not understand
- This operation risks ignoring polymorphisms or the collapse of near identical repeats
- Whole process similar to Euler's bulge remover and the bubble smoothing in OLC assemblers
- Bounded search for bubbles
- Read Threading
- Removes path with fewer reads than a threshold
- Risks removing low coverage sequence... but supposed to remove spurious connections induced by convergent sequencing errors
- Mate Pairs reduction
- Uses pebble algorithm (see Pebble and Rock Band)
- PreProcessing to reduce graph size
- Variables
- Length of k-mers (note this is constrained to be an odd number)
- The min frequency of expected k-mers in reads - determines which k-mers are pruned a-priori
- The expected coverage of genome - controls spurious connection breaking
- Uses Poisson statistics to estimate the likelihood a particular genomic region represents a collapsed repeat
- SUMMARY: does not have error-correction pre-processor, but has error avoidance filter. (not good for large genomes)
- velveth ./ 21,31,2 -long 7.GAC.454Reads.fna
- m <= k <= M with a step of s
- velvetg ./ -cov_cutoff auto -exp_cov auto
- Does extensive compression of de Bruijn graph
Veiwers
Consed
- Aligned Reads View
- Low quality base = darker and lower case
- High quality base = ligher (whiter) and upper case
- Black = unaligned
- Asterisk = spacer... like dash in blast
- Red letters = disagree with consensus read. Alot of red in high quality region
- Middle click = open up trace window
- Navigate menue for low quality reads
- Can also export consensus sequences
- Assembly View
- Dark green line = read coverage
- Can help identify contaminations... (if have area of high quality reads that drops suddenly to lower, or two contigs with different quality reads)
- Programs Compatible with Consed: Newbler, Phrap
Other
Jellyfish
- Used to identify k-mers
- PATH=/Users/letaylor/Documents/jellyfish1.1/bin:$PATH
- jellyfish count -m 22 -o outputfilename -c 3 -s 10000000 timshelRaw.fasta
- -c For sequencing reads, use a value for -c large enough to counts up to twice the coverage
- jellyfish stats -v /Users/letaylor/Desktop/Timshel/mer_counts_0
- Read stats output file from above
- jellyfish dump -o kmerlist -c /Users/letaylor/Desktop/Timshel/mer_counts_0
- jellyfish histo -o kmerhisto /Users/letaylor/Desktop/Timshel/mer_counts_0
Shustring
- Used to calculate the SHortest Unique SubSTRINGs (in C implementation)
- Global shustring
- Local shustring - local shustring is the shortest string that starts at a given sequence position and is unique.
SSAHA2
- USe to clean up tagged sequences