Difference between revisions of "Lab Notebook"
Line 52: | Line 52: | ||
''Tuesday, January 19, 2016'' | ''Tuesday, January 19, 2016'' | ||
+ | |||
Accessing GCAT Unix Cluster | Accessing GCAT Unix Cluster | ||
Revision as of 17:27, 15 March 2016
Tuesday, January 12, 2016
When pythons feed → organs double in size What genes are involved in the organ doubling? One researcher looked - the doubling is over by 6 hours 12 different samples (2 organs, 6 snakes) organs were flash frozen to stabilize RNA
RNAseq: isolate the mRNA used a poly-T oligonucleotide beads (magnetic) with T tails to bind A tails of the mRNA get 2% of the RNA RNA fragmentation; turn mRNA into cDNA fragmentation in sequencing, get very short reads fragment RNA so that it’s random (chemical vs. restriction enzyme) more edges to read from need fragments of roughly the same size translation reverse transcriptase primers (ACACTCTTTCCCTACACGACGCTCTTCCGATCTNxxxNNNNNN) every possible heximer is generated temperature has to be cool enough that the primer bps anywhere that it occurs run on a gel, excise and purify confirm via PCR
task - figure out if there is a gene that is transcribed differently in a fed snake’s organ vs. an unfed snake’s organ
Thursday, January 14, 2016
Secor and Diamond 1995 responses to feeding include intestinal hypertrophy and up regulation of intestinal nutrient transporters and hydrolytic enzymes Karasov & Diamond 1987; Toloza et al. 1991 alot of growth is inner layer → brush border need a catalog of amino acid and sugar import proteins genes involved in uptake? housekeeping genes? for uniquely liver? for uniquely intestine? so that we know we got the right tissue when brush border gets bigger → is it cell division? plump up? proline:sodium symporter in e. coli http://www.ncbi.nlm.nih.gov/gene/945602 PutP http://www.genome.jp/kegg-bin/show_pathway?pbi00520
Tuesday, January 19, 2016
Accessing GCAT Unix Cluster
Your students now have accounts on the Juniata College HHMI cluster. The usernames are davidson01 through davidson18. The password for all is 4gcatRocks! Please have them change it as soon as possible (type passwd at the command prompt once logged in).
This is the standard info I send to new account holders which you may want to share with them:
Access the cluster using ssh (Putty or other) at IP 192.112.102.21. This will open in your home directory. The OS is CentOS. The only place you will be able to install anything is in your home directory or subdirectories that you create under it. You must use qsub to submit your jobs to the cluster (guide attached); do not attempt to run them on the headnode as it does not have the processor power and it can ‘break’ the cluster. I must kill any jobs running on the headnode to protect the cluster.
I’ve attached a document with a table of most of the software installed on the cluster. It is continually changing. Most programs are under /share/apps/Installs/walls or /share/apps/Installs/sickler. You can also enter $ module avail as we have created modules for most programs to load the path to your working directory. A list of the modules will be generated.
Please let me know if I can assist you further.
Regards, Chris Walls
Mac is built on Unix
INSTRUCTIONS set up folder structure on desktop (see my example for Bio343)
- log on using davidson##
w13426:~ user$ ssh -l user 192.112.102.21 submit password 4gcatRocks!
passwd (return and follow prompts)
- must have one cap letter, one number and one symbol
- Run this for each user’s home folder:
cd .. chmod a+rx user #gives ALL the ability to READ and EXECUTE scripts from my folder ls -al cd user
- copy files into their folders, install Trimmomatic in bin folder, etc.
mkdir fastQCLiver # or fastQCintestine cp -r ../campbell/fastQCLiver/ fastQCLiver # or fastQCintestine x 2
- the above code just copied Dr. C’s folder named fastQCLiver into our copy of fastQCLiver - do cp -r ../campbell/fastQCLiver/*.* fastQCLiver to copy the contents into our folder
- then do rm -r fastQCLiver to remove the extra copy
mkdir bin #make directory called bin cd bin #change directory
- do each line one at a time
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip unzip Trimmomatic-0.32.zip rm Trimmomatic-0.32.zip
- now move from GCAT cluster to your local desktop
- STEP 3: View fastQC reports on local machine
- open new terminal window (local machine)
cd Desktop/Bio343/fastQC scp -r user@192.112.102.21:/home/user/fastQCorgan/ . #use davidson13
- for user; include “.”
- In finder, unzip all zip files
Go to each folder and open html file Explore quality of reads for trimmomatic options
- Copied files from Dr. C’s folder on liver sequences to our file on liver (fastQCLiver)
Discussion what are the take aways? what is useful? annotated bib? what is helpful, takeaways etc.
Secor et al. 2001 a lot of hormones/peptides are elevated post-feeding intestine begins to change before the meal reaches it our hypothesis -> the act of eating starts the change RIA - used the isotope iodine 125 scary stuff not a trivial assay if we have a ligand → think receptor aromatic hormone receptors are in the cytoplasm might be pmb bound receptor
Ott and Secor 2007
change in mass is from hypertrophy, not from cell replication (fig 6)
where do the extra uptake proteins come from? looking for mass increase mucosal layer - what’s mucus made of?
Thursday, January 21, 2016 FastQC Reports Per base sequnce quality above 20 is good quality reads 40 is the max Per tile sequence quality flagged as an ! dunno if that’s going to effect us Per sequence quality scores most sequences have high quality (peak → 38-39) a few reads below 20, but not many Per base sequence content First 3 → barcodes
i L 1 AGG CCG 2 CGG CAA 3 AAC CAG 4 AGC GGA 5 ACG CCA 6 AGA CAC
we don’t want the barcodes on our sequences → not a part of the RNA going to run a trimmatic to trim off the barcodes (1st 4 bases) quality score → below 15 → delete sequence less than 30 nt → toss Per sequence GC content liver is more jagged
Per base N content N → base don’t know Sequence Duplication Levels
Castoe et al. 2013 transcriptome categories snake → redesigned metabolic pathway
Assignment due before class Thursday 1/28 → what we’ve talked about so far, where we’re going, how we’re going to take the data we have so far and move forward?
Tuesday, January 26, 2016
word file should include
papers that we’ve read
both rounds of the fastQC
first round
trimmed round
these two rounds are in the same folders → download again
compare pre and 3post trimming
include papers
what we’ve done so far
research plan → conceptually, what do we need to do next?
reasoned and explained answers
wants to see we’ve thought about it some, reasonable suggestions
Today -
go over aper
download fastQC (post-trim)
Rsem is still not working
Andrew et al. 2015 give a lot of specific genes and pathways wnt pathway see cell cycle/cell growth things apoptosis around 6h NIH funds → the model system, selling a product the NIH bought - cancer applications, human health, engineering is this a good enough argument before, during and after feeding reads vs. transcripts
downloaded trimmed files:
- UN: davidson13
- PW pingDuck21!
- log on using davidson##
w13426:~ user$ ssh -l user 192.112.102.21 submit password 4gcatRocks!
passwd (return and follow prompts)
- must have one cap letter, one number and one symbol
cp -r ../campbell/fastQCLiver/*.* fastQCLiver #copies the contents into our folder
- open new terminal window (local machine)
cd Desktop/Bio343/fastQC scp -r davidson13@192.112.102.21:/home/davidson13/fastQCLiver/ .
- include the “ .”
- will ask for pw
- In finder, unzip all zip files
Go to each folder and open html file Explore quality of reads for trimmomatic option
Thursday, January 28, 2016 Tasks: how do people normalize gene expression given that you don’t get the same number of reads in every sample how do we know we have the right tissue what information do we have right now that we can use? blast the overrepresented sequences first pass at what tissue was sampled how to use SRP database how do we query it download gene file online -rename ones that say “protein of unknown fxn” 1 i and 1 l → SOP 4 L and 4 i → blast over represented sequences and compile 1 list 1 L pair and 1 i pair → how to use SRP051827 2 L and 2 i → gene name list (1) protein of unknown fxn proof (2) others that appear multiple times
Tuesday, February 2, 2016 we ran all 12 of our samples through DESeq which gave us the FPKM number normalized length of the gene per million reads
library(rnaseqWrapper)
- Load your data
- the folder chosen in setwd needs to be customized for user and project
- need subfolders geneResult and annotations, copied from server
setwd("/Users/canaso/Desktop/Bio343") countData <- mergeCountFiles("geneResult/")
- View your data
head(countData)
- Pull out counts, make them integers, and view them
myCountData <- round( countData[,grep(".expected_count",names(countData))],0) names(myCountData) <- gsub(".expected_count","",names(myCountData)) head(myCountData)
- compare organs
- Run DESeq - this saves files to folder given by outNamePrefix
deOut <- DESeqWrapper(myCountData, # Our count data to use
conditions=c("intestine","Liver"), # conditions to compare outNamePrefix="DEseq/") # Where to save the outputs
- Save the DESeq R object:
save(deOut,file="DEseq/deOut_Lvi.Rdata")
- Save qvals for easier use
myQvals <- deOut$deOutputs$intestinevsLiver$padj
- Limit data to interesting genes
toPlot <- as.matrix(myCountData[myQvals < 0.05 & !is.na(myQvals),])
- Make a heatmap
heatmap.mark(toPlot)
- Add color and a legend to show more of the options
heatmap.mark(toPlot,
cexCol=.9, # make column labels smaller ColSideColors = rep(c("red","blue"),each=6), # match column order scaleLabel="") # turn off label to leave room
legend(x="topleft",inset=c(-.01,.13), # where the label should go
bty="n", cex=.5, # no box around it, and set the size legend=c("intestine","Liver"), # What should be there fill=c("red","blue"), # Colors to use title="Conditions") #Label for the legend
- run this every time you open R for this kind of analysis
library(rnaseqWrapper)
- Load your data
- the folder chosen in setwd needs to be customized for user and project
- need subfolders geneResult and annotations, copied from server
setwd("/Users/canaso/Desktop/Bio343") countData <- mergeCountFiles("LiverResults/")
- View your data
head(countData)
- Pull out counts, make them integers, and view them
myCountData <- round( countData[,grep(".expected_count",names(countData))],0) names(myCountData) <- gsub(".expected_count","",names(myCountData)) head(myCountData)
- compare organs
- Run DESeq - this saves files to folder given by outNamePrefix
deOutLiver2 <- DESeqWrapper(myCountData, # Our count data to use
conditions=c("Liver_no","Liver_fed"), # conditions to compare outNamePrefix="DEseq/") # Where to save the outputs
- Save the DESeq R object:
save(deOutLiver,file="DEseq/deOut_Liver_no_vs_Liver_fed2.Rdata")
- Save qvals for easier use
myQvals <- deOutLiver2$deOutputs$Liver_novsLiver_fed$pval
- Limit data to interesting genes
toPlot <- as.matrix(myCountData[myQvals < 0.05 & !is.na(myQvals),])
- Make a heatmap
heatmap.mark(toPlot)
- Add color and a legend to show more of the options
heatmap.mark(toPlot,
cexCol=.9, # make column labels smaller ColSideColors = rep(c("red","blue"),each=3), # match column order scaleLabel="") # turn off label to leave room
legend(x="topleft",inset=c(-.01,.13), # where the label should go
bty="n", cex=.5, # no box around it, and set the size legend=c("intestine","Liver"), # What should be there fill=c("red","blue"), # Colors to use title="Conditions") #Label for the legend
Thursday, February 4, 2016
Questions to Answer:
What is it we want out of our research - what would be the perfect outcome?
What genes are differentially expressed immediately post feeding? What will trigger the cascade? Some sort of transcription activator? Or is the activator already there? Is the factor already there or is the receptor activating that source of proteins? transcripts of transcription factor → follow that first
What starts everything else???
Is the liver growing due to cell replication or hypertrophy?
How do we get there? 1. What constitutes a good sample? confirm sampling accuracy - housekeeping genes did I sample what I expected?
various papers have lists
verify that the liver is producing liver transcripts
2. Look at genes similarly expressed across all fed individuals
then compare to see what of those similarly expressed genes are differentially expressed in fed vs. unfed
THEN
3. Look at genes differentially expressed between fed vs. unfed then compare to see which of the differentially expressed genes are most similar within fed snakes
4. blast against andrew et. al database
what about pull out everything that is a transcription factor, then look at what’s differentially expressed? how to assign gene ontology terms to the genes that we have?
What are we going to do with each of our 12 data sets to evaluate it to know how to treat it downstream?
Correlation Exercise
R^2 → how well does the line explain the data? how good the fit is ⇒ correlation
Tuesday, February 9, 2016 Cluster Analysis, an Overview What does clustering mean? What does it mean for this class? last week, the heat maps were “grouping” genes together and presenting them in some order is there a better way to organize the genes? different approaches, different algorithms for clustering Why cluster? Data reduction analyze representative data points, not the whole dataset Hypothesis generation gain understanding of patterns in data, so they may be tested statistically Hypothesis testing e.g. “Big companies invest abroad” Gene expression data “expression level” meaning ratio each line is a particular gene one highlighted gene is induced 16 fold one highlighted gene is repressed 16 fold but the induction LOOKS much more dramatic what mathematical thing could you do to change this graph to see that the one at the bottom is more dramatic Log transformation calculate log2 of each ratio ratio of 16 becomes value of 4 ratio of .0833 (1/16) becomes value of -4 induction and repression look equal, but opposite sign negative correlation can be just as interesting as positive correlation Intensity plots color - heatmaps another way of presenting the plotted data from before Comparing Gene Expression Profiles or Guilt by Association Proximity measures correlation Euclidean distance normal distance - from one point to another ((x1-x2)2 + (y1-y2)2)½ still a distance formula for these data sets - each time point is a different dimension Inner product xTy Hamming distance L1 distance Dissimilarities may or may not be metrics Triangle inequality d(x,z) <= d(x,y) + d(y,z) loosely referred to as distance Correlation is very sensitive to outliers or biologically insignificant changes linkage methods how do you compare one thing to a group of things? find the centroid of the cluster - treat that as one gene even though it’s not average all the distances? single linkage, complete linkage, average linkage, meteoroid linkage Hierarchical clustering join two most similar genes these genes cannot be separated regardless of how the rest of the data set turns out join next two most similar “objects” (genes or clusters of genes) Repeat until all genes have been joined everybody is included start at +1 correlation, end at -1 correlation Cutting the tree cutting into groups K-means Clustering specify how many clusters to form randomly assign each gene to one of k different clusters average expression of all genes in each cluster to create k pseudogenes rearrange genes by assigning each one to the cluster represented by the pseudo gene to which it is most similar repeat until convergence Supervised Clustering find genes in expression file whose patterns are highly similar (“close”) to desired gene or pattern add closest gene first then add gne that is closest to al genes already in cluster repeat, as long as added gnee is within specified distance of genes already in cluster distance from one gene to a set of genes defined to be maximum (or minimum, or average) of all distances to individual members of the set (complete, single, and average linkage, respectively) define a pattern e.g. for intestine - glucose transporters - pull out all the genes that are expressed like this try to get rid of the noise first and then look at a heat map Qualiy Clustering: QT Clust (Dr. Heyer helped invent) each gene builds a supervised cluster gene with “best” list, and genes in its list, becomes next cluster remove these genes from consideration, and repeat stop when all genes are clustered, or largest cluster is smaller than user specified threshold no single way to create clusters chasing everything straight away isn’t necessarily the best method Thursday, February 11, 2016 Today - break into 4 groups Cody, Nick, Julia, Melpi → Blast2Go gene ontology on all of our genes GO terms through gene ontology download and connect to our terms? Ashlyn, Elise, Catherine, Brian, Elizabeth → DAVID? Caroline, Morgan, Connor, Sarah, Katie → DESeq Dr. C will be looking through “Using R at the Bench”
Notes from DESeq Manual DESeq package expects count data each column stems from an independent biolgoical replicate
- read in text file
> pasillaCountTable = read.table( datafile, header=TRUE, row.names=1 ) > head( pasillaCountTable )
- read in as a CountDataSet (cds = )
- estimate the dispersions
> cds = estimateDispersions( cds )
- look for differentially expressed genes
>res = nbinomTest (cds, “untreated”, “treated”
- returns:
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
- plot log2 fold changes against mean normalized count
>plotMA(res)
- can also look at a histogram of p values
- filter for significant genes
→ can list the most significantly expressed genes (p7) → or most significantly down regulated
write.table and write.csv to save the output to a file
Tuesday, February 16th, 2016 presentations are going to be after spring break verified liver samples as liver https://docs.google.com/spreadsheets/d/1sCD9ARc4k-aGWEMDMdcHBLJ0E9gEpjVl5_iFKE7VX2M/edit#gid=0 house keeping genes acquired from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1888706/
Thursday, February 18th, 2016 going over Dr. Heyer/Dr. C’s results from last week value may be small if it’s on that list, it’s a good candidate 1 - correlation t → transforming x and y axis of the table start working with the code from Dr. C and Dr. H compare DE genes with diff p values to make names/labels show up change margins heatmap(_____, margins=c(5,5))
Monday, February 22, 2016 realized you shouldn’t use FPKM with DEseq because DEseq expects count values
- Code to load data, build and display heatmap of supervised cluster
- Author: Laurie Heyer, tweaked by A. Malcolm Campbell
- TODO change display to re-order rows in order of distance from GOI
- TODO make this a function with arguments as name of GOI and thresh
- run this every time you open R for this kind of analysis
library(rnaseqWrapper)
- Load your data
- the folder chosen in setwd needs to be customized for user and project
- need subfolders geneResult and annotations, copied from server
setwd("/Users/macampbell/Desktop/Bio343") countData <- mergeCountFiles("genes_Liver/")
- Pull out counts, make them integers, and view them
myCountData <- round( countData[,grep(".expected_count",names(countData))],0) names(myCountData) <- gsub(".expected_count","",names(myCountData))
- filtering to keep only those genes whose mean expression is > 10,
- play with this value to see how the size of toSearch changes.
- filter out genes with zero
myMeans <- apply(as.matrix(myCountData), 1, mean) toSearch <- myCountData[myMeans > 10 & !is.na(myMeans),]
- Choose gene of interest by name from toSearch with filtered means >10,
- could compare with myCountData; goi = gene of interest
goi = which(rownames(toSearch)=="Contig12414_Npffr1_Neuropeptide_FF_receptor_1_Rattus_norvegicus")
- Compute distance from goi to every other gene using (1-corr)/2
- If you don’t get enough hits, you could go back and do this for myCountData instead.
- goiData = myCountData[goi,]
goiData = toSearch[goi,] corrgoi = (1 - cor(t(goiData), t(toSearch)))/2
- Get count data for genes within threshold distance of goi looking at toSearch only
- could compare with myCountData, but result may be too big and gum up R
- threshold currently set 0.1 but you can adjust to experiment with cluster size
thresh = 0.1 supClustCounts = toSearch[corrgoi < thresh ,] # ordered by gene number supClustOrder = order(corrgoi)[1:nrow(supClustCounts)] supClustCounts = toSearch[supClustOrder ,] # ordered by distance from goi
- Save supervised cluster to CSV file in working directory
write.csv(supClustCounts,file="Npffr1.csv")
- could print/save just the gene names stored in variable genes
- genes = rownames(supClustCounts)
- Plot heatmap -- wanted rows in order of similarity to goi, but couldn't figure this out
heatsup = heatmap.mark(as.matrix(supClustCounts), scale="none", margins = c(15,10),
distfun=function(x) as.dist(( 1 - cor(t(x)) )/2))
- Image may be way to display data in order of similarity (unclustered)
- image(as.matrix(t(supClustCounts)),col=topo.colors(100))
began to id potentially interesting de genes identified by DEseqwith p = 0.05 https://docs.google.com/spreadsheets/d/1TuR0LYiUCAYvqTd3MvLjHGNawoxmHRzSaoUJG9RuTlE/edit#gid=0
Thursday, February 25, 2016 “liver people just keep doing liver things” Tuesday, March 6, 2016 what images will you need to tell the story next week → group presentations http://www.genome.jp/tools/kaas/help.html http://www.genome.jp/kaas-bin/kaas_main?mode=partial Thursday, March 10, 2016 over the last couple of days, we realized that the last list of genes was incorrectly generated generated a new list using the new code from Dr. H Made a google doc with the sequences in FASTA formatting to run through the kaas Sunday, March 13, 2016 Met with sarah to run genes through KAAS Used the genomes “hsa, mmu, gga, cfa, bta, ptr, xla, xtr, rno, ssc, tgu, ocu, cge, dme, mcf, pbi, acs, cmy, pss, amj, asn” which are all available genomes that we had matches for (Xenopus, homo sapiens, etc.) and all reptile genomes