JP Feb 18 16
Drs. C & H findings:
Euclidean distance correlations for clustering. Changed from z scale to absolute value.
Then looked at correlation (1 - correlation), clustering was different, dendrogram was different.
They are now working on supervised clustering; make csv to export gene names based on clustering. Given a seed gene, output correlated genes.
- Transcription factors are not usually highly transcribed - small value; we look for things that are transcribed after feeding.
Our snake 1-6 data (excel files) were mapped to Todd's python genome (text file) to associate sequences with Gene names. If we can Blast2GO Todd's "protein of unknown function" sequences, we can get gene names and GO terms for these unknown proteins. As they're labeled, we can match the label (ex "...unknown_function_20") to our output with the same label, and frequencies, and find the GO terms for these.
We believe this method will yield new results as Todd did his genome a while ago, and new information may have become available since then.
For example, Todd's data says
">Contig263_Protein_of_unknown_function ATGATGATAATAACGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATG ATGATAACAACAATAATACACATTAAAAGTATCTCAAACCGACAGAGACATAAGGGGATG GGAAATCTCCTTAGACTGCTCTTTGAGATGAACCTTGGTATAGACTTGGTGGACTTTGGA CTTTTCTCCAGCTCTCTGGATTATCTCAAGTGGCTTACCTCAAGATTGCAGATCTTGTCA TGA"
We gave this a number and ran our data through Todd's data set. (Our excel file (not the same protein)) Contig1001_Protein_of_unknown_function_2387 Contig1001_Protein_of_unknown_function_2387 1077 1007.87 0 0 0
Now if we run Todd's sequence for "Protein...2387" through Blast2Go, we can get GO terms for this unknown (now known) protein, and attribute those terms to the proteins in our data.
There are 3,521 proteins of unknown function. It takes 10 minutes to run 10. It will take 59 hours to run all these.