Difference between revisions of "NHE Feb 9 Notes"

From GcatWiki
Jump to: navigation, search
(Created page with "Nick Elder Group 2 intestines")
 
 
Line 2: Line 2:
  
 
[[Group 2 intestines]]
 
[[Group 2 intestines]]
 +
 +
 +
Clustering:
 +
*Grouping genes, samples, etc.
 +
**may present the order in a way that does not reflect how would would expect/want it to
 +
*Can be used to pull out patterns and make predictions
 +
*Induction can look much more dramatic than repression
 +
*Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression
 +
**Similar relationships could be caused by direct interaction or co-regulation
 +
**Strong negative correlation could be as interesting as positive correlation
 +
***Computers are more likely to group positive correlations
 +
*Gene Expression Profiles (Guilt by Association)
 +
**comparing genes or expression levels in samples over time
 +
**Correlation, Euclidean distance, Hamming Distance, etc....
 +
**Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes
 +
*Linkage Methods
 +
**Can take the nucleus/average of a cluster and compare that to a desired single gene
 +
**Can average all the distances between all points
 +
**Can compare the gene of interest to all the rest of the genes individually
 +
 +
*Hierarchical clustering
 +
**Joins two most similar genes
 +
**Join next two most similar "objects" (genes or clusters of genes)
 +
**Repeat until all genes have been joined
 +
**Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart
 +
**No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation
 +
***Genes that are most distant in the cluster (opposite correlation) could be co-regulated together
 +
**Cutting the tree
 +
***Draw a line to separate the clusters
 +
****Decision to cut the line is completely arbitrary
 +
*K-means Clustering
 +
**Specify how many clusters to form - just break into groups, not breaking the tree
 +
**Randomly assign each gene to one of K different clusters
 +
**Average expression of all genes in each cluster to create K pseudo genes
 +
**Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar
 +
**Repeat until convergence
 +
*Supervised Clustering
 +
**Find genes in expression file whose patterns are highly similar to a desired gene or pattern
 +
**Add closest gene first
 +
**Then add gene that is closest to all genes already in cluster
 +
**Repeat, as long as added gene is within specified distance to genes already in the cluster
 +
**Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively)
 +
**Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression
 +
*Quality Clustering: QT Clust
 +
**Each gene builds a supervised cluster
 +
**Gene with "best" list, and genes in its list, becomes the next cluster
 +
**Remove these genes from consideration, and repeat
 +
**Stop when all genes are clustered, or largest cluster is smaller than user specified threshold
 +
**Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members.

Latest revision as of 19:35, 9 February 2016

Nick Elder

Group 2 intestines


Clustering:

  • Grouping genes, samples, etc.
    • may present the order in a way that does not reflect how would would expect/want it to
  • Can be used to pull out patterns and make predictions
  • Induction can look much more dramatic than repression
  • Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression
    • Similar relationships could be caused by direct interaction or co-regulation
    • Strong negative correlation could be as interesting as positive correlation
      • Computers are more likely to group positive correlations
  • Gene Expression Profiles (Guilt by Association)
    • comparing genes or expression levels in samples over time
    • Correlation, Euclidean distance, Hamming Distance, etc....
    • Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes
  • Linkage Methods
    • Can take the nucleus/average of a cluster and compare that to a desired single gene
    • Can average all the distances between all points
    • Can compare the gene of interest to all the rest of the genes individually
  • Hierarchical clustering
    • Joins two most similar genes
    • Join next two most similar "objects" (genes or clusters of genes)
    • Repeat until all genes have been joined
    • Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart
    • No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation
      • Genes that are most distant in the cluster (opposite correlation) could be co-regulated together
    • Cutting the tree
      • Draw a line to separate the clusters
        • Decision to cut the line is completely arbitrary
  • K-means Clustering
    • Specify how many clusters to form - just break into groups, not breaking the tree
    • Randomly assign each gene to one of K different clusters
    • Average expression of all genes in each cluster to create K pseudo genes
    • Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar
    • Repeat until convergence
  • Supervised Clustering
    • Find genes in expression file whose patterns are highly similar to a desired gene or pattern
    • Add closest gene first
    • Then add gene that is closest to all genes already in cluster
    • Repeat, as long as added gene is within specified distance to genes already in the cluster
    • Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively)
    • Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression
  • Quality Clustering: QT Clust
    • Each gene builds a supervised cluster
    • Gene with "best" list, and genes in its list, becomes the next cluster
    • Remove these genes from consideration, and repeat
    • Stop when all genes are clustered, or largest cluster is smaller than user specified threshold
    • Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members.