NHE Feb 9 Notes

From GcatWiki
Jump to: navigation, search

Nick Elder

Group 2 intestines


Clustering:

  • Grouping genes, samples, etc.
    • may present the order in a way that does not reflect how would would expect/want it to
  • Can be used to pull out patterns and make predictions
  • Induction can look much more dramatic than repression
  • Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression
    • Similar relationships could be caused by direct interaction or co-regulation
    • Strong negative correlation could be as interesting as positive correlation
      • Computers are more likely to group positive correlations
  • Gene Expression Profiles (Guilt by Association)
    • comparing genes or expression levels in samples over time
    • Correlation, Euclidean distance, Hamming Distance, etc....
    • Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes
  • Linkage Methods
    • Can take the nucleus/average of a cluster and compare that to a desired single gene
    • Can average all the distances between all points
    • Can compare the gene of interest to all the rest of the genes individually
  • Hierarchical clustering
    • Joins two most similar genes
    • Join next two most similar "objects" (genes or clusters of genes)
    • Repeat until all genes have been joined
    • Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart
    • No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation
      • Genes that are most distant in the cluster (opposite correlation) could be co-regulated together
    • Cutting the tree
      • Draw a line to separate the clusters
        • Decision to cut the line is completely arbitrary
  • K-means Clustering
    • Specify how many clusters to form - just break into groups, not breaking the tree
    • Randomly assign each gene to one of K different clusters
    • Average expression of all genes in each cluster to create K pseudo genes
    • Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar
    • Repeat until convergence
  • Supervised Clustering
    • Find genes in expression file whose patterns are highly similar to a desired gene or pattern
    • Add closest gene first
    • Then add gene that is closest to all genes already in cluster
    • Repeat, as long as added gene is within specified distance to genes already in the cluster
    • Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively)
    • Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression
  • Quality Clustering: QT Clust
    • Each gene builds a supervised cluster
    • Gene with "best" list, and genes in its list, becomes the next cluster
    • Remove these genes from consideration, and repeat
    • Stop when all genes are clustered, or largest cluster is smaller than user specified threshold
    • Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members.