NHE Feb 9 Notes
From GcatWiki
Clustering:
- Grouping genes, samples, etc.
- may present the order in a way that does not reflect how would would expect/want it to
- Can be used to pull out patterns and make predictions
- Induction can look much more dramatic than repression
- Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression
- Similar relationships could be caused by direct interaction or co-regulation
- Strong negative correlation could be as interesting as positive correlation
- Computers are more likely to group positive correlations
- Gene Expression Profiles (Guilt by Association)
- comparing genes or expression levels in samples over time
- Correlation, Euclidean distance, Hamming Distance, etc....
- Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes
- Linkage Methods
- Can take the nucleus/average of a cluster and compare that to a desired single gene
- Can average all the distances between all points
- Can compare the gene of interest to all the rest of the genes individually
- Hierarchical clustering
- Joins two most similar genes
- Join next two most similar "objects" (genes or clusters of genes)
- Repeat until all genes have been joined
- Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart
- No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation
- Genes that are most distant in the cluster (opposite correlation) could be co-regulated together
- Cutting the tree
- Draw a line to separate the clusters
- Decision to cut the line is completely arbitrary
- Draw a line to separate the clusters
- K-means Clustering
- Specify how many clusters to form - just break into groups, not breaking the tree
- Randomly assign each gene to one of K different clusters
- Average expression of all genes in each cluster to create K pseudo genes
- Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar
- Repeat until convergence
- Supervised Clustering
- Find genes in expression file whose patterns are highly similar to a desired gene or pattern
- Add closest gene first
- Then add gene that is closest to all genes already in cluster
- Repeat, as long as added gene is within specified distance to genes already in the cluster
- Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively)
- Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression
- Quality Clustering: QT Clust
- Each gene builds a supervised cluster
- Gene with "best" list, and genes in its list, becomes the next cluster
- Remove these genes from consideration, and repeat
- Stop when all genes are clustered, or largest cluster is smaller than user specified threshold
- Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members.