DM Notes 2.09.16
Take home points from correlation exercise: some subtle changes in genes can lead to drastically different correlation coefficients. For example, in scenario three, a gene that is close to evenly expressed across samples can have drastically different correlations if one point changes from 7 to 6. So, noise can be deceptive and lead us to believe that there are patterns where there aren't. Mathematical correlations don't always have biological significance.
Dr. Heyer- Clustering: Presentation Notes
Grouping genes (and treatments) based off of some characteristic (such as gene expression), and presenting them in some order to help draw biological significance. There are different approaches/algorithms for clustering. Why cluster? Data reduction (analyze representative data points), hypothesis generation (gain understanding of patterns), hypothesis testing, prediction based on groups (cluster cancer patients, predict outcomes).
Gene Expression Data Example: One highlighted gene is induced 16 fold. One highlighted gene is repressed 16 fold. Induction looks much more dramatic than repression - one to sixteen is much more noticeable than one to one sixteenth. (Figure: time vs expression. Each line is a gene) Solution- log scale. Induction and repression look equal, but opposite sign with log base 2. Possibilities- genes are co-regulated, or repression of one induces the other and vice versa.
Comparing Gene Expression Profiles, or, Guilt by Association:
Proximity Measures: correlation, Euclidean distance, inner product (xTy), Hamming distance, L1 distance
Linkage Methods: how do you compare one gene to a group of genes? Perhaps compare to an average value for the cluster. Or, average all the distances within the cluster. Or, say that it's close to the cluster provided that it's closest to one object in the cluster.
- Hierarchical Clustering: join two most similar genes. Join next two most similar objects (genes or clusters of genes). Repeat until all genes have been joined. However, gene pairs can then never be pulled apart (doesn't look globally first, starts with most similar pairings). Linkage methods to decide how similar a gene is to a group. Potential issue- no gene can be left behind. Things that look far apart (negative correlation) might be directly negatively related, and we can't see that. Cutting the tree: draw a vertical line through tree to break it into groups.
- K-means Clustering: specify how many clusters to form. Randomly assign each gene to one of k different clusters. Average expression of all genes in each cluster to create k pseudo genes. Rearrange genes by assigning each one to the cluster represented by the pseudo gene to which it is most similar. Repeat until convergence.
- Supervised Clustering: Find genes in expression file whose patterns are highly similar ("close") to desired gene or pattern Add closest gene first. Then add gene that is closest to all genes already in cluster. Repeat, as long as new gene is within specified distance of cluster.
- QT Clust (Quality Clustering): Each gene builds a supervised cluster. Gene with 'best' list and genes in its list, comes next cluster. Remove these genes from consideration and repeat. Stop when all genes are clustered, or largest cluster is smaller than threshold. Need to have a rule for which group you take (in this case, one with most genes in it), because some groups overlap. Pull that group out, and those genes aren't part of any other group.
Back to home Dylan Maghini