Feb 9

From GcatWiki
Jump to: navigation, search

Clustering: Grouping in a particular way based on some sort of algorithm with given parameters

Why cluster? Exploration of huge data, extract patterns and make predictions on these patterns (hypothesis generation and testing)

Gene expression data:

Induction looks much more dramatic than the repression (be sure and remember this), equivalent on the fold change, but look very dissimilar

A log transformation "normalizing" the way this data looks for fold changes

Negative correlations are as informative as the positive correlations

Scatter/line plots are a different way to represent a heat map

Comparing Gene Expression Profiles or Guilt by expression:

Co-regulation or directly regulating each other

Proximity Measures:

Want to understand relationships genes and expression level over time or samples

Correlation, Euclidean distance (distance formula), Inner product x y, Hamming distance, L1 distance, Dissimilarities may or may not be metrics

Correlation is very sensitive to outliners (percent change) so the other measures could be good

Linkage Methods:

Find some center point in a cluster, treat it as a "gene" and measure it from the gene of interest

Could average all the distances between the gene of interest and all in cluster

Could do the minimum or the maximum distance of a gene in the cluster to the gene of interest

Single linkage, Average Linkage, etc. Each will produce different clusters

Hierarchical Clustering

Join two most similar genes

Join next two most similar "objects", repeat until all genes have been joined (can never be pulled apart in your cluster once they are joined)

Iterative and stringent

Everybody is included, nobody is left out (starts at positive one correlation and ends at negative one correlation

Cutting the tree: group the things that are still joined at a certain point

K-means Clustering

Specify how many clusters to form

Randomly assign each gene to one of the k different groups

Average expression of all genes in each cluster to create k pseudo genes

Rearrange genes by assigning each one to the cluster represented by the pseudo gene to which it is most similar

Repeat until convergence

With our data maybe cluster 2 groups based on fed and non fed (then did the data support that)

Really hard to pick clusters

Supervised Clustering

Find genes in expression file whose patterns are highly similar (close) to desired gene or pattern

Add closest gene first

Then add gene that is closest to all genes already in cluster

Repeat, as long as added gene is within specified distance of genes already in cluster

Distance from one gene to a set of genes defined to be max or min or avg of all distances to individual members of the set

QT Clustering

1. Each gene builds a supervised cluster 2. Gene with "best" list, and genes in its list, becomes next cluster (2 rules, how many groups are you in and which group do you choose) 3. Remove these genes from consideration and repeat 4. Stop when all genes are clustered, or largest cluster is smaller than user specified threshold

Will likely stop at supervised clustering (around gene ontology perhaps) in this class, restraint is good, don't just go to a heat map


Completed clustering activities