2/9

From GcatWiki
Jump to: navigation, search

Nick Balanda

What is Gene Clustering?

- grouping genes together based on similar proteins they code for -allows for presentation of similar genes together, analyze representative data points within big data (draw patterns), make predictions

---not having to sort through whole data set

-many algorithms

-gene expression data:

a) often comparative expression levels

b) consider that repression is as significant as induction (increase in expression)

b1) might use log scale to represent this!

b2) could mean co-regulation or correlation

linkage methods:

difference between average point in cluster and point in question, minimum distance between point in cluster and point in question,, etc

hierarchical clustering:


-join two most similar genes, repeat until all genes have been clustered (no gene left behind-- starts at +1 correlation, end at -1)

---cutting the tree-- dividing gene clusters into groups by drawing line through hierarchical tree and acknowledging groups left behind

k-means clustering:

specify how many clusters to form, groups each gene to one of k different clusters to maximize similarity

supervised clustering:

find all genes w/ expresion patterns matching "fill in the blank:" (all like this particular gene that we found upregulated)

quality clustering (QT clust)

each gene builds its own cluster based on genes that are most similar to it (repeat for every gene)

-come up with rule for "best cluster" (because each cluster likely overlaps many others)

--default in this case is to remove biggest cluster, then the next biggest, so on...

there is no perfect answer for clustering, you have to experiment based on some biological meaning in order to draw most accurate conclusions

clustering info/practice:

[1]