2/09/16

From GcatWiki
Jump to: navigation, search

Class Notes 2/09/16

Clustering-Grouping genes and samples based on similarity. If we don't like the way it's being grouped there's a lot different algorithms we can use. Easy way to pull out patterns and make predictions from big data sets. Microarrays is similar technology used for RNA analysis Induction looks much more dramatic, log scales helps see patterns more clearly Need to watch out for negative correlations, they may be just as interesting as positive correlations but harder to detect But we're looking at counts, not ratios, so we shouldn't ever be looking at negative values if using log scale How to compare one thing to a group of genes? Linkage methods . . . Create a value for a group and then treat that cluster as an individual gene, or average all the distances, more relaxed, you'd let it in if close to one member of the cluster, more stringent if you say it has to be this close to all the genes

Hierarchical Clustering Joins two most similar genes, then next two most similar objects, repeat until all have been joined No gene can be left out, starts at +1 correlation, end at -1 Cutting the tree-group together all the things that are still joined when line is drawn down

K-means clustering Specify how many clusters to form, randomly assign each gene to one of k different clusters, average expression of all genes in each cluster to create k pseudo genes, rearrange genes by assigning each one to the cluster represented by the pseudo gene to which it is most similar, repeat until convergence

Supervised Clustering find genes in expression file whose patterns are highly smilier to desired gene or pattern Add closest gene first, then add gene that is closest to genes already in the cluster, repeat as long as added gene is within specified distance of genes already in cluster, distance from one gene to a set of genes defined to be maximum 9or minimum, or acreage) of all distances to individual members of the set (complete, single, and average linkage, respectively)

QT Clust is available in R, each gene builds supervised cluster, gene with "best" list, and genes in its list becomes next cluster, remove these genes from consideration, and repeat, stop when all genes are clustered we determine what the rule is that makes the "best" cluster (could just be group with most genes in it)