Difference between revisions of "NHE Feb 9 Notes"
From GcatWiki
(Created page with "Nick Elder Group 2 intestines") |
|||
Line 2: | Line 2: | ||
[[Group 2 intestines]] | [[Group 2 intestines]] | ||
+ | |||
+ | |||
+ | Clustering: | ||
+ | *Grouping genes, samples, etc. | ||
+ | **may present the order in a way that does not reflect how would would expect/want it to | ||
+ | *Can be used to pull out patterns and make predictions | ||
+ | *Induction can look much more dramatic than repression | ||
+ | *Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression | ||
+ | **Similar relationships could be caused by direct interaction or co-regulation | ||
+ | **Strong negative correlation could be as interesting as positive correlation | ||
+ | ***Computers are more likely to group positive correlations | ||
+ | *Gene Expression Profiles (Guilt by Association) | ||
+ | **comparing genes or expression levels in samples over time | ||
+ | **Correlation, Euclidean distance, Hamming Distance, etc.... | ||
+ | **Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes | ||
+ | *Linkage Methods | ||
+ | **Can take the nucleus/average of a cluster and compare that to a desired single gene | ||
+ | **Can average all the distances between all points | ||
+ | **Can compare the gene of interest to all the rest of the genes individually | ||
+ | |||
+ | *Hierarchical clustering | ||
+ | **Joins two most similar genes | ||
+ | **Join next two most similar "objects" (genes or clusters of genes) | ||
+ | **Repeat until all genes have been joined | ||
+ | **Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart | ||
+ | **No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation | ||
+ | ***Genes that are most distant in the cluster (opposite correlation) could be co-regulated together | ||
+ | **Cutting the tree | ||
+ | ***Draw a line to separate the clusters | ||
+ | ****Decision to cut the line is completely arbitrary | ||
+ | *K-means Clustering | ||
+ | **Specify how many clusters to form - just break into groups, not breaking the tree | ||
+ | **Randomly assign each gene to one of K different clusters | ||
+ | **Average expression of all genes in each cluster to create K pseudo genes | ||
+ | **Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar | ||
+ | **Repeat until convergence | ||
+ | *Supervised Clustering | ||
+ | **Find genes in expression file whose patterns are highly similar to a desired gene or pattern | ||
+ | **Add closest gene first | ||
+ | **Then add gene that is closest to all genes already in cluster | ||
+ | **Repeat, as long as added gene is within specified distance to genes already in the cluster | ||
+ | **Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively) | ||
+ | **Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression | ||
+ | *Quality Clustering: QT Clust | ||
+ | **Each gene builds a supervised cluster | ||
+ | **Gene with "best" list, and genes in its list, becomes the next cluster | ||
+ | **Remove these genes from consideration, and repeat | ||
+ | **Stop when all genes are clustered, or largest cluster is smaller than user specified threshold | ||
+ | **Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members. |
Latest revision as of 19:35, 9 February 2016
Clustering:
- Grouping genes, samples, etc.
- may present the order in a way that does not reflect how would would expect/want it to
- Can be used to pull out patterns and make predictions
- Induction can look much more dramatic than repression
- Changing the y-axis to a logarithmic scale reveals much more drastic changes in repression
- Similar relationships could be caused by direct interaction or co-regulation
- Strong negative correlation could be as interesting as positive correlation
- Computers are more likely to group positive correlations
- Gene Expression Profiles (Guilt by Association)
- comparing genes or expression levels in samples over time
- Correlation, Euclidean distance, Hamming Distance, etc....
- Correlation is very sensitive to outliers; also, very small changes, possibly due to noise, can be interpreted as correlation with more drastic changes
- Linkage Methods
- Can take the nucleus/average of a cluster and compare that to a desired single gene
- Can average all the distances between all points
- Can compare the gene of interest to all the rest of the genes individually
- Hierarchical clustering
- Joins two most similar genes
- Join next two most similar "objects" (genes or clusters of genes)
- Repeat until all genes have been joined
- Once clustered, two genes can, under no circumstances, under penalty of death, in no way possible on heaven or earth, be split apart
- No gene left behind. Everyone has to be in the cluster; starts at +1 correlation and end at -1 correlation
- Genes that are most distant in the cluster (opposite correlation) could be co-regulated together
- Cutting the tree
- Draw a line to separate the clusters
- Decision to cut the line is completely arbitrary
- Draw a line to separate the clusters
- K-means Clustering
- Specify how many clusters to form - just break into groups, not breaking the tree
- Randomly assign each gene to one of K different clusters
- Average expression of all genes in each cluster to create K pseudo genes
- Rearrange genes by assigning each other to the cluster represented by the pseudo gene to which it is most similar
- Repeat until convergence
- Supervised Clustering
- Find genes in expression file whose patterns are highly similar to a desired gene or pattern
- Add closest gene first
- Then add gene that is closest to all genes already in cluster
- Repeat, as long as added gene is within specified distance to genes already in the cluster
- Distance from one gene to a set of genes defined to be maximum (or minimum or average) of all distances to individual members of the set (complete, single, and average linkage, respectively)
- Dr. C says that it might be really nice to cluster genes that move with a transcription factor, since a little induction in the transcription factor can have major effects on other gene expression
- Quality Clustering: QT Clust
- Each gene builds a supervised cluster
- Gene with "best" list, and genes in its list, becomes the next cluster
- Remove these genes from consideration, and repeat
- Stop when all genes are clustered, or largest cluster is smaller than user specified threshold
- Make 'cliques', find the largest, separate that group, remove all members from that group from all other groups, and make the next largest group from the existing members.