Difference between revisions of "February 9, 2016"

From GcatWiki
Jump to: navigation, search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Classwork ==   
 
== Classwork ==   
CLUSTERING-
 
What does clustering mean to you?
 
-grouping genes together and samples together and presenting them in an order. How did it do that?
 
  
Why cluster? Data reduction-analyze representative data points, not the whole dataset Hypothesis generation- gain understanding of patternis in data, so they may be tested statistically  Remember the utility of log transformations.  Consider direct, indirect relationships of genes, coregulations, look at other types of relationships. Both negative and positive correlations can be interesting and lead to important discoveries.  
+
=== Clustering Activity ===  
Intensity plots
+
==== Clustering ====  
 +
'''''Clustering:''''' grouping genes and samples together and presenting them in a specific order.
  
  Comparing gene expression profiles, or guilt by association  Proximity measures: correlation, Euclidean distance, inner product XY, Hamming distance, L1 distance, dissimilarities ma or may not be metrics (triangle inequality, looseley referred to as distance)  WE WANT TO COMPARE GENES AND EXPRESSION PATTERNS BETWEEN FED AND NON-FED.
+
Clustering is used as a data reduction analysis. It is representative of data points rather than an entire data set. When clustering, we seek to gain an understanding of patterns in a data set, so that they may be tested statistically. While analyzing patterns, it is important to consider the utility of log transformations, co-regulations, and direct/indirect relationships of genes. Both negative and positive correlations can be interesting and lead to important discoveries.
  How do you compare one thing to a group of things? How do you measure similarity/dissimilarity to the cluster? -Define cluster (take averages of things belonging to cluster) and then treat it like an individual to compare to other things. OR average all the distances. OR do max and min (it's a part of the cluster if it is closest to one of them or close to all of them) All of these are linkage methods (Complete linkage, incomplete linkage, mediode, etc.)
 
  
  Hierarchical clustering: joins two most similar genes, join next two most similar "objects" (genes of clusters of genes), repeat until all genes have been joined. Find the two closest genes and join them together. No matter what you discover in the rest of your data, THEY CANNOT BE PULLED APART. That is the biggest problem with hierarchical clustering; it doesn't take all the components together. Also, hierarchical clustering means no gene gets left behind; everybody is in. Starts with 1 correlation and ends with -1.
 
  
  Cutting the Tree- process of actually grouping genes, draw line in hierarchy line and see what is still together. Genes that are still together are part of a cluster. BUT where do you cut the tree? you get different answers depending on where you cut the line.
+
'''''Hierarchical Clustering:''''' joins the two most similar genes, then the next two most similar genes or cluster of genes until all genes have been joined.
  
  K-means Clustering: Specifiy how many clusters to form, randomly assign each gene to one of k different clusters, average expression of all genes in each cluster to create k pseudo genes, rearrange genes by assigning each one to the cluster represented by the pseudo gene to which it is most similiar, repeat until convergence
+
In hierarchical clustering, after two genes or cluster of genes are joined, they cannot be pulled apart regardless of what future discoveries in data reveal. The biggest problem with hierarchical clustering is that it does not consider all data components together. Furthermore, no gene is left behind in hierarchical clustering; correlations begin with a value of 1 and end with a value of -1.
  
Are there things you can cluster where you know the number??
 
  
  Supervised Clustering: find genes in expression file whose patterns are highly similar (close) to desired gene or pattern; add closest gene first; then add gene that is closest to all genes already in cluster; repeat, as long as added gene is within specified distance of genes already in cluster; distance from one gene to a set of genes defined to be max or min or average of all distances to individual members of the set (complete, single, and average linkage, respectively.  
+
'''''K-means Clustering:''''' specifies how many clusters to form by randomly assigning each gene to one of k different clusters.
  
TRACK GENES THAT MATCH WITH A TRANSCRITPION FACTOR- Transcription factor might be small, but we want to see what has big changes that correlate with that.  
+
In K-means clustering, the average expression of all genes in each cluster is used to create k pseudo genes. Genes can be rearranged by assigning each one to the cluster represented by the pseudo gene to which it is most similar. K-means clustering can be repeated until there is convergence.
  
Use QT Clust instead of heat map:
 
MAIN IDEA: 1) each gene builds a supervised cluster, 2) Gene with "best" list, and genes in its list, becomes next cluster, 3) remove these genes from consideration, and repeat, 4) stop when all genes are clustered, or largest cluster is smaller than user specified threshold.
 
  
Gene with the biggest numbers/most genes is the group that we are looking at. We are calling it a cluster, now those genes are not part of anyone's group. Now look for next biggest group and get a different cluster. THERE IS NO ONE PERFECT, CORRECT ANSWER. LOOK FOR THINGS THAT MEAN SOMETHING TO YOU.
+
'''''Supervised Clustering:''''' finds genes in expression file whose patterns are highly similar to the desired gene or pattern.
Chase things you are interested in them, look for things that are similar, and then keep pulling things into your group. PRACTICE RESTRAINT.
 
  
  It would help to have gene ontology terms to help with clustering. Cluster transcription factors and look at those.
+
Supervised clustering adds the closest gene first. Then, the gene closest to all of the genes already in a cluster is added. This process continues as long as the added gene is within the specified distance of genes already in cluster. The specified distance from one gene to a set of genes can be defined as the maximum,  minimum,  or average of all distances to individual members of the set (complete, single, and average linkage, respectively).  
  
  
 +
'''''Cutting the Tree:''''' the process of grouping genes by determining a threshold value in the dendrogram. 
  
 +
In cutting the tree, cut the dendrogram at different points and see what genes or clusters of genes are still clustered together. Genes that are still together are part of a cluster. Different clusters arise depending on where the tree was cut. 
  
 +
 +
==== Intensity Plots ====
 +
'''''Intensity plots''''' compare gene expression profiles. Proximity measures include: correlation, Euclidean distance, inner product XY, Hamming distance, L1 distance, and dissimilarities that may or may not be metrics. 
 +
 +
We want our intensity plots to compare the genes and expression patterns between fed and non-fed snakes. 
 +
 +
In order to measure the similarity or dissimilarity to the cluster, one much determine which linkage method to use. 
 +
 +
'''''Linkage Methods:'''''
 +
*''Complete linkage:'' define the cluster by taking the average of the cluster's components and then treat the average like an individual to compare it to other genes. 
 +
*''Incomplete linkage:'' average the gene of interest to all of the distances included in a cluster.(?) 
 +
*''Mediode linkage:'' use a max/min approach, including a gene to the cluster if it is closest to one or all of the other genes in the cluster. 
 +
 +
 +
==== QT clust ====
 +
 +
One can also use QT clust instead of a heat map with the following steps: 
 +
# each gene builds a supervised cluster 
 +
#Gene with "best" list, and genes in its list, becomes next cluster 
 +
#Remove these genes from consideration, and repeat 
 +
#Stop when all genes are clustered, or largest cluster is smaller than user specified threshold 
 +
 +
 +
 +
 +
=== Questions to Consider: === 
 +
*How do you compare one thing to a group of things? 
 +
*How can we track genes that match with a transcription factor?
 +
 +
 
 +
 +
'''''Moving Forward:''''' 
 +
*Remember, there is no one perfect, correct answer. Therefore, chase things that are of interest to you and cluster; however, practice restraint. 
 +
*It will be important to track genes that match with a transcription factor. Although a transcription factor might be small, big changes may still correlate with it. 
 +
*Gene ontology terms will help the clustering process.
  
  
  
 
[http://gcat.davidson.edu/mediawiki-1.19.1/index.php/Ashlyn Ashlyn's Main Page]
 
[http://gcat.davidson.edu/mediawiki-1.19.1/index.php/Ashlyn Ashlyn's Main Page]

Latest revision as of 04:42, 9 March 2016

Classwork

Clustering Activity

Clustering

Clustering: grouping genes and samples together and presenting them in a specific order.

Clustering is used as a data reduction analysis. It is representative of data points rather than an entire data set. When clustering, we seek to gain an understanding of patterns in a data set, so that they may be tested statistically. While analyzing patterns, it is important to consider the utility of log transformations, co-regulations, and direct/indirect relationships of genes. Both negative and positive correlations can be interesting and lead to important discoveries.


Hierarchical Clustering: joins the two most similar genes, then the next two most similar genes or cluster of genes until all genes have been joined.

In hierarchical clustering, after two genes or cluster of genes are joined, they cannot be pulled apart regardless of what future discoveries in data reveal. The biggest problem with hierarchical clustering is that it does not consider all data components together. Furthermore, no gene is left behind in hierarchical clustering; correlations begin with a value of 1 and end with a value of -1.


K-means Clustering: specifies how many clusters to form by randomly assigning each gene to one of k different clusters.

In K-means clustering, the average expression of all genes in each cluster is used to create k pseudo genes. Genes can be rearranged by assigning each one to the cluster represented by the pseudo gene to which it is most similar. K-means clustering can be repeated until there is convergence.


Supervised Clustering: finds genes in expression file whose patterns are highly similar to the desired gene or pattern.

Supervised clustering adds the closest gene first. Then, the gene closest to all of the genes already in a cluster is added. This process continues as long as the added gene is within the specified distance of genes already in cluster. The specified distance from one gene to a set of genes can be defined as the maximum, minimum, or average of all distances to individual members of the set (complete, single, and average linkage, respectively).


Cutting the Tree: the process of grouping genes by determining a threshold value in the dendrogram.

In cutting the tree, cut the dendrogram at different points and see what genes or clusters of genes are still clustered together. Genes that are still together are part of a cluster. Different clusters arise depending on where the tree was cut.


Intensity Plots

Intensity plots compare gene expression profiles. Proximity measures include: correlation, Euclidean distance, inner product XY, Hamming distance, L1 distance, and dissimilarities that may or may not be metrics.

We want our intensity plots to compare the genes and expression patterns between fed and non-fed snakes.

In order to measure the similarity or dissimilarity to the cluster, one much determine which linkage method to use.

Linkage Methods:

  • Complete linkage: define the cluster by taking the average of the cluster's components and then treat the average like an individual to compare it to other genes.
  • Incomplete linkage: average the gene of interest to all of the distances included in a cluster.(?)
  • Mediode linkage: use a max/min approach, including a gene to the cluster if it is closest to one or all of the other genes in the cluster.


QT clust

One can also use QT clust instead of a heat map with the following steps:

  1. each gene builds a supervised cluster
  2. Gene with "best" list, and genes in its list, becomes next cluster
  3. Remove these genes from consideration, and repeat
  4. Stop when all genes are clustered, or largest cluster is smaller than user specified threshold



Questions to Consider:

  • How do you compare one thing to a group of things?
  • How can we track genes that match with a transcription factor?


Moving Forward:

  • Remember, there is no one perfect, correct answer. Therefore, chase things that are of interest to you and cluster; however, practice restraint.
  • It will be important to track genes that match with a transcription factor. Although a transcription factor might be small, big changes may still correlate with it.
  • Gene ontology terms will help the clustering process.


Ashlyn's Main Page