JP Feb 09 16

From GcatWiki
Jump to: navigation, search

Julia Preziosi

Correlation Coefficient:

Statistically, we see significant changes. Biologically, looking at changes in the genes, we may not think they're that significant.

Genes are correlated when multiple samples display a trend in expression for the genes. For instance, if Gene 1 is upregulated and Gene 2 is upregulated (magnitude doesn't matter) in the same sample, they're correlated for that sample. Over multiple samples, a correlation coefficient can be produced, especially if it's more upregulated in other samples. Negative correlations indicate that as one gene increases, the other gene decreases. They don't have to be parallel lines. Small fluctuations can really change correlations, especially if the gene is clustered around a very tight line.

  • Do we include all three fastings and feedings if little differences change clustering?


Clustering: Grouping the genes and samples together and presenting in an order. Need to understand the algorithms for clustering.

Reasons to cluster: explore big data sets, pull out patterns, make predictions.

Gene expression can be made into a ratio between fed and nonfed expression levels. Generates "gene induction / repression" values.


However, 1/16th looks less significant than 16 fold. *Use a log scale instead: ratio of 16 becomes value of 4. Lets you visualize repression and induction between genes - direct and indirect relationships can be assumed (coregulation).

  • Analysis techniques - how to pull out negative correlations.

Clustering by gene expression profiles- "Guilt by association". Compare expression levels of the genes over samples. *Pattern less meaningful since we are not doing 'over time'. R has built in different ways of comparing - Euclidean distance, etc. Clustering is very sensitive to outliers. Regardless of magnitude.

Linkage methods - how to compare one thing to a group of thing? Could define average of the cluster as the cluster, and find correlation between that cluster and the one thing. Could average all the distances between each point. Stringent requirement - only let in if close to every member of the cluster (Complete linkage?).

Hierarchical Clustering: join two most similar genes; join next two most similar genes / gene clusters ; repeat. These genes can never be separated. Doesn't look globally at all the patterns.

  • Want to pull out relationships from our mess of visual joinings.

No gene left behind. Every gene has to be in the cluster at some point. Starts at +1, ends at -1 correlation (exact opposites that might be related are separated). "Cutting the Tree" - group together all things that are still joined at an arbitrary distance of clustering = new, distinct groups.

K-means Clustering: specify how many clusters to form. Creates k clusters of most similar things. We could set k to 2 and see if data supports fed / nonfed clusters. Clustering genes? Hard to pick k.

Supervised Clustering: Can pull out clusters to suit your situation. Pull out each gene(s) individually that you want to find all genes which are similar (by correlation coefficient perchance) to it / to the average. "similar" controlled by whatever method you chose. Iterative - join two most similar, then find the next most similar to that cluster (by whatever method).

  • Want to know every gene that tracks with a transcription factor - even though relative increased expression of that factor may be low.
  • Need to filter data and get rid of the noise first.

Quality Clustering - QT Clust: 1. Each gene builds a supervised cluster. (Some groups overlap) 2. Gene with "best" list, and genes in its list, becomes a cluster. (best cluster rule set - could be biggest) 3. Removes these genes from consideration (no longer a part of anyone's group), and repeat. (next best cluster gets cut, etc).

Clustering Resource