Lecture 9
Lecture 9
Wet-Lab Biologists
Omic Data Analysis
& Visualisation Using R
So far we have shown that there are significantly different genes BUT NOT HOW
BELIEVABLE THEY ARE ALL AT ONCE. Statistics can be wrong. The first rule of
research is to be RIGHT. The first rule of statistics is once you have p-values you
MUST check to see if they are believable by eye, by looking at the original data. In this
example we need to look at the significant genes at the EXPRESSION LEVEL. If you
read a RNA-seq paper that shows p-values but NO EXPRESSION DATA don’t believe
it.
Heatmap of Significantly Different Genes
Question: Are our significantly differential genes consistent between replicates but
different between groups?
A heatmap is simply a table where the numbers are replaced by a colour. The colour
intensity represents the number. One great thing about heatmaps is that they let you see
your entire expression matrix (both genes and samples) at once. NO OTHER plot lets you
do this.
The second great thing about heatmaps is that they can be clustered. Clustering
is an EXTREMELY powerful transformation for finding groups.
If we plot the same heatmap without clustering we get this. Remember the data it shows is
identical:
It looks very messy. That’s because the genes are simply in the order that they appear in
the differential expression file – which is random. What we need is some way to order
the genes so that it looks neat. One way is to sort them by fold change. A better way is to
cluster them.
Clustering
(1) we take our expression matrix of (2) we decide which two genes have the most
signficant genes (in this case 3 genes) similar expression pattern. Based on Spearman
correlations between all combinations of genes
(3) We average the expression values (4) We repeat stages 2+3 until there is
for the two genes identified in (2): only one spearman value.
Obviously with many 100s of genes this will be computationally quite heavy.
Clustering
In a dendrogram the “arms” can rotate freely. E.g. this is the same as:
This...
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.
We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.
We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.
5
4
2
1
We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.
5
4
2
1
We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.
5
4
2
1
We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.
5
4
2
1
Ta daa! This is the order that we place the genes in the heatmap.
5
4
2
1
There are MANY different algorithms for clustering. This is the simplest but also
one of the most powerful and widely used. Clustering can be achieved in R.
The three components that can easily be modified in the algorithm are:
1) The method for getting the distance. Here we used Spearman correlations, but we
could use Pearson, Euclidean, Geometric, etc.
2) The method we used to agglomerate was mean. i.e. we averaged the expression
values when we joined. There are others such as median.
3) The method we used to reorder the dengrogram was distance. There are others.
A final point regarding dendrograms is that you DO NOT need to show them on the plot.
Clustering is NOT scientific. The dendrogram is meaningles once the order is decided.
The power of clustering large datasets
Previously we have used simple datasets. i.e. two groups of samples with few
replicates. However with more complex experiments (and primary tissue)
clustering is MORE powerful. It can be used to identify hidden patterns.
Un-clustered data The same data but clustered – look what was hiding!
2,000 genes
2,000 genes
87 patients 87 patients
You can identify previously unkown groups of genes and samples with concordant
expression profiles this way.
Part 2: Making heatmaps in R
library(amap)
# makes a matrix
hm.matrix = as.matrix(em_sig_scaled)
# clusters
y.cluster = hclust(y.dist, method="average")
Often the clusters, groups or clinical information on a heatmap has its own
colour bar. This is called a rug. In fact it is actually a tiny heatmap, that
accompanies the main one.
group
age
Rugs
We simply take the values we want to be the rug (e.g. age, BMI, tissue, etc) we
then melt, don’t cluster, and plot a heatmap using these melted values.
# colours
rug_colours = c("red", "cyan", "purple“)
# plot
ggp = ggplot(rug_data , aes(x = Var1, y = Var2, fill = value)) + geom_tile() + scale_fill_gradientn(colours = rug_colours)
Amazing!