0% found this document useful (0 votes)
8 views38 pages

Lecture 9

This document discusses the use of heatmaps and clustering in bioinformatics for visualizing omic data, emphasizing the importance of verifying statistical significance through expression data. It explains how clustering can reveal hidden patterns in complex datasets and provides R code for creating heatmaps and rugs to display additional clinical information. The document highlights the power of clustering in identifying groups of genes and samples with similar expression profiles.

Uploaded by

9djbwrn8cw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views38 pages

Lecture 9

This document discusses the use of heatmaps and clustering in bioinformatics for visualizing omic data, emphasizing the importance of verifying statistical significance through expression data. It explains how clustering can reveal hidden patterns in complex datasets and provides R code for creating heatmaps and rugs to display additional clinical information. The document highlights the power of clustering in identifying groups of genes and samples with similar expression profiles.

Uploaded by

9djbwrn8cw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Bioinformatics for

Wet-Lab Biologists
Omic Data Analysis
& Visualisation Using R

Lecture 9. Heatmaps and Clustering


Part 1: What are heatmaps?
Heatmap of Significantly Different Genes

So far we have shown that there are significantly different genes BUT NOT HOW
BELIEVABLE THEY ARE ALL AT ONCE. Statistics can be wrong. The first rule of
research is to be RIGHT. The first rule of statistics is once you have p-values you
MUST check to see if they are believable by eye, by looking at the original data. In this
example we need to look at the significant genes at the EXPRESSION LEVEL. If you
read a RNA-seq paper that shows p-values but NO EXPRESSION DATA don’t believe
it.
Heatmap of Significantly Different Genes

Question: Are our significantly differential genes consistent between replicates but
different between groups?

A heatmap is simply a table where the numbers are replaced by a colour. The colour
intensity represents the number. One great thing about heatmaps is that they let you see
your entire expression matrix (both genes and samples) at once. NO OTHER plot lets you
do this.

What do this plot show?


Clustering

The second great thing about heatmaps is that they can be clustered. Clustering
is an EXTREMELY powerful transformation for finding groups.

If we plot the same heatmap without clustering we get this. Remember the data it shows is
identical:

It looks very messy. That’s because the genes are simply in the order that they appear in
the differential expression file – which is random. What we need is some way to order
the genes so that it looks neat. One way is to sort them by fold change. A better way is to
cluster them.
Clustering

In clustering we order the genes by how similar their pattern of expression is


(across the samples). This is achieved iteratively:

(1) we take our expression matrix of (2) we decide which two genes have the most
signficant genes (in this case 3 genes) similar expression pattern. Based on Spearman
correlations between all combinations of genes

(3) We average the expression values (4) We repeat stages 2+3 until there is
for the two genes identified in (2): only one spearman value.

Obviously with many 100s of genes this will be computationally quite heavy.
Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

As we perform the previous step, what we are actually doing is building a


dendrogram.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

In a dendrogram the “arms” can rotate freely. E.g. this is the same as:

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

This...

TTN ACTN TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

The second part of a clustering algorithm is the reordering based on how far up or
down the joining loop a join occurred. In the previous example the starting order of
the genes was close to the re-ordered order. Lets go again with a more complex
example.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.

5
4

2
1

TNF P53 CCL2 CSF1 ADRB2 NR1A NR1B ACTN TTN


Clustering

We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.

5
4

2
1

TNF P53 CCL2 CSF1 ADRB2 TTN NR1B ACTN NR1A


Clustering

We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.

5
4

2
1

TNF P53 CCL2 ADRB2 TTN CSF1 NR1B ACTN NR1A


Clustering

We can see that the dendrogram is very tangled up. If we start at the first join and spin
the dendrogram until they are side by side, then take the next join and repeat, and so on
we get the final order.

5
4

2
1

TNF P53 NR1A CCL2 ADRB2 TTN CSF1 NR1B ACTN


Clustering

Ta daa! This is the order that we place the genes in the heatmap.

5
4

2
1

TNF P53 NR1A CCL2 ADRB2 TTN CSF1 NR1B ACTN


Clustering

There are MANY different algorithms for clustering. This is the simplest but also
one of the most powerful and widely used. Clustering can be achieved in R.

The three components that can easily be modified in the algorithm are:

1) The method for getting the distance. Here we used Spearman correlations, but we
could use Pearson, Euclidean, Geometric, etc.

2) The method we used to agglomerate was mean. i.e. we averaged the expression
values when we joined. There are others such as median.

3) The method we used to reorder the dengrogram was distance. There are others.

A final point regarding dendrograms is that you DO NOT need to show them on the plot.
Clustering is NOT scientific. The dendrogram is meaningles once the order is decided.
The power of clustering large datasets

Previously we have used simple datasets. i.e. two groups of samples with few
replicates. However with more complex experiments (and primary tissue)
clustering is MORE powerful. It can be used to identify hidden patterns.

Un-clustered data The same data but clustered – look what was hiding!
2,000 genes

2,000 genes
87 patients 87 patients

You can identify previously unkown groups of genes and samples with concordant
expression profiles this way.
Part 2: Making heatmaps in R
library(amap)

# makes a matrix
hm.matrix = as.matrix(em_sig_scaled)

# gets the distances


y.dist = Dist(hm.matrix, method="spearman")

# clusters
y.cluster = hclust(y.dist, method="average")

# this pulls out the dendrogram


y.dd = as.dendrogram(y.cluster)

# this untangles the denrogram


y.dd.reorder = reorder(y.dd,0,FUN="average")

# this gets the untangled gene order from the denrogram


y.order = order.dendrogram(y.dd.reorder)

# this reorders the original matrix in the new order


hm.matrix_clustered = hm.matrix[y.order,]

# makes the colour palette


colours = c("blue","pink","red")
palette = colorRampPalette(colours)(100)

# melt and plot


hm.matrix_clustered = melt(hm.matrix_clustered)
ggp = ggplot(hm.matrix_clustered, aes(x=Var2, y=Var1, fill=value)) + geom_tile() + scale_fill_gradientn(colours = palette)
ggp
Part 3: Rugs
Rugs

Often the clusters, groups or clinical information on a heatmap has its own
colour bar. This is called a rug. In fact it is actually a tiny heatmap, that
accompanies the main one.

group
age
Rugs

We simply take the values we want to be the rug (e.g. age, BMI, tissue, etc) we
then melt, don’t cluster, and plot a heatmap using these melted values.

# load sample sheet


sample_sheet = read.table("/data_d2-d9/sample_sheet.csv", header=TRUE,row.names = 1, sep='\t')

# colours
rug_colours = c("red", "cyan", "purple“)

# rug for discrete variable


rug_data = as.matrix(as.numeric(factor(ss$SAMPLE_GROUP)))
rug_data = melt(rug_data)

# plot
ggp = ggplot(rug_data , aes(x = Var1, y = Var2, fill = value)) + geom_tile() + scale_fill_gradientn(colours = rug_colours)

# trim off the headers and legends, grid, everything etc..


+ theme(plot.margin=unit(c(0,1,1,1), "cm"),
axis.line=element_blank(),axis.text.x=element_blank(),axis.title.x=element_blank(),axis.text.y=element_blank(),axis.ticks=elemen
t_blank(),axis.title.y=element_blank(),legend.position="none",panel.background=element_blank(),panel.border=element_blank(),
panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank())
Rugs

Amazing!

You might also like