A Short Tutorial For Decent Heat Maps in R
A Short Tutorial For Decent Heat Maps in R
html
sebastianraschka
BLOG BOOKS RESEARCH ELSEWHERE RESOURCES SOFTWARE
TEACHING
!
[RSS]
I received many questions from people who want to quickly visualize their data via
heat maps - ideally as quickly as possible. This is the major issue of exploratory
data analysis, since we often don’t have the time to digest whole books about the
particular techniques in different software packages to just get the job done. But
once we are happy with our initial results, it might be worthwhile to dig deeper into
the topic in order to further customize our plots and maybe even polish them for
publication. In this post, my aim is to briefly introduce one of R’s several heat map
libraries for a simple data analysis. I chose R, because it is one of the most popular
free statistical software packages around. Of course there are many more tools out
there to produce similar results (and even in R there are many different packages
for heat maps), but I will leave this as an open topic for another time.
Sections
Sections
Script overview
Running a script in R
Script parameters in more detail
A) Installing and loading required packages
B) Reading in data and transform it into matrix format
C) Customizing and plotting the heat map
Note: For those who prefer Python, I also have a short tutorial for Heatmaps,
Hierarchical Clustering, and Dendrograms in Python”
The files that I used can be downloaded from the GitHub repository at:
https://github.com/rasbt/R_snippets/tree/master/heatmaps
Following this paragraph you see the whole shebang so that you know what you
are dealing with: An R script that uses R’s gplot package to create heat maps via
the heatmap.2() function. It might look gargantuan considering that we “only”
want to create a simple heat map, but don’t worry, many of the parameters are not
required, and I will discuss the details in the following sections.
Script overview
#########################################################
### A) Installing and loading required packages
#########################################################
if (!require("gplots")) {
install.packages("gplots", dependencies = TRUE)
library(gplots)
}
if (!require("RColorBrewer")) {
install.packages("RColorBrewer", dependencies = TRUE)
library(RColorBrewer)
}
#########################################################
### B) Reading in data and transform it into matrix format
#########################################################
#########################################################
### C) Customizing and plotting the heat map
#########################################################
# (optional) defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(-1,0,length=100), # for red
seq(0.01,0.8,length=100), # for yellow
seq(0.81,1,length=100)) # for green
heatmap.2(mat_data,
cellnote = mat_data, # same data set for cell labels
Running a script in R
To run a script in R, start a new R session by either typing R into a shell terminal,
or execute R from you Applications folder. Now, you can type the following
command in R to execute a script:
source("path/to/the/script/heatmaps_in_R.R")
if (!require("gplots")) {
install.packages("gplots", dependencies = TRUE)
library(gplots)
}
if (!require("RColorBrewer")) {
install.packages("RColorBrewer", dependencies = TRUE)
library(RColorBrewer)
}
When we open the CSV file in our favorite plain text editor instead of using a
spread sheet program (Excel, Numbers, etc.), it looks like this:
measurement5,0.1587,0.2948,0.153,-0.2208
measurement6,-0.4558,0.2244,0.6619,0.0457
measurement7,-0.6241,-0.3119,0.3642,0.2003
measurement8,-0.227,0.499,0.3067,0.3289
measurement9,0.7365,-0.0872,-0.069,-0.4252
measurement10,0.9761,0.4355,0.8663,0.8107
When we are reading the data from our CSV file into R and assign it to the variable
data , note the two lines of comments preceding the main data in our CSV file,
indicated by an octothorpe (#) character. Since we don’t need those lines to plot
our heat map, we can ignore them by via the comment.char argument in the
read.csv() function.
One tricky part of the heatmap.2() function is that it requires the data in a
numerical matrix format in order to plot it. By default, data that we read from files
using R’s read.table() or read.csv() functions is stored in a data table
format. The matrix format differs from the data table format by the fact that a
matrix can only hold one type of data, e.g., numerical, strings, or logical.
Fortunately, we don’t have to worry about the row that contains our column names
(var1, var2, var3, var4) since the read.csv() function treats the first line of data
as table header by default. But we would run into trouble if we want to include
the row names (measurement1, measurment2, etc.) in our numerical matrix. For
our own convenience, we store those row names in the first column as variable
rnames , which we can use later to assign row names to our matrix after the
conversion.
Now, we transform the numerical data from the variable data (column 2 to 5) into
a matrix and assign it to a new variable mat_data
Instead of using the rather fiddly expression ncol(data)] , which returns the total
number of columns from the data table, we could also provide the integer 5 directly
in order to specify the last column that we want to include. However,
ncol(data)] is more convenient for larger data sets so that we don’t need to
count all columns to get the index of the last column for specifying the upper
boundary. Next, we assign the column names, which we have saved as rnames
previously, to the matrix via
There are many different ways to specify colors in R. I find it most convenient to
assign colors by their name. A nice overview of the different color names in R can
be found at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
The argument (n = 299) lets us define how many individuals colors we want to
have in our palette. Obviously, the higher the number of individual colors, the
smoother the transition will be; the number 299 should be sufficiently large enough
for a smooth transition. By default, RColorBrewer will divide the colors evenly so
that every color in our palette will be an interval of individual colors of similar size.
However, sometimes we want to have a little skewed color range depending on the
data we are analyzing. Let’s assume that our example data set consists of Pearson
correlation coefficients (i.e., R values) ranging from –1 to 1, and we are particularly
interested in samples that have a (relatively) high correlation: R values in the range
between 0.8 to 1.0. We want to highlight these samples in our heat map by only
showing values from 0.8 to 1 in green. In this case, we can define our color breaks
The default parameters of the png() function would yield a relatively small PNG
file at very low resolution, which is not really practical for heat maps. Thus we
provide additional arguments for the image width , height and the resolution.
The units of width and height are pixels, not inches. So if we want to create a
5x5 inch image with 300 pixels per inch, we have to do a little math here: [1500
pixels] / [300 pixels/inch] = 5 inches. Also, we choose a slightly smaller font size of
8 pt.
Be careful to not forget to close the png() plotting device at the end of you
script via the function dev.off() otherwise you probably won’t be able to
open the PNG file to view it.
Now, let’s get down to business and take a look at the heatmap.2() function:
heatmap.2(mat_data,
cellnote = mat_data, # same data set for cell labels
main = "Correlation", # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(12,9), # widens margins around plot
col=my_palette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram="row", # only draw a row dendrogram
Colv="NA") # turn off column clustering
heatmap.2(mat_data,
...
Rowv = as.dendrogram(cluster), # apply default clustering method
Colv = as.dendrogram(cluster)) # apply default clustering method
)
heatmap.2(mat_data,
...
RowSideColors = c( # grouping row-variables into different
rep("gray", 3), # categories, Measurement 1-3: green
rep("blue", 3), # Measurement 4-6: blue
rep("black", 4)), # Measurement 7-10: red
...
)
Note that we could also provide similar labels to the column variables via the
ColSideColors argument. Another useful addition would be to add a color legend
for our new category labels. The code for this particular example would be:
The figure below shows how our modified heatmap would look like after we applied
row categorization and provided a color legend:
" # $ % ! &Q
© 2013-2021 Sebastian Raschka