Graphics Chapter
Graphics Chapter
6 Multiple Testing 31
6.1 Goals for this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1.1 Drinking from the firehose . . . . . . . . . . . . . . . . . . . 31
6.1.2 Testing vs classification . . . . . . . . . . . . . . . . . . . . . 32
6.2 An Example: Coin Tossing . . . . . . . . . . . . . . . . . . . . . . . . 32
3
Bibliography 53
1000
250
0
fixme: Review this when chapter is finished: 0 10 20 30 40 50
Time to make plot in minutes
There are two important types of data visualization. The first enables a scientist Figure 3.1: An elementary law of visualisation.
to effectively explore the data and make discoveries about the complex processes at
work. The other type of visualization should be a clear and informative illustration
of the studys results that she can show to others and eventually include in the final
publication.
Both of these types of graphics can be made with R. In fact, R offers multiple
systems for plotting data. This is because R is an extensible system, and because
progress in R graphics has proceeded largely not be replacing the old functions, but
by adding packages. Each of the different approaches has its own advantages and
limitations. In this chapter well briefly get to know some of the base R plotting
functions1 . Subsequently we will switch to the ggplot2 graphics system. 1
They live in the graphics package, which ships
with every basic R installation.
Base R graphics were historically first: simple, procedural, canvas-oriented.
There are many specialized functions for different types of plots. Recurring plot
modifications, like legends, grouping of the data by using different plot symbols,
colors or subpanels, have to be reinvented over and over again. Complex plots can
quickly get messy to program. A more high-level approach grammar of graphics,
plots are built in modular pieces, so that we can easily try different visualization types
for our data in an intuitive and easily deciphered way, like we can switch in and out
parts of a sentence in human language.
Well explore faceting, for showing more than 2 variables at a time. Sometimes
this is also called lattice2 graphics, and it allows us to visualise data to up to 4 or 5 2
The first major R package to implement this was
lattice; nowadays much of that functionality is
dimensions.
also provided through ggplot2.
In the end of the chapter, we cover some specialized forms of plotting such as maps
and ideograms, still building on the base concept of the grammar of graphics. fixme:
Update
1.5
DNase$density
1.0
See how to plot 1D, 2D, 3-5D data, and understand faceting
0.5
Create beautiful and intuitive plots for scientific presentations and publications
0.0
0 2 4 6 8 10 12
2.0
DNase, which conveniently comes with base R. DNase is a nfnGroupedData, nfGrouped-
Data, groupedData, data.frame whose columns are Run, the assay run; conc, the protein
1.5
concentration that was used; and density, the measured optical density.
Optical density
head(DNase)
1.0
## Run conc density
## 1 1 . 4882812 . 17
0.5
## 2 1 . 4882812 . 18
## 3 1 .1953125 .121
0.0
## 4 1 .1953125 .124 0 2 4 6 8 10 12
## 5 1 .39 625 .2 6
DNase concentration (ng/ml)
## 6 1 .39 625 .215
plot(DNase$conc, DNase$density) Figure 3.3: Same data as in Figure 3.2 but with
This basic plot can be customized, for example by changing the plotting symbol better axis labels and a different plot symbol.
and axis labels as shown in Figure 3.3 by using the parameters xlab, ylab and pch
20
Frequency
(plot character). The information about the labels is stored in the object DNase, and
we can access it with the attr function.
plot(DNase$conc, DNase$density, 10
ylab = attr(DNase, "labels")$y,
0
sometimes still seen in biological papers. We will see more about plotting univariate
distributions in Section 3.6. Figure 3.4: Histogram of the density from
the ELISA assay, and boxplots of these values
These plotting functions are great for quick interactive exploration of data; but
stratified by the assay run. The boxes are ordered
we run quickly into their limitations if we want to create more sophisticated displays. along the axis in lexicographical order because
We are going to use a visualization framework called the grammar of graphics, imple- the runs were stored as text strings. We could
use Rs type conversion functions to achieve
mented in the package ggplot2, that enables step by step construction of high quality numerical ordering.
graphics in a logical and elegant manner. But first let us load up an example dataset.
7
To properly testdrive the ggplot2 functionality, we are going to need a dataset that is
big enough and has some complexity so that it can be sliced and viewed from many
different angles.
Well use a gene expression microarray data set that reports the transcriptomes
of around 100 individual cells from mouse embryos at different time points in early
development. The mammalian embryo starts out as a single cell, the fertilized egg.
Through synchronized waves of cell divisions, the egg multiplies into a clump of cells
that at first show no discernible differences between them. At some point, though,
cells choose different lineages. Eventually, by further and further specification, the
different cell types and tissues arise that are needed for a full organism. The aim Figure 3.5: Single-section immunofluorescence
of the experiment3 was to investigate the gene expression changes that associated image of the E3.5 mouse blastocyst stained for
with the first symmetry breaking event in the embryo. Well further explain the data Serpinh1, a marker of primitive endoderm (blue),
as we go. More details can be found in the paper and in the documentation of the Gata6 (red) and Nanog (green). Scale bar: 10 m.
Bioconductor data package Hiiragi2013. We first load the package and the data:
3
Y. Ohnishi, W. Huber, A. Tsumura, M. Kang,
P. Xenopoulos, K. Kurimoto, A. K. Oles, M. J.
library("Hiiragi2 13") Arauzo-Bravo, M. Saitou, A. K. Hadjantonakis,
data("x") and T. Hiiragi. Cell-to-cell expression variability
dim(exprs(x)) followed by signal reinforcement progressively
segregates early mouse lineages. Nature Cell
## [1] 451 1 1 1
Biology, 16(1):2737, 2014
You can print out a more detailed summary of the ExpressionSet object x by just typ-
ing x at the R prompt. The 101 columns of the data matrix (accessed above through
the exprs function) correspond to the samples (and each of these to a single cell), the
45101 rows correspond to the genes probed by the array, an Affymetrix mouse4302
array. The data were normalized using the RMA method4 . The raw data are also avail- 4
R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-
Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed.
able in the package (in the data object a) and at EMBL-EBIs ArrayExpress database
Exploration, normalization, and summaries of
under the accession code E-MTAB-1681. high density oligonucleotide array probe level
Lets have a look what information is available about the samples5 . data. Biostatistics, 4(2):249264, 2003
5
The notation #CAB2D6 is a hexadecimal
head(pData(x), n = 2)
representation of the RGB coordinates of a colour;
## File.name Embryonic.day Total.number.of.cells lineage more on this in Section 3.9.2.
## 1 E3.25 1_C32_IN E3.25 32
## 2 E3.25 2_C32_IN E3.25 32
## genotype ScanDate sampleGroup sampleColour
## 1 E3.25 WT 2 11- 3-16 E3.25 #CAB2D6
## 2 E3.25 WT 2 11- 3-16 E3.25 #CAB2D6
The information provided is a mix of information about the cells (i.e., age, size and
genotype of the embryo from which they were obtained) and technical information
(scan date, raw data file name). By convention, time in the development of the mouse
embryo is measured in days, and reported as, for instance, E3.5. Moreover, in the
paper the authors divided the cells into 8 biological groups (sampleGroup), based
on age, genotype and lineage, and they defined a colour scheme to represent these
groups (sampleColour)6 . Using the group_by and summarise functions from 6
In this chapter well use the spelling colour
the package dplyr, well define a little data.frame groups that contains summary (rather than color). This is to stay consistent with
the spelling adopted by the R package ggplot2.
information for each group: the number of cells and the preferred colour.
Other packages, like RColorBrewer, use the other
library("dplyr") spelling. In some cases, function or argument
groups = group_by(pData(x), sampleGroup) %>% names are duplicated to allow users to use either
summarise(n = n() , colour = unique(sampleColour)) choice, although you cannot always rely on that.
groups
8
3.4 ggplot2
The ggplot2 package is a package created by Hadley Wickham that implements the idea
of grammar of graphics a concept created by Leland Wilkinson in his eponymous
book7 . Comprehensive documentation for the package8 can be found on its website. 7
L. Wilkinson. The Grammar of Graphics.
Springer, 2005
The online documentation includes example use cases for each of the graphic types
8
Hadley Wickham. ggplot2: Elegant Graphics
that are introduced in this chapter (and many more) and is an invaluable resource for Data Analysis. Springer New York, 2009. ISBN
when creating figures. 978-0-387-98140-6. URL http://had.co.nz/
Lets start by loading the package and redoing the simple plot of Figure 3.2. ggplot2/book
library("ggplot2") 2.0
ggplot(DNase, aes(x = conc, y = density)) + geom_point()
We just wrote our first "sentence" using the grammar of graphics. Let us decon-
1.5
data.frame that contains the data, DNase. Then we told ggplot via the aes9 argu-
1.0
ment which variables we want on the x - and -axes, respectively. Finally, we stated
0.5
that we want the plot to use points, by adding the result of calling the function
geom_point.
0.0
Now lets turn to the mouse single cell data and plot the number of samples for 0 4 8 12
each of the 8 groups using the ggplot function. The result is shown in Figure 3.7.
conc
by a bar. Bars are one geometric object (geom) that ggplot knows about. Weve
already seen another geom in Figure 3.6: points. Well encounter many other possible 20
geometric objects later. We used the aes to indicate that we want the groups shown
n
10
E3.25
E3.25 (FGF4KO)
E3.5 (EPI)
E3.5 (FGF4KO)
E3.5 (PE)E4.5 (EPI)
E4.5 (FGF4KO)
E4.5 (PE)
sampleGroup
along the x -axis and the sizes along the -axis. Finally, we provided the argument
stat = "identity" (in other words, do nothing) to the geom_bar function, since
otherwise it would try to compute a histogram of the data (the default value of stat
is "count"). stat is short for statistic, which is what we call any function of data.
The identity statistic just returns the data themselves, but there are other more
interesting statistics, such as binning, smoothing, averaging, taking a histogram, or
other operations that summarize the data in some way.
Question 3.4.1
Flip the x - and -aesthetics to produce a horizontal barplot.
These concepts data, geometrical objects, statistics are some of the ingredients
of the grammar of graphics, just as nouns, verbs and adverbs are ingredients of an
English sentence.
The plot in Figure 3.7 is not bad, but there are several potential improvements.
We can use colour for the bars to help us quickly see which bar corresponds to which
group. This is particularly useful if we use the same colour scheme in several plots. To
this end, lets define a named vector groupColour that contains our desired colours
for each possible value of sampleGroup10 . 10
The information is completely equivalent to
that in the sampleGroup and colour columns
groupColour = setNames(groups$colour, groups$sampleGroup)
of the data.frame groups, were just adapting to
Another thing that we need to fix is the readability of the bar labels. Right now the fact that ggplot2 expects this information in
the form of a named vector.
they are running into each other a common problem when you have descriptive
names. Groups
geom_bar(stat = "identity") + 20
E3.5 (EPI)
E3.5 (FGF4KO)
scale_fill_manual(values = groupColour, name = "Groups") +
n
E3.5 (PE)
Lets dissect the above "sentence". We added an argument, fill to the aes func-
E4.5 (PE)
0
tion that states that we want the bars to be coloured (filled) based on sampleGroup
E3.25
E3.25 (FGF4KO)
E3.5 (EPI)
E3.5 (FGF4KO)
E3.5 (PE)
E4.5 (EPI)
E4.5 (FGF4KO)
E4.5 (PE)
(which in this case co-incidentally is also the value of the x argument, but that need
not be so). Furthermore we added a call to the scale_fill_manual function, which sampleGroup
takes as its input a colour map i. e., the mapping from the possible values of a vari- Figure 3.8: Similar to Figure 3.7, but with coloured
bars and better bar labels.
able to the associated colours as a named vector. We also gave this colour map a
title (note that in more complex plots, there can be several different colour maps
involved). Had we omitted the call to scale_fill_manual, ggplot2 would have used
its choice of default colours. We also added a call to theme stating that we want the
x -axis labels rotated by 90 degrees, and right-aligned (hjust; the default would be to
center it).
rank)
4. a statistical summarisation rule
5. a coordinate system
6. a facet specification, i. e. the use of several plots to look at the same data
In the examples above, Figures 3.7 and 3.8, the dataset was groupsize, the variables
were the numeric values as well as the names of groupsize, which we mapped to
X1418765_at
10
the aesthetics -axis and x -axis respectively, the scale was linear on the and rank-
based on the x -axis (the bars are ordered alphanumerically and each has the same
width), the geometric object was the rectangular bar, and the statistical summary was
the trivial one (i. e., none). We did not make use of a facet specification in the plots
5
In fact, ggplot2s implementation of the grammar of graphics allows you to use the
same type of component multiple times, in what are called layers11 . For example, the
5.0 7.5 10.0 12.5
X1426642_at
code below uses three types of geometric objects in the same plot, for the same data: Figure 3.9: A scatterplot with three layers that
show different statistics of the same data: points,
points, a line and a confidence band.
a smooth regression line, and a confidence band.
dftx = data.frame(t(exprs(x)), pData(x))
ggplot( dftx, aes( x = X1426642_at, y = X1418765_at)) +
geom_point( shape = 1 ) +
Here we had to assemble a copy of the expression data (exprs(x)) and the sample
X1418765_at
10
annotation data (pData(x)) all together into the data.frame dftx since this is
the data format that ggplot2 functions most easily take as input (more on this in
We can further enhance the plot by using colours since each of the points in
5
Figure 3.9 corresponds to one sample, it makes sense to use the sampleColour
## 2 1418765_at Timd2
## GENENAME
## 1 fibronectin 1
## 2 T cell immunoglobulin and mucin domain containing 2
Often when using ggplot you will only need to specify the data, aesthetics and a
2000
geometric object. Most geometric objects implicitly call a suitable default statistical 1500
summary of the data. For example, if you are using geom_smooth, ggplot2 by default
count
1000
uses stat = "smooth" and then displays a line; if you use geom_histogram, the
data are binned, and the result is displayed in barplot format. Heres an example: 500
dfx = as.data.frame(exprs(x)) 0
5 10 15
ggplot(dfx, aes(x = 2 E3.25)) + 20 E3.25
geom_histogram(binwidth = .2)
Figure 3.11: Histogram of probe intensities for
Question 3.5.3 What is the difference between the objects dfx and dftx? Why did we one particular sample, cell number 20, which was
from day E3.25.
need to create both of the?
Question 3.5.4 Check the ggplot2 documentation for examples of the usage of stats.
Lets come back to the barplot example from above.
pb = ggplot(groups, aes(x = sampleGroup, y = n))
30
This creates a plot object pb. If we try to display it, it creates an empty plot,
because we havent specified what geometric object we want to use. All that we have
in our pb object so far are the data and the aesthetics (Fig. 3.12)
20
n
pb
10
Now we can literally add on the other components of our plot through using the +
operator (Fig. 3.13): E3.25
E3.25 (FGF4KO)
E3.5
E3.5
(EPI)
(FGF4KO)
E3.5 (PE)
E4.5
E4.5
(EPI)
(FGF4KO)
E4.5 (PE)
pb = pb + geom_bar(stat = "identity") sampleGroup
This step-wise buildup taking a graphics object already produced in some way
E3.5 (EPI)
20
E3.5 (FGF4KO)
n
and then further refining it can be more convenient and easy to manage than, E3.5 (PE)
say, providing all the instructions upfront to the single function call that creates
E4.5 (EPI)
10
E4.5 (FGF4KO)
the graphic. We can quickly try out different visualisation ideas without having to
E4.5 (PE)
0
rebuild our plots each time from scratch, but rather store the partially finished object
E3.25
E3.25 (FGF4KO)
E3.5 (EPI)
E3.5 (FGF4KO)
E3.5 (PE)
E4.5 (EPI)
E4.5 (FGF4KO)
E4.5 (PE)
and then modify it in different ways. For example we can switch our plot to polar
coordinates to create an alternative visualization of the barplot.
sampleGroup
pb.polar = pb + coord_polar() +
Figure 3.13: The graphics object bp in its full
theme(axis.text.x = element_text(angle = , hjust = 1), glory.
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
xlab("") + ylab("") E4.5 (PE) E3.25
sampleGroup
pb.polar E3.25
E4.5 (FGF4KO) E3.25 (FGF4KO) E3.25 (FGF4KO)
Note above that we can override previously set theme parameters by simply E3.5 (EPI)
E3.5 (FGF4KO)
setting them to a new value no need to go back to recreating pb, where we originally E3.5 (PE)
set them.
E4.5 (EPI)
E4.5 (EPI) E3.5 (EPI)
E4.5 (FGF4KO)
E4.5 (PE)
3.6 1D Visualisations
A common task in biological data analysis is the comparison between several samples
of univariate measurements. In this section well explore some possibilities for
visualizing and comparing such samples. As an example, well use the intensities
of a set of four genes Fgf4, Sox2, Gata6, Gata4 and Gapdh13 . On the array, they are 13
You can read more about these genes in the
represented by paper associated with the data.
surement are implied by its position (row, column) in the matrix; in contrast, in the
long table format, the feature and sample identifiers need to be stored explicitly with
each measurement. Besides, x has additional components, including the data.frames
fData(x) and pData, which provide various sets metadata about the microarray
features and the phenotypic information about the samples.
To extract data from this representation and convert them into a data.frame, we use
the function melt, which well explain in more detail below.
library("reshape2")
genes = melt(exprs(x)[probes, ], varnames = c("probe", "sample"))
head(genes)
## probe sample value
## 1 142 85_at 1 E3.25 3. 27715
## 2 1418863_at 1 E3.25 4.843137
## 3 1425463_at 1 E3.25 5.5 618
## 4 1416967_at 1 E3.25 1.731217
## 5 142 85_at 2 E3.25 9.293 16
## 6 1418863_at 2 E3.25 5.53 16
For good measure, we also add a column that provides the gene symbol along with
the probe identifiers.
genes$gene = names(probes)[ match(genes$probe, probes) ]
13
3.6.2 Barplots 8
A popular way to display data such as in our data.frame genes is through barplots. See
Fig. 3.15. 6
value
4
In Figure 3.15, each bar represents the mean of the values for that gene. Such plots
are seen a lot in the biological sciences, as well as in the popular media. The data 2
summarisation into only the mean looses a lot of information, and given the amount
of space it takes, a barplot can be a poor way to visualise data17 . 0
Sometimes we want to add error bars, and one way to achieve this in ggplot2 is as Fgf4 Gata4 Gata6 Sox2
follows. gene
the standard error (or confidence limits) of the mean; its a simple function, and we
could also compute it ourselves using base R expressions if we wished to do so. We
gene
Fgf4
also coloured the bars in lighter colours for better contrast.
5.0
value
Gata4
Gata6
Sox2
2.5
3.6.3 Boxplots
Its easy to show the same data with boxplots. 0.0
p = ggplot(genes, aes( x = gene, y = value, fill = gene)) Fgf4 Gata4 Gata6 Sox2
gene
p + geom_boxplot()
Figure 3.16: Barplots with error bars indicating
Compared to the barplots, this takes a similar amount of space, but is much more standard error of the mean.
informative. In Figure 3.17 we see that two of the genes (Gata4, Gata6) have relatively
concentrated distributions, with only a few data points venturing out to the direction 12.5
of higher values. For Fgf4, we see that the distribution is right-skewed: the median,
indicated by the horizontal black bar within the box is closer to the lower (or left) side 10.0
7.5
Gata4
Gata6
3.6.4 Violin plots
Sox2
5.0
A variation of the boxplot idea, but with an even more direct representation of the
shape of the data distribution, is the violin plot. Here, the shape of the violin gives a 2.5
If the number of data points is not too large, it is possible to show the data points
10.0
directly, and it is good practice to do so, compared to using more abstract summaries. gene
However, plotting the data directly will often lead to overlapping points, which can Fgf4
value
7.5
be visually unpleasant, or even obscure the data. We can try to layout the points so Gata4
that they are as near possible to their proper locations without overlap18 .
Gata6
Sox2
5.0
p + geom_dotplot(binaxis = "y", binwidth = 1/6,
stackdir = "center", stackratio = .75,
2.5
aes(color = gene))
The plot is shown in the left panel of Figure 3.19. The -coordinates of the points Fgf4 Gata4
gene
Gata6 Sox2
are discretized into bins (above we chose a bin size of 1/6), and then they are stacked Figure 3.18: Violin plots.
next to each other. 18
L. Wilkinson. Dot plots. The American
A fun alternative is provided by the package beeswarm. It works with base R Statistician, 53(3):276, 1999
graphics and is not directly integrated into ggplot2s data flows, so we can either use
the base R graphics output, or pass on the point coordinates to ggplot as follows.
library("beeswarm")
bee = beeswarm(value ~ gene, data = genes, spacing = .7)
ggplot(bee, aes( x = x, y = y, colour = x.orig)) +
geom_point(shape = 19) + xlab("gene") + ylab("value") +
scale_fill_manual(values = probes)
value
Gata4 Gata4
Gata6
Gata6
Sox2 Sox2
5.0 5.0
2.5
2.5
The plot is shown in the right panel of Figure 3.19. The default layout method
used by beeswarm is called swarm. It places points in increasing order. If a point
would overlap an existing point, it is shifted sideways (along the x -axis) by a minimal
amount sufficient to avoid overlap.
As you have seen in the above code examples, some twiddling with layout parame-
ters is usually needed to make a dot plot or a beeswarm plot look good for a particular
dataset.
Density estimation has a number of complications, and you can see these in
Figure 3.20. In particular, the need for choosing a smoothing window. A window size
that is small enough to capture peaks in the dense regions of the data may lead to 0.75
instable (wiggly) estimates elsewhere; if the window is made bigger, pronounced
gene
features of the density may be smoothed out. Moreover, the density lines do not Fgf4
density
convey the information on how much data was used to estimate them, and plots like
0.50 Gata4
Figure 3.20 can become especially problematic if the sample sizes for the curves differ.
Gata6
Sox2
0.25
random variable X is its cumulative distribution function (CDF), i. e., the function
2.5 5.0 7.5 10.0 12.5
value
where x takes all values along the real axis. The density of X is then the derivative
of F , if it exists19 . The definition of the CDF can also be applied to finite samples of 19
By its definition, F tends to 0 for small x
(x ! 1) and to 1 for large x (x ! +1).
X , i. e., samples x 1 , . . . , x n . The empirical cumulative distribution function (ECDF) is
simply
1X
n
Fn (x) = x x i . (3.2)
n i=1
An important property is that even for limited sample sizes n , the ECDF Fn is not very 1.00
far from the CDF, F . This is not the case for the empirical density! Without smoothing,
the empirical density of a finite sample is a sum of Dirac delta functions, which is
difficult to visualize and quite different from any underlying smooth, true density.
0.75
gene
With smoothing, the difference can be less pronounced, but is difficult to control, as Fgf4
Gata6
ggplot(genes, aes( x = value, colour = gene)) + stat_ecdf() Sox2
0.25
3.6.8 Transformations
0.00
It is tempting to look at histograms or density plots and inspect them for evidence 2.5 5.0 7.5 10.0 12.5
gridExtra::grid.arrange(
ggplot(sim, aes(x)) +
geom_histogram(binwidth = 1 , boundary = ) + xlim( , 4 ),
ggplot(sim, aes(log(x))) + geom_histogram(bins = 3 )
16
) 30000
Question 3.6.1 Consider a log-normal mixture model as in the code above. What is the
density function of X ? What is the density function of lo (X )? How many modes do these 20000
count
densities have, as a function of the parameters of the mixture model (mean and standard
deviation of the component normals, and mixture fraction)? 10000
Let us revisit the melt command from above. In the resulting data.frame genes, each 8000
row corresponds to exactly one measured value, stored in the column value. Then
there are additional columns probe and sample, which store the associated covariates. 6000
Compare this to the following data.frame (for space reasons we print only the first five
count
columns):
4000
## 1418863_at 4.843137 5.53 16 4.418 59 5.982314 4.92358 2.5 0.0 2.5 5.0 7.5 10.0
log(x)
## 1425463_at 5.5 618 6.16 9 4.584961 4.753439 4.629728
Figure 3.22: Histograms of the same data, with
## 1416967_at 1.731217 9.697 38 4.16124 9.54 123 8.7 534 and without logarithmic transformation. The
This data.frame has several columns of data, one for each sample (annotated by the number of modes is different.
column names). Its rows correspond to the four probes, annotated by the row names.
This is an example for a data table in wide format.
Now suppose we want to store somewhere not only the probe identifiers but also
the associated gene symbols. We could stick them as an additional column into the
wide format data.frame, and perhaps also throw in the genes ENSEMBL identifier for
good measure. But now we immediately see the problem: the data.frame now has some
columns that represent different samples, and others that refer to information for
all samples (the gene symbol and identifier) and we somehow have to "know" this
when interpreting the data.frame. This is what Hadley Wickham calls untidy data20 . 20
There are many different ways for data to be
untidy.
In contrast, in the tidy data.frame genes, we can add these columns, yet still know
that each row forms exactly one observation, and all information associated with that
observation is in the same row.
In tidy data21 , 21
Hadley Wickham. Tidy data. Journal of
Statistical Software, 59(10), 2014
1. each variable forms a column,
2. each observation forms a row,
3. each type of observational unit forms a table.
A potential drawback is efficiency: even though there are only 4 probe gene symbol
relationships, we are now storing them 404 times in the rows of the data.frame genes.
Moreover, there is no standardisation: we chose to call this column symbol, but
the next person might call it Symbol or even something completely different, and
when we find a data.frame that was made by someone else and that contains a column
symbol, we can hope, but have no guarantee, that these are valid gene symbols.
Addressing such issues is behind the object-oriented design of the specialized data
structures in Bioconductor, such as the ExpressionSet class.
17
Scatter plots are useful for visualizing treatmentresponse comparisons (as in Fig-
ure 3.3), associations between variables (as in Figure 3.10), or paired data (e. g., a
disease biomarker in several patients before and after treatment). We use the two
dimensions of our plotting paper, or screen, to represent the two variables.
Let us take a look at differential expression between a wildtype and an FGF4-KO
sample.
scp = ggplot(dfx, aes( x = 59 E4.5 (PE) ,
y = 92 E4.5 (FGF4-KO)))
scp + geom_point()
The labels 59 E4.5 (PE) and 92 E4.5 (FGF4-KO) refer to column names Figure 3.23: Scatterplot of 45101 expression
(sample names) in the data.frame dfx, which we created above. Since they contain measurements for two of the samples.
special characters (spaces, parentheses, hyphen) and start with numerals, we need to
enclose them with the downward sloping quotes to make them syntactically digestible
for R. The plot is shown in Figure 3.15. We get a dense point cloud that we can try and
interpret on the outskirts of the cloud, but we really have no idea visually how the
data are distributed within the denser regions of the plot.
One easy way to ameliorate the overplotting is to adjust the transparency (alpha
value) of the points by modifying the alpha parameter of geom_point (Figure 3.24).
scp + geom_point(alpha = .1)
This is already better than Figure 3.23, but in the very density regions even the
semi-transparent points quickly overplot to a featureless black mass, while the more
isolated, outlying points are getting faint. An alternative is a contour plot of the 2D
density, which has the added benefit of not rendering all of the points on the plot, as
Figure 3.24: As Figure 3.23, but with semi-
in Figure 3.25. transparent points to resolve some of the
scp + geom_density2d() overplotting.
However, we see in Figure 3.25 that the point cloud at the bottom right (which
contains a relatively small number of points) is no longer represented. We can
somewhat overcome this by tweaking the bandwidth and binning parameters of
geom_density2d (Figure 3.26, left panel).
scp + geom_density2d(h = .5, bins = 6 )
We can fill in each space between the contour lines with the relative density of
points by explicitly calling the function stat_density2d (for which geom_density2d
is a wrapper) and using the geometric object polygon, as in the right panel of Fig-
ure 3.26.
library("RColorBrewer")
colourscale = scale_fill_gradientn(
colours = rev(brewer.pal(9, "YlGnBu")),
values = c( , exp(seq(-5, , length.out = 1 )))) Figure 3.25: As Figure 3.23, but rendered as a
contour plot of the 2D density estimate.
scp + stat_density2d(h = .5, bins = 6 ,
aes( fill = ..level..), geom = "polygon") +
colourscale + coord_fixed()
We used the function brewer.pal from the package RColorBrewer to define the
18
colour scale, and we added a call to coord_fixed to fix the aspect ratio of the plot,
to make sure that the mapping of data range to x - and -coordinates is the same for
the two variables. Both of these issues merit a deeper look, and well talk more about
plot shapes in Section 3.7.1 and about colours in Section 3.9.
The density based plotting methods in Figure 3.26 are more visually appealing and
interpretable than the overplotted point clouds of Figures 3.23 and 3.24, though we
have to be careful in using them as we loose a lot of the information on the outlier
points in the sparser regions of the plot. One possibility is using geom_point to add
such points back in.
But arguably the best alternative, which avoids the limitations of smoothing, is
hexagonal binning22 . 22
Daniel B Carr, Richard J Littlefield, WL Nichol-
son, and JS Littlefield. Scatterplot matrix
library("hexbin")
techniques for large N. Journal of the American
scp + stat_binhex() + coord_fixed() Statistical Association, 82(398):424436, 1987
scp + stat_binhex(binwidth = c( .2, .2)) + colourscale +
coord_fixed()
current plotting device. The width and height of the device are specified when it is
opened in R, either explicitly by you or through default parameters23 . Moreover, the 23
E. g., see the manual pages of the pdf and png
functions.
graph dimensions also depend on the presence or absence of additional decorations,
like the colour scale bars in Figure 3.27.
There are two simple rules that you can apply for scatterplots:
If the variables on the two axes are measured in the same units, then make
sure that the same mapping of data space to physical space is used i. e., use
coord_fixed. In the scatterplots above, both axes are the logarithm to base 2 of
expression level measurements, that is a change by one unit has the same mean-
ing on both axes (a doubling of the expression level). Another case is principal
component analysis (PCA), where the x -axis typically represents component 1,
and the -axis component 2. Since the axes arise from an orthonormal rotation
of input data space, we want to make sure their scales match. Since the variance
of the data is (by definition) smaller along the second component than along the
first component (or at most, equal), well-done PCA plots usually have a width thats
larger than the height.
If the variables on the two axes are measured in different units, then we can still
relate them to each other by comparing their dimensions. The default in many
plotting routines in R, including ggplot2, is to look at the range of the data and map
it to the available plotting region. However, in particular when the data more or
less follow a line, looking at the typical slope of the line can be useful. This is called
banking24 . 24
W. S. Cleveland, M. E. McGill, and R. McGill. The
To illustrate banking, lets use the classic sunspot data from Clevelands paper. shape parameter of a two-variable graph. Journal
of the American Statistical Association, 83:
library("ggthemes") 289300, 1988
sunsp = data.frame(year = time( sunspot.year ),
number = as.numeric( sunspot.year ))
sp = ggplot(sunsp, aes(x = year, y = number)) + geom_line()
sp
The resulting plot is shown in the upper panel of Figure 3.28. We can clearly see
long-term fluctuations in the amplitude of sunspot activity cycles, with particularly
low maximum activities in the early 1700s, early 1800s, and around the turn of the
20th century. But now lets try out banking.
ratio = with(sunsp, bank_slopes(year, number))
sp + coord_fixed(ratio = ratio)
What the algorithm does is to look at the slopes in the curve, and in particular, the
above call to bank_slopes computes the median absolute slope, and then with the
call to coord_fixed we shape the plot such that this quantity becomes 1. The result
is shown in the lower panel of Figure 3.28. Quite counter-intuitively, even though the
plot takes much smaller space, we see more on it! Namely, we can see the saw-tooth
shape of the sunspot cycles, with sharp rises and more slow declines.
150
number
100
50
150
100
50
0
1700 1800 1900 2000
year
The geom_point geometric object offers the following aesthetics (beyond x and y):
fill
colour
shape
size
alpha
They are explored in the manual page of the geom_point function. fill and
colour refer to the fill and outline colour of an object; alpha to its transparency
level. Above, in Figures 3.24 and following, we have used colour or transparency to
reflect point density and avoid the obscuring effects of overplotting. Instead, we can
use them show other dimensions of the data (but of course we can only do one or the
other). In principle, we could use all the 5 aesthetics listed above simultaneously to
show up to 7-dimensional data; however, such a plot would be hard to decipher, and
21
most often we are better off with one or two additional dimensions and mapping them
to a choice of the available aesthetics.
3.8.1 Faceting
Another way to show additional dimensions of the data is to show multiple plots that
result from repeatedly subsetting (or slicing) our data based on one (or more) of
the variables, so that we can visualize each part separately. So we can, for instance,
investigate whether the observed patterns among the other variables are the same or
different across the range of the faceting variable. Lets look at an example25 25
The first two lines this code chunk are not
strictly necessary theyre just reformatting the
library("magrittr")
lineage column of the dftx data.frame, to make
dftx$lineage %<>% sub("^$", "no", .) the plots look better.
dftx$lineage %<>% factor(levels = c("no", "EPI", "PE", "FGF4-KO"))
10
8
6
4
5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5
X1426642_at
The result is shown in Figure 3.29. We used the formula language to specify by
which variable we want to do the splitting, and that the separate panels should be in
different columns: facet_grid( . lineage ). In fact, we can specify two
faceting variables, as follows; the result is shown in Figure 3.30.
ggplot( dftx,
aes( x = X1426642_at, y = X1418765_at)) + geom_point() +
facet_grid( Embryonic.day ~ lineage )
Another useful function is facet_wrap: if the faceting variables has too many
levels for all the plots to fit in one row or one column, then this function can be used
to wrap them into a specified number of columns or rows.
We can use a continuous variable by discretizing it into levels. The function cut is
useful for this purpose.
ggplot(mutate(dftx, Tdgf1 = cut(X145 989_at, breaks = 4)),
aes( x = X1426642_at, y = X1418765_at)) + geom_point() +
facet_wrap( ~ Tdgf1, ncol = 2 )
We see in Figure 3.31 that the number of points in the four panel is different, this
is because cut splits into bins of equal length, not equal number of points. If we want
the latter, then we can use quantile in conjunction with cut.
22
(columns).
10
E3.25
8
12
X1418765_at
10
E3.5
8
12
10
E4.5
8
6
4
5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5 5.0 7.5 10.0 12.5
X1426642_at
(2.53,5.05] (5.05,7.57]
Axes scales In Figures 3.293.31, the axes scales are the same for all plots. Alterna-
12
10
tively, we could let them vary by setting the scales argument of the facet_grid
and facet_wrap; this parameters allows you to control whether to leave the x -axis, X1418765_at
6
the -axis, or both to be freely variable. Such alternatives scalings might allows us to 4
see the full detail of each plot and thus make more minute observations about what is
(7.57,10.1] (10.1,12.6]
going on in each. The downside is that the plot dimensions are not comparable across
12
10
the groupings.
8
Implicit faceting You can also facet your plots (without explicit calls to facet_grid
4
faceting is using a factor as your x -axis, such as in Figures 3.153.19 Figure 3.31: Faceting: the same data as in
Figure 3.9, split by the continuous variable
X145 989_at and arranged by facet_wrap.
3.8.2 Interactive graphics
fixme: Vlad wrote: The plots generated in R are static images. For complex data it may
be useful to create interactive visualizations, which could be explored by navigating
a computer mouse through different parts of the graphic to view pop-up annotations,
zooming in and out, pulling the graphic to rotate in the image space, etc.
plotly
A great web-based tool for interactive graphic generation is plotly You can view some
examples of interactive graphics online https://plot.ly. To create your own
23
ggvis
fixme: Describe
rgl, webgl
To generate 3D interactive graphics in R ... there is package rgl. fixme: More interesting
example. We can visualize the classic iris flower data set in a 3D scatter plot (Fig. 3.32). Figure 3.32: Scatter plot of iris data.
The iris data set is based on measurements of petal and sepal lengths and widths
for 3 species of iris flowers. The axes in Fig. 3.32 represent petal and sepal dimensions.
The color indicates which species the flower belongs to. Run the following code chunk
sequentially to view the 3D scatter plot in an interactive mode.
fixme: Make this code live again
library("rgl")
bbibrary("rglwidget")
with(iris, plot3d(Sepal.Length, Sepal.Width, Petal.Length,
type="s", col=as.numeric(Species)))
writeWebGL(dir=file.path(getwd(), "figure"), width=7 )
Function writeWebGL exports the 3D scene as an HTML file that can be viewed
interactively in a browser.
3.9 Colour
An important consideration when making plots is the colouring that we use in them.
Most R users are likely familiar with the built-in R colour scheme, used by base R 3 2
graphics, as shown in Figure 3.33. 4 1
pie(rep(1, 8), col=1:8) 5 8
These colour choices date back from 1980s hardware, where graphics cards han- 6 7
dled colours by letting each pixel either fully use or not use each of the three basic
colour channels of the display: red, green and blue (RGB): this leads to 23 = 8 combi-
nations, which lie at the 8 the extreme corners of the RGB color cube26 The colours in
Figure 3.33 are harsh on the eyes, and there is no good excuse any more for creating
Figure 3.33: Basic R colours.
graphics that are based on this palette. Fortunately, the default colours used by some 26
Thus the 8th colour should be white; in R,
of the more modern visualisation oriented packages (including ggplot2) are much whose basic infastructure was put together when
better already, but sometimes we want to make our own choices. more sophisticated graphics display were already
available, this was replaced by grey, as you can see
In Section 3.7 we saw the function scale_fill_gradientn, which allowed us in Figure 3.33.
to create the colour gradient used in Figures 3.26 and 3.27 by interpolating the basic
colour palette defined by the function brewer.pal in the RColorBrewer package. This
package defines a great set of colour palettes, we can see all of them at a glance by
using the function display.brewer.all (Figure 3.34). YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set3
Set2
24
display.brewer.all()
We can get information about the available colour palettes from the data.frame
brewer.pal.info.
head(brewer.pal.info)
## maxcolors category colorblind
## BrBG 11 div TRUE
## PiYG 11 div TRUE
## PRGn 11 div TRUE
## PuOr 11 div TRUE
## RdBu 11 div TRUE
## RdGy 11 div FALSE
table(brewer.pal.info$category)
##
## div qual seq
## 9 8 18
The palettes are divided into three categories:
qualitative: for categorical properties that have no intrinsic ordering. The Paired
palette supports up to 6 categories that each fall into two subcategories - like
before and after, with and without treatment, etc.
sequential: for quantitative properties that go from low to high
diverging: for quantitative properties for which there is a natural midpoint
or neutral value, and whose value can deviate both up- and down; well see an
example in Figure 3.36.
To obtain the colours from a particular palette we use the function brewer.pal. Its
first argument is the number of colours we want (which can be less than the available
maximum number in brewer.pal.info).
brewer.pal(4, "RdYlGn")
## [1] "#D7191C" "#FDAE61" "#A6D96A" "#1A9641"
If we want more than the available number of preset colours (for example so we
can plot a heatmap with continuous colours) we can use the colorRampPalette
function command to interpolate any of the RColorBrewer presets or any set of
colours:
Figure 3.35: A quasi-continuous colour palette
mypalette = colorRampPalette(c("darkorange3", "white", derived by interpolating between the colours
"darkblue"))(1 ) darkorange3, white and darkblue.
head(mypalette)
## [1] "#CD66 " "#CE69 5" "#CF6C A" "#D 6F F" "#D17214" "#D27519"
par(mai = rep( .1, 4))
image(matrix(1:1 , nrow = 1 , ncol = 1 ), col = mypalette,
xaxt = "n", yaxt = "n", useRaster = TRUE)
3.9.1 Heatmaps
Heatmaps are a powerful of visualising large, matrix-like datasets and giving a quick
overview over the patterns that might be in there. There are a number of heatmap
drawing functions in R; one that is particularly versatile and produces good-looking
output is the function pheatmap from the eponymous package. In the code below,
we first select the top 500 most variable genes in the dataset x, and define a function
25
rowCenter that centers each gene (row) by subtracting the mean across columns. By
default, pheatmap uses the RdYlBu colour palette from RcolorBrewer in conjuction
with the colorRampPalette function to interpolate the 11 colour into a smooth-
looking palette (Figure 3.36).
library("pheatmap")
topGenes = order(rowVars(exprs(x)), decreasing = TRUE)[ seq_len(5 ) ]
rowCenter = function(x) { x - rowMeans(x) }
pheatmap( rowCenter(exprs(x)[ topGenes, ] ),
show_rownames = FALSE, show_colnames = FALSE,
breaks = seq(-5, +5, length = 1 1),
annotation_col = pData(x)[, c("sampleGroup", "Embryonic.day", "ScanDate") ],
annotation_colors = list(
sampleGroup = groupColour,
genotype = c(FGF4-KO = "chocolate1", WT = "azure2"),
Embryonic.day = setNames(brewer.pal(9, "Blues")[c(3, 6, 9)], c("E3.25", "E3.5", "E4.5")),
ScanDate = setNames(brewer.pal(nlevels(x$ScanDate), "YlGn"), levels(x$ScanDate))
),
cutree_rows = 4
)
Embryonic.day
E3.25
E3.5
E4.5
sampleGroup
E3.25
E3.25 (FGF4KO)
E3.5 (EPI)
E3.5 (FGF4KO)
E3.5 (PE)
E4.5 (EPI)
E4.5 (FGF4KO)
E4.5 (PE)
we suppress them. The annotation_col argument takes a data frame that carries
additional information about the samples. The information is shown in the coloured
bars on top of the heatmap. There is also a similar annotation_row argument,
which we havent used here, for coloured bars at the side. annotation_colors
is a list of named vectors by which we can override the default choice of colours
for the annotation bars. Finally, with the cutree_rows argument we cut the row
dendrogram into four (an arbitrarily chosen number) clusters, and the heatmap shows
them by leaving a bit of white space in between. The pheatmap function has many
further options, and if you want to use it for your own data visualisations, its worth
studying them.
difference between a pair of similar categories and a third different one. A more
thorough discussion is provided in the references 30 . 30
J. Mollon. Seeing colour. In T. Lamb and
J. Bourriau, editors, Colour: Art and Science.
Cambridge Unversity Press, 1995; and Ross Ihaka.
Lines vs areas For lines and points, we want that they show a strong contrast to Color for presentation graphics. In Kurt Hornik
and Friedrich Leisch, editors, Proceedings of the
the background, so on a white background, we want them to be relatively dark (low 3rd International Workshop on Distributed
lightness L). For area fills, lighter, more pastell-type colours with low to moderate Statistical Computing. Vienna, Austria, 2003
chromatic content are usually more pleasant.
Plots in which most points are huddled up in one area, with a lot of sparsely populated
space, are difficult to read. If the histogram of the marginal distribution of a variable
has a sharp peak and then long tails to one or both sides, transforming the data can
be helpful. These considerations apply both to x and y aesthetics, and to colour
scales. In the plots of this chapter that involved the microarray data, we used the
logarithmic transformation31 not only in scatterplots like Figure 3.23 for the x 31
We used it implicitly since the data in
the ExpressionSet object x already come log-
and -coordinates, but also in Figure 3.36 for the colour scale that represents the
transformed.
expression fold changes. The logarithm transformation is attractive because it has a
definitive meaning - a move up or down by the same amount on a log-transformed
scale corresponds to the same multiplicative change on the original scale: log(ax) =
log a + log x .
Sometimes the logarithm however is not good enough, for instance when the
data include zero or negative values, or when even on the logarithmic scale the data
distribution is highly uneven. From the upper panel of Figure 3.38, it is easy to take
away the impression that the distribution of M depends on A, with higher variances
for low A. However, this is entirely a visual artefact, as the lower panel confirms: the
distribution of M is independent of A, and the apparent trend we saw in the upper
panel was caused by the higher point density at smaller A.
A = exprs(x)[,1]
M = rnorm(length(A))
qplot(A, M)
qplot(rank(A), M)
Question 3.10.1 Can the visual artefact be avoided by using a density- or binning-based
plotting method, as in Figure 3.27?
Question 3.10.2 Can the rank transformation also be applied when choosing colour scales
e. g. for heatmaps? What does histogram equalization in image processing do?
Or you can specify a particular plot that you want to save, say, the sunspot plot
from earlier.
ggsave("myplot2.pdf", plot = sp)
There are two major ways of storing plots: vector graphics and raster (pixel)
graphics. In vector graphics, the plot is stored as a series of geometrical primitives
such as points, lines, curves, shapes and typographic characters. The prefered format
in R for saving plots into a vector graphics format is PDF. In raster graphics, the plot
is stored in a dot matrix data structure. The main limitation of raster formats is their
limited resolution, which depends on the number of pixels available; in R, the most
commonly used device for raster graphics output is png. Generally, its preferable
to save your graphics in a vector graphics format, since it is always possible later to
convert a vector graphics file into a raster format of any desired resolution, while the
reverse is in principle limited by the resolution of the original file. And you dont want
the figures in your talks or papers look poor because of pixelisation artefacts!
fixme: Vlad wrote In addition to cytoband information, one can use ggbio to visualize
chr1
chr1
y
gene model tracks on which coding regions (CDS), untranslated regions (UTR), introns,
exons and non-genetic regions are indicated.
To create a gene model track for a subset of 500 hg19 RNA editing sites use the Figure 3.39: Chromosome 1 of the human genome:
ideogram plot.
darned_hg19_subset5 data set.
chr1
data(darned_hg19_subset5 , package = "biovizBase") chr2
chr3
dn = darned_hg19_subset5 chr4
chr5
library(GenomicRanges)
chr6
chr7
chr8
library(ggbio) chr9
chr10 exReg
chr11 3
fixme: Need to fix the par of the pdf being saved. Right now x-axis labels get squeezed
chr12
5
chr13
chr14 C
together chr15
chr16
Note that the information on sequence lengths is stored in ideoCyto data set,
chr17
chr18
seqlengths(dn) = seqlengths(ideoCyto$hg19)[names(seqlengths(dn))]
Figure 3.40: Karyogram of 22 chromosomes
Use function keepSeqlevels to subset the first 22 chromosomes and the X
chromosome and plot the karyogram.
dn = keepSeqlevels(dn, paste("chr", c(1:22, "X")))
autoplot(dn, axis.text.x=FALSE, layout = "karyogram", aes(color = exReg, fill = exReg))
29
The categorical variable exReg is included with the data and marks CDS (coding
regions), 3-UTR and 5-UTR, which correspond to C, 3 and 5 in the legend of the
figure, respectively.
3.14 Exercises
Exercise 3.1 (themes) Explore how to change the visual appearance of plots with themes.
For example:
qplot(1:1 ,1:1 )
qplot(1:1 ,1:1 ) + theme_bw()
Familiarize ourselves with the machinery of hypothesis testing, its vocabulary, its
purpose, and its strengths and limitations.
Understand what multiple testing means.
See that multiple testing is not a problem but rather, an opportunity, as it fixes
many of the limitations of single testing.
Understand the false discovery rate.
Learn how to make diagnostic plots.
Use hypothesis weighting to increase the power of our analyses.
assay, compared to control: thats again tens of thousands, if not millions of tests.
Yet, in many ways, the task becomes simpler, not harder. Since we have so much
data, and so many tests, we can ask questions like: are the assumptions of the tests
actually met by the data? What are the prior probabilities that we should assign to the
possible outcomes of the tests? Answers to these questions can be incredibly helpful,
and we can address them because of the multiplicity. So we should think about it not
as a multiple testing problem, but as an opportunity! Figure 6.2: Modern biology often involes navigat-
There is a powerful premise in data-driven sciences: we usually expect that most ing a deluge of data. Source
tests will not be rejected. Out of the thousands or millions of tests (genes, positions in
the genome, RNAi reagents), we expect that only a small fraction will be interesting,
or significant. In fact, if that is not the case, if the hits are not rare, then arguably
our analysis method serially univariate screening of each variable for association
with the outcome is not suitable for the dataset. Either we need better data (a more
specific assay), or a different analysis method, e. g., a multivariate model.
Since most nulls are true, we can use the behaviour of the many test statistics and
p-values to empirically understand their null distributions, their correlations, and so
on. Rather than having to rely on assumptions we can check them empirically!
To understand multiple tests, lets first review the mechanics of single hypothesis
testing. For example, suppose we are flipping a coin to see if it is a fair coin3 . We flip 3
The same kind of reasoning, just with more
details, applies to any kind of gambling. Here
the coin 100 times and each time record whether it came up heads or tails. So, we
we stick to coin tossing since everything can be
have a record that could look something like this: worked out easily, and it shows all the important
H H T T H T H T T ... concepts.
Which we can simulate in R. We set probHead different from 1/2, so we are
sampling from a biased coin:
33
set.seed( xdada)
numFlips <- 1
probHead <- .6
coinFlips <- sample(c("H", "T"), size = numFlips,
replace = TRUE, prob = c(probHead, 1 - probHead))
head(coinFlips)
## [1] "T" "T" "H" "T" "H" "H"
Now, if the coin were fair, we expect half of the time to get heads. Lets see.
table(coinFlips)
## coinFlips
## H T
## 59 41
So that is different from 50/50. Suppose we didnt know whether the coin is fair
or not but our prior assumption is that coins are, by and large, fair: would these
observed data be strong enough to make us conclude that this coin isnt fair? We
know that random sampling differences are to be expected. To decide, lets look at the
sampling distribution of our test statistic the total number of heads seen in 100 coin
tosses for a fair coin4 . This is really easy to work out with elementary combinatorics: 4
We havent really defined what we mean be
fair a reasonable definition would be that head
and tail are equally likely, and that the outcome
P(K = k | n, p) = *
n + k each coin toss is completely independent of the
p (1 p)n k
(6.1) previous ones. For more complex applications,
, k - nailing down the exact null hypothesis can take a
bit more thought.
Lets parse the notation: n is the number of coin tosses (100) and p is the probability
of head (0.5 if we assume a fair coin). k is the number of heads. Statisticians like to
make a difference between all the possible values of a statistic and the one that was
observed, and we use the lower case k for the possible values (so k can be anything
between 0 and 100), and the upper case K for the observed value. We pronounce the
left hand side of the above equation as the probability that the observed number
takes the value k , given that n is what it is and p is what it is.
Lets plot Equation (6.1); for good measure, we also mark the observed value
numHeads with a vertical blue line.
k <- :numFlips
numHeads <- sum(coinFlips == "H")
0.08
binomDensity <- data.frame(k = k,
p = dbinom(k, size = numFlips, prob = .5))
0.06
library("ggplot2")
ggplot(binomDensity) + 0.04
p
Suppose we didnt know about Equation (6.1). We could still manoeuvre our way
out by simulating a reasonably good approximation of the distribution. 0.00
0 25 50 75 100
numSimulations <- 1 k
outcome <- replicate(numSimulations, {
Figure 6.3: The binomial distribution for the
coinFlips <- sample(c("H", "T"), size = numFlips, parameters n = 100 and p = 0.5, according to
replace = TRUE, prob = c( .5, .5)) Equation (6.1).
sum(coinFlips == "H")
34
800
})
ggplot(data.frame(outcome)) + xlim( , 1 ) +
600
geom_histogram(aes(x = outcome), binwidth = 1, center = ) +
geom_vline(xintercept = numHeads, col="blue")
count
400
As expected, the most likely number of heads is 50, that is, half the number of coin
flips. But we see that other numbers near 50 are also not unlikely. How do we quantify 200
whether the observed value, 59, is among those values that we are likely to see from a
fair coin, or whether its deviation from the expected value is already big enough for 0
us to conclude with enough confidence that the coin is biased? We divide the set of
0 25 50 75 100
outcome
all possible k s (0 to 100) in two complementary subsets, the acceptance region and Figure 6.4: An approximation of the binomial dis-
the rejection region. A natural choice5 is to fill up the rejection region with as many tribution from 104 simulations (same parameters
as Figure 6.3).
k as possible while keeping the total probability below some threshold (say, 0.05). So 5
More on this below.
the rejection set consists of the values of k with the smallest probabilities (6.1), so that
their sum remains .
library("dplyr")
alpha <- . 5
binomDensity <- arrange(binomDensity, p) %>%
mutate(reject = (cumsum(p) <= alpha))
0.08
ggplot(binomDensity) +
geom_bar(aes(x = k, y = p, col = reject), stat = "identity") + 0.06
scale_colour_manual(
values = c(TRUE = "red", FALSE = "darkgrey")) + 0.04
p
geom_vline(xintercept = numHeads, col="blue") +
theme(legend.position = "none") 0.02
In the code above, we used the functions arrange and mutate from the dplyr
0.00
package to sort the the p-values from lowest to highest, compute the cumulative sum 0 25 50 75 100
The explicit summation over the probabilities is clumsy, we did it here for peda- Figure 6.5: As Figure 6.3, with rejection region
(red) that has been chosen such that it contains
gogic value. For one-dimensional distributions, R provides not only functions for the the maximum number of bins whose total area is
densities (e. g., dbinom) but also for the cumulative distribution functions (pbinom), at most = 0.05.
which are more precise and faster than cumsum over the probabilities. These should
be used in practice.
We see in Figure 6.5 that the observed value, 59, lies in the grey shaded area, so we
would not reject the null hypothesis of a fair coin from these data at a significance
level of = 0.05.
Question 6.2.1 Does the fact that we dont reject the null hypothesis mean that the coin is
fair?
Question 6.2.2 Would we have a better chance of detecting that the coin is not fair if we did
more coin tosses? How many?
Question 6.2.3 If we repeated the whole procedure and again tossed the coin 100 times,
might we then reject the null hypothesis?
Question 6.2.4 The rejection region in Figure 6.5 is asymmetric - its left part ends with
k = 40, while its right part starts with k = 61. Why is that? Which other ways of defining the
rejection region might be useful?
35
The binomial test is such a frequent activity that it has been wrapped into a single
function, and we can compare its output to our results
binom.test(x = numHeads, n = numFlips, p = .5)
##
## Exact binomial test
##
## data: numHeads and numFlips
## number of successes = 59, number of trials = 1 , p-value =
## . 8863
## alternative hypothesis: true probability of success is not equal to .5
## 95 percent confidence interval:
## .4871442 .68738
## sample estimates:
## probability of success
## .59
independent of each other, but that the probability of heads differed from 0.5 The sec-
ond one was that the overall probability of heads may still be 0.5, but that subsequent
coin tosses were correlated.
Question 6.3.3 Recall the concept of sufficient statistics from Chap??. Is the total number
of heads a sufficient statistic for the binomial distribution? Why might be it be a good test
statistic for our first class of alternatives, but not for the second?
Question 6.3.4 Does a test statistic always have to be sufficient?
So lets remember that we typically have multiple possible choices of test statistic
(in principle it could be any numerical summary of the data). Making the right choice
is important for getting a test with good power7 . What the right choice is will depend 7
See Section 6.4.
on what kind of alternatives we expect. This is not always easy to know in advance.
Once we have chosen the test statistic we need to compute its null distribution.
You can do this either with pencil and paper or by computer simulations. A pencil Parametric theory versus simulation
and paper solution that leads to a closed form mathematical expression (like Equa-
tion (6.1)) has the advantage that it holds for a range of model parameters of the null
hypothesis (such as n , p ). And it can be quickly computed for any specific set of param-
eters. But it is not always as easy as in the coin tossing example. Sometimes a pencil
and paper solution is impossibly difficult to compute. At other times, it may require
simplifying assumptions. An example is a null distribution for the t -statistic (which
we will see later in this chapter). We can compute one if we assume that the data
are independent and Normal distributed, the result is called the t -distribution. Such
modelling assumptions may be more or less realistic. Simulating the null distribution
offers a potentially more accurate, more realistic and perhaps even more intuitive
approach. The drawback of simulating is that it can take a rather long time, and we
have to work extra to get a systematic understanding of how varying parameters
influence the result. Generally, it is more elegant to use the parametric theory when it
applies8 . When you are in doubt, simulate or do both. 8
The assumptions dont need to be exactly true
it is sufficient if the theorys predictions are an
As for the rejection region: how small is small enough? That is your choice of the
acceptable approximation of the truth.
significance level , which is the total probability of the test statistic falling into this Rejection region
region if the null hypothesis is true9 . Even when is given, the choice of the rejection 9
Some people at one point in time for a particular
region is not unique. A further condition that we require from a good rejection region set of questions colluded on = 0.05 as being
small. But there is nothing special about this
is that the probability of the test statistic falling into it is as large possible if the null number.
hypothesis is indeed false. In other words, we want our test to have high power.
In Figure 6.5, the rejection region is split between the two tails of the distribution.
This is because we anticipate that unfair coins could have a bias either towards head
or toward tail; we dont know. If we did know, we could instead concentrate our
rejection region all on the appropriate side, e. g., the right tail if we think the bias
would be towards head. Such choices are also refered to as two-sided and one-sided
tests.
Having set out the mechanics of testing, we can assess how well we are doing. Ta-
ble 6.2 compares reality (whether or not the null hypothesis is in fact true) with the
37
The two types of error we can make are in the lower left and upper right cells
of the table. Its always possible to reduce one of the two error types on the cost
of increasing the other one. The real challenge is to find an acceptable trade-off
between both of them. This is exemplified in Figure 6.6. We can always decrease the
false positive rate (FPR) by shifting the threshold to the right. We can become more
conservative. But this happens at the price of higher false negative rate (FNR).
0.3
Analogously, we can decrease the FNR by shifting the threshold to the left. But then
again, this happens at the price of higher FPR. A bit on terminology: the FPR is the
0.2
y
same as the probability that we mentioned above. 1 is also called the specificity
0.1
of a test. The FNR is sometimes also called , and 1 the power, sensitivity or true
positive rate of a test.
0.0
Question 6.4.1 0 3 6 9
At the end of Section 6.3 we learned about one- and two-sided tests. Why does this distinction
test statistic
*. m 2 +/
n1 n2
1 X 2
X 2
s2 = x 1,i m1 + x 2,j
2 i=1
n1 + n2
, j=1 -
6.0
r
n 1n 2
c = . (6.3) 5.5
n1 + n2
weight
3.5
ctrl trt1 trt2
group
Figure 6.7: The PlantGrowth data.
38
data("PlantGrowth")
ggplot(PlantGrowth, aes(y = weight, x = group, col = group)) +
geom_jitter(height = , width = .4) +
theme(legend.position = "none")
tt <- with(PlantGrowth,
t.test(weight[group =="ctrl"],
weight[group =="trt2"],
var.equal = TRUE))
tt
##
## Two Sample t-test
##
## data: weight[group == "ctrl"] and weight[group == "trt2"]
## t = -2.134, df = 18, p-value = . 4685
## alternative hypothesis: true difference in means is not equal to
## 95 percent confidence interval:
## - .98 338117 - . 7661883
## sample estimates:
## mean of x mean of y
## 5. 32 5.526
Question 6.5.1 What do you get from the comparison with trt1? What for trt1 versus
trt2?
Question 6.5.2 What is the significance of the var.equal = TRUE in the above call to
t.test?
Question 6.5.3 Rewrite the above call to t.test using the formula interface, i. e., by
using the notation weight group.
To compute the p-value, the t.test function uses the asymptotic theory for the
t -statistic (6.2); this theory states that under the null hypothesis of equal means
in both groups, this quantity follows a known, mathematical distribution, the so-
called t -distribution with n 1 + n 2 degrees of freedom. The theory uses additional 800
technical assumptions, namely that the data are independent and come from a
Normal distribution with the same standard deviation. We could be worried about 600
these assumptions. Clearly they do not hold: weights are always positive, while the
Normal distribution extends over the whole real axis. The question is whether this
count
400
deviation from the theoretical assumption makes a real difference. We can use sample
permutations to figure this out. 200
replicate(1 , 0 1 2 3 4 5
|t|
abs(t.test(weight ~ sample(group))$statistic)))
Figure 6.8: The null distribution of the (absolute)
ggplot(data_frame(|t| = abs_t_null), aes(x = |t|)) + t -statistic determined by simulations namely,
by random permutations of the group labels.
geom_histogram(binwidth = .1, boundary = ) +
geom_vline(xintercept = abs(tt$statistic), col="red")
mean(abs(tt$statistic) <= abs_t_null)
## [1] . 471
Question 6.5.4 Why did we use the absolute value function (abs) in the above code?
39
Question 6.5.5 Plot the (parametric) t-distribution with the appropriate degrees of
freedom?
The t -test comes in multiple flavors, all of which can be chosen through parame-
ters of the t.test function. What we did above was a two-sided two-sample unpaired Different flavors of t -test
test with equal variance. Two-sided refers to the fact that we were open to reject
the null hypothesis if the weight of the treated plants was either larger or smaller
than that of the untreated ones. Two-sample indicates that we compared the means
of two groups to each other; another option would be to compare the mean of one
group against a given, fixed number. Unpaired means that there was no direct 1:1
mapping between the measurements in the two groups. If, on the other hand, the
data had been measured on the same plants before and other treatment, then a paired
test would be more appropriate, as it looks at the change of weight within each plant,
rather than their absolute weights. Equal variance refers to the way the statistic (6.2)
is calculated. That expression is most appropriate if the variances within each group
are about the same. If they are much different, an alternative form10 and associated 10
Welchs t -test
asymptotic theory exist.
Now lets try something peculiar: duplicate the data. The independence assumption
with(rbind(PlantGrowth, PlantGrowth),
t.test(weight[group =="ctrl"],
weight[group =="trt2"],
var.equal = TRUE))
##
## Two Sample t-test
##
## data: weight[group == "ctrl"] and weight[group == "trt2"]
## t = -3.1 7, df = 38, p-value = . 3629
## alternative hypothesis: true difference in means is not equal to
## 95 percent confidence interval:
## - .8165284 - .1714716
## sample estimates:
## mean of x mean of y
## 5. 32 5.526
Note how the estimates of the group means (and thus, of the difference) are
unchanged, but the p-value is now much smaller! We can conclude two things from
this:
The power of the t -test depends on the sample size. Even if the underlying bio-
logical differences are the same, a dataset with more samples tends to give more
significant results11 . 11
You can already see this from Equation 6.3.
The assumption of independence between the measurements is really important.
Blatant duplication of the same data is an extreme form of dependence, but to
some extent the same thing happens if you mix up different levels of replication.
For instance, suppose you had data from 8 plants, but measured the same thing
twice on each plant (technical replicates), then pretending that these are now 16
independent measurements to a downstream analysis, such as the t -test, is wrong.
40
Lets go back to the coin tossing example. We could not reject the null hypothesis
(that the coin is fair) at a level of 5% even though we knew that it is unfair. After
all, probHead was 0.6 on page 32. Lets suppose we now start looking at different
test statistics. Perhaps the number of consecutive series of 3 or more heads. Or
the number of heads in the first 50 coin flips. And so on. At some point we will
find a test that happens to result in a small p-value, even if just by chance (after
all, the probability for the p-value to be less than 5% under the null is 0.05, not an
infinitesimally small number). We just did what is called p-value hacking12 13 . You 12
http://fivethirtyeight.com/
see what the problem is: in our zeal to prove our point we tortured the data until features/science-isnt-broken
13
Megan L Head, Luke Holman, Rob Lanfear,
some statistic did what we wanted. A related tactic is hypothesis switching or Andrew T Kahn, and Michael D Jennions. The
HARKing hypothesizing after the results are known: we have a dataset, maybe we extent and consequences of p-hacking in science.
PLoS Biol, 13(3):e1002106, 2015
have invested a lot of time and money into assembling it, so we need results. We come
up with lots of different null hypotheses, test them, and iterate, until we can report
something interesting. Avoid fallacy. Keep in mind that our
All these tactics are not according to the rule book, as described in Section 6.3, statistical test is never attempting to
with a linear and non-iterative sequence of choosing the hypothesis and the test, prove our null hypothesis is true - we
and then seeing the data. But, of course, they are often more close to reality. With are simply saying whether or not there
biological data, we tend to have so many different choices for normalising the data, is evidence for it to be false. If a high
transforming the data, add corrections for apparent batch effects, removing outliers, p-value were indicative of the truth of
. . . . The topic is complex and open-ended. Wasserstein and Lazar (2016) give a very the null hypothesis, we could formulate
readable short summary of the problems with how p-values are used in science, and a completely crazy null hypothesis,
of some of the misconceptions. They also highlight how p-values can be fruitfully do an utterly irrelevant experiment,
used. The essential message is: be completely transparent about your data, what collect a small amount of inconclusive
analyses were tried, and how they were done. Provide the analysis code. Only with data, find a p-value that would just
such contextual information can a p-value be useful. be a random number between 0 and
1 (and so with some high probability
above our threshold ) and, whoosh,
6.7 Multiple Testing
our hypothesis would be demonstrated!
Question 6.7.1 Look up xkcd comic 882. Why didnt the newspaper report the results for the
other colors?
The same quandary occurs with high-throughput data in biology. And with force!
You will be dealing not only with 20 colors of jellybeans, but, say, with 20,000 genes
that were tested for differential expression between two conditions, or with 3 billion
positions in the genome where a DNA mutation might have happened. So how do we
deal with this? Lets look again at our table relating statistical test results with reality
(Table 6.2), this time framing everything in terms of many null hypotheses.
m : total number of hypotheses
m 0 : number of null hypotheses
V : number of false positives (a measure of type I error)
T : number of false negatives (a measure of type II error)
S , U : number of true positives and true negatives
R : number of rejections
41
Rejected V S R
Not rejected U T m R
Total m0 m m0 m
For any fixed , this probability is appreciable as soon as m is in the order of 1/ , and
tends towards 1 as m becomes larger. This relationship can have big consequences
for experiments like DNA matching, where a large database of potential matches
is searched. For example, if there is a one in a million chance that the DNA profiles
of two people match by random error, and your DNA is tested against a database of
800000 profiles, then the probability of a random hit with the database (i. e., without
you being in it) is:
1 - (1 - 1/1e6)^8e5
## [1] .55 6712
Thats pretty high. And once the database contains a few million profiles, a false hit
is virtually unavoidable.
Question 6.8.1 Prove that the probability (6.4) does indeed become very close to 1 when m
is large.
ggplot(data_frame(
alpha = seq( , 7e-6, length.out = 1 ),
p = 1 - (1 - alpha)^m),
aes(x = alpha, y = p)) + geom_line() +
xlab(expression(alpha)) +
ylab("Prob( no false rejection )") +
geom_hline(yintercept = . 5, col="red")
In Figure 6.9, the black line intersects the red line (which corresponds to a value of
42
0.05) at = 5.13 10 6 , which is just a little bit more than the value of 0.05/m implied
by the Bonferroni correction.
Question 6.8.2 Why are the two values not exactly the same?
A potential drawback of this method, however, is that when m is large, the rejec-
0.06
or would not be an effective use of our time and money. Well see that there are more
nuanced methods of controlling our type I error. 0.02
4000
6.9.1 The p-value histogram
Lets plot the histogram of p-values. 3000
count
ggplot(awde, aes(x = pvalue)) + 2000
geom_histogram(binwidth = . 25, boundary = )
The histogram (Figure 6.10) is an important sanity check for any analysis that 1000
under the null, the p-value is distributed uniformly in [0, 1]. pvalue
A peak at the left, from small p-values that were emitted by the alternatives. Figure 6.10: p-value histogram of for the airway
data.
The relative size of these two components depends on the fraction of true nulls and
true alternatives in the data. The shape of the peak towards the left depends on the
5000
power of the tests: if the experiment was underpowered, we can still expect that the 4000
p-values from the alternatives tend towards being small, but some of them will scatter
up into the middle of the range. 3000
count
Suppose we reject all tests with a p-value less than . We could visually determine
an estimate of null hypotheses among these with a plot like in Figure 6.11
2000
aes(x = pvalue)) + geom_histogram(binwidth = binw, boundary = ) + 0.00 0.25 0.50 0.75 1.00
pvalue
geom_hline(yintercept = pi * binw * nrow(awde), col = "blue") +
geom_vline(xintercept = alpha, col = "red") Figure 6.11: Visual estimation of the FDR with the
p-value histogram.
We see that there are 4783 p-values in the first bin ([0, ]), among which we expect
around 439 to be nulls (as indicated by the blue line). Thus we can estimate the
fraction of false rejections as
pi * alpha / mean(awde$pvalue <= alpha)
## [1] . 9168932
Coming back to our terminology of Table 6.4, the false discovery rate (FDR) is
defined as " #
V
FDR = E , (6.5)
max(R , 1)
The expression in the denominator makes sure that the maths are well-defined even
when R = 015 . E[ ] stands for the expectation value. That means that the FDR is not a 15
. . . and thus by implication V = 0.
quantity associated with a specific outcome of V and R for one particular experiment.
Rather, given our choice of tests and associated rejection rules for them, it is the
average16 proportion of type I errors out of the rejections made, where the average is 16
Since the FDR is an expectation value, it
does not provide worst case control: in any
taken (at least conceptually) over many replicate instances of the experiment.
single experiment, the so-called false discovery
proportion (FDP), that is V /R without the E[ ],
6.9.2 The Benjamini-Hochberg algorithm for controlling the FDR could be much higher (or lower). Just as knowing
the mean of a population does not tell you the
There is a more elegant alternative to the visual FDR method of the last section. The values of the extremes.
Then for some choice of (our target FDR), find the largest value of k that satisfies:
pk k/m
Finally reject the hypotheses 1 . . . k
0.100
We can see how this procedure works when applied to our RNA-seq p-values
through a simple graphical illustration: 0.075
phi <- .1
pvalue
awde <- mutate(awde, rank = rank(pvalue)) 0.050
m <- nrow(awde)
0.025
ggplot(filter(awde, rank <= 7 ), aes(x = rank, y = pvalue)) +
geom_line() + geom_abline(slope = phi / m, col="red") 0.000
The method now simply finds the rightmost point where the black (our p-values) 0 2000 4000 6000
and red lines (slope /m ) intersect. Then it rejects all tests to the left. rank
Figure 6.12: Visualisation of the Benjamini-
kmax <- with(arrange(awde, rank),
Hochberg procedure. Shown is a zoom-in to the
last(which(pvalue <= phi * rank / m))) 7000 lowest p-values.
kmax
## [1] 4563
Question 6.9.4 Compare the value of kmax with the number of 4783 from above (Fig-
ure 6.11). Why are they different?
Question 6.9.5 Look at the code associated with the option method="BH" of the
p.adjust function that comes with R. Compare it to what we did above.
Here, f (p) is the density of the distribution (what the histogram would look f
like with infinitely much data and infinitely small bins), 0 is a number between 0
3
and 1 that represents the size of the uniform component, and f alt is the alternative
component. These functions are visualised in the upper panel of Figure 6.14: the
2
blue areas together correspond to the graph of f alt (p), the grey areas to that of
f null (p) = 0 . If we now consider one particular cutoff p (say, p = 0.1 as in Figure 6.14),
1
0
then we can decompose the value of f at the cutoff (red line) into the contribution
from the nulls (light red, 0 ) and from the alternatives (darker red, (1 0 )f alt (p)). So
0
p
we have the local false disovery rate 0 1
0
fdr(p) = , (6.7)
f (p)
and this quantity, which by definition is between 0 and 1, tells us the probability that a F
1.0
hypothesis which we rejected at some cutoff p would be a false positive. Note how the
fdr in Figure 6.14 is a monotonically decreasing function of p , and this goes with our
intuition that the fdr should be lowest for the smallest p and then gradually get larger,
0.5
until it reaches 1 at the very right end. We can make a similar decomposition not only
for the red line, but also for the area under the curve. This is
Z p
0.0
and the ratio of the dark grey area (that is, 0 times p ) to that is the tail area false
disovery rate (Fdr19 ). 19
The convention is to use the lower case abbrevi-
0p ation fdr for the local, and the abbreviation Fdr
Fdr(p) = , (6.9)
F (p) for the tail-area false discovery rate in the context
of the two-groups model (6.6). The abbreviation
Well use the data version of F for diagnostics in Figure 6.18. FDR is used for the original definition (6.5), which
The packages qvalue and fdrtool offer facilities to fit these models to data. is a bit more general.
library("fdrtool")
ft <- fdrtool(awde$pvalue, statistic = "pvalue")
In fdrtool, what we called 0 above is called eta :
ft$param[,"eta "]
## eta
## .76 5948
Question 6.10.1 What do the plots show that are produced by the above call to fdrtool?
Question 6.10.2 Explore the other elements of the list ft.
Question 6.10.3 What does the empirical in empirical Bayes methods stand for?
always optimal.
count
500
Lets look at an example. Intuitively, the signal-to-noise ratio for genes with larger
250
numbers of reads mapped to them should be better than for genes with few reads, and
that should affect the power of our tests. We look at the mean of normalized counts 0
5 10
across samples. In the DESeq2 software this quantity is called the baseMean. asinh(baseMean)
awde$baseMean[1] Figure 6.15: Histogram of baseMean. We see that
## [1] 7 8.6 22 it covers a large dynamic range, from close to 0 to
cts <- counts(awfit, normalized = TRUE)[1, ] around 3.3 105 .
cts
## SRR1 395 8 SRR1 395 9 SRR1 39512 SRR1 39513 SRR1 39516 SRR1 39517
## 663.3142 499.9 7 74 .1528 6 8.9 63 966.3137 748.3722
## SRR1 3952 SRR1 39521
## 836.2487 6 5.6 24
mean(cts)
## [1] 7 8.6 22
47
log10(pvalue)
100
50
Next we produce its histogram across genes, and a scatterplot between it and the
p-values.
0
ggplot(awde, aes(x = asinh(baseMean))) + 0 5000 10000 15000 20000
geom_histogram(bins = 6 ) rank(baseMean)
Figure 6.16: Scatterplot of the rank of baseMean
ggplot(awde, aes(x = rank(baseMean), y = -log1(pvalue))) + versus the negative logarithm of the p-value. For
geom_hex(bins = 6 ) + small values of baseMean, no small p-values
theme(legend.position = "none") occur. Only for genes whose read counts across
all samples have a certain size, the test for
Question 6.11.1 Why did we use the asinh transformation for the histogram? How does it differential expression has power to come out
look like with no transformation, the logarithm, the shifted logarithm, i. e., log(x + const.)? with a small p-value.
Question 6.11.2 In the scatterplot, why did we use log10 for the p-values? Why the rank [0.838,3.25] (3.25,14.7]
transformation for the baseMean? 1500
For convenience, we discretize baseMean into a factor variable group, which corre- 1000
sponds to six equal-sized groups.
500
awde <- mutate(awde, stratum = cut(baseMean,
0
breaks = quantile(baseMean, probs =
(14.7,92.2] (92.2,354]
seq( , 1, length.out = 7)),
1500
include.lowest = TRUE))
count
In Figures 6.17 and 6.18 we see the histograms of p-values and the ECDFs stratified
1000
by stratum. 500
0
ggplot(awde, aes(x = pvalue)) +
(354,1.02e+03] (1.02e+03,3.26e+05]
geom_histogram(binwidth = . 25, boundary = ) +
1500
facet_wrap( ~ stratum, nrow = 4)
1000
ggplot(awde, aes(x = pvalue, col = stratum)) + 500
stat_ecdf(geom = "step")
0
If we were to fit the two-group model to these strata separately, we would get quite 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
pvalue
different parameters (i. e., 0 , f alt ). For the most lowly expressed genes (those in the Figure 6.17: p-value histograms of the airway
first baseMean-bin), the power of the DESeq2-test is low, and the p-values essentially data, stratified into 6equally sized groups defined
all come from the null component. As we go higher in average expression, the height by increasing value of baseMean.
Lets compare this to what we get from the ordinary (unweighted) Benjamini- weighting increases detection power(3.25,14.7]
in genome-
Hochberg method: scale multiple testing. Nature Methods, 2016
0.50 (14.7,92.2]
y
(92.2,354]
(354,1.02e+03]
padj_BH <- p.adjust(awde$pvalue, method = "BH") 0.25
(1.02e+03,3.26e+05]
With hypothesis weighting, we get more rejections. For these data, the difference Figure 6.18: Same data as in Figure 6.17, shown
with ECDFs.
48
is notable though not spectacular, this is because their signal-to-noise is already quite
high. In other situations (e. g., when there are fewer replicates or they are more noisy,
1.5
fold
or when the effect of the treatment is less drastic), the difference from using IHW can
1
weight
1.0 2
be more pronounced.
3
We can have a look at the weights determined by the ihw function. 0.5
4
plot(ihw_res) 0.0
5 10 15
Intuitively, what happens here is that IHW chooses to put more weight on the
stratum
6.13 Exercises
Exercise 6.1 What is a data type or an analysis method from your scientific field of ex-
pertise that relies on multiple testing? Do you focus on FWER or FDR? Are the hypotheses all
exchangeable, or are there any informative covariates?
Exercise 6.2 Why do statisticians often focus so much on the null hypothesis of a test,
compared to the alternative hypothesis?
49
Exercise 6.3 How can we ever prove that the null hypothesis is true? Or that the alternative
is true?
Exercise 6.4 Make a less extreme example of correlated test statistics than the data du-
plication at the end of Section 6.5. Simulate data with true null hypotheses only, so that the
data morph from being completely independent to totally correlated as a function of some
continuous-valued control parameter. Check type-I error control (e. g., with the p-value his-
togram) as a function of this control parameter.
Exercise 6.5 Find an example in the published literature that looks like p-value hacking,
outcome switching, HARKing played a role.
Exercise 6.6 What other type-I and type-II error concepts are there for multiple testing?
Exercise 6.7 The FDR is an expectation value, i. e., aims to control average behavior of a
procedure. Are there methods for worst case control?
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal
of the Royal Statistical Society B, 57:289300, 1995.
Richard Bourgon, Robert Gentleman, and Wolfgang Huber. Independent filtering increases detection power for high-throughput
experiments. PNAS, 107(21):95469551, 2010. URL http://www.pnas.org/content/1 7/21/9546.long.
Daniel B Carr, Richard J Littlefield, WL Nicholson, and JS Littlefield. Scatterplot matrix techniques for large N. Journal of the
American Statistical Association, 82(398):424436, 1987.
W. S. Cleveland, M. E. McGill, and R. McGill. The shape parameter of a two-variable graph. Journal of the American Statistical
Association, 83:289300, 1988.
Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge
University Press, 2010.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2008.
Megan L Head, Luke Holman, Rob Lanfear, Andrew T Kahn, and Michael D Jennions. The extent and consequences of p-hacking in
science. PLoS Biol, 13(3):e1002106, 2015.
Nikolaos Ignatiadis, Bernd Klaus, Judith Zaugg, and Wolfgang Huber. Data-driven hypothesis weighting increases detection power
in genome-scale multiple testing. Nature Methods, 2016.
Ross Ihaka. Color for presentation graphics. In Kurt Hornik and Friedrich Leisch, editors, Proceedings of the 3rd International
Workshop on Distributed Statistical Computing. Vienna, Austria, 2003.
R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and
summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2):249264, 2003.
J. Mollon. Seeing colour. In T. Lamb and J. Bourriau, editors, Colour: Art and Science. Cambridge Unversity Press, 1995.
Y. Ohnishi, W. Huber, A. Tsumura, M. Kang, P. Xenopoulos, K. Kurimoto, A. K. Oles, M. J. Arauzo-Bravo, M. Saitou, A. K. Hadjanton-
akis, and T. Hiiragi. Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages.
Nature Cell Biology, 16(1):2737, 2014.
H. von Helmholtz. Handbuch der Physiologischen Optik. Leopold Voss, Leipzig, 1867.
Ronald L Wasserstein and Nicole A Lazar. The asas statement on p-values: context, process, and purpose. The American Statisti-
cian, 2016.
Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer New York, 2009. ISBN 978-0-387-98140-6. URL
http://had.co.nz/ggplot2/book.
54
Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1):328, 2010.