Introduction To Statistics 14 Weeks
Introduction To Statistics 14 Weeks
A Tour in 14 Weeks
Reinhard Furrer
and the Applied Statistics Group
Prologue v
2 Random Variables 27
2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Estimation of Parameters 61
4.1 Linking Data with Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
i
ii CONTENTS
5 Statistical Testing 81
5.1 The General Concept of Significance Testing . . . . . . . . . . . . . . . . . . . . . 82
5.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Testing Means and Variances in Gaussian Samples . . . . . . . . . . . . . . . . . 92
5.4 Duality of Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Missuse of p-Values and Other Dangers . . . . . . . . . . . . . . . . . . . . . . . 99
5.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Epilogue 265
B Calculus 271
B.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B.2 Functions in Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
B.3 Approximating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
References 282
Glossary 289
This document accompanies the lecture STA120 Introduction to Statistics that has been given
each spring semester since 2013. The lecture is given in the framework of the minor in Applied
Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of two hours of lecture
and one hour of exercises per week.
As the lecture’s topics are structured on a week by week basis, the script contains thirteen
chapters, each covering “one” topic. Some of chapters contain consolidations or in-depth studies
of previous chapters. The last week is dedicated to a recap/review of the material.
I have thought long and hard about an optimal structure for this script. Let me quickly
summarize my thoughts. It is very important that the document contains a structure that is
tailored to the content I cover in class each week. This inherently leads to 14 “chapters.” Instead
of covering Linear Models over four weeks, I framed the material in four seemingly different
chapters. This structure helps me to better frame the lectures: each week having a start, a set
of learning goals and a predetermined end.
So to speak, the script covers not 14 but essentially only three topics:
1. Background
3. Linear Modeling
We will not cover these topics chronologically. For a smoother setting, we will discuss some part
of the background (multivariate Gaussian distribution) before linear modeling, when we need it.
This also allows a recap of several univariate concepts. Similarly, we cover the Bayesian approach
at the end. The book is structured according the path illustrated in Figure 1.
In case you use this document outside the lecture, here are several alternative paths through
the chapters, with a minimal impact on concepts that have not been covered:
All the datasets that are not part of regular CRAN packages are available via the URL
www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate links that
facilitate the download.
v
vi Prologue
Start
Background
Exploratory Data Analysis
Random Variables
Functions of Random Variables
Multivariate Normal Distribution
Statistical Foundations
Estimation of Parameters
Statistical Testing
Frequentist
Proportions
Rank−Based Methods
Bayesian Approach
Bayesian
Monte Carlo Methods
End
Linear Modeling
Correlation and Simple Regression
Multiple Regression
Analysis of Variance
Design of Experiments
The lecture STA120 Introduction to Statistics formally requires the prerequisites MAT183
Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences or
equivalent modules. For the content of these lectures we refer to the corresponding course
web pages www.math.uzh.ch/fs20/mat183 and www.math.uzh.ch/hs20/mat141. It is possible to
successfully pass the lecture without having had the aforementioned lectures, some self-studying is
necessary though. This book and the accompanying exercises require differentiation, integration,
vector notation and basic matrix operations, concept of solving a linear system of equations.
Appendix B and C give the bare minimum of relevant concepts in calculus and in linear algebra.
We review and summarize the relevant concepts of probability theory in Chapter 2 and in parts
of Chapter 3.
I have augmented this script with short video sequences giving additional – often more tech-
nical – insight. These videos are indicated in the margins with a ‘video’ symbol as here.
6 min
Some more details about notation and writing style.
• I do not differentiate between ‘Theorem’, ‘Proposition’, ‘Corollary’, they are all termed as
‘Property’.
• There are very few mathematical derivations in the main text. These are typical and
important. Quite often we only state results. For the interested, the proofs of these are
given in the ‘theoretical derivation’ problem at the end of the chapter.
Prologue vii
• Some of the end-of-chapter problems and exercises are quite detailed others are open. Of
course there are often several approaches to achieve the same; R-Code of solutions is one
way to get the necessary output.
• Variable names are typically on the lower end of explicitness and would definitely be criti-
cized by a computer scientists scrutinizing the code.
• I keep the headers of the R-Codes rather short and succinct. Most of them are linked to
examples and figures with detailed explanations.
• R-Code contain short comments and should be understandable on its own. The degree of
difficulties and details increases over the chapters. For example, in earlier chapters, we use
explicit loops for simulations, whereas in the latter chapters we typically vectorize.
• The R-Code of each chapter allows to reconstruct the figures, up to possible minor differ-
ences in margin specifications. There are a few illustrations made in R, for which the code
is presented only online and not in the book.
• At the end of each chapter there are standard references for the material. In the text I
only cite particularly important references.
• For clarity, we omit the prefix ‘http://’ or ‘https://’ from the URLs - unless necessary. The
links have been tested and worked at the time of the writing.
Many have contributed to this document. A big thanks to all of them, especially (alphabeti-
cally) Zofia Baranczuk, Federico Blasi, Julia Braun, Matteo Delucci, Eva Furrer, Florian Gerber,
Michael Hediger, Lisa Hofer, Mattia Molinaro, Franziska Robmann, Leila Schuh and many more.
Kelly Reeve spent many hours improving my English. Without their help, you would not be
reading these lines. Please let me know of any necessary improvements and I highly appreciate
all forms of contributions in form of errata, examples, or text blocks. Contributions can be
deposited directly in the following Google Doc sheet.
Major errors that are detected after the lecture of the corresponding semester started are
listed www.math.uzh.ch/furrer/download/sta120/errata.txt.
Reinhard Furrer
February 2023
viii Prologue
Chapter 1
⋄ Understand the concept and the need of an exploratory data analysis (EDA)
within a statistical data analysis
⋄ Know different data types and operations we can perform with them
⋄ Perform an EDA in R
It is surprisingly difficult to start a statistical study from scratch (somewhat similar as starting
the first chapter of a book). Hence to get started, we assume a rather pragmatic setup: suppose
we have “some” data. This chapter illustrates the first steps thereafter: exploring and visualizing
the data. The exploration consists of understanding the types and the structure of the data,
including its features and peculiarities. A valuable graphical visualization should quickly and
unambiguously transmit its message. Depending on the data and the intended message, different
types of graphics can be used — but all should be clear, concise and stripped of clutter. Of
course, much of the visualization aspects are not only used at the beginning of the study but are
also used after the statistical analysis. Figure 1.1 shows one representation of a statistical data
analysis flowchart and we discuss in this chapter the right most box of the top line, performing
an exploratory data analysis. Of course, many elements thereof will be again very helpful when
1
2 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
we summarize the results of the statistical analysis (part of the bottom right most box in the
workflow). Subsequent chapters come back to questions we should ask ourselves before we start
collecting data, i.e., before we start an experiment, and how to conduct the statistical data
analysis.
Figure 1.1: Data analysis workflow seen from a statistical perspective. There might be
situations where an EDA shows that no statistical analysis is necessary (dashed arrow
on the right). Similarly, model validation may indicate that the proposed model is not
adequate (dashed back-pointing arrow).
The workflow is of course not always as linear as indicated, even trained statisticians may
need to revisit statistical models used. Moreover, the workflow is just an extract of the scientific
cycle, or the scientific method: conclusions lead to new or refined hypotheses that require again
data and analyses.
At the beginning of any statistical data analysis, an exploratory data analysis (EDA) should
be performed (Tukey, 1977). An EDA summarizes the main characteristics of the data by
representing observations or measured values graphically and describing them qualitatively and
quantitatively. Each dataset tells us a ‘story’ that we should try to understand before we begin
with the analysis. To do so, we should ask ourselves questions like
• What is the data collection or data generating process? (discussed in Section 1.1.1)
• What are the key summary statistics of the data? (discussed in Sections 1.2 and 1.3.1)
At the end of a study, results are often summarized graphically because it is generally easier
to interpret and understand graphics than values in a table. As such, graphical representation
of data is an essential part of statistical analysis, from start to finish.
1.1. STRUCTURE AND LABELS OF DATA 3
Assuming that a data collection process is completed, the “analysis” of this data is one of the
next steps. This analysis is typically done in an appropriate software environment. There are
many of such but our prime choice is R (R Core Team, 2020), often used alternatives are SPSS,
SAS, Minitab, Stat, Prism besides other general purpose programming languages and platforms.
Appendix A gives links to R and to some R resources. We assume now a running version of R.
The first step of the analysis is loading data in the software environment. This task sounds
trivial and for pre-processed and readily available datasets often is. Cleaning own and others’
data is typically very painful and eats up much unanticipated time. We do load external data
but in this book we will not cover the aspect of data cleaning — be aware of this step when
planning your analysis.
Example 1.1. There are many datasets available in R, the command data() lists all that
are currently available on your computing system. Additional datasets are provided by many
packages. Suppose the package spam is installed, e.g., by executing install.packages("spam"),
then the function call data(package="spam") lists the datasets included in the package spam. A
specific dataset is then loaded by calling data() with argument the name of the dataset (quoted
or unquoted work both). (The command data( package=.packages( all.available=TRUE))
would list all datasets from all R packages that are installed on your computing system.) ♣
Example 1.2. Often, we will work with own data and hence we have to “read-in” (or load or
import) the data, e.g., with R functions read.table(), read.csv(). In R-Code 1.1 we read-
in observations of mercury content in lake Geneva sediments. The data is available at www.
math.uzh.ch/furrer/download/sta120/lemanHg.csv and is stored in a CSV file, with observation
number and mercury content (mg/kg) on individual lines (Furrer and Genton, 1999). After
importing the data, it is of utmost importance to check if the variables have been properly
read, that (possible) row and column names are correctly parsed. Possible R functions for this
task are str(), head(), tail(), or visualizing the entire dataset with View(). The number of
observations and variables should be checked (if known), for example with dim() for matrices (a
two-by-two arrangement of numbers) and dataframes (a handy tabular format of R), or length()
for vectors or from the output of str(). Note that in subsequent examples we will not always
display the output of all these verification calls to keep the R display to a reasonable length.
We limit the output to pertinent calls such that the given R-Code can be understood by itself
without the need to run it in an R session (although this is recommended).
R-Code 1.1 includes also a commented example code that illustrates how the format of the
imported dataset changes when arguments of importing functions (here read.csv()) are not
properly set (commented second but last line). ♣
4 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
The tabular format we are using is typically arranged such that each observation (data for
a specific location or subject or time point) that may consist of several variables or individual
measurements (different chemical elements or medical conditions) is on one line. We call the
entirety of these data points a dataset.
When analyzing data it is always crucial to query the origins of the data. In the following,
we state some typical questions that we should ask ourselves as well as resulting consequences
or possible implications and pitfalls. “What was the data collected for?”: The data may be
of different quality if collected by scientists using it as their primary source compared to data
used to counter ‘fake news’ arguments. Was the data rather observational or from a carefully
designed experiment. The latter often being more representative and less prone to biases due to
the chosen observation span. “Has the data been collected from different sources?”: This might
imply heterogeneous data because the scientists or labs have used different standards, protocols
or tools. In case of specific questions about the data, it may not be possible to contact all original
data owners. “Has the data been recorded electronically?”: If electronic data recording has lower
chances of erroneous entries. No human is perfect (especially reading off and entering numbers)
and such datasets are more prone to contain errors compared to automatically recorded ones.
In the context of an educational data analysis, the following questions are often helpful “Is
the data a classical textbook example?”: In such setting, we typically have cleaned data that
serve to illustrate one (or at most a couple) pedagogical concepts. The lemanHg dataset used in
this chapter is such a case, no negative surprises to be expected. “Has the data has been analyzed
elsewhere before?”: such data have been typically cleaned and we already have one reference for
1.1. STRUCTURE AND LABELS OF DATA 5
an analysis. “Are the data stemming from simulated data?”: in such cases, there is a known
underlying generation processes.
Of in all the cases mentioned before we can safely apply the proverb “the exception proves
the rule”.
Presumably we all have a fairly good idea of what data is. However, often this view is quite
narrow and boils down to data are numbers. But “data is not just data”: data can be hard or
soft, quantitative or qualitative.
Hard data is associated with quantifiable statements like “The height of this female is 172
cm.” Soft data is often associated with subjective statements or fuzzy quantities requiring inter-
pretation, such as “This female is tall”. Probability statements can be considered hard (derived
from hard data) or soft (due to a lack of quantitative values). In this book, we are especially
concerned with hard data.
An important distinction is whether data is qualitative or quantitative in nature. Qualitative
data consists of categories and are either on nominal scale (e.g., male/female) or on ordinal
scale (e.g., weak<average<strong, nominal with an ordering). Quantitative data is numeric and
mathematical operations can be performed with it.
Quantitative data is either discrete, taking on only specific values (e.g., integers or a subset
thereof), or continuous, taking on any value on the real number line. Quantitative data is
measured on an interval scale or ratio scale. Unlike the ordinal scale, the interval scale is
uniformly spaced. The ratio scale is characterized by a meaningful absolute zero in addition to the
characteristics of the interval scale. Depending on the measurement scale, certain mathematical
operators and thus summary measures or statistical measures are appropriate. The measurement
scales are classified according to Stevens (1946) and summarized in Table 1.1. We will discuss
the statistical measures based on data next and their theoretical counterparts and properties in
later chapters.
Non-numerical data, often gathered from open-ended responses or in audio-visual form, is
considered qualitative. We will not discuss such type of data here.
Example 1.3. The classification of elements as either “C” or “H” results in a nominal variable. If
we associate “C” with cold and “H” with hot we can use an ordinal scale (based on temperature).
In R, nominal scales are represented with factors. R-Code 1.2 illustrates the creation of
nominal and interval scales as well as some simple operations. It would be possible to create
ordinal scales as well, but we will not use it in this book.
When measuring temperature in Kelvin (absolute zero at −273.15◦ C), a statement such as
“The temperature has increased by 20%” can be made. However, a comparison of twice as hot
(in degrees Celsius) does not make sense as the origin is arbitrary. ♣
6 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Table 1.1: Types of scales according to Stevens (1946) and possible mathematical
operations. The statistical measures are for a description of location and spread.
Measurement scale
nominal ordinal interval real
=, ̸= ̸
=, = ̸
=, = =, ̸=
Mathematical <, > <, > <, >
operators −, + −, +
×, /
mode mode mode mode
location median median median
arithmetic mean arithmetic mean
Statistical geometric mean
measures range range
spread studentized range
standard deviation standard deviation
coefficient of variation
As a side note, with a careful inspection of missing values in ozone readings, the Antarctic
1.2. DESCRIPTIVE STATISTICS 7
“ozone hole” would have been discovered more than one decade earlier (see, e.g., en.wikipedia.
org/wiki/Ozone_depletion#Research_history).
Informally a statistic is a single measure of some attribute of the data, in the context of this
chapter a statistic gives a good first impression of the distribution of the data. Typical statistics
for the location (i.e., the position of the data) include the sample mean, truncated/trimmed
mean, sample median, sample quantiles and quartiles. The trimmed mean omits a small fraction
of the smallest and the same small fraction of the largest values. A trimming of 50% is equivalent
to the sample median. Sample quantiles or more specifically sample percentiles link observations
or values with the position in the ordered data. For example, the sample median is the 50th-
percentile, half the data is smaller than the median, the other half is larger. The 25th- and
75th-percentile are also called the lower and upper quartiles, i.e., the quartiles divide the data in
four equally sized groups. Depending on the number of observations at hand, arbitrary quantiles
are not precisely defined. In such cases, a linearly interpolated value is used, for which the precise
interpolation weights depend on the software at hand. It is important to know this potential
ambiguity less important to know the exact values of the weights.
Typical statistics for the spread (i.e., the dispersion of the data) include the sample variance,
sample standard deviation (square root of the sample variance), range (largest minus smallest
value), interquartile range (third quartile minus the first quartile), studentized range (range di-
vided by the standard deviation, representing the range of the sample measured in units of sample
standard deviations) and the coefficient of variation (sample standard deviation divided by the
sample mean). Note that the studentized range and the coefficient of variance are dimension-less
and should only be used with ratio scaled data.
We now introduce mathematical notation for several of these statistics. For a univariate
dataset the observations are written as x1 , . . . , xn (or with some other latin letter), with n being
the sample size. The ordered data (smallest to largest) is denoted with x(1) ≤ · · · ≤ x(n) . Hence,
we use the following classical notation:
n
X
sample mean: x= xi , (1.1)
i=1
x
(n/2+1/2) , if n odd,
sample median: med(xi ) = (1.2)
(n/2+1) ), if n even,
1 (x +x
2 (n/2)
n
1 X
sample variance: s2 = (xi − x)2 , (1.3)
n−1
i=1
√
sample standard deviation: s = s2 . (1.4)
If the context is clear, we may omit sample. Note that some books also use the quantifier
empirical instead of sample. The symbols for the sample mean, sample variance and sample
standard deviation are quite universal. However, this is not the case for the median. We use the
notation med(x1 , . . . , xn ) or med(xi ) if no ambiguities exist.
8 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Example 1.4 (continued from Example 1.2). In R-Code 1.3 several summary statistics for 293
observations of mercury in lake Geneva sediments are calculated (see R-Code 1.1 to load the
data). Of course all standard descriptive statistics are available as predefined functions in R. ♣
For discrete data, the sample mode is the most frequent value of a sample frequency distribu-
tion; in order to calculate the sample mode, only the operations {=, ̸=} are necessary. Continuous
data are first divided into categories (by discretization or binning) and then the mode can be
determined.
Another important aspect of EDA is the identification of outliers, which are defined (verbatim
from Olea, 1991): “In a sample, any of the few observations that are separated so far in value
from the remaining measurements that the questions arise whether they belong to a different
population, or that the sampling technique is faulty. The determination that an observation is
an outlier may be highly subjective, as there is no strict criteria for deciding what is and what is
not an outlier”. Graphical respresentations of the data often help in the identification of outliers,
as we will see in the next section.
exact definition and meaning of the latter four terms is not universal but a rough characterization
is as follows. A diagram is the most generic term and is essentially a symbolic representation of
information (Figure 1.1 being an example). A chart is a graphical representation of data (a pie
chart being an example). A plot is a graphical representation of data in a coordinate system (we
will discuss histograms, boxplots, scatterplots, and more below). Finally, a graph is “continuous”
line representing data in a coordinate system (for example, visualizing a function).
Remark 1.1. There are several fundamentally different approaches to creating plots in R: base
graphics (package graphics, which is automatically loaded upon starting R), trellis graph-
ics (packages lattice and latticeExtra), and the grammar of graphics approach (package
ggplot2). In this book we focus on base graphics. This approach is in sync with the R source
code style and we have a clear direct handling of all elements. ggplot functionality may pro-
duce seemingly fancy graphics at the price of certain black box elements and more complex code
structure. ♣
Example 1.5. R-Code 1.4 and Figure 1.2 illustrate bar plots with data giving aggregated CO2
emissions from different sources (transportation, electricity production, deforestation, . . . ) in the
year 2005, as presented by the SWISS Magazine 10/2011-01/2012, page 107 (SWISS Magazine,
2011) and also shown in Figure 1.11. Here and later we may use a numerical specification of the
colors which are, starting from one to eight, black, red, green, blue, cyan, magenta, yellow and
gray.
Note that the values of such emissions vary considerably according to different sources, mainly
due to the political and interest factors associated with these numbers. ♣
Histograms illustrate the frequency distribution of observations graphically and are easy to
construct and to interpret (for a specified partition of the x-axes, called bins, observed counts or
proportions are recorded). Histograms allow one to quickly assess whether the data is symmetric
or rather left-skewed (more smaller values on the left side of the bulk of the data) or right-skewed
(more larger values on the right side of the bulk of the data), whether the data is unimodal (has
rather one dominant peak) or several or whether exceptional values are present. Several valid
rules of thumb exist for choosing the optimal number of bins. However, the number of bins is a
subjective choice that affects the look of the histogram, as illustrated in the next example.
10 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.4 Representing emissions sources with barplots. (See Figure 1.2.)
dat <- c(2, 15, 16, 32, 25, 10) # see Figure 1.11
emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr', 'Deforest', 'Other')
barplot( dat, names=emissionsource, ylab="Percent", las=2)
barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))
100
30
Other
80
25 Deforest
Electr
20 Manufac
60
Percent
Percent
Transp
15 Air
40
10
20
5
0
0
Air
Transp
Manufac
Electr
Deforest
Other
2005
Figure 1.2: Bar plots: juxtaposed bars (left), stacked (right) of CO2 emissions ac-
cording to different sources taken from SWISS Magazine (2011). (See R-Code 1.4.)
Example 1.6 (continued from Examples 1.2 and 1.4). R-Code 1.5 and the associated Figure 1.3
illustrate the construction and resulting histograms of the mercury dataset. The different choices
illustrate that the number of bins needs to be carefully chosen. Often, the default values work
well, which is also the case here. In one of the histograms, a “smoothed density” has been
superimposed. Such curves will be helpful when comparing the data with different statistical
models, as we will see in later chapters. When adding smoothed densities, the histogram has
to represent the fraction of data in each bin (argument probability=TRUE) as opposed to the
frequency (the default).
The histograms in Figure 1.3 shows that the data is unimodal, right-skewed, no exceptional
values (no outliers). Important statistics like mean and median can be added to the plot with
vertical lines. Due to the slight right-skewedness, the mean is slightly larger than the median.
♣
When constructing histograms for discrete data (e.g., integer values), one has to be careful
with the binning. Often it is better to manually specify the bins. To represent the result of several
dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5, to=6.5, by=1)),
or possibly use a bar plot as explained above. A stem-and-leaf plot is similar to a histogram, in
which each individual observation is marked with a single dot (stem() in R). Although this plot
gives the most honest representation of data, it is rarely used today, mainly beacuse it does not
work for very large sample sizes. Figure 1.4 gives an example.
1.3. VISUALIZING DATA 11
R-Code 1.5 Different histograms (good and bad ones) for the ‘lemanHg’ dataset. (See
Figure 1.3.)
Density
1.0
40
0.0
0
Hg Hg
100 200
Frequency
Frequency
5
0
Hg Hg
Figure 1.3: Histograms of mercury data with various bin sizes. In the top left panel,
the smoothed density is in back, the mean and median are in green dashed and red
dotted vertical lines, respectively. (See R-Code 1.5.)
1.5·IQR are marked individually. The closest non-marked observations to these bounds are called
the whiskers.
A violin plot combines the advantages of the box plot and a “smoothed” histogram by es-
sentially merging both. Compared to a boxplot a violin plot depicts possible multi-modality
and for large datasets de-emphasizes the marked observations outside he whiskers (often termed
“outliers” but see our discussion in Chapter 7).
Example 1.7 (continued from Examples 1.2, 1.4 and 1.6). R-Code 1.6 and Figure 1.5 illustrates
the boxplot and violin plot for the mercury data. Due to the right-skewedness of the data, there
are several data points beyond the upper whisker. Notice that the function boxplot() has several
arguments for tailoring the appearance of the box plots. These are discussed in the function’s
help file. ♣
A quantile-quantile plot (QQ-plot) is used to visually compare the ordered sample, also called
sample quantiles, with the quantiles of a theoretical distribution. For the moment we think of
this theoretical distribution in form of a “smoothed density” similar to the superimposed line
in the top right panel of Figure 1.3. We will talk more about “theoretical distributions” in the
next two chapters. The theoretical quantiles can be thought of as the n values of a “perfect”
realization. If the points of the QQ-plot are aligned along a straight line, then there is a good
1.3. VISUALIZING DATA 13
R-Code 1.6 Boxplot and violin plot for the ‘lemanHg’ dataset. (See Figure 1.5.)
1.5
1.0
1.0
Hg
Hg
0.5
0.5
0.0
0.0
1
Figure 1.5: Box plots and violin plot for the ‘lemanHg’ dataset. (See R-Code 1.6.)
match between the sample and theoretical quantiles. A deviation of the points indicate that the
sample has either too few or too many small or large points. Figure 1.6 illustrates a QQ-plot
based on nine data points (for the ease of understanding) with quantiles from a symmetric (a)
and a skewed (b) distribution respectively. The last panel shows the following cases with respect
to the reference quantiles: (c1) sample has much smaller values than expected; (c2) sample has
much larger values than expected; (c3) sample does not have as many small values as expected;
(c4) sample does not have as many large values as expected.
Notice that the QQ-plot is invariant with respect to changing the location or the scale of the
sample or of the theoretical quantiles. Hence, the omission of the scales in the figure.
Example 1.8 (continued from Examples 1.2, 1.4 to 1.7). R-Code 1.7 and Figure 1.7 illustrate
a QQ-plot for the mercury dataset by comparing it to a “bell-shaped” and a right-skewed dis-
tribution. The two distributions are called normal or Gaussian distribution and chi-squared
distribution and will be discussed in subsequent chapters. To “guide-the-eye”, we have added the
a line passing through the lower and upper quartile. The right panel shows a suprisingly good
14 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
(c3) (c4)
Figure 1.6: (left) QQ-plots with a symmetric (a) and a right-skewed (b) theoretical
distribution. The histogram of the data and the theoretical density are shown in the
margins in green. (c): schematic illustration for the different types of deviations. See
text for an interpretation.
fit. ♣
In a scatterplot, “guide-the-eye” lines are often included. In such situation, some care is
needed as there is an perception of asymmetry between y versus x and x versus y. We will
discuss this further in Chapter 9.
In the case of several frequency distributions, bar plots, either stacked or grouped, may
also be used in an intuitive way. See R-Code 1.8 and Figure 1.8 for two slightly different par-
1.3. VISUALIZING DATA 15
1.5
1.5
Sample Quantiles
1.0
1.0
Hg
0.5
0.5
0.0
0.0
−3 −2 −1 0 1 2 3 0 5 10 15 20
Figure 1.7: QQ-plots using the normal distribution (left) and a so-called chi-squared
distribution with five degrees of freedom (right). The red line passes through the lower
and upper quantiles of both the sample and theoretical distribution. (See R-Code 1.7.)
titions of CO2 emission sources, based on (SWISS Magazine, 2011) and www.c2es.org/facts-
figures/international-emissions/sector (approximate values).
R-Code 1.8 Bar plots for emissions by sectors for the year 2005 from two sources. (See
Figure 1.8.)
dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org
mat <- cbind( SWISS=dat, c2es.org=dat2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', las=2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)
100
30
Other Air
80 Deforest Transp
25
Electr Manufac
Manufac Electr
60 Transp 20 Deforest
Percent
Percent
Air Other
15
40
10
20
5
0 0
SWISS
c2es.org
SWISS
c2es.org
Figure 1.8: CO2 emissions by sectors visualized with bar plots for the year 2005 from
two sources (left: stacked, right: grouped). (See R-Code 1.8.)
16 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Example 1.9. In this example we look at a more complex dataset containing several measure-
ments from different penguin species collected over three years on three different islands. The
package palmerpenguins provides the CSV file called penguins.csv in the package directory
extdata. The dataset provides bill length, bill depth, flipper length, (all in millimeters) and
body mass (in gram) of three penguin species Adelie, Chinstrap, Gentoo, next to their habitat
island, sex and the year when the penguin has been sampled (Horst et al., 2020).
R-Code 1.9 loads the data, performs a short EDA and provides the code for some represen-
tative figures. We deliberately use more arguments for several plotting functions to highlight
different features in the data or of the R functions. The eight variables are on the nominal (e.g.,
species), interval (year) and real scale (e.g., bill_length_mm). Two of the length measure-
ments have been rounded to integers (flipper_length_mm and body_mass_g). For two out of
the 344 penguins we do not have any measures and sex is missing for nine additional penguins
(summary(penguins$sex)). Due to their natural habitat, not all penguins have been observed
on all three islands, as shown by the cross-tabulation with table().
Figure 1.9 shows some representative graphics. The first two panels show that, overall,
flipper length is bimodal with a more pronounced first mode. When we partition according to
the species, we notice that the first mode corresponds species Adelie and Chinstrap, the latter
slightly larger. Comparing the second mode with the violin plot of species Gentoo (middle panel)
we notice that some details in the violin plot are “smoothed” out.
The third panel shows a so-called mosaic plot, were we represent a two dimensional frequency
table, i.e., specifically a visualization of the aforementioned table(...) call. Here, in such
a mosaic plot, the width of each vertical band corresponds to the proportion of penguins on
the corresponding island. Each band is then divided according to the species proportion. For
example, we quickly realize that Adeline is home on all three islands and that on Torgersen island
the smallest sample has been taken. Empty cells are indicated with dashed lines. A mosaic plot
is a qualitative assessment only and the areas of the rectangles are harder to compare than the
corresponding numbers of the equivalent table() output.
The final plot is a scatterplot of the four length measures. Figure 1.9 shows that for some
variables specific species cluster well (e.g., Gentoo with flipper length), whereas other variables
are less separated (e.g., Adelie and Chinstrap with bill depth). To quickly increase information
content, we use different colors or plotting symbols according to the ordinary variables. With
the argument row1attop=FALSE we obtain a fully symmetric representation that allows quick
comparisons between both sides of the diagonal. Here, with a black-red coloring we see that
male penguins are typically larger. In fact, there is even a pretty good separation between
both sexes. Some care is needed that the annotations are not overdone: unless there is evident
clustering more than three different symbols or colors are often too much. R-Code 1.9 shows how
to redefine the different panels of a scatterplot (see help(pairs) for a more professional panel
definition). By properly specifying the diag.panel argument, it possible to add charts or plots
to the diagonal panels of pairs(), e.g., the histograms of the variables. ♣
R-Code 1.9: Constructing histograms, boxplots, violin plots and scatterplot with
penguing data. (See Figure 1.9.)
1.3. VISUALIZING DATA 17
require(palmerpenguins)
penguins <- read.csv( # `path.package()` provides local location of pkg
paste0( path.package('palmerpenguins'),'/extdata/penguins.csv'))
str(penguins, strict.width='cut')
## 'data.frame': 344 obs. of 8 variables:
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgers"..
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42..
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2..
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 42..
## $ sex : chr "male" "female" "female" NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 ..
summary(penguins[, c(3:6)]) # others variables can be summarized by `table()`
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Min. :32.1 Min. :13.1 Min. :172 Min. :2700
## 1st Qu.:39.2 1st Qu.:15.6 1st Qu.:190 1st Qu.:3550
## Median :44.5 Median :17.3 Median :197 Median :4050
## Mean :43.9 Mean :17.2 Mean :201 Mean :4202
## 3rd Qu.:48.5 3rd Qu.:18.7 3rd Qu.:213 3rd Qu.:4750
## Max. :59.6 Max. :21.5 Max. :231 Max. :6300
## NA's :2 NA's :2 NA's :2 NA's :2
#penguins <- na.omit(penguins) # remove NA's
table(penguins[,c(2,1)]) # tabulating species on each island
## species
## island Adelie Chinstrap Gentoo
## Biscoe 44 0 124
## Dream 56 68 0
## Torgersen 52 0 0
hist( penguins$flipper_length_mm, main='', xlab="Flipper length [mm]", col=7)
box() # rectangle around for similar view with others
with(penguins, vioplot(flipper_length_mm[species=="Gentoo"],
flipper_length_mm[species=="Chinstrap"], flipper_length_mm[
species=="Adelie"], names=c("Gentoo", "Chinstrap", "Adelie"),
col=5:3, xlab="Flipper length [mm]", horizontal=TRUE, las=1))
upper.panel <- function( x,y, ...) # see `?pairs` for a better way
points( x, y, col=as.numeric(as.factor(penguins$species))+2,...)
lower.panel <- function( x,y, ...)
18 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
points( x, y, col=as.numeric(as.factor(penguins$sex)))
pairs( penguins[,c(3:6)], gap=0, row1attop=FALSE, # scatterplot
lower.panel=lower.panel, upper.panel=upper.panel)
# pch=as.numeric(as.factor(penguins$island)) # clutters & is not helpful
Adelie
50
Adelie
Chinstrap
40
Frequency
species
30
Chinstrap
20
Gentoo
10
Gentoo
0
170 180 190 200 210 220 230 170 180 190 200 210 220 230 island
body_mass_g
3000
230
210
flipper_length_mm
190
170
14 16 18 20
bill_depth_mm
55
bill_length_mm
45
35
Figure 1.9: Visualization of palmerpenguins data. Top left to right: histogram for the
variable flipper length, violin plots for flipper length stratified according to species and
mosaic plot for species on each island (green: Adeline; blue Chinstrap; cyan: Gentoo).
Bottom: scatterplot matrix with colors for species (above the diagonal, colors as above)
and sex (below the diagonal, black female, read male). (See R-Code 1.9.)
For a dozen and more variables, scatterplots are no longer helpful as there are too many
panels and nontrivial underlying structures are hardly visible. Parallel coordinate plots are a
popular way of representing observations in high dimensions. Each variable is recorded along
1.3. VISUALIZING DATA 19
a vertical axis and the values of each observation are then connected with a line across the
various variables. That means that points in the usual (Euclidean) representation correspond to
lines in a parallel coordinate plot. In a classical version of the plot, all interval scaled variables
are normalized to [0, 1]. Additionally, the inclusion of nominal and ordinal variables and their
comparison with interval scaled variables is possible. The natural order of the variables in such
a plot may not be optimal for an intuitive interpretation.
Example 1.10. The dataset swiss (provided by the package datasets) contains 6 variables
(standardized fertility measure and socio-economic indicators) for 47 French-speaking provinces
of Switzerland at about 1888.
R-Code 1.10 and Figure 1.10 give an example of a parallel coordinate plot. The lines are
plotted according to the percentage of practicing catholics (the alternative being protestant).
Groups can be quickly detected and strong associations are spotted directly. For example,
provices with more catholics have a higher fertility rate or lower rates on examination (indicated
by color grouping). Provinces with higher fertility have also higher values in agriculture (lines
are in general parallel) whereas higher agriculture is linked to lower examination (lines typically
cross). We will revisit such associations in Chapter 9. ♣
R-Code 1.10 Parallel coordinate plot for the swiss dataset. (See Figure 1.10.)
Figure 1.10: Parallel coordinate plot of the swiss dataset. The provinces (lines) are
colored according to the percentage catholic (black less than 40%, red between 40% and
60%, green above 60%). (See R-Code 1.10.)
20 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
An alternative way to represent high-dimensional data is to project the data to two or three
dimensions, which can be represented with classical visualization tools. The basic idea of pro-
jection pursuit is to find a projection, which highlights an interesting structure of the dataset
(Friedman and Tukey, 1974). These projections are often varied and plotted continuously to find
as many possible structures, see Problem 1.7 for a guided example.
• Construct honest graphs without hiding facts. Show all data, not omitting some observa-
tions or hiding information through aggregation. In R, it is possbile to construct a boxplot
based on three values, but such a representation would suggest to the reader the availability
of a larger amount of data. Thus hiding the fact that only three values are present.
• Construct graphs that are not suggestive. A classical deceptive example is to choose the
scale such that small variations are emphasized (zooming in) or the variations are obscured
by the large scale (zooming out). See bottom panel of Figure 1.11 where by starting the
y-axis at 3 a stronger decrease is suggested.
• Use appropriate and unambiguous labels with units clearly indicated. For presentations
and publications the labels have to be legible. See top left panel of Figure 1.11 where we
do not know the units.
• To compare quantities, one should directly represent ratios, differences or relative differ-
ences. When different panels need to be compared, they should have the same plotting
range.
• Carefully choose colors, hues, line width and type, symbols. These need to be explained in
the caption or a legend. Certain colors are more dominant or are directly associated with
emotions. Be aware of color blindness
• Never use three-dimensional renderings of lines, bars etc. The human eye can quickly
compare lengths but not angles, areas, or volumes.
• Never change scales mid axis. If absolutely necessary a second scale can be added on the
right or on the top.
Of course there are situations where it makes sense to deviate from individual bullets of the
list above. In this book, we carefully design our graphics but in order to keep the R-code to
reasonable length, we bend the above rules quite a bit.
From a good-scientific-practice point of view we recommend that figures should be constructed
such that they are “reproducible”. To do so
• create figures based on a script. The figures in this book can be reconstructed based on
sourcing the dedicated R scripts.
• do no post-process figures with graphics editors (e.g., with PhotoShop, ImageMagick)
1.4. CONSTRUCTING GOOD GRAPHICS 21
• as some graphics routines are based on random numbers, initiate the random number
generator before the construction (use set.seed(), see also Chapter 2).
Do not use pie charts unless absolutely necessary. Pie charts are often difficult to read. When
slices are similar in size it is nearly impossible to distinguish which is larger. Bar plots allow an
easier comparison, compare Figure 1.11 with either panel of Figure 1.2.
Figures 1.12 and 1.11 are examples of badly designed charts and plots. The top panel Fig-
ure 1.12 contains an unnecessary 3-D rendering. The lower panel of the figure is still suboptimal
because of the shading not all information is visible. Depending on the intended message, a plot
instead of a graph would be more adequate.
22 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Bad graphics can be found everywhere including in scientific journals. Figure 1.13 is a snap-
shot of the webpage http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/ that includes
the issues of the figures and a discussion of possible improvements.
Figure 1.12: Bad example (above) and improved but still not ideal graphic (below).
Figures from university documents.
Many consider John W. Tukey to be the founder and promoter of exploratory data analysis.
Thus his EDA book (Tukey, 1977) is often seen as the (first) authoritative text on the subject.
In a series of books, Tufte rigorously yet vividly explains all relevant elements of visualization
and displaying information (Tufte, 1983, 1990, 1997b,a). Many university programs offer lectures
on information visualization or similar topics. The lecture by Ross Ihaka is one example worth
mentioning: www.stat.auckland.ac.nz/ ihaka/120/lectures.html.
Friendly and Denis (2001) give an extensive historical overview of the evolvement of cartogra-
phy, graphics and visualization. The document at euclid.psych.yorku.ca/SCS/Gallery/milestone/
1.5. BIBLIOGRAPHIC REMARKS 23
Figure 1.13: Examples of bad plots in scientific journals. The figure is taken
from www.biostat.wisc.edu/˜kbroman/topten_worstgraphs/. The website discusses
the problems with each graph and possible improvements (‘[Discussion]’ links).
24 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
milestone.pdf has active links for virtually endless browsing. See also the applet at www.datavis.ca/
milestones/.
There are many interesting videos available illustrating good and not-so-good graphics. For
example, www.youtube.com/watch?v=ajUcjR0ma4c. The websites qz.com/580859/the-most-
misleading-charts-of-2015-fixed/ and www.statisticshowto.com/misleading-graphs/ illustrate and
discuss misleading graphics.
a) R has many built-in datasets, one example is volcano. Based on the help of the dataset,
what is the name of the Volcano? Describe the dataset in a few words.
b) Use the R help function to get information on how to use the image() function for plotting
matrices. Display the volcano data.
c) Install the package fields. Display the volcano data with the function image.plot().
What is the maximum height of the volcano?
d) Use the the R help function to find out the purpose of the function demo() and have a look
at the list of available demos. The demo of the function persp() utilizes the volcano data
to illustrate basic three-dimensional plotting. Call the demo and have a look at the plots.
a) Construct a boxplot and a QQ-plot of the moose and wolf data. Give a short interpretation.
b) Jointly visualize the wolves and moose data, as well as their abundances over the years.
Give a short interpretation of what you see in the figures. (Of course you may compare
the result with what is given on the aforementioned web page).
Problem 1.3 (EDA of trivariate data) Perform an EDA of the dataset www.math.uzh.ch/
furrer/download/sta120/lemanHgCdZn.csv, containing mercury, cadmium and zinc content in
sediment samples taken in lake Geneva.
1.6. EXERCISES AND PROBLEMS 25
Problem 1.4 (EDA of multivariate data) In this problem we want to explore the classical
mtcars dataset (directly available through the package datasets). Perform an EDA thereof and
provide at least three meaningful plots (as part of the EDA) and a short description of what
they display. In what measurement scale are the variables stored and what would be the natural
or original measurement scale?
Problem 1.5 (Beyond Example 1.9) In this problem we extend R-Code 1.9 and create further
graphics summarizing the palmerpenguins dataset.
b) Add the marginal histograms to the diagonal panels of the scatterplot matrix in Figure 1.9.
c) In the panels of Figure 1.9 only one additional variable is used for annotation. With the
upper- and lower-diagonal plots, it is possible to add two variables. Convince yourself that
adding additionally year or island is rather hindering interpretability than helpful. How
can the information about island and year be summarized/visualized?
Problem 1.6 (Parallel coordinate plot) Construct a parallel coordinate plot using the built-in
dataset state.x77. In the left and right margins, annotate the states. Give a few interpretations
that can be derived from the plot.
Problem 1.7 (Feature detection with ggobi) The open source visualization program ggobi may
be used to explore high-dimensional data (Swayne et al., 2003). It provides highly dynamic and
interactive graphics such as tours, as well as familiar graphics such as scatterplots, bar charts and
parallel coordinates plots. All plots are interactive and linked with brushing and identification.
The package rggobi provides a link to R.
a) Install the required software ggobi and R package rggobi and explore the tool with the
dataset state.x77.
Problem 1.8 (BMJ Endgame) Discuss and justify the statements about ‘Skewed distributions’
given in doi.org/10.1136/bmj.c6276.
26 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Chapter 2
Random Variables
⋄ Pass from a cdf to a quantile function, pdf or pmf and vice versa
⋄ Give the definition and intuition of an expected value (E), variance (Var),
know the basic properties of E, Var used for calculation
Probability theory is the prime tool of all statistical modeling. Hence we need a minimal
understanding of the theory of probability in order to well understand statistical models, the
interpretation thereof, and so forth. Probability theory could (or rather should?) be covered
in entire books. Hence, we boil the material down to the bare minimum used in subsequent
chapters and thus to some more experienced reader, there seem many gaps in this chapter.
27
28 CHAPTER 2. RANDOM VARIABLES
Example 2.1. Tossing two coins results in a sample space {HH, HT, TH, TT}. The event “at
least one head” {HH, HT, TH} consists of three elementary events. ♣
A probability measure is a function P : Ω → [0, 1], that assigns to an event A of the sample
space Ω a value in the interval [0, 1], that is, P(A) ∈ [0, 1], for all A ⊂ Ω. Such a function cannot
be chosen arbitrarily and has to obey certain rules that are required for consistency. For our
purpose, it is sufficient to link these requirements to Kolmogorov’s axioms. More precisely, a
probability function must satisfy the following axioms:
2. P(Ω) = 1,
3. P i Ai = i P(Ai ), for Ai ∩ Aj = ∅, i ̸= j.
S P
In the last bullet, we only specify the index without indicating start and end, which means sum
over all possible indices i, say ni=1 , where n may be finite or countably infinite. (Similarly for
P
the union).
Summarizing informally, a probability is a function that assigns a value to each event of the
sample space constraint to:
1. the probability of an event is never smaller than zero or greater than one,
3. the probability of several events is equal to the sum of the individual probabilities, if the
events are mutually exclusive.
Probabilities are often visualized with Euler diagrams (Figure 2.1), which clearly and intu-
itively illustrate consequences of Kolmogorovs axioms, such as:
For the second but last statement, we require that P(B) > 0 and conditioning is essentially
equivalent to reducing the sample space from Ω to B. The last statement can be written for
arbitrary number of events Bi with Bi ∩ Bj = ∅, i ̸= j and i Bi = Ω yielding P(A) = i P(A |
S P
Bi ) P(Bi ).
Ω
B
D
C
Figure 2.1: Euler diagram where events are illustrated with ellipses. The magenta
area represents A ∩ B and the event C is in the event B, C ⊂ B.
The following definition introduces a random variable and gives a (unique) characterization
of random variables. In subsequent sections, we will see additional characterizations. These,
however, will depend on the type of values the random variable takes.
for all x ∈ R. ♢
Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations (i.e.,
their outcomes, the observations of an experiment) are denoted by the corresponding lowercase
letters (x, y). This means that the theoretical “concept“ are denoted by uppercase letters. Actual
values or data, for example the columns in your dataset, would be denoted with lowercase letters.
The first two following statements are a direct consequence of the monotonicity of a proba-
bility measure and probability of empty set/sample space. The last requires a bit more work to
show but will be intuitive shortly.
Remark 2.1. In more formal treatise, one typically introduces a probability space, being a
triplet (Ω, F, P), consisting of a sample space Ω, a σ-algebra F (i.e., a collection of subsets of Ω
including Ω itself, which is closed under complement and under countable unions) and a proba-
bility measure P. A random variable on a probability space is a measurable function from Ω to
the real numbers: X : Ω → R, ω 7→ X(ω). To indicate its dependence on elementary events, one
often writes the argument explicitly, e.g., P(X(ω) ≤ x). ♣
Example 2.2. Let X be the sum of the roll of two dice. The random variable X assumes
the values 2, 3, . . . , 12, with probabilities 1/36, 2/36, . . . , 5/36, 6/36, 5/36, . . . , 1/36. Hence, for
example, P(X ≤ 4) = 1/6, P(X < 2) = 0, P(X < 12.2) = P(X ≤ 12) = 1. The left panel
of Figure 2.2 illustrates the distribution function. This distribution function (as for all discrete
random variables) is piece-wise constant with jumps equal to the probability of that value. ♣
Example 2.3. A boy practices free throws, i.e., foul shots to the basket standing at a distance
of 15 ft to the board. Each shot is either a success or a failure, which can be coded as 1/0. He
counts the number of attempts it takes until he has a successful shot. We let the random variable
X be the number of throws that are necessary until the boy succeeds. Theoretically, there is no
upper bound on this number. Hence X can take the values 1, 2, . . . . ♣
Next to the cdf, another way of describing discrete random variables is the probability mass
function, defined as follows.
Definition 2.2. The probability mass function (pmf) of a discrete random variable X is defined
by fX (x) = P(X = x). ♢
In other words, the pmf gives probabilities that the random variables takes a precise single
value, whereas, as seen, the cdf gives probabilities that the random variables takes that or any
smaller value.
Figure 2.2 illustrates the cumulative distribution and the probability mass function of the
random variable X as given in Example 2.2. The jump locations and sizes (discontinuities) of
the cdf correspond to probabilities given in the right panel. Notice that we have emphasized the
right continuity of the cdf (see Proposition 2.1.3) with the additional dot.
2.2. DISCRETE DISTRIBUTIONS 31
It is important to note that we have a theoretical construct here. When tossing two dice
several times and reporting the frequencies of their sum, the corresponding plot (bar plot or
histogram with appropriate bins) does not exactly match the right panel of Figure 2.2. The
more tosses we take, the better the match is which we will discuss further in the next chapter.
R-Code 2.1 Cdf and pmf of X as given in Example 2.2. (See Figure 2.2.)
pi = fX(xi)
FX(x)
2 4 6 8 10 12 2 4 6 8 10 12
x xi
Figure 2.2: Cumulative distribution function (left) and probability mass function
(right) of X = “the sum of the roll of two dice”. The two y-axes have a different scale.
(See R-Code 2.1.)
Property 2.2. Let X be a discrete random variable with probability mass function fX (x) and
cumulative distribution function FX (x). Then:
Points 3 and 4 of the property show that there is a one-to-one relation (also called a bijection)
between the cumulative distribution function and probability mass function. Given one, we can
construct the other.
Example 2.4. A single three throw (see also Example 2.3) can be modeled as a Bernoulli
experiment with success probability p. In practice, repeating throws might probably effect the
success probability. For simplicity, one often keeps the probability constant nevertheless. ♣
Example 2.5. In this example we visualize a particular binomial random variable including a
hypothetical experiment. Suppose we draw with replacement 12 times one card from a deck of
36. The probability of having a face card in a single draw is thus 3/9. If X denotes the total
number of face cards drawn, we model X ∼ Bin(12, 1/3). To calculate the probability mass or
cumulative distribution function, R provides the construct of prefix and variate. Here, the latter
is binom. The prefixes are d for the probability mass function, p for the cumulative distribution
function, both illustrated in Figure 2.3.
When I have made the experiment, I’ve had three face cards; my son only had two. Instead
of asking more persons to make the same experiment, we ask R to do so, using the prefix r
with the variate binom. (This also implies that the deck is well mixed and all is added up
correctly). Figure 2.3 shows the counts (left) for 10 and frequencies (right) for 100 experiments,
i.e., realizations of the random variable. The larger the number of realizations, the closer the
match to the probability mass function in the top left corner (again, further discussed in the
next chapter). For example, P (X = 5) = 0.19, whereas 24 out of the 100 realizations “observed”
2.2. DISCRETE DISTRIBUTIONS 33
5 face cards and 24/100 = 0.24 is much closer to 0.19, compared to 1/10 = 0.1, when only ten
realizations are considered.
We have initialized the random number generator or R with a function call set.seed() to
obtain “repeatable” or reproducible results. ♣
R-Code 2.2 Density and distribution function of a Binomial random variable. (See Fig-
ure 2.3.)
λk
P(X = k) = exp(−λ), 0 < λ, k = 0, 1, . . . , (2.10)
k!
is said to follow a Poisson distribution with parameter λ, denoted by X ∼ Pois(λ). ♢
The Poisson distribution is also a good approximation for the binomial distribution with large
n and small p; as a rule of thumb if n > 20 and np < 10 (see Problem 3.3.a).
Example 2.6. Seismic activities are quite frequent in Switzerland, most of them are not of high
strength, luckily. The webpage ecos09.seismo.ethz.ch provides a portal to download information
about earthquakes in the vicinity of Switzerland along several other variables. From the page we
have manually aggregated the number of earthquakes with a magnitude exceeding 4 between 1980
34 CHAPTER 2. RANDOM VARIABLES
FX(x)
fX(x)
0.10
0.00
0 2 4 6 8 10 12 0 2 4 6 8 10 12
x x
4
0.20
3
2
0.10
1
0.00
0
0 1 2 3 4 5 6 7 8 9 11 0 1 2 3 4 5 6 7 8 9 11
Figure 2.3: Top row: probability mass function and cumulative distribution function
of the binomial random variable X ∼ Bin(12, 1/3). Bottom row: observed counts for
10 repetitions and observed frequencies for 100 repetitions of the experiment. (See R-
Code 2.2.)
and 2005 (Richter magnitude, ML scale). Figure 2.4 (based on R-Code 2.3) shows a histogram of
the data with superimposed probability mass function of a Poisson random variable with λ = 2.
There is a surprisingly good fit. There are slightly too few years with a single event, with is
offset by a few too many with zero and two events. The value of λ was chosen to match visually
the data. In later chapters we see approaches to determine the best possible value (which would
be λ = 1.92 here). ♣
R-Code 2.3 Number of earthquakes and Poisson random variable. (See Figure 2.4.)
0.30
0.20
Density
0.10
0.00
0 2 4 6 8
mag4
Figure 2.4: Number of earthquakes with a magnitude exceeding 4 between 1980 and
2005 as barplot with superimposed probabilities of the random variable X ∼ Pois(2).
(See R-Code 2.3.)
Definition 2.4. The probability density function (density function, pdf) fX (x), or density for
short, of a continuous random variable X is defined by
Z b
P(a < X ≤ b) = fX (x)dx, a < b. (2.11)
a
The density function does not give directly a probability and thus cannot be compared to
the probability mass function. The following properties are nevertheless similar to Property 2.2.
Property 2.3. Let X be a continuous random variable with density function fX (x) and distri-
bution function FX (x). Then:
1. The density function satisfies fX (x) ≥ 0 for all x ∈ R and fX (x) is continuous almost
everywhere.
Z ∞
2. fX (x)dx = 1.
−∞
Z x
3. FX (x) = fX (y)dy.
−∞
d
4. fX (x) = FX′ (x) = FX (x).
dx
5. The cumulative distribution function FX (x) is continuous everywhere.
36 CHAPTER 2. RANDOM VARIABLES
6. P(X = x) = 0.
Example 2.7. The continuous uniform distribution U(a, b) is defined by a constant density
function over the interval [a, b], a < b, i.e., f (x) = 1/(b − a), if a ≤ x ≤ b, and f (x) = 0,
otherwise. Figure 2.5 shows the density and cumulative distribution function of the uniform
distribution U(0, 1) (see also Problem 1.6).
The distribution function is continuous everywhere and the density has only two discontinu-
ities at a and b. ♣
R-Code 2.4 Density and distribution function of a uniform distribution. (See Figure 2.5.)
−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5
x x
Figure 2.5: Density and distribution function of the uniform distribution U(0, 1). (See
R-Code 2.4.)
As given by Property 2.3.3 and 4, there is again a bijection between the density function and
the cumulative distribution function: if we know one we can construct the other. Actually, there
is a third characterization of random variables, called the quantile function, which is essentially
the inverse of the cdf. That means, we are interested in values x for which FX (x) = p.
Definition 2.5. The quantile function QX (p) of a random variable X with (strictly) monotone
cumulative distribution function FX (x) is defined by
i.e., the quantile function is equivalent to the inverse of the distribution function. ♢
2.4. EXPECTATION AND VARIANCE 37
In R, the quantile function is specified with the prefix q and the corresponding variate. For
example, qunif is the quantile function for the uniform distribution, which is QX (p) = a+p(b−a)
for 0 < p < 1.
The quantile function can be used to define the theoretical counter part to the sample quartiles
of Chapter 1 as illustrated next.
Definition 2.6. The median ν of a continuous random variable X with cumulative distribution
function FX (x) is defined by ν = QX (1/2). Accordingly, the lower and upper quartiles of X are
QX (1/4) and QX (3/4). ♢
The quantile function is essential for the theoretical quantiles of QQ-plots (Section 1.3.1). For
example, the two left-most panels of Figure 1.6 are based essentially on the i/(n + 1)-quantiles.
Remark 2.2. Instead of the simple i/(n + 1)-quantiles, R uses the form (i − a)/(n + 1 − 2a), for
a specific a ∈ [0, 1]. The precise value can be specified using the argument type in quantile(),
or qtype in qqline or a in ppoints(). Of course, different definitions of quantiles only play a role
for very small sample sizes and in general we should not bother what approach has been taken. ♣
Remark 2.3. For discrete random variables the cdf is not continuous (see the plateaus in the left
panel of Figure 2.2) and the inverse does not exist. The quantile function returns the minimum
value of x from among all those values with probability p ≤ P(X ≤ x) = FX (x), more formally,
where for our level of rigor here, the inf can be read as min. ♣
Remark 2.4. Mathematically, it is possible that the expectation is not finite: the random
variable X may take very, very large values too often. In such cases we would say that the
expectation does not exist. We see a single example in Chapter 8 where this is the case. In all
other situations we assume a finite expectation and for simplicity, we do not explicitly state this.
♣
Many other “summary” values are reduced to calculate a particular expectation. The following
property states how to calculate the expectation of a function of the random variable X, which
is in turn used to summarize the spread of X.
We often take for g(x) = xk , k = 2, 3 . . . and we call E(X k ) the higher moments of X.
Definition 2.8. The variance of X is the expectation of the squared deviation from its expected
value:
Var(X) = E (X − E(X))2
(2.16)
and is also denoted as the centered second moment, in contrast to the second moment E(X 2 ).
The standard deviation of X is SD(X) = Var(X).
p
♢
The expectation is “linked” to the average (or sample mean, mean()) if we have a set of
realizations thought to be from the particular random variable. Similarly, the variance is “linked”
to the sample variance (var()). This link will be formalized in later chapters.
E(X) = 0 · (1 − p) + 1 · p = p, (2.17)
Var(X) = (0 − p)2 · (1 − p) + (1 − p)2 · p = p(1 − p). (2.18)
2. The expectation and variance of a Poisson random variable are (see Problem 3.3.a)
Property 2.5. For random variables X and Y , regardless of whether discrete or continuous,
and for a and b given constants, we have
2
1. Var(X) = E(X 2 ) − E(X) ;
The second but last property seems somewhat surprising. But starting from the definition of
the variance, one quickly realizes that the variance is not a linear operator:
2 2
Var(a + bX) = E a + bX − E(a + bX) = E a + bX − (a + b E(X)) , (2.20)
followed by a factorization of b2 .
Example 2.9 (continuation of Example 2.2). For X the sum of the roll of two dice, straightfor-
ward calculation shows that
12
X
E(X) = i P(X = i) = 7, by equation (2.14), (2.21)
i=2
6
X 1 7
=2 i =2· , by using Property 2.5.4 first. ♣ (2.22)
6 2
i=1
As a small note, recall that the first moment is a measure of location, the centered second
moment a measure of spread. The centered third moment is a measure of asymmetry and can
be used to quantify the skewness of a distribution. The centered forth moment is a measure of
heaviness of the tails of the distribution.
Definition 2.9. The random variable X is said to be normally distributed if the cumulative
distribution function is given by
Z x
FX (x) = fX (x)dx (2.23)
−∞
1 (x − µ)2
1
f (x) = fX (x) = √ exp − · , (2.24)
2πσ 2 2 σ2
While the exact form of the density (2.24) is not important, a certain recognizing factor will
be very useful. Especially, for a standard normal random variable, the density is proportional to
exp(−z 2 /2). Figure 2.7 gives the density, distribution and the quantile function of a standard
norm distributed random variable.
The following property is essentially a rewriting of the second part of the definition. We will
state it explicitly because of its importance.
Property 2.7. Let X ∼ N (µ, σ 2 ), then X−µ
σ ∼ N (0, 1) and FX (x) = Φ X−µ
σ . Conversely, if
2
Z ∼ N (0, 1), then σZ + µ ∼ N (µ, σ ), σ > 0.
The cumulative distribution function Φ has no closed form and the corresponding proba-
bilities must be determined numerically. In the past, so-called “standard tables” summarized
probabilities and were included in statistics books. Table 2.1 gives an excerpt of such a table.
Now even “simple” pocket calculators have the corresponding functions to calculate the proba-
bilities. It is probably worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well
as 95% = Φ(1.64) and 97.5% = Φ(1.96). Relevant quantiles have been illustrated in Figure 2.6
for a standard normal random variable. For arbitrary normal density, the density scales linearly
with the standard deviation.
P(Z<0)=50.0% P(−1<Z<1)=68.3%
P(Z<1)=84.1% P(−2<Z<2)=95.4%
P(Z<2)=97.7% P(−3<Z<3)=99.7%
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Table 2.1: Probabilities of the standard normal distribution. The table gives the value
of Φ(zp ) for selected values of zp . For example, Φ(0.2 + 0.04) = 0.595.
zp 0.0 0.1 0.2 0.3 0.4 . . . 1 ... 1.6 1.7 1.8 1.9 2 ... 3
0.00 0.500 0.540 0.579 0.618 0.655 0.841 0.945 0.955 0.964 0.971 0.977 0.999
0.02 0.508 0.548 0.587 0.626 0.663 0.846 0.947 0.957 0.966 0.973 0.978 0.999
0.04 0.516 0.556 0.595 0.633 0.670 0.851 0.949 0.959 0.967 0.974 0.979 ..
.
0.06 0.524 0.564 0.603 0.641 0.677 0.855 0.952 0.961 0.969 0.975 0.980
0.08 0.532 0.571 0.610 0.648 0.684 0.860 0.954 0.962 0.970 0.976 0.981
2.5. THE NORMAL DISTRIBUTION 41
R-Code 2.5 Calculation of the “z-table” (see Table 2.1) and density, distribution, and
quantile functions of the standard normal distribution. (See Figure 2.7.)
0.5
0.0
−0.5
−2 −1 0 1 2
We finish this section with two probability calculations that are found similarly in typical
textbooks. We will, however, revisit the Gaussian distribution in virtually every following chap-
ter.
diagram.
Problem 2.3 (Properties of probability measures) For events A and B show that:
b) P(A ∪ B) = 1 − P(Ac ∩ B c );
c) P(A ∩ B) = 1 − P(Ac ∪ B c ).
Problem 2.4 (Counting events) Place yourself in a sidewalk café during busy times and for
several consecutive minutes, count the number of persons walking by. (If you do not enjoy the
experimental setup, it is possible to count the number of vehicules in one of the live cams linked
at https://www.autobahnen.ch/index.php?lg=001&page=017).
b) Try to generate realizations from a Poisson random variable whose histogram matches best
the the one seen in a), i.e., use rpois() for different values of the argument lambda.
Would there be any other distribution that would yield a better match?
2.7. EXERCISES AND PROBLEMS 43
Problem 2.5 (Discrete uniform distribution) Let m be a positive integer. The discrete uniform
distribution is defined by the pmf
1
P(X = k) = , k = 1, . . . , m.
m
a) Visualize the pmf and cdf of a discrete uniform random variable with m = 12.
b) Draw several realizations from X, visualize the results and compare to the pmf of a). What
are sensible graphics types for the visualization?
Hint: the function sample() conveniently draws random samples without replacement.
m+1 m2 − 1
c) Show that E(X) = and Var(X) = .
2 12
m
X m(m + 1)(2m + 1)
Hint: k2 = .
6
k=1
Problem 2.6 (Uniform distribution) We assume X ∼ U(0, θ), for some value θ > 0.
a) For all a and b, with 0 < a < b < θ show that P(X ∈ [a, b]) = (b − a)/θ.
c) Choose a sensible value for θ. In R, simulate n = 10, 50, 10000 random numbers and
visualize the histogram as well as a QQ-plot thereof. Is it helpful to superimpose a smoothed
density to the histograms (with lines( density( ...)))?
Problem 2.7 (Calculating probabilities) In the following settings, approximate the probabilities
and quantiles q1 and q2 using a Gaussian “standard table”. Compare these values with the ones
obtained with R.
X ∼ N (2, 16) : P(X < 4), P(0 ≤ X ≤ 4), P(X > q1 ) = 0.95, P(X < −q2 ) = 0.05.
If you do not have standard table, the following two R commands may be used instead: pnorm(a)
and qnorm(b) for specific values a and b.
Problem 2.8 (Exponential Distribution) In this problem we get to know another important
distribution you will frequently come across - the expontential distribution. Consider the random
variable X with density
0, x < 0,
f (x) =
c · exp(−λx), x ≥ 0,
with λ > 0. The parameter λ is called the rate. Subsequently, we denote an exponential random
variable with X ∼ Exp(λ).
d) Let λ = 2. Calculate:
Problem 2.9 (BMJ Endgame) Discuss and justify the statements about ‘The Normal distribu-
tion’ given in doi.org/10.1136/bmj.c6085.
Chapter 3
⋄ Know the central limit theorem (CLT) and being able to approximate bino-
mial random variables
The last chapter introduced “individual” random variables, e.g., a Poisson random variable,
Normal random variable. In subsequent chapters, we need as many random variables as we have
observations because we will match one random variable with each observation. Of course, these
random variables may have the same distribution (may be identical).
45
46 CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
If this is not the case, they are dependent (all three cases). ♢
The concept of independence can be translated to information flow. Recall the formula of
conditional probability (2.5), if we have independence, P(A | B) = P(A): knowing anything
about the event B does not change the probability (knowledge) of the event A. Knowing that I
was born on a Sunday does not provide any information whether I have a driving license. But
having a driving license increases the probability that I own a car.
Formally, to write Equation (3.2), we would have to introduce bivariate random variables.
We will formalize these ideas in Chapter 8 but note here that the definition of independence also
implies that the joint density and the joint cumulative distribution is simply the product of the
individual ones, also called marginal ones. For example, if X and Y are independent, their joint
density is fX (x)fY (y).
Example 3.1. The sum of two dice (Example 2.2) is not independent of the value of the first
die: P(X ≤ 4 ∩ Y ≤ 2) = 5/36 ̸= P(X ≤ 4) P(Y ≤ 2) = 1/6 · 1/3. ♣
Example 3.2 (continuation of Example 2.3). If we assume that each foul shot is independent,
then the probability that 6 shots are necessary is P(X = 6) = (1 − p) · (1 − p) · (1 − p) · (1 − p) ·
(1 − p) · p = (1 − p)5 p, where p is the probability that a shot is successful.
This independence also implies that the success of the next shot is independent of my previous
one. In other words, there is no memory of how often I have missed in the past. See also
Problem 3.2. ♣
We will often use several independent random variables with a common distribution function.
The iid assumption is very crucial and relaxing the assumptions to allow, for example, de-
pendence between the random variables, has severe implications on the statistical modeling.
Independence also implies a simple formula for the variance of the sum of two or several random
variables, a formal justification will follow in Chapter 8.
The latter two points of Property 3.1 are used when we investigate statistical properties of
the sample mean. This concept is quite powerful and works as follows. Consider the sample
mean x = 1/n ni=1 xi which can be seen as function of n arguments f (x1 , . . . , xn ) = x. We then
P
evaluate the function at the arguments corresponding to the random sample, f (X1 , . . . , Xn ),
which is the random sample mean X = 1/n ni=1 Xi , a random variable itself!
P
2. (theoretical approach) We start with a random sample and study the theoretical properties.
Often these will be illustrated with realizations from the the random sample (generated
using R).
Property 3.2. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent and a and b arbitrary,
then aX1 + bX2 ∼ N (aµ1 + bµ2 , a2 σ12 + b2 σ22 ).
Hence, the density of the sum of two random variables is again unimodal and does not
correspond to a simple “superposition” of the densities. The following property is essential and
a direct consequence of the previous one.
iid
Property 3.3. Let X1 , . . . , Xn ∼ N (µ, σ 2 ), then X ∼ N (µ, σ 2 /n).
The next paragraphs discuss random variables that are derived (obtained as functions) from
Gaussian random variables, going beyond the average of iid Gaussians.
More specifically, we will look at the distribution of (X−µ)/ σ 2 /n, S 2 = 1/(n−1) ni=1 (Xi −
p P
is called the chi-square distribution (X 2 distribution) with n degrees of freedom and denoted
X ∼ Xn2 . The following applies:
We summarize below, the exact link between the distribution of S 2 and a chi-square dis-
tribution. Further, the chi-square distribution is used in numerous statistical tests that we see
Chapter 5 and 6.
R-Code 3.1 Chi-square distribution for various degrees of freedom. (See Figure 3.1.)
For small n the density of the chi-square distribution is quite right-skewed (Figure 3.1). For
larger n the density is quite symmetric and has a bell-shaped form. In fact, for n > 50, we
can approximate the chi-square distribution with a normal distribution, i.e., Xn2 is distributed
approximately N (n, 2n) (a justification is given in Section 3.3).
3.2. RANDOM VARIABLES DERIVED FROM GAUSSIAN RANDOM VARIABLES 49
0.6
0.5 1
2
4
0.4
8
16
Density
32
0.3
64
0.2
0.1
0.0
0 10 20 30 40 50
Figure 3.1: Densities of the chi-square distribution for various degrees of freedom.
(See R-Code 3.1.)
random variable
Z
V =p (3.7)
X/m
is called the t-distribution (or Student’s t-distribution) with m degrees of freedom and denoted
V ∼ tn . We have:
Remark 3.2. For m = 1, 2 the density is heavy-tailed and the variance of the distribution does
not exist. Realizations of this random variable occasionally manifest with extremely large values.
50 CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
R-Code 3.2 t-distribution for various degrees of freedom. (See Figure 3.2.)
1
2
4
0.3
8
16
Density
32
0.2
64
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 3.2: Densities of the t-distribution for various degrees of freedom. The normal
distribution is in black. A density with 27 = 128 degrees of freedom would make the
normal density function appear thicker. (See R-Code 3.2.)
Of course, the sample variance can still be calculated (see R-Code 3.3). We come back to this
issue in Chapter 6. ♣
R-Code 3.3 Sample variance of the t-distribution with one degree of freedom.
set.seed( 14)
tmp <- rt( 1000, df=1) # 1000 realizations
var( tmp) # variance is huge!!
## [1] 37391
sort( tmp)[1:7] # many "large" values, but 2 exceptionally large
## [1] -190.929 -168.920 -60.603 -53.736 -47.764 -43.377 -36.252
sort( tmp, decreasing=TRUE)[1:7]
## [1] 5726.53 2083.68 280.85 239.75 137.36 119.16 102.70
n n
iid 1X 1 X
Property 3.4. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Define X = Xi and S 2 = (Xi −X)2 .
n n−1
i=1 i=1
X −µ n−1 2 2
1. (a) √ ∼ N (0, 1) (b) S ∼ Xn−1 .
σ/ n σ2
X −µ
3. √ ∼ tn−1 .
S/ n
Statement 1(a) is not surprising and is a direct consequence of Properties 2.7 and 3.3. State-
ment 1(b) is surprising insofar that centering the random variables with X amounts to reducing
the degrees of freedom by one. Point 2 seems very surprising as the same random variables are
used in X and S 2 . A justification is that S 2 is essentially a sum of random variables which were
corrected for X and thus S 2 does not contain information about X anymore. In Chapter 8 we
give a more detailed explanation thereof. Point 3 is not surprising as we use the definition of the
t-distribution and the previous two points.
3.2.3 F -Distribution
The F -distribution is mainly used to compare two sample variances with each other, as we will
see in Chapters 5, 10 and ongoing.
Let X ∼ Xm 2 and Y ∼ X 2 be two independent random variables. The distribution of the
n
random variable
X/m
W = (3.10)
Y /n
is called the F -distribution with m and n degrees of freedom and denoted W ∼ Fm,n . It holds
that:
n
E(W ) = , for n > 2; (3.11)
n−2
2n2 (m + n − 2)
Var(W ) = , for n > 4. (3.12)
m(n − 2)2 (n − 4)
That means that if n increases the expectation gets closer to one and the variance to 2/m, with
m fixed. Figure 3.3 (based on R-Code 3.4) shows the density for various degrees of freedom.
R-Code 3.4 F -distribution for various degrees of freedom. (See Figure 3.3.)
F1,1
F2,50
3.0
F5,10
F10,50
F50,50
Density
2.0
F50,250
F250,250
1.0
0.0
0 1 2 3 4
Figure 3.3: Density of the F -distribution for various degrees of freedom. (See R-
Code 3.4.)
Property 3.5. (Central Limit Theorem (CLT), classical version) Let X1 , X2 , X3 , . . . an infinite
sequence of iid random variables with E(Xi ) = µ and Var(Xi ) = σ 2 . Then
X −µ
n
lim P √ ≤ z = Φ(z) (3.13)
n→∞ σ/ n
where we kept the subscript n for the random sample mean to emphasis its dependence on n.
The proof of the CLT is a typical exercise in a probability theory lecture. Many extensions
of the CLT exist, for example, the independence assumptions can be relaxed.
Using the central limit theorem argument, we can show that distribution of a binomial random
variable X ∼ Bin(n, p) converges to a distribution of a normal random variable as n → ∞. Thus,
the distribution of a normal random variable N (np, np(1 − p)) can be used as an approximation
for the binomial distribution Bin(n, p). For the approximation, n should be larger than 30 for
p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.
For a binomial random variable, P(X ≤ x) = P(X < x+1), x = 1, 2, . . . , n, which motivates a
3.3. LIMIT THEOREMS 53
Example 3.4. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.0494, “exactly”. However,
X − np 10 − np 10 − 15
P(X ≤ 10) ≈ P p ≤p =Φ p = 0.0339 , (3.14)
np(1 − p) np(1 − p) 30/4
X − np 10 + 0.5 − np 10.5 − 15
P(X ≤ 10 + 0.5) ≈ P p ≤ p =Φ p = 0.05017 . (3.15)
np(1 − p) np(1 − p) 30/4
The improvement of the continuity correction can be quantified by the reduction of the absolute
errors |0.0494 − 0.0339| = 0.0155 vs |0.0494 − 0.0502| = 0.0008 or by the relative errors |0.0494 −
0.0339|/0.0494 = 0.3138 vs |0.0494 − 0.0502|/0.0494 = 0.0162. ♣
Another very important limit theorem is the law of large numbers (LLN) that essentially
states that for X1 , . . . , Xn iid with E(Xi ) = µ, the average X n converges to µ. We have de-
liberately used the somewhat ambiguous “convergence” statement, a more rigorous statement is
technically a bit more involved. We will use the LLN in the next chapter, when we try to infer
parameter values from data, i.e., say something about µ when we observe x1 , . . . , xn .
Remark 3.3. There are actually two forms of the LLN theorem, the strong and the weak
formulation. We do not not need the precise formulation later and thus simply state them here
for the sole reason of stating them
The differences between both formulations are subtle. The weak version states that the average
is close to the mean and excursions (for specific n) beyond µ ± ϵ can happen arbitrary often. The
strong version states that there exists a large n such that the average is always within µ ± ϵ.
The two forms represent fundamentally different notions of convergence of random variables:
(3.17) is almost sure convergence, (3.16) is convergence in probability. The CLT represents con-
vergence in distribution. ♣
We have observed several times that for increasing sample sizes, the discrepancy between
the theoretical value and the sample diminishes. Examples include the CLT, LLN, but also our
observation that the histogram of rnorm(n) looks like the normal density for large n.
It is possible to formalize this important concept. Let X1 , . . . , Xn iid with distribution
function F (x). We define the sample empirical distribution function
n
1X
Fn (x) = Ixi ≤x (x), (3.18)
n
i=1
that is, a step function with jump size 1/n at the values x1 , . . . , xn (see also Equation 3.32 for the
meaning of I). As n → ∞ the empirical distribution function Fn (x) converges to the underlying
54 CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
distribution function F (x). Because of this fundamental result we are able to work with specific
distributions of random samples.
For discrete random variables, the previous convergence result can be written in terms of
probabilities and observed proportions (see Problem 3.3.b). For continuous random variables,
we would have to invoke binning to compare the histogram with the density. The binning adds
another technical layer for the theoretical results. In practice, we simply compare the histogram
(or a smoothed version thereof, Figure 1.3) with the theoretical density.
Remark 3.4. To show the convergence of the sample empirical distribution function, we consider
the random version thereof, i.e., the empirical distribution function 1/n ni=1 IXi ≤x (x) and invoke
P
the LLN. This convergence is pointwise only, meaning it holds for all x. Stronger results hold and
the Glivenko–Cantelli theorem states that the convergence is even uniform: supx Fn (x) − F (x)
converges to zero almost surely. ♣
To derive the probability mass function we apply Property 2.2.4. In the more interesting setting
of continuous random variables, the density function is derived by Property 2.3.4 and is thus
d −1
fY (y) = g (y) fX (g −1 (y)). (3.21)
dy
Example 3.5. Let X be a random variable with cdf FX (x) and pdf fX (x). We consider Y =
a+bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse g −1 (y) = (y−a)/b
is monotonically increasing. The cdf of Y is thus FX (y − a)/b and, for a continuous random
variable X, the pdf is fX (y − a)/b · 1/b. This fact has already been stated in Property 2.7 for
This last example also motivates a formal definition of a location parameter and a scale
parameter.
3.4. FUNCTIONS OF A RANDOM VARIABLE 55
Definition 3.3. For a random variable X with density fX (x), we call θ a location parameter if
the density has the form fX (x) = f (x − θ) and call θ a scale parameter if the density has the
form fX (x) = f (x/θ)/θ. ♢
In the previous definition θ stands for an arbitrary parameter. The parameters are often
chosen such that the expectation of the random variable is equal to the location parameter and
the variance equal to the squared scale parameter. Hence, the following examples are no surprise.
Example 3.6. For a Gaussian distribution N (µ, σ 2 ), µ is the location parameter, σ a scale
parameter, as seen from the definition of the density (2.24). The parameter λ of the Poisson
distribution Pois(λ) is neither a location nor a scale parameter. ♣
Example 3.7. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), where log() is
the natural logarithm, i.e., the logarithm to the base e. Thus, g −1 (y) = 1 − exp(−y) and the
distribution and density function of Y = g(X) is
This last example gives rise to the so-called inverse transform sampling method to draw
realizations from a random variable X with a closed form quantile function QX (p). In more
details, assume an arbitrary cumulative distribution function FX (x) with a closed form inverse
FX−1 (p) = QX (p). For U ∼ U(0, 1), the random variable X = F −1 (U ) has cdf FX (x):
Hence, based on Example 3.7, the R expression -log(1- runif(n))/lambda draws a realization
iid
from X1 , . . . , Xn ∼ Exp(1).
Remark 3.5. For U ∼ U(0, 1), 1 − U ∼ U(0, 1), and thus it is possible to further simplify
the sampling procedure. Interestingly, R uses a seemingly more complex algorithm to sample.
The algorithm is however fast and does not require a lot of memory (Ahrens and Dieter, 1972);
properties that were historically very important. ♣
In the case, where we cannot invert the function g, we can nevertheless use the concept of
the approach by starting with (3.19), followed by simplification and use of Property 2.3.4, as
illustrated in the following example.
56 CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Example 3.8. The density of a chi-squared distributed random variable with one degree of
freedom is calculated as follows. Let X ∼ N (0, 1) and Y = Z 2 .
√ √ √ √
FY (y) = P(Y = Z 2 ≤ y) = P | Z |≤ y = Φ( y) − Φ(− y) = 2Φ( y) − 1 (3.25)
d √ d√ 2 y 1 1 y
fY (y) = FY (y) = 2ϕ( y) y √ exp − √ =√ exp − . (3.26)
dy dy 2π 2 2 y 2πy 2
As we are often interested in summarizing a random variable by its mean and variance, we now
introduce a very convenient approximation for transformed random variables Y = g(X) by the
so-called delta method . The idea thereof consists of a Taylor expansion around the expectation
E(X):
p 1 2 p
E(Y ) ≈ ; Var(Y ) ≈ 2
· p(1 − p) = . ♣ (3.30)
1−p (1 − p) (1 − p)3
Of course, in the case of a linear transformation (as, e.g., in Example 3.5), equation (3.27) is
an equality and thus relations (3.28) and (3.29) are exact, which is in sync with Property 2.7.
Consider g(X) = X 2 , by Property 2.5.1, E(X 2 ) = E(X)2 +Var(X) and thus E(X 2 ) > E(X)2 ,
refining the approximation (3.28). This result can be generalized and states that for every convex
function g(·), E g(X) ≤ g E(X) . For concave functions, the inequality is reversed. For strictly
convex or strictly concave functions, we have strict inequalities. Finally, a linear function is not
concave or convex and we have equality, as given in Property 2.5.4. These inequalities run under
Jensen’s inequality.
Remark 3.6. The following results are for completeness only. They are nicely elaborated in
Rice (2006).
3.5. BIBLIOGRAPHIC REMARKS 57
2. It is also possible to construct random variables based on an entire random sample, say
Y = g(X1 , . . . , Xn ). Property.3.5 uses exactly such an approach, where g(·) is given by
√
σ/ n .
P
g(X1 , . . . , Xn ) = i Xi − µ
3. Starting with two random continuous variables X1 and X2 , and two bijective functions
g1 (X1 , X2 ) and g2 (X1 , X2 ), there exists a closed form expression for the (joint) density of
Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ).
♣
We finish the chapter with introducing a very simple but convenient function, the so-called
indicator function, defined by
1, if x ∈ A,
Ix∈A (x) = (3.32)
0, if x ∈
/ A.
The function argument ‘(x)’ is redundant but often helps to clarify. In the literature, one often
finds the concise notation IA . Here, let X be a random variable then we define a random variable
Y = IX∈A (X), i.e., Y ‘specifies’ if the random variable X is in the set A.
The indicator function is often used to calculate expectations. For example, let Y be defined
as above. Then
E(Y ) = E IX>0 (X) = 0 · P(X ≤ 0) + 1 · P(X > 0) = P(X > 0) (3.33)
where we used Property 2.7 with g(x) = Ix∈A (x). In other words, we see IX>0 (X) as a Bernoulli
random variable with success probability P (X > 0).
b) Starting from the pmf of a Binomial random variable, derive the pmf of the Poisson random
variable when n → ∞, p → 0 but λ = np constant.
Problem 3.2 (Geometric distribution) In the setting of Examples 2.3 and 3.2, denote p =
P(shot is successful). Assume that the individual shots are independent. Show that
a) P(X ≤ k) = 1 − (1 − p)k , k = 1, 2, . . . .
Problem 3.3 (Poisson distribution) In this problem we visualize and derive some properties of
the Poisson random variable with parameter λ > 0.
b) For λ = 0.2 and λ = 2, sample from X1 , . . . , Xn ∼ Pois(λ) with n = 200 and draw
histograms. Compare the histograms with a). What do you expect to happen when n
increases?
c) Let X ∼ Pois(λ), with λ = 3: calculate P(X ≤ 2), P(X < 2) and P(X ≥ 3).
d) Plot the pmf of X ∼ Pois(λ), λ = 5, and Y ∼ Bin(n, p) for n = 10, 100, 1000 with λ = np.
What can you conclude?
Problem 3.4 (Approximating probabilities) In the following settings, approximate the probabil-
ities and quantiles by reducing the problem to calculating probabilities from a standard normal,
i.e., using a “standard table”. Compare these values with the ones obtained with R.
If you do not have standard table, the following two R commands may be used instead: pnorm(a)
and qnorm(b) for specific values a and b.
Problem 3.5 (Random numbers from a t-distribution) We sample random numbers from a t-
distribution by drawing repeatedly z1 , . . . , zm from a standard normal and setting yj = z/ s2 /m,
p
a) For m = 6 and n = 500 construct Q-Q plots based on the normal and appropriate t-
distribution. Based on the plots, is it possible to discriminate which of the two distributions
is more appropriate? Is this question getting more difficult if m and n are chosen larger or
smaller?
3.6. EXERCISES AND PROBLEMS 59
b) Suppose we receive the sample y1 , . . . , yn constructed as above but the value m is not
disclosed. Is it possible to determine m based on Q-Q plots? What general implication
can be drawn, especially for small samples?
a) Sample n = 100 random numbers from Exp(λ). Visualize the data with a histogram and
superimpose the theoretical density.
c) Draw a histogram of min(X1 , . . . , Xn ) from 500 realizations and compare it to the theoret-
ical result from part b).
Problem 3.7 (Inverse transform sampling) The goal of this exercise is to implement your own
R code to simulate from a continous random variable X with the following probability density
function (pdf):
fX (x) = c |x| exp −x2 , x ∈ R.
We use inverse transform sampling, which is well suited for distributions whose cdf is easily
invertible.
a) Find c such that fX (x) is an actual pdf. Show that the quantile function is
p
− log (2(1 − p)), p ≥ 0.5,
−1
FX (p) = QX (p) = p
− − log (2p), p < 0.5.
b) Write your own code to sample from X. Check the correctness of your sampler.
and is often used to summarize rates or to summarize different items that are rated on different
scales.
mg (x1 , . . . , xn )
a) For x1 , . . . , xn and y1 , . . . , yn , show that mg (x1 /y1 , . . . , xn /yn ) = . What is
mg (y1 , . . . , yn )
a profound consequence of this property?
√
b) Using Jensen’s inequality, show that n x1 · · · · · xn ≤ n1 (x1 + · · · + xn ), i.e., mg ≤ x.
60 CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Chapter 4
Estimation of Parameters
A central point of statistics is to draw information from observations (in form of data from
measurements of an experiment or from a subset of a population) and to infer from these towards
hypotheses in form of scientific questions or general statements of the population. Inferential
statistics is often explained by “inferring” conclusions from the sample toward the “population”.
This step requires foremost data and an adequate statistical model . Such a statistical model
describes the data through the use of random variables and their distributions, by means the data
is considered a realization of the corresponding distribution. The model comprises of unknown
quantities, so-called parameters. The goal of a statistical estimation is to determine plausible
values for the parameters of the model using the observed data.
In this section we first introduce the concept of statistical models and the associated parameters.
In the second step we introduce the idea of estimation.
61
62 CHAPTER 4. ESTIMATION OF PARAMETERS
Table 4.1: Inhibition diameters by E. coli and the antibiotics imipenem and
meropenem.
Diameter (mm) 28 29 30 31 32 33 34 35 36 37 38 39 40
Imipenem 0 3 7 14 32 20 18 4 1 1 0 0 0
Meropenem 0 0 0 0 2 9 33 20 17 9 6 4 0
Although the data of the previous example is rounded to the nearest millimeter, it would be
reasonable to assume that the diameters are real valued. Each measurement being identical to
others and thus fluctuating naturally around a common mean. The following is a very simple
example of a statistical model adequate here:
Y i = µ + εi , i = 1, . . . , n, (4.1)
where Yi are the observations, µ is an unknown diameter and εi are random variables representing
measurement error. It is often reasonable to assume E(εi ) = 0 with a symmetric density. Here,
iid
we further assume εi ∼ N (0, σ 2 ). Thus, Y1 , . . . , Yn are normally distributed with mean µ and
variance σ 2 . As typically, both parameters µ and σ 2 are unknown and we need to determine
plausible values for these parameters from the available data.
Such a statistical model describes the data yi , . . . , yn through random variables Yi , . . . , Yn
and their distributions. In other words, the data is a realization of the random sample given by
the statistical model.
For both antibiotics, a slightly more evolved example consists of
Yi = µimi + εi , i = 1, . . . , nimi , (4.2)
Yi = µmero + εi , i = nimi + 1, . . . , nimi + nmero , (4.3)
4.1. LINKING DATA WITH PARAMETERS 63
iid
We assume εi ∼ N (0, σ 2 ). Thus the model states that both diseases have a different mean
but the same variability. This assumption pools information from both samples to estimate the
variance σ 2 . The parameters of the model are µimi , µmero and to a lesser extend σ 2 .
The statistical model represents the population with a seamingly infinite size. The data is
the realization of a subset of our population. And thus based on the data we want to answer
questions like: What are plausible values of the population levels? How much do the individuals
deviate around the population mean?
R-Code 4.1 states the means which we might consider as reasonable values for µimi , µmero ,
similar for their variances. Of course we need to formalize that the mean of the sample can be
taken as a representative value of the entire population. For the variance parameter σ 2 , slightly
more care is needed as neither var(imipDat) nor var(meropDat) are fully satisfying.
The questions if the inhibition diameter of both antibiotics are comparable or if the inhibition
diameter from meropenem is (statistically) larger than 33 mm are of completely different nature
and will be discussed in the next chapter, where we formally discuss statistical tests.
Imipenem Meropenem
25
25
15
15
5
0 5
0
28 30 32 34 36 38 40 28 30 32 34 36 38 40
Figure 4.1: Frequencies of inhibition diameters (in mm) by E. coli and imipenem and
meropenem (total 100 measurements). (See R-Code 4.1.)
64 CHAPTER 4. ESTIMATION OF PARAMETERS
We assume now that an appropriate statistical model parametrized by one or several parameters
has been proposed for a dataset. For estimation, the data is seen as a realization of a random
sample where the distribution of the latter is given by the statistical model. The goal of point
estimation is to provide a plausible value for the parameters of the distribution based on the
data at hand.
Hence, in order to estimate a parameter, we start from an estimator for that particular
parameter and evaluate the estimator at the available data. The estimator may depend on the
underlying statistical model and typically depends on the sample size.
Example 4.2. 1. The numerical values shown in R-Code 4.1 are estimates.
n
1X
2. Y = Yi is an estimator.
n
i=1
100
1 X
y= yi = 32.40 is a point estimate.
100
i=1
n
1 X
3. S2 = (Yi − Y )2 is an estimator.
n−1
i=1
100
1 X
s2 = (yi − y)2 = 2.242 or s = 1.498 are a point estimates. ♣
n−1
i=1
Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic one.
The estimator and estimate of a parameter θ are denoted by θ. b Context makes clear which of
the two cases is meant.
and thus after minimizing the sums of squares we get the estimator µbLS = Y and the estimate
µ
bLS = y.
Often, the parameter θ is linked to the expectation E(Yi ) through some function, say g. In
such a setting, we have
n
X 2
θb = θbLS = argmin Yi − g(θ) . (4.5)
θ i=1
b =Y.
and θbLS solves g(θ)
In linear regression settings, the ordinary least squares method minimizes the sum of squares
of the differences between observed responses and those predicted by a linear function of the ex-
planatory variables. Due to the linearity, simple and close form solutions exist (see Chapters 9ff).
iid
Example 4.4. Let Y1 , . . . , Yn ∼ F with expectation µ and variance σ 2 . Since Var(Y ) = E(Y 2 )−
E(Y )2 (Property 2.5.1), we can write σ 2 = µ2 − (µ)2 and we have the estimator
n n
2 1X 2 2 1X
c2 = µ
σMM 2 − (b
µ ) = Yi − Y = (Yi − Y )2 . ♣
(4.9)
n n
c
i=1 i=1
66 CHAPTER 4. ESTIMATION OF PARAMETERS
For a given distribution, we call L(θ) the likelihood function, or simply the likelihood.
Definition 4.2. The maximum likelihood estimate θbML of the parameter θ is based on maxi-
mizing the likelihood, i.e.
By definition of a random sample, the random variables are independent and identically
distributed and thus the likelihood is the product of the individual densities fY (yi ; θ) (see Sec-
tion 3.1). To simplify the notation, we have omitted the index of Y . Since θbML = argmaxθ L(θ) =
argmaxθ log L(θ) , the log-likelihood ℓ(θ) := log L(θ) can be maximized instead. The log-
likelihood is often preferred because the expressions simplify more and maximizing sums is much
easier than maximizing products.
iid
Example 4.5. Let Y1 , . . . , Yn ∼ Exp(λ), thus
n
Y n
Y n
X
n
L(λ) = fY (yi ) = λ exp(−λyi ) = λ exp(−λ yi ) . (4.13)
i=1 i=1 i=1
Then
Pn Pn n
dℓ(λ) d log(λn exp(−λ i=1 yi ) d(n log(λ) − λ i=1 yi ) n X !
= = = − yi = 0 (4.14)
dλ dλ dλ λ
i=1
λ bML = Pnn
b=λ 1
= . (4.15)
i=1 yi y
In a vast majority of cases, maximum likelihood estimators posses very nice properties. In-
tuitively, because we use information about the density and not only about the moments, they
are “better” compared to method of moment estimators and to least squares based estimators.
Further, for many common random variables, the likelihood function has a single optimum, in
fact a maximum, for all permissible θ.
In our daily live we often have estimators available and thus we rarely need to rely on the
approaches presented in this section.
4.3. COMPARISON OF ESTIMATORS 67
E(θ)
b = θ, (4.16)
Simply put, an unbiased estimator leads to estimates that are on the long run correct.
iid
Example 4.6. Y1 , . . . , Yn ∼ N (µ, σ 2 )
n
1 X
2. S =
2
(Yi −Y )2 is unbiased for σ 2 . To show this, we expand the square terms and
n−1
i=1
simplify the sum of these cross-terms:
We now use E(Y ) = E(Y ) = µ, thus E (Yi − µ)2 = Var Yi = σ 2 , and similarly, E (µ −
n
X σ2
(n − 1) E(S 2 ) = Var(Yi ) − n Var Y = nσ 2 − n · = (n − 1)σ 2 .
(4.21)
n
i=1
c2 = 1
X
3. σ (Yi − Y )2 is biased for σ 2 , since
n
i
1 1 X n−1
c2 ) =
E(σ (n − 1) E (Yi − Y )2 = σ2. (4.22)
n n−1 n
i
| {z }
E(S 2 ) = σ 2
68 CHAPTER 4. ESTIMATION OF PARAMETERS
The bias is
c2 ) − σ 2 = n−1 2 1
E(σ σ − σ2 = − σ2, (4.23)
n n
iid
Example 4.7. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). Using the result (4.17) and Property 3.1.3, we have
σ2
MSE(Y ) = bias(Y )2 + Var(Y ) = 0 + . (4.25)
n
There is a another “classical” example for the calculation of the mean squared error, however
it requires some properties of squared Gaussian variables.
iid
Example 4.8. If Y1 , . . . , Yn ∼ N (µ, σ 2 ) then (n − 1)S 2 /σ 2 ∼ Xn−1
2 (see Property 3.4.1). Using
the variance expression from a chi-squared random variable (see Equation (3.6)), we have
σ4 (n − 1)S 2 σ4 2σ 4
MSE(S 2 ) = Var(S 2 ) =
2
Var 2
= 2
2(n − 1) = . (4.26)
(n − 1) σ (n − 1) n−1
Analogously, one can show that MSE(σ c2 ) is smaller than Equation (4.26). Moreover, the
MM
estimator (n − 1)S 2 /(n + 1) possesses the smallest MSE (see Problem 4.1.b). ♣
Remark 4.1. In both examples above, the variance has order O(1/n). In practical settings, it is
not possible to get a better rate. In fact, there is a lower bound for the variance that cannot be
undercut (the bound is called the Cramér–Rao lower bound ). Ideally, we aim to construct and
use minimal variance unbiased estimators (MVUE). Such bounds and properties are studied in
mathematical statistics lectures and treatise. ♣
4.4. INTERVAL ESTIMATORS 69
20
15
Frequency
10
5
0
lambdas
Figure 4.2: 100 estimates of the parameter λ = 1/2 based on a sample of size n = 25.
The red and blue vertical line indicate the sample mean and median of the estimates
respectively. (See R-Code 4.2.)
In practice we have one set of observations and thus we cannot repeat sampling to get
a description of the variability in our estimate. Therefore we apply a different approach by
attaching an uncertainty to the estimate itself, which we now illustrate in the Gaussian setting.
70 CHAPTER 4. ESTIMATION OF PARAMETERS
Y −µ σ σ
1 − α = P zα/2 ≤ √ ≤ z1−α/2 = P zα/2 √ ≤ Y − µ ≤ z1−α/2 √ (4.27)
σ/ n n n
σ σ
= P −Y + zα/2 √ ≤ −µ ≤ −Y + z1−α/2 √ (4.28)
n n
σ σ
= P Y − zα/2 √ ≥ µ ≥ Y − z1−α/2 √ (4.29)
n n
σ σ
= P Y − z1−α/2 √ ≤ µ ≤ Y + z1−α/2 √ , (4.30)
n n
where zp is the p-quantile of the standard normal distribution P(Z ≤ zp ) = p for Z ∼ N (0, 1)
(recall that zp = −z1−p ).
The manipulations on the inequalities in the previous derivation are standard but the prob-
abilities in (4.28) to (4.30) seem at first sight a bit strange, they should be read as P {Y − a ≤
µ} ∩ {µ ≤ Y + a} .
iid
Definition 4.4. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . The interval
h σ σ i
Y − z1−α/2 √ , Y + z1−α/2 √ (4.31)
n n
is an exact (1 − α) confidence interval for the parameter µ. 1 − α is the called the level of the
confidence interval. ♢
If the standard deviation σ is unknown, the approach above must be modified by using a
√
point estimate for σ, typically S = S 2 with S 2 = 1/(n − 1) i (Yi −Y )2 . Since (Y − µ)
p
S 2 /n
P
has a t-distribution with n−1 degrees of freedom (see Property 3.4.3), the corresponding quantile
must be modified:
Y −µ
1 − α = P tn−1,α/2 ≤ √ ≤ tn−1,1−α/2 . (4.32)
S/ n
Here, tn−1,p is the p-quantile of a t-distributed random variable with n − 1 degrees of freedom.
The next steps are similar as in (4.28) to (4.30) and the result is summarized below.
4.4. INTERVAL ESTIMATORS 71
iid
Definition 4.5. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The interval
h S S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √ (4.33)
n n
Example 4.10 (continuation of Example 4.1). For the antibiotic imipenem, we do not know the
underlying variability and hence we use (4.33). A sample 95% confidence interval for the mean
inhibition diameter is y − t99,.975 s 100, y + t99,.975 s 100 = 32.40 − 1.98 · 1.50/10, 32.40 +
√ √
1.98 · 1.50/10 = 32.1, 32.7 , where we used the information from R Code 4.1 and qt(.975,
Confidence intervals are, as shown in the previous two definitions, constituted by random vari-
ables (functions of Y1 , . . . , Yn ). Similar to estimators and estimates, sample confidence intervals
are computed with the corresponding realization y1 , . . . , yn of the random sample. Subsequently,
relevant confidence intervals will be summarized in the blue-highlighted text boxes, as shown
here.
Notice that both, the sample approximate and sample exact confidence intervals of the mean,
are of the form
that is, symmetric intervals around the estimate. Here, SE(·) denotes the standard error of the
estimate, that is, an estimate of the standard deviation of the estimator.
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The estimator S 2 for the parameter σ 2 is such, that (n −
2 , i.e. a chi-square distribution with n − 1 degrees of freedom (see Section 3.2.1).
1)S 2 /σ 2 ∼ Xn−1
72 CHAPTER 4. ESTIMATION OF PARAMETERS
Hence
(n − 1)S 2
1 − α = P χ2n−1,α/2 ≤ ≤ χ 2
n−1,1−α/2 (4.37)
σ2
(n − 1)S 2 (n − 1)S 2
2
=P ≥ σ ≥ , (4.38)
χ2n−1,α/2 χ2n−1,1−α/2
where χ2n−1,p is the p-quantile of the chi-square distribution with n − 1 degrees of freedom. The
corresponding exact (1 − α) confidence interval no longer has the form θb ± q1−α/2 SE(θ),
b because
the chi-square distribution is not symmetric.
For large n, the chi-square distribution can be approximated with a normal one (see also
Section 3.2.1) with mean n and variance 2n. Hence, a confidence interval based on a Gaussian
approximation is reasonable, see CI 2.
Example 4.11 (continuation of Example 4.1). For the antibiotic imipenem, we have a sam-
ple variance of 2.24 leading to a 95%-confidence interval [1.73, 3.03] for σ 2 , computed with
(100-1)*var(imipDat)/qchisq(c(.975, .025), df=100-1). The confidence interval is slightly
asymmetric: the center of the interval is (1.73 + 3.03)/2 = 2.38 compared to the estimate 2.24.
As the sample size is quite large, the Gaussian approximation yields a 95%-confidence inter-
val [1.62, 2.86] for σ 2 , computed with var(imipDat)+qnorm(c(.025, .975 ))*sqrt(2/100)*
var(imipDat). The interval is symmetric, slightly narrower 2.86 − 1.62 = 1.24 compared to
3.03 − 1.73 = 1.30.
√
If we would like to construct a confidence interval for the standard deviation σ = σ 2 , we
√ √
can use the approximation [ 1.73, 3.03 ] = [ 1.32, 1.74 ]. That means, we have applied the same
transformation for the bounds as for the estimate, a quite common approach. ♣
when repeating the same experiment many times, on average, the fraction 1 − α of all confidence
intervals contain the true parameter.
The sample confidence interval [ bl , bu ] does not state that the parameter θ is in the interval
with fraction 1 − α. The parameter is not random and thus such a probability statement cannot
be made.
iid
Example 4.12. Let Y1 , . . . , Y4 ∼ N (0, 1). Figure 4.3 (based on R-Code 4.3) shows 100 sam-
ple confidence intervals based on Equation (4.31) (top), Equation (4.35) (middle) and Equa-
tion (4.33) (bottom). We color all intervals that do not contain zero, the true (unknown) pa-
rameter value µ = 0, in red (if ci[1]>mu | ci[2]<mu is true).
On average we should observe 5% of the intervals colored red (five here) in the top and
bottom panel because these are exact confidence intervals. Due to sampling variability we have
for the specific simulation three and five in the two panels. In the middle panel there are typically
more, as the normal quantiles are too small compared to the t-distribution ones (see Figure 3.2).
Because n is small, the difference between the normal and the t-distribution is quite pronounced;
here, there are eleven intervals that do not cover zero.
A few more points to note are as follows. As we do not estimate the variance, all intervals
in the top panel have the same lengths. Further, the variance estimate shows a lot of variabil-
ity (very different interval lengths in the middle and bottom panel). Instead of for()-loops,
calls of the form segments(1:ex.n, ybar + sigmaybar*qnorm(alpha/2)), 1:ex.n, ybar -
sigmaybar*qnorm(alpha/2)) are possible. ♣
Confidence intervals can often be constructed starting from an estimator θb and its distribution.
In many cases it is possible to extract the parameter to get to 1 − α = P(Bl ≤ θ ≤ Bu ), often
some approximations are necessary. For the variance parameter σ 2 in the framework of Gaussian
random variables, CI 2 states two different intervals. These two are not the only ones and we
can use further approximations (see derivation of Problem 4.7.a or Remark 3.1), all leading to
slightly different confidence intervals.
Similar as with estimators, it is possible to compare different confidence intervals with equal
level. Instead of bias and MSE, the criteria here are the width of a confidence interval and the
so-called coverage probability. The latter is the probability that the confidence interval actually
contains the true parameter. In case there is an exact confidence interval, the coverage probability
is equal to the level of the interval.
For a specific estimator the width of a confidence interval can be reduced by reducing the
level or increasing n. The former should be fixed at 90%, 95% or possibly 99% by convention.
Increasing n after the experiment has been performed is often impossible. Therefore, the sample
size should be choosen before the experiment, such that under the model assumptions the width
of a confidence interval is below a certain threshold (see Chapter 12).
The following example illustrates the concept of coverage probability. A more relevant case
will be presented in the next chapter.
74 CHAPTER 4. ESTIMATION OF PARAMETERS
R-Code 4.3 100 confidence intervals for the parameter µ, based on three different ap-
proaches (exact with known σ, approximate, and exact with unknown σ). (See Figure 4.3.)
iid
Example 4.13. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). If σ 2 is known, the following confidence intervals
4.4. INTERVAL ESTIMATORS 75
σ known
3
2
1
−1 0
−3
Gaussian approximation
1:ex.n
3
2
1
−1 0
−3
t−distribution
1:ex.n
3
2
1
−1 0
−3
Figure 4.3: Normal and t-distribution based confidence intervals for the parameters
1:ex.n
µ = 0 with σ = 1 (above) and unknown σ (middle and below). The sample size is n = 4
and confidence level is (1 − α) = 95%. Confidence intervals which do not cover the true
value zero are shown in red. (See R-Code 4.3.)
that z1−α/2 = tn−1,1−α⋆ /2 , i.e., the standard normal p-quantile and the 1 − α⋆ /2-quantile of
the t-distribution with n − 1 degrees of freedom are equivalent. Thus the interval has coverage
probability α⋆ and in the case of Example 4.12 α⋆ ≈ 86%, based on 1-2*uniroot(function(p)
qt(p, 3)-qnorm(.025),c(0,1))$root. In Figure 4.3, we have 11 intervals marked compared
76 CHAPTER 4. ESTIMATION OF PARAMETERS
to expected 14. ♣
n
iid 1X
c) We assume Y1 , . . . , Yn ∼ N (µ, σ 2 ). Let θbρ = Yi and calculate the MSE as a function
ρ
i=1
of ρ. Argue that Y is the minimum variance unbiased estimator.
Problem 4.2 (Hemoglobin levels) The following hemoglobin levels of blood samples from pa-
tients with Hb SS and Hb S/β sickle cell disease are given (Hüsler and Zimmermann, 2010):
HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7, 9.1,
9.1, 9.1, 9.8, 10.1, 10.3)
HbSb <- c( 8.1, 9.2, 10, 10.4, 10.6, 10.9, 11.1, 11.9, 12.0, 12.1)
b) Propose a statistical model for Hb SS and for Hb S/β sickle cell diseases. What are the
parameters? Indicate your random variables and parameters with subscripts SS and Sb .
d) Propose a single statistical model for both diseases. What are the parameters? Estimate
all parameters from your model based on intuitive estimators.
e) Based on boxplots and QQ-plots, is there coherence between your model and the data?
iid
Problem 4.3 (Normal distribution with known σ 2 ) Let X1 , . . . , Xn ∼ N (µ, σ 2 ), with σ > 0
assumed to be known.
4.6. EXERCISES AND PROBLEMS 77
√
a) What is the distribution of n(X − µ)/σ? (No formal proofs required).
c) Determine the lower and upper bound of a confidence interval Bl and Bu (both functions
of X̄) such that
√
P(−q ≤ n(X − µ)/σ ≤ q) = P(Bl ≤ µ ≤ Bu )
e) Determine an expression of the width of the confidence interval? What “elements” appear-
ing in a general 1 − α-confidence interval for µ make the interval narrower?
f) Use the sickle-cell disease data from Problem 2 and construct 90%-confidence intervals for
the means of HbSS and HbSβ variants (assume σ = 1).
g) Repeat problems a)–f) by replacing σ with S and making adequate and necessary changes.
Problem 4.4 (ASTs radii) We work with the inhibition diameters for imipenem and meropenem
as given in Table 4.1.
a) The histograms of the inhibition diameters in Figure 4.1 are not quite symmetric. Someone
proposes to work with square-root diameters.
Visualize the transformed data of Table 4.1 with histograms. Does such a transformation
render the data more symmetric? Does such a transformation make sense? Are there other
reasonable transformations?
b) What are estimates of the inhibition area for both antibiotics (in cm2 )?
c) Construct a 90% confidence interval for the inhibition area for both antibiotics?
d) What is an estimate of the variability of the inhibition area for both antibiotics? What is
the uncertainty of this estimate.
Problem 4.5 (Geometric distribution) In the setting of Example 2.3, denote p = P( shot is
successful ). Assume that it took the boy k1 , k2 , . . . , kn attempts for the 1st, 2nd, . . . , nth
successful shot.
a) Derive the method of moment estimator of p. Argue that this estimator is also the least
squares estimator.
c) The distribution of the estimator pb = n/ ni=1 ki is non-trivial. And thus we use a simula-
P
tion approach to assess the uncertainty in the estimate. For n = 10 and p = 0.1 draw from
the geometric distribution (rgeom(...)+1)) and report the estimate. Repeat R = 500
times and discuss the histogram of the estimates. What changes if p = 0.5 or p = 0.9?
78 CHAPTER 4. ESTIMATION OF PARAMETERS
d) Do you expect the estimator in c) to be biased? Can you support your claim with a
simulation?
iid
Problem 4.6 (Poisson Distribution) Consider X1 , . . . , Xn ∼ Pois(λ) with a fixed λ > 0.
Pn
a) Let λ
b= 1
n i=1 Xi be an estimator of λ. Calculate E(λ), b and the MSE(λ).
b Var(λ) b
iid
Problem 4.7 (Coverage probability) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) and consider the usual estimator
S 2 for σ 2 . Show that the coverage probability of (4.40) is given by
n−1 n−1
1− P W ≥ √ −P W ≤ √
1− z1−α/2 √n2 1+ z1−α/2 √n2
with W ∼ Xn−1
2 . Plot the coverage probability as a function of n.
Problem 4.8 (Germany cancer counts) The dataset Oral is available in the R package spam
and contains oral cavity cancer counts for 544 districts in Germany.
a) Load the data and take a look at its help page using ?Oral.
c) Poisson distribution is common for modeling rare events such as death caused by cavity
cancer (column Y in the data). However, the districts differ greatly in their populations.
Define a subset from the data, which only considers districts with expected fatal casualties
caused by cavity cancer between 35 and 45 (subset, column E). Perform a Q-Q Plot for a
Poisson distribution.
Hint: use qqplot() from the stats package and define the theoretical quantiles with
qpois(ppoints( ...), lambda=...).
Simulate a Poisson distributed random variable with the same length and and the same
lambda as your subset. Perform a QQ-plot of your simulated data. What can you say
about the distribution of your subset of the cancer data?
d) Assume that the standardized mortality ratio Zi = Yi /Ei is normally distributed, i.e.,
iid
Z1 , . . . , Z544 ∼ N (µ, σ 2 ). Estimate µ and give a 95% (exact) confidence interval (CI).
What is the precise meaning of the CI?
e) Simulate a 95% confidence interval based on the following bootstrap scheme (sampling with
replacement):
Repeat 10′ 000 times
Construct the confidence interval by taking the 2.5% and the 97.5% quantiles of the stored
means.
Compare it to the CI from d).
Problem 4.9 (BMJ Endgame) Discuss and justify the statements about ‘Describing the spread
of data’ given in doi.org/10.1136/bmj.c1116.
80 CHAPTER 4. ESTIMATION OF PARAMETERS
Chapter 5
Statistical Testing
⋄ Be aware of the multiple testing problem and know how to deal with it
Recently I have replaced a LED light bulb that was claimed to last 20 000 hours. However,
in less than 2 months (and only a fraction thereof in use) the bulb was already broken and I
immediately asked myself if I am unlucky or is the claim simply exaggerated? A few moments
later rational kicked back in and - being a statistician - I knew that one individual break should
not be used to make too general statements
On a similar spirit, suppose we observe 13 heads in 17 tosses of a coin. When tossing a
fair coin, I expect somewhere between seven to ten heads and the 13 observed ones representing
seemingly an unusual case. We intuitively wonder if the coin is fair.
In this chapter we discuss a formal approach to answer if the observed data provides enough
evidence against a hypothesis (against a claimed livetime or against a claimed fairness). We
introduce two interlinked types of statistical testing procedures and provide a series of tests that
can be used off-the-shelf.
81
82 CHAPTER 5. STATISTICAL TESTING
Definition 5.1. The p-value is the probability under the distribution of the null hypothesis of
obtaining a result equal to or more extreme than the observed result. ♢
Example 5.1. We assume a fair coin is tossed 17 times and we observe 13 heads. Under
the null hypothesis of a fair coin, each toss is a Bernoulli random variable and the 17 tosses
can be modeled with a binomial random variable Bin(n = 17, p = 1/2). Hence, the p-value
is the probability of observing 0, 1, . . . , 4, 13, 14, . . . , 17 heads (or by symmetry of observing
17, . . . , 13, 4, . . . , 0), which can be calculated with sum( dbinom(0:4, size=17, prob=1/2) +
dbinom(13:17, size=17, prob=1/2)) and is 0.049. The p-value indicates that we observe such
a seemingly unlikely event roughly every 20th time.
Note that because of the symmetry of the binomial distribution at Bin(n, 1/2), we can alterna-
tively calculate the p-value as 2*pbinom(4, size=17, prob=1/2) or equivalently as 2*pbinom(12,
size=17, prob=1/2, lower.tail=FALSE).
In this example, we have considered more extreme as very many or very few heads. There
might be situations, where very few heads is not relevant or does not even make sense and thus
“more extreme” corresponds only to observing 13, 14, . . . , 17 heads. ♣
Figure 5.1 illustrates graphically the p-value in two hypothetical situations. Suppose that
under the null hypothesis the hypothetical distribution of the observed result is Gaussian with
mean zero and variance one and suppose that we observe a value of 1.8. If more extreme is con-
sidered on both sides of the tails of the density then the p-value consists of two probabilities (here
because of the symmetry, twice the probability of either side). If more extreme is actually larger
(possibly smaller in other situations), the p-value is calculated based on a one-sided probability.
As the Gaussian distribution is symmetric around its mean, the two-sided p-value is twice the
one-sided p-value, here 1-pnorm(1.8), or, equivalently, pnorm(1.8, lower.tail=FALSE).
We illustrate the statistical testing approaches and the statistical tests with data introduced
in the following example.
Example 5.2. In rabbits, pododermatitis is a chronic multifactorial skin disease that manifests
mainly on the hind legs. This presumably progressive disease can cause pain leading to poor
welfare. To study the progression of this disease on the level of individual animals, scientists
5.1. THE GENERAL CONCEPT OF SIGNIFICANCE TESTING 83
p−value=0.072 p−value=0.036
(two−sided) (one−sided)
H0 H0
Observation Observation
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 5.1: Illustration of the p-value in the case of a standard normal distribution
with observed value 1.8. Two-sided (left panel) and one-sided setting (right panel).
assessed many rabbits in three farms over the period of an entire year (Ruchti et al., 2019). We
use a subset of the dataset in this and later chapters, consisting of one farm (with two barns) and
four visits (between July 19/20, 2016 and June 29/30, 2017). The 6 stages from Drescher and
Schlender-Böbbis (1996) were used as a tagged visual-analogue-scale to score the occurrence and
severity of pododermatitis on 4 spots on the rabbits hind legs (left and right, heal and middle
position), resulting in the variable PDHmean with range 0–10, for details on the scoring see Ruchti
et al. (2018). We consider the visits in June 2017. R-Code 5.1 loads the dataset and subsets it
correspondingly. ♣
In practice, we often start with a scientific hypothesis and subsequently collect data or per-
form an experiment to confirm the hypothesis. The data is then “modeled statistically”, in the
sense that we need to determine a theoretical distribution for which the data is a realization. In
our discussion here, the distribution typically involves parameters that are linked to the scientific
question (probability p in a binomial distribution for coin tosses, mean µ of a Gaussian distri-
bution for testing differences pododermatitis scores). We then formulate the null hypothesis H0
for which the data should provide evidence against. The calculation of a p-value can be summa-
rized as follows. When testing about a certain parameter, say θ, we use an estimator θb for that
parameter. We often need to transform the estimator such that the distribution thereof does
not depend on (the) parameter(s). We call this random variable test statistic which is typically
a function of the random sample. The test statistic evaluated at the observed data is then used
to calculate the p-value based on the distribution of the test statistic. Based on the p-value we
summarize the evidence against the null hypothesis. We cannot make any statement for the
hypothesis.
Example 5.3 (continuation of Example 5.2). For the visits in June 2017 we would like to asses
if the score of the rabbits is comparable to 10/3 ≈ 3.33, representing a low-grade scoring (low-
grade hyperkeratosis, hypotrichosis or alopecia). We have 17 observations and the sample mean
is 3.87 with a standard deviation of 0.64. Is this enough evidence in the data to claim that the
observed mean is different from low-grade scoring?
We postulate a Gaussian model for the scores. The observations are a realization of X1 , . . . ,
iid
X17 ∼ N (µ, 0.82 ), i.e., n = 17 and the standard deviation is known (the latter will be relaxed
84 CHAPTER 5. STATISTICAL TESTING
soon, here we suppose that this value has been determined by another study or additional
information). The scientific hypothesis results in the null hypothesis H0 : “mean is low-grade
H0 :µ=3.33
scoring” being equivalent to H0 : µ = 3.33. Under the null hypothesis, we have X ∼
N (3.33, 0.8 /17). Hence, we test about the parameter µ and our estimator is θb = X with a
2
known distribution under the null. With this information, the p-value in a two-sided setting is
p-value = P(under the null hypothesis we observe 3.87 or a more extreme value) (5.1)
= 2 PH0 (|X| ≥ |x|) = 2 1 − PH0 (X < 3.87) (5.2)
X − 3.33 3.87 − 3.33
=2 1−P √ < √ = 2(1 − φ(2.76)) ≈ 0.6%. (5.3)
0.8/ 17 0.8/ 17
where we have used the subscript “H0 ” to emphasis the calculation under the null hypothesis,
i.e., µ = 3.33. An alternative is to write it in a conditional form PH0 ( · ) = P( · | H0 ). Hence,
there is evidence in the data against the null hypothesis. ♣
In many cases we use a “known” statistical test, instead of a “manually” constructed test
statistic. Specifically, we state the statistical model with a well-known and named test, as we
shall see later.
Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak evidence,
in [0.01, 0.001] as substantial evidence, and smaller ones as strong evidence (see, e.g., Held and
Sabanés Bové, 2014). For certain representations, R output uses symbols for similar ranges ␣ ,
. and * , ** , and, *** .
5.1. THE GENERAL CONCEPT OF SIGNIFICANCE TESTING 85
The workflow of a statistical significance test can be summarized as follows. The starting
point is a scientific question or hypothesis and data that has been collected to support the
scientific claim.
(i) Formulation of the statistical model and statistical assumptions. Formulate the scientific
hypothesis in terms of a statistical one.
(ii) Selection of the appropriate test or test statistic and formulation of the null hypothesis H0
with the parameters of the test.
(iii) Calculation of the p-value,
(iv) Interpretation of the results of the statistical test and conclusion.
Although the workflow is presented in a linear fashion, there are several dependencies. For
example the interpretation depends not only on the p-value but also on the null hypothesis in
terms of proper statistical formulation and, finally, on the scientific question to be answered, see
Figure 5.2. The statistical test hypothesis essentially depends on the statistical assumptions, but
need to be cast to answer the scientific questions, of course. The statistical assumptions may
also determine the selection of statistical tests. This dependency will be taken up in Chapter 7.
Scientific question
or hypothesis;
data from an experiment
Example 5.4 (Revisiting Example 5.1 using the workflow of Figure 5.2). The (scientific) claim
is that the coin is biased and as data we have 17 coin tosses with 13 heads. (i) As the data is
the number of successes among a fixed number of trials a binomal model is appropriate. The
test hypothesis that we want to reject is equal chance of getting head or tail. (ii) We can work
directly with X ∼ Bin(17, p) with the null hypothesis H0 : p = 1/2. (iii) The p-value is 0.049.
(iv) The p-value indicates that we observe such a seemingly unlikely event roughly every 20th
time only weak evidence. ♣
Example 5.5 (Revisiting Example 5.2 using the workflow of Figure 5.2). The scientific claim
is that the observed score is different to low-grade. We have four scores from the hind legs from
17 animals. (i) We work with one average value per animal (PDMean). Thus we assume that
iid
X ∼ N (µ, 0.82 ). We want to quantify how much the observed mean differs from low-grade score.
(ii) We have X ∼ N (µ, 0.82 /17), with the null hypothesis H0 : µ = 3.33. (iii) The p-value is
p-value = 2 PH0 (|X| ≥ |x|) ≈ 0.6%. (iv) There is substantial evidence, that the pododermatitis
scores are different than low-grade scoring. ♣
86 CHAPTER 5. STATISTICAL TESTING
Example 5.6. We revisit the light bulb situation elaborated at the beginning of the chapter.
I postulate a null hypothesis H0 : “median lifetime is 20 000 h” versus the alternative hypothesis
H1 : “median lifetime is 5 000 h”. I only have one observation and thus I need external information
about the distribution of light bulb lifetime. Although there is no consensus, some published
literature claim that the cdf of certain types of light bulbs are given by F (x) = 1 − exp(−x/λ)k
for x > 0 and k between 4 and 5. For simplicity we take k = 4 and thus the median lifetime is
λ log(2)1/4 . The hypotheses are thus equivalent to H0 : λ = 20 000/ log(2)1/4 h versus H1 : λ =
5 000/ log(2)1/4 h. ♣
In the example above, I could have taken any other value for the alternative. Of course this
is dangerous and very subjective (the null hypothesis is given by companies claim). Therefore,
we state the alternative hypothesis as everything but the null hypothesis. In the example above
it would be H1 : λ ̸= 20 000/ log(2)1/4 h.
In a similar fashion, we could state that the median lifetime is at least 20 000 h. In such
a setting we would have a null hypothesis H0 : “median lifetime is 20 000 h or larger” versus
the alternative hypothesis H1 : “median lifetime smaller than 20 000 h”, which is equivalent to
H0 : λ ≥ 20 000/ log(2)1/4 h versus H1 : λ < 20 000/ log(2)1/4 h.
Hypotheses are classified as simple if parameter θ assumes only a single value (e.g., H0 :
θ = 0), or composite if parameter θ can take on a range of values (e.g., H0 : θ ≤ 0, H1 : µ ̸= µ0 ).
The case of a simple null hypothesis and composite alternative hypothesis is also called a
two-sided setting. Whereas composite null hypothesis and composite alternative hypothesis is
called one-sided or directional setting.
Note that for Example 5.2 a one-sided test is necessary for the hypothesis “there is a pro-
gression of the pododermatitis scores between two visits”, but a two-sided test is needed for “the
pododermatitis scores between two visits are different”. We strongly recommend to always use
two-sided tests (e.g. Bland and Bland, 1994; Moyé and Tita, 2002), not only in clinical studies
where it is the norm but as Bland and Bland (1994) states “a one-sided test is appropriate when
a large difference in one direction would lead to the same action as no difference at all. Expecta-
tion of a difference in a particular direction is not adequate justification.”. However to illustrate
certain concepts, a one-sided setting may be simpler and more accessible.
In the case of a hypothesis test, we compare the value of the test statistic with the quantiles
of the distribution of the null hypothesis. A predefined threshold determines if we reject H0 , if
not, we fail to reject H0 .
5.2. HYPOTHESIS TESTING 87
Definition 5.2. The significance level α is a threshold determined before the testing, with
0 < α < 1 but it is often set to 5% or 1%.
The rejection region of a test includes all values of the test statistic for which we reject the
null hypothesis. The boundary values of the rejection region are called critical values. ♢
Similar as for the significance test we reject for values of the test statistic that are in the tail
of the density under the null hypothesis, e.g., that would lead to a small p-value. In fact, we can
base our decision on whether the p-value is smaller than the significance level or not.
It is important to realize that the level α is set by the scientists, not by the experiment or
the data. Therefore there is some “arbitrariness” to the value and thus whether we reject the H0
or not. The level α may be imposed to other values in different scientific domains.
H0 H0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 5.3: Critical values (red) and rejection regions (orange) for two-sided H0 : µ =
µ0 = 0 (left) and one-sided H0 : µ ≤ µ0 = 0 (right) hypothesis test with significance
level α = 5%.
In significance testing two types of errors can occur. Type I errors: we reject H0 if we should
have not and Type II errors: we fail to reject H0 if we should have. The framework of hypothesis
testing allows us to express the probabilities of committing these two errors. The probability of
committing a Type I error is exactly α = P(reject H0 |H0 ). This probability is often called the
size of the test. To calculate the probability of committing a Type II error, we need to assume
a specific value for our parameter within the alternative hypothesis, e.g., a simple alternative.
88 CHAPTER 5. STATISTICAL TESTING
The probability of Type II error is often denoted with β = P(not rejecting H0 |H1 ). Table 5.1
summarizes the errors in a classical 2 × 2 layout.
Table 5.1: Probabilities of Type I and Type II errors in the setting of a significance
test.
Example 5.7 (revisit Example 5.1). The null hypothesis remains as having a fair coin and
the alternative is simply not having a fair coin. Suppose we reject the null hypothesis if we
observe 0, . . . , 4 or 13, . . . , 17 heads out of 17 tosses. The Type I error is 2*pbinom(4, size=17,
prob=1/2), i.e., 0.049, and, if the coin has a probability of 0.7 for heads, the Type II error is
sum(dbinom(5:12, size=17, prob=0.7)), i.e., 0.611. However, if the coin has a probability
of 0.6 for heads, the Type II error increases to sum(dbinom(5:12, size=17, prob=0.6)), i.e.,
0.871. ♣
Ideally we would like to use tests that have simultaneous small Type I and Type II errors.
This is coneptually not possible as reducing one increases the other and one typically fixes the
Type I error to some small value, say 5%, 1% or suchlike (committing a type one error has
typically more severe consequences than a Type II error). Type I and Type II errors are shown
in Figure 5.4 for two different alternative hypotheses. When reducing the significance level α,
the critical values move further from the center of the density under H0 and thus to an increase
of the Type II error β. Additionally, the clearer the separation of the densities under H0 and
H1 , the smaller the Type II error β. This is intuitive, if the data stems from H1 which is “far”
from H0 , the chance that we reject is large.
As a summary, the Type I error
The value 1−β is called the power of a test. High power of a test is desirable in an experiment:
we want to detect small effects with a large probability. R-Code 5.2 computes the power for a
z-test (Gaussian random sample with known variance). More specifically, under the assumption
√
of σ/ n = 1 we test H0 : µ0 = 0 versus H1 : µ0 ̸= 0. Similarly to the probability of a Type II
5.2. HYPOTHESIS TESTING 89
R-Code 5.2 A one-sided and two-sided power curve for a z-test. (See Figure 5.5.)
error, the power can only be calculated for a specific assumption of the “actual” mean µ1 , i.e., of
a simple alternative. Thus, as typically done, Figure 5.5 plots power (µ1 − µ0 ).
For µ1 = µ0 , the power is equivalent to the size of the test (significance level α). If µ1 − µ0
H0 H1 H0 H1
−2 0 2 4 6 −2 0 2 4 6
H0 H1 H0 H1
−2 0 2 4 6 −2 0 2 4 6
Figure 5.4: Type I error with significance level α (red) and Type II error with prob-
ability β (blue) for two different alternative hypotheses (µ = 2 top row, µ = 4 bottom
row) with two-single hypothesis H0 : µ = µ0 = 0 (left column) and one-sided hypothesis
H0 : µ ≤ µ0 = 0 (right).
90 CHAPTER 5. STATISTICAL TESTING
1.0
0.8
Power 0.6
0.4
0.2
α
0.0
−1 0 1 2 3 4
µ1 − µ0
Figure 5.5: Power curves for a z-test: one-sided (blue solid line) and two-sided (black
dashed line). The gray line represents the level of the test, here α = 5%. The vertical
lines represent the alternative hypotheses µ = 2 and µ = 4 of Figure 5.4. (See R-
Code 5.2.)
increases, the power increases in sigmoid-shaped form to reach asymptotically one. For a two-
sided test, the power is symmetric around µ1 = µ0 (no direction preferred) and smaller than the
power for a one sided test. The latter decreases further to zero for negative differences µ1 − µ0
although for the negative values the power curve does not make sense. In a similar fashion as
to reduce the probability of a Type II error, it is possible to increase the power by increasing
√
the sample. Note that in the illustrations above we work with σ/ n = 1, the dependence of the
power on n is somewhat hidden by using the default argument for sd in the functions pnorm()
and qnorm().
Remark 5.1. It is impossible to reduce simultaneously both the Type I and II error proba-
bilities, but it is plausible to consider tests that have the smallest possible Type II error (or
the largest possible power) for a fixed significance level. More theoretical treatise discuss the
existence and construction of uniformly most powerful tests. Note that the latter do not exist
for two-sided settings or for tests involving more than one parameter. Most tests that we dis-
cuss in this book are “optimal” (the one-sided version being uniformly most powerful under the
statistical assumptions). ♣
The workflow of a hypothesis test is very similar to the one of a statistical significance test
and only point (ii) and (iii) need to be slightly modified:
(i) Formulation of the statistical model and statistical assumptions. Formulate the scientific
hypothesis in terms of a statistical one.
(ii) Selection of the appropriate test or test statistic, significance level and formulation of the
null hypothesis H0 and the alternative hypothesis H1 with the parameters of the test.
(iii) Calculation of the test statistic value (or p-value), comparison with critical value (or level)
and decision
The choice of test is again constrained by the assumptions. The significance level must,
however, always be chosen before the computations.
The value of the test statistic, say tobs , calculated in step (iii) is compared with critical values
tcrit in order to reach a decision. When the decision is based on the calculation of the p-value,
it consists of a comparison with α. The p-value can be difficult to calculate, but is valuable
because of its direct interpretation as the strength (or weakness) of the evidence against the null
hypothesis.
We forumlate our scientific question in general terms under Question. The sta-
tistical or formal assumptions summarizing the statistical model are given under
Assumptions. Generally, two-sided tests are performed.
For most tests, there is a corresponding R function. The arguments x, y usually
represent vectors containing the data and alpha the significance level. From the
output, it is possible to get the p-value.
In this book we consider typical settings and discuss common and appropriate tests. The
choice of test is primarily dependent on the quantity being tested (location, scale, frequencies,
. . . ) and secondly on the statistical model and assumptions. The tests will be summarized in
yellowish boxes similar as given here. The following list of tests can be used as a decision tree.
Of course this list is not exhaustive and many additional possible tests exists and are fre-
quently used. Moreover, the approaches described in the first two sections allow to construct
arbitrary tests.
We present several of these tests in more details by motivating the test statistic, giving an
explicit example and by summarizing the test in yellow boxes. Ultimately, we perform test with
a single call in R. However, the underlying mechanism has to be understood, it would be too
dangerous using statistical tests as black-box tools only.
We revisit the setting when comparing the sample mean with a hypothesized value (e.g., ob-
iid
served pododermatitis score with the value 3.33). As stated above, Y1 , . . . , Yn ∼ N (µ, σ 2 ) with
parameter of interest µ but now unknown σ 2 . Thus, from
Y −µ σ unknown Y −µ
Y ∼ N (µ, σ 2 /n) =⇒ √ ∼ N (0, 1) =⇒ √ ∼ tn−1 . (5.4)
σ/ n S/ n
The null hypothesis H0 : µ = µ0 specifies the hypothesized mean and the distribution in (5.4).
This test is tyically called the “one-sample t-test”, for obvious reasons. To calculate p-values
the function pt(..., df=n-1) (for sample size n) is used.
The box Test 1 summarizes the test and Example 5.8 illustrates the test based on the podo-
dermatitis data.
5.3. TESTING MEANS AND VARIANCES IN GAUSSIAN SAMPLES 93
Question: Does the sample mean deviate significantly from the postulated but
unknown mean?
Example 5.8 (continuation of Example 5.2). We test the hypothesis that the animals have
a different pododermatitis score than low-grade hyperkeratosis, corresponding to 3.333. The
sample mean is larger and we want to know if the difference is large enough for a statistical
claim.
The statistical null hypothesis is that the mean score is equal to 3.333 and we want to know
if the mean of the (single) sample deviates from a specified value, sufficiently for a statistical
claim. Although there might be a preferred direction of the test (higher score than 3.333), we
perform a two-sided hypothesis test. From R-Code 5.1 we have for the sample mean x̄ = 3.869,
sample standard deviation s = 0.638 and sample size n = 17. Thus,
H0 : µ = 3.333 versus H1 : µ ̸= 3.333;
|3.869 − 3.333|
tobs = √ = 3.467;
0.638/ 17
tcrit = t16,1−0.05/2 = 2.120 p-value: 0.003.
Formally, we can reject our H0 because tobs > tcrit . The p-value can be calculated with 2*(1-pt(
tobs, n-1)) with tobs defined as 3.467. This calculation is equivalent to 2*pt( -tobs, n-1).
The p-value is low and hence there is substantial evidence against the null hypothesis.
R-Code 5.3 illustrates the direct testing in R with the function t.test() and subsequent
extraction of the p-value. ♣
The returned object of the function t.test() (as well as of virtually all other test functions
we will see) is of class htest. Hence, the output looks always similar and is summarized in
Figure 5.6 for the particular case of the example above.
Comparing means of two different samples is probably the most often used statistical test. To
introduce this test, we assume that both random samples are normally distributed with equal
iid iid
sample size and variance, i.e., X1 , . . . , Xn ∼ N (µx , σ 2 ), Y1 , . . . , Yn ∼ N (µy , σ 2 ). Further, we
94 CHAPTER 5. STATISTICAL TESTING
R-Code 5.3 One sample t-test, pododermatitis (see Example 5.8 and Test 1)
print( out <- t.test( PDHmean, mu=3.333)) # print the result of the test
##
## One Sample t-test
##
## data: PDHmean
## t = 3.47, df = 16, p-value = 0.0032
## alternative hypothesis: true mean is not equal to 3.333
## 95 percent confidence interval:
## 3.5413 4.1969
## sample estimates:
## mean of x
## 3.8691
out$p.val # printing only the p-value
## [1] 0.0031759
Alternative hypothesis
(not equal, less, or ’greater’)
assume that both random samples are independent. Under these assumptions we have
σ2 σ2 2σ 2
X ∼ N µx , , Y ∼ N µy , =⇒ X − Y ∼ N µx − µy , (5.5)
n n n
X − Y − (µx − µy ) X − Y H0 :µx =µy
=⇒ p ∼ N (0, 1) =⇒ p ∼ N (0, 1). (5.6)
σ/ n/2 σ/ n/2
As often in practice, we do not know σ and we have to estimate it. One possible estimate is
so-called pooled estimate s2p = (s2x + s2y )/2, where s2x and s2x are the variance estimates of the
two samples. When using the estimator Sp2 in the righ-hand expression of (5.6), the distribution
. p
of (X − Y ) (Sp n/2) is a t-distribution with 2n − 2 degrees of freedom. This result is not
surprising (up to the degrees of freedom) but somewhat difficult to show formally.
If the sample sizes are different, we need to adjust the pooled estimate and the form is slightly
more complicated (see Test 2). As the calculation s2p requires the estimates µx and µy , we adjust
the degrees of freedom to 2n − 2 or nx + ny − 2 in case of different sample sizes.
The following example revisits the pododermatitis data again and compares the scores be-
tween the two different barns.
5.3. TESTING MEANS AND VARIANCES IN GAUSSIAN SAMPLES 95
Assumptions: Both samples are normally distributed with the same unknown vari-
ance. The samples are independent.
|x − y| |x − y| nx · ny
r
Calculation: tobs = p = · ,
sp 1/nx + 1/ny s p n x + ny
1
where s2p = · (nx − 1)s2x + (ny − 1)s2y .
nx + ny − 2
Decision: Reject H0 : µx = µy if tobs > tcrit = tnx +ny −2,1−α/2 .
Example 5.9 (continuation of Example 5.2). We question if the pododermatitis scores of the
two barns are significantly different (means 3.83 and 3.67; standard deviations: 0.88 and 0.87;
sample sizes: 20 and 14). Hence, using the formulas given in Test 2, we have
H0 : µx = µy versus H1 : µx ̸= µy
1
s2p = (19 · 0.8842 + 13 · 0.8682 ) = 0.878
20 + 14 − 2
r
|3.826 − 3.675| 20 · 14
tobs = √ = 0.494
0.878 20 + 14
tcrit = t32,1−0.05/2 = 2.037 p-value: 0.625.
Hence, 3.826 and 3.675 are not statistically different. See also R-Code 5.4, were we use again the
function t.test() but with two data vectors and the argument var.equal=TRUE. ♣
In practice, we often have to assume that the variances of both samples q are different, say σx
2
and σy2 . In such a setting, we have to normalize the mean difference by s2x /nx + s2y /ny . While
this estimate seems simpler than the pooled estimate sp , the degrees of freedom of the resulting
t-distribution is difficult to derive, and we refrain to elaborate it here (Problem 5.1.d gives some
insight). In the literature, this test is called Welch’s two sample t-test and is actually the default
choice of t.test( x, y).
In many situations we have paired measurements at two different time points, before and after
a treatment or intervention, from twins, etc. Analyzing the different time points should be done
on an individual level and not on the difference of the sample means of the paired samples.
The assumption of independence of both samples in the previous Test 2 may not be valid if
the two samples consist of two measurements of the same individual, e.g., observations over two
different instances of time. In such settings, were we have a “before” and “after” measurement, it
96 CHAPTER 5. STATISTICAL TESTING
R-Code 5.4 Two-sample t-test with independent samples, pododermatitis (see Exam-
ple 5.9 and Test 2).
would be better to take this pairing into account, by considering differences only instead of two
samples. Hence, instead of constructing a test statistic based on X − Y we consider
iid
σ2
X1 − Y1 , . . . , Xn − Yn ∼ N (µx − µy , σd2 ) =⇒ X − Y ∼ N µx − µ y , d (5.7)
n
X −Y H0 :µx =µy
=⇒ √ ∼ N (0, 1). (5.8)
σd / n
where σd2 is essentially the sum of the variances minus the “dependence” between Xi and Yi . We
formalize this dependence, called covariance, starting in Chapter 8.
The paired two-sample t-test can thus be considered a one sample t-test of the differences
with mean µ0 = 0.
Example 5.10. We consider the pododermatitis measurements from July 2016 and June 2017
and test if there is a progression over time. We have the following summaries for the differences
(see R-Code 5.5 and Test 3). Mean d¯ = 0.21; standard deviation sd = 1.26; and sample size n =
17.
H0 : d = 0 versus H1 : d ̸= 0; or equivalently H0 : µx = µy versus H1 : µx ̸= µy ;
|0.210|
tobs = √ = 0.687;
1.262/ 17
tcrit = t16;0.05 = 2.12 p-value: 0.502.
There is no evidence that there is a progression over time. ♣
Question: Are the means x and y of two paired samples significantly different?
Assumptions: The samples are paired. The differences are normally distributed
with unknown mean δ. The variance is unknown.
|d |
Calculation: tobs = √ , where
sd / n
• di = xi − yi is the i-th observed difference,
• d and sd are the mean and the standard deviation of the differences di .
R-Code 5.5 Two-sample t-test with paired samples, pododermatitis (see Example 5.10
and Test 3).
podoV1V13 <- podo[podo$Visit %in% c(1,13),] # select visits from 2016 and 2017
PDHmean2 <- matrix(podoV1V13$PDHmean[order(podoV1V13$ID)], ncol=2, byrow=TRUE)
t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)
##
## Paired t-test
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## t = 0.687, df = 16, p-value = 0.5
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.43870 0.85929
## sample estimates:
## mean difference
## 0.21029
# Same result as with
# t.test( PDHmean2[,2] - PDHmean2[,1])
and the distribution thereof is an F -distribution (see also Section 3.2.3 with quantile, density,
and distribution functions implemented in R with [q,d,p]f). This “classic” F -test is given in
Test 4.
In latter chapters we will see more natural settings, where we need to compare variances (not
98 CHAPTER 5. STATISTICAL TESTING
Question: Are the variances s2x and s2y of two samples significantly different?
Decision: Reject H0 : σx2 = σy2 if Fobs > Fcrit , where Fcrit is the 1 − α/2 quantile of
an F -distribution with nx − 1 and ny − 1 degrees of freedom or if Fobs < Fcrit ,
where Fcrit is the α/2 quantile of an F -distribution with nx − 1 and ny − 1
degrees of freedom.
Example 5.11 (continuation of Example 5.2). As shown by R-Code 5.6 the pododermatitis
mean scores of the two barns do not have any evidence against the null hypothesis of having
equal variances. ♣
R-Code 5.6 Comparison of two variances, PDH (see Example 5.11 and Test 4).
It is important to note that for the two-sample t-test we should not test first if the variances
are equal and then deciding on whether to use the classical or Welch’s two-sample t-test. With
such a sequential approach, we cannot maintain the nominal significance level (we use the same
data for several tests, see also Section 5.5.2). It should be rather the experimental setup that
should argue if conceptually the variances should be equivalent (see also Chapter 12).
5.4. DUALITY OF TESTS AND CONFIDENCE INTERVALS 99
Remark 5.2. When flipping the samples in Test 4, the value of the observed test statistic and
the associated confidence interval of σy2 /σx2 changes. However, the p-value of the test remains
the same, see the output of var.test( PDHmeanB2, PDHmeanB1). This is because if W ∼ Fn,m
then 1/W ∼ Fm,n . ♣
Example 5.12 (reconsider the situation from Example 5.8). The 95%-confidence interval for
the mean µ is [ 3.54, 4.20 ]. Since the value of µ0 = 3.33 is not in this interval, the null hypothesis
H0 : µ = µ0 = 3.33 is rejected at level α = 5%. ♣
In R, most test functions return the corresponding confidence intervals (named element
$conf.int of the returned list) together with the value of the statistic ($statistic), p-value
($p.value) and other information. Some test functions may explicitly require setting additional
argument conf.int=TRUE.
1. p-values can indicate how incompatible the data are with a specified statistical model.
2. p-values do not measure the probability that the studied hypothesis is true, or the proba-
bility that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether
a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance does measure the size of and effect or the importance
of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or
hypothesis.
P(at least 1 false significant results) = 1 − P(no false significant results) (5.13)
= 1 − (1 − α)m . (5.14)
Table 5.2 gives the probabilities of at least one false significant result for α = 0.05 and various
m. Even for just a few tests, the probability increases drastically.
Table 5.2: Probabilities of at least one false significant test result when performing m
tests at level α = 5% (top row) and at level αnew = α/m (bottom row).
m 1 2 3 4 5 6 8 10 20 100
1 − (1 − α)m 0.05 0.098 0.143 0.185 0.226 0.265 0.337 0.401 0.642 0.994
1 − (1 − αnew )m 0.05 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049
5.6. BIBLIOGRAPHIC REMARKS 101
There are several different methods that allow multiple tests to be performed while maintain-
ing the selected significance level. The simplest and most well-known of them is the Bonferroni
correction. Instead of comparing the p-value of every test to α, they are compared to a new
significance level, αnew = α/m, see second row of Table 5.2.
There are several alternative methods, which, according to the situation, may be more appro-
priate. We recommend to use at least method="holm" (default) in p.adjust. For more details
see, for example, Farcomeni (2008).
Hypothesizing after the results are known (HARKing) is another inappropriate scientific
practice in which a post hoc hypothesis is presented as an a priori hypotheses. In a nutshell, we
collect the data of the experiment and adjust the hypothesis after we have analysed the data,
e.g., select effects small enough such that significant results have been observed.
Along similar lines, analyzing a dataset with many different methods will likely lead to
several significant p-values. In fact, even if in the case of the true underlying null hypothesis, on
average α · 100% of the tests are significant. Due to various inherent decisions often even more.
When searching for a good statistical analysis one often has to make many choices and thus
inherently selects the best one among many. This danger is often called the ‘garden of forking
paths’. Conceptually, adjusting the p-value for the many (not-performed) test would mitigate
the problem.
If a result is not significant, the study is often not published and is left in a ‘file-drawer’. A
seemingly significant result might be well due to Type I error but this is not evident as many
similar experiments lead to non-significant outcomes that are not published. Hence, the so-called
publication bias implies that there are more Type I error results than the nominal α level.
For many scientific domains, it is possible to preregister the study, i.e., to declare the study
experiment, analysis methods, etc. before the actual data has been collected. In other words,
everything is determined except the actual data collection and actual numbers of the statistical
analysis. The idea is that the scientific question is worthwhile investigating and reporting in-
dependent of the actual outcome. Such an approach reduces HARKing, garden-of-forking-paths
issue, publication bias and more.
a) Derive the power of a one-sample t-test for H0 : µ = µ0 and H0 : µ = µ1 for sample size n.
b) Show that the value of the test statistic of the two sample t-test with equal variances and
√ q
equal sample sizes simplifies to n x − ȳ s2x + s2y .
c) Starting from results in (5.6), derive the test statistic of Test 2 for the case general case of
sample sizes nx and ny .
d) We give some background to Welch’s two sample t-test with test statistic X − Y SW
Problem 5.2 (t-Test) Use again the sickle-cell disease data introduced in Problem 4.2. For the
cases listed below, specify the null and alternative hypothesis. Then use R to perform the tests
and give a careful interpretation.
Problem 5.3 (t-Test) Anorexia is an eating disorder that is characterized by low weight, food
restriction, fear of gaining weight and a strong desire to be thin. The dataset anorexia in the
package MASS gives the weight of 29 females before and after a cognitive behavioral treatment
(in pounds). Test whether the treatment was effective.
Problem 5.4 (Testing the variance of one sample) In this problem we develop a “one-sample
iid
variance” test. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We consider the estimator S 2 for the parameter σ 2 .
We test H0 : σ 2 = σ02 against H0 : σ 2 ≥ σ02 .
b) Find an expression for the p-value for the estimate s2 . For simplicity, we assume s2 > σ02 .
c) We now assume explicit values n = 17, s2 = 0.41 and σ02 = 0.25. What is the p-value?
d) Construct a one-sided sample confidence interval [0, bu ] for the parameter σ 2 in the general
setting and with the values from c).
5.7. EXERCISES AND PROBLEMS 103
Problem 5.5 (p-values under mis-specified assumptions) In this problem investigate the effect
of deviations of statistical assumptions on the p-value. For simplicity, we use the one sample
t-test.
iid
a) For 10000 times, sample X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ = 1 and n = 10. For each
sample perform a t-test for H0 : µ = 0. Plot the p-values in a histogram. What do you
observe? For α = 0.05, what is the observed Type I error?
b) We repeat the experiment with a different distribution. Same questions as in a), but for
c) Will the observed Type I error be closer to the nominal level α when we increase n? Justify.
b) Calculate the difference between the lower bound of the sample confidence interval for µy
and upper bound of the sample.confidence interval for µx . Show that the two intervals
√
“touch“ each other when (x − y) (sx + sy )/ n = tn−1,1−α/2 .
c) Which test would be adequate to consider here? What is tobs and tcrit in this specific
setting?
d) Comparing b) and c), why do we not have this apparent duality between the confidence
intervals and the hypothesis test?
Problem 5.7 (BMJ Endgame) Discuss and justify the statements about ‘Independent samples
t test’ given in doi.org/10.1136/bmj.c2673.
104 CHAPTER 5. STATISTICAL TESTING
Chapter 6
⋄ Explain and apply estimation, confidence interval and hypothesis testing for
proportions
One of the motivational examples in the last chapter was tossing a coin. In this chapter, we
generalize this specific idea and discuss the estimation and testing of a single proportion as well
as two or several proportions. The following example serves as a motivation.
The end-of-semester exam of the lecture ‘Analysis for the natural sciences’ at my university
consisted of two sightly different versions. The exam type is assigned according to the student
listing. It is of utmost importance, that the two versions are identical. For one particular year,
among the 589 students that participated, 291 received exam version A, the others version B.
There were 80 students failing version A versus 85 failing version B. Is there enough evidence
in the data to claim that the exams were not of equal difficulty and hence some students were
disadvantaged.
In this chapter we have a closer look at statistical techniques that help us to correctly answer
the above and similar questions. More precisely, we will estimate and compare proportions. To
simplify the exhibition, we discuss the estimation of one proportion followed by comparing two
proportions. The third section discusses statistical test for different cases.
105
106 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
6.1 Estimation
We start with a simple setting where we observe occurrences of a certain event and are interested
in the proportion of the events over the total population. More specifically, we consider the
number of successes in a sequence of experiments, e.g., whether a certain treatment had an effect
or whether a certain test has been passed. We first discuss point estimation followed by the
construction a confidence intervals for a proportion.
In our example we have the following estimates: pbA = 211/291 ≈ 72.51% for exam version A
and pbB = 213/298 ≈ 71.48% for version B. Once we have an estimate of an proportion, we can
now answer questions (for each version separately), such as:
(where we use results from Sections 2.2.1 and 3.3 only). We revisit questions like “Is the failure
rate significantly lower than 30%?” later in the chapter.
The estimator pb = X/n does not have a “classical” distribution (it is a “scaled binomial”).
Figure 6.1 illustrates the probability mass function based on the estimate of exam version A. The
figure visually suggest to use a Gaussian approximation. Formally, we approximate the binomial
distribution of X by a Gaussian distribution (see Section 3.3), which is well justified here as
np(1 − p) ≫ 9. The approximation for X is then used to state that the estimator pb is also
app
approximately Gaussian with adjusted parameters: pb ∼ N (x/n, p(1 − p)/n). In this chapter,
we will often use this approximation and thus we implicitly assume np(1 − p) > 9.
Figure 6.1 also indicates that shifting the Gaussian density slightly to the right, the approx-
imation would improve. This shift is linked to the continuity correction and is performed in
practice. In our derivations we often omit the correction for clarity.
6.1. ESTIMATION 107
x x
Figure 6.1: Probability mass function of pb = 211/291. The blue curve in the right
panel is the normal approximation. The actual estimate is indicated with a green tick
mark.
When dealing with proportions we often speak of odds, or simply of chance, defined by
ω = p/(1 − p). The corresponding intuitive estimator is ω b = pb/(1 − pb) for an estimator pb.
Similarly, θb = log(b p/(1 − pb) is an intuitive estimator of log odds. If pb is an estimate,
ω ) = log (b
then the quantities are the corresponding estimates. As a side note, these estimators also coincide
with the maximum likelihood estimators.
Wald confidence interval rests upon this assumption (which can be shown more formally) and is
108 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
identical to (6.8). ♣
If the inequality in (6.4) is solved through a quadratic equation, we obtain the sample Wilson
confidence interval
r !
1 q2 pb(1 − pb) q2
bl,u = · pb + ±q· + 2 , (6.9)
1 + q 2 /n 2n n 4n
The Wilson confidence interval is “more complicated” than the Wald confidence interval. Is
it also “better” because of one fewer approximation during the derivation?
Ideally the coverage probability of a (1 − α) confidence interval should be 1 − α. Because of
the approximations this will not be the case and the coverage probability can be used to assess
confidence intervals. For a discrete random variable, the coverage is
n
X
P(p ∈ [ Bl , Bu ] ) = P(X = x)I{p∈[ bl ,bu ] } . (6.12)
x=0
(see Problem 6.1.b). R-Code 6.1 calculates the coverage of the 95% confidence intervals for
X ∼ Bin(n = 40, p = 0.4). For the particular setting, the Wilson confidence interval does not
seem to have a better coverage (96% compared to 94%).
R-Code 6.1: Coverage of 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4).
6.1. ESTIMATION 109
Figure 6.2 illustrates the coverage for the confidence intervals for different value of p. Overall
the Wilson confidence interval now dominates the Wald one. The Wilson confidence interval
has better nominal coverage at the center. This observation also holds when n is varied, as seen
in Figure 6.3 which shows the coverage for different values of n and p. The Wilson confidence
interval has a slight tendency of too large coverage (more blueish areas) but is overall much better
than the Wald one. Note that the top “row” of the left and right part of the panel corresponds
to the left and right part of Figure 6.2.
Wald CI Wilson CI
Figure 6.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p) for the
Wald CI (left) and Wilson CI (right). The red dashed line is the nominal level 1 − α
and in green we have a smoothed curve to “guide the eye”. (See R-Code 6.2.)
Wald CI Wilson CI
40
1.00
0.95
30
0.90
20
0.85
n
0.80
10
0.75
0.70
Figure 6.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as functions
of p and n. The probabilities are symmetric around p = 1/2. All values smaller than
0.7 are represented with dark red. Left is for the Wald CI, the right is for the Wilson
CI.
The width of a sample confidence interval is bu − bl . For the Wald confidence interval we
obtain
r r
pb(1 − pb) x(n − x)
2q · = 2q (6.13)
n n3
and for the Wilson confidence interval we have
r r
2q pb(1 − pb) q2 2q x(n − x) q2
2
+ 2 = 2 3
+ 2. (6.14)
1 + q /n n 4n 1 + q /n n 4n
The widths vary with the observed value of X and are shown in Figure 6.4. For 5 < x < 36, the
Wilson confidence interval has a smaller width and a better nominal coverage (over small ranges
of p). For small and very large values x, the Wald confidence interval has a too small coverage
and thus wider intervals are desired.
6.2. COMPARISON OF TWO PROPORTIONS 111
0.30
Width
0.15
0.00
0 10 20 30 40
Figure 6.4: Widths of the sample 95% confidence intervals for X ∼ Bin(n = 40, p) as
a function of the observed value x (the Wald CI is in green, the Wilson CI in blue).
Result
positive negative Total
A h11 h12 n1
Group
B h21 h22 n2
Total c1 c2 n
R-Code 6.3 Contingency table for the pre-eclampsia data of Example 6.1.
The risk difference RD describes the (absolute) difference in the probability of experiencing the
event in question.
Using the notation introduce above, the difference h11 (h11 + h12 ) − h21 (h21 + h22 ) =
h11 n1 − h21 n2 can be seen as a realization of X1 /n1 − X2 /n2 , which is approximately normally
distributed
p1 (1 − p1 ) p2 (1 − p2 )
N p1 − p2 , + (6.15)
n1 n2
The relative risk estimates the size of the effect of a risk factor compared with the size of the
effect when the risk factor is not present:
The groups with or without the risk factor can also be considered the treatment and control
groups.
The relative risk is a positive values. A value of RR = 1 means that the risk is the same in
both groups and there is no evidence of a association between the diagnosis/disease/event and
the risk factor. A RR ≥ 1 is evidence of a possible positive association between a risk factor and
a diagnosis/disease. If the relative risk is less than one, the exposure has a protective effect, as
is the case, for example, for vaccinations.
h11
d = pb1 =
RR
h11 + h12
=
h11 n2
. (6.17)
pb2 h21 h21 n1
h21 + h22
6.2. COMPARISON OF TWO PROPORTIONS 113
A back-transformation
h i
exp θb ± z1−α/2 SE(θ)
b (6.22)
implies positive confidence boundaries. Note that with the back-transformation we loose the
‘symmetry’ of estimate plus/minus standard error.
Example 6.2 (continuation of Example 6.1). The relative risk and corresponding confidence
interval for the pre-eclampsia data are given in R-Code 6.4. The relative risk is smaller than one
(diuretics reduce the risk). An approximate 95% confidence interval does not include the value
one. ♣
The relative risk cannot be applied to so-called case-control studies, where we match for each
subject in the risk group one or several subjects in the second, control group. This matching
implies that the risk in the control group is not representative but influenced by the first group.
114 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
R-Code 6.5 Odds ratio with confidence interval, approximate and exact.
Example 6.3 (continuation of Example 6.1). The odds ratio with confidence interval for the pre-
eclampsia data is given in R-Code 6.5. The 95% confidence interval is again similar as calculated
for the relative risks and does also not include one, strengthening the claim (i.e., significant
result).
Notice that the function fisher.test() also calculates the odds ratio. As it is based on a
likelihood calculation, there are very minor differences between both estimates. ♣
We start the discussion for the hypothesis test H0 : p = p0 versus H1 : p ̸= p0 . This case is
straightforward when relying on the duality of test and confidence intervals. That means, we
reject the null hypothesis at level α if p0 is not in the (1 − α) sample confidence interval [ bl , bu ].
The confidence interval can be obtained based on a Wald, Wilson or some other approach. If
the confidence interval is not exact, the test may not have exact size α.
In R, there is the possibility to use binom.test(n,p), prop.test(n,p), the latter with the
argument correct=TRUE (default) to include a continuity correction or correct=FALSE.
116 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Example 6.4. In one of his famous experiments Gregor Mendel crossed peas based on AA
and aa-type homozygotes. Under the Mendelian inheritance assumption, the third generation
should consist of AA and Aa genotypes with a ratio of 1:2. For one particular genotype, Mendel
reported the counts 8 and 22 (see, e.g., Table 1 of Ellis et al., 2019). We cannot reject the
hypothesis H0 : p = 1/3 based on the p-value 0.56. As shown in R-Code 6.6, both binom.test(8,
8+22, p=1/3) and prop.test(8, 8+22, p=1/3) yield the same p-value (up to two digits). The
corresponding confidence intervals are also very similar. Whereas when using the argument
correct=FALSE the outcome changes noticeably. ♣
practice one often works with the squared difference of the proportions for which the test statistic
takes quite a simple form, as given in Test 5. Under the null hypothesis, the distribution thereof
is a chi-squared distribution (square of a normal random variable). The quantile, density, and
distribution functions are implemented in R with [q,d,p]chisq (see Section 3.2.1). This test is
also called Pearson’s χ2 test.
Example 6.5 (continuation of Example 6.1). The R-Code 6.7 shows the results for the pre-
eclampsia data, once using a proportion test and once using a chi-squared test (comparing
expected and observed frequencies). ♣
Remark 6.2. We have presented the rows of Table 6.1 in terms of two binomials, i.e., with two
fixed marginals. In certain situations, such a table can be seen from a hypergeometric distri-
bution point of view (see help( dhyper)), where three margins are fixed. For this latter view,
fisher.test is the test of choice. ♣
6.3. STATISTICAL TESTS 117
(h11 h22 − h12 h21 )2 (h11 + h12 + h21 + h22 ) (h11 h22 − h12 h21 )2 n
χ2obs = = .
(h11 + h12 )(h21 + h22 )(h12 + h22 )(h11 + h21 ) n1 n2 c1 c2
There is a natural extension of the test of proportions in which we compare “arbitrary” many
proportions. The so-called chi-square test (X 2 test) compares if the observed data follow a
particular distribution by comparing the frequencies of binned observations with the expected
frequencies.
Under the null hypothesis, the chi-square test is X 2 distributed (Test 6). The test is based on
approximations and thus the categories should be aggregated so that all bins contain a reasonable
amount of counts, e.g., ei ≥ 5 (trivially, K − k > 1).
Decision: Reject H0 : “no deviation between the observed and expected” if χ2obs >
χ2crit = χ2K−1−k,1−α , where k is the number of parameters estimated from the
data to calculate the expected counts.
Example 6.6. With few observations (10 to 50) it is often pointless to test for normality of
the data. Even for larger samples, a Q-Q plot is often more informative. For completeness,
6.3. STATISTICAL TESTS 119
we illustrate a simple goodness-of-fit test by comparing the pododermatitits data with expected
counts constructed from Gaussian density with matching mean and variance (R-Code 6.8). We
pool over both periods and barns (n = 34) (there is no significant difference in the means and
the variances).
The binning of the data is done through a histogram-type binning (an alternative way would
be table( cut( podo$PDHmean))). As we have less than five observations in several bins, the
function chisq.test() issues a warning. This effect could be mitigated if we calculate the p-
value using a bootstrap simulation by setting the argument simulate.p.value=TRUE. Pooling
the bins, say breaks=c(1.5,2.5,3.5,4,4.5,5) would be an alternative as well.
The degrees of freedom are K − 1 − k = 7 − 1 − 2 = 4, as we estimate the mean and standard
deviation to determine the expected counts. ♣
R-Code 6.8 Testing normality, pododermatitis (see Example 6.6 and Test 6).
General distribution tests (also goodness-of-fit tests) differ from the other tests discussed here,
in the sense that they do not test or compare a single parameter or a vector of proportions. Such
tests run under the names of Kolmogorov-Smirnov Tests (ks.test()), Shapiro–Wilk Normality
test (shapiro.test()), Anderson–Darling test (goftest::ad.test()), etc.
Example 6.7. The dataset haireye gives the hair and eye colors of 592 persons collected
by students (Snee, 1974). The cross-tabulation data helpful to get information about individ-
ual combinations or comparing two combinations. The overall picture is best assessed with
120 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
blue
brown
hazelgreen
Figure 6.5: Mosaic plot of hair and eye colors. (See R-Code 6.9.)
a mosaic-plot, which indicates that there is a larger proportion of persons with blond hair
having blue eyes compared to other hair colors, which is well known. When examining the
individual terms (oi − ei )2 /ei of the χ2 statistic of Test 6, we observe three very large val-
ues (BLONDE-blue, BLONDE-brown, BLACK-brown), and thus the very low p-value is no surprise.
These residual terms do not indicate if there is an excess or lack of observed pairs. Restrict-
ing to persons with brown and red hair only, the eye color seems independent (p-value 0.15,
chisq.test(HAIReye[c("BROWN","RED"),])$p.value).
R-Code 6.9 Pearson’s Chi-squared test for hair and eye colors data. (See Figure 6.5.)
b) Derive formula (6.12) to calculate the coverage probablity for X ∼ Bin(n, p).
d) Show that the test statistic of Test 5 is a particular case of Test 6 (without continuity
correction).
Problem 6.2 (Binomial distribution) Suppose that among n = 95 Swiss males, eight are red-
green colour blind. We are interested in estimating the proportion p of people suffering from
such disease among the male population.
b) Calculate the maximum likelihood estimate (ML) p̂ML and the ML of the odds ω̂.
c) Using the central limit theorem (CLT), it can be shown that pb follows approximately
N p, n1 p(1 − p) . Compare the binomial distribution to the normal approximation for
different n and p. To do so, plot the exact cumulative distribution function (CDF) and
compare it with the CDF obtained from the CLT. For which values of n and p is the ap-
proximation reasonable? Is the approximation reasonable for the red-green colour blindness
data?
f) Compute the Wilson 95%-confidence interval and compare it to the confidence intervals
from (d).
122 CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Treatment A Treatment B
Cleared 9 5
Not cleared 18 22
Problem 6.3 (A simple clinical trial) A clinical trial is performed to compare two treatments,
A and B, that are intended to treat a skin disease named psoriasis. The outcome shown in the
following table is whether the patient’s skin cleared within 16 weeks of the start of treatment.
Use α = 0.05 throughout this problem.
a) Compute for each of the two treatments a Wald type and a Wilson confidence interval for
the proportion of patients whose skin cleared.
b) Test whether the risk difference is significantly different to zero (i.e., RD = 0). Use both
an exact and an approximated approach.
c) Compute CIs for both, relative risk (RR) and odds ratio (OR).
d) How would the point estimate of the odds ratio change if we considered the proportions of
patients whose skin did not clear?
Problem 6.4 (BMJ Endgame) Discuss and justify the statements about ‘Relative risks versus
odds ratios’ given in doi.org/10.1136/bmj.g1407.
Chapter 7
Rank-Based Methods
⋄ Describe, apply and interpret rank based tests as counter pieces to the clas-
sical t-test
⋄ Describe the general approach to compare the means of several samples, apply
and interpret the rank based tests
As seen in previous chapters “classic” estimates of the expectation and the variance are
n n
1X 1 X
x= xi , and s2 = (xi − x)2 . (7.1)
n n−1
i=1 i=1
If, hypothetically, we set one (arbitrary) value xi to an infinitely large value (i.e., we create an
extreme outlier), these estimates “explode”. A single value may exert enough influence on the
estimate such that the estimate is not representative of the bulk of the data anymore. In a
similar fashion, outliers may not only influence estimates drastically but also the value of test
statistics and thus render the result of the test questionable.
Until now we have often assumed that we have a realization of a Gaussian random sample.
We have argued that the t-test family is exact for Gaussian data but remains usable for moderate
deviations thereof since we use the central limit theorem in the test statistic. If we have very small
sample sizes or if the deviation is substantial, the result of the test may be again questionable.
In this chapter, we discuss basic approaches to estimation and testing for cases that include
the presence of outliers and deviations from Gaussianity.
123
124 CHAPTER 7. RANK-BASED METHODS
where most software programs (including R) use c = 1.4826. The choice of c is such, that for
Gaussian random variables we have an unbiased estimator, i.e., E(MAD) = σ for MAD seen as an
estimator. Since for Gaussian random variables IQR= 2Φ−1 (3/4)σ, IQR/1.349 is an estimator
of σ; for IQR seen as an estimator.
Example 7.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Suppose that we have erro-
neously entered the final number as 27. R-Code 7.1 compares several statistics (for location and
scale) and illustrates the effect of this single outlier on the estimates. ♣
Remark 7.1. An intuitive approach to quantify the “robustness” of an estimator is the breakdown
point which quantifies the proportion of the sample that can be set to arbitrary large values before
the estimate takes an arbitrary value, i.e., before it breaks down. The mean has a breakdown
point of 0, (1 out of n values is sufficient to break the estimate), an α-trimmed mean of α
(α ∈ [0, 1/2]), the median of 1/2, the latter is the maximum possible value.
The IQR has a breakdown point of 1/4 and the MAD of 1/2. See also Problem 7.1.a. ♣
Robust estimators have two main drawbacks compared to classical estimators. The first
disadvantage is that robust estimators do not possess simple distribution functions and for this
7.1. ROBUST POINT ESTIMATES 125
reason the corresponding exact confidence intervals are not easy to calculate. More specifically,
for a robust estimator θb of the parameter θ we rarely have the exact quantiles ql and qu (which
depend on θ), to construct a confidence interval starting from
1 − α = P ql (θ) ≤ θb ≤ qu (θ) (7.3)
If we could assume that the distribution of robust estimators are somewhat Gaussian (for
large samples) we could calculate approximate confidence intervals based on
s
d robust\
Var( estimator)
robust\estimator ± zα/2 , (7.4)
n
The second disadvantage of robust estimators is their lower efficiency, i.e., these estimators
have larger variances compared to classical estimators. Formally, the efficiency is the ratio of
the variance of one estimator to the variance of the second estimator.
In some cases the exact variance of robust estimators can be determined, often approxima-
tions or asymptotic results exist. For example, for a continuous random variable with distribution
function F (x) and density function f (x), asymptotically, the median is also normally distributed
around the true median η = Q(1/2) = F −1 (1/2) with variance 1 (4nf (η)2 ). The following exam-
ple illustrates this result and R-Code 7.2 compares the finite sample efficiency of two estimators
of location based on repeated sampling.
iid
Example 7.2. Let X1 , . . . , X10 ∼ N (0, σ 2 ). We simulate realizations of this random sample
and calculate the corresponding sample mean and sample median. We repeat R = 1000 times.
Figure 7.1 shows the histogram of these means and medians including a (smoothed) density of
the sample. The histogram and the density of the sample medians are wider and thus the mean
is more efficient.
For this particular example, the sample efficiency is roughly 72% for n = 10. As the density
is symmetric, η = µ = 0 and thus the asymptotic efficiency is
σ 2 /n σ2 1 2 2
= · 4n √ = ≈ 64%. (7.5)
1/ 4nf (0)2 n 2πσ π
Of course, if we change the distribution of X1 , . . . , X10 , the efficiency changes. For example let
us consider the case of a t-distribution with 4 degrees of freedom, a density with heavier tails
than the normal. Now the sample efficiency for sample size n = 10 is 1.26, which means that the
median is better compared to the mean. ♣
Robust estimation approaches have the advantage of not having to identify outliers and
eliminating these for the estimation process. The decision as to whether a realization of a
126 CHAPTER 7. RANK-BASED METHODS
R-Code 7.2 Distribution of sample mean and median, see Example 7.2. (See Figure 7.1.)
0.4
0.0
estimates
Figure 7.1: Comparing finite sample efficiency of the mean and median for a Gaussian
sample of size n = 10. Medians in yellow with red smoothed density of the sample,
means in black. (See R-Code 7.2.)
random sample contains outliers is not always easy and some care is needed. For example it is
not possible to declare a value as an outlier if it lays outside the whiskers of a boxplot. For all
distributions with values from R, observations will lay outside the whiskers when n is sufficiently
large (see Problem 7.1.b). Obvious outliers are easy to identify and eliminate, but in less clear
cases robust estimation methods are preferred.
Outliers can be very difficult to recognize in multivariate random samples, because they are
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS 127
not readily apparent with respect to the marginal distributions. Robust methods for random
vectors exist, but are often computationally intense and not as intuitive as for scalar values.
It has to be added that independent of the estimation procedures, if an EDA finds outliers,
these should be noted and scrutinized.
The sign test is based on very weak assumptions and therefore has little power. We introduce
now the concept of “ranks”, as an extension of “signs”, and resulting rank tests. The rank of a
value in a set of values is the position (order) of that value in the ordered sequence (from smallest
to largest). In particular, the smallest value has rank 1 and the largest rank n. In the case of
ties, the arithmetic mean of the ranks is used.
Example 7.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6. However,
the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6. ♣
Rank-based tests consider the ranks of the observations or ranks of the differences, not the
observations or the differences between the observations and thus mitigate their effect on the test
128 CHAPTER 7. RANK-BASED METHODS
Test 7: Comparing the locations of two paired samples with the sign test
Decision: Reject H0 : “the medians are the same”, if Sobs < Scrit , where Scrit is
the α/2-quantile of a Bin(n⋆ , 1/2) distribution.
Calculation in R:
binom.test( sum( d>0), sum( d!=0), conf.level=1-alpha)
statistic. For example, the largest value has always the same rank and therefore has the same
influence on the test statistic independent of its value.
Compared to classical t-tests as discussed in Chapter 5 and further tests that we will discuss
in Chapter 11, rank tests should be used if
For smaller sample sizes, it is difficult to check model assumptions and it is recommended
to use rank tests in such situations. Rank tests have fewer assumptions on the underlying
distributions and have a fairly similar power (see Problem 7.6).
We now introduce two classical rank tests which are the Wilcoxon signed rank test and the
Wilcoxon–Mann–Whitney U test (also called the Mann–Whitney test), i.e., rank-based versions
of Test 2 and Test 3 respectively.
The Wilcoxon signed rank test is used to test an effect in paired samples, i.e., two matched
samples or two repeated measurements on the same subject). As for the sign test or the two-
sample paired t-test, we start by calculating the differences of the paired observations, followed
by calculating the ranks of the negative differences and of the positive differences. Note that we
omit pairs having zero differences and denote with n⋆ the possibly adjusted sample size. Under
the null-hypotheses, the paired samples are from the same distribution and thus the ranks of the
positive and negative differences should be comparable, not be too small or too large. If this
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS 129
is not the case, the data indicates evidence against the hypothesis. This approach is formally
presented in Test 8.
Assumptions: Both samples are from continuous distributions of the same shape,
the samples are paired.
The quantile, density and distribution functions of the Wilcoxon signed rank test statistic
are implemented in R with [q,d,p]signrank(). For example, the critical value Wcrit (n; α/2)
mentioned in Test 8 is qsignrank( .025, n) for α = 5% and the corresponding p-value is
2*psignrank( Wobs, n) with n = n⋆ .
It is possible to approximate the distribution of test statistic with a normal distribution. The
observed test statistic Wobs is z-transformed as follows:
Wobs − n⋆ (n⋆ + 1)
zobs =r 4 , (7.6)
n⋆ (n⋆ + 1)(2n⋆ + 1)
24
and zobs is then compared with the corresponding quantile of the standard normal distribution
(see Problem 7.1.c). This approximation may be used when the samples are sufficiently large,
which is, as a rule of thumb, n⋆ ≥ 20.
In case of ties, R may not be capable to calculate exact p-values and thus will issue a warn-
ing. The warning can be avoided by not requiring exact p-values through setting the argument
exact=FALSE. When setting the argument conf.int=TRUE in wilcox.test(), a nonparametric
130 CHAPTER 7. RANK-BASED METHODS
confidence interval is constructed. It is possible to specify the confidence level with conf.level,
default value is 95%. The numerical values of the confidence interval are accessed with the
element $conf.int, also illustrated in the next example.
Example 7.4 (continuation of Example 5.2). We consider again the podo data as introduced in
Example 5.2. R-Code 7.3 performs Wilcoxon signed rank test. As we do not have any outliers
the p-value is similar to the one obtained with a paired t-test in Example 5.10. There are no ties
and thus the argument exact=FALSE is not necessary.
The advantage of robust methods becomes clear when the first value from the second visit
3.75 is changed to 37.5, as shown towards the end of the same R-Code. While the p-value of the
signed rank test does virtually not change, the one from the paired two-sample t-test changes
from 0.5 (see R-Code 5.5) to 0.31. More importantly, the confidence intervals are now drastically
different as the outlier inflated the estimated standard deviation of the t-test. In other situations,
it is quite likely that with or without a particular “outlier” the p-value falls below the magical
threshold α (recall the discussion of Section 5.5.3).
Of course, a corrupt value, as introduced in this example, would be detected with a proper
EDA of the data (scales are within zero and ten). ♣
R-Code 7.3: Rank tests and comparison of a paired tests with a corrupted observation.
# Possibly relaod the 'podo.csv' and construct the variables as in Example 4.1
wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)
##
## Wilcoxon signed rank exact test
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## V = 88, p-value = 0.61
## alternative hypothesis: true location shift is not equal to 0
# wilcox.test( PDHmean2[,2]-PDHmean2[,1], exact=FALSE) # is equivalent
PDHmean2[1, 2] <- PDHmean2[1, 2]*10 # corrupted value, decimal point wrong
rbind(t.test=unlist( t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)[3:4]),
wilcox=unlist( wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE,
conf.int=TRUE)[c( "p.value", "conf.int")]))
## p.value conf.int1 conf.int2
## t.test 0.31465 -2.2879 6.6791
## wilcox 0.61122 -0.5500 1.1375
For better understanding of the difference between a sign test and the Wilcoxon signed
rank test consider an arbitrary distribution F (x). The assumptions of the Wilcoxon signed
iid iid
rank test are X1 , . . . , Xn ∼ F (x) and Y1 , . . . , Yn ∼ F (x − δ), where δ represents the shift.
Hence, under the null hypothesis, δ = 0 which further implies that for all i = 1, . . . , n, (1)
P(Xi > Yi ) = P(Xi < Yi ) = 1/2 and (2) the distribution of the difference Xi − Yi is symmetric.
The second point is not required by the sign test. Hence, the Wilcoxon signed rank test requires
more assumptions and has thus generally a higher power.
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS 131
For a symmetric distribution, we can thus use wilcox.test() with argument mu=mu0 to test
H0 : µ = µ0 , where µ is the median (and by symmetry also the mean). This setting is the rank
test equivalent of Test 1.
Assumptions: Both samples are from continuous distributions of the same shape,
the samples are independent and the data are at least ordinally scaled.
Decision: Reject H0 : “medians are the same” if Uobs < Ucrit (nx , ny ; α/2), where
Ucrit is the critical value.
The quantile, density and distribution functions of the Wilcoxon–Mann–Whitney test statistic
are implemented in R with [q,d,p]wilcox(). For example, the critical value Ucrit (nx , ny ; α/2)
used in Test 9 is qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox(
Uobs, nx, ny).
132 CHAPTER 7. RANK-BASED METHODS
It is possible to approximate the distribution of the test statistic by a Gaussian one, provided
we have sufficiently large samples, e.g., nx , ny ≥ 10. The value of the test statistic Uobs is
z-transformed to
Uobs − nx ny
zobs = r 2 , (7.7)
nx ny (nx + ny + 1)
12
where nx ny /2 is the mean of all ranks and the denominator is the standard deviation of the
sum of the ranks under the null (see Problem 7.1.d). This value is then compared with the
respective quantile of the standard normal distribution. With additional continuity corrections,
the approximation may be improved.
To construct confidence intervals, the argument conf.int=TRUE must be used in the function
wilcox.test() and unless α = 5% a specification of conf.level is required. The numerical
values of the confidence interval are accessed with the list element $conf.int.
Example 7.5 (continuation of Example 5.2 and 7.4). We compare if there is a difference be-
tween the pododermatitis scores between both barns. Histogram of the densities support the
assumption that both samples come from one underlying distribution. There is no evidence of
difference (same conclusion as in Example 5.9).
Several scores appear multiple times and thus the ranks contain ties. To avoid a warning
message we set exact=FALSE. ♣
2. The interpretation of the confidence interval resulting from argument conf.int=TRUE re-
quires some care, see ?wilcox.test. With this approach, it is possible to set the confidence
level via the argument conf.level. ♣
7.3. COMPARING MORE THAN TWO GROUPS 133
Similar as in the setting of two groups, we also have to differentiate betweeen matched
or repeated measures and independent groups. We will discuss two classical rank tests now.
The latter two complete the layout shown in Figure 7.2, which summarizes the different non-
parametric tests to compare locations (median with a theoretical value or medians of two or
several groups). Note that this layout is by no means complete.
Nonparametric tests
of location
no no
1 group 2 groups k>2 groups
Wilcoxon signed
rank test Friedman test
Figure 7.2: Different non-parametric statistical tests of location. In the case of paired
data, the ∆ operation takes the difference of the two samples and reduces the problem
to a single sample setting.
134 CHAPTER 7. RANK-BASED METHODS
Suppose we have I subjects that are tested on J different treatments. The resulting data matrix
represents for each subject the J measures. While the subjects are assumed independent, the
measures within each subject are not necessarily.
Such a simulation setup is also called one-way repeated measures analysis of variance, one-
way because each subject receives at any point one single treatment and all other factors are
kept the same. We compare the variability between the groups with the variability within the
groups based on ranks.
The data is then often presented in a matrix form with I rows and J columns, resulting in a
total of n = IJ observations. In a first step we calculate the ranks within each subject. If all the
treatments are equal then there is no preference group for small or large ranks. That means, the
sum of the ranks within each group Rj are similar. As we have more than two groups we look
at the variability of the column ranks. There are different ways to present the test statistics and
the following is typically used to calculate the observed value
J
12 X
Fr obs = Rj2 − 3I(J + 1), (7.8)
IJ(J + 1)
j=1
where Rj is the sum of the ranks of the jth column. Under the null hypothesis, the value Fr obs
is small.
At first sight it is not evident that the test statistic represents a variability measure. See
Problem 7.1.e for a different and more intuitive form of the statistic. If there are ties then the
test statistic needs to be adjusted because ties affect the variance in the groups. The form of
the adjustment is not intuitive and consists essentially of a larger denominator compared to
IJ(J + 1). For very small values of J and I, critical values are tabulated in selected books (e.g.,
Siegel and Castellan Jr, 1988). For J > 5 (or for I > 5, 8, 13 when J = 5, 4, 3, respectively) an
approximation based on a chi-square distribution with J − 1 degrees of freedom is used. Test 10
summarizes the Friedman test and Example 7.6 uses the podo dataset as an illustration.
Assumptions: All distributions are continuous of the same shape, possibly shifted,
the samples are independent.
Calculation: For each subject the ranks are calculated. For each sample, sum the
ranks to get Rj . Calculate Fr obs according to (7.8).
Decision: Reject H0 : “the medians are the same” if Fr obs > χ2crit = χ2J−1,1−α .
Calculation in R: friedman.test(y)
7.3. COMPARING MORE THAN TWO GROUPS 135
In case H0 is rejected, we have evidence that at least one of the groups has a different location
compared to at least one other. The test does not say which one differs or how many are different.
To check which of the group pairs differ, individual post-hoc tests can be performed. Typically,
for the pair (r, s) the following approximation is used
r
IJ(J + 1)
|Rr − Rs | ≥ zα/(J(J−1)) (7.9)
6
where Rs is as above. Note that we divide the level α of the quantile by J(J − 1) = J2 which
Example 7.6 (continuation of Example 5.2). We consider all four visits of the dataset. Note that
one rabbit was not measured in the third visit (ID 15 in visit called 5) and thus the remaining
three measures have to be eliminated. The data can be arranged in a matrix form, here of
dimension 16 × 4. R-Code shows that there is no significant difference between the visits, even
if we would further stratify according to the barn. ♣
R-Code 7.5 Friedman test, pododermatitis (see Example 7.6 and Test 10)
table(podo$ID)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4
podoCompl <- podo[podo$ID !=15,] # different ways to eliminate
PDHmean4 <- matrix(podoCompl$PDHmean[order(podoCompl$ID)], ncol=4, byrow=TRUE)
colMeans(PDHmean4) # means of the four visits
## [1] 3.7500 3.4484 3.8172 3.8531
friedman.test( PDHmean4) # no post-hoc tests necessary.
##
## Friedman rank sum test
##
## data: PDHmean4
## Friedman chi-squared = 2.62, df = 3, p-value = 0.45
## further stratification does not change the result
# podoCompl <- podoCompl[podoCompl$Barn==2,] # Barn 2 only, 6 animals
Often the Friedman test is also called a Friedman two-way ANOVA by ranks. Two-way in
the sense that there is a symmetry with respect to groups and subjects.
the ranks in each group and compares it to the overall average. Similar as in a variance estimate,
we square the difference and additionally weight the difference by the number of observations in
the group. Formally, we have
J
12 X 2
KW obs = nj R j − R , (7.10)
n(n + 1)
j=1
where n is the total number of observations, Rj the average of the ranks in group j and R the
overall mean of the ranks, i.e., (n + 1)/2. The additional denominator is such that for J > 3 and
I > 5, the Kruskal–Wallis test statistics is approximately distributed as a chi-square distribution
with J − 1 degrees of freedom.
Similar comments as for the Friedman test hold: for very small datasets critical values are
tabulated; in case of ties the test statistic needs to be adjusted; if the null hypothesis is rejected
post-hoc tests can be performed.
Calculation: Calculate the rank of the observations among the entire dataset. For
Rj the average of the ranks in group j and R = (n + 1)/2, calculate KW obs
according (7.10).
Decision: Reject H0 : “the medians are the same” if KW obs > χ2crit = χ2J−1,1−α .
Calculation in R: Kruskal.test(x, g)
In this short paragraph we give an alternative view on correcting for multiple testing. A Bonferoni
correction as used in the post-hoc tests (see Section 5.5.2) guarantees the specified Type I error
but may not be ideal as discussed now. Suppose that we perform several test, with resulting
counts denoted as shown in Table 7.1. The family-wise error rate (FWER) is the probability
of making at least one Type I error in the family of tests FWER = P(V ≥ 1). Hence, when
performing the tests such that FWER ≤ α we keep the probability of making one or more Type I
errors in the family at or below level α.
The Bonferroni correction, where we replace the level α of the test with αnew = α/m, main-
tains the FWER. Another popular correction method is the Holm–Bonferroni procedure for
which we order the p-values (lowest to largest) and the associated hypothesis. We successively
test if the k th p-value is smaller than α/(m − k + 1). If so we reject the k th hypothesis and move
7.4. PERMUTATION TESTS 137
on to k + 1, if not we stop the procedure. The Holm–Bonferroni procedure also maintains the
FWER but has (uniformly) higher power than the simple Bonferroni correction.
Remark 7.3. In bioinformatics and related fields m may be thousands and much larger. In
such settings, a FWER procedure may be too stringent as many effects may be missed due to
the very low thresholds. An alternative correction is to control the false discovery rate (FDR) at
level q with FDR = E(V /R). FDR procedures have greater power at the cost of increased rates
of type I errors.
If all of the null hypotheses are true (m0 = m), then controlling the FDR at level q also
controls the FWER
Example 7.7 (continuation of Examples 5.2, 7.4 and 7.5). R-Code 7.6 illustrates an alternative
to the Wilcoxon signed rank test and the Wilcoxon–Mann–Whitney test and compares the dif-
ference in locations between the two visits and the two barns respectively. For the paired setting,
see also Problem 7.4. Without too much surprise, the resulting p-values are in the same range as
seen in Examples 7.4 and 7.5. Note that the function oneway_test() requires a formula input.
♣
138 CHAPTER 7. RANK-BASED METHODS
Assumptions: The null hypothesis is formulated, such that under H0 the groups
are exchangeable.
Assumptions: The null hypothesis is formulated, such that under H0 the groups
are exchangeable.
Calculation: (1) Calculate the difference tobs in the means of the two groups to
be compared (m observations in group 1, n observations in group 2).
(2) Form a random permutation of the values of both groups by randomly
allocating the observed values to the two groups (m observations in
group 1, n observations in group 2).
(3) Calculate the difference in means of the two new groups.
(4) Repeat this procedure R times (R large).
require(exactRankTests)
perm.test( PDHmean2[,1], PDHmean2[,2], paired=TRUE)
##
## 1-sample Permutation Test (scores mapped into 1:m using rounded
## scores)
##
## data: PDHmean2[, 1] and PDHmean2[, 2]
## T = 28, p-value = 0.45
## alternative hypothesis: true mu is not equal to 0
require(coin)
oneway_test( PDHmean ~ as.factor(Barn), data=podo)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: PDHmean by as.factor(Barn) (1, 2)
## Z = 1.24, p-value = 0.21
## alternative hypothesis: true mu is not equal to 0
Note that the package exactRankTests will no longer be further developed. The package
coin is the successor of the latter and includes extensions to other rank-based tests.
Remark 7.4. The two tests presented in this section are often refered to as randomization tests.
There is a subtle difference between permutation tests and randomization tests. For simplicity
we use the term permutation test only and refer to Edgington and Onghena (2007) for an indepth
discussion. ♣
c) Show that under the null hypothesis, E(Wobs ) = n⋆ (n⋆ + 1)/4 and Var(Wobs ) = n⋆ (n⋆ +
1)(2n⋆ + 1)/24.
Hint: Write Wobs = nk=1 kIk where Ik is a binomial random variable with p = 1/2 under
P ⋆
d) Show that under the null hypothesis, E(Uobs ) = nx ny /2 and Var(Uobs ) = nx ny (nx + ny +
1)/12.
Hints: Use the result of Problem 2.5.c. If we draw without replacement ny elements from
a set of N = nx + ny iid random variables, each
having mean µ and variance σ , then the
2
2 ny −1
mean and variance of the sample is µ and nσy 1 − nx +n y −1
. That means that the variance
needs to be adjusted because of the finite population size N = nx + ny , see also Problem 4.
e) Show that the Friedman test statistic Fr given in equation (7.8) can be written in the
“variance form”
J
12 X 2
Fr = Rj − R ,
IJ(J + 1)
j=1
Problem 7.2 (Robust estimates) For the mtcars dataset (available by default through the
package datasets), summarize location and scale for the milage, number of cylinders and the
weight with standard and robust estimates. Compare the results.
Problem 7.3 (Weight changes over the years) We consider palmerpenguins and assess if
there is a weight change of the penguins over the different years (due to exceptionally harsh
environmental conditions in the corresponding years).
a) Is there a significant weight change for Adelie between 2008 and 2009?
b) Is there evidence of change for any of the species? Why would a two-by-two comparison
be suboptimal?
Problem 7.4 (Permutation test for paired samples) Test if there is a significant change in
pododermatitis between the first and last visit using a manual implementation of a permutation
test. Compare the result to Example 7.4.
Problem 7.5 (Rank and permutation tests) Download the water_transfer.csv data from
the course web page and read it into R. The dataset describes tritiated water diffusion across
human chorioamnion and is taken from Hollander and Wolfe (1999, Table 4.1, page 110). The
7.6. EXERCISES AND PROBLEMS 141
pd values for age "At term" and "12-26 Weeks" are denoted with yA and yB , respectively. We
will statistically test if the water diffusion is different at the two time points. That means we
test whether there is a shift in the distribution of the second group compared to the first.
a) Use a Wilcoxon–Mann–Whitney test to test for a shift in the groups. Interpret the results.
b) Now, use a permutation test as implemented by the function wilcox_test() from R pack-
age coin to test for a potential shift. Compare to a).
c) Under the null hypothesis, we are allowed to permute the observations (all y-values) while
keeping the group assignments fix. Keeping this in mind, we will now manually construct
a permutation test to detect a potential shift. Write an R function perm_test() that
implements a two-sample permutation test and returns the p-value. Your function should
execute the following steps.
• Compute the test statistic tobs = yeA − yeB , where e· denotes the empirical median.
• Then repeat many times (e.g., R = 1000)
– Randomly assign all the values of pd to two groups xA and xB of the same size
as yA and yB .
– Store the test statistic tsim = x eB .
eA − x
• Return the two-sided p-value, i.e., the number of permuted test statistics tsim which
are smaller or equal than −|tobs | or larger or equal than |tobs | divided by the total
number of permutations (in our case R = 1000).
Problem 7.6 (Comparison of power) In this problem we compare the power of the one sample
t-test to the Wilcoxon signed rank test with a simulation study.
iid
a) We assume that Y1 , . . . , Yn ∼ N (µ1 , σ 2 ), where n = 15 and σ 2 = 1. For µ1 = 0 to 1.2 in
steps of 0.05, simulate R = 1000 times a sample and based on these replicates, estimate
the power for a one sample t-test and a Wilcoxon signed rank test (at level α = 0.1).
iid
b) Redo for Yi = (m − 2)/mVi + µ1 , where V1 , . . . , Vn ∼ tm , with m = 4. Interpret.
p
√ iid
c) Redo for Yi = (Xi − m)/ 2m + µ1 , where X1 , . . . , Xn ∼ Xm 2 , with m = 10. Why is the
Problem 7.7 (Power of the sign test) In this problem we will analyze the power of the sign test
in two situations and compare it to the Wilcoxon signed rank test. To emphasize the effects, we
assume a large sample size n = 100.
a) We assume that d1 , . . . , d100 are the differences of two paired samples. Simulate the di
according to a normal random variables (with variance one) and different means µ1 (e.g.,
a fine sequence from 0 to 1 seq(0, to=1, by=.1)). For the sign test and the Wilcoxon
signed rank test, calculate the proportion of rejected cases out of 500 of the test H0 :
“median is zero” and visualize the resulting empirical power curves. What are the powers
for µ1 = .2? Interpret the power for µ1 = 0.
142 CHAPTER 7. RANK-BASED METHODS
b) We now look at the Type I error if we have a skewed distribution and we use the chi-
squared distribution with varying degrees of freedom. For very large values thereof, the
distribution is almost symmetric, for very small values we have a pronounced asymmetry.
Note that the median of a chi-square distribution with ν degrees of freedom is approximately
ν(1 − 2/(9ν))3 .
We assume that we have a sample d1 , . . . , d100 from a shifted chi-square distribution with ν
degrees of freedom and median one, e.g., from rchisq(100, df=nu)-nu*(1-2/(9*nu))ˆ3.
As a function of ν, ν = 2, 14, 26, . . . , 50, (seq(2, to=50, by=12)) calculate the proportion
of rejected cases out of 500 tests for the test H0 : “median is zero”. Calculate the resulting
empirical power curve and interpret.
Problem 7.9 (Water drinking preference) Pachel and Neilson (2010) conducted a pilot study
to assess if house cats prefer water if provided still or flowing. Their experiment consisted of
providing nine cats either form of water over 4 days. The data provided in mL consumption over
the 22h hour period is provided in the file cat.csv. Some measurement had to be excluded due
to detectable water spillage. Assess with a suitable statistical approach if the cats prefer flowing
water over still water.
Problem 7.10 (BMJ Endgame) Discuss and justify the statements about ‘Parametric v non-
parametric statistical tests’ given in doi.org/10.1136/bmj.e1753.
Chapter 8
⋄ Describe a random vector, cdf, pdf of a random vector and its properties
⋄ Give the definition and intuition of E, Var and Cov for a random vector
⋄ Explain the relationship between the eigenvalues and eigenvector of the co-
variance matrix and the shape of the density function.
In Chapter 2 we have introduced univariate random variables and in Chapter 3 random sam-
ples. We now extend the framework to random vectors (i.e., multivariate random variables) where
the individual random variables are not necessarily independent (see Definition 3.1). Within the
scope of this book, we can only cover a tiny part of a beautiful theory. We are pragmatic and
discuss what will be needed in the sequel. Hence, we discuss a discrete case as a motivating
example only and then focus on continuous random vectors, especially Gaussian random vectors.
In this chapter we cover the theoretical details, the next one focuses on estimation.
143
144 CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
Definition 8.1. The multivariate (or multidimensional) distribution function of a random vector
X is defined as
where the list in the right-hand-side is to be understood as intersections (∩) of the p events. ♢
The multivariate distribution function generally contains more information than the set of
p
marginal distribution functions P(Xi ≤ xi ), because (8.1) only simplifies to FX (x ) =
Q
P(Xi ≤
i=1
xi ) under independence of all random variables Xi (compare to Equation (3.3)).
Example 8.1. Suppose we toss two fair four-sided dice with face values 0, 1, 2 and 3. The
first player marks the absolute difference between both and the second the maximum value. To
cast the situation in a probabilistic framework, we introduce the random variables T1 and T2
for the result of the tetrahedra and assume that each value appears with probability 1/4. (As a
side note, tetrahedron die have often marked corners instead of faces. For the later, appearance
means the face that is turned down.) Table 8.1 gives the frequency table for X = |T1 − T2 | and
Y = max(T1 , T2 ). The entries are to be interpreted as the joint pdf fX,Y (x, y) = P(X = x, Y =
y). From the joint pdf, we can derive various probabilities, for example P(X ≥ Y ) = 9/16,
P(X ≥ 2 | Y ̸= 3) = 1/8. It is also possible to obtain the marginal pdf of, say, X: fX (x) =
P3
y=0 fX,Y (x, y), also summarized by Table 8.1. More specifically, from the joint we can find the
marignal quantities, but not the other way around. ♣
Table 8.1: Probability table for X and Y of Example 8.1. Last column and last row
give the marginal probabilities of X and Y , respectively.
-
X Y 0 1 2 3
Definition 8.2. The probability density function (or density function, pdf) fX (x ) of a p-
dimensional continuous random vector X is defined by
Z
P(X ∈ A) = fX (x ) dx , for all A ⊂ Rp . (8.2)
A
♢
8.1. RANDOM VECTORS 145
For convenience, we summarize here a few properties of random vectors with two continuous
components, i.e., for a bivariate random vector (X, Y )⊤ . The univariate counterparts are stated
in Properties 2.1 and 2.3. The properties are illustrated with a subsequent extensive example.
Property 8.1. Let (X, Y )⊤ be a bivariate continuous random vector with joint density function
fX,Y (x, y) and joint distribution function FX,Y (x, y).
1. The distribution function is monotonically increasing:
for x1 ≤ x2 and y1 ≤ y2 , FX,Y (x1 , y1 ) ≤ FX,Y (x2 , y2 ).
(We use the slight abuse of notation by writing ∞ in arguments without a limit.)
FX,Y (−∞, −∞) = FX,Y (x, −∞) = FX,Y (−∞, y) = 0.
6. Marginal distributions:
FX (x) = P(X ≤ x, Y arbitrary) = FX,Y (x, ∞) and FY (y) = FX,Y (∞, y);
7. Marginal Zdensities: Z
fX (x) = fX,Y (x, y) dy and fY (y) = fX,Y (x, y) dx.
R R
The last two points of Property 8.1 refer to marginalization, i.e., reduce a higher-dimensional
random vector to a lower dimensional one. Intuitively, we “neglect” components of the random
vector in allowing them to take any value.
Example 8.2. The joint distribution of (X, Y )⊤ is given by FX,Y (x, y) = y − 1 − exp(−xy) /x,
for x ≥ 0 and 0 ≤ y ≤ 1. Top left panel of Figure 8.1 illustrates the joint cdf and the mononicity
of the joint cdf is nicely visible. The top right panel illustrates that the joint cdf is normalized:
for all values x = 0 or y = 0 the joint cdf is also zero and it reaches one for x → ∞ and y = 1.
The joint cdf is defined for the entire plane (x, y), if any of the two variables is negative, the joint
cdf is zero. For values y > 1 we have FX,Y (x, y) = FX,Y (x, 1). That means, that the joint cdf
outside the domain [0, ∞[ × [0, 1] is determined by its properties (here, for rotational simplicity,
we only state the functions in that specific domain).
∂2
The joint density is given by fX,Y (x, y) = ∂x∂y ∂
FX,Y (x, y) = ∂x 1 − exp(−xy) = exp(−xy)y
and shown in the top right panel of Figure 8.1. The joint density has a particular simple form
but it cannot be factored into the marginal cdf and marginal densities:
FY (y) = FX,Y (∞, y) = y, FX (x) = FX,Y (x, 1) = FX,Y (x, ∞) = 1 − (1 − e−x )/x, (8.3)
Z ∞
1 − e−x (1 + x)
Z 1
fY (y) = e−xy y dx = 1, fX (x) = e−xy y dy = . (8.4)
0 0 x2
146 CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0 0.0
1 0
2 1
x 3 2
1.0 x 3
4 0.8
0.6 1.0
0.4 0.8
0.2
5 0.0 y 4 0.6
0.4
0.2
5 0.0 y
0.4
0.8
fX, Y(x, a)
fX(x)
0.2
0.4
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x
Figure 8.1: Top row: joint cdf and joint density function for case discussed in Exam-
ple 8.2. Bottom row: marginal density of X and joint density evaluated at a = 1, .7, .4, .1
(in black, blue, green and red).
(Of course, we could have differentiated the expressions in the first line to get the second line or
integrated the expressions in the second line to get the ones in the first line, due to Property 2.3.)
The marginal distribution of Y is U(0, 1)! Bottom left panel of Figure 8.1 gives the marginal
density of X. The product of the two marginal densities is not equal to the joint one. Hence, X
and Y are not independent.
For coordinate aligned rectangular domains, we use the joint cdf to calculate probabilities,
for example
P(1 < X ≤ 2, Y ≤ 1/2) = FX,Y (2, 1/2) − FX,Y (2, 0) − FX,Y (1, 1/2) + FX,Y (1, 0)
√ (8.5)
−2·1/2 −1·1/2 ( e − 1)2
= 1/2 − (1 − e )/2 − 0 − 1/2 − (1 − e )/1 − 0 = ≈ 0.077.
2e
and via integration for other cases
Z 1 Z 1/y Z 1
−xy −1 e −1
P(XY ≤ 1) = FX,Y (1, 1) + e y dx dy = e + e−y − e−1 dy = ≈ 0.63. (8.6)
0 1 0 e
♣
Hence the expectation of a random vector is simply the vector of the individual expectations.
Of course, to calculate these, we only need the marginal univariate densities fXi (x) and thus the
expectation does not change whether (8.1) can be factored or not. The expectation of products
of random variables is defined as
Z Z
E(X1 X2 ) = x1 x2 f (x1 , x2 ) dx1 dx2 (8.8)
(for continuous random variables). The variance of a random vector requires a bit more thought
and we first need the following.
Definition 8.4. The covariance between two arbitrary random variables X1 and X2 is defined
as
Cov(X1 , X2 ) = E (X1 − E(X1 ))(X2 − E(X2 )) = E(X1 X2 ) − E(X1 ) E(X2 ). (8.9)
In case of two independent random variables, the their joint density can be factored and
Equation (8.8) shows that E(X1 X2 ) = E(X1 ) E(X2 ) and thus their covariance is zero. The
inverse, however, is in general not true.
Using the linearity property of the expectation operator, it is possible to show the following
handy properties.
1. Cov(X1 , X2 ) = Cov(X2 , X1 ),
2. Cov(X1 , X1 ) = Var(X1 ),
It is important to note that the covariance describes the linear relationship between the
random variables.
The covariance matrix is a symmetric matrix and – except for degenerate cases – a positive
definite matrix. We will not consider degenerate cases and thus we can assume that the inverse
of the matrix Var(X) exists and is called the precision.
Similar to Properties 2.5, we have the following properties for random vectors.
Property 8.3. For an arbitrary p-variate random vector X, (fixed) vector a ∈ Rq and matrix
4 min B ∈ Rq×p it holds:
The covariance itself cannot be interpreted as it can take arbitrary values. The correlation
between two random variables X1 and X2 is defined as
Cov(X1 , X2 )
Corr(X1 , X2 ) = p (8.12)
5 min Var(X1 ) Var(X2 )
and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1 , X2 ) ≤ 1, with
equality only in the degenerate case X2 = a + bX1 for some a and b ̸= 0.
Definition 8.6. Let (X, Y )⊤ be a bivarite continuous random vector with joint density function
fX,Y (x, y). The conditional density of X given Y = y is
fX,Y (x, y)
fX|Y (x | y) = , (8.14)
fY (y)
whenever fY (y) > 0 and zero otherwise. ♢
Example 8.3. In the setting of Example 8.2 the marginal density of Y is constant and thus
fX|Y (x | y) = fX,Y (x, y). The curves in the lower right panel of Figure 8.2 are actual densities:
the conditional density fX|Y (x | y) for y = 1, 0.7, 0.4 and 0.1.
R∞
The conditional expectation of X | Y = y is E(X | Y ) = 0 x e−xy y dx = 1/y. ♣
Definition 8.7. The random variable pair (X, Y ) has a bivariate normal distribution if
Z x Z y
FX,Y (x, y) = fX,Y (x, y) dx dy (8.17)
−∞ −∞
with density
for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1. ♢
The role of some of the parameters µx , µy , σx , σy and ρ might be guessed. We will discuss
their precise meaning after the following example. Note that the joint cdf does not have a closed
form and essentially all probability calculations involve some sort of numerical integration scheme
of the density.
Example 8.4. Figure 8.2 (based on R-Code 8.1) shows the density of a bivariate normal distri-
√ √
bution with µx = µy = 0, σx = 1, σy = 5, and ρ = 2/ 5 ≈ 0.9. Because of the quadratic form
in (8.18), the contour lines (isolines) are ellipses.
The joint cdf is harder to interpret, it is almost impossible to infer the shape an orientation
of an isoline of the density. Since (x, y) can take any values in the plane, the joint cdf is strictly
positive and the value zero is only reached at limx,y↘−∞ F (x, y) = 0.
Several R packages implement the bivariate/multivariate normal distribution. We recommend
the package mvtnorm. ♣
150 CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
R-Code 8.1: Density of a bivariate normal random vector. (See Figure 8.2.)
The bivariate normal distribution has many nice properties. A first set is as follows.
Property 8.4. For the bivariate normal random vector as specified by (8.18), we have: (i) The
marginal distributions are X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) and (ii)
!! ! !! !
X µx X σx2 ρσx σy
E = Var = . (8.19)
Y µy , Y ρσx σy σy2
Thus,
Note, however, that the equivalence of independence and uncorrelatedness is specific to jointly
normal variables and cannot be assumed for random variables that are not jointly normal.
Example 8.5. Figure 8.3 (based on R-Code 8.2) shows realizations from a bivariate normal
distribution with zero mean and unit marginal variance for various values of correlation ρ. Even
for large samples as shown here (n = 200), correlations between −0.25 and 0.25 are barely
perceptible. ♣
8.3. MULTIVARIATE NORMAL DISTRIBUTION 151
0.15
2
0.15
1
0.10
0.10
0
y
−1
0.05 0.05
−3
−2
0.00
−2
−1
−3 0
−2 x
−1 1
0.00 0
y 1 2
2
−3
3 3
−3 −2 −1 0 1 2 3
x
3
1.0
2
0.8
1
0.6
0
y
1.0
3
0.4 0.8
2
−1
0.6
1
0.4
0.2 0
y
0.2
−2
−1
0.0
−3 −2 −2
0.0 −1 0 1
−3
x 2 3 −3
−3 −2 −1 0 1 2 3
Figure 8.2: Density (top row) and distribution function (bottom row) of a bivariate
normal random vector. (See R-Code 8.1.)
R-Code 8.2 Realizations from a bivariate normal distribution for various values of ρ,
termed binorm (See Figure 8.3.)
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( 200, sigma=Sigma)
plot( sample, pch=20, xlab='', ylab='', xaxs='i', yaxs='i',
xlim=c(-4, 4),ylim=c(-4, 4), cex=.4)
legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')
}
4
4
ρ = −0.25 ρ=0 ρ = 0.1
2
2
0
0
−2
−2
−2
−4
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
4
2
0
0
−2
−2
−2
−4
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Figure 8.3: Realizations from a bivariate normal distribution with mean zero, unit
marginal variance and different correlations (n = 200). (See R-Code 8.2.)
with density
1 1
⊤ −1
fX (x1 , . . . , xp ) = fX (x ) = exp − (x − µ) Σ (x − µ) (8.22)
(2π)p/2 det(Σ)1/2 2
for all x ∈ Rp (with µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution
with X ∼ Np (µ, Σ). ♢
The following two properties give insight into the meaning of the parameters and give the
distribution of linear combinations of Gaussian random variables.
8.3. MULTIVARIATE NORMAL DISTRIBUTION 153
a + BX ∼ Nq a + Bµ, BΣB⊤ .
(8.24)
This last property has profound consequences. First, it generalizes Property 2.7 and, sec-
ond, it asserts that the one-dimensional marginal distributions are again Gaussian with Xi ∼
N (µ)i , (Σ)ii , i = 1, . . . , p (by (8.24) with a = 0 and B = (. . . , 0, 1, 0, . . . ), a vector with zeros
We now discuss how to draw realizations from an arbitrary Gaussian random vector, much
in the spirit of Property 2.7. We suppose that we can efficiently draw from a standard normal
distribution. Recall that Z ∼ N (0, 1), then σZ + µ ∼ N (µ, σ 2 ), σ > 0. We will rely on 5 min
Equation (8.24) but need to decompose the target covariance matrix Σ. Let L ∈ Rp×p such that
LL⊤ = Σ. That means, L is like a “matrix square root” of Σ.
To draw a realization x from a p-variate random vector X ∼ Np (µ, Σ), one starts with
iid
drawing p values from Z1 , . . . , Zp ∼ N (0, 1), and sets z = (z1 , . . . , zp )⊤ . The vector is then
(linearly) transformed with µ + Lz . Since Z ∼ Np (0, I), where I ∈ Rp×p be the identity matrix,
a square matrix which has only ones on the main diagonal and only zeros elsewhere, Property 8.6
asserts that X = µ + LZ ∼ Np (µ, LL⊤ ).
In practice, the Cholesky decomposition of Σ is often used. This factorization decomposes
a symmetric positive-definite matrix into the product of a lower triangular matrix L and its
transpose. It holds that det(Σ) = det(L)2 = pi=1 (L)2ii , i.e., we get the normalizing constant
Q
Both (multivariate) marginal distributions X1 and X2 are again normally distributed with X1 ∼
Nq (µ1 , Σ11 ) and X2 ∼ Np−q (µ2 , Σ22 ) (this can be seen again by Property 8.6).
154 CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
Property 8.7. If one conditions a multivariate normally distributed random vector (8.26) on a
sub-vector, the result is itself multivariate normally distributed with
Equation (8.27) is probably one of the most important formulas one encounters in statistics
albeit not always in this explicit form. The result is again one of the many features of Gaussian
random vectors: the distribution is closed (meaning again Gaussian) with respect to linear
combinations and conditioning.
We now have a close look at Equation (8.27) and give a detailed explaination thereof. The
expected value of the conditional distribution (conditional expectation) depends linearly on the
value of x 1 , but the variance is independent of the value of x 1 . The conditional expectation
represents an update of X2 through X1 = x 1 : the difference x 1 − µ1 is normalized by its
variance Σ11 and scaled by the covariance Σ21 .
The interpretation of the conditional distribution (8.27) is a bit easier to grasp if we use p = 2
with X and Y , in which Σ21 Σ−1 2 −1 = ρσ /σ and Σ Σ−1 Σ
11 = ρσy σx (σx ) y x 21 11 12 = ρ σy , yielding
2 2
This equation is illustrated in Figure (8.4). The bivariate density is shown with blue ellipses. The
vertical green density is the marginal density of Y and represent the “uncertainty” without any
further information. The inclined ellipse indicates dependence between both variables. Hence, if
we know X = x (e.g., one of the ticks on the x-axis), we change our knowledge about the second
variable. We can adjust the mean and reduce the variance. The red vertical densities are two
µy + ρσy σx−1 (x − µx )
examples of conditional densities. The conditional means are on the line µy + ρσy σx−1 (x − µx ).
The unconditional mean µy is corrected based the difference x − µx , the further x from the
mean of X the larger the correction. Of course, this difference needs to be normalized by the
standard deviation of X, leading then to σx−1 (x − µx ). The tighter the ellipses the stronger we
need to correct (further multiplication with ρ) and finally, we have to scale (back) according to
the variable Y (final multiplication with σy ).
The conditional variance remains the same for all possible X = x (the red vertical densities
have the same variance) and can be written as σy2 (1 − ρ2 ) and is smaller the larger the absolute
value of ρ is. As ρ is the correlation, a large correlation means a stronger linear relationship, the
contour lines of the ellipses are tighter and thus the more information we have for Y . A negative
ρ simply tilts the ellipses preserving the shape. Hence, similar information for the variance and
a mere sign change for the mean adjustment.
This interpretation indicates that the conditional distribution and more specifically the con-
ditional expectation plays a large role in prediction where one tries to predict (in the litteral
sense of the word) a random variable based on an observation (see Problem 8.4).
Remark 8.1. With the background of this chapter, we are able to close a gap from Chapter 3.
A slightly more detailed explanation of Property 3.4.2 is as follows. We represent the vector
(X1 , . . . , Xn )⊤ by two components, (X1 − X, . . . , Xn − X)⊤ and X1, with 1 the p-vector con-
taining only ones. Both vectors are orthogonal, thus independent. If two random variables,
say Y and Z are independent, then g(Y ) and h(Z) as well (for reasonable choices of g and h).
Here, g and h are multivariate and have the particular form of g(y1 , . . . , yn ) = ni=1 yi2 /(n − 1)
P
h(z1 , . . . , zn ) = z1 . ♣
a) Proof Property 8.3. Specifically, let X be a p-variate random vector and let B ∈ Rq×p and
a ∈ Rq be a non-stochastic matrix and vector, respectively. Show the following.
Problem 8.2 (Bivariate normal) In this exercise we derive the bivariate density in an alternative
fashion.
iid
a) Let X, Y ∼ N (0, 1) Show that the joint density is fX,Y (x, y) = 1/(2π) exp(−(x2 + y 2 )/2).
b) Define Z = ρX + 1 − ρ2 Y Derive the distribution of Z and the pair (X, Z). Give an
p
Problem 8.3 (Covariance and correlation) Let fX,Y (x, y) = c · exp(−x)(x + y) for 0 ≤ x ≤ ∞,
0 ≤ y ≤ 1 and zero otherwise.
b) Derive the marginal densities fX (x) and fY (y) and their first two moments.
d) Derive the conditional densities fY |X (y | x), fX|Y (x | y) and the conditional expectations
E(Y | X = x), E(X | Y = y). Give an interpretation thereof.
Problem 8.5 (Sum of dependent random variables) We saw that if X, Y are independent
Gaussian variables, then their (weighted) sum is again Gaussian. This problem illustrates that
the assumption of independence is necessary.
Let X be a standard normal random variable. We define Y as
X if |X| ≥ c,
Y =
−X otherwise,
b) Argue heuristically and with simulations that (i) X and Y are dependent and (ii) the
random variable X + Y is not normally distributed.
Chapter 9
We now start to introduce more complex and realistic statistical models (compared to (4.1)).
In most of this chapter (and most of the remaining ones) we consider “linear models.” We start by
estimating the correlation and then linking the simple regression model to the bivariate Gaussian
distribution, followed by extending the model to several so-called predictors in the next chapter.
Chapter 11 presents the least squares approach from a variance decomposition point of view.
A detailed discussion of linear models would fill entire books, hence we consider only the most
important elements.
The (simple) linear regression is commonly considered the archetypical task of statistics and
is often introduced as early as middle school, either in a black-box form or based on intuition.
We approach the problem more formally and we will in this chapter (i) quantify the (linear)
relationship between variables, (ii) explain one variable through another variable with the help
of a “model”.
157
158 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
where we used the same denomiator for the covariance and the variance estimate (e.g., n − 1).
The estimate r is called the Pearson correlation coefficient. Just like the correlation, the Pearson
correlation coefficient also lies in the interval [−1, 1].
We will introduce a handy notation that is often used in the following:
n
X n
X n
X
sxy = (xi − x)(yi − y), sxx = (xi − x)2 , syy = (yi − y)2 . (9.2)
i=1 i=1 i=1
√
Hence, we can express (9.1) as r = sxy / sxx syy . Similarly, unbiased variance estimates are
sxx /(n − 1) and syy /(n − 1) and an estimate for the covariance is sxy /(n − 1) (again an unbiased).
Example 9.1. In R, we use cov() and cor() to obtain an estimate of the covariance and
of Pearson correlation coefficient. For the penguin data (see Example 1.9) we have between
body mass and flipper length a covariance of 9824.42 and a correlation of 0.87. The lat-
ter was obtained with the command with(penguins, cor(body_mass_g, flipper_length_mm,
use="complete.obs")), where the argument use specifies the handling of missing values. ♣
example, r = 0.25 is not significant unless n > 62. Naturally, if we do not have Gaussian data,
Test 14 is not exact and the resulting p-value an approximation only.
In order to construct confidence intervals for correlation estimates, we typically need the
so-called Fisher transformation
1 1 + r
W (r) = log = arctanh(r) (9.4)
2 1−r
and the fact that, for bivarate normally distributed random variables, the distribution of W (R)
is approximately N W (ρ), 1/(n − 3) . Then a straight-forward confidence interval can be con-
structed:
W (R) − W (ρ) app h z1−α/2 i
p ∼ N (0, 1), approx. CI for W (ρ) : W (R) ± √ . (9.5)
1/(n − 3) n−3
A confidence interval for ρ requires a back-transformation and is shown in CI 6.
Example 9.2. We estimate the correlation of the scatter plots from Figure 8.3 and calculate
the corresponding 95%-confidence intervals thereof in R-Code 9.1. For these specific samples, the
confidence intervals obtained from simulating with ρ = 0 and 0.1 cover the value zero. Hence, ρ
is not statistically significant (different from zero).
The width of the six intervals are slightly different and the correlation estimate is not precisely
in the center, both due to the back-transformation. ♣
160 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
R-Code 9.1 Pearson correlation coefficient and confidence intervals of the scatter plots
from Figure 8.3.
require( mvtnorm)
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
n <- 200
out <- matrix(0, 3,6, dimnames=list(c("rhohat","b_low","b_up"), paste(rho)))
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( n, sigma=Sigma)
out[1,i] <- cor( sample)[2]
out[2:3,i] <- tanh( atanh( out[1,i]) + qnorm( c(0.025,0.975))/sqrt(n-3))
}
print( out, digits=2)
## -0.25 0 0.1 0.25 0.75 0.9
## rhohat -0.181 -0.1445 0.035 0.24 0.68 0.89
## b_low -0.312 -0.2777 -0.104 0.11 0.60 0.86
## b_up -0.044 -0.0059 0.173 0.37 0.75 0.92
There are alternatives to Pearson correlation coefficient, that is, there are alternative estima-
tors of the correlation. The two most common ones are called Spearman’s ϱ or Kendall’s τ and
are based on ranks and thus also called rank correlation coefficients. In brief, Spearman’s ϱ is
calculated similarly to (9.1), where the values are replaced by their ranks. Kendall’s τ compares
the number of concordant (if xi < xj then yi < yj ) and discordant (if xi < xj then yi > yj ) pairs.
As expected, Pearson’s correlation coefficient is not robust, while Spearman’s ϱ or Kendall’s τ
are “robust”.
R-Code 9.2 anscombe data: visualization and correlation estimates. (See Figure 9.1.)
12
3 4 5 6 7 8 9
12
10
10
10
8
y1
y2
y3
y4
8
8
6
6
6
4
4 6 8 10 14 4 6 8 10 14 4 6 8 10 14 8 12 16
x1 x2 x3 x4
Figure 9.1: anscombe data, the four cases all have the same Pearson correlation
coefficient of 0.82, yet the scatterplot shows a completely different relationship. (See
R-Code 9.2.)
The estimators and estimates are intuitive generalizations of the univariate forms (see Prob-
lem 8.3.a). In fact, it is possible to show that the two estimators given in (9.6) are unbiased
162 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
A normally distributed random variable is determined by two parameters, namely the mean
µ and the variance σ 2 . A multivariate normally distributed random vector is determined by p
and p(p + 1)/2 parameters for µ and Σ, respectively. The remaining p(p − 1)/2 parameters in Σ
are determined by symmetry. However, the p(p + 1)/2 values cannot be arbitrarily chosen since
Σ must be positive-definite (in the univariate case the variance must be strictly positive as well).
As long as n > p, the estimator in (9.7) satisfies this condition. If p ≤ n, additional assumptions
about the structure of the matrix Σ are needed.
Example 9.4. Similar as in R-Code 8.2, we generate bivariate realizations with different sam-
ple sizes (n = 10, 50, 100, 500). We estimate the mean vector and covariance matrix according
to (9.7); from these we can calculate the corresponding isolines of the bivariate normal density
(where with plug-in estimates for µ and Σ). Figure 9.2 (based on R-Code 9.3) shows the esti-
mated 95% and 50% confidence regions (isolines). As n increases, the estimation improves, i.e.,
the estimated ellipses are closer to the ellipses based on the true (unknown) parameters.
Note that several packages provide a function called ellipse with varying arguments. We
use the call ellipse::ellipse() to ensure using the one provided by the package ellipse. ♣
R-Code 9.3: Bivariate normally distributed random numbers for various sample sizes with
contour lines of the density and estimated moments. (See Figure 9.2.)
set.seed( 14)
require( ellipse) # to draw ellipses
n <- c( 10, 50, 100, 500) # different sample sizes
mu <- c(2, 1) # theoretical mean
Sigma <- matrix( c(4, 2, 2, 2), 2) # and covariance matrix
for (i in 1:4) {
plot(ellipse::ellipse( Sigma, centre=mu, level=.95), col='gray',
xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6), type='l')
lines( ellipse::ellipse( Sigma, centre=mu, level=.5), col='gray')
sample <- rmvnorm( n[i], mean=mu, sigma=Sigma) # draw realization
points( sample, pch=20, cex=.4) # add realization
muhat <- colMeans( sample) # apply( sample, 2, mean) # is identical
Sigmahat <- cov( sample) # var( sample) # is identical
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.95), col=2, lwd=2)
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.5), col=4, lwd=2)
points( rbind( muhat), col=3, cex=2)
text( -2, 4, paste('n =',n[i]))
}
9.2. ESTIMATION OF A P-VARIATE MEAN AND COVARIANCE 163
6
n = 10 n = 50
4
4
2
2
y
y
0
0
−2
−2
−4
−4
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
6
x x
n = 100 n = 500
4
4
2
2
y
y
0
0
−2
−2
−4
−4
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
x x
Figure 9.2: Bivariate normally distributed random numbers. The contour lines of the
true density are in gray. The isolines correspond to 95% and 50% quantiles and the
sample mean in green. (See R-Code 9.3.)
In general, interval estimation within random vectors is much more complex compared to
random variables. In fact, due to the multiple dimensions, a better term for ’interval estima-
tion’ would be ’area or (hyper-)volume estimation’. In practice, one often marginalizes and
constructs intervals for individual elements of the parameter vector (which we will also practice
in Chapter 10, for example).
iid
In the trivial case of X1 , . . . , Xn ∼ Np (µ, Σ), with Σ known, we have µ
b = X ∼ Np (µ, Σ/n)
(can also be shown with Property 8.24) and the uncertainty in the estimate can be expressed with
164 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
ellipses or hyper-ellipsoids. In case of unknown Σ, plug-in estimates can be used, at the price of
accuracy. This setting is essentially the multivariate version of the middle panel of Figure 4.3.
The discussion of uncertainty in Σ b is beyond the scope of this book.
To estimate the correlation matrix we define the diagonal matrix D with elements dii =
(Σ)ii )−1/2 . The estimated correlation matrix is given by
b
R = DΣD
b (9.8)
Example 9.5. We retake the setup of Example 9.4 for n = 50 and draw 100 realizations.
Figure 9.3 visualizes the density based on the estimated parameters (as in Figure 9.2 top left).
The variablity seems dramatic. However, if we would estimate the parameters marginally, we
would have the same estimates and the same uncertainties. The only additional element here is
the estimate of the covariance, i.e., the off-diagonal elements of Σ.
b ♣
set.seed( 14)
n <- 50 # fixed sample size
R <- 100 # draw R realizations
plot( 0, type='n', xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6))
for (i in 1:R) {
sample <- rmvnorm( n, mean=mu, sigma=Sigma)
muhat <- colMeans( sample) # estimated mean
Sigmahat <- cov( sample) # estimated variance
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.95), col=2)
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.5), col=4)
points( rbind( muhat), col=3, pch=20) # add mean in green
}
2. assess pairs plots for elliptic scatter plots of all bivariate pairs;
3. assess QQ-plots of linear combinations, e.g. sums of two, three, . . . , all variables;
−1
4. assess d1 , . . . , dn with di = (x i −x )⊤ Σ
b (x i −x ) with a QQ-plot against a χ2p distribution.
If any of the above fails, we have evidence against multivariate normality of the entire dataset;
subsets thereof may still be of course. Note that only for the fourth point we do need estimates
of the mean and covariance because the marginal QQ-plots are scale invariant.
9.3. SIMPLE LINEAR REGRESSION 165
6
4
2
0
0
−2
−4
−4 −2 0 2 4 6 8
Y i = µ i + εi (9.9)
= β 0 + β 1 x i + εi , i = 1, . . . , n, (9.10)
with
• β0 , β1 : parameters (unknown);
• εi : error, error term, noise (unknown), with symmetric distribution around zero.
It is often also assumed that Var(εi ) = σ 2 and that the errors are independent of each other.
iid
We make another simplification and further assume εi ∼ N (0, σ 2 ) with unknown σ 2 . Thus,
Yi ∼ N (β0 + β1 xi , σ 2 ), i = 1, . . . , n, and Yi and Yj are independent for i ̸= j. Because of the
varying mean, Y1 , . . . , Yn are not identically distributed.
166 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Fitting a linear model is based on pairs of data (x1 , y1 ), . . . , (xn , yn ) and the model (9.10)
with the goal to find estimates for β0 and β1 , essentially an estimation problem. In R, this task
is straightforward as shown in a motivating example below which has been dissected and will
illustrate the theoretical concepts in the remaining part of the chapter.
Example 9.6 (hardness data). One of the steps of manufacturing metal springs is a quenching
bath that cools metal back to room temperature to prevent that a slow cooling process dramat-
ically changes the metal’s microstructure. The temperature of the bath has an influence on the
hardness of the springs. Data is taken from Abraham and Ledolter (2006). The Rockwell scale
measures the hardness of technical materials and is denoted by HR (Hardness Rockwell). Fig-
ure 9.4 shows the Rockwell hardness of coil springs as a function of the temperature (in degrees
Celcius) of the quenching bath, as well as the line of best fit. R-Code 9.5 shows how a simple
linear regression is performed with the function lm() and the ‘formula’ statement Hard˜Temp. ♣
R-Code 9.5 hardness data from Example 9.6. (See Figure 9.4.)
Temp <- c(30, 30, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60)
# Temp <- rep( 10*3:6, c(4, 3, 3, 4)) # alternative way based on `rep()`
Hard <- c(55.8, 59.1, 54.8, 54.6, 43.1, 42.2, 45.2,
31.6, 30.9, 30.8, 17.5, 20.5, 17.2, 16.9)
plot( Temp, Hard, xlab="Temperature [C]", ylab="Hardness [HR]")
lm1 <- lm( Hard~Temp) # fitting of the linear model
abline( lm1) # add fit of the linear model to linear model
60
50
Hardness [HR]
40
30
20
30 35 40 45 50 55 60
Temperature [C]
Figure 9.4: hardness data: hardness as a function of temperature. The black line is
the fitted regression line. (See R-Code 9.5.)
Classically, we estimate βb0 and βb1 with a least squares approach (see Section 4.2.1), which
9.3. SIMPLE LINEAR REGRESSION 167
minimizes the sum of squared residuals. That means that βb0 and βb1 are determined such that
n
X
(yi − βb0 − βb1 xi )2 (9.11)
i=1
is minimized. This concept is also called ordinary least squares (OLS) method to emphasize that
we have iid errors and no weighting is taken into account. The solution of the minimization is
given by
Xn
r
syy sxy (xi − x)(yi − y)
i=1
βb1 = r = = Xn , (9.12)
sxx sxx (xi − x)2
i=1
βb0 = y − βb1 x . (9.13)
and are termed the estimated regression coefficients. The first equality in (9.12) emphasizes
the difference to a correlation estimate. Here, we correct Pearson’s correlation estimate with
syy /sxx . It can be shown that the associated estimators are unbiased (see Problem 9.1.b).
p
where we have used a slightly different denominator than in Example 4.6.2 to obtain an unbiased
estimate. More precisely, instead of estimating one mean parameter we have to estimate two
regression coefficients and thus instead of n − 1 we use n − 2. For simplicity, we do not introduce
another symbol for the estimate here. The variance estimate σ b2 is often termed mean squared
error and its root, σb, residual standard error .
Example 9.7 (continuation of Example 9.6). R-Code 9.6 illustrates how to access the estimates,
the fitted values and the residuals from the output of the linear model fit of Example 9.6.
More specifically, the function lm() generates an object of class "lm" and the functions coef(),
fitted() and residuals() extract the corresponding elements from the object.
To access the residual standard error we need to further “digest” the object via summary().
More specifically, an estimate of σ 2 can be obtained via residuals(lm1) and Equation (9.17)
(e.g., sum(residuals(lm1)ˆ2)/(14-2)) or directly via summary(lm1)$sigmaˆ2.
There are more methods defined for an object of class "lm", for example abline() as used
in R-Code 9.5; more will be discussed subsequently. ♣
168 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Example 9.8 (continuation of Examples 9.6 and 9.7). R-Code 9.7 outputs the summary of a
linear fit by summary(lm1). ♣
The output is essentially build of four blocks. The first restates the call of the model fit, the
second states information about the residuals, the third summarizes the regression coefficients
and the fourth digests the overall quality of the regression.
The summary of the residuals serves to check if the residuals are symmetric (median should
be close to zero, up to the sign similar quartiles) or if there are outliers (small minimum or
maximum value). In the next chapter, we will revisit graphical assessments of the residuals.
The third block of the summary output contains for each estimated coefficient the following
four numbers: the estimate itself, its standard error, the ratio of the latter two and the associated
p-value of the Test 15. More specifically, for the slope parameters of a simple linear regression
these are βb1 , SE(βb1 ) βb1 / SE(βb1 ) and a p-value.
To obtain the aforementioned p-value consider βb1 as an estimator. When properly scaled
it can be shown that the estimator of βb1 / SE(βb1 ) has a t-distribution and thus we can apply a
9.4. INFERENCE IN SIMPLE LINEAR REGRESSION 169
statistical test for H0 : β1 = 0, detailed in Test 15. That means for the slope parameter, the
last number is P V ≥ |βb1 |/ SE(βb1 ) with V a t-distributed random variable with n − 2 degrees
of freedom. This latter test is conceptually very similar to Test 1 and essentially identical to
Test 14, see Problem 9.1.1.
The final block summarizes the overall fit of the model. In the case of a “good” model fit, the
1
States the R command used to create the object
4
Summarizes the overall model
Figure 9.5: R summary output of an object of class lm (here for the hardness data
from Example 9.6).
170 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
resulting absolute residuals are small. More precisely, the residual sums of squares are small, at
least compared to the total variability in the data syy . In later chapters, we will formalize that
a linear regression decomposes the total variablity in the data in two components, namely the
variablity explained by the model and the variablity of the residual. We often write this as
where the subscript indicates ‘total’, ’model’ and ’error’. This statement is of general nature
and holds not only for linear regression models but for virtually all statistical models. Here,
the somewhat surprising fact is that we have the simple expressions SST = i (yi − y)2 , SSM =
P
yi − y)2 and SSE = i (yi − ybi )2 (see also Problem 9.1.d). The specific values of the final
P P
i (b
block for the simple regression are calculated as follows
SSE (yi − ybi )2
P
Multiple R : R = 1 −
2 2
= 1 − Pi , (9.19)
SST i (yi − y i )
2
1 − R2
Adjusted R2 : Radj
2
= R2 − , (9.20)
n−2
SSM
Observed value of F -test: , (9.21)
SSE /(n − 2)
The multiple R2 is sometimes called the coefficient of determination or simply R-squared . The
adjusted R2 is a weighted version of R2 and we revisit it later for more details. For a simple
linear regression the F -test is an alternative form to test H0 : β1 = 0 (see Problem 9.1.e) and a
deeper meaning will be revealed with more complex models. The denominator of (9.21) is σ b2 .
Note that for the simple regression the degrees of freedom of the F -test are 1 and n − 2.
Hence, if the regression model explains a lot of variability in the data, SSM is large compared
to SSE , that means SSE is small compared to SST , which implies R2 is close to one and the
observed value of the F -test (F -statistic) is large.
In simple linear regression, the central task is often to determine whether there exists a linear
relationship between the dependent and independent variables. This can be tested with the
9.4. INFERENCE IN SIMPLE LINEAR REGRESSION 171
hypothesis H0 : β1 = 0 (Test 15). We do not formally derive the test statistic here. The idea
is to replace in Equation (9.12) the observations yi with random variables Yi with distribution
specified by (9.10) and derive the distribution of a test statistic.
For prediction of the dependent variable at x0 , a specific, given value of the independent
variable, we plug-in x0 in Equation (9.15), i.e., we determine the corresponding value of the
regression line. The function predict() can be used for prediction in R.
Prediction at a (potentially) new value x0 can also be written as
Thus, the last expression is equivalent to Equation (8.27) but with estimates instead of (unknown)
parameters. Or in other words, simple linear regression is equivalent to conditional expectation
of a bivariate normal distribution with plug-in estimates.
The uncertainty of the prediction depends on the uncertainty of the estimated parameter.
Specifically:
Var(b
µ0 ) = Var(βb0 + βb1 x0 ), (9.23)
is an expression in terms of the independent variables, the dependent variables and (implicitly)
the variance of the error term. Because βb0 and βb1 are not necessarily independent, it is quite
tedious to derive the explicit expression. However, with matrix notation, the above variance is
straightforward to calculate, as will be illustrated in Chapter 10.
To construct confidence intervals for a prediction, we must discern whether the prediction is
for the mean response µ b0 or for an unobserved (e.g., future) observation yb0 at x0 . The former
prediction interval depends on the variability of the estimates of βb0 and βb1 . For the latter the
prediction interval depends on the uncertainty of µb0 and additionally on the variability of the error
ε, i.e., σ
b . Hence, the latter is always wider than the former. In R, these two types are denoted
2
Example 9.9 (continuation of Examples 9.6 to 9.8). We close this example by constructing
pointwise confidence intervals for the mean response and for an unobserved observation. Fig-
ure 9.6 (based on R-Code 9.8) illustrates the confidence intervals for a very fine grid of temper-
atures. For each temperature, we calculate the intervals (hence “pointwise”) but as classically
done they are visualized by lines. The width of both increases as we move further from x. ♣
R-Code 9.8 hardness data: predictions and pointwise confidence intervals. (See Fig-
ure 9.6.)
40
30
20
10
30 35 40 45 50 55 60
Temperature [C]
In both intervals we use estimators, i.e., all the estimates yi are to be replaced with
Yi to obtain estimators.
9.5. BIBLIOGRAPHIC REMARKS 173
a) Show that the off-diagonal entries of the matrix R of (9.8) are equivalent to (9.8) and that
the diagonal elements are one.
b) Show that the ordinary least squares regression coefficient estimators are unbiased.
Pn
c) Show that the rightmost expression of (9.12) can be written as − x)yi sxx and
i=1 (xi
deduce Var(βb1 ) = σ 2 /sxx
d) Show the decomposition (9.18) by showing that SSM = r2 syy and SSE = (1 − r2 )syy .
e) Show that the value of the F -test in the lm summary is equivalent to t2obs of Test 15 and
to t2obs of Test 14.
Problem 9.2 (Correlation) For the swiss dataset, calculate the correlation between the vari-
ables Catholic and Fertility as well as Catholic and Education. What do you conclude?
Hint: for the interpretation, you might use the parallel coordinate plot, as shown in Figure 1.10
in Chapter 1.
iid
Problem 9.3 (Bivariate normal distribution) Consider the random sample X1 , . . . , Xn ∼ N2 (µ, Σ)
with
! !
1 1 1
µ= , Σ= .
2 1 2
a) Explain in words that these estimators for µ and Σ “generalize” the univariate estimators
for µ and σ.
174 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
b) Simulate n = 500 iid realizations from N2 (µ, Σ) using the function rmvnorm() from package
mvtnorm. Draw a scatter plot of the results and interpret the figure.
d) Estimate µ, Σ and the correlation between X1 and X2 from the 500 simulated values using
mean(), cov() and cor(), respectively.
e) Redo the simulation with several different covariance matrices, i.e., choose different values
as entries for the covariance matrices. What is the influence of the diagonal elements and
the off-diagonal elements of the covariance matrix on the shape of the scatter plot?
Problem 9.4 (Variance under simple random sampling) In this problem we proof the hint of
Problem 7.1.d. Let Y1 , . . . , Ym be iid with E(Yi ) = µ and Var(Yi ) = σ 2 . We take a random
sample Y1′ , . . . , Yn′ of size n without replacement and denote the associated mean with Y ′ .
a) Show that E Y ′ = µ.
−σ 2
b) Show that Cov(Yi′ , Yj′ ) = , if i ̸= j and Cov(Yi′ , Yj′ ) = σ 2 , if i = j.
m−1
σ2 n−1
c) Show that Var Y ′ = 1− . Discuss this result.
n m−1
Problem 9.5 (Linear regression) The dataset twins from the package faraway contains the
IQ of mono-zygotic twins where one of the siblings has been raised by the biological parents and
the second one by foster parents. The dataset has been collected by Cyril Burt. Explain the IQ
of the foster sibling as a function of the IQ from his sibling. Give an interpretation. Discuss the
shortcomings and issues of the model.
Problem 9.6 (Linear regression) In a simple linear regression, the data are assumed to follow
iid
Yi = β0 + β1 xi + εi with εi ∼ N (0, σ 2 ), i = 1, . . . , n. We simulate n = 15 data points from that
model with β0 = 1, β1 = 2, σ = 2 and the follwoing values for xi .
Hint: to start, copy–paste the following lines into your R-Script.
a) Plot the simulated data in a scatter plot. Calculate the Pearson correlation coefficient and
the Spearman’s rank correlation coefficient. Why do they agree well?
b) Estimate the linear regression coefficients βb0 and βb1 using the formulas from the script.
Add the estimated regression line to the plot from (a).
c) Calculate the fitted values Ybi for the data in x and add them to the plot from (a).
d) Calculate the residuals (yi − ybi ) for all n points and the residual sum of squares SS =
bi )2 . Visualize the residuals by adding lines to the plot with segments(). Are
P
i (yi − y
the residuals normally distributed? Do the residuals increase or decrease with the fitted
values?
e) Calculate standard errors for β0 and β1 . For σ bε = SS/(n − 2), they are given by
p
s s
1 x2 1
σ
bβ0 = σ
bε +P 2
, σ
bβ1 = σbε P 2
.
n i (xi − x) i (xi − x)
f) Give an empirical 95% confidence interval for β0 and β1 . (The degree of freedom is the
number of observations minus the number of parameters in the model.)
g) Calculate the values of the t statistic for βb0 and βb1 and the corresponding two-sided p-
values.
h) Verify your result with the R function lm() and the corresponding S3 methods summary(),
fitted(), residuals() and plot() (i.e., apply these functions to the returned object of
lm()).
i) Use predict() to add a “confidence” and a “prediction” interval to the plot from (a). What
is the difference?
Hint: The meanings of "confidence" and "predict" here are based on the R function. Use
the help of those functions to understand their behaviour.
j) Fit a linear model without intercept (i.e., force β0 to be zero). Add the corresponding
regression line to the plot from (a). Discuss if the model fits “better” the data.
l) What is the difference between a model with formula y ∼ x and x ∼ y? Explain it from
a stochastic and fitting perspective.
Problem 9.7 (BMJ Endgame) Discuss and justify the statements about ‘Simple linear regres-
sion’ given in doi.org/10.1136/bmj.f2340.
176 CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Chapter 10
Multiple Regression
– Multicollinearity
– Influential points
– Interactions between variables
– Categorical variables (factors)
– Model validation and information criterion (basic theory and R)
In many situations a dependent variable is associated with more than one independent vari-
able. The simple linear regression model can be extended by the addition of further independent
variables. We first introduce the model and estimators. Subsequently, we become acquainted
with the most important steps in model validation. Two typical examples of multiple regression
are given. At the end, several typical examples of extensions of linear regression are illustrated.
Yi = µi + εi , (10.1)
= β0 + β1 xi1 + · · · + βp xip + εi , (10.2)
= x⊤
i β + εi , i = 1, . . . , n, n > p, (10.3)
with
177
178 CHAPTER 10. MULTIPLE REGRESSION
• εi : (unknown) error term, noise, with symmetric distribution around zero, E(εi ) = 0.
It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
In matrix notation, Equation (10.3) is written as
Y = Xβ + ε (10.4)
with X an n × (p + 1) matrix with rows x ⊤ i . The model is linear in β and often referred to as
multiple linear regression model.
To derive estimators and obtain simple, closed form distributions for these, we assume that
iid
εi ∼ N (0, σ 2 ) with unknown σ 2 , also when simply referring to the model (10.3). With a Gaussian
error term we have (in matrix notation)
The mean of the response varies, implying that Y1 , . . . , Yn are only independent and not iid.
There are some constraints on the independent variables. We assume that the rank of X
equals p + 1 (rank(X) = p + 1, column rank). This assumption guarantees that the inverse of
X⊤ X exists. In practical terms, this implies that we do not include twice the same predictor, or
that a predictor has additional information on top of the already included predictors.
The parameter vector β is estimated with the method of ordinary least squares (see Sec-
tion 4.2.1). This means, that the estimate β
b is such that the sum of the squared errors (residuals)
is minimal and is thus derived as follows:
Xn
b = argmin
β (yi − x ⊤ 2 ⊤
i β) = argmin(y − Xβ) (y − Xβ) (10.7)
β i=1 β
d
=⇒ (y − Xβ)⊤ (y − Xβ) (10.8)
dβ
d
= (y ⊤ y − 2β ⊤ X⊤ y + β ⊤ X⊤ Xβ) = −2X⊤ y + 2X⊤ Xβ (10.9)
dβ
=⇒ X⊤ Xβ = X⊤ y (10.10)
⊤ −1 ⊤
=⇒ b = (X X)
β X y (10.11)
Equation (10.10) is also called the normal equation and Equation (10.11) indicates why we need
to assume full column rank of the matrix X.
We now derive the distributions of the estimator and other related and important vectors.
The derivation of the results are based directly on Property 8.6. Starting from the distributional
assumption of the errors (10.5), jointly with Equations (10.6) and (10.11), it can be shown that
b = (X⊤ X)−1 X⊤ y , b ∼ Np+1 β, σ 2 (X⊤ X)−1 ,
β β (10.12)
b = X(X⊤ X)−1 X⊤ y = Hy ,
y b ∼ Nn (Xβ, σ 2 H),
Y (10.13)
R ∼ Nn 0, σ 2 (I − H) ,
r =y −y
b = (I − H)y (10.14)
10.1. MODEL AND ESTIMATORS 179
where we term the matrix H = X(X⊤ X)−1 X⊤ as the hat matrix . In the left column we find the
estimate, predicted value and the residuals, in the right column the according functions of the
random sample and its distribution. Notice the subtle difference in the covariance matrix of the
distributions Y and Y:b the hat matrix H is not I, hopefully quite close to it. The latter would
imply that the variances of R are close to zero.
The distribution of the regression coefficients will be used for inference and when interpret-
ing a fitted regression model (similarly as in the case of the simple regression). The marginal
distributions of the individual coefficients βbi are determined by the distribution (10.12):
(again direct consequence of Property 8.6). Hence (βbi − βi ) σ 2 vii ∼ N (0, 1). As σ 2 is unknown
√
which is again termed mean squared error . Its square root is termed residual standard error
with n − p − 1 degrees of freedom. Finally, we use the same approach when deriving the t-test
in Equation (5.4) and obtain
βbi − βi
√ ∼ tn−p−1 (10.17)
b2 vii
σ
as our statistic for testing H0 : βi = β0,i . For testing, we are often interested in H0 : βi = 0 for
which the test statistic reduces to the one shown in Test 15. Confidence intervals are constructed
along equation (4.33) and summarized subsequently in CI 8.
with r = y − y
b and vii = (X⊤ X)−1 ii .
In R, the multiple regression is fitted with lm() again by adding the additional predictor to
the right-hand-side of the formula statement. The summary output is very similar as for the
simple regression, we simply add further lines to the coefficient block. The degrees of freedom
of the numerator of the F -statistic changes from 1 to p − 1. The following example illustrates a
regression with p = 2 predictors.
Example 10.1 (abrasion data). The data comes from an experiment investigating how rubber’s
resistance to abrasion is affected by the hardness of the rubber and its tensile strength (Cleveland,
1993). Each of the 30 rubber samples was tested for hardness and tensile strength, and then
subjected to steady abrasion for a fixed time.
180 CHAPTER 10. MULTIPLE REGRESSION
R-Code 10.1 performs the regression analysis based on two predictors. The sample confidence
intervals confint( res) do not contain zero. Accordingly, the p-values of the three t-tests are
small.
We naively assumed a linear relationship between the predictors hardness and strength and
the abrasion. The scatterplots in Figure 10.1 give some indication that hardness itself has a
linear relationship with abrasion. The scatterplot only depict the marginal relations, i.e., the
red curves would be linked to lm(loss hardness, data=abrasion) and lm(loss strength,
data=abrasion) (the latter indeed shows a poor linear relationship). Similarly, a quadratic term
for strength is not necessary (see, the summary of lm(loss hardness+strength+I(strengthˆ2),
data=abrasion)). ♣
350
250
250
loss
150
150
50
9050
80
70
hardness
60
240 50
200
strength
160
120
Figure 10.1: Pairs plot of abrasion data with red “guide-the-eye” curves. (See R-
Code 10.1.)
R-Code 10.1: abrasion data: fitting a linear model. (See Figure 10.1.)
##
## Call:
## lm(formula = loss ~ hardness + strength, data = abrasion)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
confint( abres)
## 2.5 % 97.5 %
## (Intercept) 758.4573 1011.86490
## hardness -7.7674 -5.37423
## strength -1.7730 -0.97562
Recall the data analysis workflow shown in Figure 1.1. Suppose we have a model fit of a linear
model (obtained with lm(), as illustrated in the last chapter or in the previous section). Model
validation essentially verifies if (10.3) is an adequate model for the data to probe the given
Hypothesis to investigate. The question is not whether a model is correct, but rather if the
model is useful (“Essentially, all models are wrong, but some are useful”, Box and Draper, 1987
page 424).
Model validation verifies (i) the fixed component (or fixed part) µi and (ii) the stochastic
component (or stochastic part) εi and is typically an iterative process (arrow back to Propose
statistical model in Figure 1.1).
182 CHAPTER 10. MULTIPLE REGRESSION
Validation is based on (a) graphical summaries, typically (standardized) residuals versus fitted
values, individual predictors, summary values of predictors or simply Q-Q plots and (b) summary
statistics. The latter are also part of a summary() call of a regression object, see, e.g., R-
Code 9.6. The residuals are summarized by the range and the quartiles. Below the summary of
the coefficients, the following statistics are given
were SS stands for sums of squares and SST , SSE for total sums of squares and sums of squares
of the error, respectively. The statistics are similar to the ones from a simple regression up to
taking into account that we have now p + 1 parameters to estimate.
The last statistic explains how much variability in the data is explained by the model and
is essentially equivalent to Test 4 and performs the omnibus test H0 : β1 = β2 = · · · = βp = 0.
When we reject this test, this merely signifies that at least one of the coefficients is significant
and thus often not very useful.
A slightly more general version of the F -Test (10.21) is used to compare nested models. Let
M0 be the simpler model with only q out of the p predictors of the more complex model M1
(0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic
and often runs under the name ANOVA (analysis of variance). We see an alternative derivation
thereof in Chapter 11.
In order to validate the fixed components of a model, it must be verified whether the necessary
predictors are in the model. We do not want too many, nor too few. Unnecessary predictors
are often identified through insignificant coefficients. When predictors are missing, the residuals
show (in the ideal case) structure, indicative for model improvement. In other cases, the quality
of the regression is low (F -Test, R2 (too) small). Example 10.2 below will illustrate the most
important elements.
Example 10.2. We construct synthetic data in order to better illustrate the difficulty of detect-
ing a suitable model. Table 10.1 gives the actual models and the five fitted models. In all cases
we use a small dataset of size n = 50 and predictors x1 and x2 = x21 ) that we construct from
iid
a uniform distribution. Further, we set εi ∼ N (0, 0.252 ). R-Code 10.2 and the corresponding
Figure 10.2 illustrate how model deficiencies manifest.
10.2. MODEL VALIDATION 183
We illustrate how residual plots may or may not show missing or unnecessary predictors.
Because of the ‘textbook’ example, the adjusted R2 values are very high and the p-value of the
F -Test is – as often in practice – of little value.
Since the output of summary() is quite long, here we show only elements from it. This is
achieved with the functions print() and cat(). For Examples 2 to 5 the output has been
constructed by a function call to subset_of_summary() constructing the output as the first
example.
The plots should supplement a classical graphical analysis through lm( res). ♣
Table 10.1: Fitted models for five examples of 10.2. The true model is always Yi =
β0 + β1 x1 + β2 x2 + εi .
R-Code 10.2: Illustration of missing and unnecessary predictors for an artificial dataset.
(See Figure 10.2.)
set.seed( 18)
n <- 50
x1 <- runif( n); x2 <- x1^2; x3 <- runif( n)
eps <- rnorm( n, sd=0.16)
y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps
# Example 1: Correct model
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))
print( sres$coef, digits=2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0 0.045 -23 7.9e-27
## x1 3.0 0.238 13 9.9e-17
## I(x1^2) 3.9 0.242 16 6.9e-21
cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),
" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))
## Adjusted R-squared: 0.9963 F-Test: 4.6376e-53
plotit( res$fitted, res$resid, "fitted values") # Essentially equivalent to:
# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)
# i.e, plot(), followed by a panel.smooth()
plotit( x1, res$resid, bquote(x[1]))
184 CHAPTER 10. MULTIPLE REGRESSION
It is important to understand that the stochastic part εi does not only represent measurement
error. In general, the error is the remaining “variability” (also noise) that is not explained through
the predictors (“signal”).
With respect to the stochastic part εi , the following points should be verified:
1. constant variance: if the (absolute) residuals are displayed as a function of the predictors,
the estimated values, or the index, no structure should be discernible. The observations
can often be transformed in order to achieve constant variance. Constant variance is also
called homoscedasticity and the terms heteroscedasticity or variance heterogeneity are used
otherwise.
indep
More precisely, for heteroscedacity we relax Model (10.3) to εi ∼ N (0, σi2 ), i.e., ε ∼
Nn (0, σ 2 V). In case the diagonal matrix V is known, we can use so-called weighted least
squares (WLS) by considering the argument weights in R (weights=1/diag(V)).
If data are taken over time, observations might be serially correlated or dependent. That
means, Corr(εi−1 , εi ) ̸= 0 and Var(ε) = Σ which is not a diagonal matrix. This is easy to
test and to visualize through the residuals, illustrated in Example 10.3.
3. symmetric distribution: it is not easy to find evidence against this assumption. If the
distribution is strongly right- or left-skewed, the scatter plots of the residuals will have
structure. Transformations or generalized linear models may help. We have a quick look
at a generalized linear model in Section 10.4.
Example 10.3 (continuation of Example 10.1). R-Code 10.3 constructs a few diagnostic plots
shown in Figure 10.3.
A quadratic term for strength is not necessary. However, the residuals appear to be slightly
correlated (see bottom right panel, cor( abres$resid[-1],abres$resid[-30]) is 0.53). We
do not have further information about the data here and cannot investigate this aspect further.
♣
# Fitted values
plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)
points( abres$fitted~hardness, col=4, data=abrasion)
plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)
points( abres$fitted~strength, col=4, data=abrasion)
# Residuals vs ...
plot( abres$resid~abres$fitted)
lines( lowess( abres$fitted, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid~hardness, data=abrasion)
lines( lowess( abrasion$hardness, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid~strength, data=abrasion)
lines( lowess( abrasion$strength, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid[-1]~abres$resid[-30])
abline( h=0, col='gray')
186 CHAPTER 10. MULTIPLE REGRESSION
400
400
loss
loss
200
200
0
0
50 60 70 80 90 120 140 160 180 200 220 240
hardness strength
50
50
abres$resid
abres$resid
0
0
−50
−50
50 100 150 200 250 300 350 50 60 70 80 90
50
abres$resid
0
−50
−50
strength abres$resid[−30]
Figure 10.3: abrasion data: model validation. Top row shows the loss (black) and
fitted values (blue) as a function of hardness (left) and strength (right). Middle and
bottom panels are different residual plots. (See R-Code 10.3.)
To introduce two well known information criterion, we assume that the distribution of the
10.3. MODEL SELECTION 187
AIC = −2ℓ(θ)
b + 2p. (10.23)
In regression models with normally distributed errors, the maximized log-likelihood is n log(bσ2)
up to some constant and so the first term describes the goodness of fit.
It is important to not that additive constants are not relevant. Also solely the difference of
two AICs make sense but not the reduction in proportion or something.
Remark 10.1. When deriving the AIC for the regression setting, we have to use likelihood es-
timates. β
b =β
ML
b and σb2 = r ⊤ r /n. The log-likelihood is then shown to be ℓ(β
LS ML
b ,σ b2 ) =
ML ML
−n/2(log(2 ∗ pi) − log(n) + log(r ⊤ r ) + 1). As additive constants are not relevant for the AIC,
we can express the AIC in terms of the residuals only. ♣
The disadvantage of AIC is that the penalty term is independent of the sample size. The
Bayesian information criterion (BIC)
BIC = −2ℓ(θ)
b + log(n) p (10.24)
penalizes the model more heavily based on both the number of parameters p and sample size n,
and its use is recommended.
We illustrate the use of information criteria with an example based on a classical dataset.
Example 10.4 (LifeCycleSavings data). Under the life-cycle savings hypothesis developed by
Franco Modigliani, the savings ratio (aggregate personal savings divided by disposable income) is
explained by per-capita disposable income, the percentage rate of change in per-capita disposable
income, and two demographic variables: the percentage of population less than 15 years old and
the percentage of the population over 75 years old (see, e.g., Modigliani, 1966). The data
provided by LifeCycleSavings are averaged over the decade 1960–1970 to remove the business
cycle or other short-term fluctuations and contains information from 50 countries about these
five variables:
Scatter plots are shown in Figure 10.4. R-Code 10.4 fits a multiple linear model, selects models
through comparison of various goodness of fit criteria (AIC, BIC) and shows the model validation
plots for the model selected using AIC. The step() function is a convenient way for selecting
188 CHAPTER 10. MULTIPLE REGRESSION
relevant predictors. Figure 10.4 gives four of the most relevant diagnostic plots, obtained by
passing a fitted object to plot() (compare with the manual construction of Figure 10.1).
Different models may result from different criteria: when using BIC for model selection,
pop75 drops out of the model. ♣
0 5 10 15 20 25 35 45 1 2 3 4 0 2000 4000
0 5 10 15
10 15 20
10 15 20
sr
5
5
0
45 0
pop15
35
25
4
3
pop75
2
4000 1
2000
dpi
15 0
10
ddpi
5
0
0 5 10 15
Figure 10.4: Scatter plots of LifeCycleSavings with red “guide-the-eye” curves. (See
R-Code 10.4.)
R-Code 10.4: LifeCycleSavings data: EDA, linear model and model selection. (See
Figures 10.4 and 10.5.)
data( LifeCycleSavings)
head( LifeCycleSavings)
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
summary( lcs.all)
10.3. MODEL SELECTION 189
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.242 -2.686 -0.249 2.428 9.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.566087 7.354516 3.88 0.00033 ***
## pop15 -0.461193 0.144642 -3.19 0.00260 **
## pop75 -1.691498 1.083599 -1.56 0.12553
## dpi -0.000337 0.000931 -0.36 0.71917
## ddpi 0.409695 0.196197 2.09 0.04247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared: 0.338,Adjusted R-squared: 0.28
## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
summary( lcs.aic)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.12466 7.18379 3.9150 0.00029698
## pop15 -0.45178 0.14093 -3.2056 0.00245154
## pop75 -1.83541 0.99840 -1.8384 0.07247270
## ddpi 0.42783 0.18789 2.2771 0.02747818
plot( lcs.aic) # 4 plots to assess the models
summary( step( lcs.all, k=log(50), trace=0))$coefficients # now BIC
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.59958 2.334394 6.6825 2.4796e-08
## pop15 -0.21638 0.060335 -3.5863 7.9597e-04
## ddpi 0.44283 0.192401 2.3016 2.5837e-02
Example 10.5 (orings data). In January 1986 the space shuttle Challenger exploded shortly
after taking off, killing all seven crew members aboard. Part of the problem was with the
rubber seals, the so-called o-rings, of the booster rockets. Due to low ambient temperature, the
seals started to leak causing the catastrophe. The data set data( orings, package="faraway")
contains the number of defects in the six seals in 23 previous launches (Figure 10.6). The question
we ask here is whether the probability of a defect for an arbitrary seal can be predicted for an
air temperature of 31◦ F (as in January 1986). See Dalal et al. (1989) for a detailed statistical
account or simply https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.
The variable of interest is a probability (failure of a rubber seal), that we estimate based on
binomial data (failures of o-rings) but a linear model cannot guarantee pbi ∈ [0, 1] (see linear fit in
Figure 10.6). In this and similar cases, logistic regression is appropriate. The logistic regression
10.4. EXTENSIONS OF THE LINEAR MODEL 191
R-Code 10.5 orings data and estimated probability of defect dependent on air tempera-
ture. (See Figure 10.6.)
For most functions f , we do not have closed forms for the resulting estimates and iterative
approaches are needed. That is, starting from an initial condition we improve the solution by
small steps. The correction factor typically depends on the gradient of f . Such algorithms are
variants of the so-called Gauss–Newton algorithm. The details of these are beyond the scope of
this document.
There are some prototype functions f (·, β) that are often used:
Problem 10.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Derive the distributions of the fitted values and the residuals, i.e., (10.13) and (10.14).
b) Identify the “simple” and “complex model” in the setting of a simple regression such that
(10.22) reduces to (9.21).
c) Show that the AIC for a standard multiple regression model with p+1 regression coefficients
is proportional to n log(r ⊤ r ) + 2p and derive the BIC for the same model.
Problem 10.2 (Box–Cox transformation) In this problem, we motivate the so-called Box–Cox
transformation. Suppose that we have a random variable Y with strictly positive mean µ > 0 and
standard deviation σ = cµα , where c > 0 and α arbitrary. That means the standard deviation
of Y is proportional to a power of the mean.
a) Define the random variable X = Y λ , λ ̸= 0, and show that the standard deviation of X
is approximately proportional to µλ−1+α . What value λ of the transformation leads to a
constant variance?
c) The dataset SanMartinoPPts from the package hydroTSM contains daily precipitation at
station San Martino di Castrozza (Trento Italy) over 70 years. Determine the optimal
transformation for monthly totals. Justify that a square root transformation is adequate.
How do you expect the transformation to change when working with annual or daily data?
Hint: monthly totals may be obtained via hydroTSM::daily2monthly(SanMartinoPPts,
FUN=sum).
Problem 10.3 (Multiple linear regression 1) The data stackloss.txt are available on the
course web page. The data represents the production of nitric acid in the process of oxidizing
194 CHAPTER 10. MULTIPLE REGRESSION
ammonia. The response variable, stack loss, is the percentage of the ingoing ammonia that
escapes unabsorbed. Key process variables are the airflow, the cooling water temperature (in
degrees C), and the acid concentration (in percent).
Construct a regression model that relates the three predictors to the response, stack loss.
Check the adequacy of the model.
Exercise and data are from B. Abraham and J. Ledolter, Introduction to Regression Modeling,
2006, Thomson Brooks/Cole.
Hints:
• Look at the data. Outliers?
• Try to find a “optimal” model. Exclude predictors that do not improve the model fit.
• Use model Diagnostics, use t−, F -tests and (adjusted) R2 values to compare different
models.
• Which data points have a (too) strong influence on the model fit? (influence.measures())
• Are the predictors correlated? In case of a high correlation, what are possible implications?
Problem 10.4 (Multiple linear regression 2) The file salary.txt contains information about
average teacher salaries for 325 school districts in Iowa. The variables are
District name of the district
districtSize size of the district:
1 = small (less than 1000 students)
2 = medium (between 1000 and 2000 students)
3 = large (more than 2000 students)
salary average teacher salary (in dollars)
experience average teacher experience (in years)
b) For each of the three district sizes, fit a linear model using salary as the dependent variable
and experience as the covariate. Is there an effect of experience? How can we compare the
results?
c) We now use all data jointly and use districtSize as covariate as well. However, districtSize
is not numerical, rather categorical and thus we set mydata$districtSize <- as.factor(
mydata$districtSize) (with appropriate dataframe name). Fit a linear model using
salary as the dependent variable and the remaining data as the covariates. Is there an
effect of experience and/or district size? How can we interpret the parameter estimates?
model deficiencies.
Hint: You may consider R-Code 10.2 and Figure 10.2.
Example fitted model
1 Yi = β0 + β1 x1 + β2 x21 + β3 x2 + εi correct model
2 Yi = β0 + β1 x1 + β2 x2 + εi missing predictor x21
3 Yi = β0 + β1 x21 + β2 x2 + εi missing predictor x1
4 Yi = β0 + β1 x1 + β2 x21 + εi missing predictor x2
5 Yi = β0 + β1 x1 + β2 x21 + β3 x2 + β4 x3 + εi unnecessary predictor x3
Problem 10.6 (BMJ Endgame) Discuss and justify the statements about ‘Multiple regression’
given in doi.org/10.1136/bmj.f4373.
196 CHAPTER 10. MULTIPLE REGRESSION
−1.0
−1.0
−1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
−1.0
−1.0
−1 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
−1.0
−1.0
0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
−1.0
−1.0
−1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
Figure 10.2: Residual plots. Residuals versus fitted values (left column), predictor
x1 (middle) and x2 (right column). The rows correspond to the different fitted models.
The panels in the left column have different scaling of the x-axis. (See R-Code 10.2.)
10.6. EXERCISES AND PROBLEMS 197
3
10
Zambia Zambia
Standardized residuals
2
Philippines Philippines
5
Residuals
1
0
0
−1
−5
−2
Chile
−10
Chile
6 8 10 12 14 16 −2 −1 0 1 2
Chile Zambia
Standardized residuals
Standardized residuals
Philippines
2
1
Japan
0.5
1.0
1
0
0.5
−1
Libya 0.5
1
−2
0.0
Cook's distance
1.0
+
0.8
Probability of damage
0.6
0.4
0.2
++ + + +
0.0
+++++ ++ ++ ++ +
20 30 40 50 60 70 80
Temperature [F]
Figure 10.6: orings data (proportion of damaged orings, black crosses) and estimated
probability of defect (red dots) dependent on air temperature. Linear fit is given by
the gray solid line. Dotted vertical line is the ambient launch temperature at the time
of launch. (See R-Code 10.5.)
Chapter 11
Analysis of Variance
In this Chapter we will further elaborate on a sums of squares decomposition of linear models.
For the ease of presentation, we focus on qualitative predictors, called factors. The simplest
setting is comparing the means of I independent samples with a common variance term. This
is much in the spirit of Test 2 discussed in Chapter 4, where we compared the means of two
independent samples with each other.
Instead of comparing the samples pairwise (which would amount to I2 tests and would re-
quire adjustments due to multiple testing as discussed in Section 5.5.2) we introduce a “better”
method. Although we are concerned with comparing means we frame the problem to compar-
ing variances, termed analysis of variance (ANOVA). We focus on a linear model approach to
ANOVA. Due to historical reasons, the notation is slightly different than what we have seen in
the last two chapters; but we try to link and unify as much as possible.
The model of this section is tailored to compare I different groups where the variability of the
observations around the mean is the same in all groups. That means that there is one common
variance parameter and we pool the information across all observations to estimate it.
199
200 CHAPTER 11. ANALYSIS OF VARIANCE
More formally, the model consists of I groups, in the ANOVA context called factor levels,
i = 1, . . . , I, and every level contains a sample of size ni . Thus, in total we have N = n1 +· · ·+nI
observations. The model is given by
where we use the indices to indicate the group and within group observation. Similarly as in
iid
the regression models of the last chapters, we again assume εij ∼ N (0, σ 2 ). Formulation (11.1)
represents the individual group means directly, whereas formulation (11.2) models an overall
mean and deviations from the mean.
However, model (11.2) is overparameterized (I levels and I + 1 parameters) and an additional
constraint on the parameters is necessary. Often, the sum-to-zero-contrast or treatment contrast,
written as:
I
X
βi = 0 or β1 = 0, (11.3)
i=1
are used.
We are inherently interested in whether there exists a difference between the groups and so
our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is independent of
the constraint. To develop the associated test, we proceed in several steps. We first link the two
group case to the notation from the last chapter. In a second step we intuitively derive estimates
in the general setting. Finally, we state the test statistic.
with Yi∗ the components of the vector (Y11 , Y12 , . . . , Y1n1 , Y21 , . . . , Y2n2 )⊤ and xi = 0 if i =
1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from writing the
index denoted by the asterisk with
Y1 β0 1 0 β0
= Xβ + ε = X +ε= +ε (11.5)
Y2 β1 1 1 β1
and thus we have as least squares estimate
β0 N n2 −1 1⊤ 1⊤
b
⊤ −1 ⊤ y 1 y1
β = b = (X X) X
b = ⊤ ⊤
(11.6)
β1 y2 n2 n2 0 1 y2
1 P
j y1j
P
n2 − n2 ij yij
1 n1
= = 1 P 1 P
. (11.7)
n1 n2 −n2 N
P
j y2j n j y2j − n 2 j y1j 1
Thus the least squares estimates of µ and β2 in (11.2) for two groups are the mean of the first
group and the difference between the two group means.
The null hypothesis H0 : β1 = β2 = 0 in Model (11.2) is equivalent to the null hypothesis
H0 : β1∗ = 0 in Model (11.4) or to the null hypothesis H0 : β1 = 0 in Model (11.5). The latter is
11.1. ONE-WAY ANOVA 201
of course based on a t-test for a linear association (Test 15) and coincides with the two-sample
t-test for two independent samples (Test 2).
Estimators can also be derived in a similar fashion under other constraints or for more factor
levels.
With the least squares method, µ b and βbi are chosen such that
X X
b − βbi )2 =
(yij − µ b − βbi )2
(y ·· + y i· − y ·· + yij − y i· − µ (11.10)
i,j i,j
X 2
= (y ·· − µ
b) + (y i· − y ·· − βbi ) + (yij − y i· ) (11.11)
i,j
is minimized. We evaluate the square of this last equation and note that the cross terms are zero
since
J
X I
X
(yij − y i· ) = 0 and (y i· − y ·· − βbi ) = 0 (11.12)
j=1 i=1
yij = µ
b + βbi + rij . (11.13)
The observations are orthogonally projected in the space spanned by µ and βi . This orthog-
onal projection allows for the division of the sums of squares of the observations (mean corrected
to be precise) into the sums of squares of the model and sum of squares of the error component.
These sums of squares are then weighted and compared. The representation of this process
in table form and the subsequent interpretation is often equated with the analysis of variance,
denoted ANOVA.
Remark 11.1. This orthogonal projection also holds in the case of a classical regression frame-
work, of course. Using (10.13) and (10.14), we have
b ⊤ r = y ⊤ H⊤ (I − H)y = y ⊤ (H − HH)y = 0,
y (11.14)
202 CHAPTER 11. ANALYSIS OF VARIANCE
because the hat matrix H is symmetric (H⊤ = H) and idempotent (HH = H). ♣
The decomposition of the sums of squares can be derived with help from (11.9). No assump-
tions about constraints or ni are made
X X
(yij − y ·· )2 = (y i· − y ·· + yij − y i· )2 (11.15)
i,j ij
X X X
= (y i· − y ·· )2 + (yij − y i· )2 + 2(y i· − y ·· )(yij − y i· ), (11.16)
ij i,j i,j
ni
X
where the cross term is again zero because (yij − y i· ) = 0. Hence we have the decomposition
j=1
of the sums of squares
X X X
(yij − y ·· )2 = (y i· − y ·· )2 + (yij − y i· )2 (11.17)
i,j i,j i,j
| {z } | {z } | {z }
Total Model Error
or SST = SSA + SSE . We choose deliberately SSA instead of SSM as this will simplify subsequent
b = y ·· and βbi = y i· − y ·· , this equation can be
extensions. Using the least squares estimates µ
read as
1 X 1 X 1 X
b)2 =
(yij − µ ni (µ\ + βi − µb)2 + + βi )2
(yij − µ\ (11.18)
N N N
i,j i i,j
\ij ) = 1 X 2
c2 ,
Var(y ni βbi + σ (11.19)
N
i
(where we could have used some divisor other than N ). The test statistic for the statistical
hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance into
variance between groups and variance within groups, just as illustrated in (11.19), and comparing
them. Formally, this must be made more precise. A good model has a small estimate for σ c2 in
comparison to that for the second sum. We now develop a quantitative comparison of the sums.
A raw comparison of both variance terms is not sufficient, the number of observations must
be considered: SSE increases as N increases also in light of a high quality model. In order to
weight the individual sums of squares, we divide them by their degrees of freedom, e.g., instead
of SSE we will use SSE /(N − I) and instead of SSA we will use SSA /(I − 1), which we will
term mean squares. Under the null hypothesis, the mean squares are chi-square distributed
and thus their quotients are F distributed. Hence, an F -test as illustrated in Test 4 is needed
again. Historically, such a test has been “constructed” via a table and is still represented as such.
This so-called ANOVA table consists of columns for the sums of squares, degrees of freedom,
mean squares and F -test statistic due to variance between groups, within groups, and the total
variance. Table 11.1 illustrates such a generic ANOVA table, numerical examples are given in
Example 11.1 later in this section. Note that the third row of the table represents the sum of
the first two rows. The last two columns are constructed from the first two ones.
11.1. ONE-WAY ANOVA 203
2
P
i ni βi
E(MSA ) = E SSA /(I − 1) = σ 2 + E(MSE ) = E SSE /(N − I) = σ 2 .
, (11.20)
I −1
Calculation in R: summary( lm(...)) for the value of the test statistic or anova(
lm(...)) for the explicit ANOVA table.
204 CHAPTER 11. ANALYSIS OF VARIANCE
Example 11.1 (retardant data). Many substances related to human activities end up in
wastewater and accumulate in sewage sludge. The present study focuses on hexabromocyclodo-
decane (HBCD) detected in sewage sludge collected from a monitoring network in Switzerland.
HBCD’s main use is in expanded and extruded polystyrene for thermal insulation foams, in
building and construction. HBCD is also applied in the backcoating of textiles, mainly in furni-
ture upholstery. A very small application of HBCD is in high impact polystyrene, which is used
for electrical and electronic appliances, for example in audio visual equipment. Data and more
detailed background information are given in Kupper et al. (2008) where it is also argued that
loads from different types of monitoring sites showed that brominated flame retardants ending
up in sewage sludge originate mainly from surface runoff, industrial and domestic wastewater.
HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children
in the mother’s womb.
In R-Code 11.1 the data are loaded and reduced to Hexabromocyclododecane. First we use
constraint β1 = 0, i.e., Model (11.5). The estimates naturally agree with those from (11.7). Then
we use the sum-to-zero constraint and compare the results. The estimates and the standard errors
changed (and thus the p-values of the t-test). The p-values of the F -test are, however, identical,
since the same test is used.
The R command aov is an alternative for performing ANOVA and its use is illustrated in
R-Code 11.2. We prefer, however, the more general lm approach. Nevertheless we need a function
which provides results on which, for example, Tukey’s honest significant difference (HSD) test
can be performed with the function TukeyHSD. The differences can also be calculated from the
coefficients in R-Code 11.1. The p-values are smaller because multiple tests are considered. ♣
R-Code 11.1: retardant data: ANOVA with lm command and illustration of various
contrasts.
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.7 42.4 1.78 0.098 .
## typeB 77.2 60.0 1.29 0.220
## typeC 107.8 51.9 2.08 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
options( "contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
# manually construct the estimates:
c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),
mean(HBCD[9:16])-mean(HBCD[1:4]))
## [1] 75.675 77.250 107.788
# change the constrasts to sum-to-zero
options(contrasts=c("contr.sum","contr.sum"))
lmout1 <- lm( HBCD ~ type )
# summary(lmout1) # as above, except the coefficents are different:
print( summary(lmout1)$coef, digits=3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.4 22.3 6.15 3.51e-05
## type1 -61.7 33.1 -1.86 8.55e-02
## type2 15.6 33.1 0.47 6.46e-01
beta <- as.numeric(coef(lmout1))
# Construct 'contr.treat' coefficients:
c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])
## [1] 75.675 77.250 107.787
206 CHAPTER 11. ANALYSIS OF VARIANCE
R-Code 11.2 retardant data: ANOVA with aov and multiple testing of the means.
iid
with i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , nij and εijk ∼ N (0, σ 2 ). The indices again specify the
levels of the first and second factor as well as the count for that configuration. As stated, the
model is over parameterized and additional constraints are again necessary, in which case
I
X J
X
βi = 0, γj = 0 or β1 = 0, γ1 = 0 (11.22)
i=1 j=1
lm(...). In case we do compare sums of squares there are resulting ambiguities and factors need
to be included in decreasing order of “natural” importance.
For the sake of illustration, we consider the balanced case of nij = K, called complete two-way
ANOVA. More precisely, the model consists of I · J groups and every group contains K samples
and N = I · J · K. The calculation of the estimates are easier than in the unbalanced case and
are illustrated as follows.
As in the one-way case, we can derive least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + yijk − y i·· − y ·j· + y ··· (11.23)
|{z} | {z } | {z } | {z }
µ
b βbi γ
bj rijk
X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
I −1 MSE
i,j,k
X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
J −1 MSE
i,j,k
SSE = DFE =
X SSE
Error (yijk − y i·· − y ·j· + y ··· )2 N −I −J +1 MSE =
DFE
i,j,k
X
Total SST = (yijk − y ··· )2 N −1
i,j,k
Model (11.21) is additive: “More of both leads to even more”. It might be that there is a
certain canceling or saturation effect. To model such a situation, we need to include an interaction
(βγ)ij in the model to account for the non-linear effects:
iid
with εijk ∼ N (0, σ 2 ) and corresponding ranges for the indices. In addition to constraints (11.22)
we require
I
X J
X
(βγ)ij = 0 and (βγ)ij = 0 for all i and j (11.26)
i=1 j=1
or analogous treatment constraints are often used. As in the previous two-way case, we can
derive the least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + y ij· − y i·· − y ·j· + y ··· + yijk − y ij· (11.27)
|{z} | {z } | {z } | {z } | {z }
µ
b βbi γ
bj (βγ)
d
ij rijk
X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
DFA MSE
i,j,k
X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
DFB MSE
i,j,k
iid
with j = 1, . . . , J, k = 1, . . . , nj and εijk ∼ N (0, σ 2 ). Additional constraints are again necessary.
Keeping track of indices and Greek letters quickly gets cumbersome and one often resorts to R
formula notation. For example, if the predictor xi is in the variable Population and γj is in the
variable Treatment, in form of a factor then
11.4 Example
Example 11.2 (UVfilter data). Octocrylene is an organic UV Filter found in sunscreen and
cosmetics. The substance is classified as a contaminant and dangerous for the environment by
the EU under the CLP Regulation. Sunscreens containing i.a. octocrylene has been forbidden
in some areas as a protective measure for their coral reefs.
Because the substance is difficult to break down, the environmental burden of octocrylene
can be estimated through the measurement of its concentration in sludge from waste treatment
facilities.
The study Plagellat et al. (2006) analyzed octocrylene (OC) concentrations from 24 different
purification plants (consisting of three different types of Treatment), each with two samples
(Month). Additionally, the catchment area (Population) and the amount of sludge (Production)
are known. Treatment type A refers to small plants, B medium-sized plants without considerable
industry and C medium-sized plants with industry.
R-Code 11.3 prepares the data and fits first a one-way ANOVA based on Treatment only,
followed by a two-way ANOVA based on Treatment and Month (with and without interactions).
Note that the setup is not balanced with respect to treatment type.
Adding the factor Month improves considerably the model fit: increase of the adjusted R2
from 40% to 50%) and the standard errors of the treatment effects estimates are further reduced.
The interaction is not significant, as the corresponding p-value is above 14%. Based on
Figure 11.1 this is not surprising. First, the seasonal effect of groups A and B are very similar
and second, the variability in group C is too large. ♣
str(UV, strict.width='cut')
## 'data.frame': 24 obs. of 7 variables:
## $ Treatment : chr "A" "A" "A" "A" ...
## $ Site_code : chr "A11" "A12" "A15" "A16" ...
## $ Site : chr "Chevilly" "Cronay" "Thierrens" "Prahins" ...
## $ Month : chr "jan" "jan" "jan" "jan" ...
## $ Population: int 210 284 514 214 674 5700 8460 11300 6500 7860 ...
## $ Production: num 2.7 3.2 12 3.5 13 80 150 220 80 250 ...
## $ OT : int 1853 1274 1342 685 1003 3502 4781 3407 11073 3324 ...
with( UV, table(Treatment, Month))
## Month
## Treatment jan jul
## A 5 5
## B 3 3
## C 4 4
options( contrasts=c("contr.sum", "contr.sum"))
lmout <- lm( log(OT) ~ Treatment, data=UV)
summary( lmout)
##
## Call:
## lm(formula = log(OT) ~ Treatment, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.952 -0.347 -0.136 0.343 1.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.116 70.15 < 2e-16 ***
## Treatment1 -0.640 0.154 -4.16 0.00044 ***
## Treatment2 0.438 0.175 2.51 0.02049 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.555 on 21 degrees of freedom
## Multiple R-squared: 0.454,Adjusted R-squared: 0.402
## F-statistic: 8.73 on 2 and 21 DF, p-value: 0.00174
lmout <- lm( log(OT) ~ Treatment + Month, data=UV)
summary( lmout)
##
11.4. EXAMPLE 211
## Call:
## lm(formula = log(OT) ~ Treatment + Month, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7175 -0.3452 -0.0124 0.1691 1.2236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.106 76.78 < 2e-16 ***
## Treatment1 -0.640 0.141 -4.55 0.00019 ***
## Treatment2 0.438 0.160 2.74 0.01254 *
## Month1 -0.235 0.104 -2.27 0.03444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.507 on 20 degrees of freedom
## Multiple R-squared: 0.566,Adjusted R-squared: 0.501
## F-statistic: 8.69 on 3 and 20 DF, p-value: 0.000686
summary( aovout <- aov( log(OT) ~ Treatment * Month, data=UV))
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 5.38 2.688 11.67 0.00056 ***
## Month 1 1.32 1.325 5.75 0.02752 *
## Treatment:Month 2 1.00 0.499 2.17 0.14355
## Residuals 18 4.15 0.230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( aovout, which=c('Treatment'))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log(OT) ~ Treatment * Month, data = UV)
##
## $Treatment
## diff lwr upr p adj
## B-A 1.07760 0.44516 1.71005 0.00107
## C-A 0.84174 0.26080 1.42268 0.00446
## C-B -0.23586 -0.89729 0.42556 0.64105
boxplot( log(OT)~Treatment, data=UV, col=7, boxwex=.5)
at <- c(0.7, 1.7, 2.7, 1.3, 2.3, 3.3)
boxplot( log(OT)~Treatment+Month, data=UV, add=T, at=at, xaxt='n', boxwex=.2)
212 CHAPTER 11. ANALYSIS OF VARIANCE
Month
8.5
jan
mean of log(OT)
8.5
jul
log(OT)
8.0
7.5
7.5
6.5
7.0
A B C A B C
Treatment Treatment
Figure 11.1: UVfilter data: box plots sorted by treatment and interaction plot. (See
R-Code 11.3.)
Problem 11.2 (ANOVA) We consider the data chemosphere_OC.csv available on the course
web page. The data describe the octocrylene (OC) concentration sampled from 12 wastewater
treatment plants in Switzerland. Further variables in the dateset are: Behandlung (treatment of
the wastewater), Monat (month when the sample was collected), Einwohner (number of inhabitant
connected to the plant), Produktion (sludge production (metric tons of dry matter per year),
everything that doesn’t enter the water system after treatment).
11.6. EXERCISES AND PROBLEMS 213
a) Describe the data. Do a visual inspection to check for differences between the treatment
types and between the months of data aquirement. Use an appropriate plot function to do
so. Describe your results.
Hint: Also try the function table()
b) Fit a one-way ANOVA with log(OC) as response variable and Behandlung as explanatory
variable.
Hint: use lm and perform an anova on the output. Don’t forget to check model assump-
tions.
c) Extend the model to a two-way ANOVA by adding Monat as a predictor. Interpret the
summary table.
d) Test if there is a significant interaction between Behandlung and Monat. Compare the
result with the output of interaction.plot
e) Extend the model from (b) by adding Produktion as an explanatory variable. Perform an
anova on the model output and interpret the summary table. (Such a model is sometimes
called Analysis of Covariance, ANCOVA).
Switch the order of your explanatory variables and run an anova on both model outputs.
Discuss the results of Behandlung + Produktion and Produktion + Behandlung. What
causes the differences?
Problem 11.3 (ANOVA table) Calculate the missing values in the following table:
How many observations have been taken in total? Do we have a balanced or complete four-
way setting?
Problem 11.4 (ANCOVA) The dataset rats of the package faraway consists of two-way
ANOVA design with factors poison (three levels I, II and III) and treatment (four levels A, B,
214 CHAPTER 11. ANALYSIS OF VARIANCE
C and D). To study the toxic agents, 4 rats were exposed to each pair of factors. The response
was survival time in tens of hours.
Problem 11.5 (ANCOVA) The perceived stress scale (PSS) is the most widely used psycho-
logical instrument for measuring the perception of stress. It is a measure of the degree to which
situations in one’s life are appraised as stressful.
The dataset PrisonStress from the package PairedData gives the PSS measurements for 26
people in prison at the entry and at the exit. Part of these people were physically trained during
their imprisonment.
a) Describe the data. Do a visual inspection to check for differences between the treatment
types and PSS.
Problem 11.6 (BMJ Endgame) Discuss and justify the statements about ‘One way analysis of
variance’ given in doi.org/10.1136/bmj.e2427.
Chapter 12
Design of Experiments
Design of Experiments (DoE) is a relatively old field of statistics. Pioneering work has been
done almost 100 years ago by Sir Ronald Fisher and co-workers at Rothamsted Experimental
Station, England, where mainly agricultural questions have been discussed. The topic has been
taken up by the industry after the second world war to, e.g., optimize production of chemical
compounds, work with robust parameter designs. In recent decades, advances are still been made
on for example using the abundance of data in machine learning type discovery or in preclinical
and clinical research were the sample sizes are often extremely small.
In this chapter we will selectively cover different aspects of DoE, focusing on sample size
calculations, power and randomization. Additionally, we also cover a few domain specific concepts
and terms that are often used in the context of setting up experiments for clinical or preclinical
trials.
215
216 CHAPTER 12. DESIGN OF EXPERIMENTS
Here, the term ‘experiment’ describes a controlled procedure that is (hopefully) carefully
designed to test one (or very few) scientific hypothesis. In the context of this chapter, the
hypothesis is often about the effect of independent variables on one dependent variable, i.e., the
outcome measure. In terms of our linear model equation (10.2), what is the effect of one or several
of the xiℓ on the Yi . Designing the experiment implies the choice of the independent variables
(we need to account for possible confounders, or effect modifiers), the values thereof (fixed at
particular “levels” or randomly chosen) and sample size. Again in terms of our model, we need
to include and well specify all necessary predictors xiℓ that have an effect on Yi . Finally, we
need to determine the sample size n such that the desired effects – if they exist – are statistically
significant.
The prime paradigm is of DoE is
Maximize primary variance, minimize error variance and control for secondary variance.
which translates to maximize the signal we are investigating, minimize the noise we are not
modeling and control for uncertainties with carefully chosen independent variables.
In the context of DoE we often want to compare the effect of a treatment (or procedure) on
an outcome. Examples that have been discussed in previous chapters are: “Is there a progression
of pododermatitis at the hind paws over time?”, “Is a diuretic medication during pregnancy
reducing the risk of pre-eclampsia?”, “How much can we increase hardness of metal springs with
lower temperatures of quenching baths?”, “Is residual octocrylene in waste water sludge linked
to particular waste water types?”.
When planning an experiment, we should always carefully evaluate sample size n, that is require
to be able to properly conclude our hypothesis. In many cases sample size needs to be determined
before starting the experiment: organizing (time-wise) the experiment, acquire necessary funds,
filing study protocols or submitting a license to an ethic commission. As a general rule, we
choose as many as possible but as few as necessary samples to balance statistical and economic
interest.
Suppose that we test the effect of a dietary treatment for female rabbits (say, with and without
a vitamin additive) on the weight of the litter within two housing boxes. Each doe (i.e., female
reproductive rabbit) in the box receives the same treatment, i.e., the treatment is not applied to
a single individual subject and could not be individually controlled for. All does form a single,
so-called experimental unit. The outcomes or responses are measured on the response units,
which are typically “smaller” than the experimental units. In our example, we would weight the
litter of each doe in the housing box individually, but aggregate or average these to a single
number. As a side note, this average often justifies the use of a normal response model when an
experimental unit consists of several response units.
Formally, experimental units are entities which are independent of each other and to which
it is possible to assign a treatment or intervention independently of the other units. The experi-
mental unit is the unit which has to be replicated in an experiment. Below, when we talk about
a sample, we are talking about a sample of experimental units. We do not discuss the choice
of the response units here as it is most often situation specific were a statistician has little to
contribute.
In this section, we link the sample size to the width of three different confidence intervals.
Specifically, we discuss the necessary number of observations required such that the width of the
empirical confidence interval has a predefined size.
To start, assume that we are in the setting of a simple z confidence interval at level (1 − α)
with known σ, as seen Equation (4.31). If we want to ensure an empirical interval width ω, we
need
2 σ2
n ≈ 4z1−α/2 (12.1)
ω2
The same approach is used when estimating a proportion. We use, for example, the Wald
218 CHAPTER 12. DESIGN OF EXPERIMENTS
2 pb(1 − pb)
n ≈ 4z1−α/2 , (12.2)
ω2
which corresponds to (12.1) with the the plug-in estimate pb(1 − pb) for σ b2 in the setting of
a Bernoulli random variable and central limit approximation for X/n. Of course, pb is not
known a priori and we often take the conservative choice of pb = 1/2 as the function x(1 − x)
is maximized over (0, 1) at x = 1/2. Thus, without any prior knowledge on p we may choose
conservatively n ≈ (z1−α/2 /ω)2 . Alternatively, the sample size calculation can be done based on
a Wilson confidence interval (6.11), where a quadratic equation needs to be solved to obtain n
(see Problem 12.1.a).
If we are estimating a Pearson’s correlation coefficient, we can use CI 6 to link interval width
ω with n. Here, we use an alternative approach and would like to determine sample size such
that the interval does not contain the value zero, i.e., the width is just smaller than 2r. The
derivation relies on the duality of tests and confidence intervals (see Section 5.4). Recall Test 14
for Pearson’s correlation coefficient. From Equation (9.3) we construct the critical value for the
test (boundary of the rejection region, see Figure 5.3) and based on that we can calculate the
minimum sample size necessary to detect a correlation |r| ≥ rcrit as significant:
√
n−2 tcrit
tcrit = rcrit q =⇒ rcrit = q . (12.3)
2
1 − rcrit n − 2 + t2crit
Figure 12.1 illustrates the least significant correlation for specific sample sizes. Specifically, with
sample size n < 24 correlations between −0.4 and 0.4 are not significant and for a correlation of
±0.25 to be significant, we require n > 62 at level α = 5% (see R-Code 12.1).
R-Code 12.1 Significant correlation for specific sample sizes (See Figure 12.1.)
1.0
0.8
0.6
rcrit
0.4
0.2
0.0
Figure 12.1: Significant correlation for specific sample sizes (at level α = 5%). For
a sample correlation of 0.25, n needs to be larger than 62 as indicated with the gray
lines. For a particular n, correlations above the line are significant, below are not. (See
R-Code 12.1.)
Suppose we want to “detect” the alternative with probability 1 − β(µ1 ), i.e., reject the null
hypothesis with probability 1 − β when the true mean is µ1 . Hence, plugging the values in (12.4)
and solving for n we have approximate sample size
z1−α + z1−β 2
n≈ σ . (12.5)
µ0 − µ1
Hence, the sample size depends on the Type I and II error probabilities as well as the standard
deviation and the difference of the means. The latter three quantities are often combined to the
standardized effect size
µ0 − µ1
δ= , (12.6)
σ
If the standard deviation is not known, an estimate can be used. An estimate based version of
δ is often termed Cohen’s d .
For a two-sided test, a similar expression is found where z1−α is replaced by z1−α/2 . For a
one-sample t-test (Test 1) the right hand side of (12.5) is analogue with the quantiles tn−1,1−α
and tn−1,1−β respectively. Note that now the right hand side depends on n as well as on the
quantiles. To determine n, we start with a reasonable value for n to obtain the quantiles,
calculate the resulting n and repeat the two steps for at least one more iteration. In R, the
function power.t.test() uses a numerical approach.
In the case of two independent samples, the degrees of freedom in the t-quantiles need to be
adjusted from n − 1 to n − 2. Cohen’s d is defined as (x1 −x2 )/sp , where s2p is an estimate of the
pooled variance (e.g., as given in Test 2).
For t-tests in the behavioral sciences, Cohen (1988) defined small, medium and large (stan-
dardized) effect sizes as d = 0.2, 0.5 and 0.8, respectively. These are often termed the conventional
220 CHAPTER 12. DESIGN OF EXPERIMENTS
effect sizes but depend on the type of test, see also the function cohen.ES() of the R package
pwr. In preclinical studies, effect sizes are typically larger.
Example 12.1. In the setting of a two-sample t-test with equal group sizes, we need at level
α = 5% and power 1 − β = 80% in each group 26, 64 and 394 observations for a large, medium
and small effect size, respectively, see, e.g., power.t.test( d=0.2, power=0.8) or pwr.t.test(
d=0.2, power=0.8) from the pwr package.
For unequal sample sizes, the sum of both group sizes is a bit larger compared to equal sample
sizes (balanced setting). For a large effect size, we would, for example, require n1 = 20 and n2 =
35, leading to three more observations compared to the balanced setting, (pwr.t2n.test(n1=20,
d=0.8, power=.8) from the pwr package). ♣
Many R packages and other web interfaces allow to calculate sample sizes for many different
settings of comparing means. Of course, care is needed when applying such toolboxes whether
they use the same parametrizations, definitions etc.
12.3.1 Randomization
the treatment, all other effects that are not accounted for are averaged out. Proper randomization
also protects against spurios correlations in the observations.
The randomization procedure should be a truly randomized procedure for all assignments,
ultimately performed by a genuine random number generator. There are several procedures,
simple randomization, balanced or constrained randomization, stratified randomization etc. and
of course, the corresponding sample sizes are determined a priori.
Simple randomization randomly assigns the type of treatment to an experimental unit. For
example, to assign 12 subjects to three groups we use sample(x=3, size=12, replace=TRUE).
This procedure has the disadvantage of leading to a possibly unbalanced design and thus should
not be used in practice.
In the case of discrete confounders it is possible to split your sample into subgroups accord-
ing to these pre-defined factors. These subgroups are often called blocks (when controllable) or
strata (when not). To randomize, randomized complete block design (RCBD), a stratified ran-
domization, is used. In RCBD each block receives the same amount of experimental units and
each block is like a small CRD experiment.
Suppose we have six male and six female subjects. The randomization is done for both the
male and female groups according to cbind( matrix(sample(x=6, size=6, replace=FALSE),
nrow=3), matrix(sample(x=7:12, size=6, replace=FALSE), nrow=3)).
Example 12.2. Suppose we are studying the effect of fertilizer type on plant growth. We have
access to 24 plant pots arranged in a four by six alignment inside a glass house. We have three
different fertilizer and one control group.
To allocate the fertilizers to the plants we number the plants row by row (starting top right).
We assign randomly the 24 pots randomly to four groups of four. The fertilizers are then
222 CHAPTER 12. DESIGN OF EXPERIMENTS
additionally randomly assigned to the different groups. This results in a CRD design (left panel
of Figure 12.2).
Suppose that the glass house has one open wall. This particular setup affects plant growth
unequally, because of temperature differences between the open and the opposite side. To account
for the difference we block the individual rows and we assign randomly one fertilizer to each plant
in every row. This results in a RCBD design (right panel of Figure 12.2). ♣
RCBD are used in practice and might be seen on fields of agricultural education and research
stations, as illustrated in Figure 12.3.
Remark 12.1. In certain situations there are two different EUs in the same experiment. As
illustration suppose we are studying the effect of irrigation amount and fertilizer type on crop
yield. If irrigation is more difficult to vary on a small scale and fields are large enough to be
split, a split plot design becomes appropriate. Irrigation levels are assigned to whole plots by
CRD and fertilizer is assigned to subplots using RCBD (irrigation is the block).
Split-plot experiments are not straight-forward to model statistically and we refer to followup
lectures. ♣
12.3.2 Biases
Similar to the bias of an estimator we denote any systematic error of an experiment as a bias.
Biases are introduced at the design of the study, at the experiment itself and at the analysis
phase. Inherently, these biases are not part of the statistical model and thus induce biases in the
estimates and/or inflate variance of the estimates.
There are many different types of biases and we explain these in its simplest from, assuming
an experimental setup of two groups comparing one treatment to a control.
• Selection bias occurs when the two groups differ systematically next to the effect that is
analyzed.
When studies are not carefully implemented, it is possible to associate smaller risk for
dementia for smokers compared to non-smokers. Here, selection bias occurs by smokers
having a lower life-expectancy compared to non-smokers and thus having fewer dementia
cases (Hernán et al., 2008)
• Confirmation bias occurs when the experimenter or analyziser searches for a confirmation
in the experiment.
In a famous study, students were told that there exist two different types of rats, “maze
bright” and “maze dull” rats where the former are genetically more apt to navigate in a
maze. The result of an experiment by the students showed that the maze bright ones did
perform systematically better than the maze dull ones (Rosenthal and Fode, 1963).
• Confounding is a bias that occurs due to another factor that distorts the relationship
between treatment and outcome.
There are many classical examples of confounding bias. One is that coffee drinkers have a
larger risk of lung cancers. Such studies typically neglect that there is a larger proportion of
smokers that are coffee drinkers, than non-smoking coffee drinkers (Galarraga and Boffetta,
2016).
• in certain cohort studies, subjects do not receive the treatment immediately after they have
entered the study. The time span between entering and receiving the treatment is called
the immortal time and this needs to be taken into account when comparing to the control
group.
A famous example was a study that wrongly claimed that Oscar laureates live about four
years longer than comparable actors without (Redelmeier and Singh, 2001). It is not
sufficient to group the actors in two groups (received Oscar or not) and then to match the
actors in both groups.
224 CHAPTER 12. DESIGN OF EXPERIMENTS
• Performance bias is when a care giver or analyzer treats the subjects of the two groups
differently.
• Attrition bias occurs when participants do not have the same drop-out rate in the control
and treatment groups.
This section summarizes different terms that are used in the context of design of experiments.
An intervention is a process where a group of subjects (or experimental units) is subjected
to such a surgical procedure, a drug injection, or some other form of treatment (intervention).
Control has several different uses in design. First, an experiment is controlled because scien-
tists assign treatments to experimental units. Otherwise, we would have an observational study.
Second, a control treatment is a “standard” treatment that is used as a baseline or basis of com-
parison for the other treatments. This control treatment might be the treatment in common
use, or it might be a null treatment (no treatment at all). For example, a study of new pain
killing drugs could use a standard pain killer as a control treatment, or a study on the efficacy
of fertilizer could give some fields no fertilizer at all. This would control for average soil fertility
or weather conditions.
Placebo is a null treatment that is used when the act of applying a treatment has an effect.
Placebos are often used with human subjects, because people often respond to any treatment:
for example, reduction in headache pain when given a sugar pill. Blinding is important when
placebos are used with human subjects. Placebos are also useful for nonhuman subjects. The
apparatus for spraying a field with a pesticide may compact the soil. Thus we drive the apparatus
over the field, without actually spraying, as a placebo treatment. In case of several factors, they
are combined to form treatments. For example, the baking treatment for a cake involves a given
time at a given temperature. The treatment is the combination of time and temperature, but we
can vary the time and temperature separately. Thus we speak of a time factor and a temperature
factor. Individual settings for each factor are called levels of the factor.
A randomized controlled trial (RCT) is study in which people are allocated at random to
receive one of several clinical interventions. One of these interventions is the standard of com-
parison or control. The control may be a standard practice, a placebo, a sham treatment or no
intervention at all. Someone who takes part in a RCT is called a participant or subject. RCTs
seek to measure and compare the outcomes after the participants received their intervention.
Because the outcomes are measured, RCTs are quantitative studies.
In sum, RCTs are quantitative, comparative, controlled experiments in which investigators
study two or more interventions in a series of individuals who receive them in random order.
The RCT is one of the simplest and most powerful tools in clinical research but often relatively
expensive.
Confounding occurs when the effect of one factor or treatment cannot be distinguished from
that of another factor or treatment. The two factors or treatments are said to be confounded.
Except in very special circumstances, confounding should be avoided. Consider planting corn
12.4. DOE IN THE CLASSICAL FRAMEWORK 225
Systematic reviews
Expert opinion
variety A in Minnesota and corn variety B in Iowa. In this experiment, we cannot distinguish
location effects from variety effects: the variety factor and the location factor are confounded.
Blinding occurs when the evaluators of a response do not know which treatment was given
to which unit. Blinding helps prevent bias in the evaluation, even unconscious bias from well-
intentioned evaluators. Double blinding occurs when both the evaluators of the response and
the subject (experimental units) do not know the assignment of treatments to units. Blinding
the subjects can also prevent bias, because subject responses can change when subjects have
expectations for certain treatments.
Before a new drug is admitted to the market, many steps are necessary: starting from a
discovery based step toward highly standardized clinical trials (type I, II and III). At the very
end, there are typically randomized controlled trials, that by design (should) eliminate all possible
confounders.
At later steps, when searching for an appropriate drug, a decision may be available based on
“evidence”: what has been used in the past, what has been shown to work (for similar situations).
This is part of evidence-based medicine. Past information may be of varying quality, ranging
from ideas opinions to case studies to RCTs or systematic reviews. Figure 12.4 represents a so-
called evidence-based medicine pyramid which reflects the quality of research designs (increasing)
and quantity (decreasing) of each study design in the body of published literature (from bottom
to top). For other scientific domains, similar pyramids exist, with bottom and top typically
remaining the same.
In a simple regression setting, the standard errors of βb0 and βb1 depend on 1/ i (xi −x)2 , see
P
expressions for the estimates (9.12) and (9.13). Hence, to reduce the variability of the estimates,
we should increase i (xi − x) as much as possible. Specifically, suppose the interval [ a, b ]
2
P
represents a natural range for the predictor, then we should choose half of the predictors as a
and the other half as b.
This last argument justifies a discretization of continuous (and controllable) predictor vari-
ables in levels. Of course this implies that we expect a linear relationship. If the relationship is
not linear, such a discretization may be devastating.
(see also Equation (11.24)). In the unbalanced setting this is not the case and the decomposition
depends on the order which we introduce the factors in the model. That means, the ANOVA
table of aov(y˜f1+f2) is not the same as aov(y˜f2+f1). We should consider the ANOVA table
as sequential: each additional factor (row in the table), we reduce the remaining variability.
Hence, we should rather write
where the term SSB|A indicates the sums of squares of factor B after correction of factor A. and
similarly, term SSAB|A,B indicates the sums of squares of the interaction AB after correction of
factors A and B.
This concept of sums of squares after correction is not new. We have encountered this type
of correction already: SST is actually calculated after correcting for the overall mean.
12.5. BIBLIOGRAPHIC REMARKS 227
Equation (12.9) represents the sequential sums of squares decomposition, called Type I se-
quential SS : SSA and SSB|A and SSAB|A,B . It is possible to show that SSB|A = SSA,B − SSA ,
where the former is the classical sums of squares of a model without interactions. An ANOVA
table such as given in Table 11.3 yields different p-values for H0 : β1 = · · · = βI = 0 and
H0 : γ1 = · · · = γJ = 0 if the order of the factors is exchanged. This is often a disadvantage and
for the F -test the so-called Type II partial SS, being SSA|B and SSB|A can be used. As there is
no interaction involved, we should use Type II only if the interaction is not significant (in which
case it is to be preferred over Type I). Alternatively, Type III partial SS, SSA|B,AB and SSB|A,AB ,
may be used.
In R, the output of aov, or anova are Type I sequential SS. To obtain the other types, manual
calculations may be done or using the function Anova(..., type=i) from the package car.
Example 12.3. Consider Example 11.2 in Section 11.4 but we eliminate the first observation
and the design is unbalanced in both factors. R-Code 12.2 calculates the Type I sequential SS
for the same order as in R-Code 11.3. Type II partial SS are subsequently slightly different.
Note that the design is balanced for the factor Month and thus simply exchanging the order
does not alter the SS here. ♣
R-Code 12.2 Type I and II SS for UVfilter data without the first observation.
require( car)
lmout2 <- lm( log(OT) ~ Month + Treatment, data=UV, subset=-1) # omit 1st!
print( anova( lmout2), signif.stars=FALSE)
## Analysis of Variance Table
##
## Response: log(OT)
## Df Sum Sq Mean Sq F value Pr(>F)
## Month 1 1.14 1.137 4.28 0.053
## Treatment 2 5.38 2.692 10.12 0.001
## Residuals 19 5.05 0.266
print( Anova( lmout2, type=2), signif.stars=FALSE) # type=2 is default
## Anova Table (Type II tests)
##
## Response: log(OT)
## Sum Sq Df F value Pr(>F)
## Month 1.41 1 5.31 0.033
## Treatment 5.38 2 10.12 0.001
## Residuals 5.05 19
a) Compare sample sizes when using Wilson and Wald type confidence intervals for an pro-
portion.
b) In the context of a simple regression, the variances of βb0 and βb1 are given by
1 x2 1
Var(βb0 ) = σ 2 +P 2
, Var(βb1 ) = σ 2 P 2
.
n i (xi − x) i (xi − x)
Assume that x1 , . . . , xn ∈ [a, b], with n even. Show that the variances are minimized by
choosing half at a and the other half at b. What if n is odd?
Problem 12.2 (Study design) Consider a placebo-controlled trial for a treatment B (compared
to a placebo A). The clinician proposes, to use ten patients, who first receive the placebo A and
after a long enough period treatment B. Your task is to help the clinician to find an optimal
design with at most 20 treatments and with at most 20 patients available.
a) Describe alternative designs, argue regarding which aspects those are better or worse than
the original.
Problem 12.3 (Sample size calculation) Suppose we compare the mean of some treatment
in two equally sized groups. Let zγ denote the γ-quantile of the standard normal distribution.
Furthermore, the following properties are assumed to be known or fixed:
• Power 1 − β.
12.6. EXERCISES AND PROBLEMS 229
a) Write down the suitable test statistic and its distributions under the null hypothesis.
c) Prove analytically that the required sample size n in each group is at least
2σ 2 (z1−β + z1−α/2 )2
n=
∆2
Problem 12.4 (Sample size and group allocation) A randomized clinical trial to compare
treatment A to treatment B is being conducted. To this end 20 patients need to be allocated to
the two treatment arms.
a) Using R, randomize the 20 patients to the two treatments with equal probability. Repeat
the randomization in total a 1000 times retaining the difference in group size and visualize
the distribution of the differences with a histogram.
b) In order to obtain group sizes that are closer while keeping randomization codes secure a
random permuted block design with varying block sizes 2 and 4 and respective probabilities
0.25 and 0.75 is now to be used. Here, for a given length, each possible block of equal
numbers of As and Bs is chosen with equal probability. Using R, randomize the 20 patients
to the two treatments using this design. Repeat the randomization in total a 1000 times
retaining the difference in group size. What are the possible values this difference my take?
How often did these values occur?
Problem 12.5 (BMJ Endgame) Discuss and justify the statements about ‘Sample size: how
many participants are needed in a trial?’ given in doi.org/10.1136/bmj.f1041.
230 CHAPTER 12. DESIGN OF EXPERIMENTS
Chapter 13
Bayesian Approach
⋄ Explain the idea of the Bayes factor and link it to model selection
In statistics there exist two different philosophical approaches to inference: frequentist and
Bayesian inference. Past chapters dealt with the frequentist approach; now we introduce the
Bayesian approach. Here, we consider the parameter as a random variable with a suitable
distribution, which is chosen a priori, i.e., before the data is collected and analyzed. The goal is
to update this prior knowledge after observation of the data in order to draw conclusions (with
the help of the so-called posterior distribution).
231
232 CHAPTER 13. BAYESIAN APPROACH
and is shown by using twice Equation (2.5). Bayes theorem is often used in probability theory
to calculate probabilities along an event tree, as illustrated in the arch-example below.
Example 13.1. A patient sees a doctor and gets a test for a (relatively) rare disease. The
prevalence of this disease is 0.5%. As typical, the screening test is not perfect and has a sensitivity
of 99%, i.e., true positive rate; properly identified the disease in a sick patient, and a specificity
of 98%, i.e., true negative rate; a healthy person is correctly identified disease free. What is the
probability that the patient has the disease provided the test is positive?
Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we can use Equa-
tion (13.1) to calculate
Extending Bayes’ theorem to the setting of two continuous random variables X and Y along
the definition of the conditional density (8.14) we have
fY |X=x (y | x) fX (x)
fX|Y =y (x | y) = , for all y such that fY (y) > 0. (13.3)
fY (y)
In the context of Bayesian inference the random variable X will now be a parameter, typically
of the distribution of Y :
fY |Θ=θ (y | θ) fΘ (θ)
fΘ|Y =y (θ | y) = , for all y such that fY (y) > 0. (13.4)
fY (y)
Hence, current knowledge about the parameter is expressed by a probability distribution on the
parameter: the prior distribution. The model for our observations is called the likelihood. We
use our observed data to update the prior distribution and thus obtain the posterior distribution.
13.2. BAYESIAN INFERENCE 233
In the next section, we discuss examples where the parameter is the success probability of a
trial and the mean in a normal distribution.
Notice that P(B) in (13.1), P(+) in (13.2), or fY (y) in (13.3) and (13.4) serves as a normal-
izing constant, i.e., it is independent of A, D, x or the parameter θ, respectively. Thus, we often
write the posterior without this normalizing constant
(or in short form f (θ | y) ∝ f (y | θ)f (θ) if the context is clear). The symbol “∝” means
“proportional to”. For simplicity, we will omit the additional constraint that f (y) > 0.
Finally, we can summarize the most important result in Bayesian inference. The posterior
density is proportional to the likelihood multiplied by the prior density, i.e.,
Until recently, there were clear fronts between frequentists and Bayesians. Luckily, these differ-
ences have vanished.
with normalization constant c. We write P ∼ Beta(α, β). Figure 13.6 shows densities for various
pairs (α, β).
If we investigate the probability of a lamb being male, then it is highly unlikely that p < 0.1
or p > 0.9. This additional knowledge about the parameter p would be reflected by using a prior
P ∼ Beta(5, 5), for example.
The posterior density is then proportional to
n y
∝ p (1 − p)n−y × c · pα−1 (1 − p)β−1 (13.8)
y
∝ py pα−1 (1 − p)n−y (1 − p)β−1 = py+α−1 (1 − p)n−y+β−1 , (13.9)
y+α
E(P | Y = y) = . (13.10)
n+α+β
Data/likelihood
Prior
Posterior
3
2
1
0
^
p
0.0 0.2 0.4 0.6 0.8 1.0
Figure 13.1: Beta-binomial model with prior density (cyan), data/likelihood (green)
and posterior density (blue).
In the previous example, we use P ∼ Beta(α, β) and fix α and β during model specification,
which is why they are called hyper-parameters.
The beta distribution Beta(1, 1), i.e., α = 1, β = 1, is equivalent to a uniform distribution
U(0, 1). The uniform distribution for the probability p, however, does not mean “information-
free”. As a result of Equation (13.10), a uniform distribution as prior is “equivalent” to two exper-
iments, of which one is a success. That means, we can see the prior as two pseudo-observations.
In the next example, we have the data and the parameter both continuous.
13.2. BAYESIAN INFERENCE 235
iid
Example 13.3. (normal-normal model) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We assume σ is known.
The mean µ is the only parameter of interest, for which we assume the prior N (η, τ 2 ). Thus, we
have the Bayesian model:
iid
Yi | µ ∼ N (µ, σ 2 ), i = 1, . . . , n, (13.11)
2
µ ∼ N (η, τ ). (13.12)
where σ 2 , η and τ 2 are considered as hyper-parameters. Notice that we have again slightly abused
the notation by using µ as the realization in (13.11) and as the random variable in (13.12). Since
the context determines the meaning, we use this simplification for the parameters in the Bayesian
context. The posterior density is then
n
Y
f (µ | y1 , . . . , yn ) ∝ f (y1 , . . . , yn | µ) × f (µ) = f (yi | µ) × f (µ) (13.13)
i=1
n 1 (y − µ)2 1 (µ − η)2
i
Y
∝ exp − exp − (13.14)
2 σ2 2 τ2
i=1
n
1X (yi − µ)2 1 (µ − η)2
∝ exp − − , (13.15)
2 σ2 2 τ2
i=1
As a summary statistic of the posterior distribution the posterior mode is often used. Nat-
urally, the posterior median and posterior mean (i.e., expectation of the posterior distribution)
are intuitive alternatives. In the case of the previous example, the posterior mode is the same as
the posterior mean.
236 CHAPTER 13. BAYESIAN APPROACH
Data/likelihood
0.8
Prior
Posterior
0.6
Density
0.4
0.2
0.0
−2 −1 0 1 2 3 4
Figure 13.2: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). (See R-Code 13.1.)
Interval estimation in the frequentist approach results in confidence intervals. But sample con-
fidence intervals need to be interpreted with care, in a context of repeated sampling. A sample
(1 − α)% confidence interval [bu , bo ] contains the true parameter with a frequency of (1 − α)% in
infinite repetitions of the experiment. With a Bayesian approach, we can now make statements
about the parameter with probabilities. In Example 13.3, based on Equation (13.17)
P v −1 m − z1−α/2 v −1/2 ≤ µ ≤ v −1 m + z1−α/2 v −1/2 = 1 − α, (13.19)
with v = n/σ 2 + 1/τ 2 and m = ny/σ 2 + η/τ 2 . That means that the bounds v −1 m ± z1−α/2 v −1/2
can be used to construct a Bayesian counterpart to a confidence interval.
13.2. BAYESIAN INFERENCE 237
is called a (1 − α)% credible interval for θ with respect to the posterior density f (θ | y1 , . . . , yn )
and 1 − α is the credible level of the interval. ♢
The definition states that the parameter θ, now seen as a random variable whose posterior
density is given by f (θ | y1 , . . . , yn ), is contained in the (1−α)% credible interval with probability
(1 − α).
Example 13.4 (continuation of Example 13.3). The interval [ 0.94, 2.79 ] is a 95% credible
interval for the parameter µ. ♣
Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This
is the so-called HPD interval (highest posterior density interval). HPD intervals and credible
intervals in general are often determined numerically.
Example 13.5 (continuation of Example 13.2). The 2.5% and 97.5% quantiles of the poste-
rior (13.9) are 0.45 and 0.83, respectively. A HPD is given by the bounds 0.46 and 0.84. The
differences are not pronounced as the posterior density is fairly symmetric. Hence, the widths of
both are almost identical: 0.377 and 0.375.
The frequentist sample 95% CI is [0.5, 0.92], with width 0.42, see Equation (6.9). ♣
In the classical regression framework, the estimated regression line represented the mean of an
unobserved new location. To fully assess the uncertainty of the prediction, we had to take into
account the uncertainty of the estimates and argued that the prediction is given by a t-distribution
(see, CI 7).
In the Bayesian setting, the likelihood f (ynew | θ) can be seen as a the density of the predictive
distribution. That means, the distribution of an unobserved new observation ynew . As the
classical regression framework, using f (ynew | θb ), with θb some Bayesian estimate of the parameter
(e.g., posterior mean or posterior mode). The better approach is based on the posterior predictive
distribution, defined as follows.
Definition 13.2. The posterior predictive distribution of a Bayesian model with likelihood f (y |
θ) and prior f (θ) is
Z
f (ynew | y1 , . . . , yn ) = f (ynew | θ)f (θ | y1 , . . . , yn ) dθ. ♢
(13.21)
In the previous equation, f (ynew | θ, y1 , . . . , yn ) represents the likelihood and thus there is no
dependency on the data. Hence f (ynew | θ, y1 , . . . , yn ) = f (ynew | θ).
238 CHAPTER 13. BAYESIAN APPROACH
Predictive posterior
0.25
Likelihood plugin
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12
Figure 13.3: Predictive posterior distribution for the beta-binomial model (red) and
likelihood with plugin parameter pb = 10/13 (black).
Example 13.6 (continuation of Example 13.2). In the context of the beta-binomial model, the
posterior predictive distribution is constructed based on the single observation y only
Z 1
f (ynew | y) = f (ynew | p) × f (p | y) dp
0 Z 1 (13.22)
n ynew n−ynew y+α−1 n−y+β−1
= c p (1 − p) ×p (1 − p) dp,
ynew 0
where c is the normalizing constant for the posterior. The integral itself gives us the normalizing
constant of a Beta(ynew + y + α, 2n − ynew − y + β) distribution. We do not recognize this
distribution per se. As illustration, Figure 13.3 shows the posterior predictive distribution based
on the observation y = 10 and prior Beta(5, 5). The prior implies that the posterior predictive
distribution is much more centered compared to the likelihood with plugin parameter pb = 10/13
(i.e., the binomial distribution Bin(13, 10/13)).
R-Code 13.2 Predictive distribution with the beta-binomial model. (See Figure 13.3.)
library(LearnBayes)
n <- 13
y <- 0:n
pred.probs <- pbetap(c( 10+5, 13-10+5), n, y) # prior Beta(5,5)
plot(y, pred.probs, type="h", ylim=c(0,.27), col=2, ylab='')
lines( y+0.07, dbinom(y, size=n, prob=10/13), type='h')
legend("topleft", legend=c("Predictive posterior", "Likelihood plugin"),
col=2:1, lty=1, bty='n')
Even in the simple case of the beta-binomial model, it is not straightforward to derive the
predictive posterior distribution. Quite often, more detailed integration knowledge is required.
In the case of the normal-normal model as introduced in Example 13.3, it is possible to show
that the posterior predictive distribution is again normal N (µpost , σ 2 + σpost
2 ), where µ
post , σpost
2
Example 13.7. We consider the setup of Example 13.2 and compare the models with p = 1/2
and p = 0.8 when observing 10 successes among the 13 trials. To calculate the Bayes factor, we
need to calculate P(Y = 10 | p) for p = 1/2 and p = 0.8. Hence, the Bayes factor is
13
10 3
10 0.5 (1 − 0.5) 0.0349
BF01 = 13 = = 0.1421, (13.24)
10
10 0.8 (1 − 0.2)
3 0.2457
which is somewhat substantial (1/0.1421 ≈ 7) in favor of H1 . This is not surprising, as the
observed proportion is pb = 10/13 = 0.77 close to p = 0.8 corresponding to H1 . ♣
In the example above, the hypotheses H0 and H1 are understood in the sense of H0 : θ = θ0
and H1 : θ = θ1 . The situation for an unspecified alternative H1 : θ ̸= θ0 is much more interesting
and relies on using the prior f (θ) and integrating out the parameter θ:
Z
f (y1 , . . . , yn | H1 : θ ̸= θ0 ) = f (y1 , . . . , yn | θ)f (θ) dθ, (13.25)
illustrated as follows.
Example 13.8 (continuation of Example 13.7). For the situation H1 : p ̸= 0.5 using the prior
Beta(5, 5), we have
Z 1
P(Y = 13 | H1 ) = P(Y = 13 | p)f (p) dp
0
Z 1 (13.26)
13 10 3 4 4
= p (1 − p) · c p (1 − p) dp = 0.0704,
0 10
where we used integrate( function(p) dbinom(10,13,prob=p)*dbeta(p, 5,5),0,1). Thus,
BF01 = 0.0349/0.0704 = 0.4957. Hence, the Bayes factor is approximately 2 in favor of H1 , barely
worth calculating the value. Under a uniform prior, the support for H1 only marginally increases
(from 2.017 to 2.046). ♣
240 CHAPTER 13. BAYESIAN APPROACH
Example 13.9 (continuation of Example 13.3, data from Example 5.8). We now look at a
Bayesian extension of the frequentist t-test. For simplicity we assume the one sample setting
without deriving the explicit formulas. The package BayesFactor provides functionality to
calculate Bayes factors for different settings.
We take the pododermatitis scores (see R-Code 5.1). R-Code ?? shows that the Bayesfactor
comparing the null model µ = 3.33 against the alternative µ ̸= 3.33 is approximately 14. Here,
we have used the standard parameter setting which includes the specification of the prior and
the prior variance. The prior variance can be specified with the argument rscale with default
√
0.707 = 2. Increasing this variance leads to a flatter prior and thus to a smaller Bayes factor.
Default priors are typically very reasonable and we come back to the choice of the priors in the
next section. ♣
library(BayesFactor)
ttestBF(PDHmean, mu=3.33) # data may be reloaded.
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 14.557 ±0%
##
## Against denominator:
## Null, mu = 3.33
## ---
## Bayes factor type: BFoneSample, JZS
Bayes factors are popular because they are linked to the BIC (Bayesian Information Crite-
rion) and thus automatically penalize model complexity. Further, they also work for non-nested
models.
The examples in the last section were such, that the posterior and prior distributions did
belong to the same class. Naturally, this is no coincidence and such prior distributions are called
conjugate prior distributions.
With other prior distributions we may obtain posterior distributions that we no longer “rec-
ognize” and normalizing constants must be explicitly calculated. An alternative approach to
integrating the posterior density is discussed in Chapter 14.
Beside conjugate priors, there are many more classes that are typically discussed in a full
Bayesian lecture. We prefer to classify the effect of the prior instead. Although not universal and
13.3. CHOICE AND EFFECT OF PRIOR DISTRIBUTION 241
For large n the difference between a Bayesian and likelihood estimate is not pronounced.
As a matter of fact, it is possible to show that the posterior mode converges to the likelihood
estimate as the number of observations increase.
Example 13.10. We consider again the normal-normal model and compare the posterior density
for various n with the likelihood. We keep y = 2.1, independent of n. As shown in Figure 13.4,
the maximum likelihood estimate does not depend on n (y is kept constant by design). However,
√
the uncertainty decreases (standard error is σ/ n). For increasing n, the posterior approaches
the likelihood density. In the limit, there is no difference between the posterior and the likelihood.
The R-Code follows closely R-Code 13.1. ♣
4
Data/likelihood
Prior
Posterior
3
Density
2
1
0
Figure 13.4: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue) for increasing n (n = 4, 36, 64, 100). Prior is N (0, 2) and y = 2.1 in all
cases.
242 CHAPTER 13. BAYESIAN APPROACH
where σ 2 , η and T are hyper-parameters. The model is quite similar to (13.11) and (13.12) with
the exception that we have a multivariate prior distribution for β. Instead of the parameter τ 2
we use σ 2 T, where T is a (p + 1) × (p + 1) symmetric positive definite matrix. The special form
will allow us to factor σ 2 and simplify the posterior. With a few steps, it is possible to show that
The function bayesglm() from the R package arm implements an accessible way for simple
linear regression and logistic regression. It is simple in the sense that it returns the posterior
modes of the estimates in a framework that is similar to a frequentist approach. We need to
specify the priors for the regression coefficients (separately for the intercept and the remaining
coefficients).
Example 13.11 (Bayesian approach to orings data). We revisit Example 10.5 and fit in R-
Code 13.4 a Bayesian logistic model to the data.
In the first bayesglm() call, we set the prior variances to infinity, resulting in uninformative
priors. The posterior mode is identical to the result of a classical glm() model fit.
In the second call, we use Gaussian priors for both parameters with mean zero and variance 9.
This choice is set by prior.df=Inf (i.e., a t-distribution with infinite degrees of freedom), by
the default prior.mean=0, and by prior.scale=3. and similarly for the intercept parameter.
The slope parameter is hardly affected by the prior. The intercept is, because with its rather
informative choice of the prior variance, the posterior mode is shrunk towards zero.
Note that summary(baye) should not be used, as the printed p-values are not relevant in the
Bayesian context. The function display() is the preferred way. ♣
Remark 13.1. For one particular class, consider several grades of some students. A possible
model might be
iid
Yij = µ + αi + εij with εij ∼ N (0, σ 2 ), (13.30)
where αi represents the performance relative to the overall mean. This performance is not “fixed”,
it highly depends on the choice of the student and is thus variable. Hence, it would make more
13.4. REGRESSION IN A BAYESIAN FRAMEWORK 243
R-Code 13.4 orings data and estimated probability of defect dependent on air tempera-
ture. (See Figure 13.5.)
require(arm)
data( orings, package="faraway")
bayes1 <- bayesglm( cbind(damage,6-damage)~temp, family=binomial, data=orings,
prior.scale=Inf, prior.scale.for.intercept=Inf) #
coef(bayes1)
## (Intercept) temp
## 11.66299 -0.21623
# result "similar" to
# coef( glm( cbind(damage,6-damage)~temp, family=binomial, data=orings))
iid
sense to consider αi as random, or, more specifically, αi ∼ N (0, σα2 ), and αi and εij independent.
Such models are called mixed effects models in contrast to the fixed effects model as discussed in
this book.
From a Bayesian perspective, such a separation is not necessary, as all Bayesian linear models
iid
are a mixed-effects model. In the example above, we impose as prior αi ∼ N (0, σα2 ). ♣
244 CHAPTER 13. BAYESIAN APPROACH
1.0
+
0.8
Probability of damage
0.6
0.4
0.2
++ + + +
++
++++ ++ ++ ++ +
0.0
20 30 40 50 60 70 80
Temperature [F]
Figure 13.5: orings data (proportion of damaged orings, black crosses), estimated
probability of defect dependent on air temperature by a logistic regression (blue line).
Gray lines are based on draws from the posterior distribution. (See R-Code 13.4.)
Figure 13.6 shows densities of the beta distribution for various pairs of (α, β).
R-Code 13.5 Densities of beta distributed random variables for various pairs of (α, β).
(See Figure 13.6.)
α, β
1,1
2,2
2.5
3,3
4,4
2.0
5,5
6,6
Density
0.8,0.8
1.5
0.4,0.4
0.2,0.2
1.0
1,4
0.5,4
2,4
0.5
0.0
Figure 13.6: Densities of beta distributed random variables for various pairs of (α, β).
(See R-Code 13.5.)
any positive value and the gamma distribution is a natural choice because of its conjugacy with
the normal likelihood.
A random variable Y with density
is called gamma distributed with parameters α and β. We write Y ∼∼ Gam(α, β). The normal-
ization constant c cannot be written in closed form for all parameters α and β.
246 CHAPTER 13. BAYESIAN APPROACH
α α
E(Y ) = , Var(Y ) = . (13.34)
β β2
The parameters α and β are also called the shape and rate parameter, respectively. The
parameterizations of the density in terms of the scale parameter 1/β is also frequently used.
Many results involving the variance parameters of a normal distribution would be simpler if
we would work with the precision, i.e., the inverse of the variance. In such cases we choose a
so-called inverse-gamma distribution for the parameter τ = 1/σ 2 .
A random variable Y is said to be distributed according an inverse-gamma distribution if
1/Y is distributed according a gamma distribution. For parameters α and β, the density is given
by
and we write Y ∼ IGam(α, β). The normalization constant c cannot be written in closed form
for all parameters α and β.
For arbitrary β > 0 and for α > 1 and α > 2, we have respectively
β β2
E(Y ) = , Var(Y ) = . (13.36)
α−1 (α − 1)2 (α − 2)
a) Show that the likelihood of Example 13.3 can be written as c · exp −n(y − µ)2 /σ 2 , where
Problem 13.2 (Sunrise problem) The sunrise problem is formulated as “what is the probability
that the sun rises tomorror?” Laplace formulated this problem by casting it into his rule of
succession which calculates the probability of a success after having observed y successes out of
n trials (the (n + 1)th is again independent of the previous ones).
a) Formulate the rule of succession in a Bayesian framework and calculate the expected prob-
ability for the n + 1th term.
b) Laplace assumed that the Earth was created about 6000 years ago. If we use the same
information, what is the probability that the sun rises tomorrow?
iid
Problem 13.3 (Normal-gamma model) Let Y1 , Y2 , . . . , Yn ∼ N (µ, 1/κ). We assume that the
value of the expectation µ is known (i.e., we treat it as a constant in our calculations), whereas
the precision, i.e., inverse of the variance, denoted here with κ, is the parameter of interest.
a) Write down the likelihood of this model.
b) We choose the a gamma prior for the parameter κ, i.e., κ ∼ Gam(α, β). How does this
distribution relates to the exponential distribution?
Plot four densities for (α, β) = (1,1), (1,2), (2,1) and (2,2). How does a certain choice of
α, β be interpreted with respect to our “beliefs” on κ?
d) Compare the prior and posterior distributions. Why is the choice in b) sensible?
e) Simulate some data with n = 50, µ = 10 and κ = 0.25. Plot the prior and posterior
distributions of κ for α = 2 and β = 1.
Problem 13.4 (Bayesian statistics) For the following Bayesian models, derive the posterior
distribution and give an interpretation thereof in terms of prior and data.
a) Let Y | µ ∼ N (µ, 1/κ), where κ is the precision (inverse of the variance) and is assumed
to be known (hyper-parameter). Further, we assume that µ ∼ N (η, 1/ν), for fixed hyper-
parameters η and ν > 0.
b) Let Y | λ ∼ Pois(λ) with a prior λ ∼ Gam(α, β) for fixed hyper-parameters α > 0, β > 0.
c) Let Y | θ ∼ U(0, θ) with a prior a shifted Pareto distribution with parameters γ > 0 and
ξ > 0, whose density is
f (θ; γ, ξ) ∝ θ−(γ+1) Iθ>ξ (θ). (13.38)
248 CHAPTER 13. BAYESIAN APPROACH
Chapter 14
In Chapter 13, the posterior distribution of several examples were similar to the chosen prior
distribution albeit with different parameters. Specifically, for binomial data with a beta prior,
the posterior is again beta. This was no coincidence; rather, we chose so-called conjugate priors
based on our likelihood (distribution of the data).
With other prior distributions, we may have “complicated”, not standard posterior distribu-
tions, for which we no longer know the normalizing constant, the expected value or any other
moment in general. Theoretically, we could derive the normalizing constant and then in subse-
quent steps determine the expectation and the variance (via integration) of the posterior. The
calculation of these types of integrals is often complex and so here we consider classic simulation
procedures as a solution to this problem. In general, so-called Monte Carlo simulation is used
to numerically solve a complex problem through repeated random sampling.
In this chapter, we start with illustrating the power of Monte Carlo simulation where we
utilize, above all, the law of large numbers. We then discuss one method to draw a sample from
an arbitrary density and, finally, illustrate a method to derive (virtually) arbitrary posterior
densities by simulation. We conclude the chapter with a few realistic examples.
249
250 CHAPTER 14. MONTE CARLO METHODS
where x1 , . . . , xn is a realization of a random sample with density fX (x). The method relies on
the law of large numbers (see Section 3.3).
Example 14.1. To estimate the expectation of a χ21 random variable we can use mean( rnorm(
100000)ˆ2), yielding 1 with a couple digits of precision, close to what we expect according to
Equation (3.6).
Of course, we can use the same approach to calculate arbitrary moments of a χ2n or Fn,m
distribution. ♣
We now discuss this justification in slightly more details. We consider a continuous function
Rb
g (over the interval [a, b]) and the integral I = a g(x) dx. There exists a value ξ such that
I = (b − a)g(ξ) (often termed as the mean value theorem for definite integrals). We do not
know ξ nor g(ξ), but we hope that the “average” value of g is close to g(ξ). More formally, let
iid
X1 , . . . , Xn ∼ U(a, b) which we use to calculate the average (the density of Xi is fX (x) = 1/(b−a)
over the interval [a, b] and zero elsewhere). We now show that on average, our approximation is
correct:
n n
1X 1X
E Ib = E (b − a) g(Xi ) = (b − a) E(g(Xi )) = (b − a) E g(X)
n n
i=1 i=1 (14.3)
Z b Z b Z b
1
= (b − a) g(x)fX (x) dx = (b − a) g(x) dx = g(x) dx = I .
a a b−a a
We can generalize this to almost arbitrary densities fX (x) having a sufficiently large support:
n
1 X g(xi )
Ib = , (14.4)
n fX (xi )
i=1
where the justification is as in (14.3). The density in the denominator takes the role of an
additional weight for each term.
Similarly, to integrate over a rectangle R in two dimensions (or a cuboid in three dimensions,
etc.), we use a uniform random variable for each dimension. More specifically, let R = [a, b]×[c, d]
then
Z bZ d n
1X
Z
g(x, y) dx dy = g(x, y) dx dy ≈ (b − a)(d − c) g(xi , yi ), (14.5)
R a c n
i=1
14.1. MONTE CARLO INTEGRATION 251
where x1 , . . . , xn and y1 , . . . , yn , are two samples of U(a, b) and of U(c, d), respectively.
To approximate A g(x, y) dx dy for some complex domain A ⊂ R2 . We choose a bivariate
R
random vector having a density fX,Y (x, y) whose support contains A. For example we define a
rectangle R such that A ⊂ R and let fX,Y (x, y) = (b − a)(d − c) over R and zero otherwise. We
define the indicator function IA (x, y) that is 1 if (x, y) ∈ A and zero otherwise. Then we have
the general formula
Z bZ d n
1X g(xi , yi )
Z
g(x, y) dx dy = IA (x, y)g(x, y) dx dy ≈ IA (xi , yi ) . (14.6)
A a c n fX,Y (xi , yi )
i=1
Example 14.2. Consider the bivariate normal density specified in Example 8.4 and suppose we
are interested in evaluating the probability that P(X > Y 2 ). To approximate this probability we
can draw a large sample of the bivariate normal density and calculate the proportion for which
xi > yi2 , as illustrated in R-Code 14.1 and yielding 10.47%.
In this case, the function g is the density with which we are drawing the data points. Hence,
Equation (14.6) reduces to calculate the proportion of the data satisfying xi > yi2 . ♣
R-Code 14.1 Calculating probability with the aid of a Monte Carlo simulation
set.seed( 14)
require(mvtnorm) # to sample the bivariate normals
l.sample <- rmvnorm( 10000, mean=c(0,0), sigma=matrix( c(1,2,2,5), 2))
mean( l.sample[,1] > l.sample[,2]^2) # calculate the proportion
## [1] 0.1047
Example 14.3. The area of the unit circle is π as well as the volume of a cylinder placed at the
origin with height one. To estimate π we estimate the volume of the cylinder and we consider
U(−1, 1) for both coordinates of a square that contains the unit circle. The function g(x, y) = 1
is the identity function, IA (x, y) is the indicator function of the set A = {x2 + y 2 ≤ 1} and
fX,Y (xi , yi ) = 1/4 for 0 ̸= x, y ̸= 1. We have the following approximation of the number π
Z 1Z 1 n
1X
π= IA (x, y) dx dy ≈ 4 IA (xi , yi ), (14.7)
−1 −1 n
i=1
where x1 , . . . , xn and y1 , . . . , yn , are two independent samples of U(−1, 1). Equation (14.6)
reduces to calculate a proportion again.
It is important to note that the convergence is very slow, see Figure 14.1. It can be shown
√
that the rate of convergence is of the order of 1/ n. ♣
In practice, more efficient “sampling” schemes are used. More specifically, we do not sample
uniformly but deliberately “stratified”. There are several reasons to sample randomly stratified
but the discussion is beyond the scope of the work here.
252 CHAPTER 14. MONTE CARLO METHODS
R-Code 14.2 Approximation of π with the aid of Monte Carlo integration. (See Fig-
ure 14.1.)
set.seed(14)
m <- 49 # calculate for 49 different n
n <- round( 10+1.4^(1:m)) # non-equal spacing
piapprox <- numeric(m) # to store the approximation
for (i in 1:m) {
st <- matrix( runif( 2*n[i]), ncol=2) # bivariate uniform
piapprox[i] <- 4*mean( rowSums( st^2)<= 1) # proportion
}
plot( n, abs( piapprox-pi)/pi, log='xy', type='l') # plotting on log-log scale
lines( n, 1/sqrt(n), col=2, lty=2) # order of convergence
sel <- 1:7*7 # subset for printing
cbind( n=n[sel], pi.approx=piapprox[sel], rel.error= # summaries
abs( piapprox[sel]-pi)/pi, abs.error=abs( piapprox[sel]-pi))
## n pi.approx rel.error abs.error
## [1,] 21 2.4762 0.21180409 0.66540218
## [2,] 121 3.0083 0.04243968 0.13332819
## [3,] 1181 3.1634 0.00694812 0.02182818
## [4,] 12358 3.1662 0.00783535 0.02461547
## [5,] 130171 3.1403 0.00040166 0.00126186
## [6,] 1372084 3.1424 0.00025959 0.00081554
## [7,] 14463522 3.1406 0.00032656 0.00102592
1e−01
abs(piapprox − pi)/pi
1e−03
1e−05
Figure 14.1: Convergence of the approximation for π: the relative error as a function
of n (log-log scale). (See R-Code 14.2.)
In this method, values from a known density fZ (z) (proposal density) are drawn and through
rejection of “unsuitable” values, observations of the density fY (y) (target density) are generated.
This method can also be used when the normalizing constant of fY (y) is unknown and we write
fY (y) = c · f ∗ (y).
The procedure is as follows: Step 0: Find an m < ∞, so that f ∗ (y) ≤ m · fZ (y) for all y.
Step 1: draw a realization ỹ from fZ (y) and a realization u from a standard uniform distribution
U(0, 1). Step 2: if u ≤ f ∗ (ỹ)/ m · fZ (ỹ) then ỹ is accepted as a simulated value from fY (y),
otherwise ỹ is dicarded and no longer considered. We cycle along Steps 1 and 2 until a sufficiently
large sample has been obtained. The algorithm is illustrated in the following example.
Example 14.4. The goal is to draw a sample from a Beta(6, 3) distribution with the rejection
sampling method. That means, fY (y) = c · y 6−1 (1 − y)3−1 and f ∗ (y) = y 5 (1 − y)2 . As proposal
density we use a uniform distribution, hence fZ (y) = I0≤y≤1 (y). We select m = 0.02, which
fulfills the condition f ∗ (y) ≤ m · fZ (y) since optimize( function(x) xˆ5*(1-x)ˆ2, c(0, 1),
maximum=TRUE) is roughly 0.152.
An implementation of the example is given in R-Code 14.3. Of course, f_Z is always one
here. The R-Code can be optimized with respect to speed. It would then, however, be more
difficult to read.
Figure 14.2 shows a histogram and the density of the simulated values. By construction the
bars of the target density are smaller than the one of the proposal density. In this particular
example, we have sample size 285. ♣
R-Code 14.3: Rejection sampling in the setting of a beta distribution. (See Figure 14.2.)
set.seed( 14)
n.sim <- 1000
m <- 0.02
fstar <- function(y) y^( 6-1) * (1-y)^(3-1) # unnormalized target
f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0) # proposal density
result <- sample <- rep( NA, n.sim) # to store the result
for (i in 1:n.sim){
sample[i] <- runif(1) # ytilde, proposal
u <- runif(1) # u, uniform
if( u < fstar( sample[i]) /( m * f_Z( sample[i])) ) # if accept ...
result[i] <- sample[i] # ... keep
}
mean( !is.na(result)) # proportion of accepted samples
## [1] 0.285
result <- result[ !is.na(result)] # eliminate NAs
hist( sample, xlab="y", main="", col="lightblue") # hist of all proposals
hist( result, add=TRUE, col=4) # of the kept ones
curve( dbeta(x, 6, 3), frame =FALSE, ylab="", xlab='y', yaxt="n")
254 CHAPTER 14. MONTE CARLO METHODS
truth
smoothed empirical
80
Frequency
60
40
20
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
y y
Figure 14.2: Left panel: histogram of the simulated values of fZ (y) (light blue) and
fY (y) (dark blue). Right panel: theoretical density (truth) black and the simulated
density (smoothed empirical) blue dashed. (See R-Code 14.3.)
For efficiency reasons the constant m should be chosen to be as small as possible to reduce
the number of rejections. Nevertheless in practice, rejection sampling is intuitive but often quite
inefficient. The next section illustrates an approach well suited for complex Bayesian models.
In many cases one does not have to program a Gibbs sampler oneself but can use a pre-
programmed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plummer,
2003) with the R-Interface package rjags (Plummer, 2016).
R-Codes 14.4, 14.5 and 14.6 give a short, but practical overview into MCMC methods with
JAGS in the case of a simple Gaussian likelihood. Luckily more complex models can easily be
constructed based on the approach shown here.
14.3. GIBBS SAMPLING 255
When using MCMC methods, you may encounter situations in which the sampler does not
converge (or converges too slowly). In such a case the posterior distribution cannot be approx-
imated with the simulated values. It is therefore important to examine the simulated values
for eye-catching patterns. For example, the so-called trace-plot, observations in function of the
index, as illustrated in the right panel of Figure 14.3 is often used.
Example 14.5. R-Code 14.4 implements the normal-normal model for a single observation,
y = 1, n = 1, known variance, σ 2 = 1.1, and a normal prior for the mean µ:
The basic approach to use JAGS is to first create a file containing the Bayesian model definition.
This file is then transcribed into a model graph (function jags.model()) from which we can
finally draw samples (coda.samples()).
Defining a model for JAGS is quite straightforward, as the notation is very close to the one
fro, R. Some care is needed when specifying variance parameters. In our notation, we typically
use the variance σ 2 , as in N ( · , σ 2 ) ; in R we have to specify the standard deviation σ as parameter
sd in the function dnorm(..., sd=sigma); and in JAGS we have to specify the precision 1/σ 2
in the function dnorm(..., precision=1/sigma2), see also LeBauer et al. (2013).
The resulting samples are typically plotted with smoothed densities, as seen in the left panel
of Figure 14.3 with prior and likelihood, if possible. The posterior seems affected similarly by
likelihood (data) and prior, the mean is close to the average of the prior mean and the data.
More precisely, the prior is slightly tighter as its variance is slightly smaller (0.8 vs. 1.1), thus
the posterior mean is slightly closer to the prior mean than to y. The setting here is identical to
Example 13.3 and thus the posterior has again a normal distribution with N 0.8/(0.8 + 1.1), 0.8 ·
1.1/(0.8 + 1.1) , see Equation (13.17).
♣
0.5
2
0.4
1
0.3
0
0.2
−1
0.1
0.0
−2
Figure 14.3: Left: empirical densities: MCMC based posterior (black), exact (red),
prior (blue), likelihood (green). Right: trace-plot of the posterior µ | y = 1. (See
R-Code 14.4.)
256 CHAPTER 14. MONTE CARLO METHODS
require( rjags)
writeLines("model { # File with Bayesian model definition
y ~ dnorm( mu, 1/1.1) # here Precision = 1/Variance
mu ~ dnorm( 0, 1/0.8) # Precision again!
}", con="jags01.txt") # arbitrary file name
jagsModel <- jags.model( "jags01.txt", data=list( 'y'=1)) # transcription
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 8
##
## Initializing model
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000) # draw samples
Example 14.6. R-Code 14.5 extends the normal-normal model to n = 10 observations with
known variance:
iid
Y1 , . . . , Yn | µ ∼ N (µ, 1.1), (14.10)
µ ∼ N (0, 0.8). (14.11)
We draw the data in R via rnorm(n, 1, sqrt(1.1)) and proceed similarly as in R-Code 14.4.
Figure 14.4 gives the empirical and exact densities of the posterior, prior and likelihood and shows
√
a trace-plot as a basic graphical diagnostic tool. The density of the likelihood is N (y, 1.1/ n),
the prior density is based on (14.11) and the posterior density is based on (13.17). The latter
simplifies considerably because we have η = 0 in (14.11).
As the number of observations increases, the data gets more “weight”. From (13.18), the
weight increases from 0.8/(0.8 + 1.1) ≈ 0.42 to 0.8n/(0.8n + 1.1) ≈ 0.88. Thus, the posterior
is “closer” to the likelihood but slightly more peaked. As both the variance of the data and the
variance of the priors are comparable, the prior has a comparable impact on the posterior as if
we would possess an additional observation with value zero. ♣
14.3. GIBBS SAMPLING 257
2.0
1.5
1.0
0.5
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 500 1000 1500 2000
R-Code 14.5: JAGS sampler for the normal-normal model, with n = 10. (See Figure 14.4.)
set.seed( 4)
n <- 10
obs <- rnorm( n, 1, sqrt(1.1)) # generate artificial data
writeLines("model {
for (i in 1:n) { # define a likelihood for each
y[i] ~ dnorm( mu, 1/1.1) # individual observation
}
mu ~ dnorm( 0, 1/0.8)
}", con="jags02.txt")
jagsModel <- jags.model( "jags02.txt", data=list('y'=obs, 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)
and we have a priori no knowledge of the posterior and cannot compare the empirical posterior
density with a true (bivariate) density (as we had the red densities in Figures 14.3 and 14.4).
R-Code 14.6 implements the following model in JAGS:
iid
Yi | µ, κ ∼ N (µ, 1/κ), i = 1, . . . , n, with n = 10, (14.12)
µ ∼ N (η, 1/λ), with η = 0, λ = 1.25, (14.13)
κ ∼ Gam(α, β), with α = 1, β = 0.2. (14.14)
For more flexibility with the code, we also pass the hyper-parameters η, λ, α, β to the JAGS
MCMC engine.
Figure 14.5 gives the marginal empirical posterior densities of µ and κ, as well as the priors
(based on (14.13) and (14.14)) and likelihood (based on (14.12)). The posterior is quite data
driven and by the choice of the prior, slightly shrunk towards zero.
√
Note that the marginal likelihood for µ is N (y, s2 / n), i.e., we have replaced the parameters
in the model with their unbiased estimates. The marginal likelihood for κ is a gamma distribution
based on parameters n/2 + 1 and ns2 /2 = ni=1 (yi − y)2 /2, see Problem 13.3.a.
P
♣
R-Code 14.6 JAGS sampler for priors on mean and precision parameter, with n = 10.
This last example is another classical Bayesian example and with a very careful specification
of the priors, we can construct a closed form posterior density. Problem 14.1 gives a hint towards
this more advanced topic.
Note that the function jags.model() writes some local files that may be cleaned after the
analysis.
where τ , αλ , βλ , ακ and βκ are hyperparameters. The three levels (14.15) to (14.17) are often
referred to as observation level, state or process level and prior level.
Example 14.8. Parasite infection can pose a large economic burden on livestock such as sheep,
horses etc. Infected herds or animals receive an anthelmintic treatment that reduces the infection
260 CHAPTER 14. MONTE CARLO METHODS
of parasitic worms. To assess the efficacy of the treatment, the number of parasitic eggs per gram
feces are evaluated. We use the following Poisson model for the pre- and post-treatment counts,
denoted as yiC and yiT :
indep indep
YiC | µC
i ∼ Pois(µC
i ), YiT | µTi ∼ Pois(δµC
i ), i = 1, . . . , n, (14.18)
µC
i ∼ Gam(κ, κ/µ). (14.19)
δ ∼ U(0, 1), κ ∼ Gam(1, 0.001), µ ∼ Gam(1, 0.7), (14.20)
where we have used directly the numerical values for the hyperparameters (Wang et al., 2017).
The parameter δ represents the efficacy of the treatment. Notice that with the parameterization
of the Gamma distribution for µC i the mean thereof is µ.
The package eggCounts provides the dataset epgs containing 14 eggs per gram (epg) values
in sheep before and after anthelmintic treatment of benzimidazole. The correction factor of
the diagnostic technique was 50 (thus all numbers are multiples of 50, due to the measuring
technique, see also Problem 14.4). R-Code 14.7 illustrates the implementation in JAGS. We
reduce all sheep that did not have any parasites, followed by the setup of the JAGS file with
dpois(), dunif() and dgamma() for the distributions of the different layers of the hierarchy.
Although the arguments of the gamma distribution are named differently in JAGS we can pass
the same parameters in the same order.
The sampler does not indicate any convergence issues (trace-plots and the empirical pos-
terior densities behave well). The posterior median reduction factor is about 0.077, with a
95% HPD interval [ 0.073, 0.0812 ], virtually identical to the quantile based credible interval
[ 0.0729, 0.0812 ], (TeachingDemos::emp.hpd(postSamples[,"delta"][[1]], conf=0.95) and
summary( postSamples)$quantiles["delta",c(1,5)]). The posterior median epg is reduced
from 1885.8 to 145.
As a sanity check, we can compare the posterior median (or mean values) with corresponding
frequentist estimates. The average epgs before and after the treatment are 2094.4 and 161.1
(colMeans(epgs2)).
To compare the variances we can use the following ad-hoc approach. The (prior) variance
of µCi is µ /κ. Although the posterior distribution of µi is not gamma anymore, we use the
2 C
same formula to estimate the variance (here we have very few observations and for a simple
Poisson likelihood, a gamma prior is conjugate, see Problem 13.4.b). Based on the posterior
medians, we have the following values 5.326 × 106 and 3.15 × 104 , which are somewhat smaller
than the frequentist values 9.007×106 and 6.861×104 . We should not be too surprised about the
difference but rather be assured that we have properly specified and interpreted the parameters
of the gamma distribution. ♣
R-Code 14.7: JAGS sampler for epgs data. (See Figure 14.6.)
require(eggCounts)
require(rjags)
data(epgs)
epgs2 <- epgs[rowSums(epgs[,c("before","after")])>0,c("before","after")]
14.4. BAYESIAN HIERARCHICAL MODELS 261
n <- nrow(epgs2)
writeLines("model {
for (i in 1:n) { # define a likelihood for each
yC[i] ~ dpois( muC[i]) # pre-treatment
yT[i] ~ dpois( delta*muC[i]) # post-treatment
muC[i] ~ dgamma( kappa, kappa/mu) # pre-treatment mean
}
delta ~ dunif( 0, 1) # reduction
mu ~ dgamma(1, 0.001) # pre-treatment mean
kappa ~ dgamma(1, 0.7) #
}", con="jagsEggs.txt")
jagsModel <- jags.model( "jagsEggs.txt", # write the model
data=list('yC'=epgs2[,1],'yT'=epgs2[,2], 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, # run sampler and monitor all param.
c('mu', 'kappa', 'delta'), n.iter=5000)
summary( postSamples)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## delta 0.077 2.13e-03 3.01e-05 4.10e-05
## kappa 0.684 2.60e-01 3.68e-03 5.21e-03
## mu 1982.866 7.06e+02 9.99e+00 1.52e+01
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## delta 0.0727 7.56e-02 0.077 7.85e-02 8.12e-02
## kappa 0.2892 4.96e-01 0.644 8.29e-01 1.30e+00
## mu 943.9016 1.49e+03 1864.182 2.35e+03 3.71e+03
par(mfcol=c(2,3), mai=c(.6,.6,.1,.1))
plot( postSamples[,"delta"], main="", auto.layout=FALSE, xlab=bquote(delta))
plot( postSamples[,"mu"], main="", auto.layout=FALSE, xlab=bquote(mu))
plot( postSamples[,"kappa"], main="", auto.layout=FALSE, xlab=bquote(kappa))
262 CHAPTER 14. MONTE CARLO METHODS
6000
1.5
0.078
2000
0.5
0.070
δ µ κ
150
4e−04
1.0
0 50
0e+00
0.0
0.070 0.075 0.080 0.085 0 2000 4000 6000 8000 0.0 0.5 1.0 1.5 2.0
δ µ κ
Figure 14.6: Top row: trace-plots of the parameters δ, µ and κ. Bottom row: empirical
posterior densities of the parameters δ, µ and κ. (See R-Code 14.7.)
An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is dis-
tributed as two main versions: WinBUGS and OpenBUGS, see also Lunn et al. (2012). Addi-
tionally, there is the R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities
are the Stan or INLA engines with convenient user interfaces to R through rstan and INLA
(Gelman et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).
iid
a) Create an artificial dataset consisting for Y1 , . . . , Yn ∼ N (1, 1), with n = 20.
14.6. EXERCISES AND PROBLEMS 263
b) Write a function called dnormgamma() that calculates the density at mu, kappa based on the
parameters eta, nu, alpha, beta. Visualize the bivariate density based on η = 1, ν = 1.5,
α = 1, and β = 0.2.
c) Setup a Gibbs sampler for the following values η = 0, ν = 1.5, α = 1, and β = 0.2. For a
sample of length 2000 illustrate the (empirical) joint posterior density of µ, κ | y1 , . . . , yn .
Problem 14.2 (Monte Carlo integration) Estimate the volume of the unit ball in d = 2, 3, . . . , 10
dimensions and compare it to the exact value π d/2 /Γ(d/2 + 1). What do you notice?
Problem 14.3 (Rejection sampling) A random variable X has a Laplace distribution with
parameters µ ∈ R and λ > 0 if its density is of the form
1 |x − µ|
fX (x) = exp − .
2λ λ
a) Draw 100 realizations of a Laplace distribution with parameters µ = 1 and λ = 1 with a
rejection sampling approach.
Problem 14.4 (Anthelmintic model) The process of determining the parasitic load, a fecal sam-
ple is taken and is thoroughly mixed after dilution. We assume that the eggs are homogeneously
distributed within each sample. A proportion of the diluted sample p = 1/f is then counted.
Denote the raw number of eggs in the diluted sample of the ith control animal as Yi∗C , with
i = 1, 2, . . . , n. Given the true number of eggs per gram of feces YiC , the raw count Yi∗C follows
a binomial distribution Bin(YiC , p). This captures both the dilution and the counting variability.
For the true epg counts YiC we use the same models as in Example 14.8. Similar approach is
used for the observations after the treatment.
a) Implement the model in JAGS and compare the results with those of the simple JAGS
sampler of Example 14.8.
b) The package eggCounts provides samplers for the specific model discussed here as well as
further extensions. Interpret the output of
264 CHAPTER 14. MONTE CARLO METHODS
and compare the result with those of a). (See also the vignette of the package eggCounts).
Epilogue
After all that, are we done yet? No of course not, the educational iteration requires a
further refinement. Books at virtually arbitrary length could be written for all the topics above.
However, I often feel that too many iterations is not advancing my capability to help statistically
in a proportional manner. Being a good statistician does not only require having solid knowledge
in statistics but also having domain specific knowledge, the skills to listen and to talk to experts
and to have fun stepping outside the own comfort zone, rather into a foreign backyard (in the
sense of John Tukey’s quote “The best thing about being a statistician is that you get to play in
everyone’s backyard.”).
For the specific document, I am done here up to some standard appendices and indices.
265
266 Epilogue
Appendix A
Software Environment R
R is a freely available language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques. It compiles and runs on a wide
varieties operating systems (Windows, Mac, and Linux), its central entry point is https://www.
r-project.org.
The R software can be downloaded from CRAN (Comprehensive R Archive Network) https:
//cran.r-project.org, a network of ftp and web servers around the world that store identical,
up-to-date, versions of code and documentation for R. Figure A.1 shows a screenshot of the web
page.
267
268 APPENDIX A. SOFTWARE ENVIRONMENT R
R is console based, that means that individual commands have to be typed. It is very impor-
tant to save these commands to construct a reproducible workflow – which the big advantage
over a “click-and-go” approach. We strongly recommend to use some graphical, integrated de-
velopment environment (IDE) for R. The prime choice these days is RStudio. RStudio includes
a console, syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging and workspace management, see Figure A.2.
RStudio is available in a desktop open source version for many different operating systems
(Windows, Mac, and Linux) or in a browser connected to an RStudio Server. There are several
providers of such servers, including https://rstudio.math.uzh.ch for the students of the STA120
lecture.
Figure A.2: RStudio screenshot. The four panels shown are (clock-wise starting top
left): (i) console, (ii) plots, (iii) environment, (iv) script.
The installation of all software components are quite straightforward, but the look of the
download page may change from time to time and the precise steps may vary a bit. Some
examples are given by the attached videos.
4 min
The biggest advantage of using R is the support from and for a huge user community. Sheer
endless packages provide almost seemingly every statistical task, often implemented by several
authors. The packages are documented and by the upload to CRAN confined to a limited level
of documentation, coding standards, (unit) testing etc. There are several forums (e.g., R mailing
lists, Stack Overflow with tag “r”) to get additional help, see https://www.r-project.org/help.
html.
In this document we tried to keep the level of R quite low and rely on few packages only,
examples were MASS, vioplot, mvtnorm, ellipse and some more. Due to complex dependencies,
269
more than the actual loaded packages are used. The following R-Code output shows all packages
(and their version number) used to compile this document (not including any packages required
for the problems).
Calculus
In this chapter we present some of the most important ideas and concepts of calculus. For exam-
ple, we will not discuss sequences and series. It is impossible to give a formal, mathematically
precise exposition. Further, we cannot present all rules, identities, guidelines or even tricks.
B.1 Functions
We start with one of the most basic concepts, a formal definition that describes a relation between
two sets.
Definition B.1. A function f from a set D to a set W is a rule that assigns a unique value
element f (x) ∈ W to each element x ∈ D. We write
f :D→W (B.1)
x 7→ f (x) (B.2)
The set D is called the domain, the set W is called the range (or target set or codomain).
The graph of a function f is the set (x, f (x)) : x ∈ D .
♢
The function will not necessarily map to every element in W , and there may be several
elements in D with the same image in W . These functions are characterized as follows.
Definition B.2. 1. A function f is called injective, if the image of two different elements in
D is different.
2. A function f is called surjective, if for every element y in W there is at least one element
x in D such that y = f (x).
3. A function f is called bijective if it is surjective and injective. Such a function is also called
a one-to-one function. ♢
271
272 APPENDIX B. CALCULUS
In general, there is virtually no restriction on the domain and codomain. However, we often
work with real functions, i.e., D ⊂ R and W ⊂ R.
There are many different characterizations of functions. Some relevant one are as follows.
1. periodic if there exists an ω > 0 such that f (x + ω) = f (x) for all x ∈ D. The smallest
value ω is called the period of f ;
2. called increasing if f (x) ≤ f (x + h) for all h ≥ 0. In case of strict inequalities, we call the
function strictly increasing. Similar definitions hold when reversing the inequalities. ♢
f −1 : W → D
(B.3)
y 7→ f −1 (y), such that y = f f −1 (y) .
To capture the behavior of a function locally, say at a point x0 ∈ D, we use the concept of a
limit.
The latter definition does not assume that the function is defined at x0 .
It is possible to define “directional” limits, in the sense that x approaches x0 from above (from
the right side) or from below (from the left side). These limits are denoted with
lim lim for the former; or lim lim for the latter. (B.4)
x→x+
0
x↘x0 x→x−
0
x↗x0
We are used to interpret graphs and when we sketch an arbitrary function we often use a
single, continuous line. This concept of not lifting the pen while sketching is formalized as follows
and linked directly to limits, introduced above.
There are many other approaches to define coninuity, for example in terms of neighborhoods,
in terms of limits of sequences.
Another very important (local) characterization of a function is the derivative, which quan-
tifies the (infinitesimal) rate of change.
B.1. FUNCTIONS 273
Definition B.6. The derivative of a function f (x) with respect to the variable x at the point
x0 is defined by
f (x0 + h) − f (x0 )
f ′ (x0 ) = lim , (B.6)
h→0 h
df (x0 )
provided the limit exists. We also write = f ′ (x0 ).
dx
If the derivative exists for all x0 ∈ D, the function f is differentiable. ♢
The integral of a (positive) function quantifies the area between the function and the x-axis.
A mathematical definition is a bit more complicated.
Definition B.7. Let f (x) : D → R a function and [a, b] ∈ D a finite interval such that |f (x)| <
∞ for x ∈ [a, b]. For any n, let t0 = a < t1 < · · · < tn = b a partition of [a, b].
The integral of f from a to b is defined as
Z b X n
f (x)dx = lim f (ti )(ti − ti−1 ). (B.7)
a n→∞
i=1
For non-finite a and b, the definition of the integral can be extended via limits.
Property B.2. (Fundamental theorem of calculus (I)). Let f : [a, b] → R continuous. For all
Rx
x ∈ [a, b], let F (x) = a f (u)du. Then F is continuous on [a, b], differentiable on (a, b) and
F ′ (x) = f (x), for all x ∈ (a, b).
The function F is often called the antiderivative of f . There exists a second form of the
previous theorem that does not assume continuity of f but only Riemann integrability, that
means that an integral exists.
Property B.3. (Fundamental theorem of calculus (II)). Let f : [a, b] → R. And let F such that
Z b
′
F (x) = f (x), for all x ∈ (a, b). If f is Riemann integrable then f (u)du = F (b) − F (a).
a
There are many ‘rules’ to calculate integrals. One of the most used ones is called integration
by substitution and is as follows.
Property B.4. Let I be an interval and φ : [a, b] → I be a differentiable function with integrable
derivative. Let f : I → R be a continuous function. Then
Z φ(b) Z b
f (u) du = f (φ(x))φ′ (x) dx. (B.8)
φ(a) a
274 APPENDIX B. CALCULUS
We denote with Rm the vector space with elements x = (x1 , . . . , xm )⊤ , called vectors, equipped
with the standard operations. We will discuss vectors and vector notation in more details in the
subsequent chapter.
A natural extension of a real function is as follows. The set D is subset of Rm and thus we
write
f : D ⊂ Rm → W
(B.9)
x 7→ f (x ).
(provided it exists). ♢
Remark B.1. The existence of partial derivatives is not sufficient for the differentiability of the
function f . ♣
In a similar fashion, higher order derivatives can be calculated. For example, taking the
derivative of each component of (B.11) with respect to all components is an matrix with com-
ponents
∂ 2 f (x )
f ′′ (x ) = , (B.12)
∂xi ∂xj
Property B.5. Let f : D → R with continuous Then there exists ξ ∈ [a, x] such that
1
f (x) = f (a) + f ′ (a)(x − a) + f ′′ (a)(x − a)2 + . . .
2
1 (m) 1 (B.13)
+ f (a)(x − a)m + f (m+1) (ξ)(x − a)m
m! (m + 1)!
We call (B.13) Taylor’s formula and the last term, often denoted by Rn (x), as the reminder
of order n. Taylor’s formula is an extension of the mean value theorem.
If the function has bounded derivatives, the reminder Rn (x) converges to zero as x → a.
Hence, if the function is at least twice differentiable in a neighborhood of a then
1
f (a) + f ′ (a)(x − a) + f ′′ (a)(x − a)2 (B.14)
2
is the best quadratic approximation in this neighborhood.
Taylor’s formula can be expressed for multivariate real functions. Without stating the precise
assumptions we consider here the following example
∞
X X 1 ∂ r f (a)
f (a + h) = hi1 hi2 . . . hinn , (B.16)
i1 !i2 ! . . . in ! ∂xi1 . . . ∂xin 1 2
r=0 i :i1 +···+in =r
Linear Algebra
In this chapter we cover the most important aspects of linear algebra, namely of notational
nature.
The n × n identity matrix I is defined as the matrix with ones on the diagonal and zeros
elsewhere. We denote the vector with solely one elements with 1 similarly, 0 is a vector with only
zero elements. A matrix with entries d1 , . . . , dn on the diagonal and zero elsewhere is denoted
with diag(d1 , . . . , dn ) or diag(di ) for short and called a diagonal matrix. Hence, I = diag(1).
To indicate the ith-jth element of A, we use (A)ij . The transpose of a vector or a matrix
flips its dimension. When a matrix is transposed, i.e., when all rows of the matrix are turned
into columns (and vice-versa), the elements aij and aji are exchanged. Thus (A⊤ )ij = (A)ji .
The vector x ⊤ = (xa , . . . , xp ) is termed a row vector. We work mainly with column vectors as
shown in (C.1).
In the classical setting of real numbers, there is only one type of multiplication. As soon
as we have several dimensions, several different types of multiplications exist, notably scalar
multiplication, matrix multiplication and inner product (and actually more such as the vector
product, outer product).
Let A and B be two n × p and p × m matrices. Matrix multiplication AB is defined as
p
X
AB = C with (C)ij = aik bkj . (C.2)
k=1
277
278 APPENDIX C. LINEAR ALGEBRA
This last equation shows that the matrix I is the neutral element (or identity element) of the
matrix multiplication.
Definition C.1. The inner product between two p-vectors x and y is defined as x ⊤ y =
Pp
i=1 xi yi . There are several different notations used: x y = ⟨a, b⟩ = x · y .
⊤
AB = BA = I, (C.3)
then the matrix B is uniquely determined by A and is called the inverse of A, denoted by A−1 .
Definition C.2. A vector space over R is a set V with the following two operations:
1. + : V × V → V (vector addition)
2. · : R × V → V (scalar multiplication). ♢
Typically, V is Rp , p ∈ N.
In the following we assume a fixed d and the usual operations on the vectors.
Definition C.3. 1. The vectors v 1 , . . . , v k are linearly dependent if there exists scalars a1 , . . . , ak
(not all equal to zero), such that a1 v 1 + · · · + ak v k = 0.
In a set of linearly dependent vectors, each vector can be expressed as a linear combination
of the others.
Definition C.4. The set of vectors {b 1 , . . . , b d } is a basis of a vectors space V if the set is
linearly independent and any other vector v ∈ V can be expressed by v = v1 b 1 + · · · + vd b d . ♢
2. All basis of V have the same cardinality, which is called the dimension of V , dim(V ).
3. If there are two basis {b1 , . . . , bd } and {e1 , . . . , ed } then there exists a d × d matrix A such
that ei = Abi , for all i.
Definition C.6. Let A be a n × m matrix. The column rank of the matrix is the dimension
of the subspace that the m columns of A span and is denoted by rank(A). A matrix is said to
have full rank if rank(A) = m.
The row rank is the column rank of A⊤ . ♢
3. rank(A) ≤ dim(V ).
C.3 Projections
We consider classical Euclidean vector spaces with elements x = (x1 , . . . , xp )⊤ ∈ Rp with Eu-
clidean norm ||x || = ( i x2i )1/2 .
P
To illustrate projections, consider the setup illustrated in Figure C.1, where y and a are two
vectors in R2 . The subspace spanned by a is
where the second expression is based on a normalized vector a/||a||. By the (geometric) definition
of the inner product (dot product),
where θ is the angle between the vectors. Classical trigonometric properties state that the length
of the projection is a/||a|| · ||y || cos(θ). Hence, the projected vector is
a a⊤
y = a(a ⊤ a)−1 a ⊤ y . (C.6)
||a|| ||a||
In statistics we often encounter expressions like this last term. For example, ordinary least
squares (“classical” multiple regression) is a projection of the vector y onto the column space
spanned by X, i.e., the space spanned by the columns of the matrix X. The projection is
X(X⊤ X)−1 X⊤ y . Usually, the column space is in a lower dimension.
y
a
θ
Remark C.1. Projection matrices (like H = X(X⊤ X)−1 X⊤ ) have many nice properties such
as being symmetric, being idempotent, i.e., H = HH, having eigenvalues within [0, 1], (see next
section), rank(H) = rank(X), etc. ♣
Ax = λx , (C.7)
We often denote the set of eigenvectors with γ 1 , . . . , γ n . Let Γ be the matrix with columns
γ i , i.e., Γ = (γ 1 , . . . , γ n ). Then
Γ⊤ AΓ = diag(λ1 , . . . , λn ), (C.8)
due to the orthogonality property of the eigenvectors Γ⊤ Γ = I. This last identity also implies
that A = Γ diag(λ1 , . . . , λn )Γ⊤ .
B = UDV⊤ (C.9)
Besides an SVD there are many other matrix factorization. We often use the so-called
Cholesky factorization, as - to a certain degree - it generalizes the concept of a square root for
matrices. Assume that all eigenvalues of A are strictly positive, then there exists a unique lower
triangular matrix L with positive entries on the diagonal such that A = LL⊤ . There exist very
efficient algorithm to calculate L and solving large linear systems is often based on a Cholesky
factorization.
The determinant of a square matrix essentially describes the change in “volume” that associ-
ated linear transformation induces. The formal definition is quite complex but it can be written
as det(A) = ni=1 λi for matrices with real eigenvalues.
Q
7. A−1 is spd
281
282 References
For a non-singular matrix A, written as a 2 × 2 block matrix (with square matrices A11 and
A22 ), we have
!−1 !
−1 A11 A12 A−1 −1 −1
11 + A11 A12 CA21 A11 −A−1
11 A12 C
A = = −1
(C.13)
A21 A22 −CA21 A11 C
Agresti, A. (2007). An introduction to categorical data analysis. Wiley, New York, second edition.
Ahrens, J. H. and Dieter, U. (1972). Computer methods for sampling from the exponential and
normal distributions. Communications of the ACM, 15, 873–882.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In Petrov, B. and Csaki, F., editors, 2nd International Symposium on Information Theory,
267–281. Akadémiai Kiadó.
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons Inc.,
Chichester.
Bland, J. M. and Bland, D. G. (1994). Statistics notes: One and two sided tests of significance.
BMJ, 309, 248.
Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces. Wiley.
Brown, L. D., Cai, T. T., and DasGupta, A. (2002). Confidence intervals for a binomial propor-
tion and asymptotic expansions. The Annals of Statistics, 30, 160–201.
Canal, L. (2005). A normal approximation for the chi-square distribution. Computational Statis-
tics & Data Analysis, 48, 803 – 808.
Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge.
Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle: Pre-
challenger prediction of failure. Journal of the American Statistical Association, 84, 945–957.
Devore, J. L. (2011). Probability and Statistics for Engineering and the Sciences. Brooks/Cole,
8th edition.
Edgington, E. and Onghena, P. (2007). Randomization Tests. CRC Press, 4th edition.
283
284 BIBLIOGRAPHY
Ellis, T. H. N., Hofer, J. M. I., Swain, M. T., and van Dijk, P. J. (2019). Mendel’s pea crosses:
varieties, traits and statistics. Hereditas, 156, 33.
Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und Anwendungen.
Springer, 2 edition.
Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods and
Applications. Springer.
Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects
and Nonparametric Regression Models. CRC Press.
Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention
to the false discovery proportion. Statistical Methods in Medical Research, 17, 347–388.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications: Volume I. Number
Bd. 1 in Wiley series in probability and mathematical statistics. John Wiley & Sons.
Fisher, R. A. (1938). Presidential address. Sankhyā: The Indian Journal of Statistics, 4, 14–17.
Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C-23, 881–890.
Furrer, R. and Genton, M. G. (1999). Robust spatial data analysis of lake Geneva sediments
with S+SpatialStats. Systems Research and Information Science, 8, 257–272.
Galarraga, V. and Boffetta, P. (2016). Coffee drinking and risk of lung cancer–a meta-analysis.
Cancer Epidemiology, Biomarkers & Prevention, 25, 951–957.
Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language for
bayesian inference and optimization. Journal of Educational and Behavior Science, 40, 530–
543.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
Statistical Science, 7, 457–511.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and
Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to
misinterpretations. European Journal of Epidemiology, 31, 337–350.
Hampel, F. R., Ronchetti, E., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust statistics: the
approach based on influence functions. Wiley New York.
Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer, Heidelberg.
Hernán, M. A., Alonso, A., and Logroscino, G. (2008). Cigarette smoking and dementia: poten-
tial selection bias in the elderly. Epidemiology, 19, 448–450.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley & Sons.
Hombach, M., Ochoa, C., Maurer, F. P., Pfiffner, T., Böttger, E. C., and Furrer, R. (2016).
Relative contribution of biological variation and technical variables to zone diameter variations
of disc diffusion susceptibility testing. Journal of Antimicrobial Chemotherapy, 71, 141–151.
Horst, A. M., Hill, A. P., and Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago
(Antarctica) penguin data. R package version 0.1.0.
Hüsler, J. and Zimmermann, H. (2010). Statistische Prinzipien für medizinische Projekte. Huber,
5 edition.
Jeffreys, H. (1983). Theory of probability. The Clarendon Press Oxford University Press, third
edition.
Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions. Wiley-
Interscience, 3rd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions,
Vol. 1. Wiley-Interscience, 2nd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distributions,
Vol. 2. Wiley-Interscience, 2nd edition.
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic
Press, first edition.
Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier, second edition.
Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T. (2008). Con-
centrations and specific loads of brominated flame retardants in sewage sludge. Chemosphere,
71, 1173–1180.
Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O. (1965).
The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy. J. Obstet. Gy-
naecol., 72, 1004–1010.
286 BIBLIOGRAPHY
LeBauer, D. S., Dietze, M. C., and Bolker, B. M. (2013). Translating Probability Density
Functions: From R to BUGS and Back Again. The R Journal, 5, 207–209.
Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical
Software, 63, i19.
Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS Book:
A Practical Introduction to Bayesian Analysis. Texts in Statistical Science. Chapman &
Hall/CRC.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. Academic Press.
McGill, R., Tukey, J. W., and Larsen, W. A. (1978). Variations of box plots. The American
Statistician, 32, 12–16.
Modigliani, F. (1966). The life cycle hypothesis of saving, the demand for wealth and the supply
of capital. Social Research, 33, 160–217.
Moyé, L. A. and Tita, A. T. (2002). Defending the rationale for the two-tailed test in clinical
research. Circulation, 105, 3062–3065.
Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University Press.
Pachel, C. and Neilson, J. (2010). Comparison of feline water consumption between still and
flowing water sources: A pilot study. Journal of Veterinary Behavior, 5, 130–133.
Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14, http:
//matrixcookbook.com.
Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas, J.
(2006). Concentrations and specific loads of UV filters in sewage sludge originating from a
monitoring network in Switzerland. Chemosphere, 62, 915–925.
Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sam-
pling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003). Vienna, Austria.
Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version 4-6.
BIBLIOGRAPHY 287
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.
Raftery, A. E. and Lewis, S. M. (1992). One long run with diagnostics: Implementation strategies
for Markov chain Monte Carlo. Statistical Science, 7, 493–497.
Rice, J. A. (2006). Mathematical Statistics and Data Analysis. Belmont, CA: Duxbury Press.,
third edition.
Rosenthal, R. and Fode, K. L. (1963). The effect of experimenter bias on the performance of the
albino rat. Behavioral Science, 8, 183–189.
Ross, S. M. (2010). A First Course in Probability. Pearson Prentice Hall, 8th edition.
Ruchti, S., Kratzer, G., Furrer, R., Hartnack, S., Würbel, H., and Gebhardt-Henrich, S. G.
(2019). Progression and risk factors of pododermatitis in part-time group housed rabbit does
in switzerland. Preventive Veterinary Medicine, 166, 56–64.
Ruchti, S., Meier, A. R., Würbel, H., Kratzer, G., Gebhardt-Henrich, S. G., and Hartnack, S.
(2018). Pododermatitis in group housed rabbit does in switzerland prevalence, severity and
risk factors. Preventive Veterinary Medicine, 158, 114–121.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society B, 71, 319–392.
Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral Sciences.
McGraw-Hill, 2nd edition.
Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician,
28, 9–12.
Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from
R. Journal of Statistical Software, 12, 1–16.
Swayne, D. F., Temple Lang, D., Buja, A., and Cook, D. (2003). GGobi: evolving from XGobi
into an extensible framework for interactive data visualization. Computational Statistics &
Data Analysis, 43, 423–444.
Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making Decisions.
Graphics Press.
Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press.
Wang, C., Torgerson, P. R., Höglund, J., and Furrer, R. (2017). Zero-inflated hierarchical models
for faecal egg counts to assess anthelmintic efficacy. Veterinary Parasitology, 235, 20–28.
Wasserstein, R. L. and Lazar, N. A. (2016). The asa statement on p-values: Context, process,
and purpose. The American Statistician, 70, 129–133.
Glossary
:= Define the left hand side by the expression on the other side.
♣, ♢ End of example, end of definition end of remark.
, , Integration, summation and product symbol. If there is no ambiguity, we omit
R P Q
289
290 Glossary
The following table contains the abbreviations of the statistical distributions (dof denotes degrees
of freedom).
U(a, b) Uniform distribution over the support [a, b], −∞ < a < b < ∞.
Xν2 , χ2ν,p Chi-squared distribution with ν dof, p-quantile thereof.
Tn , tn,p Student’s t-distribution with n dof, p-quantile thereof.
Fm,n , fm,n,p F -distribution with m and n dof, p-quantile thereof.
Ucrit (nx ,ny ;1−α) 1 − α-quantile of the distribution of the Wilcoxon rank sum statistic.
Wcrit (n⋆ ; 1−α) 1 − α-quantile of the distribution of the Wilcoxon signed rank statistic.
The following table contains the abbreviations of the statistical methods, properties and quality
measures.
293
294 Index of Statistical Tests and CIs
Video Index
The following index gives a short description of the available videos, including a link to the
referenced page. The videos are uploaded to https://tube.switch.ch/.
Chapter 0
What are all these videos about?, vi
Chapter 8
Construction of general multivariate normal variables, 153
Important comment about an important equation, 154
Proof that the correlation is bounded, 148
Properties of expectation and variance in the setting of random vectors, 148
Chapter 9
Two classical estimators and estimates for random vectors, 161
Chapter A
Installing of RStudio, 268
295
296 Video Index
Index of Terms
297
298 Index of Terms
Estimation, 64 Location, 7
Estimator, 64 Location parameter, 55
Euler diagrams, 28 Lower quartile, 7
Event, 28
Mean squared error, 68, 167, 179
Evidence-based medicine, 225
Median, 37
Expectation, 147
Memoryless, 44
Expectation, 37
Minimal variance unbiased estimators, 68
Experimental unit, 217
Mixed effects model, 243
Explanatory variables, 178
Model
Exploratory data analysis, 2
Bayesian hierarchical, 259
False discovery rate, 137 fixed effects, 243
Family-wise error rate, 136 mixed effects, 243
Fisher transformation, 159 Mosaic plot, 16
Fixed effects model, 243 Multiple linear regression, 178
Observation, 4
HARKing, 101
Observational study, 224
Hat matrix, 179
Occam’s razor, 186
Higher moments, 38
One-sided test, 86
Histogram, 9
Ordinal scale, 5
Immortal time, 223 Ordinary least squares, 167
Improper prior, 241 Outlier, 8
Independence, 46
p-hacking, 101
Independent variable, 165
Parallel coordinate plots, 18
Independent variables, 178
Pearson correlation coefficient, 158
Indicator function, 57
Performance bias, 224
Informative prior, 241
Placebo, 224
Interquartile range, 7
Plot, 9
Interval scale, 5
Point estimate, 64
Intervention, 224
Poisson distribution, 33
Inverse transform sampling, 55
Post-hoc tests, 135
Jensen’s inequality, 56 Posterior odds, 239
Joint pdf, 144 Posterior predictive distribution, 237
Power, 88
Law of large numbers, 53 Predicted values, 167
Left-skewed, 9 Prediction, 155
Level, 70 Predictor, 165, 178
Index of Terms 299
Prior Statistic, 7, 64
improper, 241 Statistical model, 61
informative, 241 Studentized range, 7
uninformative, 241
Taylor series, 56
weakly informative, 241
Trace-plot, 255
Prior odds, 239
Trimmed mean, 7
Probability measure, 28
Two-sided test, 86
Projection pursuit, 20
Unbiased, 67
QQ-plot, 12
Uniformly most powerful, 90
Qualitative data, 5
Unimodal, 9
Quantile function, 36
Uninformative prior, 241
Quantitative data, 5
Upper quartile, 7
R-squared, 170
Variables, 4
Randomized complete block design, 221
Variance, 38
Range, 7
Variance–covariance, 147
Rank, 127
Violin plot, 12
Ratio scale, 5
Rejection region, 87 Weakly informative prior, 241
Relative errors, 53 Welch’s two sample t-test, 95
Residual standard error, 167, 179
z -transformation, 39
Residuals, 167
Response units, 217
Right-skewed, 9
Robust estimators, 124
Rule of succession, 247
Sample mode, 8
Sample size, 7
Sample space, 28
Sample standard deviation, 7
Sample variance, 7
Scale parameter, 55
Second moment, 38
Selection bias, 223
Significance level, 87
Simple hypothesis, 86
Simple linear regression, 165
Simple randomization, 221
Size, 87
Split plot design, 223
Spread, 7
Standard error, 71
300 Index of Terms