0% found this document useful (0 votes)

261 views263 pages

Book IntroStatistics PDF

This document provides an introduction to statistics and probability concepts for students at the Technical University of Denmark (DTU). It covers descriptive statistics, probability, random variables, probability distributions, statistical inference for one and two samples, and power analysis. R is introduced as a tool for statistical analysis and visualization. The document contains examples, exercises, and serves as a textbook for a statistics course at DTU.

Uploaded by

Erick Saldaña Villa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

261 views263 pages

Book IntroStatistics PDF

Uploaded by

Erick Saldaña Villa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 263

Introduction to Statistics

at DTU
Per B. Brockhoff, Jan K. Møller, Elisabeth W. Andersen
Peder Bacher, Lasse E. Christiansen

2017 Spring

n=1 n=5 n = 30
X ∼ N (0, 1)
Density
Density

Density

X ∼ U (0, 1)
x̄ x̄ x̄
Density
Density

Density

X ∼ Exp(1) X̄ ∼ N µ, √σn
x̄ x̄ x̄
Density
Density

Density

x̄ x̄ x̄
x̄ x̄ x̄
Chapter 0 CONTENTS 1

Contents

1 Introduction, descriptive statistics, R and data visualization 2

1.1 What is Statistics - a primer . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Statistics at DTU Compute . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Statistics - why, what, how? . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Measures of centrality . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Measures of variability . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Measures of relation: correlation and covariance . . . . . . 17
1.5 Introduction to R and RStudio . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 Console and scripts . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.2 Assignments and vectors . . . . . . . . . . . . . . . . . . . 22
1.5.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . 23
1.5.4 Use of R in the course and at the exam . . . . . . . . . . . . 25
1.6 Plotting, graphics - data visualisation . . . . . . . . . . . . . . . . 27
1.6.1 Frequency distributions and the histogram . . . . . . . . . 27
1.6.2 Cumulative distributions . . . . . . . . . . . . . . . . . . . 29
1.6.3 The box plot and the modified box plot . . . . . . . . . . . 30
1.6.4 The Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6.5 Bar plots and Pie charts . . . . . . . . . . . . . . . . . . . . 37
1.6.6 More plots in R? . . . . . . . . . . . . . . . . . . . . . . . . 39
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2 Probability and simulation 43

2.1 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Introduction to simulation . . . . . . . . . . . . . . . . . . . 49
2.2.2 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 59
2.3.2 Hypergeometric distribution . . . . . . . . . . . . . . . . . 62
2.3.3 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . 64
2.4 Continuous random variables . . . . . . . . . . . . . . . . . . . . . 68
2.4.1 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . 70
2.5 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 0 CONTENTS 2

2.5.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . 72

2.5.3 Log-Normal distribution . . . . . . . . . . . . . . . . . . . . 79
2.5.4 Exponential distribution . . . . . . . . . . . . . . . . . . . . 79
2.6 Simulation of random variables . . . . . . . . . . . . . . . . . . . . 83
2.7 Identities for the mean and variance . . . . . . . . . . . . . . . . . 86
2.8 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . 89
2.9 Independence of random variables . . . . . . . . . . . . . . . . . . 92
2.10 Functions of normal random variables . . . . . . . . . . . . . . . . 97
2.10.1 The χ2 -distribution . . . . . . . . . . . . . . . . . . . . . . . 98
2.10.2 The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 103
2.10.3 The F-distribution . . . . . . . . . . . . . . . . . . . . . . . 109
2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3 Statistics for one and two samples 123

3.1 Learning from one-sample quantitative data . . . . . . . . . . . . 123
3.1.1 Distribution of the sample mean . . . . . . . . . . . . . . . 125
3.1.2 Quantifying the precision of the mean - the confidence in-
terval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.1.3 The language of statistics and the process of learning from
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.1.4 When we cannot assume a normal distribution: the Cen-
tral Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 135
3.1.5 Repeated sampling interpretation of confidence intervals . 138
3.1.6 Confidence interval for the variance . . . . . . . . . . . . . 139
3.1.7 Hypothesis testing, evidence, significance and the p-value 143
3.1.8 Assumptions and how to check them . . . . . . . . . . . . 158
3.1.9 Transformation towards normality . . . . . . . . . . . . . . 164
3.2 Learning from two-sample quantitative data . . . . . . . . . . . . 168
3.2.1 Comparing two independent means - Confidence Interval 169
3.2.2 Comparing two independent means - hypothesis test . . . 170
3.2.3 The paired design and analysis . . . . . . . . . . . . . . . . 180
3.2.4 Validation of assumptions with normality investigations . 184
3.3 Planning a study: wanted precision and power . . . . . . . . . . . 185
3.3.1 Sample Size for wanted precision . . . . . . . . . . . . . . . 185
3.3.2 Sample size and statistical power . . . . . . . . . . . . . . . 186
3.3.3 Power/Sample size in two-sample setup . . . . . . . . . . 191
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

4 Simulation Based Statistics 203

4.1 Probability and Simulation . . . . . . . . . . . . . . . . . . . . . . . 203
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.1.2 Simulation as a general computational tool . . . . . . . . . 205
4.1.3 Propagation of error . . . . . . . . . . . . . . . . . . . . . . 207
4.2 The parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . 211
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.2.2 One-sample confidence interval for µ . . . . . . . . . . . . 212
Chapter 0 CONTENTS 3

4.2.3 One-sample confidence interval for any feature assuming

any distribution . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.2.4 Two-sample confidence intervals assuming any distribu-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.3 The non-parametric bootstrap . . . . . . . . . . . . . . . . . . . . . 223
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.3.2 One-sample confidence interval for µ . . . . . . . . . . . . 223
4.3.3 One-sample confidence interval for any feature . . . . . . 225
4.3.4 Two-sample confidence intervals . . . . . . . . . . . . . . . 226
4.4 OPTIONAL: Bootstrapping – a further perspective . . . . . . . . . 230
4.4.1 Non-parametric bootstrapping with the boot-package . . 231
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Glossaries 245

Acronyms 246

A Collection of formulas and R commands 247

A.1 Introduction, descriptive statistics, R and data visualization . . . 247
A.2 Probability and Simulation . . . . . . . . . . . . . . . . . . . . . . . 249
A.2.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 251
A.3 Statistics for one and two samples . . . . . . . . . . . . . . . . . . 255
A.4 Simulation based statistics . . . . . . . . . . . . . . . . . . . . . . . 257

The plot on the front page is an illustration of the Central Limit Theorem (CLT). To put
it shortly, it states that when sampling a population: as the sample size increases, then
the mean of the sample converges to a normal distribution – no matter the distribution
of the population. The thumb rule is that the normal distribution can be used for the
sample mean when the sample size n is above 30 observations (the sample size n is the
number observations in the sample). The plot is created by simulating 100000 sample
means X̄ = ∑in=1 Xi (where Xi is an observation from a distribution) and plotting their
histogram with the CLT distribution on top (the red linie). The upper is for the normal,
the mid is for the uniform and the lower is for the exponential distribution. We can thus
see that as n increase, then the distribution of the simulated sample means
x̄ approaches

the distribution stated by the CLT (it is the normal distribution X̄ ∼ N µ, √σn , where µ
is the mean and σ is the standard deviation of the population). It is presented in Section
3.1.4.
Chapter 1 4

Chapter 1

Introduction, descriptive statistics, R

and data visualization

This is the first chapter in the eight-chapter DTU Introduction to Statistics book.
It consists of eight chapters:

1. Introduction, descriptive statistics, R and data visualization

2. Probability and simulation
3. Statistical analysis of one and two sample data
4. Statistics by simulation
5. Simple linear regression
6. Multiple linear regression
7. Analysis of categorical data
8. Analysis of variance (analysis of multigroup data)

In this first chapter the idea of statistics is introduced together with some of the
basic summary statistics and data visualization methods. The software used
throughout the book for working with statistics, probability and data analysis is
the open source environment R. An introduction to R is included in this chapter.

1.1 What is Statistics - a primer

To catch your attention we will start out trying to give an impression of the
importance of statistics in modern science and engineering.
Chapter 1 1.1 WHAT IS STATISTICS - A PRIMER 5

In the well respected New England Journal of medicine a millenium editorial on

the development of medical research in a thousand years was written:

EDITORIAL: Looking Back on the Millennium in Medicine, N Engl J Med, 342:42-

49, January 6, 2000, NEJM200001063420108.

They came up with a list of 11 points summarizing the most important devel-
opments for the health of mankind in a millenium:

• Elucidation of human anatomy and physiology

• Discovery of cells and their substructures
• Elucidation of the chemistry of life
• Application of statistics to medicine
• Development of anesthesia
• Discovery of the eelation of microbes to disease
• Elucidation of inheritance and genetics
• Knowledge of the immune system
• Development of body imaging
• Discovery of antimicrobial agents
• Development of molecular pharmacotherapy

The reason for showing the list here is pretty obvious: one of the points is Ap-
plication of Statistics to Medicine! Considering the other points on the list, and
what the state of medical knowledge was around 1000 years ago, it is obviously
a very impressive list of developments. The reasons for statistics to be on this
list are several and we mention two very important historical landmarks here.
Quoting the paper:

"One of the earliest clinical trials took place in 1747, when James Lind treated 12
scorbutic ship passengers with cider, an elixir of vitriol, vinegar, sea water, oranges
and lemons, or an electuary recommended by the ship’s surgeon. The success of the
citrus-containing treatment eventually led the British Admiralty to mandate the provi-
sion of lime juice to all sailors, thereby eliminating scurvy from the navy." (See also
James_Lind).

Still today, clinical trials, including the statistical analysis of the outcomes, are
taking place in massive numbers. The medical industry needs to do this in
order to find out if their new developed drugs are working and to provide doc-
umentation to have them accepted for the World markets. The medical industry
is probably the sector recruiting the highest number of statisticians among all
sectors. Another quote from the paper:
Chapter 1 1.2 STATISTICS AT DTU COMPUTE 6

"The origin of modern epidemiology is often traced to 1854, when John Snow demon-
strated the transmission of cholera from contaminated water by analyzing disease rates
among citizens served by the Broad Street Pump in London’s Golden Square. He ar-
rested the further spread of the disease by removing the pump handle from the polluted
well." (See also John_Snow_(physician)).

Still today, epidemiology, both human and veterinarian, maintains to be an ex-

tremely important field of research (and still using a lot of statistics). An im-
portant topic, for instance, is the spread of diseases in populations, e.g. virus
spreads like Ebola and others.

Actually, today more numbers/data than ever are being collected and the amounts
are still increasing exponentially. One example is Internet data, that internet
companies like Google, Facebook, IBM and others are using extensively. A
quote from New York Times, 5. August 2009, from the article titled “For To-
day’s Graduate, Just One Word: Statistics” is:

“I keep saying that the sexy job in the next 10 years will be statisticians," said Hal
Varian, chief economist at Google. ‘and I’m not kidding.’ ”

The article ends with the following quote:

“The key is to let computers do what they are good at, which is trawling these massive
data sets for something that is mathematically odd,” said Daniel Gruhl, an I.B.M. re-
searcher whose recent work includes mining medical data to improve treatment. “And
that makes it easier for humans to do what they are good at - explain those anomalies.”

1.2 Statistics at DTU Compute

At DTU Compute at the Technical University of Denmark statistics is used,

taught and researched mainly within four research sections:

• Statistics and Data Analysis

• Dynamical Systems
• Image Analysis & Computer Graphics
• Cognitive Systems

Each of these sections have their own focus area within statistics, modelling
and data analysis. On the master level it is an important option within DTU
Compute studies to specialize in statistics of some kind on the joint master pro-
gramme in Mathematical Modelling and Computation (MMC). And a Statisti-
cian is a wellknown profession in industry, research and public sector institu-
tions.
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 7

The high relevance of the topic of statistics and data analysis today is also il-
lustrated by the extensive list of ongoing research projects involving many and
diverse industrial partners within these four sections. Neither society nor in-
dustry can cope with all the available data without using highly specialized per-
sons in statistical techniques, nor can they cope and be internationally compet-
itive without constinuosly further developing these methodologies in research
projects. Statistics is and will continue to be a relevant, viable and dynamic
field. And the amount of experts in the field continues to be small compared
to the demand for experts, hence obtaining skills in statistics is for sure a wise
career choice for an engineer. Still for any engineer not specialising in statistics,
a basic level of statistics understanding and data handling ability is crucial for
the ability to navigate in modern society and business, which will be heavily
influenced by data of many kinds in the future.

1.3 Statistics - why, what, how?

Often in society and media, the word statistics is used simply as the name for
a summary of some numbers, also called data, by means of a summary table
and/or plot. We also embrace this basic notion of statistics, but will call such
basic data summaries descriptive statistics or explorative statistics. The meaning
of statistics goes beyond this and will rather mean “how to learn from data in an
insightful way and how to use data for clever decision making”, in short we call this
inferential statistics. This could be on the national/societal level, and could be
related to any kind of topic, such as e.g. health, economy or environment, where
data is collected and used for learning and decision making. For example:

• Cancer registries
• Health registries in general
• Nutritional databases
• Climate data
• Macro economic data (Unemployment rates, GNP etc. )
• etc.

The latter is the type of data that historically gave name to the word statistics. It
originates from the Latin ‘statisticum collegium’ (state advisor) and the Italian
word ‘statista’ (statesman/politician). The word was brought to Denmark by
the Gottfried Achenwall from Germany in 1749 and originally described the
processing of data for the state, see also History_of_statistics.

Or it could be for industrial and business applications:

Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 8

• Is machine A more effective than machine B?

• How many products are we selling on different markets?
• Predicting wind and solar power for optimizing energy systems
• Do we produce at the specified quality level?
• Experiments and surveys for innovative product development
• Drug development at all levels at e.g. Novo Nordisk A/S or other phar-
maceutical companies
• Learning from "Big Data"
• etc.

In general, it can be said say that we learn from data by analysing the data
with statistical methods. Therefore statistics will in practice involve mathematical
modelling, i.e. using some linear or nonlinear function to model the particular
phenomenon. Similarly, the use of probability theory as the concept to describe
randomness is extremely important and at the heart of being able to “be clever”
in our use of the data. Randomness express that the data just as well could have
come up differently due to the inherent random nature of the data collection
and the phenomenon we are investigating.

Probability theory is in its own right an important topic in engineering relevant

applied mathematics. Probability based modelling is used for e.g. queing sys-
tems (queing for e.g. servers, websites, call centers etc.), for reliability mod-
elling, and for risk analysis in general. Risk analysis encompasses a vast di-
versity of engineering fields: food safety risk (toxicological and/or allergenic),
environmental risk, civil engineering risks, e.g. risk analysis of large building
constructions, transport risk, etc. The present material focuses on the statistical
issues, and treats probability theory at a minimum level, focusing solely on the
purpose of being able to do proper statistical inference and leaving more elabo-
rate probability theory and modelling to other texts.

There is a conceptual frame for doing statistical inference: in Statistical inference

the observed data is a sample, that is (has been) taken from a population. Based
on the sample, we try to generalize to (infer about) the population. Formal
definitions of what the sample and the population is are given by:
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 9

Definition 1.1 Sample and population

• An observational unit is the single entity about which information is

sought (e.g. a person)
• An observational variable is a property which can be measured on the
observational unit (e.g. the height of a person)
• The statistical population consists of the value of the observational vari-
able for all observational units (e.g. the heights of all persons in Den-
mark)
• The sample is a subset of the statistical population, which has been cho-
sen to represent the population (e.g. the heights of 20 persons in Den-
mark).

(Infinite) Statistical population

Sample
{ x1 , x2 , . . . , x n }
Randomly
selected

Statistical
Mean Sample mean
Inference
µ x̄

Figure 1.1: Illustration of statistical population and sample, and statistical in-
ference. Note that the bar on each person indicates that the it is the height (the
observational variable) and not the person (the observational unit), which are
the elements in the statistical population and the sample. Notice, that in all
analysis methods presented in this text the statistical population is assumed to
be very large (or infinite) compared to the sample size.
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 10

This is all a bit abstract at this point. And likely adding to the potential confu-
sion about this is the fact that the words population and sample will have a “less
precise” meaning when used in everyday language. When they are used in a
statistical context the meaning is very specific, as given by the definition above.
Let us consider a simple example:

Example 1.2

The following study is carried out (actual data collection): the height of 20 persons
in Denmark is measured. This will give us 20 values x1 , . . . , x20 in cm. The sample
is then simply these 20 values. The statistical population is the height values of all
people in Denmark. The observational unit is a person.

The meaning of sample in statistics is clearly different from how a chemist or

medical doctor would use the word, where a sample would be the actual sub-
stance in e.g. the petri dish. Within this book, when using the word sample, then
it is always in the statistical meaning i.e. a set of values taken from a statistical
population.

With regards to the meaning of population within statistics the difference to the
everyday meaning is less obvious: but note that the statistical population in the
example is defined to be the height values of persons, not actually the persons.
Had we measured the weights instead the statistical population would be quite
different. Also later we will realize that statistical populations in engineering
contexts can refer to many other things than populations as in a group of or-
ganisms, hence stretching the use of the word beyond the everyday meaning.
From this point: population will be used instead of statistical population in order
to simplify the text.

The population in a given situation will be linked with the actual study and/or
experiment carried out - the data collection procedure sometimes also denoted
the data generating process. For the sample to represent relevant information
about the population it should be representative for that population. In the ex-
ample, had we only measured male heights, the population we can say any-
thing about would be the male height population only, not the entire height
population.

A way to achieve a representative sample is that each observation (i.e. each

value) selected from the population, is randomly and independently selected of
each other, and then the sample is called a random sample.
Chapter 1 1.4 SUMMARY STATISTICS 11

1.4 Summary statistics

The descriptive part of studying data maintains to be an important part of statis-

tics. This means that it is recommended to study the given data, the sample, by
means of descriptive statistics as a first step, even though the purpose of a full
statistical analysis is to eventually perform some of the new inferential tools
taught in this book, that will go beyond the pure descriptive part. The aims of
the initial descriptive part are several, and when moving to more complex data
settings later in the book, it will be even more clear how the initial descriptive
part serves as a way to prepare for and guide yourself in the subsequent more
formal inferential statistical analysis.

The initial part is also called an explorative analysis of the data. We use a number
of summary statistics to summarize and describe a sample consisting of one or
two variables:

• Measures of centrality:
– Mean
– Median
– Quantiles
• Measures of “spread”:
– Variance
– Standard deviation
– Coefficient of variation
– Inter Quartile Range (IQR)
• Measures of relation (between two variables):
– Covariance
– Correlation

One important point to notice is that these statistics can only be calculated for
the sample and not for the population - we simply don’t know all the values
in the population! But we want to learn about the population from the sample.
For example when we have a random sample from a population we say that the
sample mean (x̄) is an estimate of the mean of the population, often then denoted
µ, as illustrated in Figure 1.1.
Chapter 1 1.4 SUMMARY STATISTICS 12

Remark 1.3
Notice, that we put ’sample’ in front of the name of the statistic, when it is
calculated for the sample, but we don’t put ’population’ in front when we
refer to it for the population (e.g. we can think of the mean as the true mean).

HOWEVER we don’t put sample in front of the name every time it should
be there! This is to keep the text simpler and since traditionally this is not
strictly done, for example the median is rarely called the sample median,
even though it makes perfect sense to distinguish between the sample me-
dian and the median (i.e. the population median). Further, it should be
clear from the context if the statistic refers to the sample or the population,
when it is not clear then we distinguish in the text. Most of the way we do
distinguish strictly for the mean, standard deviation, variance, covariance and
correlation.

1.4.1 Measures of centrality

The sample mean is a key number that indicates the centre of gravity or center-
ing of the sample. Given a sample of n observations x1 , . . . , xn , it is defined as
follows:

Definition 1.4 Sample mean

The sample mean is the sum of observations divided by the number of ob-
servations
1 n
n i∑
x̄ = xi . (1-1)
=1

Sometimes this is refered to as the average.

The median is also a key number indicating the center of sample (note that to
be strict we should call it ’sample median’, see Remark 1.3 above). In some
cases, for example in the case of extreme values or skewed distributions, the
median can be preferable to the mean. The median is the observation in the
middle of the sample (in sorted order). One may express the ordered observa-
tions as x(1) , . . . , x(n) , where then x(1) is the smallest of all x1 , . . . , xn (also called
Chapter 1 1.4 SUMMARY STATISTICS 13

the minimum) and x(n) is the largest of all x1 , . . . , xn (also called the maximum).

Definition 1.5 Median

Order the n observations x1 , . . . , xn from the smallest to largest:
x(1) , . . . , x(n) . The median is defined as:
n +1
• If n is odd the median is the observation in position 2 :

Q 2 = x ( n +1 ) . (1-2)
2

• If n is even the median is the average of the two observations in posi-

2
tions n2 and n+2 :

x ( n ) + x ( n +2 )
2 2
Q2 = . (1-3)
2
The reason why it is denoted with Q2 is explained below in Definition 1.8.

Example 1.6 Student heights

A random sample of the heights (in cm) of 10 students in a statistics class was

168 161 167 179 184 166 198 187 191 179 .

The sample mean height is

1
x̄ = (168 + 161 + 167 + 179 + 184 + 166 + 198 + 187 + 191 + 179) = 178.
10
To find the sample median we first order the observations from smallest to largest

x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x(10)
.
161 166 167 168 179 179 184 187 191 198

Note that having duplicate observations (like e.g. two of 179) is not a problem - they
all just have to appear in the ordered list. Since n = 10 is an even number the median
becomes the average of the 5th and 6th observations
x( n2 ) + x( n+2 ) x (5) + x (6) 179 + 179
2
= = = 179.
2 2 2
Chapter 1 1.4 SUMMARY STATISTICS 14

As an illustration, let’s look at the results if the sample did not include the 198 cm
height, hence for n = 9

1
x̄ = (168 + 161 + 167 + 179 + 184 + 166 + 187 + 191 + 179) = 175.78.
9
then the median would have been

x( n+1 ) = x(5) = 179.

This illustrates the robustness of the median compared to the sample mean: the
sample mean changes a lot more by the inclusion/exclusion of a single “extreme”
measurement. Similarly, it is clear that the median does not depend at all on the
actual values of the most extreme ones.

The median is the point that divides the observations into two halves. It is of
course possible to find other points that divide into other proportions, they are
called quantiles or percentiles (note, that this is actually the sample quantile or
sample percentile, see Remark 1.3).

Definition 1.7 Quantiles and percentiles

The p quantile also called the 100p% quantile or 100p’th percentile, can be
defined by the following procedure: a

1. Order the n observations from smallest to largest: x(1) , . . . , x(n)

2. Compute pn

3. If pn is an integer: average the pn’th and ( pn + 1)’th ordered observa-

tions. Then the p quantile is

q p = x(np) + x(np+1) /2 (1-4)

4. If pn is a noninteger: take the “next one” in the ordered list. Then the
p’th quantile is

q p = x(dnpe) , (1-5)

where dnpe is the ceiling of np, that is, the smallest integer larger than
np

a There exist several other formal definitions. To obtain this definition of quan-
tiles/percentiles in R use quantile(. . . , type=2). Using the default in R is also a perfectly
valid approach - just a different one.
Chapter 1 1.4 SUMMARY STATISTICS 15

Often calculated percentiles are the so-called quartiles (splitting the sample in
quarters, i.e. 0%, 25%, 50%, 75% and 100%):

• q0 , q0.25 , q0.50 , q0.75 and q1

Note that the 0’th percentile is the minimum (smallest) observation and the
100’th percentile is the maximum (largest) observation. We have specific names
for the three other quartiles:

Definition 1.8 Quartiles

Q1 = q0.25 = “lower quartile” = “0.25 quantile” = “25’th percentile”
Q2 = q0.50 = “median” = “0.50 quantile” = “50’th percentile”
Q3 = q0.75 = “upper quartile” = “0.75 quartile” = “75’th percentile”

Example 1.9 Student heights

Using the n = 10 sample from Example 1.6 and the ordered data table from there,
let us find the lower and upper quartiles (i.e. Q1 and Q3 ), as we already found
Q2 = 179.

First, the Q1 : with p = 0.25, we get that np = 2.5 and we find that

Q1 = x(d2.5e) = x(3) = 167,

and since n · 0.75 = 7.5, the upper quartile becomes

Q3 = x(d7.5e) = x(8) = 187.

We could also find the 0’th percentile

q0 = min( x1 , . . . , xn ) = x(1) = 161,

and the 100’th percentile

q1 = max( x1 , . . . , xn ) = x(10) = 198.

Finally, 10’th percentile (i.e. 0.10 quantile) is

x (1) + x (2) 161 + 166
q0.10 = = = 163.5,
2 2
since np = 1 for p = 0.10.
Chapter 1 1.4 SUMMARY STATISTICS 16

1.4.2 Measures of variability

A crucial aspect to understand when dealing with statistics is the concept of

variability - the obvious fact that not everyone in a population, nor in a sample,
will be exactly the same. If that was the case they would all equal the mean
of the population or sample. But different phenomena will have different de-
grees of variation: An adult (non dwarf) height population will maybe spread
from around 150 cm up to around 210 cm with very few exceptions. A kitchen
scale measurement error population might span from −5 g to +5 g. We need a
way to quantify the degree of variability in a population and in a sample. The
most commonly used measure of sample variability is the sample variance or
its square root, called the sample standard deviation:

Definition 1.10 Sample variance

The sample variance of a sample x1 , . . . , xn is the sum of squared differences
from the sample mean divided by n − 1
n
1
n − 1 i∑
s2 = ( xi − x̄ )2 . (1-6)
=1

Definition 1.11 Sample standard deviation

The sample standard deviation is the square root of the sample variance
s
√ n
1
n − 1 i∑
s = s2 = ( xi − x̄ )2 . (1-7)
=1

The sample standard deviation and the sample variance are key numbers of
absolute variation. If it is of interest to compare variation between different
samples, it might be a good idea to use a relative measure - most obvious is the
coefficient of variation:
Chapter 1 1.4 SUMMARY STATISTICS 17

Definition 1.12 Coefficient of variation

The coefficient of variation is the sample standard deviation seen relative to
the sample mean
s
V= . (1-8)
x̄

We interpret the standard deviation as the average absolute deviation from the mean
or simply: the average level of differences, and this is by far the most used measure
of spread. Two (relevant) questions are often asked at this point (it is perfectly
fine if you didn’t wonder about them by now and you might skip the answers
and return to them later):

Remark 1.13

Question: Why not actually compute directly what the interpretation is

stating, which would be: n1 ∑in=1 | xi − x̄ |?

Answer: This is indeed an alternative, called the mean absolute deviation, that
one could use. The reason for most often measuring “mean deviation”
NOT by the Mean Absolute Deviation statistic, but rather by the sample
standard deviation s, is the so-called theoretical statistical properties of
the sample variance s2 . This is a bit early in the material for going into
details about this, but in short: inferential statistics is heavily based
on probability considerations, and it turns out that it is theoretically
much easier to put probabilities related to the sample variance s2 on
explicit mathematical formulas than probabilities related to most other
alternative measures of variability. Further, in many cases this choice
is in fact also the optimal choice in many ways.
Chapter 1 1.4 SUMMARY STATISTICS 18

Remark 1.14

Question: Why divide by n − 1 and not n in the formulas of s and s2 ?

(which also appears to fit better with the stated interpretation)

Answer: The sample variance s2 will most often be used as an estimate of

the (true but unknown) population variance σ2 , which is the average
of ( xi − µ)2 in the population. In doing that, one should ideally com-
pare each observation xi with the population mean, usually called µ.
However, we do not know µ and instead we use x̄ in the computation
of s2 . In doing so, the squared differences ( xi − x̄ )2 that we compute in
this way will tend to be slightly smaller than those we ideally should
have used: ( xi − µ)2 (as the observations themselves were used to find
x̄ so they will be closer to x̄ than to µ). It turns out, that the correct way
to correct for this is by dividing by n − 1 instead of n.

Spread in the sample can also be described and quantified by quartiles:

Definition 1.15 Range

The range of the sample is

Range = Maximum − Minimum = Q4 − Q0 = x(n) − x(1) . (1-9)

The Inter Quartile Range (IQR) is the middle 50% range of data defined as

IQR = q0.75 − q0.25 = Q3 − Q1 . (1-10)

Chapter 1 1.4 SUMMARY STATISTICS 19

Example 1.16 Student heights

Consider again the n = 10 data from Example 1.6. To find the variance let us com-
pute the n = 10 differences to the mean, that is ( xi − 178)

-10 -17 -11 1 6 -12 20 9 13 1.

So, if we square these and add them up we get

10
∑ (xi − x̄)2 = 102 + 172 + 112 + 12 + 62 + 122 + 202 + 92 + 132 + 12 = 1342.
i =1

Therefore the sample variance is

1
s2 = 1342 = 149.1,
9
and the sample standard deviation is

s = 12.21.

We can interpret this as: people are on average around 12 cm away from the mean
height of 178 cm. The Range and Inter Quartile Range (IQR) are easily found from
the ordered data table in Example 1.6 and the earlier found quartiles in Example 1.9

Range = maximum − minimum = 198 − 161 = 37,

IQR = Q3 − Q1 = 187 − 167 = 20.

Hence 50% of all people (in the sample) lie within 20 cm.

Note, that the standard deviation in the example has the physical unit cm,
whereas the variance has cm2 . This illustrates the fact that the standard de-
viation has a more direct interpretation than the variance in general.

1.4.3 Measures of relation: correlation and covariance

When two observational variables are available for each observational unit, it
may be of interest to quantify the relation between the two, that is to quantify
how the two variables co-vary with each other, their sample covariance and/or
sample correlation.
Chapter 1 1.4 SUMMARY STATISTICS 20

Example 1.17 Student heights and weights

In addition to the previously given student heights we also have their weights (in
kg) available

Heights ( xi ) 168 161 167 179 184 166 198 187 191 179
.
Weights (yi ) 65.5 58.3 68.1 85.7 80.5 63.4 102.6 91.4 86.7 78.9

The relation between weights and heights can be illustrated by the so-called scatter-
plot, cf. Section 1.6.4, where e.g. weights are plotted versus heights:

7
100

8
90

4 9
Weight

y = 78.1 5
80

10
70

3
1
6
x = 178
60

2
160 170 180 190
Height

Each point in the plot corresponds to one student - here illustrated by using the
observation number as plot symbol. The (expected) relation is pretty clear now -
different wordings could be used for what we see:

• Weights and heights are related to each other

• Higher students tend to weigh more than smaller students
• There is an increasing pattern from left to right in the "point cloud”
• If the point cloud is seen as an (approximate) ellipse, then the ellipse clearly is
horizontally upwards ”tilted”.
• Weights and heights are (positively) correlated to each other

The sample covariance and sample correlation coefficients are a summary statis-
tics that can be calculated for two (related) sets of observations. They quantify
the (linear) strength of the relation between the two. They are calculated by
combining the two sets of observations (and the means and standard deviations
Chapter 1 1.4 SUMMARY STATISTICS 21

from the two) in the following ways:

Definition 1.18 Sample covariance

The sample covariance is
n
1
s xy = ∑ ( x − x̄ ) (yi − ȳ) .
n − 1 i =1 i
(1-11)

Definition 1.19 Sample correlation

The sample correlation coefficient is
n
s xy
1 xi − x̄ yi − ȳ
r= ∑
n − 1 i =1 sx sy
=
s x · sy
, (1-12)

where s x and sy is the sample standard deviation for x and y respectively.

When xi − x̄ and yi − ȳ have the same sign, then the point ( xi , yi ) give a positive
contribution to the sample correlation coefficient and when they have opposite
signs the point give a negative contribution to the sample correlation coefficient,
as illustrated here:

Example 1.20 Student heights and weights

The sample means are found to be

x̄ = 178 and ȳ = 78.1.
Using these we can show how each student deviate from the average height and
weight (these deviations are exactly used for the sample correlation and covariance
computations)

Student 1 2 3 4 5 6 7 8 9 10
Height ( xi ) 168 161 167 179 184 166 198 187 191 179
Weight (yi ) 65.5 58.3 68.1 85.7 80.5 63.4 102.6 91.4 86.7 78.9
( xi − x̄ ) -10 -17 -11 1 6 -12 20 9 13 1
(yi − ȳ) -12.6 -19.8 -10 7.6 2.4 -14.7 24.5 13.3 8.6 0.8
( xi − x̄ )(yi − ȳ) 126.1 336.8 110.1 7.6 14.3 176.5 489.8 119.6 111.7 0.8

Student 1 is below average on both height and weight (−10 and − 12.6). Student
Chapter 1 1.4 SUMMARY STATISTICS 22

10 is above average on both height and weight (+1 and + 0.8).s

The sample covariance is then given by the sum of the 10 numbers in the last row of
the table
1
s xy = (126.1 + 336.8 + 110.1 + 7.6 + 14.3 + 176.5 + 489.8 + 119.6 + 111.7 + 0.8)
9
1
= · 1493.3
9
= 165.9

And the sample correlation is then found from this number and the standard devia-
tions

s x = 12.21 and sy = 14.07.

(the details of the sy computation is not shown). Thus we get the sample correlation
as
165.9
r= = 0.97.
12.21 · 14.07

Note how all 10 contributions to the sample covariance are positive in the ex-
ample case - in line with the fact that all observations are found in the first
and third quadrants of the scatter plot (where the quadrants are defined by the
sample means of x and y). Observations in second and fourth quadrant would
contribute with negative numbers to the sum, hence such observations would
be from students with below average on one feature while above average on the
other. Then it is clear that: had all students been like that, then the covariance
and the correlation would have been negative, in line with a negative (down-
wards) trend in the relation.

We can state (without proofs) a number of properties of the sample correlation

Remark 1.21 Properties of the sample correlation, r

• r is always between −1 and 1: −1 ≤ r ≤ 1

• r measures the degree of linear relation between x and y
• r = ±1 if and only if all points in the scatterplot are exactly on a line
• r > 0 if and only if the general trend in the scatterplot is positive
• r < 0 if and only if the general trend in the scatterplot is negative
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 23

The sample correlation coefficient measures the degree of linear relation be-
tween x and y, which imply that we might fail to detect nonlinear relationships,
illustrated in the following plot of four different point clouds and their sample
correlations:

r ≈ 0.95 r ≈ −0.5
1.2

1
0.8

0
y

y
0.4

-1
-2
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

r≈0 r≈0
2
1

0.8
0
y

y
0.4
-1
-2

0.0
-3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

The sample correlation in both the bottom plots are close to zero, but as we see
from the plot this number itself doesn’t imply that there no relation between y
and x - which clearly is the case in the bottom right and highly nonlinear case.

Sample covariances and correlation are closely related to the topic of linear re-
gression, treated in Chapter 5 and 6, where we will treat in more detail how we
can find the line that could be added to such scatterplots to describe the relation
between x and y in a different (but related) way, as well as the statistical analysis
used for this.

1.5 Introduction to R and RStudio

The program R is an open source software for statistics that you can download
to your own laptop for free. Go to http://mirrors.dotsrc.org/cran/ and se-
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 24

lect your platform (Windows, Mac or Linux) and follow instructions to install.

RStudio is a free and open source integrated development environment (IDE)

for R. You can run it on your desktop (Windows, Mac or Linux) or even over
the web using RStudio Server. It works as (an extended) alternative to running R
in the basic way through a terminal. This will be used in the course. Download
it from http://www.rstudio.com/ and follow installation instructions. To use
the software, you only need to open RStudio (R will then be used by RStudio for
carrying out the calculations).

1.5.1 Console and scripts

Once you have opened RStudio, you will see a number of different windows.
One of them is the console. Here you can write commands and execute them by
hitting Enter. For instance:

> ## Add two numbers in the console

> 2+3

[1] 5

In the console you cannot go back and change previous com-

mands and neither can you save your work for later. To do this
you need to write a script. Go to File->New->R Script. In the
script you can write a line and execute it in the console by hitting
Ctrl+Enter (Windows) or Cmd+Enter (Mac). You can also mark
several lines and execute them all at the same time.

1.5.2 Assignments and vectors

If you want to assign a value to a variable, you can use = or <-. The latter is the
preferred by R-users, so for instance:

> ## Assign the value 3 to y

> y <- 3

It is often useful to assign a set of values to a variable like a vector. This is done
with the function c (short for concatenate):
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 25

## Concatenate numbers to a vector

x <- c(1, 4, 6, 2)
x

[1] 1 4 6 2

Use the colon :, if you need a sequence, e.g. 1 to 10:

> ## A sequence from 1 to 10

> x <- 1:10
> x

[1] 1 2 3 4 5 6 7 8 9 10

You can also make a sequence with a specific stepsize different from 1

> ## Sequence with specified steps

> x <- seq(0, 1, by=0.1)
> x

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

If you are in doubt of how to use a certain function, the help page can be opened
by typing ? followed by the function, e.g. ?seq.

If you know Matlab then this document Hiebeler-matlabR.pdf can

be very helpful.

1.5.3 Descriptive statistics

All the summary statistics measures presented in Section 1.4 can be found as
functions or part of functions in R:

• mean(x) - mean value of the vector x

• var(x) - variance
• sd(x) - standard deviation
• median(x) - median
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 26

• quantile(x,p) - finds the pth quantile. p can consist of several different

values, e.g. quantile(x,c(0.25,0.75)) or quantile(x,c(0.25,0.75), type=2)
• cov(x, y) - the covariance of the vectors x and y
• cor(x, y) - the correlation

Please again note that the words quantiles and percentiles are used interchange-
ably - they are essentially synonyms meaning exactly the same, even though the
formal distinction has been clarified earlier.

Example 1.22 Summary statistics in R

Consider again the n = 10 data from Example 1.6. We can read these data into R
and compute the sample mean and sample median as follows:

## Sample Mean and Median

x <- c(168, 161, 167, 179, 184, 166, 198, 187, 191, 179)
mean(x)

[1] 178

median(x)

[1] 179

The sample variance and sample standard deviation are found as follows:

## Sample variance and standard deviation

var(x)

[1] 149.1

sqrt(var(x))

[1] 12.21

sd(x)

[1] 12.21

The sample quartiles can be found by using the quantile function as follows:
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 27

## Sample quartiles
quantile(x, type=2)

0% 25% 50% 75% 100%

161 167 179 187 198

The option “type=2” makes sure that the quantiles found by the function is found
using the definition given in Definition 1.7. By default, the quantile function would
use another definition (not detailed here). Generally, we consider this default choice
just as valid as the one explicitly given here, it is merely a different one. Also the
quantile function has an option called “probs” where any list of probability values
from 0 to 1 can be given. For instance:

## Sample quantiles 0%, 10%,..,90%, 100%:

quantile(x, probs=seq(0, 1, by=0.10), type=2)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
161.0 163.5 166.5 168.0 173.5 179.0 184.0 187.0 189.0 194.5 198.0

1.5.4 Use of R in the course and at the exam

You should bring your laptop with R intalled with you to the teaching activity
and to the exam. We will need access to the so-called probability distributions
to do statistical computations, and the values of these distributions are not oth-
erwise part of the written material: These probability distributions are part of
many different softwares, also Excel, but it is part of the syllabus to be able to
work with these within R.

Apart from access to these probability distributions, the R-software is used in

three ways in our course

1. As a pedagogical learning tool: The random variable simulation tools in-

built in R enables the use of R as a way to illustrate and learn the principles
of statistical reasoning that are the main purposes of this course.
2. As a pocket calculator substitute - that is making R calculate ”manually”
- by simple routines - plus, minus, squareroot etc. whatever needs to be
calculated, that you have identified by applying the right formulas from
the proper definitions and methods in the written material.
3. As a ”probability calculus and statistical analysis machine” where e.g.
with some data fed into it, it will, by inbuilt functions and procedures
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 28

do all relevant computations for you and present the final results in some
overview tables and plots.

We will see and present all three types of applications of R during the course.
For the first type, the aim is not to learn how to use the given R-code itself
but rather to learn from the insights that the code together with the results of
applying it is providing. It will be stated clearly whenever an R-example is of
this type. Types 2 and 3 are specific tools that should be learned as a part of the
course and represent tools that are explicitly relevant in your future engineering
activity. It is clear that at some point one would love to just do the last kind
of applications. However, it must be stressed that even though the program is
able to calculate things for the user, understanding the details of the calculations
must NOT be forgotten - understanding the methods and knowing the formulas
is an important part of the syllabus, and will be checked at the exam.

Remark 1.23 BRING and USE pen and paper PRIOR to R

For many of the exercises that you are asked to do it will not be possible to
just directly identify what R-command(s) should be used to find the results.
The exercises are often to be seen as what could be termed ”problem math-
ematics” exercises. So, it is recommended to also bring and use pen and
paper to work with the exercises to be able to subsequently know how to
finally finish them by some R-calculations.(If you adjusted yourself to some
digitial version of ”pen-and-paper”, then this is fine of course.)

Remark 1.24 R is not a substitute for your brain activity in this

course!
The software R should be seen as the most fantastic and easy computa-
tional companion that we can have for doing statistical computations that
we could have done ”manually”, if we wanted to spend the time doing
it. All definitions, formulas, methods, theorems etc. in the written mate-
rial should be known by the student, as should also certain R-routines and
functions.

A good question to ask yourself each time that you apply en inbuilt R-function
is: ”Would I know how to make this computation ”manually”?”. There are few
exceptions to this requirement in the course, but only a few. And for these the
question would be: ”Do I really understand what R is computing for me now?”
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 29

1.6 Plotting, graphics - data visualisation

A really important part of working with data analysis is the visualisation of the
raw data, as well as the results of the statistical analysis – the combination of
the two leads to reliable results. Let us focus on the first part now, which can
be seen as being part of the explorative descriptive analysis also mentioned in
Section 1.4. Depending on the data at hand different types of plots and graphics
could be relevant. One can distinguish between quantitative vs. categorical data.
We will touch on the following type of basic plots:

• Quantitative data:
– Frequency plots and histograms
– box plots
– cumulative distribution
– Scatter plot (xy plot)
• Categorical data:
– Bar charts
– Pie charts

1.6.1 Frequency distributions and the histogram

The frequency distribution is the count of occurrences of values in the sample

for different classes using some classification, for example in intervals or by
some other property. It is nicely depicted by the histogram, which is a bar-plot
of the occurrences in each classes.
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 30

Example 1.25 Histogram in R

Consider again the n = 10 sample from Example 1.6.

## A histogram of the heights

hist(x)
4
3
Frequency
2
1
0

160 170 180 190 200

The default histogram uses equidistant interval widths (the same width for all
intervals) and depicts the raw frequencies/counts in each interval. One may
change the scale into showing what we will learn to be densities by dividing the
raw counts by n and the interval width, i.e.

"Interval count"
.
n · ("Interval width")

By plotting the densities a density histogram also called the empirical density
the area of all the bars add up to 1:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 31

Example 1.26 Empirical density in R

## A density histogram or empirical density of the heights

hist(x, freq=FALSE, col="red", nclass=8)
0.06
0.04
Density
0.02
0.00

160 170 180 190 200

The R-function hist makes some choice of the number of classess based on
the number of observations - it may be changed by the user option nclass as
illustrated here, although the original choice seems better in this case due to the
very small sample.

1.6.2 Cumulative distributions

The cumulative distribution can be visualized simply as the cumulated relative

frequencies either across classes, as also used in the histogram, or individual
data points, which is then called the empirical cumulative distribution function:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 32

Example 1.27 Cumulative distribution plot in R

## Empirical cumulative distribution plot

plot(ecdf(x), verticals=TRUE)
1.0
0.8
0.6
Fb( x )
0.4
0.2
0.0

160 170 180 190 200

The empirical cumulative distribution function Fn is a step function with jumps

i/n at observation values, where i is the number of identical(tied) observations
at that value.

For observations ( x1 , x2 , . . . , xn ), Fn ( x ) is the fraction of observations less or

equal to x, that mathematically can be expressed as

1
Fn ( x ) = ∑ n
. (1-13)
j where x ≤ x
j

1.6.3 The box plot and the modified box plot

The so-called box plot in its basic form depicts the five quartiles (min, Q1 , me-
dian, Q3 , max) with a box from Q1 to Q3 emphasizing the Inter Quartile Range
(IQR):
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 33

Example 1.28 Box plot in R

## A basic box plot of the heights (range=0 makes it "basic")

boxplot(x, range=0, col="red", main="Basic box plot")
## Add the blue text
text(1.3, quantile(x), c("Minimum","Q1","Median","Q3","Maximum"),
col="blue")

Basic box plot

Maximum
190

Q3
180

Median
170

Q1
Minimum
160

In the modified box plot the whiskers only extend to the min. and max. obser-
vation if they are not too far away from the box: defined to be 1.5 × IQR. Obser-
vations further away are considered as extreme observations and will be plotted
individually - hence the whiskers extend from the smallest to the largest obser-
vation within a distance of 1.5 × IQR of the box (defined as either 1.5 × IQR
larger than Q3 or 1.5 × IQR smaller than Q1 ).
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 34

Example 1.29 Box plot in R

If we add an extreme observation, 235 cm, to the heights sample and make the mod-
ified box plot - the default in R- and the basic box plot, then we have:

## Add an extreme value and box plot

boxplot(c(x, 235), col="red", main="Modified box plot")
boxplot(c(x, 235), col="red", main="Basic box plot", range=0)

Modified box plot Basic box plot

Maximum Maximum
220

220
200

200
Q3 Q3
180

180

Median Median

Q1 Q1
Minimum Minimum
160

160

Note that since there was no extreme observations among the original 10 observa-
tions, the two ”different” plots would be the same if we didn’t add the extreme 235
cm observation.

The box plot hence is an alternative to the histogram in visualising the distribu-
tion of the sample. It is a convenient way of comparing distributions in different
groups, if such data is at hand.

Example 1.30 Box plot in R

This example shows some ways of working with R to illustrate data.

In another statistics course the following heights of 17 female and 23 male students
were found:

Males 152 171 173 173 178 179 180 180 182 182 182 185
185 185 185 185 186 187 190 190 192 192 197
Females 159 166 168 168 171 171 172 172 173 174 175 175
175 175 175 177 178
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 35

The two modified box plots of the distributions for each gender can be generated by
a single call to the boxplot function:

## Box plot with two groups

Males <- c(152, 171, 173, 173, 178, 179, 180, 180, 182, 182, 182, 185,
185 ,185, 185, 185 ,186 ,187 ,190 ,190, 192, 192, 197)
Females <-c(159, 166, 168 ,168 ,171 ,171 ,172, 172, 173, 174 ,175 ,175,
175, 175, 175, 177, 178)
boxplot(list(Males, Females), col=2:3, names=c("Males", "Females"))
190
180
170
160

Males Females

At this point, it should be noted that in real work with data using R, one would
generally not import data into R by explicit listings in an R-script as here. This
only works for very small data sets. Usually the data is imported from some-
where else, e.g. from a spread sheet exported in a .csv (comma separated values)
format as shown here:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 36

Example 1.31 Read and explore data in R

The gender grouped student heights data used in Example 1.30 is avail-
able as a .csv-file via http://www2.compute.dtu.dk/courses/introstat/data/
studentheights.csv. The structure of the data file, as it would appear in a spread
sheet program (e.g. LibreOffice Calc or Excel) is two columns and 40+1 rows includ-
ing a header row:

1 Height Gender
2 152 male
3 171 male
4 173 male
. . .
. . .
24 197 male
25 159 female
26 166 female
27 168 female
. . .
. . .
39 175 female
40 177 female
41 178 female

The data can now be imported into R with the read.table function:

## Read the data (note that per default sep="," but here semicolon)
studentheights <- read.table("studentheights.csv", sep=";", dec=".",
header=TRUE)

The resulting object studentheights is now a so-called data.frame, which is the

class used for such tables in R. There are some ways of getting a quick look at what
kind of data is really in a data set:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 37

## Have a look at the first 6 rows of the data

head(studentheights)

Height Gender
1 152 male
2 171 male
3 173 male
4 173 male
5 178 male
6 179 male

## Get an overview
str(studentheights)

'data.frame': 40 obs. of 2 variables:

$ Height: int 152 171 173 173 178 179 180 180 182 182 ...
$ Gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...

## Get a summary of each column/variable in the data

summary(studentheights)

Height Gender
Min. :152.0 female:17
1st Qu.:172.8 male :23
Median :177.5
Mean :177.9
3rd Qu.:185.0
Max. :197.0

For quantitative variables we get the quartiles and the mean from summary. For
categorical variables we see (some of) the category frequencies . A data structure like
this is commonly encountered (and often the only needed) for statistical analysis.
The gender grouped box plot can now be generated by:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 38

## Box plot for each gender

boxplot(Height ~ Gender, data=studentheights, col=2:3)

190
180
170
160

female male

The R-syntax Height ~ Gender with the tilde symbol “~” is one that we will use a
lot in various contexts such as plotting and model fitting. In this context it can be
understood as “Height is plotted as a function of Gender”.

1.6.4 The Scatter plot

The scatter plot can be used for two quantitative variables. It is simply one
variable plotted versus the other using some plotting symbol.

Example 1.32 Explore data included in R

Now we will use a data set available as part of R itself. Both base R and many addon
R-packages include data sets, which can be used for testing and practicing. Here we
will use the mtcars data set. If you write:

## See information about the mtcars data

?mtcars

you will be able to read the following as part of the help info:

“The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel con-
sumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 39

models). A data frame with 32 observations on 11 variables. Source: Henderson and Velle-
man (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.”

Let us plot the gasoline use, (mpg=miles pr. gallon), versus the weight (wt):

## To make 2 plots
par(mfrow=c(1,2))
## First the default version
plot(mtcars$wt, mtcars$mpg, xlab="wt", ylab="mpg")
## Then a nicer version
plot(mpg ~ wt, xlab="Car Weight (1000lbs)", data=mtcars,
ylab="Miles pr. Gallon", col=factor(am),
main="Inverse fuel usage vs. size")
## Add a legend to the plot
legend("topright", c("Automatic transmission","Manual transmission"),
col=c("black","red"), pch=1, cex=0.7)

Inverse fuel usage vs. size

Automatic transmission
Manual transmission
30

30
Miles pr. Gallon
25

25
mpg
20

20
15

15
10

2 3 4 5 2 3 4 5
wt Car Weight (1000lbs)

In the second plot call we have used the so-called formula syntax of R, that was
introduced above for the grouped box plot. Again, it can be read: “mpg is plotted
as a function of wt”. Note also how a color option, col=factor(am), can be used to
group the cars with and without automatic transmission, stored in the data column
am in the data set.

1.6.5 Bar plots and Pie charts

All the plots described so far were for quantitative variables. For categorical
variables the natural basic plot would be a bar plot or pie chart visualizing the
relative frequencies in each category.
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 40

Example 1.33 Bar plots and Pie charts in R

For the gender grouped student heights data used in Example 1.30 we can plot the
gender distribution by:

## Barplot
barplot(table(studentheights$Gender), col=2:3)
20
15
10
5
0

female male

## Pie chart
pie(table(studentheights$Gender), cex=1, radius=1)

female

male
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 41

1.6.6 More plots in R?

A good place for getting more inspired on how to do easy and nice plots in R is:
http://www.statmethods.net/.
Chapter 1 1.7 EXERCISES 42

1.7 Exercises

Exercise 1.1 Infant birth weight

In a study of different occupational groups the infant birth weight was recorded
for randomly selected babies born by hairdressers, who had their first child.
The following table shows the weight in grams (observations specified in sorted
order) for 10 female births and 10 male births:

Females (x) 2474 2547 2830 3219 3429 3448 3677 3872 4001 4116
Males (y) 2844 2863 2963 3239 3379 3449 3582 3926 4151 4356

Solve at least the following questions a)-c) first “manually” and then by the
inbuilt functions in R. It is OK to use R as alternative to your pocket calculator
for the “manual” part, but avoid the inbuilt functions that will produce the
results without forcing you to think about how to compute it during the manual
part.

a) What is the sample mean, variance and standard deviation of the female
births? Express in your own words the story told by these numbers. The
idea is to force you to interpret what can be learned from these numbers.

b) Compute the same summary statistics of the male births. Compare and
explain differences with the results for the female births.

c) Find the five quartiles for each sample — and draw the two box plots with
pen and paper (i.e. not using R.)

d) Are there any “extreme” observations in the two samples (use the modified
box plot definition of extremness)?

e) What are the coefficient of variations in the two groups?

Chapter 1 1.7 EXERCISES 43

Exercise 1.2 Course grades

To compare the difficulty of 2 different courses at a university the following

grades distributions (given as number of pupils who achieved the grades) were
registered:

Course 1 Course 2 Total

Grade 12 20 14 34
Grade 10 14 14 28
Grade 7 16 27 43
Grade 4 20 22 42
Grade 2 12 27 39
Grade 0 16 17 33
Grade -3 10 22 32
Total 108 143 251

a) What is the median of the 251 achieved grades?

b) What are the quartiles and the IQR (Inter Quartile Range)?

Exercise 1.3 Cholesterol

In a clinical trial of a cholesterol-lowering agent, 15 patients’ cholesterol (in

mmol L−1 ) was measured before treatment and 3 weeks after starting treatment.
Data is listed in the following table:

Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6 8.2 9.2 7.3 8.5 9.5
After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8 7.1 7.7 6.0 6.6 8.4

a) What is the median of the cholesterol measurements for the patients before
treatment, and similarly after treatment?

b) Find the standard deviations of the cholesterol measurements of the pa-

tients before and after treatment.
Chapter 1 1.7 EXERCISES 44

c) Find the sample covariance between cholesterol measurements of the pa-

tients before and after treatment.

d) Find the sample correlation between cholesterol measurements of the pa-

tients before and after treatment.

e) Compute the 15 differences (Dif = Before − After) and do various sum-

mary statistics and plotting of these: sample mean, sample variance, sam-
ple standard deviation, boxplot etc.

f) Observing such data the big question is whether an average decrease in

cholesterol level can be “shown statistically”. How to formally answer
this question is presented in Chapter 3, but consider now which summary
statistics and/or plots would you look at to have some idea of what the
answer will be?

Exercise 1.4 Project start

a) Go to CampusNet and take a look at the first project and read the project
page on the website for more information (02323.compute.dtu.dk/Agendas
or 02402.compute.dtu.dk/Agendas). Follow the steps to import the data
into R and get started with the explorative data analysis.
Chapter 2 45

Chapter 2

Probability and simulation

In this chapter elements from probability theory are introduced. These are
needed to form the basic mathematical description of randomness. For example
for calculating the probabilities of outcomes in various types of experimental or
observational study setups. Small illustrative examples, such as e.g. dice rolls
and lottery draws, and natural phenomena such as the waiting time between
radioactive decays are used as throughout. But the scope of probability theory
and it’s use in society, science and business, not least engineering endavour,
goes way beyond these small examples. The theory is introduced together with
illustrative R code examples, which the reader is encouraged to try and interact
with in parallel to reading the text. Many of these are of the learning type, cf.
the discussion of the way R is used in the course in Section 1.5.

2.1 Random variable

The basic building blocks to describe random outcomes of an experiment are

introduced in this section. The definition of an experiment is quite broad. It can
be an experiment, which is carried out under controlled conditions e.g. in a
laboratory or flipping a coin, as well as an experiment in conditions which are
not controlled, where for example a process is observed e.g. observations of
the GNP or measurements taken with a space telescope. Hence, an experiment
can be thought of as any setting in which the outcome cannot be fully known.
This for example also includes measurement noise, which are random “errors”
related to the system used to observe with, maybe originating from noise in
electrical circuits or small turbulence around the sensor. Measurements will
always contain some noise.

First the sample space is defined:

Chapter 2 2.1 RANDOM VARIABLE 46

Definition 2.1
The sample space S is the set of all possible outcomes of an experiment.

Example 2.2

Consider an experiment in which a person will throw two paper balls with the pur-
pose of hitting a wastebasket. All the possible outcomes forms the sample space of
this experiment as

S = (miss,miss), (hit,miss), (miss,hit), (hit,hit) . (2-1)

Now a random variable can be defined:

Definition 2.3
A random variable is a function which assigns a numerical value to each out-
come in the sample space. In this book random variables are denoted with
capital letters, e.g.

X, Y, . . . . (2-2)

Example 2.4

Continuing the paper ball example above, a random variable can be defined as the
number of hits, thus

X (miss,miss) = 0, (2-3)

X (hit,miss) = 1, (2-4)

X (miss,hit) = 1, (2-5)

X (hit,hit) = 2. (2-6)

In this case the random variable is a function which maps the sample space S to
positive integers, i.e. X : S → N0 .
Chapter 2 2.1 RANDOM VARIABLE 47

Remark 2.5
The random variable represents a value of the outcome before the experiment
is carried out. Usually the experiment is carried out n times and there are
random variables for each of them

{ Xi : 1, 2, . . . , n}. (2-7)

After the experiment has been carried out n times a set of values of the ran-
dom variable is available as

{ xi : 1, 2, . . . , n}. (2-8)

Each value is called a realization or observation of the random variable and

is denoted with a small letter sub-scripted with an index i, as introduced in
Chapter 1.

Finally, in order to quantify probability, a random variable is associated with

a probability distribution. The distribution can either be discrete or continuous
depending on the nature of the outcomes:

• Discrete outcomes can for example be: the outcome of a dice roll, the num-
ber of children per family, or the number of failures of a machine per year.
Hence some countable phenomena which can be represented by an inte-
ger.
• Continuous outcomes can for example by: the weight of the yearly har-
vest, the time spend on homework each week, or the electricity generation
per hour. Hence a phenomena which can be represented by a continuous
value.

Furthermore, the outcome can either be unlimited or limited. This is most ob-
vious in the case discrete case, e.g. a dice roll is limited to the values between
1 and 6. However it is also often the case for continuous random variables, for
example many are non-negative (weights, distances, etc.) and proportions are
limited to a range between 0 and 1.

Conceptually there is no difference between the discrete and the continuous

case, however it is easier to distinguish since the formulas, which in the discrete
case are with sums, in the continuous case are with integrals. In the remaining
of this chapter, first the discrete case is presented and then the continuous.
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 48

2.2 Discrete random variables

In this section discrete distributions and their properties are introduced. A dis-
crete random variable has discrete outcomes and follows a discrete distribution.

To exemplify, consider the outcome of one roll of a fair six-sided dice as the
random variable X fair . It has six possible outcomes, each with equal probability.
This is specified with the probability density function.

Definition 2.6
For a discrete random variable X the probability density function (pdf) is

f ( x ) = P ( X = x ). (2-9)

It assigns a probability to every possible outcome value x.

A discrete pdf fulfills two properties: there are no negative probabilities for
any outcome value

f ( x ) ≥ 0 for all x, (2-10)

and the probabilities for all outcome values sum to one

∑ f ( x ) = 1. (2-11)
all x

Example 2.7

For the fair dice the pdf is

x 1 2 3 4 5 6
1 1 1 1 1 1
f Xfair ( x ) 6 6 6 6 6 6

If the dice is not fair, maybe it has been modified to increase the probability of rolling
a six, the pdf could for example be

x 1 2 3 4 5 6
1 1 1 1 1 2
f Xunfair ( x ) 7 7 7 7 7 7

where X unfair is a random variable representing the value of a roll with the unfair
dice.
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 49

The pdfs are plotted: the left plot shows the pdf of a fair dice and the right plot the
pdf of an unfair dice:
1.0

1.0
0.8

0.8
f Xunfair ( x )
0.6

0.6
f Xfair ( x )
0.4

0.4
0.2

0.2
0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6
x x

Remark 2.8
Note that the pdfs has subscript with the symbol of the random variable to
which they belong. This is done when there is a need to distinguish between
pdfs e.g. for several random variables. For example if two random variables
X and Y are used in same context, then: f X ( x ) is the pdf for X and f Y ( x ) for
Y, similarly the sample standard deviation s X is for X and sY is for Y, and so
forth.

The cumulated distribution function (cdf), or simply the distribution function, is of-
ten used.

Definition 2.9 The cdf

The cumulated distribution function (cdf) for the discrete case is the probability
of realizing an outcome below or equal to the value x

F ( x ) = P( X ≤ x ) = ∑ f (xj ) = ∑ P ( X = x j ). (2-12)
j where x j ≤ x j where x j ≤ x

The probability that the outcome of X is in a range is

P ( a < X ≤ b ) = F ( b ) − F ( a ). (2-13)
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 50

For the fair dice the probability of an outcome below or equal to 4 can be calcu-
lated
4
1 1 1 1 2
FXfair (4) = ∑ f Xfair (x j ) = 6 + 6 + 6 + 6 = 3 . (2-14)
j =1

Example 2.10

For the fair dice the cdf is

x 1 2 3 4 5 6
1 2 3 4 5
FXfair ( x ) 6 6 6 6 6 1

The cdf for a fair dice is plotted in the left plot and the cdf for an unfair dice is plotted
in the right plot:
1.0

1.0
0.8

0.8
FXunfair ( x )
0.6

0.6
FXfair ( x )
0.4

0.4
0.2

0.2
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
x x
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 51

2.2.1 Introduction to simulation

One nice thing about having computers available is that we try things in virtual
reality - this we can here use here to play around while learning how prob-
ability and statistics work. With the pdf defined an experiment can easily be
simulated, i.e. instead of carrying out the experiment in reality it is carried out
using a model on the computer. When the simulation includes generating ran-
dom numbers it is called a stochastic simulation. Such simulation tools are readily
available within R, and it can be used for as well learning purposes as a way to
do large scale complex probabilistic and statistical computations. For now it
will be used in the first way.

Example 2.11 Simulation of rolling a dice

Let’s simulate the experiment of rolling a dice using the following R code (open the
file chapter2-ProbabilitySimulation.R and try it)

## Make a random draw from (1,2,3,4,5,6) with equal probability

## for each outcome
sample(1:6, size=1)

The simulation becomes more interesting when the experiment is repeated many
times, then we have a sample and can calculate the empirical density function (or em-
pirical pdf or density histogram, see Section 1.6.1) as a discrete histogram and actually
“see” the shape of the pdf

## Simulate a fair dice

## Number of simulated realizations

n <- 30
## Draw independently from the set (1,2,3,4,5,6) with equal probability
xFair <- sample(1:6, size=n, replace=TRUE)
## Count the number of each outcome using the table function
table(xFair)
## Plot the pdf
par(mfrow=c(1,2))
plot(rep(1/6,6), type="h", col="red", ylim=c(0,1), lwd=10)
## Plot the empirical pdf
lines(table(xFair)/n, lwd=4)
## Plot the cdf
plot(cumsum(rep(1/6,6)), ylim=c(0,1), lwd=10, type="h", col="red")
## Add the empirical cdf
lines(cumsum(table(xFair)/n), lwd=4, type="h")
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 52

1.0

1.0
pdf
Empirical pdf

Cumulated density
0.8

0.8
0.6

0.6
Density
0.4

0.4
0.2

0.2
0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6
x x

Try simulating with different number of rolls n and describe how this affects
the accuracy of the empirical pdf compared to the pdf?

Now repeat this with the unfair dice

## Simulate an unfair dice

## Number of simulated realizations

n <- 30
## Draw independently from the set (1,2,3,4,5,6) with higher
## probability for a six
xUnfair <- sample(1:6, size=n, replace=TRUE, prob=c(rep(1/7,5),2/7))
## Plot the pdf
plot(c(rep(1/7,5),2/7), type="h", col="red", ylim=c(0,1), lwd=10)
## Plot the empirical density function
lines(table(xUnfair)/n, lwd=4)
## Plot the cdf
plot(cumsum(c(rep(1/7,5),2/7)), ylim=c(0,1), lwd=10, type="h", col="red")
## Add the empirical cdf
lines(cumsum(table(xUnfair)/n), lwd=4, type="h")
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 53

1.0

1.0
pdf
Empirical pdf

Cumulated density
0.8

0.8
0.6

0.6
Density
0.4

0.4
0.2

0.2
0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6
x x

Compare the fair and the unfair dice simulations:

How did the empirical pdf change?

By simply observing the empirical pdf can we be sure to distinguish be-

tween the fair and the unfair dice?

How does the number of rolls n affect how well we can distinguish the two
dices?

One reason to simulate becomes quite clear here: it would take considerably
more time to actually carry out these experiments. Furthermore, sometimes
calculating the theoretical properties of random variables (e.g. products of sev-
eral random variables etc.) are impossible and simulations can be a useful way
to obtain such results.

Random number sequences generated with software algorithms have the prop-
erties of real random numbers, e.g. they are independent, but are in fact de-
terministic sequences depending on a seed, which sets an initial value of the
sequence. Therefore they are named pseudo random numbers, since they behave
like and are used as random numbers in simulations, but are in fact determin-
istic sequences.
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 54

Remark 2.12 Random numbers and seed in R

In R the initial values can be set with a single number called the seed as
demonstrated with the following R code. As default the seed is created from
the time of start-up of a new instance of R. A way to generate truly (i.e. non-
pseudo) random numbers can be to sample some physical phenomena, for
example atmospheric noise as done at www.random.org.

## The random numbers generated depends on the seed

## Set the seed

set.seed(127)
## Generate a (pseudo) random sequence
sample(1:10)
[1] 3 1 2 8 7 4 10 9 6 5

## Generate again and see that new numbers are generated

sample(1:10)
[1] 9 3 7 2 10 8 6 1 5 4

## Set the seed and the same numbers as before just after the
## seed was set are generated
set.seed(127)
sample(1:10)
[1] 3 1 2 8 7 4 10 9 6 5

2.2.2 Mean and variance

In Chapter 1 the sample mean and the sample variance were introduced. They
indicate respectively the centering and the spread of data, i.e. of a sample. In
this section the mean and variance are introduced. They are properties of the
distribution of a random variable, they are called population parameters. The
mean indicates where the distribution is centered. The variance indicates the
spread of the distribution.
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 55

Mean and expected value

The mean (µ) of a random variable is the population parameter which most sta-
tistical analysis focus on. It is formally defined as a function E( X ): the expected
value of the random variable X.

Definition 2.13 Mean value

The mean of a discrete random variable X is
∞
µ = E( X ) = ∑ x i f ( x i ), (2-15)
i =1

where xi is the value and f ( xi ) is the probability that X takes the outcome
value xi .

The mean is simply the weighted average over all possible outcome values,
weighted with the corresponding probability. As indicated in the definition
there might be infinitely many possible outcome values, hence, even if the total
sum of probabilities is one, then the probabilities must go sufficiently fast to
zero for increasing values of X in order for the sum to be defined.

Example 2.14

For the fair dice the mean is calculated by

1 1 1 1 1 1
µ xfair = E( X fair ) = 1 + 2 + 3 + 4 + 5 + 6 = 3.5,
6 6 6 6 6 6
for the unfair dice the mean is
1 1 1 1 1 2
µ xunfair = E( X unfair ) = 1 + 2 + 3 + 4 + 5 + 6 ≈ 3.86.
7 7 7 7 7 7

The mean of a random variable express the limiting value of an average of many
outcomes. If a fair dice is rolled a really high number of times the sample mean
of these will be very close to 3.5. For the statistical reasoning related to the use of
a sample mean as an estimate for µ, the same property ensures that envisioning
many sample means (with the same n), a meta like thinking, then the mean of
such many repeated sample means will be close to µ.

After an experiment has been carried out n times then the sample mean or average
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 56

can be calculated as previously defined in Chapter 1

1 n
n∑
µ̂ = x̄ = xi . (2-16)
i

It is called a statistic, which means that it is calculated from a sample. Note the
use of a hat in the notation over µ: this indicates that it is an estimate of the real
underlying mean.

Our intuition tells us that the estimate (µ̂) will be close to true underlying ex-
pectation
h i (µ) when n is large. This is indeed the case, to be more specific
E n1 ∑ Xi = µ (when E[ Xi ] = µ), and we say that the average is a central
estimator for the expectation. The exact quantification of these qualitative state-
ments will be covered in Chapter 3.

Now play a little around with the mean and the sample mean with some simu-
lations.

Example 2.15 Simulate and estimate the mean

Carrying out the experiment more than one time an estimate of the mean, i.e. the
sample mean, can be calculated. Simulate rolling the fair dice

## Simulate a fair dice

## Number of realizations
n <- 30
## Simulate rolls with a fair dice
xFair <- sample(1:6, size=n, replace=TRUE)
## Calculate the sample mean
sum(xFair)/n

[1] 3.6

## or
mean(xFair)

[1] 3.6

Let us see what happens with the sample mean of the unfair dice by simulating the
same number of rolls
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 57

## Simulate an unfair dice

## n realizations
xUnfair <- sample(1:6, size=n, replace=TRUE, prob=c(rep(1/7,5),2/7))
## Calculate the sample mean
mean(xUnfair)

[1] 3.967

Consider the mean of the unfair dice and compare it to the mean of the fair
dice (see Example 2.14). Is this in accordance with your simulation results?

Let us again turn to how much we can “see” from the simulations and the impact
of the number of realizations n on the estimation. In statistics the term information is
used to refer to how much information is embedded in the data, and therefore how
accurate different properties (parameters) can be estimated from the data.

Repeat the simulations several times with n = 30. By simply comparing the
sample means from a single simulation can it then be determined if the two
means really are different?

Repeat the simulations several times and increase n. What happens with to
the ’accuracy’ of the sample mean compared to the real mean? and thereby
how well it can be inferred if the sample means are different?

Does the information embedded in the data increase or decrease when n is

increased?

Variance and standard deviation

The second most used population parameter is the variance (or standard devia-
tion). It is a measure describing the spread of the distribution, more specifically
the spread away from the mean.
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 58

Definition 2.16 Variance

The variance of a discrete random variable X is
∞
2 2
σ = V( X ) = E[( X − µ) ] = ∑ ( x i − µ )2 f ( x i ), (2-17)
i =1

where xi is the outcome value and f ( xi ) is the pdf of the ith outcome value.
The standard deviation σ is the square root of the variance.

The variance is the expected value (i.e. average (weighted by probabilities)) of

the squared distance between the outcome and the mean value.

Remark 2.17
Notice that the variance cannot be negative.

The standard deviation is measured on the same scale (same units) as the ran-
dom variable, which is not case for the variance. Therefore the standard de-
viation can much easier be interpreted, when communicating the spread of a
distribution.
Consider how the expected value is calculated in Equation (2-15).
One can think of the squared distance as a new random variable
that has an expected value which is the variance of X.

Example 2.18

The variance of rolls with the fair dice is

σx2fair = E[( X fair − µ Xfair )2 ]

1 1 1 1 1 1
= (1 − 3.5)2 + (2 − 3.5)2 + (3 − 3.5)2 + (4 − 3.5)2 + (5 − 3.5)2 + (6 − 3.5)2
6 6 6 6 6 6
70
=
24
≈ 2.92.

It was seen in Chapter 1, that after an experiment has been carried out n times
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 59

the sample variance can be calculated as defined previously by

n
1
n − 1 i∑
s2 = σ̂2 = ( xi − x̄ )2 , (2-18)
=1
and hence thereby also sample standard deviation s.

Again our intuition tells us that the statistic (e.g. sample variance), should in
some sense converge to the true variance - this is indeed the case and the we
call the sample variance a central estimator for the true underlying variance.
This convergence will be quantified for a special case in Chapter 3.
The sample variance is calculated by:
• Take the sample mean: x̄
• Take the distance for each sample: xi − x̄
• Finally, take the average of the squared distances (using n −
1 in the denominator, see Chapter 1)

Example 2.19 Simulate and estimate the variance

Return to the simulations. First calculate the sample variance from n rolls of a fair
dice

## Simulate a fair dice and calculate the sample variance

## Number of realizations
n <- 30
## Simulate
xFair <- sample(1:6, size=n, replace=TRUE)
## Calculate the distance for each sample to the sample mean
distances <- xFair - mean(xFair)
## Calculate the average of the squared distances
sum(distances^2)/(n-1)

[1] 2.764

## Or use the built in function

var(xFair)

[1] 2.764

Let us then try to play with variance in the dice example. Let us now consider a
four-sided dice. The pdf is
Chapter 2 2.2 DISCRETE RANDOM VARIABLES 60

x 1 2 3 4
1 1 1 1
FXfairFour ( x ) 4 4 4 4

Plot the pdf for both the six-sided dice and the four-sided dice

## Plot the pdf of the six-sided dice and the four-sided dice
plot(rep(1/6,6), type="h", col="red")
plot(rep(1/4,4), type="h", col="blue")
1.0

1.0
Four sided dice density
Six sided dice density
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
x x

## Calculate the means and variances of the dices

## The means
muXSixsided <- sum((1:6)*1/6) # Six-sided
muXFoursided <- sum((1:4)*1/4) # Four-sided
## The variances
sum((1:6-muXSixsided)^2*1/6)

[1] 2.917

sum((1:4-muXFoursided)^2*1/4)

[1] 1.25

Which dice outcome has the highest variance? is that as you had antici-
pated?
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 61

2.3 Discrete distributions

In this section the discrete distributions included in the material are presented.
See the overview of all distributions in the collection of formulas Section A.2.1.

In R, implementations of many different distributions are available. For each

distribution at least the following is available

• The pdf is available by preceding with 'd', e.g. for the binomial distribu-
tion dbinom
• The cdf is available by preceding with 'p', e.g. pbinom
• The quantiles by preceding with 'q', e.g. qbinom
• Random number generation by preceding with 'r' e.g. rbinom

See for example the help with ?dbinom in R and see the names of all the R func-
tions in the overview A.2.1. They are demonstrated below in this section for the
discrete and later for the continuous distributions, see them demonstrated for
the normal distribution in Example 2.45.

2.3.1 Binomial distribution

The binomial distribution is a very important discrete distribution and appears

in many applications, it is presented in this section. In statistics it is typically
used for proportions as explained in Chapter 7 .

If an experiment has two possible outcomes (e.g. failure or success, no or yes, 0

or 1) and is repeated more than one time, then the number of successes may be
binomial distributed. For example the number of heads obtained after a certain
number of flips with a coin. Each repetition must be independent. In relation to
random sampling this corresponds to successive draws with replacement (think
of drawing notes from a hat, where after each draw the note is put back again,
i.e. the drawn number is replaced again).
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 62

Definition 2.20 Binomial distribution

Let the random variable X be binomial distributed

X ∼ B(n, p), (2-19)

where n is number of independent draws and p is the probability of a suc-

cess in each draw.
The binomial pdf describes probability of obtaining x successes

n x
f ( x; n, p) = P( X = x ) = p (1 − p ) n − x , (2-20)
x

where

n n!
= , (2-21)
x x!(n − x )!

is the number of distinct sets of x elements which can be chosen from a set
of n elements. Remember that n! = n · (n − 1) · . . . · 2 · 1.

Theorem 2.21 Mean and variance

The mean of a binomial distributed random variable is

µ = np, (2-22)

and the variance is

σ2 = np(1 − p). (2-23)

Actually this can be proved by calculating the mean using Definition 2.13 and
the variance using Definition 2.16.
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 63

Example 2.22 Simulation with a binomial distribution

The binomial distribution for 10 flips with a coin describe probabilities of getting x
heads (or equivalently tails)

## Simulate a binomial distributed experiment

## Number of flips
nFlips <- 10
## The possible outcomes are (0,1,...,nFlips)
xSeq <- 0:nFlips
## Use the dbinom() function which returns the pdf, see ?dbinom
pdfSeq <- dbinom(xSeq, size=nFlips, prob=1/2)
## Plot the density
plot(xSeq, pdfSeq, type="h")
1.0
0.8
0.6
Density
0.4
0.2
0.0

0 2 4 6 8 10
x
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 64

Example 2.23 Simulate 30 successive dice rolls

In the previous examples successive rolls of a dice was simulated. If a random vari-
able which counts the number of sixes obtained X six is defined, it follows a binomial
distribution

## Simulate 30 successive dice rolls

Xfair <- sample(1:6, size=30, replace=TRUE)
## Count the number sixes obtained
sum(Xfair==6)

[1] 9

## This is equivalent to
rbinom(1, size=30, prob=1/6)

[1] 7

2.3.2 Hypergeometric distribution

The hypergeometric distribution describes number of successes from successive

draws without replacement.
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 65

Definition 2.24 Hypergeometric distribution

Let the random variable X be the number of successes in n draws without
replacement. Then X follows the hypergeometric distribution

X ∼ H (n, a, N ), (2-24)

where a is the number of successes in the N elements large population. The

probability of obtaining x successes is described by the hypergeometric pdf

( xa )( N −a
n− x )
f ( x; n, a, N ) = P( X = x ) = . (2-25)
( Nn )

The notation

a a!
= , (2-26)
b b!( a − b)!

represents the number of distinct sets of b elements which can be chosen

from a set of a elements.

Theorem 2.25 Mean and variance

The mean of a hypergeometric distributed random variable is
a
µ=n , (2-27)
N
and the variance is
a( N − a) N − n
σ2 = n . (2-28)
N2 N−1

Example 2.26 Lottery probabilities using the hypergeometric dis-

tribution

A lottery drawing is a good example where the hypergeometric distribution can be

applied. The numbers from 1 to 90 are put in a bowl and randomly drawn without
replacement (i.e. without putting back the number when it has been drawn). Say
that you have the sheet with 8 numbers and want to calculate the probability of
getting all 8 numbers in 25 draws.
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 66

## The probability of getting x numbers of the sheet in 25 drawings

## Number of successes in the population

a <- 8
## Size of the population
N <- 90
## Number of draws
n <- 25
## Plot the pdf, note: parameters names are different in the R function
plot(0:8, dhyper(x=0:8,m=a,n=N-a,k=n), type="h")
0.5

n = 25
a=8
0.4

N = 90
0.3
Density
0.2
0.1
0.0

0 2 4 6 8
x

2.3.3 Poisson distribution

The Poisson distribution describes the probability of a given number of events

occurring in a fixed interval if these events occur with a known average rate
and independently of the distance to the last event. Often it is events in a time
interval, but can as well be counts in other intervals, e.g. of distance, area or
volume. In statistics the Poisson distribution is usually applied for analyzing
for example counts of: arrivals, traffic, failures and breakdowns.
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 67

Definition 2.27 Poisson distribution

Let the random variable X be Poisson distributed

X ∼ Po(λ), (2-29)

where λ is the rate (or intensity): the average number of events per interval.
The Poisson pdf describes the probability of x events in an interval

λ x −λ
f ( x; λ) = e . (2-30)
x!

Theorem 2.28 Mean and variance

A Poisson distributed random variable X has exactly the rate λ as the mean

µ = λ, (2-31)

and variance

σ2 = λ. (2-32)

Example 2.29

The Poisson distribution is typically used to describe phenomena such as:

• the number radioactive particle decays per time interval, i.e. the number of
clicks per time interval of a Geiger counter
• calls to a call center per time interval (λ does vary over the day)
• number of mutations in a given stretch of DNA after a certain amount of radi-
ation
• goals scored in a soccer match

One important feature is that the rate can be scaled, such that probabilities of
occurrences in other interval lengths can be calculated. Usually the rate is de-
noted with the interval length, for example the hourly rate is denoted as λhour
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 68

and can be scaled to the minutely rate by

minute λhour
λ = , (2-33)
60
such the probabilities of x events per minute can be calculated with the Poisson
pdf with rate λminute .

Example 2.30 Rate scaling

You are enjoying a soccer match. Assuming that the scoring of goals per match in
the league is Poisson distributed and on average 3.4 goals are scored per match.
Calculate the probability that no goals will be scored while you leave the match for
10 minutes.

Let λ90minutes = 3.4 be goals per match and scale this to the 10 minute rate by

λ90minutes 3.4
λ10minutes = = . (2-34)
9 9
Let X be the number of goals in 10 minute intervals and use this to calculate the
probability of no events a 10 minute interval by

P( X = 0) = f (0, λ10minutes ) ≈ 0.685, (2-35)

which was found with the R code

## Probability of no goals in 10 minutes

## The Poisson pdf

dpois(x=0, lambda=3.4/9)

[1] 0.6854
Chapter 2 2.3 DISCRETE DISTRIBUTIONS 69

Example 2.31 Poisson distributed random variable

Simulate a Poisson distributed random variable to see the Poisson distribution

## Simulate a Poisson random variable

## The mean rate of events per interval

lambda <- 4
## Number of realizations
n <- 1000
## Simulate
x <- rpois(n, lambda)
## Plot the empirical pdf
plot(table(x)/n)
## Add the pdf to the plot
lines(0:20, dpois(0:20,lambda), type="h", col="red")
0.5

Empirical pdf
pdf
0.4
0.3
Density
0.2
0.1
0.0

0 1 2 3 4 5 6 7 8 9 10 11
x
Chapter 2 2.4 CONTINUOUS RANDOM VARIABLES 70

2.4 Continuous random variables

If an outcome of an experiment takes a continuous value, for example: a dis-

tance, a temperature, a weight, etc., then it is represented by a continuous ran-
dom variable.

Definition 2.32 Density and probabilities

The pdf of a continuous random variable X is a non-negative function for all
possible outcomes

f ( x ) ≥ 0 for all x, (2-36)

and has an area below the function of one

Z ∞
f ( x )dx = 1. (2-37)
−∞

It defines the probability of observing an outcome in the range from a to b

by
Z b
P( a < X ≤ b) = f ( x )dx. (2-38)
a

For the discrete case the probability of observing an outcome x is equal to the
pdf of x, but this is not the case for a continuous random variable, where
Z x
P( X = x ) = P( x < X ≤ x ) = f (u)du = 0, (2-39)
x

i.e. the probability for a continuous random variable to be realized at a single

number P( X = x ) is zero.

The plot in Figure 2.1 shows how the area below the pdf represents the proba-
bility of observing an outcome in a range. Note that the normal distribution is
used here for the examples, it is introduced in Section 2.5.2.
Chapter 2 2.4 CONTINUOUS RANDOM VARIABLES 71

0.4
P( a < X ≤ b)
0.3
f (x)
0.2
0.1
0.0

-4 -2 0 2 4
a b
x
Figure 2.1: The probability of observing the outcome of X in the range between
a and b is the area below the pdf spanning the range, as illustrated with the
colored area.

Definition 2.33 Distribution

The cdf of a continuous variable is defined by
Z x
F ( x ) = P( X ≤ x ) = f (u)du, (2-40)
−∞

and has the properties (in both the discrete and continuous case): the cdf is
non-decreasing and

lim F ( x ) = 0 and lim F ( x ) = 1. (2-41)

x →−∞ x →∞

The relation between the cdf and the pdf is

Z b
P( a < X ≤ b) = F (b) − F ( a) = f ( x )dx, (2-42)
a

as illustrated in Figures 2.1 and 2.2.

Also as the cdf is defined as the integral of the pdf, the pdf becomes the derivative
of the cdf

d
f (x) = F(x) (2-43)
dx
Chapter 2 2.4 CONTINUOUS RANDOM VARIABLES 72

1.0

P( a < X ≤ b) = F (b) − F ( a)
0.8
0.6
F(x)
0.4
0.2
0.0

-4 -2 0 2 4
a b
x
Figure 2.2: The probability of observing the outcome of X in the range between
a and b is the distance between F ( a) and F (b).

2.4.1 Mean and Variance

Definition 2.34 Mean and variance

For a continuous random variable the mean or expected value is
Z ∞
µ = E( X ) = x f ( x )dx, (2-44)
−∞

hence similar as for the discrete case the outcome is weighted with the pdf.
The variance is
Z ∞
2 2
σ = E[( X − µ) ] = ( x − µ)2 f ( x )dx, (2-45)
−∞

The differences between the discrete and the continuous case can be summed
up in two points:

• In the continuous case integrals are used, in the discrete case sums are
used.
• In the continuous case the probability of observing a single value is always
zero. In the discrete case it can be positive or zero.
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 73

2.5 Continuous distributions

2.5.1 Uniform distribution

A random variable following the uniform distribution has equal density at any
value within a defined range.

Definition 2.35 Uniform distribution

Let X be a uniform distributed random variable

X ∼ U (α, β), (2-46)

where α and β defines the range of possible outcomes. It has the pdf
(
1
for x ∈ [α, β]
f ( x ) = β−α . (2-47)
0 otherwise

The uniform cdf is



0
 for x < α
x −α
F(x) = for x ∈ [α, β) . (2-48)
 β−α

1 for x ≥ β

In Figure 2.3 the uniform pdf and cdf are plotted.

Theorem 2.36 Mean and variance of the uniform distribution

The mean of a uniform distributed random variable X is
1
µ= ( α + β ), (2-49)
2
and the variance is
1
σ2 = ( β − α )2 . (2-50)
12
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 74

P( a < X ≤ b) = F (b) − F ( a)
P( a < X ≤ b)
β−α

F (b)
1

F(x)
f (x)

F ( a)
0

0
α a b β α a b β
x x
Figure 2.3: The uniform distribution pdf and cdf.

2.5.2 Normal distribution

The most famous continuous distribution is the normal distribution for many
reasons. Often it is also called the Gaussian distribution. The normal distribu-
tion appears naturally for many phenomena and is therefore used in extremely
many applications, which will be apparent in later chapters of the book.

Definition 2.37 Normal distribution

Let X be a normal distributed random variable

X ∼ N (µ, σ2 ), (2-51)

where µ is the mean and σ2 is the variance (remember that the standard
deviation is σ). Note that the two parameters are actually the mean and
variance of X.
It follows the normal pdf

1 ( x − µ )2
−
f ( x ) = √ e 2σ2 , (2-52)
σ 2π
and the normal cdf
Z x ( u − µ )2
1 −
F(x) = √ e 2σ2 du. (2-53)
σ 2π −∞
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 75

Theorem 2.38 Mean and variance

The mean of a Normal distributed random variable is

µ, (2-54)

and the variance is

σ2 . (2-55)

Hence simply the two parameters defining the distribution.

Example 2.39 The normal pdf

Example: Let us play with the normal pdf

## Play with the normal distribution

## The mean and standard deviation

muX <- 0
sigmaX <- 1
## A sequence of x values
xSeq <- seq(-6, 6, by=0.1)
##
pdfX <- 1/(sigmaX*sqrt(2*pi)) * exp(-(xSeq-muX)^2/(2*sigmaX^2))
## Plot the pdf
plot(xSeq, pdfX, type="l", xlab="$x$", ylab="f(x)")
0.4
0.3
f(x)
0.2
0.1
0.0

-6 -4 -2 0 2 4 6
x
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 76

Try with different values of the mean and standard deviation. Describe how
this change the position and spread of the pdf?

Theorem 2.40 Linear combinations of normal random variables

Let X1 , . . . , Xn be independent normal random variables, then any linear
combination of X1 , . . . , Xn will follow a normal distribution, with mean and
variance given in Theorem 2.56.

Use the mean and variance identities introduced in Section 2.7 to find the mean
and variance of the linear combination as exemplified here:

Example 2.41

Consider two normal distributed random variables

X1 ∼ N (µ X1 , σX2 1 ) and X2 ∼ N (µ X2 , σX2 2 ). (2-56)

The difference

Y = X1 − X2 , (2-57)

is normal distributed

Y ∼ N (µY , σY2 ), (2-58)

where the mean is

µ Y = µ X1 − µ X2 , (2-59)

and

σY2 = σX2 1 + σX2 2 , (2-60)

where the mean and variance identities introduced in Section 2.7 have been used.

Standard normal distribution

Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 77

Definition 2.42 Standard normal distribution

The standard normal distribution is the normal distribution with zero mean
and unit variance

Z ∼ N (0, 1), (2-61)

where Z is the standardized normal random variable.

Historically before the widespread use of computers the standardized random

variables were used a lot, since it was not possible to easily evaluate the pdf and
cdf, instead they were looked up in tables for the standardized distributions.
This was smart since transformation into standardized distributions requires
only a few simple operations.

Theorem 2.43 Transformation to the standardized normal random

variable
A normal distributed random variable X can be transformed into a stan-
dardized normal random variable by

X−µ
Z= . (2-62)
σ

Example 2.44 Quantiles in the standard normal distribution

The most used quantiles (or percentiles) in the standard normal distribution are

Percentile 1% 2.5% 5% 25% 75% 95% 97.5% 99%

Quantile 0.01 0.025 0.05 0.25 0.75 0.95 0.975 0.99
Value -2.33 -1.96 -1.64 -0.67 0.67 1.64 1.96 2.33

Note that the values can be considered as standard deviations (i.e. for Z the stan-
dardized normal then σZ = 1), which holds for any normal distribution.

The most used quantiles are marked on the plot

Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 78

0.4
0.3
Density
0.2

q0.25 q0.75
0.1

q0.05 q0.95
q0.025
q0.01 q0.975
q0.99
0.0

-3 -2 -1 0 1 2 3
z (σ)

Note that the units on the x-axis is in standard deviations.

Normal pdf details

In order to get insight into how the normal distribution is formed consider the
following steps. In Figure 2.4 the result of each step is plotted:

1. Take the distance to the mean: x − µ

2. Square the distance: ( x − µ)2

−( x −µ)2
3. Make it negative and scale it: (2σ2 )

−( x −µ)2
4. Take the exponential: e (2σ2 )

−( x −µ)2
5. Finally, scale it to have an area of one: √1 e (2σ2 )
σ 2π
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 79

2
1
0

Step 1
Step 2
-1

Step 3
Step 4
Step 5
-2

-2 -1 0 1 2
x
Figure 2.4: The steps involved in calculating the normal distribution pdf.

Example 2.45 R functions for the normal distribution

In R functions to generate values from many distributions are implemented. For the
normal distribution the following functions are available:
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 80

## Do it for a sequence of x values

xSeq <- c(-3,-2,1,0,1,2,3)
## The pdf
dnorm(xSeq, mean=0, sd=1)

[1] 0.004432 0.053991 0.241971 0.398942 0.241971 0.053991 0.004432

## The cdf
pnorm(xSeq, mean=0, sd=1)

[1] 0.00135 0.02275 0.84134 0.50000 0.84134 0.97725 0.99865

## The quantiles
qnorm(c(0.01,0.025,0.05,0.5,0.95,0.975,0.99), mean=0, sd=1)

[1] -2.326 -1.960 -1.645 0.000 1.645 1.960 2.326

## Generate random normal distributed realizations

rnorm(n=10, mean=0, sd=1)

[1] 0.59716 -0.27049 1.28617 0.06501 -3.13349 -1.09420 -1.16043

[8] 1.04028 0.42958 0.60432

## Calculate the probability that that the outcome of X is between a and b

a <- 0.2
b <- 0.8
pnorm(b) - pnorm(a)

[1] 0.2089

## See more details by running "?dnorm"

Use the functions to make a plot of the normal pdf with marks of the
2.5%, 5%, 95%, 97.5% quantiles.

Make a plot of the normal pdf and a histogram (empirical pdf) of 100 simu-
lated realizations.
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 81

2.5.3 Log-Normal distribution

If a random variable is log-normal distributed then its logarithm is normally

distributed.

Definition 2.46 Log-Normal distribution

A log-normal distributed random variable

X ∼ LN (α, β2 ), (2-63)

where α is the mean and β2 is the variance of the normal distribution ob-
tained when taking the natural logarithm to X.
The log-normal pdf is
ln x −α) 2
1 −(
2β2
f (x) = √ e . (2-64)
x 2πβ

Theorem 2.47 Mean and variance of log-normal distribution

Mean of the log-normal distribution
2 /2
µ = eα+ β , (2-65)

and variance
2 2
σ2 = e2α+ β (e β − 1). (2-66)

The log-normal distribution occurs in many fields, in particular: biology, fi-

nance and many technical applications.

2.5.4 Exponential distribution

The usual application of the exponential distribution is for describing the length
(usually time) between events which, when counted, follows a Poisson distri-
bution, see Section 2.3.3. Hence the length between events which occur contin-
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 82

uously and independently at a constant average rate.

Definition 2.48 Exponential distribution

Let X be an exponential distributed random variable

X ∼ Exp(λ), (2-67)

where λ is the average rate of events.

It follows the exponential pdf

(
λe−λx for x ≥ 0
f (x) = . (2-68)
0 for x < 0

Theorem 2.49 Mean and variance of exponential distribution

Mean of an exponential distribution is

1
µ= , (2-69)
λ
and the variance is
1
σ2 = . (2-70)
λ2
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 83

Example 2.50 Exponential distributed time intervals

Simulate a so-called Poisson process, which has exponential distributed time inter-
val between events

## Simulate exponential waiting times

## The rate parameter: events per time

lambda <- 4
## Number of realizations
n <- 1000
## Simulate
x <- rexp(n, lambda)
## The empirical pdf
hist(x, probability=TRUE)
## Add the pdf to the plot
curve(dexp(xseq,lambda), xname="xseq", add=TRUE, col="red")
2.5
2.0
Density
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5

Furthermore check that by counting the events in fixed length intervals that they
follow a Poisson distribution.
Chapter 2 2.5 CONTINUOUS DISTRIBUTIONS 84

## Check the relation to the Poisson distribution

## by counting the events in each interval

## Sum up to get the running time

xCum <- cumsum(x)
## Use the hist function to count in intervals between the breaks,
## here 0,1,2,...
tmp <- hist(xCum, breaks=0:ceiling(max(xCum)))
## Plot the discrete empirical pdf
plot(table(tmp$counts)/length(tmp$counts))
## Add the Poisson pdf to the plot
lines(0:20, dpois(0:20,lambda), type="h", col="red")
0.30

Empirical pdf
pdf
0.20
Density
0.10
0.00

0 1 2 3 4 5 6 7 8 9 10 11 12
x
Chapter 2 2.6 SIMULATION OF RANDOM VARIABLES 85

t2 t4 t6 t8 t10

t1 t3 t5 t7 t9

0.0 0.5 1.0 1.5 2.0 2.5

Time
Figure 2.5: Exponential distributed time intervals between events forms a so-
called Poisson process.

2.6 Simulation of random variables

The basic concept of simulation was introduced in Section 2.6 and we have al-
ready applied the in-built functions in R for generating random numbers from
any implemented distribution, see how in Section 2.3.1. In this section it is ex-
plained how realizations of a random variable can be generated from any prob-
ability distribution – it is the same technique for both discrete and continuous
distributions.

Basically, a computer obviously cannot create a result/number, which is ran-

dom. A computer can give an output as a function of an input. (Pseudo) ran-
dom numbers from a computer are generated from a specially designed algo-
rithm - called a random number generator, which once started can make the
number xi+1 from the number xi . The algorithm is designed in such a way that
when looking at a sequence of these values, in practice one cannot tell the dif-
ference between them and a sequence of real random numbers. The algorithm
needs a start input, called the “seed”, as explained above Remark 2.12. Usually,
you can manage just fine without having to worry about the seed issue since
the program itself finds out how to handle it appropriately. Only if you want
to be able to recreate exactly the same results you need to set seed value. For
details about this and the random number generators used in R, type ?Random.

Actually, a basic random number generator typically generates (pseudo) ran-

dom numbers between 0 and 1 in the sense that numbers in practice follow the
uniform distribution on the interval 0 to 1, see Section 2.35. Actually, there is a
simple way how to come from the uniform distribution to any kind of distribu-
Chapter 2 2.6 SIMULATION OF RANDOM VARIABLES 86

tion:

Theorem 2.51
If U ∼ Uniform(0, 1) and F is a distribution function for any probability
distribution, then F −1 (U ) follow the distribution given by F

Recall, that the distribution function F in R is given by the p versions of the

distributions, while F −1 is given by the q versions.

Example 2.52 Random numbers in R

We can generate 100 normally distributed N (2, 32 ) numbers similarly the following
two ways:

## Generate 100 normal distributed values

rnorm(100, mean=2, sd=3)
## Similarly, generate 100 uniform distributed values from 0 to 1 and
## put them through the inverse normal cdf
qnorm(runif(100), mean=2, sd=3)

Example 2.53 Simulating the exponential distribution

Consider the exponential distribution with λ = 1/β = 1/2, that is, with density
function

f ( x ) = λe−λx ,

for x > 0 and 0 otherwise. The distribution function is

Z x
F(x) = f (t)dt = 1 − e−0.5x .
0

The inverse of this distribution function can be found by solving

u = 1 − e−0.5x ⇔ x = −2 log(1 − u).

So if random numbers U ∼ Uniform(0, 1) then −2 log(1 − U ) follows the exponen-

tial distribution with λ = 1/2 (and β = 2). We confirm this in R here:
Chapter 2 2.6 SIMULATION OF RANDOM VARIABLES 87

## Three equivalent ways of simulating the exponential distribution

## with lambda=1/2
re1 <- -2*log(1-runif(10000))

re2 <- qexp(runif(10000), 1/2)

re3 <- rexp(10000, 1/2)

## Check the means and variances of each

c(mean(re1), mean(re2), mean(re3))

[1] 2.007 1.987 1.996

c(var(re1), var(re2), var(re3))

[1] 3.948 3.903 3.871

This can be illustrated by plotting the distribution function (cdf) for the exponential
distribution with λ = 1/2 and 5 random outcomes
1.0
0.8
Uniform outcomes
0.6
0.4
0.2
0.0

-5 0 5 10
Exponential outcomes

But since R has already done all this for us, we do not really need this as long as
we only use distributions that have already been implemented in R. One can use
the help function for each function, for example. ?rnorm, to check exactly how
to specify the parameters of the individual distributions. The syntax follows
exactly what is used in p, d and q versions of the distributions.
Chapter 2 2.7 IDENTITIES FOR THE MEAN AND VARIANCE 88

2.7 Identities for the mean and variance

Rules for calculation of the mean and variance of linear combinations of in-
dependent random variables are introduced here. They are valid for both the
discrete and continuous case.

Theorem 2.54 Mean and variance of linear functions

Let Y = aX + b then

E(Y ) = E( aX + b) = a E( X ) + b, (2-71)

and

V(Y ) = V( aX + b) = a2 V( X ). (2-72)

Random variables are often scaled (i.e. aX) for example when shifting units:

Example 2.55

The mean of a bike shops sale is 100 bikes per month and varies with a standard
deviation of 15. They earn 200 Euros per bike. What is the mean and standard
deviation of their earnings per month?

Let X be the number of bikes sold per month. On average they sell µ X = 100 bikes
per month and it varies with a variance of σX2 = 225. The shops monthly earnings

Y = 200X,

has then a mean and standard deviation of

µY = E(Y ) = E(200X ) = 200 E( X ) = 200 · 100 = 20000 Euro/month,

q q q √
σY = V(Y ) = V(200X ) = 2002 V( X ) = 40000 · 225 = 3000 Euro/month.
Chapter 2 2.7 IDENTITIES FOR THE MEAN AND VARIANCE 89

Theorem 2.56 Mean and variance of linear combinations

The mean of a linear combination of independent random variables is

E ( a 1 X1 + a 2 X2 + · · · + a n X n ) = a 1 E ( X1 ) + a 2 E ( X2 ) + · · · + a n E ( X n ) ,
(2-73)

and the variance

V( a1 X1 + a2 X2 + · · · + an Xn ) = a21 V( X1 ) + a22 V( X2 ) + · · · + a2n V( Xn ).

(2-74)

Example 2.57

Lets take a dice example to emphasize an important point. Let Xi represent the
outcome of a roll with a dice with mean µ X and standard deviation σX .

Now, consider a scaling of a single roll with a dice, say five times

Y scale = 5X1 ,

then the mean will scale linearly

E(Y scale ) = E(5X1 ) = 5 E( X1 ) = 5 µ X ,

and the standard deviation also scales linearly

σY2 scale = V(5X1 ) = 52 V( X1 ) = 52 σX2 ⇔ σYscale = 5 σX .

Whereas for a sum of five rolls

Y sum = X1 + X2 + X3 + X4 + X5 ,

the mean will similarly scale linearly

E(Y sum ) = E( X1 + X2 + X3 + X4 + X5 )
= E ( X1 ) + E ( X2 ) + E ( X3 ) + E ( X4 ) + E ( X5 )
= 5 µX ,

however the standard deviation will increase only with the square root

σY2 sum = V( X1 + X2 + X3 + X4 + X5 )
= V ( X1 ) + V ( X2 ) + V ( X3 ) + V ( X4 ) + V ( X5 )
= 5 σX2 ⇔
√
σYsum = 5 σX .
Chapter 2 2.7 IDENTITIES FOR THE MEAN AND VARIANCE 90

This is simply because when applying the sum to many random outcomes, then
the high and low outcomes will even out each other, such that the variance will be
smaller for a sum than for a scaling.
Chapter 2 2.8 COVARIANCE AND CORRELATION 91

2.8 Covariance and correlation

In this chapter we have discussed mean and variance (or standard deviation),
and the relation to the sample mean and sample variance, see Section 2.2.2. In
Chapter 1 Section 1.4.3 we discussed the sample covariance and sample correla-
tion, these two measures also have theoretical justification, namely covariance
and correlation, which we will discuss in this section. We start by the definition
of covariance.

Definition 2.58 Covariance

Let X and Y be two random variables, then the covariance between X and
Y, is

Cov( X, Y ) = E[( X − E[ X ])(Y − E[Y ])] . (2-75)

Remark 2.59
It follows immediately from the definition that Cov( X, X ) = V( X ) and
Cov( X, Y ) = Cov(Y, X ).

An important concept in statistics is independence (see Section 2.9 for a formal

definition). We often assume that realizations (random variables) are indepen-
dent. If two random variables are independent then their covariance will be
zero, the reverse is however not necessarily true (see also the discussion on
sample correlation in Section 1.4.3).

The following calculation rule apply to covariance between two random vari-
ables X and Y:

Theorem 2.60 Covariance between linear combinations

Let X and Y be two random variables, then
Cov( a0 + a1 X + a2 Y, b0 + b1 X + b2 Y ) = a1 b1 V( X ) + a2 b2 V(Y ) + ( a1 b2 + a2 b1 ) Cov( X, Y ).
(2-76)
Chapter 2 2.8 COVARIANCE AND CORRELATION 92

Proof

Let Z1 = a0 + a1 X + a2 Y and Z2 = b0 + b1 X + b2 Y then

Cov( Z1 , Z2 ) = E[( a1 ( X − E[ X ]) + a2 (Y − E[Y ]))(b1 ( X − E[ X ]) + b2 (Y − E[Y ]))]

= E[ a1 ( X − E[ X ])b1 ( X − E[ X ])] + E[ a1 ( X − E[ X ])b2 (Y − E[Y ])]+
E[ a2 (Y − E[Y ])b1 ( X − E[ X ])] + E[ a2 (Y − E[Y ])b2 (Y − E[Y ])]
= a1 b1 V( X ) + a2 b2 V(Y ) + ( a1 b2 + a2 b2 ) Cov( X, Y ). (2-77)

Example 2.61

Let X ∼ N (3, 22 ) and Y ∼ N (2, 1) and the covariance between X and Y given by
Cov( X, Y ) = 1. What is the variance of the random variable Z = 2X − Y?

V( Z ) = Cov[2X − Y, 2X − Y ] = 22 V( X ) + V(Y ) − 4 Cov( X, Y )

= 22 22 + 1 − 4 = 13.

We have already seen in Section 1.4.3 that the sample correlation measures the
observed degree of linear dependence between two random variables – calcu-
lated from samples observed on the same observational unit e.g. height and
weight of persons. The theoretical counterpart is the correlation between two
random variables – the true linear dependence between the two variables:

Definition 2.62 Correlation

Let X and Y be two random variables with V( X ) = σx2 , V(Y ) = σy2 , and
Cov( X, Y ) = σxy , then the correlation between X and Y is
σxy
ρ xy = . (2-78)
σx σy

Remark 2.63
The correlation is a number between -1 and 1.
Chapter 2 2.8 COVARIANCE AND CORRELATION 93

Example 2.64

Let X ∼ N (1, 22 ) and e ∼ N (0, 0.52 ) be independent random variables, find the
correlation between X and Z = X + e.

The variance of Z is

V( Z ) = V( X + e) = V( X ) + V(e) = 4 + 0.25 = 4.25.

The covariance between X and Z is

Cov( X, Z ) = Cov( X, X + e) = V( X ) = 4,

and hence
4
ρ xz = √ = 0.97.
4.25 · 4
Chapter 2 2.9 INDEPENDENCE OF RANDOM VARIABLES 94

2.9 Independence of random variables

In statistics the concept of independence is very important, and in order to

give a formal definition of independence we will need the definition of two-
dimensional random variables. The probability density function of a two-dimensional
discrete random variable, called the joint probability density function, is,

Definition 2.65 Joint pdf of two-dimensional discrete random vari-

ables
The pdf of a two-dimensional discrete random variable [ X, Y ] is

f ( x, y) = P( X = x, Y = y), (2-79)

with the properties

f ( x, y) ≥ 0 for all ( x, y), (2-80)

∑∑ f ( x, y) = 1. (2-81)
all x all y

Remark 2.66
P( X = x, Y = y) should be read: the probability of X = x and Y = y.

Example 2.67

Imagine two throws with an fair coin: the possible outcome of each throw is either
head or tail, which will be given the values 0 and 1 respectively. The complete set of
outcomes is (0,0), (0,1), (1,0), and (1,1) each with probability 1/4. And hence the pdf
is
1
f ( x, y) = ; x = {0, 1}, y = {0, 1},
4
further we see that
1 1 1
∑∑ f ( x, y) = ∑ ( f (x, 0) + f (x, 1)) = f (0, 0) + f (0, 1) + f (1, 0) + f (1, 1)
x =0 y =0 x =0

= 1.
Chapter 2 2.9 INDEPENDENCE OF RANDOM VARIABLES 95

The formal definition of independence for a two dimensional discrete random

variable is:

Definition 2.68 Independence of discrete random variables

Two discrete random variables X and Y are said to be independent if and
only if

P( X = x, Y = y) = P( X = x ) P(Y = y). (2-82)

Example 2.69

Example 2.67 is an example of two independent random variables, to see this write
the probabilities
1
1
P ( X = 0) = ∑ f (0, y) =
2
,
y =0
1
1
P ( X = 1) = ∑ f (1, y) =
2
.
y =0

similarly P(Y = 0) = 12 and P(Y = 1) = 21 , now we see that P( X = x ) P(Y = y) = 1

4
for all possible x and y, and hence

1
P( X = x ) P(Y = y) = P( X = x, Y = y) = .
4

Example 2.70

Now imagine that for the second throw we don’t see the outcome of Y, but only
observe the sum of X and Y, denote it by
Z = X + Y.
Lets find out if X and Z are independent. In this case the for all outcomes (0, 0),
(0, 1), (1, 1), (1, 2) the joint pdf is
1
P( X = 0, Z = 0) = P( X = 0, Z = 1) = P( X = 1, Z = 1) = P( X = 1, Z = 2) = .
4
The pdf for each variable is: for X
1
P ( X = 0) = P ( X = 1) = ,
2
Chapter 2 2.9 INDEPENDENCE OF RANDOM VARIABLES 96

and for Z
1 1
P ( Z = 0) = P ( Z = 2) = and P( Z = 1) = ,
4 2
thus for example for the particular outcome (0, 0)

1 1 1 1
P ( X = 0) P ( Z = 0) = · = 6= = P( X = 0, Z = 0),
2 4 8 4
the pdf s are not equal and hence we see that X and Z are not independent.

Remark 2.71
In the example above it is quite clear that X and Z cannot be independent.
In real applications we do not know exactly how the outcomes are realized
and therefore we will need to assume independence (or test it).

To be able to define independence of continuous random variables, we will need

the pdf of a two-dimensional random variable:

Definition 2.72 Pdf of two dimensional continous random vari-

ables
The pdf of a two-dimensional continous random variable [ X, Y ] is a function
f ( x, y) from R2 into R+ with the properties

f ( x, y) ≥ 0 for all ( x, y), (2-83)

Z Z
f ( x, y)dxdy = 1. (2-84)

Just as for one-dimensional random variables the probability interpretation is

in form of integrals
Z
P ( X, Y ) ∈ A = f ( x, y)dxdy, (2-85)
A

where A is an area.
Chapter 2 2.9 INDEPENDENCE OF RANDOM VARIABLES 97

Example 2.73 Bivariate normal distribution

The most important two-dimensional distribution is the bivariate normal distribu-

tion
1 1 T Σ −1 ( x − µ )
f ( x1 , x2 ) = p e− 2 ( x−µ)
2π |Σ|
2 ( x −µ )2 +σ ( x −µ )2 −2σ ( x −µ )( x −µ )
σ22 1 1 22 2 2 12 1 1 2 2
1 −
2(σ11 σ22 −σ2 )
= q e 12 ,
2π 2
σ11 σ22 − σ12

where x = ( x1 , x2 ), and µ = [E( X1 ), E( X2 )], and Σ is the so-called variance-

covariance matrix with elements (Σ)ij = σij = Cov( Xi , X j ), note that σ12 = σ21 ,
| · | is the determinant, and Σ−1 is the inverse of Σ.

Definition 2.74 Independence of continous random variables

Two continous random variables X and Y are said to be independent if

f ( x, y) = f ( x ) f (y). (2-86)

We list here some properties of independent random variables.

Theorem 2.75 Properties of independent random variables

If X and Y are independent then

E( XY ) = E( X ) E(Y ), (2-87)

and

Cov( X, Y ) = 0. (2-88)

Let X1 , . . . , Xn be independent and identically distributed random variables

then

Cov( X, Xi − X ) = 0. (2-89)
Chapter 2 2.9 INDEPENDENCE OF RANDOM VARIABLES 98

Proof

Z Z Z Z
E( XY ) = xy f ( x, y)dxdy = xy f ( x ) f (y)dxdy
Z Z (2-90)
= x f ( x )dx y f (y)dy = E( X ) E(Y )

Cov( X, Y ) = E[( X − E( X ))(Y − E(Y ))]

= E[ XY ] − E[E( X )Y ] − E[ X E(Y )] + E( X ) E(Y ) (2-91)
= 0.

Cov( X, Xi − X ) = Cov( X, Xi ) − Cov( X, X )

1 1
= σ2 − 2 Cov ∑ Xi , ∑ Xi (2-92)
n n
1 2 1
= σ − 2 nσ2 = 0.
n n

Remark 2.76
Note that Cov( X, Y ) = 0 does not imply that X and Y are independent.
However, if X and Y follow a bivariate normal distribution, then if X and Y
are uncorrelated then they are also independent.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 99

2.10 Functions of normal random variables

This section will cover some important functions of a normal random variable.
In general the question of how an arbitrary function of a random variable is dis-
tributed cannot be answered on closed form (ı.e. directly and exactly calculated)
– for answering such questions we must use simulation as a tool, as covered de-
tails in Chapter 4. We have already discussed simulation as a learning tool,
which will also be used in this section.

The simplest function we can think of is a linear combination of normal random

variables, which we from Theorem 2.40 know will follow a normal distribution.
The mean and variance of this normal distribution can be calculated using the
identities given in Theorem 2.56.

Remark 2.77
Note that combining Theorems 2.40 and 2.75, and Remark 2.76 imply that X
and Xi − X are independent.

In addition to the result given above we will cover three additional distribu-
tions: χ2 -distribution, t-distribution and the F-distribution, which are all very
important for the statistical inference covered in the following chapters.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 100

2.10.1 The χ2 -distribution

The χ2 -distribution (chi-square) is defined by:

Definition 2.78
Let X be χ2 distributed, then its pdf is

1 ν x
f (x) = ν x 2 −1 e − 2 ; x ≥ 0, (2-93)
2 Γ
2 ν
2

where Γ ν
2 is the Γ-function and ν is the degrees of freedom.

An alternative definition (here formulated as a theorem) of the χ2 -distribution

is:

Theorem 2.79
Let Z1 , . . . , Zν be independent random variables following the standard nor-
mal distribution, then
ν
∑ Zi2 ∼ χ2 (ν). (2-94)
i =1

We will omit the proof of the theorem as it requires more probabilty calculus
than covered here. Rather a small example that illustrates how the theorem can
be checked by simulation:
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 101

Example 2.80 simulation of χ2 -distribution

## Simulate 10 realizations from a standard normal distributed variable

n <- 10
rnorm(n)
## Now repeat this 200 times and calculate the sum of squares each time
## Note: the use of the function replicate: it repeats the
## expression in the 2nd argument k times, see ?replicate
k <- 200
x <- replicate(k, sum(rnorm(n)^2))
## Plot the epdf of the sums and compare with the theoretical chisquare pdf
par(mfrow=c(1,2))
hist(x, freq=FALSE)
curve(dchisq(xseq,df=n), xname="xseq", add=TRUE, col="red")
## and the ecdf compared to the cdf
plot(ecdf(x))
curve(pchisq(xseq,df=n), xname="xseq", add=TRUE, col="red")

ecdf(x)
0.10

1.0

epdf ecdf
pdf cdf
0.08

Cumulated density
0.8
0.06

0.6
Density
0.04

0.4
0.02

0.2
0.00

0.0

5 10 15 20 25 30 5 10 15 20 25 30
x x

In the left plot the empirical pdf is compared to the theoretical pdf and in the right
plot the empirical cdf is compared to the theoretical cdf.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 102

Theorem 2.81
Given a sample of size n from the normal distributed random variables Xi
with variance σ2 , then the sample variance S2 (viewed as random variable)
can be transformed into
( n − 1) S2
χ2 = , (2-95)
σ2
which follows the χ2 -distribution with degrees of freedom ν = n − 1.

Proof

Start by rewriting the expression

n 2
( n − 1) S2 Xi − X
σ2
= ∑ σ
i =1
n 2
Xi − µ + µ − X
=∑ σ
i =1
n n 2
Xi − µ 2 X−µ n
( X − µ)( Xi − µ)
=∑ +∑ −2∑ (2-96)
i =1
σ i =1
σ i =1
σ2
n 2 2
Xi − µ 2 X−µ X−µ
=∑ +n − 2n
i =1
σ σ σ
n 2
Xi − µ 2 X−µ
=∑ − √ ,
i =1
σ σ/ n

Xi − µ X −µ
we know that σ ∼ N (0, 1) and σ/ √ ∼ N (0, 1), and hence the left hand side is a
n
χ (n) distributed random variable minus a χ2 (1) distributed random variable (also
2

X and S2 are independent, see Theorems 2.75, and 2.40, and Remark 2.76). Hence
the left hand side must be χ2 (n − 1).

If someone claims that a sample comes from a specific normal distribution (i.e.
Xi ∼ N (µ, σ2 ), then we can examine probabilities of specific outcomes of the
sample variance. Such calculation will be termed hypethesis test in later chap-
ters.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 103

Example 2.82 Milk dose machines

A manufacture of machines for dosing milk claims that their machines can dose with
a precision defined by the normal distribution with a standard deviation less than
2% of the dose volume in the operation range. A sample of n = 20 observations was
taken to check if the precision was as claimed. The sample standard deviation was
calculated to s = 0.03.

Hence the claim is that σ ≤ 0.02, thus we want to answer the question: if σ = 0.02
(i.e. the upper limit of the claim), what is then the probability of getting the sampling
deviation s ≥ 0.03?

## Chi-square milk dosing precision

## The sample size

n <- 20
## The claimed deviation
sigma <- 0.02
## The observed sample standard deviation
s <- 0.03
## Calculate the chi-square statistic
chiSq <- (n-1)*s^2 / sigma^2
## Use the cdf to calculate the probability of getting the observed
## sample standard deviation or higher
1 - pchisq(chiSq, df=n-1)

[1] 0.001402

It seems very unlikely that the standard deviation is below 0.02 since the probability
of obtaining the observed sample standard deviation under this condition is very
small. The probability we just found will be termed a p-value in later chapters - the
p-value a very fundamental in testing of hypothesis.

The probability calculated in the above example will be called the p-value in
later chapters and it is a very fundamental concept in statistics.

Theorem 2.83 Mean and variance

Let X ∼ χ2 (ν) then the mean and variance of X is

E( X ) = ν; V( X ) = 2ν. (2-97)
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 104

We will omit the proof of this theorem, but it is easily checked by a symbolic
calculation software (like e.g. Maple).

Example 2.84

We want to calculate the expected value of the sample variance (S2 ) based on n
observations with Xi ∼ N (µ, σ2 ). We have already seen that nσ−21 S2 ∼ χ2 (n − 1) and
we can therefore write
σ2 n − 1
E( S2 ) = 2
E S2
n−1 σ

σ2 n−1 2
= E S
n−1 σ2
σ 2
= ( n − 1) = σ 2 ,
n−1

and we say that S2 is a central estimator for σ2 (the term estimator is introduced in
Section 3.1.3). We can also find the variance of the estimator
2
2 σ2 n−1 2
V( S ) = V S
n−1 σ2
σ4 σ4
= 2
2( n − 1) = 2 .
( n − 1) n−1

Example 2.85 Pooled variance

Suppose now that we have two different samples (not yet realized) X1 , . . . , Xn1 and
Y1 , . . . , Yn2 with Xi ∼ N (µ1 , σ2 ) and Yi ∼ N (µ2 , σ2 ) (both i.i.d.). Let S12 be the sample
variance based on the X’s and S22 be the sample variance based on the Y’s. Now both
S12 and S22 will be central estimators for σ2 , and so will any weighted average of the
type

S2 = aS12 + (1 − a)S22 ; a ∈ [0, 1].

Now we would like to choose a such that the variance of S2 is as small as possible,
and hence we calculate the variance of S2
σ4 σ4
V( S2 ) = a2 2 + (1 − a )2 2
n1 − 1 n2 − 1

4 2 1 2 1
= 2σ a + (1 − a ) .
n1 − 1 n2 − 1
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 105

In order to find the minimum we differentiate with respect to a

∂ V( S2 ) 4 1 1
= 2σ 2a − 2(1 − a )
∂a n1 − 1 n2 − 1

4 1 1 1
= 4σ a + −
n1 − 1 n2 − 1 n2 − 1

4 n 1 + n 2 − 2 1
= 4σ a − ,
(n1 − 1)(n2 − 1) n2 − 1
which is zero for
n1 − 1
a= .
n1 + n2 − 2

In later chapters we will refer to this choice of a as the pooled variance (S2p ), inserting
in (2-98) gives

(n1 − 1)S12 + (n2 − 1)S22

S2p = .
n1 + n2 − 2

Note that S2p is a weighted (proportional to the number of observations) average

of the sample variances. It can also be shown (you are invited to do this) that
n1 + n2 −2 2
σ2
S p ∼ χ2 (n1 + n2 − 2). Further, note that the assumption of equal variance in
the two samples is crucial in the calculations above.

2.10.2 The t-distribution

The t-distribution is the sampling distribution of the sample mean standardized

with the sample variation. It is valid for all sample sizes, however for larger
sample sizes (n > 30) the difference between the t-distribution and the normal
distribution is very small. Hence for larger sample sizes the normal distribution
is often applied.

Definition 2.86
The t-distribution pdf is

Γ ( ν+ 1 − ν +1
2 ) t2 2
f T (t) = √ 1+ , (2-98)
νπ Γ( ν2 ) ν

where ν is the degrees of freedom and Γ() is the Gamma function.

Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 106

The relation between normal random variables and χ2 -distributed random vari-
ables are given in the following theo:rem

Theorem 2.87
Let Z ∼ N (0, 1) and Y ∼ χ2 (ν), then

Z
X= √ ∼ t ( ν ). (2-99)
Y/ν

We will not prove this theorem, but show by an example how this can be illus-
trated by simulation:

Example 2.88 Relation between normal and χ2

## Set simulate parameters

nu <- 8; k <- 200
## Generate the simulated realizations
z <- rnorm(k)
y <- rchisq(k, df=nu)
x <- z/sqrt(y/nu)
## Plot
par(mfrow=c(1,2))
hist(x, freq = FALSE)
curve(dt(xseq, df = nu), xname="xseq", add=TRUE, col="red")
plot(ecdf(x))
curve(pt(xseq, df = nu), xname="xseq", add=TRUE, col="red")

ecdf(x)
0.4

1.0

epdf ecdf
pdf cdf
Cumulated density
0.8
0.3

0.6
Density
0.2

0.4
0.1

0.2
0.0

0.0

-4 -2 0 2 4 -4 -2 0 2 4
x x
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 107

In the left plot the empirical pdf is compared to the theoretical pdf and in the right
plot the empirical cdf is compared to the theoretical cdf.

The t-distribution arises when a sample is taken of a normal distributed random

variable, then the sample mean standardized with the sample variance follows
the t-distribution.

Theorem 2.89
Given a sample of normal distributed random variables X1 , . . . , Xn , then the
random variable
X−µ
T= √ ∼ t ( n − 1), (2-100)
S/ n

follows the t-distribution, where X is the sample mean, µ is the mean of X,

n is the sample size and S is the sample standard deviation.

Proof

X −µ ( n −1) S2
Note that √
σ/ n
∼ N (0, 1) and σ2
∼ χ2 (n − 1) which inserted in Equation (2.87)
gives

X −µ X −µ
√ √
σ/ n 1/ n
T= q = √
( n −1) S2 S2
σ 2 ( n −1) (2-101)
X−µ
= √ ∼ t ( n − 1).
S/ n

We could also verify this by simulation:

Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 108

Example 2.90 Simulation of t-distribution

## Simulate
n <- 8; k <- 200; mu <- 1; sigma <- 2
## Repeat k times the simulation of a normal dist. sample:
## return the values in a (n x k) matrix
x <- replicate(k, rnorm(n, mean=mu, sd=sigma))
xbar <- apply(x, 2, mean)
s <- apply(x, 2, sd)
tobs <- (xbar - mu)/(s/sqrt(n))
## Plot
par(mfrow=c(1,2))
hist(tobs, freq = FALSE)
curve(dt(xseq, df=n-1), xname="xseq", add=TRUE, col="red")
plot(ecdf(tobs))
curve(pt(xseq, df=n-1), xname="xseq", add=TRUE, col="red")
0.4

1.0

epdf ecdf
pdf cdf
Cumulated density
0.8
0.3

0.6
Density
0.2

0.4
0.1

0.2
0.0

0.0

-4 -2 0 2 4 -4 -2 0 2 4
tobs tobs

In the left plot the empirical pdf is compared to the theoretical pdf and in the right
plot the empirical cdf is compared to the theoretical cdf.

Note that X and S are random variables, since they are the sample mean and
standard deviation of a sample consisting of realizations of X, but the sample is
not taken yet.

Very often samples with only few observations are available. In this case by
assuming normality of the population (i.e. the Xi ’s are normal distributed) and
for a some mean µ, the t-distribution can be used to calculate the probability of
obtaining the sample mean in a given range.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 109

Example 2.91 Electric car driving distance

An electric car manufacture claims that their cars can drive on average 400 km on
a full charge at a specified speed. From experience it is known that this full charge
distance, denote it by X, is normal distributed. A test of n = 10 cars was carried out,
which resulted in a sample mean of x̄ = 382 km and a sample deviation of s = 14.

Now we can use the t-distribution to calculate the probability of obtaining this value
of the sample mean or lower, if their claim about the mean is actually true:

## Calculate the probability of getting the sample mean under the

## conditions that the claim is actually the real mean

## A test of 10 cars was carried out

n <- 10
## The claim is that the real mean is 400 km
muX <- 400
## From the sample the sample mean was calculated to
xMean <- 393
## And the sample deviation was
xSD <- 14
## Use the cdf to calculate the probability of obtaining this
## sample mean or a lower value
pt( (xMean-muX) / (xSD/sqrt(n)), df=n-1)

[1] 0.07415

If we had the same sample mean and sample deviation, how do you think
changing the number of observations will affect the calculated probability?
Try it out.

The t-distribution converges to the normal distribution as the simple size in-
creases. For small sample sizes it has a higher spread than the normal distribu-
tion. For larger sample sizes with n > 30 observations the difference between
the normal and the t-distribution is very small.
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 110

Example 2.92 t-distribution

Generate plots to see how the t-distribution is shaped compared to the normal dis-
tribution.

## Plot the t-distribution for different sample sizes

## First plot the standard normal distribution

curve(dnorm(x), xlim=c(-5,5), xlab="x", ylab="Density")
## Add the t-distribution for 30 observations
curve(dt(x,df=30-1), add=TRUE, col=2)
## Add the t-distribution for 15, 5 and 2 observations
curve(dt(x,df=15-1), add=TRUE, col=3)
curve(dt(x,df=5-1), add=TRUE, col=4)
curve(dt(x,df=2-1), add=TRUE, col=5)
0.4

Norm.
n= 30
n= 15
0.3

n= 5
n= 2
Density
0.2
0.1
0.0

-4 -2 0 2 4
x

How does the number of observations affect the shape of the t-distribution
pdf compared to the normal pdf?
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 111

Theorem 2.93 Mean and variance

Let X ∼ t(ν) then the mean and variance of X is

E( X ) = 0; ν > 1, (2-102)
ν
V( X ) = ; ν > 2. (2-103)
ν−2

We will omit the proof of this theorem, but it is easily checked with a symbolic
calculation software (like e.g. Maple).

Remark 2.94
For ν ≤ 1 the expectation (and hence the variance) is not defined (the inte-
gral is not absolutely convergent), and for ν ∈ (1, 2] (1 < ν ≤ 2) the variance
is equal ∞. Note that this does not violate the general definition of probabil-
ity density functions.

2.10.3 The F-distribution

The F-distribution is defined by:

Definition 2.95
The F-distribution pdf is
ν1 − ν1 +ν2
1 ν1 2 ν1 ν 2
f F (x) = ν1 ν2 x 2 −1 1+ 1x , (2-104)
B 2, 2 ν2 ν2

where ν1 an ν2 are the degrees of freedom and B(·, ·) is the Beta function.

The F-distribution appears as the ratio between two independent χ2 -distributed

random variables:
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 112

Theorem 2.96
Let U ∼ χ2 (ν1 ) and V ∼ χ2 (ν2 ), be independent then

U/ν1
F= ∼ F (ν1 , ν2 ). (2-105)
V/ν2

Again we will omit the proof of the theorem and rather show how it can be
visualized by simulation:

Example 2.97 F-distribution

## Simulate
nu1 <- 8; nu2 <- 10; k <- 200
u <- rchisq(k, df=nu1)
v <- rchisq(k, df=nu2)
fobs <- (u/nu1) / (v/nu2)
## Plot
par(mfrow=c(1,2))
hist(fobs, freq = FALSE)
curve(df(x, df1=nu1, df2=nu2), add=TRUE, col="red")
plot(ecdf(fobs))
curve(pf(x, df1=nu1, df2=nu2), add=TRUE, col="red")
1.0

epdf
0.6

pdf
Cumulated density
0.8
0.6
Density
0.4

0.4
0.2

0.2

ecdf
cdf
0.0

0.0

0 1 2 3 4 5 0 1 2 3 4 5
f obs f obs

Theorem 2.98
Let X1 , . . . , Xn1 be independent and sampled from a normal distribution
with mean µ1 and variance σ12 , further let Y1 , . . . , Yn2 be independent and
sampled from a normal distribution with mean µ2 and variance σ22 . Then
the statistic
S12 /σ12
F= ∼ F (n1 − 1, n2 − 1), (2-106)
S22 /σ22

follows an F-distribution.

Proof

(n1 −1)S12 (n2 −1)S22

Note that σ12
∼ χ2 (n1 − 1) and σ22
∼ χ2 (n2 − 1) and hence

(n1 −1)S12 S12

σ12 (n1 −1) σ12
(n2 −1)S22
= S22
∼ F (n1 − 1, n2 − 1). (2-107)
σ22 (n2 −1) σ22

We can also illustrate this sample version by simulation:

Example 2.99 Relation between normal and F-distribution

## Simulate
n1 <- 8; n2 <- 10; k <- 200
mu1 <- 2; mu2 <- -1
sigma1 <- 2; sigma2 <- 4
s1 <- replicate(k, sd(rnorm(n1, mean=mu1, sd=sigma1)))
s2 <- replicate(k, sd(rnorm(n2, mean=mu2, sd=sigma2)))
fobs <- (s1^2 / sigma1^2) / (s2^2 / sigma2^2)
## Plot
par(mfrow=c(1,2))
hist(fobs, freq=FALSE)
curve(df(xseq, df1=n1-1, df2=n2-1), xname="xseq", add=TRUE, col="red")
plot(ecdf(fobs))
curve(pf(xseq, df1=n1-1, df2=n2-1), xname="xseq", add=TRUE, col="red")
Chapter 2 2.10 FUNCTIONS OF NORMAL RANDOM VARIABLES 114

1.0
epdf
0.6
pdf

Cumulated density
0.8
0.6
Density
0.4

0.4
0.2

0.2
ecdf
cdf
0.0

0.0
0 1 2 3 4 5 6 0 2 4 6
f obs f obs

In the left plot the empirical pdf is compared to the theoretical pdf and in the right
plot the empirical cdf is compared to the theoretical cdf.

Remark 2.100
Of particular importance in statistics is the case when σ1 = σ2 , in this case

S12
F= ∼ F (n1 − 1, n2 − 1). (2-108)
S22

Theorem 2.101 Mean and variance

Let F ∼ F (ν1 , ν2 ) then the mean and variance of F is
ν2
E( F ) = ; ν2 > 2, (2-109)
ν2 − 2
2ν22 (ν1 + ν2 − 2)
V( F ) = ; ν2 > 4. (2-110)
ν1 (ν2 − 2)2 (ν2 − 4)
Chapter 2 2.11 EXERCISES 115

2.11 Exercises

Exercise 2.1 Discrete random variable

a) Let X be a stochastic variable. When running the R-command dbinom(4,10,0.6)

R returns 0.1115, written as:

dbinom(4,10,0.6)
[1] 0.1115

What distribution is applied and what does 0.1115 represent?

b) Let X be the same stochastic variable as above. The following are results
from R:

pbinom(4,10,0.6)
[1] 0.1662

pbinom(5,10,0.6)
[1] 0.3669

Calculate the following probabilities: P( X ≤ 5), P( X < 5), P( X > 4) and

P ( X = 5).

c) Let X be a stochastic variable. From R we get:

dpois(4,3)
[1] 0.168

What distribution is applied and what does 0.168 represent?

d) Let X be the same stochastic variable as above. The following are results
from R:
Chapter 2 2.11 EXERCISES 116

ppois(4,3)
[1] 0.8153

ppois(5,3)
[1] 0.9161

Calculate the following probabilities: P( X ≤ 5), P( X < 5), P( X > 4) and

P ( X = 5).

Exercise 2.2 Course passing proportions

a) If a passing proportion for a course given repeatedly is assumed to be 0.80

on average, and there are 250 students who are taking the exam each time,
what is the expected value, µ and standard deviation, σ, for the number
of students who do not pass the exam for a randomly selected course?

Exercise 2.3 Notes in a box

A box contains 6 notes:

On 1 of the notes there is the number 1

On 2 of the notes there is the number 2
On 2 of the notes there is the number 3
On 1 of the notes there is the number 4

Two notes are drawn at random from the box, and the following random vari-
able is introduced: X, which describes the number of notes with the number 4
among the 2 drawn. The two notes are drawn without replacement.

a) The mean and variance for X, and P( X = 0) are?

b) The 2 notes are now drawn with replacement. What is the probability that
none of the 2 notes has the number 1 on it?
Chapter 2 2.11 EXERCISES 117

Exercise 2.4 Consumer survey

In a consumer survey performed by a newspaper, 20 different groceries (prod-

ucts) were purchased in a grocery store. Discrepancies between the price ap-
pearing on the sales slip and the shelf price were found in 6 of these purchased
products.

a) At the same time a customer buys 3 random (different) products within

the group consisting of the 20 goods in the store. The probability that no
discrepancies occurs for this customer is?

Exercise 2.5 Hay delivery quality

A horse owner receives 20 bales of hay in a sealed plastic packaging. To con-

trol the hay, 3 bales of hay are randomly selected, and each checked whether it
contains harmful fungal spores.

It is believed that among the 20 bales of hay 2 bales are infected with fungal
spores. A random variable X describes the number of infected bales of hay
among the three selected.

a) The mean of X, (µ X ), the variance of X, (σX2 ) and P( X ≥ 1) are?

b) Another supplier advertises that no more than 1% of his bales of hay are
infected. The horse owner buys 10 bales of hay from this supplier, and
decides to buy hay for the rest of the season from this supplier if the 10
bales are error-free.
What is the probability that the 10 purchased bales of hay are error-free, if
1% of the bales from a supplier are infected ( p1 ) and the probability that
the 10 purchased bales of hay are error-free, if 10% of the bales from a
supplier are infected ( p10 )?
Chapter 2 2.11 EXERCISES 118

Exercise 2.6 Newspaper consumer survey

In a consumer survey performed by a newspaper, 20 different groceries (prod-

ucts) were purchased in a grocery store. Discrepancies between the price ap-
pearing on the sales slip and the shelf price were found in 6 of these purchased
products.

a) Let X denote the number of discrepancies when purchasing 3 random (dif-

ferent) products within the group of the 20 products in the store. What is
the mean and variance of X?

Exercise 2.7 A fully automated production

On a large fully automated production plant items are pushed to a side band
at random time points, from which they are automatically fed to a control unit.
The production plant is set up in such a way that the number of items sent to
the control unit on average is 1.6 item pr. minute. Let the random variable X
denote the number of items pushed to the side band in 1 minute. It is assumed
that X follows a Poisson distribution.

a) What is the probability that there will arrive more than 5 items at the con-
trol unit in a given minute is?

b) What is the probability that no more than 8 items arrive to the control unit
within a 5-minute period?

Exercise 2.8 Call center staff

The staffing for answering calls in a company is based on that there will be 180
phone calls per hour randomly distributed. If there are 20 calls or more in a
period of 5 minutes the capacity is exceeded, and there will be an unwanted
waiting time, hence there is a capacity of 19 calls per 5 minutes.

a) What is the probability that the capacity is exceeded in a random period

of 5 minutes?
Chapter 2 2.11 EXERCISES 119

b) If the probability should be at least 99% that all calls will be handled with-
out waiting time for a randomly selected period of 5 minutes, how large
should the capacity per 5 minutes then at least be?

Exercise 2.9 Continuous random variable

a) The following R commands and results are given:

pnorm(2)
[1] 0.9772

pnorm(2,1,1)
[1] 0.8413

pnorm(2,1,2)
[1] 0.6915

Specify which distributionsare used and explain the resulting probabili-

ties (preferably by a sketch).

b) What is the result of the following command: qnorm(pnorm(2))?

c) The following R commands and results are given:

qnorm(0.975)
[1] 1.96

qnorm(0.975,1,1)
[1] 2.96

qnorm(0.975,1,2)
[1] 4.92

State what the numbers represent in the three cases (preferably by a sketch).
Chapter 2 2.11 EXERCISES 120

Exercise 2.10 The normal pdf

a) Which of the following statements regarding the probability density func-

tion of the normal distribution N (1, 22 ) is false?
1. The total area under the curve is equal to 1.0
2. The mean is equal to 12
3. The variance is equal to 2
4. The curve is symmetric about the mean
5. The two tails of the curve extend indefinitely
6. Don’t know

Let X be normally distributed with mean 24 and variance 16

b) Calculate the following probabilities:

– P( X ≤ 20)
– P( X > 29.5)
– P( X = 23.8)

Exercise 2.11 Computer chip control

A machine for checking computer chips uses on average 65 milliseconds per

check with a standard deviation of 4 milliseconds. A newer machine, poten-
tially to be bought, uses on average 54 milliseconds per check with a standard
deviation of 3 milliseconds. It can be used that check times can be assumed
normally distributed and independent.

a) What is the probability that the time savings per check using the new ma-
chine is less than 10 milliseconds is?

b) What is the mean (µ) and standard deviation (σ ) for the total time use for
checking 100 chips on the new machine is?
Chapter 2 2.11 EXERCISES 121

Exercise 2.12 Concrete items

A manufacturer of concrete items knows that the length ( L) of his items are rea-
sonably normally distributed with µ L = 3000 mm and σL = 3 mm. The require-
ment for these elements is that the length should be not more than 3007 mm
and the length must be at least 2993 mm.

a) The expected error rate in the manufacturing will be?

b) The concrete items are supported by beams, where the distance between
the beams is called Lbeam and can be assumed normal distributed. The
concrete items length is still called L. For the items to be supported cor-
rectly, the following requirements for these lengths must be fulfilled: 90 mm <
L − Lbeam < 110 mm. It is assumed that the mean of the distance between
the beams is µbeam = 2900 mm. How large may the standard deviation
σbeam of the distance between the beams be if you want the requirement
fulfilled in 99% of the cases?

Exercise 2.13 Online statistic video views

In 2013, there were 110,000 views of the DTU statistics videos that are avail-
able online. Assume first that the occurrence of views through 2014 follows a
Poisson process with a 2013 average: λ365days = 110000.

a) What is the probability that in a randomly chosen half an hour there is no

occurrence of views?

b) There has just been a view, what is the probability that you have to wait
more than fifteen minutes for the next view?
Chapter 2 2.11 EXERCISES 122

Exercise 2.14 Body mass index distribution

The so-called BMI (Body Mass Index) is a measure of the weight-height-relation,

and is defined as the weight (W) in kg divided by the squared height (H) in
meters:
W
BMI = .
H2
Assume that the population distribution of BMI is a log-normal distribution
with α = 3.1 and β = 0.15 (hence that log(mathitBMI ) is normal distributed
with mean 3.1 and standard deviation 0.15).

a) A definition of "being obese" is a BMI-value of at least 30. How large a

proportion of the population would then be obese?

Exercise 2.15 Bivariate normal

a) In the bivariate normal distribution (see Example 2.73), show that if Σ is a

diagonal matrix then ( X1 , X2 ) are also independent and follow univariate
normal distributions.

b) Assume that Z1 and Z2 are independent standard normal random vari-

ables. Now let X and Y be defined by

X = a11 Z1 + c1 ,
Y = a12 Z1 + a22 Z2 + c2 .

Show that an appropriate choice of a11 , a12 , a22 , c1 , c2 can give any bivariate
normal distribution for the random vector ( X, Y ), i.e. find a11 , a12 , a22 , c1 , c2
as a function of µ X , µY and the elements of Σ.

Note that Σij = Cov( Xi , X j ) (i.e. here Σ12 = Σ21 = Cov( X, Y )),
and that any linear combination of random normal variables will
result in a random normal variable.
Chapter 2 2.11 EXERCISES 123

c) Use the result to simulate 1000 realization of a bivariate normal random

variable with µ = (1, 2) and

1 1
Σ=
1 2

and make a scatter plot of the bivariate random variable.

Exercise 2.16 Sample distributions

a) Verify by simulation that n1 +σn22 −2 S2p ∼ χ2 (n1 + n2 − 2) (See Example 2.85).

You may use n1 = 5, n2 = 8, µ1 = 2, µ2 = 4, and σ2 = 2.

b) Show that if X ∼ N (µ1 , σ2 ) and Y ∼ N (µ2 , σ2 ), then

X̄ − Ȳ − (µ1 − µ2 )
q ∼ t ( n1 + n2 − 2).
S p n11 + n12

Verify the result by simulation. You may use n1 = 5, n2 = 8, µ1 = 2,

µ2 = 4, and σ2 = 2.

Exercise 2.17 Sample distributions 2

Let X1 , ..., Xn and Y1 , ..., Yn , with Xi ∼ N (µ1 , σ2 ) and Yi ∼ N (µ2 , σ2 ) be indepen-

dent random variables. Hence, two samples before they are taken. S12 and S22
are the sample variances based on the X’s and the Y’s respectively. Now define
a new random variable
S12
Q= (2-111)
S22

a) For n equal 2, 4, 8, 16 and 32 find:

1. P( Q < 1)
2. P( Q > 2)
Chapter 2 2.11 EXERCISES 124

3. P Q < 21

4. P 21 < Q < 2

b) For at least one value of n illustrate the results above by direct simulation
from independent normal distributions. You may use any values of µ1 , µ2
and σ2 .
Chapter 3 125

Chapter 3

Statistics for one and two samples

3.1 Learning from one-sample quantitative data

Statistics is the art and science of learning from data, i.e. statistical inference.
What we are usually interested in learning about is the population from which
our sample was taken, as described in Section 1.3. More specifically, most of
the time the aim is to learn about the mean of this population, as illustrated in
Figure 1.1.

Example 3.1 Student heights

In examples in Chapter 1 we did descriptive statistics on the following random sam-

ple of the heights of 10 students in a statistics class (in cm):

168 161 167 179 184 166 198 187 191 179

and we computed the sample mean and standard deviation to be

x̄ = 178,
s = 12.21.

The population distribution of heights will have some unknown mean µ and some
unknown standard deviation σ. We use the sample values as point estimates for
these population parameters

µ̂ = 178,
σ̂ = 12.21.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 126

Since we only have a sample of 10 persons, we know that the point estimate of 178
cannot with 100% certainty be exactly the true value µ (if we collected a new sample
with 10 different persons height and computed the sample mean we would defi-
nitely expect this to be different from 178). The way we will handle this uncertainty
is by computing an interval called the confidence interval for µ. The confidence in-
terval is a way to handle the uncertainty by the use of probability theory. The most
commonly used confidence interval would in this case be

12.21
178 ± 2.26 · √ ,
10
which is

178 ± 8.74.

The number 2.26 comes from a specific probability distribution called the t-
distribution, presented in Section 2.86. The t-distributions are similar to the stan-
dard normal distribution presented in Section 2.5.2: they are symmetric and cen-
tered around 0.

The confidence interval interval

178 ± 8.74 = [169.3, 186.7],

represents the plausible values of the unknown population mean µ in light of the
data.

So in this section we will explain how to estimate the mean of a distribution and
how to quantify the precision, or equivalently the uncertainty, of our estimate.

We will start by considering a population characterized by some distribution

todoDet kunne overvejes at indføre med Xi rigtigt første gang, og ikke X første
gang, det kommer først længere nede. from which we take a sample x1 , . . . , xn
of size n. In the example above Xi would be the height of a randomly selected
person and x1 , . . . , x10 our sample of student heights.

A crucial issue in the confidence interval is to use the correct probabilities, that
is, we must use probability distributions that are properly representing the real
life phenomena we are investigating. In the height example, the population dis-
tribution is the distribution of all heights in the entire population. This is what
you would see if you sampled from a huge amount of heights, say n = 1000000,
and then made a density histogram of these, see Example 1.25. Another way of
saying the same is: the random variables Xi have a probability density function
(pdf or f ( x )) which describe exactly the distribution of all the values. Well, in
our setting we have only a rather small sample, so in fact we may have to as-
sume some specific pdf for X I , since we don’t know it and really can’t see it well
from the small sample. The most common type of assumption, or one could say
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 127

model, for the population distribution is to assume it to be the normal distribu-

tion. This assumption makes the theoretical justification for the methods easier.
In many cases real life phenomena actually indeed are nicely modelled by a nor-
mal distribution. In many other cases they are not. After taking you through
the methodology based on a normal population distribution assumption, we
will show and discuss what to do with the non-normal cases.

Hence, we will assume that the random variables Xi follow a normal distribution
with mean µ and variance σ2 , X ∼ N (µ, σ2 ) . Our goal is to learn about the
mean of the population µ, in particular, we want to:

1. Estimate µ, that is calculate a best guess of µ based on the sample

2. Quantify the precision, or equivalently the uncertainty, of the estimate

Intuitively, the best guess of the population mean µ is the sample mean

1 n
µ̂ = x̄ = ∑ xi .
n i =1

Actually, there is a formal theoretical framework to support that this sort of ob-
vious choice also is the theoretically best choice, when we have assumed that
the underlying distribution is normal. The next sections will be concerned with
answering the second question: quantifying how precisely x̄ estimates µ, that
is, how close we can expect the sample mean x̄ to be to the true, but unknown,
population mean µ. To answer this, we first, in Section 3.1.1, discuss the dis-
tribution of the sample mean, and then, in Section 3.1.2, discuss the confidence
interval for µ, which is universally used to quantify precision or uncertainty.

3.1.1 Distribution of the sample mean

As indicated in Example 3.1 the challenge we have in using the sample mean x̄
as an estimate of µ is the unpleasant fact that the next sample we take would
give us a different result, so there is a clear element of randomness in our esti-
mate. More formally, if we take a new sample from the population, let us call
it x2,1 , . . . , x2,n , then the sample mean of this, x̄2 = n1 ∑in=1 x2,i will be different
from the sample mean of the first sample we took. In fact, we can repeat this
process as many times as we would like, and we would obtain:

1. Sample x1,1 , . . . , x1,n and calculate the average x̄1

2. Sample x2,1 , . . . , x2,n and calculate the average x̄2

Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 128

3. Sample x3,1 , . . . , x3,n and calculate the average x̄3

4. etc.

Since the sample means x̄ j will all be different, it is apparent that the sample
mean is also the realization of a random variable. In fact it can be shown that if X
is a random variable with a normal distribution with mean µ and variance σ2 ,
then the random sample mean X̄ from a sample of size n is also a normally
distributed random variable with mean µ and variance σ2 /n. This result is
formally expressed in the following theorem:

Theorem 3.2 The distribution of the mean of normal random vari-

ables
Assume that X1 , . . . , Xn are independent and identicallya normally dis-
tributed random variables, Xi ∼ N (µ, σ2 ), i = 1, . . . , n, then

1 n σ2
X̄ = ∑ Xi ∼ N µ, . (3-1)
n i =1 n

a The“independent and identically” part is in many ways a technical detail that you
don’t have to worry about at this stage in the text.

Note how the formula in the theorem regarding the mean and variance of X̄ is
a consequence of the mean and variance of linear combinations Theorem 2.56

1 n 1 n 1
E( X̄ ) = ∑ E( Xi ) = ∑ µ = nµ = µ, (3-2)
n i =1 n i =1 n

and
n n
1 1 1 σ2
V( X̄ ) = 2
n ∑ V ( Xi ) = n 2 ∑ σ = n2 nσ = n ,
2 2
(3-3)
i =1 i =1

and using Theorem 2.40 it is clear that the mean of normal distributions also is
a normal distribution.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 129

One important point to read from this theorem is that it tells us, at
least theoretically, what the variance of the sample mean is, and
hence also the standard deviation
σ
σX̄ = √ . (3-4)
n

Let us elaborate a little on the importance of this. Due to the basic

rules for mean and variance calculations, i.e. Theorem 2.56, we
know that the difference between X̄ and µ has the same standard
deviation
σ
σ(X̄ −µ) = √ . (3-5)
n
This is the mean absolute difference between the sample estimate
X̄ and the true µ, or in other words: this is the mean of the error
we will make using the sample mean to estimate the population
mean. This is exactly what we are interested in: to use a proba-
bility distribution to handle the possible error we make.

In our way of justifying and making explicit methods it is useful to consider

the so-called standardized sample mean, where the X̄ − µ is seen relative to its
standard deviation, and using the standardization of normal distributions in
Theorem 2.43, which states that the standardized sample mean has a standard
normal distribution:

Theorem 3.3 The distribution of the σ-standardized mean of nor-

mal random variables
Assume that X1 , . . . , Xn are independent
and identically normally dis-
2
tributed random variables, Xi ∼ N µ, σ where i = 1, . . . , n, then

X̄ − µ
2
Z= √ ∼ N 0, 1 . (3-6)
σ/ n

That is, the standardized sample mean Z follows a standard normal distri-
bution.

However, to somehow use the probabilities to say something clever about how
close the estimate x̄ is to µ, all these results have a flaw: the population standard
deviation σ (true, but unknown) is part of the formula. And in most practical
cases we don’t know the true standard deviation σ. The natural thing to do is
to use the sample standard deviation s as a substitute for (estimate of) σ. How-
ever, then the theory above breaks down: the sample mean standardized by the
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 130

sample standard deviation instead of the true standard deviation no longer has
a normal distribution! But luckily the distribution can be found (as a probability
theoretical result) and we call such a distribution a t-distribution with (n − 1)
degrees of freedom (for more details see Section 2.10.2):

Theorem 3.4 The distribution of the S-standardized mean of nor-

mal random variables
Assume that X1 , . . . , Xn are independent and identically normally dis-
tributed random variables, where Xi ∼ N µ, σ2 and i = 1, . . . , n, then

X̄ − µ
T= √ ∼ t ( n − 1), (3-7)
S/ n

where t(n − 1) is the t-distribution with n − 1 degrees of freedom.

A t-distribution, as any other distribution, has a probability density function,

presented in Definition 2.86. It is similar in shape to the standard normal dis-
tribution: it is symmetric and centered around 0, but it has thicker tails as il-
lustrated in the figure of Example 2.92. Also, the t-distributions are directly
available in R, via the similar four types of R functions as seen also for the other
probability distributions, see the overview of distributions in A.2.1. So we can
easily work with t-distributions in practice. As indicated, there is a different
t-distribution for each n: the larger the n, the closer the t-distribution is to the
standard normal distribution.

Example 3.5 Normal and t probabilities and quantiles

In this example we compare some probabilities from the standard normal distribu-
tion with the corresponding ones from the t-distribution with various numbers of
degrees of freedom.

Let us compare P( T > 1.96) for some different values of n with P( Z > 1.96):
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 131

## The P(T>1.96) probability for n=10

1 - pt(1.96, df = 9)

[1] 0.04082

## The P(Z>1.96) probability

1 - pnorm(1.96)

[1] 0.025

## The P(T>1.96) probability for n-values, 10, 20, ... ,50

1 - pt(1.96, df = seq(9, 49, by = 10))

[1] 0.04082 0.03241 0.02983 0.02858 0.02785

## The P(T>1.96) probability for n-values, 100, 200, ... ,500

1 - pt(1.96, df = seq(99, 499, by = 100))

[1] 0.02640 0.02570 0.02546 0.02535 0.02528

Note how the t-probabilities approach the standard normal probabilities as n in-
creases. Similarly for the quantiles:

## The standard normal 0.975% quantile

qnorm(0.975)

[1] 1.96

## The t-quantiles for n-values: 10, 20, ... ,50

## (rounded to 3 decimal points)
qt(0.975, df = seq(9, 49, by = 10))

[1] 2.262 2.093 2.045 2.023 2.010

## The t-quantiles for n-values: 100, 200, ... ,500

## (rounded to 3 decimal points)
qt(0.975, df = seq(99, 499, by = 100))

[1] 1.984 1.972 1.968 1.966 1.965

Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 132

√
The sample version of the standard deviation of the sample mean s/ n is called
the Standard Error of the Mean (and is often abbreviated SEM):

Definition 3.6 Standard Error of the mean

Given a sample x1 , . . . , xn , the Standard Error of the Mean is defined as
s
SEx̄ = √ . (3-8)
n

It can also be read as the Sampling Error of the mean, and can be called the
standard deviation of the sampling distribution of the mean.

Remark 3.7
Using the phrase sampling distribution as compared to just the distribution of
the mean bears no mathematical/formal distinction: formally a probability
distribution is a probability distribution and there exist only one definition
of that. It is merely used to emphasize the role played by the distribution of
the sample mean, namely to quantify how the sample mean changes from
(potential) sample to sample, so more generally, the sample mean has a dis-
tribution (from sample to sample), so most textbooks and e.g. Wikipedia
would call this distribution a sampling distribution.

3.1.2 Quantifying the precision of the mean - the confidence in-

terval

As already discussed above, estimating the mean from a sample is usually not
enough: we also want to know how close this estimate is to the true mean (i.e.
the population mean). Using knowledge about probability distributions, we are
able to quantify the uncertainty of our estimate even without knowing the true
mean. Statistical practice is to quantify precision (or, equivalently, uncertainty)
with a confidence interval (CI).

In this section we will provide the explicit formula for and discuss confidence
intervals for the population mean µ. The theoretical justification, and hence as-
sumptions of the method, is a normal distribution of the population. However,
it will be clear in a subsequent section that the applicability goes beyond this
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 133

if the sample size n is large enough. The standard so-called one-sample confi-
dence interval method is:

Method 3.8 The one sample confidence interval for µ

For a sample x1 , . . . , xn the 100(1 − α)% confidence interval is given by
s
x̄ ± t1−α/2 · √ , (3-9)
n

where t1−α/2 is the (1 − α) quantile from the t-distribution with n − 1 de-

grees of freedom.a
Most commonly used is the 95%-confidence interval:
s
x̄ ± t0.975 · √ . (3-10)
n

a Note how the dependence of n has been suppressed from the notation to leave room for

using the quantile as index instead - since using two indices would appear less readable:
tn−1,1−α/2

We will reserve the Method boxes for specific directly applicable statistical meth-
ods/formulas (as opposed to theorems and formulas used to explain, justify or
prove various points).

Example 3.9 Student heights

We can now use Method 3.8 to find the 95% confidence interval for the population
mean height from the height sample from Example 3.1. We need the 0.975-quantile
from the t-distribution with n − 1 = 9 degrees of freedom:

## The t-quantiles for n=10:

qt(0.975, 9)

[1] 2.262

And we can recognize the already stated result

12.21
178 ± 2.26 · √ ,
10
which is
178 ± 8.74 = [169.3, 186.7].
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 134

Therefore with high confidence we conclude that the true mean height of the popu-
lation of students to be between 169.3 and 186.7.

The confidence interval is widely used to summarize uncertainty, not only for
the sample mean, but also for many other types of estimates, as we shall see
in later sections of this chapter and in following chapters. It is quite common
to use 95% confidence intervals, but other levels, e.g. 99% are also used (it is
presented later in this chapter what the precise meaning of “other levels” is).

Example 3.10 Student heights

Let us try to find the 99% confidence interval for µ for the height sample from Exam-
ple 3.1. Now α = 0.01 and we get that 1 − α/2 = 0.995, so we need the 0.995-quantile
from the t-distribution with n − 1 = 9 degrees of freedom:

## The t-quantile for n=10

qt(p=0.995, df=9)

[1] 3.25

And we can find the result as

12.21
178 ± 3.25 · √ ,
10
which is:

178 ± 12.55 = [165.5, 190.5].

Or explicitly in R:

## The 99% confidence interval for the mean

x <- c(168, 161, 167, 179, 184, 166, 198, 187, 191, 179)
n <- length(x)
mean(x) - qt(0.995, df = 9) * sd(x) / sqrt(n)

[1] 165.5

mean(x) + qt(0.995, df = 9) * sd(x) / sqrt(n)

[1] 190.5

Or using the inbuilt function t.test in R:

Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 135

## The 99% confidence interval for the mean

t.test(x, conf.level = 0.99)

One Sample t-test

data: x
t = 46, df = 9, p-value = 5e-12
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
165.5 190.5
sample estimates:
mean of x
178

As can be seen this R function provides various additional information and talks
about “t-test” and “p-value”. In a subsequent section below we will explain
what this is all about.

In our motivation of the confidence interval we used the assumption that the
population is normal distributed. Thankfully, as already pointed out above, the
validity is not particularly sensitive to the normal distribution assumption. In
later sections, we will discuss how to assess if the sample is sufficiently close to
a normal distribution, and what we can do if the assumption is not satisfied.

3.1.3 The language of statistics and the process of learning from

data

In this section we review what it means to make statistical inference using a

confidence interval. We review the concepts, first presented in Section 1.3, of: a
population, distribution, a parameter, an estimate, an estimator, and a statistic.

The basic idea in statistics is that there exists a statistical population (or just
population) which we want to know about or learn about, but we only have
a sample from that population. The idea is to use the sample to say something
about the population. To generalize from the sample to the population, we
characterize the population by a distribution (see Definition 1.1 and Figure 1.1).

For example, if we are interested in the weight of eggs lain by a particular

species of hen, the population consists of the weights of all currently existing
eggs as well as weights of eggs that formerly existed and will (potentially) exist
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 136

in the future. We may charactarize these weights by a normal distribution with

mean µ and variance σ2 . If we let X denote the weight of a randomly chosen
egg, then we may write X ∼ N (µ, σ2 ). We say that µ and σ2 are the parameters
of this distribution - we call them population parameters.

Naturally, we do not know the values of these true parameters, and it is impos-
sible for us to ever know, since it would require that we weighed all possible
eggs that have existed or could have existed. In fact the true parameters of the
distribution N (µ, σ2 ) are unknown and will forever remain unknown.

If we take a random sample of eggs from the population of egg weights, say
we make 10 observations, then we have x1 , . . . , x10 . We call this the observed sam-
ple or just sample. From the sample, we can calculate the sample mean, x̄. We
say that x̄ is an estimate of the true population mean µ (or just mean, see Remark
1.3). In general we distinguish estimates of the parameters from the parameters
themselves, by adding a hat (circumflex). For instance, when we use the sample
mean as an estimate of the mean, we may write µ̂ = x̄ for the estimate and µ for
the parameter, see the illustration of this process in Figure 1.1.

We denote parameters such as µ and σ2 by greek letters. Therefore parameter

estimates are greek letters with hats on them. Random variables such as X are
denoted by capital roman letters. The observed values of the random variables
are denoted by lower case instead – we call them realizations of the random vari-
ables. For example, the sample x1 , . . . , x10 represents actually observed num-
bers (e.g. the weights of 10 eggs), so they are not random and therefore in lower
case. If we consider a hypothetical sample it is yet unobserved and therefore
random and denoted by, say, X1 , . . . , Xn and therefore in capital letters, see also
Section 2.1.

To emphasize the difference, we say that X1 , . . . , Xn is a random sample, while we

say that x1 , . . . , xn is a sample taken at random; the observed sample is not random
when it is observed, but it was produced as a result of n random experiments.

A statistic is a function of the data, and it can represent both a fixed value from
an observed sample or a random variable from a random (yet unobserved) sam-
ple. For example sample average x̄ = n1 ∑in=1 xi is a statistic computed from an
observed sample, while X̄ = n1 ∑in=1 Xi is also a statistic, but it is considered
a function of a random (yet unobserved) sample. Therefore X̄ is itself a ran-
dom variable with a distribution. Similarly the sample variance S2 is a random
variable, while s2 is its realized value and just a number.

An estimator (not to be confused with an estimate) is a function that produces an

estimate. For example, µ is a parameter, µ̂ is the estimate and we use X̄ as an
estimator of µ. Here X̄ is the function that produces the estimate of µ from a
sample.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 137

Learning from data is learning about parameters of distributions that describe

populations. For this process to be meaningful, the sample should in a mean-
ingful way be representative of the relevant population. One way to ensure that
this is the case is to make sure that the sample is taken completely at random
from the population, as formally defined here:

Definition 3.11 Random sample

A random sample from an (infinite) population: A set of observations
X1 , ..., Xn constitutes a random sample of size n from the infinite population
f ( x ) if:

1. Each Xi is a random variable whose distribution is given by f ( x )

2. The n random variables are independent

It is a bit difficult to fully comprehend what this definition really amounts to

in practice, but in brief one can say that the observations should come from
the same population distribution, and that they must each represent truly new
information (the independence).

Remark 3.12
Throughout previous sections and the rest of this chapter we assume infinite
populations. Finite populations of course exists, but only when the sam-
ple constitutes a large proportion of the entire population, is it necessary to
adjust the methods we discuss here. This occurs relatively infrequently in
practice and we will not discuss such conditions.

3.1.4 When we cannot assume a normal distribution: the Central

Limit Theorem

The Central Limit Theorem (CLT) states that the sample mean of independent
identically distributed (i.i.d.) random variables converges to a normal distribu-
tion:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 138

Theorem 3.13 Central Limit Theorem (CLT)

Let X̄ be the sample mean of a random sample of size n taken from a popu-
lation with mean µ and variance σ2 , then

X̄ − µ
Z= √ , (3-11)
σ/ n

is a random variable which distribution function approaches that of the

standard normal distribution, N (0, 12 ), as n → ∞. In other words, for large
enough n, it holds approximately that

X̄ − µ
√ ∼ N (0, 12 ). (3-12)
σ/ n

The powerful feature of the CLT is that, when the sample size n is large enough,
the distribution of the sample mean X̄ is (almost) independent of the distri-
bution of the population X. This means that the underlying distribution of a
sample can be disregarded when carrying out inference related to the mean.
The variance of the sample mean can be estimated from the sample and it can
be seen that as n increases the variance of the sample mean decreases, hence the
“accuracy” with which we can infer increases.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 139

Example 3.14 Central Limit Theorem in practice

## Number of simulated samples

k <- 1000

## Number of observations in each sample

n <- 1
## Simulate k samples with n observations
## Note, the use of replicate: it repeats the second argument (here k times)
Xbar <- replicate(k, runif(n))
hist(Xbar, col="blue", main="n=1", xlab="Sample means", xlim=xlim)
## Increase the number of observations in each sample
## Note, the use of apply here: it takes the mean on the 2nd dimension
## (i.e. column) of the matrix returned by replicate
n <- 2
Xbar <- apply(replicate(k, runif(n)), 2, mean)
hist(Xbar, col="blue", main="n=2", xlab="Sample means", xlim=xlim)
## Increase the number of observations in each sample
n <- 6
Xbar <- apply(replicate(k, runif(n)), 2, mean)
hist(Xbar, col="blue", main="n=6", xlab="Sample means", xlim=xlim)
## Increase the number of observations in each sample
n <- 30
Xbar <- apply(replicate(k, runif(n)), 2, mean)
hist(Xbar, col="blue", main="n=30", xlab="Sample means", xlim=xlim)

n=1 n=2
Frequency

Frequency
0 100
60
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Sample means Sample means

n=6 n=30
Frequency

Frequency
200
100
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Sample means Sample means

Recognize the plot on the front page of the book.

Due to the amazing result of the Central Limit Theorem 3.13 many expositions
of classical statistics provides a version of the confidence interval based on the
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 140

standard normal quantiles rather than the t-quantiles

s
x̄ ± z1−α/2 · √ . (3-13)
n

We present it here only as an interesting limit situation of the t-based interval in

Method 3.8.

For large samples, the standard normal distribution and t-distribution are al-
most the same, so in practical situations, it doesn’t matter whether the normal
based or the t-based confidence interval (CI) is used. Since the t-based interval
is also valid for small samples when a normal distribution is assumed, we rec-
ommend that the t-based interval in Method 3.8 is used in all situations. This
recommendation also has the advantage that the R-function t.test, which pro-
duces the t-based interval, can be used in all cases.

How large should the sample then be in a non-normal case to ensure the validity
of the interval? No general answer can be given, but as a rule of thumb we
recommend n ≥ 30.

When we have a small sample for which we cannot or will not make a nor-
mality assumption, we have not yet presented a valid CI method. The classical
solution is to use the so-called non-parametric methods. However, in the next
chapter we will present the more widely applicable simulation or re-sampling
based techniques.

3.1.5 Repeated sampling interpretation of confidence intervals

In this section we show that 95% of the 95% confidence intervals we make will
cover the true value in the long run. Or, in general 100(1 − α)% of the 100(1 −
α)% confidence intervals we make will cover the true value in the long run. For
example, if we make 100 95% CI we cannot guarantee that exactly 95 of these
will cover the true value, but if we repeatedly make 100 95% CIs then on average
95 of them will cover the true value.

Example 3.15 Simulating many confidence intervals

To illustrate this with a simulation example, then we can generate 50 random

N (1, 12 ) distributed numbers and calculate the t-based CI given in Method 3.8, and
then repeated this 1000 times to see how many times the true mean µ = 1 is covered.
The following code illustrates this:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 141

## Simulate 1000 samples of n=50 observations, and

## calculate a CI from each sample
k <- 1000
ThousandCIs <- replicate(k, t.test(rnorm(n=50, mean=1, sd=1))$conf.int)
## Count how often 1 is covered
sum(ThousandCIs[1,] < 1 & 1 < ThousandCIs[2,])

[1] 954

Hence in 954 of the 1000 repetitions (i.e. 95.4%) the CI covered the true value. If
we repeat the whole simulation over, we would obtain 1000 different samples and
therefore 1000 different CIs. Again we expect that approximately 95% of the CIs will
cover the true value µ = 1.

The result that we arrived at by simulation in the previous example can also be
derived mathematically. Since
X̄ − µ
T= √ ∼ t ( n − 1),
S/ n
where t is the t-distribution with n − 1 degrees of freedom, it holds that

X̄ − µ
1 − α = P −t1−α/2 < √ < t1−α/2 ,
S/ n
which we can rewrite as

S S
= P X̄ − t1−α/2 √ < µ < X̄ + t1−α/2 √ .
n n
Thus, the probability that the interval with limits
S
X̄ ± t1−α/2 √ , (3-14)
n
covers the true value µ is exactly 1 − α. One thing to note is that the only dif-
ference between the interval above and the interval in Method 3.8, is that the
interval above is written with capital letters (simply indicating that it calculated
with random variables rather than with observations).

This shows exactly that 100(1 − α)% of the 100(1 − α)% confidence interval we
make will contain the true value in the long run.

3.1.6 Confidence interval for the variance

In previous sections we discussed how to calculate a confidence interval for the

mean. In this section we discuss how to calculate a confidence interval for the
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 142

variance or the standard deviation.

We will assume that the observations come from a normal distribution through-
out this section, and we will not present any methods that are valid beyond this
assumption. While the methods for the sample mean in the previous sections
are not sensitive to (minor) deviations from the normal distribution, the meth-
ods discussed in this section for the sample variance rely much more heavily on
the correctness of the normal distribution assumption.

Example 3.16 Tablet production

In the production of tablets, an active matter is mixed with a powder and then the
mixture is formed to tablets. It is important that the mixture is homogenous, such
that each tablet has the same strength.

We consider a mixture (of the active matter and powder) from where a large amount
of tablets is to be produced.

We seek to produce the mixtures (and the final tablets) such that the mean content of
the active matter is 1 mg/g with the smallest variance possible. A random sample is
collected where the amount of active matter is measured. It is assumed that all the
measurements follow a normal distribution.

The variance estimator, that is, the formula for the variance seen as a random
variable, is
n
1
2
S = ∑
n − 1 i =1
( Xi − X̄ )2 , (3-15)

where n is the number of observations, Xi is observation number i where i =

1, . . . , n, and X̄ is the estimator of the mean of X.

The (sampling) distribution of the variance estimator is the χ2 -distribution dis-

tribution: let S2 be the variance of a sample of size n from a normal distribution
with variance σ2 , then
( n − 1) S2
χ2 = ,
σ2
is a stochastic variable following the χ2 -distribution with v = n − 1 degrees of
freedom.

The χ2 -distribution, as any other distribution, has a probability density func-

tion. It is a non-symmetric distribution on the positive axis. It is a distribution
of squared normal random variables, for more details see Section 2.10.1. An
example of a χ2 -distribution is given in the following:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 143

Example 3.17 The χ2 -distribution

The density of the χ2 -distribution with 9 degrees of freedom is:

## The chisquare-distribution with df=9 (the density)

x <- seq(0, 35, by = 0.1)
plot(x, dchisq(x, df = 9), type = "l", ylab="Density")
0.00 0.02 0.04 0.06 0.08 0.10
Density

0 5 10 15 20 25 30 35
x

So, the χ2 -distributions are directly available in R, via the similar four types
of R-functions as seen for the other probability distributions presented in the
distribution overview, see Appendix A.3.

Hence, we can easily work with χ2 -distributions in practice. As indicated there

is a different χ2 -distribution for each n.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 144

Method 3.18 Confidence interval for the variance/standard devia-

tion
A 100(1 − α)% confidence interval for the variance σ2 is
" #
( n − 1) s2 ( n − 1) s2
, , (3-16)
χ21−α/2 χ2α/2

where the quantiles come from a χ2 -distribution with ν = n − 1 degrees of

freedom.
A 100(1 − α)% confidence interval for the standard deviation σ is
"s s #
( n − 1) s2 ( n − 1) s2
, . (3-17)
χ21−α/2 χ2α/2

Note: The confidence intervals for the variance and standard deviations are
generally non-symmetric as opposed to the t-based interval for the mean µ.

Example 3.19 Tablet production

A random sample of n = 20 tablets is collected and from this the mean is estimated
to x̄ = 1.01 and the variance to s2 = 0.072 . Let us find the 95%-confidence interval
for the variance. To apply the method above we need the 0.025 and 0.975 quantiles
of the χ2 -distribution with ν = 20 − 1 = 19 degrees of freedom
χ20.025 = 8.907, χ20.975 = 32.85,
which we get from R:

## Quantiles of the chi-square distribution:

qchisq(p=c(0.025, 0.975), df=19)

[1] 8.907 32.852

Hence the confidence interval is

19 · 0.072 19 · 0.072
, ≈ [0.00283, 0.0105],
32.85 8.907

and for the standard deviation the confidence interval is

h√ √ i
0.002834, 0.01045 ≈ [0.053, 0.102] .
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 145

3.1.7 Hypothesis testing, evidence, significance and the p-value

Example 3.20 Sleeping medicine

In a study the aim is to compare two kinds of sleeping medicine A and B. 10 test
persons tried both kinds of medicine and the following 10 DIFFERENCES between
the two medicine types were measured (in hours):

Person x = Beffect - Aeffect

1 1.2
2 2.4
3 1.3
4 1.3
5 0.9
6 1.0
7 1.8
8 0.8
9 4.6
10 1.4

For Person 1, Medicine B provided 1.2 sleep hours more than Medicine A, etc.

Our aim is to use these data to investigate if the two treatments are different in their
effect on length of sleep. We therefore let µ represent the mean difference in sleep
length. In particular we will consider the so-called null hypothesis

H0 : µ = 0,

which states that there is no difference in sleep length between the A and B Medicines.

If the observed sample turns out to be not very likely under this null hypothesis, we
conclude that the null hypothesis is unlikely to be true.

First we compute the sample mean

µ̂ = x̄1 = 1.67.

As of now, we don’t know if this number is particularly small or large. If the true
mean difference is zero, would it be unlikely to observe a mean difference this large?
Could it be due to just random variation? To answer this question we compute the
probability of observing a sample mean that is 1.67 or further from 0 – in the case
that the true mean difference is in fact zero. This probability is called a p-value. If
the p-value is small (say less than 0.05), we conclude that the null hypothesis isn’t
true. If the p-value is not small (say larger than 0.05), we conclude that we haven’t
obtained sufficient evidence to falsify the null hypothesis.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 146

After some computations that you will learn to perform later in this section, we
obtain a p-value

p-value ≈ 0.00117,

which indicates quite strong evidence against the null hypothesis. As a matter of
fact, the probability of observing a mean difference as far from zero as 1.67 or further
is only ≈ 0.001 (one out of thousand) and therefore very small.

We conclude that the null hypothesis is unlikely to be true as it is highly incompat-

ible with the observed data. We say that the observed mean µ̂ = 1.67 is statistically
significantly different from zero (or simply significant implying that it is different from
zero). Or that there is a significant difference in treatment effects of B and A, and we may
conclude that Medicine B makes patients sleep significantly longer than Medicine
A.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 147

p < 0.001 Very strong evidence against H0

0.001 ≤ p < 0.01 Strong evidence against H0
0.01 ≤ p < 0.05 Some evidence against H0
0.05 ≤ p < 0.1 Weak evidence against H0
p ≥ 0.1 Little or no evidence against H0

Table 3.1: A way to interpret the evidence for a given p-value.

The p-value

Definition 3.21 The p-value

The p-value is the probability of obtaining a test statistic that is at least as
extreme as the test statistic that was actually observed. This probability is
calculated under the assumption that the null hypothesis is true.

Interpretations of a p-value:

1. The p-value measures evidence

2. The p-value measures extremeness/unusualness of the data under the

null hypothesis (“under the null hypothesis” means “assuming the null
hypothesis is true”)

The p-value is used as a general measure of evidence against a null hypothesis:

the smaller the p-value, the stronger the evidence against the null hypothesis
H0 . A typical strength of evidence scale is given in Table 3.1.

As indicated, the definition and interpretations above are generic in the sense
that they can be used for any kind of hypothesis testing in any kind of setup.
In later sections and chapters of this material, we will indeed encounter many
different such setups. For the specific setup in focus here, we can now give the
key method:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 148

Method 3.22 The one-sample t-test statistic and the p-value

For a (quantitative) one sample situation, the p-value is given by

p-value = 2 · P( T > |tobs |), (3-18)

where T follows a t-distribution with (n − 1) degrees of freedom.

The observed value of the test statistics to be computed is
x̄ − µ0
tobs = √ , (3-19)
s/ n

where µ0 is the value of µ under the null hypothesis

H0 : µ = µ0 . (3-20)

The t-test and the p-value will in some cases be used to formalise actual decision
making and the risks related to it:

Definition 3.23 The hypothesis test

We say that we carry out a hypothesis test when we decide against a null
hypothesis or not, using the data.
A null hypothesis is rejected if the p-value, calculated after the data has been
observed, is less than some α, that is if the p-value < α, where α is some pre-
specifed (so-called) significance level. And if not, then the null hypothesis is
said to be accepted.

Remark 3.24
Often chosen significance levels α are 0.05, 0.01 or 0.001 with the former
being the globally chosen default value.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 149

Remark 3.25
A note of caution in the use of the word accepted is in place: this should
NOT be interpreted as having proved anything: accepting a null hypothesis
in statistics simply means that we could not prove it wrong! And the reason
for this could just potentially be that we did not collect sufficient amount of
data, and acceptance hence proofs nothing at its own right.

Example 3.26 Sleeping medicine

Continuing from Example 3.20, we now illustrate how to compute the p-value using
Method 3.22.

## Enter sleep difference observations

x <- c(1.2, 2.4, 1.3, 1.3, 0.9, 1.0, 1.8, 0.8, 4.6, 1.4)
n <- length(x)
## Compute the tobs - the observed test statistic
tobs <- (mean(x) - 0) / (sd(x) / sqrt(n))
tobs

[1] 4.672

## Compute the p-value as a tail-probability in the t-distribution

pvalue <- 2 * (1-pt(abs(tobs), df=n-1))
pvalue

[1] 0.001166

Naturally, as we have seen already a function in R that can do this for us

t.test(x)

One Sample t-test

data: x
t = 4.7, df = 9, p-value = 0.001
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.8613 2.4787
sample estimates:
mean of x
1.67
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 150

The confidence interval and the p-value supplements each other, and often both
the confidence interval and the p-value are reported. The confidence interval
covers those values of the parameter that we accept given the data, while the
p-value measures the extremeness of the data if the null hypothesis is true.

Example 3.27 Sleeping medicine

In the sleep medicine example the 95% confidence interval is

[0.86, 2.48] ,

so based on the data these are the values for the mean sleep difference of Medicine
B versus Medicine A that we accept can be true. Only if the data is so extreme (i.e.
rarely occuring) that we would only observe it 5% of the time the confidence interval
does not cover the true mean difference in sleep.

The p-value for the null hypothesis µ = 0 was ≈ 0.001 providing strong evidence
against the correctness of the null hypothesis.

If the null hypothesis was true, we would only observe this large a difference in
sleep medicine effect levels in around one out of a thousand times. Consequently
we conclude that the null hypothesis is unlikely to be true and reject it.

Statistical significance

The word significance can mean importance or the extent to which something matters
in our everyday language. In statistics, however, it has a very particular mean-
ing: if we say that an effect is significant, it means that the p-value is so low that
the null hypothesis stating no effect has been rejected at some significance level α.

Definition 3.28 Significant effect

An effect is said to be (statistically) significant if the p-value is less than the
significance level α. a

a Often, α = 0.05 is adopted.

At this point an effect would amount to a µ-value different from µ0 . In other

contexts we will see later, effects can be various features of interest to us.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 151

Example 3.29 Statistical significance

Consider the follwing two situations:

1. A researcher decides on a significance level of α = 0.05 and obtains p-value =

0.023. She therefore concludes that the effect is statistically significant

2. Another researcher also adopts a significance level of α = 0.05, but obtains

p-value = 0.067. He concludes that the effect was not statistically significant

From a binary decision point of view the two researchers couldn’t disagree more.
However, from a scientific and more continuous evidence quantification point of
view there is not a dramatic difference between the findings of the two researchers.

In daily statistical and/or scientific jargon the word ”statistically” will often be
omitted, and when results then are communicated as significant further through
media or other places, it gives the risk that the distinction between the two
meanings gets lost. At first sight it may appear unimportant, but the big dif-
ference is the following: sometimes a statistically significant finding can be so
small in real size that it is of no real importance. If data collection involves very
big data sizes one may find statistically significant effects that for no practical
situations matter much or anything at all.

The null hypothesis

The null hypothesis most often expresses the status quo or that “nothing is hap-
pening”. This is what we have to believe before we perform any experiments
and observe any data. This is what we have to accept in the absense of any
evidence that the situation is otherwise. For example the null hypothesis in the
sleep medicine examples states that the difference in sleep medicine effect level
is unchanged by the treatment: this is what we have to accept until we obtain
evidence otherwise. In this particular example the observed data and the statis-
tical theory provided such evidence and we could conclude a significant effect.

The null hypothesis has to be falsifiable. This means that it should be possible to
collect evidence against it.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 152

Acceptance
Rejection Rejection

t0.025 0 t0.975

Figure 3.1: The 95% critical value. If tobs falls in the pink area we would reject,
otherwise we would accept

Confidence intervals, critical values and significance levels

A hypothesis test, that is, making the decision between rejection and acceptance of
the null hypothesis, can also be carried out without actually finding the p-value.
As an alternative one can use the so-called critical values, that is the values of the
test-statistic which matches exactly the Csignificance level, see Figure 3.1:

Definition 3.30 The critical values

The (1 − α)100% critical values for the one-sample t-test are the α/2- and
1 − α/2-quantiles of the t-distribution with n − 1 degrees of freedom

tα/2 and t1−α/2 . (3-21)

Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 153

Method 3.31 The one-sample hypothesis test by the critical value

A null hypothesis is rejected if the observed test-statistic is more extreme than

the critical values

If |tobs | > t1−α/2 then reject, (3-22)

otherwise accept.

The confidence interval covers the acceptable values of the parameter given the
data:

Theorem 3.32 Confidence interval for µ

We consider a (1 − α) · 100% confidence interval for µ
s
x̄ ± t1−α/2 · √ . (3-23)
n

The confidence interval corresponds to the acceptance region for H0 when

testing the hypothesis

H0 : µ = µ0 . (3-24)
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 154

Remark 3.33
The proof of this theorem is almost straightforward: a µ0 inside the confi-
dence interval will fullfill that
s
| x̄ − µ0 | < t1−α/2 · √ , (3-25)
n

which is equivalent to

| x̄ − µ0 |
< t1−α/2 , (3-26)
√s
n

and again to

|tobs | < t1−α/2 , (3-27)

which then exactly states that µ0 is accepted, since the tobs is within the
critical values.

The alternative hypothesis

Some times we may in addition to the null hypothesis, also explicitly state an
alternative hypothesis. This completes the framework that allows us to control the
rates at which we make correct and wrong conclusions in light of the alternative.

The alternative hypothesis is

H1 : µ 6= µ0 . (3-28)
This is sometimes called the two-sided (or non-directional) alternative hypoth-
esis, because also one-sided (or directional) alternative hypothesis occur. How-
ever, the one-sided setup is not included in the book apart from a small discus-
sion below.

Example 3.34 Sleeping medicine – Alternative hypothesis

Continuing from Example 3.20 we can now set up the null hypothesis and the alter-
native hypothesis together

H0 : µ = 0
H1 : µ 6= 0.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 155

Which means that we have exactly the same setup just formalized by adding the
alternative hypothesis. The conclusion is naturally exactly the same as in before.

A generic approach for tests of hypotheses is:

1. Formulate the hypotheses and choose the level of significance α (choose

the "risk-level")

2. Calculate, using the data, the value of the test statistic

3. Calculate the p-value using the test statistic and the relevant sampling
distribution, compare the p-value and the significance level α, and finally
make a conclusion
or
Compare the value of the test statistic with the relevant critical value(s)
and make a conclusion

Combining this generic hypothesis test approach with the specific method boxes
of the previous section, we can now below give a method box for the one-
sample t-test. This is hence a collection of what was presented in the previous
section:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 156

Method 3.35 The level α one-sample t-test

x̄ −µ0
1. Compute tobs using Equation (3-19): tobs = s/ n
√

2. Compute the evidence against the null hypothesis

H0 : µ = µ0 ,

vs. the alternative hypothesis

H1 : µ 6 = µ0 ,

by the

p-value = 2 · P( T > |tobs |),

where the t-distribution with n − 1 degrees of freedom is used

3. If p-value < α: We reject H0 , otherwise we accept H0 ,

or
The rejection/acceptance conclusion could alternatively, but equiva-
lently, be made based on the critical value(s) ±t1−α/2 :
If |tobs | > t1−α/2 we reject H0 , otherwise we accept H0

The so-called one-sided (or directional) hypothesis setup, where the alternative
hypothesis is either “less than” or “greater than”, is opposed to the previous
presented two-sided (or non-directional) setup, with a “different from” alter-
native hypothesis. In most situations the two-sided should be applied, since
when setting up a null hypothesis with no knowledge about in which direction
the outcome will be, then the notion of “extreme” is naturally in both directions.
However, in some situations the one-sided setup makes sense to use. As for ex-
ample in pharmacology where concentrations of drugs are studied and in some
situations it is known that the concentration can only decrease from one time
point of measurement to another (after the peak concentration). In such case a
“less than” is the only meaningful alternative hypothesis – one can say that na-
ture really has made the decision for us in that: either the concentration has not
changed (the null hypothesis) or it has dropped (the alternative hypothesis). In
other cases, e.g. more from the business and/or judicial perspective, one-sided
hypothesis testing come up when for example a claim about the performance of
some product is tested.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 157

The one-sided “less than” hypothesis setup is: compute the evidence against
the null hypothesis vs. the one-sided alternative hypothesis

H0 : µ ≥ µ0 (3-29)
H1 : µ < µ0 , (3-30)

by the

p-value = P( T < tobs ). (3-31)

and equivalently for the “greather than” setup

H0 : µ ≤ µ0 (3-32)
H1 : µ > µ0 , (3-33)

by the

p-value = P( T > tobs ). (3-34)

In both cases: if p-value < α: We reject H0 , otherwise we accept H0 .

Note that there are no one-sided hypothesis testing involved in the exercises.

Errors in hypothesis testing

When testing statistical hypotheses, two kind of errors can occur:

Type I: Rejection of H0 when H0 is true

Type II: Non-rejection (acceptance) of H0 when H1 is true

Example 3.36 Ambulance times

An ambulance company claims that on average it takes 20 minutes from a telephone

call to their switchboard until an ambulance reaches the location.

We might have some measurements (in minutes): 21.1, 22.3, 19.6, 24.2, ...

If our goal is to show that on average it takes longer than 20 minutes, the null- and
the alternative hypotheses are

H0 : µ = 20,
H1 : µ 6= 20.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 158

What kind of errors can occur?

Type I: Reject H0 when H0 is true, that is we mistakenly conclude that it takes longer
(or shorter) than 20 minutes for the ambulance to be on location

Type II: Not reject H0 when H1 is true, that is we mistakenly conclude that it takes
20 minutes for the ambulance to be on location

Example 3.37 Court of law analogy

A man is standing in a court of law accused of criminal activity.

The null- and the alternative hypotheses are

H0 : The man is not guilty,

H1 : The man is guilty.

We consider a man not guilty until evidence beyond any doubt proves him guilty.
This would correspond to an α of basically zero.

Clearly, we would prefer not to do any kinds of errors, however it is a fact of

life that we cannot avoid to do so: if we would want to never do a Type I error,
then we would never reject the null hypothesis, which means that we would
e.g. never conclude that one medical treatment is better than another, and thus,
that we would (more) often do a Type II error, since we would never detect
when there was a significance effect.

For the same investment (sample size n), we will increase the risk of a Type II
error by enforcing a lower risk of a Type I error. Only by increasing n we can
lower both of them, but to get both of them very low can be extremely expensive
and thus such decisions often involve economical considerations.

The statistical hypothesis testing framework is a way to formalize the handling

of the risk of the errors we may make and in this way make decisions in an
enlightened way knowing what the risks are. To that end we define the two
possible risks as

P("Type I error") = α,
(3-35)
P("Type II error") = β.

This notation is globally in statistical literature. The name choice for the Type I
error is in line with the use of α for the significance level, as:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 159

Theorem 3.38 Significance level and Type I error

The significance level α in hypothesis testing is the overall Type I risk

P("Type I error") = P("Rejection of H0 when H0 is true") = α. (3-36)

So controlling the Type I risk is what is most commonly apparent in the use of
statistics. Most published results are results that became significant, that is, the
p-value was smaller than α, and hence the relevant risk to consider is the Type I
risk.

Controlling/dealing with the Type II risk, that is: how to conclude on an exper-
iment/study in which the null hypothesis was not rejected (ı.e. no significant
effect was found) is not so easy, and may lead to heavy discussions if the non-
findings even get to the public. To which extent is a non-finding an evidence of
the null hypothesis being true? Well, in the outset the following very important
saying makes the point:

Remark 3.39
Absence of evidence is NOT evidence of absence!
Or differently put:
Accepting a null hypothesis is NOT a statistical proof of the null hypothesis
being true!

The main thing to consider here is that non-findings (non-significant results)

may be due to large variances and small sample sizes, so sometimes a non-
finding is indeed just that we know nothing. In other cases, if the sample sizes
were high, a non-finding may actually, if not proving an effect equal to zero,
which is not really possible, then at least indicate with some confidence that the
possible effect is small or even very small. The confidence interval is a more
clever method to use here, since the confidence interval will show the precision
of what we know, whether it includes the zero effect or not.

In Section 3.3 we will use a joint consideration of both error types to formalize
the planning of suitably sized studies/experiments.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 160

3.1.8 Assumptions and how to check them

The t-tests that have been presented above are based on some assumptions
about the sampling and the population. In Theorem 3.2 the formulations are
that the random variables X1 , . . . , Xn are independent and identically normally
distributed: Xi ∼ N (µ, σ2 ). In this statement there are two assumptions:

• Independent observations
• Normal distribution

The assumption about independent observations can be difficult to check. It

means that each observation must bring a unique new amount of information to
the study. Independence will be violated if some measurements are not on ran-
domly selected units and share some feature – returning to the student height
example: we do not want to include twins or families in general. Having a sam-
ple of n = 20 heights, where 15 of them stem from a meeting with a large family
group would not be 20 independent observations. The independence assump-
tion is mainly checked by having information about the sampling procedure.

The assumption about normality can be checked graphically using the actual
sample at hand.

Example 3.40 Student heights

We will return to the height of the ten students from example 3.1. If we want to
check whether the sample of heights could come from a normal distribution then
we could plot a histogram and look for a symmetric bell-shape:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 161

## The height sample

x <- c(168,161,167,179,184,166,198,187,191,179)

## Using histograms
par(mfrow=c(1,3), mar=c(4,3,1,1))
hist(x, xlab="Height", main="")
hist(x, xlab="Height", main="", breaks=8)
hist(x, xlab="Height", main="", breaks=2)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

6
5
3
Frequency

Frequency

Frequency
4
2

3
2
1

1
0

160 180 200 160 180 200 0 160 180 200

Height Height Height

However, as we can see the histograms change shape depending on the number of
breaks.

Instead of using histograms, one can plot empirical cumulative distribution (see
1.6.2) and compare it with the best fitting normal distribution, in this case N (µ̂ =
178, σ̂2 = 12.212 ):
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 162

## Plot the empirical cdf

plot(ecdf(x), verticals = TRUE)
## Plot the best normal cdf
xseq <- seq(0.9*min(x), 1.1*max(x), length.out = 100)
lines(xp, pnorm(xp, mean(x), sd(x)))

ecdf(x)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0

160 170 180 190 200

In the accumulated distribution plot it is easier to see how close the distributions are
– compared to in the density histogram plot. However, we will go one step further
and do the q-q plot: The observations (sorted from smallest to largest) are plotted
against the expected quantiles – from the same normal distribution as above. If the
observations are normally distributed then the observed are close to the expected
and this plot is close to a straight line. In R we can generate this plot by the following:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 163

## The expected quantiles in a 0 to 1 uniform distribution

n <- length(x)
## They have equal distance
pseq <- (1:n-0.5)/n
## Plot the expected normal distribution quantiles
plot(x=qnorm(p=pseq), y=sort(x), xlab="Normal quantiles",
ylab="Sample quantiles")
## Mark the 1st and 3rd quantiles with crosses
points(x=qnorm(p=c(0.25,0.75)), y=quantile(x,probs=c(0.25,0.75)),
pch=3, col="red")
## Add a straight line through the 1st and 3rd quantiles
qqline(x)
190
Sample quantiles
180
170
160

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

Normal quantiles

In the ideal normal case, the observations vs. the expected quantiles in the best
possible normal distribution will be on a straight line, here plotted with the inbuilt
function qqnorm:
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 164

set.seed(89473)
## Simulate 100 normal distributed observations
xr <- rnorm(100, mean(x), sd(x))
## Do the q-q normal plot with inbuilt functions
qqnorm(xr)
qqline(xr)

Normal Q-Q Plot

Sample Quantiles
190
170
150

-2 -1 0 1 2
Theoretical Quantiles

Note, that inbuilt functions do exactly the same as R code generating the first q-q
plot, except for a slight change for n < 10, as described in Method 3.41.

In this example the points are close to a straight line and we can assume that the
normal distribution holds. It can, however, be difficult to decide whether the plot is
close enough to a straight line, so there is a package in R (the MESS package) where
the observations are plotted together with eight simulated plots where the normal
distribution is known to hold. It is then possible to visually compare the plot based
on the observed data to the simulated data and see whether the distribution of the
observations is "worse" than they should be.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 165

2
● ●
●
Sample Quantiles

Sample Quantiles

Sample Quantiles
1

1
● ●
●
●
●
● ●

● ● ● ●
●
0

0
● ●
●
● ● ●
● ●
● ●
−1

−1

−1
●
● ●
● ●
−2

−2

−2
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

2
●

●
● ●
Sample Quantiles

Sample Quantiles

Sample Quantiles
1

1
● ●
● ● ●
●
●
● ●
0

0
● ●
● ● ●
● ●
● ● ●

● ● ●
−1

−1

−1
●
●
●
●
−2

−2

−2
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

● ●
2

2
●
●
● ●
Sample Quantiles

Sample Quantiles

Sample Quantiles
1

● 1
●
● ● ●
● ● ●
● ●
●
0

●
● ● ●
● ●
● ●
−1

−1

●
●
−2

−2

●
●

−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

When we look at the nine plots then the original data are plotted in the frame with
the red border. Comparing the observed data to the simulated data the straight
line for the observed data is no worse than some of the simulated data, where the
normality assumption is known to hold. So we conclude here that we apparently
have no problem in assuming the normal distribution for these data.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 166

Method 3.41 The Normal q-q plot

The ordered observations x(1) , . . . , x(n) , called the sample quantiles, are plot-
ted versus a set of expected normal quantiles z p1 , . . . , z pn . If the points are
not systematically deviating from a line, we accept the normal distribution
assumption. The evaluation of this can be based on some simulations of a
sample of the same size.
The usual definition of p1 , . . . , pn to be used for finding the expected normal
quantiles is

i − 0.5
pi = , i = 1, . . . , n. (3-37)
n
Hence, simply the equally distanced points between 0.5/n and 1 − 0.5/n.
This is the default method in the qqnorm function in R, when n > 10, if
n ≤ 10 instead
i − 3/8
pi = , i = 1, . . . , n, (3-38)
n + 1/4
is used.

Example 3.42 Student heights

An example of how the expected normal quantile is calculated by R can be seen

if we take the second smallest height 166. There are 2 observations ≤ 166, so
166 = x(2) can be said to be the observed 210.25
−3/8
= 0.1585 quantile (where we use
the default R-definition for n ≤ 10). The 0.1585 quantile in the normal distribution
is qnorm(0.1585) = −1.00 and the point (−1.00, 166) can be seen on the q-q plot
above.

3.1.9 Transformation towards normality

In the above we looked at methods to check for normality. When the data are
not normally distributed it is often possible to choose a transformation of the
sample, which improves the normality.

When the sample is positive with a long tail or a few large observations then the
most common choice is to apply a logarithmic transformation, log( x ). The log-
transformation will make the large values smaller and also spread the observa-
tions on both positive and negative values. Even though the log-transformation
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 167

√ 1
is the most common there are also other possibilities such as x or x for making
large values smaller, or x2 and x3 for making large values larger.

When we have transformed the sample we can use all the statistical analyse we
want. It is important to remember that we are now working on the transformed
scale (e.g. the mean and its confidence interval is calculated for log( x )) and
perhaps it will be necessary to back-transform to the original scale.

Example 3.43 Radon in houses

In an American study the radon level was measured in a number of houses. The
Environmental Protection Agency’s recommended action level is ≥ 4 pCi/L. Here
we have the results for 20 of the houses (in pCi/L):

House 1 2 3 4 5 6 7 8 9 10
Radon level 2.4 4.2 1.8 2.5 5.4 2.2 4.0 1.1 1.5 5.4
House 11 12 13 14 15 16 17 18 19 20
Radon level 6.3 1.9 1.7 1.1 6.6 3.1 2.3 1.4 2.9 2.9

The sample mean, median and std. deviance is: x̄ = 3.04, Q2 = 2.45 and s x = 1.72.

We would like to see whether these observed radon levels could be thought of as
coming from a normal distribution. To do this we will plot the data:

## Reading in the sample

radon <- c(2.4, 4.2, 1.8, 2.5, 5.4, 2.2, 4.0, 1.1, 1.5, 5.4, 6.3,
1.9, 1.7, 1.1, 6.6, 3.1, 2.3, 1.4, 2.9, 2.9)
par(mfrow = c(1,2))
hist(radon)
qqnorm(radon, ylab = "Sample quantiles", xlab = "Normal quantiles")
qqline(radon)

Histogram of radon Normal Q-Q Plot

6
6

Sample quantiles
5

5
Frequency
4

4
3

3
2

2
1
0

1 2 3 4 5 6 7 -2 -1 0 1 2
radon Normal quantiles
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 168

From both plots we see that the data are positive and right skewed with a few large
observations. Therefore a log-transformation is applied:

## Transform using the natural logarithm

logRadon <- log(radon)
hist(logRadon)
qqnorm(logRadon, ylab = "Sample quantiles", xlab = "Normal quantiles")
qqline(logRadon)

Histogram of logRadon Normal Q-Q Plot

7
6

1.5
Sample quantiles
5
Frequency
4

1.0
3
2

0.5
1
0

0.0 0.5 1.0 1.5 2.0 -2 -1 0 1 2

logRadon Normal quantiles

As we had expected the log-transformed data seem to be closer to a normal distri-

bution.

We can now calculate the mean and 95% confidence interval for the log-transformed
data. However, we are perhaps not interested in the mean of the log-radon levels,
then we have to back-transform the estimated mean and confidence interval using
exp( x ). When we take the exponential of the estimated mean, then this is no longer
a mean but a median on the original pCi/L scale. This gives a good interpretation,
as medians are useful when the distributions are not symmeric.
Chapter 3 3.1 LEARNING FROM ONE-SAMPLE QUANTITATIVE DATA 169

## A confidence interval and t-test

t.test(logRadon, conf.level=0.95)

One Sample t-test

data: logRadon
t = 7.8, df = 19, p-value = 2e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.7054 1.2234
sample estimates:
mean of x
0.9644

## Back transform to original scale, now we get the median!

exp(0.9644)

[1] 2.623

## And the confidence interval on the original scale

exp(c(0.7054, 1.2234))

[1] 2.025 3.399

From the R code we see that the mean log-radon level is 0.96 (95% CI: 0.71 to 1.22).
On the original scale the estimated median radon level is 2.6 pCi/L (95% CI: 2.0 to
3.4).

Theorem 3.44 Transformations and quantiles

In general, the data transformations discussed in this section will preserve
the quantiles of the data. Or more precisely, if f is a data transformation
function (an increasing function), then

The pth quantile of f (Y ) = f (The pth quantile of Y ). (3-39)

The consequence of this theorem is that confidence limits on one scale trans-
form easily to confidence limits on another scale even though the transforming
function is non-linear.
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA170

3.2 Learning from two-sample quantitative data

In this section the setup where we want to learn about the difference between
the means from two populations – this is very often a setup encountered in
most fields of science and engineering: compare the quality of two products,
compare the performance of two groups, compare a new drug to a placebo and
so on. One could say, that it should be called a two-population setup, since it
is really two populations (or groups) which are compared by taking a sample
from each, however it is called a two-sample setup (probably it sounds better to
say).

First, the two-sample setup is introduced with an example and then methods
for confidence intervals and tests are presented.

Example 3.45 Nutrition study

In a nutrition study the aim is to investigate if there is a difference in the energy

usage for two different types of (moderately physically demanding) work. In the
study, the energy usage of 9 nurses from hospital A and 9 (other) nurses from hos-
pital B have been measured. The measurements are given in the following table in
mega Joule (MJ):

Hospital A Hospital B
7.53 9.21
7.48 11.51
8.08 12.79
8.09 11.85
10.15 9.97
8.40 8.79
10.88 9.69
6.13 9.68
7.90 9.19

Our aim is to assess the difference in energy usage between the two groups of nurses.
If µ A and µ B are the mean energy expenditures for nurses from hospital A and B,
then the estimates are just the sample means

µ̂ A = x̄ A = 8.293,
µ̂ B = x̄ B = 10.298.

To assess the difference in means, δ = µ B − µ A , we could consider the confidence

interval for δ = µ B − µ A . Clearly, the estimate for the difference is the difference of
the sample means, δ̂ = µ̂ B − µ̂ A = 2.005.
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA171

The 95% confidence interval is

2.005 ± 1.412 = [0.59, 3.42],

which spans the mean differences in energy expenditure that we find acceptable
based on the data. Thus we do not accept that the mean difference could be zero.

The interval width, given by 1.41, as we will learn below, comes from a simple com-
putation using the two sample standard deviations, the two sample sizes and a t-
quantile.

We can also compute a p-value to measure the evidence against the null hypothesis
that the mean energy expenditures are the same. Thus we consider the following
null hypothesis

H0 : δ = 0.

Since the 95% confidence interval does not cover zero, we already know that the p-
value for this significance test will be less than 0.05. In fact it turns out that the
p-value for this significance test is 0.0083 indicating strong evidence against the
null hypothesis that the mean energy expenditures are the same for the two nurse
groups. We therefore have strong evidence that the mean energy expenditure of
nurses from hospital B is higher than that of nurses from hospital A.

This section describes how to compute the confidence intervals and p-values in such
two-sample setups.

3.2.1 Comparing two independent means - Confidence Interval

We assume now that we have a sample x1 , . . . , xn taken at random from one

population with mean µ1 and variance σ12 and another sample y1 , . . . , yn taken
at random from another population with mean µ2 and variance σ22 .
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA172

Method 3.46 The two-sample confidence interval for µ1 − µ2

For two samples x1 , . . . , xn and y1 , . . . , yn the 100(1 − α)% confidence inter-
val for µ1 − µ2 is given by
s
s21 s2
x̄ − ȳ ± t1−α/2 · + 2, (3-40)
n1 n2

where t1−α/2 is the (1 − α/2)-quantile from the t-distribution with ν degrees

of freedom given from Equation (3-45)
2
s21 s22
n1 + n2
ν= . (3-41)
(s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1

Note how the t-quantile used for the confidence interval is exactly what we

called the critical value above.

Example 3.47 Nutrition study

Let us find the 95% confidence interval for µ B − µ A . Since the relevant t-quantile is,
using ν = 15.99,

t0.975 = 2.120,

the confidence interval becomes

r
2.0394 1.954
10.298 − 8.293 ± 2.120 · + ,
9 9
which then gives the result as also seen above

[0.59, 3.42].

3.2.2 Comparing two independent means - hypothesis test

We describe the setup as having a random sample from each of two different
populations, each described by a mean and a variance:

• Population 1: has mean µ1 , and variance σ12

Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA173

• Population 2: has mean µ2 , and variance σ22

The interest lies in the comparisons of the means.

Method 3.48 The (Welch) two-sample t-test statistic

When considering the null hypothesis about the difference between the
means of two independent samples

δ = µ2 − µ1 ,
(3-42)
H0 : δ = δ0 ,

the (Welch) two-sample t-test statistic is

( x̄ − x̄2 ) − δ0
tobs = q 1 . (3-43)
s21 /n1 + s22 /n2

Theorem 3.49 The distribution of the (Welch) two-sample statistic

The (Welch) two-sample statistic seen as a random variable

( X̄ − X̄2 ) − δ0
T= q 1 , (3-44)
S12 /n1 + S22 /n2

approximately, under the null hypothesis, follows a t-distribution with ν

degrees of freedom, where
2
s21 s22
n1 + n2
ν= , (3-45)
(s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1

if the two population distributions are normal or if the two sample sizes are
large enough.

We can now, based on this, express the full hypothesis testing procedures for
the two-sample setting:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA174

Method 3.50 The level α two-sample t-test

1. Compute tobs from Equation (3-43) and ν from Equation (3-45)

2
s21 s22
( x̄ − x̄2 ) − δ0 n1 + n2
tobs = q 1 and ν =
2 2 (s21 /n1 )2 (s22 /n2 )2
s1 /n1 + s2 /n2 n1 −1 + n2 −1

2. Compute the evidence against the null hypothesisa

H0 : µ1 − µ2 = δ0 ,

vs. the alternative hypothesis,

H1 : µ1 − µ2 6= δ0 ,

by the

p-value = 2 · P( T > |tobs |),

where the t-distribution with ν degrees of freedom is used

3. If p-value < α: We reject H0 , otherwise we accept H0 ,

or
The rejection/acceptance conclusion could alternatively, but equiva-
lently, be made based on the critical value(s) ±t1−α/2 :
If |tobs | > t1−α/2 we reject H0 , otherwise we accept H0

a We are often interested in the test where δ0 = 0

An assumption that often is applied in statistical analyses of various kinds is

that of the underlying variability being of the same size in different groups or
at different conditions. The assumption is rarely crucial for actually carrying
out some good statistics, but it may indeed make the theoretical justification for
what is done more straightforward, and the actual computational procedures
also may become more easily expressed. We will see in later chapters how this
comes in play. Actually, the methods presented above does not make this as-
sumption, which is nice. The fewer assumptions needed the better, obviously.
Assumptions are problematic in the sense, that they may be questioned for par-
ticular applications of the methods.
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA175

However, below we will present a version of the two-sample t-test statistic, that
actually is adapted to such an assumption, namely assuming that the two pop-
ulation variances are the same: σ12 = σ22 . We present it here not because we
really need it, we will use the above in all situations. But the version below
will appear and be used many places and it also bears some nice relations to
later multi-group analysis (Analysis of Variance (ANOVA)) that we will get to
in Chapter 8.

If we believe in the equal variance assumption it is natural to compute a single

joint – called the pooled – estimate of the variance based on the two individual
variances:

Method 3.51 The pooled two-sample estimate of variance

Under the assumption that σ12 = σ22 the pooled estimate of variance is the
weighted average of the two sample variances

(n1 − 1)s21 + (n2 − 1)s22

s2p = . (3-46)
n1 + n2 − 2

Note that when there is the same number of observations in the two groups,
n1 = n2 , the pooled variance estimate is simply the average of the two sample
variances. Based on this the so-called pooled two-sample t-test statistic can be
given:

Method 3.52 The pooled two-sample t-test statistic

When considering the null hypothesis about the difference between the
means of two independent samples

δ = µ1 − µ2 ,
(3-47)
H0 : δ = δ0 .

the pooled two-sample t-test statistic is

( x̄ − x̄2 ) − δ0
tobs = q 1 . (3-48)
s2p /n1 + s2p /n2
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA176

And the following theorem would form the basis for hypothesis test procedures
based on the pooled version:

Theorem 3.53 The distribution of the pooled two-sample t-test

statistic
The pooled two-sample statistic seen as a random variable:

( X̄ − X̄2 ) − δ0
T= q 1 . (3-49)
S2p /n1 + S2p /n2

follows, under the null hypothesis and under the assumption that σ12 = σ22 ,
a t-distribution with n1 + n2 − 2 degrees of freedom if the two population
distributions are normal.

A little consideration will show why choosing the Welch-version as the ap-
proach to always use makes good sense: First of all if s21 = s22 the Welch and the
Pooled test statistics are the same. Only when the two variances become really
different the two test-statistics may differ in any important way, and if this is
the case, we would not tend to favour the pooled version, since the assumption
of equal variances appears questionable then.

Only for cases with a small sample sizes in at least one of the two groups the
pooled approach may provide slightly higher power if you believe in the equal
variance assumption. And for these cases the Welch approach is then a some-
what cautious approach.

Example 3.54 Nutrition study

Let us consider the nurses example again, and test the null hypothesis expressing
that the two groups have equal means

H0 : δ = µ A − µ B = 0,

versus the alternative

H0 : δ = µ A − µ B 6= 0,

using the most commonly used significance level, α = 0.05. We follow the steps
of Method 3.50: we should first compute the test-statistic tobs and the degrees of
freedom ν. These both come from the basic computations on the data:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA177

## Load the two samples

xA <- c(7.53, 7.48, 8.08, 8.09, 10.15, 8.4, 10.88, 6.13, 7.9)
xB <- c(9.21, 11.51, 12.79, 11.85, 9.97, 8.79, 9.69, 9.68, 9.19)
## Summary statistics
c(mean(xA), mean(xB))

[1] 8.293 10.298

c(var(xA), var(xB))

[1] 2.039 1.954

c(length(xA), length(xB))

[1] 9 9

So
10.298 − 8.293
tobs = √ = 3.01,
2.0394/9 + 1.954/9
and
2.0394
2
9 + 1.954
9
ν= (2.0394/9)2 )2
= 15.99.
8 + (1.954/9
8
Or the same done in R by ”manual” expression:

## Keep the summary statistics

ms <- c(mean(xA), mean(xB))
vs <- c(var(xA), var(xB))
ns <- c(length(xA), length(xB))
## The observed statistic
t_obs <- (ms[2]-ms[1])/sqrt(vs[1]/ns[1]+vs[2]/ns[2])
## The degrees of freedom
nu <- ((vs[1]/ns[1]+vs[2]/ns[2])^2)/
((vs[1]/ns[1])^2/(ns[1]-1)+(vs[2]/ns[2])^2/(ns[2]-1))
## Print the result
t_obs

[1] 3.009

[1] 15.99
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA178

Next step is then to find the p-value

p-value = 2 · P( T > |tobs |) = 2P( T > 3.01) = 2 · 0.00415 = 0.0083,

where we use R to find the probability P( T > 3.01) based on a t-distribution with
ν = 15.99 degrees of freedom:

1 - pt(t_obs, df = nu)

[1] 0.004161

To complete the hypothesis test, we compare the p-value with the given α-level, in
this case α = 0.05, and conclude:
Since the p-value < α we reject the null hypothesis, and we have sufficient ev-
idence for concluding: the two nurse groups have on average different energy
usage work levels. We have shown this effect to be statistically significant.

In spite of a pre-defined α-level (whoever gave us that), it is always valuable to

consider at what other α-levels the hypothesis would be rejected/accepted. Or in
different words, interpret the size of the p-value using Table 3.1 and we thus sharpen
the statement a little:
Since the p-value in this case is between 0.001 and 0.01 conclude: there is
a strong evidence against equality of the two population energy usage means
and it is found that the mean is significantly higher on Hospital B compared to
Hospital A.

The last part, that the mean is higher on Hospital B, can be concluded because it is
rejected that they are equal and x̄ B > x̄ A and we can thus add this to the conclusion.

Finally, the t-test computations are actually directly provided by the t.test function
in R, if it is called using two data input vectors:

t.test(xB, xA)

Welch Two Sample t-test

data: xB and xA
t = 3, df = 16, p-value = 0.008
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.5923 3.4166
sample estimates:
mean of x mean of y
10.298 8.293
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA179

Note, how the default choices of the R-function matches our exposition:

• Default test version: the Welch (not assuming equal variances)

• Default α-level: 0.05
• Default ”direction version”: the two-sided (or non-directional) alternative hy-
pothesis, see Section 3.1.7 about other alternative hypotheses)

Actually, the final rejection/acceptance conclusion based on the default (or chosen)
α-level is not given by R.

In the t.test results the α-level is used for the given confidence interval for the
mean difference of the two populations, to be interpreted as: we accept that the
true difference in mean energy levels between the two nurse groups is somewhere
between 0.6 and 3.4.

Remark 3.55
Often ”degrees of freedom” are integer values, but in fact t-distributions
with non-integer valued degrees of freedom are also well defined. The
ν = 15.99 t-distribution (think of the density function) is a distribution in
between the ν = 15 and the ν = 16 t-distributions. Clearly it will indeed be
very close to the ν = 16 one.

We did not in the example above use Step 4. of Method 3.50, which can be
called the critical value approach. In fact this approach is directly linked to
the confidence interval in the sense that one could make a rapid conclusion
regarding rejection or not by looking at the confidence interval and checking
whether the hypothesized value is in the interval or not. This would correspond
to using the critical value approach.

Example 3.56 Nutrition study

In the nutrition example above, we can see that 0 is not in the confidence interval so
we would reject the null hypothesis. Let us formally use Step 4 of Method 3.50 to
see how this is exactly the same: the idea is that one can even before the experiment
is carried out find the critical value(s), in this case:

The 5% critical values = ±t0.975 = ±2.120,

where the quantile is found from the t-distribution with ν = 15.99 degrees of free-
dom:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA180

qt(0.975, df = 15.99)

[1] 2.12

Now we conclude that since the observed t-statistic tobs = 3.01 is beyond the crit-
ical values (either larger than 2.120 or smaller than −2.120) the null hypothesis is
rejected, and further since it was higher, that µ A − µ B > 0 hence µ B > µ A .

Example 3.57 Overlapping confidence intervals?

A commonly encountered way to visualize the results of a two-sample comparison

is to use a barplot of the means together with some measure of uncertainty, either
simply the standard errors of the means or the 95% confidence intervals within each
group:

## The confidence intervals and joining the lower and upper limits
CIA <- t.test(xA)$conf.int
CIB <- t.test(xB)$conf.int
lower <- c(CIA[1], CIB[1])
upper <- c(CIA[2], CIB[2])
## First install the package with: install.packages("gplots")
library(gplots)
barplot2(c(mean(xA),mean(xB)), plot.ci=TRUE, ci.l=lower, ci.u=upper,
col = 2:3)
10
8
6
4
2
0

Here care must taken in the interpretation of this plot: it is natural, if your main
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA181

aim is a comparison of the two means, to immediately visually check whether the
shown error bars, in this case the confidence intervals, overlap or not, to make a con-
clusion about group difference. Here they actually just overlap - could be checked
by looking at the actual CIs:

## The confidence intervals

CIA

[1] 7.196 9.391

attr(,"conf.level")
[1] 0.95

CIB

[1] 9.223 11.372

attr(,"conf.level")
[1] 0.95

And the conclusion would (incorrectly) be that the groups are not statistically dif-
ferent. However, remind that we found above that the p-value = 0.008323, so we
concluded that there was strong evidence of a mean difference between the two
nurse groups.

The problem of the ”overlapping CI interpretation” illustrated in the example

comes technically from the fact that standard deviations are not additive but
variances are
σ(X̄ A −X̄B ) 6= σX̄ A + σX̄B ,
(3-50)
V( X̄ A − X̄B ) = V( X̄ A ) + V( X̄B ).
The latter is what the confidence interval for the mean difference µ A − µ B is using
and what should be used for the proper statistical comparison of the means.
The former is what you implicitly use in the ”overlapping CI interpretation
approach”.

The proper standard deviation (sampling error) of the sample mean difference due
to Pythagoras, is smaller than the sum of the two standard errors: assume that
the two standard
√ errors are 3 and 4. The sum is 7, but the squareroot of the
squares is 3 + 42 = 5. Or more generally
2

σ(X̄ A −X̄B ) < σX̄ A + σX̄B . (3-51)

So we can say the following:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA182

Remark 3.58
When interpreting two (and multi-) independent samples mean barplots
with added confidence intervals:

When two CIs do NOT overlap: The two groups are significantly different

When two CIs DO overlap: We do not know from this what the conclusion
is (but then we can use the presented two-sample test method)

One can consider other types of plots for visualizing (multi)group differences.
We will return to this in Chapter 7 on the multi-group data analysis, the so-
called Analysis of Variance (ANOVA).

3.2.3 The paired design and analysis

Example 3.59 Sleeping medicine

In a study the aim is to compare two kinds of sleeping medicine A and B. 10 test
persons tried both kinds of medicine and the following results are obtained, given
in prolonged sleep length (in hours) for each medicine type:

Person A B D = B−A
1 +0.7 +1.9 +1.2
2 -1.6 +0.8 +2.4
3 -0.2 +1.1 +1.3
4 -1.2 +0.1 +1.3
5 -1.0 -0.1 +0.9
6 +3.4 +4.4 +1.0
7 +3.7 +5.5 +1.8
8 +0.8 +1.6 +0.8
9 0.0 +4.6 +4.6
10 +2.0 +3.4 +1.4

Note that this is the same experiment as already treated in Example 3.20. We now in
addition see the original measurements for each sleeping medicine rather than just
individual differences given earlier. And we saw that we could obtain the relevant
analysis (p-value and confidence interval) by a simple call to the t.test function
using the 10 differences:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA183

## Read the samples

x1 <- c(.7,-1.6,-.2,-1.2,-1,3.4,3.7,.8,0,2)
x2 <- c(1.9,.8,1.1,.1,-.1,4.4,5.5,1.6,4.6,3.4)
## Take the differences
dif <- x2 - x1
## t-test on the differences
t.test(dif)

One Sample t-test

data: dif
t = 4.7, df = 9, p-value = 0.001
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.8613 2.4787
sample estimates:
mean of x
1.67

The example shows that this section actually could be avoided, as the right way
to handle this so-called paired situation is to apply the one-sample theory and
methods from Section 3.1 on the differences

di = xi − yi for i = 1, 2, ..., n. (3-52)

Then we can do all relevant statistics based on the mean d¯ and the variance s2d
for these differences.

The reason for having an entire section devoted to the paired t-test is that it is
an important topic for experimental work and statistical analysis. The paired
design for experiments represent an important generic principle for doing ex-
periments as opposed to the un-paired/independent samples design, and these
important basic experimental principles will be important also for multi-group
experiments and data, that we will encounter later in the material.
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA184

Example 3.60 Sleeping medicine

And similarly in R, they have prepared way to do the paired analysis directly on the
two-sample data:

t.test(x2, x1, paired = TRUE)

Paired t-test

data: x2 and x1
t = 4.7, df = 9, p-value = 0.001
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8613 2.4787
sample estimates:
mean of the differences
1.67

Paired vs. completely randomized experiments

An experiment like the one exemplified here where two treatments are investi-
gated can essentially be performed in two different ways:

Completely Randomized (independent samples) 20 patients are used and com-

pletely at random allocated to one of the two treatments (but usually mak-
ing sure to have 10 patients in each group). So: different persons in the
different groups.

Paired (dependent samples) 10 patients are used, and each of them tests both
of the treatments. Usually this will involve some time in between treat-
ments to make sure that it becomes meaningful, and also one would typ-
ically make sure that some patients do A before B and others B before A.
(and doing this allocation at random). So: the same persons in the differ-
ent groups.

Generally, one would expect that whatever the experiment is about and which
observational units are involved (persons, patients, animals) the outcome will
be affected by the properties of each individual – the unit. In the example,
some persons will react positively to both treatments because they generally
are more prone to react to sleeping medicins. Others will not respond as much
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA185

to sleeping medicin. And these differences, the person-to-person variability,

will give a high variance for the Welch independent samples t-test used for
the independent samples case. So generally, one would often prefer to carry
out a paired experiment, where the generic individual variability will not blur
the signal – one can say that in a paired experiment, each individual serves as
his/her own control – the effect of the two treatments are estimated for each
individual. We illustrate this by analysing the example data wrongly, as if they
were the results of a completely randomized experiment on 20 patients:

Example 3.61 Sleeping medicine - WRONG analysis

What happens when applying the wrong analysis:

## WRONG analysis
t.test(x1, x2)

Welch Two Sample t-test

data: x1 and x2
t = -1.9, df = 18, p-value = 0.07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.4854 0.1454
sample estimates:
mean of x mean of y
0.66 2.33

Note how the p-value here is around 0.07 as opposed to the 0.001 from the proper
paired analysis. Also the confidence interval is much wider. Had we done the ex-
periment with 20 patients and gotten the results here, then we would not be able
to detect the difference between the two medicines. What happened is that the in-
dividual variabilities seen in each of the two groups now, incorrectly so, is being
used for the statistical analysis and these are much larger than the variability of the
differences:
Chapter 3 3.2 LEARNING FROM TWO-SAMPLE QUANTITATIVE DATA186

var(x1)

[1] 3.452

var(x2)

[1] 4.009

var(x1-x2)

[1] 1.278

3.2.4 Validation of assumptions with normality investigations

For normality investigations in two-sample settings we use the tools given for
one-sample data, presented in Section 3.1.8. For the paired setting, the investi-
gation would be carried out for the differences. For the independent case the
investigation is carried out within each of the two groups.
Chapter 3 187
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

3.3 Planning a study: wanted precision and power

Experiments and observational studies are always better when they are care-
fully planned. Good planning covers many features of the study. The obser-
vations must be sampled appropriately from the population, reliable measure-
ments must be made and the study must be "big enough" to be able to detect
an effect of interest. And if the study becomes too big, effects of little practical
interest may become statistically significant, and (some of) the money invested
in the study will be wasted. Sample size is important for economic reasons: an
oversized study uses more resourses than necessary, this could be both finan-
cial but also ethical if subjecting objects to potentially harmful treatments, an
undersized study can be wasted if it is not able to produce reliable results.

Sample size is very important to consider before a study is carried out.

3.3.1 Sample Size for wanted precision

One way of calculating the required sample size is to work back from the wanted
precision. From (3-9) we see that the confidence interval is symmetric around
x̄ and the half width of the confidence interval (also called the margin of error
(ME)) is given as
σ
ME = t1−α/2 √ . (3-53)
n

Here t1−α/2 is the (1 − α/2) quantile from the t-distribution with n − 1 degrees
of freedom. This quantile depends on both α and the sample size n, which is
what we want to find.

The sample size now affects both n and t1−α/2 , but if we have a large sample
(e.g. n ≥ 30) then we can use the normal approximation and replace t1−α/2 by
the quantile from the normal distribution z1−α/2 .

In the expression for ME in Equation (3-53) we also need σ, the standard devi-
ation. An estimate of the standard deviation would usually only be available
after the sample has been taken. Instead we use a guess for σ possibly based on
a pilot study or from the literature, or we could use a scenario based choice (i.e.
set σ to some value which we think is reasonable).

For a given choice of ME it is now possible to isolate n in Equation (3-53) (with

the normal quantile inserted instead of the t-quantile):
Chapter 3 188
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

Method 3.62 The one-sample CI sample size formula

When σ is known or guessed at some value, we can calculate the sample
size n needed to achieve a given margin of error, ME, with probability 1 − α
as
z 2
1−α/2 · σ
n= . (3-54)
ME

Example 3.63 Student heights

In Example 3.1 we inferred using a sample of heights of 10 students and found the
sample mean height to be x̄ = 178 and standard deviation s = 12.21. We can now
calculate how many students we should include in a new study, if we want a margin
of error of 3 cm with confidence 95%. Using the standard deviation from the pilot
study with 10 students as our guess we can plug into Method A.4
2
1.96 · 12.21
n= = 63.64.
3

These calculations show that we should include 64 students, the nearest integer to
63.64.

The formula and approach here has the weakness that it only gives an ”ex-
pected” behaviour in a coming experiment - at first reading this may seem good
enough, but if you think about it, it means that half of time the actual width will
be smaller and the other half, it will be larger. If this uncertainty variability is
not too large it is still not a big problem, but nothing in the approach helps us
to know whether it is good enough. A more advanced approach, that will help
us to be more in control of being more certain, that a future experiment/study
will meet our needs is presented now.

3.3.2 Sample size and statistical power

Another way of calculating the necessary sample size is to use the power of the
study. The statistical power of a study is the probability of correctly rejecting H0 if H0
is false. The relations between Type I error, Type II error and the power are seen
in the table below.
Chapter 3 189
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

Low power High power

0.5 Under H0 0.5 Under H0
Under H1 Under H1
α/2 α/2
0.4 0.4
β β
Probability density

Probability density
Power Power
0.3 0.3

0.2 0.2

0.1 0.1
β Power Power
α α
2 β 2
0.0 0.0

µ0 µ1 µ0 µ1

Figure 3.2: The mean µ0 is the mean under H0 and µ1 the mean under H1 . When
µ1 increases (i.e. moving away from µ0 ) so does the power (the yellow area on
the graph).

Reject H0 Fail to reject H0

H0 is true Type I error (α) Correct acceptance of H0
H0 is false Correct rejection of H0 (Power) Type II error (β)

The power has to do with the Type II error β, the probability of wrongly accept-
ing H0 , when H0 actually is false. We would like to have high power (low β), but
it is clear that this will be impossible for all possible situations: it will depend
on the scenario for the potential mean – small potential effects will be difficult
to detect (low power), whereas large potential effects will be easier to detect
(higher power), as illustrated in Figure 3.2. In the left plot we have the mean
under H0 (µ0 ) close to the mean under the alternative hypothesis (µ1 ) making
it difficult to distinguish between the two and the power becomes low. In the
right plot µ0 and µ1 are further apart and the statistical power is much higher.

The power approach to calculating the sample size first of all involves specify-
ing the null hypothesis H0 . Then the following four elements must be speci-
fied/chosen:

• The significance level α of the test (in R: sig.level)

• A change in the mean that you would want to detect, effect size (in R:
delta)
• The standard deviation σ (in R: sd)
• The wanted power (1 − β) (in R: power)

When these values have been decided, it is possible to calculate the necessary
sample size, n. In the one-sided,one-sample t-test there is an approximate closed
Chapter 3 190
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

form for n and this is also the case in some other simple situations. R offers
easy to use functions for this not based on the approximate normal distribution
assumption, but using the more proper t-distributions. In more complicated
settings even it is possible to do some simulations to find the required sample
size.

Method 3.64 The one-sample sample size formula

For the one-sample t-test for given α, β and σ
2
z1− β + z1−α/2
n= σ ,
( µ0 − µ1 )
where µ0 − µ1 is the change in means that we would want to detect and
z1− β , z1−α/2 are quantiles of the standard normal distribution.

Example 3.65 Sample size as function of power

The following figure shows how the sample size increases with increasing power
using the formula in 3.64. Here we have chosen σ = 1 and α = 0.05. Delta is
µ0 − µ1 .

50
Delta = 0.5
40 Delta = 0.75
Delta = 1
Sample size

0.70 0.75 0.80 0.85 0.90

Power
Chapter 3 191
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

Example 3.66 Student heights

If we return to the example with student heights 3.1, we might want to collect data
for a new study to test the hypothesis about the mean height

H0 : µ = 180

Against the alternative

H1 : µ 6= 180

This is the first step in the power approach. The following four elements then are:

• Set the significance level α equal to 5%

• Specify that we want to be able to detect a change of 4 cm
• We will use the standard deviation 12.21 from the study with 10 subjects as
our guess for σ
• We want a power of 80%

Using the formula in 3.64 we get

0.84 + 1.96 2
n = 12.21 · = 73.05.
4
So we would need to include 74 students.
We could also use the R function for power and sample size based on the t-
distributions:

power.t.test(power=0.8, delta=4, sd=12.21, sig.level=0.05,

type="one.sample")

One-sample t test power calculation

n = 75.08
delta = 4
sd = 12.21
sig.level = 0.05
power = 0.8
alternative = two.sided

From the calculations in R avoiding the normal approximation the required sample
size is 76 students, very close to the number calculated by hand using the approxi-
mation above.

In fact the R-function is really nice in the way that it could also be used to find the
power for a given sample size, e.g. n = 50 (given all the other aspects):
Chapter 3 192
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

power.t.test(n=50, delta=4, sd=12.21, sig.level=0.05,

type="one.sample")

One-sample t test power calculation

n = 50
delta = 4
sd = 12.21
sig.level = 0.05
power = 0.6221
alternative = two.sided

This would only give the power 0.62 usually considered too low for a relevant effect
size.

And finally the R-function can tell us what effect size that could be detected by, say,
n = 50, and a power of 0.80:

power.t.test(n=50, power=0.80, sd=12.21, sig.level=0.05,

type="one.sample")

One-sample t test power calculation

n = 50
delta = 4.935
sd = 12.21
sig.level = 0.05
power = 0.8
alternative = two.sided

So with n = 50 only an effect size as big as 4.9 would be detectable with probability
0.80.

To summarize: if we know/define 4 out the 5 values: α, 1 − β, effect size, σ and

n, we can find the 5’th. And to repeat, in the R-function these values are called
sig.level, power, delta, sd and n.

In the practical planning of a study, often a number of scenario-based values of

effect size and σ are used to find a reasonable size of the study.
Chapter 3 193
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

3.3.3 Power/Sample size in two-sample setup

For power and sample size one can generalize the tools presented for the one-
sample setup in the previous section. We illustrate it here by an example of how
to work with the inbuilt R-function:

Example 3.67 Two-sample power and sample size computations

in R

We consider the two-sample hypothesis test

H0 : µ1 = µ2 ,
H1 : µ1 6= µ2

## Finding the power of detecting a group difference of 2

## with sigma=1 for n=10
power.t.test(n=10, delta=2, sd=1, sig.level=0.05)

Two-sample t test power calculation

n = 10
delta = 2
sd = 1
sig.level = 0.05
power = 0.9882
alternative = two.sided

NOTE: n is number in each group

## Finding the sample size for detecting a group difference of 2

## with sigma=1 and power=0.9
power.t.test(power=0.90, delta=2, sd=1, sig.level=0.05)

Two-sample t test power calculation

n = 6.387
delta = 2
sd = 1
sig.level = 0.05
power = 0.9
alternative = two.sided

NOTE: n is number in each group

Chapter 3 194
3.3 PLANNING A STUDY: WANTED PRECISION AND POWER

## Finding the detectable effect size (delta)

## with sigma=1, n=10 and power=0.9
power.t.test(power=0.90, n=10, sd=1, sig.level=0.05)

Two-sample t test power calculation

n = 10
delta = 1.534
sd = 1
sig.level = 0.05
power = 0.9
alternative = two.sided

NOTE: n is number in each group

Note how the two-sample t-test is the default choice of the R-function. Previously,
when we used the same function for the one-sample tests, we used the options
type="one.sample".
Chapter 3 3.4 EXERCISES 195

3.4 Exercises

Exercise 3.1 Concrete items

A construction company receives concrete items for a construction. The length

of the items are assumed reasonably normally distributed. The following re-
quirements for the length of the elements are made

µ = 3000 mm.

The company samples 9 items from a delevery which are then measured for
control. The following measurements (in mm) are found:

3003 3005 2997 3006 2999 2998 3007 3005 3001

a) Compute the following three statistics: the sample mean, the sample stan-
dard deviation and the standard error of the mean, and what are the in-
terpretations of these statistics?

b) In a construction process, 5 concrete items are joined together to a single

construction with a length which is then the complete length of the 5 con-
crete items. It is very important that the length of this new construction
is within 15 m plus/minus 1 cm. How often will it happen that such a
construction will be more than 1 cm away from the 15 m target (assume
that the population mean concrete item length is µ = 3000 mm and that
the population standard deviation is σ = 3)?

c) Find the 95% confidence interval for the mean µ.

d) Find the 99% confidence interval for µ. Compare with the 95% one from
above and explain why it is smaller/larger!
Chapter 3 3.4 EXERCISES 196

e) Find the 95% confidence intervals for the variance σ2 and the standard
deviation σ.

f) Find the 99% confidence intervals for the variance σ2 and the standard
deviation σ.

Exercise 3.2 Aluminum profile

The length of an aluminum profile is checked by taking a sample of 16 items

whose length is measured. The measurement results from this sample are listed
below, all measurements are in mm:

180.02 180.00 180.01 179.97 179.92 180.05 179.94 180.10

180.24 180.12 180.13 180.22 179.96 180.10 179.96 180.06

From data is obtained: x̄ = 180.05 and s = 0.0959.

It can be assumed that the sample comes from a population which is normal
distributed.

a) A 90%-confidence interval for µ becomes?

b) A 99%-confidence interval for σ becomes?

Exercise 3.3 Concrete items (hypothesis testing)

This is a continuation of Exercise 1, so the same setting and data is used (read
the initial text of it).

a) To investigate whether the requirement to the mean is fulfilled (with α =

5%), the following hypothesis should be tested
H0 : µ = 3000
H1 : µ 6= 3000.
Or similarly asked: what is the evidence against the null hypothesis?
Chapter 3 3.4 EXERCISES 197

b) What would the level α = 0.01 critical values be for this test, and what are
the interpretation of these?

c) What would the level α = 0.05 critical values be for this test (compare also
with the values found in the previous question)?

d) Investigate, by som plots, whether the data here appears to be coming

from a normal distribution (as assumed until now)?

e) Assuming that you, maybe among different plots, also did the normal Q-
Q plot above, the question is now: What exactly is plotted in that plot? Or
more specifically: what are the x- and y-coordinates of e.g. the two points
to the lower left in this plot?

Exercise 3.4 Aluminium profile (hypothesis testing)

We use the same setting and data as in Exercise 2, so read the initial text of it.

a) Find the evidence against the following hypothesis:

H0 : µ = 180.

b) If the following hypothesis test is carried out

H0 : µ = 180,
H1 : µ 6= 180.

What are the level α = 1% critical values for this test?

Chapter 3 3.4 EXERCISES 198

c) What is the 99%-confidence interval for µ?

d) Carry out the following hypothesis test

H0 : µ = 180,
H1 : µ 6= 180,

using α = 5%.

Exercise 3.5 Transport times

A company, MM, selling items online wants to compare the transport times for
two transport firms for delivery of the goods. To compare the two companies
recordings of delivery times on a specific route were made, with a sample size
of n = 9 for each firm. The following data were found:

Firm A: ȳ A = 1.93 d and s A = 0.45 d,

Firm B: ȳ B = 1.49 d and s B = 0.58 d.

note that d is the SI unit for days. It is assumed that data can be regarded as
stemming from normal distributions.

a) We want to test the following hypothesis

H0 : µ A = µ B
H1 : µ A 6= µ B

What is the p-value, interpretation and conclusion for this test (at α = 5%
level)?

b) Find the 95% confidence interval for the mean difference µ A − µ B .

Chapter 3 3.4 EXERCISES 199

c) What is the power of a study with n = 9 observations in each of the two

samples of detecting a potential mean difference of 0.4 between the firms
(assume that σ = 0.5 and that we use α = 0.05)?

d) What effect size (mean difference) could be detected with n = 9 observa-

tions in each of the two samples with a power of 0.8 (assume that σ = 0.5
and that we use α = 0.05)?

e) How large a sample size (from each firm) would be needed in a new inves-
tigation, if we want to detect a potential mean difference of 0.4 between the
firms with probability 0.90, that is with power=0.90 (assume that σ = 0.5
and that we use α = 0.05)?

Exercise 3.6 Cholesterol

In a clinical trial of a cholesterol-lowering agent, 15 patients’ cholesterol (in

mmol/L) has been measured before treatment and 3 weeks after starting treat-
ment. Data are listed in the following table:
Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6 8.2 9.2 7.3 8.5 9.5
After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8 7.1 7.7 6.0 6.6 8.4

The following is run in R:

x1 <- c(9.1, 8.0, 7.7, 10.0, 9.6, 7.9, 9.0, 7.1, 8.3,
9.6, 8.2, 9.2, 7.3, 8.5, 9.5)
x2 <- c(8.2, 6.4, 6.6, 8.5, 8.0, 5.8, 7.8, 7.2, 6.7,
9.8, 7.1, 7.7, 6.0, 6.6, 8.4)
t.test(x1, x2)

Welch Two Sample t-test

data: x1 and x2
t = 3.3, df = 27, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
Chapter 3 3.4 EXERCISES 200

95 percent confidence interval:

0.4637 1.9630
sample estimates:
mean of x mean of y
8.600 7.387

t.test(x1, x2, pair=TRUE)

Paired t-test

data: x1 and x2
t = 7.3, df = 14, p-value = 0.000004
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8588 1.5678
sample estimates:
mean of the differences
1.213

a) Can there, based on these data be demonstrated a significant decrease in

cholesterol levels with α = 0.001?

Exercise 3.7 Pulse

13 runners had their pulse measured at the end of a workout and 1 minute after
again and we got the following pulse measurements:
Runner 1 2 3 4 5 6 7 8 9 10 11 12 13
Pulse end 173 175 174 183 181 180 170 182 188 178 181 183 185
Pulse 1min 120 115 122 123 125 140 108 133 134 121 130 126 128

The following was run in R:

Pulse_end <- c(173,175,174,183,181,180,170,182,188,178,181,183,185)

Pulse_1min <- c(120,115,122,123,125,140,108,133,134,121,130,126,128)
mean(Pulse_end)

[1] 179.5

mean(Pulse_1min)
Chapter 3 3.4 EXERCISES 201

[1] 125

sd(Pulse_end)

[1] 5.19

sd(Pulse_1min)

[1] 8.406

sd(Pulse_end-Pulse_1min)

[1] 5.768

a) What is the 99% confidence interval for the mean pulse drop (meaning the
drop during 1 minute from end of workout)?

b) Consider now the 13 pulse end measurements (first row in the table).
What is the 95% confidence interval for the standard deviation of these?

Exercise 3.8 Foil production

In the production of a certain foil (film), the foil is controlled by measuring the
thickness of the foil in a number of points distributed over the width of the foil.
The production is considered stable if the mean of the difference between the
maximum and minimum measurements does not exceed 0.35 mm. At a given
day, the following random samples are observed for 10 foils:

Foil 1 2 3 4 5 6 7 8 9 10
Max. in mm (ymax ) 2.62 2.71 2.18 2.25 2.72 2.34 2.63 1.86 2.84 2.93
Min. in mm (ymin ) 2.14 2.39 1.86 1.92 2.33 2.00 2.25 1.50 2.27 2.37
Max-Min (D) 0.48 0.32 0.32 0.33 0.39 0.34 0.38 0.36 0.57 0.56

The following statistics may potentially be used

ȳmax = 2.508, ȳmin = 2.103, symax = 0.3373, symin = 0.2834, s D = 0.09664.

Chapter 3 3.4 EXERCISES 202

a) What is a 95% confidence interval for the mean difference?

b) How much evidence is there that the mean difference is different from
0.35? State the null hypothesis, t-statistic and p-value for this question.

Exercise 3.9 Course project

At a specific education it was decided to introduce a project, running through

the course period, as a part of the grade point evaluation. In order to assess
whether it has changed the percentage of students passing the course, the fol-
lowing data was collected:
Before introduction After introduction
of project of project
Number of students evaluated 50 24
Number of students failed 13 3
Average grade point x̄ 6.420 7.375
Sample standard deviation s 2.205 1.813

a) As it is assumed that the grades are approximately normally distributed

in each group, the following hypothesis is tested:

H0 : µBefore = µAfter ,
H1 : µBefore 6= µAfter .

The test statistic, the p-value and the conclusion for this test become?

b) A 99% confidence interval for the mean grade point difference is?

c) A 95% confidence interval for the grade point standard deviation after the
introduction of the project becomes?
Chapter 3 3.4 EXERCISES 203

Exercise 3.10 Concrete items (sample size)

This is a continuation of Exercise 1, so the same setting and data is used (read
the initial text of it).

a) A study is planned of a new supplier. It is expected that the standard

deviation will be approximately 3, that is, σ = 3 mm. We want a 90%
confidence interval for the mean value in this new study to have a width
of 2 mm. How many items should be sampled to achieve this?

b) Answer the sample size question above but requiring the 99% confidence
interval to have the (same) width of 2 mm.

c) (Warning: This is a difficult question about a challenging abstraction - do

not worry, if you do not make this one) For the two sample sizes found
in the two previous questions find the probability that the correspond-
ing confidence interval in the future study will actually be more than 10%
wider than planned for (still assuming and using that the population vari-
ance is σ2 = 9).

d) Now a new experiment is to be planned. In the first part above, given

some wanted margin of error (ME) a sample size of n = 25 was found.
What are each of the probabilities that an experiment with n = 25 will de-
tect effects corresponding to (”end up significant for”) µ1 = 3001, 3002, 3003
respectively? Assume that we use the typical α = 0.05 level and that
σ = 3?

e) One of the sample size computation above led to n = 60 (it is not so im-
portant how/why). Answer the same question as above using n = 60.
Chapter 3 3.4 EXERCISES 204

f) What sample size would be needed to achieve a power of 0.80 for an effect
of size 0.5?

g) Assume that you only have the finances to do an experiment with n = 50.
How large a difference would you be able to detect with probability 0.8
(i.e. Power= 0.80)?
Chapter 4 205

Chapter 4

Simulation Based Statistics

4.1 Probability and Simulation

4.1.1 Introduction

One of the really big gains for statistics and modeling of random phenomena,
provided by computer technology during the last decades, is the ability to sim-
ulate random systems on the computer, as we have already seen much in use
in Chapter 2. This provides possibilities to obtain results that otherwise from a
mathematical analytical point of view would be impossible to calculate. And,
even in cases where the highly educated mathematician/physicist might be able
to find solutions, simulation is a general and simple calculation tool allowing
solving complex problems without a need for deep theoretical insight.

An important reason for including this subject in an introductory statistics course,

apart from using it as a pedagogical tool to aide the understanding of random
phenomena, is the fact that the methods we are usually introducing in basic
statistics are characterized by relying on one of two conditions:

1. The original data population density is assumed to be a normal distribu-

tion

2. Or: The sample size n is large enough to make this assumption irrelevant
for what we do

And in real settings it may be challenging to know for sure whether any of these
two are really satisfied, so to what extend can we trust the statistical conclusions
that we make using our basic tools, as e.g. the one- and two-sample statistical
Chapter 4 4.1 PROBABILITY AND SIMULATION 206

methods presented in Chapter 3. And how should we do the basic statistical

analysis if we even become convinced that none of these two conditions are ful-
filled? Statistical data analysis based on simulation tools is a valuable tool to
complete the tool box of introductory statistics. It can be used to do statistical
computing for other features than just means, and for other population distri-
butions than the normal. It can also be used to investigate whether some of our
assumptions appear reasonable. We already saw an example of this in relation
to the qq-plots in Chapter 3.1.9.

In fact, it will become clear that the simulation tools presented here will make
us rapidly able to perform statistical analysis that goes way beyond what histor-
ically has been introduced in basic statistics classes or textbooks. Unfortunately,
the complexity of real life engineering applications and data analysis challenges
can easily go beyond the settings that we have time to cover within an intro-
ductory exposition. With the general simulation tool in our tool box, we have
a multitool that can be used for (and adapted to) basically almost any level of
complexity that we will meet in our future engineering activity.

The classical statistical practice would be to try to ensure that the data we’re
analyzing behaves like a normal distribution: symmetric and bell-shaped his-
togram. In Chapter 3 we also learned that we can make a normal q-q plot to
verify this assumption in practice, and possibly transform the data to get them
closer to being normal. The problem with small samples is that it even with
these diagnostic tools can be difficult to know whether the underlying distribu-
tion really is ”normal” or not.

And in some cases the assumption of normality after all simply may be obvi-
uosly wrong. For example, when the response scale we work with is far from
being quantitative and continuous - it could be a scale like ”small”, ”medium”
and ”large” - coded as 1, 2 and 3. We need tools that can do statistical analy-
sis for us WITHOUT the assumption that the normal distribution is the right
model for the data we observe and work with.

Traditionally, the missing link would be covered by the so-called non-parametric

tests. In short this is a collection of methods that make use of data at a more
coarse level, typically by focusing on the rank of the observations instead of the
actual values of the observations. So in a paired t-test setup, for example, one
would just count how many times the observations in one sample is bigger than
in the other – instead of calculating the differences. In that way you can make
statistical tests without using the assumption of an underlying normal distribu-
tion. There are a large number of such non-parametric tests for different setups.
Historically, before the computer age, it was the only way to really handle such
situations in practice. These tests are all characterized by the fact that they are
given by relatively simple computational formulas which in earlier times easily
Chapter 4 4.1 PROBABILITY AND SIMULATION 207

could be handled. For small sample statistics with questionable distributional

settings, these tools maintain to offer a robust set of basic statistical procedures.

The simulation based methods that we now present instead have a couple of
crucial advantages to the traditional non-parametric methods:

• Confidence intervals are much easier to achieve

• They are much easier to apply in more complex situations
• They scale better to modern time big data analysis

4.1.2 Simulation as a general computational tool

Basically, the strength of the simulation tool is that one can compute arbitrary
functions of random variables and their outcomes. In other words one can find
probabilities of complicated outcomes. As such, simulation is really not a statis-
tical tool, but rather a probability calculus tool. However, since statistics essen-
tially is about analysing and learning from real data in the light of certain proba-
bilities, the simulation tool indeed becomes of statistical importance, which we
will exemplify very specifically below. Before starting with exemplifying the
power of simulation as a general computational tool, we refer to the introduc-
tion to simulation in Chapter 2 – in particular read first Section 2.6, Example
2.15 and thereafter Section 2.6.

Example 4.1 Rectangular plates

A company produces rectangular plates. The length of plates (in meters), X is as-
sumed to follow a normal distribution N (2, 0.012 ) and the width of the plates (in
meters), Y are assumed to follow a normal distribution N (3, 0.022 ). We’re hence
dealing with plates of size 2 × 3 meters, but with errors in both length and width.
Assume that these errors are completely independent. We are interested in the area
of the plates which of course is given by A = XY. This is a nonlinear function of X
and Y, and actually it means that we, with the theoretical tools we presented so far
in the material, cannot figure out what the mean area really is, and not at all what
the standard deviation would be in the areas from plate to plate, and we would defi-
nitely not know how to calculate the probabilities of various possible outcomes. For
example, how often such plates have an area that differ by more than 0.1 m2 from
the targeted 6 m2 ? One statement summarizing all our lack of knowledge at this
point: we do not know the probability distribution of the random variable A and
we do not know how to find it! With simulation, it is straightforward: one can find
all relevant information about A by just simulating the X and Y a high number of
Chapter 4 4.1 PROBABILITY AND SIMULATION 208

times, and from this compute A just as many times, and then observe what happens
to the values of A. The first step is then given by:

## Number of simulations
k <- 10000
## Simulate X, Y and then A
X <- rnorm(k, 2, 0.01)
Y <- rnorm(k, 3, 0.02)
A <- X*Y

The R object A now contains 10.000 observations of A. The expected value and the
standard deviation for A are simply found by calculating the average and standard
deviation for the simulated A-values:

mean(A)

[1] 5.999511

sd(A)

[1] 0.04957494

and the desired probability, P(| A − 6| > 0.1) = 1 − P(5.9 ≤ A ≤ 6.1) is found by
counting how often the incident actually occurs among the k outcomes of A:

mean(abs(A-6)>0.1)

[1] 0.0439

The code abs(A-6)>0.1 creates a vector with values TRUE or FALSE depending on
whether the absolute value of A − 6 is greater than 0.1 or not. When you add (sum)
these the TRUE is automatically translated into 1 and FALSE automatically set to 0, by
which the desired count is available, and divided by the total number of simulations
k by mean().

Note, that if you do this yourself without using the same seed value you will not
get exactly the same result. It is clear that this simulation uncertainty is something
we must deal with in practice. The size of this will depend on the situation and
on the number of simulations k. We can always get a first idea of it in a specific
situation simply by repeating the calculation a few times and note how it varies.
Chapter 4 4.1 PROBABILITY AND SIMULATION 209

Indeed, one could then formalize such an investigation and repeat the simulation
many times, to get an evaluation of the simulation uncertainty. We will not pursue
this further here. When the target of the computation is in fact a probability, as in the
latter example here, you can alternatively use standard binomial statistics, which is
covered in Chapter 2 and Chapter 7. For example, withqk = 100000 the uncertainty
0.044(1−0.044)
for a calculated proportion of around 0.044 is given by: 100000 = 0.00065. Or
for example, with k = 10000000 the uncertainty is 0.000065. The result using such
a k was 0.0455 and because we’re a bit unlucky with the rounding position we can
in practice say that the exact result rounded to 3 decimal places are either 0.045 or
0.046. In this way, a calculation which is actually based on simulation is turned into
an exact one in the sense that rounded to 2 decimal places, the result is simply 0.05.

4.1.3 Propagation of error

Within chemistry and physics one may speak of measurement errors and how
measurement errors propagate/accumulate if we have more measurements and/or
use these measurements in subsequent formulas/calculations. First of all: The
basic way to ”measure an error”, that is, to quantify a measurement error is by
means of a standard deviation. As we know, the standard deviation expresses
the average deviation from the mean. It is clear it may happen that a measur-
ing instrument also on average measures wrongly (off the target). This is called
”bias”, but in the basic setting here, we assume that the instrument has no bias.

Hence, reformulated, an error propagation problem is a question about how

the standard deviation of some function of the measurements depends on the
standard deviations for the individual measurement: let X1 , . . . , Xn be n mea-
surements with standard deviations (average measurement errors) σ1 , . . . , σn .
As usual in this material, we assume that these measurement errors are inde-
pendent of each other. There are extensions of the formulas that can handle
dependencies, but we omit those here. We must then in a general formulation
be able to find

σ2f (X1 ,...,Xn ) = V( f ( X1 , . . . , Xn )). (4-1)

Chapter 4 4.1 PROBABILITY AND SIMULATION 210

Remark 4.2 For the thoughtful reader: Measurement errors, er-

rors and variances
Although we motivate this entire treatment by the measurement error termi-
nology, often used in chemistry and physics, actually everything is valid
for any kind of errors, be it ”time-to-time” production errors, or ”substance-
to-substance” or ”tube-to-tube” errors. What the relevant kind of er-
rors/variabilities are depends on the situation and may very well be mixed
together in applications. But, the point is that as long as we have a relevant
error variance, we can work with the concepts and tools here. It does not
have to have a ”pure measurement error” interpretation.

Actually, we have already in this course seen the linear error propagation rule,
in Theorem in 2.56, which then can be restated here as
n n
If f ( X1 , . . . , Xn ) = ∑ a i Xi , then σ2f (X1 ,...,Xn ) = ∑ a2i σi2.
i =1 i =1
There is a more general non-linear extension of this, albeit theoretically only an
approximate result, which involves the partial derivative of the function f with
respect to the n variables:

Method 4.3 The non-linear approximative error propagation rule

If X1 , . . . , Xn are independent random variables with variances σ12 , . . . , σn2
and f is a (potentially non-linear) function of n variables, then the variance
of the f -transformed variables can be approximated linearly by
n 2
∂f
σ2f (X1 ,...,Xn ) = ∑ ∂xi
σi2 , (4-2)
i =1

∂f
where ∂xi is the partial derivative of f with respect to the i’th variable

In practice one would have to insert the actual measurement values x1 , . . . , xn

of X1 , . . . , Xn in the partial derivatives to apply the formula in practice, see the
example below. This is a pretty powerful tool for the general finding of (ap-
proximate) uncertainties for complicated functions of many measurements or
for that matter: complex combinations of various statistical quantities. When
the formula is used for the latter, it is also in some contexts called the ”delta
rule” (which is mathematically speaking a so-called first-order (linear) Taylor
approximations to the nonlinear function f ). We bring it forward here, because
Chapter 4 4.1 PROBABILITY AND SIMULATION 211

as an alternative to this approximate formula one could use simulation in the

following way:

Method 4.4 Non-linear error propagation by simulation

Assume we have actual measurements x1 , . . . , xn with known/assumed er-
ror variances σ12 , . . . , σn2 :

1. Simulate k outcomes of all n measurements from assumed error distri-

( j)
butions, e.g. N ( xi , σi2 ): Xi , j = 1 . . . , k.

2. Calculate the standard deviation directly as the observed standard de-

viation of the k simulated values of f :
v
u k
u 1
t ( f j − f¯)2 ,
k − 1 i∑
sim
s f (X1 ,...,Xn ) = (4-3)
=1

where
( j) ( j)
f j = f ( X1 , . . . , X n ) . (4-4)

Example 4.5

Let us continue the example with A = XY and X and Y defined as in the example
above. First of all note, that we already above used the simulation based error prop-
agation method, when we found the standard deviation to be 0.04957 based on the
simulation. To exemplify the approximate error propagation rule, we must find the
derivatives of the function f ( x, y) = xy with respect to both x and y

∂f ∂f
=y = x.
∂x ∂y

Assume, that we now have two specific measurements of X and Y, for example
X = 2.00 m and y = 3.00 m the error propagation law would provide the following
approximate calculation of the ”uncertainty error variance of the area result” 2.00 m ·
3.00 m = 6.00 m2 , namely

σA2 = y2 · 0.012 + x2 · 0.022 = 3.002 · 0.012 + 2.002 · 0.022 = 0.0025.

So, with the error propagation law we are managing a part of the challenge without
simulating. Actually, we are pretty close to be able to find the correct theortical
variance of A = XY using tools provided in this course. By the definition and the
Chapter 4 4.1 PROBABILITY AND SIMULATION 212

following fundamental relationship

V( X ) = E( X − E( X ))2 = E( X 2 ) − E( X )2 . (4-5)

So, one can actually deduce the variance of A theoretically, it is only necessary
to know in addition that for independent random variables: E( XY ) = E( X ) E(Y )
(which by the way then also tells us that E( A) = E( X ) E(Y ) = 6)

V( XY ) = E ( XY )2 − E( XY )2
= E ( X 2 ) E (Y 2 ) − E ( X ) 2 E (Y ) 2

= V ( X ) + E ( X ) 2 V (Y ) + E (Y ) 2 − E ( X ) 2 E (Y ) 2
= V ( X ) V (Y ) + V ( X ) E (Y ) 2 + V (Y ) E ( X ) 2
= 0.012 · 0.022 + 0.012 · 32 + 0.022 · 22
= 0.00000004 + 0.0009 + 0.0016
= 0.00250004.

Note, how the approximate error propagation rule actually corresponds to the two
latter terms in the correct variance, while the first term – the product of the two
variances is ignored. Fortunately, this term is the smallest of the three in this case. It
does not always have to be like that. If you want to learn how to make a theoretical
derivation of the density function for A = XY then take a course in probability
calculation.

Note, how we in the example actually found the ”average error”, that is, the
error standard deviation by three different approaches:

1. The simulation based approach

2. The analytical, but approximate, error propagation method

3. A theoretical derivation

The simulation approach has a number of crucial advantages:

1. It offers a simple way to compute many other quantities than just the stan-
dard deviation (the theoretical derivations of such other quantities could
be much more complicated than what was shown for the variance here)

2. It offers a simple way to use any other distribution than the normal – if we
believe such better reflect reality

3. It does not rely on any linear approximations of the true non-linear rela-
tions
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 213

4.2 The parametric bootstrap

4.2.1 Introduction

Generally, a confidence interval for an unknown parameter µ is a way to ex-

press uncertainty using the sampling distribution of µ̂ = x̄. Hence, we use a
distribution that expresses how our calculated value would vary from sample
to sample. And the sampling distribution is a theoretical consequence of the
original population distribution. As indicated, we have so far no method to do
this if we only have a small sample size (n < 30), and the data cannot be as-
sumed to follow a normal distribution. In principle there are two approaches
for solving this problem:

1. Find/identify/assume a different and more suitable distribution for the

population (”the system”)

2. Do not assume any distribution whatsoever

The simulation method called bootstrapping, which in practice is to simulate

many samples, exists in two versions that can handle either of these two chal-
lenges:

1. Parametric bootstrap: simulate multiple samples from the assumed distri-

bution.

2. Non-parametric bootstrap: simulate multiple samples directly from the

data.

Actually, the parametric bootstrap handles in addition the situation where data
could perhaps be normally distributed, but where the calculation of interest is
quite different than the average, for example, the coefficient of variation (stan-
dard deviation divided by average) or the median. This would be an example
of a nonlinear function of data – thus not having a normal distribution nor a t-
distribution as a sampling distribution. So, the parametric bootstrap is basically
just an example of the use of simulation as a general calculation tool, as intro-
duced above. Both methods are hence very general and can be used in virtually
all contexts.

In this material we have met a few of such alternative continuous distributions,

e.g. the log-normal, uniform and exponential distributions. But if we think
about it, we have not (yet) been taught how to do any statistics (confidence
intervals and/or hypothesis testing) within an assumption of any of these. The
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 214

parametric bootstrap is a way to do this without relying on theoretical deriva-

tions of everything. As for the theoretical variance deduction above, there are
indeed methods for doing such general theoretical derivations, which would
make us able to do statistics based on any kind of assumed distribution. The
most welknown, and in many ways also optimal, overall approach for this is
called maximum likelihood theory. The general theory and approach of maxi-
mum likelihood is not covered in this course, however it is good to know that,
in fact, all the methods we present are indeed also maximum likelihood meth-
ods assuming normal distributions for the population(s).

4.2.2 One-sample confidence interval for µ

Example 4.6 Confidence interval for the exponential rate or mean

Assume that we observed the following 10 call waiting times (in seconds) in a call
center

32.6, 1.6, 42.1, 29.2, 53.4, 79.3, 2.3, 4.7, 13.6, 2.0.

If we model the waiting times using the exponential distribution, we can estimate
the mean as

µ̂ = x̄ = 26.08,

and hence the rate parameter λ = 1/β in the exponential distribution as (cf. 2.48)

λ̂ = 1/26.08 = 0.03834356.

However, what if we want a 95% confidence interval for either µ = β or λ? We

have not been tought the methods, that is, given any formulas for finding this. The
following few lines of R-code, a version of the simulation based error propagation
approach from above, will do the job for us:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 215

## Read the data

x <- c(32.6, 1.6, 42.1, 29.2, 53.4, 79.3, 2.3 , 4.7, 13.6, 2.0)
n <- length(x)
## Set the number of simulations
k <- 100000
## 1. Simulate 10 exponentials with the right mean k times
set.seed(9876)
simsamples <- replicate(k, rexp(10,1/26.08))
## 2. Compute the mean of the 10 simulated observations k times
simmeans <- apply(simsamples, 2, mean)
## 3. Find the two relevant quantiles of the k simulated means
quantile(simmeans, c(0.025, 0.975))

2.5% 97.5%
12.58739 44.62749

Explanation: replicate is a function that repeats the call to rexp(10,1/26.08), in

this case 100000 times and the results are collected in a 10 × 100.000 matrix. Then
in a single call the 100.000 averages are calculated and subsequently the relevant
quantiles found.

So the 95%-confidence interval for the mean µ is (in seconds)

[12.6, 44.6].

And for the rate λ = 1/µ it can be found by a direct transformation (remember that
the quantiles are ’invariant’ to monotonic transformations, c.f. Chapter 3)

[1/44.6, 1/12.6] ⇔ [0.022, 0.0794].

The simulated sampling distribution of means that we use for our statistical analysis
can be seen with the histogram:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 216

hist(simmeans, col="blue", nclass=30, cex.main=0.8)

Histogram of simmeans
10000
Frequency
6000
2000
0

20 40 60 80
simmeans

We see clearly that the sampling distribution in this case is not a normal nor a t-
distribution: it has a clear right skewed shape. So n = 10 is not quite large enough
for this exponential distribution to make the Central Limit Theorem take over.

The general method which we have used in the example above is given below
as Method 4.7.

4.2.3 One-sample confidence interval for any feature assuming

any distribution

We saw in the example above that we could easily find a confidence interval for
the rate λ = 1/µ assuming an exponential distribution. This was so, since the
rate was a simple (monotonic) transformation of the mean, and the quantiles
of simulated rates would then be the same simple transformation of the quan-
tiles of the simulated means. However, what if we are interested in something
not expressed as a simple function of the mean, for instance the median, the
coefficent of variation, the quartiles, Q1 or Q3 , the IQR=Q3 − Q1 or any other
quantile? Well, a very small adaptation of the method above would make that
possible for us. To express that we now cover any kind of statistic one could
think of, we use the general notation, the greek letter θ, for a general feature
of the distribution. For instance, θ could be the true median of the population
distribution, and then θ̂ is the sample median computed from the sample taken.
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 217

Method 4.7 Confidence interval for any feature θ by parametric

bootstrap
Assume we have actual observations x1 , . . . , xn and assume that they stem
from some probability distribution with density (pdf) f :

1. Simulate k samples of n observations from the assumed distribution f

where the mean is set to x̄ a

2. Calculate the statistic θ̂ in each of the k samples θ̂1∗ , . . . , θ̂k∗

3. Find the 100(α/2)% and 100(1 − α/2)% quantiles for these,

∗
q100 ∗
and q100 as the 100(1 − α)% confidence interval:
h (α/2)% (1−α/2i)%
∗ ∗
q100 (α/2)% , q100(1−α/2)%

a (Footnote: And otherwise chosen to match the data as good as possible: some distributions
have more than just a single mean related parameter, e.g. the normal or the log-normal. For these
one should use a distribution with a variance that matches the sample variance of the data. Even
more generally the approach would be to match the chosen distribution to the data by the so-called
maximum likelihood approach)

Please note again, that you can simply substitute the θ with whatever statistics
that you are working with. This then also shows that the method box includes
the often occuring situation, where a confidence interval for the mean µ is the
aim.

Example 4.8 Confidence interval for the median assuming an ex-

ponential distribution

Let us look at the exponential data from the previous section and find the confidence
interval for the median:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 218

## Load the data

x <- c(32.6, 1.6, 42.1, 29.2, 53.4, 79.3, 2.3 , 4.7, 13.6, 2.0)
n <- length(x)
## Set the number of simulations
k <- 100000
## 1. Simulate k samples of n=10 exponentials with the right mean
simsamples <- replicate(k, rexp(n,1/26.08))
## 2. Compute the median of the n=10 simulated observations k times:
simmedians <- apply(simsamples, 2, median)
## 3. Find the two relevant quantiles of the k simulated medians:
quantile(simmedians, c(0.025, 0.975))

2.5% 97.5%
7.038026 38.465226

The simulated sampling distribution of medians that we use for our statistical anal-
ysis can be studied by the histogram:

hist(simmedians, col="blue", nclass=30, cex.main=0.8)

Histogram of simmedians
10000
Frequency
6000
2000
0

0 20 40 60 80
simmedians

We see again clearly that the sampling distribution in this case is not a normal nor a
t-distribution: it has a clear right skewed shape.
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 219

Example 4.9 Confidence interval for Q3 assuming a normal distri-

bution

Let us look at the heights data from the previous chapters and find the 99% confi-
dence interval for the upper quartile: (Please note that you will find NO theory nor
analytically expressed method boxes in the material to solve this challenge). There
is one litte extra challenge for us here, since with as well the mean as the median
there were directly applicable R-functions to do these sample computations for us,
namely the R-functions mean and median. The upper quartile Q3 does not as such
have its own R-function but it comes as part of the result of e.g. the summary function
or the quantile function. However, in one little line of R-code, we could make such
a Q3-function ourselves, e.g. by:
Q3 <- function(x){ quantile(x, 0.75)}
And now it goes exactly as before:

## load in the data

x <- c(168, 161, 167, 179, 184, 166, 198, 187, 191, 179)
n <- length(x)
## Set the number of simulations:
k <- 100000
## 1. Simulate k samples of n=10 normals with the right mean and variance:
set.seed(9876)
simsamples <- replicate(k, rnorm(n, mean(x), sd(x)))
## 2. Compute the Q3 of the n=10 simulated observations k times:
simQ3 <- apply(simsamples, 2, Q3)
## 3. Find the two relevant quantiles of the k simulated medians:
quantile(simQ3, c(0.005, 0.995))

0.5% 99.5%
172.818 198.003

The simulated sampling distribution of upper quartiles that we use for our statistical
analysis can be studied by the histogram:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 220

hist(simQ3, col="blue", cex.main=0.8)

Histogram of simQ3
15000
10000
Frequency
5000
0

170 180 190 200

simQ3

In this case the Q3 of n = 10 samples of a normal distribution appear to be rather

symmetric and nicely distributed, so maybe one could in fact use the normal distri-
bution, also as an approximate sampling distribution in this case.

4.2.4 Two-sample confidence intervals assuming any distributions

In this section we extend what we learned in the two previous sections to the
case where the focus is a comparison between two (independent) samples. We
present a method box which is the natural extensions of the method box from
above, comparing any kind of feature (hence including the mean comparison):
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 221

Method 4.10 Two-sample confidence interval for any feature com-

parison θ1 − θ2 by parametric bootstrap
Assume we have actual observations x1 , . . . , xn and y1 , . . . , yn and assume
that they stem from some probability distributions with density f 1 and f 2 :

1. Simulate k sets of 2 samples of n1 and n2 observations from the as-

sumed distributions setting the means to µ̂1 = x̄ and µ̂2 = ȳ, respec-
tively a

2. Calculate the difference between the features in each of the k samples

∗ − θ̂ ∗ , . . . , θ̂ ∗ − θ̂ ∗
θ̂ x1 y1 xk yk

3. Find the 100(α/2)% and 100(1 − α/2)% quantiles for these,

∗
q100 ∗
and q100 as the 100(1 − α)% confidence interval
h (α/2)% (1−α/2i )%
∗ ∗
q100 (α/2)% , q100(1−α/2)%

Example 4.11 CI for the difference of two means from exponential

distributed data

Let us look at the exponential data from the previous section and compare that with
a second sample of n = 12 observations from another day at the call center

9.6, 22.2, 52.5, 12.6, 33.0, 15.2, 76.6, 36.3, 110.2, 18.0, 62.4, 10.3.

Let us quantify the difference between the two days and conclude whether the call
rates and/or means are any different on the two days:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 222

## Read the data

x <- c(32.6, 1.6, 42.1, 29.2, 53.4, 79.3, 2.3 , 4.7, 13.6, 2.0)
y <- c(9.6, 22.2, 52.5, 12.6, 33.0, 15.2, 76.6, 36.3, 110.2, 18.0,
62.4, 10.3)
n1 <- length(x)
n2 <- length(y)
## Set the number of simulations
k <- 100000

## 1. Simulate k samples of each n1=10 and n2=12 exponentials

## with the right means
simXsamples <- replicate(k, rexp(n1,1/mean(x)))
simYsamples <- replicate(k, rexp(n2,1/mean(y)))
## 2. Compute the difference between the simulated means k times
simDifmeans <- apply(simXsamples,2,mean) - apply(simYsamples,2,mean)
## 3. Find the two relevant quantiles of the k simulated differences
## in sample means
quantile(simDifmeans, c(0.025, 0.975), cex.main=0.8)

2.5% 97.5%
-40.73539 14.11699

Thus, although the mean waiting time was higher on the second day (ȳ = 38.24 s),
the range of acceptable values (the confidence interval) for the difference in means
is [−40.7, 14.1] – a pretty large range and including 0, so we have no evidence of the
claim that the two days had different mean waiting times (nor call rates then) based
on the current data.

Let us, as in previous examples take a look at the distribution of the simulated sam-
ples. In a way, we do not really need this for doing the analysis, but just out of
curiosity, and for the future it may give a idea of how far from normality the rele-
vant sampling distribution really is:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 223

hist(simDifmeans, col="blue", nclass=25, cex.main=0.8)

Histogram of simDifmeans
15000
10000
Frequency
5000
0

-50 0 50
simDifmeans

In this case the differences of means of exponential distributions appears to be rather

symmetric and nicely distributed, so maybe one could in fact use the normal distri-
bution, also as an approximate sampling distribution in this case.

Example 4.12 Nutrition study: comparing medians assuming nor-

mal distributions

Let us compare the median energy levels from the two-sample nutrition data from
Example 3.45. And let us do this still assuming the normal distribution as we also
assumed in the previous example. First we read in the data:

## Read the data

xA <- c(7.53, 7.48, 8.08, 8.09, 10.15, 8.4, 10.88, 6.13, 7.9)
xB <- c(9.21, 11.51, 12.79, 11.85, 9.97, 8.79, 9.69, 9.68, 9.19)
nA <- length(xA)
nB <- length(xB)

Then we do the two-sample median comparison by the parametric, normal based,

bootstrap:
Chapter 4 4.2 THE PARAMETRIC BOOTSTRAP 224

## Set the number of simulations

k <- 100000
## 1. Simulate k samples of each nA=9 and nB=9 exponentials with the
## right means and standard deviations
simAsamples <- replicate(k, rnorm(nA, mean(xA), sd(xA)))
simBsamples <- replicate(k, rnorm(nB, mean(xB), sd(xB)))

## 2. Compute the difference between the simulated medians k times

simDifmedians <- apply(simAsamples, 2, median) - apply(simBsamples, 2, median)
## 3. Find the two relevant quantiles of the k simulated differences of means
quantile(simDifmedians, c(0.025, 0.975))

2.5% 97.5%
-3.6013672 -0.3981232

Thus, we accept that the difference between the two medians is somewhere between
0.4 and 3.6, and confirming the group difference that we also found in the means, as
the 0 is not included in the interval.

Note, how the only differences in the R code compared to the previous bootstrap-
ping example: the calls to the rexp-function into calls to the rnorm-function and
subsituting mean with median.

Remark 4.13 Hypothesis testing by simulation based confidence

intervals
We have also seen that even though the simulation method boxes given are
providing confidence intervals: we can also use this for hypothesis testing,
by using the basic relation between hypothesis testing and confidence in-
tervals. A confidence interval includes the ’acceptable’ values, and values
outside the confidence interval are the ’rejectable’ values.
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 225

4.3 The non-parametric bootstrap

4.3.1 Introduction

In the introduction to the parametric bootstrap section above it was discussed

that another approach instead of finding the ’right’ distribution to use is to not
assume any distribution at all. This can be done, and a way to do this simula-
tion based is called the non-parametric bootstrap and is presented in this section.
The section is structured as the parametric bootstrap section above – includ-
ing the similar subsections and similar method boxes. So there will be two
method boxes in this section: one for the one-sample analysis and one for the
two-sample analysis.

In fact, the non-parametric approach could be seen as the parametric approach

but substituting the density/distribution used for the simulation by the ob-
served distribution of the data, that is, the empirical cumulative distribution
function (ecdf), cf. Chapter 1. In practice this is carried out by (re)-sampling the
data we have again and again: To get the sampling distribution of the mean (or
any other feature) based on the n observations that we have in our given sam-
ple, we simply again and again take new samples with n observations from the
one we have. This is done ”with replacement” such that the ”new” samples,
from now on called the bootstrap samples would contain some of the original
observations in duplicates (or more) and others will not be there.

4.3.2 One-sample confidence interval for µ

We have the sample: x1 , . . . , xn .

The 100(1 − α)% confidence interval for µ determined by the non-parametric

bootstrap is first exemplified:
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 226

Example 4.14 Women’s cigarette consumption

In a study women’s cigarette consumption before and after giving birth is explored.
The following observations of the number of smoked cigarettes per day were ob-
served:

before after before after

8 5 13 15
24 11 15 19
7 0 11 12
20 15 22 0
6 0 15 6
20 20

This is a typical paired t-test setup, as discussed in Section 3.2.3, which then was
handled by finding the 11 differences and thus transforming it into a one-sample
setup. First we read the observations into R and calculate the differences by:

## Read and calculate the differences for each woman before and after
x1 <- c(8, 24, 7, 20, 6, 20, 13, 15, 11, 22, 15)
x2 <- c(5, 11, 0, 15, 0, 20, 15, 19, 12, 0, 6)
dif <- x1-x2
dif

[1] 3 13 7 5 6 0 -2 -4 -1 22 9

There is a random-sampling function in R (which again is based on a uniform ran-

dom number generator): sample. Eg. you can get 5 repeated samples (with replace-
ment - replace=TRUE) by:

t(replicate(5, sample(dif, replace=TRUE)))

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 13 9 -2 -4 3 -2 -1 -1 22 22 -1
[2,] 3 22 6 3 22 -2 9 6 0 7 9
[3,] 6 6 6 13 22 7 0 -4 0 22 7
[4,] 0 9 9 3 6 9 7 9 13 -2 -1
[5,] -1 22 6 -2 9 13 -1 6 22 0 9

Explanation: replicate is a function that repeats the call to sample - in this case 5
times. The function t simply transposes the matrix of numbers, making it 5 × 11
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 227

instead of 11 × 5 (only used for showing the figures row-wise in slightly fewer lines
than otherwise necessary)

One can then run the following to get a 95% confidence interval for µ based on
k = 100000:

## Number of simulated samples

k <- 100000
## Simulate
simsamples <- replicate(k, sample(dif, replace=TRUE))
## Calculate the mean of each simulated sample
simmeans <- apply(simsamples, 2, mean)
## Quantiles of the differences gives the CI
quantile(simmeans, c(0.025,0.975))

2.5% 97.5%
1.363636 9.818182

Explanation: The sample function is called 100.000 times and the results collected in
an 11 × 100.000 matrix. Then in a single call the 100.000 averages are calculated and
subsequently the relevant quantiles found.

Note, that we use the similar three steps as above for the parametric bootstrap, with
the only difference that the simulations are carried out by the resampling the given
data rather than from some probability distribution.

4.3.3 One-sample confidence interval for any feature

What we have just done can be more generally expressed as follows:

Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 228

Method 4.15 Confidence interval for any feature θ by non-

parametric bootstrap
Assume we have actual observations x1 , . . . , xn :

1. Simulate k samples of size n by randomly sampling among the avail-

able data (with replacement)

2. Calculate the statistic θ̂ in each of the k samples θ̂1∗ , . . . , θ̂k∗

3. Find the 100(α/2)% and 100(1 − α/2)% quantiles for these,

∗
q100 ∗
and q100 as the 100(1 − α)% confidence interval:
h (α/2)% (1−α/2i)%
∗ ∗
q100 (α/2)% , q100(1−α/2)%

Example 4.16

Let us find the 95% confidence interval for the median cigarette consumption change
in the example from above:

## The 95% CI for the median change

k <- 100000
simsamples <- replicate(k, sample(dif, replace = TRUE))
simmedians <- apply(simsamples, 2, median)
quantile(simmedians, c(0.025,0.975))

2.5% 97.5%
-1 9

4.3.4 Two-sample confidence intervals

We now have two random samples: x1 , . . . , xn1 and y1 , . . . , yn2 . The 100(1 − α)%
confidence interval for θ1 − θ2 determined by the non-parametric bootstrap is
defined as:
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 229

Method 4.17 Two-sample confidence interval for θ1 − θ2 by non-

parametric bootstrap
Assume we have actual observations x1 , . . . , xn and y1 , . . . , yn :

1. Simulate k sets of 2 samples of n1 and n2 observations from the respec-

tive groups (with replacement)

2. Calculate the difference between the features in each of the k samples

∗ − θ̂ ∗ , . . . , θ̂ ∗ − θ̂ ∗
θ̂ x1 y1 xk yk

3. Find the 100(α/2)% and 100(1 − α/2)% quantiles for these,

∗
q100 ∗
and q100 as the 100(1 − α)% confidence interval:
h (α/2)% (1−α/2i)%
∗ ∗
q100 (α/2)% , q100(1−α/2)%

Example 4.18 Teeth and bottle

In a study it was explored whether children who received milk from bottle as a child
had worse or better teeth health conditions than those who had not received milk
from the bottle. For 19 randomly selected children it was recorded when they had
their first incident of caries:

bottle age bottle age bottle Age

no 9 no 10 yes 16
yes 14 no 8 yes 14
yes 15 no 6 yes 9
no 10 yes 12 no 12
no 12 yes 13 yes 12
no 6 no 20
yes 19 yes 13

One can then run the following to obtain a 95 % confidence interval for µ1 − µ2 based
on k = 100000:
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 230

## Reading in "no bottle" group

x <- c(9, 10, 12, 6, 10, 8, 6, 20, 12)
## Reading in "yes bottle" group
y <- c(14,15,19,12,13,13,16,14,9,12)

## Number of simulations
k <- 100000
## Simulate each sample k times
simxsamples <- replicate(k, sample(x, replace=TRUE))
simysamples <- replicate(k, sample(y, replace=TRUE))
## Calculate the sample mean differences
simmeandifs <- apply(simxsamples,2,mean) - apply(simysamples,2,mean)
## Quantiles of the differences gives the CI
quantile(simmeandifs, c(0.025,0.975))

2.5% 97.5%
-6.2111111 -0.1222222

Example 4.19

Let us make a 99% confidence interval for the difference of medians between the
two groups in the tooth health example:

## CI for the median differences

simmediandifs <- apply(simxsamples,2,median)-apply(simysamples,2,median)
quantile(simmediandifs, c(0.005,0.995))

0.5% 99.5%
-8 0
Chapter 4 4.3 THE NON-PARAMETRIC BOOTSTRAP 231

Remark 4.20 Warning: Bootstrapping may not always work well

for small sample sizes!
The bootstrapping idea was presented here rather enthusiastically as an al-
most magic method that can do everything for us in all cases. This is not
the case. Some statistics are more easily bootstrapped than others and gen-
erally non-parametric bootstrap will not work well for small samples. The
inherent lack of information with small samples cannot be removed by any
magic trick. Also, there are more conceptually difficult aspects of bootstrap-
ping for various purposes to improve on some of these limitations, see the
next section. Some of the "naive bootstrap" CI interval examples introduced
in this chapter is likely to not have extremely good properties – the coverage
percentages might not in all cases be exactly at the aimed nominal levels.
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 232

4.4 OPTIONAL: Bootstrapping – a further perspective

This section is not part of the syllabus, but can be seen as providing some op-
tional perspectives of this very versatile method. First of all there are actually
other principles of constructing the bootstrap confidence intervals than the one
presented here. What we have used is the so-called percentile method. Other
methods to come from the non-parametric bootstrapped samples to a confi-
dence interval are called the bootstrap-t or studentized bootstrap and Bias Cor-
rected Accelerated confidence intervals. These are often considered superior to
the straightforward percentile method presented above.

All of these bootstrap based confidence interval methods are also available in
specific bootstrap functions and packages in R. There are two major bootstrap
packages in R: the boot and the bootstrap packages. The former is the one
recommended by the QuickR-website: (Adv. Stats, Bootstrapping) http://www.
statmethods.net/advstats/bootstrapping.html.

The other package called bootstrap includes a function (also) called bootstrap.
To e.g. use this one, first install this package, e.g. by:

## Install the bootstrap package

install.packages("bootstrap")

Example 4.21 Teeth and bottle

Now the calculation can be performed in a single call:

## Calculate the 95% CI for the Teech and bottle example above
library(bootstrap)
quantile(bootstrap(dif,k,mean)$thetastar, c(0.025,0.975))

2.5% 97.5%
1.363636 9.818182

These bootstrap packages are advantageous to use when looking for confidence
intervals for more complicated functions of data.
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 233

4.4.1 Non-parametric bootstrapping with the boot-package

Now, we will show how to use the boot-package for more general non-parametric
bootstrapping than described so far. Both the bootstrap and the boot packages
require that the statistics that are bootstrapped in more general situations are
on a specific functional form in R. The nice thing about the bootstrap pack-
age is that it may handle the simple cases without this little extra complica-
tion, whereas the boot package requires this for all applications, also the simple
ones. However, then the boot-package has a number of other good features that
makes it a good choice, and the point is, that for the applications coming now,
the additional complexity would be needed also for the bootstrap package. Let
us begin with a simple example that is already covered by our methods up to
now:

Example 4.22 Bootstrapping the mean µ by the boot-package

We will use again the women’s cigarette consumption data:

## Read and calculate the differences for each woman before and after
x1 <- c(8,24,7,20,6,20,13,15,11,22,15)
x2 <- c(5,11,0,15,0,20,15,19,12,0,6)
dif <- x1-x2

Our aim is to find a 95%-confidence interval for the mean difference. The additional
complexity mentioned is that we cannot just use the mean function as it comes, we
need to re-define it on a specific function form, where the indices enters the function:

## Define function for calculating the mean of the d indexes

samplemean <- function(x, d){ mean(x[d]) }

This is a version of the mean function, where we explicitly have to specify which of
the available observations should be used in the computation of the mean. Let us
check the result of that:
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 234

## Call the new function

mean(dif)

[1] 5.272727

samplemean(dif,1:11)

[1] 5.272727

samplemean(dif,c(1,3))

[1] 5

dif

[1] 3 13 7 5 6 0 -2 -4 -1 22 9

dif[c(1,3)]

[1] 3 7

mean(dif[c(1,3)])

[1] 5

We see that samplemean(dif,c(1,3)) means that we compute the mean of obser-

vation numbers 1 and 3. Now we can use the boot package (to do what we so far
already were able to using methods from previous sections) and firstly we look at
the bootstrap distribution:

## Load the boot package

library(boot)

## Non-parametric bootstrap of the mean difference:

k <- 10000
meandifboot <- boot(dif, samplemean, k)
plot(meandifboot)
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 235

Histogram of t
0.30

15
0.20

10
Density

t*
5
0.10

0
0.00

0 5 10 15 -4 -2 0 2 4
t* Quantiles of Standard Normal

The actual confidence interval corresponding then to Method 4.15 can then be ex-
tracted using the dedicated R-function boot.ci as:

## Percentile bootstrap CI:

boot.ci(meandifboot, type="perc")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 10000 bootstrap replicates

CALL :
boot.ci(boot.out = meandifboot, type = "perc")

Intervals :
Level Percentile
95% ( 1.364, 9.818 )
Calculations and Intervals on Original Scale

One of the nice features of the boot-package is that we can now easily get instead the
so-called Bias Corrected and Accelerated (bca) confidence interval simply by writing
type="bca":
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 236

## Bias Corrected Accelerated CI:

boot.ci(meandifboot, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 10000 bootstrap replicates

CALL :
boot.ci(boot.out = meandifboot, type = "bca")

Intervals :
Level BCa
95% ( 1.636, 10.364 )
Calculations and Intervals on Original Scale

And now we can apply this bca-method to any case, e.g. bootstrapping the x1 me-
dian:

## Define a function for taking the median in the needed format

samplemedian <- function(x, d) {
return(median(x[d]))
}
## Non-parametric bootstrap of the x1 median
b <- boot(x1,samplemedian,k)
boot.ci(b, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 10000 bootstrap replicates

CALL :
boot.ci(boot.out = b, type = "bca")

Intervals :
Level BCa
95% ( 7, 20 )
Calculations and Intervals on Original Scale

Or the coefficient of variation for the difference:

Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 237

## Non-parametric bootstrap of the Dif coef. of var

samplecoefvar <- function(x, d) {
return(sd(x[d])/mean(x[d]))
}
##
b <- boot(dif,samplecoefvar,k)
boot.ci(b, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 9996 bootstrap replicates

CALL :
boot.ci(boot.out = b, type = "bca")

Intervals :
Level BCa
95% ( 0.811, 5.452 )
Calculations and Intervals on Original Scale

Now we will show how we can work directly on data frames, which in real
applications always will be how we have data available. Also, we will show
how we can bootstrap statistics that depend on more than a single input vari-
able. The first example we will use is that of finding a confidence interval for a
sample correlation coefficient, cf. Chapter 1 and Chapter 5. If you read through
Section 5.6.1, you will see that no formula is given for this confidence interval.
Actually, this is not so easily found, and only approximate explicit solutions
can be found for this. We illustrate it on some data that we produce(simulate)
ourselves and then store in a data frame:
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 238

Example 4.23

## Example with data frame and two variables

## Making our own data for the example - into a data frame:
x <- runif(100)
y <- x + 2*runif(100)
D <- data.frame(x, y)
head(D)

x y
1 0.8917505 2.151290
2 0.1018970 1.429229
3 0.2695788 2.251663
4 0.6379553 2.045412
5 0.3358962 2.160790
6 0.6389259 2.582587

plot(D)
3.0
2.5
2.0
1.5
y
1.0
0.5

0.2 0.4 0.6 0.8 1.0

cor(D$x,D$y)

[1] 0.4500598

Then we make a version of the correlation function on the right form, where the
entire data frame is taken as the input to the function (together with the index):
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 239

## The correlation function on the right form:

mycor <- function(D, d) {
E <- D[d,]
return(cor(E$x, E$y))
}

## Check:
mycor(D, 1:100)

[1] 0.4500598

mycor(D, 1:15)

[1] 0.5629782

The E selects the chosen observations from the data frame D, and then in the compu-
tations we use the variables needed from the data frame (think about how this then
can easily be extended to using any number of variables in the data frame). And we
can now do the actual bootstrap:

## Doing the bootstrap on the data frame:

b <- boot(D, mycor, 10000)
boot.ci(b, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 10000 bootstrap replicates

CALL :
boot.ci(boot.out = b, type = "bca")

Intervals :
Level BCa
95% ( 0.2835, 0.5964 )
Calculations and Intervals on Original Scale

Our last example will show a way to bootstrap output from linear models based
on bootstrapping the entire rows of the data frame. (Other bootstrap principles
exist for linear models, e.g. bootstrapping only residuals).
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 240

Example 4.24

We will show how to find the confidence interval for the percentage of explained
variation in a multiple linear regression (MLR - covered in Chapter 6). We use the
data set mtcars from the boot package:

## Bootstrapping an R-Squared from an MLR using

## the data set mtcars from the boot package:
## (And showing how to send EXTRA stuff to your function)

# function to obtain R-Squared from the data

# AND working for ANY model fit you want!!

rsq <- function(formula, data, d) {

fit <- lm(formula, data=data[d,])
return(summary(fit)$r.square)
}

## Bootstrapping with 1000 replications

b <- boot(mtcars, rsq, 1000, formula=mpg~wt+disp)
plot(b)

Histogram of t
0.90
8

0.80
6
Density

t*
4

0.70
2

0.60
0

0.6 0.7 0.8 0.9 -3 -2 -1 0 1 2 3

t* Quantiles of Standard Normal
Chapter 4 4.4 OPTIONAL: BOOTSTRAPPING – A FURTHER
PERSPECTIVE 241

boot.ci(b, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = b, type = "bca")

Intervals :
Level BCa
95% ( 0.6308, 0.8549 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
Chapter 4 4.5 EXERCISES 242

4.5 Exercises

Exercise 4.1 Reliability: System lifetime (simulation as a computa-

tion tool)

A system consists of three components A, B and C serially connected, such that

A is positioned before B, which is again positioned before C. The system will
be functioning only so long as A, B and C are all functioning. The lifetime in
months of the three components are assumed to follow exponential distribu-
tions with means: 2 months, 3 months and 5 months, respectively (hence there
are three random variables, X A , XB and XC with exponential distributions with
λ A = 1/2, λ B = 1/3 and λC = 1/5 resp.) A little R-help: You will probably
need (or at least it would help) to put three variables together to make e.g. a
k × 3-matrix - this can be done by the cbind function:

x <- cbind(xA,xB,xC)

And just as an example, remember from the examples in the chapter that the
way to easily compute e.g. the mean of the three values for each of all the k
rows of this matrix is:

simmeans <- apply(x, 1, mean)

a) Generate, by simulation, a large number (at least 1000 – go for 10000 or

100000 if your computer is up for it) of system lifetimes (hint: consider
how the random variable Y = System lifetime is a function of the three
X-variables: is it the sum, the mean, the median, the minimum, the maxi-
mum, the range or something even different?).

b) Estimate the mean system lifetime.

c) Estimate the standard deviation of system lifetimes.

Chapter 4 4.5 EXERCISES 243

d) Estimate the probability that the system fails within 1 month.

e) Estimate the median system lifetime

f) Estimate the 10th percentile of system lifetimes

g) What seems to be the distribution of system lifetimes? (histogram etc)

Exercise 4.2 Basic bootstrap CI

(Can be handled without using R) The following measurements were given for
the cylindrical compressive strength (in MPa) for 11 prestressed concrete beams:

38.43, 38.43, 38.39, 38.83, 38.45, 38.35, 38.43, 38.31, 38.32, 38.48, 38.50.

1000 bootstrap samples (each sample hence consisting of 11 measurements)

were generated from these data, and the 1000 bootstrap means were arranged
on order. Refer to the smallest as x̄(∗1) , the second smallest as x̄(∗2) and so on,
with the largest being x̄(∗1000) . Assume that

x̄(∗25) = 38.3818,
x̄(∗26) = 38.3818,
x̄(∗50) = 38.3909,
x̄(∗51) = 38.3918,
x̄(∗950) = 38.5218,
x̄(∗951) = 38.5236,
x̄(∗975) = 38.5382,
x̄(∗976) = 38.5391.
Chapter 4 4.5 EXERCISES 244

a) Compute a 95% bootstrap confidence interval for the mean compressive

strength.

b) Compute a 90% bootstrap confidence interval for the mean compressive

strength.

Exercise 4.3 Various bootstrap CIs

Consider the data from the exercise above. These data are entered into R as:

x <- c(38.43, 38.43, 38.39, 38.83, 38.45, 38.35,

38.43, 38.31, 38.32, 38.48, 38.50)

Now generate k = 1000 bootstrap samples and compute the 1000 means (go
higher if your computer is fine with it)

a) What are the 2.5%, and 97.5% quantiles (so what is the 95% confidence
interval for µ without assuming any distribution)?

b) Find the 95% confidence interval for µ by the parametric bootstrap as-
suming the normal distribution for the observations. Compare with the
classical analytic approach based on the t-distribution from Chapter 2.

c) Find the 95% confidence interval for µ by the parametric bootstrap as-
suming the log-normal distribution for the observations. (Help: To use
the rlnorm function to simulate the log-normal distribution, we face the
challenge that we need to specify the mean and standard deviation on the
log-scale and not on the raw scale, so compute mean and standard devia-
tion for log-transformed data for this R-function)
Chapter 4 4.5 EXERCISES 245

d) Find the 95% confidence interval for the lower quartile Q1 by the paramet-
ric bootstrap assuming the normal distribution for the observations.

e) Find the 95% confidence interval for the lower quartile Q1 by the non-
parametric bootstrap (so without any distributional assumptions)

Exercise 4.4 Two-sample TV data

A TV producer had 20 consumers evaluate the quality of two different TV flat

screens - 10 consumers for each screen. A scale from 1 (worst) up to 5 (best)
were used and the following results were obtained:

TV screen 1 TV screen 2
1 3
2 4
1 2
3 4
2 2
1 3
2 2
3 4
1 3
1 2

a) Compare the two means without assuming any distribution for the two
samples (non-parametric bootstrap confidence interval and relevant hy-
pothesis test interpretation).

b) Compare the two means assuming normal distributions for the two sam-
ples - without using simulations (or rather: assuming/hoping that the
sample sizes are large enough to make the results approximately valid).
Chapter 4 4.5 EXERCISES 246

c) Compare the two means assuming normal distributions for the two sam-
ples - simulation based (parametric bootstrap confidence interval and rel-
evant hypothesis test interpretation – in spite of the obviously wrong as-
sumption).

Exercise 4.5 Non-linear error propagation

The pressure P, and the volume V of one mole of an ideal gas are related by
the equation PV = 8.31T, when P is measured in kilopascals, T is measured in
kelvins, and V is measured in liters.

a) Assume that P is measured to be 240.48 kPa and V to be 9.987 L with

known measurement errors (given as standard deviations): 0.03 kPa and
0.002 L. Estimate T and find the uncertainty in the estimate.

b) Assume that P is measured to be 240.48kPa and T to be 289.12K with

known measurement errors (given as standard deviations): 0.03kPa and
0.02K. Estimate V and find the uncertainty in the estimate.

c) Assume that V is measured to be 9.987 L and T to be 289.12 K with known

measurement errors (given as standard deviations): 0.002 L and 0.02 K.
Estimate P and find the uncertainty in the estimate.

d) Try to answer one or more of these questions by simulation (assume that

the errors are normally distributed).
Chapter 4 Glossaries 247

Glossaries

Alternative hypothesis [Alternativ hypotese] The alternative hypothesis (H1 )

is oftern the negation of the null hypothesis 152

Binomial distribution [Binomial fordeling] If an experiment has two possible

outcomes (e.g. failure or success, no or yes, 0 or 1) and is repeated more
than one time, then the number of successes is binomial distributed 59

Box plot [Box plot] The so-called boxplot in its basic form depicts the five quar-
tiles (min, Q1 , median, Q3 , max) with a box from Q1 to Q3 emphasizing
the Inter Quartile Range (IQR) 30

cumulated distribution function [Fordelingsfunktion]The cdf is the function

which determines the probability of observing an outcome of a random
variable below a given value 249,

χ2 -distribution [χ2 -fordeling (udtales: chi-i-anden fordeling)] 98, 140

confidence interval [Konfidensinterval] The confidence interval is a way to han-

dle the uncertainty by the use of probability theory. The confidence inter-
val represents those values of the unknown population mean µ that we
believe is based on the data. Thus we believe the true mean in the statis-
tics class is in this interval 131,

Central Limit Theorem [Centrale grænseværdisætning] The Central Limit The-

orem (CLT) states that the sample mean of independent identically dis-
tributed outcomes converges to a normal distribution 135

Continuous random variable [Kontinuert stokastisk variabel] If an outcome of

an experiment takes a continuous value, for example: a distance, a tem-
perature, a weight, etc., then it is represented by a continuous random
variable 45, 68, 249

Correlation [Korrelation] The sample correlation coefficient are a summary statis-

tic that can be calculated for two (related) sets of observations. It quantifies
Chapter 4 Glossaries 248

the (linear) strength of the relation between the two. See also: Covariance
17, 89, 248

Covariance [Kovarians] The sample covariance coefficient are a summary statis-

tic that can be calculated for two (related) sets of observations. It quantifies
the (linear) strength of the relation between the two. See also: Correlation
17, 89, 248, 250

Critical value Kritisk værdi As an alternative to the p-value one can use the so-
called critical values, that is the values of the test-statistic which matches
exactly the significance level 150

Degrees of freedom [Frihedsgrader] The number of "observations" in the data

that are free to vary when estimating statistical parameters often defined
as n − 1 98, 177

Descriptive statistics [Beskrivende statistik] Descriptive statistics, or explorative

statistics, is an important part of statistics, where the data is summarized
and described 9

Discrete random variable [Diskret stokastisk variabel] A discrete random vari-

able has discrete outcomes and follows a discrete distribution 46

Distribution [Fordeling] Defines how the data is distributed such as, normal
distribution, cumulated distribution function, probability density func-
tion exponential distribution, log-normal distribution, Poisson distribu-
tion, uniform distribution, hypergeometric distribution, binomial distri-
bution, t-distribution, F-distribution 45

Empirical cumulative distribution [Empirisk fordeling] The empirical cumu-

lative distribution function Fn is a step function with jumps i/n at obser-
vation values, where i is the number of identical observations at that value
29

Expectation [Forventningsværdi] A function for calculating the mean. The value

we expect for a random variable (or function of random variables), hence
of the population 54

Exponential distribution [Eksponential fordelingen] The usual application of

the exponential distribution is for describing the length (usually time) be-
tween events which, when counted, follows a Poisson distribution 79

F-distribution [F-fordelingen] The F-distribution appears as the ratio between

two independent χ2 -distributed random variables 109
Chapter 4 Glossaries 249

Frequency [Frekvens] How frequent data is observed. The frequency distribu-

tion of the data for a certain grouping is nicely depicted by the histogram,
which is a barplot of either raw frequencies or for some number of classes
27

Histogram [Histogram] The default histogram uses the same width for all classes
and depicts the raw frequencies/counts in each class. By dividing the raw
counts by n times the class width the density histogram is found where
the area of all bars sum to 1 27

Hypergeometric distribution [Hypergeometrisk fordeling] 62

Independence [Uafhængighed] 92

Independent samples [Uafhængige stikprøver] 181

(Statistical) Inference [Statistisk inferens (følgeslutninger baseret på data)] 123

Interval [Interval] Data in a specified range 64

Inter Quartile Range [Interkvartil bredde] The Inter Quartile Range (IQR) is
the middle 50% range of data 16, 248

Log-normal distribution [Lognormal fordeling] 79

Maximum likelihood [Estimator baseret på maximum likelihood metoden] 212

Median [Median, stikprøvemedian] The median of population or sample (note,

in text no distinguishment between population median and sample median)
11, 247

Non-parametric (test) [Ikke-parametriske (tests)] 223

Normal distribution [Normal fordeling] 72

Null hypothesis [Nulhypotese (H0 )] 149

One-sample t-test Missing description 154

P-value [p-værdi (for faktisk udfald af en teststørrelse)] 145

probability density function The pdf is the function which determines the prob-
ability of every possible outcome of a random variable 249,

Quantile [Fraktil, stikprøvefraktil] The quantiles of population or sample (note,

in text no distinguishment between population quantile and sample quantile)
12, 247
Chapter 4 Glossaries 250

Quartile [Fraktil, stikprøvefraktil] The quartiles of population or sample (note,

in text no distinguishment between population quartile and sample quartile)
13, 247

Sample variance [Empirisk varians, stikprøvevarians] 14, 248

Sample mean [Stikprøvegennemsnit] The average of a sample 10, 53, 125, 247

Standard deviation [Standard afvigelse] 248

t-distribution [t-fordeling] 103

Chapter 4 Acronyms 251

Acronyms

ANOVA Analysis of Variance Glossary: Analysis of Variance

cdf cumulated distribution function 47, Glossary: cumulated distribution func-

tion

CI confidence interval Glossary: confidence interval

CLT Central Limit Theorem Glossary: Central Limit Theorem

IQR Inter Quartile Range Glossary: Inter Quartile Range

LSD Least Significant Difference Glossary: Least Significant Difference

pdf probability density function Glossary: probability density function

Chapter A 252

Appendix A

Collection of formulas and R commands

This appendix chapter holds a collection of formulas. All the relevant equations from def-
initions, methods and theorems are included – along with associated R functions. All are
in included in the same order as in the book, except for the distributions which are listed
together.

A.1 Introduction, descriptive statistics, R and data

visualization

Description Formula R command

Sample mean 1 n
n i∑
1.4 x̄ = xi mean(x)
The mean of a sample. =1

Sample median 
The value that divides a sam- x 1 for odd n
( n+2 )
1.5 ple in two halves with equal Q2 = x ( n ) + x ( n +2 ) median(x)
 2 2
for even n
number of observations in 2
each.
Sample quantile
The value that divide a sam- (x
(np) + x(np+1)
ple such that p of the obser- 2 for pn integer quantile(x,p,type=2),
1.7 qp =
vations are less that the value. x(dnpe) for pn non-integer
The 0.5 quantile is the Me-
dian.
Sample quartiles Q0 = q0 = “minimum”
The quartiles are the five Q1 = q0.25 = “lower quartile” quantile(x,
quantiles dividing the sample probs,type=2)
1.8 Q2 = q0.5 = “median”
in four parts, such that each where
Q3 = q0.75 = “upper quartile”
part holds an equal number of probs=p
observations Q4 = q1 = “maximum”
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND
DATA VISUALIZATION 253

Description Formula R command

Sample variance
n
The sum of squared differ- 1
n − 1 i∑
1.10 s2 = ( xi − x̄ )2 var(x)
ences from the mean divided =1
by n − 1.
s
Sample standard deviation √ n
1
n − 1 i∑
1.11 The square root of the sample s= s2 = ( xi − x̄ )2 sd(x)
variance. =1

Sample coefficient of vari-

ance
s
1.12 The sample standard devia- V= sd(x)/mean(x)
x̄
tion seen relative to the sam-
ple mean.
Sample Inter Quartile Range
1.15 IQR: The middle 50% range of IQR = Q3 − Q1 IQR(x)
data
Sample covariance
1.18 Measure of linear strength of s xy = 1
n −1 ∑in=1 ( xi − x̄ ) (yi − ȳ) cov(x,y)
relation between two samples
Sample correlation
Measure of the linear strength
xi − x̄ yi −ȳ s xy
1.19 r= 1
n −1 ∑in=1 sx sy = s x ·sy cor(x,y)
of relation between two sam-
ples between -1 and 1.
Chapter A A.2 PROBABILITY AND SIMULATION 254

A.2 Probability and Simulation

Description Formula R command

Probability density function

(pdf) for a discrete variable
dnorm,dbinom,dhyper,
2.6 fulfills two conditions: f ( x ) ≥ f ( x ) = P( X = x )
dpois
0 and ∑all x f ( x ) = 1 and finds
the probality for one x value.
Cumulated distribution
function (cdf)
pnorm,pbinom,phyper,
2.9 gives the probability in a F ( x ) = P( X ≤ x )
ppois
range of x values where
P ( a < X ≤ b ) = F ( b ) − F ( a ).
Mean of a discrete random
2.13 variable µ = E( X ) = ∑i∞=1 xi f ( xi )

Variance of a discrete ran-

2.16 dom variable X σ2 = Var( X ) = E[( X − µ)2 ]

Pdf of a continuous random

variable
is a non-negative function for Rb
2.32 P( a < X ≤ b) = f ( x )dx
all possible outcomes and has a

an area below the function of

one
Cdf of a continuous random
variable Rx
2.33 is non-decreasing F ( x ) = P( X ≤ x ) = f (u)du
−∞
and limx→−∞ F ( x ) =
0 and limx→∞ F ( x ) = 1
Mean and variance for a con- R∞
µ = E( X ) = −∞ x f ( x )dx
2.34 tinuous random variable X R∞
σ2 = E[( X − µ)2 ] = −∞ ( x − µ)2 f ( x )dx

Mean and variance of a linear

function E( aX + b) = a E( X ) + b
2.54 The mean and variance of a
V( aX + b) = a2 V( X )
linear function of a random
variable X.
Mean and variance of a linear E ( a 1 X1 + a 2 X2 + · · · + a n X n ) =
combination a 1 E ( X1 ) + a 2 E ( X2 ) + · · · + a n E ( X n )
2.56 The mean and variance of a
V ( a 1 X1 + a 2 X2 + . . . + a n X n ) =
linear combination of random
variables. a21 V( X1 ) + a22 V( X2 ) + · · · + a2n V( Xn )
Chapter A A.2 PROBABILITY AND SIMULATION 255

Description Formula R command

Covariance
The covariance between be
2.58 Cov( X, Y ) = E [( X − E[ X ])(Y − E[Y ])]
two random variables X and
Y.
Chapter A A.2 PROBABILITY AND SIMULATION 256

A.2.1 Distributions

Here all the included distributions are listed including some important theorems and definitions
related specifically with a distribution.

Description Formula R command

Binominal distribution
f ( x; n, p) = P( X = x ) dbinom(x,size, prob)
n is the number of indepen-
pbinom(q,size, prob)
dent draws and p is the prob- n x
= p (1 − p ) n − x qbinom(p,size, prob)
2.20 ability of a success in each x
rbinom(n,size, prob)
draw. The Binominal pdf de- n n!
where = where
scribes the probability of x x x!(n − x )! size=n, prob=p
succeses.
Mean and variance of a bino- µ = np
2.21 mial distributed random vari-
σ2 = np(1 − p)
able.

Hypergeometric distribution f ( x; n, a, N ) = P( X = x ) dhyper(x,m,n,k)

n is the number of draws phyper(q,m,n,k)
( xa )( Nn−−xa)
without replacement, a is = qhyper(p,m,n,k)
2.24 (N)
number of succeses and N is n rhyper(nn,m,n,k)
the population size. a a! where
where =
b b!( a − b)! m=a, n=N − a, k=n

Mean and variance of a hyper- a

µ=n
geometric distributed random N
2.25
a ( N − a) N − n
variable. σ2 = n
N N−1
Poisson distribution dpois(x,lambda)
λ is the rate (or intensity) i.e. ppois(q,lambda)
the average number of events λ x −λ qpois(p,lambda)
2.27 f ( x; λ) = e
per interval. The Poisson pdf x! rpois(n,lambda)
describes the probability of x where
events in an interval. lambda=λ
Mean and variance of a Pois- µ=λ
2.28 son distributed random vari-
σ2 = λ
able.

Uniform distribution 

α and β defines the range of 0
 for x < α
1 dunif(x,min,max)
possible outcomes. random f ( x; α, β) = for x ∈ [α, β]
 β−α
 punif(q,min,max)
variable following the uni- 0 for x > β qunif(p,min,max)
2.35 form distribution has equal 
 runif(n,min,max)
density at any value within a 0
 for x < α
F ( x; α, β) = x −α where
defined range. for x ∈ [α, β]
 β−α
 min=α, max=β
0 for x > β
Chapter A A.2 PROBABILITY AND SIMULATION 257

Description Formula R command

Mean and variance of a uni- 1

µ= (α + β)
2.36 form distributed random vari- 2
able X. 1
σ2 = ( β − α )2
12
dnorm(x,mean,sd)
pnorm(q,mean,sd)
Normal distribution 1 ( x − µ )2
√ e− 2σ2 qnorm(p,mean,sd)
2.37 Often also called the Gaussian f ( x; µ, σ) =
σ 2π rnorm(n,mean,sd)
distribution.
where
mean=µ, sd=σ.
Mean and variance of a nor- µ
2.38 mal distributed random vari-
σ2
able.
Transformation of a normal
distributed random variable X−µ
2.43 Z=
X into a standardized normal σ
random variable.

dlnorm(x,meanlog,sdlog)
Log-normal distribution
plnorm(q,meanlog,sdlog)
α is the mean and β2 is the 1
2
− (ln x−α) qlnorm(p,meanlog,sdlog)
2.46 variance of the normal distri- f (x) = √ e 2β2
x 2πβ rlnorm(n,meanlog,sdlog)
bution obtained when taking
where
the natural logarithm to X.
meanlog=α, sdlog=β.
Mean and variance of a log- µ = eα+ β
2 /2

normal distributed random 2 2

2.47 variable. σ2 = e2α+ β (e β − 1)

dexp(x,rate)
( pexp(q,rate)
Exponential distribution λe−λx for x ≥ 0 qexp(p,rate)
2.48 f ( x; λ) =
λ is the mean rate of events. 0 for x < 0 rexp(n,rate)
where
rate=λ.
1
Mean and variance of a ex- µ=
2.49 λ
ponential distributed random 1
variable. σ2 = 2
λ

dchisq(x,df)
pchisq(q,df)
χ2 -distribution 1

f (x) =
ν x
x 2 −1 e − 2 ; x≥0 qchisq(p,df)
2.78 Γ ν2 is the Γ-function and ν is ν
2 Γ 2
2
ν
rchisq(n,df)
the degrees of freedom.
where
df=ν.
Chapter A A.2 PROBABILITY AND SIMULATION 258

Description Formula R command

Given a sample of size n from

the normal distributed ran-
dom variables Xi with vari-
ance σ2 , then the sample vari-
2.81
( n − 1) S2
ance S2 (viewed as random χ2 =
σ2
variable) can be transformed
to follow the χ2 distribution
with the degrees of freedom
ν = n − 1.
Mean and variance of a χ2 dis- E( X ) = ν
2.83
tributed random variable. V ( X ) = 2ν

t-distribution
− ν+2 1
ν is the degrees of freedom Γ ( ν+ 1
2 ) t2
2.86 f T (t) = √ 1+
and Γ() is the Gamma func- νπ Γ( ν2 ) ν

tion.
dt(x,df)
Relation between normal pt(q,df)
random variables and χ2 - Z qt(p,df)
2.87 X= √ ∼ t(ν)
distributed random variables. Y/ν rt(n,df)
Z ∼ N (0, 1) and Y ∼ χ2 (ν). where
df=ν.
For normal distributed ran-
dom variables X1 , . . . , Xn , the
random variable follows the
t-distribution, where X is the X−µ
2.89 T= √ ∼ t ( n − 1)
sample mean, µ is the mean of S/ n
X, n is the sample size and S
is the sample standard devia-
tion.

µ = 0; ν>1
Mean and variance of a t-
2.93 ν
distributed variable X. σ2 = ; ν>2
ν−2

ν21 df(x,df1,df2)
F-distribution 1 ν1 pf(q,df1,df2)
f F (x) = ν1 ν2

ν1 an ν2 are the degrees of B 2, 2 ν2 qf(p,df1,df2)
2.95 ν1 +ν2
freedom and B(·, ·) is the Beta ν1 ν1 − 2 rf(n,df1,df2)
− 1
function. ·x2 1+ x where
ν2
df1=ν1 ,df2=µ2 .
The F-distribution appears as
the ratio between two inde-
U/ν1
2.96 pendent χ2 -distributed ran- ∼ F (ν1 , ν2 )
V/ν2
dom variables with U ∼
χ2 (ν1 ) and V ∼ χ2 (ν2 ).
Chapter A A.2 PROBABILITY AND SIMULATION 259

Description Formula R command

X1 , . . . , Xn1 and Y1 , . . . , Yn2

with the mean µ1 and µ2 S12 /σ12
2.98 and the variance σ12 and σ22 ∼ F (n1 − 1, n2 − 1)
S22 /σ22
is independent and sampled
from a normal distribution.
ν2
µ= ; ν2 > 2
Mean and variance of a F- ν2 − 2
2.101
distributed variable X. 2ν22 (ν1 + ν2 − 2)
σ= ; ν2 > 4
ν1 (ν2 − 2)2 (ν2 − 4)
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 260

A.3 Statistics for one and two samples

Description Formula R command

The distribution of the mean 1 n σ 2

n i∑
3.2 X̄ = Xi ∼ N µ,
of normal random variables. =1
n

The distribution of the σ- X̄ − µ

3.4 standardized mean of normal Z= √ ∼ N 0, 12
σ/ n
random variables
The distribution of the S- X̄ − µ
3.4 standardized mean of normal T= √ ∼ t ( n − 1)
S/ n
random variables
s
3.6 Standard Error of the mean SEx̄ = √
n
The one sample confidence in- s
3.8 x̄ ± t1−α/2 · √
terval for µ n

X̄ − µ
3.13 Central Limit Theorem (CLT) Z= √
σ/ n
" #
2 ( n − 1) s2 ( n − 1) s2
σ : ;
Confidence interval for the χ21−α/2 χ2α/2
3.18 " #
variance/standard deviation ( n − 1) s2 ( n − 1) s2
σ: ;
χ21−α/2 χ2α/2

The p-valueis the probability of obtain-

ing a test statistic that is at least as ex-
treme as the test statistic that was actu-
3.21 The p-value P(T>x)=2(1-pt(x,n-1))
ally observed. This probability is calcu-
lated under the assumption that the null
hypothesis is true.

p-value = 2 · P( T > |tobs |)

The one-sample t-test statistic x̄ − µ0
3.22 tobs = √
and p-value s/ n
H0 : µ = µ0

Rejected: p-value < α

3.23 The hypothesis test
Accepted: otherwise
3.28 Significant effect An effect is significant if the p-value< α
The critical values: α/2- and
1 − α/2-quantiles of the t-
3.30 tα/2 and t1−α/2
distribution with n − 1 de-
grees of freedom
The one-sample hypothesis Reject: |tobs | > t1−α/2
3.31
test by the critical value accept: otherwise
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 261

Description Formula R command

x̄ ± t1−α/2 · √sn
3.32 Confidence interval for µ
acceptance region/CI: H0 : µ = µ0
Test: H0 : µ = µ0 and H1 : µ 6= µ0 by
p–value = 2 · P( T > |tobs |)
3.35 The level α one-sample t-test
Reject: p–value < α or |tobs | > t1−α/2
Accept: Otherwise
The one-sample confidence
z1−α/2 ·σ 2
interval (CI) sample size for- n= ME
mula
The one-sample sample size 2
z +z
3.64 n = σ 1−(µβ −µ1−)α/2
formula 0 1

naive approach: pi = ni , i = 1, . . . , n
The Normal q-q plot with
3.41 commonly aproach: pi = in−+0.5 1, i =
n>10
1, . . . , n

δ = µ2 − µ1
The (Welch) two-sample t-test H0 : δ = δ0
3.48
statistic ( x̄ − x̄ )−δ
tobs = √ 21 2 2 0
s1 /n1 +s2 /n2

( X̄ − X̄ )−δ
T = √ 21 2 2 0
S /n1 +S2 /n2
The distribution of the 1 2 2
3.49 s
1 s2
2
(Welch) two-sample statistic n +n1 2
ν= (s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1

Test: H0 : µ1 − µ2 = δ0 and H1 : µ1 −
µ2 6= δ0 by p–value = 2 · P( T > |tobs |)
3.50 The level α two-sample t-test
Reject: p–value < α or |tobs | > t1−α/2
Accept: Otherwise
The pooled two-sample esti- (n1 −1)s21 +(n2 −1)s22
3.51 s2p = n1 + n2 −2
mate of variance

δ = µ1 − µ2
The pooled two-sample t-test H0 : δ = δ0
3.52
statistic ( x̄ − x̄ )−δ
tobs = √ 21 2 2 0
s p /n1 +s p /n2

The distribution of the pooled ( X̄ − X̄ )−δ

T = √ 21 2 2 0
3.53
two-sample t-test statistic S p /n1 +S p /n2

q
s21 s22
x̄ − ȳ ± t1−α/2 · n1 + n2
The two-sample confidence
s2 s2
2
3.46 1 2
n1 + n2
interval for µ1 − µ2 ν= (s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1
Chapter A A.4 SIMULATION BASED STATISTICS 262

A.4 Simulation based statistics

Description Formula R command

The non-linear approximative 2

∂f
4.3 σ2f (X
1 ,...,Xn )
= ∑in=1 σi2
error propagation rule ∂xi

1. Simulate k outcomes
Non-linear error propagation 2. Calculate the
4.4
by simulation q standard deviation by
s f (X ,...,Xn ) = k−1 1 ∑ik=1 ( f j − f¯)2
sim
1

Confidence interval for any 1.Simulate k samples

4.7 feature θ by parametric boot- 2.Calculate the hstatistic θ̂ i
∗ , ∗
3.Calculate CI: q100(α/2)% 100(1−α/2)%
q
strap

Two-sample confidence inter- 1.Simulate k sets of 2 samples

val for any feature compar- ∗ − θ̂ ∗
2.Calculate the statistic θ̂ xk
4.10 yk
ison θ1 − θ2 by parametric h i
∗ ∗
3.Calculate CI: q100(α/2)% , q100(1−α/2)%
bootstrap

Computational Genomics With R (Chapman & Hall - CRC Computational Biology Series) - Nodrm
No ratings yet
Computational Genomics With R (Chapman & Hall - CRC Computational Biology Series) - Nodrm
463 pages
Cengage Limits & Derivatives
No ratings yet
Cengage Limits & Derivatives
88 pages
Kotze Writing An Academic Journal Article
No ratings yet
Kotze Writing An Academic Journal Article
158 pages
ISM Session 1-8+webinar1,2 Merged
No ratings yet
ISM Session 1-8+webinar1,2 Merged
718 pages
Test Bank For Interactive Statistics Informed Decisions Using Data 3rd Edition by Sullivan
No ratings yet
Test Bank For Interactive Statistics Informed Decisions Using Data 3rd Edition by Sullivan
54 pages
Agricultural Statistical Data Analysis Using Stata by George Boyhan
No ratings yet
Agricultural Statistical Data Analysis Using Stata by George Boyhan
253 pages
STA1007 Notes
No ratings yet
STA1007 Notes
251 pages
Maths
No ratings yet
Maths
292 pages
Uteach Cs Principles
100% (1)
Uteach Cs Principles
743 pages
Essentials of Statistics
No ratings yet
Essentials of Statistics
272 pages
Geology 2025
No ratings yet
Geology 2025
5 pages
The Winds of Python
No ratings yet
The Winds of Python
308 pages
Data 101 Complete PDF
No ratings yet
Data 101 Complete PDF
603 pages
T. Mailund - Beginning Data Science in R 4 - Data Analysis, Visualization, and Modelling For The Data Scientist (2022)
No ratings yet
T. Mailund - Beginning Data Science in R 4 - Data Analysis, Visualization, and Modelling For The Data Scientist (2022)
527 pages
Bayesian Reliability For Complex Systems
No ratings yet
Bayesian Reliability For Complex Systems
19 pages
hw4 Theory Handout
No ratings yet
hw4 Theory Handout
1 page
STA501 Study Guide 2024-02-27 01 - 00 - 08
No ratings yet
STA501 Study Guide 2024-02-27 01 - 00 - 08
270 pages
Wellness Book
No ratings yet
Wellness Book
44 pages
IAL Maths Statistics 3 SB
No ratings yet
IAL Maths Statistics 3 SB
35 pages
Bhaskar BRM File
No ratings yet
Bhaskar BRM File
91 pages
Data Analysis With SAS
100% (1)
Data Analysis With SAS
353 pages
Book IntroStatistics
No ratings yet
Book IntroStatistics
422 pages
IBM SPSS Missing Values
100% (1)
IBM SPSS Missing Values
34 pages
DM Mod4
No ratings yet
DM Mod4
108 pages
Practical Bayesian Inference
100% (2)
Practical Bayesian Inference
322 pages
Arena User S Guide En-107-150
No ratings yet
Arena User S Guide En-107-150
44 pages
Statistics I
100% (2)
Statistics I
686 pages
Ch-4 Load Forecasting Techniques
No ratings yet
Ch-4 Load Forecasting Techniques
12 pages
Exercices Introduction To Statistics Cou
No ratings yet
Exercices Introduction To Statistics Cou
11 pages
DABD (KMBNIT01) Model Paper With Solution
No ratings yet
DABD (KMBNIT01) Model Paper With Solution
19 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
XII STD - Statistics English Medium
No ratings yet
XII STD - Statistics English Medium
280 pages
Textbook Practice Problems 1
No ratings yet
Textbook Practice Problems 1
39 pages
Statistics
100% (1)
Statistics
72 pages
BIOL40049 Biostatistics Syllabus
No ratings yet
BIOL40049 Biostatistics Syllabus
7 pages
Lecture Notes Ma12003 PDF
100% (1)
Lecture Notes Ma12003 PDF
105 pages
Diki Firmansyah 11119059
No ratings yet
Diki Firmansyah 11119059
14 pages
MATH 1281 - Unit 5 DF
No ratings yet
MATH 1281 - Unit 5 DF
3 pages
Biostats Notes
No ratings yet
Biostats Notes
3 pages
Extrinsic Attributes Responsible For Red Wine Quality Perception A Cross-Cultural Study Between France and Spain PDF
No ratings yet
Extrinsic Attributes Responsible For Red Wine Quality Perception A Cross-Cultural Study Between France and Spain PDF
16 pages
ACT Science Cheat Sheet
No ratings yet
ACT Science Cheat Sheet
3 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
A Comparative Study of Classifying Legal Documents With Neural Networks
No ratings yet
A Comparative Study of Classifying Legal Documents With Neural Networks
9 pages
s41598 020 67089 0 PDF
No ratings yet
s41598 020 67089 0 PDF
11 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
ACSL by Example
No ratings yet
ACSL by Example
217 pages
Oxidation in Wine - Does Expertise Influence The Perception PDF
No ratings yet
Oxidation in Wine - Does Expertise Influence The Perception PDF
9 pages
Digital Twin - Old Wine in A New Bottle
No ratings yet
Digital Twin - Old Wine in A New Bottle
20 pages
AP Computer Science A 2012 Free-Response Questions: About The College Board
No ratings yet
AP Computer Science A 2012 Free-Response Questions: About The College Board
21 pages
Purchase Intention and Buying Behavior Towards Laptops: A Study of Students in
No ratings yet
Purchase Intention and Buying Behavior Towards Laptops: A Study of Students in
9 pages
Connecting Flavors in Social Media A Cross Cultural Study With Beer Pairing PDF
No ratings yet
Connecting Flavors in Social Media A Cross Cultural Study With Beer Pairing PDF
8 pages
Bhavana Resume PDF
No ratings yet
Bhavana Resume PDF
2 pages
Core Banking Article Review
100% (1)
Core Banking Article Review
4 pages
Up Tps6 Lecture Powerpoint 11.1 2
No ratings yet
Up Tps6 Lecture Powerpoint 11.1 2
63 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
AP Statistics
No ratings yet
AP Statistics
42 pages
Statistics Formula Sheet and Tables 2020
No ratings yet
Statistics Formula Sheet and Tables 2020
6 pages
STAT6101 Coursenotes 1516 PDF
No ratings yet
STAT6101 Coursenotes 1516 PDF
73 pages
Homework # 5 Solution: Instructor: John C.S. Lui
No ratings yet
Homework # 5 Solution: Instructor: John C.S. Lui
3 pages
As A 2 Physics Practical Handbook
50% (2)
As A 2 Physics Practical Handbook
48 pages
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Scientific Thinking and Processes: Teacher Notes and Answers
100% (1)
Scientific Thinking and Processes: Teacher Notes and Answers
5 pages
MMC 1
No ratings yet
MMC 1
2 pages
A Course in Manufacturing Systems With Simulation
No ratings yet
A Course in Manufacturing Systems With Simulation
10 pages
1981 Trends in Identification
No ratings yet
1981 Trends in Identification
15 pages
Statistics R Charts and Graphs Assignment
No ratings yet
Statistics R Charts and Graphs Assignment
13 pages
MDM4U Unit 3a Formation Quiz Solutions
No ratings yet
MDM4U Unit 3a Formation Quiz Solutions
2 pages
MBR Assignment 3
No ratings yet
MBR Assignment 3
3 pages
Instructions To Authors (Effective 01 November 2015)
No ratings yet
Instructions To Authors (Effective 01 November 2015)
16 pages
Differentiation (Derivatives)
No ratings yet
Differentiation (Derivatives)
42 pages
Xtabond Postestimation - Postestimation Tools For Xtabond
No ratings yet
Xtabond Postestimation - Postestimation Tools For Xtabond
3 pages
Lesson Plan 8sci Fireproof Balloon
No ratings yet
Lesson Plan 8sci Fireproof Balloon
4 pages
L09 Measurement Uncertainty in Microbiological Examinations of Foods Technique For Determination of Pathogens - Hilde Skår Norli
No ratings yet
L09 Measurement Uncertainty in Microbiological Examinations of Foods Technique For Determination of Pathogens - Hilde Skår Norli
20 pages
Psat 2010
No ratings yet
Psat 2010
14 pages
SAT Number Properties
No ratings yet
SAT Number Properties
6 pages
Lesson 3: Testing The Difference Between Two Population Means: Paired Sample
No ratings yet
Lesson 3: Testing The Difference Between Two Population Means: Paired Sample
2 pages
Regression
No ratings yet
Regression
46 pages
Cfa 2 7
No ratings yet
Cfa 2 7
6 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
List of Important AP Statistics Concepts To Know
No ratings yet
List of Important AP Statistics Concepts To Know
9 pages
Statistics 1 AQA Revision Notes
No ratings yet
Statistics 1 AQA Revision Notes
7 pages
AP Statistics Problems #09
0% (1)
AP Statistics Problems #09
6 pages
Energy Transfer and Energy Loss of A Ball Rolling Down A Slope
No ratings yet
Energy Transfer and Energy Loss of A Ball Rolling Down A Slope
2 pages
A Short Course of Time-Series Analysis and Forecasting by D S G Pollock
No ratings yet
A Short Course of Time-Series Analysis and Forecasting by D S G Pollock
133 pages
An Introduction To T
No ratings yet
An Introduction To T
7 pages
Prob and Stats Formula Sheet
No ratings yet
Prob and Stats Formula Sheet
3 pages
AP Statistics Problems #17
No ratings yet
AP Statistics Problems #17
4 pages
R Markdown
No ratings yet
R Markdown
15 pages
AP Statistics Problems #16
No ratings yet
AP Statistics Problems #16
6 pages
AP Statistics Problems #19
No ratings yet
AP Statistics Problems #19
1 page
AP Statistics Problems #13
No ratings yet
AP Statistics Problems #13
2 pages
Rounding Decimals: © Dorling Kindersley Limited (2010)
No ratings yet
Rounding Decimals: © Dorling Kindersley Limited (2010)
2 pages
Statistics Assignment
No ratings yet
Statistics Assignment
4 pages
2010 AP Statistics Free Response Solutions
No ratings yet
2010 AP Statistics Free Response Solutions
3 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet

Book IntroStatistics PDF

Uploaded by

Book IntroStatistics PDF

Uploaded by

Introduction to Statistics

1 Introduction, descriptive statistics, R and data visualization 2

2 Probability and simulation 43

2.5.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . 72

3 Statistics for one and two samples 123

4 Simulation Based Statistics 203

4.2.3 One-sample confidence interval for any feature assuming

A Collection of formulas and R commands 247

Introduction, descriptive statistics, R

1. Introduction, descriptive statistics, R and data visualization

1.1 What is Statistics - a primer

In the well respected New England Journal of medicine a millenium editorial on

EDITORIAL: Looking Back on the Millennium in Medicine, N Engl J Med, 342:42-

• Elucidation of human anatomy and physiology

Still today, epidemiology, both human and veterinarian, maintains to be an ex-

The article ends with the following quote:

1.2 Statistics at DTU Compute

At DTU Compute at the Technical University of Denmark statistics is used,

• Statistics and Data Analysis

1.3 Statistics - why, what, how?

Or it could be for industrial and business applications:

• Is machine A more effective than machine B?

Probability theory is in its own right an important topic in engineering relevant

There is a conceptual frame for doing statistical inference: in Statistical inference

Definition 1.1 Sample and population

• An observational unit is the single entity about which information is

See also the illustration in Figure 1.1.

(Infinite) Statistical population

The meaning of sample in statistics is clearly different from how a chemist or

A way to achieve a representative sample is that each observation (i.e. each

1.4 Summary statistics

The descriptive part of studying data maintains to be an important part of statis-

1.4.1 Measures of centrality

Definition 1.4 Sample mean

Sometimes this is refered to as the average.

Definition 1.5 Median

• If n is even the median is the average of the two observations in posi-

Example 1.6 Student heights

The sample mean height is

x( n+1 ) = x(5) = 179.

Definition 1.7 Quantiles and percentiles

1. Order the n observations from smallest to largest: x(1) , . . . , x(n)

3. If pn is an integer: average the pn’th and ( pn + 1)’th ordered observa-

• q0 , q0.25 , q0.50 , q0.75 and q1

Definition 1.8 Quartiles

Example 1.9 Student heights

Q1 = x(d2.5e) = x(3) = 167,

and since n · 0.75 = 7.5, the upper quartile becomes

Q3 = x(d7.5e) = x(8) = 187.

We could also find the 0’th percentile

q0 = min( x1 , . . . , xn ) = x(1) = 161,

and the 100’th percentile

q1 = max( x1 , . . . , xn ) = x(10) = 198.

Finally, 10’th percentile (i.e. 0.10 quantile) is

1.4.2 Measures of variability

A crucial aspect to understand when dealing with statistics is the concept of

Definition 1.10 Sample variance

Definition 1.11 Sample standard deviation

Definition 1.12 Coefficient of variation

Question: Why not actually compute directly what the interpretation is

Question: Why divide by n − 1 and not n in the formulas of s and s2 ?

Answer: The sample variance s2 will most often be used as an estimate of

Spread in the sample can also be described and quantified by quartiles:

Definition 1.15 Range

Range = Maximum − Minimum = Q4 − Q0 = x(n) − x(1) . (1-9)

IQR = q0.75 − q0.25 = Q3 − Q1 . (1-10)

Example 1.16 Student heights

-10 -17 -11 1 6 -12 20 9 13 1.

So, if we square these and add them up we get

Therefore the sample variance is

Range = maximum − minimum = 198 − 161 = 37,

IQR = Q3 − Q1 = 187 − 167 = 20.

1.4.3 Measures of relation: correlation and covariance

Example 1.17 Student heights and weights

• Weights and heights are related to each other

from the two) in the following ways: