Statistical Inference For Data Science
Statistical Inference For Data Science
org
Statistical inference for data science
A companion to the Coursera Statistical Inference Course
Brian Caffo
This book is for sale at http://leanpub.com/LittleInferenceBook
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
www.dbooks.org
To Kerri, Penelope and Scarlett
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Before beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Statistical inference defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Summary notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The goals of inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The tools of the trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Different thinking about probability leads to different styles of inference . . . . . . . . . 4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Where to get a more thorough treatment of probability . . . . . . . . . . . . . . . . . . . 6
Kolmogorov’s Three Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Consequences of The Three Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CDF and survival function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Conditional probability, motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Conditional probability, definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Diagnostic Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
IID random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4. Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
The population mean for discrete random variables . . . . . . . . . . . . . . . . . . . . . 23
The sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
www.dbooks.org
CONTENTS
5. Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
The variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
The sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
The standard error of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Summary notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7. Asymptopia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Limits of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
CLT simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Simulation of confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Poisson interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Summary notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8. t Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Small sample confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Gosset’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Independent group t confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Mistakenly treating the sleep data as grouped . . . . . . . . . . . . . . . . . . . . . . . . 79
Unequal variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Summary notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9. Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CONTENTS
Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Types of errors in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Discussion relative to court cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Building up a standard of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
General rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Two sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
T test in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Connections with confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Two group intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Exact binomial test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10. P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Introduction to P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
What is a P-value? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
The attained significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Binomial P-value example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Poisson example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
www.dbooks.org
CONTENTS 1
www.dbooks.org
Introduction 3
of probability models as the connection between our data and a populations represents the most
effective way to obtain inference.
Summary notes
These examples illustrate many of the difficulties of trying to use data to create general conclusions
about a population.
Paramount among our concerns are:
• Is the sample representative of the population that we’d like to draw inferences about?
• Are there known and observed, known and unobserved or unknown and unobserved variables
that contaminate our conclusions?
• Is there systematic bias created by missing data or the design or conduct of the study?
• What randomness exists in the data and how do we use or adjust for it? Here randomness
can either be explicit via randomization or random sampling, or implicit as the aggregation
of many complex unknown processes.
• Are we trying to estimate an underlying mechanistic model of phenomena under study?
Statistical inference requires navigating the set of assumptions and tools and subsequently thinking
about how to draw conclusions from data.
Introduction 4
1. Estimate and quantify the uncertainty of an estimate of a population quantity (the proportion
of people who will vote for a candidate).
2. Determine whether a population quantity is a benchmark value (“is the treatment effective?”).
3. Infer a mechanistic relationship when quantities are measured with noise (“What is the slope
for Hooke’s law?”)
4. Determine the impact of a policy? (“If we reduce pollution levels, will asthma rates decline?”)
5. Talk about the probability that something occurs.
1. Randomization: concerned with balancing unobserved variables that may confound infer-
ences of interest.
2. Random sampling: concerned with obtaining data that is representative of the population of
interest.
3. Sampling models: concerned with creating a model for the sampling process, the most common
is so called “iid”.
4. Hypothesis testing: concerned with decision making in the presence of uncertainty.
5. Confidence intervals: concerned with quantifying uncertainty in estimation.
6. Probability models: a formal connection between the data and a population of interest. Often
probability models are assumed or are approximated.
7. Study design: the process of designing an experiment to minimize biases and variability.
8. Nonparametric bootstrapping: the process of using the data to, with minimal probability
model assumptions, create inferences.
9. Permutation, randomization and exchangeability testing: the process of using data permuta-
tions to perform inferences.
www.dbooks.org
Introduction 5
1. Frequency probability: is the long run proportion of times an event occurs in independent,
identically distributed repetitions.
2. Frequency style inference: uses frequency interpretations of probabilities to control error
rates. Answers questions like “What should I decide given my data controlling the long run
proportion of mistakes I make at a tolerable level.”
3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain
rules.
4. Bayesian style inference: the use of Bayesian probability representation of beliefs to perform
inference. Answers questions like “Given my subjective beliefs and the objective information
from the data, what should I believe now?”
Data scientists tend to fall within shades of gray of these and various other schools of inference.
Furthermore, there are so many shades of gray between the styles of inferences that it is hard to pin
down most modern statisticians as either Bayesian or frequentist. In this class, we will primarily
focus on basic sampling models, basic probability models and frequency style analyses to create
standard inferences. This is the most popular style of inference by far.
Being data scientists, we will also consider some inferential strategies that
rely heavily on the observed data, such as permutation testing and bootstrapping. As probability
modeling will be our starting point, we first build up basic probability as our first task.
Exercises
1. The goal of statistical inference is to?
• Infer facts about a population from a sample.
• Infer facts about the sample from a population.
• Calculate sample quantities to understand your data.
• To torture Data Science students.
2. The goal of randomization of a treatment in a randomized trial is to?
• It doesn’t really do anything.
• To obtain a representative sample of subjects from the population of interest.
• Balance unobserved covariates that may contaminate the comparison between the
treated and control groups.
• To add variation to our conclusions.
3. Probability is a?
• Population quantity that we can potentially estimate from data.
• A data quantity that does not require the idea of a population.
2. Probability
Watch this video before beginning.¹
Probability forms the foundation for almost all treatments of statistical inference. In our treatment,
probability is a law that assigns numbers to the long run occurrence of random phenomena after
repeated unrelated realizations.
Before we begin discussing probability, let’s dispense with some deep philosophical questions, such
as “What is randomness?” and “What is the fundamental interpretation of probability?”. One could
spend a lifetime studying these questions (and some have). For our purposes, randomness is any
process occurring without apparent deterministic patterns. Thus we will treat many things as if
they were random when, in fact they are completely deterministic. In my field, biostatistics, we
often model disease outcomes as if they were random when they are the result of many mechanistic
components whose aggregate behavior appears random. Probability for us will be the long run
proportion of times some occurs in repeated unrelated realizations. So, think of the proportion of
times that you get a head when flipping a coin.
For the interested student, I would recommend the books and work by Ian Hacking to learn more
about these deep philosophical issues. For us data scientists, the above definitions will work fine.
www.dbooks.org
Probability 7
From these simple rules all of the familiar rules of probability can be developed. This all might seem
a little odd at first and so we’ll build up our intuition with some simple examples based on coin
flipping and die rolling.
I would like to reiterate the important definition that we wrote out: mutually exclusive. Two
events are mutually exclusive if they cannot both simultaneously occur. For example, we cannot
simultaneously get a 1 and a 2 on a die. Rule 3 says that since the event of getting a 1 and 2 on
a die are mutually exclusive, the probability of getting at least one (the union) is the sum of their
probabilities. So if we know that the probability of getting a 1 is 1/6 and the probability of getting a
2 is 1/6, then the probability of getting a 1 or a 2 is 2/6, the sum of the two probabilities since they
are mutually exclusive.
This last rules states that P (A ∪ B) = P (A) + P (B) − P (A ∩ B) shows what is the issue with
adding probabilities that are not mutually exclusive. If we do this, we’ve added the probability that
both occur in twice! (Watch the video where I draw a Venn diagram to illustrate this).
Then
Given the scenario, it’s likely that some fraction of the population has both. This example serves as
a reminder don’t add probabilities unless the events are mutually exclusive. We’ll have a similar rule
for multiplying probabilities and independence.
Random variables
Watch this video before reading this section¹¹
¹⁰http://www.sleepfoundation.org/
¹¹http://youtu.be/Shzt9uZ8BII?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Probability 9
Probability calculus is useful for understanding the rules that probabilities must follow. However, we
need ways to model and think about probabilities for numeric outcomes of experiments (broadly
defined). Densities and mass functions for random variables are the best starting point for this.
You’ve already heard of a density since you’ve heard of the famous “bell curve”, or Gaussian density.
In this section you’ll learn exactly what the bell curve is and how to work with it.
Remember, everything we’re talking about up to at this point is a population quantity, not a
statement about what occurs in our data. Think about the fact that 50% probability for head is a
statement about the coin and how we’re flipping it, not a statement about the percentage of heads
we obtained in a particular set of flips. This is an important distinction that we will emphasize over
and over in this course. Statistical inference is about describing populations using data. Probability
density functions are a way to mathematically characterize the population. In this course, we’ll
assume that our sample is a random draw from the population.
So our definition is that a random variable is a numerical outcome of an experiment. The random
variables that we study will come in two varieties, discrete or continuous. Discrete random
variables are random variables that take on only a countable number of possibilities. Mass functions
will assign probabilities that they take specific values. Continuous random variable can conceptually
take any value on the real line or some subset of the real line and we talk about the probability that
they lie within some range. Densities will characterize these probabilities.
Let’s consider some examples of measurements that could be considered random variables. First,
familiar gambling experiments like the tossing of a coin and the rolling of a die produce random
variables. For the coin, we typically code a tail as a 0 and a head as a 1. (For the die, the number
facing up would be the random variable.) We will use these examples a lot to help us build intuition.
However, they aren’t interesting in the sense of seeming very contrived. Nonetheless, the coin
example is particularly useful since many of the experiments we consider will be modeled as if
tossing a biased coin. Modeling any binary characteristic from a random sample of a population
can be thought of as a coin toss, with the random sampling performing the roll of the toss and the
population percentage of individuals with the characteristic is the probability of a head. Consider,
for example, logging whether or not subjects were hypertensive in a random sample. Each subject’s
outcome can be modeled as a coin toss. In a similar sense the die roll serves as our model for
phenomena with more than one level, such as hair color or rating scales.
Consider also the random variable of the number of web hits for a site each day. This variable is
a count, but is largely unbounded (or at least we couldn’t put a specific reasonable upper limit).
Random variables like this are often modeled with the so called Poisson distribution.
Finally, consider some continuous random variables. Think of things like lengths or weights. It is
mathematically convenient to model these as if they were continuous (even if measurements were
truncated liberally). In fact, even discrete random variables with lots of levels are often treated as
continuous for convenience.
For all of these kinds of random variables, we need convenient mathematical functions to model
the probabilities of collections of realizations. These functions, called mass functions and densities,
take possible values of the random variables, and assign the associated probabilities. These entities
Probability 10
describe the population of interest. So, consider the most famous density, the normal distribution.
Saying that body mass indices follow a normal distribution is a statement about the population of
interest. The goal is to use our data to figure out things about that normal distribution, where it’s
centered, how spread out it is and even whether our assumption of normality is warranted!
Example
Let X be the result of a coin flip where X = 0 represents tails and X = 1 represents heads.
p(x) = (1/2)x (1/2)1−x for x = 0, 1. Suppose that we do not know whether or not the coin is fair;
Let θ be the probability of a head expressed as a proportion (between 0 and 1). p(x) = θx (1 − θ)1−x
for x = 0, 1
www.dbooks.org
Probability 11
Example
Suppose that the proportion of help calls that get addressed in a random day by a help line is given
by f (x) = 2x for 0 < x < 1. The R code for plotting this density is
Is this a mathematically valid density? To answer this we need to make sure it satisfies our two
conditions. First it’s clearly nonnegative (it’s at or above the horizontal axis everywhere). The area
Probability 12
is similarly easy. Being a right triangle in the only section of the density that is above zero, we can
calculate it as 1/2 the area of the base times the height. This is 12 × 1 × 2 = 1
Now consider answering the following question. What is the probability that 75% or fewer of calls
get addressed? Remember, for continuous random variables, probabilities are represented by areas
underneath the density function. So, we want the area from 0.75 and below, as illustrated by the
figure below.
This again is a right triangle, with length of the base as 0.75 and height 1.5. The R code below shows
the calculation.
[1] 0.5625
Thus, the probability of 75% or fewer calls getting addressed in a random day for this help line is
56%. We’ll do this a lot throughout this class and work with more useful densities. It should be noted
www.dbooks.org
Probability 13
that this specific density is a special case of the so called beta density. Below I show how to use R’s
built in evaluation function for the beta density to get the probability.
> pbeta(0.75, 2, 1)
[1] 0.5625
Notice the syntax pbeta. In R, a prefix of p returns probabilities, d returns the density, q returns
the quantile and r returns generated random variables. (You’ll learn what each of these does in
subsequent sections.)
F (x) = P (X ≤ x)
S(x) = P (X > x)
Notice that S(x) = 1 − F (x), since the survival function evaluated at a particular value of x is
calculating the probability of the opposite event (greater than as opposed to less than or equal to).
The survival function is often preferred in biostatistical applications while the distribution function
is more generally used (though both convey the same information.)
Example
What are the survival function and CDF from the density considered before?
1 1
F (x) = P (X ≤ x) = Base × Height = (x) × (2x) = x2 ,
2 2
Probability 14
for 1 ≥ x ≥ 0. Notice that calculating the survival function is now trivial given that we’ve already
calculated the distribution function.
S(x) = 1 = F (x) = 1 − x2
Again, R has a function that calculates the distribution function for us in this case, pbeta. Let’s try
calculating F (.4), F (.5) and F (.6)
Notice, of course, these are simply the numbers squared. By default the prefix p in front of a density
in R gives the distribution function (pbeta, pnorm, pgamma). If you want the survival function values,
you could always subtract by one, or give the argument lower.tail = FALSE as an argument to the
function, which asks R to calculate the upper area instead of the lower.
Quantiles
You’ve heard of sample quantiles. If you were the 95th percentile on an exam, you know that 95%
of people scored worse than you and 5% scored better. These are sample quantities. But you might
have wondered, what are my sample quantiles estimating? In fact, they are estimating the population
quantiles. Here we define these population analogs.
The αth quantile of a distribution with distribution function F is the point xα so that
F (xα ) = α
So the 0.95 quantile of a distribution is the point so that 95% of the mass of the density lies below it.
Or, in other words, the point so that the probability of getting a randomly sampled point below it is
0.95. This is analogous to the sample quantiles where the 0.95 sample quantile is the value so that
95% of the data lies below it.
A percentile is simply a quantile with α expressed as a percent rather than a proportion.
The (population) median is the 50th percentile. Remember that percentiles are not probabilities!
Remember that quantiles have units. So the population median height is the height (in inches say)
so that the probability that a randomly selected person from the population is shorter is 50%. The
sample, or empirical, median would be the height so in a sample so that 50% of the people in the
sample were shorter.
Example
What is the median of the distribution that we were working with before? We want to solve 0.5 =
F (x) = x2 , resulting in the solution
www.dbooks.org
Probability 15
> sqrt(0.5)
[1] 0.7071
Therefore, 0.7071 of calls being answered on a random day is the median. Or, the probability that
70% or fewer calls get answered is 50%.
R can approximate quantiles for you for common distributions with the prefix q in front of the
distribution name
> qbeta(0.5, 2, 1)
[1] 0.7071
Exercises
1. Can you add the probabilities of any two events to get the probability of at least one occurring?
2. I define a PMF, p so that for x = 0 and x = 1 we have p(0) = −0.1 and p(1) = 1.1. Is this a
valid PMF?
3. What is the probability that 75% or fewer calls get answered in a randomly sampled day from
the population distribution from this chapter?
4. The 97.5th percentile of a distribution is?
5. Consider influenza epidemics for two parent heterosexual families. Suppose that the proba-
bility is 15% that at least one of the parents has contracted the disease. The probability that the
father has contracted influenza is 10% while that the mother contracted the disease is 9%. What
is the probability that both contracted influenza expressed as a whole number percentage?
Watch a video solution to this problem.¹³ and see a written out solution.¹⁴
6. A random variable, X, is uniform, a box from 0 to 1 of height 1. (So that it’s density is f (x) = 1
for 0 ≤ x ≤ 1.) What is it’s median expressed to two decimal places? Watch a video solution
to this problem here¹⁵ and see written solutions here¹⁶.
7. If a continuous density that never touches the horizontal axis is symmetric about zero, can
we say that its associated median is zero? Watch a worked out solution to this problem here¹⁷
and see the question and a typed up answer here¹⁸
¹³http://youtu.be/CvnmoCuIN08?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁴http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#3
¹⁵http://youtu.be/UXcarD-1xAM?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#4
¹⁷http://youtu.be/sn48CGH_TXI?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁸http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#9
3. Conditional probability
Conditional probability, motivation
Watch this video before beginning.¹
Conditioning is a central subject in statistics. If we are given information about a random variable,
it changes the probabilities associated with it. For example, the probability of getting a one when
rolling a (standard) die is usually assumed to be one sixth. If you were given the extra information
that the die roll was an odd number (hence 1, 3 or 5) then conditional on this new information, the
probability of a one is now one third.
This is the idea of conditioning, taking away the randomness that we know to have occurred.
Consider another example, such as the result of a diagnostic imaging test for lung cancer. What’s the
probability that a person has cancer given a positive test? How does that probability change under
the knowledge that a patient has been a lifetime heavy smoker and both of their parents had lung
cancer? Conditional on this new information, the probability has increased dramatically.
P (A ∩ B)
P (A | B) = .
P (B)
If A and B are unrelated in any way, or in other words independent, (discussed more later in the
lecture), then
P (A)P (B)
P (A | B) = = P (A)
P (B)
That is, if the occurrence of B offers no information about the occurrence of A - the probability
conditional on the information is the same as the probability without the information, we say that
the two events are independent.
¹http://youtu.be/u6AH6qsSVA4?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
16
www.dbooks.org
Conditional probability 17
Example
Consider our die roll example again. Here we have that B = {1, 3, 5} and A = {1}
P (A ∩ B) P (A) 1/6 1
P (one given that roll is odd) = P (A | B) = = = =
P (B) P (B) 3/6 3
Bayes’ rule
Watch this video before beginning²
Bayes’ rule is a famous result in statistics and probability. It forms the foundation for large branches
of statistical thinking. Bayes’ rule allows us to reverse the conditioning set provided that we know
some marginal probabilities.
Why is this useful? Consider our lung cancer example again. It would be relatively easy for
physicians to calculate the probability that the diagnostic method is positive for people with lung
cancer and negative for people without. They could take several people who are already known
to have the disease and apply the test and conversely take people known not to have the disease.
However, for the collection of people with a positive test result, the reverse probability is more of
interest, “given a positive test what is the probability of having the disease?”, and “given a given a
negative test what is the probability of not having the disease?”.
Bayes’ rule allows us to switch the conditioning event, provided a little bit of extra information.
Formally Bayes’ rule is:
P (A | B)P (B)
P (B | A) = .
P (A | B)P (B) + P (A | B c )P (B c )
Diagnostic tests
Since diagnostic tests are a really good example of Bayes’ rule in practice, let’s go over them in greater
detail. (In addition, understanding Bayes’ rule will be helpful for your own ability to understand
medical tests that you see in your daily life). We require a few definitions first.
Let + and − be the events that the result of a diagnostic test is positive or negative respectively Let
D and Dc be the event that the subject of the test has or does not have the disease respectively
The sensitivity is the probability that the test is positive given that the subject actually has the
disease, P (+ | D)
²http://youtu.be/TfeaZ_26iQk?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
Conditional probability 18
The specificity is the probability that the test is negative given that the subject does not have the
disease, P (− | Dc )
So, conceptually at least, the sensitivity and specificity are straightforward to estimate. Take people
known to have and not have the disease and apply the diagnostic test to them. However, the reality
of estimating these quantities is quite challenging. For example, are the people known to have the
disease in its later stages, while the diagnostic will be used on people in the early stages where it’s
harder to detect? Let’s put these subtleties to the side and assume that they are known well.
The quantities that we’d like to know are the predictive values.
The positive predictive value is the probability that the subject has the disease given that the test
is positive, P (D | +)
The negative predictive value is the probability that the subject does not have the disease given
that the test is negative, P (Dc | −)
Finally, we need one last thing, the prevalence of the disease - which is the marginal probability
of disease, P (D). Let’s now try to figure out a PPV in a specific setting.
Example
A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV
antibody tests have a sensitivity of 99.7% and a specificity of 98.5% Suppose that a subject, from a
population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive
value?
Mathematically, we want P (D | +) given the sensitivity, P (+ | D) = .997, the specificity,
P (− | Dc ) = .985 and the prevalence P (D) = .001.
P (+ | D)P (D)
P (D | +) =
P (+ | D)P (D) + P (+ | Dc )P (Dc )
P (+ | D)P (D)
=
P (+ | D)P (D) + {1 − P (− | Dc )}{1 − P (D)}
.997 × .001
=
.997 × .001 + .015 × .999
= .062
In this population a positive test result only suggests a 6% probability that the subject has the disease,
(the positive predictive value is 6% for this test). If you were wondering how it could be so low for
this test, the low positive predictive value is due to low prevalence of disease and the somewhat
modest specificity
Suppose it was known that the subject was an intravenous drug user and routinely had intercourse
with an HIV infected partner? Our prevalence would change dramatically, thus increasing the
PPV. You might wonder if there’s a way to summarize the evidence without appealing to an often
unknowable prevalence? Diagnostic likelihood ratios provide this for us.
www.dbooks.org
Conditional probability 19
P (+ | D)P (D)
P (D | +) =
P (+ | D)P (D) + P (+ | Dc )P (Dc )
and
P (+ | Dc )P (Dc )
P (Dc | +) = .
P (+ | D)P (D) + P (+ | Dc )P (Dc )
P (D | +) P (+ | D) P (D)
= ×
P (D | +)
c P (+ | D ) P (Dc )
c
In other words, the post test odds of disease is the pretest odds of disease times the DLR+ . Similarly,
DLR− relates the decrease in the odds of the disease after a negative test result to the odds of disease
prior to the test.
So, the DLRs are the factors by which you multiply your pretest odds to get your post test odds.
Thus, if a test has a DLR+ of 6, regardless of the prevalence of disease, the post test odds is six
times that of the pretest odds.
The result of the positive test is that the odds of disease is now 66 times the pretest odds. Or,
equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis
of no disease
Suppose instead that a subject has a negative test result
Therefore, the post-test odds of disease is now 0.3% of the pretest odds given the negative test. Or,
the hypothesis of disease is supported .003 times that of the hypothesis of absence of disease given
the negative test result
Independence
Watch this video before beginning.³
Statistical independence of events is the idea that the events are unrelated. Consider successive coin
flips. Knowledge of the result of the first coin flip tells us nothing about the second. We can formalize
this into a definition.
Two events A and B are independent if
P (A ∩ B) = P (A)P (B)
Example
Let’s cover a very simple example: “What is the probability of getting two consecutive heads?”. Then
we have that A is the event of getting a head on flip 1 P (A) = 0.5 B is the event of getting a head
on flip 2 P (B) = 0.5 A ∩ B is the event of getting heads on flips 1 and 2. Then independence would
tell us that:
³http://youtu.be/MY1EfrR1ZUs?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Conditional probability 21
This is exactly what we would have intuited of course. But, it’s nice that the mathematics mirrors
our intuition. In more complex settings, it’s easy to get tripped up. Consider the following famous
(among statisticians at least) case study.
Case Study
Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal
trial. Based on an estimated prevalence of sudden infant death syndrome (SIDS) of 1 out of 8,543,
a physician testified that that the probability of a mother having two children with SIDS was
(1/8, 543)2 . The mother on trial was convicted of murder.
Relevant to this discussion, the principal mistake was to assume that the events of having SIDs
within a family are independent. That is, P (A1 ∩ A2 ) is not necessarily equal to P (A1 )P (A2 ). This
is because biological processes that have a believed genetic or familiar environmental component, of
course, tend to be dependent within families. Thus, we can’t just multiply the probabilities to obtain
the result.
There are many other interesting aspects to the case. For example, the idea of a low probability of
an event representing evidence against a plaintiff. (Could we convict all lottery winners of fixing
the lotter since the chance that they would win is so small.)
Exercises
1. I pull a card from a deck and do not show you the result. I say that the resulting card is a
heart. What is the probability that it is the queen of hearts?
2. The odds associated with a probability, p, are defined as?
3. The probability of getting two sixes when rolling a pair of dice is?
4. The probability that a manuscript gets accepted to a journal is 12% (say). However, given
that a revision is asked for, the probability that it gets accepted is 90%. Is it possible that the
probability that a manuscript has a revision asked for is 20%? Watch a video of this problem
getting solved⁴ and see the worked out solutions here⁵.
5. Suppose 5% of housing projects have issues with asbestos. The sensitivity of a test for asbestos
is 93% and the specificity is 88%. What is the probability that a housing project has no asbestos
given a negative test expressed as a percentage to the nearest percentage point? Watch a video
solution here⁶ and see the worked out problem here⁷.
⁴http://youtu.be/E4kE4M1J15s?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#3
⁶https://www.youtube.com/watch?v=rbI97tSvGvQ&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=11
⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#5
www.dbooks.org
4. Expected values
Watch this video before beginning.¹
Expected values characterize a distribution. The most useful expected value, the mean, characterizes
the center of a density or mass function. Another expected value summary, the variance, charac-
terizes how spread out a density is. Yet another expected value calculation is the skewness, which
considers how much a density is pulled toward high or low values.
Remember, in this lecture we are discussing population quantities. It is convenient (and of course
by design) that the names for all of the sample analogs estimate the associated population quantity.
So, for example, the sample or empirical mean estimates the population mean; the sample variance
estimates the population variance and the sample skewness estimates the population skewness.
where the sum is taken over the possible values of x. Where did they get this idea from? It’s taken
from the physical idea of the center of mass. Specifically, E[X] represents the center of mass
of a collection of locations and weights, {x, p(x)}. We can exploit this fact to quickly calculate
population means for distributions where the center of mass is obvious.
∑
n
X̄ = xi p(xi ),
i=1
23
Expected values 24
Galton’s Data
Using rStudio’s manipulate package, you can try moving the histogram around and see what value
balances it out. Be sure to watch the video to see this in action.
www.dbooks.org
Expected values 25
library(manipulate)
myHist <- function(mu){
g <- ggplot(galton, aes(x = child))
g <- g + geom_histogram(fill = "salmon",
binwidth=1, aes(y = ..density..), color = "black")
g <- g + geom_density(size = 2)
g <- g + geom_vline(xintercept = mu, size = 2)
mse <- round(mean((galton$child - mu)^2), 3)
g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse))
g
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
Going through this exercise, you find that the point that balances out the histogram is the empirical
mean. (Note there’s a small distinction here that comes about from rounding with the histogram bar
widths, but ignore that for the time being.) If the bars of the histogram are from the observed data,
the point that balances it out is the empirical mean; if the bars are the true population probabilities
(which we don’t know of course) then the point is the population mean. Let’s now go through some
examples of mathematically calculating the population mean.
Expected values 26
Histogram illustration
www.dbooks.org
Expected values 27
E[X] = .5 × 0 + .5 × 1 = .5
Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0
and 1, the center of mass will be 0.5.
E[X] = 0 ∗ (1 − p) + 1 ∗ p = p
Notice that the expected value isn’t a value that the coin can take in the same way that the sample
proportion of heads will also likely be neither 0 nor 1.
This coin example is not exactly trivial as it serves as the basis for a random sample of any population
for a binary trait. So, we might model the answer from an election polling question as if it were a
coin flip.
Expected values 28
1 1 1 1 1 1
E[X] = 1 × + 2 × + 3 × + 4 × + 5 × + 6 × = 3.5
6 6 6 6 6 6
Again, the geometric argument makes this answer obvious without calculation.
Example
Consider a density where f (x) = 1 for x between zero and one. Suppose that X follows this density;
what is its expected value?
³http://youtu.be/YS5EIKsamXI?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Expected values 29
Uniform Density
The answer is clear since the density looks like a box, it would balance out exactly in the middle,
0.5.
gives rise to the distribution of averages of ten heights in the same way that distribution associated
with a die roll gives rise to the distribution of the average of ten dice.
An important question to ask is: “What does the distribution of averages look like?”. This question
is important, since it tells us things about averages, the best way to estimate the population mean,
when we only get to observe one average.
Consider the die rolls again. If wanted to know the distribution of averages of 100 die rolls, you could
(at least in principle) roll 100 dice, take the average and repeat that process. Imagine, if you could
only roll the 100 dice once. Then we would have direct information about the distribution of die rolls
(since we have 100 of them), but we wouldn’t have any direct information about the distribution of
the average of 100 die rolls, since we only observed one average.
Fortunately, the mathematics tells us about that distribution. Notably, it’s centered at the same spot
as the original distribution! Thus, the distribution of the estimator (the sample mean) is centered
at the distribution of what it’s estimating (the population mean). When the expected value of an
estimator is what its trying to estimate, we say that the estimator is unbiased.
Let’s go through several simulation experiments to see this more fully.
Simulation experiments
Standard normals
Consider simulating a lot of standard normals and plotting a histogram (the blue density). Now
consider simulating lots of averages of 10 standard normals and plotting their histogram (the salmon
colored density). Notice that they’re centered in the same spot! It’s also more concentrated around
that point. (We’ll discuss that more in the next lectures).
www.dbooks.org
Expected values 31
Simulation of normals
www.dbooks.org
Expected values 33
Summary notes
• Expected values are properties of distributions.
• The population mean is the center of mass of population.
• The sample mean is the center of mass of the observed data.
• The sample mean is an estimate of the population mean.
• The sample mean is unbiased: the population mean of its distribution is the mean that it’s
trying to estimate.
• The more data that goes into the sample mean, the more. concentrated its density / mass
function is around the population mean.
Exercises
1. A standard die takes the values 1, 2, 3, 4, 5, 6 with equal probability. What is the expected
value?
2. Consider a density that is uniform from -1 to 1. (I.e. has height equal to 1/2 and looks like a
box starting at -1 and ending at 1). What is the mean of this distribution?
3. If a population has mean µ, what is the mean of the distribution of averages of 20 observations
from this distribution?
4. You are playing a game with a friend where you flip a coin and if it comes up heads you give
her X dollars and if it comes up tails she gives you $Y$ dollars. The odds that the coin is heads
is d. What is your expected earnings? Watch a video of the solution to this problem⁴ and look
at the problem and the solution here.⁵.
5. If you roll ten standard dice, take their average, then repeat this process over and over and
construct a histogram what would it be centered at? Watch a video solution here⁶ and see the
original problem here⁷.
⁴http://youtu.be/5J88Zq0q81o?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#5
⁶https://www.youtube.com/watch?v=ia3n2URiJaw&index=16&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#11
5. Variation
The variance
Watch this video before beginning.¹
Recall that the mean of distribution was a measure of its center. The variance, on the other hand, is
a measure of spread. To get a sense, the plot below shows a series of increasing variances.
We saw another example of how variances changed in the last chapter when we looked at the
distribution of averages; they were always centered at the same spot as the original distribution, but
¹http://youtu.be/oLQVU-VRiHo?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
34
www.dbooks.org
Variation 35
are less spread out. Thus, it is less likely for sample means to be far away from the population mean
than it is for individual observations. (This is why the sample mean is a better estimate than the
population mean.)
If X is a random variable with mean µ, the variance of X is defined as
The rightmost equation is the shortcut formula that is almost always used for calculating variances
in practice.
Thus the variance is the expected (squared) distance from the mean. Densities with a higher variance
are more spread out than densities with a lower variance. The square root of the variance is called
the standard deviation. The main benefit of working with standard deviations is that they have the
same units as the data, whereas the variance has the units squared.
In this class, we’ll only cover a few basic examples for calculating a variance. Otherwise, we’re going
to use the ideas without the formalism. Also remember, what we’re talking about is the population
variance. It measures how spread out the population of interest is, unlike the sample variance which
measures how spread out the observed data are. Just like the sample mean estimates the population
mean, the sample variance will estimate the population variance.
Example
What’s the variance from the result of a toss of a die? First recall that E[X] = 3.5, as we discussed
in the previous lecture. Then let’s calculate the other bit of information that we need, E[X 2 ].
1 1 1 1 1 1
E[X 2 ] = 12 × + 22 × + 32 × + 42 × + 52 × + 62 × = 15.17
6 6 6 6 6 6
Thus now we can calculate the variance as:
Example
What’s the variance from the result of the toss of a (potentially biased) coin with probability of heads
(1) of p? First recall that E[X] = 0 × (1 − p) + 1 × p = p. Secondly, recall that since X is either 0
or 1, X 2 = X. So we know that:
E[X 2 ] = E[X] = p.
Thus we can now calculate the variance of a coin flip as V ar(X) = E[X 2 ] − E[X]2 = p − p2 =
p(1 − p). This is a well known formula, so it’s worth committing to memory. It’s interesting to note
that this function is maximized at p = 0.5. The plot below shows this by plotting p(1 − p) by p.
Variation 36
www.dbooks.org
Variation 37
Simulation experiments
Watch this video before beginning.²
Notice that these histograms are always centered in the same spot, 1. In other words, the sample
variance is an unbiased estimate of the population variances. Notice also that they get more
concentrated around the 1 as more data goes into them. Thus, sample variances comprised of more
observations are less variable than sample variances comprised of fewer.
www.dbooks.org
Variation 39
Recall that we calculated the variance of a die roll as 2.92 earlier on in this chapter. Notice each of
the histograms are centered there. In addition, they get more concentrated around 2.92 as more the
variances are comprised of more dice.
Summary notes
• The sample variance, S 2 , estimates the population variance, σ 2 .
• The distribution of the sample variance is centered around σ 2 .
• The variance of the sample mean is σ 2 /n.
– Its logical estimate is s2 /n. √
– The logical estimate of the standard error is S/ n.
• S, the standard deviation, talks about how variable the population is.
√
• S/ n, the standard error, talks about how variable averages of random samples of size n
from the population are.
³http://youtu.be/uPjHB9JjGKI?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Variation 41
Data example
Watch this before beginning.⁴
Now let’s work through a data example to show how the standard error of the mean is used in
practice. We’ll use the father.son height data from Francis Galton.
library(UsingR); data(father.son);
x <- father.son$sheight
n<-length(x)
Here’s a histogram of the sons’ heights from the dataset. Let’ calculate different variances and
interpret them in this context.
⁴http://youtu.be/Lm2DMVyZVxk?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Variation 43
The first number, 7.92, and its square root, 2.81, are the estimated variance and standard deviation
of the sons’ heights. Therefore, 7.92 tells us exactly how variable sons’ heights were in the data and
estimates how variable sons’ heights are in the population. In contrast 0.01, and the square root 0.09,
estimate how variable averages of n sons’ heights are.
Therefore, the smaller numbers discuss the precision of our estimate of the mean of sons’ heights.
The larger numbers discuss how variable sons’ heights are in general.
Variation 44
Summary notes
• The sample variance estimates the population variance.
• The distribution of the sample variance is centered at what its estimating.
• It gets more concentrated around the population variance with larger sample sizes.
• The variance of the sample mean is the population variance divided by n.
– The square root is the standard error.
• It turns out that we can say a lot about the distribution of averages from random samples,
even though we only get one to look at in a given data set.
Exercises
1. If I have a random sample from a population, the sample variance is an estimate of?
• The population standard deviation.
• The population variance.
• The sample variance.
• The sample standard deviation.
2. The distribution of the sample variance of a random sample from a population is centered at
what?
• The population variance.
• The population mean.
3. I keep drawing samples of size n from a population with variance σ 2 and taking their average.
I do this thousands of times. If I were to take the variance of the collection of averages, about
what would it be?
4. You get a random sample of n observations from a population and take their average. You
would like to estimate the variability of averages of $$n$$ observations from this population
to better understand how precise of an estimate it is. Do you need to repeated collect averages
to do this?
• No, we can multiply our estimate of the population variance by 1/n to get a good
estimate of the variability of the average.
• Yes, you have to get repeat averages.
5. A random variable takes the value -4 with probability .2 and 1 with probability .8. What is
the variance of this random variable? Watch a video solution to this problem.⁵ and look at a
version with a worked out solution.⁶
6. If X̄ and Ȳ are comprised of n iid random variables arising from distributions having means
µx and µy , respectively and common variance σ 2 what is the variance X̄ − Ȳ ? Watch a video
solution to this problem here⁷ and see a typed up solution here⁸
⁵http://youtu.be/Em-xJeQO1rc?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#6
⁷http://youtu.be/7zJhPzX6jns?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁸http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#7
www.dbooks.org
Variation 45
7. Let X be a random variable having standard deviation σ. What can be said about the variance
of X/σ? Watch a video solution to this problem here⁹ and typed up solutions here¹⁰.
8. Consider the following pmf given in R by the code p <- c(.1, .2, .3, .4) and ‘x <- 2 : 5‘.
What is the variance? Watch a video solution to this problem here¹¹ and here is the problem
worked out¹².
9. If you roll ten standard dice, take their average, then repeat this process over and over and
construct a histogram, what would be its variance expressed to 3 decimal places? Watch a
video solution here¹³ and see the text here¹⁴.
⁹http://youtu.be/0WUj18_BUPA?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁰http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#8
¹¹http://youtu.be/HSn8n4DsGSg?list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹²http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#10
¹³https://www.youtube.com/watch?v=MLfo9zz1zX4&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=17
¹⁴http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#12
6. Some common distributions
The Bernoulli distribution
The Bernoulli distribution arises as the result of a binary outcome, such as a coin flip. Thus,
Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) p and 1 − p,
respectively. Recall that the PMF for a Bernoulli random variable X is P (X = x) = px (1 − p)1−x .
The mean of a Bernoulli random variable is p and the variance is p(1 − p). If we let X be a Bernoulli
random variable, it is typical to call X = 1 as a “success” and X = 0 as a “failure”.
If a random variable follows a Bernoulli distribution with success probability p we write that X ∼
Bernoulli(p).
Bernoulli random variables are commonly used for modeling any binary trait for a random sample.
So, for example, in a random sample whether or not a participant has high blood pressure would be
reasonably modeled as Bernoulli.
Binomial trials
The binomial random variables are obtained as the sum of iid Bernoulli trials. So if a Bernoulli
trial is the result of a coin flip, a binomial random variable is the total number of heads.
∑
To write it out as mathematics, let X1 , . . . , Xn be iid Bernoulli(p), then X = ni=1 Xi is a binomial
random variable. We write out that X ∼ Binomial(n, p). The binomial mass function is
( )
n
P (X = x) = px (1 − p)n−x
x
(read “n choose x”) counts the number of ways of selecting x items out of n without replacement
disregarding the order of the items. It turns out that n choose 0 is 1 and n choose 1 and n choose
n − 1 are both n.
46
www.dbooks.org
Some common distributions 47
Example
Suppose a friend has 8 children, 7 of which are girls and none are twins. If each gender has an
independent 50% probability for each birth, what’s the probability of getting 7 or more girls out of
8 births?
( ) ( )
8 8
.5 (1 − .5) +
7 1
.58 (1 − .5)0 ≈ 0.04.
7 8
Simulating means of coin flips
> choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
[1] 0.03516
> pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
[1] 0.03516
If X is a RV with this density then E[X] = µ and V ar(X) = σ 2 . That is, the normal distribution is
characterized by the mean and variance. We write X ∼ N (µ, σ 2 ) to denote a normal random
variable. When µ = 0 and σ = 1 the resulting distribution is called the standard normal
distribution. Standard normal RVs are often labeled Z
Consider an example, if we say that intelligence quotients are normally distributed with a mean of
100 and a standard deviation of 15. Then, we are saying that if we randomly sample a person from
this population, the probability that they have an IQ of say 120 or larger, is governed by a normal
distribution with a mean of 100 and a variance of 152 .
Taken another way, if we know that the population is normally distributed then to estimate
everything about the population, we need only estimate the population mean and variance.
(Estimated by the sample mean and the sample variance.)
¹http://youtu.be/dUTWvKa0Leo?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
Some common distributions 48
1. Approximately 68\%, 95\% and 99\% of the normal density lies within 1, 2 and 3 standard
deviations from the mean, respectively.
www.dbooks.org
Some common distributions 49
2. -1.28, -1.645, -1.96 and -2.33 are the 10th , 5th , 2.5th and 1st percentiles of the standard normal
distribution, respectively.
3. By symmetry, 1.28, 1.645, 1.96 and 2.33 are the 90th , 95th , 97.5th and 99th percentiles of the
standard normal distribution, respectively.
X −µ
Z= ∼ N (0, 1).
σ
If Z is standard normal
X = µ + σZ ∼ N (µ, σ 2 )
then X is X ∼ N (µ, σ 2 ). We can use these facts to answer questions about non-standard normals
by relating them back to the standard normal.
Example
What is the 95th percentile of a N (µ, σ 2 ) distribution? Quick answer in R qnorm(.95, mean = mu,
sd = sigma). Alternatively, because we have the standard normal quantiles memorized, and we
know that 1.645 is its 95th percentile, the answer has to be µ + σ1.645.
In general, µ + σz0 where z0 is the appropriate standard normal quantile.
To put some context on our previous setting, population mean BMI for men is reported as² 29
kg/mg 2 with a standard deviation of 4.73. Assuming normality of BMI, what is the population
95th percentile? The answer is then:
24.27 − 29
= −1.
4.73
²http://www.ncbi.nlm.nih.gov/pubmed/23675464
Some common distributions 50
Therefore, 24.27 is 1 standard deviation below the mean. We know that 16% lies below or above
1 standard deviation from the mean. Thus 16% lies below. Alternatively, pnorm(24.27, 29, 4.73)
yields the result.
Example
Assume that the number of daily ad clicks for a company is (approximately) normally distributed
with a mean of 1020 and a standard deviation of 50. What’s the probability of getting more than
1,160 clicks in a day? Notice that:
1160 − 1020
= 2.8
50
Therefore, 1,160 is 2.8 standard deviations above the mean. We know from our standard normal
quantiles that the probability of being larger than 2 standard deviation is 2.5% and 3 standard
deviations is far in the tail. Therefore, we know that the probability has to be smaller than 2.5%
and should be very small. We can obtain it exactly as r pnorm(1160, 1020, 50, lower.tail =
FALSE) which is 0.3%. Note that we can also obtain the probability as r pnorm(2.8, lower.tail =
FALSE).
Example
Consider the previous example again. What number of daily ad clicks would represent the one where
75% of days have fewer clicks (assuming days are independent and identically distributed)? We can
obtain this as:
www.dbooks.org
Some common distributions 51
bus stop. (While these are in principle bounded, it would be hard to actually put an upper limit on it.)
There is also a deep connection between the Poisson distribution and popular models for so-called
event-time data. In addition, the Poisson distribution is the default model for so-called contingency
table data, which is simply tabulations of discrete characteristics. Finally, when n is large and p is
small, the Poisson is an accurate approximation to the binomial distribution.
The Poisson mass function is:
λx e−λ
P (X = x; λ) =
x!
for x = 0, 1, . . .. The mean of this distribution is λ. The variance of this distribution is also λ. Notice
that x ranges from 0 to ∞. Therefore, the Poisson distribution is especially useful for modeling
unbounded counts.
Example
The number of people that show up at a bus stop is Poisson with a mean of 2.5 per hour. If watching
the bus stop for 4 hours, what is the probability that $3$ or fewer people show up for the whole
time?
Therefore, there is about a 1% chance that 3 or fewer people show up. Notice the multiplication by
four in the function argument. Since lambda is specified as events per hour we have to multiply by
four to consider the number of events that occur in 4 hours.
We flip a coin with success probability 0.01 five hundred times. What’s the probability of 2 or fewer
successes?
Some common distributions 52
So we can see that the probabilities agree quite well. This approximation is often done as the Poisson
model is a more convenient model in many respects.
Exercises
1. Your friend claims that changing the font to comic sans will result in more ad revenue on your
web sites. When presented in random order, 9 pages out of 10 had more revenue when the
font was set to comic sans. If it was really a coin flip for these 10 sites, what’s the probability
of getting 9 or 10 out of 10 with more revenue for the new font?
2. A software company is doing an analysis of documentation errors of their products. They
sampled their very large codebase in chunks and found that the number of errors per chunk
was approximately normally distributed with a mean of 11 errors and a standard deviation of
2. When randomly selecting a chunk from their codebase, whats the probability of fewer than
5 documentation errors?
3. The number of search entries entered at a web site is Poisson at a rate of 9 searches per minute.
The site is monitored for 5 minutes. What is the probability of 40 or fewer searches in that
time frame?
4. Suppose that the number of web hits to a particular site are approximately normally
distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. What’s
the probability that a given day has fewer than 93 hits per day expressed as a percentage to
the nearest percentage point? Watch a video solution⁴ and see the problem⁵.
5. Suppose that the number of web hits to a particular site are approximately normally
distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. What
number of web hits per day represents the number so that only 5% of days have more hits?
Watch a video solution⁶ and see the problem and solution⁷.
6. Suppose that the number of web hits to a particular site are approximately normally
distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. Imagine
taking a random sample of 50 days. What number of web hits would be the point so that only
5% of averages of 50 days of web traffic have more hits? Watch a video solution⁸ and see the
problem and solution⁹.
⁴https://www.youtube.com/watch?v=E-ancc7iTho&index=10&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#4
⁶https://www.youtube.com/watch?v=rv48_5C8gx4&index=12&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#6
⁸https://www.youtube.com/watch?v=c_B2AuOhdzg&index=13&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#7
www.dbooks.org
Some common distributions 53
7. You don’t believe that your friend can discern good wine from cheap. Assuming that you’re
right, in a blind test where you randomize 6 paired varieties (Merlot, Chianti, …) of cheap and
expensive wines. What is the change that she gets 5 or 6 right? Watch a video solution¹⁰ and
see the original problem¹¹.
8. The number of web hits to a site is Poisson with mean 16.5 per day. What is the probability of
getting 20 or fewer in 2 days? Watch a video solution¹² and see a written solution¹³.
¹⁰https://www.youtube.com/watch?v=ILm2OUl6p_w&index=14&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹¹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#8
¹²https://www.youtube.com/watch?v=PMPFbwtpp1k&index=18&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹³http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#12
7. Asymptopia
Asymptotics
Watch this video before beginning.¹
Asymptotics is the term for the behavior of statistics as the sample size limits to infinity. Asymptotics
are incredibly useful for simple statistical inference and approximations. Asymptotics often make
hard problems easy and difficult calculations simple. We will not cover the philosophical consid-
erations in this book, but is true nonetheless, that asymptotics often lead to nice understanding
of procedures. In fact, the ideas of asymptotics are so important form the basis for frequency
interpretation of probabilities by considering the long run proportion of times an event occurs.
Some things to bear in mind about the seemingly magical nature of asymptotics. There’s no free
lunch and unfortunately, asymptotics generally give no assurances about finite sample performance.
¹http://youtu.be/WRUgUEBIYZY?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
54
www.dbooks.org
Asymptopia 55
n <- 10000
means <- cumsum(rnorm(n))/(1:n)
library(ggplot2)
g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g
Discussion
An estimator is called consistent if it converges to what you want to estimate. Thus, the LLN says
that the sample mean of iid sample is consistent for the population mean. Typically, good estimators
are consistent; it’s not too much to ask that if we go to the trouble of collecting an infinite amount
of data that we get the right answer. The sample variance and the sample standard deviation of iid
random variables are consistent as well.
www.dbooks.org
Asymptopia 57
has a distribution like that of a standard normal for large n. Replacing the standard error by its
estimated value doesn’t change the CLT.
The useful way to think about the CLT is that X̄n is approximately N (µ, σ 2 /n).
Die rolling
• Simulate a standard normal random variable by rolling n (six sided) dice.
• Let Xi be the outcome for die i.
• Then note that µ = E[Xi ] = 3.5.
• Recall
√ also that V ar(X ) = 2.92.
√i
• SE 2.92/n = 1.71/ n.
√
• Lets roll n dice, take their mean, subtract off 3.5, and divide by 1.71/ n and repeat this over
and over.
²http://youtu.be/FAIyVHmniK0?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
Asymptopia 58
It’s pretty remarkable that the approximation works so well with so few rolls of the die. So, if you’re
stranded on an island, and need to simulate a standard normal without a computer, but you do have
a die, you can get a pretty good approximation with 10 rolls even.
Coin CLT
In fact the oldest application of the CLT is to the idea of flipping coins (by de Moivre)³. Let Xi be
the 0 or 1 result of the ith flip of a possibly unfair coin. The sample proportion, say p̂, is the average
of the coin flips. We know that:
• E[Xi ] = p,
• V √ − p),
√ar(Xi ) = p(1
• V ar(p̂) = p(1 − p)/n.
www.dbooks.org
Asymptopia 59
p̂ − p
√
p(1 − p)/n
This convergence doesn’t look quite as good as the die, since the coin has fewer possible outcomes.
In fact, among coins of various degrees of bias, the convergence to normality is governed by how
far from 0.5 p is. Let’s redo the simulation, now using p = 0.9 instead of p = 0.5 like we did before.
Asymptopia 60
Notice that the convergence to normality is quite poor. Thus, be careful when using CLT approxi-
mations for sample proportions when your proportion is very close to 0 or 1.
Confidence intervals
Watch this video before beginning.⁴
Confidence intervals are methods for quantifying uncertainty in our estimates. The fact that the
interval has width characterizes that there is randomness that prevents us from getting a perfect
estimate. Let’s go through how a confidence interval using the CLT is constructed.
According to√the CLT, the sample mean, X̄, is approximately normal with mean µ and standard
deviation σ/ n. Furthermore,
√
µ + 2σ/ n
is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail). Similarly,
⁴http://youtu.be/u85aQ0mtiZ8?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
Asymptopia 61
√
µ − 2σ/ n
is pretty far in the left tail (only 2.5% chance of a normal√being smaller than 2 standard
√ deviations
in the tail). So the probability X̄ is bigger than µ + 2σ/ n or smaller than µ − 2σ/ n is 5%. Or
equivalently, the probability that these limits contain µ is 95%. The quantity:
√
X̄ ± 2σ/ n
is called a 95% interval for µ. The 95% refers to the fact that if one were to repeatedly get samples of
size n, about 95% of the intervals obtained would contain µ. The 97.5th quantile is 1.96 (so I rounded
to 2 above). If instead of a 95% interval, you wanted a 90% interval, then you want (100 - 90) / 2 =
5% in each tail. Thus your replace the 2 with the 95th percentile, which is 1.645.
Example CI
Give a confidence interval for the average height of sons in Galton’s data.
Here we divided by 12 to get our interval in feet instead of inches. So we estimate the average height
of the sons as 5.71 to 5.74 with 95% confidence.
1
p̂ ± √ .
n
This is useful for doing quick confidence intervals for binomial proportions in your head.
Asymptopia 62
Example
Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for
you. Can you relax? Do you have this race in the bag? Without access to a computer or calculator,
how precise is this estimate?
> 1/sqrt(100)
[1] 0.1
so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66).
Thus, since the interval contains 0.5 and numbers below it, there’s not enough votes for you to relax;
better go do more campaigning!
√
The basic rule of thumb is then, 1/ n gives you a good estimate for the margin of error of a
proportion. Thus, n = 100
for about 1 decimal place, 10,000 for 2, 1,000,000 for 3.
> round(1/sqrt(10^(1:6)), 3)
[1] 0.316 0.100 0.032 0.010 0.003 0.001
We could very easily do the full Wald interval, which is less conservative (may provide a narrower
interval). Remember the Wald interval for a binomial proportion is:
√
p̂(1 − p̂)
p̂ ± Z1−α/2 .
n
Here’s the R code for our election setting, both coding it directly and using binom.test.
www.dbooks.org
Asymptopia 63
For our purposes, we’re using confidence intervals and so will investigate their frequency perfor-
mance over repeated realizations of the experiment. We can do this via simulation. Let’s consider
different values of p and look at the Wald interval’s coverage when we repeatedly create confidence
intervals.
n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p) {
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})
Asymptopia 64
The figure shows that if we were to repeatedly try experiments for any fixed value of p, it’s rarely
the case that our intervals will cover the value that they’re trying to estimate in 95% of them. This
is bad, since covering the parameter that its estimating 95% of the time is the confidence interval’s
only job!
So what’s happening? Recall that the CLT is an approximation. In this case n isn’t large enough for
the CLT to be applicable for many of the values of p. Let’s see if the coverage improves for larger n.
www.dbooks.org
Asymptopia 65
n <- 100
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage2 <- sapply(pvals, function(p) {
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})
Asymptopia 66
Now it looks much better. Of course, increasing our sample size is rarely an option. There’s exact
fixes to make this interval work better for small sample sizes.
However, for a quick fix is to take your data and add two successes and two failures. So, for example,
in our election example, we would form our interval with 58 votes out of 104 sampled (disregarding
that the actual numbers were 56 and 100). This interval is called the Agresti/Coull interval. This
interval has much better coverage. Let’s show it via a simulation.
www.dbooks.org
Asymptopia 67
n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p) {
phats <- (rbinom(nosim, prob = p, size = n) + 2)/(n + 4)
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})
Asymptopia 68
The coverage is better, if maybe a little conservative in the sense of being over the 95% line most of
the time. If the interval is too conservative, it’s likely a little too wide. To see this clearly, imagine
if we made our interval −∞ to ∞. Then we would always have 100% coverage in any setting, but
the interval wouldn’t be useful. Nonetheless, the Agrestic/Coull interval gives a much better trade
off between coverage and width than the Wald interval.
In general, one should use the add two successes and failures method for binomial confidence
intervals with smaller n. For very small n consider using an exact interval (not covered in this
class).
www.dbooks.org
Asymptopia 69
Poisson interval
Since the Poisson distribution is so central for data science, let’s do a Poisson confidence interval.
Remember that if X ∼ Poisson(λt) then our estimate of λ is λ̂ = X/t. Furthermore, we know that
V ar(λ̂) = λ/t and so the natural estimate is λ̂/t. While it’s not immediate how the CLT applies in
this case, the interval is of the familiar form
Example
A nuclear pump failed 5 times out of 94.32 days. Give a 95% confidence interval for the failure rate
per day.
> x <- 5
> t <- 94.32
> lambda <- x/t
> round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
[1] 0.007 0.099
A non-asymptotic test, one that guarantees coverage, is also available. But, it has to be evaluated
numerically.
Code for evaluating the coverage of the asymptotic Poisson confidence interval
www.dbooks.org
Asymptopia 71
The coverage can be low for low values of lambda. In this case the asymptotics works as we increase
the monitoring time, t. Here’s the coverage if we increase t to 1,000.
Asymptopia 72
Summary notes
• The LLN states that averages of iid samples. converge to the population means that they are
estimating.
• The CLT states that averages are approximately normal, with distributions.
– centered at the population mean.
– with standard deviation equal to the standard error of the mean.
– CLT gives no guarantee that $n$ is large enough.
• Taking the mean and adding and subtracting the relevant. normal quantile times the SE yields
a confidence interval for the mean.
www.dbooks.org
Asymptopia 73
Exercises
1. I simulate 1,000,000 standard normals. The LLN says that their sample average must be close
to?
2. About what is the probability of getting 45 or fewer heads out 100 flips of a fair coin? (Use the
CLT, not the exact binomial calculation).
3. Consider the father.son data. Using the CLT and assuming that the fathers are a random
sample from a population of interest, what is a 95% confidence mean height in inches?
4. The goal of a a confidence interval having coverage 95% is to imply that:
• If one were to repeated collect samples and reconstruct the intervals, around 95% percent
of them would contain the true mean being estimated.
• The probability that the sample mean is in the interval is 95%.
5. The rate of search entries into a web site was 10 per minute when monitoring for an hour. Use
R to calculate the exact Poisson interval for the rate of events per minute?
6. Consider a uniform distribution. If we were to sample 100 draws from a a uniform distribution
(which has mean 0.5, and variance 1/12) and take their mean, X̄. What is the approximate
probability of getting as large as 0.51 or larger? Watch this video solution⁵ and see the problem
and solution here.⁶.
⁵https://www.youtube.com/watch?v=JsiLK0g3IZ4&index=15&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#9
8. t Confidence intervals
Small sample confidence intervals
Watch this video before beginning.¹
In the previous lecture, we discussed creating a confidence interval using the CLT. Our intervals
took the form:
Est ± Z × SEEst .
In this lecture, we discuss some methods for small samples, notably Gosset’s t distribution and t
confidence intervals.
These intervals are of the form:
Est ± t × SEEst .
So the only change is that we’ve replaced the Z quantile now with a t quantile. These are some of
the handiest of intervals in all of statistics. If you want a rule between whether to use a t interval or
normal interval, just always use the t interval.
Gosset’s t distribution
The t distribution was invented by William Gosset (under the pseudonym “Student”) in 1908. Fisher
provided further mathematical details about the distribution later. This distribution has thicker tails
than the normal. It’s indexed by a degrees of freedom and it gets more like a standard normal as the
degrees of freedom get larger. It assumes that the underlying data are iid Gaussian with the result
that
X̄ − µ
√
S/ n
follows Gosset’s t distribution with n − 1 degrees of freedom. (If we replaced s by σ the statistic
would be exactly standard normal.) The interval is
√
X̄ ± tn−1 S/ n,
where tn−1 is the relevant quantile from the t distribution.
¹http://youtu.be/pHXrDMjzyYg?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
74
www.dbooks.org
t Confidence intervals 75
k <- 1000
xvals <- seq(-5, 5, length = k)
myplot <- function(df){
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),
x = xvals,
dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(color = dist))
g
}
manipulate(myplot(mu), mu = slider(1, 20, step = 1))
The difference is perhaps easier to see in the tails. Therefore, the following code plots the upper
quantiles of the Z distribution by those of the t distribution.
Summary notes
In this section, we give an overview of important facts about the t distribution.
• The t interval technically assumes that the data are iid normal, though it is robust to this
assumption.
t Confidence intervals 76
• It works well whenever the distribution of the data is roughly symmetric and mound shaped.
• Paired observations are often analyzed using the t interval by taking differences.
• For large degrees of freedom, t quantiles become the same as standard normal quantiles;
therefore this interval converges to the same interval as the CLT yielded.
• For skewed distributions, the spirit of the t interval assumptions are violated.
– Also, for skewed distributions, it doesn’t make a lot of sense to center the interval at the
mean.
– In this case, consider taking logs or using a different summary like the median.
• For highly discrete data, like binary, other intervals are available.
The data
> data(sleep)
> head(sleep)
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
Here’s a plot of the data. In this plot paired observations are connected with a line.
²http://youtu.be/2L41xqPvPso?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
t Confidence intervals 77
Now let’s calculate the t interval for the differences from baseline to follow up. Below we give four
different ways for calculating the interval.
t Confidence intervals 78
Therefore, since our interval doesn’t include 0, our 95% confidence interval estimate for the mean
change (follow up - baseline) is 0.70 to 2.45.
www.dbooks.org
t Confidence intervals 79
Confidence interval
A (1 − α) × 100% confidence interval for the mean difference between the groups, µy − µx is:
( )1/2
1 1
Ȳ − X̄ ± tnx +ny −2,1−α/2 Sp + .
nx ny
The notation tnx +ny −2,1−α/2 means a t quantile with nx + ny − 2 degrees of freedom. The pooled
variance estimator is:
This variance estimate is used if one is willing to assume a constant variance across the groups. It is
a weighted average of the group-specific variances, with greater weight given to whichever group
has the larger sample size.
If there is some doubt about the constant variance assumption, assume a different variance per
group, which we will discuss later.
[,1] [,2]
[1,] -0.2039 3.364
[2,] -0.2039 3.364
[3,] 0.7001 2.460
Notice that the paired interval (the last row) is entirely above zero. The grouped interval (first two
rows) contains zero. Thus, acknowledging the pairing explains variation that would otherwise be
absorbed into the variation for the group means. As a result, treating the groups as independent
results in wider intervals. Even if it didn’t result in a shorter interval, the paired interval would be
correct as the groups are not statistically independent!
ChickWeight data in R
Now let’s try an example with actual independent groups. Load in the ChickWeight data in R. We
are also going to manipulate the dataset to have a total weight gain variable using dplyr.
www.dbooks.org
t Confidence intervals 81
Violin plots of chickweight data by weight gain (final minus baseline) by diet.
Now let’s do a t interval comparing groups 1 and 4. We’ll show the two intervals, one assuming that
the variances are equal and one assuming otherwise.
www.dbooks.org
t Confidence intervals 83
[,1] [,2]
[1,] -108.1 -14.81
[2,] -104.7 -18.30
For the time being, let’s interpret the equal variance interval. Since the interval is entirely below
zero it suggest that group 1 had less weight gain than group 4 (at 95% confidence).
Unequal variances
Watch this video before beginning.⁴
Under unequal variances our t interval becomes:
( )1/2
s2x s2y
Ȳ − X̄ ± tdf × +
nx ny
which will be approximately a 95% interval. This works really well. So when in doubt, just assume
unequal variances. Also, we present the formula for completeness. In practice, it’s easy to mess up,
so make sure to do t.test.
Referring back to the previous ChickWeight example, the violin plots suggest that considering
unequal variances would be wise. Recall the code is
⁴https://www.youtube.com/watch?v=CVDdbR4VuOE&list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ&index=24
t Confidence intervals 84
This interval is remains entirely below zero. However, it is wider than the equal variance interval.
Summary notes
• The t distribution is useful for small sample size comparisons.
• It technically assumes normality, but is robust to this assumption within limits.
• The t distribution gives rise to t confidence intervals (and tests, which we will see later)
For other kinds of data, there are preferable small and large sample intervals and tests.
Exercises
X̄−µ
1. For iid Gaussian data, the statistic s/ √ must follow a:
n
• Z distribution
• t distribution
2. Paired differences T confidence intervals are useful when:
• Pairs of observations are linked, such as when there is subject level matching or in a
study with baseline and follow up measurements on all participants.
• When there was randomization of a treatment between two independent groups.
3. The assumption that the variances are equal for the independent group T interval means that:
• The sample variances have to be nearly identical.
• The population variances are identical, but the sample variances may be different.
4. Load the data set mtcars in the datasets R package. Calculate a 95% confidence interval to
the nearest MPG for the variable mpg. Watch a video solution⁵ and see written solutions⁶.
⁵https://www.youtube.com/watch?v=5BPY6JqRLbE&index=19&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#3
www.dbooks.org
t Confidence intervals 85
5. Suppose that standard deviation of 9 paired differences is $1$. What value would the average
difference have to be so that the lower endpoint of a 95% students t confidence interval touches
zero? Watch a video solution here⁷ and see the text here⁸.
6. An independent group Student’s T interval is used instead of a paired T interval when:
• The observations are paired between the groups.
• The observations between the groups are naturally assumed to be statistically indepen-
dent.
• As long as you do it correctly, either is fine.
• More details are needed to answer this question. watch a discussion of this problem⁹ and
see the text.¹⁰
7. Consider the mtcars dataset. Construct a 95% T interval for MPG comparing 4 to 6 cylinder
cars (subtracting in the order of 4 - 6) assume a constant variance. Watch a video solution¹¹
and see the text¹². 10.
8. If someone put a gun to your head and said “Your confidence interval must contain what it’s
estimating or I’ll pull the trigger”, what would be the smart thing to do?
• Make your interval as wide as possible.
• Make your interval as small as possible.
• Call the authorities. Watch the video solution¹³ and see the text.¹⁴
9. Refer back to comparing MPG for 4 versus 6 cylinders (question 7). What do you conclude?
• The interval is above zero, suggesting 6 is better than 4 in the terms of MPG.
• The interval is above zero, suggesting 4 is better than 6 in the terms of MPG.
• The interval does not tell you anything about the hypothesis test; you have to do the
test.
• The interval contains 0 suggesting no difference. Watch a video solution¹⁵ and see the
text.¹⁶
10. Suppose that 18 obese subjects were randomized, 9 each, to a new diet pill and a placebo.
Subjects’ body mass indices (BMIs) were measured at a baseline and again after having
received the treatment or placebo for four weeks. The average difference from follow-up to
the baseline (followup - baseline) was 3 kg/m2 for the treated group and 1 kg/m2 for the
placebo group. The corresponding standard deviations of the differences was 1.5 kg/m2 for
the treatment group and 1.8 kg/m2 for the placebo group. The study aims to answer whether
the change in BMI over the four week period appear to differ between the treated and placebo
⁷https://www.youtube.com/watch?v=ioDwUPCy508&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=20
⁸http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#4
⁹https://www.youtube.com/watch?v=zJWJljxJ7Zk&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=21
¹⁰http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#5
¹¹https://www.youtube.com/watch?v=QfuMgsUlu_w&index=23&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹²http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#6
¹³https://www.youtube.com/watch?v=8zM1RV4Rb7A&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=24
¹⁴http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#7
¹⁵https://www.youtube.com/watch?v=zUVoueHLPdo&index=25&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#8
t Confidence intervals 86
groups. What is the pooled variance estimate? Watch a video solution here¹⁷ and see the text
here.¹⁸
¹⁷https://www.youtube.com/watch?v=kzRzrrDWTRQ&index=26&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁸http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#9
www.dbooks.org
9. Hypothesis testing
Hypothesis testing is concerned with making decisions using data.
Hypothesis testing
Watch this video before beginning.¹
To make decisions using data, we need to characterize the kinds of conclusions we can make.
Classical hypothesis testing is concerned with deciding between two decisions (things get much
harder if there’s more than two). The first, a null hypothesis is specified that represents the status
quo. This hypothesis is usually labeled, H0 . This is what we assume by default. The alternative or
research hypothesis is what we require evidence to conclude. This hypothesis is usually labeled, Ha
or sometimes H1 (or some other number other than 0).
So to reiterate, the null hypothesis is assumed true and statistical evidence is required to reject it in
favor of a research or alternative hypothesis
Example
A respiratory disturbance index (RDI) of more than 30 events / hour, say, is considered evidence of
severe sleep disordered breathing (SDB). Suppose that in a sample of 100 overweight subjects with
other risk factors for sleep disordered breathing at a sleep clinic, the mean RDI was 32 events / hour
with a standard deviation of 10 events / hour.
We might want to test the hypothesis that
H0 : µ = 30
Ha : µ > 30
where µ is the population mean RDI. Clearly, somehow we must figure out a way to decide between
these hypotheses using the observed data, particularly the sample mean.
Before we go through the specifics, let’s set up the central ideas.
¹http://youtu.be/Wqvx6_12ZMs?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
87
Hypothesis testing 88
We will perform hypothesis testing by forcing the probability of a Type I error to be small. This
approach consequences, which we can discuss with an analogy to court cases.
www.dbooks.org
Hypothesis testing 89
of a Type I error, labeled α, is 0.05 (or some other relevant constant) To reiterate, α = Type I error
rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct
√
Let’s see if we can figure out what C has to be. The standard error of the mean is 10/ 100 = 1.
Furthermore, under H0 we know that X̄ ∼ N (30, 1) (at least approximately) via the CLT. We want
to chose C so that:
The 95th percentile of a normal distribution is 1.645 standard deviations from the mean. So, if C
is set 1.645 standard deviations from the mean, we should be set since the probability of getting a
sample mean that large is only 5%. The 95th percentile from a N (30, 1) is:
C = 30 + 1 × 1.645 = 31.645.
So the rule “Reject H0 when X̄ ≥ 31.645” has the property that the probability of rejection is 5%
when H0 is true.
In general, however, we don’t convert C back to the original scale. Instead, we calculate how many
standard errors the observed mean is from the hypothesized mean
X̄ − µ0
Z= √ .
s/ n
This is called a Z-score. We can compare this statistic to standard normal quantiles.
To reiterate, the Z-score is how many standard errors the sample mean is above the hypothesized
mean. In our example:
32 − 30
√ =2
10/ 100
Since 2 is greater than 1.645 we would reject. Setting the rule “We reject if our Z-score is larger than
1.645” controls the Type I error√rate at 5%. We could write out a general rule for this alternative
hypothesis as reject whenever n(X̄ − µ0 )/s > Z1−α where α is the desired Type I error rate.
Because the Type I error rate was controlled to be small, if we reject we know that one of the
following occurred:
The third option can be difficult to check and at some level all bets are off depending on how
wrong we are about our basic assumptions. So for this course, we speak of our conclusions under
the assumption that our modeling choices (such as the iid sampling model) are correct, but do so
wide eyed acknowledging the limitations of our approach.
General rules
We developed our test for one possible alternatives. Here’s some general rules for the three most
important alternatives.
Consider the Z test for H0 : µ = µ0 versus: H1 : µ < µ0 , H2 : µ ̸= µ0 , H3 : µ > µ0 . Our test
statistic
X̄ − µ0
TS = √ .
S/ n
H2 : |T S| ≥ Z1−α/2
H3 : T S ≥ Z1−α ,
respectively.
Summary notes
• We have fixed α to be low, so if we reject H0 (either our model is wrong) or there is a low
probability that we have made an error.
• We have not fixed the probability of a type II error, β; therefore we tend to say “Fail to reject
H0 ” rather than accepting H0 .
• Statistical significance is no the same as scientific significance.
• The region of TS values for which you reject H0 is called the rejection region.
• The Z test requires the assumptions of the CLT and for n to be large enough for it to apply.
• If n is small, then a Gosset’s t test is performed exactly in the same way, with the normal
quantiles replaced by the appropriate Student’s t quantiles and n − 1 df.
• The probability of rejecting the null hypothesis when it is false is called power
• Power is a used a lot to calculate sample sizes for experiments.
www.dbooks.org
Hypothesis testing 91
Example reconsidered
Watch this video before beginning.³
Consider our example again. Suppose that n = 16 (rather than 100). The statistic
X̄ − 30
√ ,
s/ 16
T test in R
Let’s try the t test on the pairs of fathers and sons in Galton’s data.
³http://youtu.be/5iMMBTlOFTI?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
Hypothesis testing 92
www.dbooks.org
Hypothesis testing 93
Thus if we reject under 7 or 8 girls, we will have a less than 5% chance of rejecting under the null
hypothesis.
It’s impossible to get an exact 5% level test for this case due to the discreteness of the binomial.
The closest is the rejection region [7 : 8]. Further note that an alpha level lower than 0.0039 is not
attainable. For larger sample sizes, we could do a normal approximation.
Extended this test to two sided test isn’t obvious. Given a way to do two sided tests, we could take
the set of values of p0 for which we fail to reject to get an exact binomial confidence interval (called
the Clopper/Pearson interval, by the way). We’ll cover two sided versions of this test when we cover
P-values.
Exercises
1. Which hypothesis is typically assumed to be true in hypothesis testing?
• The null.
• The alternative.
2. The type I error rate controls what?
3. Load the data set mtcars in the datasets R package. Assume that the data set mtcars is a
random sample. Compute the mean MPG, x̄, of this sample. You want to test whether the true
MPG is µ0 or smaller using a one sided 5% level test. (H0 : µ = µ0 versus Ha : µ < µ0 ).
Using that data set and a Z test: Based on the mean MPG of the sample x̄, and by using a Z
test: what is the smallest value of µ0 that you would reject for (to two decimal places)? Watch
a video solution here⁴ and see the text here⁵.
4. Consider again the mtcars dataset. Use a two group t-test to test the hypothesis that the 4 and
6 cyl cars have the same mpg. Use a two sided test with unequal variances. Do you reject?
Watch the video here⁶ and see the text here⁷
⁴https://www.youtube.com/watch?v=gReR0uxLnIA&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=27
⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#3
⁶https://www.youtube.com/watch?v=Zo5TirzS9rU&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=28
⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#4
www.dbooks.org
Hypothesis testing 95
5. A sample of 100 men yielded an average PSA level of 3.0 with a sd of 1.1. What are the complete
set of values that a 5% two sided Z test of H0 : µ = µ0 would fail to reject the null hypothesis
for? Watch the video solution⁸ and see the text⁹.
6. You believe the coin that you’re flipping is biased towards heads. You get 55 heads out of 100
flips. Do you reject at the 5% level that the coin is fair? Watch a video solution¹⁰ and see the
text¹¹.
7. Suppose that in an AB test, one advertising scheme led to an average of 10 purchases per day
for a sample of 100 days, while the other led to 11 purchases per day, also for a sample of
100 days. Assuming a common standard deviation of 4 purchases per day. Assuming that the
groups are independent and that they days are iid, perform a Z test of equivalence. Do you
reject at the 5% level? Watch a video solution¹² and see the text.¹³
8. A confidence interval for the mean contains:
• All of the values of the hypothesized mean for which we would fail to reject with α =
1 − Conf.Level.
• All of the values of the hypothesized mean for which we would fail to reject with 2α =
1 − Conf.Level.
• All of the values of the hypothesized mean for which we would reject with α = 1 −
Conf.Level.
• All of the values of the hypothesized mean for which we would reject with 2α = 1 −
Conf.Level. Watch a video solution¹⁴ and see the text¹⁵.
9. In a court of law, all things being equal, if via policy you require a lower standard of evidence
to convict people then
• Less guilty people will be convicted.
• More innocent people will be convicted.
• More Innocent people will be not convicted. Watch a video solution¹⁶ and see the text¹⁷.
⁸https://www.youtube.com/watch?v=TooyEaVgLZc&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=29
⁹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#5
¹⁰https://www.youtube.com/watch?v=0sqOErsfhqo&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=30
¹¹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#6
¹²https://www.youtube.com/watch?v=Or4ly4rOiaA&index=32&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹³http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#8
¹⁴https://www.youtube.com/watch?v=UiNV1mXQGLs&index=33&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#9
¹⁶https://www.youtube.com/watch?v=GlKPG24bZMI&index=36&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#12
10. P-values
Introduction to P-values
Watch this video before beginning.¹
P-values are the most common measure of statistical significance. Their ubiquity, along with concern
over their interpretation and use makes them controversial among statisticians. The following
manuscripts are interesting reads about P-values.
• http://warnercnr.colostate.edu/∼anderson/thompson1.html²
• Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall³
• Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman⁴
• The hilariously titled: The Earth is Round (p < .05) by Cohen.⁵
• Some positive comments
– simply statistics⁶
– normal deviate⁷
– Error statistics⁸
What is a P-value?
The central idea of a P-value is to assume that the null hypothesis is true and calculate how unusual
it would be to see data (in the form of a test statistic) as extreme as was seen in favor of the alternative
hypothesis. The formal definition is:
A P-value is the probability of observing a test statistic as or more extreme in favor of the alternative
than was actually obtained, where the probability is calculated assuming that the null hypothesis is
true.
A P-value then requires a few steps. 1. Decide on a statistic that evaluates support of the null or
alternative hypothesis. 2. Decide on a distribution of that statistic under the null hypothesis (null
¹http://youtu.be/Ky68x_7iK6c?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
²http://warnercnr.colostate.edu/~anderson/thompson1.html
³http://www.crcpress.com/product/isbn/9780412044113
⁴https://scholar.google.com/scholar?q=towards+evidence+based+medical+statistics+the+p-value+fallacy&hl=en&as_sdt=0&as_vis=1&oi=
scholart&sa=X&ei=uOTjVNHdG4anggSMlYOwBQ&ved=0CBsQgQMwAA
⁵http://www.scopus.com/record/display.url?eid=2-s2.0-0039802908&origin=inward&txGid=BBE363C58BE8785BFF9E71AB60004733.
ZmAySxCHIBxxTXbnsoe5w%3a2
⁶http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/
⁷http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/
⁸http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/
96
www.dbooks.org
P-values 97
distribution). 3. Calculate the probability of obtaining a statistic as or more extreme as was observed
using the distribution in 2.
The way to interpret P-values is as follows. If the P-value is small, then either H0 is true and we
have observed a rare event or H0 is false (or possibly the null model is incorrect).
Let’s do a quick example. Suppose that you get a t statistic of 2.5 for 15 degrees of freedom testing
H0 : µ = µ0 versus Ha : µ > µ0 . What’s the probability of getting a t statistic as large as 2.5?
P-value calculation in R.
Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained
under H0 is 0.0123. So, (assuming our model is correct) either we observed data that was pretty
unlikely under the null, or the null hypothesis if false.
This calculation is a P-value where the statistic is the number of girls and the null distribution is a
fair coin flip for each gender. We want to test H0 : p = 0.5 versus Ha : p > 0.5, where p is the
probability of having a girl for each birth.
Recall here’s the calculation:
Since our P-value is less than 0.05 we would reject at a 5% error rate. Note, however, if we were
doing a two sided test, we would have to double the P-value and thus would then fail to reject.
Poisson example
Watch this video before beginning.⁹
Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1)
during the last monitoring period. Assume that an infection rate of 0.05 is an important benchmark.
Given a Poisson model, could the observed rate being larger than 0.05 be attributed to chance? We
want to test H0 : λ = 0.05 where λ is the rate of infections per person day so that 5 would be the
rate per 100 days. Thus we want to know if 9 events per 100 person/days is unusual with respect to
a Poisson distribution with a rate of 5 events per 100. Consider Ha : λ > 0.05.
Again, since this P-value is less than 0.05 we reject the null hypothesis. The P-value would be 0.06
for two sided hypothesis (double) and so we would fail to reject in that case.
Exercises
1. P-values are probabilities that are calculated assuming which hypothesis is true?
• the alternative
• the null
2. You get a P-value of 0.06. Would you reject for a type I error rate of 0.05?
• Yes you would reject the null
⁹http://youtu.be/Tcw2OVyEX3s?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
www.dbooks.org
P-values 99
¹⁰https://www.youtube.com/watch?v=Zo5TirzS9rU&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=28
¹¹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#4
¹²https://www.youtube.com/watch?v=0sqOErsfhqo&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=30
¹³http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#6
¹⁴https://www.youtube.com/watch?v=cE_88-Q7TX0&index=31&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁵http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#7
¹⁶https://www.youtube.com/watch?v=Or4ly4rOiaA&index=32&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
¹⁷http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#8
¹⁸https://www.youtube.com/watch?v=m0B5p0w2wJI&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=37
¹⁹http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#13
11. Power
Power
Watch this video before beginning.¹ and then watch this video as well.²
Power is the probability of rejecting the null hypothesis when it is false. Ergo, power (as its name
would suggest) is a good thing; you want more power. A type II error (a bad thing, as its name would
suggest) is failing to reject the null hypothesis when it’s false; the probability of a type II error is
usually called β. Note Power = 1 − β.
Let’s go through an example of calculating power. Consider our previous example involving RDI.
H0 : µ = 30 versus Ha : µ > 30. Then power is:
( )
X̄ − 30
P √ > t1−α,n−1 ; µ = µa .
s/ n
Note that this is a function that depends on the specific value of µa ! Further notice that as µa
approaches 30 the power approaches α.
Pushing this example further, we reject if
X̄ − 30
Z= √ > z1−α
σ/ n
Or, equivalently, if
σ
X̄ > 30 + Z1−α √
n
But, note that, under H0 : X̄ ∼ N (µ0 , σ 2 /n). However, under Ha : X̄ ∼ N (µa , σ 2 /n).
So for this test we could calculate power with this R code:
¹http://youtu.be/-TsBOLiW4rQ?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
²http://youtu.be/GRS2b1aedmk?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
100
www.dbooks.org
Power 101
alpha = 0.05
z = qnorm(1 - alpha)
pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALS\
E)
Let’s plug in the specific numbers for our example where: µa = 32, µ0 = 30, n = 16, σ = 4.
> mu0 = 30
> mua = 32
> sigma = 4
> n = 16
> z = qnorm(1 - alpha)
> pnorm(mu0 + z * sigma/sqrt(n), mean = mu0, sd = sigma/sqrt(n), lower.tail = FA\
LSE)
[1] 0.05
> pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FA\
LSE)
[1] 0.6388
When we plug in µ0 , the value under the null hypothesis, we get that the probability of rejection is
5%, as the test was designed. However, when we plug in a value of 32, we get 64%. Therefore, the
probability of rejection is 64% when the true value of µ is 32. We could create a curve of the power
as a function of µa , as seen below. We also varied the sample size to see how the curve depends on
that.
Power 102
The code below shows how to use manipulate to investigate power as the various inputs change.
library(manipulate)
mu0 = 30
myplot <- function(sigma, mua, n, alpha) {
g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu))
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0,
sd = sigma/sqrt(n)), size = 2, col = "red")
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua,
sd = sigma/sqrt(n)), size = 2, col = "blue")
xitc = mu0 + qnorm(1 - alpha) * sigma/sqrt(n)
g = g + geom_vline(xintercept = xitc, size = 3)
g
}
manipulate(myplot(sigma, mua, n, alpha), sigma = slider(1, 10, step = 1, initial\
= 4),
mua = slider(30, 35, step = 1, initial = 32), n = slider(1, 50, step = 1,
initial = 16), alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05))
www.dbooks.org
Power 103
Question
Watch this video before beginning.³
When testing Ha : µ > µ0 , notice if power is 1 − β, then
( )
σ
1 − β = P X̄ > µ0 + z1−α √ ; µ = µa
n
where X̄ ∼ N (µa , σ 2 /n). The unknowns in the equation are: µa , σ, n, β and the knowns are: µ0 ,
α. Specify any 3 of the unknowns and you can solve for the remainder.
Notes
• The calculation for Ha : µ < µ0 is similar
• For Ha : µ ̸= µ0 calculate the one sided power using α/2 (this is only approximately right, it
excludes the probability of getting a large TS in the opposite direction of the truth)
• Power goes up as α gets larger
• Power of a one sided test is greater than the power of the associated two sided test
• Power goes up as µ1 gets further away from $\mu_0$
• Power goes up as n goes up √
• Power doesn’t need µa , σ and n, instead only n(µσa −µ0 )
– The quantity µa −µσ
0
is called the effect size, the difference in the means in standard
deviation units.
– Being unit free, it has some hope of interpretability across settings.
T-test power
Watch this before beginning.⁴
Consider calculating power for a Gosset’s t test for our example where we now assume that n = 16.
The power is
( )
X̄ − µ0
P √ > t1−α,n−1 ; µ = µa .
S/ n
Calculating this requires the so-called non-central t distribution. However, fortunately for us, the R
function power.t.test does this very well. Omit (exactly) any one of the arguments and it solves
for it. Our t-test power again only relies on the effect size.
Let’s do our example trying different options.
³http://youtu.be/3bWhP5MyuqI?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
⁴http://youtu.be/1DiwutNpt5Y?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
Power 104
Exercises
1. Power is a probability calculation assuming which is true:
• The null hypothesis
• The alternative hypothesis
• Both the null and alternative
2. As your sample size gets bigger, all else equal, what do you think would happen to power?
• It would get larger
• It would get smaller
• It would stay the same
• It cannot be determined from the information given
3. What happens to power as µa gets further from µ0 ?
• Power decreases
• Power increases
www.dbooks.org
Power 105
⁵https://www.youtube.com/watch?v=RiS6EFnPYY8&index=34&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁶http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#10
⁷https://www.youtube.com/watch?v=lrXyJrtatzk&index=35&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L
⁸http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#11
12. The bootstrap and resampling
The bootstrap
Watch this video before beginning.¹
The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating
standard errors for difficult statistics. For a classic example, how would one derive a confidence
interval for the median? The bootstrap procedure follows from the so called bootstrap principle
To illustrate the bootstrap principle, imagine a die roll. The image below shows the mass function
of a die roll on the left. On the right we show the empirical distribution obtained by repeatedly
averaging 50 independent die rolls. By this simulation, without any mathematics, we have a good
idea of what the distribution of averages of 50 die rolls looks like.
Image of true die roll distribution (left) and simulation of averages of 50 die rolls
Now imagine a case where we didn’t know whether or not the die was fair. We have a sample of
size 50 and we’d like to investigate the distribution of the average of 50 die rolls where we’re not
allowed to roll the die anymore. This is more like a real data analysis, we only get one sample from
the population.
¹http://youtu.be/0hNQx9nagq4?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
106
www.dbooks.org
The bootstrap and resampling 107
Image of empirical die roll distribution (left) and simulates of averages of 50 die rolls from this distribution
The bootstrap principle is to use the empirical mass function of the data to perform the simulation,
rather than the true distribution. That is, we simulate averages of 50 samples from the histogram
that we observe. With enough data, the empirical distribution should be a good estimate of the true
distribution and this should result in a good approximation of the sampling distribution.
That’s the bootstrap principle: investigate the sampling distribution of a statistic by simulating
repeated realizations from the observed distribution.
If we could simulate from the true distribution, then we would know the exact sampling distribution
of our statistic (if we ran our computer long enough.) However, since we only get to sample from
that distribution once, we have to be content with using the empirical distribution. This is the clever
idea of the bootstrap.
Bootstrapping example
library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
resampledMedians <- apply(resamples, 1, median)
www.dbooks.org
The bootstrap and resampling 109
1. Sample n observations with replacement from the observed data resulting in one simulated
complete data set.
2. Take the median of the simulated data set
3. Repeat these two steps B times, resulting in B simulated medians
4. These medians are approximately drawn from the sampling distribution of the median of n
observations; therefore we can:
• Draw a histogram of them
• Calculate their standard deviation to estimate the standard error of the median
• Take the 2.5th and 97.5th percentiles as a confidence interval for the median
For the general bootstrap, just replace the median with whatever statistic that you’re investigating.
Example code
Consider our father/son data from before. Here is the relevant code for doing the resampling.
B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
medians <- apply(resamples, 1, median)
www.dbooks.org
The bootstrap and resampling 111
> sd(medians)
[1] 0.08424
Thus, 0.084 estimates the standard error of the median for this data set. It did this by repeatedly
sampling medians from the observed distribution and taking the standard deviation of the resulting
collection of medians. Taking the 2.5 and 97.5 percentiles gives us a bootstrap 95% confidence interval
for the median.
We also always want to plot a histogram or density estimate of our simulated statistic.
www.dbooks.org
The bootstrap and resampling 113
data(InsectSprays)
g = ggplot(InsectSprays, aes(spray, count, fill = spray))
g = g + geom_boxplot()
g
⁵http://youtu.be/nn1t9Kk7nn8?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ
The bootstrap and resampling 114
Permutation tests
Consider comparing means between the group. However, let’s use the calculate the distribution of
our statistic under a null hypothesis that the labels are irrelevant (exchangeable). This is a handy
way to create a null distribution for our test statistic by simply permuting the labels over and over
and seeing how extreme our data are with respect to this permuted distribution.
The procedure would be as follows:
Also, so-called randomization tests are exactly permutation tests, with a different motivation. In
that case, think of the permutation test as replicating the random assignment over and over.
For matched or paired data, it wouldn’t make sense to randomize the group labels, since that would
break the association between the pairs. Instead, one can randomize the signs of the pairs. For data
that has been replaced by ranks, you might of heard of this test before as the the signed rank test.
Again we won’t cover more complex examples, but it should be said that permutation strategies
work for regression as well by permuting a regressor of interest (though this needs to be done with
care). These tests work very well in massively multivariate settings.
Permutation test B v C
Let’s create some code for our example. Our statistic will be the difference in the means in each
group.
www.dbooks.org
The bootstrap and resampling 115
Let’s look at some of the results. First let’s look at the observed statistic.
> observedStat
[1] 13.25
Now let’s see what proportion of times we got a simulated statistic larger than our observed statistic.
Since this is 0, our estimate of the P-value is 0 (i.e. we strongly reject the NULL). It’s useful to look
at a histogram of permuted statistics with a vertical line drawn at the observed test statistic for
reference.
The bootstrap and resampling 116
Exercises
1. The bootstrap uses what to estimate the sampling distribution of a statistic?
• The true population distribution
• The empirical distribution that puts probability 1/n for each observed data point
2. When performing the bootstrap via Monte Carlo resampling for a data set of size n which is
true? Assume that you’re going to do 10,000 bootstrap resamples?
• You sample n complete data sets of size 10,000 with replacement
• You sample 10,000 complete data sets of size n without replacement
www.dbooks.org
The bootstrap and resampling 117