09 Inference - Slides Web
09 Inference - Slides Web
Edwin Leuven
Introduction
2/39
Introduction
Until now we’ve mostly dealt with descriptive statistics and with
probability.
In descriptive statistics one investigates the characteristics of the
data
3/39
Introduction
4/39
Introduction
Probability
Parameter Statistic
x̄ = n1 ni=1 xi
P
E [X ] = µ
Inference
5/39
Point estimation
We want to estimate a population parameter using the observed
data.
7/39
Hypothesis Testing
8/39
Statistical hypothesis testing
Each of the two hypothesis, the old and the new, predicts a different
distribution for the empirical measurements.
In order to decide which of the distributions is more in tune with the
data a statistic is computed.
This statistic t is called the test statistic.
A threshold c is set and the old theory is reject if t > c
Hypothesis testing consists in asking a binary question about the
sampling distribution of t
9/39
Statistical hypothesis testing
This decision rule is not error proof, since the test statistic may fall
by chance on the wrong side of the threshold.
Suppose we know the sampling distribution of the test statistic t
We can then set the probability of making an error to a given level
by setting c
The probability of erroneously rejecting the currently accepted
theory (the old one) is called the significance level of the test.
The threshold is selected in order to assure a small enough
significance level.
10/39
Multiple measurements
11/39
Statistics
12/39
Statistics
we therefore denote the statistic with capitals, f.e. the sample mean:
1 Pn
I X̄ = n i=1 Xi
13/39
Example: Polling
14/39
Example: Polling
Imagine we want to predict whether the left block or the right block
will get a majority in parliament
Key quantities:
I N = 4,166,612 - Population
I p = (# people who support the right) / N
I 1 − p = (# people who support the left) / N
1. What is p?
2. Is p > 0.5?
3. We estimate p but are we sure?
15/39
Example: Polling
We poll a random sample of n = 1,000 people from the population
without replacement:
Let (
1 if person i support the right
Xi =
0 if person i support the left
and denote our data by x1 , . . . , xn
Then we can estimate p by
p̂ = (x1 + . . . + xn )/n
16/39
Example: Polling
E [Xi ] = 1 · p + 0 · (1 − p) = p
and therefore
I p = 0.54?
I p > 0.5?
estimation error = p̂ − p 6= 0
which comes from the difference between our sample and the
population
18/39
Example: Polling
20/39
The Sampling Distribution
I population
I eligible voters in norway today
I model (theoretical population)
I Pr(vote right block) = p
21/39
Sampling Distribution of Statistics
22/39
Sampling Distribution of Statistics
can be complicated!
We can sometimes learn about the sampling distribution of a
statistic by
23/39
Finite sample distributions
Sometimes we can derive the finite sample distribution of a statistic
Let the fraction of people voting right in the population be p
Because we know the distribution of the data (up to the unknown
parameter p) we can derive the sampling distribution
In a random sample of size n the probability of observing k people
voting on the right can be derived and follows a binomial distribution
!
n k
Pr(X = k) = p (1 − p)n−k
k
Distribution E [X ] Var (X ) R
Binomial np np(1 − p) d,p,q,rbinom
Poisson λ λ d,p,q,rpoisson
1 1 2
Uniform 2 (a + b) 12 (b − a) d,p,q,runif
Exponential λ−1 λ −2 d,p,q,rexp
Normal µ σ2 d,p,q,rnorm
25/39
Example: Polling
hist(
replicate(
10000,mean(rbinom(1000, 1, .54)))
, main="", xlab="p_hat",prob=TRUE,breaks=50)
25
20
Density
15
10
5
0
27/39
Example: Polling
√
n(p̂ − p) ∼ N(0, p(1 − p))
or equivalently
p(1 − p)
p̂ ∼ N(p, )
n
by the Central Limit Theorem
28/39
Example: Polling
15
10
5
0
p_hat
29/39
Approximation through numerical simulation
30/39
Approximation through numerical simulation
√
1/λ ± 1.96 · 1/(λ n)
31/39
Approximation through numerical simulation
X.bar = replicate(10^5,mean(rexp(201,1/12000)))
mean(abs(X.bar-12000) <= 1.96*0.0705*12000)
## [1] 0.95173
32/39
Approximation through numerical simulation
33/39
Approximation through numerical simulation
Let us carry out the simulation that produces an approximation of
the central region that contains 95% of the sampling distribution of
the mid-range statistic for the Uniform distribution:
## 2.5% 97.5%
## 4.9409107 5.0591218
mean(mid.range)
## [1] 4.9998949
sd(mid.range)
## [1] 0.027876151
35/39
Approximation through numerical simulation
36/39
Approximation through numerical simulation
n = 1000
data = rbinom(n, 1, .54) # true distr, usually unknown
estimates = rep(0,999)
for(i in 1:999) {
id = sample(1:n, n, replace=T)
estimates[i] = mean(data[id])
}
sd(estimates)
## [1] 0.015946413
## [1] 0.015760711
37/39
Summary
I Estimation:
I Determining the distribution, or some characteristic of it.
(What is our best guess for p?)
I Confidence intervals:
I Quantifying the uncertainty of our estimate. (What is a range
of values to which we’re reasonably sure p belongs?)
I Hypothesis testing:
I Asking a binary question about the distribution. (Is p > 0.5?)
38/39
Summary
The coming weeks we will take a closer look at how this randomness
affects what we can learn about the population from the data
39/39