Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
(Module 2)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
Semester 2, 2018
Contents
1 Estimation & sampling distributions 1
2 Estimators 3
3 Method of moments 6
On a particular street, we measure the time interval (in minutes) between each car that passes:
2.55 2.13 3.18 5.94 2.29 2.41 8.72 3.71
We believe these follow an exponential distribution:
Xi ∼ Exp(λ)
What can we say about λ?
Can we approximate it from the data?
Yes! We can do it using a statistic. This is called estimation.
1
Distributions of statistics
X(1) ∼ Exp(100λ)
X
Xi ∼ Gamma(100, λ)
How to estimate?
2
Can we use the sample mean, X̄, as an estimate of θ? Yes!
Can we use the sample standard deviation, S, as an estimate of θ? Yes!
Will these statistics be good estimates? Which one is better? Let’s see. . .
We need to know properties of their sampling distributions, such as their mean and variance.
Note: we are referring to the distribution of the statistic, T , rather than the population distribution from which we
draw samples, X.
For example, it is natural to expect that:
• E(X̄) ≈ µ (sample mean ≈ population mean)
• E(S 2 ) ≈ σ 2 (sample variance ≈ population variance)
Let’s see for our example:
0.8
0.06
0.6
0.04
)
f ( x)
0.4
2
f (s
0.02
0.2
0.00
0.0
2 3 4 5 6 7 8 10 20 30 40 50 60
2
x s
Left: distribution of X̄. Right: distribution of S 2 . Vertical dashed lines: true values, E(X) = 5 and var(X) = 52 .
• Should we use X̄ or S to estimate θ? Which one is the better estimator?
• We would like the sample distribution of the estimator to be as close as possible to the true value θ = 5.
• In practice, for any given dataset, we don’t know which estimate is the closest, since we don’t know the true
value.
• We should use the one that is more likely to be the closest.
• Simulation: consider 250 samples of size n = 100 and compute:
x̄1 , . . . , x̄250 ,
s1 , . . . , s250
> summary(x.bar)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.789 4.663 4.972 5.015 5.365 6.424
> sd(x.bar)
[1] 0.4888185
> summary(s)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.502 4.473 4.916 5.002 5.512 7.456
> sd(s)
[1] 0.7046119
From our simulation, sd(X̄) ≈ 0.49 and sd(S) ≈ 0.70. So, in this case it looks like X̄ is superior to S.
2 Estimators
Definitions
• A parameter is a quantity that describes the population distribution, e.g. µ and σ 2 for N(µ, σ 2 )
3
• The parameter space is the set of all possible values that a parameter might take, e.g. −∞ < µ < ∞ and
0 ≤ σ < ∞.
• An estimator (or point estimator) is a statistic that is used to estimate a parameter. It refers specifically to the
random variable version of the statistic, e.g. T = u(X1 , . . . , Xn ).
• An estimate (or point estimate) is the observed value of the estimator for a given dataset. In other words, it is
a realisation of the estimator, e.g. t = u(x1 , . . . , xn ), where x1 , . . . , xn is the observed sample (data).
• ‘Hat’ notation: If T is an estimator for θ, then we usually refer to it by θ̂ for convenience.
Examples
Sample mean
n
1 1X
X̄ = (X1 + X2 + . . . Xn ) = Xi
n n i=1
Properties:
• E(X̄) = µ
σ2
• var(X̄) = n
Sample variance
n
1 X 2
S2 = Xi − X̄
n − 1 i=1
Properties:
• E(S 2 ) = σ 2
• var(S 2 ) = (a messy formula)
Often used to estimate the population variance, σ̂ 2 = S 2 .
Sample proportion
For a discrete random variable, we might be interested in how often a particular value appears. Counting this gives
the sample frequency:
Xn
freq(a) = I(Xi = a)
i=1
freq(a) ∼ Bi(n, p)
4
Divide by the sample size to get the sample proportion. This is often used as an estimator for the population proportion:
n
freq(a) 1X
p̂ = = I(Xi = a)
n n i=1
Note:
• The sample pmf and the sample proportion are the same, both of them estimate the probability of a given event
or set of events.
• The pmf is usually used when the interest is in many different events/values, and is written as a function, e.g.
p̂(a).
• The proportion is usually used when only a single event is of interest (getting heads for a coin flip, a certain
candidate winning an election, etc.).
If the sample is drawn from a normal distribution, Xi ∼ N(µ, σ 2 ), we can derive exact distributions for these statistics.
Sample mean:
σ2
X̄ ∼ N µ,
n
Sample variance:
σ2 2
S2 ∼ χ
n − 1 n−1
2σ 4
E(S 2 ) = σ 2 , var(S 2 ) =
n−1
χ2k is the chi-squared distribution with k degrees of freedom. (more details in Module 3)
Bias
Consider an estimator θ̂ of θ.
• If E(θ̂) = θ, the estimator is said to be unbiased
• The bias of the estimator is, E(θ̂) − θ
Examples:
• The sample variance is unbiased for the population variance, E(S 2 ) = σ 2 . (problem 5 in week 3 tutorial)
• What if we divide by n instead of n − 1 in the denominator?
E( n−1 2
n S )=
n−1 2
n σ < σ2
⇒ biased!
In general, if θ̂ is unbiased for θ, then it will usually be the case that g(θ̂) is biased for g(θ).
Unbiasedness is not preserved under transformations.
Challenge problem
√
Is the sample standard deviation, S = S 2 , biased for the population standard deviation, σ?
5
Choosing between estimators
• Evaluate and compare the sampling distributions of the estimators.
• Generally, prefer estimators that have smaller bias and smaller variance (and it can vary depending on the
aim of your problem).
• Sometimes, we only know asymptotic properties of estimators (will see examples later).
Note: this approach to estimation is referred to as frequentist or classical inference. The same is true for most of the
techniques we will cover. We will also learn about an alternative approach, called Bayesian inference, later in the
semester.
Take a random sample of size n from the uniform distribution with pdf:
1 1
f (x) = 1 (θ − <x<θ+ )
2 2
Can you think of some estimators for θ? What is their bias and variance?
Take a random sample of size n from the shifted exponential distribution, with pdf:
Equivalently:
Xi ∼ θ + Exp(1)
Can you think of some estimators for θ? What is their bias and variance?
3 Method of moments
Method of moments (MM)
• Idea:
– Make the population distribution resemble the empirical (data) distribution. . .
– . . . by equating theoretical moments with sample moments
– Do this until you have enough equations, and then solve them
• Example: if E(X̄) = θ, then the method of moments estimator of θ is X̄.
• General procedure (for r parameters):
1. X1 , . . . , Xn i.i.d. f (x | θ1 , . . . , θr ).
2. kth moment is E(X k )
1
Xik
P
3. kth sample moment is Mk = n
6
Remarks
• An intuitive approach to estimation
• Can work in situations where other approaches are too difficult
• Usually biased
• Usually not optimal (but may suffice)
• Note: some authors use a ‘bar’ (θ̄) or a ‘tilde’ (θ)
e to denote MM estimators rather than a ‘hat’ (θ̂). This helps
to distinguish different estimators when comparing them to each other.
Note:
• This not the usual sample variance!
n−1 2
e2 =
• σ n S
n−1 2
σ2 ) =
• This one is biased, E(e n σ 6= σ 2 .
7
Equating them:
X̄ = αθ and S 2 = αθ2
Solving these gives:
S2 X̄ 2
θe = and α
e=
X̄ S2
Note:
• This is an example of using S 2 instead of M2
• Regard the sample x1 , . . . , xn as known (since we have observed it) and regard the probability of the data as a
function of p.
• When written this way, this is called the likelihood of p:
L(p) = L(p | x1 , . . . , xn )
= Pr(X1 = x1 , . . . , Xn = xn | p)
P P
xi
=p (1 − p)n− xi
• The final answer (the maximising value of p) is the same, since the log of non-negative numbers is a one-to-one
function whose inverse is the exponential, so any value θ that maximises the log-likelihood also maximises the
likelihood.
Pn
• Putting x = i=1 xi so that x is the number of 1’s in the sample,
ln L(p) = x ln p + (n − x) ln(1 − p)
• Find the maximum of this log-likelihood with respect to p by differentiating and equating to zero,
∂ ln L(p) 1 −1
= x + (n − x) =0
∂p p 1−p
8
x = 50 x = 40 x = 80
−100
n = 100
log likelihood(p)
−200
−300
9
P
∂ ln L(λ) n xi
=− + 2 =0
∂λ λ λ
This gives: λ̂ = X̄
Log−likelihood curve
−10
−20
●
−30
log(L)
−40
−50
−60
Log−likelihood curves
−10
●
●
−20
●●
●●●
●
●●
●●
−30
●●
●
log(L)
−40
−50
−60
10
Example: Geometric distribution
n
n 1 X
ln L(θ1 , θ2 ) = − ln(2πθ2 ) − (xi − θ1 )2
2 2θ2 i=1
Take partial derivatives with respect to θ1 and θ2 .
n
∂ ln L(θ1 , θ2 ) 1 X
= (xi − θ1 )
∂θ1 θ2 i=1
n
∂ ln L(θ1 , θ2 ) n 1 X
=− + 2 (xi − θ1 )2
∂θ2 2θ2 2θ2 i=1
Pn
Set both of these to zero and solve. This gives: θ1 = x̄ and θ̂2 = n−1 i=1 (xi − x̄)2 . The maximum likelihood
estimators are therefore:
n
1X n−1 2
θb1 = X̄, θb2 = (Xi − X̄)2 = S
n i=1 n
11
Normal Q−Q Plot
0.6
●
0.1
1.2
0.5
0.2
0.3
1.0
Sample Quantiles
0.4
●
● 0.4
0.5
σ
0.6
0.7
0.3
0.8
0.8
0.2
0.6
0.1
−1.0 −0.5 0.0 0.5 1.0 0.6 0.7 0.8 0.9 1.0 1.1 1.2
Theoretical Quantiles µ
Take a random sample of size n from the shifted exponential distribution, with pdf:
f (x | θ) = e−(x−θ) (x > θ)
Equivalently:
Xi ∼ θ + Exp(1)
Derive the MLE for θ. Is it biased? Can you create an unbiased estimator from it?
Invariance property
Suppose we know θ̂ but are actually interestd in φ = g(θ) rather than θ itself. Can we estimate φ?
Yes! It is simply φ̂ = g(θ̂).
This is known as the invariance property of the MLE. In other words, transformations don’t affect the value of the
MLE.
Consequence: MLEs are usually biased since expectations are not invariant under transformations.
12