0% found this document useful (0 votes)
101 views55 pages

Ch3 PDF

This document discusses methods for statistical estimation, including maximum likelihood estimation, the method of moments, and M-estimators. It introduces concepts like total variation distance and Kullback-Leibler divergence that can be used to measure how close an estimated distribution is to the true distribution. The maximum likelihood principle is presented as a method for statistical estimation that chooses parameter values to maximize the likelihood function. Examples of likelihood functions are given for Bernoulli and Poisson models.

Uploaded by

phantom29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views55 pages

Ch3 PDF

This document discusses methods for statistical estimation, including maximum likelihood estimation, the method of moments, and M-estimators. It introduces concepts like total variation distance and Kullback-Leibler divergence that can be used to measure how close an estimated distribution is to the true distribution. The maximum likelihood principle is presented as a method for statistical estimation that chooses parameter values to maximize the likelihood function. Examples of likelihood functions are given for Bernoulli and Poisson models.

Uploaded by

phantom29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

18.

650 – Fundamentals of Statistics

3. Methods for estimation

1/54
Goals

In the kiss example, the estimator was intuitively the right thing
to do: p̂ = X̄n .
In view of LLN, since p = IE[X], we have X̄n
so p̂ ⇡ p for n large enough.
If the parameter is ✓ 6= IE[X]? How do we perform?
1. Maximum likelihood estimation: a generic approach with very
good properties
2. Method of moments: a (fairly) generic and easy approach
3. M-estimators: a flexible approach, close to machine learning

2/54
Total variation distance

Let E, (IP✓ )✓2⇥ be a statistical model associated with a sample



of i.i.d. r.v. X1 , . . . , Xn . Assume that there exists ✓ 2 ⇥ such
that X1 ⇠ IP✓⇤ : ✓⇤ is the true parameter.

Statistician’s goal: given X1 , . . . , Xn , find an estimator


✓ˆ = ✓(X
ˆ 1 , . . . , Xn ) such that IP ˆ is close to IP✓⇤ for the true

parameter ✓ . ⇤

This means: is small for all A ⇢ E.


Definition
The total variation distance between two probability measures IP✓
and IP✓0 is defined by

TV(IP✓ , IP✓0 ) = max .


A⇢E

3/54
Total variation distance between discrete measures

Assume that E is discrete (i.e., finite or countable). This includes


Bernoulli, Binomial, Poisson, . . .

Therefore X has a PMF (probability mass function):


IP✓ (X = x) = p✓ (x) for all x 2 E,
X
p✓ (x) , p✓ (x) =
x2E

The total variation distance between IP✓ and IP✓0 is a simple


function of the PMF’s p✓ and p✓0 :
1X
TV(IP✓ , IP✓0 ) = p✓ (x) p✓0 (x) .
2
x2E

4/54
Total variation distance between continuous measures

Assume that E is continuous. This includes Gaussian, Exponential,


...
R
Assume that X has a density IP✓ (X 2 A) = A f✓ (x)dx for all
A ⇢ E.
f✓ (x) 0, = 1.

The total variation distance between IP✓ and IP✓0 is a simple


function of the densities f✓ and f✓0 :
1
TV(IP✓ , IP✓0 ) = f✓ (x) f✓0 (x) .
2

5/54
Properties of Total variation

I TV(IP✓ , IP✓0 ) = (symmetric)


I TV(IP✓ , IP✓0 ) (positive)
I If TV(IP✓ , IP✓0 ) = 0 then (definite)
I TV(IP✓ , IP✓0 )  (triangle inequality)

These imply that the total variation is a between


probability distributions.

6/54
Exercises
Compute:
a)TV(Ber(0.5), Ber(0.1)) =

b) TV(Ber(0.5), Ber(0.9)) =

c)TV(Exp(1), Unif[0, 1]) =

d)TV(X, X + a) =
for any a 2 (0, 1), where X ⇠ Ber(0.5)

p
e)TV(2 n(X̄n 1/2), Z) =
i.i.d
where Xi ⇠ Ber(0.5) and Z ⇠ N (0, 1)
7/54
An estimation strategy
c ✓ , IP✓⇤ ) for all ✓ 2 ⇥. Then find ✓ˆ that
Build an estimator TV(IP
c ✓ , IP✓⇤ ).
minimizes the function ✓ 7! TV(IP

c ✓ , IP✓⇤ )!
problem: Unclear how to build TV(IP
8/54
Kullback-Leibler (KL) divergence

There are many distances between probability measures to replace


total variation. Let us choose one that is more convenient.

Definition
The Kullback-Leibler1
(KL) divergence between two probability
measures IP✓ and IP✓0 is defined by
8
>
>
>
> X ⇣ p (x) ⌘
>
> ✓
>
< p ✓ (x) log if E is discrete
p✓0 (x)
KL(IP✓ , IP✓0 ) = x2E
>
>
>
>
>
>
>
: if E is continuous

1
KL-divergence is also know as “relative entropy” 9/54
Properties of KL-divergence

I KL(IP✓ , IP✓0 )6=KL(IP✓0 , IP✓ ) in general


I KL(IP✓ , IP✓0 ) 0
I If KL(IP✓ , IP✓0 ) = 0 then IP✓ = IP✓0 (definite)
I KL(IP✓ , IP✓0 ) ⇥ KL(IP✓ , IP✓00 ) + KL(IP✓00 , IP✓0 ) in general

Not a distance.

This is is called a

Asymmetry is the key to our ability to estimate it!

10/54
Maximum likelihood

estimation

11/54
Estimating the KL
h ⇣ p ⇤ (X) ⌘i

KL(IP✓⇤ , IP✓ ) = IE✓⇤ log
p✓ (X)

⇥ ⇤
= IE✓⇤ log p✓⇤ (X)

So the function ✓ 7! KL(IP✓⇤ , IP✓ ) is of the form:


“constant”

n
X
1
Can be estimated: IE✓⇤ [h(X)] h(Xi ) (by LLN)
n
i=1
n
X
c ✓⇤ , IP✓ ) = “constant” 1
KL(IP log p✓ (Xi )
n
i=1
12/54
Maximum likelihood
n
X
c ✓⇤ , IP✓ ) = “constant” 1
KL(IP log p✓ (Xi )
n
i=1

n
X
c ✓⇤ , IP✓ ) 1
min KL(IP , min log p✓ (Xi )
✓2⇥ ✓2⇥ n
i=1

n
Y
, max p✓ (Xi )
✓2⇥
i=1

This is the maximum likelihood principle.


13/54
Likelihood, Discrete case (1)

Let E, (IP✓ )✓2⇥ be a statistical model associated with a sample


of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or
countable).

Definition

The likelihood of the model is the map Ln (or just L) defined as:

Ln : E n ⇥⇥ ! IR
(x1 , . . . , xn , ✓) 7! IP✓ [X1 = x1 , . . . , Xn = xn ].

14/54
Likelihood for the Bernoulli model
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ⇠ Ber(p) for some
p 2 (0, 1):

I E = {0, 1};

I ⇥ = (0, 1);

I 8(x1 , . . . , xn ) 2 {0, 1}n , 8p 2 (0, 1),


n
Y
L(x1 , . . . , xn , p) = IPp [Xi = xi ]
i=1
Yn
=
i=1
=p (1 p) .

15/54
Likelihood for the Poisson model

Example 2 (Poisson model):


iid
If X1 , . . . , Xn ⇠ Poiss( ) for some > 0:

I E = IN;

I ⇥ = (0, 1);

I 8(x1 , . . . , xn ) 2 INn , 8 > 0,


Pn
i=1 xi
n
L(x1 , . . . , xn , p) = e .
x1 ! . . . xn !

16/54
Likelihood, Continuous case

Let E, (IP✓ )✓2⇥ be a statistical model associated with a sample


of i.i.d. r.v. X1 , . . . , Xn . Assume that all the IP✓ have density f✓ .

Definition

The likelihood of the model is the map L defined as:

L : E n ⇥⇥ ! IR
Qn
(x1 , . . . , xn , ✓) 7! i=1 f✓ (xi ).

17/54
Likelihood for the Gaussian model

iid 2 ),
Example 1 (Gaussian model): If X1 , . . . , Xn ⇠ N (µ, for
2
some µ 2 IR, > 0:

I E = IR;

I ⇥ = IR ⇥ (0, 1)

I 8(x1 , . . . , xn ) 2 IRn , 8(µ, 2) 2 IR ⇥ (0, 1),


n
!
1 1 X
2 2
L(x1 , . . . , xn , µ, )= p exp 2
(xi µ) .
( 2⇡)n 2
i=1

18/54
Exercises

Let E, (IP✓ )✓2⇥ be a statistical model associated with


X1 , . . . , Xn ⇠ Exp( ).

a) What is E?

b) What is ⇥?

c) Find the likelihood of the model.

19/54
Exercise

Let E, (IP✓ )✓2⇥ be a statistical model associated with


X1 , . . . , Xn ⇠Unif[0, b] for some b > 0.
a) What is E?

b) What is ⇥?

c) Find the likelihood of the model.

20/54
Maximum likelihood estimator

Let X1 , . . . , Xn be an i.i.d. sample associated with a statistical


model E, (IP✓ )✓2⇥ and let L be the corresponding likelihood.

Definition
The maximum likelihood estimator of ✓ is defined as:
ˆM LE
✓n = argmax L(X1 , . . . , Xn , ✓),
✓2⇥

provided it exists.

Remark (log-likelihood estimator): In practice, we use the fact


that
ˆM LE
✓n = argmax L(X1 , . . . , Xn , ✓).
✓2⇥

21/54
Interlude: maximizing/minimizing functions

Note that
min h(✓) , max h(✓)
✓2⇥ ✓2⇥

In this class, we focus on maximization.

Maximization of arbitrary functions can be difficult:

Qn
Example: ✓ 7! i=1 (✓ Xi )

22/54
Concave and convex functions
Definition
A function twice di↵erentiable function h : ⇥ ⇢ IR ! IR is said to
be concave if its second derivative satisfies
00
h (✓)  0 , 8✓2⇥

It is said to be strictly concave if the inequality is strict: 00


h (✓) <0

Moreover, h is said to be (strictly) convex if h is (strictly)


concave, i.e. h00 (✓) 0 (h00 (✓) > 0).
Examples:
I ⇥ = IR, h(✓) = ✓2 ,
p
I ⇥ = (0, 1), h(✓) = ✓,
I ⇥ = (0, 1), h(✓) = log ✓,
I ⇥ = [0, ⇡], h(✓) = sin(✓)
I ⇥ = IR, h(✓) = 2✓ 3
23/54
Multivariate concave functions
More generally for a multivariate function: h : ⇥ ⇢ IR d ! IR,
d 2, define the 0 1
B
I gradient vector: rh(✓) = @ C d
A 2 IR

I Hessian matrix:
0 @2h 1
@2h
@✓1 @✓1 (✓) ··· @✓1 @✓d (✓)
B C
B C
Hh(✓) = B
B
C 2 IRd⇥d
C
@ A
@2h @2h
@✓d @✓d (✓) ··· @✓d @✓d (✓)

h is concave , x> Hh(✓)x  0 8x 2 IRd , ✓ 2 ⇥.


h is strictly concave , x> Hh(✓)x < 0 8x 2 IRd , ✓ 2 ⇥.
Examples:
I ⇥ = IR2 , h(✓) = ✓12 2✓22 or h(✓) = (✓1 ✓2 ) 2
I ⇥ = (0, 1), h(✓) = log(✓1 + ✓2 ),
24/54
Optimality conditions

Strictly concave functions are easy to maximize: if they have a


maximum, then it is unique. It is the unique solution to
0
h (✓) = 0 ,

or, in the multivariate case


d
= 0 2 IR .

There are many algorithms to find it numerically: this is the theory


of “convex optimization”. In this class, often a closed form
formula for the maximum.

25/54
Exercises

a) Which one of the following functions are concave on ⇥ = 2


IR ?
1. h(✓) = (✓1 ✓2 ) 2 ✓1 ✓2
2. h(✓) = (✓1 ✓2 ) 2 + ✓1 ✓2
3. h(✓) = (✓1 ✓2 ) 2 ✓1 ✓2
4. Both 1. and 2.
5. All of the above
6. None of the above
b)Let h : ⇥ ⇢ IR d ! IR be a function whose hessian matrix Hh(✓)
has a positive diagonal entry for some ✓ 2 ⇥. Can h be concave?
Why or why not?

26/54
Examples of maximum likelihood estimators

I Bernoulli trials: p̂M


n
LE
= X̄n .

I Poisson model: ˆ M
n
LE
= X̄n .
⇣ ⌘
I Gaussian model: µ̂n , ˆn2 = X̄n , Ŝn .

27/54
Consistency of maximum likelihood estimator

Under mild regularity conditions, we have

ˆM LE IP ⇤
✓n !✓
n!1

This is because for all ✓ 2 ⇥


1 IP
L(X1 , . . . , Xn , ✓) ! “constant”
n n!1

Moreover, the minimizer of the right-hand side is if the


parameter is .
Technical conditions allow to transfer this convergence to the
minimizers.

28/54
Covariance

How about asymptotic normality?


In general, when ✓ ⇢ IRd , d 2, its coordinates are not necessarily
.
The covariance between two random variables X and Y is

Cov(X, Y ) =
⇥ ⇤
= IE X · Y
⇥ ⇤
= IE X ·

29/54
Properties
I Cov(X, Y ) =
I Cov(X, Y ) = Cov(Y, X)
I If X and Y are independent, then Cov(X, Y ) =

B In general, the converse is not true except if (X, Y ) >


2
is a Gaussian vector , i.e., ↵X + Y is Gaussian for all
\
(↵, ) 2 IR {(0, 0)}.

Take X ⇠ N (0, 1), B ⇠ Ber(1/2), R = 2B 1 ⇠ Rad(1/2). Then


Y =R·X ⇠
But taking ↵ = = 1, we get

with prob. 1/2
X +Y =
with prob. 1/2
Actually Cov(X, Y ) = 0 but they are not independent: |X| = |Y | 30/54
Covariance matrix
The covariance matrix of a random vector
(1) (d) > d
X = (X , . . . , X ) 2 IR is given by
⇥ >⇤
⌃ = Cov(X) = IE X IE(X) X IE(X)

This is a matrix of size


The term on the ith row and jth column is
⇥ (i) (i) (j) (j)

⌃ij = IE X IE(X ) X IE(X ) =

In particular, on the diagonal, we have


⌃ii =
Recall that for X 2 IR, Var(aX + b) = . Actually, if
d
X 2 IR and A, B are matrices:

Cov(AX + B) =
31/54
The multivariate Gaussian distribution
If (X, Y )> is a Gaussian vector then its pdf depends on 5
parameters:

and Cov(X, Y )

More generally, a Gaussian vector3 X 2 IRd , is completely


d
determined by its expected value and IE[X] = µ 2 IR covariance
matrix ⌃. We write
X ⇠ Nd (µ, ⌃) .
It has pdf over IRd given by:
✓ ◆
1 1 > 1
exp (x µ) ⌃ (x µ)
(2⇡ det(⌃))d/2 2

3
As before, this means that ↵> X is Gaussian for any ↵ 2 IRd , ↵ 6= 0. 32/54
The multivariate CLT

The CLT may be generalized to averages or random vectors (also


vectors of averages).
d
Let X1 , . . . , Xn 2 IR be independent copies of a random vector X
such that IE[X] = µ, Cov(X) = ⌃,
p (d)
n(X̄n µ) !
n!1

Equivalently
(d)
! Nd (0, Id )
n!1

33/54
Multivariate Delta method
Let (Tn )n sequence of random vectors in IR d such that
1

p (d)
n(Tn ✓) ! Nd (0, ⌃),
n!1

for some ✓ 2 IR d and some covariance ⌃ 2 d⇥d


IR .

Let g : IRd ! IRk (k 1) be continuously di↵erentiable at ✓.


Then,
p (d)
n (g(Tn ) g(✓)) ! Nk 0, ⌃ ,
n!1

✓ ◆
@g @gj d⇥k
where rg(✓) = (✓) = 2 IR .
@✓ @✓i 1i
1j

34/54
Fisher Information

Definition: Fisher information


Define the log-likelihood for one observation as:
d
`(✓) = log L1 (X, ✓), ✓ 2 ⇥ ⇢ IR

Assume that ` is a.s. twice di↵erentiable. Under some regularity


conditions, the Fisher information of the statistical model is
defined as:
⇥ >
⇤ ⇥ ⇤ ⇥ ⇤ >
I(✓) = IE r`(✓)r`(✓) IE r`(✓) IE r`(✓) = IE [H`(✓)] .

If ⇥ ⇢ IR, we get:
⇥ 0
⇤ ⇥ 00

I(✓) = var ` (✓) = IE ` (✓)

35/54
Fisher information of the Bernoulli experiment

Let X ⇠ Ber(p).

`(p) =

0 0
` (p) = var[` (p)] =

00 00
` (p) = IE[` (p)] =

36/54
Asymptotic normality of the MLE

Theorem

Let ✓ ⇤ 2 ⇥ (the true parameter). Assume the following:


1. The parameter is identifiable.
2. For all ✓ 2 ⇥, the support of IP✓ does not depend on ✓;
3. ✓⇤ is not on the boundary of ⇥;
4. I(✓) is invertible in a neighborhood of ⇤
✓ ;
5. A few more technical conditions.

ˆM
Then, ✓n LE satisfies:

I ✓n
ˆM LE IP
! w.r.t. IP✓⇤ ;
n!1
p ⇣ ⌘ (d)
I n ✓n ˆM LE
✓ ⇤
! N 0, w.r.t. IP✓⇤ .
n!1

37/54
The method of moments

38/54
Moments
Let X1 , . . . , Xn be an i.i.d. sample associated with a statistical
model E, (IP✓ )✓2⇥ .

I Assume that E ✓ IR and ⇥ ✓ IRd , for some d 1.

I Population moments: Let mk (✓) = IE✓ [X1k ], 1  k  d.

I Empirical moments: Let m̂k = Xnk = , 1  k  d.

I From LLN,
IP/a.s
m̂k !
n!1

More compactly, we say that the whole vector converges:


IP/a.s
(m̂1 , . . . , m̂d ) !
n!1
39/54
Moments estimator
Let
M : ⇥ ! IR d

✓ 7! M (✓) = (m1 (✓), . . . , md (✓)) .

Assume M is one to one:


1
✓=M (m1 (✓), . . . , md (✓)).

Definition

Moments estimator of ✓:
ˆMM
✓n =M 1
( ),

provided it exists.
40/54
Statistical analysis

I Recall M (✓) = (m1 (✓), . . . , md (✓));

I Let M̂ = (m̂1 , . . . , m̂d ).

I Let ⌃(✓) = Cov✓ (X1 , X12 , . . . , X1d ) be the covariance matrix


2 d
of the random vector (X1 , X1 , . . . , X1 ), which we assume to
exist.

I Assume M 1 is continuously di↵erentiable at M (✓).

41/54
Method of moments (5)
Remark: The method of moments can be extended to more
general moments, even when E 6⇢ IR.

I Let g1 , . . . , gd : E ! IR be given functions, chosen by the


practitioner.

I Previously, gk (x) = xk , x 2 E = IR, for all k = 1, . . . , d.

I Define mk (✓) = IE✓ [gk (X)], for all k = 1, . . . , d.

I Let ⌃(✓) = Cov✓ (g1 (X1 ), g2 (X1 ), . . . , gd (X1 )) be the


covariance matrix of the random vector
(g1 (X1 ), g2 (X1 ), . . . , gd (X1 )), which we assume to exist.

I Assume M is one to one and M 1 is continuously


di↵erentiable at M (✓).
42/54
Generalized method of moments

Applying the multivariate CLT and Delta method yields:

Theorem
p ⇣ MM ⌘ (d)
n ✓ˆn ✓ ! N (0, (✓)) (w.r.t. IP✓ ),
n!1

 1 >  1
@M @M
where (✓) = M (✓) ⌃(✓) M (✓) .
@✓ @✓

43/54
MLE vs. Moment estimator

I Comparison of the quadratic risks: In general, the MLE is


more accurate.

I MLE still gives good results if model is misspecified

I Computational issues: Sometimes, the MLE is intractable but


MM is easier (polynomial equations)

44/54
M-estimation

45/54
M-estimators

Idea:
I Let X1 , . . . , Xn be i.i.d with some unknown distribution IP in
d
some sample space E (E ✓ IR for some d 1).
I No statistical model needs to be assumed (similar to ML).

I Goal: estimate some parameter µ⇤ associated with IP, e.g. its


mean, variance, median, other quantiles, the true parameter in
some statistical model...
I Find a function ⇢ : E ⇥ M ! IR, where M is the set of all

possible values for the unknown µ , such that:

Q(µ) := IE [⇢(X1 , µ)]

achieves its minimum at µ = ⇤


µ .

46/54
Examples (1)

I If E = M = IR and ⇢(x, µ) = (x µ)2 , for all x 2 IR, µ 2 IR:



µ =

I If E = M = IRd and ⇢(x, µ) = kx µk22 , for all


d d ⇤
x 2 IR , µ 2 IR : µ =

I If E = M = IR and ⇢(x, µ) = |x µ|, for all x 2 IR, µ 2 IR:



µ is a of IP.

47/54
Examples (2)
If E = M = IR, ↵ 2 (0, 1) is fixed and ⇢(x, µ) = C↵ (x µ), for
all x 2 IR, µ 2 IR : µ⇤ is a ↵-quantile of IP.

Check function
(
(1 ↵)x if x < 0
C↵ (x) =
↵x if x 0.

48/54
MLE is an M-estimator

Assume that (E, {IP✓ }✓2⇥ ) is a statistical model associated with


the data.

Theorem

Let M = ⇥ and ⇢(x, ✓) = log L1 (x, ✓), provided the likelihood is


positive everywhere. Then,
⇤ ⇤
µ =✓ ,

where IP = IP (i.e., ✓ ⇤ is the true value of the parameter).


✓⇤

49/54
Definition

I Define µ̂n as a minimizer of:

Qn (µ) := ⇢(Xi , µ).

I Examples: Empirical mean, empirical median, empirical


quantiles, MLE, etc.

50/54
Statistical analysis

2
 2
@ Q @ ⇢
I Let J(µ) = (µ) (= IE (X 1 , µ) under
@µ@µ > @µ@µ >

some regularity conditions).


@⇢
I Let K(µ) = Cov (X1 , µ) .

I Remark: In the log-likelihood case (write µ = ✓),

J(✓) = K(✓) =

51/54
Asymptotic normality

Let µ⇤ 2 M (the true parameter). Assume the following:


1. µ ⇤ is the only minimizer of the function Q;
2. J(µ) is invertible for all µ 2 M;
3. A few more technical conditions.

Then, µ̂n satisfies:


I µ̂n IP ⇤
!µ ;
n!1
p (d)
I n (µ̂n ⇤
µ ) ! N 0, K(µ )⇤
.
n!1

52/54
M-estimators in robust statistics

Example: Location parameter

If X1 , . . . , Xn are i.i.d. with density f (· m), where:


I f is an unknown, positive, even function (e.g., the Cauchy
density);
I m is a real number of interest, a location parameter ;

How to estimate m ?

I M-estimators: empirical mean, empirical median, ...


I Compare their risks or asymptotic variances;
I The empirical median is more robust.

53/54
Recap

I Three principled methods for estimation: maximum likelihood,


Method of moments, M-estimators
I Maximum likelihood is an example of M -estimation
I Method of moments inverts the function that maps
parameters to moments
I All methods yield to asymptotic normality under regularity
conditions
I Asymptotic covariance matrix can be computed using
multivariate -method
I For MLE, asymptotic covariance matrix is the inverse Fisher
information matrix

54/54

You might also like