0% found this document useful (0 votes)
108 views20 pages

Theoretical Statistics. Lecture 15.: M-Estimators. Consistency of M-Estimators. Nonparametric Maximum Likelihood

The document discusses M-estimators and Z-estimators. M-estimators are defined by maximizing a criterion function based on a function mθ, while Z-estimators are defined as solutions to estimating equations ψθ. Consistency of M-estimators and Z-estimators can be shown under conditions including a uniform law of large numbers and identifiability. Nonparametric maximum likelihood estimation involves estimating an unknown density p0 by maximizing the log-likelihood over a family of densities P. Hellinger consistency of the maximum likelihood estimator can be shown under appropriate conditions on P and the empirical process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views20 pages

Theoretical Statistics. Lecture 15.: M-Estimators. Consistency of M-Estimators. Nonparametric Maximum Likelihood

The document discusses M-estimators and Z-estimators. M-estimators are defined by maximizing a criterion function based on a function mθ, while Z-estimators are defined as solutions to estimating equations ψθ. Consistency of M-estimators and Z-estimators can be shown under conditions including a uniform law of large numbers and identifiability. Nonparametric maximum likelihood estimation involves estimating an unknown density p0 by maximizing the log-likelihood over a family of densities P. Hellinger consistency of the maximum likelihood estimator can be shown under appropriate conditions on P and the empirical process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Theoretical Statistics. Lecture 15.

Peter Bartlett

M-Estimators.
Consistency of M-Estimators.
Nonparametric maximum likelihood.

1
M-estimators

Goal: estimate a parameter of the distribution P of observations


X1 , . . . , Xn .
Define a criterion 7 Mn () in terms of functions m : X R,

Mn () = Pn m .

The estimator = arg max Mn () is called an M-estimator (M for


maximum).
Example:
maximum likelihood uses

m (x) = log p (x).

2
Z-estimators

Can maximize by setting derivatives to zero:

n () = Pn = 0.

These are estimating equations. van der Vaart calls this a Z-estimator (Z
for zero), but its often called an M-estimator (even if theres no
maximization).
Example:
maximum likelihood:

(x) = log p (x).

3
M-estimators and Z-estimators

Of course, sometimes we cannot transform an M-estimator into a


Z-estimator. Example: p =uniform on [0, ] is not differentiable in , and
there is no natural Z-estimator. The M-estimator chooses

= arg max Pn m

1 [ [0, ]]
= arg max Pn log

= max Xi .
i

4
M-estimators and Z-estimators: Examples

Mean:

m (x) = (x )2 .
(x) = (x ).

Median:

m (x) = |x |.
(x) = sign(x ).

5
M-estimators and Z-estimators: Examples

Huber: [PICTURE]

m (x) = rk (x )

1 2
2 k k(x + k) if x < k,


rk (x) = 21 x2 if |x| k,

1 2

2 k + k(x k) if x > k.
(x) = [x ]kk

k if x < k,


[x]kk = x if |x| k,


k if x > k.

These are all location estimators: m (x) = m(x ), (x) = (x ).

6
Consistency of M-estimators and Z-estimators

P
We want to show that 0 , where approximately maximizes
Mn () = Pn m and 0 maximizes M () = P m . We use a ULLN.

Theorem: Suppose that


P
1. sup |Mn () M ()| 0,
2. For all > 0, sup {M () : d(, 0 ) } < M (0 ), and
3. Mn (n ) Mn (0 ) oP (1).
P
Then n 0 .
(2) is an identifiability condition: approximately maximizing M ()
unambiguously specifies 0 . It suffices if there is a unique maximizer, is
compact, and M is continuous.

7
Proof

From (2), for all > 0 there is a > 0 such that

Pr(d(n , 0 ) )
Pr(M (0 ) M (n ) )
= Pr(M (0 ) Mn (0 ) + Mn (0 ) Mn (n ) + Mn (n ) M (n ) )
Pr(M (0 ) Mn (0 ) /3) + Pr(Mn (0 ) Mn (n ) /3)
+ Pr(Mn (n ) M (n ) /3).

Then (1) implies the first and third probabilities go to zero, and (3) implies
the second probability goes to zero.

8
Consistency of M-estimators and Z-estimators

Same thing for Z-estimators: Finding that is an approximate zero of


n () = Pn leads to 0 , which is the unique zero of () = P .

Theorem: Suppose that


P
1. sup kn () ()k 0,
2. For all > 0, inf {k()k : d(, 0 ) } > 0 = k(0 )k, and
3. n (n ) = oP (1).
P
Then n 0 .

Proof: Choosing Mn () = kn ()k and M () = k()k in the


previous theorem implies the result.

9
Example: Sample median

Sample median n is the zero of

Pn (X) = Pn sign(X ).

Suppose that P is continuous and positive around the median, and check the
conditions:
1. The class {x 7 sign(x ) : R} is Glivenko-Cantelli.

2. The population median is unique, so for all > 0,


1
P (X < 0 ) < < P (X < 0 + ).
2

3. The sample median always has |Pn sign(X n )| = 0.

10
ULLN and M-estimators

Notice the ULLN condition:


P
sup |Mn () M ()| 0.
Typically, this requires the empirical process 7 Pn m to be totally
bounded. This can be problematic if m is unbounded. For instance:
Mean: m (x) = (x )2 ,
Median: m (x) = |x |.
We can get around the problem by restricting to a compact set where most
of the mass of P lies, and showing that this does not affect the asymptotics.
In that case, we can also restrict to an appropriate compact subset.

11
Non-parametric maximum likelihood

Estimate P on X . Suppose it has a density


dP
p0 = P,
d
where P is a family of densities. Define the maximum likelihood estimate

pn = arg max Pn log p.


pP

Well show conditions for which pn is Hellinger consistent, that is,


as
h(
pn , p0 ) 0, where h is the Hellinger distance:
 Z  1/2
1  2
h(p, q) = p1/2 q 1/2 d .
2
[The 1/2 ensures 0 h(p, q) 1.]

12
Hellinger distance

We have
1
Z  2
h(p, q)2 = p1/2 q 1/2 d
2
1 
Z 
= p + q 2p1/2 q 1/2 d
2
Z
= 1 p1/2 q 1/2 d.

This latter integral is called the Hellinger affinity. Expressing h in this form
can simplify its calculation for product densities. Notice that, by
Cauchy-Schwartz,
Z Z Z
p1/2 q 1/2 d p d q d = 1,

so h(p, q) [0, 1].

13
Non-parametric maximum likelihood

The Kullback-Leibler divergence between p and q is


q
Z
dKL (p, q) = log q d.
p
Clearly, dKL (p, p) = 0. Also, since log() is convex,
Z 
p p
Z
dKL (p, q) = log q d log q d = 0.
q q

14
Non-parametric maximum likelihood

Relating KL-divergence to a ULLN:

p0
Z
dKL (
pn , p0 ) = log p0 d
pn
p0 p0
Z
log p0 d Pn log
pn pn
p0 p0
= P log Pn log
pn pn
kP Pn kG ,

where the first inequality follows from the fact that pn maximizes Pn log p

15
over p P, and the class G is defined as
 
p0
G = 1[p0 > 0] log :pP .
p

16
Non-parametric maximum likelihood

One problem here is that log(p0 /p) is unbounded, since p can be zero.
Well take a different approach: For any p P, consider the mixture
p + p0
p = .
2
If the class P is convex and pn , p0 P, this mixture has
Pn log p Pn log pn . This is behind the following lemma.

Lemma: Define
pn + p0
pn = .
2
If P is convex,
pn
Z
p n , p 0 )2
h( d(Pn P ).
pn

17
Non-parametric maximum likelihood

Theorem: For a convex class P of densities, if P has density p0 P and


pn maximizes likelihood over P, we have

pn , p0 )2 kP Pn kG ,
h(

where  
2p
G= :pP .
p + p0

Notice that functions in G are bounded between 0 and 2.

18
Non-parametric maximum likelihood: Example

Lemma: Suppose P is a set of densities on a compact subset X of Rd .


Fix a norm k k on Rd . Suppose that, for all p P,

p(x)
p(y) 1 Lkx yk.


p(x)
1. For all p conv P, p(y) 1 Lkx yk.

2p
2. For all p, p0 conv P, p+p0 is O(L2 )-Lipschitz wrt k k.
as
3. kP Pn kG 0, where
 
2p
G= : p conv P .
p + p0

19
Non-parametric maximum likelihood: Example

But notice that the dependence on the dimension d is terrible: the rate is
exponentially slow in d. The Lipschitz property is a very weak restriction.

20

You might also like