0% found this document useful (0 votes)
39 views

SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)

1. Maximum likelihood estimation (MLE) attempts to find the parameters that make the observed data most probable. It does this by maximizing the likelihood function. 2. For a Bernoulli distribution, the MLE of the parameter r is the sample proportion of successes. For a Gaussian distribution, the MLEs of the mean μ and variance σ2 are the sample mean and sample variance, respectively. 3. The bias and variance of an estimator decompose the mean squared error into components measuring how far the estimator is from the true value on average (bias) and how far it scatters from the average (variance).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)

1. Maximum likelihood estimation (MLE) attempts to find the parameters that make the observed data most probable. It does this by maximizing the likelihood function. 2. For a Bernoulli distribution, the MLE of the parameter r is the sample proportion of successes. For a Gaussian distribution, the MLEs of the mean μ and variance σ2 are the sample mean and sample variance, respectively. 3. The bias and variance of an estimator decompose the mean squared error into components measuring how far the estimator is from the true value on average (bias) and how far it scatters from the average (variance).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Maximum Likelihood Estimation

B li Ch Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
References:
1. Ethem Alpaydin, Introduction to Machine Learning , Chapter 4, MIT Press, 2004
Sample Statistics and Population Parameters
A Schematic Depiction
Population (sample space) Sample
Inference
Statistics
Parameters
SP - Berlin Chen 2
Introduction
Statistic
Any value (or function) that is calculated from a given sample y ( ) g p
Statistical inference: make a decision using the information
provided by a sample (or a set of examples/instances)
Parametric methods
Assume that examples are drawn from some distribution that
obeys a known model ( ) x p obeys a known model ( ) x p
Advantage: the model is well defined up to a small number of
parameters
E g mean and variance are sufficient statistics for the E.g., mean and variance are sufficient statistics for the
Gaussian distribution
Model parameters are typically estimated by either maximum
SP - Berlin Chen 3
likelihood estimation or Bayesian (MAP) estimation
Maximum Likelihood Estimation (MLE) (1/2)
Assume the instances are independent
and identically distributed (iid) and drawn from some
{ }
N t
x x x x , , , , ,
2 1
K K = x
and identically distributed (iid), and drawn from some
known probability distribution
( )
t t
x p X
X

: model parameters (assumed to be fixed but unknown here)


( ) x p X ~
MLE attempts to find that make the most likely to
be drawn
x
Namely, maximize the likelihood of the instances
iid x x
N
are , ,
1
K
( ) ( ) ( ) ( )

=
= = =
N
t
t N
x p x x p p l
1
1
, , L x x
SP - Berlin Chen 4
MLE (2/2)
Because logarithm will not change the value of when
it take its maximum (monotonically increasing/decreasing)

it take its maximum (monotonically increasing/decreasing)


Finding that maximizes the likelihood of the instances is
equivalent to finding that maximizes the log likelihood of the

samples
( ) ( ) ( )

= =
N
t
x p l L log log x x
b a
b a
log log



As we shall see, logarithmic operation can further simplify the
( ) ( ) ( )

= t
p
1
g g
g p p y
computation when estimating the parameters of those
distributions that have exponents
SP - Berlin Chen 5
MLE: Bernoulli Distribution (1/3)
Bernoulli Distribution
A random variable takes either the value (with X 1 = x
A random variable takes either the value (with
probability ) or the value (with probability )
Can be thought of as is generated form two distinct states
r
X
X
1 = x
1 = x
r 1
The associated probability distribution
( ) ( ) { } 1 , 0 , 1
1
=

x r r x P
x
x
The log likelihood for a set of iid instances drawn from
Bernoulli distribution
( ) ( ) { } 1 , 0 , 1 x r r x P
x
Bernoulli distribution
( )
( )
( )
( )
1 log
1
1
r r X r L
N
t
x
x
t
t
=
=

{ }
N t
x x x x , , , , ,
2 1
K K = x
( ) 1 log log
1 1
1
r x N r x
N
t
t
N
t
t
t

=
= =

SP - Berlin Chen 6
MLE: Bernoulli Distribution (2/3)
MLE of the distribution parameter r
N
N
x
r
N
t
t

=
=1

The estimate for is the ratio of the number of occurrences of


the event ( ) to the number of experiments
N
1 =
t
x
r
( ) p
The expected value for X
( )
{ }
( ) r r r x P x X E
x
= + = =

1 1 0 ] [
1 , 0
The variance value for
X
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
2 2 2
SP - Berlin Chen 7
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
MLE: Bernoulli Distribution (3/3)
Appendix A
( )
( ) r x N r x
X r dL
N
t
t
N
t
t

= = 1 1
0
1 log log
dr dr
N N

=

= 0
r
x N
r
x
N
t
t
N
t
t
=

= = 1 1
0
1

1 log
y dy
y d
=
N
x
r
N
t
t

=
=1

y y
N
Themaximumlikelihoodestimateofthemeanisthesampleaverage
SP - Berlin Chen 8
MLE: Multinomial Distribution (1/4)
Multinomial Distribution
A generalization of Bernoulli distribution A generalization of Bernoulli distribution
The value of a random variable can be one of K mutually
exclusive and exhaustive states with
b biliti ti l
X
{ }
K
s s s x , , ,
2 1
L
probabilities , respectively
The associated probability distribution
( )

K K
K
r r r , , ,
2 1
L

state choose if 1
i
s X
( )

= =
= =
i
i
i
s
i
r r x p
i
1 1
1 ,
The log likelihood for a set of iid instances drawn from a

=
otherwise 0
state choose if 1
i
i
s X
s
x
The log likelihood for a set of iid instances drawn from a
multinomial distribution
( ) log

=
N K
s
t
i
r L x r { }
N t
x x x x
2 1
= x
x
X
SP - Berlin Chen 9
( )

log
1 1

= =
=
t i
i
r L x r { } x x x x , , , , , K K = x
MLE: Multinomial Distribution (2/4)
MLE of the distribution parameter
i
r
N
s
r
N
t
t
i
i

=
=1

The estimate for is the ratio of the number of experiments


N
i
r
The estimate for is the ratio of the number of experiments
with outcome of state ( ) to the number of experiments
i
r
i
1 =
t
i
s
SP - Berlin Chen 10
MLE: Multinomial Distribution (3/4)
Appendix B
( )
K
s
N K N K
s
t t

( )
( )
r r s
r r r L
K
i i
N K
t
i
i
i
s
i
t i t i
s
i
i i


= = = = =

= = =
1 1 1 1 1
1 log
1 : constraint with , log log

x r
( )
r r
L
i
i
i i
t i
i
i

= = =
=

1 1 1
0
g
x r
Lagrange Multiplier
r
s
N
N
t
i
t
i
=
= +
1
1
0
1

s r
s r
N K
t
K
N
t
t
i i


=

= =
=
1
1
1
1

N
s r
N
t
t i
i
i
i


= = =
=

= =
1 1 1

1


=1
SP - Berlin Chen 11
N
s
r
t
t
i
i

=
=
1

LagrangeMultiplier:http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html
MLE: Multinomial Distribution (4/4)
P(B)=3/10
P(W)=4/10
P(B)=3/10
P(R)=3/10
SP - Berlin Chen 12
MLE: Gaussian Distribution (1/3)
Also called Normal Distribution
Characterized with mean and variance

Characterized with mean and variance


( )
( )
< <


= x
x
x p - ,
2
exp
2
1
2
2


Recall that mean and variance are sufficient statistics for
G i

2 2
Gaussian
The log likelihood for a set of iid instances drawn from
Gaussian distribution X Gaussian distribution
( )
( )
2
1
log ,
2
2
2


=
x
N
e L
t
x
{ }
N t
x x x x , , , , ,
2 1
K K = x
X
( )
( )
log 2 log
2
1
2
1

=
=

=
N
t
t
t
x
N
N
SP - Berlin Chen 13
( )
2
log 2 log
2

2

= N
MLE: Gaussian Distribution (2/3)
MLE of the distribution parameters and

N
x
m
N
t
t

= =
=1

sample average
N
( ) m x
N
t
t


=1
2
2 2

l i
R i d th t d till fi d b t k
N
s
t
= =
=1
2 2

2
sample variance
Remind that and are still fixed but unknown

SP - Berlin Chen 14
MLE: Gaussian Distribution (3/3)
Appendix C
( ) ( )
( )
2
log
2
2 log
2
,
2
1
2
2



=

=
N
t
t
x
N N
L x
( )
( ) 0
1
0
,
1
2
x
x
L
N
t
t
N
t

=
= = =


x
2 2 2
( )
( )
0 0
2
1
2
N
x
N
t
t

=
= = =



( )
( )
( )
0
1
0
,
1
2
2
1
2
2 2
N
x
x N
L
t
t
N
t
t

=
=

= = + =




x
SP - Berlin Chen 15
Evaluating an Estimator : Bias and Variance (1/6)
The mean square error of the estimator can be further
decomposed into two parts respectively composed of
d
decomposed into two parts respectively composed of
bias and variance
[ ] ( ) ( ) [ ]
[ ] [ ] ( ) [ ]
2
2

,
d E d E d E
d E d r
+ =
=
[ ] [ ] ( ) [ ]
[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
2 2
2 2
2
2
d E d E d E d E E d E d E
d E d E d d E d E d E
+ + =
+ + =
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
2 2
2
2
d E d E d E d E d E d E
d E d E d E d E E d E d E
+ + =
+ +
constant
constant
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
[ ] ( ) [ ] [ ] ( )
2 2
2
d E d E d E
d E d E d E d E d E d E
+ =
+ + =
0
[ ] ( ) [ ] [ ] ( ) d E d E d E + =
variance bias
2
SP - Berlin Chen 16
Evaluating an Estimator : Bias and Variance (2/6)
SP - Berlin Chen 17
Evaluating an Estimator : Bias and Variance (3/6)
Example 1: sample average and sample variance
Assume samples are independent and { }
N t
x x x x
2 1
x Assume samples are independent and
identically distributed (iid), and drawn from some known
probability distribution with mean and variance
{ } x x x x , , , , , K K = x

2
X
Mean
[ ] ( )

= =
x
x p x X E
Variance
( ) [ ] [ ] [ ] ( )
2
2
2
2
X E X E X E = =
N
1
Sample average (mean) for the observed samples
( )
N
t
2
2
1

=
=
N
t
t
x
N
m
1
1
Sample variance for the observed samples ( )

=
= t
t
m x
N
s
1
2
2
1
( )
1
1
or
1
2
2

=
N
t
m x
N
s ?
1 1 = t N
SP - Berlin Chen 18
Evaluating an Estimator : Bias and Variance (4/6)
Example 1 (count.)
S l i bi d ti t f th Sample average is an unbiased estimator of the mean
[ ] [ ]

= =

=

N
N
X E
N
X
N
E m E
N N
t
1 1
m


= =
N N N
t t 1 1
( )
[ ] 0 = m E
is also a consistent estimator: m
( ) N m as 0 Var
1 1
2 2

N N
N
( ) ( ) 0 Var
1 1
Var Var
2
2
2
1
2
1
=

= =

=
=
= =

N
N
t
N
t
t
N N
N
X
N
X
N
m

( ) ( )
( ) ( ) ( ) Y X Y X
X a b aX
Var Var Var
Var Var
2
+ = +
= +
SP - Berlin Chen 19
Evaluating an Estimator : Bias and Variance (5/6)
Example 1 (count.)
Sample variance is an asymptotically unbiased estimator of
2
s Sample variance is an asymptotically unbiased estimator of
the variance
2

s
( )

=
N
t
t
m x
N
s
1
2
2
1
[ ] ( )
( )
m X
N
E s E
N
N
t
t
1
2
2
1

1

=

=

=
=
N
t
t
m N x
1
= t
N
1
( ) ( )
( ) m m X X E
X m X
N
E
N
t
t
2 2
1
2
2
1
i.i.d. are s '
1

+ =

=
t 1
( )
N
Nm m N X N
E
m m X X
N
E
t
2 2 2
1
2

2

+
=

=
[ ] [ ]
N
m E N X E N
N
m N X N
E
N
2 2 2 2


=


=

SP - Berlin Chen 20
Evaluating an Estimator : Bias and Variance (6/6)
Example 1 (count.)
Sample variance is an asymptotically unbiased estimator of
2
Sample variance is an asymptotically unbiased estimator of
the variance
2

2
s
( ) [ ] [ ] ( )
[ ] [ ] ( )
2
2
2
2
2
2
2
2
Var

= = m E m E
N
m
[ ]
[ ] [ ]
2 2
2

=
m E N X E N
s E
[ ] [ ] ( )
2
2
2


+ = + =
N
m E
N
m E
[ ]
( )
2
2
2 2

+ +
=
N N
N
s E
( )
( ) 1


+ +
=
N
N
N
N N
( ) [ ] [ ] ( )
2
2 2
Var = = X E X E X
( )
2 2

1

=
= N
N
N
[ ] [ ] ( )
2 2
2
2 2
+ = + = X E X E
The size of the observed sample set
SP - Berlin Chen 21
Bias and Variance: Example 2
1
x
different
samples
foranunknown
population
2
x
population
x
( ) y x X ,
( ) x F y =
3
x
SP - Berlin Chen 22
( ) + =

x F y
errorofmeasurement
Simple is Elegant ?
SP - Berlin Chen 23

You might also like