Maximum Likelihood Learning of Gaussians For Data Mining
Maximum Likelihood Learning of Gaussians For Data Mining
Maximum Likelihood
Note to other teachers and users of
these slides. Andrew would be delighted
Andrew W. Moore
Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use School of Computer Science
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the Carnegie Mellon University
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm
Comments and corrections gratefully
received. [email protected]
412-268-7599
1
Why we should care
• Maximum Likelihood Estimation is a very
very very very fundamental part of data
analysis.
• “MLE for Gaussians” is training wheels for
our future techniques
• Learning Gaussians is more useful than you
might guess…
2
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ
(you do know σ2)
Sneer
Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
3
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ (you do know σ2)
• MLE: For which µ is x1, x2, … xR most likely?
Algebra Euphoria
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
µ
= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)
4
Algebra Euphoria
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
µ
R
= arg max ∏ p ( xi | µ , σ 2 )
(by i.i.d)
µ i =1
R
= arg max log p ( x | µ , σ 2 ) (monotonicity of
∑ µ
i
i =1
log)
= arg max 1 R
( xi − µ ) 2 (plug in formula
µ 2π σ
∑
i =1
−
2σ 2
for Gaussian)
= R (after
arg min ∑ ( xi − µ ) 2 simplification)
µ i =1
5
The MLE µ
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
µ
R
= arg min ∑ ( xi − µ ) 2
µ i =1
∂LL
= µ s.t. 0 = =
∂µ
= (what?)
The MLE µ
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
µ
R
= arg min ∑ ( xi − µ ) 2
µ i =1
∂LL ∂ R
= µ s.t. 0 =
∂µ
=
∂µ
∑ (x
i =1
i − µ )2
R
− ∑ 2 ( xi − µ )
i =1
1 R
Thus µ = ∑ xi
R i =1
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12
6
Lawks-a-lawdy!
1 R
µ mle = ∑ xi
R i =1
7
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations
∂LL
=0
∂θ1
∂LL
=0
∂θ 2
M
∂LL
=0
∂θ n
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15
∂LL
=0
∂θ1
∂LL
=0 4. Check that you’re at
∂θ 2
a maximum
M
∂LL
=0
∂θ n
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16
8
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations
∂LL
=0
If you can’t solve them, ∂θ1
what should you do? ∂LL
=0 4. Check that you’re at
∂θ 2
a maximum
M
∂LL
=0
∂θ n
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17
∂LL 1 R
∂µ
= 2
σ
∑ (x
i =1
i −µ )
∂LL R 1 R
∂σ 2
=−
2σ 2
+
2σ 4
∑ (x
i =1
i −µ ) 2
9
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ or σ2
• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
R
1 1
log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π +
2
log σ 2 ) −
2σ 2
∑ (x
i =1
i −µ ) 2
R
1
0=
σ2
∑ (x
i =1
i −µ )
R
R 1
0=−
2σ 2
+
2σ 4 ∑ (x
i =1
i −µ ) 2
R R
1 1
0=
σ 2 ∑ (x
i =1
i −µ ) ⇒ µ = ∑ xi
R i =1
R
R 1
0=−
2σ 2
+
2σ 4 ∑ (x
i =1
i −µ ) 2 ⇒ what?
10
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ or σ2
• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
1 R
µ mle = ∑ xi
R i =1
1 R
σ mle
2
= ∑
R i =1
( xi −µ mle ) 2
Unbiased Estimators
• An estimator of a parameter is unbiased if the
expected value of the estimate is the same as the
true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
⎡1 R ⎤
E[ µ mle ] = E ⎢ ∑ xi ⎥ = µ
⎣ R i =1 ⎦
µmle is unbiased
11
Biased Estimators
• An estimator of a parameter is biased if the
expected value of the estimate is different from
the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
⎡1 ⎛ R ⎞ ⎤
2
Eσ [ ] 2
mle
⎡1 R mle 2 ⎤ 1 R
= E ⎢ ∑ ( xi −µ ) ⎥ = E ⎢ ⎜⎜ ∑ xi − ∑ x j ⎟⎟ ⎥ ≠ σ 2
⎣ R i =1 ⎦ ⎢⎣ R ⎝ i =1 R j =1 ⎠ ⎥
⎦
σ2mle is biased
⎡1 ⎛ R ⎞ ⎤ ⎛
2
Eσ [ ]
2
mle
1 R 1⎞
= E ⎢ ⎜⎜ ∑ xi − ∑ x j ⎟⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
⎢⎣ R ⎝ i =1 R j =1 ⎠ ⎥ ⎝
⎦
R⎠
12
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
⎡1 ⎛ R ⎞ ⎤ ⎛
2
Eσ [ ] 2
mle
1 R 1⎞
= E ⎢ ⎜⎜ ∑ xi − ∑ x j ⎟⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
⎢⎣ R ⎝ i =1 R j =1 ⎠ ⎥ ⎝
⎦
R⎠
σ mle
2
σ =
[ ]
2
So define So E σ unbiased
2
=σ2
⎛ 1⎞
unbiased
⎜1 − ⎟
⎝ R⎠
⎡1 ⎛ R ⎞ ⎤ ⎛
2
Eσ [ ] 2
mle
1 R 1⎞
= E ⎢ ⎜⎜ ∑ xi − ∑ x j ⎟⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
⎢⎣ R ⎝ i =1 R j =1 ⎠ ⎥ ⎝
⎦
R⎠
σ mle
2
σ = [ ]
2
So define unbiased
⎛ 1⎞ So E σ unbiased
2
=σ2
⎜1 − ⎟
⎝ R⎠
1 R
σ unbiased
2
= ∑
R − 1 i =1
( xi −µ mle ) 2
13
Unbiaseditude discussion
• Which is best?
1 R
σ mle
2
= ∑
R i =1
( xi −µ mle ) 2
1 R
σ unbiased
2
= ∑
R − 1 i =1
( xi −µ mle ) 2
Answer:
•It depends on the task
•And doesn’t make much difference once R--> large
R
1
µ suboptimal
=
R+7 R
∑x
i =1
i
14
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?
1 R
µ mle = ∑ xk
R k =1
Σ mle =
1 R
∑
R k =1
( )(
x k − µ mle x k − µ mle )
T
1 R 1 R Where 1 ≤ i ≤ m
µ mle = ∑ xk
R k =1
µ mle
i = ∑ x ki
R k =1
And xki is value of the
ith component of xk
( )( )
R
1
∑
T
Σ mle = x k − µ mle x k − µ mle (the ith attribute of
R k =1
the kth record)
And µimle is the ith
component of µmle
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30
15
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?
Where 1 ≤ i ≤ m, 1 ≤ j ≤ m
R
1
µ mle = ∑ xk
R k =1 And xki is value of the ith
component of xk (the ith
attribute of the kth record)
Σ mle =
1 R
∑
R k =1
( )(
x k − µ mle x k − µ mle )T
σ ijmle =
1 R
∑
R k =1
( )(
x ki − µ imle x kj − µ mle
j )
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31
Σ mle =
1 R
∑
R k =1
( )(
x k − µ mle x k − µ mle )T
Σ unbiased =
Σ mle 1 R
1 R −1 ∑
= ( )(
x k − µ mle x k − µ mle )T
1− k =1
R
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32
16
Confidence intervals
We need to talk
Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact
not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just before
we need this technology.
17
Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan
18
Bivariate MLE in action
Multivariate MLE
19
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
20
Being Bayesian: MAP estimates for Gaussians
ν small: “I am not sure
• Suppose
0
you have x , x , … xR ~(i.i.d) N(µ,Σ)
about my guess of Σ 0 1“ 2 Σ 0 : (Roughly) my best
• But you don’t know µ or Σ guess of Σ
• ν 0 large: “I’m pretty sure
MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
about my guess of Σ 0 “ Ε[Σ ] = Σ 0
Step 1: Put a prior on (µ,Σ)
Step 1a: Put a prior on Σ
(ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!
21
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ κ0 small: “I am not sure
about
p(µmy guess of µ “
• MAP: Which (µ,Σ) maximizes ,Σ |x 1, x2, …0 xR)?
22
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
Why do we use this form of
Step 1: Put a prior on (µ,Σ) prior?
Step 1a: Put a prior on Σ Actually, we don’t have to
R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
T (x − µ 0 )(x − µ 0 )T
k =1 1/ κ 0 +1/ R
Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
µ | Σ ~ N(µR , Σ / κR)
23
Being Bayesian: •Look
MAPcarefully
estimates forformulae
at what these
doing. It’s all very sensible.
Gaussians
are
x,x,…
• Suppose you have•Conjugate xRmean
priors ~(i.i.d) N(µand
prior form ,Σ)posterior
form1 are 2same and characterized by “sufficient
• MAP: Which (µ,Σ)statistics”
maximizes of the p( µ,Σ |x1, x2, … xR)?
data.
Step 1: Prior: (ν0-m-1) Σ ~ •The
IW(ν0marginal
, (ν0-m-1)distribution ~µ
Σ 0 ), µ | Σon is a, student-t
N(µ 0 Σ / κ0)
•One point of view: it’s pretty academic if R > 30
Step 2: R
1 κ 0 µ 0 + Rx ν R = ν 0 + R
x= ∑ k
R k =1
x µ R =
κ0 + R κ = κ + R
R 0
R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
T (x − µ 0 )(x − µ 0 )T
k =1 1/ κ 0 +1/ R
Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
µ | Σ ~ N(µR , Σ / κR)
Where we’re at
Predict Joint BC
Inputs
Dec Tree
Classifier category Naïve BC
Joint DE Gauss DE
Inputs Inputs
Density Prob-
Estimator ability Naïve DE
Predict
Regressor real no.
24
What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
parameters
• Understand “biased estimator” versus
“unbiased estimator”
• Appreciate the outline behind Bayesian
estimation of Gaussian parameters
Useful exercise
• We’d already done some MLE in this class
without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?
25