CS775 Lec 2
CS775 Lec 2
Maximum-Likelihood &
Bayesian Parameter
Estimation
Introduction
Maximum-Likelihood Estimation
– Example of a Specific Case
– The Gaussian Case: unknown and
– Bias
Introduction
– Data availability in a Bayesian framework
We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!
1
– A priori information about the problem
Characterized by 2 parameters
– Estimation techniques
1
Parameters in ML estimation are fixed but
unknown!
1
Maximum-Likelihood Estimation
– General principle
1 2 11 22 m n
( j , j ) ( j , j ,..., j , j , cov(x j , x j )...)
2
Use the information
provided by the training samples to estimate
= (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category
̂
ML estimate of is, by definition the value that
maximizes P(D | )
“It is the value of that best agrees with the actually
2
2
Optimal estimation
– Let = (1, 2, …, p)t and let be the gradient operator
t
, ,...,
1 2 p
2
Set of necessary conditions for an optimum is:
k n
( l ln P(x k | ))
k 1
l = 0
2
Example of a specific case: unknown
– P(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal
population)
1
d 1
t
1
ln P(x k | ) ln (2) (x k ) (x k )
2 2
1
and ln P(x k | ) (x k )
= therefore:
• The ML estimate for must satisfy:
k n
1
(x k ˆ ) 0
k 1
2
• Multiplying by and rearranging, we
obtain:
1 k n
ˆ x k
n k 1
Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
= (1, 2, …, c)t and perform an optimal classification!
2
ML Estimation:
– Gaussian Case: unknown and
= (1, 2) = (, 2)
1 1
l ln P( x k | ) ln 2 2 (x k 1 )2
2 2 2
(ln P( x k | ))
1
l 0
(ln P( x k | ))
2
1
(x k 1 ) 0
2
2
1 (x k 1 ) 0
2 2 2 2
2
2
Summation:
k n 1
k 1 ˆ ( x k 1 ) 0 (1)
2
k n ˆ 2
1 (x k 1 ) 0
k n
(2)
k 1 ˆ 2 k 1 ˆ 22
Combining (1) and (2), one obtains:
k n
2
k n x (x k )
k 1
k
; 2
k 1 n n
2
Bias
2
ML Problem Statement
̂
Our goal is to determine (value of that
makes this sample the most representative!)
2
= (1, 2, …, c)
2
Chapter 3:
Maximum-Likelihood and
Bayesian Parameter Estimation
Bayesian Estimation (BE)
Bayesian Parameter Estimation: Gaussian Case
Bayesian Parameter Estimation: General
Estimation
Problems of Dimensionality
Computational Complexity
Component Analysis and Discriminants
Hidden Markov Models
Bayesian Estimation (Bayesian learning
to pattern classification problems)
– In MLE was supposed fix
– In BE is a random variable
– The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written
P(x | i , D ).P(i | D )
P(i | x, D ) c
P(x | j , D ).P( j | D )
j 1
3
To demonstrate the preceding equation, use:
3
Bayesian Parameter Estimation:
Gaussian Case
4
P(D | ).P( )
P( | D ) (1)
P(D | ).P( )d
k n
P(x k | ).P( )
k 1
– Reproducing density
P( | D ) ~ N( n , n2 ) (2)
Identifying (1) and (2) yields:
n 02 2
n 2 2
ˆ n 2 .0
n 0 n 0 2
0
2 2
and n
2
n 02 2
4
n 02 2
n 2 2
ˆ n 2 .0
n 0 n 0 2
0
2 2
and n
2
n 02 2
It provides:
P(x | D ) ~ N( n , 2 n2 )
5
More cases: Binary Variables
Coin flipping: heads=1, tails=0
Bernoulli Distribution
Binary variables (2)
N coin flips:
Binomial Distribution
Binomial distribution
Parameter Estimation
ML for Bernoulli
Given:
Parameter Estimation (2)
Example:
Prediction: all future tosses will land heads up
Overfitting to D
Beta Distribution
Distribution over .
Bayesian Bernoulli
lim P(error ) 0
r
7
If features are independent then:
diag( 12 , 22 ,..., 2d )
2
i d i2
2 i1
r
i 1 i
Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
7
7
7
Computational Complexity
f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
7
– “big oh” is not unique!
f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)
7
– Complexity of the ML Estimation
Total = O(d2..n)
Total for c classes = O(cd2.n) O(d2.n)
8
Expectation Maximization
Learning parameters governing a distribution
from training points with missing values
A generalization of MLE Missing
features
Consider
D [x1 ,..., x n ] xi [xig , xib ]
Improved
estimate
Q(θ; θ ) EDb [ln p( Dg , Db : θ) | [ Dg : θ ]
i i
Marginalized
w/respect to
Best current
estimate so estimate
far
EM Algorithm
begin initialize θ0 , , i 0
do i i 1
E step: compute Q(θ; θi )
M step: θi 1 arg max Q(θ; θi )
until Q(θi 1 ; θi ) Q(θi ; θi 1 )
return θ θ i 1
end
EM example: Mixtures of Gaussians
K
f ( x) k N ( x | k , k )
k 1
Pick estimates of
θ { 1 ,..., K , 1 ,... K , 1 ,... K }
k N ( x | k , k ) k , k , k (θ0 )
P(k | x) E-step
f ( x)
Plug them here
1 n
k P(k | x(i ))
Use the result to solve these
n i 1
n Use the new estimate to go
1
k
n k i 1
P (k | x(i )) x(i ) M-step back to the E-step
1 n …
k
n k i 1
P (k | x(i ))( x(i ) k )
2
Gaussian
Mixture
Example:
Start
After first
iteration
After 2nd
iteration
After 3rd
iteration
After 4th
iteration
After 5th
iteration
After 6th
iteration
After 20th
iteration
Hidden Markov Models:
– Markov Chains
10
10
= (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)
10
HMMs