Lecture 4
Lecture 4
Spring 24
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation
All materials in these
• Introduction slides were taken
from
• Maximum-Likelihood Estimation Pattern
– Example of a Specific Case Classification (2nd
ed) by R. O. Duda, P.
– The Gaussian Case: unknown and E. Hart and D. G.
Stork, John Wiley &
Sons, 2000
– Bias with the permission
of the authors and
the publisher
• Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!
• Characterized by 2 parameters
– Estimation techniques
4
• Parameters in ML estimation are fixed but unknown!
– Best parameters are obtained by maximizing the
probability of obtaining the samples observed
5
• Maximum-Likelihood Estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques
– General principle
• Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j) P (x | j, j) where:
6
• Use the information
provided by the training samples to estimate
= (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category
t
θ = , ,...,
1 2 p
k =n
( θl = θ ln P(x k | θ)) (eq 6)
k =1
l = 0 (eq. 7)
• Example, the Gaussian case: unknown
– We assume we know the covariance
– p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)
2
d
ln p(x k | μ) = − ln (2 ) − (x k − μ) t Σ −1 (x k − μ )
1 1
2
and μ ln p(x k | μ) = Σ −1 (x k − μ ) (eq. 9)
= therefore:
The ML estimate for must satisfy:
n
k =1
11
• Multiplying by and rearranging, we obtain:
1 k =n
μˆ = x k
n k =1
Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional
feature space; then we can estimate the vector
= (1, 2, …, c)t and perform an optimal classification!
12
• Example, Gaussian Case: unknown and
– First consider univariate case: unknown and
= (1, 2) = (, 2)
1 1
l = ln p ( xk | θ) = − ln 2 2 − ( xk − 1 ) 2
2 2 2
(ln P ( xk | θ))
1
θ l = =0
(ln P ( x | θ ))
2
k
1
( xk − 1 ) = 0
2
− 1 + ( xk − 1 ) = 0
2
2 2 2 22 13
2
Summation (over the training set):
1 n
ˆ k 1 ( x − ˆ)=0 (1)
k =1 2
n ˆ
− 1 n
( xk − 1 ) 2
k =1 ˆ k =1
+
ˆ
2 2
= 0 (2)
2
Combining (1) and (2), one obtains:
1 n 1 n
ˆ = xk ; ˆ 2 = ( xk − ˆ ) 2
n k =1 n k =1
14
15
n
1
μˆ = x k
n k =1
n
ˆΣ = 1 (x − μˆ )(x − μˆ ) t
k k
n k =1
Bias
– ML estimate for 2 is biased
1 n 2 n −1 2
E ( xi − x ) = 2
n i =1 n
– Extreme case: n=1, E[ ] = 0 ≠ 2
16
– An elementary unbiased estimator for is:
1 n
C=
n-1
k =1
(x k − μˆ )(x k − μˆ ) t
Sample covariance matrix
17
18
n − 1
ˆ = C
n
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
⚫ Bayesian Estimation (BE)
⚫ Bayesian Parameter Estimation: Gaussian Case
⚫ Bayesian Parameter Estimation: General Estimation
⚫ Problems of Dimensionality
⚫ Computational Complexity
⚫ Component Analysis and Discriminants
⚫ Hidden Markov Models 19
• Bayesian Estimation (Bayesian learning to pattern
classification problems)
– In MLE was supposed fix
– In BE is a random variable
– The computation of posterior probabilities P(i | x)
lies at the heart of Bayesian classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written
p (x | i , D) P(i | D)
P(i | x, D) = c
p(x | , D) P(
j =1
j j | D)
20
• To demonstrate the preceding equation, use:
p(x | , D) P( )
j =1
j j
21
• Bayesian Parameter Estimation: Gaussian Case
– But we know
p( xk | ) ~ N ( , ) and p( ) ~ N ( 0 , 0 )
2 2
1 n 1 2 1 n
0
p( | D) = exp − 2 + 2 − 2 2
2 0
xk + 2
0
k =1
1 n 1 2 1 n
0
p ( | D ) = exp − 2 + 2 − 2 2
2 0
xk + 2
0
k =1
P( | D) ~ N ( n , ) 2
n
25
1 n 1 1 n
0
p ( | D ) = exp − 2 + 2 − 2 2
2 0
2
xk + 2
0
k =1
1
Yields n 1 for nand
expressions 2
n 0
= + and 2 = 2 ˆ n + 2
n n
2
n 2
02 n 0
26
n 02 2
n = ˆ +
2 n
0
n0 0 + n 0 +
2 2 2
0
2 2
and n =
2
n 02 + 2
From these equations we see as n increases:
– the variance decreases monotonically
– the estimate of p(|D) becomes more peaked
4
27
28
P( x | D) = P( x | ) P( | D)d is Gaussian
P( x | D) ~ N ( n , 2 + n2 )
n and σ n2
It provides:
We know 2 and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:
Max P ( j | x, D Max P ( x | j , D j ) P ( j )
j j
• 3.5 Bayesian Parameter Estimation: General Theory
P(D | θ) P(θ)
P(θ | D) = ,
P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ) = p(x k | θ)
k =1
5
• Problems of Dimensionality
– Problems involving 50 or 100 features are
common (usually binary valued)
– Note: microarray data might entail ~20000
real-valued features
– Classification accuracy dependant on
• dimensionality
• amount of training data
• discrete vs continuous
31
7
• Case of two class multivariate normal
with the same covariance
– P(x|j) ~N(j,), j=1,2
– Statistically independent features
– If the priors are equal then:
1 −u 2
P (error) =
2 e
r/2
2
du (Bayes error)
where : r 2 = ( 1 − 2 ) t −1 ( 1 − 2 )
r 2 is the squared Mahalanobis distance
lim P (error) = 0
r →
32
• If features are conditionally independent then:
= diag( 12 , 22 ,..., 2d )
2
i = d − i2
r =
2 i1
i =1 i
33
• Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
– Doesn’t require independence
34
7
35
Overfitting
• Dimensionality of model vs size of training
data
– Issue: not enough data to support the model
– Possible solutions:
• Reduce model dimensionality
• Make (possibly incorrect) assumptions to better
estimate
36
Overfitting
• Estimate better
– use data pooled from all classes
• normalization issues
– use pseudo-Bayesian form 0 + (1-)n
– “doctor” by thresholding entries
• reduces chance correlations
– assume statistical independence
• zero all off-diagonal elements
37
38
Shrinkage
Σ( ) = (1 − ) Σ + I for 0 1
• Component Analysis and Discriminants
41
42
= (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)
43
Chapter 3 (Part 3):
Maximum-Likelihood and Bayesian Parameter
Estimation (Section 3.10)
45
• The evaluation problem
It is the probability that the model produces a sequence VT
of visible states. It is:
rmax
P ( V ) = P ( V T | rT )P ( rT )
T
r =1
where each r indexes a particular sequence of T hidden
states.
rT = ( 1 ), ( 2 ),..., ( T )
t =T
(1) P ( V T | rT ) = P ( v ( t ) | ( t ))
t =1
t =T
(2) P( rT ) = P ( ( t ) | ( t − 1 ))
t =1 46
Using equations (1) and (2), we can write:
rmax t =T
P ( V ) = P ( v ( t ) | ( t )) P ( ( t ) | ( t − 1 )
T
r =1 t =1
Example: Let 1, 2, 3 be the hidden states; v1, v2, v3 be the visible states
and V3 = {v1, v2, v3} is the sequence of visible states
First possibility:
1 2 3
(t = 1) (t = 2) (t = 3)
v1 v2 v3
Second Possibility:
2 3 1
(t = 1) (t = 2) (t = 3)
Therefore:
t =3
P ({ v 1 , v 2 , v 3 }) =
possible sequence
P ( v ( t ) | ( t )).P ( ( t ) | ( t − 1 ))
t =1
of hidden states
48
• The decoding problem (optimal state sequence)
50
• The learning problem (parameter estimation)
This third problem consists of determining a method to adjust the
model parameters = [,A,B] to satisfy a certain optimization
criterion. We need to find the best model
ˆ = [ ˆ , Â , B̂ ]
Such that to maximize the probability of the observation sequence:
Max P ( V T | )