0% found this document useful (0 votes)
15 views

Lecture 4

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture 4

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern


Recognition and Bayesian Decision Theory

Dr. Amr El-Wakeel


Lane Department of Computer Science and Electrical Engineering

Spring 24
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation
All materials in these
• Introduction slides were taken
from
• Maximum-Likelihood Estimation Pattern
– Example of a Specific Case Classification (2nd
ed) by R. O. Duda, P.
– The Gaussian Case: unknown  and E. Hart and D. G.
 Stork, John Wiley &
Sons, 2000
– Bias with the permission
of the authors and
the publisher
• Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample


• No problem with prior estimation
• Samples are often too small for class-conditional
estimation (large dimension of feature space!)
3
– A priori information about the problem
• Do we know something about the distribution?
• → find parameters to characterize the distribution

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Characterized by 2 parameters

– Estimation techniques

• Maximum-Likelihood (ML) and the Bayesian estimations


• Results are nearly identical, but the approaches are different

4
• Parameters in ML estimation are fixed but unknown!
– Best parameters are obtained by maximizing the
probability of obtaining the samples observed

• Bayesian methods view the parameters as random


variables having some known distribution

• In either approach, we use P(i | x)


for our classification rule!

5
• Maximum-Likelihood Estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques
– General principle
• Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

 = ( j ,  j ) = ( 1j,  2j ,..., 11


j ,  22
j , cov( x m
j , x n
j )...)

6
• Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category

• Suppose that D contains n samples, x1, x2,…, xn


k =n
P(D | ) =  P(x k | ) = F()
k =1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

• ML estimate of  is, by definition the value that maximizes P(D | )


“It is the value of  that best agrees with the actually observed
training sample”
7
8
• Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator

t
    
θ =  , ,..., 
 1  2  p 

– We define l() as the log-likelihood function


l() = ln P(D | )
(recall D is the training data)
– New problem statement:
determine  that maximizes the log-likelihood

θˆ = arg max l (θ)


θ
9
10

The definition of l() is:


n
l (θ) =  ln p(x k | θ)
k =1
and

k =n
( θl =   θ ln P(x k | θ)) (eq 6)
k =1

Set of necessary conditions for an optimum is:

l = 0 (eq. 7)
• Example, the Gaussian case: unknown 
– We assume we know the covariance
– p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)

2
 d

ln p(x k | μ) = − ln (2 )  − (x k − μ) t Σ −1 (x k − μ )
1 1
2
and  μ ln p(x k | μ) = Σ −1 (x k − μ ) (eq. 9)

 =  therefore:
The ML estimate for  must satisfy:
n

 (x k − μˆ ) = 0 from eqs 6,7 & 9


Σ −1

k =1
11
• Multiplying by  and rearranging, we obtain:

1 k =n
μˆ =  x k
n k =1

Just the arithmetic average of the samples of the


training samples!

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional
feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
12
• Example, Gaussian Case: unknown  and 
– First consider univariate case: unknown  and 
 = (1, 2) = (, 2)
1 1
l = ln p ( xk | θ) = − ln 2 2 − ( xk − 1 ) 2
2 2 2

  
 (ln P ( xk | θ)) 
1
 θ l =  =0


 (ln P ( x | θ )) 
  2
k

1
 ( xk − 1 ) = 0
 2

− 1 + ( xk − 1 ) = 0
2


 2 2 2 22 13
2
Summation (over the training set):

 1 n

 ˆ k 1 ( x − ˆ)=0 (1)
 k =1  2
 n ˆ
− 1 n
( xk − 1 ) 2


 k =1 ˆ k =1
+
ˆ
2 2
= 0 (2)
 2
Combining (1) and (2), one obtains:

1 n 1 n
ˆ =  xk ; ˆ 2 =  ( xk − ˆ ) 2
n k =1 n k =1
14
15

• The ML estimates for the multivariate case is similar


– The scalars c and  are replaced with vectors
– The variance 2 is replaced by the covariance matrix

n
1
μˆ =  x k
n k =1
n
ˆΣ = 1  (x − μˆ )(x − μˆ ) t
k k
n k =1
Bias
– ML estimate for 2 is biased
1 n 2 n −1 2
E   ( xi − x )  =  2
 n i =1  n
– Extreme case: n=1, E[ ] = 0 ≠ 2

– As the n increases the bias is reduced


→ this type of estimator is called asymptotically
unbiased

16
– An elementary unbiased estimator for  is:
1 n
C=
n-1

k =1
(x k − μˆ )(x k − μˆ ) t

  
Sample covariance matrix

This estimator is unbiased for all distributions


→ Such estimators are called absolutely unbiased

17
18

– Our earlier estimator for  is biased:


n
1
ˆΣ =  (x − μˆ )(x − μˆ )t
k k
n k =1

In fact it is asymptotically unbiased:


Observe that

n − 1
ˆ = C
n
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
⚫ Bayesian Estimation (BE)
⚫ Bayesian Parameter Estimation: Gaussian Case
⚫ Bayesian Parameter Estimation: General Estimation
⚫ Problems of Dimensionality
⚫ Computational Complexity
⚫ Component Analysis and Discriminants
⚫ Hidden Markov Models 19
• Bayesian Estimation (Bayesian learning to pattern
classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities P(i | x)
lies at the heart of Bayesian classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written

p (x | i , D) P(i | D)
P(i | x, D) = c

 p(x |  , D) P(
j =1
j j | D)
20
• To demonstrate the preceding equation, use:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)


P(x | D) =  P(x,  j | D) (from law of total prob.)
j

P(i ) = P(i | D) (Training sample provides this!)


We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P (i )
P(i | x, D) = c

 p(x |  , D) P( )
j =1
j j

21
• Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate  using the a-posteriori density


P( | D)

– The univariate case: P( | D)


 is the only unknown parameter
P(x | ) ~ N(,  2 )
P() ~ N( 0 ,  20 )
(0 and 0 are known!)
22
P (D |  ) P (  )
P (  | D) = Bayes Formula
 P(D |  ) P( )d
k =n
=   P ( xk |  ) P (  )
k =1

– But we know
p( xk |  ) ~ N (  ,  ) and p(  ) ~ N ( 0 ,  0 )
2 2

Plugging in their gaussian expressions and


extracting out factors not depending on  yields:

 1  n 1  2  1 n
0   

p(  | D) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

(from eq. 29 page 93) 23


24

Observation: p(|D) is an exponential of a quadratic

 1  n 1  2  1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

It is again normal! It is called a reproducing density

P(  | D) ~ N (  n ,  ) 2
n
25

 1  n 1   1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
2
 xk + 2   
 0   
   k =1

• Identifying coefficients in the top equation with that of


the generic Gaussian
2
1  1   − n  
p(  | D) = exp  −   
2  n  2  n 

1
Yields n 1 for nand
expressions  2
n 0
= + and 2 = 2 ˆ n + 2
n n

 2
n  2
 02 n  0
26

Solving for n and n2 yields:

 n 02  2
 n =   ˆ +
2  n
0
 n0 0 +   n 0 + 
2 2 2

 0
2 2
and  n =
2

n 02 +  2
From these equations we see as n increases:
– the variance decreases monotonically
– the estimate of p(|D) becomes more peaked

4
27
28

– The univariate case P(x | D)


• P( | D) computed (in preceding discussion)
• P(x | D) remains to be computed!

P( x | D) =  P( x |  ) P(  | D)d is Gaussian
P( x | D) ~ N (  n ,  2 +  n2 )
 n and σ n2
It provides:
We know 2 and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D  Max P ( x |  j , D j ) P ( j )
j j

• 3.5 Bayesian Parameter Estimation: General Theory

– P(x | D) computation can be applied to any situation


in which the unknown density can be
parameterized: the basic assumptions are:

• The form of P(x | ) is assumed known, but the value of 


is not known exactly
• Our knowledge about  is assumed to be contained in a
known prior density P()
• The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
29
30

The basic problem is:


“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where

p(x | D) =  p(x | θ) p(θ | D)dθ


Using Bayes formula, we have:

P(D | θ) P(θ)
P(θ | D) = ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ) =  p(x k | θ)
k =1
5
• Problems of Dimensionality
– Problems involving 50 or 100 features are
common (usually binary valued)
– Note: microarray data might entail ~20000
real-valued features
– Classification accuracy dependant on
• dimensionality
• amount of training data
• discrete vs continuous

31
7
• Case of two class multivariate normal
with the same covariance
– P(x|j) ~N(j,), j=1,2
– Statistically independent features
– If the priors are equal then:

1 −u 2
P (error) =
2 e
r/2
2
du (Bayes error)

where : r 2 = ( 1 −  2 ) t  −1 ( 1 −  2 )
r 2 is the squared Mahalanobis distance
lim P (error) = 0
r →

32
• If features are conditionally independent then:

 = diag( 12 ,  22 ,...,  2d )
2
i = d  −  i2 
r =  
2 i1

i =1 i 

• Do we remember what conditional independence is?


• Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

33
• Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
– Doesn’t require independence

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature


extractor and classifier

• It has frequently been observed in practice that, beyond a certain


point, the inclusion of additional features leads to worse rather
than better performance:
– we have the wrong model !
– we don’t have enough training data to support the additional
dimensions

34
7

35
Overfitting
• Dimensionality of model vs size of training
data
– Issue: not enough data to support the model
– Possible solutions:
• Reduce model dimensionality
• Make (possibly incorrect) assumptions to better
estimate 

36
Overfitting
• Estimate better 
– use data pooled from all classes
• normalization issues
– use pseudo-Bayesian form 0 + (1-)n
– “doctor”  by thresholding entries
• reduces chance correlations
– assume statistical independence
• zero all off-diagonal elements

37
38

Shrinkage

• Shrinkage: weighted combination of common and


individual covariances
(1 −  )ni Σi + nΣ
Σi ( ) = for 0    1
(1 −  )ni + n

• We can also shrink the estimate common covariances


toward the identity matrix

Σ(  ) = (1 −  ) Σ + I for 0    1
• Component Analysis and Discriminants

– Combine features in order to reduce the


dimension of the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower
dimensional space
– Two classical approaches for finding “optimal”
linear transformation

• PCA (Principal Component Analysis) “Projection that


best represents the data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
39
• Hidden Markov Models:
– Markov Chains
– Goal: make a sequence of decisions
• Processes that unfold in time, states at time t are influenced
by a state at time t-1
• Applications: speech recognition, gesture recognition, parts
of speech tagging and DNA sequencing,
• Any temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
• The system can revisit a state at different steps and not
every state need to be visited
40
– First-order Markov models

• Our productions of any sequence is described by


the transition probabilities

P(j(t + 1) | i (t)) = aij

41
42
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”


Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/
and /n/ to a silent state

43
Chapter 3 (Part 3):
Maximum-Likelihood and Bayesian Parameter
Estimation (Section 3.10)

Hidden Markov Model: Extension of


Markov Chains
• Hidden Markov Model (HMM)
– Interaction of the visible states with the hidden states
bjk= 1 for all j where bjk=P(Vk(t) | j(t)).

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

45
• The evaluation problem
It is the probability that the model produces a sequence VT
of visible states. It is:
rmax
P ( V ) =  P ( V T |  rT )P ( rT )
T

r =1
where each r indexes a particular sequence of T hidden
states.
 rT =  ( 1 ), ( 2 ),..., ( T )
t =T
(1) P ( V T |  rT ) =  P ( v ( t ) |  ( t ))
t =1
t =T
(2) P(  rT ) =  P (  ( t ) |  ( t − 1 ))
t =1 46
Using equations (1) and (2), we can write:
rmax t =T
P ( V ) =  P ( v ( t ) |  ( t )) P (  ( t ) |  ( t − 1 )
T

r =1 t =1

Interpretation: The probability that we observe the particular sequence of T


visible states VT is equal to the sum over all rmax possible sequences of
hidden states of the conditional probability that the system has made a
particular transition multiplied by the probability that it then emitted the
visible symbol in our target sequence.

Example: Let 1, 2, 3 be the hidden states; v1, v2, v3 be the visible states
and V3 = {v1, v2, v3} is the sequence of visible states

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)


+…+ (possible terms in the sum= all possible (33= 27) cases !)
47
v1 v2 v3

First possibility:
1 2 3
(t = 1) (t = 2) (t = 3)

v1 v2 v3

Second Possibility:
2 3 1
(t = 1) (t = 2) (t = 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Therefore:
t =3
P ({ v 1 , v 2 , v 3 }) = 
possible sequence
 P ( v ( t ) |  ( t )).P (  ( t ) |  ( t − 1 ))
t =1
of hidden states
48
• The decoding problem (optimal state sequence)

Given a sequence of visible states VT, the decoding problem is


to find the most probable sequence of hidden states.
This problem can be expressed mathematically as:
find the single “best” state sequence (hidden states)

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :


ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) = arg max P  ( 1 ), ( 2 ),..., ( T ), v ( 1 ), v ( 2 ),...,V ( T ) |  
 ( 1 ), ( 2 ),..., ( T )

Note that the summation disappeared, since we want to find


Only one unique best case !
49
Where:  = [,A,B]
 = P((1) = ) (initial state probability)
A = aij = P((t+1) = j | (t) = i)
B = bjk = P(v(t) = k | (t) = j)

In the preceding example, this computation corresponds to the


selection of the best path amongst:

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}


{3(t = 1),1(t = 2),2(t = 3)}, {3(t = 1),2(t = 2),1(t = 3)}
{2(t = 1),1(t = 2),3(t = 3)}

50
• The learning problem (parameter estimation)
This third problem consists of determining a method to adjust the
model parameters  = [,A,B] to satisfy a certain optimization
criterion. We need to find the best model

ˆ = [ ˆ , Â , B̂ ]
Such that to maximize the probability of the observation sequence:

Max P ( V T |  )

We use an iterative procedure such as Baum-Welch or Gradient to find


this local optimum
51

You might also like