0% found this document useful (0 votes)
18 views

Lecture 4

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 4

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern


Recognition and Bayesian Decision Theory

Dr. Amr El-Wakeel


Lane Department of Computer Science and Electrical Engineering

Spring 24
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation
All materials in these
• Introduction slides were taken
from
• Maximum-Likelihood Estimation Pattern
– Example of a Specific Case Classification (2nd
ed) by R. O. Duda, P.
– The Gaussian Case: unknown  and E. Hart and D. G.
 Stork, John Wiley &
Sons, 2000
– Bias with the permission
of the authors and
the publisher
• Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample


• No problem with prior estimation
• Samples are often too small for class-conditional
estimation (large dimension of feature space!)
3
– A priori information about the problem
• Do we know something about the distribution?
• → find parameters to characterize the distribution

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Characterized by 2 parameters

– Estimation techniques

• Maximum-Likelihood (ML) and the Bayesian estimations


• Results are nearly identical, but the approaches are different

4
• Parameters in ML estimation are fixed but unknown!
– Best parameters are obtained by maximizing the
probability of obtaining the samples observed

• Bayesian methods view the parameters as random


variables having some known distribution

• In either approach, we use P(i | x)


for our classification rule!

5
• Maximum-Likelihood Estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques
– General principle
• Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

 = ( j ,  j ) = ( 1j,  2j ,..., 11


j ,  22
j , cov( x m
j , x n
j )...)

6
• Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category

• Suppose that D contains n samples, x1, x2,…, xn


k =n
P(D | ) =  P(x k | ) = F()
k =1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

• ML estimate of  is, by definition the value that maximizes P(D | )


“It is the value of  that best agrees with the actually observed
training sample”
7
8
• Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator

t
    
θ =  , ,..., 
 1  2  p 

– We define l() as the log-likelihood function


l() = ln P(D | )
(recall D is the training data)
– New problem statement:
determine  that maximizes the log-likelihood

θˆ = arg max l (θ)


θ
9
10

The definition of l() is:


n
l (θ) =  ln p(x k | θ)
k =1
and

k =n
( θl =   θ ln P(x k | θ)) (eq 6)
k =1

Set of necessary conditions for an optimum is:

l = 0 (eq. 7)
• Example, the Gaussian case: unknown 
– We assume we know the covariance
– p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)

2
 d

ln p(x k | μ) = − ln (2 )  − (x k − μ) t Σ −1 (x k − μ )
1 1
2
and  μ ln p(x k | μ) = Σ −1 (x k − μ ) (eq. 9)

 =  therefore:
The ML estimate for  must satisfy:
n

 (x k − μˆ ) = 0 from eqs 6,7 & 9


Σ −1

k =1
11
• Multiplying by  and rearranging, we obtain:

1 k =n
μˆ =  x k
n k =1

Just the arithmetic average of the samples of the


training samples!

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional
feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
12
• Example, Gaussian Case: unknown  and 
– First consider univariate case: unknown  and 
 = (1, 2) = (, 2)
1 1
l = ln p ( xk | θ) = − ln 2 2 − ( xk − 1 ) 2
2 2 2

  
 (ln P ( xk | θ)) 
1
 θ l =  =0


 (ln P ( x | θ )) 
  2
k

1
 ( xk − 1 ) = 0
 2

− 1 + ( xk − 1 ) = 0
2


 2 2 2 22 13
2
Summation (over the training set):

 1 n

 ˆ k 1 ( x − ˆ)=0 (1)
 k =1  2
 n ˆ
− 1 n
( xk − 1 ) 2


 k =1 ˆ k =1
+
ˆ
2 2
= 0 (2)
 2
Combining (1) and (2), one obtains:

1 n 1 n
ˆ =  xk ; ˆ 2 =  ( xk − ˆ ) 2
n k =1 n k =1
14
15

• The ML estimates for the multivariate case is similar


– The scalars c and  are replaced with vectors
– The variance 2 is replaced by the covariance matrix

n
1
μˆ =  x k
n k =1
n
ˆΣ = 1  (x − μˆ )(x − μˆ ) t
k k
n k =1
Bias
– ML estimate for 2 is biased
1 n 2 n −1 2
E   ( xi − x )  =  2
 n i =1  n
– Extreme case: n=1, E[ ] = 0 ≠ 2

– As the n increases the bias is reduced


→ this type of estimator is called asymptotically
unbiased

16
– An elementary unbiased estimator for  is:
1 n
C=
n-1

k =1
(x k − μˆ )(x k − μˆ ) t

  
Sample covariance matrix

This estimator is unbiased for all distributions


→ Such estimators are called absolutely unbiased

17
18

– Our earlier estimator for  is biased:


n
1
ˆΣ =  (x − μˆ )(x − μˆ )t
k k
n k =1

In fact it is asymptotically unbiased:


Observe that

n − 1
ˆ = C
n
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
⚫ Bayesian Estimation (BE)
⚫ Bayesian Parameter Estimation: Gaussian Case
⚫ Bayesian Parameter Estimation: General Estimation
⚫ Problems of Dimensionality
⚫ Computational Complexity
⚫ Component Analysis and Discriminants
⚫ Hidden Markov Models 19
• Bayesian Estimation (Bayesian learning to pattern
classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities P(i | x)
lies at the heart of Bayesian classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written

p (x | i , D) P(i | D)
P(i | x, D) = c

 p(x |  , D) P(
j =1
j j | D)
20
• To demonstrate the preceding equation, use:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)


P(x | D) =  P(x,  j | D) (from law of total prob.)
j

P(i ) = P(i | D) (Training sample provides this!)


We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P (i )
P(i | x, D) = c

 p(x |  , D) P( )
j =1
j j

21
• Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate  using the a-posteriori density


P( | D)

– The univariate case: P( | D)


 is the only unknown parameter
P(x | ) ~ N(,  2 )
P() ~ N( 0 ,  20 )
(0 and 0 are known!)
22
P (D |  ) P (  )
P (  | D) = Bayes Formula
 P(D |  ) P( )d
k =n
=   P ( xk |  ) P (  )
k =1

– But we know
p( xk |  ) ~ N (  ,  ) and p(  ) ~ N ( 0 ,  0 )
2 2

Plugging in their gaussian expressions and


extracting out factors not depending on  yields:

 1  n 1  2  1 n
0   

p(  | D) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

(from eq. 29 page 93) 23


24

Observation: p(|D) is an exponential of a quadratic

 1  n 1  2  1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

It is again normal! It is called a reproducing density

P(  | D) ~ N (  n ,  ) 2
n
25

 1  n 1   1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
2
 xk + 2   
 0   
   k =1

• Identifying coefficients in the top equation with that of


the generic Gaussian
2
1  1   − n  
p(  | D) = exp  −   
2  n  2  n 

1
Yields n 1 for nand
expressions  2
n 0
= + and 2 = 2 ˆ n + 2
n n

 2
n  2
 02 n  0
26

Solving for n and n2 yields:

 n 02  2
 n =   ˆ +
2  n
0
 n0 0 +   n 0 + 
2 2 2

 0
2 2
and  n =
2

n 02 +  2
From these equations we see as n increases:
– the variance decreases monotonically
– the estimate of p(|D) becomes more peaked

4
27
28

– The univariate case P(x | D)


• P( | D) computed (in preceding discussion)
• P(x | D) remains to be computed!

P( x | D) =  P( x |  ) P(  | D)d is Gaussian
P( x | D) ~ N (  n ,  2 +  n2 )
 n and σ n2
It provides:
We know 2 and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D  Max P ( x |  j , D j ) P ( j )
j j

• 3.5 Bayesian Parameter Estimation: General Theory

– P(x | D) computation can be applied to any situation


in which the unknown density can be
parameterized: the basic assumptions are:

• The form of P(x | ) is assumed known, but the value of 


is not known exactly
• Our knowledge about  is assumed to be contained in a
known prior density P()
• The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
29
30

The basic problem is:


“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where

p(x | D) =  p(x | θ) p(θ | D)dθ


Using Bayes formula, we have:

P(D | θ) P(θ)
P(θ | D) = ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ) =  p(x k | θ)
k =1
5
• Problems of Dimensionality
– Problems involving 50 or 100 features are
common (usually binary valued)
– Note: microarray data might entail ~20000
real-valued features
– Classification accuracy dependant on
• dimensionality
• amount of training data
• discrete vs continuous

31
7
• Case of two class multivariate normal
with the same covariance
– P(x|j) ~N(j,), j=1,2
– Statistically independent features
– If the priors are equal then:

1 −u 2
P (error) =
2 e
r/2
2
du (Bayes error)

where : r 2 = ( 1 −  2 ) t  −1 ( 1 −  2 )
r 2 is the squared Mahalanobis distance
lim P (error) = 0
r →

32
• If features are conditionally independent then:

 = diag( 12 ,  22 ,...,  2d )
2
i = d  −  i2 
r =  
2 i1

i =1 i 

• Do we remember what conditional independence is?


• Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

33
• Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
– Doesn’t require independence

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature


extractor and classifier

• It has frequently been observed in practice that, beyond a certain


point, the inclusion of additional features leads to worse rather
than better performance:
– we have the wrong model !
– we don’t have enough training data to support the additional
dimensions

34
7

35
Overfitting
• Dimensionality of model vs size of training
data
– Issue: not enough data to support the model
– Possible solutions:
• Reduce model dimensionality
• Make (possibly incorrect) assumptions to better
estimate 

36
Overfitting
• Estimate better 
– use data pooled from all classes
• normalization issues
– use pseudo-Bayesian form 0 + (1-)n
– “doctor”  by thresholding entries
• reduces chance correlations
– assume statistical independence
• zero all off-diagonal elements

37
38

Shrinkage

• Shrinkage: weighted combination of common and


individual covariances
(1 −  )ni Σi + nΣ
Σi ( ) = for 0    1
(1 −  )ni + n

• We can also shrink the estimate common covariances


toward the identity matrix

Σ(  ) = (1 −  ) Σ + I for 0    1
• Component Analysis and Discriminants

– Combine features in order to reduce the


dimension of the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower
dimensional space
– Two classical approaches for finding “optimal”
linear transformation

• PCA (Principal Component Analysis) “Projection that


best represents the data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
39
• Hidden Markov Models:
– Markov Chains
– Goal: make a sequence of decisions
• Processes that unfold in time, states at time t are influenced
by a state at time t-1
• Applications: speech recognition, gesture recognition, parts
of speech tagging and DNA sequencing,
• Any temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
• The system can revisit a state at different steps and not
every state need to be visited
40
– First-order Markov models

• Our productions of any sequence is described by


the transition probabilities

P(j(t + 1) | i (t)) = aij

41
42
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”


Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/
and /n/ to a silent state

43
Chapter 3 (Part 3):
Maximum-Likelihood and Bayesian Parameter
Estimation (Section 3.10)

Hidden Markov Model: Extension of


Markov Chains
• Hidden Markov Model (HMM)
– Interaction of the visible states with the hidden states
bjk= 1 for all j where bjk=P(Vk(t) | j(t)).

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

45
• The evaluation problem
It is the probability that the model produces a sequence VT
of visible states. It is:
rmax
P ( V ) =  P ( V T |  rT )P ( rT )
T

r =1
where each r indexes a particular sequence of T hidden
states.
 rT =  ( 1 ), ( 2 ),..., ( T )
t =T
(1) P ( V T |  rT ) =  P ( v ( t ) |  ( t ))
t =1
t =T
(2) P(  rT ) =  P (  ( t ) |  ( t − 1 ))
t =1 46
Using equations (1) and (2), we can write:
rmax t =T
P ( V ) =  P ( v ( t ) |  ( t )) P (  ( t ) |  ( t − 1 )
T

r =1 t =1

Interpretation: The probability that we observe the particular sequence of T


visible states VT is equal to the sum over all rmax possible sequences of
hidden states of the conditional probability that the system has made a
particular transition multiplied by the probability that it then emitted the
visible symbol in our target sequence.

Example: Let 1, 2, 3 be the hidden states; v1, v2, v3 be the visible states
and V3 = {v1, v2, v3} is the sequence of visible states

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)


+…+ (possible terms in the sum= all possible (33= 27) cases !)
47
v1 v2 v3

First possibility:
1 2 3
(t = 1) (t = 2) (t = 3)

v1 v2 v3

Second Possibility:
2 3 1
(t = 1) (t = 2) (t = 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Therefore:
t =3
P ({ v 1 , v 2 , v 3 }) = 
possible sequence
 P ( v ( t ) |  ( t )).P (  ( t ) |  ( t − 1 ))
t =1
of hidden states
48
• The decoding problem (optimal state sequence)

Given a sequence of visible states VT, the decoding problem is


to find the most probable sequence of hidden states.
This problem can be expressed mathematically as:
find the single “best” state sequence (hidden states)

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :


ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) = arg max P  ( 1 ), ( 2 ),..., ( T ), v ( 1 ), v ( 2 ),...,V ( T ) |  
 ( 1 ), ( 2 ),..., ( T )

Note that the summation disappeared, since we want to find


Only one unique best case !
49
Where:  = [,A,B]
 = P((1) = ) (initial state probability)
A = aij = P((t+1) = j | (t) = i)
B = bjk = P(v(t) = k | (t) = j)

In the preceding example, this computation corresponds to the


selection of the best path amongst:

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}


{3(t = 1),1(t = 2),2(t = 3)}, {3(t = 1),2(t = 2),1(t = 3)}
{2(t = 1),1(t = 2),3(t = 3)}

50
• The learning problem (parameter estimation)
This third problem consists of determining a method to adjust the
model parameters  = [,A,B] to satisfy a certain optimization
criterion. We need to find the best model

ˆ = [ ˆ , Â , B̂ ]
Such that to maximize the probability of the observation sequence:

Max P ( V T |  )

We use an iterative procedure such as Baum-Welch or Gradient to find


this local optimum
51

You might also like