0% found this document useful (0 votes)

26 views

Lecture 4

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Lecture 4

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern

Recognition and Bayesian Decision Theory

Dr. Amr El-Wakeel

Lane Department of Computer Science and Electrical Engineering

Spring 24
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation
All materials in these
• Introduction slides were taken
from
• Maximum-Likelihood Estimation Pattern
– Example of a Specific Case Classification (2nd
ed) by R. O. Duda, P.
– The Gaussian Case: unknown  and E. Hart and D. G.
 Stork, John Wiley &
Sons, 2000
– Bias with the permission
of the authors and
the publisher
• Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample

• No problem with prior estimation
• Samples are often too small for class-conditional
estimation (large dimension of feature space!)
3
– A priori information about the problem
• Do we know something about the distribution?
• → find parameters to characterize the distribution

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Characterized by 2 parameters

– Estimation techniques

• Maximum-Likelihood (ML) and the Bayesian estimations

• Results are nearly identical, but the approaches are different

4
• Parameters in ML estimation are fixed but unknown!
– Best parameters are obtained by maximizing the
probability of obtaining the samples observed

• Bayesian methods view the parameters as random

variables having some known distribution

• In either approach, we use P(i | x)

for our classification rule!

5
• Maximum-Likelihood Estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques
– General principle
• Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

 = ( j ,  j ) = ( 1j,  2j ,..., 11

j ,  22
j , cov( x m
j , x n
j )...)

6
• Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category

• Suppose that D contains n samples, x1, x2,…, xn

k =n
P(D | ) =  P(x k | ) = F()
k =1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

• ML estimate of  is, by definition the value that maximizes P(D | )

“It is the value of  that best agrees with the actually observed
training sample”
7
8
• Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator

t
    
θ =  , ,..., 
 1  2  p 

– We define l() as the log-likelihood function

l() = ln P(D | )
(recall D is the training data)
– New problem statement:
determine  that maximizes the log-likelihood

θˆ = arg max l (θ)

θ
9
10

The definition of l() is:

n
l (θ) =  ln p(x k | θ)
k =1
and

k =n
( θl =   θ ln P(x k | θ)) (eq 6)
k =1

Set of necessary conditions for an optimum is:

l = 0 (eq. 7)
• Example, the Gaussian case: unknown 
– We assume we know the covariance
– p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)

2
 d

ln p(x k | μ) = − ln (2 )  − (x k − μ) t Σ −1 (x k − μ )
1 1
2
and  μ ln p(x k | μ) = Σ −1 (x k − μ ) (eq. 9)

 =  therefore:
The ML estimate for  must satisfy:
n

 (x k − μˆ ) = 0 from eqs 6,7 & 9

Σ −1

k =1
11
• Multiplying by  and rearranging, we obtain:

1 k =n
μˆ =  x k
n k =1

Just the arithmetic average of the samples of the

training samples!

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional
feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
12
• Example, Gaussian Case: unknown  and 
– First consider univariate case: unknown  and 
 = (1, 2) = (, 2)
1 1
l = ln p ( xk | θ) = − ln 2 2 − ( xk − 1 ) 2
2 2 2

  
 (ln P ( xk | θ)) 
1
 θ l =  =0


 (ln P ( x | θ )) 
  2
k


1
 ( xk − 1 ) = 0
 2

− 1 + ( xk − 1 ) = 0
2


 2 2 2 22 13
2
Summation (over the training set):

 1 n

 ˆ k 1 ( x − ˆ)=0 (1)
 k =1  2
 n ˆ
− 1 n
( xk − 1 ) 2


 k =1 ˆ k =1
+
ˆ
2 2
= 0 (2)
 2
Combining (1) and (2), one obtains:

1 n 1 n
ˆ =  xk ; ˆ 2 =  ( xk − ˆ ) 2
n k =1 n k =1
14
15

• The ML estimates for the multivariate case is similar

– The scalars c and  are replaced with vectors
– The variance 2 is replaced by the covariance matrix

n
1
μˆ =  x k
n k =1
n
ˆΣ = 1  (x − μˆ )(x − μˆ ) t
k k
n k =1
Bias
– ML estimate for 2 is biased
1 n 2 n −1 2
E   ( xi − x )  =  2
 n i =1  n
– Extreme case: n=1, E[ ] = 0 ≠ 2

– As the n increases the bias is reduced

→ this type of estimator is called asymptotically
unbiased

16
– An elementary unbiased estimator for  is:
1 n
C=
n-1

k =1
(x k − μˆ )(x k − μˆ ) t

  
Sample covariance matrix

This estimator is unbiased for all distributions

→ Such estimators are called absolutely unbiased

17
18

– Our earlier estimator for  is biased:

n
1
ˆΣ =  (x − μˆ )(x − μˆ )t
k k
n k =1

In fact it is asymptotically unbiased:

Observe that

n − 1
ˆ = C
n
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
⚫ Bayesian Estimation (BE)
⚫ Bayesian Parameter Estimation: Gaussian Case
⚫ Bayesian Parameter Estimation: General Estimation
⚫ Problems of Dimensionality
⚫ Computational Complexity
⚫ Component Analysis and Discriminants
⚫ Hidden Markov Models 19
• Bayesian Estimation (Bayesian learning to pattern
classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities P(i | x)
lies at the heart of Bayesian classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written

p (x | i , D) P(i | D)
P(i | x, D) = c

 p(x |  , D) P(
j =1
j j | D)
20
• To demonstrate the preceding equation, use:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)

P(x | D) =  P(x,  j | D) (from law of total prob.)
j

P(i ) = P(i | D) (Training sample provides this!)

We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P (i )
P(i | x, D) = c

 p(x |  , D) P( )
j =1
j j

21
• Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate  using the a-posteriori density

P( | D)

– The univariate case: P( | D)

– But we know
p( xk |  ) ~ N (  ,  ) and p(  ) ~ N ( 0 ,  0 )
2 2

Plugging in their gaussian expressions and

extracting out factors not depending on  yields:

 1  n 1  2  1 n
0   

p(  | D) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

(from eq. 29 page 93) 23

Observation: p(|D) is an exponential of a quadratic

 1  n 1  2  1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

It is again normal! It is called a reproducing density

P(  | D) ~ N (  n ,  ) 2
n
25

 1  n 1   1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
2
 xk + 2   
 0   
   k =1

• Identifying coefficients in the top equation with that of

the generic Gaussian
2
1  1   − n  
p(  | D) = exp  −   
2  n  2  n 

1
Yields n 1 for nand
expressions  2
n 0
= + and 2 = 2 ˆ n + 2
n n

 2
n  2
 02 n  0
26

Solving for n and n2 yields:

 n 02  2
 n =   ˆ +
2  n
0
 n0 0 +   n 0 + 
2 2 2

 0
2 2
and  n =
2

n 02 +  2
From these equations we see as n increases:
– the variance decreases monotonically
– the estimate of p(|D) becomes more peaked

4
27
28

– The univariate case P(x | D)

• P( | D) computed (in preceding discussion)
• P(x | D) remains to be computed!

P( x | D) =  P( x |  ) P(  | D)d is Gaussian
P( x | D) ~ N (  n ,  2 +  n2 )
 n and σ n2
It provides:
We know 2 and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D  Max P ( x |  j , D j ) P ( j )
j j

• 3.5 Bayesian Parameter Estimation: General Theory

– P(x | D) computation can be applied to any situation

in which the unknown density can be
parameterized: the basic assumptions are:

• The form of P(x | ) is assumed known, but the value of 

is not known exactly
• Our knowledge about  is assumed to be contained in a
known prior density P()
• The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
29
30

The basic problem is:

“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where

p(x | D) =  p(x | θ) p(θ | D)dθ

Using Bayes formula, we have:

P(D | θ) P(θ)
P(θ | D) = ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ) =  p(x k | θ)
k =1
5
• Problems of Dimensionality
– Problems involving 50 or 100 features are
common (usually binary valued)
– Note: microarray data might entail ~20000
real-valued features
– Classification accuracy dependant on
• dimensionality
• amount of training data
• discrete vs continuous

31
7
• Case of two class multivariate normal
with the same covariance
– P(x|j) ~N(j,), j=1,2
– Statistically independent features
– If the priors are equal then:

1 −u 2
P (error) =
2 e
r/2
2
du (Bayes error)

where : r 2 = ( 1 −  2 ) t  −1 ( 1 −  2 )
r 2 is the squared Mahalanobis distance
lim P (error) = 0
r →

32
• If features are conditionally independent then:

 = diag( 12 ,  22 ,...,  2d )
2
i = d  −  i2 
r =  
2 i1

i =1 i 

• Do we remember what conditional independence is?

• Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

33
• Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
– Doesn’t require independence

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature

extractor and classifier

• It has frequently been observed in practice that, beyond a certain

point, the inclusion of additional features leads to worse rather
than better performance:
– we have the wrong model !
– we don’t have enough training data to support the additional
dimensions

34
7

35
Overfitting
• Dimensionality of model vs size of training
data
– Issue: not enough data to support the model
– Possible solutions:
• Reduce model dimensionality
• Make (possibly incorrect) assumptions to better
estimate 

36
Overfitting
• Estimate better 
– use data pooled from all classes
• normalization issues
– use pseudo-Bayesian form 0 + (1-)n
– “doctor”  by thresholding entries
• reduces chance correlations
– assume statistical independence
• zero all off-diagonal elements

37
38

Shrinkage

• Shrinkage: weighted combination of common and

individual covariances
(1 −  )ni Σi + nΣ
Σi ( ) = for 0    1
(1 −  )ni + n

• We can also shrink the estimate common covariances

toward the identity matrix

Σ(  ) = (1 −  ) Σ + I for 0    1
• Component Analysis and Discriminants

– Combine features in order to reduce the

dimension of the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower
dimensional space
– Two classical approaches for finding “optimal”
linear transformation

• PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
39
• Hidden Markov Models:
– Markov Chains
– Goal: make a sequence of decisions
• Processes that unfold in time, states at time t are influenced
by a state at time t-1
• Applications: speech recognition, gesture recognition, parts
of speech tagging and DNA sequencing,
• Any temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
• The system can revisit a state at different steps and not
every state need to be visited
40
– First-order Markov models

• Our productions of any sequence is described by

the transition probabilities

P(j(t + 1) | i (t)) = aij

41
42
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”

Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/
and /n/ to a silent state

43
Chapter 3 (Part 3):
Maximum-Likelihood and Bayesian Parameter
Estimation (Section 3.10)

Hidden Markov Model: Extension of

Markov Chains
• Hidden Markov Model (HMM)
– Interaction of the visible states with the hidden states
bjk= 1 for all j where bjk=P(Vk(t) | j(t)).

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

45
• The evaluation problem
It is the probability that the model produces a sequence VT
of visible states. It is:
rmax
P ( V ) =  P ( V T |  rT )P ( rT )
T

r =1
where each r indexes a particular sequence of T hidden
states.
 rT =  ( 1 ), ( 2 ),..., ( T )
t =T
(1) P ( V T |  rT ) =  P ( v ( t ) |  ( t ))
t =1
t =T
(2) P(  rT ) =  P (  ( t ) |  ( t − 1 ))
t =1 46
Using equations (1) and (2), we can write:
rmax t =T
P ( V ) =  P ( v ( t ) |  ( t )) P (  ( t ) |  ( t − 1 )
T

r =1 t =1

Interpretation: The probability that we observe the particular sequence of T

visible states VT is equal to the sum over all rmax possible sequences of
hidden states of the conditional probability that the system has made a
particular transition multiplied by the probability that it then emitted the
visible symbol in our target sequence.

Example: Let 1, 2, 3 be the hidden states; v1, v2, v3 be the visible states
and V3 = {v1, v2, v3} is the sequence of visible states

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)

+…+ (possible terms in the sum= all possible (33= 27) cases !)
47
v1 v2 v3

First possibility:
1 2 3
(t = 1) (t = 2) (t = 3)

v1 v2 v3

Second Possibility:
2 3 1
(t = 1) (t = 2) (t = 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Therefore:
t =3
P ({ v 1 , v 2 , v 3 }) = 
possible sequence
 P ( v ( t ) |  ( t )).P (  ( t ) |  ( t − 1 ))
t =1
of hidden states
48
• The decoding problem (optimal state sequence)

Given a sequence of visible states VT, the decoding problem is

to find the most probable sequence of hidden states.
This problem can be expressed mathematically as:
find the single “best” state sequence (hidden states)

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) = arg max P  ( 1 ), ( 2 ),..., ( T ), v ( 1 ), v ( 2 ),...,V ( T ) |  
 ( 1 ), ( 2 ),..., ( T )

Note that the summation disappeared, since we want to find

Only one unique best case !
49
Where:  = [,A,B]
 = P((1) = ) (initial state probability)
A = aij = P((t+1) = j | (t) = i)
B = bjk = P(v(t) = k | (t) = j)

In the preceding example, this computation corresponds to the

selection of the best path amongst:

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}

{3(t = 1),1(t = 2),2(t = 3)}, {3(t = 1),2(t = 2),1(t = 3)}
{2(t = 1),1(t = 2),3(t = 3)}

50
• The learning problem (parameter estimation)
This third problem consists of determining a method to adjust the
model parameters  = [,A,B] to satisfy a certain optimization
criterion. We need to find the best model

ˆ = [ ˆ , Â , B̂ ]
Such that to maximize the probability of the observation sequence:

Max P ( V T |  )


We use an iterative procedure such as Baum-Welch or Gradient to find

this local optimum
51

OpenText Document Pipelines 16.2 - Programming Guide English
0% (1)
OpenText Document Pipelines 16.2 - Programming Guide English
92 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
4.ML_Estimation
No ratings yet
4.ML_Estimation
19 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
Max Likelihood
No ratings yet
Max Likelihood
4 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
44 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
I2ml3e Chap4
No ratings yet
I2ml3e Chap4
28 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
76 pages
AE - Tema 5 - Two-class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-class Fisher Discriminant Analysis
6 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
3. PDF Estimation 23mar23
No ratings yet
3. PDF Estimation 23mar23
45 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
Machine learning 04 - Bayes
No ratings yet
Machine learning 04 - Bayes
35 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
unsupervised_learning_clustering_math
No ratings yet
unsupervised_learning_clustering_math
28 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
04 Bayes Classification Rule
No ratings yet
04 Bayes Classification Rule
46 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
03 Classification Methods
No ratings yet
03 Classification Methods
37 pages
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
No ratings yet
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
9 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
7. Statistical Perspective
No ratings yet
7. Statistical Perspective
85 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Class13-PatternClassification BayesClassifier UnimodalDensity
No ratings yet
Class13-PatternClassification BayesClassifier UnimodalDensity
30 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Bayesian
No ratings yet
Bayesian
91 pages
ML-chap10_2024_110300
No ratings yet
ML-chap10_2024_110300
29 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Lecture 6 7
No ratings yet
Lecture 6 7
69 pages
Lecture 5
No ratings yet
Lecture 5
66 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Lecture 2 3
No ratings yet
Lecture 2 3
72 pages
JavaScript Array Methods
100% (1)
JavaScript Array Methods
5 pages
UsersGuide 20S EVO 1.3.0 en 211102
No ratings yet
UsersGuide 20S EVO 1.3.0 en 211102
53 pages
Curiva - by Slidesgo
No ratings yet
Curiva - by Slidesgo
89 pages
Kern CM Manual
No ratings yet
Kern CM Manual
15 pages
Modern Control Theory
No ratings yet
Modern Control Theory
11 pages
An R Package For Actuarial Science
No ratings yet
An R Package For Actuarial Science
35 pages
How To Write A Cover Letter
No ratings yet
How To Write A Cover Letter
35 pages
OnlyFans Mastery
No ratings yet
OnlyFans Mastery
28 pages
Amar Almasude - The New Mass Media and The Shaping of Amazigh Identity
No ratings yet
Amar Almasude - The New Mass Media and The Shaping of Amazigh Identity
12 pages
Automation of Temperature Sensor in Biogas
No ratings yet
Automation of Temperature Sensor in Biogas
10 pages
The Method of Partial Fractions Math 121 Calculus II
No ratings yet
The Method of Partial Fractions Math 121 Calculus II
5 pages
QA Automation Engineer (API) - JD - EDC
No ratings yet
QA Automation Engineer (API) - JD - EDC
2 pages
MC Kinsey - Strategy in A Structural Break
No ratings yet
MC Kinsey - Strategy in A Structural Break
8 pages
Hmpyc80 2024 TL 102 0 B
No ratings yet
Hmpyc80 2024 TL 102 0 B
82 pages
Vestel 17mb24-1 Orion pt20s SCH PDF
No ratings yet
Vestel 17mb24-1 Orion pt20s SCH PDF
7 pages
Ankush IITG Resume MLE
No ratings yet
Ankush IITG Resume MLE
1 page
LATEX MANUAL
No ratings yet
LATEX MANUAL
64 pages
Cambridge O Level: Computer Science 2210/12
No ratings yet
Cambridge O Level: Computer Science 2210/12
12 pages
Background of The Study
No ratings yet
Background of The Study
19 pages
CV Template 13-14-15 Year Old
No ratings yet
CV Template 13-14-15 Year Old
2 pages
Additional Evidence Maricopa
No ratings yet
Additional Evidence Maricopa
9 pages
Modern Cad Viewer: Glovius
No ratings yet
Modern Cad Viewer: Glovius
2 pages
RolastarCatalogue PDF
No ratings yet
RolastarCatalogue PDF
8 pages
DNA-Payments-Merchant-Portal-Guide
No ratings yet
DNA-Payments-Merchant-Portal-Guide
41 pages
AWX Assignment 4 PDF
No ratings yet
AWX Assignment 4 PDF
11 pages
The Morrison Technique: A Free-Hand Method For Capturing Photomicrographs Using A Smartphone
No ratings yet
The Morrison Technique: A Free-Hand Method For Capturing Photomicrographs Using A Smartphone
3 pages
Adding Node
No ratings yet
Adding Node
4 pages
Biostatistics_and_Computer_Application
No ratings yet
Biostatistics_and_Computer_Application
3 pages
Paramount Flyer
No ratings yet
Paramount Flyer
1 page

Lecture 4

Uploaded by

Lecture 4

Uploaded by

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern

Dr. Amr El-Wakeel

– Design a classifier from a training sample

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Maximum-Likelihood (ML) and the Bayesian estimations

• Bayesian methods view the parameters as random

• In either approach, we use P(i | x)

 = ( j ,  j ) = ( 1j,  2j ,..., 11

• Suppose that D contains n samples, x1, x2,…, xn

• ML estimate of  is, by definition the value that maximizes P(D | )

– We define l() as the log-likelihood function

θˆ = arg max l (θ)

The definition of l() is:

Set of necessary conditions for an optimum is:

 (x k − μˆ ) = 0 from eqs 6,7 & 9

Just the arithmetic average of the samples of the

• The ML estimates for the multivariate case is similar

– As the n increases the bias is reduced

This estimator is unbiased for all distributions

– Our earlier estimator for  is biased:

In fact it is asymptotically unbiased:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)

P(i ) = P(i | D) (Training sample provides this!)

Goal: Estimate  using the a-posteriori density

– The univariate case: P( | D)

Plugging in their gaussian expressions and

(from eq. 29 page 93) 23

Observation: p(|D) is an exponential of a quadratic

It is again normal! It is called a reproducing density

• Identifying coefficients in the top equation with that of

Solving for n and n2 yields:

– The univariate case P(x | D)

– P(x | D) computation can be applied to any situation

• The form of P(x | ) is assumed known, but the value of 

The basic problem is:

p(x | D) =  p(x | θ) p(θ | D)dθ

• Do we remember what conditional independence is?

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature

• It has frequently been observed in practice that, beyond a certain

• Shrinkage: weighted combination of common and

• We can also shrink the estimate common covariances

– Combine features in order to reduce the

• PCA (Principal Component Analysis) “Projection that

• Our productions of any sequence is described by

P(j(t + 1) | i (t)) = aij

Example: speech recognition

“production of spoken words”

Hidden Markov Model: Extension of

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

Interpretation: The probability that we observe the particular sequence of T

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Given a sequence of visible states VT, the decoding problem is

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :

Note that the summation disappeared, since we want to find

In the preceding example, this computation corresponds to the

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}

We use an iterative procedure such as Baum-Welch or Gradient to find

You might also like