0% found this document useful (0 votes)

18 views

Lecture 4

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Lecture 4

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern

Recognition and Bayesian Decision Theory

Dr. Amr El-Wakeel

Lane Department of Computer Science and Electrical Engineering

Spring 24
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation
All materials in these
• Introduction slides were taken
from
• Maximum-Likelihood Estimation Pattern
– Example of a Specific Case Classification (2nd
ed) by R. O. Duda, P.
– The Gaussian Case: unknown  and E. Hart and D. G.
 Stork, John Wiley &
Sons, 2000
– Bias with the permission
of the authors and
the publisher
• Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample

• No problem with prior estimation
• Samples are often too small for class-conditional
estimation (large dimension of feature space!)
3
– A priori information about the problem
• Do we know something about the distribution?
• → find parameters to characterize the distribution

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Characterized by 2 parameters

– Estimation techniques

• Maximum-Likelihood (ML) and the Bayesian estimations

• Results are nearly identical, but the approaches are different

4
• Parameters in ML estimation are fixed but unknown!
– Best parameters are obtained by maximizing the
probability of obtaining the samples observed

• Bayesian methods view the parameters as random

variables having some known distribution

• In either approach, we use P(i | x)

for our classification rule!

5
• Maximum-Likelihood Estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques
– General principle
• Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

 = ( j ,  j ) = ( 1j,  2j ,..., 11

j ,  22
j , cov( x m
j , x n
j )...)

6
• Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category

• Suppose that D contains n samples, x1, x2,…, xn

k =n
P(D | ) =  P(x k | ) = F()
k =1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

• ML estimate of  is, by definition the value that maximizes P(D | )

“It is the value of  that best agrees with the actually observed
training sample”
7
8
• Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator

t
    
θ =  , ,..., 
 1  2  p 

– We define l() as the log-likelihood function

l() = ln P(D | )
(recall D is the training data)
– New problem statement:
determine  that maximizes the log-likelihood

θˆ = arg max l (θ)

θ
9
10

The definition of l() is:

n
l (θ) =  ln p(x k | θ)
k =1
and

k =n
( θl =   θ ln P(x k | θ)) (eq 6)
k =1

Set of necessary conditions for an optimum is:

l = 0 (eq. 7)
• Example, the Gaussian case: unknown 
– We assume we know the covariance
– p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)

2
 d

ln p(x k | μ) = − ln (2 )  − (x k − μ) t Σ −1 (x k − μ )
1 1
2
and  μ ln p(x k | μ) = Σ −1 (x k − μ ) (eq. 9)

 =  therefore:
The ML estimate for  must satisfy:
n

 (x k − μˆ ) = 0 from eqs 6,7 & 9

Σ −1

k =1
11
• Multiplying by  and rearranging, we obtain:

1 k =n
μˆ =  x k
n k =1

Just the arithmetic average of the samples of the

training samples!

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional
feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
12
• Example, Gaussian Case: unknown  and 
– First consider univariate case: unknown  and 
 = (1, 2) = (, 2)
1 1
l = ln p ( xk | θ) = − ln 2 2 − ( xk − 1 ) 2
2 2 2

  
 (ln P ( xk | θ)) 
1
 θ l =  =0


 (ln P ( x | θ )) 
  2
k


1
 ( xk − 1 ) = 0
 2

− 1 + ( xk − 1 ) = 0
2


 2 2 2 22 13
2
Summation (over the training set):

 1 n

 ˆ k 1 ( x − ˆ)=0 (1)
 k =1  2
 n ˆ
− 1 n
( xk − 1 ) 2


 k =1 ˆ k =1
+
ˆ
2 2
= 0 (2)
 2
Combining (1) and (2), one obtains:

1 n 1 n
ˆ =  xk ; ˆ 2 =  ( xk − ˆ ) 2
n k =1 n k =1
14
15

• The ML estimates for the multivariate case is similar

– The scalars c and  are replaced with vectors
– The variance 2 is replaced by the covariance matrix

n
1
μˆ =  x k
n k =1
n
ˆΣ = 1  (x − μˆ )(x − μˆ ) t
k k
n k =1
Bias
– ML estimate for 2 is biased
1 n 2 n −1 2
E   ( xi − x )  =  2
 n i =1  n
– Extreme case: n=1, E[ ] = 0 ≠ 2

– As the n increases the bias is reduced

→ this type of estimator is called asymptotically
unbiased

16
– An elementary unbiased estimator for  is:
1 n
C=
n-1

k =1
(x k − μˆ )(x k − μˆ ) t

  
Sample covariance matrix

This estimator is unbiased for all distributions

→ Such estimators are called absolutely unbiased

17
18

– Our earlier estimator for  is biased:

n
1
ˆΣ =  (x − μˆ )(x − μˆ )t
k k
n k =1

In fact it is asymptotically unbiased:

Observe that

n − 1
ˆ = C
n
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
⚫ Bayesian Estimation (BE)
⚫ Bayesian Parameter Estimation: Gaussian Case
⚫ Bayesian Parameter Estimation: General Estimation
⚫ Problems of Dimensionality
⚫ Computational Complexity
⚫ Component Analysis and Discriminants
⚫ Hidden Markov Models 19
• Bayesian Estimation (Bayesian learning to pattern
classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities P(i | x)
lies at the heart of Bayesian classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written

p (x | i , D) P(i | D)
P(i | x, D) = c

 p(x |  , D) P(
j =1
j j | D)
20
• To demonstrate the preceding equation, use:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)

P(x | D) =  P(x,  j | D) (from law of total prob.)
j

P(i ) = P(i | D) (Training sample provides this!)

We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P (i )
P(i | x, D) = c

 p(x |  , D) P( )
j =1
j j

21
• Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate  using the a-posteriori density

P( | D)

– The univariate case: P( | D)

– But we know
p( xk |  ) ~ N (  ,  ) and p(  ) ~ N ( 0 ,  0 )
2 2

Plugging in their gaussian expressions and

extracting out factors not depending on  yields:

 1  n 1  2  1 n
0   

p(  | D) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

(from eq. 29 page 93) 23

Observation: p(|D) is an exponential of a quadratic

 1  n 1  2  1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
 xk + 2   
 0   
   k =1

It is again normal! It is called a reproducing density

P(  | D) ~ N (  n ,  ) 2
n
25

 1  n 1   1 n
 0   

p (  | D ) =  exp −  2 + 2   − 2 2
 2  0 
2
 xk + 2   
 0   
   k =1

• Identifying coefficients in the top equation with that of

the generic Gaussian
2
1  1   − n  
p(  | D) = exp  −   
2  n  2  n 

1
Yields n 1 for nand
expressions  2
n 0
= + and 2 = 2 ˆ n + 2
n n

 2
n  2
 02 n  0
26

Solving for n and n2 yields:

 n 02  2
 n =   ˆ +
2  n
0
 n0 0 +   n 0 + 
2 2 2

 0
2 2
and  n =
2

n 02 +  2
From these equations we see as n increases:
– the variance decreases monotonically
– the estimate of p(|D) becomes more peaked

4
27
28

– The univariate case P(x | D)

• P( | D) computed (in preceding discussion)
• P(x | D) remains to be computed!

P( x | D) =  P( x |  ) P(  | D)d is Gaussian
P( x | D) ~ N (  n ,  2 +  n2 )
 n and σ n2
It provides:
We know 2 and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D  Max P ( x |  j , D j ) P ( j )
j j

• 3.5 Bayesian Parameter Estimation: General Theory

– P(x | D) computation can be applied to any situation

in which the unknown density can be
parameterized: the basic assumptions are:

• The form of P(x | ) is assumed known, but the value of 

is not known exactly
• Our knowledge about  is assumed to be contained in a
known prior density P()
• The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
29
30

The basic problem is:

“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where

p(x | D) =  p(x | θ) p(θ | D)dθ

Using Bayes formula, we have:

P(D | θ) P(θ)
P(θ | D) = ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ) =  p(x k | θ)
k =1
5
• Problems of Dimensionality
– Problems involving 50 or 100 features are
common (usually binary valued)
– Note: microarray data might entail ~20000
real-valued features
– Classification accuracy dependant on
• dimensionality
• amount of training data
• discrete vs continuous

31
7
• Case of two class multivariate normal
with the same covariance
– P(x|j) ~N(j,), j=1,2
– Statistically independent features
– If the priors are equal then:

1 −u 2
P (error) =
2 e
r/2
2
du (Bayes error)

where : r 2 = ( 1 −  2 ) t  −1 ( 1 −  2 )
r 2 is the squared Mahalanobis distance
lim P (error) = 0
r →

32
• If features are conditionally independent then:

 = diag( 12 ,  22 ,...,  2d )
2
i = d  −  i2 
r =  
2 i1

i =1 i 

• Do we remember what conditional independence is?

• Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

33
• Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
– Doesn’t require independence

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature

extractor and classifier

• It has frequently been observed in practice that, beyond a certain

point, the inclusion of additional features leads to worse rather
than better performance:
– we have the wrong model !
– we don’t have enough training data to support the additional
dimensions

34
7

35
Overfitting
• Dimensionality of model vs size of training
data
– Issue: not enough data to support the model
– Possible solutions:
• Reduce model dimensionality
• Make (possibly incorrect) assumptions to better
estimate 

36
Overfitting
• Estimate better 
– use data pooled from all classes
• normalization issues
– use pseudo-Bayesian form 0 + (1-)n
– “doctor”  by thresholding entries
• reduces chance correlations
– assume statistical independence
• zero all off-diagonal elements

37
38

Shrinkage

• Shrinkage: weighted combination of common and

individual covariances
(1 −  )ni Σi + nΣ
Σi ( ) = for 0    1
(1 −  )ni + n

• We can also shrink the estimate common covariances

toward the identity matrix

Σ(  ) = (1 −  ) Σ + I for 0    1
• Component Analysis and Discriminants

– Combine features in order to reduce the

dimension of the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower
dimensional space
– Two classical approaches for finding “optimal”
linear transformation

• PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
39
• Hidden Markov Models:
– Markov Chains
– Goal: make a sequence of decisions
• Processes that unfold in time, states at time t are influenced
by a state at time t-1
• Applications: speech recognition, gesture recognition, parts
of speech tagging and DNA sequencing,
• Any temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
• The system can revisit a state at different steps and not
every state need to be visited
40
– First-order Markov models

• Our productions of any sequence is described by

the transition probabilities

P(j(t + 1) | i (t)) = aij

41
42
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”

Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/
and /n/ to a silent state

43
Chapter 3 (Part 3):
Maximum-Likelihood and Bayesian Parameter
Estimation (Section 3.10)

Hidden Markov Model: Extension of

Markov Chains
• Hidden Markov Model (HMM)
– Interaction of the visible states with the hidden states
bjk= 1 for all j where bjk=P(Vk(t) | j(t)).

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

45
• The evaluation problem
It is the probability that the model produces a sequence VT
of visible states. It is:
rmax
P ( V ) =  P ( V T |  rT )P ( rT )
T

r =1
where each r indexes a particular sequence of T hidden
states.
 rT =  ( 1 ), ( 2 ),..., ( T )
t =T
(1) P ( V T |  rT ) =  P ( v ( t ) |  ( t ))
t =1
t =T
(2) P(  rT ) =  P (  ( t ) |  ( t − 1 ))
t =1 46
Using equations (1) and (2), we can write:
rmax t =T
P ( V ) =  P ( v ( t ) |  ( t )) P (  ( t ) |  ( t − 1 )
T

r =1 t =1

Interpretation: The probability that we observe the particular sequence of T

visible states VT is equal to the sum over all rmax possible sequences of
hidden states of the conditional probability that the system has made a
particular transition multiplied by the probability that it then emitted the
visible symbol in our target sequence.

Example: Let 1, 2, 3 be the hidden states; v1, v2, v3 be the visible states
and V3 = {v1, v2, v3} is the sequence of visible states

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)

+…+ (possible terms in the sum= all possible (33= 27) cases !)
47
v1 v2 v3

First possibility:
1 2 3
(t = 1) (t = 2) (t = 3)

v1 v2 v3

Second Possibility:
2 3 1
(t = 1) (t = 2) (t = 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Therefore:
t =3
P ({ v 1 , v 2 , v 3 }) = 
possible sequence
 P ( v ( t ) |  ( t )).P (  ( t ) |  ( t − 1 ))
t =1
of hidden states
48
• The decoding problem (optimal state sequence)

Given a sequence of visible states VT, the decoding problem is

to find the most probable sequence of hidden states.
This problem can be expressed mathematically as:
find the single “best” state sequence (hidden states)

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) = arg max P  ( 1 ), ( 2 ),..., ( T ), v ( 1 ), v ( 2 ),...,V ( T ) |  
 ( 1 ), ( 2 ),..., ( T )

Note that the summation disappeared, since we want to find

Only one unique best case !
49
Where:  = [,A,B]
 = P((1) = ) (initial state probability)
A = aij = P((t+1) = j | (t) = i)
B = bjk = P(v(t) = k | (t) = j)

In the preceding example, this computation corresponds to the

selection of the best path amongst:

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}

{3(t = 1),1(t = 2),2(t = 3)}, {3(t = 1),2(t = 2),1(t = 3)}
{2(t = 1),1(t = 2),3(t = 3)}

50
• The learning problem (parameter estimation)
This third problem consists of determining a method to adjust the
model parameters  = [,A,B] to satisfy a certain optimization
criterion. We need to find the best model

ˆ = [ ˆ , Â , B̂ ]
Such that to maximize the probability of the observation sequence:

Max P ( V T |  )


We use an iterative procedure such as Baum-Welch or Gradient to find

this local optimum
51

Lecture Notes On Multivariate Analysis
100% (1)
Lecture Notes On Multivariate Analysis
75 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
Max Likelihood
No ratings yet
Max Likelihood
4 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
44 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
I2ml3e Chap4
No ratings yet
I2ml3e Chap4
28 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
76 pages
AE - Tema 5 - Two-class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-class Fisher Discriminant Analysis
6 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
3. PDF Estimation 23mar23
No ratings yet
3. PDF Estimation 23mar23
45 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
Machine learning 04 - Bayes
No ratings yet
Machine learning 04 - Bayes
35 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
04 Bayes Classification Rule
No ratings yet
04 Bayes Classification Rule
46 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
03 Classification Methods
No ratings yet
03 Classification Methods
37 pages
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
No ratings yet
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
9 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
7. Statistical Perspective
No ratings yet
7. Statistical Perspective
85 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Class13-PatternClassification BayesClassifier UnimodalDensity
No ratings yet
Class13-PatternClassification BayesClassifier UnimodalDensity
30 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Bayesian
No ratings yet
Bayesian
91 pages
ML-chap10_2024_110300
No ratings yet
ML-chap10_2024_110300
29 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Classifier Estimation From Group Probabilities, Cf. (,)
No ratings yet
Classifier Estimation From Group Probabilities, Cf. (,)
8 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Lecture 6 7
No ratings yet
Lecture 6 7
69 pages
Lecture 5
No ratings yet
Lecture 5
66 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Lecture 2 3
No ratings yet
Lecture 2 3
72 pages
AMOS - Analysis of Moment Structures: HIV Prevention Center University of Kentucky
No ratings yet
AMOS - Analysis of Moment Structures: HIV Prevention Center University of Kentucky
83 pages
Solutions Manual For Statistical Computing With R Rizzo 2 1
No ratings yet
Solutions Manual For Statistical Computing With R Rizzo 2 1
137 pages
Chebysev Inequality: Suppose and Variance
No ratings yet
Chebysev Inequality: Suppose and Variance
13 pages
Just Give Me The Codes Lecture 5: Data Preprocessing II
No ratings yet
Just Give Me The Codes Lecture 5: Data Preprocessing II
21 pages
Chap2 Multivariate Normal and Related Distributions
No ratings yet
Chap2 Multivariate Normal and Related Distributions
18 pages
R18 - PG - MTEch (DS)
No ratings yet
R18 - PG - MTEch (DS)
60 pages
Download State Space and Unobserved Component Models Theory and Applications Harvey A. ebook file with all chapters
100% (1)
Download State Space and Unobserved Component Models Theory and Applications Harvey A. ebook file with all chapters
77 pages
The CMA Evolution Strategy A Tutorial
No ratings yet
The CMA Evolution Strategy A Tutorial
39 pages
Full Statistical Methods For Climate Scientists Timothy Delsole Ebook All Chapters
No ratings yet
Full Statistical Methods For Climate Scientists Timothy Delsole Ebook All Chapters
49 pages
LHS
No ratings yet
LHS
4 pages
Contemporary Statistical Models for the Plant and Soil Sciences 1st Edition Oliver Schabenberger Francis J. Pierce instant download
100% (1)
Contemporary Statistical Models for the Plant and Soil Sciences 1st Edition Oliver Schabenberger Francis J. Pierce instant download
64 pages
Complete Download of Solutions Manual To Accompany Applied Multivariate Statistical Analysis 6th Edition 0131877151 Full Chapters in PDF
100% (3)
Complete Download of Solutions Manual To Accompany Applied Multivariate Statistical Analysis 6th Edition 0131877151 Full Chapters in PDF
38 pages
(Ebook) Introduction to Probability, Second Edition by George G. Roussas ISBN 9780128000410, 0128000414 instant download
100% (1)
(Ebook) Introduction to Probability, Second Edition by George G. Roussas ISBN 9780128000410, 0128000414 instant download
31 pages
Multivariate Nonnormal Process Capability Analysis: S. Ahmad M. Abdollahian P. Zeephongsekul B. Abbasi
No ratings yet
Multivariate Nonnormal Process Capability Analysis: S. Ahmad M. Abdollahian P. Zeephongsekul B. Abbasi
9 pages
6.gaussian Random Processes
No ratings yet
6.gaussian Random Processes
3 pages
Marginal and Conditional Distributions of Multivariate Normal Distribution
No ratings yet
Marginal and Conditional Distributions of Multivariate Normal Distribution
4 pages
Prob Stat Book
No ratings yet
Prob Stat Book
543 pages
Fundamentals of Statistics 2
No ratings yet
Fundamentals of Statistics 2
168 pages
M.E.cse - R21 Syllabus
No ratings yet
M.E.cse - R21 Syllabus
20 pages
T M N D: HE Ultivariate Ormal Istribution
No ratings yet
T M N D: HE Ultivariate Ormal Istribution
19 pages
Introduction to the mathematical and statistical foundations of econometrics Bierens All Chapters Instant Download
100% (12)
Introduction to the mathematical and statistical foundations of econometrics Bierens All Chapters Instant Download
50 pages
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani download pdf
100% (4)
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani download pdf
76 pages
Probitspatial R Package: Fast and Accurate Spatial Probit Estimations
No ratings yet
Probitspatial R Package: Fast and Accurate Spatial Probit Estimations
12 pages
Introduction To Clustering Procedures: Sas/Stat User's Guide
No ratings yet
Introduction To Clustering Procedures: Sas/Stat User's Guide
48 pages
Ravishanker N., Dey D. a First Course in Linear Model Theory 2020
No ratings yet
Ravishanker N., Dey D. a First Course in Linear Model Theory 2020
490 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
MVN: An R Package For Assessing Multivariate Normality
No ratings yet
MVN: An R Package For Assessing Multivariate Normality
12 pages
Impact of The Real-Time Thermal Loading On The Bulk Electric System Reliability
No ratings yet
Impact of The Real-Time Thermal Loading On The Bulk Electric System Reliability
10 pages
Acm - M.E - Cse - Uit r2024 A
No ratings yet
Acm - M.E - Cse - Uit r2024 A
115 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

CS 677 Pattern Recognition

Lecture 4: Introduction to Pattern

Dr. Amr El-Wakeel

– Design a classifier from a training sample

– Example: Normality of P(x | i)

P(x | i) ~ N( i, i)

• Maximum-Likelihood (ML) and the Bayesian estimations

• Bayesian methods view the parameters as random

• In either approach, we use P(i | x)

 = ( j ,  j ) = ( 1j,  2j ,..., 11

• Suppose that D contains n samples, x1, x2,…, xn

• ML estimate of  is, by definition the value that maximizes P(D | )

– We define l() as the log-likelihood function

θˆ = arg max l (θ)

The definition of l() is:

Set of necessary conditions for an optimum is:

 (x k − μˆ ) = 0 from eqs 6,7 & 9

Just the arithmetic average of the samples of the

• The ML estimates for the multivariate case is similar

– As the n increases the bias is reduced

This estimator is unbiased for all distributions

– Our earlier estimator for  is biased:

In fact it is asymptotically unbiased:

P(x, D | i ) = P(x | D, i ) P( D | i ) (from def. of cond. prob.)

P(i ) = P(i | D) (Training sample provides this!)

Goal: Estimate  using the a-posteriori density

– The univariate case: P( | D)

Plugging in their gaussian expressions and

(from eq. 29 page 93) 23

Observation: p(|D) is an exponential of a quadratic

It is again normal! It is called a reproducing density

• Identifying coefficients in the top equation with that of

Solving for n and n2 yields:

– The univariate case P(x | D)

– P(x | D) computation can be applied to any situation

• The form of P(x | ) is assumed known, but the value of 

The basic problem is:

p(x | D) =  p(x | θ) p(θ | D)dθ

• Do we remember what conditional independence is?

• Adding independent features helps increase r → reduce error

• Caution: adding features increases cost & complexity of feature

• It has frequently been observed in practice that, beyond a certain

• Shrinkage: weighted combination of common and

• We can also shrink the estimate common covariances

– Combine features in order to reduce the

• PCA (Principal Component Analysis) “Projection that

• Our productions of any sequence is described by

P(j(t + 1) | i (t)) = aij

Example: speech recognition

“production of spoken words”

Hidden Markov Model: Extension of

– 3 problems are associated with this model

• The evaluation problem

• The decoding problem

• The learning problem

Interpretation: The probability that we observe the particular sequence of T

P({v1, v2, v3}) = P(1).P(v1 | 1).P(2 | 1).P(v2 | 2).P(3 | 2).P(v3 | 3)

P({v1, v2, v3}) = P(2).P(v1 | 2).P(3 | 2).P(v2 | 3).P(1 | 3).P(v3 | 1) + …+

Given a sequence of visible states VT, the decoding problem is

ˆ ( 1 ),ˆ ( 2 ),...,ˆ ( T ) such that :

Note that the summation disappeared, since we want to find

In the preceding example, this computation corresponds to the

{1(t = 1),2(t = 2),3(t = 3)}, {2(t = 1),3(t = 2),1(t = 3)}

We use an iterative procedure such as Baum-Welch or Gradient to find

You might also like