0% found this document useful (0 votes)
6 views

Lecture 2 3

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 2 3

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

CS 677 Pattern Recognition

Week 2-3: Introduction to Pattern


Recognition and Bayesian Decision Theory

Dr. Amr El-Wakeel


Lane Department of Computer Science and Electrical Engineering

Spring 24
Machine Perception

• Build a machine that can recognize patterns:


– Speech recognition

– Fingerprint identification

– OCR (Optical Character Recognition)

– Medical diagnosis and event detection

– DNA sequence identification

2
Example

• “Sorting incoming Fish on a conveyor


according to species using optical sensing”

Sea bass
Species
Salmon

3
Example

• Sorting Fish: incoming fish are sorted


according to species using optical
sensing (sea bass or salmon?)

• Problem Analysis:
▪ set up a camera and take some sample
images to extract features
▪ Consider features such as length, Sensing
lightness, width, number and shape of
fins, position of mouth, etc.
▪ This is the set of some suggested features Segmentation
to explore for use in our classifier!

Feature Extraction
Pre-processing

– Use a segmentation operation to isolate fishes from


one another and from the background

• Information from a single fish is sent to a feature


extractor whose purpose is to reduce the data by
measuring certain features

• The features are passed to a classifier

5
Classification
– Select the length of the fish as a possible feature for
discrimination

6
• The length is a poor feature alone!
• Select the lightness as a possible feature.

7
• Threshold decision boundary and cost
relationship

– Move our decision boundary toward smaller values of


lightness in order to minimize the cost (reduce the
number of sea bass that are classified salmon!)

Task of decision theory

8
Multiple Features

• Adopt the lightness and add the width of


the fish

Fish xT = [x1, x2]

Lightness Width

9
Multiple Features

10
• We might add other features that are not correlated with
the ones we already have. A precaution should be taken
not to reduce the performance by adding such “noisy
features”

• Ideally, the best decision boundary should be the one


which provides an optimal performance such as in the
following?

11
12
• However, our satisfaction is premature
because the central aim of designing a
classifier is to correctly classify novel input

Issue of generalization!

13
14
Pattern Recognition Systems

• Sensing

– Use of a transducer (camera or microphone)


– PR system depends of the bandwidth, the resolution
sensitivity distortion of the transducer

• Segmentation and grouping

– Patterns should be well separated and should not


overlap

15
Pattern Recognition Systems

16
Pattern Recognition Systems
• Feature extraction
– Discriminative features
– Invariant features with respect to translation, rotation and
scale.

• Classification
– Use a feature vector provided by a feature extractor to assign
the object to a category

• Post Processing
– Exploit context input dependent information other than from
the target pattern itself to improve performance
17
The Design Cycle

• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity

18
• Computational Complexity

– What is the trade-off between computational


ease and performance?

– (How an algorithm scales as a function of the


number of features, patterns or categories?)
– Labelling 20 X 20 binary pixel, 220X20 = 10120
patterns. scale

19
Introduction

Bayesian Decision Theory–


Continuous Features
Introduction

• The sea bass/salmon example

– State of nature, prior

• State of nature is a random variable

• The catch of salmon and sea bass is equiprobable

– P(1) = P(2) (uniform priors)

– P(1) + P( 2) = 1 (exclusivity and exhaustivity)

21
• Decision rule with only the prior information
– Decide 1 if P(1) > P(2) otherwise decide 2

• Use of the class–conditional information

• P(x | 1) and P(x | 2) describe the difference


in lightness between populations of sea and
salmon (particular density function for
random variable X)
22
23
• Posterior, likelihood, evidence

– P(j | x) = P(x | j)P (j) / P(x) (Bayes formula)

– Where in case of two categories

j=2
P ( x ) =  P ( x |  j )P (  j )
j =1

– Posterior = (Likelihood * Prior) / Evidence

24
25
• Decision given the posterior probabilities

X is an observation for which:

if P(1 | x) > P(2 | x) True state of nature = 1


if P(1 | x) < P(2 | x) True state of nature = 2

Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1

26
• Minimizing the probability of error

• Decide 1 if P(1 | x) > P(2 | x);


otherwise decide 2

Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
(Bayes decision)

27
Bayesian Decision Theory – Continuous Features

• Generalization of the preceding ideas

– Use of more than one feature


– Use more than two states of nature
– Allowing actions and not only decide on the state of
nature
– Introduce a loss of function which is more general
than the probability of error

28
• Allowing actions other than classification primarily
allows the possibility of rejection
– Rejection in the sense of abstention
– Don’t make a decision if the alternatives are too close
– This must be tempered by the cost of indecision

• The loss function states how costly each action taken is

29
Let {1, 2,…, c} be the set of c states of nature
(or “categories”)

Let {1, 2,…, a} be the set of possible actions

Let (i | j) be the loss incurred for taking

action i when the state of nature is j

30
Overall risk

R = Sum of all R(i | x) for i = 1,…,a and all x


Conditional risk

Minimizing R Minimizing R(i | x) for i = 1,…, a

j =c
R(  i | x ) =   (  i |  j )P (  j | x )
j =1

for each action i (i = 1,…,a)


Note: This is the risk specifically for observation x
31
Select the action i for which R(i | x) is minimum

R is minimum and R in this case is called the


Bayes risk = best performance that can be achieved!

32
• Two-category classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j

Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)

33
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken

Substituting the def. of R() we have :


decide 1 if:
11 P(1 | x) + 12P(2 | x) <
21 P(1 | x) + 22P(2 | x)

and decide 2 otherwise


34
We can rewrite
11 P(1 | x) + 12P(2 | x) <
21 P(1 | x) + 22P(2 | x)

As
(21- 11) P(1 | x) > (12- 22) P(2 | x)

35
Finally, we can rewrite
(21- 11) P(1 | x) >
(12- 22) P(2 | x)

using Bayes formula and posterior probabilities to get:


decide 1 if:

(21- 11) P(x | 1) P(1) >


(12- 22) P(x | 2) P(2)

and decide 2 otherwise


36
If 21 > 11 then we can express our rule as a
Likelihood ratio:

The preceding rule is equivalent to the following rule:

P ( x |  1 ) 12 − 22 P (  2 )
if  .
P ( x |  2 ) 21 − 11 P (  1 )

Then take action 1 (decide 1)


Otherwise take action 2 (decide 2)
37
Optimal decision property

“If the likelihood ratio exceeds a threshold


value independent of the input pattern x, we
can take optimal actions”

38
Minimum-Error-Rate Classification

• Actions are decisions on classes


If action i is taken and the true state of nature is j then:
the decision is correct if i = j and in error if i  j

• Seek a decision rule that minimizes the


probability of error which is the error rate

39
• Introduction of the zero-one loss function:
0 i = j
 (  i , j ) =  i , j = 1 ,..., c
1 i  j

Therefore, the conditional risk is:

j =c
R(  i | x ) =   (  i |  j ) P (  j | x )
j =1

=  P(  j | x ) = 1 − P(  i | x )
j 1

“The risk corresponding to this loss function is the average


probability error”

40
• The Bayes decision rule depends on minimizing risk

• Minimizing the risk requires selecting the i that


maximizes P (i | x)
(since R (i | x) = 1 – P (i | x))

• For Minimum error rate

– Decide i if P (i | x) > P (j | x) j  i


41
• Regions of decision and zero-one loss function, therefore
(using the likelihood ratio formula:

12 − 22 P (  2 ) P( x | 1 )
Let . =   then decide  1 if :  
21 − 11 P (  1 ) P( x |  2 )

• If  is the zero-one loss function which means:


 0 1
 =  
1 0
P(  2 )
then   = = a
P( 1 )
0 2  2 P(  2 )
if  =   then   = = b
 1 042 P( 1 )
43
Classifiers, Discriminant Functions
and Decision Surfaces

• The multi-category case

– Set of discriminant functions gi(x), i = 1,…, c

– The classifier assigns a feature vector x to class i


if:
gi(x) > gj(x) j  i

44
45
• Let gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)

• For the minimum error rate, we take


gi(x) = P(i | x)

(max. discrimination corresponds to max. posterior!)


gi(x)  P(x | i) P(i)

gi(x) = ln P(x | i) + ln P(i)


46
• Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri
(Ri means assign x to i)

• The two-category case


– A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2

Let g(x)  g1(x) – g2(x)

Decide 1 if g(x) > 0 ; Otherwise decide 2


47
– The computation of g(x)

g( x ) = P (  1 | x ) − P (  2 | x )
P( x | 1 ) P( 1 )
= ln + ln
P( x |  2 ) P(  2 )

48
49
The Normal Density
• Univariate density
– Density which is analytically tractable
– Continuous density
– A lot of processes are asymptotically Gaussian
– Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)

1  1 x− 
2

P( x ) = exp  −   ,
2   2    
Where:
 = mean (or expected value) of x
2 = expected squared deviation or variance
50
51
• Multivariate density

– Multivariate normal density in d dimensions is:

1  1 
P( x ) = exp − ( x −  )  ( x −  )
t −1

( 2 )   2 
d/2 1/ 2

where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
 = (1, 2, …, d)t mean vector
 = d*d covariance matrix
| | and  -1 are determinant and inverse respectively

52
Discriminant Functions for the Normal Density

• We saw that the minimum error-rate


classification can be achieved by the
discriminant function

gi(x) = ln P(x | i) + ln P(i)

• Case of multivariate normal


−1
1 d 1
g i ( x) = − ( x − i )  ( x − i ) − ln 2 − ln i + ln P(i )
t

2 i 2 2

53
Case i = 2I (I stands for the identity matrix)
• What does “i = 2I” say about the dimensions?
• What about the variance of each dimension?
Note : both  i and (d/2) ln are independent of i in
−1
1 d 1
g i ( x) = − ( x − i )  ( x − i ) − ln 2 − ln  i + ln P (i )
t

2 i 2 2
Thus we can simplify to :
 − i
2

g i ( x) = − + ln P (i )
2 2
where  denotes the Euclidean norm
54
• We can further simplify by recognizing that
the quadratic term xtx implicit in the
Euclidean norm is the same for all i.

g i ( x) = w ti x + wi 0 (linear discrimina nt function)


where :
μi 1
wi = ; wi 0 = − μ i μ i + ln P (i )
t

 2
2 2

(i 0 is called the threshold for the ith category!)

55
• A classifier that uses linear discriminant
functions is called “a linear machine”

• The decision surfaces for a linear machine


are pieces of hyperplanes defined by:

gi(x) = gj(x)

The equation can be written as:


wt(x-x0)=0
56
– The hyperplane separating Ri and Rj

1 2 P (i )
x 0 = (μ i + μ j ) − ln (μ i − μ j )
2 μi − μ j
2
P ( j )

always orthogonal to the line linking the


means!
1
if P(i ) = P( j ) then x 0 = (μ i + μ j )
2

57
58
59
60
• Case i =  (covariance of all classes are identical
but arbitrary!)
Hyperplane separating Ri and Rj
Has the equation
w t (x − x 0 ) = 0
Where
w = Σ −1 (μ i − μ j )
and
1
x 0 = (μ i + μ j ) −
 
ln P (i ) / P ( j )
.(μ i − μ j )
−1
2 (μ i − μ j ) Σ (μ i − μ j )
t

(the hyperplane separating Ri and Rj is generally not


orthogonal to the line between the means!)
61
62
63
• Case i = arbitrary
– The covariance matrices are different for each category
g i ( x ) = x tWi x + w it x = w i 0
where :
1 −1
Wi = −  i
2
wi =  i− 1  i
1 t −1 1
wi 0 = −  i  i  i − ln  i + ln P (  i )
2 2
The decision surfaces are hyperquadratics
(Hyperquadrics are: hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids, hyperhyperboloids)

64
65
66
67

Bayes Decision Theory – Discrete Features

• Components of x are binary or integer valued, x can take only


one of m discrete values
v1, v2, …, vm
➔ concerned with probabilities rather than probability densities
in Bayes Formula:

P ( x |  j ) P ( j )
P ( j | x) =
P ( x)
where
c
P ( x) =  P ( x |  j ) P ( j )
j =1
68

Bayes Decision Theory – Discrete Features

• Conditional risk is defined as before: R(|x)

• Approach is still to minimize risk:


 = arg min R( i | x)
*
i
Bayes Decision Theory – Discrete Features

• Case of independent binary features in 2


category problem
Let x = [x1, x2, …, xd ]t where each xi is either 0
or 1, with probabilities:
pi = P(xi = 1 | 1)
qi = P(xi = 1 | 2)

69
70

Bayes Decision Theory – Discrete Features

• Assuming conditional independence, P(x|i) can be written


as a product of component probabilities:
d
P (x | 1 ) =  pixi (1 − pi )1− xi
i =1

and
d
P (x | 2 ) =  qixi (1 − qi )1− xi
i =1

yielding a likelihood ratio given by :


xi 1− xi
P(x | 1 )  pi 
d
 1 − pi 
=     
P(x | 2 ) i =1  qi   1 − qi 
71

Bayes Decision Theory – Discrete Features

• Taking our likelihood ratio


xi 1− xi
P (x | 1 ) d
 pi   1 − pi 
=     
P (x | 2 ) i =1  qi   1 − qi 
and plugging it into Eq. 31
p (x | 1 ) p (1 )
g (x) = ln + ln
p (x | 2 ) p (2 )
yields :
d
 pi 1 − pi  p (1 )
g (x) =   xi ln + (1 − xi ) ln  + ln
i =1  qi 1 − qi  p (2 )
• The discriminant function in this case is:
d
g ( x ) =  w i x i + w0
i =1

where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
1 − pi
d
P( 1 )
w0 =  ln + ln
i =1 1 − qi P(  2 )
decide  1 if g(x)  0 and  2 if g(x)  0
72

You might also like