Lecture 2 3
Lecture 2 3
Spring 24
Machine Perception
– Fingerprint identification
2
Example
Sea bass
Species
Salmon
3
Example
• Problem Analysis:
▪ set up a camera and take some sample
images to extract features
▪ Consider features such as length, Sensing
lightness, width, number and shape of
fins, position of mouth, etc.
▪ This is the set of some suggested features Segmentation
to explore for use in our classifier!
Feature Extraction
Pre-processing
5
Classification
– Select the length of the fish as a possible feature for
discrimination
6
• The length is a poor feature alone!
• Select the lightness as a possible feature.
7
• Threshold decision boundary and cost
relationship
8
Multiple Features
Lightness Width
9
Multiple Features
10
• We might add other features that are not correlated with
the ones we already have. A precaution should be taken
not to reduce the performance by adding such “noisy
features”
11
12
• However, our satisfaction is premature
because the central aim of designing a
classifier is to correctly classify novel input
Issue of generalization!
13
14
Pattern Recognition Systems
• Sensing
15
Pattern Recognition Systems
16
Pattern Recognition Systems
• Feature extraction
– Discriminative features
– Invariant features with respect to translation, rotation and
scale.
• Classification
– Use a feature vector provided by a feature extractor to assign
the object to a category
• Post Processing
– Exploit context input dependent information other than from
the target pattern itself to improve performance
17
The Design Cycle
• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity
18
• Computational Complexity
19
Introduction
21
• Decision rule with only the prior information
– Decide 1 if P(1) > P(2) otherwise decide 2
j=2
P ( x ) = P ( x | j )P ( j )
j =1
24
25
• Decision given the posterior probabilities
Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1
26
• Minimizing the probability of error
Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
(Bayes decision)
27
Bayesian Decision Theory – Continuous Features
28
• Allowing actions other than classification primarily
allows the possibility of rejection
– Rejection in the sense of abstention
– Don’t make a decision if the alternatives are too close
– This must be tempered by the cost of indecision
29
Let {1, 2,…, c} be the set of c states of nature
(or “categories”)
30
Overall risk
j =c
R( i | x ) = ( i | j )P ( j | x )
j =1
32
• Two-category classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
33
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
As
(21- 11) P(1 | x) > (12- 22) P(2 | x)
35
Finally, we can rewrite
(21- 11) P(1 | x) >
(12- 22) P(2 | x)
P ( x | 1 ) 12 − 22 P ( 2 )
if .
P ( x | 2 ) 21 − 11 P ( 1 )
38
Minimum-Error-Rate Classification
39
• Introduction of the zero-one loss function:
0 i = j
( i , j ) = i , j = 1 ,..., c
1 i j
j =c
R( i | x ) = ( i | j ) P ( j | x )
j =1
= P( j | x ) = 1 − P( i | x )
j 1
40
• The Bayes decision rule depends on minimizing risk
12 − 22 P ( 2 ) P( x | 1 )
Let . = then decide 1 if :
21 − 11 P ( 1 ) P( x | 2 )
44
45
• Let gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)
g( x ) = P ( 1 | x ) − P ( 2 | x )
P( x | 1 ) P( 1 )
= ln + ln
P( x | 2 ) P( 2 )
48
49
The Normal Density
• Univariate density
– Density which is analytically tractable
– Continuous density
– A lot of processes are asymptotically Gaussian
– Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
1 1 x−
2
P( x ) = exp − ,
2 2
Where:
= mean (or expected value) of x
2 = expected squared deviation or variance
50
51
• Multivariate density
1 1
P( x ) = exp − ( x − ) ( x − )
t −1
( 2 ) 2
d/2 1/ 2
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
= (1, 2, …, d)t mean vector
= d*d covariance matrix
| | and -1 are determinant and inverse respectively
52
Discriminant Functions for the Normal Density
2 i 2 2
53
Case i = 2I (I stands for the identity matrix)
• What does “i = 2I” say about the dimensions?
• What about the variance of each dimension?
Note : both i and (d/2) ln are independent of i in
−1
1 d 1
g i ( x) = − ( x − i ) ( x − i ) − ln 2 − ln i + ln P (i )
t
2 i 2 2
Thus we can simplify to :
− i
2
g i ( x) = − + ln P (i )
2 2
where denotes the Euclidean norm
54
• We can further simplify by recognizing that
the quadratic term xtx implicit in the
Euclidean norm is the same for all i.
2
2 2
55
• A classifier that uses linear discriminant
functions is called “a linear machine”
gi(x) = gj(x)
1 2 P (i )
x 0 = (μ i + μ j ) − ln (μ i − μ j )
2 μi − μ j
2
P ( j )
57
58
59
60
• Case i = (covariance of all classes are identical
but arbitrary!)
Hyperplane separating Ri and Rj
Has the equation
w t (x − x 0 ) = 0
Where
w = Σ −1 (μ i − μ j )
and
1
x 0 = (μ i + μ j ) −
ln P (i ) / P ( j )
.(μ i − μ j )
−1
2 (μ i − μ j ) Σ (μ i − μ j )
t
64
65
66
67
P ( x | j ) P ( j )
P ( j | x) =
P ( x)
where
c
P ( x) = P ( x | j ) P ( j )
j =1
68
69
70
and
d
P (x | 2 ) = qixi (1 − qi )1− xi
i =1
where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
1 − pi
d
P( 1 )
w0 = ln + ln
i =1 1 − qi P( 2 )
decide 1 if g(x) 0 and 2 if g(x) 0
72