Sergios Theodoridis Konstantinos Koutroumbas
Sergios Theodoridis Konstantinos Koutroumbas
Konstantinos Koutroumbas
1
Version 2
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining
Bionformatics
2
Features: These are measurable quantities obtained from
the patterns, and the classification task is based on their
respective values.
3
An example:
4
The classifier consists of a set of functions, whose values,
computed at x , determine the class to which the
corresponding pattern belongs
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
5
Supervised – unsupervised pattern recognition:
The two major directions
Supervised: Patterns whose class is known a-priori
are used for training.
Unsupervised: The number of classes is (in general)
unknown and no training patterns are available.
6
CLASSIFIERS BASED ON BAYES DECISION
THEORY
That is x i : P ( i x )
maximum
7
Computation of a-posteriori probabilities
Assume known
• a-priori probabilities
P(1 ), P (2 )..., P (M )
• p ( x i ), i 1,2...M
8
The Bayes rule (Μ=2)
p ( x ) P ( i x ) p ( x i ) P ( i )
p ( x i ) P ( i )
P ( i x )
p( x)
where 2
p( x) i 1
p ( x i ) P ( i )
9
The Bayes classification rule (for two classes M=2)
Given x classify it according to the rule
If P(1 x ) P(2 x ) x 1
If P(2 x ) P(1 x ) x 2
p ( x 1 ) P (1 )( ) p ( x 2 ) P (2 )
p ( x 1 ) ( ) P ( x 2 )
10
R1 ( 1 ) and R2 ( 2 )
11
Equivalently in words: Divide space in two regions
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded area
x0
P
e
p( x 2 )dx p
x0
( x 1 )dx
P(i x) P( j x) j i
15
For M=2
• Define the loss matrix
11 12
L( )
21 22
16
Risk with respect to 2
r2 21 p( x 2 )d x 22 p( x 2 )d x
R1 R2
Average risk
r r1 P (1 ) r2 P ( 2 )
17
Choose R1 and R2 so that r is minimized
Then assign x to i if
l1 11 p( x 1 )P(1 ) 21 p( x 2 ) P(2 )
l 2 12 p( x 1 )P(1 ) 22 p( x 2 ) P(2 )
Equivalently:
assign x in 1 ( 2 ) if
p( x 1 ) P(2 ) 21 22
l 12 ( )
p ( x 2 ) P(1 ) 12 11
l 12 : likelihood ratio
18
If 1
P(1 ) P(2 ) and 11 22 0
2
21
x 1 if P( x 1 ) P( x 2 )
12
12
x 2 if P( x 2 ) P( x 1 )
21
if 21 12 Minimum classification
error probability
19
An example:
1
p( x 1 ) exp( x )
2
1
p( x 2 ) exp(( x 1) 2 )
1
P(1 ) P(2 )
2
0 0 .5
L
1 .0 0
20
Then the threshold value is:
x0 for minimum Pe :
x0 : exp( x ) exp(( x 1) )
2 2
1
x0
2
Threshold x̂0 for minimum r
xˆ0 : exp( x ) 2 exp(( x 1) )
2 2
(1 ln2) 1
xˆ0
2 2
21
1
Thus x̂0 moves to the left of x0
(WHY?) 2
22
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous: g(x) P(i x) P(j x) 0
Ri : P(i x) P ( j x)
+
- g ( x) 0
R j : P( j x) P (i x)
23
If f(.) monotonic, the rule remains the same if we use:
24
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
25
26
27
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
i E x l l matrix in i
i E ( x i )( x i )
called covariance matrix
28
ln() is monotonic. Define:
g i ( x) ln( p ( x i ) P(i ))
ln p ( x i ) ln P(i )
1 T 1
g i ( x) ( x i ) i ( x i ) ln P (i ) Ci
2
l 1
Ci ( ) ln 2 ( ) ln i
2 2
2 0
Example: i
2
0 29
1 1
g i ( x) (x x )
2 2
( i1 x1 i 2 x2 )
2
2 1 2 2
1
( i21 i22 ) ln( Pi ) Ci
2 2
30
Decision Hyperplanes
1
x x
T
Quadratic terms: i
wi Σ 1 i
1 Τ 1
wi 0 ln P(i ) i Σ i
2
Discriminant functions are LINEAR 31
Let in addition:
• Σ 2 I . Then
1
g i ( x) i x wi 0
T
2
• g ij ( x) g i ( x) g j ( x) 0
T
w ( x xo )
• w i j,
1 P(i ) i j
• x o ( i j ) ln
2
2 P( j ) 2
i j
32
Nondiagonal: 2
T
• gij ( x) w ( x x0 ) 0
1
• w (i j )
1 P (i ) i j
• x 0 ( i j ) ln ( )
2 P ( j ) 2
i j 1
1
( x 1 x)
T 2
x 1
not normal to i j
Decision hyperplane
normal to 1 ( i j )
33
Minimum Distance Classifiers
1
P(i ) equiprobable
M
1
g i ( x) ( x i )T 1 ( x i )
2
2 I : Assign x i :
Euclidean Distance: dE x i
smaller
2 I : Assign x i :
1
1
Mahalanobis Distance: dm ((x i ) (x i ))
T 2
smaller 34
35
Example:
Given 1 , 2 : P (1 ) P ( 2 ) and p ( x 1 ) N ( 1 , Σ ),
0 3 1.1 0.3
p ( x 2 ) N ( 2 , Σ ), 1 , 2 ,
0
3
0 . 3 1 . 9
1.0
classify t he vector x using Bayesian classifica tion :
2 .2
0.95 0.15
Σ
-1
0. 15 0 .55
Compute Mahalanobis d m from 1 , 2 : d 2 m,1 1.0, 2.2
1.0 1 2.0
Σ 2.952, d m , 2 2.0, 0.8
1 2
3.672
2.2 0.8
p( X ; ) p( x1 , x 2 ,...x N ; )
N
p ( x k ; )
k 1
L ( ) N 1 p ( x )
ˆ ML : 0
k ;
( ) k 1 p ( x k ; ) ( )
38
39
If, indeed, there is a 0 such that
p ( x ) p ( x ; 0 ) , then
lim E [ ML ] 0
N
2
lim E ˆ ML 0 0
N
40
Example:
p ( x) : N ( , Σ ) : unknown, x1 , x 2 ,..., x N p( x k ) p( x k ; )
N 1 N
L( ) ln p ( x k ; ) C ( x k )T Σ 1 ( x k )
k 1 2 k 1
1 1 1
p( x k ; ) l 1
exp( ( x k ) T
Σ ( x k ))
2
(2 ) Σ
2 2
L
1
.
L( ) N 1 N
. Σ ( x k ) 0 ML x k
1
( ) k 1 N k 1
.
L
l
( A )
T
Remember : if A A
T
2 A 41
Maximum Aposteriori Probability Estimation
In ML method, θ was considered as a parameter
Here we shall look at θ as a random vector
described by a pdf p(θ), assumed to be known
Given
X x 1 , x 2 ,..., x N
Compute the maximum of
p ( X )
From Bayes theorem
p ( ) p( X ) p( X ) p( X ) or
p ( ) p( X )
p ( X ) 42
p( X )
The method:
ˆ : ( P ( ) p ( X ))
MAP
43
44
Example:
) N N 1 1
MAP : ln( p( x k ) p( )) 0 or 2 ( x k ) 2 (ˆ 0 ) 0
k 1 k 1
2 N
0 2 xk 2
ˆ MAP k 1
For 2 1, or for N
2
1 2 N
1 N
ˆ MAP ˆ ML x k
N k 1
45
Bayesian Inference
46
p( x X ) p( x ) p( X )d
p( X ) p( ) p( X ) p ( )
p( X )
p( X ) p( X ) p( )d
N
p( X )
k 1
p( x k )
47
The above is a sequence of Gaussians as N
Maximum Entropy
Entropy H p ( x ) ln p ( x)d x
• The constraint:
x2
x1
p ( x ) dx 1
• Lagrange Multipliers
x2
H L H ( p ( x)dx 1)
x1
• pˆ ( x) exp( 1)
1 x1 x x2
ˆp ( x) x2 x1
0 otherwise 49
Mixture Models
J
p ( x ) p ( x j ) Pj
The picture can't be display ed.
j 1
M
P
j 1
j 1, p ( x j ) d x 1
x
• General formulation
– y the complete data set y Y R m
, with p y ( y; ) ,
which are not observed directly.
We observe
x g( y) X ob R , l m with Px (x; ),
l
ln( p y ( y k ; ))
ˆML :
k
0
52
The algorithm:
• E-step: Q( ; (t )) E[ln(py ( y k ; X ; (t ))]
k
Q ( ; (t ))
• M-step: (t 1) : 0
Application to the mixture modeling problem
• Complete data ( x k , jk ), k 1,2,..., N
• Observed data x k , k 1,2,..., N
• p ( x k , jk ; ) p ( x jk ; ) Pjk
k
• Assuming mutual independence
N
L ( ) ln( p ( x k jk ; ) Pjk ) 53
k 1
• Unknown parameters
[ , P ] , P [ P1 , P2 ,..., Pj ]T
T T T T
• E-step N N
Q(; (t )) E[ln(p( xk jk ; )Pjk )] E[ ]
k 1 k 1
N J
k 1 jk 1
P ( jk xk ;(t )) ln( p( xk jk ; )P jk )
Q Q
• M-step 0 0, jk 1,2,..., J
Pjk
p( x k j; (t )) Pj J
P( j x k ; (t )) p( x k ; (t )) p( x k j; (t )) Pj
P( x k ; (t )) j 1
54
Nonparametric Estimation
kN k N in h
P
N N total h h
x̂ x̂ x̂
1 kN h 2 2
pˆ ( x) pˆ ( xˆ ) , x - xˆ
h N 2
56
Define
1 1
x ij
(xi) 2
0
otherwise
• That is, it is 1 inside a unit side hypercube centered
at 0
1 1 N
xi x
• p( x) l (
ˆ
h N
i 1
(
h
))
1 1
• * * number of points inside
volume N
an h - side hypercube centered at x
1
• h 0, l
h
x' x
• h 0 the width of ( )0
h
1 x ' x
• l ( )d x 1
h h
1 x
• h 0 l ( ) ( x)
h h
E[ pˆ ( x)] ( x' x) p ( x' )d x' p( x)
x'
59
h=0.1, N=10000
60
If
• h0
• N
• hΝ
asymptotically unbiased
The method
• Remember:
p ( x 1 ) P ( 2 ) 21 22
l12 ( )
p(x 2 ) P ( 1 ) 12 11
1 N1
xi x
N 1h l
(
h
)
• i 1
( )
1 N2
xi x
N 2h l
i 1
(
h
)
61
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest
the number of points, N, the better the resulting
estimate.
62
NAIVE – BAYES CLASSIFIER
j 1
In Parzen:
• The volume is constant
• The number of points in the volume is varying
Now:
• Keep the number of points kN k
constant
64
•
k
N1V1 N 2V2
()
k N1V1
N 2V2
65
The Nearest Neighbor Rule
Choose k out of the N training vectors, identify the k
nearest ones to x
Assign x i : ki k j i j
k , PkNN PB
PNN 2 PB
P3 NN PB 3( PB ) 2
67
Voronoi tesselation
Ri x : d ( x , x i ) d ( x , x j ) i j
68
BAYESIAN NETWORKS
Bayes Probability Chain Rule
p( x1, x2 ,..., xl ) p( xl | xl1,..., x1 ) p( xl1 | xl2 ,..., x1 ) ...
... p( x2 | x1 ) p( x1 )
69
For example, if ℓ=6, then we could assume:
p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 )
Then:
A6 x5 , x4 x5 ,..., x1
70
A graphical way to portray conditional dependencies
is given below
According to this figure we
have that:
• x6 is conditionally dependent on
x4, x5.
• x5 on x4
• x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally
independent on other variables.
72
The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications field.
This Bayesian network
models conditional
dependencies for an
example concerning
smokers (S),
tendencies to develop
cancer (C) and heart
disease (H), together
with variables
corresponding to heart
(H1, H2) and cancer
(C1, C2) medical tests.
73
Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional (non-root
nodes) probabilities.
74
Example: Consider the Bayesian network of the
figure:
75
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes. 76