Introduction To Pattern Recognition
Introduction To Pattern Recognition
PATTERN RECOGNITION
❖ Typical application areas
➢ Machine vision
➢ Character recognition (OCR)
➢ Computer aided diagnosis
➢ Speech recognition
➢ Face recognition
➢ Biometrics
➢ Image Data Base retrieval
➢ Data mining
Sergios Theodoridis ➢ Bionformatics
Konstantinos Koutroumbas
❖ The task: Assign unknown objects – patterns – into the correct
class. This is known as classification.
1 2
Version 2
❖1 ❖2
An example:
3 4
❖3 ❖4
feature
selection
classifier
design
system
evaluation
5 6
❖5 ❖6
❖1
❖10/06/2022
• a-priori probabilities
P (1 ), P (2 )..., P (M )
❖ Assign the pattern represented by feature vector x
to the most probable of the available classes • p ( x i ), i = 1,2...M
1 , 2 ,..., M
This is also known as the likelihood of
x w.r. to i .
That is x → i : P ( i x )
maximum
7 8
❖7 ❖8
p ( x 1 )( ) P ( x 2 )
9 10
❖9 ❖10
If x R1 x in 1
If x R2 x in 2
❖ Probability of error
➢ Total shaded area
x0 +
➢P
e = p( x
−
2 )dx + p( x 1 )dx
x0
❖11 ❖12
❖2
❖10/06/2022
P(i x) P( j x) j i
❖13 ❖14
➢ For M=2
➢ Risk with respect to 2
• Define the loss matrix
r2 = 21 p( x 2 )d x + 22 p( x 2 )d x
11 12
L=( ) R1 R2
21 22
r1 = 11 p( x 1 )d x + 12 p( x 1 )d x
R1 R2
r = r1P(1 ) + r2 P(2 )
15 16
❖15 ❖16
12 : likelihood ratio
17 18
❖17 ❖18
❖3
❖10/06/2022
❖19 ❖20
1 DISCRIMINANT FUNCTIONS
Thus x̂ 0 moves to the left of = x0
(WHY?) 2 DECISION SURFACES
❖ If Ri , R j are contiguous: g ( x) P(i x) − P( j x) = 0
Ri : P(i x) P( j x)
+
- g ( x) = 0
R j : P( j x) P(i x)
21 22
❖21 ❖22
23 24
❖23 ❖24
❖4
❖10/06/2022
1 1
➢ g i ( x) = − ( x12 + x22 ) + 2 ( i1 x1 + i 2 x2 )
2 2
1
− ( i21 + i22 ) + ln( Pi ) + Ci
❖ ln() is monotonic. Define: 2 2
2 0
➢ Example: i =
2
0 25 26
❖25 ❖26
1 Τ
wi 0 = ln P(i ) − i Σ −1 i
2
Discriminant functions are LINEAR 27 28
❖27 ❖28
g ij ( x) = w ( x − x 0 ) = 0
T 1
• ➢ P(i ) = equiprobable
M
1
• w = −1 ( i − j ) ➢ g i ( x) = − ( x − i )T −1 ( x − i )
2
1 P(i ) i − j
• x0 = ( i + j ) − n ( ) ➢ = 2 I : Assign x → i :
2 P( j ) − 2
j −1
dE x − i
i
1 Euclidean Distance:
( x −1 x) 2 smaller
T
x −1
not normal to i − j
➢ 2 I : Assign x → i :
1
➢ Decision hyperplane
normal to −1 ( i − j ) Mahalanobis Distance: d m = (( x − i )T −1 ( x − i )) 2
29
smaller 30
❖29 ❖30
❖5
❖10/06/2022
❖ Example:
Given 1 , 2 : P (1 ) = P (2 ) and p ( x 1 ) = N ( 1 , Σ ),
0 3 1.1 0.3
p ( x 2 ) = N ( 2 , Σ ), 1 = , 2 = , =
0 3 0.3 1.9
1.0
classify the vector x = using Bayesian classification :
2.2
0.95 − 0.15
• Σ -1 =
− 0.15 0.55
• Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 = 1.0, 2.2
1.0 − 2.0
Σ −1 = 2.952, d 2 m, 2 = − 2.0, − 0.8 −1 = 3.672
2.2 − 0.8
❖31 ❖32
❖33 ❖34
35 36
❖35 ❖36
❖6
❖10/06/2022
❖ Example:
❖ Maximum Aposteriori Probability Estimation
p ( x) : N ( , Σ ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ; )
➢ In ML method, θ was considered as a parameter
N 1 N
L( ) = ln p ( x k ; ) = C − ( x k − )T Σ −1 ( x k − ) ➢ Here we shall look at θ as a random vector
k =1 2 k =1 described by a pdf p(θ), assumed to be known
1 1
p( x k ; ) = l
exp(− ( x k − )T Σ −1 ( x k − )) ➢ Given
X = x1 , x 2 ,..., x N
1
2
(2 ) 2 Σ 2
L
Compute the maximum of
1
L( )
.
N 1 N p ( X )
. = Σ −1 ( x k − ) = 0 ML = x k
( ) k =1 N k =1
. ➢ From Bayes theorem
L
p ( ) p ( X ) = p ( X ) p( X ) or
l
( A )
T p ( ) p ( X )
Remember : if A = AT = 2 A 37
p ( X ) = 38
p( X )
❖37 ❖38
➢ The method:
ˆ MAP : ( P( ) p( X ))
If p( ) is uniform or broad enough ˆ MAP ML
39 40
❖39 ❖40
❖ Example:
❖ Bayesian Inference
p( x) : N ( , Σ ), unknown, X = x1,...,x N
− 0
2 ➢ ML, MAP a single estimate for .
1
p( ) = exp(− ) Here a different root is followed.
l
2 2
(2 ) l2
Given : X = {x1 ,..., x N }, p( x ) and p( )
N N 1 1
MAP : ln( p( x k ) p( )) = 0 or 2 ( x k − ) − 2 ( ˆ − 0 ) = 0
k =1 k =1 The goal : estimate p( x X )
2 N
How??
0 + xk 2
ˆ MAP = 2 k =1
For 2 1, or for N →
2
1+ 2 N
1 N
ˆ MAP ˆ ML = x k
N k =1
41 42
❖41 ❖42
❖7
❖10/06/2022
❖43 ❖44
p( x)dx = 1 P
j =1
j = 1, p( x j )d x = 1
x
x1
• Lagrange Multipliers
➢ Assume parametric modeling, i.e., p( x j ; )
x2
H L = H + ( p( x)dx − 1)
➢ The goal is to estimate and P1 , P2 ,..., Pj
x1 , x2 ,..., x N
x1
•
given a set X=
pˆ ( x) = exp( − 1)
1 x1 x x2 ➢ Why not ML? As before?
pˆ ( x) = x2 − x1 N
0 otherwise
max P( x k ; , Pi ,..., Pj )
45 , Pi ,...,Pj k =1 46
❖45 ❖46
➢ This is a nonlinear problem due to the missing • Let Y ( x) Y all y ' s → to a specific x
label information. This is a typical problem with
an incomplete data set. p x ( x; ) = p
Y ( x)
y ( y; ) d y
ln( p y ( y k ; ))
• General formulation
y the complete data set y Y R m , with p y ( y; ) ,
ˆML :
=0
– k
❖47 ❖48
❖8
❖10/06/2022
Q ( ; (t )) • E-step
• M-step: (t + 1) : =0 N
Q(; (t )) = E[ ln( p( x k jk ; ) Pjk )] = E[
N
]=
k =1 k =1
➢ Application to the mixture modeling problem N J
❖49 ❖50
❖ Nonparametric Estimation
❖ Parzen Windows
➢ Divide the multidimensional space in hypercubes
➢ kN k N in h
P
N N total h h
x̂ − x̂ x̂ +
1 kN h 2 2
➢ pˆ ( x) pˆ ( xˆ ) = , x - xˆ
h N 2
➢ If p( x) continuous , pˆ ( x) → p( x) as N → , if
kN
hN → 0, k N → , →0 51 52
N
❖51 ❖52
❖53 ❖54
❖9
❖10/06/2022
55 56
❖55 ❖56
➢ If
• h→0 ❖ CURSE OF DIMENSIONALITY
• N → ➢ In all the methods, so far, we saw that the highest
• hΝ → the number of points, N, the better the resulting
estimate.
asymptotically unbiased
➢ If in the one-dimensional space an interval, filled
➢ The method with N points, is adequately (for good estimation), in
the two-dimensional space the corresponding square
• Remember: will require N2 and in the ℓ-dimensional space the ℓ-
p( x 1 ) P(2 ) 21 − 22 dimensional cube will require Nℓ points.
l12 ()
p ( x 2 ) P(1 ) 12 − 11
➢ The exponential increase in the number of necessary
1 N1
x −x points in known as the curse of dimensionality. This
N1h l
( i
h
) is a major problem one is confronted with in high
• i =1
( ) dimensional spaces.
1 N2
x −x
N 2 hl
i =1
( i
h
)
57 58
❖57 ❖58
❖59 ❖60
❖10
❖10/06/2022
➢ Assign x → i : ki k j i j
❖61 ❖62
❖ Voronoi tesselation
2 PNN
➢ PB PkNN PB +
k
➢ k → , PkNN → PB
PNN 2 PB
P3 NN PB + 3( PB ) 2 Ri = x : d ( x, x i ) d ( x, x j ) i j
63 64
❖63 ❖64
BAYESIAN NETWORKS
❖ Bayes Probability Chain Rule ➢ For example, if ℓ=6, then we could assume:
p( x1 , x2 ,..., x ) = p( x | x −1 ,..., x1 ) p( x −1 | x −2 ,..., x1 ) ... p( x6 | x5 ,..., x1 ) = p( x6 | x5 , x4 )
... p( x2 | x1 ) p( x1 ) Then:
➢ Assume now that the conditional dependence for A6 = x5 , x4 x5 ,..., x1
each xi is limited to a subset of the features
appearing in each of the product terms. That is: ➢ The above is a generalization of the Naïve – Bayes.
For the Naïve – Bayes the assumption is:
p( x1 , x2 ,..., x ) = p( x1 ) p( xi | Ai ) Ai = Ø, for i=1, 2, …, ℓ
i =2
where
Ai xi −1 , xi −2 ,..., x1
65 66
❖65 ❖66
❖11
❖10/06/2022
❖ Bayesian Networks
➢ A graphical way to portray conditional dependencies
is given below ➢ Definition: A Bayesian Network is a directed acyclic
graph (DAG) where the nodes correspond to random
➢ According to this figure we variables. Each node is associated with a set of
have that:
conditional probabilities (densities), p(xi|Ai), where xi
• x6 is conditionally dependent on
x4, x5. is the variable associated with the node and Ai is the
• x5 on x4 set of its parents in the graph.
• x4 on x1, x2
• x3 on x2
➢ A Bayesian Network is specified by:
• x1, x2 are conditionally
independent on other variables. • The marginal probabilities of its root nodes.
• The conditional probabilities of the non-root nodes,
given their parents, for ALL possible combinations.
➢ For this case:
p( x1 , x2 ,..., x6 ) = p( x6 | x5 , x4 ) p( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 )
67 68
❖67 ❖68
69 70
❖69 ❖70
❖ Example: Consider the Bayesian network of the ➢ For a), a set of calculations are required that
figure: propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
❖71 ❖72
❖12