0% found this document useful (0 votes)
25 views12 pages

Introduction To Pattern Recognition

1) Pattern recognition is used in applications like machine vision, character recognition, medical diagnosis, and biometrics. The task involves assigning patterns to the correct class. 2) Patterns are represented by feature vectors which are treated as random vectors. A classifier assigns patterns to classes based on feature vector values. 3) Bayes classifiers assign patterns to the class with the maximum a posteriori probability based on feature likelihoods and class priors. This Bayesian approach is optimal for minimizing classification error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

Introduction To Pattern Recognition

1) Pattern recognition is used in applications like machine vision, character recognition, medical diagnosis, and biometrics. The task involves assigning patterns to the correct class. 2) Patterns are represented by feature vectors which are treated as random vectors. A classifier assigns patterns to classes based on feature vector values. 3) Bayes classifiers assign patterns to the class with the maximum a posteriori probability based on feature likelihoods and class priors. This Bayesian approach is optimal for minimizing classification error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

❖10/06/2022

PATTERN RECOGNITION
❖ Typical application areas
➢ Machine vision
➢ Character recognition (OCR)
➢ Computer aided diagnosis
➢ Speech recognition
➢ Face recognition
➢ Biometrics
➢ Image Data Base retrieval
➢ Data mining
Sergios Theodoridis ➢ Bionformatics
Konstantinos Koutroumbas
❖ The task: Assign unknown objects – patterns – into the correct
class. This is known as classification.

1 2
Version 2

❖1 ❖2

An example:

❖ Features: These are measurable quantities obtained from


the patterns, and the classification task is based on their
respective values.

❖Feature vectors: A number of features


x1 ,..., xl ,
constitute the feature vector
x = x1 ,..., xl   R l
T

Feature vectors are treated as random vectors.

3 4

❖3 ❖4

❖ The classifier consists of a set of functions, whose values,


computed at x , determine the class to which the
corresponding pattern belongs
❖ Supervised – unsupervised pattern recognition:
❖ Classification system overview The two major directions
Patterns ➢ Supervised: Patterns whose class is known a-priori
sensor are used for training.
➢ Unsupervised: The number of classes is (in general)
feature unknown and no training patterns are available.
generation

feature
selection

classifier
design

system
evaluation
5 6

❖5 ❖6

❖1
❖10/06/2022

CLASSIFIERS BASED ON BAYES DECISION


THEORY

❖ Statistical nature of feature vectors ❖ Computation of a-posteriori probabilities


x = x1 , x2 ,..., xl  ➢ Assume known
T

• a-priori probabilities
P (1 ), P (2 )..., P (M )
❖ Assign the pattern represented by feature vector x
to the most probable of the available classes • p ( x i ), i = 1,2...M
1 , 2 ,..., M
This is also known as the likelihood of
x w.r. to i .
That is x →  i : P ( i x )
maximum

7 8

❖7 ❖8

❖ The Bayes classification rule (for two classes M=2)


➢ Given x classify it according to the rule

The Bayes rule (Μ=2)



If P(1 x )  P(2 x ) x → 1
p ( x) P(i x) = p ( x i ) P(i ) 
If P(2 x )  P(1 x ) x → 2
p ( x i ) P(i )
P(i x) =
p( x)
➢ Equivalently: classify x according to the rule
where 2
p ( x) =  p ( x i ) P(i )
p( x 1 ) P(1 )(  ) p( x 2 ) P(2 )
i =1

➢ For equiprobable classes the test becomes

p ( x 1 )( ) P ( x  2 )
9 10

❖9 ❖10

❖ Equivalently in words: Divide space in two regions

If x  R1  x in 1
If x  R2  x in 2

❖ Probability of error
➢ Total shaded area
x0 +
➢P
e =  p( x 
−
2 )dx +  p( x 1 )dx
x0

R1 (→ 1 ) and R2 (→ 2 ) ❖ Bayesian classifier is OPTIMAL with respect to


minimising the classification error probability!!!!
11 12

❖11 ❖12

❖2
❖10/06/2022

❖ The Bayes classification rule for many (M>2) classes:


➢ Given x classify it to  i if:

P(i x)  P( j x) j  i

➢ Such a choice also minimizes the classification error


probability

❖ Minimizing the average risk


➢ For each wrong decision, a penalty term is assigned since
some decisions are more sensitive than others

➢ Indeed: Moving the threshold the total shaded


area INCREASES by the extra “grey” area.
13 14

❖13 ❖14

➢ For M=2
➢ Risk with respect to 2
• Define the loss matrix
r2 = 21  p( x 2 )d x + 22  p( x 2 )d x
11 12
L=( ) R1 R2
21 22

12 penalty term for deciding class 2 ,




➢ Probabilities of wrong decisions,
although the pattern belongs to 1 , etc.
weighted by the penalty terms

➢ Risk with respect to 1 ➢ Average risk

r1 = 11  p( x 1 )d x + 12  p( x 1 )d x
R1 R2
r = r1P(1 ) + r2 P(2 )
15 16

❖15 ❖16

❖ Choose R1 and R2 so that r is minimized


❖ If 1
P(1 ) = P(2 ) = and 11 = 22 = 0
❖ Then assign x to  i if 2
 1  11 p( x 1 ) P(1 ) + 21 p( x 2 ) P(2 ) 21
x → 1 if P( x 1 )  P( x 2 )
 2  12 p( x 1 ) P(1 ) + 22 p( x 2 ) P(2 ) 12
12
❖ Equivalently: x → 2 if P( x 2 )  P( x 1 )
assign x in 1 ( 2 ) if 21
p( x 1 ) P(2 ) 21 − 22 if 21 = 12  Minimum classification
 12   ()
p ( x 2 ) P(1 ) 12 − 11 error probability

 12 : likelihood ratio

17 18

❖17 ❖18

❖3
❖10/06/2022

❖ An example: ➢ Then the threshold value is:


1
− p ( x 1 ) = exp(− x )
2 x0 for minimum Pe :

x0 : exp(− x 2 ) = exp(−( x − 1) 2 ) 
1
− p( x 2 ) = exp(−( x − 1) )
2
1
 x0 =
1 2
− P(1 ) = P(2 ) =
2 ➢ Threshold x̂0 for minimum r
 0 0.5 
− L =   xˆ0 : exp(− x 2 ) = 2 exp(−( x − 1) 2 ) 
1.0 0 
(1 − n2) 1
xˆ0 = 
2 2
19 20

❖19 ❖20

1 DISCRIMINANT FUNCTIONS
Thus x̂ 0 moves to the left of = x0
(WHY?) 2 DECISION SURFACES
❖ If Ri , R j are contiguous: g ( x)  P(i x) − P( j x) = 0
Ri : P(i x)  P( j x)
+
- g ( x) = 0
R j : P( j x)  P(i x)

is the surface separating the regions. On one side is


positive (+), on the other is negative (-). It is known
as Decision Surface

21 22

❖21 ❖22

BAYESIAN CLASSIFIER FOR NORMAL


DISTRIBUTIONS

❖ Multivariate Gaussian pdf


❖ If f(.) monotonic, the rule remains the same if we use:
1  1 
p ( x i ) = exp − ( x −  i )   i−1 ( x −  i ) 
x → i if : f ( P(i x))  f ( P( j x)) i  j 
(2 )  i
2
1
2
 2 

❖ g i ( x)  f ( P(i x)) is a discriminant function


 i = Ex     matrix in i
❖ In general, discriminant functions can be defined
independent of the Bayesian rule. They lead to
suboptimal solutions, yet if chosen appropriately, can be

 i = E ( x −  i )( x −  i )  
computationally more tractable. called covariance matrix

23 24

❖23 ❖24

❖4
❖10/06/2022

1 1
➢ g i ( x) = − ( x12 + x22 ) + 2 ( i1 x1 + i 2 x2 )
2 2 
1
− ( i21 + i22 ) + ln( Pi ) + Ci
❖ ln() is monotonic. Define: 2 2

➢ g i ( x) = ln( p( x i ) P(i )) = That is, gi (x) is quadratic and the surfaces


ln p( x  i ) + ln P(i ) g i ( x) − g j ( x) = 0
1 quadrics, ellipsoids, parabolas, hyperbolas,
➢ g i ( x) = − ( x −  i )  i ( x −  i ) + ln P (i ) + Ci
T −1
pairs of lines.
2
For example:
 1
Ci = −( ) ln 2 − ( ) ln  i
2 2

 2 0 
➢ Example:  i =  
2
0   25 26

❖25 ❖26

❖ Decision Hyperplanes ➢ Let in addition:


• Σ =  2 I . Then
x  i−1 x
T
➢ Quadratic terms: g i ( x) =
1
 Ti x + wi 0
2
If ALL Σi = Σ (the same) the quadratic • g ij ( x) = g i ( x) − g j ( x) = 0
terms are not of interest. They are not
= w ( x − xo )
T
involved in comparisons. Then, equivalently,
we can write: • w = i −  j,
g i ( x) = w x + wio P(i )  i −  j
T
1
(  +  j ) −  2 ln
i
• xo =
wi = Σ  i −1 2 i P( j )  −  2
i j

1 Τ
wi 0 = ln P(i ) −  i Σ −1  i
2
Discriminant functions are LINEAR 27 28

❖27 ❖28

➢ Nondiagonal:    2 ❖ Minimum Distance Classifiers

g ij ( x) = w ( x − x 0 ) = 0
T 1
• ➢ P(i ) = equiprobable
M
1
• w =  −1 (  i −  j ) ➢ g i ( x) = − ( x −  i )T  −1 ( x −  i )
2
1 P(i ) i −  j
• x0 = (  i +  j ) − n ( ) ➢  =  2 I : Assign x → i :
2 P( j )  −  2
j −1
dE  x −  i
i
1 Euclidean Distance:
 ( x  −1 x) 2 smaller
T
x  −1

not normal to  i −  j
➢    2 I : Assign x → i :
1
➢ Decision hyperplane
normal to  −1 (  i −  j ) Mahalanobis Distance: d m = (( x −  i )T  −1 ( x −  i )) 2
29
smaller 30

❖29 ❖30

❖5
❖10/06/2022

❖ Example:
Given 1 , 2 : P (1 ) = P (2 ) and p ( x 1 ) = N (  1 , Σ ),
0  3 1.1 0.3
p ( x 2 ) = N (  2 , Σ ),  1 =  ,  2 =  ,  =  
0  3 0.3 1.9 
1.0 
classify the vector x =   using Bayesian classification :
2.2
 0.95 − 0.15
• Σ -1 =  
− 0.15 0.55 
• Compute Mahalanobis d m from 1 ,  2 : d 2 m ,1 = 1.0, 2.2
1.0  − 2.0
Σ −1   = 2.952, d 2 m, 2 = − 2.0, − 0.8  −1   = 3.672
2.2  − 0.8

• Classify x → 1. Observe that d E ,2  d E ,1


31 32

❖31 ❖32

ESTIMATION OF UNKNOWN PROBABILITY


DENSITY FUNCTIONS
❖ Maximum Likelihood Ν
➢ ˆ ML : arg max  p ( x k ; )
 k =1
➢ Let x , x ,...., x known and independent
1 2 N N
➢ Let p( x) known within an unknown vector ➢ L( )  ln p ( X ; ) =  ln p ( x k ; )
k =1
parameter  : p( x)  p ( x; )
X = x1 , x 2 ,...x N  L( ) N 1 p( x k ; )
➢ ➢ ˆ ML : = =0
➢ p( X ; )  p( x1 , x 2 ,...x N ; )  ( ) k =1 p ( x k ; )  ( )
N
=  p ( x k ; )
k =1

which is known as the Likelihood of  w.r. to X


The method :
33 34

❖33 ❖34

If, indeed, there is a  0 such that


p ( x) = p ( x; 0 ), then
lim E[ ML ] =  0
N →
2
lim E ˆ ML −  0 =0
N →

Asymptotically unbiased and consistent

35 36

❖35 ❖36

❖6
❖10/06/2022

❖ Example:
❖ Maximum Aposteriori Probability Estimation
p ( x) : N (  , Σ ) :  unknown, x1 , x 2 ,..., x N p ( x k )  p ( x k ;  )
➢ In ML method, θ was considered as a parameter
N 1 N
L(  ) = ln  p ( x k ;  ) = C −  ( x k −  )T Σ −1 ( x k −  ) ➢ Here we shall look at θ as a random vector
k =1 2 k =1 described by a pdf p(θ), assumed to be known
1 1
p( x k ;  ) = l
exp(− ( x k −  )T Σ −1 ( x k −  )) ➢ Given
X = x1 , x 2 ,..., x N 
1
2
(2 ) 2 Σ 2

 L 
   Compute the maximum of
 1
L(  ) 
. 
N 1 N p ( X )
  .  =  Σ −1 ( x k −  ) = 0   ML =  x k
( )   k =1 N k =1
 .  ➢ From Bayes theorem
 L 
   p ( ) p ( X  ) = p ( X ) p( X ) or
 l
 ( A )
T p ( ) p ( X  )
Remember : if A = AT  = 2 A 37
p ( X ) = 38
 p( X )

❖37 ❖38

➢ The method:

ˆ MAP = arg max p( X ) or



ˆ MAP : ( P( ) p( X  ))

If p( ) is uniform or broad enough ˆ MAP   ML

39 40

❖39 ❖40

❖ Example:
❖ Bayesian Inference
p( x) : N (  , Σ ),  unknown, X = x1,...,x N 

 − 0
2 ➢ ML, MAP  a single estimate for  .
1
p(  ) = exp(− ) Here a different root is followed.
l
2 2
(2 )  l2 
Given : X = {x1 ,..., x N }, p( x  ) and p( )
  N N 1 1
 MAP : ln(  p( x k  ) p(  )) = 0 or  2 ( x k −  ) − 2 ( ˆ −  0 ) = 0 
  k =1 k =1   The goal : estimate p( x X )
 2 N

How??
0 +  xk 2
ˆ MAP =  2 k =1
For 2  1, or for N → 
 2 
1+ 2 N

1 N
ˆ MAP  ˆ ML =  x k
N k =1

41 42

❖41 ❖42

❖7
❖10/06/2022

p( x X ) =  p ( x  ) p( X )d  ➢ The above is a sequence of Gaussians as N → 


p( X  ) p ( ) p ( X  ) p ( )
p ( X ) = =
p( X )  ( X  ) p( )d 
p
N
p( X  ) = 
k =1
p( x k  )

A bit more insight vi a an example


• Let p( x  ) → N (  ,  2 )
• p(  ) → N (  0 ,  02 ) ❖ Maximum Entropy
• It turns out that : p(  X ) → N (  N ,  N2 ) ➢ Entropy H = −  p( x) ln p( x)d x
N 02 x +  2  0  2 02 1 N
N =
N 02 +  2
,  N2 =
N 02 +  2
, x=
N
x
k =1
k ➢ pˆ ( x) : maximum H subject to the
43 available constraints 44

❖43 ❖44

➢ Example: x is nonzero in the interval x1  x  x2 ❖ Mixture Models


and zero otherwise. Compute the ME pdf J
➢ p( x) =  p( x j ) Pj
• The constraint: j =1
x2 M

 p( x)dx = 1 P
j =1
j = 1,  p( x j )d x = 1
x
x1
• Lagrange Multipliers
➢ Assume parametric modeling, i.e., p( x j ; )
x2

H L = H +  (  p( x)dx − 1)
➢ The goal is to estimate  and P1 , P2 ,..., Pj
x1 , x2 ,..., x N 
x1


given a set X=
pˆ ( x) = exp( − 1)
 1 x1  x  x2 ➢ Why not ML? As before?

pˆ ( x) =  x2 − x1 N

 0 otherwise
max  P( x k ; , Pi ,..., Pj )
45  , Pi ,...,Pj k =1 46

❖45 ❖46

➢ This is a nonlinear problem due to the missing • Let Y ( x)  Y all y ' s → to a specific x
label information. This is a typical problem with
an incomplete data set. p x ( x; ) = p
Y ( x)
y ( y;  ) d y

➢ The Expectation-Maximisation (EM) algorithm. • What we need is to compute

 ln( p y ( y k ; ))
• General formulation
y the complete data set y  Y  R m , with p y ( y; ) ,
ˆML :  
=0
– k

which are not observed directly.


• But y k ' s are not observed. Here comes the
We observe EM. Maximize the expectation of the loglikelihood
conditioned on the observed samples and the
x = g ( y )  X ob  R l , l  m with Px ( x; ), current iteration estimate of  .

a many to one transformation


47 48

❖47 ❖48

❖8
❖10/06/2022

➢ The algorithm: • Unknown parameters


• E-step: Q( ; (t )) = E[ ln( p y ( y k ; X ; (t ))]
 = [ , P ]T , P = [ P1 , P2 ,..., Pj ]T
T T T
k

Q ( ; (t )) • E-step
• M-step:  (t + 1) : =0 N
Q(; (t )) = E[ ln( p( x k jk ; ) Pjk )] = E[
N
]=
 k =1 k =1
➢ Application to the mixture modeling problem N J

• Complete data ( x k , jk ), k = 1,2,..., N 


k =1 jk =1
P ( jk x k ;(t )) ln ( p( x k jk ; ) P jk )

• Observed data x k , k = 1,2,..., N • M-step


Q
=0
Q
= 0, jk = 1,2,..., J
 Pjk
• p ( x k , jk ; ) = p ( x jk ; ) Pjk
k
• Assuming mutual independence p( x k j; (t )) Pj J
N P( j x k ; (t )) = p( x k ; (t )) =  p( x k j; (t )) Pj
L( ) =  ln( p( x k jk ; ) Pjk ) 49
P( x k ; (t )) j =1
50
k =1

❖49 ❖50

❖ Nonparametric Estimation
❖ Parzen Windows
➢ Divide the multidimensional space in hypercubes

➢ kN k N in h
P
N N total h h
x̂ − x̂ x̂ +
1 kN h 2 2
➢ pˆ ( x)  pˆ ( xˆ ) = , x - xˆ 
h N 2

➢ If p( x) continuous , pˆ ( x) → p( x) as N → , if
kN
hN → 0, k N → , →0 51 52
N

❖51 ❖52

➢ Define ➢ Mean value



1 xij 
1 

 ( xi ) =  1 1 N
xi − x 1 x'− x
 0
2  E[ pˆ ( x)] = (  E[ ( )]) =  l  ( ) p ( x' )d x'
 otherwise
 hl N i =1 h x'
h h
• That is, it is 1 inside a unit side hypercube centered
at 0
1 1 N
xi − x • h → 0,
1
→
• pˆ ( x) = (
hl N
 (
i =1 h
))
hl
x'− x

1 1
* * number of points inside • h → 0 the width of  ( )→0
volume N h
1 x'− x
•  l ( )d x = 1
an h - side hypercube centered at x
h h
• The problem: p( x) continuous 1 x
• h → 0 l  ( ) →  ( x)
 (.) discontinuous h h
• Parzen windows-kernels-potential functions E[ pˆ ( x)] =   ( x'− x) p( x' )d x' = p( x)
 ( x) is smooth x'

 ( x)  0,   ( x)d x = 1 53 Hence unbiased in the limit 54


x

❖53 ❖54

❖9
❖10/06/2022

➢ Variance h=0.1, N=10000


• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000

➢The higher the N the better the accuracy

55 56

❖55 ❖56

➢ If
• h→0 ❖ CURSE OF DIMENSIONALITY
• N → ➢ In all the methods, so far, we saw that the highest
• hΝ →  the number of points, N, the better the resulting
estimate.
asymptotically unbiased
➢ If in the one-dimensional space an interval, filled
➢ The method with N points, is adequately (for good estimation), in
the two-dimensional space the corresponding square
• Remember: will require N2 and in the ℓ-dimensional space the ℓ-
p( x 1 ) P(2 ) 21 − 22 dimensional cube will require Nℓ points.
l12  () 
p ( x 2 ) P(1 ) 12 − 11
➢ The exponential increase in the number of necessary
1 N1
x −x points in known as the curse of dimensionality. This
N1h l
 ( i
h
) is a major problem one is confronted with in high
• i =1
( ) dimensional spaces.
1 N2
x −x
N 2 hl

i =1
( i
h
)
57 58

❖57 ❖58

❖ NAIVE – BAYES CLASSIFIER


❖ K Nearest Neighbor Density Estimation
➢ Let x   and the goal is to estimate p ( x |  i )

i = 1, 2, …, M. For a “good” estimate of the pdf ➢ In Parzen:


one would need, say, Nℓ points. • The volume is constant
• The number of points in the volume is varying
➢ Assume x1, x2 ,…, xℓ mutually independent. Then:
➢ Now:
p(x | i ) =  p(x j | i )

• Keep the number of points kN = k


j =1 constant
➢ In this case, one would require, roughly, N points
for each pdf. Thus, a number of points of the • Leave the volume to be varying
order N·ℓ would suffice. k
• pˆ ( x) =
NV ( x)
➢ It turns out that the Naïve – Bayes classifier
works reasonably well even in cases that violate
the independence assumption. 59 60

❖59 ❖60

❖10
❖10/06/2022

❖ The Nearest Neighbor Rule


➢ Choose k out of the N training vectors, identify the k
nearest ones to x

• ➢ Out of these k identify ki that belong to class ωi

➢ Assign x →  i : ki  k j i  j

➢ The simplest version


k=1 !!!
k
N1V1 N 2V2 ➢ For large N this is not bad. It can be shown that:
= () if PB is the optimal Bayesian error probability, then:
k N1V1
M
N 2V2 PB  PNN  PB (2 − PB )  2 PB
M −1
61 62

❖61 ❖62

❖ Voronoi tesselation

2 PNN
➢ PB  PkNN  PB +
k

➢ k → , PkNN → PB

➢ For small PB:

PNN  2 PB
P3 NN  PB + 3( PB ) 2 Ri = x : d ( x, x i )  d ( x, x j ) i  j
63 64

❖63 ❖64

BAYESIAN NETWORKS
❖ Bayes Probability Chain Rule ➢ For example, if ℓ=6, then we could assume:
p( x1 , x2 ,..., x ) = p( x | x −1 ,..., x1 )  p( x −1 | x −2 ,..., x1 )  ... p( x6 | x5 ,..., x1 ) = p( x6 | x5 , x4 )
...  p( x2 | x1 )  p( x1 ) Then:

➢ Assume now that the conditional dependence for A6 = x5 , x4   x5 ,..., x1
each xi is limited to a subset of the features
appearing in each of the product terms. That is: ➢ The above is a generalization of the Naïve – Bayes.
 For the Naïve – Bayes the assumption is:
p( x1 , x2 ,..., x ) = p( x1 )   p( xi | Ai ) Ai = Ø, for i=1, 2, …, ℓ
i =2
where
Ai  xi −1 , xi −2 ,..., x1

65 66

❖65 ❖66

❖11
❖10/06/2022

❖ Bayesian Networks
➢ A graphical way to portray conditional dependencies
is given below ➢ Definition: A Bayesian Network is a directed acyclic
graph (DAG) where the nodes correspond to random
➢ According to this figure we variables. Each node is associated with a set of
have that:
conditional probabilities (densities), p(xi|Ai), where xi
• x6 is conditionally dependent on
x4, x5. is the variable associated with the node and Ai is the
• x5 on x4 set of its parents in the graph.
• x4 on x1, x2
• x3 on x2
➢ A Bayesian Network is specified by:
• x1, x2 are conditionally
independent on other variables. • The marginal probabilities of its root nodes.
• The conditional probabilities of the non-root nodes,
given their parents, for ALL possible combinations.
➢ For this case:
p( x1 , x2 ,..., x6 ) = p( x6 | x5 , x4 )  p( x5 | x4 )  p( x3 | x2 )  p( x2 )  p( x1 )
67 68

❖67 ❖68

➢ Once a DAG has been constructed, the joint


➢ The figure below is an example of a Bayesian probability can be obtained by multiplying the
Network corresponding to a paradigm from the marginal (root nodes) and the conditional (non-root
medical applications field. nodes) probabilities.
➢ This Bayesian network
models conditional
➢ Training: Once a topology is given, probabilities are
dependencies for an
estimated via the training data set. There are also
example concerning
methods that learn the topology.
smokers (S),
tendencies to develop
cancer (C) and heart ➢ Probability Inference: This is the most common task
disease (H), together that Bayesian networks help us to solve efficiently.
with variables Given the values of some of the variables in the
corresponding to heart graph, known as evidence, the goal is to compute
(H1, H2) and cancer the conditional probabilities for some of the other
(C1, C2) medical tests. variables, given the evidence.

69 70

❖69 ❖70

❖ Example: Consider the Bayesian network of the ➢ For a), a set of calculations are required that
figure: propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.

➢ For b), the propagation is reversed in direction. It


turns out that P(x0|w1) = 0.4.

➢ In general, the required inference information is


a) If x is measured to be x=1 (x1), compute computed via a combined process of “message
P(w=0|x=1) [P(w0|x1)]. passing” among the nodes of the DAG.

b) If w is measured to be w=1 (w1) compute


P(x=0|w=1) [ P(x0|w1)]. ❖Complexity:
➢ For singly connected graphs, message passing
algorithms amount to a complexity linear in the
71 number of nodes. 72

❖71 ❖72

❖12

You might also like