04 Bayes Classification Rule
04 Bayes Classification Rule
Spring 2023
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya
Recap: Minimum Distance Classifier
𝑽 𝟐 C1
C2
𝑽 𝟏 C3
𝑿𝟏
Recap: Nearest Neighbor Classifier
𝑿𝟐
C1
𝑿 C2
𝑿𝟏
Recap: K-Nearest Neighbor Classifier
• Take k = 5
𝑿𝟐
C1
𝑿 C2
𝑿𝟏
• One can see that C2 is the majority à classify 𝑋 as C2
𝑃(𝐶& , 𝑋)
𝑃 𝐶& 𝑋 =
𝑃(𝑋)
' (|*! '(*! )
=
'(()
Bayes Rule:
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
6
Recap: Bayes Classification Rule
• To compute 𝑃 𝐶! 𝑋 , we use Bayes rule:
𝑃 𝑋|𝐶! 𝑃(𝐶! )
𝑃 𝐶! 𝑋 =
𝑃(𝑋)
7
Recap: Bayes Classification Rule
• The a priori probabilities represent the
frequencies of the classes irrespective of the
observed features
8
Bayes Classification Rule
• Find 𝐶- giving max 𝑃 𝐶- 𝑋
𝑃 𝑋|𝐶- 𝑃(𝐶- )
𝑃 𝐶- 𝑋 =
𝑃(𝑋)
–𝑃 𝐶! 𝑋 ≡ posterior prob.
–𝑃 𝐶! ≡ a priori prob.
–𝑃 𝑋|𝐶! ≡ class-conditional densities
• 𝑃 𝑋 = ∑/
&". 𝑃(𝑋 , 𝐶& ) = ∑ /
&". 𝑃 𝑋 𝐶& 𝑃(𝐶& )
9
Recap: Marginalization
• Discrete case:
$
𝑃 𝐴 = $ 𝑃(𝐴, 𝐵 = 𝐵! )
!"#
• Continuous case:
&
𝑃 𝑥 = * 𝑃 𝑥, 𝑦 𝑑𝑦
%&
• So:
' '
𝑃 𝑋 = $ 𝑃(𝑋 , 𝐶! ) = $ 𝑃 𝑋 𝐶! 𝑃(𝐶! )
!"# !"#
11
Bayes Classification Rule
• Classify 𝑋 to the class corresponding to
max 𝑃 𝑋|𝐶- 𝑃(𝐶- )
P(x|C1)P(C1) P(x|C2) P(C2)
8 11
1 2 3 4 5 6 7 9 10
x
1-D example
12
Bayes Classification Rule
• Classify 𝑋 to the class corresponding to max 𝑃 𝑋|𝐶! 𝑃(𝐶! )
8 11
1 2 3 4 5 6 7
1-D example
9 10
x
• For x=5, P(x|C1)P(C1) has a higher value compared to P(x|C2)P(C2)
à classify as C1
13
Classification Accuracy
𝑷 𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑿 = 𝐦𝐚𝐱 𝑷(𝑪𝒊 |𝑿)
𝟏0𝒊0𝑲
14
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = * 𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡, 𝑋 𝑑𝑋 Marginal prob.
𝑃 𝑋|𝐶( 𝑃(𝐶( )
= * max 𝑃 𝑋 𝑑𝑋
( 𝑃(𝑋)
15
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = J max 𝑃 𝑋|𝐶- 𝑃(𝐶- ) 𝑑𝑋
-
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
16
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = J max 𝑃 𝑋|𝐶- 𝑃(𝐶- ) 𝑑𝑋
-
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
17
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = J max 𝑃 𝑋|𝐶- 𝑃(𝐶- ) 𝑑𝑋
-
P(correct) = areas[ + + ]
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
18
Classification Accuracy
𝑃 𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑃(𝑐𝑜𝑟𝑟𝑒𝑐𝑡)
We can compute P(error) directly only for 2-class case!
area = P(error) 19
1-D Example
• Assume Gaussian class-conditional
densities:
("# $% )'
( .
–𝑃 𝑋 𝐶+ = ),-
𝑒 '('
#("#$)'
/
– 𝜇+ = 𝐸 𝑋 = ∫ ),-
𝑒 '(' 𝑑𝑥
– Var𝑖𝑎𝑛𝑐𝑒 = 𝐸 (𝑋 − 𝜇)) = 𝜎)
– 𝜎 = 𝑣𝑎𝑟
σ
std. dev
µ
mean 20
1-D Example
• To get decision boundary:
𝑃 𝑋|𝐶. 𝑃 𝐶. = 𝑃 𝑋|𝐶2 𝑃 𝐶2
21
1-D Example
• To get decision boundary:
Exercise!
22
1-D Example
• Compute P(error)
P(X|𝑪𝟏 )P(𝑪𝟏 ) P(X|𝑪𝟐 )P(𝑪𝟐 )
(/. 0- )- (/. 02 )-
. .
7, 8(:- )< -1- & 8(:2 )< -1-
• 𝑃 𝑒𝑟𝑟𝑜𝑟 = ∫%& =>?
𝑑𝑋 + ∫7
, =>?
𝑑𝑋
I1 I2
23
1-D Example
• Compute P(error): I1 I2
"! ("# $")" ) ("# $#)"
# #
𝑃(𝐶% )𝑒 %& " 𝑃(𝐶! )𝑒 %& "
𝑃 𝑒𝑟𝑟𝑜𝑟 = & 𝑑𝑋 + & 𝑑𝑋
2𝜋𝜎 2𝜋𝜎
#) "!
25
Bayes Classification Rule
• However, Bayes classifier assumes that the
probability densities are known, which is
not usually the case
26
Bayes Classification Rule
• Density estimation:
real
estimate
27
Gaussian Densities
• Assume a multi dimensional Gaussian
density for each 𝑃(𝑋|𝐶& )
28
Gaussian Densities (independent case)
/
1 4 (+, ) -, )
)/ ∑,31
0/
,
5
• Get 𝑃 𝑋 𝐶& = 4
(26) / 71 7/ ⋯74
29
Gaussian Densities (dependent case)
-
,.(0,1)3 4,- (0,1)
"
• 𝑃 𝑋 𝐶! = 5 -
#$ . %"& . (()
where:
Σ ≡ Covariance matrix (NxN)
det ≡ determinant
𝜇(
𝜇)
𝜇 ≡mean vector = ⋮ 𝜇1 = 𝐸(𝑋1 )
𝜇0
30
Gaussian Densities (dependent case)
• Let 𝑍 = 𝑋 − 𝜇
• A ≡ matrix
Exercise for yourself!
31
Covariance
• Covariance between two variables is a
measure of how they change together
32
Covariance
grades grades
IQ exam diff.
positive covariance negative covariance
zero covariance 33
Covariance Matrix
• Σ(4,-) = 𝐸 𝑋4 − 𝜇4 𝑋- − 𝜇- ≡ covariance(𝑋5 , 𝑋6 )
𝐸 𝑋7 − 𝜇7 8 𝐸 𝑋7 − 𝜇7 𝑋8 − 𝜇8 ⋯
• Σ = 𝐸 𝑋7 − 𝜇7 𝑋8 − 𝜇8 𝐸 𝑋8 − 𝜇8 8 ⋮
⋮ ⋮ ⋱
34
Some Properties
𝜎.2 ⋯ 0
Σ= ⋮ ⋱ ⋮ diagonal matrix
0 ⋯ 𝜎32
35
Some Properties
• For the independent case:
1
⋯ 0
𝜎"#
Σ $" = ⋮ ⋱ ⋮
1
0 ⋯
𝜎3#
𝑃 𝐶& 𝑃 𝑋 𝐶& = 𝑃 𝐶4 𝑃 𝑋 𝐶4
# 6 # 6
% 7 % @5 A.2 (7%@5 ) % 7 % @7 A.2 (7%@7 )
𝑒 = 5 𝑒 = 7
𝑃 𝐶! $ # = 𝑃 𝐶B $ #
2𝜋 = 𝑑𝑒𝑡 = (Σ! ) 2𝜋 = 𝑑𝑒𝑡 = (ΣB )
37
Decision boundaries
• For simplicity assume Σ= = ΣB = Σ
# 6 # 6
% 7 % @5 A.2 (7%@5 ) % 7 % @7 A.2 (7%@7 )
𝑒 = 5 𝑒 = 7
𝑃 𝐶! $ # = 𝑃 𝐶B $ #
2𝜋 = 𝑑𝑒𝑡 = (Σ! ) 2𝜋 = 𝑑𝑒𝑡 = (ΣB )
39
Decision boundaries
• For simplicity assume Σ+ = Σ1 = Σ
& % & %
%' 3 % 4$ 5"&
$ (3%4$ ) %' 3 % 4' 5"&
' (3%4' )
𝑒 𝑒
𝑃 𝐶2 6 & = 𝑃 𝐶1 6 &
2𝜋 ' 𝑑𝑒𝑡 ' (Σ2 ) 2𝜋 ' 𝑑𝑒𝑡 ' (Σ1 )
1 ) 1 )
log 𝑃 𝐶+ − 𝑋 − 𝜇( Σ *+ 𝑋 − 𝜇( = log 𝑃 𝐶1 − 𝑋 − 𝜇, Σ *+ 𝑋 − 𝜇,
2 2
1 ) 1 )
log 𝑃 𝐶+ − 𝑋 − 𝜇( Σ *+ 𝑋 − 𝜇( = log 𝑃 𝐶1 − 𝑋 − 𝜇, Σ *+ 𝑋 − 𝜇,
2 2
𝑃 𝐶(
2𝑙𝑜𝑔 = 2𝜇, ) Σ*+ 𝑋 − 𝜇, ) Σ*+ 𝜇, − 2𝜇( ) Σ*+ 𝑋 + 𝜇( ) Σ*+ 𝜇(
𝑃 𝐶,
𝑃 𝐶(
2𝑙𝑜𝑔 = 2 𝜇, ) Σ*+ − 𝜇( ) Σ*+ 𝑋 − 𝜇, ) Σ*+ 𝜇, − 𝜇( ) Σ*+ 𝜇(
𝑃 𝐶,
Linear classifier! 41
Decision boundaries
M N-
• 2𝑙𝑜𝑔 M N%
= 2 𝜇5 O Σ &7 − 𝜇P O Σ &7 𝑋 − 𝜇5 O Σ &7𝜇5 − 𝜇P O Σ &7𝜇P
• Let:
<
< *= < *=
𝑊 = 2 𝜇; Σ = 2 Σ*= (𝜇; − 𝜇> )
− 𝜇> Σ
< *= < *=
𝑃 𝐶>
𝑊+ = − 𝜇; Σ 𝜇; − 𝜇> Σ 𝜇> − 2𝑙𝑜𝑔
𝑃 𝐶;
42
Decision boundaries
• Decision boundary:
𝑊 9 𝑋 + 𝑊E = 0 à Linear classifier!
X2 decision boundary
𝝁𝒊
𝝁𝒋
data variation
X1
Scatter diagram
43
Applying Bayes Rule
• One way on how to apply Bayes rule in practical
situations:
– To obtain the form of each density we need 𝜇> and Σ> for
each class 𝑖 à estimate from training set
44
Estimate 𝝁 and Σ
• Estimate 𝝁 and Σ for a particular class:
𝐸(𝑋7)
𝐸(𝑋8)
– We know that: 𝜇 = 𝐸 𝑋 =
⋮
𝐸(𝑋Q )
– An estimate of 𝜇P ≡ 𝐸(𝑋P ) is:
T
1
𝜇̂ P = D 𝑋P (𝑚) à the average
𝑀
RS7
𝜇̂ 7 T
𝜇̂ 8 1
𝜇̂ = = D 𝑋(𝑚)
⋮ 𝑀
RS7
𝜇̂ Q M is # of training patterns
belonging to the considered class
45
Estimate 𝝁 and Σ
• Estimate 𝝁 and Σ for a particular class:
– We know that:
=
estimate of variance = ∑?
@A= 𝑋> 𝑚 − 𝜇
M> B
?
= <
– Est. of Σ ∶ ΣO =
?
∑?
@A= 𝑋 𝑚 − 𝜇
M 𝑋 𝑚 − 𝜇M
46
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya
47