Decision Theory - I Part 5mar24
Decision Theory - I Part 5mar24
Decision Theory
I Part
• The minimum risk theory is based (like the MAP rule) on the
optimization of a probabilistic criterion.
• It can be applied to classification by considering as actions the
assignment of samples to classes. However, it can be regarded as
a more general theory, as
– in general there is not a direct correspondence class action
– it utilizes the additional information about action costs.
• Notation:
– Set of classes: = {1, 2, ..., M};
– Set of possible actions: A = {1, 2, ..., R}.
5
Cost matrix
• Conditional risk
– Since action costs depend on classes, which are unknown, we
compute, for each pattern x, the conditional risk R(i|x) of
performing the action i given the pattern x :
M
R( i | x ) = c( i | j )P( j | x) = E {c( i | )| x}
j =1
– The conditional risk can be seen as the average cost (mean cost)
that we have if we decide for the action i , where the average is
computed w.r.t. to class posterior probabilities P(j|x).
• Decision criterion according to the minimum risk theory
– Given the pattern x we choose the action j that corresponds to
the minimum conditional risk:
x j R(j|x) R(i|x) i = 1, 2, ..., R
− the corresponding value of the risk, R*, is called Bayes risk.
7
Special case
• If M = R = 2, P1 = P(1) and P2 = P(2), then we obtain:
– C is a 2 × 2 square matrix and
– If c21 = c12 the MAP classifier is obtained (no unbalancing between the
costs of “wrong” actions):
p( x | 1 ) P2
MAP x 1 L( x ) =
p( x | 2 ) P1
• Multiclass and multiaction (M=R) case
– As in the general case, the action that minimizes the conditional risk is
chosen:
M
R( i | x) = c( i | j )P( j | x)
j =1
– The MAP decision rule is obtained again if cij = 1 – ij, where ij is the
Kronecker symbol: this is the so-called “0-1 cost matrix” situation.
9
ML Classifier
• A maximum likelihood classifier (ML) associates each sample to the
class corresponding to the maximum value of the conditional pdfs:
• The Minimum risk decision rule requires the following input data:
– conditional pdfs p(x|i), i = 1, 2, ..., M;
– cost matrix C;
– prior probabilities Pi, i = 1, 2, ..., M.
• Given these data, the minimum risk discriminant function is deduced by the
decision rule. For example, in the case M = R = 2 if the conditional pdfs
p(x|1) and p(x|2) are continuous functions, we’ll get the boundary
between the decision regions R1 and R2 by imposing the following equality:
p( x | 1 ) P2 (c12 − c22 )
=
p( x | 2 ) P1 (c21 − c11 )
11
Global risk and Hypothesis testing
SORGENTE
SOURCE
Decision
Decido H0 region
Z0 => H0
x
f(y/H1)
– cij is the cost associated with the decision Di, given the hypothesis Hj is
true. It corresponds to the cost cij defined in the minimum risk theory in a
more general case. In practice it will always be c01 > c11 and c10 > c00.
– The Bayesian decision rule minimizes the global risk meant as the average
cost with respect to the probabilities of H0 and H1 throughout the whole Z.
– We want to determine the decision rule that corresponds to the optimum
decision regions Z0 and Z1 in the sense of the minimum overall risk (global
risk):
= E{cost} = c00 P ( D0 , H 0 ) + c01P ( D0 , H1 ) + c10 P ( D1 , H 0 ) + c11P ( D1 , H1 )
Z
14
Risk computation
P( Di , H j ) = P( Di | H j ) P( H j )
P( D0 | H 0 ) = p( x| H0 )dx = 1 − PF , P( D1 | H 0 ) = p( x| H0 )dx = PF
Z0 Z1
P( D0 | H1 ) = p( x| H1 )dx = PM = 1 − PD , P( D1 | H1 ) = p( x| H1 )dx = PD
Z0 Z1
P{correct decision} = Pc = P ( D0 , H 0 ) + P ( D1 , H1 ) = (1 − PF ) P0 + PD P1
P{error} = Pe = P ( D0 , H1 ) + P ( D1 , H 0 ) = PM P1 + PF P0
= c00 ( 1 − PF ) P0 + c01 ( 1 − PD ) P1 + c10 PF P0 + c11 PD P1 =
= c 1 − P P + c P P + c P P + c (1 − P )P
00 ( F) 0 01 M 1 10 F 0 11 M 1
• If we optimize the overall risk over the regions Z0 and Z1, the
decision rule we obtain is identical to that derived by
operating locally on the conditional risk on each sample x .
• So we have verified that the local decision rule for the minimal
risk also optimizes the global risk.
18
MAP as a special case of Minimum risk
• Also with the global approach, we can verify that the MAP
classification rule is a special case of the minimum risk
decision rule.
– In the case of the "0-1" cost matrix we have
0 1 P0
C= =
1 0 P1
– Therefore, the minimum risk decision rule based on such a cost
matrix is:
p( x | H1 ) H1 P0
L( x ) = MAP
p( x | H 0 ) H0 P1
PD = p( x | H1 )dx = PD ()
Z1 ( )
20
Neyman-Pearson criterion: Introduction
• When we don't know the prior probabilities and the costs (entries of
the cost matrix), we can use the Neyman-Pearson approach.
– In this context it is assumed to know the desired false alarm probability,
PF = , or at least it is required that PF does not exceed a given value .
– The Neyman-Pearson criterion maximizes PD (or minimizes PM)* under
the constraint PF = .
– For this purpose we introduce a Lagrange multiplier 0,
minimizing the following functional:
= PM + ( PF − ) = p( x | H1 )dx + p( x | H 0 )dx − =
Z0 Z1
= p( x | H1 )dx + 1 − p( x | H 0 )dx − =
Z0 Z0
= (1 − ) + [ p( x | H1 ) − p( x | H0 )]dx
Z0
PF = p( x | H 0 )dx = (constraint)
Z1
pL (L | H0 )dL =
*
Note— Not always PF changes continuosly while varying (if the pdfs
were impulsive, this would not happen): therefore, generally, it is
possible to formulate the Neyman-Pearson test with the condition PF .
• Remark: Other rules can be defined based on the Bayesian decision
theory, such as Minimax (see Appendix), which takes the same form
as the rules introduced above, ie, a likelihood ratio test.
23
Receiver Operating Characteristic (ROC)
0 PF
– ROC curves in this case are parameterized with respect to
h = m/ = 0.5, 1, 2…:
PD
=0
1
h=2
h=1
h=0.5
PF
−+
0 1
25
ROC curves : properties
• The tangent slope of the ROC curve coincides with the threshold
𝑑𝑃𝐷
value to which the probabilities PF and PD correspond: =𝜂
𝑑𝑃𝐹
+
P =
F L
– Demonstration :
L L = − pL ( | H 0 )
dP
p ( | H 0 )d d
F
+
PD = pL ( L | H1 )dL dP d
= − pL ( | H1 ) D
• Example 1
– One-dimensional case (n = 1);
– Two Gaussian classes : p(x| H0) = (0, 2), p(x| H1) = (m, 2);
x
0 m
1 x2
p( x | H 0 ) = exp − 2
2 2
( x − m )2
p( x | H ) = 1
exp −
1
2 2 2
30
Example 1: Likelihood test
p( x | H1 ) m2 − 2 mx H1 P0 (c10 − c00 )
L( x ) = = exp − =
p( x | H 0 ) 2 2
H0 P1 (c01 − c11 )
m2 − 2 mx H1
H1 2
m
ln L( x) = − ln x ln + =
2 2
H0 H0 m 2
Z0 = ( − , ) and Z1 = ( , +)
( x = can be arbitrarily assigned to either H 0 or H1 )
x
m
0
PF
31
Example 1: PF and PD
+
1 x2
PF = P( D1 | H 0 ) = P{ x | H 0 } = exp − 2 dx = Q
2 2
+
1 ( x − m) 2 −m
PD = P( D1 | H1 ) = P{ x | H1 } = exp − dx = Q
2 2 2
−m m−
PM = 1 − PD = 1 − Q = Q
PF + PM
Pe = P1 PM + P0 PF in the case of equiprobable cla sses: Pe =
2
+
y2
where Q ( x ) =
1
2
exp − dy
2
x
• Example 2:
– One-dimensional case (n = 1):
– Non-Gaussian pdfs:
1 x
p( x | H ) = exp( − x ) 2
2(1 − e −1 )
0
p( x | H ) = 1 x
1
2 2
1
-1
2(1-e )
p(x/H ) p(x/H )
0 1
1/2
x
-1 0 1
33
Example 2: Maximum likelihood
1 1
p( x | H1 ) = p( x | H 0 )= −1
exp( − x ) for x [ −1,1]
2 2(1 − e )
two solutions: x= 0.46 Note that two thresholds are
obtained for the feature x
decide H 0 if x 0.46
decide H1 if 0.46 x 1
Z0 = [−0.46,0.46]
Z1 = [−1, −0.46] [0.46,1]
34
Example 2: Maximum likelihood
1
-1
P 2(1-e )
M
P
F p(x/H0 ) p(x/H1 )
1/2
x
-1 -0.46 0.46 1
Z Z Z
1 0 1
−0.46 1
exp ( x ) dx + exp( − x)dx = 0.42
1
−1
PF = P( D1 | H 0 ) =
2(1 − e ) −1 0.46
1
MP = P ( D0 | H 1 ) = 2 0.46 = 0.46
2
PM + PF
eP = P P
1 M + P P
0 F = = 0.44
2
35
Example 2: Neyman-Pearson
− 1
exp ( x ) dx + exp( − x)dx = 0.5
1
PF = P( D1 | H 0 ) =
(
2 1 − e −1 ) −1
= 0.38 PF
According to the Neyman-Pearson criterion,
the value of is determined by imposing the -* *
desired value of the probability of false alarm. PD
The resulting probability of detection is:
PD = 2(1 – 0.38)/2 = 0.62. -* *
36
Example 3
• Example 3:
– One-dimensional case (n = 1);
– Exponential pdfs:
exp ( − x ) x0
p( x | H 0 ) =
0 otherwise
exp ( −x ) x0
p( x | H1 ) = ( 1)
0 otherwise
2
p(x| H1)
p(x| H0)
1 =2
x
0
0 2,5 5 7,5 10 12,5
37
Example 3: ROC curves