0% found this document useful (0 votes)
15 views128 pages

Lecture 4

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views128 pages

Lecture 4

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

Recap

• We have studied the Bayes Classifier last class.

PR NPTEL course – p.1/128


Recap

• We have studied the Bayes Classifier last class.


• We have seen the Bayes classifier for general loss
function and proved its optimality.

PR NPTEL course – p.2/128


Recap

• We have studied the Bayes Classifier last class.


• We have seen the Bayes classifier for general loss
function and proved its optimality.
• We had also seen many special cases and how one
can analytically derive the Bayes classifier for some
simple cases of class conditional densities.

PR NPTEL course – p.3/128


An Example

Consider another example of deriving Bayes classifier.


• Suppose we have K classes. The classifier is allowed
the option to ‘reject’ a pattern and this is done by the
classifier assigning class K + 1 to the pattern.

PR NPTEL course – p.4/128


An Example

Consider another example of deriving Bayes classifier.


• Suppose we have K classes. The classifier is allowed
the option to ‘reject’ a pattern and this is done by the
classifier assigning class K + 1 to the pattern.
Define the loss function by

L(i, j) = 0 if i = j and i, j = 1, · · · , K
= ρm if i = 1, · · · , K, and i 6= j
= ρr if i = K + 1

PR NPTEL course – p.5/128


An Example

Consider another example of deriving Bayes classifier.


• Suppose we have K classes. The classifier is allowed
the option to ‘reject’ a pattern and this is done by the
classifier assigning class K + 1 to the pattern.
Define the loss function by

L(i, j) = 0 if i = j and i, j = 1, · · · , K
= ρm if i = 1, · · · , K, and i 6= j
= ρr if i = K + 1
Now we want to derive the Bayes classifier in terms of
the posterior probabilities.
PR NPTEL course – p.6/128
Example Contd.

• Recall that the Bayes classifier is


hB (X) = αi if
R(αi | X) ≤ R(αj | X), ∀j.
where
K
X
R(αi | X) = L(αi , Cj )qj (X)
j=0

PR NPTEL course – p.7/128


Example Contd.

• Recall that the Bayes classifier is


hB (X) = αi if
R(αi | X) ≤ R(αj | X), ∀j.
where
K
X
R(αi | X) = L(αi , Cj )qj (X)
j=0

• So, we now need to calculate R(αi | X) for different


actions, αi available to the classifier.

PR NPTEL course – p.8/128


• For αi = 1, · · · , K , we have L(αi , Cj ) = ρm if αi 6= Cj
and it is zero otherwise.

PR NPTEL course – p.9/128


• For αi = 1, · · · , K , we have L(αi , Cj ) = ρm if αi 6= Cj
and it is zero otherwise.
P
• Hence, R(i | X) = j6=i ρm qj (X) = ρm (1 − qi (X).

PR NPTEL course – p.10/128


• For αi = 1, · · · , K , we have L(αi , Cj ) = ρm if αi 6= Cj
and it is zero otherwise.
P
• Hence, R(i | X) = j6=i ρm qj (X) = ρm (1 − qi (X).
P
• Also, R(K + 1 | X) = j ρr qj (X) = ρr

PR NPTEL course – p.11/128


• For αi = 1, · · · , K , we have L(αi , Cj ) = ρm if αi 6= Cj
and it is zero otherwise.
P
• Hence, R(i | X) = j6=i ρm qj (X) = ρm (1 − qi (X).
P
• Also, R(K + 1 | X) = j ρr qj (X) = ρr
• Hence, hB (X) = i, 1 ≤ i ≤ K , if

PR NPTEL course – p.12/128


• For αi = 1, · · · , K , we have L(αi , Cj ) = ρm if αi 6= Cj
and it is zero otherwise.
P
• Hence, R(i | X) = j6=i ρm qj (X) = ρm (1 − qi (X).
P
• Also, R(K + 1 | X) = j ρr qj (X) = ρr
• Hence, hB (X) = i, 1 ≤ i ≤ K , if

ρm (1 − qi (X) ≤ ρm (1 − qj (X), ∀j
and
ρm (1 − qi (X) ≤ ρr

PR NPTEL course – p.13/128


• We have, hB (X) = i, 1 ≤ i ≤ K , if

ρm (1 − qi (X) ≤ ρm (1 − qj (X), ∀j
and
ρm (1 − qi (X) ≤ ρr

PR NPTEL course – p.14/128


• We have, hB (X) = i, 1 ≤ i ≤ K , if

ρm (1 − qi (X) ≤ ρm (1 − qj (X), ∀j
and
ρm (1 − qi (X) ≤ ρr
• Thus, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


ρr
(ii). qi (X) ≥ 1 − ρm ;
else hB (X) = K + 1.

PR NPTEL course – p.15/128


• We saw, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


(ii). qi (X) ≥ 1 − ρρmr ;
else hB (X) = K + 1.

PR NPTEL course – p.16/128


• We saw, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


(ii). qi (X) ≥ 1 − ρρmr ;
else hB (X) = K + 1.
• If ρr ≥ ρm

PR NPTEL course – p.17/128


• We saw, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


(ii). qi (X) ≥ 1 − ρρmr ;
else hB (X) = K + 1.
• If ρr ≥ ρm – Never reject a pattern!

PR NPTEL course – p.18/128


• We saw, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


(ii). qi (X) ≥ 1 − ρρmr ;
else hB (X) = K + 1.
• If ρr ≥ ρm – Never reject a pattern!
• If ρr = 0

PR NPTEL course – p.19/128


• We saw, hB (X) = i, 1 ≤ i ≤ K , if

(i). qi (X) ≥ qj (X), ∀j , and


(ii). qi (X) ≥ 1 − ρρmr ;
else hB (X) = K + 1.
• If ρr ≥ ρm – Never reject a pattern!
• If ρr = 0 – Always reject the pattern (unless you are
absolutely sure)

PR NPTEL course – p.20/128


Finding Bayes Error

• Given class conditional densities, the Bayes classifier


is easily computed.

PR NPTEL course – p.21/128


Finding Bayes Error

• Given class conditional densities, the Bayes classifier


is easily computed.
• We may also want to compute the Bayes error.
• Gives us the expected performance. Also lets us
decide whether we need better features.

PR NPTEL course – p.22/128


Finding Bayes Error

• Given class conditional densities, the Bayes classifier


is easily computed.
• We may also want to compute the Bayes error.
• Gives us the expected performance. Also lets us
decide whether we need better features.
• For the case of 0-1 loss function, we need to evaluate
Z
min(p0 f0 (X), p1 f1 (X)) dX
ℜn

PR NPTEL course – p.23/128


Finding Bayes Error

• Given class conditional densities, the Bayes classifier


is easily computed.
• We may also want to compute the Bayes error.
• Gives us the expected performance. Also lets us
decide whether we need better features.
• For the case of 0-1 loss function, we need to evaluate
Z
min(p0 f0 (X), p1 f1 (X)) dX
ℜn

• In general, a difficult integral to evaluate.

PR NPTEL course – p.24/128


• Let us consider the simplest case:
2-class problem, X ∈ ℜ, normal class conditional
densities and 0-1 loss function.
• Assume equal priors. Let σ0 = σ1 = σ and µ0 < µ1 .

PR NPTEL course – p.25/128


• Let us consider the simplest case:
2-class problem, X ∈ ℜ, normal class conditional
densities and 0-1 loss function.
• Assume equal priors. Let σ0 = σ1 = σ and µ0 < µ1 .
• Then hB (X) = 0 if X < (µ0 + µ1 )/2.

PR NPTEL course – p.26/128


• Let us consider the simplest case:
2-class problem, X ∈ ℜ, normal class conditional
densities and 0-1 loss function.
• Assume equal priors. Let σ0 = σ1 = σ and µ0 < µ1 .
• Then hB (X) = 0 if X < (µ0 + µ1 )/2.
• Then, Bayes error is
µ0 +µ1
Z 2
Z ∞
P (error) = 0.5 f1 (X) dX+0.5 f0 (X) dX
µ0 +µ1
−∞ 2

PR NPTEL course – p.27/128


µ0 +µ1
Z 2
Z ∞
P (error) = 0.5 f1 (X) dX + 0.5 f0 (X) dX
µ0 +µ1
−∞ 2

PR NPTEL course – p.28/128


µ0 +µ1
Z 2
Z ∞
P (error) = 0.5 f1 (X) dX + 0.5 f0 (X) dX
µ0 +µ1
−∞ 2

• Put Z = (X − µ1 )/σ in the first and Z = (X − µ0 )/σ


in the second integral.

PR NPTEL course – p.29/128


µ0 +µ1
Z 2
Z ∞
P (error) = 0.5 f1 (X) dX + 0.5 f0 (X) dX
µ0 +µ1
−∞ 2

• Put Z = (X − µ1 )/σ in the first and Z = (X − µ0 )/σ


in the second integral.
• Now both f1 and f0 become standard normal
distribution.

PR NPTEL course – p.30/128


µ0 +µ1
Z 2
Z ∞
P (error) = 0.5 f1 (X) dX + 0.5 f0 (X) dX
µ0 +µ1
−∞ 2

• Put Z = (X − µ1 )/σ in the first and Z = (X − µ0 )/σ


in the second integral.
• Now both f1 and f0 become standard normal
distribution.
• The upper limit in the first integral becomes
(µ0 − µ1 )/2σ and lower limit in second integral
becomes (µ1 − µo )/2σ .
PR NPTEL course – p.31/128
Now we get

µ 0 − µ1 µ 1 − µ0
µ ¶ · µ ¶¸
P (error) = 0.5Φ + 0.5 1 − Φ
2σ 2σ

PR NPTEL course – p.32/128


Now we get

µ 0 − µ1 µ 1 − µ0
µ ¶ · µ ¶¸
P (error) = 0.5Φ + 0.5 1 − Φ
2σ 2σ
µ 0 − µ1
µ ¶
= Φ

Here, Φ is the distribution function of the Standard
Normal random Variable.

PR NPTEL course – p.33/128


Now we get

µ 0 − µ1 µ 1 − µ0
µ ¶ · µ ¶¸
P (error) = 0.5Φ + 0.5 1 − Φ
2σ 2σ
µ 0 − µ1
µ ¶
= Φ

Here, Φ is the distribution function of the Standard
Normal random Variable.
|µ0 −µ1 |
The quantity σ
is called discriminability.

PR NPTEL course – p.34/128


• In the general case, we need to evaluate
R
P (error) = ℜn
min(p0 f0 (X), p1 f1 (X)) dX

PR NPTEL course – p.35/128


• In the general case, we need to evaluate
R
P (error) = ℜn
min(p0 f0 (X), p1 f1 (X)) dX
• A useful inequality here is
min(a, b) ≤ aβ b1−β , ∀ a, b ≥ 0, 0 ≤ β ≤ 1.

PR NPTEL course – p.36/128


• In the general case, we need to evaluate
R
P (error) = ℜn
min(p0 f0 (X), p1 f1 (X)) dX
• A useful inequality here is
min(a, b) ≤ aβ b1−β , ∀ a, b ≥ 0, 0 ≤ β ≤ 1.
• Easy to prove. Suppose a < b
¡ b ¢1−β
aβ b1−β = a−1+β b1−β a = a
a ≥ a = min(a, b)

PR NPTEL course – p.37/128


• In the general case, we need to evaluate
R
P (error) = ℜn
min(p0 f0 (X), p1 f1 (X)) dX
• A useful inequality here is
min(a, b) ≤ aβ b1−β , ∀ a, b ≥ 0, 0 ≤ β ≤ 1.
• Easy to prove. Suppose a < b
¡ b ¢1−β
aβ b1−β = a−1+β b1−β a = a
a ≥ a = min(a, b)
• Hence we have (for 0-1 loss function)
Z
β 1−β β 1−β
P (error) ≤ p p 0 1 f (X)f
0 1 (X) dX
ℜn
PR NPTEL course – p.38/128
Suppose f0 ∼ N (µ0 , Σ0 ) and f1 ∼ N (µ1 , Σ1 ).

PR NPTEL course – p.39/128


Suppose f0 ∼ N (µ0 , Σ0 ) and f1 ∼ N (µ1 , Σ1 ). Then we
can show
Z
f0β (X)f11−β (X) dX = exp(−K(β))

where
β(1 − β)
K(β) = (µ1 − µ0 )t (βΣ0 + (1 − β)Σ1 )−1 (µ1 − µ0 )

|βΣ0 + (1 − β)Σ1 |

1
+ ln
2 |Σ0 |β |Σ1 |(1−β)
PR NPTEL course – p.40/128
β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))

PR NPTEL course – p.41/128


β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))
• We can choose a β and calculate a bound from this
expression.

PR NPTEL course – p.42/128


β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))
• We can choose a β and calculate a bound from this
expression.
• To get a tighter bound we can choose β to minimize
exp(−K(β)). Gives so called Chernoff bound.

PR NPTEL course – p.43/128


β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))
• We can choose a β and calculate a bound from this
expression.
• To get a tighter bound we can choose β to minimize
exp(−K(β)). Gives so called Chernoff bound.
• Often this minimization can be difficult.

PR NPTEL course – p.44/128


β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))
• We can choose a β and calculate a bound from this
expression.
• To get a tighter bound we can choose β to minimize
exp(−K(β)). Gives so called Chernoff bound.
• Often this minimization can be difficult.
• In such cases, a useful choice is β = 0.5. Known as
Bhattacharya bound.

PR NPTEL course – p.45/128


β (1−β)
• We thus have: P (error) ≤ p p
0 1 exp(−K(β))
• We can choose a β and calculate a bound from this
expression.
• To get a tighter bound we can choose β to minimize
exp(−K(β)). Gives so called Chernoff bound.
• Often this minimization can be difficult.
• In such cases, a useful choice is β = 0.5. Known as
Bhattacharya bound.
• The bound min(a, b) ≤ aβ b(1−β) can always be used;
the resulting integral may be complex for other
densities. Can use some numerical approximation.
PR NPTEL course – p.46/128
Other Criteria

• The Bayes classifier is optimal for the criterion of risk


minimization.
• There can be other criteria.

PR NPTEL course – p.47/128


Other Criteria

• The Bayes classifier is optimal for the criterion of risk


minimization.
• There can be other criteria.
• The Bayes classifier depends on both pi , prior
probabilities, and fi , class conditional densities.
• Suppose we do not want to rely on prior probabilities.
• We may want a classifier that does best against any
(or worst) prior probabilities.

PR NPTEL course – p.48/128


• Consider a 2-class case.
• Let Ri (h) denote the subset of feature space where h
classifies into Class-i.

PR NPTEL course – p.49/128


• Consider a 2-class case.
• Let Ri (h) denote the subset of feature space where h
classifies into Class-i.
• Then the Risk integral is
Z Z
R(h) = L(1, 0)p0 f0 (X)dX+ L(0, 1)p1 f1 (X)dX
R1 (h) R0 (h)

• We can simplify this to get rid of dependence on


priors.

PR NPTEL course – p.50/128


• Using p0 = 1 − p1 , we get
Z Z
R = L(1, 0)p0 f0 (X) dX + L(0, 1)p1 f1 (X) dX
R1 R0

PR NPTEL course – p.51/128


• Using p0 = 1 − p1 , we get
Z Z
R = L(1, 0)p0 f0 (X) dX + L(0, 1)p1 f1 (X) dX
R1 R0
Z
= L(1, 0)p0 f0 (X) dX +
R1
Z
L(0, 1)(1 − p0 ) f1 (X) dX
R0

PR NPTEL course – p.52/128


• Thus we get
Z Z
R = L(1, 0)p0 f0 (X) dX + L(0, 1)p1 f1 (X) dX
R1 R0
Z
= L(0, 1) f1 (X) dX +
R0
· Z Z ¸
p0 L(1, 0) f0 (X) dX − L(0, 1) f1 (X) dX
R1 R0

PR NPTEL course – p.53/128


Minmax Classifier

• Consider a classifier such that


Z Z
L(1, 0) f0 (X) dX = L(0, 1) f1 (X) dX
R1 R0

PR NPTEL course – p.54/128


Minmax Classifier

• Consider a classifier such that


Z Z
L(1, 0) f0 (X) dX = L(0, 1) f1 (X) dX
R1 R0

• For this classifier the risk would be independent of


priors.

PR NPTEL course – p.55/128


Minmax Classifier

• Consider a classifier such that


Z Z
L(1, 0) f0 (X) dX = L(0, 1) f1 (X) dX
R1 R0

• For this classifier the risk would be independent of


priors.
• Called the minmax classifier
• We are minimizing the maximum possible (over all
priors) risk.
• In general, finding the minmax classifier can be
analytically complicated.
PR NPTEL course – p.56/128
Neyman-Pearson Criterion

• Bayes classifier minimizes risk.


• It minimizes some weighted sum of all errors.

PR NPTEL course – p.57/128


Neyman-Pearson Criterion

• Bayes classifier minimizes risk.


• It minimizes some weighted sum of all errors.
• We may not explicitly want to trade one type of error
with another

PR NPTEL course – p.58/128


Neyman-Pearson Criterion

• Bayes classifier minimizes risk.


• It minimizes some weighted sum of all errors.
• We may not explicitly want to trade one type of error
with another
• One criterion: minimize Type-II error under the
constraint that Type-I error is below some threshold.
• This is the Neyman-Pearson criterion.

PR NPTEL course – p.59/128


Neyman-Pearson Criterion

• Bayes classifier minimizes risk.


• It minimizes some weighted sum of all errors.
• We may not explicitly want to trade one type of error
with another
• One criterion: minimize Type-II error under the
constraint that Type-I error is below some threshold.
• This is the Neyman-Pearson criterion.
• This could be useful in, e.g., biometric applications.

PR NPTEL course – p.60/128


• Type-I error: Wrongly classifying a Class-0 pattern
• Suppose the upper bound on Type-I error is α.
• The Neyman Person classifier can also be expressed
as a threshold on the likelihood ratio.

PR NPTEL course – p.61/128


Neyman-Pearson Classifier

• The Neyman-Pearson classifier, hN P , is characterized


by: given any α ∈ (0, 1)

PR NPTEL course – p.62/128


Neyman-Pearson Classifier

• The Neyman-Pearson classifier, hN P , is characterized


by: given any α ∈ (0, 1)
1. P [hN P (X) = 1 | X ∈ C-0] ≤ α

PR NPTEL course – p.63/128


Neyman-Pearson Classifier

• The Neyman-Pearson classifier, hN P , is characterized


by: given any α ∈ (0, 1)
1. P [hN P (X) = 1 | X ∈ C-0] ≤ α
2. P [hN P (X) = 0 | X ∈ C-1] ≤ [P [h(X) = 0 | X ∈ C-1]

for all h such that P [h(X) = 1 | X ∈ C-0] ≤ α

PR NPTEL course – p.64/128


Neyman-Person Classifier

• Let the bound on Type-I error be α. Then

f1 (X)
hN P (X) = 1 if >K
f0 (X)
= 0 Otherwise
where K is such that
· ¸
f1 (X)
P ≤ K | X ∈ C-0 = 1 − α
f0 (X)
(We assume P {X : f1 (X) = Kf0 (X)} = 0, for
simplicity)
PR NPTEL course – p.65/128
• We now prove that this satisfies the NP Criterion. By
construction, we have
· ¸
f1 (X)
P [hN P (X) = 1 | X ∈ C-0] = P > K | X ∈ C-0
f0 (X)
= α

PR NPTEL course – p.66/128


• We now prove that this satisfies the NP Criterion. By
construction, we have
· ¸
f1 (X)
P [hN P (X) = 1 | X ∈ C-0] = P > K | X ∈ C-0
f0 (X)
= α
• So, we need to show that its Type-II error is less than
that for any other classifier satisfying the constraint on
Type-I error.

PR NPTEL course – p.67/128


• Let h be any classifier such that

P [h(X) = 1 | X ∈ C-0] ≤ α

PR NPTEL course – p.68/128


• Let h be any classifier such that

P [h(X) = 1 | X ∈ C-0] ≤ α
• To complete the proof we have to show that

P [hN P (X) = 0 | X ∈ C-1] ≤ P [h(X) = 0 | X ∈ C-1]

PR NPTEL course – p.69/128


• Let h be any classifier such that

P [h(X) = 1 | X ∈ C-0] ≤ α
• To complete the proof we have to show that

P [hN P (X) = 0 | X ∈ C-1] ≤ P [h(X) = 0 | X ∈ C-1]


Or, equivalently

P [hN P (X) = 1 | X ∈ C-1] ≥ [P [h(X) = 1 | X ∈ C-1]

PR NPTEL course – p.70/128


• Consider the Integral
Z
I = (hN P (x) − h(x)) (f1 (x) − Kf0 (x)) dx
ℜn

PR NPTEL course – p.71/128


• Consider the Integral
Z
I = (hN P (x) − h(x)) (f1 (x) − Kf0 (x)) dx
ℜn
Z
= (hN P (x) − h(x)) (f1 (x) − Kf0 (x)) dx +
f1 >Kf0
Z
(hN P (x) − h(x)) (f1 (x) − Kf0 (x)) dx
f1 ≤Kf0

• We first show that this integral is always non-negative.

PR NPTEL course – p.72/128


• When f1 (x) > Kf0 (x), we have
hN P (x) − h(x) = 1 − h(x) ≥ 0

PR NPTEL course – p.73/128


• When f1 (x) > Kf0 (x), we have
hN P (x) − h(x) = 1 − h(x) ≥ 0 which implies

(hN P (x) − h(x))(f1 (x) − Kf0 (x)) ≥ 0

PR NPTEL course – p.74/128


• When f1 (x) > Kf0 (x), we have
hN P (x) − h(x) = 1 − h(x) ≥ 0 which implies

(hN P (x) − h(x))(f1 (x) − Kf0 (x)) ≥ 0


• Similarly, when f1 (x) < Kf0 (x), we have
hN P (x) − h(x) = 0 − h(x) ≤ 0 which implies
(hN P (x) − h(x))(f1 (x) − Kf0 (x)) ≥ 0

PR NPTEL course – p.75/128


• When f1 (x) > Kf0 (x), we have
hN P (x) − h(x) = 1 − h(x) ≥ 0 which implies

(hN P (x) − h(x))(f1 (x) − Kf0 (x)) ≥ 0


• Similarly, when f1 (x) < Kf0 (x), we have
hN P (x) − h(x) = 0 − h(x) ≤ 0 which implies
(hN P (x) − h(x))(f1 (x) − Kf0 (x)) ≥ 0
• This shows that I ≥ 0.

PR NPTEL course – p.76/128


• Thus, we have
Z
(hN P (x) − h(x))(f1 (x) − Kf0 (x)) dx ≥ 0
ℜn

PR NPTEL course – p.77/128


• Thus, we have
Z
(hN P (x) − h(x))(f1 (x) − Kf0 (x)) dx ≥ 0
ℜn

• This implies
Z Z
hN P (x)f1 (x) dx − h(x)f1 (x) dx ≥
·Z Z ¸
K hN P (x)f0 (x) dx − h(x)f0 (x) dx

PR NPTEL course – p.78/128


Since hN P and h take values in {0, 1},
Z
hN P (x)f1 (X)dX = P [hN P (X) = 1 | X ∈ C-1]
ℜn

and
Z
h(x)f1 (X)dX = P [h(X) = 1 | X ∈ C-1]
ℜn

PR NPTEL course – p.79/128


Since hN P and h take values in {0, 1},
Z
hN P (x)f1 (X)dX = P [hN P (X) = 1 | X ∈ C-1]
ℜn

and
Z
h(x)f1 (X)dX = P [h(X) = 1 | X ∈ C-1]
ℜn

Similarly for the integrals involving f0 .

PR NPTEL course – p.80/128


• Hence we have
P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥

K [P [hN P (X) = 1 | X ∈ C-0 ] − P [h(X) = 1 | X ∈ C-0]]

PR NPTEL course – p.81/128


• Hence we have
P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥

K [P [hN P (X) = 1 | X ∈ C-0 ] − P [h(X) = 1 | X ∈ C-0]]


• But for all h under consideration, the RHS above is
non-negative.

PR NPTEL course – p.82/128


• Hence we have
P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥

K [P [hN P (X) = 1 | X ∈ C-0 ] − P [h(X) = 1 | X ∈ C-0]]


• But for all h under consideration, the RHS above is
non-negative. Hence

P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥ 0

PR NPTEL course – p.83/128


• Hence we have
P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥

K [P [hN P (X) = 1 | X ∈ C-0 ] − P [h(X) = 1 | X ∈ C-0]]


• But for all h under consideration, the RHS above is
non-negative. Hence

P [hN P (X) = 1 | X ∈ C-1] − P [h(X) = 1 | X ∈ C-1] ≥ 0


• This completes the proof.

PR NPTEL course – p.84/128


• Neymann-Pearson classifier also needs knowledge of
class conditional densities.
f1 (X)
• Like Bayes classifier, it also is based on the ratio .
f0 (X)
f1 (X) p0 L(0,1)
• In Bayes classifier we say c-1 if f0 (X)
> p1 L(1,0)
.
• In NP, this threshold, K , is set based on the allowed
Type-I error.

PR NPTEL course – p.85/128


Example of NP classifier

• Take X ∈ ℜ and class conditional densities normal


with equal variance. Let µ0 < µ1 .
• Now the NP classifier is: If X > τ then c-1
where τ is simply determined by Type-I error bound.
• This is intuitively clear.
• We will now derive this formally.

PR NPTEL course – p.86/128


Example

Now (assuming µ1 > µ0 ),


2 2
(X − µ1 ) (X − µ0 )
µ ¶
f1 (X)
= exp − 2
+
f0 (X) 2σ 2σ 2

PR NPTEL course – p.87/128


Example

Now (assuming µ1 > µ0 ),


2 2
(X − µ1 ) (X − µ0 )
µ ¶
f1 (X)
= exp − 2
+
f0 (X) 2σ 2σ 2
µ ¶
1 2
= exp − 2 [µ1 − µ20 − 2X(µ1 − µ0 )]

PR NPTEL course – p.88/128


Example

Now (assuming µ1 > µ0 ),


2 2
(X − µ1 ) (X − µ0 )
µ ¶
f1 (X)
= exp − 2
+
f0 (X) 2σ 2σ 2
µ ¶
1 2
= exp − 2 [µ1 − µ20 − 2X(µ1 − µ0 )]

µ 1 − µ0
µ ¶
= exp 2
[2X − (µ1 + µ0 )]

PR NPTEL course – p.89/128


• We need to find K such that
· ¸
f1 (X)
P ln ≤ ln K | X ∈ C-0 = 1 − α
f0 (X)

PR NPTEL course – p.90/128


• We need to find K such that
· ¸
f1 (X)
P ln ≤ ln K | X ∈ C-0 = 1 − α
f0 (X)
f (X)
• From the earlier expression, ln f10 (X) ≤ ln K is same
as
µ 1 − µ0
2
[2X − (µ1 + µ0 )] ≤ ln K

PR NPTEL course – p.91/128


• We need to find K such that
· ¸
f1 (X)
P ln ≤ ln K | X ∈ C-0 = 1 − α
f0 (X)
f (X)
• From the earlier expression, ln f10 (X) ≤ ln K is same
as
µ 1 − µ0
2
[2X − (µ1 + µ0 )] ≤ ln K

σ 2 ln K µ1 + µ0
i.e., X ≤ +
µ 1 − µ0 2
PR NPTEL course – p.92/128
Hence we have ( writing P [A|X ∈ c-0] as P0 [A])

PR NPTEL course – p.93/128


Hence we have ( writing P [A|X ∈ c-0] as P0 [A])
2
· ¸ · ¸
f1 (X) σ ln K µ1 + µ0
P0 ln ≤ ln K = P0 X ≤ +
f0 (X) µ 1 − µ0 2

PR NPTEL course – p.94/128


Hence we have ( writing P [A|X ∈ c-0] as P0 [A])
2
· ¸ · ¸
f1 (X) σ ln K µ1 + µ0
P0 ln ≤ ln K = P0 X ≤ +
f0 (X) µ 1 − µ0 2
X − µ0 µ 1 − µ0
·
σ ln K
= P0 ≤ +
σ µ 1 − µ0 2σ

PR NPTEL course – p.95/128


Hence we have ( writing P [A|X ∈ c-0] as P0 [A])
2
· ¸ · ¸
f1 (X) σ ln K µ1 + µ0
P0 ln ≤ ln K = P0 X ≤ +
f0 (X) µ 1 − µ0 2
X − µ0 µ 1 − µ0
·
σ ln K
= P0 ≤ +
σ µ 1 − µ0 2σ
µ 1 − µ0
µ ¶
σ ln K
= Φ +
µ 1 − µ0 2σ

PR NPTEL course – p.96/128


Hence we have ( writing P [A|X ∈ c-0] as P0 [A])
2
· ¸ · ¸
f1 (X) σ ln K µ1 + µ0
P0 ln ≤ ln K = P0 X ≤ +
f0 (X) µ 1 − µ0 2
X − µ0 µ 1 − µ0
·
σ ln K
= P0 ≤ +
σ µ 1 − µ0 2σ
µ 1 − µ0
µ ¶
σ ln K
= Φ +
µ 1 − µ0 2σ
We need this quantity to be equal to (1 − α).

PR NPTEL course – p.97/128


Thus we want
µ 1 − µ0
µ ¶
σ ln K
Φ + = (1 − α)
µ 1 − µ0 2σ

PR NPTEL course – p.98/128


Thus we want
µ 1 − µ0
µ ¶
σ ln K
Φ + = (1 − α)
µ 1 − µ0 2σ
This gives us an expression for ln K as
σ ln K −1 µ 1 − µ0
= Φ (1 − α) −
µ 1 − µ0 2σ

PR NPTEL course – p.99/128


Thus we want
µ 1 − µ0
µ ¶
σ ln K
Φ + = (1 − α)
µ 1 − µ0 2σ
This gives us an expression for ln K as
σ ln K −1 µ 1 − µ0
= Φ (1 − α) −
µ 1 − µ0 2σ
or
µ1 − µ0 −1 (µ1 − µ0 )2
ln K = Φ (1 − α) −
σ 2σ 2
PR NPTEL course – p.100/128
f1 (X)
We say X ∈ c-1 if ln f0 (X)
> ln K . That is

PR NPTEL course – p.101/128


f1 (X)
We say X ∈ c-1 if ln f0 (X)
> ln K . That is

µ 1 − µ0 µ1 − µ0 −1 (µ1 − µ0 )2
2
[2X−(µ1 +µ0 )] > Φ (1−α)−
2σ σ 2σ 2

PR NPTEL course – p.102/128


f1 (X)
We say X ∈ c-1 if ln f0 (X)
> ln K . That is

µ 1 − µ0 µ1 − µ0 −1 (µ1 − µ0 )2
2
[2X−(µ1 +µ0 )] > Φ (1−α)−
2σ σ 2σ 2
i.e., 2X − (µ1 + µ0 ) > 2σ Φ−1 (1 − α) − (µ1 − µ0 )

PR NPTEL course – p.103/128


f1 (X)
We say X ∈ c-1 if ln f0 (X)
> ln K . That is

µ 1 − µ0 µ1 − µ0 −1 (µ1 − µ0 )2
2
[2X−(µ1 +µ0 )] > Φ (1−α)−
2σ σ 2σ 2
i.e., 2X − (µ1 + µ0 ) > 2σ Φ−1 (1 − α) − (µ1 − µ0 )
i.e., X > σ Φ−1 (1 − α) + µ0

PR NPTEL course – p.104/128


Thus NP classifier puts X ∈ c-1 if

i.e., X > σ Φ−1 (1 − α) + µ0

PR NPTEL course – p.105/128


Thus NP classifier puts X ∈ c-1 if

i.e., X > σ Φ−1 (1 − α) + µ0

X − µ0
µ ¶
i.e., Φ > (1 − α)
σ

PR NPTEL course – p.106/128


Thus NP classifier puts X ∈ c-1 if

i.e., X > σ Φ−1 (1 − α) + µ0

X − µ0
µ ¶
i.e., Φ > (1 − α)
σ
This means the NP classifier puts X in c-1 if X > τ
R∞
where τ
f0 (X) dX = α.

PR NPTEL course – p.107/128


• Like the Bayes classifier, the NP classifier also needs
knowledge of class conditional densities.
• NP classifier is only for the 2-class case.
• It is actually more important in hypothesis testing
problems. (Likelihood ratio test)

PR NPTEL course – p.108/128


Receiver Operating Characteristic (ROC)

• Consider a one dimensional feature space, 2-class


problem with a classifier, h(X) = 0 if X < τ .
• Consider equal priors, Gaussian class conditional
densities with equal variance, 0-1 loss. Now let us
write the probability of error as a function of τ .

PR NPTEL course – p.109/128


Receiver Operating Characteristic (ROC)

Z τ Z ∞
P [error] = 0.5 f1 (X) dX + 0.5 f0 (X) dX
−∞ τ
τ − µ1 τ − µ0
µ ¶ µ ¶
= 0.5Φ + 0.5(1 − Φ )
σ σ
• As we vary τ we trade one kind of error with another.
In Bayes classifier, the loss function determines the
‘exchange rate’.

PR NPTEL course – p.110/128


ROC curve

• The receiver operating characteristic (ROC) curve is


one way to conveniently visualize and exploit this
trade off.
• For a two class classifier there are four possible
outcomes of a classifcation decison – two are correct
decisions and two are errors.
• Let ei denote probability of wrongly assigning class i,
i = 0, 1.

PR NPTEL course – p.111/128


ROC curve

Then we have
e0 = P [X ≤τ |X ∈ c-1] (a miss)
e1 = P [X >τ |X ∈ c-0] (false alarm)
1 − e0 = P [X >τ |X ∈ c-1] (correct detection)
1 − e1 = P [X ≤τ |X ∈ c-0] (correct rejection)

PR NPTEL course – p.112/128


ROC curve

Then we have
e0 = P [X ≤τ |X ∈ c-1] (a miss)
e1 = P [X >τ |X ∈ c-0] (false alarm)
1 − e0 = P [X >τ |X ∈ c-1] (correct detection)
1 − e1 = P [X ≤τ |X ∈ c-0] (correct rejection)
• For fixed class conditional densities, if we vary τ the
point (e1 , 1 − e0 ) moves on a smooth curve in ℜ2 .
• This is traditionally called the ROC curve. (Choice of
coordinates is arbitrary)

PR NPTEL course – p.113/128


• For any fixed τ we can estimate e0 and e1 from
training data.

PR NPTEL course – p.114/128


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.

PR NPTEL course – p.115/128


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.
• This can be done for any threshold based classifier
irrespective of class conditional densities.

PR NPTEL course – p.116/128


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.
• This can be done for any threshold based classifier
irrespective of class conditional densities.
• When the class conditional densities are Gaussian
with equal variance, we use this procedure to
estimate Bayes error also.

PR NPTEL course – p.117/128


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ

PR NPTEL course – p.118/128


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ
|µ1 −µ0 |
• Then, |a − b| = σ
= d, the discriminability.

PR NPTEL course – p.119/128


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ
|µ1 −µ0 |
• Then, |a − b| = σ
= d, the discriminability.
• Knowing e1 , (1 − e0 ), we can get d and hence the
Bayes error. For our given τ we can also get the
actuall error probability. We can tweak τ to match the
Bayes error.
PR NPTEL course – p.120/128
• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ
|µ1 −µ0 |
• Then, |a − b| = σ
= d, the discriminability.
• Knowing e1 , (1 − e0 ), we can get d and hence the
Bayes error. For our given τ we can also get the
actuall error probability. We can tweak τ to match the
Bayes error.
PR NPTEL course – p.121/128
• We can in general use the ROC curve in
multidimensional cases also. Consider, for example,
h(X) = sgn(Wt X + w0 ).
We can use ROC to fix w0 after learning W.

PR NPTEL course – p.122/128


Summary

• Bayes classifier is optimal for minimizing risk.

PR NPTEL course – p.123/128


Summary

• Bayes classifier is optimal for minimizing risk.


• we can derive Bayes classifier if we know class
conditional densities.

PR NPTEL course – p.124/128


Summary

• Bayes classifier is optimal for minimizing risk.


• we can derive Bayes classifier if we know class
conditional densities.
• There are criteria other than minimizing risk.

PR NPTEL course – p.125/128


Summary

• Bayes classifier is optimal for minimizing risk.


• we can derive Bayes classifier if we know class
conditional densities.
• There are criteria other than minimizing risk.
• MinMax classifier, Naymann-Pearson Classifier are
some such examples.

PR NPTEL course – p.126/128


Summary

• Bayes classifier is optimal for minimizing risk.


• we can derive Bayes classifier if we know class
conditional densities.
• There are criteria other than minimizing risk.
• MinMax classifier, Naymann-Pearson Classifier are
some such examples.
• ROC curves allow us to visualize trade-offs between
different types of errors as we vary a threshold.

PR NPTEL course – p.127/128


PR NPTEL course – p.128/128

You might also like