06-Classification 2011 01 13 01
06-Classification 2011 01 13 01
Classification
6 - Classification
Examples: radar system, binary transmission, OCR , spam filtering
The classification problem
Transition matrices and Bayes rule
The importance of prior probabilities
The MAP classifier and example
Decision regions
Example: gold coins
Error analysis and the MAP classifier
Cost functions and example
Trade-offs and the Neyman-Pearson cost function
Example: weighted-sum objective
The operating characteristic
Conditional errors and maximum likelihood
S. Lall, Stanford
2011.01.13.01
6-2
Classification
S. Lall, Stanford
2011.01.13.01
0.15
0.1
0.05
10
12
14
16
6-3
Classification
S. Lall, Stanford
2011.01.13.01
0.15
0.1
0.05
10
12
14
16
If there are fewer than 6 reflections, an aircraft is not present. If there are more than 11
reflections, an aircraft is present.
6-4
Classification
S. Lall, Stanford
2011.01.13.01
Other Examples
Binary transmission channel: A binary bit is sent to us across a communication
channel.
If a 1 is sent, then with probability 0.8 a 1 is received, and probability 0.2 a 0 is
received
If a 0 is sent, then with probability 0.1 a 1 is received, and probability 0.9 a 0 is
received
We measure the received bit, and would like to determine which bit was sent.
Optical character recognition: We measure various features of a character in an
optical system, such as
the width of the character
the ratio of black pixels to white pixels
Which of the characters A, B, . . . , Z is it?
Spam filtering: we measure which words are contained in the email. We would like
to determine if the email is spam or not.
6-5
Classification
S. Lall, Stanford
2011.01.13.01
Y3
X1
X2
Y2
Y1
6-6
Classification
S. Lall, Stanford
Transition Matrices
We have a transition matrix A Rmn
Y1
0:7
X1
Aij = Prob(Yi | Xj )
The matrix A is also called the likelihood matrix.
0.7 0.2
A = 0.3 0.3
0 0.5
0:3
0:2
X2
0:3
0:5
Y3
and
Y2
1T A = 1T
2011.01.13.01
6-7
Classification
S. Lall, Stanford
Conditional Probability
0.25
0.2
2011.01.13.01
0.15
0.1
0.05
0
10
12
10
12
24
22
20
18
16
14
12
10
8
6
4
6-8
Classification
S. Lall, Stanford
Bayes Rule
For any events A, B with Prob(B) 6= 0, Bayes rule is
Prob(A | B) =
Prob(B | A) Prob(A)
Prob(B)
and so
Prob(A | B) Prob(B) = Prob(B | A) Prob(A)
2011.01.13.01
6-9
Classification
S. Lall, Stanford
2011.01.13.01
Bayes Rule
The Law of Total Probability says that since X1, . . . , Xm partition , we have for any
event A
m
X
Prob(A Xj )
Prob(A) =
j=1
Prob(Yi | Xj ) Prob(Xj )
Prob(Yi)
Prob(Yi | Xj ) Prob(Xj )
= Pm
k=1 Prob(Yi Xk )
and therefore the a-posteriori probability is
Prob(Yi | Xj ) Prob(Xj )
Prob(Xj | Yi) = Pm
k=1 Prob(Yi |Xk ) Prob(Xk )
6 - 10
Classification
S. Lall, Stanford
2011.01.13.01
Problem Data
We start with
the prior distribution xj = Prob(Xj ) for j = 1, . . . , n
the transition probabilities Aij = Prob(Yi | Xj ) for i = 1, . . . , m and j = 1, . . . , n
We have
y = Ax
Bij =
Jij
yi
Jij = Aij xj
6 - 11
Classification
S. Lall, Stanford
2011.01.13.01
Suppose we have a test for cancer, which has the following accuracy
if the patient does not have cancer, then the probability of a negative result is 0.97,
and of positive result is 0.03.
if the patient has cancer, then the probability of a negative result is 0.02, and of a
positive result is 0.98.
These are the transition probabilities
Suppose a patient takes this test. The probability of not having cancer is 0.992, and hence
the probability of having cancer is 0.008.
These are the prior probabilities.
6 - 12
Classification
S. Lall, Stanford
2011.01.13.01
6 - 13
Classification
S. Lall, Stanford
2011.01.13.01
0.97 0.02
A=
0.03 0.98
does not
have cancer
X1
0:97
Y1
test is
negative
Y2
test is
positive
0:03
0:02
has cancer
X2
0:98
no cancer cancer
test is negative 0.96224 0.00016
test is positive 0.02976 0.00784
no cancer
cancer
test is negative 0.999834 0.000166251
test is positive 0.791489 0.208511
So given that the patient tests positive, the chances of having cancer are only 20%
Without a prior, one cannot draw any conclusion.
6 - 14
Classification
S. Lall, Stanford
2011.01.13.01
Classifiers
We would like to find a classifier, that is a map fest : {1, . . . , m} {1, . . . , n} which
if we observe event Yi, then we estimate that event Xj occurred, where j = fest(i)
Notice that classification is deliberately throwing away information, since we have the
conditional probabilities Prob(Xj | Yi).
That is, the summary that the patient does not have cancer is less informative than
the patient has 20.8% chance of having cancer
6 - 15
Classification
Classifiers
We will specify the estimator via a matrix K Rmn, where
(
1 if j = fest(i)
Kij =
0 otherwise
S. Lall, Stanford
2011.01.13.01
6 - 16
Classification
S. Lall, Stanford
2011.01.13.01
If we measure that event Yimeas occurred, then we estimate which event X1, . . . , Xn occurred by picking the one which has the highest conditional probability
6 - 17
Classification
S. Lall, Stanford
2011.01.13.01
Example
Here n = 2 and m = 8.
0.5
prob(y|x1)
prob(y|x2)
0.4
0.3
0.2
0.1
0
0.02 0
0.1 0
0.2 0
0.04 0
0.4 0
0.08 0
0.2 0.1
0.04 0.08
0.2
A=
x=
J =
0.8
0.1 0.2
0.02 0.16
0 0.4
0 0.32
0 0.2
0 0.16
0 0.08
0 0.1
1
1
1/3
B=
1/9
0
0
0
0
2/3
8/9
1
1
6 - 18
Classification
S. Lall, Stanford
2011.01.13.01
0.3
0.2
0.1
0
in words: scale the transition pdf Prob(Yi |Xj ) by the prior pdf Prob(Xj ), and pick the
largest evaluated at Yimeas .
6 - 19
Classification
S. Lall, Stanford
2011.01.13.01
Decision Regions
The classifier splits the set of observations into decision regions
prob(y x1)
prob(y x2)
0.3
0.2
0.1
0
6 - 20
Classification
S. Lall, Stanford
2011.01.13.01
Reducible Error
Reducible error
\
\
6 - 21
Classification
S. Lall, Stanford
Example
Suppose there are four coins in a bag, some gold and some silver. Let
Xj = Prob(j 1 of the coins in the bag are gold)
i = 1, . . . , 5
T
x = 0.05 0.15 0.15 0.6 0.05
We draw two coins at random from the bag. Let
Yi = Prob(i 1 of the coins drawn are gold)
2011.01.13.01
6 - 22
Classification
S. Lall, Stanford
Example
The transition matrix is
1
A = 0
0
1/2
1/2
0
1/6
2/3
1/6
0
1/2
1/2
0
0
1
2011.01.13.01
6 - 23
Classification
S. Lall, Stanford
2011.01.13.01
Example
The joint probability matrix is
0.05
0.075
0.025
0.075
0.1
J = 0
0
0
0.025
The map estimator is
0 1
K = 0 0
0 0
0
0
0
0
0.3
0.3
0
1
1
0
0
0.05
0
0
0
prob(Xi|Y1)
0.8
0.6
0.4
0.2
0
1
prob(Xi|Y2)
0.8
0.6
0.4
0.2
0
1
prob(X |Y )
i
0.8
0.6
0.4
0.2
0
6 - 24
Classification
S. Lall, Stanford
Error Analysis
The unconditional error matrix E Rnn is
Ejk = probability that Xj is estimated and Xk occurs
= Prob(jest = j and Xk )
=
=
m
X
i=1
m
X
i=1
Yp | fest(p) = j Yi and Xk
Prob
Yi
Yp | fest(p) = j Yi =
(
Yi
=
if fest(i) = j
otherwise
if Kij = 1
otherwise
2011.01.13.01
6 - 25
Classification
S. Lall, Stanford
Error Analysis
Therefore we have
Ejk = probability that Xj is estimated and Xk occurs
=
=
m
X
i=1
m
X
i=1
That is, E = K T J.
Notice that 1T E1 = 1.
Kij Prob(Yi Xk )
Kij Jik
2011.01.13.01
6 - 26
Classification
S. Lall, Stanford
2011.01.13.01
0
0.05
E=
0
0
0
0
0.075
0
0.075
0
0
0.025
0
0.125
0
0
0
0
0.6
0
0
0
0.05
0
Some rows are zero, since, e.g., we never estimate that are no coins in the bag.
Ideally, we would have E zero on the off-diagonal elements.
Notice that each column j sums to the prior probability Prob(Xj )
6 - 27
Classification
S. Lall, Stanford
Error Analysis
The probability that the estimate is correct is
n
X
Ejj = trace E
j=1
n X
m
X
Kij Jij
j=1 i=1
2011.01.13.01
6 - 28
Classification
S. Lall, Stanford
Cost Functions
Suppose we now assign costs to errors
Cjk = cost when Xj is estimated and Xk occurs
n
n X
X
n X
n
X
Cjk Ejk
j=1 k=1
j=1 k=1
= trace(EC T )
= trace(K T JC T )
This is called the Bayes risk
2011.01.13.01
6 - 29
Classification
S. Lall, Stanford
2011.01.13.01
Cost Functions
Suppose we assign cost
(
1 if j 6= q
Cjk =
0 otherwise
That is
0 1 ...
1 0 ...
C=
.. ... ...
.
0
1 ...
1
1
.
.
. = 11T I
1
0
T
E C = trace E(11 I) = 1 trace E
Hence minimizing this cost function maximizes the probability of a correct estimate.
So the MAP classifier minimizes this cost function.
6 - 30
Classification
S. Lall, Stanford
0 100
C=
1 0
C21 is the cost for estimating X2 when X1 occurs
i.e., the cost for false positives
C12 is the cost for estimating X1 when X2 occurs
i.e., the cost for false negatives
2011.01.13.01
6 - 31
Classification
S. Lall, Stanford
2011.01.13.01
0.4
0.3
0.2
0.1
0
0 0.05
0.1 0
0.2 0
0 0.1
0.4 0
0 0.2
0.2 0.1
5 0.1
0 100
0.5
T
A=
C=
x=
JC =
1 0
0.5
0.1 0.2
10 0.05
0 0.4
20 0
0 0.2
10 0
5 0
0 0.1
6 - 32
Classification
S. Lall, Stanford
Trade-offs
Often we would like to examine the trade off between
J1 = the probability of making a false positive error.
J2 = the probability of making a false negative error.
2011.01.13.01
6 - 33
Classification
S. Lall, Stanford
Trade-off Curve
shaded area shows (J2, J1) achieved by
some x Rn
clear area shows (J2, J1) not achieved by
any x Rn
x(1)
J1
x(3)
x(2)
J2
three example choices of x: x(1), x(2), x(3)
x(3) is worse than x(2) on both counts (J2 and J1)
x(1) is better than x(2) in J2, but worse in J1
2011.01.13.01
6 - 34
Classification
S. Lall, Stanford
2011.01.13.01
Weighted-Sum Objective
to find Pareto optimal points, i.e. xs on
optimal trade-off curve, we minimize the
weighted-sum objective:
J1 + J2
x(1)
J1 + J2 =
J1
x(3)
x(2)
J2
points where weighted sum is constant, J1 + J2 = correspond to line with slope
x(2) minimizes the weighted-sum objective for shown
by varying from 0 to +, we can sweep out the entire optimal trade-off curve
In some cases, the trade-off curve may not be convex; then there are Pareto points
that are not found by minimizing a weighted sum.
6 - 35
Classification
S. Lall, Stanford
Weighted-Sum Objective
We have
J1 = Prob(jest = 2 X1)
J2 = Prob(jest = 1 X2)
and we would like to minimize J1 + J2
0
C=
1 0
This is called the Neyman-Pearson cost function.
2011.01.13.01
6 - 36
Classification
S. Lall, Stanford
prob(Yi X1)
prob(Y X )
i
0.04
0.02
10
15
20
25
30
35
40
Prob(X2) = 0.4
45
50
2011.01.13.01
6 - 37
Classification
S. Lall, Stanford
2011.01.13.01
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
prob estimating X2 and X1 occurs
0.6
6 - 38
Classification
Operating characteristic
Also called the receiver operating characteristic or ROC.
Often plotted other way up
S. Lall, Stanford
2011.01.13.01
6 - 39
Classification
S. Lall, Stanford
2011.01.13.01
0.06
prob(Yi X1)
prob(Y X )
0.05
0.3
0.04
With = 1
0.2
0.03
0.02
0.1
0.01
0
0.1
0.2
0.3
0.4
0.5
prob estimating X2 and X1 occurs
0.6
10
20
30
40
50
0.4
0.06
prob(Y X )
i
prob(Yi X2)
0.05
0.3
0.04
With = 10
0.2
0.03
0.02
0.1
0.01
0
0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs
2
0.6
10
20
30
40
50
6 - 40
Classification
S. Lall, Stanford
2011.01.13.01
0.4
0.06
prob(Y X )
i
0.3
prob(Yi X2)
0.05
0.04
0.2
0.03
0.02
0.1
0.01
0
0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs
10
20
30
40
50
60
0.4
0.06
prob(Yi X1)
prob(Y X )
0.05
0.3
0.04
0.6
0.2
0.03
0.02
0.1
0.01
0
0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs
2
0.6
10
20
30
40
50
6 - 41
Classification
S. Lall, Stanford
Conditional Errors
The conditional error matrix E cond Rnn is
cond
= probability that Xj is estimated given that Xk occurred
Ejk
= Prob(jest = j | Xk )
m
X
Prob jest = j and Yi | Xk since the Yi partition
=
=
i=1
m
X
i=1
Yp | (p) = j Yi | Xk
Prob
Yi
Yp | (p) = j Yi =
(
Yi
=
if (i) = j
otherwise
if Kij = 1
otherwise
2011.01.13.01
6 - 42
Classification
S. Lall, Stanford
Conditional Errors
Therefore we have
cond
=
Ejk
m
X
i=1
m
X
Kij Prob(Yi | Xk )
Kij Aik
i=1
That is
E cond = K T A
2011.01.13.01
6 - 43
Classification
S. Lall, Stanford
Conditional Errors
For the coins example, we have
E cond
0
1
=
0
0
0
0
0.5
0
0.5
0
0
1/6
0
5/6
0
0
0
0
1
0
0
0
1
0
cond
is the probability that Xj is estimated given that Xk occurred
Ejk
2011.01.13.01
6 - 44
Classification
S. Lall, Stanford
2011.01.13.01
Maximum-Likelihood
When we do not have any prior probabilities, a commonly used heuristic is the method of
maximum likelihood.