0% found this document useful (0 votes)
67 views44 pages

06-Classification 2011 01 13 01

The document discusses classification problems and techniques. It provides examples of classification in domains like radar systems, binary transmission channels, and optical character recognition. It also covers topics like transition matrices, Bayes rule, prior probabilities, maximum a posteriori probability classification, and decision regions.

Uploaded by

scatterwalker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views44 pages

06-Classification 2011 01 13 01

The document discusses classification problems and techniques. It provides examples of classification in domains like radar systems, binary transmission channels, and optical character recognition. It also covers topics like transition matrices, Bayes rule, prior probabilities, maximum a posteriori probability classification, and decision regions.

Uploaded by

scatterwalker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

6-1

Classification

6 - Classification
Examples: radar system, binary transmission, OCR , spam filtering
The classification problem
Transition matrices and Bayes rule
The importance of prior probabilities
The MAP classifier and example
Decision regions
Example: gold coins
Error analysis and the MAP classifier
Cost functions and example
Trade-offs and the Neyman-Pearson cost function
Example: weighted-sum objective
The operating characteristic
Conditional errors and maximum likelihood

S. Lall, Stanford

2011.01.13.01

6-2

Classification

S. Lall, Stanford

2011.01.13.01

Example: Radar System


A radar system sends out n pulses, and receives y reflections, where 0 y n.
Ideally, y = n if an aircraft is present, and y = 0 otherwise.
In practice, reflections may be lost, or noise may be mistaken for reflections.
So we have two probability mass functions
p1(y) = the probability of receiving y reflections when there are no aircraft present
p2(y) = the probability of receiving y reflections when there is an aircraft present
0.25
prob(y|x1)
prob(y|x2)
0.2

0.15

0.1

0.05

10

12

14

16

If we measure ymeas reflections, how do we decide if an aircraft is present?

6-3

Classification

S. Lall, Stanford

2011.01.13.01

Example: Radar System


0.25
prob(y|x1)
prob(y|x2)
0.2

0.15

0.1

0.05

10

12

14

16

If there are fewer than 6 reflections, an aircraft is not present. If there are more than 11
reflections, an aircraft is present.

We would like to choose a threshold value, based on


probabilities of errors; false-positives and false-negatives
Costs assigned to these events

6-4

Classification

S. Lall, Stanford

2011.01.13.01

Other Examples
Binary transmission channel: A binary bit is sent to us across a communication
channel.
If a 1 is sent, then with probability 0.8 a 1 is received, and probability 0.2 a 0 is
received
If a 0 is sent, then with probability 0.1 a 1 is received, and probability 0.9 a 0 is
received
We measure the received bit, and would like to determine which bit was sent.
Optical character recognition: We measure various features of a character in an
optical system, such as
the width of the character
the ratio of black pixels to white pixels
Which of the characters A, B, . . . , Z is it?
Spam filtering: we measure which words are contained in the email. We would like
to determine if the email is spam or not.

6-5

Classification

S. Lall, Stanford

2011.01.13.01

The Classification Problem


X1, . . . , Xn are events that partition , called hypotheses
Y1, . . . , Ym are events that partition , called observations

Y3
X1

X2

Y2
Y1

The outcome of the experiment is


lies in exactly one of the events Xj and exactly one of the events Yi
In other words, exactly one hypothesis is true and exactly one observation occurs
The decision or classification problem is as follows:
We measure which of the Yi the outcome lies in, say Yimeas
We would like to pick jest to estimate which Xj contains

6-6

Classification

S. Lall, Stanford

Transition Matrices
We have a transition matrix A Rmn

Y1

0:7

X1

Aij = Prob(Yi | Xj )
The matrix A is also called the likelihood matrix.

We can represent it as a bipartite graph, e.g.,

0.7 0.2
A = 0.3 0.3
0 0.5

0:3
0:2

X2

0:3
0:5

Y3

A is elementwise nonnegative and the sum of each column is one, i.e.,


A0

and

Y2

1T A = 1T

A matrix with these properties is called column stochastic.

2011.01.13.01

6-7

Classification

S. Lall, Stanford

Conditional Probability

0.25
0.2

We would like to know


Bimeas,j = Prob(Xj | Yimeas )

2011.01.13.01

p|y (a, 19)

0.15
0.1
0.05
0

Prob(Xj | Yimeas ) is called the aposteriori probability

10

12

10

12

24
22
20

We will have a different pmf for each


value of imeas

18
16

Once we have computed the a-posteriori


pmf, we can pick an estimate, i.e., a
value for jest
The estimate is usually chosen to minimize a cost function

14
12
10
8
6
4

6-8

Classification

S. Lall, Stanford

Bayes Rule
For any events A, B with Prob(B) 6= 0, Bayes rule is
Prob(A | B) =

Prob(B | A) Prob(A)
Prob(B)

Because if Prob(B) 6= 0, then


Prob(A B)
Prob(A | B) =
Prob(B)

and so
Prob(A | B) Prob(B) = Prob(B | A) Prob(A)

2011.01.13.01

6-9

Classification

S. Lall, Stanford

2011.01.13.01

Bayes Rule
The Law of Total Probability says that since X1, . . . , Xm partition , we have for any
event A
m
X
Prob(A Xj )
Prob(A) =
j=1

Now by Bayes rule, we have


Prob(Xj | Yi) =

Prob(Yi | Xj ) Prob(Xj )
Prob(Yi)

Prob(Yi | Xj ) Prob(Xj )
= Pm
k=1 Prob(Yi Xk )
and therefore the a-posteriori probability is
Prob(Yi | Xj ) Prob(Xj )
Prob(Xj | Yi) = Pm
k=1 Prob(Yi |Xk ) Prob(Xk )

6 - 10

Classification

S. Lall, Stanford

2011.01.13.01

Problem Data
We start with
the prior distribution xj = Prob(Xj ) for j = 1, . . . , n
the transition probabilities Aij = Prob(Yi | Xj ) for i = 1, . . . , m and j = 1, . . . , n

From these we can find


the a-posteriori probabilities Bij = Prob(Xj | Yi)
the marginal pmf yi = Prob(Yi)
and the joint distribution Jij = Prob(Yi Xj )

We have
y = Ax

Bij =

Jij
yi

Jij = Aij xj

6 - 11

Classification

S. Lall, Stanford

2011.01.13.01

Example: Prior Probabilities


Why do we need prior probabilities? The following is the standard example.

Suppose we have a test for cancer, which has the following accuracy
if the patient does not have cancer, then the probability of a negative result is 0.97,
and of positive result is 0.03.
if the patient has cancer, then the probability of a negative result is 0.02, and of a
positive result is 0.98.
These are the transition probabilities

Suppose a patient takes this test. The probability of not having cancer is 0.992, and hence
the probability of having cancer is 0.008.
These are the prior probabilities.

6 - 12

Classification

S. Lall, Stanford

2011.01.13.01

Example: Prior Probabilities


Imagine 10, 000 patients take this test.
On average, 80 of these people will have cancer (0.008 probability) and since 98% of
them will test positive, we will have 78 positive tests
Of the 9,920 cancerless patients, 3% of them will test positive, giving a further 297
positive tests
Hence of the total 375 positive tests, most (297) are false positives.
The conditional probability of having cancer given that one tests positive
is 78/375 = 0.208

6 - 13

Classification

S. Lall, Stanford

2011.01.13.01

Example: Prior Probabilities


The transition matrix is

0.97 0.02
A=
0.03 0.98

does not
have cancer

X1

0:97

Y1

test is
negative

Y2

test is
positive

0:03
0:02

has cancer

X2

0:98

The joint probabilities are


J=

no cancer cancer
test is negative 0.96224 0.00016
test is positive 0.02976 0.00784

But the conditional probabilities are


B=

no cancer
cancer
test is negative 0.999834 0.000166251
test is positive 0.791489 0.208511

So given that the patient tests positive, the chances of having cancer are only 20%
Without a prior, one cannot draw any conclusion.

6 - 14

Classification

S. Lall, Stanford

2011.01.13.01

Classifiers
We would like to find a classifier, that is a map fest : {1, . . . , m} {1, . . . , n} which
if we observe event Yi, then we estimate that event Xj occurred, where j = fest(i)

Notice that classification is deliberately throwing away information, since we have the
conditional probabilities Prob(Xj | Yi).
That is, the summary that the patient does not have cancer is less informative than
the patient has 20.8% chance of having cancer

6 - 15

Classification

Classifiers
We will specify the estimator via a matrix K Rmn, where
(
1 if j = fest(i)
Kij =
0 otherwise

there is exactly one 1 in every row of K


K1 = 1, i.e., K is row stochastic

S. Lall, Stanford

2011.01.13.01

6 - 16

Classification

S. Lall, Stanford

2011.01.13.01

The MAP Classifier


The maximum a-posteriori probability (MAP) classifier is
fmap(imeas) = arg max Prob(Xj | Yimeas )
j

If we measure that event Yimeas occurred, then we estimate which event X1, . . . , Xn occurred by picking the one which has the highest conditional probability

We pick j to maximize the conditional probability


Prob(Yi | Xj ) Prob(Xj )
Prob(Xj | Yi) =
Prob(Yi)
This is the same as picking j to maximize the joint probability
Prob(Yi | Xj ) Prob(Xj )

6 - 17

Classification

S. Lall, Stanford

2011.01.13.01

Example
Here n = 2 and m = 8.
0.5
prob(y|x1)
prob(y|x2)

0.4
0.3
0.2
0.1
0

We have transition, prior, joint and conditional probabilities

0.02 0
0.1 0
0.2 0
0.04 0

0.4 0
0.08 0


0.2 0.1

0.04 0.08
0.2

A=
x=
J =

0.8
0.1 0.2
0.02 0.16
0 0.4
0 0.32

0 0.2
0 0.16
0 0.08
0 0.1

1
1

1/3
B=
1/9

0
0

0
0

2/3

8/9

1
1

6 - 18

Classification

S. Lall, Stanford

2011.01.13.01

The MAP Classifier


In terms of B and J, the MAP estimator is
pick j corresponding to the largest element in row imeas of B

Equivalently, we can use J instead of B; the columns of J are plotted below.


prob(y x1)
prob(y x2)

0.3
0.2
0.1
0

in words: scale the transition pdf Prob(Yi |Xj ) by the prior pdf Prob(Xj ), and pick the
largest evaluated at Yimeas .

6 - 19

Classification

S. Lall, Stanford

2011.01.13.01

Decision Regions
The classifier splits the set of observations into decision regions
prob(y x1)
prob(y x2)

0.3
0.2
0.1
0

The decision regions are


R1 = { Yi | i 3 }
R2 = { Yi | i > 3 }
if Yimeas Ri, then we estimate that Xi occurred.
We will see that this idea is useful when estimating in continuous probability spaces

6 - 20

Classification

S. Lall, Stanford

2011.01.13.01

Reducible Error

Reducible error

\
\

The area (probability mass) under both curves sums to 1.


If we choose the decision boundary shown at i = 43, then the error probability is the
area of the three shaded regions
By moving the decision boundary to 40, we can remove the reducible error

6 - 21

Classification

S. Lall, Stanford

Example
Suppose there are four coins in a bag, some gold and some silver. Let
Xj = Prob(j 1 of the coins in the bag are gold)

i = 1, . . . , 5

We have the prior pdf xj = Prob(Xj )

T
x = 0.05 0.15 0.15 0.6 0.05
We draw two coins at random from the bag. Let
Yi = Prob(i 1 of the coins drawn are gold)

2011.01.13.01

6 - 22

Classification

S. Lall, Stanford

Example
The transition matrix is

1
A = 0
0

1/2
1/2
0

1/6
2/3
1/6

0
1/2
1/2

0
0
1

As usual, Aij = Prob(Yi | Xj )

Because, if there are q gold coins in the bag, then


the probability of drawing 0 gold coins is (4 q)(3 q)/12
the probability of drawing 1 gold coin is q(4 q)/6
the probability of drawing 2 gold coins is q(q 1)/12

2011.01.13.01

6 - 23

Classification

S. Lall, Stanford

2011.01.13.01

Example
The joint probability matrix is

0.05
0.075
0.025
0.075
0.1
J = 0
0
0
0.025
The map estimator is

0 1
K = 0 0
0 0

0
0
0

0
0.3
0.3

0
1
1

0
0
0.05

0
0
0

So, using the MAP estimator, we conclude

prob(Xi|Y1)

0.8
0.6
0.4
0.2
0

1
prob(Xi|Y2)

0.8
0.6
0.4
0.2
0

if we draw no gold coins, we estimate there


was 1 gold coin in the bag
if we draw 1 or 2 or gold coins, we estimate
there were 3 gold coins in the bag

1
prob(X |Y )
i

0.8

0.6
0.4

The a-posteriori probabilities are shown on the


right for each of the three possible measurements

0.2
0

6 - 24

Classification

S. Lall, Stanford

Error Analysis
The unconditional error matrix E Rnn is
Ejk = probability that Xj is estimated and Xk occurs
= Prob(jest = j and Xk )
=
=

m
X

i=1
m
X
i=1

Prob jest = j and Yi and Xk

since the Yi partition

Yp | fest(p) = j Yi and Xk
Prob

Now notice that


(
[

Yi
Yp | fest(p) = j Yi =

(
Yi
=

if fest(i) = j
otherwise
if Kij = 1
otherwise

2011.01.13.01

6 - 25

Classification

S. Lall, Stanford

Error Analysis
Therefore we have
Ejk = probability that Xj is estimated and Xk occurs
=
=

m
X

i=1
m
X
i=1

That is, E = K T J.
Notice that 1T E1 = 1.

Kij Prob(Yi Xk )
Kij Jik

2011.01.13.01

6 - 26

Classification

S. Lall, Stanford

2011.01.13.01

Example: Error Analysis


For the coins example, we have

0
0.05

E=
0
0
0

0
0.075
0
0.075
0

0
0.025
0
0.125
0

0
0
0
0.6
0

0
0

0.05
0

Some rows are zero, since, e.g., we never estimate that are no coins in the bag.
Ideally, we would have E zero on the off-diagonal elements.
Notice that each column j sums to the prior probability Prob(Xj )

6 - 27

Classification

S. Lall, Stanford

Error Analysis
The probability that the estimate is correct is
n
X

Ejj = trace E

j=1

n X
m
X

Kij Jij

j=1 i=1

Hence to maximize the probability of a correct estimate, we pick K so that


(
1 if Jij is the largest element of row i of J
Kij =
0 otherwise

This is exactly the MAP classifier; i.e.,


The MAP classifier maximizes the probability of a correct estimate

2011.01.13.01

6 - 28

Classification

S. Lall, Stanford

Cost Functions
Suppose we now assign costs to errors
Cjk = cost when Xj is estimated and Xk occurs

The expected cost is


EC =

n
n X
X

Cjk Prob(jest = j and Xk )

n X
n
X

Cjk Ejk

j=1 k=1

j=1 k=1

= trace(EC T )
= trace(K T JC T )
This is called the Bayes risk

2011.01.13.01

6 - 29

Classification

S. Lall, Stanford

2011.01.13.01

Cost Functions
Suppose we assign cost
(
1 if j 6= q
Cjk =
0 otherwise
That is

Then the Bayes risk is

i.e., the estimate is wrong

0 1 ...
1 0 ...

C=
.. ... ...
.
0
1 ...
1

1
.
.
. = 11T I

1
0

T
E C = trace E(11 I) = 1 trace E
Hence minimizing this cost function maximizes the probability of a correct estimate.
So the MAP classifier minimizes this cost function.

6 - 30

Classification

S. Lall, Stanford

Choosing a Cost Function


Suppose we consider the radar example, where
X1 = the event that there are no aircraft present
X2 = the event that there is an aircraft present
Then we may significantly prefer false positives to false negatives.
In that case we could choose, for example

0 100
C=
1 0
C21 is the cost for estimating X2 when X1 occurs
i.e., the cost for false positives
C12 is the cost for estimating X1 when X2 occurs
i.e., the cost for false negatives

2011.01.13.01

6 - 31

Classification

S. Lall, Stanford

2011.01.13.01

Example: Choosing a Cost Function


0.5
prob(y|x1)
prob(y|x2)

0.4
0.3
0.2
0.1
0

We would like to minimize E C = trace(K T JC T ), so we pick the smallest element in


each row of JC T

0 0.05
0.1 0
0.2 0
0 0.1

0.4 0
0 0.2


0.2 0.1

5 0.1
0 100
0.5
T

A=
C=
x=
JC =

1 0
0.5
0.1 0.2
10 0.05
0 0.4
20 0

0 0.2
10 0
5 0
0 0.1

6 - 32

Classification

S. Lall, Stanford

Trade-offs
Often we would like to examine the trade off between
J1 = the probability of making a false positive error.
J2 = the probability of making a false negative error.

usually the objectives are competing


we can make one smaller at the expense of making the other larger

2011.01.13.01

6 - 33

Classification

S. Lall, Stanford

Trade-off Curve
shaded area shows (J2, J1) achieved by
some x Rn
clear area shows (J2, J1) not achieved by
any x Rn

x(1)
J1

boundary of region is called optimal


trade-off curve

x(3)
x(2)

corresponding x called Pareto optimal

J2
three example choices of x: x(1), x(2), x(3)
x(3) is worse than x(2) on both counts (J2 and J1)
x(1) is better than x(2) in J2, but worse in J1

2011.01.13.01

6 - 34

Classification

S. Lall, Stanford

2011.01.13.01

Weighted-Sum Objective
to find Pareto optimal points, i.e. xs on
optimal trade-off curve, we minimize the
weighted-sum objective:
J1 + J2

x(1)

J1 + J2 =

J1

parameter 0 gives relative weight between J1 and J2

x(3)
x(2)

J2
points where weighted sum is constant, J1 + J2 = correspond to line with slope
x(2) minimizes the weighted-sum objective for shown
by varying from 0 to +, we can sweep out the entire optimal trade-off curve
In some cases, the trade-off curve may not be convex; then there are Pareto points
that are not found by minimizing a weighted sum.

6 - 35

Classification

S. Lall, Stanford

Weighted-Sum Objective
We have
J1 = Prob(jest = 2 X1)
J2 = Prob(jest = 1 X2)
and we would like to minimize J1 + J2

This is the same as picking cost matrix

0
C=
1 0
This is called the Neyman-Pearson cost function.

2011.01.13.01

6 - 36

Classification

S. Lall, Stanford

Example: Weighted-Sum Objective


Consider the joint probabilities
0.06

prob(Yi X1)
prob(Y X )
i

0.04

0.02

10

15

20

25

30

35

40

with prior probabilities


Prob(X1) = 0.6

Prob(X2) = 0.4

45

50

2011.01.13.01

6 - 37

Classification

S. Lall, Stanford

2011.01.13.01

Example: Weighted-Sum Objective

prob estimating X1 and X2 occurs

The trade-off curve is below


0.4

0.3

0.2

0.1

0.1
0.2
0.3
0.4
0.5
prob estimating X2 and X1 occurs

0.6

This curve is called the operating characteristic


Note intersections with axes at prior probabilities
The pareto-optimal points are a finite set, not a continuous curve, since there are
only a few choices for threshold value.

6 - 38

Classification

Operating characteristic
Also called the receiver operating characteristic or ROC.
Often plotted other way up

S. Lall, Stanford

2011.01.13.01

6 - 39

Classification

S. Lall, Stanford

2011.01.13.01

Example: Trading off Errors


0.4

0.06

prob(Yi X1)
prob(Y X )

0.05

0.3

0.04

prob estimating X and X occurs

With = 1

0.2

0.03
0.02

0.1

0.01
0

0.1
0.2
0.3
0.4
0.5
prob estimating X2 and X1 occurs

0.6

10

20

30

40

50

0.4

0.06

prob(Y X )
i

prob(Yi X2)

0.05
0.3

0.04

prob estimating X and X occurs

With = 10

0.2

0.03
0.02

0.1

0.01
0

0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs
2

0.6

10

20

30

40

50

6 - 40

Classification

S. Lall, Stanford

2011.01.13.01

Example: Trading off Errors

0.4

0.06

prob(Y X )
i

0.3

prob(Yi X2)

0.05
0.04

prob estimating X and X occurs

The operating characteristic becomes gentler when it is hard to distinguish X1 from X2

0.2

0.03
0.02

0.1

0.01
0

0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs

10

20

30

40

50

60

0.4

0.06

prob(Yi X1)
prob(Y X )

0.05

0.3

0.04

prob estimating X and X occurs

0.6

0.2

0.03
0.02

0.1

0.01
0

0.1
0.2
0.3
0.4
0.5
prob estimating X and X occurs
2

0.6

10

20

30

40

50

6 - 41

Classification

S. Lall, Stanford

Conditional Errors
The conditional error matrix E cond Rnn is
cond
= probability that Xj is estimated given that Xk occurred
Ejk

= Prob(jest = j | Xk )
m

X
Prob jest = j and Yi | Xk since the Yi partition
=
=

i=1
m
X
i=1

Yp | (p) = j Yi | Xk
Prob

Now notice that


(
[

Yi
Yp | (p) = j Yi =

(
Yi
=

if (i) = j
otherwise
if Kij = 1
otherwise

2011.01.13.01

6 - 42

Classification

S. Lall, Stanford

Conditional Errors
Therefore we have
cond
=
Ejk

m
X

i=1
m
X

Kij Prob(Yi | Xk )
Kij Aik

i=1

That is
E cond = K T A

2011.01.13.01

6 - 43

Classification

S. Lall, Stanford

Conditional Errors
For the coins example, we have

E cond

0
1

=
0
0
0

0
0.5
0
0.5
0

0
1/6
0
5/6
0

0
0
0
1
0

0
0

1
0

cond
is the probability that Xj is estimated given that Xk occurred
Ejk

1T E cond = 1T , i.e., the column sums are one


Because, when Xk occurs, some Xj is always estimated
Ideally we would like E cond = I

2011.01.13.01

6 - 44

Classification

S. Lall, Stanford

2011.01.13.01

Maximum-Likelihood
When we do not have any prior probabilities, a commonly used heuristic is the method of
maximum likelihood.

MAP estimate: pick j to maximize the joint probability


Prob(Yi | Xj ) Prob(Xj )
Max Likelihood: pick j to maximize the a-priori probability
Prob(Yi | Xj )

We can also minimize costs associated with errors. In this case we


minimize trace(E condC T ) instead of trace(EC T ).
Similarly, we can construct a trade-off curve using these costs.
The estimates are identical to those obtained when all prior probabilities are equal.

You might also like