CS229 Supplemental Lecture Notes: 1 Binary Classification
CS229 Supplemental Lecture Notes: 1 Binary Classification
John Duchi
1 Binary classification
In binary classification problems, the target y can take on at only two
values. In this set of notes, we show how to model this problem by letting
y {1, +1}, where we say that y is a 1 if the example is a member of the
positive class and y = 1 if the example is a member of the negative class.
We assume, as usual, that we have input features x Rn .
As in our standard approach to supervised learning problems, we first
pick a representation for our hypothesis class (what we are trying to learn),
and after that we pick a loss function that we will minimize. In binary
classification problems, it is often convenient to use a hypothesis class of the
form h (x) = T x, and, when presented with a new example x, we classify it
as positive or negative depending on the sign of T x, that is, our predicted
label is
1
if t > 0
T
sign(h (x)) = sign( x) where sign(t) = 0 if t = 0
1 if t < 0.
1
vector assigns to labels for the point x: if xT is very negative (or very
positive), then we more strongly believe the label y is negative (or positive).
Now that we have chosen a representation for our data, we must choose a
loss function. Intuitively, we would like to choose some loss function so that
for our training data {(x(i) , y (i) )}m (i) T (i)
i=1 , the chosen makes the margin y x
very large for each training example. Let us fix a hypothetical example (x, y),
let z = yxT denote the margin, and let : R R be the loss functionthat
is, the loss for the example (x, y) with margin z = yxT is (z) = (yxT ).
For any particular loss function, the empirical risk that we minimize is then
m
1 X
J() = (y (i) T x(i) ). (2)
m i=1
Consider our desired behavior: we wish to have y (i) T x(i) positive for each
training example i = 1, . . . , m, and we should penalize those for which
y (i) T x(i) < 0 frequently in the training data. Thus, an intuitive choice for
our loss would be one with (z) small if z > 0 (the margin is positive), while
(z) is large if z < 0 (the margin is negative). Perhaps the most natural
such loss is the zero-one loss, given by
(
1 if z 0
zo (z) =
0 if z > 0.
In this case, the risk J() is simply the average number of mistakesmisclassifications
the parameter makes on the training data. Unfortunately, the loss zo is
discontinuous, non-convex (why this matters is a bit beyond the scope of
the course), and perhaps even more vexingly, NP-hard to minimize. So we
prefer to choose losses that have the shape given in Figure 1. That is, we
will essentially always use losses that satisfy
As a few different examples, here are three loss functions that we will see
either now or later in the class, all of which are commonly used in machine
learning.
z = yxT
Figure 1: The rough shape of loss we desire: the loss is convex and continuous,
and tends to zero as the margin z = yxT .
exp (z) = ez .
2 Logistic regression
With this general background in place, we now we give a complementary
view of logistic regression to that in Andrew Ngs lecture notes. When we
3
logistic
hinge
exp
z = yxT
Figure 2: The three margin-based loss functions logistic loss, hinge loss, and
exponential loss.
use binary labels y {1, 1}, it is possible to write logistic regression more
compactly. In particular, we use the logistic loss
Roughly, we hope that choosing to minimize the average logistic loss will
yield a for which y (i) T x(i) > 0 for most (or even all!) of the training
examples.
4
1
0
-5 5
5
then we see that the likelihood of the training data is
m
Y m
Y
(i) (i)
L() = p(Y = y | x ; ) = h (y (i) x(i) ),
i=1 i=1
where J() is exactly the logistic regression risk from Eq. (3). That is,
maximum likelihood in the logistic model (4) is the same as minimizing the
average logistic loss, and we arrive at logistic regression again.
6
This update is intuitive: if our current hypothesis h(t) assigns probability
close to 1 for the incorrect label y (i) , then we try to reduce the loss by
moving in the direction of y (i) x(i) . Conversely, if our current hypothesis
h(t) assigns probability close to 0 for the incorrect label y (i) , the update
essentially does nothing.