Chapter Classification
Chapter Classification
Classification
4.1 Classification
Classification is a machine learning problem seeking to map from inputs Rd to outputs in
an unordered set. Examples of classification output sets could be {apples, oranges, pears} in contrast to a continu-
if we’re trying to figure out what type of fruit we have, or {heartattack, noheartattack} ous real-valued output,
as we saw for linear re-
if we’re working in an emergency room and trying to give the best medical care to a new gression
patient. We focus on an essential simple case, binary classification, where we aim to find
a mapping from Rd to two outputs. While we should think of the outputs as not having
an order, it’s often convenient to encode them as {−1, +1}. As before, let the letter h (for
hypothesis) represent a classifier, so the classification process looks like:
x→ h →y .
We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? As in regression, we want it to work well on new data,
making good predictions on examples it hasn’t seen. But we don’t know exactly what
data this classifier might be tested on when we use it in the real world. So, we have to
assume a connection between the training data and testing data; typically, they are drawn
independently from the same probability distribution.
In classification, we will often use 0-1 loss for evaluation (as discussed in Section 1.3).
For that choice, we can write the training error and the testing error. In particular, given a
training set Dn and a classifier h, we define the training error of h to be
n
1 X 1 h(x(i) ) ̸= y(i)
En (h) = .
n
i=1
0 otherwise
29
MIT 6.390 Spring 2024 30
For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error
n+n ′
1 X 1 h(x(i) ) ̸= y(i)
E(h) = ′
n
i=n+1
0 otherwise
on n ′ new examples that were not used in the process of finding the classifier.
We begin by introducing the hypothesis class of linear classifiers (Section 4.2) and then
define an optimization framework to learn linear logistic classifiers (Section 4.3).
1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1
The diagram below shows the θ vector (in green) and the separator it defines:
x2 θT x + θ0 = 0
θ θ2
x1
θ1
What is θ0 ? We can solve for it by plugging a point on the line into the equation for
the line. It is often convenient
to choose a point on one of the axes, e.g., in this case,
0
x = [0, 1]T , for which θT + θ0 = 0, giving θ0 = 1.
1
In this example, the separator divides Rd , the space our x(i) points live in, into two half-
spaces. The one that is on the same side as the normal vector is the positive half-space, and
we classify all points in that space as positive. The half-space on the other side is negative
and all points in it are classified as negative.
Note that we will call a separator a linear separator of a data set if all of the data with
one label falls on one side of the separator and all of the data with the other label falls on
the other side of the separator. For instance, the separator in the next example is a linear
separator for the illustrated data. If there exists a linear separator on a dataset, we call this
dataset linearly separable.
−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5
3
The diagram below shows several points classified by h. In particular, let x(1) =
2
4
and x(2) = .
−1
3
h(x(1) ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2
4
h(x (2)
; θ, θ0 ) = sign −1 1.5 + 3 = sign(−2.5) = −1
−1
Thus, x(1) and x(2) are given positive and negative classifications, respectively.
x(1)
θT x + θ0 = 0
x(2)
Study Question: What is the green vector normal to the separator? Specify it as a
column vector.
Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?
However, even for simple linear classifiers, it is very difficult to find values for θ, θ0 that
minimize simple 0-1 training error
n
1X
J(θ, θ0 ) = L01 (sign(θT x(i) + θ0 ), y(i) ) .
n
i=1
This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ ′ , θ0′ ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs. NP.”
misclassifications so they have the same J value.
• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches in a sensible way through the space
of hypotheses for a good one. For these reasons, we investigate another hypothesis class:
linear logistic classifiers, providing their definition, then an approach for learning such clas-
sifiers using optimization.
σ(z)
1
0.5
z
−4 −3 −2 −1 1 2 3 4
Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?
What does an LLC look like? Let’s consider the simple case where d = 1, so our input
points simply lie along the x axis. Classifiers in this case have dimension 0, meaning that
they are points. The plot below shows LLCs for three different parameter settings: σ(10x +
1), σ(−2x + 1), and σ(2x − 3).
σ(θT x + θ0 )
1
0.5
x
−4 −3 −2 −1 1 2 3 4
Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier is that it’s a mapping from Rd →
{−1, +1} or to some other discrete set. So, then, it seems like an LLC is actually not a
classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice. For example, if
the consequences of predicting +1 when the answer should be −1 are much worse than
the consequences of predicting −1 when the answer should be +1, then we might set the
prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.
Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) = 0.5,
that is, the “boundary” between positive and negative predictions with prediction
threshold 0.5, is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?
Study Question: Be sure you can see why these two expressions are the same.
The big product above is kind of hard to deal with in practice, though. So what can we
do? Because the log function is monotonic, the θ, θ0 that maximize the quantity above will
be the same as the θ, θ0 that maximize its log, which is the following:
n
X
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1
Finally, we can turn the maximization problem above into a minimization problem by tak-
ing the negative of the above expression, and write in terms of minimizing a loss
n
X
Lnll (g(i) , y(i) )
i=1
This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
What is the objective function for linear logistic classification? We can finally put all for the logarithm and
it won’t make any real
these pieces together and develop an objective function for optimizing regularized neg-
difference. If we ask
ative log-likelihood for a linear logistic classifier. In fact, this process is usually called you for numbers, use
“logistic regression,” so we’ll call our objective Jlr , and define it as log base e.
!
1X
n That’s a lot of fancy
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ ∥θ∥2 .
T (i) (i)
(4.1) words!
n
i=1
Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.
What role does regularization play for classifiers? This objective function has the same
structure as the one we used for regression, Eq. 2.2, where the first term (in parentheses)
is the average loss, and the second term is for regularization. Regularization is needed
for building classifiers that can generalize well (just as was the case for regression). The
parameter λ governs the trade-off between the two terms as illustrated in the following
example.
Suppose we wish to obtain a linear logistic classifier for this one-dimensional dataset:
1.2
1.0
0.8
0.6
y
0.4
0.2
0.0
0.2
8 6 4 2 0 2 4 6 8
x
Clearly, this can be fit very nicely by a hypothesis h(x) = σ(θx), but what is the best value
for θ? Evidently, when there is no regularization (λ = 0), the objective function Jlr (θ) will
approach zero for large values of θ, as shown in the plot on the left, below. However, would
the best hypothesis really have an infinite (or very large) value for θ? Such a hypothesis
would suggest that the data indicate strong certainty that a sharp transition between y = 0
and y = 1 occurs exactly at x = 0, despite the actual data having a wide gap around x = 0.
20 λ=0 20 λ = 0.2
15 15
Jlr (θ)
Jlr (θ)
10 10
5 5
0 0
4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
θ θ
Study Question: Be sure this makes sense. When the θ values are very large, what
does the sigmoid curve look like? Why do we say that it has a strong certainty in
that case?
In absence of other beliefs about the solution, we might prefer that our linear logistic
classifier not be overly certain about its predictions, and so we might prefer a smaller θ
over a large θ. By not being overconfident, we might expect a somewhat smaller θ to per-
form better on future examples drawn from this same distribution. This preference can be To refresh on some vo-
realized using a nonzero value of the regularization trade-off parameter, as illustrated in cabulary, we say that
in this example, a very
the plot on the right, above, with λ = 0.2.
large θ would be overfit
Another nice way of thinking about regularization is that we would like to prevent to the training data.
our hypothesis from being too dependent on the particular training data that we were
given: we would like for it to be the case that if the training data were changed slightly, the
hypothesis would not change by much.
Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for logistic regression be-
comes:
(t) (t−1)
8 until Jlr (θ(t) , θ0 ) − Jlr (θ(t−1) , θ0 ) <ϵ
(t)
9 return θ(t) , θ0
the derivative of the function f1 (z) is a monotonically increasing function and therefore f1
is a convex function.
Second, we can see that since,
d d
f2 (z) = [− log(exp(−z)/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z)) + z] ,
dz
= σ(z),
the derivative of the function f2 (z) is also monotonically increasing and therefore f2 is a
convex function.
which can be interpreted as a probability distribution over K items. To make the final
prediction of the class label, we can then look at g, find the most likely probability over
these K entries in g, (i.e. find the largest entry in g,) and return the corresponding index as
the “one-hot” element of 1 in our prediction.
Study Question: Convince yourself that the vector of g values will be non-negative
and sum to 1.
h(x; θ, θ0 ) = softmax(θT x + θ0 ) .
Now, we retain the goal of maximizing the probability that our hypothesis assigns to
the correct output yk for each input x. We can write this probability, letting g stand for our
Q
“guess”, h(x), for a single example (x, y) as K yk
k=1 gk .
Study Question: How many elements that are not equal to 1 will there be in this
product?
The negative log of the probability that we are making a correct guess is, then, for one-
hot vector y and probability distribution vector g,
K
X
Lnllm (g, y) = − yk · log(gk ) .
k=1
We’ll call this NLLM for negative log likelihood multiclass. It is also worth noting that the
NLLM loss function is also convex; however, we will omit the proof.
Study Question: Be sure you see that is Lnllm is minimized when the guess assigns
high probability to the true class.
where g(i) is the final guess for one class or the other that we make from h(x(i) ), e.g., after
thresholding. It’s noteworthy here that we use a different loss function for optimization
than for evaluation. This is a compromise we make for computational ease and efficiency.