Notes Chapter Logistic Regression
Notes Chapter Logistic Regression
Logistic regression
The loss tells us how unhappy we are about the prediction h(x(i) ; Θ) that Θ makes for
(x(i) , y(i) ). A common example is the 0-1 loss, introduced in chapter 1:
0 if y = h(x; Θ)
L01 (h(x; Θ), y) = ,
1 otherwise
which gives a value of 0 for a correct prediction, and a 1 for an incorrect prediction. In the
case of linear separators, this becomes:
0 if y(θT x + θ0 ) > 0
L01 (h(x; θ, θ0 ), y) = .
1 otherwise
27
MIT 6.036 Fall 2019 28
2 Regularization
If all we cared about was finding a hypothesis with small loss on the training data, we
would have no need for regularization, and could simply omit the second term in the
objective. But remember that our ultimate goal is to perform well on input values that we
haven’t trained on! It may seem that this is an impossible task, but humans and machine-
learning methods do this successfully all the time. What allows generalization to new input
values is a belief that there is an underlying regularity that governs both the training and
testing data. We have already discussed one way to describe an assumption about such
a regularity, which is by choosing a limited class of possible hypotheses. Another way to
do this is to provide smoother guidance, saying that, within a hypothesis class, we prefer
some hypotheses to others. The regularizer articulates this preference and the constant λ
says how much we are willing to trade off loss on the training data versus preference over
hypotheses.
This trade-off is illustrated in the figure below. Hypothesis h1 has 0 training loss, but is
very complicated. Hypothesis h2 mis-classifies two points, but is very simple. In absence
of other beliefs about the solution, it is often better to prefer that the solution be “simpler,”
and so we might prefer h2 over h1 , expecting it to perform better on future examples drawn
from this same distribution. Another nice way of thinking about regularization is that we To establish some vo-
would like to prevent our hypothesis from being too dependent on the particular training cabulary, we say that h1
is overfit to the training
data that we were given: we would like for it to be the case that if the training data were
data.
changed slightly, the hypothesis would not change by much.
h1
h2
when we have some idea in advance that θ ought to be near some value Θprior . In the Learn about Bayesian
absence of such knowledge a default is to regularize toward zero: methods in machine
learning to see the the-
ory behind this and cool
R(Θ) = kΘk2 .
results!
This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ 0 , θ00 ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs NP.”
misclassifications so they have the same J value.
• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches through the space of hypotheses for
a good one.
For these reasons, we are going to investigate a new hypothesis class: linear logistic
classifiers. These hypotheses are still parameterized by a d-dimensional vector θ and a
scalar θ0 , but instead of making predictions in {+1, −1}, they generate real-valued outputs
in the interval (0, 1). A linear logistic classifier has the form
h(x; θ, θ0 ) = σ(θT x + θ0 ) .
This looks familiar! What’s new?
The logistic function, also known as the sigmoid function, is defined as
1
σ(z) = ,
1 + e−z
and plotted below, as a function of its input z. Its output can be interpreted as a probability,
because for any value of z the output is in (0, 1).
σ(z)
1
0.5
z
−4 −3 −2 −1 1 2 3 4
Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?
What does a linear logistic classifier (LLC) look like? Let’s consider the simple case
where d = 1, so our input points simply lie along the x axis. The plot below shows LLCs
for three different parameter settings: σ(10x + 1), σ(−2x + 1), and σ(2x − 3).
σ(θT x + θ0 )
1
0.5
x
−4 −3 −2 −1 1 2 3 4
Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier from chapter 2 is that it’s a map-
ping from Rd → {−1, +1} or to some other discrete set. So, then, it seems like an LLC is
actually not a classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice from the perspec-
tive of Bayesian reasoning. For example, if the consequences of predicting +1 when the
answer should be −1 are much worse than the consequences of predicting −1 when the
answer should be +1, then we might set the prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.
Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) =
0.5, that is, the separator between positive and negative predictions with prediction
threshold 0.5 is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?
under the assumption that our predictions are independent. This can be cleverly rewritten,
when y(i) ∈ {0, 1}, as
Yn
y(i)
(1 − g(i) )1−y
(i)
g(i) .
i=1
Study Question: Be sure you can see why these two expressions are the same.
Now, because products are kind of hard to deal with, and because the log function is
monotonic, the θ, θ0 that maximize the log of this quantity will be the same as the θ, θ0 that
maximize the original, so we can try to maximize
n
X
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1
We can turn the maximization problem above into a minimization problem by taking the
negative of the above expression, and write in terms of minimizing a loss
n
X
Lnll (g(i) , y(i) )
i=1
This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
for the logarithm and
it won’t make any real
difference. If we ask
you for numbers, use
log base e.
Last Updated: 12/18/19 11:56:05
MIT 6.036 Fall 2019 32
n
!
1X
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ kθk2 .
T (i) (i)
n
i=1
Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.