Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
All
rights reserved. Draft of January 12, 2025.
CHAPTER
Logistic Regression
5
“And how do you know that these fine begonias are not of equal importance?”
Hercule Poirot, in Agatha Christie’s The Mysterious Affair at Styles
Detective stories are as littered with clues as texts are with words. Yet for the
poor reader it can be challenging to know how to weigh the author’s clues in order
to make the crucial classification task: deciding whodunnit.
In this chapter we introduce an algorithm that is admirably suited for discovering
logistic
regression the link between features or clues and some particular outcome: logistic regression.
Indeed, logistic regression is one of the most important analytic tools in the social
and natural sciences. In natural language processing, logistic regression is the base-
line supervised machine learning algorithm for classification, and also has a very
close relationship with neural networks. As we will see in Chapter 7, a neural net-
work can be viewed as a series of logistic regression classifiers stacked on top of
each other. Thus the classification and machine learning techniques introduced here
will play an important role throughout the book.
Logistic regression can be used to classify an observation into one of two classes
(like ‘positive sentiment’ and ‘negative sentiment’), or into one of many classes.
Because the mathematics for the two-class case is simpler, we’ll describe this special
case of logistic regression first in the next few sections, and then briefly summarize
the use of multinomial logistic regression for more than two classes in Section 5.3.
We’ll introduce the mathematics of logistic regression in the next few sections.
But let’s begin with some high-level issues.
Generative and Discriminative Classifiers: The most important difference be-
tween naive Bayes and logistic regression is that logistic regression is a discrimina-
tive classifier while naive Bayes is a generative classifier.
These are two very different frameworks for how
to build a machine learning model. Consider a visual
metaphor: imagine we’re trying to distinguish dog
images from cat images. A generative model would
have the goal of understanding what dogs look like
and what cats look like. You might literally ask such
a model to ‘generate’, i.e., draw, a dog. Given a test
image, the system then asks whether it’s the cat model or the dog model that better
fits (is less surprised by) the image, and chooses that as its label.
A discriminative model, by contrast, is only try-
ing to learn to distinguish the classes (perhaps with-
out learning much about them). So maybe all the
dogs in the training data are wearing collars and the
cats aren’t. If that one feature neatly separates the
classes, the model is satisfied. If you ask such a
model what it knows about cats all it can say is that
they don’t wear collars.
2 C HAPTER 5 • L OGISTIC R EGRESSION
More formally, recall that the naive Bayes assigns a class c to a document d not
by directly computing P(c|d) but by computing a likelihood and a prior
likelihood prior
z }| { z}|{
ĉ = argmax P(d|c) P(c) (5.1)
c∈C
generative A generative model like naive Bayes makes use of this likelihood term, which
model
expresses how to generate the features of a document if we knew it was of class c.
discriminative By contrast a discriminative model in this text categorization scenario attempts
model
to directly compute P(c|d). Perhaps it will learn to assign a high weight to document
features that directly improve its ability to discriminate between possible classes,
even if it couldn’t generate an example of one of the classes.
Components of a probabilistic machine learning classifier: Like naive Bayes,
logistic regression is a probabilistic classifier that makes use of supervised machine
learning. Machine learning classifiers require a training corpus of m input/output
pairs (x(i) , y(i) ). (We’ll use superscripts in parentheses to refer to individual instances
in the training set—for sentiment classification each instance might be an individual
document to be classified.) A machine learning system for classification then has
four components:
1. A feature representation of the input. For each input observation x(i) , this
will be a vector of features [x1 , x2 , ..., xn ]. We will generally refer to feature
( j)
i for input x( j) as xi , sometimes simplified as xi , but we will also see the
notation fi , fi (x), or, for multiclass classification, fi (c, x).
2. A classification function that computes ŷ, the estimated class, via p(y|x). In
the next section we will introduce the sigmoid and softmax tools for classifi-
cation.
3. An objective function that we want to optimize for learning, usually involving
minimizing a loss function corresponding to error on training examples. We
will introduce the cross-entropy loss function.
4. An algorithm for optimizing the objective function. We introduce the stochas-
tic gradient descent algorithm.
Logistic regression has two phases:
training: We train the system (specifically the weights w and b, introduced be-
low) using stochastic gradient descent and the cross-entropy loss.
test: Given a test example x we compute p(y|x) and return the higher probability
label y = 1 or y = 0.
P(y = 1|x) that this observation is a member of the class. So perhaps the decision
is “positive sentiment” versus “negative sentiment”, the features represent counts of
words in a document, P(y = 1|x) is the probability that the document has positive
sentiment, and P(y = 0|x) is the probability that the document has negative senti-
ment.
Logistic regression solves this task by learning, from a training set, a vector of
weights and a bias term. Each weight wi is a real number, and is associated with one
of the input features xi . The weight wi represents how important that input feature
is to the classification decision, and can be positive (providing evidence that the in-
stance being classified belongs in the positive class) or negative (providing evidence
that the instance being classified belongs in the negative class). Thus we might
expect in a sentiment task the word awesome to have a high positive weight, and
bias term abysmal to have a very negative weight. The bias term, also called the intercept, is
intercept another real number that’s added to the weighted inputs.
To make a decision on a test instance—after we’ve learned the weights in training—
the classifier first multiplies each xi by its weight wi , sums up the weighted features,
and adds the bias term b. The resulting single number z expresses the weighted sum
of the evidence for the class.
n
!
X
z = wi xi + b (5.2)
i=1
dot product In the rest of the book we’ll represent such sums using the dot product notation
from linear algebra. The dot product of two vectors a and b, written as a · b, is the
sum of the products of the corresponding elements of each vector. (Notice that we
represent vectors using the boldface notation b). Thus the following is an equivalent
formation to Eq. 5.2:
z = w·x+b (5.3)
But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to lie
between 0 and 1. In fact, since weights are real-valued, the output might even be
negative; z ranges from −∞ to ∞.
1
Figure 5.1 The sigmoid function σ (z) = 1+e −z takes a real value and maps it to the range
(0, 1). It is nearly linear around 0 but outlier values get squashed toward 0 or 1.
sigmoid To create a probability, we’ll pass z through the sigmoid function, σ (z). The
sigmoid function (named because it looks like an s) is also called the logistic func-
logistic tion, and gives logistic regression its name. The sigmoid has the following equation,
function
shown graphically in Fig. 5.1:
1 1
σ (z) = = (5.4)
1 + e−z 1 + exp (−z)
4 C HAPTER 5 • L OGISTIC R EGRESSION
(For the rest of the book, we’ll use the notation exp(x) to mean ex .) The sigmoid
has a number of advantages; it takes a real-valued number and maps it into the range
(0, 1), which is just what we want for a probability. Because it is nearly linear around
0 but flattens toward the ends, it tends to squash outlier values toward 0 or 1. And
it’s differentiable, which as we’ll see in Section 5.10 will be handy for learning.
We’re almost there. If we apply the sigmoid to the sum of the weighted features,
we get a number between 0 and 1. To make it a probability, we just need to make
sure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:
P(y = 1) = σ (w · x + b)
1
=
1 + exp (−(w · x + b))
P(y = 0) = 1 − σ (w · x + b)
1
= 1−
1 + exp (−(w · x + b))
exp (−(w · x + b))
= (5.5)
1 + exp (−(w · x + b))
p
logit(p) = σ −1 (p) = ln (5.7)
1− p
Using the term logit for z is a way of reminding us that by using the sigmoid to turn
z (which ranges from −∞ to ∞) into a probability, we are implicitly interpreting z as
not just any real-valued number, but as specifically a log odds.
Let’s have some examples of applying logistic regression as a classifier for language
tasks.
5.2 • C LASSIFICATION WITH L OGISTIC R EGRESSION 5
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Let’s assume for the moment that we’ve already learned a real-valued weight
for each of these features, and that the 6 weights corresponding to the 6 features
are [2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while b = 0.1. (We’ll discuss in the next section
how the weights are learned.) The weight w1 , for example indicates how important
a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is to
a positive sentiment decision, while w2 tells us the importance of negative lexicon
words. Note that w1 = 2.5 is positive, while w2 = −5.0, meaning that negative words
are negatively associated with a positive sentiment decision, and are about twice as
important as positive words.
Given these 6 features and the input review x, P(+|x) and P(−|x) can be com-
puted using Eq. 5.5:
normalize Alternatively, we can normalize the input features values to lie between 0 and 1:
xi − min(xi )
xi0 = (5.10)
max(xi ) − min(xi )
Having input data with comparable range is useful when comparing values across
features. Data scaling is especially important in large neural networks, since it helps
speed up gradient descent.
For the first 3 test examples, then, we would be separately computing the pre-
dicted ŷ(i) as follows:
But it turns out that we can slightly modify our original equation Eq. 5.5 to do
this much more efficiently. We’ll use matrix arithmetic to assign a class to all the
examples with one matrix operation!
First, we’ll pack all the input feature vectors for each input x into a single input
matrix X, where each row i is a row vector consisting of the feature vector for in-
put example x(i) (i.e., the vector x(i) ). Assuming each example has f features and
weights, X will therefore be a matrix of shape [m × f ], as follows:
(1) (1) (1)
x1 x2 . . . x f
(2) (2) (2)
x
1 x2 . . . x f
X = (5.12)
x1(3) x2(3) . . . x(3)
f
...
y = Xw + b (5.13)
8 C HAPTER 5 • L OGISTIC R EGRESSION
You should convince yourself that Eq. 5.13 computes the same thing as our for-loop
in Eq. 5.11. For example ŷ(1) , the first entry of the output vector y, will correctly be:
(1) (1) (1)
ŷ(1) = [x1 , x2 , ..., x f ] · [w1 , w2 , ..., w f ] + b (5.14)
Note that we had to reorder X and w from the order they appeared in in Eq. 5.5 to
make the multiplications come out properly. Here is Eq. 5.13 again with the shapes
shown:
y = X w + b
(m × 1) (m × f )( f × 1) (m × 1) (5.15)
Modern compilers and compute hardware can compute this matrix operation very
efficiently, making the computation much faster, which becomes important when
training or testing on very large datasets.
Note by the way that we could have kept X and w in the original order (y =
Xw + b) if we had chosen to define X differently as a matrix of column vectors, one
vector for each input example, instead of row vectors, and then it would have shape
[ f × m]. But we conventionally represent inputs as rows.
the correct one (sometimes called hard classification; an observation can not be in
multiple classes). Let’s use the following representation: the output y for each input
x will be a vector of length K. If class c is the correct class, we’ll set yc = 1, and
set all the other elements of y to be 0, i.e., yc = 1 and y j = 0 ∀ j 6= c. A vector like
this y, with one value=1 and the rest 0, is called a one-hot vector. The job of the
classifier is to produce an estimate vector ŷ. For each class k, the value ŷk will be
the classifier’s estimate of the probability p(yk = 1|x).
5.3.1 Softmax
The multinomial logistic classifier uses a generalization of the sigmoid, called the
softmax softmax function, to compute p(yk = 1|x). The softmax function takes a vector
z = [z1 , z2 , ..., zK ] of K arbitrary values and maps them to a probability distribution,
with each value in the range [0,1], and all the values summing to 1. Like the sigmoid,
it is an exponential function.
For a vector z of dimensionality K, the softmax is defined as:
exp (zi )
softmax(zi ) = PK 1≤i≤K (5.16)
j=1 exp (z j )
The denominator Ki=1 exp (zi ) is used to normalize all the values into probabilities.
P
Thus for example given a vector:
Like the sigmoid, the softmax has the property of squashing values toward 0 or 1.
Thus if one of the inputs is larger than the others, it will tend to push its probability
toward 1, and suppress the probabilities of the smaller inputs.
Finally, note that, just as for the sigmoid, we refer to z, the vector of scores that
is the input to the softmax, as logits (see Eq. 5.7).
exp (wk · x + bk )
p(yk = 1|x) = K
(5.18)
X
exp (w j · x + b j )
j=1
10 C HAPTER 5 • L OGISTIC R EGRESSION
The form of Eq. 5.18 makes it seem that we would compute each output sep-
arately. Instead, it’s more common to set up the equation for more efficient com-
putation by modern vector processing hardware. We’ll do this by representing the
set of K weight vectors as a weight matrix W and a bias vector b. Each row k of
W corresponds to the vector of weights wk . W thus has shape [K × f ], for K the
number of output classes and f the number of input features. The bias vector b has
one value for each of the K output classes. If we represent the weights in this way,
we can compute ŷ, the vector of output probabilities for each of the K classes, by a
single elegant equation:
ŷ = softmax(Wx + b) (5.19)
If you work out the matrix arithmetic, you can see that the estimated score of
the first output class ŷ1 (before we take the softmax) will correctly turn out to be
w1 · x + b1 .
One helpful interpretation of the weight matrix W is to see each row wk as a
prototype prototype of class k. The weight vector wk that is learned represents the class as
a kind of template. Since two vectors that are more similar to each other have a
higher dot product with each other, the dot product acts as a similarity function.
Logistic regression is thus learning an exemplar representation for each class, such
that incoming vectors are assigned the class k they are most similar to from the K
classes.
Fig. 5.3 shows the difference between binary and multinomial logistic regression
by illustrating the weight vector versus weight matrix in the computation of the
output class probabilities.
Feature Definition
w5,+ w5,− w5,0
1 if “!” ∈ doc
f5 (x) 3.5 3.1 −5.3
0 otherwise
Because these feature weights are dependent both on the input text and the output
class, we sometimes make this dependence explicit and represent the features them-
selves as f (x, y): a function of both the input and the class. Using such a notation
5.4 • L EARNING IN L OGISTIC R EGRESSION 11
Output y y^
sigmoid [scalar]
Weight vector w
[1⨉f]
Input feature x x1 x2 x3 … xf
vector [f ⨉1]
wordcount positive lexicon count of
=3 words = 1 “no” = 0
Figure 5.3 Binary versus multinomial logistic regression. Binary logistic regression uses a
single weight vector w, and has a scalar output ŷ. In multinomial logistic regression we have
K separate weight vectors corresponding to the K classes, all packed into a single weight
matrix W, and a vector output ŷ. We omit the biases from both figures for clarity.
f5 (x) above could be represented as three features f5 (x, +), f5 (x, −), and f5 (x, 0),
each of which has a single weight. We’ll use this kind of notation in our description
of the CRF in Chapter 17.
label y. Rather than measure similarity, we usually talk about the opposite of this:
the distance between the system output and the gold output, and we call this distance
loss the loss function or the cost function. In the next section we’ll introduce the loss
function that is commonly used for logistic regression and also for neural networks,
the cross-entropy loss.
The second thing we need is an optimization algorithm for iteratively updating
the weights so as to minimize this loss function. The standard algorithm for this is
gradient descent; we’ll introduce the stochastic gradient descent algorithm in the
following section.
We’ll describe these algorithms for the simpler case of binary logistic regres-
sion in the next two sections, and then turn to multinomial logistic regression in
Section 5.8.
We do this via a loss function that prefers the correct class labels of the train-
ing examples to be more likely. This is called conditional maximum likelihood
estimation: we choose the parameters w, b that maximize the log probability of
the true y labels in the training data given the observations x. The resulting loss
cross-entropy function is the negative log likelihood loss, generally called the cross-entropy loss.
loss
Let’s derive this loss function, applied to a single observation x. We’d like to
learn weights that maximize the probability of the correct label p(y|x). Since there
are only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we can
express the probability p(y|x) that our classifier produces for one observation as the
following (keeping in mind that if y = 1, Eq. 5.21 simplifies to ŷ; if y = 0, Eq. 5.21
simplifies to 1 − ŷ):
Now we take the log of both sides. This will turn out to be handy mathematically,
and doesn’t hurt us; whatever values maximize a probability will also maximize the
log of the probability:
Eq. 5.22 describes a log likelihood that should be maximized. In order to turn this
into a loss function (something that we need to minimize), we’ll just flip the sign on
Eq. 5.22. The result is the cross-entropy loss LCE :
Let’s see if this loss function does the right thing for our example from Fig. 5.2. We
want the loss to be smaller if the model’s estimate is close to correct, and bigger if
the model is confused. So first let’s suppose the correct gold label for the sentiment
example in Fig. 5.2 is positive, i.e., y = 1. In this case our model is doing well, since
from Eq. 5.8 it indeed gave the example a higher probability of being positive (.70)
than negative (.30). If we plug σ (w · x + b) = .70 and y = 1 into Eq. 5.24, the right
side of the equation drops out, leading to the following loss (we’ll use log to mean
natural log when the base is not specified):
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
the loss to be higher. Now if we plug y = 0 and 1 − σ (w · x + b) = .30 from Eq. 5.8
into Eq. 5.24, the left side of the equation drops out:
Sure enough, the loss for the first classifier (.36) is less than the loss for the second
classifier (1.2).
Why does minimizing this negative log probability do what we want? A perfect
classifier would assign probability 1 to the correct outcome (y = 1 or y = 0) and
probability 0 to the incorrect outcome. That means if y equals 1, the higher ŷ is (the
closer it is to 1), the better the classifier; the lower ŷ is (the closer it is to 0), the
worse the classifier. If y equals 0, instead, the higher 1 − ŷ is (closer to 1), the better
the classifier. The negative log of ŷ (if the true y equals 1) or 1 − ŷ (if the true y
equals 0) is a convenient loss metric since it goes from 0 (negative log of 1, no loss)
to infinity (negative log of 0, infinite loss). This loss function also ensures that as
the probability of the correct answer is maximized, the probability of the incorrect
answer is minimized; since the two sum to one, any increase in the probability of the
correct answer is coming at the expense of the incorrect answer. It’s called the cross-
entropy loss, because Eq. 5.22 is also the formula for the cross-entropy between the
true probability distribution y and our estimated distribution ŷ.
Now we know what we want to minimize; in the next section, we’ll see how to
find the minimum.
How shall we find the minimum of this (or any) loss function? Gradient descent is
a method that finds a minimum of a function by figuring out in which direction (in
the space of the parameters θ ) the function’s slope is rising the most steeply, and
moving in the opposite direction. The intuition is that if you are hiking in a canyon
and trying to descend most quickly down to the river at the bottom, you might look
around yourself in all directions, find the direction where the ground is sloping the
steepest, and walk downhill in that direction.
convex For logistic regression, this loss function is conveniently convex. A convex func-
tion has at most one minimum; there are no local minima to get stuck in, so gradient
descent starting from any point is guaranteed to find the minimum. (By contrast,
the loss for multi-layer neural networks is non-convex, and gradient descent may
get stuck in local minima for neural network training and never find the global opti-
mum.)
Although the algorithm (and the concept of gradient) are designed for direction
vectors, let’s first consider a visualization of the case where the parameter of our
system is just a single scalar w, shown in Fig. 5.4.
Given a random initialization of w at some value w1 , and assuming the loss
function L happened to have the shape in Fig. 5.4, we need the algorithm to tell us
whether at the next iteration we should move left (making w2 smaller than w1 ) or
right (making w2 bigger than w1 ) to reach the minimum.
Loss
one step
of gradient
slope of loss at w1 descent
is negative
w1 wmin w
0 (goal)
Figure 5.4 The first step in iteratively finding the minimum of this loss function, by moving
w in the reverse direction from the slope of the function. Since the slope is negative, we need
to move w in a positive direction, to the right. Here superscripts are used for learning steps,
so w1 means the initial value of w (which is 0), w2 the value at the second step, and so on.
gradient The gradient descent algorithm answers this question by finding the gradient
of the loss function at the current point and moving in the opposite direction. The
gradient of a function of many variables is a vector pointing in the direction of the
greatest increase in a function. The gradient is a multi-variable generalization of the
5.6 • G RADIENT D ESCENT 15
slope, so for a function of one variable like the one in Fig. 5.4, we can informally
think of the gradient as the slope. The dotted line in Fig. 5.4 shows the slope of this
hypothetical loss function at point w = w1 . You can see that the slope of this dotted
line is negative. Thus to find the minimum, gradient descent tells us to go in the
opposite direction: moving w in a positive direction.
The magnitude of the amount to move in gradient descent is the value of the
d
learning rate slope dw L( f (x; w), y) weighted by a learning rate η. A higher (faster) learning
rate means that we should move w more on each step. The change we make in our
parameter is the learning rate times the gradient (or the slope, in our single-variable
example):
d
wt+1 = wt − η L( f (x; w), y) (5.26)
dw
Now let’s extend the intuition from a function of one scalar variable w to many
variables, because we don’t just want to move left or right, we want to know where
in the N-dimensional space (of the N parameters that make up θ ) we should move.
The gradient is just such a vector; it expresses the directional components of the
sharpest slope along each of those N dimensions. If we’re just imagining two weight
dimensions (say for one weight w and one bias b), the gradient might be a vector with
two orthogonal components, each of which tells us how much the ground slopes in
the w dimension and in the b dimension. Fig. 5.5 shows a visualization of the value
of a 2-dimensional gradient vector taken at the red point.
In an actual logistic regression, the parameter vector w is much longer than 1 or
2, since the input feature vector x can be quite long, and we need a weight wi for
each xi . For each dimension/variable wi in w (plus the bias b), the gradient will have
a component that tells us the slope with respect to that variable. In each dimension
wi , we express the slope as a partial derivative ∂∂wi of the loss function. Essentially
we’re asking: “How much would a small change in that variable wi influence the
total loss function L?”
Formally, then, the gradient of a multi-variable function f is a vector in which
each component expresses the partial derivative of f with respect to one of the vari-
ables. We’ll use the inverted Greek delta symbol ∇ to refer to the gradient, and
Cost(w,b)
b
w
Figure 5.5 Visualization of the gradient vector at the red point in two dimensions w and
b, showing a red arrow in the x-y plane pointing in the direction we will go to look for the
minimum: the opposite direction of the gradient (recall that the gradient points in the direction
of increase not decrease).
16 C HAPTER 5 • L OGISTIC R EGRESSION
It turns out that the derivative of this function for one observation vector x is Eq. 5.30
(the interested reader can see Section 5.10 for the derivation of this equation):
∂ LCE (ŷ, y)
= [σ (w · x + b) − y]x j
∂wj
= (ŷ − y)x j (5.30)
∂ LCE (ŷ, y)
= −(y − ŷ)x j (5.31)
∂wj
Note in these equations that the gradient with respect to a single weight w j rep-
resents a very intuitive value: the difference between the true y and our estimated
ŷ = σ (w · x + b) for that observation, multiplied by the corresponding input value
x j.
Figure 5.6 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
mainly to report how well we are doing on the current tuple; we don’t need to compute the
loss in order to compute the gradient. The algorithm can terminate when it converges (when
the gradient norm < ), or when progress halts (for example when the loss starts going up on
a held-out set). Weights are initialized to 0 for logistic regression, but to small random values
for neural networks, as we’ll see in Chapter 7.
We’ll discuss hyperparameters in more detail in Chapter 7, but in short, they are
a special kind of parameter for any machine learning model. Unlike regular param-
eters of a model (weights like w and b), which are learned by the algorithm from
the training set, hyperparameters are special parameters chosen by the algorithm
designer that affect how the algorithm works.
Let’s assume the initial weights and bias in θ 0 are all set to 0, and the initial learning
rate η is 0.1:
w1 = w2 = b = 0
η = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
∂ LCE (ŷ,y)
∂ w1 (σ (w · x + b) − y)x1 (σ (0) − 1)x1 −0.5x1 −1.5
(ŷ,y)
∇w,b L = ∂ LCE = (σ (w · x + b) − y)x2 = (σ (0) − 1)x2 = −0.5x2 = −1.0
∂ w2
∂ LCE (ŷ,y) σ (w · x + b) − y σ (0) − 1 −0.5 −0.5
∂b
Now that we have a gradient, we compute the new parameter vector θ 1 by moving
θ 0 in the opposite direction from the gradient:
w1 −1.5 .15
θ 1 = w2 − η −1.0 = .1
b −0.5 .05
So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
Now the cost function for the mini-batch of m examples is the average loss for each
example:
m
1X
Cost(ŷ, y) = LCE (ŷ(i) , y(i) )
m
i=1
m
1 X
= − y(i) log σ (w · x(i) + b) + (1 − y(i) ) log 1 − σ (w · x(i) + b) (5.33)
m
i=1
The mini-batch gradient is the average of the individual gradients from Eq. 5.30:
m
∂Cost(ŷ, y) 1 Xh i
(i)
= σ (w · x(i) + b) − y(i) x j (5.34)
∂wj m
i=1
Instead of using the sum notation, we can more efficiently compute the gradient
in its matrix form, following the vectorization we saw on page 7, where we have a
matrix X of size [m × f ] representing the m inputs in the batch, and a vector y of size
[m × 1] representing the correct outputs:
∂Cost(ŷ, y) 1
= (ŷ − y)| X
∂w m
1
= (σ (Xw + b) − y)| X (5.35)
m
5.7 Regularization
There is a problem with learning weights that make the model perfectly match the
training data. If a feature is perfectly predictive of the outcome because it happens
to only occur in one class, it will be assigned a very high weight. The weights for
features will attempt to perfectly fit details of the training set, in fact too perfectly,
modeling noisy factors that just accidentally correlate with the class. This problem is
overfitting called overfitting. A good model should be able to generalize well from the training
generalize data to the unseen test set, but a model that overfits will have poor generalization.
regularization To avoid overfitting, a new regularization term R(θ ) is added to the loss func-
tion in Eq. 5.25, resulting in the following loss for a batch of m examples (slightly
rewritten from Eq. 5.25 to be maximizing log probability rather than minimizing
loss, and removing the m1 term which doesn’t affect the argmax):
m
X
θ̂ = argmax log P(y(i) |x(i) ) − αR(θ ) (5.36)
θ i=1
The new regularization term R(θ ) is used to penalize large weights. Thus a setting
of the weights that matches the training data perfectly— but uses many weights with
20 C HAPTER 5 • L OGISTIC R EGRESSION
high values to do so—will be penalized more than a setting that matches the data a
little less well, but does so using smaller weights. There are two common ways to
L2
regularization compute this regularization term R(θ ). L2 regularization is a quadratic function of
the weight values, named because it uses the (square of the) L2 norm of the weight
values. The L2 norm, ||θ ||2 , is the same as the Euclidean distance of the vector θ
from the origin. If θ consists of n weights, then:
n
X
R(θ ) = ||θ ||22 = θ j2 (5.37)
j=1
If we multiply each weight by a Gaussian prior on the weight, we are thus maximiz-
ing the following constraint:
m n
!
(θ − )2
Y Y 1 j µ j
θ̂ = argmax P(y(i) |x(i) ) × exp − (5.42)
2σ 2j
q
θ i=1 j=1 2πσ 2 j
5.8 • L EARNING IN M ULTINOMIAL L OGISTIC R EGRESSION 21
The loss function for multinomial logistic regression generalizes the two terms in
Eq. 5.44 (one that is non-zero when y = 1 and one that is non-zero when y = 0) to
K terms. As we mentioned above, for multinomial regression we’ll represent both y
and ŷ as vectors. The true label y is a vector with K elements, each corresponding
to a class, with yc = 1 if the correct class is c, with all other elements of y being 0.
And our classifier will produce an estimate vector with K elements ŷ, each element
ŷk of which represents the estimated probability p(yk = 1|x).
The loss function for a single example x, generalizing from binary logistic re-
gression, is the sum of the logs of the K output classes, each weighted by the indi-
cator function yk (Eq. 5.45). This turns out to be just the negative log probability of
the correct class c (Eq. 5.46):
K
X
LCE (ŷ, y) = − yk log ŷk (5.45)
k=1
How did we get from Eq. 5.45 to Eq. 5.46? Because only one class (let’s call it c) is
the correct one, the vector y takes the value 1 only for this value of k, i.e., has yc = 1
and y j = 0 ∀ j 6= c. That means the terms in the sum in Eq. 5.45 will all be 0 except
for the term corresponding to the true class c. Hence the cross-entropy loss is simply
the log of the output probability corresponding to the correct class, and we therefore
negative log also call Eq. 5.46 the negative log likelihood loss.
likelihood loss
Of course for gradient descent we don’t need the loss, we need its gradient. The
gradient for a single example turns out to be very similar to the gradient for binary
logistic regression, (ŷ − y)x, that we saw in Eq. 5.30. Let’s consider one piece of the
gradient, the derivative for a single weight. For each class k, the weight of the ith
element of input x is wk,i . What is the partial derivative of the loss with respect to
wk,i ? This derivative turns out to be just the difference between the true value for the
class k (which is either 1 or 0) and the probability the classifier outputs for class k,
22 C HAPTER 5 • L OGISTIC R EGRESSION
weighted by the value of the input xi corresponding to the ith element of the weight
vector for class k:
∂ LCE
= −(yk − ŷk )xi
∂ wk,i
= −(yk − p(yk = 1|x))xi
!
exp (wk · x + bk )
= − yk − PK xi (5.48)
j=1 exp (wj · x + b j )
We’ll return to this case of the gradient for softmax regression when we introduce
neural networks in Chapter 7, and at that time we’ll also discuss the derivation of
this gradient in equations Eq. ??–Eq. ??.
dσ (z)
= σ (z)(1 − σ (z)) (5.50)
dz
chain rule Finally, the chain rule of derivatives. Suppose we are computing the derivative
of a composite function f (x) = u(v(x)). The derivative of f (x) is the derivative of
u(x) with respect to v(x) times the derivative of v(x) with respect to x:
df du dv
= · (5.51)
dx dv dx
First, we want to know the derivative of the loss function with respect to a single
weight w j (we’ll need to compute it for each weight, and for the bias):
∂ LCE ∂
= − [y log σ (w · x + b) + (1 − y) log (1 − σ (w · x + b))]
∂wj ∂wj
∂ ∂
= − y log σ (w · x + b) + (1 − y) log [1 − σ (w · x + b)]
∂wj ∂wj
(5.52)
Next, using the chain rule, and relying on the derivative of log:
∂ LCE y ∂ 1−y ∂
= − σ (w · x + b) − 1 − σ (w · x + b)
∂wj σ (w · x + b) ∂ w j 1 − σ (w · x + b) ∂ w j
(5.53)
Rearranging terms:
∂ LCE y 1−y ∂
= − − σ (w · x + b)
∂wj σ (w · x + b) 1 − σ (w · x + b) ∂ w j
(5.54)
And now plugging in the derivative of the sigmoid, and using the chain rule one
more time, we end up with Eq. 5.55:
∂ LCE y − σ (w · x + b) ∂ (w · x + b)
= − σ (w · x + b)[1 − σ (w · x + b)]
∂wj σ (w · x + b)[1 − σ (w · x + b)] ∂wj
y − σ (w · x + b)
= − σ (w · x + b)[1 − σ (w · x + b)]x j
σ (w · x + b)[1 − σ (w · x + b)]
= −[y − σ (w · x + b)]x j
= [σ (w · x + b) − y]x j (5.55)
5.11 Summary
This chapter introduced the logistic regression model of classification.
• Logistic regression is a supervised machine learning classifier that extracts
real-valued features from the input, multiplies each by a weight, sums them,
and passes the sum through a sigmoid function to generate a probability. A
threshold is used to make a decision.
24 C HAPTER 5 • L OGISTIC R EGRESSION
• Logistic regression can be used with two classes (e.g., positive and negative
sentiment) or with multiple classes (multinomial logistic regression, for ex-
ample for n-ary text classification, part-of-speech labeling, etc.).
• Multinomial logistic regression uses the softmax function to compute proba-
bilities.
• The weights (vector w and bias b) are learned from a labeled training set via a
loss function, such as the cross-entropy loss, that must be minimized.
• Minimizing this loss function is a convex optimization problem, and iterative
algorithms like gradient descent are used to find the optimal weights.
• Regularization is used to avoid overfitting.
• Logistic regression is also one of the most useful analytic tools, because of its
ability to transparently study the importance of individual features.
Exercises
Exercises 25