Lec4 PDF
Lec4 PDF
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
a) −1
−3 −2 −1 0 1 2 3 b) −1
−3 −2 −1 0 1 2 3
Figure 1: a) The hinge loss (1 − z)+ as a function of z. b) The logistic loss log[1 + exp(−z)]
as a function of z.
To turn the relaxed optimization problem into a regularization problem we define a loss
function that corresponds to individually optimized ξt values and specifies the cost of vio
lating each of the margin constraints. We are effectively solving the optimization problem
�
with respect to the ξ values for a fixed θ and θ0 . This will lead to an expression of C t ξt
as a function of θ and θ0 .
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 2
The loss function we need for this purpose is based on the hinge loss Lossh (z) defined as
the positive part of 1 − z, written as (1 − z)+ (see Figure 1a). The relaxed optimization
problem can be written using the hinge loss as
=ξ̂t
n �
1 � �� �+�
minimize �θ�2 + C T
�
1 − yt (θ xt + θ0 ) (3)
2 t=1
Here �θ�2 /2, the inverse squared geometric margin, is viewed as a regularization penalty
that helps stabilize the objective
n
� �+
1 − yt (θT xt + θ0 )
�
C (4)
t=1
In other words, when no margin constraints are violated (zero loss), the regularization
penalty helps us select the solution with the largest geometric margin.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−3 −2 −1 0 1 2 3
Another way of dealing with noisy labels in linear classification is to model how the noisy
labels are generated. For example, human assigned labels tend to be very good for “typical
examples” but exhibit some variation in more difficult cases. One simple model of noisy
labels in linear classification is a logistic regression model. In this model we assign a
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 3
probability distribution over the two labels in such a way that the labels for examples
further away from the decision boundary are more likely to be correct. More precisely, we
say that
P (y = 1|x, θ, θ0 ) = g θT x + θ0
� �
(5)
where g(z) = (1 + exp(−z))−1 is known as the logistic function (Figure 2). One way to
derive the form of the logistic function is to say that the log-odds of the predicted class
probabilities should be a linear function of the inputs:
P (y = 1|x, θ, θ0 )
log = θ T x + θ0 (6)
P (y = −1|x, θ, θ0 )
So for example, when we predict the same probability (1/2) for both classes, the log-odds
term is zero and we recover the decision boundary θT x + θ0 = 0. The precise functional
form of the logistic function, or, equivalently, the fact that we chose to model log-odds
with the linear prediction, may seem a little arbitrary (but perhaps not more so than the
hinge loss used with the SVM classifier). We will derive the form of the logistic function
later on in the course based on certain assumptions about class-conditional distributions
P (x|y = 1) and P (x|y = −1).
In order to better compare the logistic regression model with the SVM we will write the
conditional probability P (y|x, θ, θ0 ) a bit more succinctly. Specifically, since 1 − g(z) =
g(−z) we get
and therefore
P (y|x, θ, θ0 ) = g y(θT x + θ0 )
� �
(8)
So now we have a linear classifier that makes probabilistic predictions about the labels.
How should we train such models? A sensible criterion would seem to be to maximize the
probability that we predict the correct label in response to each example. Assuming each
example is labeled independently from others, this probability of assigning correct labels
to examples is given by the product
n
�
L(θ, θ0 ) = P (yt |xt , θ, θ0 ) (9)
t=1
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 4
n �
log-loss
� �� �
− l(θ, θ0 ) = − log P (yt |xt , θ, θ0 ) (11)
t=1
�n
− log g yt (θT xt + θ0 )
� �
= (12)
t=1
n
�
log 1 + exp −yt (θT xt + θ0 )
� � ��
= (13)
t=1
We can interpret this similarly to the sum of the hinge losses in the SVM approach. As
before, we have a base loss function, here log[1 + exp(−z)] (Figure 1b), similar to the hinge
loss (Figure 1a), and this loss depends only on the value of the “margin” yt (θT xt + θ0 ) for
each example. The difference here is that we have a clear probabilistic interpretation of
the “strength” of the prediction, i.e., how high P (yt |xt , θ, θ0 ) is for any particular example.
Having a probabilistic interpretation does not, however, mean that the probability values
are in any way sensible or calibrated. Predicted probabilities are calibrated when they
1
An estimator is a function that maps data to parameter values. An estimate is the value obtained in
response to specific data.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 5
correspond to observed frequencies. So, for example, if we group together all the examples
for which we predict positive label with probability 0.5, then roughly half of them should be
labeled +1. Probability estimates are rarely well-calibrated but can nevertheless be useful.
The minimization problem we have defined above is convex and there are a number of
optimization methods available for finding the minimizing θ̂ and θ̂0 including simple gradi
ent descent. In a simple (stochastic) gradient descent, we would modify the parameters in
response to each term in the sum (based on each training example). To specify the updates
we need the following derivatives
� �
d � � T
�� exp −yt (θT xt + θ0 )
log 1 + exp −yt (θ xt + θ0 ) = −yt (14)
dθ0 1 + exp ( −yt (θT xt + θ0 ) )
= −yt [1 − P (yt |xt , θ, θ0 )] (15)
and
d
log 1 + exp −yt (θT xt + θ0 )
� � ��
= −yt xt [1 − P (yt |xt , θ, θ0 )] (16)
dθ
The parameters are then updated by selecting training examples at random and moving
the parameters in the opposite direction of the derivatives:
where η is a small (positive) learning rate. Note that P (yt |xt , θ, θ0 ) is the probability that
we predict the training label correctly and [1 − P (yt |xt , θ, θ0 )] is the probability of making a
mistake. The stochastic gradient descent updates in the logistic regression context therefore
strongly resemble the perceptron mistake driven updates. The key difference here is that
the updates are graded, made in proportion to the probability of making a mistake.
The stochastic gradient descent algorithm leads to no significant change on average when
the gradient of the full objective equals zero. Setting the gradient to zero is also a necessary
condition of optimality:
n
d �
(−l(θ, θ0 ) = − yt [1 − P (yt |xt , θ, θ0 )] = 0 (19)
dθ0 t=1
n
d �
(−l(θ, θ0 )) = − yt xt [1 − P (yt |xt , θ, θ0 )] = 0 (20)
dθ t=1
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 6
The sum in Eq.(19) is the difference between mistake probabilities associated with positively
and negatively labeled examples. The optimality of θ0 therefore ensures that the mistakes
are balanced in this (soft) sense. Another way of understanding this is that the vector of
mistake probabilities is orthogonal to the vector of labels. Similarly, the optimal setting
of θ is characterized by mistake probabilities that are orthogonal to all rows of the label-
example matrix X̃ = [y1 x1 , . . . , yn xn ]. In other words, for each dimension j of the example
vectors, [y1 x1j , . . . , yn xnj ] is orthogonal to the mistake probabilities. Taken together, these
orthogonality conditions ensure that there’s no further linearly available information in
the examples to improve the predicted probabilities (or mistake probabilities). This is
perhaps a bit easier to see if we first map ±1 labels into 0/1 labels: ỹt = (1 + yt )/2 so that
ỹt ∈ {0, 1}. Then the above optimality conditions can be rewritten in terms of prediction
errors [ỹt − P (y = 1|xt , θ, θ0 )] rather than mistake probabilities as
n
�
[ỹt − P (y = 1|xt , θ, θ0 )] = 0 (21)
t=1
n
�
xt [ỹt − P (y = 1|xt , θ, θ0 )] = 0 (22)
t=1
and
n
� n
�
θ0� [ỹt − P (y = 1|xt , θ, θ0 )] + θ �T
xt [ỹt − P (y = 1|xt , θ, θ0 )] (23)
t=1 t=1
�
= (θ�T xt + θ0 )[ỹt − P (y = 1|xt , θ, θ0 )] = 0 (24)
t=1
meaning that the prediction errors are orthogonal to any linear function of the inputs.
Let’s try to briefly understand the type of predictions we could obtain via maximum like
lihood estimation of the logistic regression model. Suppose the training examples are
linearly separable. In this case we can find parameter values such that yt (θT xt + θ0 ) are
positive for all training examples. By scaling up the parameters, we make these values
larger and larger. This is beneficial as far as the likelihood model is concerned since
the �log of the
� logistic function
�� is strictly increasing as a function of yt (θT xt + θ0 ) (the loss
log 1 + exp −yt (θT xt + θ0 ) is strictly decreasing). Thus, as a result, the maximum like
lihood parameter values would become unbounded, and infinite scaling of any parameters
corresponding to a perfect linear classifier would attain the highest likelihood (likelihood
of exactly one or the loss exactly zero). The resulting probability values, predicting each
training label correctly with probability one, are hardly accurate in the sense of reflecting
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 7
our uncertainty about what the labels might be. So, when the number of training ex
amples is small we would need to add the regularizer �θ�2 /2 just as in the SVM model.
The regularizer helps select reasonable parameters when the available training data fails to
sufficiently constrain the linear classifier.
To estimate the parameters of the logistic regression model with regularization we would
minimize instead
n
1 �
�θ�2 + C log 1 + exp −yt (θT xt + θ0 )
� � ��
(25)
2 t=1
where the constant C again specifies the trade-off between correct classification (the ob
jective) and the regularization penalty. The regularization problem is typically written
(equivalently) as
n
λ �
�θ�2 + log 1 + exp −yt (θT xt + θ0 )
� � ��
(26)
2 t=1
since it seems more natural to vary the strength of regularization with λ while keeping the
objective the same.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].