cs229 Notes3
cs229 Notes3
Andrew Ng
updated by Tengyu Ma on October 5, 2019
Part V
Kernel Methods
1.1 Feature maps
Recall that in our discussion about linear regression, we considered the prob-
lem of predicting the price of a house (denoted by y) from the living area of
the house (denoted by x), and we fit a linear function of x to the training
data. What if the price y can be more accurately represented as a non-linear
function of x? In this case, we need a more expressive family of models than
linear models.
We start by considering fitting cubic functions y = θ3 x3 + θ2 x2 + θ1 x + θ0 .
It turns out that we can view the cubic function as a linear function over
the a different set of feature variables (defined below). Concretely, let the
function φ : R → R4 be defined as
1
x 4
φ(x) =
x2 ∈ R .
(1)
x3
θ3 x3 + θ2 x2 + θ1 x + θ0 = θT φ(x)
1
2
in the context of kernel methods, we will call the “original” input value the
input attributes of a problem (in this case, x, the living area). When the
original input is mapped to some new set of quantities φ(x), we will call those
new quantities the features variables. (Unfortunately, different authors use
different terms to describe these two things in different contexts.) We will
call φ a feature map, which maps the attributes to the features.
n
X
θ := θ + α y (i) − θT φ(x(i) ) φ(x(i) ) (3)
i=1
n
X
θ= βi φ(x(i) ) (6)
i=1
1
Here, for simplicity, we include all the monomials with repetitions (so that, e.g., x1 x2 x3
and x2 x3 x1 both appear in φ(x)). Therefore, there are totally 1 + d + d2 + d3 entries in
φ(x).
4
You may realize that our general strategy is to implicitly represent the p-
dimensional vector θ by a set of coefficients β1 , . . . , βn . Towards doing this,
we derive the update rule of the coefficients β1 , . . . , βn . Using the equation
above, we see that the new βi depends on the old one via
βi := βi + α y (i) − θT φ(x(i) ) (8)
HerePwe still have the old θ on the RHS of the equation. Replacing θ by
θ = nj=1 βj φ(x(j) ) gives
n
!
X
(j) T
∀i ∈ {1, . . . , n}, βi := βi + α y (i) − βj φ(x ) φ(x(i) )
j=1
T
We often rewrite φ(x(j) ) φ(x(i) ) as hφ(x(j) ), φ(x(i) )i to emphasize that it’s the
inner product of the two feature vectors. Viewing βi ’s as the new representa-
tion of θ, we have successfully translated the batch gradient descent algorithm
into an algorithm that updates the value of β iteratively. It may appear that
at every iteration, we still need to compute the values of hφ(x(j) ), φ(x(i) )i for
all pairs of i, j, each of which may take roughly O(p) operation. However,
two important properties come to rescue:
1. We can pre-compute the pairwise inner products hφ(x(j) ), φ(x(i) )i for all
pairs of i, j before the loop starts.
2. For the feature map φ defined in (5) (or many other interesting fea-
ture maps), computing hφ(x(j) ), φ(x(i) )i can be efficient and does not
5
As you will see, the inner products between the features hφ(x), φ(z)i are
essential here. We define the Kernel corresponding to the feature map φ as
a function that maps X × X → R satisfying: 2
1. Compute all the values K(x(i) , x(j) ) , hφ(x(i) ), φ(x(j) )i using equa-
tion (9) for all i, j ∈ {1, . . . , n}. Set β := 0.
2. Loop:
n
!
X
(i) (i) (j)
∀i ∈ {1, . . . , n}, βi := βi + α y − βj K(x , x ) (11)
j=1
β := β + α(~y − Kβ)
You may realize that fundamentally all we need to know about the feature
map φ(·) is encapsulated in the corresponding kernel function K(·, ·). We
will expand on this in the next section.
Thus, we see that K(x, z) = hφ(x), φ(z)i is the kernel function that corre-
sponds to the the feature mapping φ given (shown here for the case of d = 3)
by
x1 x1
x1 x2
x1 x3
x2 x1
φ(x) = x2 x2 .
x2 x3
x3 x1
x3 x2
x3 x3
Revisiting the computational efficiency perspective of kernel, note that whereas
calculating the high-dimensional φ(x) requires O(d2 ) time, finding K(x, z)
takes only O(d) time—linear in the dimension of the input attributes.
For another related example, also consider K(·, ·) defined by
and the parameter c controls the relative weighting between the xi (first
order) and the xi xj (second order) terms.
More broadly, the kernel K(x, z) = (xT z + c)k corresponds to a feature
mapping to an d+k k
feature space, corresponding of all monomials of the
form xi1 xi2 . . . xik that are up to order k. However, despite working in this
O(dk )-dimensional space, computing K(x, z) still takes only O(d) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.
a feature map φ such that the kernel K defined above satisfies K(x, z) =
φ(x)T φ(z)? In this particular example, the answer is yes. This kernel is called
the Gaussian kernel, and corresponds to an infinite dimensional feature
mapping φ. We will give a precise characterization about what properties
a function K needs to satisfy so that it can be a valid kernel function that
corresponds to some feature map φ.
Part VI
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” supervised learning algorithms. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.
2 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic-
tions; these ideas will be made formal in Section 4.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by hθ (x) = g(θT x). We then predict “1” on an input x if and only if
hθ (x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0. Consider a positive
training example (y = 1). The larger θT x is, the larger also is hθ (x) = p(y =
1|x; θ), and thus also the higher our degree of “confidence” that the label is 1.
Thus, informally we can think of our prediction as being very confident that
12
A0
1
B0
1
C0
1
Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from the separating hyperplane, then we may be significantly
more confident in our predictions. Again, informally we think it would be
nice if, given a training set, we manage to find a decision boundary that
allows us to make all correct and confident (meaning far from the decision
13
3 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vector θ, we
will use parameters w, b, and write our classifier as
Note that if y (i) = 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need wT x(i) + b to be a large
positive number. Conversely, if y (i) = −1, then for the functional margin
to be large, we need wT x(i) + b to be a large negative number. Moreover, if
y (i) (wT x(i) + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a large functional margin represents a confident and a
correct prediction.
14
For a linear classifier with the choice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that makes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(wT x + b) = g(2wT x + 2b),
this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends
only on the sign, but not on the magnitude, of wT x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||2 = 1; i.e., we might
replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional
margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later.
Given a training set S = {(x(i) , y (i) ); i = 1, . . . , n}, we also define the
function margin of (w, b) with respect to S as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can therefore
be written:
γ̂ = min γ̂ (i) .
i=1,...,n
Next, let’s talk about geometric margins. Consider the picture below:
A w
γ (i)
label y (i) = 1. Its distance to the decision boundary, γ (i) , is given by the line
segment AB.
How can we find the value of γ (i) ? Well, w/||w|| is a unit-length vector
pointing in the same direction as w. Since A represents x(i) , we therefore
find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on
the decision boundary, and all points x on the decision boundary satisfy the
equation wT x + b = 0. Hence,
T (i) (i) w
w x −γ + b = 0.
||w||
This was worked out for the case of a positive training example at A in the
figure, where being on the “positive” side of the decision boundary is good.
More generally, we define the geometric margin of (w, b) with respect to a
training example (x(i) , y (i) ) to be
T !
w b
γ (i) = y (i) x(i) + .
||w|| ||w||
Note that if ||w|| = 1, then the functional margin equals the geometric
margin—this thus gives us a way of relating these two different notions of
margin. Also, the geometric margin is invariant to rescaling of the parame-
ters; i.e., if we replace w with 2w and b with 2b, then the geometric margin
does not change. This will in fact come in handy later. Specifically, because
of this invariance to the scaling of the parameters, when trying to fit w and b
to training data, we can impose an arbitrary scaling constraint on w without
changing anything important; for instance, we can demand that ||w|| = 1, or
|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by
rescaling w and b.
Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , n}, we also define
the geometric margin of (w, b) with respect to S to be the smallest of the
geometric margins on the individual training examples:
γ = min γ (i) .
i=1,...,n
16
maxγ,w,b γ
s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , n
||w|| = 1.
Here, we’re going to maximize γ̂/||w||, subject to the functional margins all
being at least γ̂. Since the geometric and functional margins are related by
γ = γ̂/||w|, this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint ||w|| = 1 that we didn’t like. The downside is that we now
γ̂
have a nasty (again, non-convex) objective ||w|| function; and, we still don’t
have any off-the-shelf software that can solve this form of an optimization
problem.
17
Let’s keep going. Recall our earlier discussion that we can add an arbi-
trary scaling constraint on w and b without changing anything. This is the
key idea we’ll use now. We will introduce the scaling constraint that the
functional margin of w, b with respect to the training set must be 1:
γ̂ = 1.
Since multiplying w and b by some constant results in the functional margin
being multiplied by that same constant, this is indeed a scaling constraint,
and can be satisfied by rescaling w, b. Plugging this into our problem above,
and noting that maximizing γ̂/||w|| = 1/||w|| is the same thing as minimizing
||w||2 , we now have the following optimization problem:
1
minw,b ||w||2
2
s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , n
We’ve now transformed the problem into a form that can be efficiently
solved. The above is an optimization problem with a convex quadratic ob-
jective and only linear constraints. Its solution gives us the optimal mar-
gin classifier. This optimization problem can be solved using commercial
quadratic programming (QP) code.4
While we could call the problem solved here, what we will instead do is
make a digression to talk about Lagrange duality. This will lead us to our
optimization problem’s dual form, which will play a key role in allowing us to
use kernels to get optimal margin classifiers to work efficiently in very high
dimensional spaces. The dual form will also allow us to derive an efficient
algorithm for solving the above optimization problem that will typically do
much better than generic QP software.
Some of you may recall how the method of Lagrange multipliers can be used
to solve it. (Don’t worry if you haven’t seen it before.) In this method, we
define the Lagrangian to be
l
X
L(w, β) = f (w) + βi hi (w)
i=1
Here, the βi ’s are called the Lagrange multipliers. We would then find
and set L’s partial derivatives to zero:
∂L ∂L
= 0; = 0,
∂wi ∂βi
and solve for w and β.
In this section, we will generalize this to constrained optimization prob-
lems in which we may have inequality as well as equality constraints. Due to
time constraints, we won’t really be able to do the theory of Lagrange duality
justice in this class,5 but we will give the main ideas and results, which we
will then apply to our optimal margin classifier’s optimization problem.
Consider the following, which we’ll call the primal optimization problem:
minw f (w)
s.t. gi (w) ≤ 0, i = 1, . . . , k
hi (w) = 0, i = 1, . . . , l.
To solve it, we start by defining the generalized Lagrangian
k
X l
X
L(w, α, β) = f (w) + αi gi (w) + βi hi (w).
i=1 i=1
Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity
θP (w) = max L(w, α, β).
α,β : αi ≥0
Here, the “P” subscript stands for “primal.” Let some w be given. If w
violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) 6= 0
for some i), then you should be able to verify that
k
X l
X
θP (w) = max f (w) + αi gi (w) + βi hi (w) (13)
α,β : αi ≥0
i=1 i=1
= ∞. (14)
5
Readers interested in learning more about this topic are encouraged to read, e.g., R.
T. Rockarfeller (1970), Convex Analysis, Princeton University Press.
19
Thus, θP takes the same value as the objective in our problem for all val-
ues of w that satisfies the primal constraints, and is positive infinity if the
constraints are violated. Hence, if we consider the minimization problem
we see that it is the same problem (i.e., and has the same solutions as) our
original, primal problem. For later use, we also define the optimal value of
the objective to be p∗ = minw θP (w); we call this the value of the primal
problem.
Now, let’s look at a slightly different problem. We define
Here, the “D” subscript stands for “dual.” Note also that whereas in the
definition of θP we were optimizing (maximizing) with respect to α, β, here
we are minimizing with respect to w.
We can now pose the dual optimization problem:
This is exactly the same as our primal problem shown above, except that the
order of the “max” and the “min” are now exchanged. We also define the
optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w).
How are the primal and the dual problems related? It can easily be shown
that
d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ .
α,β : αi ≥0 w w α,β : αi ≥0
(You should convince yourself of this; this follows from the “max min” of a
function always being less than or equal to the “min max.”) However, under
certain conditions, we will have
d ∗ = p∗ ,
so that we can solve the dual problem in lieu of the primal problem. Let’s
see what these conditions are.
20
Suppose f and the gi ’s are convex,6 and the hi ’s are affine.7 Suppose
further that the constraints gi are (strictly) feasible; this means that there
exists some w so that gi (w) < 0 for all i.
Under our above assumptions, there must exist w∗ , α∗ , β ∗ so that w∗ is the
solution to the primal problem, α∗ , β ∗ are the solution to the dual problem,
and moreover p∗ = d∗ = L(w∗ , α∗ , β ∗ ). Moreover, w∗ , α∗ and β ∗ satisfy the
Karush-Kuhn-Tucker (KKT) conditions, which are as follows:
∂
L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , d (15)
∂wi
∂
L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , l (16)
∂βi
αi∗ gi (w∗ ) = 0, i = 1, . . . , k (17)
gi (w∗ ) ≤ 0, i = 1, . . . , k (18)
α∗ ≥ 0, i = 1, . . . , k (19)
6
When f has a Hessian, then it is convex if and only if the Hessian is positive semi-
definite. For instance, f (w) = wT w is convex; similarly, all linear (and affine) functions
are also convex. (A function f can also be convex without being differentiable, but we
won’t need those more general definitions of convexity here.)
7
I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as
linear, except that we also allow the extra intercept term bi .
21
We have one such constraint for each training example. Note that from the
KKT dual complementarity condition, we will have αi > 0 only for the train-
ing examples that have functional margin exactly equal to one (i.e., the ones
corresponding to constraints that hold with equality, gi (w) = 0). Consider
the figure below, in which a maximum margin separating hyperplane is shown
by the solid line.
The points with the smallest margins are exactly the ones closest to the
decision boundary; here, these are the three points (one negative and two pos-
itive examples) that lie on the dashed lines parallel to the decision boundary.
Thus, only three of the αi ’s—namely, the ones corresponding to these three
training examples—will be non-zero at the optimal solution to our optimiza-
tion problem. These three points are called the support vectors in this
problem. The fact that the number of support vectors can be much smaller
than the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-
lem, one key idea to watch out for is that we’ll try to write our algorithm
22
in terms of only the inner product hx(i) , x(j) i (think of this as (x(i) )T x(j) )
between points in the input feature space. The fact that we can express our
algorithm in terms of these inner products will be key when we apply the
kernel trick.
When we construct the Lagrangian for our optimization problem we have:
n
1 X
L(w, b, α) = ||w||2 − αi y (i) (wT x(i) + b) − 1 . (21)
2 i=1
Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the
problem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to first
minimize L(w, b, α) with respect to w and b (for fixed α), to get θD , which
we’ll do by setting the derivatives of L with respect to w and b to zero. We
have: n
X
∇w L(w, b, α) = w − αi y (i) x(i) = 0
i=1
If we take the definition of w in Equation (22) and plug that back into
the Lagrangian (Equation 21), and simplify, we get
n n n
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y (i) .
i=1
2 i,j=1 i=1
But from Equation (23), the last term must be zero, so we obtain
n n
X 1 X (i) (j)
L(w, b, α) = αi − y y αi αj (x(i) )T x(j) .
i=1
2 i,j=1
and the constraint (23), we obtain the following dual optimization problem:
n n
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (24)
i=1
2 i,j=1
s.t. αi ≥ 0, i = 1, . . . , n
Xn
αi y (i) = 0,
i=1
You should also be able to verify that the conditions required for p∗ = d∗
and the KKT conditions (Equations 15–19) to hold are indeed satisfied in
our optimization problem. Hence, we can solve the dual in lieu of solving
the primal problem. Specifically, in the dual problem above, we have a
maximization problem in which the parameters are the αi ’s. We’ll talk later
about the specific algorithm that we’re going to use to solve the dual problem,
but if we are indeed able to solve it (i.e., find the α’s that maximize W (α)
subject to the constraints), then we can use Equation (22) to go back and
find the optimal w’s as a function of the α’s. Having found w∗ , by considering
the primal problem, it is also straightforward to find the optimal value for
the intercept term b as
n
!T
X
wT x + b = αi y (i) x(i) x+b (26)
i=1
n
X
= αi y (i) hx(i) , xi + b. (27)
i=1
be zero except for the support vectors. Thus, many of the terms in the sum
above will be zero, and we really need to find only the inner products between
x and the support vectors (of which there is often only a small number) in
order calculate (27) and make our prediction.
By examining the dual form of the optimization problem, we gained sig-
nificant insight into the structure of the problem, and were also able to write
the entire algorithm in terms of only inner products between input feature
vectors. In the next section, we will exploit this property to apply the ker-
nels to our classification problem. The resulting algorithm, support vector
machines, will be able to efficiently learn in very high dimensional spaces.
regularization) as follows:
n
1 X
minγ,w,b ||w||2 + C ξi
2 i=1
Thus, examples are now permitted to have (functional) margin less than 1,
and if an example has functional margin 1 − ξi (with ξ > 0), we would pay
a cost of the objective function being increased by Cξi . The parameter C
controls the relative weighting between the twin goals of making the ||w||2
small (which we saw earlier makes the margin large) and of ensuring that
most examples have functional margin at least 1.
As before, we can form the Lagrangian:
n n n
1 X X X
L(w, b, ξ, α, r) = wT w + C ξi − αi y (i) (xT w + b) − 1 + ξi − ri ξ i .
2 i=1 i=1 i=1
are:
Now, all that remains is to give an algorithm for actually solving the dual
problem, which we will do in the next section.
max W (α1 , α2 , . . . , αn ).
α
Here, we think of W as just some function of the parameters αi ’s, and for now
ignore any relationship between this problem and SVMs. We’ve already seen
two optimization algorithms, gradient ascent and Newton’s method. The
new algorithm we’re going to consider here is called coordinate ascent:
For i = 1, . . . , n, {
αi := arg maxα̂i W (α1 , . . . , αi−1 , α̂i , αi+1 , . . . , αn ).
}
Thus, in the innermost loop of this algorithm, we will hold all the variables
except for some αi fixed, and reoptimize W with respect to just the parameter
αi . In the version of this method presented here, the inner-loop reoptimizes
the variables in order α1 , α2 , . . . , αn , α1 , α2 , . . .. (A more sophisticated version
27
might choose other orderings; for instance, we may choose the next variable
to update according to which one we expect to allow us to make the largest
increase in W (α).)
When the function W happens to be of such a form that the “arg max”
in the inner loop can be performed efficiently, then coordinate ascent can be
a fairly efficient algorithm. Here’s a picture of coordinate ascent in action:
2.5
1.5
0.5
−0.5
−1
−1.5
−2
The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in the figure is the path that it took on its way to the global maximum.
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
9.2 SMO
We close off the discussion of SVMs by sketching the derivation of the SMO
algorithm. Some details will be left to the homework, and for others you
may refer to the paper excerpt handed out in class.
Here’s the (dual) optimization problem that we want to solve:
n n
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (31)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , n (32)
Xn
αi y (i) = 0. (33)
i=1
28
Let’s say we have set of αi ’s that satisfy the constraints (32-33). Now,
suppose we want to hold α2 , . . . , αn fixed, and take a coordinate ascent step
and reoptimize the objective with respect to α1 . Can we make any progress?
The answer is no, because the constraint (33) ensures that
n
X
(1)
α1 y =− αi y (i) .
i=2
(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence,
α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αn
fixed, then we can’t make any change to α1 without violating the con-
straint (33) in the optimization problem.
Thus, if we want to update some subject of the αi ’s, we must update at
least two of them simultaneously in order to keep satisfying the constraints.
This motivates the SMO algorithm, which simply does the following:
To test for convergence of this algorithm, we can check whether the KKT
conditions (Equations 28-30) are satisfied to within some tol. Here, tol is
the convergence tolerance parameter, and is typically set to around 0.01 to
0.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to
αi , αj can be computed very efficiently. Let’s now briefly sketch the main
ideas for deriving the efficient update.
Let’s say we currently have some setting of the αi ’s that satisfy the con-
straints (32-33), and suppose we’ve decided to hold α3 , . . . , αn fixed, and
29
Since the right hand side is fixed (as we’ve fixed α3 , . . . αn ), we can just let
it be denoted by some constant ζ:
α1 y (1) + α2 y (2) = ζ. (34)
We can thus picture the constraints on α1 and α2 as follows:
H α1y(1)+ α2y(2)=ζ
α2
L
α1 C
From the constraints (32), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissible values for α2 that will ensure that α1 , α2
lie within the box [0, C] × [0, C].
Using Equation (34), we can also write α1 as a function of α2 :
α1 = (ζ − α2 y (2) )y (1) .
(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so
that (y (1) )2 = 1.) Hence, the objective W (α) can be written
W (α1 , α2 , . . . , αn ) = W ((ζ − α2 y (2) )y (1) , α2 , . . . , αn ).
30
Finally, having found the α2new , we can use Equation (34) to go back and find
the optimal value of α1new .
There’re a couple more details that are quite easy but that we’ll leave you
to read about yourself in Platt’s paper: One is the choice of the heuristics
used to select the next αi , αj to update; the other is how to update b as the
SMO algorithm is run.