Kernel SVM For Image Classification
Kernel SVM For Image Classification
1 Introduction
This work investigates the utility of Kernel SVMs for the task of image classification. This
is a supervised learning task: the learning algorithm is given a set of labeled data on which
to train a model that will then be used to classify images. We use a subset of the CIFAR-
10 data set which consists of 60,000 32 × 32 color images in 10 classes, with 6000 images
per class. These are broken into 50,000 images for training and 10,000 images for testing.
Details of the experiment are given in Section 4.
Because SVMs are binary classifiers and we wish to classify into more than two classes,
we will employ the “one-vs-one” reduction method. That is, for K classes we train (K
choose 2) binary classifiers. Each classifier will be trained on samples from a pair of classes
and learn to distinguish between them. When making a prediction on an unlabeled image,
each of the K(K − 1)/2 classifiers will be given the image, and then “vote” on which class
they think the image belongs to, i.e. the class that gets the highest number of positive
predictions among all classifiers’ predictions will be taken as the “winner”.
For each binary classifier, the training data will of the form {(x(1) , y (1) ), . . . , (x(m) , y (m) )},
where x(i) ∈ X is a 32×32 pixel RGB image from CIFAR-10 and the label is y (i) ∈ {−1, +1},
representing in being in one (binary) class over the other.
• x(i) ∈ Rn is a point
• b ∈ R is a bias parameter (how shifted the separating hyperplane is from the origin
• ξi ∈ R is a slack (error) variable that represents the amount that a point x(i) has a
(functional) margin less than 1
• C ∈ R is a parameter to control the relative weighting of maximizing the margin
(minimizing kwk2 ) and ensuring that most examples have functional margin at least
1. If a misclassified point (i.e. ξi > 0) has functional margin 1 − ξi , then the cost of
the objective function increases by Cξi
We note that this optimization problem is a quadratic linear program, because the ob-
jective is quadratic and the constraints are linear.
PmFurthermore, we have that the vector normal to the separating hyperplane is w =
(i) x(i) and the bias term is b = y (i) − wx(i) . Lastly, the predictor function f : X →
α
i=1 i y
{−1, +1} is f (x) = sign(wT x + b).
Figure 1: Visualization of Soft Margin SVM. Points are in R2 and the two classes are
represented with colors red (y = −1) and blue(y = +1). The gray cross is the axis, the
black line is the separating hyperplane, the dashed blue and red lines are the supporting
hyperplanes to the blue and red points, respectively. The points circled in black (touching
the supporting hyperplanes) are the support vectors. There are two misclassified red points,
both labeled, along with corresponding values of ξ that show the difference in its value
depending on how much over the wrong side of supporting hyperplane of their class they
lie. The green arrow between the red and blue dashed lines represents the margin that the
SVM is trying to maximize.
2
2.2 Dual
2.2.1 Lagrangian
The Lagrangian for the optimization problem is given by:
m m m
1 2
X X X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi )
2
i=1 i=1 i=1
| {z } | {z } | {z }
Primal Objective First set of constraints Second set of constraints
f0 f1 ,...,fm fm+1 ,...,f2m
(2)
Where θD (α, β) is the dual problem and the αi s and βi s are Lagrange multipliers, each
constrained be ≥ 0. To find the dual form of the problem, fix α and β and minimize L
with respect to w, to b, and ξ. When we set the respective derivatives to zero, we will
find the point where the Lagrangian is minimized, and then we can maximize the reduced
Lagrangian with respect to α and β.
Taking the derivative of the Lagrangian with respect to w and setting it to zero, we
have:
m m m
∂ ∂ 1 2
X X
(i) T (i)
X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (w x + b)] + βi (−ξi ) = 0
∂w ∂w 2
i=1 i=1 i=1
m
X
=⇒ w − αi y (i) x(i) = 0
i=1
m
X
=⇒ w = αi y (i) x(i) (4)
i=1
Now, taking the derivative of the Lagrangian with respect to b and setting it to zero,
we have:
m m m
∂ ∂ 1 X X X
L(w, b, ξ, α, β) = kwk2 +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi ) = 0
∂b ∂b 2
i=1 i=1 i=1
m
X
=⇒ αi y (i) = 0 (5)
i=1
And finally, taking the derivative of the Lagrangian with respect to ξi and setting it to
zero, we have:
3
m m m
∂ ∂ 1 2
X X
(i) T (i)
X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (w x + b)] + βi (−ξi ) = 0
∂ξi ∂ξi 2
i=1 i=1 i=1
=⇒ C − αi − βi = 0 (6)
=⇒ βi = C − αi
But because the βi s are a dual variables with βi ≥ 0, then this leads to a constraint that
C − αi ≥ 0 or αi ≤ C. This along with the fact that αi are dual variables with αi ≥ 0 we
have
0 ≤ αi ≤ C (7)
We now will take these results and plug them back into our full Lagrangian to get a
reduced Lagrangian that depends only on α and β:
4
Distribute yi
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X Xm
+ αi [(1 − ξi ) − ( αj y (i) y (j) x(j) )T x(i) + by (i) ]
i=1 j=1
Xm
− βi ξi
i=1
Pm
Distribute i=1 αi
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X m
X m
X m
X
(i) (j) (j) T (i)
+ αi − αi ξi − αi αj y y (x ) x +b αi y (i)
i=1 i=1 i=1,j i=1
Xm
− βi ξi
i=1
Note that (x(j) )T x(i) is the same as (x(i) )T x(j) , so replace
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X m
X m
X m
X
(i) (j) (i) T (j)
+ αi − αi ξi − αi αj y y (x ) x +b αi y (i)
i=1 i=1 i=1,j i=1
Xm
− βi ξi
i=1
Simplify
m m
1 X (i) (j) (i) T (j)
X
=− αi αj y y (x ) x + C ξi
2
i,j=1 i=1
m
X m
X m
X
+ αi − αi ξi + b αi y (i)
i=1 i=1 i=1
m
X
− βi (ξi )
i=1
Pm (i)
We have from (5) that i=1 αi y =0
m m
1 X
(i) (j) (i) T (j)
X
=− αi αj y y (x ) x +C ξi
2
i,j=1 i=1
m
X m
X
+ αi − αi ξi
i=1 i=1
m
X
− βi ξi
i=1
5
We have from (6) that C = αi + βi
m m m
1 X X X
=− αi αj y (i) y (j) (x(i) )T x(j) + αi ξi + β i ξi
2
i,j=1 i=1 i=1
m
X m
X
+ αi − αi ξi
i=1 i=1
Xm
− βi ξi
i=1
Simplify
m m
1 X (i) (j) (i) T (j)
X
=− αi αj y y (x ) x + αi
2
i,j=1 i=1
Re-arrange terms
m m
X 1 X
= αi − αi αj y (i) y (j) (x(i) )T x(j)
2
i=1 i,j=1
Finally then, we have a reduced Lagrangian that only depends on the value of α:
m m
X 1 X (i) (j)
L(α) = αi − y y αi αj (x(i) )T x(j) (8)
2
i=1 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (9)
Xm
αi y (i) = 0
i=1
Note that we have replaced (x(i) )T x(j) with an inner product hx(i) , x(j) i. This is just
the definition of the inner product, and will be useful conceptually for developing the idea
of the kernel shortly.
A quick note on some intuition behind the slack variable: without it, αi can go to ∞
when constraints are violated (i.e. points are misclassified). Upper-bounding the αi s by
C allows some leeway in having points cross the supporting hyperplane of their class, the
number of points and how much error is tolerated can be tweaked by varying C.
6
In this case, there must exist a w∗ , b∗ , ξ ∗ that are a solution to the primal and α∗ , β ∗
that are the solution to the dual such that p∗ (the optimal value of the primal) equals d∗
(the optimal solution to the dual) equals L(w, b, ξ, α, β).
Furthermore, w∗ , b∗ , ξ ∗ , α∗ , β ∗ satisfy the KKT conditions:
• fi (w∗ , b∗ , ξ ∗ ) ≤ 0, i = 1, . . . , 2m
• α∗ ≥ 0
• β∗ ≥ 0
3. Complimentary slackness
• αi∗ fi = 0, i = 1, . . . , m
• βi∗ fi = 0, i = m + 1, . . . , m
• ∂ ∗ ∗ ∗ ∗ ∗
∂w L(w , b , ξ , α , β ) = 0
• ∂ ∗ ∗ ∗ ∗ ∗
∂b L(w , b , ξ , α , β ) = 0
• ∂ ∗ ∗ ∗ ∗ ∗
∂ξ L(w , b , ξ , α , β ) = 0
In our formulation of the dual, we have ensured that all the KKT conditions are satisfied
as well as the conditions for p∗ to equal d∗ . So, we are justified in using the dual form of
the problem.
7
m m m
1 2
X X X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi )
2
i=1 i=1 i=1
| {z } | {z } | {z }
Primal Objective First set of constraints Second set of constraints
f0 f1 ,...,fm fm+1 ,...,f2m
(10)
Once the dual is solved and some optimal α∗ is found, 0 < αk∗ < C for some k implies the
corresponding constraint fk is tight and ξi = 0 (see Section 2.2.5), i.e. y (k) (wT x(k) + b) = 1
8
(b) An example classifier (black line) along with
shaded regions representing the classification of
(a) Points in R . The blue triangles forming thepoints to either the blue class or the red class.
2
smaller circle is one class and the red dots formingIt should be 2clear that no matter how a line is
the larger circle is another class. [5] drawn (in R ), a good separation of the points
where almost blue triangles fall on one side and
all red circles fall on the other of the black line is
impossible.
If we were to apply an SVM directly to the learn from the data in this space, it’s clear
that there is no good linear separator. However, if we can “lift” the points to a higher
dimensional space by e.g. applying a function to each of the the points in the space, then
perhaps in that higher dimensional space there will exist a good separating hyperplane
between the classes.
Figure 3: Points in original space on the left, “lifted” points are on the right. The trans-
formation function is shown in the middle. The transformation here is essentially taking a
parabola centered at the origin that is then rotated creating a conic surface, the mapping
is projecting the points from R2 onto the resulting surface (i.e. lifting them up into the
z3 dimension by this projection while maintaining their values in the z1 and z2 dimensions
from the values in the x1 and x2 dimensions, respectively.)
9
Once a separating hyperplane has been found in the higher dimensional space, we can
project it back down to the original space and use that as a classifying function.
Figure 4: On the left, a separating hyperplane is shown in the R3 space that splits points in
the red circle class from the points in the blue triangle class. On the right is the projection of
this separating hyperplane in the original R2 space. While in the original space the classifier
looks non-linear, it is simply the projection of a linear separator in a higher dimensional
space.
So, an inner product is enough for training. Assuming that there will be a payoff by
putting things in terms of the inner product, let’s see how classification would work.
10
Pm (i) (i)
and from (4) we have that w = i=1 αi y φ(x ). Substituting this w we get
Xm
(k)
b= y −( αi y (i) φ(x(i) ))T φ(x(k) )
i=1
Simplifying we get
m
X
b = y (k) − αi y (i) (φ(x(i) ))T φ(x(k) )
i=1
m
X
b = y (k) − αi y (i) hφ(x(i) ), φ(x(k) )i (12)
i=1
Having b and the equation from (4) for w, we can rewrite the primal classification
function in terms of the inner product as well.
m
X
f (x) = sign αi y (i) hφ(x(i) ), φ(x(k) )i + b (13)
i=1
In conclusion, both training and prediction can be done using just the results of inner
products without ever mapping the points themselves into any higher dimensional space.
11
Definition of φ : R2 → R4
√ √
(x1 , x2 ) 7→ (z1 , z2 , z3 , z4 ) := (x31 , 3x21 x2 , 3x1 x22 , x22 )
Mapping points x, x0 ∈ R2 to R4 using φ and finding their inner product
D √ √ 3 √ 2 √ 2 2
E
hφ(x1 , x2 ), φ(x01 , x02 )i = (x31 , 3x21 x2 , 3x1 x22 , x22 ), (x01 , 3x01 x02 , 3x01 x02 , x02 )
Expanding the inner product we get
3 2 2 3
= x31 x01 + 3x21 x01 x2 x02 + 3x22 x02 x1 x01 + x32 x02
We can define a Kernel function on the points x, x0 ∈ R2 that achieve the same result
Define K : R2 → R as
3
K(x, x0 ) := hx, x0 i
Plugging in values of x and x0 we get
3
K(x, x0 ) = (x1 , x2 ), (x01 , x02 )
Expanding the inner product we get
= (x1 x01 + x2 x02 )3
Expanding the polynomial we get
3 2 2 3
= x31 x01 + 3x21 x01 x2 x02 + 3x22 x02 x1 x01 + x32 x02
This is an explicit example of the polynomial kernel of degree 3. In general, the kernel
for polynomial of degree d is defined as:
d
Kpoly (x, x0 ) = hx, x0 i (15)
Other kernels are the linear kernel (in our case, this results in the equivalent of just
finding a standard linear separator):
As well as the radial basis function (RBF) kernel, which centers a “bump” or “cavity”,
i.e. a Gaussian with variance σ 2 , around each point:
0 2
e−kx−x k
Krbf (x, x0 ) = (17)
2σ 2
The RBF function is particularly interesting because it achieves the equivalent of lifting
into an infinite dimensional space to find a separating hyperplane. However, using the
kernel trick, we can achieve the same results without going through what would otherwise
be a computational intractable procedure.
The RBF kernel has been shown to give some of the best state-of-the-art results with
SVM. It can be thought of as putting a “bump” or “cavity” around each point in the
training set, either “pushing up” or “pulling down” the space, depending on the class.
The accumulation of these bumps and cavities aggregate creating a complicated surface. σ
controls how tall or fat the bumps/crevices are.
12
Figure 5: Two classes shown as red circles and yellow stars. Red points are “bumps” that
raise the surface, yellow stars are “cavities” that drag down the surface, both created by
the Gaussian Kernel (very similar to RBF). A seperating hyperplane is shown for classifi-
cation.(Image from [6])
Figure 6: Another example of the geometric surface created applying an RBF kernel. While
all the points are the same color, the classes can be deduced by if the points produce a “hill”
above the flat surface or a “crevice” below. Separating hyperplane not shown. (Image from
[7])
13
And for classification, assuming point x(k) , y (k) corresponding to 0 < αk < C:
m
X
b = y (k) − αi y (i) K(x(i) , x(k) ) (19)
i=1
4 Experiments
4.1 Setup
The images we are classifying are a subset of the CIFAR-10 [1] data set which consists of
60000 32 × 32 color images in 10 classes, with 6000 images per class. There are 50,000
training images and 10,000 test images.
Specifically, we will be classifying into a subset of four classes: airplane, automobile,
bird, and cat. We are choosing 1500 images from each class for training and 1000 images
for testing. We are choosing this number of subsets and number of training samples for time
reasons. Training a single classifier takes around 1 hour, and we need to train six classifiers
to employ the “one-vs-one” reduction method.
The “one-vs-one” reduction method for K class classification requires (K choose 2)
binary classifiers. Each classifier is trained on samples from a pair of classes. When making
a prediction on an unlabeled image, each of the K(K − 1)/2 classifiers are given the image,
and then “vote” on which class the image belongs to. That is, the class that gets the
highest number of positive predictions among all classifiers’ predictions will be taken as the
“winner”.
We perform a number of different rounds of this multi-class classification task to compare
different Kernel methods. Specifically, we train with a linear kernel (the same as a linear
classifier), a polynomial kernel of degree two and degree 3, and finally with an RBF kernel
with σ = 1 and σ = 4. For each of these the soft-margin penalty C is set to 1.
We also perform feature extraction using Histogram of Oriented Gradients (HOG) with
cell size of 8x8 on each image, and use the resulting feature vector as the representation of
the image on which we train/classify.
14
Figure 7: This is an image from [2], not from CIFAR-10, but it does a nice job of illustrating
the way that HOG works. The red cells are the “patches” of the image that will result in
different histograms. Because there are 4x8 patches, a total of 4x8=32 histograms will be
computed. These will then be concatenated with each other to represent the image. The
image then, would be represented by a vector in R32×32×#bins space.
Figure 8: This is also an image from [2] that illustrates the whole process of HOG. An
input image is given, gradients (changes in intensity) are found, oriented histograms are
computed, block normalization (i.e. binning according to ranges) is performed, and finally
accumulation (concatenating all the patches histograms) to represent the image as a single
vector.
There is a trade off in the grid size, i.e. the number of patches and the results you can
get from HOG. If the grid is too fine, then not only is it more computationally expensive
to compute the features, but it might also miss broader structure in the image. However,
if the grid size is too course, then local structure can be lost. It is generally best to try
different grid sizes and choose one that works best.
15
Figure 9: Here is an image from the automobile class of the CIFAR-10 data set with HOG
features shown at different levels of granularity, i.e. with cell (patch) sizes of 4x4, 8x8,
and 16x16. We choose 8x8 for our experiments because it seems to make a nice trade-off
between performance and accuracy.
16
Table 2: Polynomial Kernel (Degree 2): Accuracy 46.40%
17
Table 4: RBF Kernel (σ = 1): Accuracy 77.55%
18
In many cases, in the original feature space there is no linear function that can separate
instances into classes. One way of handling this is to embed the points into a different space
such that they are linearly separable by e.g. lifting the points into a higher dimensional
space.
The problem with the above is that the process of embedding can be extremely compu-
tationally expensive, and that, in fact, the benefits of this process don’t come from knowing
anything about the points in this new space other than their inner product.
Luckily, everything involved in training an SVM model to learn a separating hyperplane
as well as classifying unlabeled points requires only knowing the result of such an inner
product. So instead of e.g. lifting and computing inner products, a kernel function can
be used that can preprocess the results of the same inner product at a substantial compu-
tational savings. Even more, the kernel function for e.g. the radial basis function (RBF)
allows the SVM to learn a classifier that would require lifting points in the original space
into an infinite dimensional space, which is certainly computational intractable.
In this work, the derivation of the dual form of a soft-margin SVM was performed,
as well as the derivation for a non-trivial kernel (polynomial degree 3), and other kernel
methods were given.
Using these equations, the kernel SVM was implemented in Matlab and applied to the
classification task of a subset of images from the CIFAR-10 data set. Results were compared
between a linear kernel, polynomial (degree 2 and degree 3) kernel, and RBF (σ = 1 and
σ = 2) kernel, and results were given and analyzed.
The RBF kernel performed the best in general, but the polynomial degree 3 kernel
also performed quite well, all achieving about or above 75% accuracy. The linear kernel
performed the poorest, only matching the accuracy of random guessing and in fact just
classified all images as cats.
More insight could be gained by looking into the details of mis-classifications, such as
looking into the HOG features as well as analyzing the generated hyperplanes and where
points lie in the new space.
Accuracy could probably also be improved by using different features other than (or in
conjunction with) HOG, and by more tuning of the hyperparameters such as the degree of
the polynomial, the σ parameter in RBF, and also in the misclassification penalty function
C in the dual objective.
19
References
[1] Alex Krizhevsky, The CIFAR-10 dataset, https://www.cs.toronto.edu/ kriz/cifar.html
[2] Sistu Ganesh, What are HOG features in computer vision in layman’s terms?,
https://www.quora.com/What-are-HOG-features-in-computer-vision-in-laymans-terms
[6] Ankit K Sharma, Support Vector Machines without tears (Slide #28),
https://www.slideshare.net/ankitksharma/svm-37753690
20