0% found this document useful (0 votes)

93 views20 pages

Kernel SVM For Image Classification

The document is a report on using kernel SVMs for image classification on the CIFAR-10 dataset. It discusses: 1) Using a "one-vs-one" approach for multi-class classification with SVMs. 2) The primal formulation of soft-margin SVMs as an optimization problem that balances margin size and misclassification error. 3) Deriving the dual formulation by taking the partial derivatives of the Lagrangian and solving for the primal variables in terms of the dual variables.

Uploaded by

Ariake Swyce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views20 pages

Kernel SVM For Image Classification

Uploaded by

Ariake Swyce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Convex Optimization Final Project Report

Kernel SVM for Image Classification

by Nathaniel Hobbs Instructor: Professor Dana

May 5, 2018

1 Introduction
This work investigates the utility of Kernel SVMs for the task of image classification. This
is a supervised learning task: the learning algorithm is given a set of labeled data on which
to train a model that will then be used to classify images. We use a subset of the CIFAR-
10 data set which consists of 60,000 32 × 32 color images in 10 classes, with 6000 images
per class. These are broken into 50,000 images for training and 10,000 images for testing.
Details of the experiment are given in Section 4.
Because SVMs are binary classifiers and we wish to classify into more than two classes,
we will employ the “one-vs-one” reduction method. That is, for K classes we train (K
choose 2) binary classifiers. Each classifier will be trained on samples from a pair of classes
and learn to distinguish between them. When making a prediction on an unlabeled image,
each of the K(K − 1)/2 classifiers will be given the image, and then “vote” on which class
they think the image belongs to, i.e. the class that gets the highest number of positive
predictions among all classifiers’ predictions will be taken as the “winner”.
For each binary classifier, the training data will of the form {(x(1) , y (1) ), . . . , (x(m) , y (m) )},
where x(i) ∈ X is a 32×32 pixel RGB image from CIFAR-10 and the label is y (i) ∈ {−1, +1},
representing in being in one (binary) class over the other.

2 Soft-Margin SVM Primal to Dual

2.1 Primal
Recall that the Primal form of the optimal soft-margin classifier (aka L1 soft-margin SVM):
m
1 X
min kwk2 +C ξi
w,b 2
i=1
(1)
(i) T (i)
s.t. y (w x + b) ≥ 1 − ξi , i = 1, . . . , m
ξi ≥ 0, i = 1, . . . , m
where

• x(i) ∈ Rn is a point

• y (i) ∈ {−1, +1} is the label of x(i)

• w ∈ Rn is a vector that is normal to the separating hyperplane

• b ∈ R is a bias parameter (how shifted the separating hyperplane is from the origin

• ξi ∈ R is a slack (error) variable that represents the amount that a point x(i) has a
(functional) margin less than 1
• C ∈ R is a parameter to control the relative weighting of maximizing the margin
(minimizing kwk2 ) and ensuring that most examples have functional margin at least
1. If a misclassified point (i.e. ξi > 0) has functional margin 1 − ξi , then the cost of
the objective function increases by Cξi

We note that this optimization problem is a quadratic linear program, because the ob-
jective is quadratic and the constraints are linear.
PmFurthermore, we have that the vector normal to the separating hyperplane is w =
(i) x(i) and the bias term is b = y (i) − wx(i) . Lastly, the predictor function f : X →
α
i=1 i y
{−1, +1} is f (x) = sign(wT x + b).

Figure 1: Visualization of Soft Margin SVM. Points are in R2 and the two classes are
represented with colors red (y = −1) and blue(y = +1). The gray cross is the axis, the
black line is the separating hyperplane, the dashed blue and red lines are the supporting
hyperplanes to the blue and red points, respectively. The points circled in black (touching
the supporting hyperplanes) are the support vectors. There are two misclassified red points,
both labeled, along with corresponding values of ξ that show the difference in its value
depending on how much over the wrong side of supporting hyperplane of their class they
lie. The green arrow between the red and blue dashed lines represents the margin that the
SVM is trying to maximize.

2
2.2 Dual
2.2.1 Lagrangian
The Lagrangian for the optimization problem is given by:
m m m
1 2
X X X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi )
2
i=1 i=1 i=1
| {z } | {z } | {z }
Primal Objective First set of constraints Second set of constraints
f0 f1 ,...,fm fm+1 ,...,f2m
(2)

2.2.2 Reduced Lagrangian

We wish to find a solution to the following:

max θD (α, β) = max min L(w, b, ξ, α, β)

α,β: α,β: w,b,ξ: (3)
αi ≥0,βi ≥0 αi ≥0,βi ≥0 ξi ≥0

Where θD (α, β) is the dual problem and the αi s and βi s are Lagrange multipliers, each
constrained be ≥ 0. To find the dual form of the problem, fix α and β and minimize L
with respect to w, to b, and ξ. When we set the respective derivatives to zero, we will
find the point where the Lagrangian is minimized, and then we can maximize the reduced
Lagrangian with respect to α and β.
Taking the derivative of the Lagrangian with respect to w and setting it to zero, we
have:

m m m
∂ ∂ 1 2
X X
(i) T (i)
X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (w x + b)] + βi (−ξi ) = 0
∂w ∂w 2
i=1 i=1 i=1
m
X
=⇒ w − αi y (i) x(i) = 0
i=1
m
X
=⇒ w = αi y (i) x(i) (4)
i=1

Now, taking the derivative of the Lagrangian with respect to b and setting it to zero,
we have:

m m m
∂ ∂ 1 X X X
L(w, b, ξ, α, β) = kwk2 +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi ) = 0
∂b ∂b 2
i=1 i=1 i=1
m
X
=⇒ αi y (i) = 0 (5)
i=1

And finally, taking the derivative of the Lagrangian with respect to ξi and setting it to
zero, we have:

3
m m m
∂ ∂ 1 2
X X
(i) T (i)
X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (w x + b)] + βi (−ξi ) = 0
∂ξi ∂ξi 2
i=1 i=1 i=1
=⇒ C − αi − βi = 0 (6)
=⇒ βi = C − αi
But because the βi s are a dual variables with βi ≥ 0, then this leads to a constraint that
C − αi ≥ 0 or αi ≤ C. This along with the fact that αi are dual variables with αi ≥ 0 we
have

0 ≤ αi ≤ C (7)
We now will take these results and plug them back into our full Lagrangian to get a
reduced Lagrangian that depends only on α and β:

Our original Lagrangian

m m m
1 X X X
L(w, b, ξ, α, β) = kwk2 +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi )
2
i=1 i=1 i=1
Replace w with m (i) (i) from (4)
P
i=1 αi y x
m m
1 X X
= k αi y (i) x(i) k2 +C ξi
2
i=1 i=1
m
X Xm
+ αi [(1 − ξi ) − y (i) (( αj y (j) x(j) )T x(j) + b)]
i=1 j=1
Xm
− βi ξi
i=1
Exapand squared norm
m m m
1 X T X X
αi y (i) x(i) αj y (j) x(j) + C

= ξi
2
i=1 j=1 i=1
m
X Xm
+ αi [(1 − ξi ) − y (i) (( αi y (i) x(i) )T x(i) + b)]
i=1 i=1
m
X
− βi ξi
i=1
Group like terms
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X m
X
+ αi [(1 − ξi ) − y (i) (( αj y (j) x(j) )T x(i) + b)]
i=1 j=1
m
X
− βi ξi
i=1

4
Distribute yi
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X Xm
+ αi [(1 − ξi ) − ( αj y (i) y (j) x(j) )T x(i) + by (i) ]
i=1 j=1
Xm
− βi ξi
i=1
Pm
Distribute i=1 αi
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X m
X m
X m
X
(i) (j) (j) T (i)
+ αi − αi ξi − αi αj y y (x ) x +b αi y (i)
i=1 i=1 i=1,j i=1
Xm
− βi ξi
i=1
Note that (x(j) )T x(i) is the same as (x(i) )T x(j) , so replace
m m
1 X X
= αi αj y (i) y (j) (x(i) )T x(j) + C ξi
2
i,j=1 i=1
m
X m
X m
X m
X
(i) (j) (i) T (j)
+ αi − αi ξi − αi αj y y (x ) x +b αi y (i)
i=1 i=1 i=1,j i=1
Xm
− βi ξi
i=1
Simplify
m m
1 X (i) (j) (i) T (j)
X
=− αi αj y y (x ) x + C ξi
2
i,j=1 i=1
m
X m
X m
X
+ αi − αi ξi + b αi y (i)
i=1 i=1 i=1
m
X
− βi (ξi )
i=1
Pm (i)
We have from (5) that i=1 αi y =0
m m
1 X
(i) (j) (i) T (j)
X
=− αi αj y y (x ) x +C ξi
2
i,j=1 i=1
m
X m
X
+ αi − αi ξi
i=1 i=1
m
X
− βi ξi
i=1

5
We have from (6) that C = αi + βi
m m m
1 X X X
=− αi αj y (i) y (j) (x(i) )T x(j) + αi ξi + β i ξi
2
i,j=1 i=1 i=1
m
X m
X
+ αi − αi ξi
i=1 i=1
Xm
− βi ξi
i=1
Simplify
m m
1 X (i) (j) (i) T (j)
X
=− αi αj y y (x ) x + αi
2
i,j=1 i=1
Re-arrange terms
m m
X 1 X
= αi − αi αj y (i) y (j) (x(i) )T x(j)
2
i=1 i,j=1

Finally then, we have a reduced Lagrangian that only depends on the value of α:

m m
X 1 X (i) (j)
L(α) = αi − y y αi αj (x(i) )T x(j) (8)
2
i=1 i,j=1

2.2.3 Dual Problem

Recall that the dual function always has as a constraint that α ≥ 0. Putting this together
with the above formulation of the Lagrangian, as well as the constraints we obtained from
(5) and (7), we obtain the dual optimization problem:
m m
X 1 X
max J(α) = αi − αi αj y (i) y (j) hx(i) , x(j) i
α 2
i=1 i,j=1

s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (9)
Xm
αi y (i) = 0
i=1

Note that we have replaced (x(i) )T x(j) with an inner product hx(i) , x(j) i. This is just
the definition of the inner product, and will be useful conceptually for developing the idea
of the kernel shortly.
A quick note on some intuition behind the slack variable: without it, αi can go to ∞
when constraints are violated (i.e. points are misclassified). Upper-bounding the αi s by
C allows some leeway in having points cross the supporting hyperplane of their class, the
number of points and how much error is tolerated can be tweaked by varying C.

2.2.4 KKT Conditions and Strong Duality

We note that in the Primal the objective function f0 and constraints f1 , . . . , f2m are all
convex, and that the constraints are (strictly) feasible.

6
In this case, there must exist a w∗ , b∗ , ξ ∗ that are a solution to the primal and α∗ , β ∗
that are the solution to the dual such that p∗ (the optimal value of the primal) equals d∗
(the optimal solution to the dual) equals L(w, b, ξ, α, β).
Furthermore, w∗ , b∗ , ξ ∗ , α∗ , β ∗ satisfy the KKT conditions:

1. The primal constraints hold, i.e.

• fi (w∗ , b∗ , ξ ∗ ) ≤ 0, i = 1, . . . , 2m

2. The dual constraints hold, i.e.

• α∗ ≥ 0
• β∗ ≥ 0

3. Complimentary slackness

• αi∗ fi = 0, i = 1, . . . , m
• βi∗ fi = 0, i = m + 1, . . . , m

4. The gradient of the Lagrangian with respect to w, b, and ξ vanishes

• ∂ ∗ ∗ ∗ ∗ ∗
∂w L(w , b , ξ , α , β ) = 0
• ∂ ∗ ∗ ∗ ∗ ∗
∂b L(w , b , ξ , α , β ) = 0
• ∂ ∗ ∗ ∗ ∗ ∗
∂ξ L(w , b , ξ , α , β ) = 0

In our formulation of the dual, we have ensured that all the KKT conditions are satisfied
as well as the conditions for p∗ to equal d∗ . So, we are justified in using the dual form of
the problem.

2.2.5 Interesting Utility of Complimentary Slackness to Dual SVM

By complimentary slackness, we have αi∗ fi = 0, i = 1, . . . , m. This means that for any
training points xi , either αi = 0 or y (i) (wT xi + b) = 1 − ξi (the corresponding primal
constraint is tight).
If αi = 0 then y (i) (wT xi + b) = 1 − ξi =⇒ y (i) (wT xi + b) ≥ 1.
If αi = C then y (i) (wT xi + b) = 1 − ξi =⇒ y (i) (wT xi + b) ≤ 1.
If 0 < αi < C then y (i) (wT xi + b) = 1 − ξi =⇒ y (i) (wT xi + b) = 1.
That is to say, once we’ve solved the dual, we can look at the vector α and any 0 < αi < C
entry corresponds to a support vector. This will give us an easy way to see how well the
SVM has performed at generalized learning (fewer support vectors indicates better learning
a la VC dimension) as well as an easy way to retrieve points from the training set that are
support vectors.

2.2.6 Classification with the Dual

Classification with the dual is relatively straight forward. First, recall the Lagrangian:

7
m m m
1 2
X X X
L(w, b, ξ, α, β) = kwk +C ξi + αi [(1 − ξi ) − y (i) (wT x(i) + b)] + βi (−ξi )
2
i=1 i=1 i=1
| {z } | {z } | {z }
Primal Objective First set of constraints Second set of constraints
f0 f1 ,...,fm fm+1 ,...,f2m
(10)
Once the dual is solved and some optimal α∗ is found, 0 < αk∗ < C for some k implies the
corresponding constraint fk is tight and ξi = 0 (see Section 2.2.5), i.e. y (k) (wT x(k) + b) = 1

By complimentary slackness and for 0 < αk < C

y (k) (wT x(k) + b) = 1
Multiply both sides by y (k)
y (k) y (k) (wT x(k) + b) = y (k)
Because y (k) ∈ {−1, +1}, then (y (k) )2 = 1
wT x(k) + b = y (k)

Solving for b we get:

b = y (k) − wT x(k) (11)
The KKT conditions also tell us that the gradient of the Lagrangian vanishes with
respect to w, so , we have from (4) that
m
X
w= αi y (i) x(i)
i=1

Therefore the linear separator can be found by solving w = m (i) (i)

P
i=1 αi y x , using that
value of w to solve for b = y (k) − wx(k) for any k where 0 < αk∗ < C.
Unlabeled points can then be solved using the same function as the primal, i.e. f (x) =
sign(wT x + b)

3 Non-Linearly Separable Data

3.1 Lifting to Higher-Dimensional Space
In many cases, the data that we wish to classify isn’t linearly separable in an immediate
sense. Taking an example from [5] (in which we modify some of the following images):
imagine a set of points in R2 where all points belong to two classes. All points are generated
by a true function where points in one class are defined as those around a small ring or
radius ≈ 0.5 plus some Gaussian noise, and points in the other other class are those which
lye around a larger ring of radius ≈ 1 plus some Gaussian noise.

8
(b) An example classifier (black line) along with
shaded regions representing the classification of
(a) Points in R . The blue triangles forming thepoints to either the blue class or the red class.
2

smaller circle is one class and the red dots formingIt should be 2clear that no matter how a line is
the larger circle is another class. [5] drawn (in R ), a good separation of the points
where almost blue triangles fall on one side and
all red circles fall on the other of the black line is
impossible.

Figure 2: Points in R2 that are not linearly separable in R2

If we were to apply an SVM directly to the learn from the data in this space, it’s clear
that there is no good linear separator. However, if we can “lift” the points to a higher
dimensional space by e.g. applying a function to each of the the points in the space, then
perhaps in that higher dimensional space there will exist a good separating hyperplane
between the classes.

Figure 3: Points in original space on the left, “lifted” points are on the right. The trans-
formation function is shown in the middle. The transformation here is essentially taking a
parabola centered at the origin that is then rotated creating a conic surface, the mapping
is projecting the points from R2 onto the resulting surface (i.e. lifting them up into the
z3 dimension by this projection while maintaining their values in the z1 and z2 dimensions
from the values in the x1 and x2 dimensions, respectively.)

9
Once a separating hyperplane has been found in the higher dimensional space, we can
project it back down to the original space and use that as a classifying function.

Figure 4: On the left, a separating hyperplane is shown in the R3 space that splits points in
the red circle class from the points in the blue triangle class. On the right is the projection of
this separating hyperplane in the original R2 space. While in the original space the classifier
looks non-linear, it is simply the projection of a linear separator in a higher dimensional
space.

3.2 The Kernel Trick

3.2.1 Putting everything in terms of the inner product
The basic idea is that it is possible to get all the benefits of lifting to a higher dimension
without the computational baggage. We will argue that everything that SVMs benefit from
lifting techniques can be taken in terms of taking the inner product of the points in a higher
dimensional space. Therefore, if there’s a way to do the inner product efficiently, then
SVMs can get the benefits of lifting without ever “visiting” the higher dimensional space
(i.e. mapping the points into the space, and then taking inner products).
As a straightforward example to begin establishing this fact, take the dual optimization
problem. The only place where it matters that points have been lifted in into a higher
dimension happens in the objective function:
m m
X 1 X
J(α) = αi − αi αj y (i) y (j) hφ(x(i) , φ(x(j) )i
2
i=1 i,j=1

So, an inner product is enough for training. Assuming that there will be a payoff by
putting things in terms of the inner product, let’s see how classification would work.

We have from (11)

b = y (k) − wT φ(x(k) )

10
Pm (i) (i)
and from (4) we have that w = i=1 αi y φ(x ). Substituting this w we get
Xm
(k)
b= y −( αi y (i) φ(x(i) ))T φ(x(k) )
i=1
Simplifying we get
m
X
b = y (k) − αi y (i) (φ(x(i) ))T φ(x(k) )
i=1

By definition of the inner product

m
X
b = y (k) − αi y (i) hφ(x(i) ), φ(x(k) )i (12)
i=1

Having b and the equation from (4) for w, we can rewrite the primal classification
function in terms of the inner product as well.

Recall the classification function f (x) from the primal

f (x) = sign wT φ(x(k) ) + b

Substitute w from (4)

X m
f (x) = sign ( αi y (i) φ(x(i) ))T φ(x(k) ) + b
i=1
Simplify
m
X
f (x) = sign αi y (i) (φ(x(i) ))T φ(x(k) ) + b
i=1

Again, by definition of the inner product we get

m
X
f (x) = sign αi y (i) hφ(x(i) ), φ(x(k) )i + b (13)
i=1

In conclusion, both training and prediction can be done using just the results of inner
products without ever mapping the points themselves into any higher dimensional space.

3.2.2 Defining the Kernel

The Kernel is defined to be any function K : Rn × Rn → R such that for all x, x0 ∈ X for
some φ(·):
K(x, x0 ) = hφ(x), φ(x0 )i (14)
provided the above is positive semidefinite and symmetric. This is known as Mercer’s
Theorem.
As a somewhat non-trivial example, consider points x, x0 ∈ R that we wish to map into
4
R using a degree 3 polynomial, i.e. , where

11
Definition of φ : R2 → R4
√ √
(x1 , x2 ) 7→ (z1 , z2 , z3 , z4 ) := (x31 , 3x21 x2 , 3x1 x22 , x22 )
Mapping points x, x0 ∈ R2 to R4 using φ and finding their inner product
D √ √ 3 √ 2 √ 2 2
E
hφ(x1 , x2 ), φ(x01 , x02 )i = (x31 , 3x21 x2 , 3x1 x22 , x22 ), (x01 , 3x01 x02 , 3x01 x02 , x02 )
Expanding the inner product we get
3 2 2 3
= x31 x01 + 3x21 x01 x2 x02 + 3x22 x02 x1 x01 + x32 x02

We can define a Kernel function on the points x, x0 ∈ R2 that achieve the same result

Define K : R2 → R as
3
K(x, x0 ) := hx, x0 i
Plugging in values of x and x0 we get

3
K(x, x0 ) = (x1 , x2 ), (x01 , x02 )
Expanding the inner product we get
= (x1 x01 + x2 x02 )3
Expanding the polynomial we get
3 2 2 3
= x31 x01 + 3x21 x01 x2 x02 + 3x22 x02 x1 x01 + x32 x02

This is an explicit example of the polynomial kernel of degree 3. In general, the kernel
for polynomial of degree d is defined as:
d
Kpoly (x, x0 ) = hx, x0 i (15)

Other kernels are the linear kernel (in our case, this results in the equivalent of just
finding a standard linear separator):

Klinear (x, x0 ) = hx, x0 i (16)

As well as the radial basis function (RBF) kernel, which centers a “bump” or “cavity”,
i.e. a Gaussian with variance σ 2 , around each point:
0 2
e−kx−x k
Krbf (x, x0 ) = (17)
2σ 2

The RBF function is particularly interesting because it achieves the equivalent of lifting
into an infinite dimensional space to find a separating hyperplane. However, using the
kernel trick, we can achieve the same results without going through what would otherwise
be a computational intractable procedure.
The RBF kernel has been shown to give some of the best state-of-the-art results with
SVM. It can be thought of as putting a “bump” or “cavity” around each point in the
training set, either “pushing up” or “pulling down” the space, depending on the class.
The accumulation of these bumps and cavities aggregate creating a complicated surface. σ
controls how tall or fat the bumps/crevices are.

12
Figure 5: Two classes shown as red circles and yellow stars. Red points are “bumps” that
raise the surface, yellow stars are “cavities” that drag down the surface, both created by
the Gaussian Kernel (very similar to RBF). A seperating hyperplane is shown for classifi-
cation.(Image from [6])

Figure 6: Another example of the geometric surface created applying an RBF kernel. While
all the points are the same color, the classes can be deduced by if the points produce a “hill”
above the flat surface or a “crevice” below. Separating hyperplane not shown. (Image from
[7])

3.2.3 Training and Classification via Kernel SVM

As shown above, we can replace any inner product in the SVM with the equivalent kernel
and achieve the same results as lifting to a higher degree space. Denoting with K(·, ·) the
choice of whichever kernel one wishes to use to emulate mapping φ(·), we have for training:
m m
X 1 X
max J(α) = αi − αi αj y (i) y (j) K(x(i) , x(j) )
α 2
i=1 i,j=1
(i) (j)
s.t. K(x , x ) = hφ(x(i) ), φ(x(j) )i
(18)
0 ≤ αi ≤ C, i = 1, . . . , m
Xm
αi y (i) = 0
i=1

13
And for classification, assuming point x(k) , y (k) corresponding to 0 < αk < C:
m
X
b = y (k) − αi y (i) K(x(i) , x(k) ) (19)
i=1

And, assuming the same point x(k) and b as above:

m
X
f (x) = sign αi y (i) K(x(i) , x(k) ) + b (20)
i=1

4 Experiments
4.1 Setup
The images we are classifying are a subset of the CIFAR-10 [1] data set which consists of
60000 32 × 32 color images in 10 classes, with 6000 images per class. There are 50,000
training images and 10,000 test images.
Specifically, we will be classifying into a subset of four classes: airplane, automobile,
bird, and cat. We are choosing 1500 images from each class for training and 1000 images
for testing. We are choosing this number of subsets and number of training samples for time
reasons. Training a single classifier takes around 1 hour, and we need to train six classifiers
to employ the “one-vs-one” reduction method.
The “one-vs-one” reduction method for K class classification requires (K choose 2)
binary classifiers. Each classifier is trained on samples from a pair of classes. When making
a prediction on an unlabeled image, each of the K(K − 1)/2 classifiers are given the image,
and then “vote” on which class the image belongs to. That is, the class that gets the
highest number of positive predictions among all classifiers’ predictions will be taken as the
“winner”.
We perform a number of different rounds of this multi-class classification task to compare
different Kernel methods. Specifically, we train with a linear kernel (the same as a linear
classifier), a polynomial kernel of degree two and degree 3, and finally with an RBF kernel
with σ = 1 and σ = 4. For each of these the soft-margin penalty C is set to 1.
We also perform feature extraction using Histogram of Oriented Gradients (HOG) with
cell size of 8x8 on each image, and use the resulting feature vector as the representation of
the image on which we train/classify.

4.2 Image Preprocessing/Feature Extraction

In order to map our pictures to points in space, we perform a preprocessing step involving
the Histogram of Oriented Gradients (HOG).
HOG results in compressing and encoding images. More specifically, HOG takes an an
image gradient at different locations in an image and gives the intensity change in that
location as an oriented gradient. For each patch of the image, gradient orientations are
computed and then these orientations are then plotted into a histogram. The histogram is
akin to placing in different “bins” vectors that correspond to different ranges of orientation,
e.g. one bin for vector orientations 0-degrees to 15-degrees, another bin for vector orienta-
tions of 15-degrees to 30-degrees, etc. This results in the ability to calculate the probability
of existence of a gradient with specific orientation in each patch of the image.

14
Figure 7: This is an image from [2], not from CIFAR-10, but it does a nice job of illustrating
the way that HOG works. The red cells are the “patches” of the image that will result in
different histograms. Because there are 4x8 patches, a total of 4x8=32 histograms will be
computed. These will then be concatenated with each other to represent the image. The
image then, would be represented by a vector in R32×32×#bins space.

Figure 8: This is also an image from [2] that illustrates the whole process of HOG. An
input image is given, gradients (changes in intensity) are found, oriented histograms are
computed, block normalization (i.e. binning according to ranges) is performed, and finally
accumulation (concatenating all the patches histograms) to represent the image as a single
vector.

There is a trade off in the grid size, i.e. the number of patches and the results you can
get from HOG. If the grid is too fine, then not only is it more computationally expensive
to compute the features, but it might also miss broader structure in the image. However,
if the grid size is too course, then local structure can be lost. It is generally best to try
different grid sizes and choose one that works best.

15
Figure 9: Here is an image from the automobile class of the CIFAR-10 data set with HOG
features shown at different levels of granularity, i.e. with cell (patch) sizes of 4x4, 8x8,
and 16x16. We choose 8x8 for our experiments because it seems to make a nice trade-off
between performance and accuracy.

4.3 Experimental Results

Table 1: Linear Kernel: Accuracy 25%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Airplane 0 0 0 0
Automobile 0 0 0 0
Bird 0 0 0 0
Cat 1000 1000 1000 1000
This linear separator doesn’t seem to have performed well at all. In
fact, it’s accuracy achieves no better than random guessing, and it’s
clear that the classification decision was simply “classify everything as
a cat”.

16
Table 2: Polynomial Kernel (Degree 2): Accuracy 46.40%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Airplane 859 387 328 77
Automobile 8 55 6 2
Bird 2 2 23 2
Cat 131 556 643 919
This confusion matrix tells us that out of the 1000 images of Air-
planes, 859 were correctly classified as such; the rest were incorrectly
classified: 8 as automobile, 2 as bird, and 131 as cat. Out of the 1000
images of automobiles, 55 were correctly classified as such; the rest
were incorrectly classified: 387 as airplane, 2 as bird, and 556 as cat.
Similarly we can see that out of the 1000 images of birds, 23 were
correctly classified, and 977 were incorrectly classified. Lastly, out of
the 1000 images of cats 919 were correctly classified and 81 were incor-
rectly classified. The overall accuracy is around 46%, which is better
than random guessing (which would be 25% accuracy for 4 classes).
The airplane and cat classes seems to be preferred in general, with the
automobile and bird categories being overwhelmingly misclassified as
such.

Table 3: Polynomial Kernel (Degree 3): Accuracy 75.67%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Airplane 782 49 110 73
Automobile 53 889 33 58
Bird 99 27 641 154
Cat 66 35 216 715
This confusion matrix tells us that out of the 1000 images of Air-
planes, 782 were correctly classified as such; the rest were incorrectly
classified: 53 as automobile, 99 as bird, and 66 as cat. Out of the
1000 images of automobiles, 889 were correctly classified as such; the
rest were incorrectly classified: 49 as airplane, 27 as bird, and 35 as
cat. Similarly we can see that out of the 1000 images of birds, 641
were correctly classified, and 359 were incorrectly classified. Lastly,
out of the 1000 images of cats 715 were correctly classified and 285
were incorrectly classified. The overall accuracy is around 76%. No
class seems to be preferred in general, but mis-classifications of one
group as another seems to be apparent (e.g. birds misclassified as cats
and vice versa).

17
Table 4: RBF Kernel (σ = 1): Accuracy 77.55%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Airplane 824 42 108 62
Automobile 32 897 25 41
Bird 96 22 646 162
Cat 48 39 221 735
This confusion matrix tells us that out of the 1000 images of Air-
planes, 824 were correctly classified as such; the rest were incorrectly
classified: 32 as automobile, 96 as bird, and 48 as cat. Out of the 1000
images of automobiles, 897 were correctly classified as such; the rest
were incorrectly classified: 42 as airplane, 22 as bird, and 39 as cat.
Similarly we can see that out of the 1000 images of birds, 646 were
correctly classified, and 354 were incorrectly classified, most of these
as cats at 221. Lastly, out of the 1000 images of cats 735 were correctly
classified and 265 were incorrectly classified. The overall accuracy is
around 78%, which is highest out of all kernels tested. Classifying cor-
rectly birds as such was the poorest performing, mis-classifying them
as cats most, followed by airplanes. Perhaps, this is because birds and
cats are both animals that could be located in similar environments
and airplanes and birds in flight might have similar shapes that causes
this. Investigating the misclassified images, their HOG features, and
the specifics of the classifier might help elucidate if this hypothesis has
any weight.

Table 5: RBF Kernel (σ = 2): Accuracy 74.08%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Airplane 763 47 92 50
Automobile 66 880 42 66
Bird 119 26 611 175
Cat 52 47 255 709
This confusion matrix tells us that out of the 1000 images of Air-
planes, 763 were correctly classified as such; the rest were incorrectly
classified: 66 as automobile, 119 as bird, and 52 as cat. Out of the
1000 images of automobiles, 880 were correctly classified as such; the
rest were incorrectly classified: 47 as airplane, 26 as bird, and 47 as
cat. Similarly we can see that out of the 1000 images of birds, 611 were
correctly classified, and 389 were incorrectly classified, most of these
as cats at 255. Lastly, out of the 1000 images of cats 709 were correctly
classified and 291 were incorrectly classified. The overall accuracy is
around 74%, which is the third highest out of all kernels tested. In-
terestingly, airplanes were more incorrectly classified as birds with a
broader variance compared to σ = 1. Besides this, the number and
type of mis-categorizations was fairly close to the RBF Kernel σ = 1.

5 Conclusions and Final Thoughts

SVMs are very powerful classification tools that are conceptually simple and much more
computationally efficient compared to other state-of-the-art classifiers like Neural Nets.

18
In many cases, in the original feature space there is no linear function that can separate
instances into classes. One way of handling this is to embed the points into a different space
such that they are linearly separable by e.g. lifting the points into a higher dimensional
space.
The problem with the above is that the process of embedding can be extremely compu-
tationally expensive, and that, in fact, the benefits of this process don’t come from knowing
anything about the points in this new space other than their inner product.
Luckily, everything involved in training an SVM model to learn a separating hyperplane
as well as classifying unlabeled points requires only knowing the result of such an inner
product. So instead of e.g. lifting and computing inner products, a kernel function can
be used that can preprocess the results of the same inner product at a substantial compu-
tational savings. Even more, the kernel function for e.g. the radial basis function (RBF)
allows the SVM to learn a classifier that would require lifting points in the original space
into an infinite dimensional space, which is certainly computational intractable.
In this work, the derivation of the dual form of a soft-margin SVM was performed,
as well as the derivation for a non-trivial kernel (polynomial degree 3), and other kernel
methods were given.
Using these equations, the kernel SVM was implemented in Matlab and applied to the
classification task of a subset of images from the CIFAR-10 data set. Results were compared
between a linear kernel, polynomial (degree 2 and degree 3) kernel, and RBF (σ = 1 and
σ = 2) kernel, and results were given and analyzed.
The RBF kernel performed the best in general, but the polynomial degree 3 kernel
also performed quite well, all achieving about or above 75% accuracy. The linear kernel
performed the poorest, only matching the accuracy of random guessing and in fact just
classified all images as cats.
More insight could be gained by looking into the details of mis-classifications, such as
looking into the HOG features as well as analyzing the generated hyperplanes and where
points lie in the new space.
Accuracy could probably also be improved by using different features other than (or in
conjunction with) HOG, and by more tuning of the hyperparameters such as the degree of
the polynomial, the σ parameter in RBF, and also in the misclassification penalty function
C in the dual objective.

19
References
[1] Alex Krizhevsky, The CIFAR-10 dataset, https://www.cs.toronto.edu/ kriz/cifar.html

[2] Sistu Ganesh, What are HOG features in computer vision in layman’s terms?,
https://www.quora.com/What-are-HOG-features-in-computer-vision-in-laymans-terms

[3] Andrew Ng, CS229 Lecture Notes - Support Vector Machines,

http://cs229.stanford.edu/notes/cs229-notes3.pdf

[4] Yaser Abu-Mostafa, Caltech CS156 Lecture 15 - Kernel Methods,

https://www.youtube.com/watch?v=XUj5JbQihlU

[5] Eric Kim, Kernel Trick, http://www.eric-kim.net/eric-kim-

net/posts/1/kernel trick.html

[6] Ankit K Sharma, Support Vector Machines without tears (Slide #28),
https://www.slideshare.net/ankitksharma/svm-37753690

[7] Basilio Noris, ML Demos (Screenshots), http://mldemos.b4silio.com/

Satelite Image Analysis
No ratings yet
Satelite Image Analysis
96 pages
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
No ratings yet
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
4 pages
Deep Learning Based Image Classification On Smartphones
No ratings yet
Deep Learning Based Image Classification On Smartphones
89 pages
SME Credit Scoring Using Social Media Data
No ratings yet
SME Credit Scoring Using Social Media Data
82 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Lec 06 SVM
No ratings yet
Lec 06 SVM
34 pages
Lecture 2.7
No ratings yet
Lecture 2.7
18 pages
Design of SVM
No ratings yet
Design of SVM
20 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
SVM Explained PDF
No ratings yet
SVM Explained PDF
19 pages
Non-Negative Matrix Factorization
No ratings yet
Non-Negative Matrix Factorization
18 pages
Ex05 Support Vector Regression Solution
No ratings yet
Ex05 Support Vector Regression Solution
3 pages
Research Methodology - Lung Cancer Prediction
No ratings yet
Research Methodology - Lung Cancer Prediction
12 pages
Support Vector Machines (SVM) : N I y X D
No ratings yet
Support Vector Machines (SVM) : N I y X D
5 pages
04SVM
No ratings yet
04SVM
22 pages
Lecture 7 - SVM
No ratings yet
Lecture 7 - SVM
125 pages
Predictor Effects Graphics Gallery
No ratings yet
Predictor Effects Graphics Gallery
44 pages
Ell409 Aq
No ratings yet
Ell409 Aq
8 pages
Datamining M1 SVM
No ratings yet
Datamining M1 SVM
28 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
1.1 Hard Margin SVM (Insert Diagrams)
No ratings yet
1.1 Hard Margin SVM (Insert Diagrams)
4 pages
A Hybrid KNN-SVM Model For Iranian License Plate Recognition
No ratings yet
A Hybrid KNN-SVM Model For Iranian License Plate Recognition
7 pages
Support Vector Machine
No ratings yet
Support Vector Machine
49 pages
Multiplicative Updates For The LASSO
No ratings yet
Multiplicative Updates For The LASSO
7 pages
Precision, Recall, F1-Score
No ratings yet
Precision, Recall, F1-Score
6 pages
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
No ratings yet
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
39 pages
Assessment of Construction Workers ' Labor Intensity Based On Wearable Smartphone System
No ratings yet
Assessment of Construction Workers ' Labor Intensity Based On Wearable Smartphone System
9 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
10 SVM
No ratings yet
10 SVM
77 pages
20 SVM
No ratings yet
20 SVM
35 pages
Lec8 PDF
No ratings yet
Lec8 PDF
5 pages
SVM Dual Problem Proof
No ratings yet
SVM Dual Problem Proof
2 pages
Hidden Markov Support Vector Machines
No ratings yet
Hidden Markov Support Vector Machines
8 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
L5 SVMs
No ratings yet
L5 SVMs
37 pages
Project Name: Personality Prediction Using Mbti
No ratings yet
Project Name: Personality Prediction Using Mbti
16 pages
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
No ratings yet
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
5 pages
Crop Recommendation System
No ratings yet
Crop Recommendation System
5 pages
Lecture5 SVM
No ratings yet
Lecture5 SVM
67 pages
Ipmv Mod 5&6 (Theory Questions)
No ratings yet
Ipmv Mod 5&6 (Theory Questions)
11 pages
315 F19 15 SVM 2
No ratings yet
315 F19 15 SVM 2
35 pages
Lecture 9 - SVM
No ratings yet
Lecture 9 - SVM
42 pages
1) s2.0 S277266222200011X Main
No ratings yet
1) s2.0 S277266222200011X Main
30 pages
ML TCS Lecture 15
No ratings yet
ML TCS Lecture 15
46 pages
MIT15 097S12 Lec12
No ratings yet
MIT15 097S12 Lec12
14 pages
Comparing Machine Learning and Deep Learning Methods For Real-Time Crash Prediction
No ratings yet
Comparing Machine Learning and Deep Learning Methods For Real-Time Crash Prediction
10 pages
Support Vector Machine in R - Using SVM To Predict Heart Diseases - Edureka
No ratings yet
Support Vector Machine in R - Using SVM To Predict Heart Diseases - Edureka
13 pages
Chapter 07 SVM
No ratings yet
Chapter 07 SVM
20 pages
A Comparative Study of Various Machine Learning Algorithms in Fog Computing
No ratings yet
A Comparative Study of Various Machine Learning Algorithms in Fog Computing
12 pages
Classification: Linear SVM
No ratings yet
Classification: Linear SVM
26 pages
An Idiot's Guide To Support Vector Machines
No ratings yet
An Idiot's Guide To Support Vector Machines
28 pages
ML - 5 Sovan LR SVM 1
No ratings yet
ML - 5 Sovan LR SVM 1
59 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
ACM Conference Proceedings Master Template
No ratings yet
ACM Conference Proceedings Master Template
3 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
DP 4 Report
No ratings yet
DP 4 Report
46 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
3 pages
Radio Frequency Scene Analysis For Multiple Transmitter Detection and Identification by
No ratings yet
Radio Frequency Scene Analysis For Multiple Transmitter Detection and Identification by
47 pages
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
No ratings yet
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
11 pages
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
No ratings yet
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
28 pages
SVM Seminarbericht Hofmann
No ratings yet
SVM Seminarbericht Hofmann
16 pages
Application of Big Data Analytics Pertaining To Power System Security
No ratings yet
Application of Big Data Analytics Pertaining To Power System Security
6 pages
Machine Learning - SVM
No ratings yet
Machine Learning - SVM
11 pages
Support Vector Machines
No ratings yet
Support Vector Machines
5 pages
Efficiently Computing The Inverse Square Root Using Integer Operations
No ratings yet
Efficiently Computing The Inverse Square Root Using Integer Operations
13 pages
Least Squares Support Vector Machine Classifiers: Neural Processing Letters 9: 293-300, 1999
No ratings yet
Least Squares Support Vector Machine Classifiers: Neural Processing Letters 9: 293-300, 1999
8 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
An Idiot Guide To SVM
No ratings yet
An Idiot Guide To SVM
25 pages
Machine Learning Quick Start Guide
No ratings yet
Machine Learning Quick Start Guide
1 page
SVM New
No ratings yet
SVM New
12 pages
Nandini Project Report
No ratings yet
Nandini Project Report
55 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
2019 - Xiao - Et Al - Leak Detection of Gas Pipelines Using Acoustic Signals Based On Wavelet Transform and Support Vector Machine
No ratings yet
2019 - Xiao - Et Al - Leak Detection of Gas Pipelines Using Acoustic Signals Based On Wavelet Transform and Support Vector Machine
11 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Chapter 11. Goodness of Fit and Contingency Tables
No ratings yet
Chapter 11. Goodness of Fit and Contingency Tables
12 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Aditya Bhandari, Ameya Joshi, Rohit Patki, Bird Species Identification From An Image
No ratings yet
Aditya Bhandari, Ameya Joshi, Rohit Patki, Bird Species Identification From An Image
5 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Disease Detection of Plant Leaf Using Image Processing and CNN With Preventive Measures
No ratings yet
Disease Detection of Plant Leaf Using Image Processing and CNN With Preventive Measures
6 pages
Machine Learning in Antenna Design: An Overview On Machine Learning Concept and Algorithms
No ratings yet
Machine Learning in Antenna Design: An Overview On Machine Learning Concept and Algorithms
9 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Report 1
No ratings yet
Report 1
6 pages
ML Unit 1 Pallav
No ratings yet
ML Unit 1 Pallav
22 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

Kernel SVM For Image Classification

Uploaded by

Kernel SVM For Image Classification

Uploaded by

Convex Optimization Final Project Report

Kernel SVM for Image Classification

by Nathaniel Hobbs Instructor: Professor Dana

2 Soft-Margin SVM Primal to Dual

• y (i) ∈ {−1, +1} is the label of x(i)

• w ∈ Rn is a vector that is normal to the separating hyperplane

2.2.2 Reduced Lagrangian

max θD (α, β) = max min L(w, b, ξ, α, β)

Our original Lagrangian

2.2.3 Dual Problem

2.2.4 KKT Conditions and Strong Duality

1. The primal constraints hold, i.e.

2. The dual constraints hold, i.e.

4. The gradient of the Lagrangian with respect to w, b, and ξ vanishes

2.2.5 Interesting Utility of Complimentary Slackness to Dual SVM

2.2.6 Classification with the Dual

By complimentary slackness and for 0 < αk < C

Solving for b we get:

Therefore the linear separator can be found by solving w = m (i) (i)

3 Non-Linearly Separable Data

Figure 2: Points in R2 that are not linearly separable in R2

3.2 The Kernel Trick

We have from (11)

By definition of the inner product

Recall the classification function f (x) from the primal

Substitute w from (4)

Again, by definition of the inner product we get

3.2.2 Defining the Kernel

Klinear (x, x0 ) = hx, x0 i (16)

3.2.3 Training and Classification via Kernel SVM

And, assuming the same point x(k) and b as above:

4.2 Image Preprocessing/Feature Extraction

4.3 Experimental Results

Table 1: Linear Kernel: Accuracy 25%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Polynomial (Degree 3) Airplane Automobile Bird Cat

Table 3: Polynomial Kernel (Degree 3): Accuracy 75.67%

Polynomial (Degree 3) Airplane Automobile Bird Cat

Polynomial (Degree 3) Airplane Automobile Bird Cat

Table 5: RBF Kernel (σ = 2): Accuracy 74.08%

Polynomial (Degree 3) Airplane Automobile Bird Cat

5 Conclusions and Final Thoughts

[3] Andrew Ng, CS229 Lecture Notes - Support Vector Machines,

[4] Yaser Abu-Mostafa, Caltech CS156 Lecture 15 - Kernel Methods,

[5] Eric Kim, Kernel Trick, http://www.eric-kim.net/eric-kim-

[7] Basilio Noris, ML Demos (Screenshots), http://mldemos.b4silio.com/

You might also like