0% found this document useful (0 votes)
2 views

Lecture 34

The document discusses the Support Vector Machine (SVM) method for learning classifiers, emphasizing the transformation of feature space and the use of kernel functions to learn nonlinear classifiers. It also covers the formulation of regression problems using SVM, including the use of an epsilon-insensitive loss function and the optimization problem to minimize errors. The document concludes with the derivation of the dual problem related to the optimization of SVM parameters.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 34

The document discusses the Support Vector Machine (SVM) method for learning classifiers, emphasizing the transformation of feature space and the use of kernel functions to learn nonlinear classifiers. It also covers the formulation of regression problems using SVM, including the use of an epsilon-insensitive loss function and the optimization problem to minimize errors. The document concludes with the derivation of the dual problem related to the optimization of SVM parameters.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

• We have been discussing SVM method for learning

classifiers.

PR NPTEL course – p.1/135


• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.

PR NPTEL course – p.2/135


• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.

PR NPTEL course – p.3/135


• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
• Thus Kernels give us an elegant method to learn
nonlinear classifiers.

PR NPTEL course – p.4/135


• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
• Thus Kernels give us an elegant method to learn
nonlinear classifiers.
• We can use the same idea in regression problems
also.

PR NPTEL course – p.5/135


Kernel Trick

• We use φ : ℜm → H to map pattern vectors into


appropriate high dimensional space.

PR NPTEL course – p.6/135


Kernel Trick

• We use φ : ℜm → H to map pattern vectors into


appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.

PR NPTEL course – p.7/135


Kernel Trick

• We use φ : ℜm → H to map pattern vectors into


appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
• Through kernel functions, many algorithms that use
only innerproducts can be implicitly executed in a high
dimensional H.
( e.g., Fisher discriminant, regression etc).

PR NPTEL course – p.8/135


Kernel Trick

• We use φ : ℜm → H to map pattern vectors into


appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
• Through kernel functions, many algorithms that use
only innerproducts can be implicitly executed in a high
dimensional H.
( e.g., Fisher discriminant, regression etc).
• We can elegantly construct non-linear versions of
linear techniques.

PR NPTEL course – p.9/135


Support Vector Regression

• Now we consider the regression problem.


• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .

PR NPTEL course – p.10/135


Support Vector Regression

• Now we consider the regression problem.


• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .
• We search in a parameterized class of functions
g(X, W ) = w1 φ1 (X) + · · · + wm′ φm′ (X) + b
= W T Φ(X) + b,
where φi : ℜm → ℜ are some chosen functions.

PR NPTEL course – p.11/135


• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.

PR NPTEL course – p.12/135


• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.

PR NPTEL course – p.13/135


• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.

PR NPTEL course – p.14/135


• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
• We want to formulate the problem so that we can use
the Kernel idea.

PR NPTEL course – p.15/135


• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
• We want to formulate the problem so that we can use
the Kernel idea.
• Then, by using a kernel function, we never need to
compute or even precisely specify the mapping Φ.

PR NPTEL course – p.16/135


Loss function

• As in a general regression problem, we need to find


W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.

PR NPTEL course – p.17/135


Loss function

• As in a general regression problem, we need to find


W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.


• This is the general strategy of empirical risk
minimization.

PR NPTEL course – p.18/135


Loss function

• As in a general regression problem, we need to find


W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.


• This is the general strategy of empirical risk
minimization.
• We consider a special loss function that allows us to
use the kernel trick.

PR NPTEL course – p.19/135


ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ


= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.

PR NPTEL course – p.20/135


ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ


= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.

PR NPTEL course – p.21/135


ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ


= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
• Using absolute value of error rather than square of
error allows for better robustness.

PR NPTEL course – p.22/135


ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ


= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
• Using absolute value of error rather than square of
error allows for better robustness.
• Also gives us optimization problem with the right
structure.

PR NPTEL course – p.23/135


• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.

PR NPTEL course – p.24/135


• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
• Hence empirical risk minimization under the
ǫ-insensitive loss function would minimize
n
X
T
¡ ¢
max |yi − Φ(Xi ) W − b| − ǫ, 0
i=1

PR NPTEL course – p.25/135


• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
• Hence empirical risk minimization under the
ǫ-insensitive loss function would minimize
n
X
T
¡ ¢
max |yi − Φ(Xi ) W − b| − ǫ, 0
i=1

• We can write this as an equivalent constrained


optimization problem.

PR NPTEL course – p.26/135


• We can pose the problem as follows.
n
X n
X
min ξi + ξi′

W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n

PR NPTEL course – p.27/135


• We can pose the problem as follows.
n
X n
X
min ξi + ξi′

W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• This does not give a dual with the structure we want.

PR NPTEL course – p.28/135


• We can pose the problem as follows.
n
X n
X
min ξi + ξi′

W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• This does not give a dual with the structure we want.
• So, we reformulate the optimization problem.

PR NPTEL course – p.29/135


The Optimization Problem

• Find W, b and ξi , ξi′ to


à n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1

subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n

PR NPTEL course – p.30/135


The Optimization Problem

• Find W, b and ξi , ξi′ to


à n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1

subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• We have added the term W T W in the objective
function. This is like model complexity in a
regularization context.
PR NPTEL course – p.31/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.

PR NPTEL course – p.32/135


• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.

PR NPTEL course – p.33/135


• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
• Essentially, the lagrange multipliers corresponding to
the inequality constraints on the errors would be the
determining factors.

PR NPTEL course – p.34/135


• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
• Essentially, the lagrange multipliers corresponding to
the inequality constraints on the errors would be the
determining factors.
• We can use the same technique as earlier to
formulate the dual to solve for the optimal Lagrange
multipliers.
PR NPTEL course – p.35/135
The dual
• The dual of this problem is
n
X n
X
max yi (αi − αi′ ) − ǫ (αi + αi′ )
α,α
i=1 i=1
1X
− (αi − αi′ )(αj − αj′ )Φ(Xi )T Φ(Xj )
2 i,j
n
X
subject to (αi − αi′ ) = 0
i=1
0 ≤ αi , αi′ ≤ C, i = 1, . . . , n

PR NPTEL course – p.36/135


The solution

• We can use the Kuhn-Tucker conditions to derive the


final optimal values of W and b as earlier.

PR NPTEL course – p.37/135


The solution

• We can use the Kuhn-Tucker conditions to derive the


final optimal values of W and b as earlier.
• This gives us
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

PR NPTEL course – p.38/135


• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

PR NPTEL course – p.39/135


• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n


∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.

PR NPTEL course – p.40/135


• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n


∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.

PR NPTEL course – p.41/135


• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n


∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.
• Note that the dual and the final solution are such that
we can use the kernel trick.
PR NPTEL course – p.42/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).

PR NPTEL course – p.43/135


• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗

i i
i=1

PR NPTEL course – p.44/135


• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗

i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗

i i
i=1

PR NPTEL course – p.45/135


• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗

i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗

i i
i=1

• As earlier, b∗ can also be written in terms of the Kernel


function.

PR NPTEL course – p.46/135


Support vector regression

• Once again, the kernel trick allows us to learn


non-linear models using a linear method.

PR NPTEL course – p.47/135


Support vector regression

• Once again, the kernel trick allows us to learn


non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.

PR NPTEL course – p.48/135


Support vector regression

• Once again, the kernel trick allows us to learn


non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
• The parameters: C , ǫ and parameters of kernel
function.

PR NPTEL course – p.49/135


Support vector regression

• Once again, the kernel trick allows us to learn


non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
• The parameters: C , ǫ and parameters of kernel
function.
• The basic idea of SVR can be used in many related
problems.

PR NPTEL course – p.50/135


SV regression

• With the ǫ-insensitive loss function, points whose


targets are within ǫ of the prediction do not contribute
any ‘loss’.

PR NPTEL course – p.51/135


SV regression

• With the ǫ-insensitive loss function, points whose


targets are within ǫ of the prediction do not contribute
any ‘loss’.
• Gives rise to some interesting robustness of the
method. It can be proved that local movements of
target values of points outside the ǫ-tube do not
influence the regression.

PR NPTEL course – p.52/135


SV regression

• With the ǫ-insensitive loss function, points whose


targets are within ǫ of the prediction do not contribute
any ‘loss’.
• Gives rise to some interesting robustness of the
method. It can be proved that local movements of
target values of points outside the ǫ-tube do not
influence the regression.
• Robustness essentially comes through the support
vector representation of the regression.

PR NPTEL course – p.53/135


• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.

PR NPTEL course – p.54/135


• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1

PR NPTEL course – p.55/135


• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1

• This is ‘regularized risk minimization’.

PR NPTEL course – p.56/135


• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1

• This is ‘regularized risk minimization’.


• Then W T W is the model complexity term which is
intended to favour learning of ‘smoother’ models.
PR NPTEL course – p.57/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.

PR NPTEL course – p.58/135


• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.

PR NPTEL course – p.59/135


• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.

PR NPTEL course – p.60/135


• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
• There are ways to characterize the ‘degree of
continuity’ of a function.

PR NPTEL course – p.61/135


• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
• There are ways to characterize the ‘degree of
continuity’ of a function.
• We consider one such measure now.

PR NPTEL course – p.62/135


ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}

PR NPTEL course – p.63/135


ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}


• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’

PR NPTEL course – p.64/135


ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}


• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’
• The larger mǫ (f ), the smoother is the function.

PR NPTEL course – p.65/135


• Obviously, mǫ (f ) = 0 if f is discontinuous.

PR NPTEL course – p.66/135


• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.

PR NPTEL course – p.67/135


• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.

PR NPTEL course – p.68/135


• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.
• Higher margin would mean the function is ‘slowly
varying’ and hence is a ‘smoother’ model.

PR NPTEL course – p.69/135


SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.

PR NPTEL course – p.70/135


SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .

PR NPTEL course – p.71/135


SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .

PR NPTEL course – p.72/135


SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .


• Thus, mǫ (f ) = ||W ||
.

PR NPTEL course – p.73/135


SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .


• Thus, mǫ (f ) = ||W ||
.
• Thus in our optimization problem in SVR, minimizing
W T W promotes learning of smoother models.
PR NPTEL course – p.74/135
Solving the SVM optimization problem

• So far we have not considered any algorithms for


solving for the SVM.

PR NPTEL course – p.75/135


Solving the SVM optimization problem

• So far we have not considered any algorithms for


solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.

PR NPTEL course – p.76/135


Solving the SVM optimization problem

• So far we have not considered any algorithms for


solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.
• Many specialized algorithms have been proposed for
this.

PR NPTEL course – p.77/135


• The optimization problem to be solved is
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• A quadratic programming (QP) problem with


interesting structure.

PR NPTEL course – p.78/135


Example

• We will first consider a very simple example problem


in ℜ2 to get a feel for the method of obtaining SVM.

PR NPTEL course – p.79/135


Example

• We will first consider a very simple example problem


in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)


with y1 = y2 = +1 and y3 = −1.

PR NPTEL course – p.80/135


Example

• We will first consider a very simple example problem


in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)


with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.

PR NPTEL course – p.81/135


Example

• We will first consider a very simple example problem


in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)


with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.
• Suppose we use the Kernel function:
K(X, X ′ ) = (1 + X T X ′ )2 .

PR NPTEL course – p.82/135


• This example is shown below.

PR NPTEL course – p.83/135


• This example is shown below.

PR NPTEL course – p.84/135


• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

PR NPTEL course – p.85/135


• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)


• The objective function involves K(Xi , Xj ). These are
given in a matrix below.
 
4 0 1
(1 + XiT Xj )2 =  0 4 1 
£ ¤
1 1 1

PR NPTEL course – p.86/135


• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2

PR NPTEL course – p.87/135


• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2

• The constraints are


µ1 + µ2 − µ3 = 0; and µi ≥ 0, i = 1, 2, 3.

PR NPTEL course – p.88/135


• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1

PR NPTEL course – p.89/135


• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1

∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.

PR NPTEL course – p.90/135


• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1

∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.

PR NPTEL course – p.91/135


• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1

∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.
We use complementary slackness conditions on αi .

PR NPTEL course – p.92/135


• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1

∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.
We use complementary slackness conditions on αi .
• We have αi µi = 0. Essentially, we need to guess
which µi > 0.
PR NPTEL course – p.93/135
• In this simple problem we know all µi > 0.

PR NPTEL course – p.94/135


• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.

PR NPTEL course – p.95/135


• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.

PR NPTEL course – p.96/135


• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.
• We have now four unknowns: µ1 , µ2 , µ3 , λ.

PR NPTEL course – p.97/135


• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.
• We have now four unknowns: µ1 , µ2 , µ3 , λ.
∂L
• Using ∂µi
= 0, i = 1, 2, 3 and feasibility, we can solve
for µi .

PR NPTEL course – p.98/135


• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0

PR NPTEL course – p.99/135


• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1

PR NPTEL course – p.100/135


• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .

PR NPTEL course – p.101/135


• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .
• Thus we get µ1 = µ2 = 1 and µ3 = 2.

PR NPTEL course – p.102/135


• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .
• Thus we get µ1 = µ2 = 1 and µ3 = 2.
• This completely determines the SVM

PR NPTEL course – p.103/135


• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).

PR NPTEL course – p.104/135


• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X

f (X) = µi yi K(Xi , X) + b
i

PR NPTEL course – p.105/135


• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X

f (X) = µi yi K(Xi , X) + b
i
= K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗

PR NPTEL course – p.106/135


• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X

f (X) = µi yi K(Xi , X) + b
i
= K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
• Let us first calculate b∗ .

PR NPTEL course – p.107/135


• Recall the formula
X

b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i

PR NPTEL course – p.108/135


• Recall the formula
X

b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i

• We can take j = 1, 2 or 3.

PR NPTEL course – p.109/135


• Recall the formula
X

b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i

• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.

PR NPTEL course – p.110/135


• Recall the formula
X

b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i

• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.
• With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1.

PR NPTEL course – p.111/135


• Recall the formula
X

b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i

• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.
• With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1.
• If we solved our optimization problem correctly, we
should get same b∗ !

PR NPTEL course – p.112/135


• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .

PR NPTEL course – p.113/135


• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have

f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗

PR NPTEL course – p.114/135


• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have

f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗


= (1 − x1 )2 + (1 + x1 )2 − 2(1) − 1

PR NPTEL course – p.115/135


• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have

f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗


= (1 − x1 )2 + (1 + x1 )2 − 2(1) − 1
= 2x21 − 1

PR NPTEL course – p.116/135


• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2x21 ≥ 1

PR NPTEL course – p.117/135


• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2

PR NPTEL course – p.118/135


• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?

PR NPTEL course – p.119/135


• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?
• We are maximizing margin of the hyperplane in
‘x2 ’-space.

PR NPTEL course – p.120/135


• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?
• We are maximizing margin of the hyperplane in
‘x2 ’-space.
• The final SVM is intuitively very reasonable and we
solve essentially the same problem whether we are
seeking a linear classifier or a nonlinear classifier.

PR NPTEL course – p.121/135


• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.122/135


• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• We need a numerical method.

PR NPTEL course – p.123/135


• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• We need a numerical method.


• Due to the special structure, many efficient algorithms
are proposed.

PR NPTEL course – p.124/135


• One interesting idea – Chunking

PR NPTEL course – p.125/135


• One interesting idea – Chunking
• We optimize on only a few variables at a time.

PR NPTEL course – p.126/135


• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.

PR NPTEL course – p.127/135


• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.
• We keep randomly choosing the subset of variables.

PR NPTEL course – p.128/135


• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.
• We keep randomly choosing the subset of variables.
• Gave rise to the first specialized algorithm for SVM –
SVM Light

PR NPTEL course – p.129/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?

PR NPTEL course – p.130/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.

PR NPTEL course – p.131/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.

PR NPTEL course – p.132/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.

PR NPTEL course – p.133/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.
• We need to decide which two we consider in each
iteration.

PR NPTEL course – p.134/135


• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.
• We need to decide which two we consider in each
iteration.
• A very efficient algorithm
PR NPTEL course – p.135/135

You might also like