0% found this document useful (0 votes)
4 views9 pages

On The Equality of Kernel AdaTron and Sequential M

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

On The Equality of Kernel AdaTron and Sequential M

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221166069

On the Equality of Kernel AdaTron and Sequential Minimal Optimization in


Classification and Regression Tasks and Alike Algorithms for Kernel

Conference Paper · January 2003


Source: DBLP

CITATIONS READS

31 229

3 authors, including:

Vojislav Kecman
Virginia Commonwealth University
129 PUBLICATIONS 5,178 CITATIONS

SEE PROFILE

All content following this page was uploaded by Vojislav Kecman on 21 May 2014.

The user has requested enhancement of the downloaded file.


th
Proceedings of 11 European Symposium on Artificial Neural Networks, pp. 215-222, ESANN 2003, Bruges, Belgium, 2003

On the Equality of Kernel AdaTron and Sequen-


tial Minimal Optimization in Classification and
Regression Tasks and Alike Algorithms for Ker-
nel Machines

Vojislav Kecman1, Michael Vogt2, Te Ming Huang1


1
School of Engineering, The University of Auckland, Auckland, New Zealand
2
Institute of Automatic Control, TU Darmstadt, Darmstadt,, Germany
e-mail: [email protected], [email protected]

Abstract: The paper presents the equality of a kernel AdaTron (KA) method
(originating from a gradient ascent learning approach) and sequential minimal
optimization (SMO) learning algorithm (based on an analytic quadratic pro-
gramming step) in designing the support vector machines (SVMs) having posi-
tive definite kernels. The conditions of the equality of two methods are estab-
lished. The equality is valid for both the nonlinear classification and the nonlin-
ear regression tasks, and it sheds a new light to these seemingly different learn-
ing approaches. The paper also introduces other learning techniques related to
the two mentioned approaches, such as the nonnegative conjugate gradient, clas-
sic Gauss-Seidel (GS) coordinate ascent procedure and its derivative known as
the successive over-relaxation (SOR) algorithm as a viable and usually faster
training algorithms for performing nonlinear classification and regression tasks.
The convergence theorem for these related iterative algorithms is proven.

1. Introduction

One of the mainstream research fields in learning from empirical data by support vec-
tor machines, and solving both the classification and the regression problems, is an
implementation of the incremental learning schemes when the training data set is
huge. Among several candidates that avoid the use of standard quadratic program-
ming (QP) solvers, the two learning approaches which have recently got the attention
are the KA (Anlauf, Biehl, 1989; Frieß, Cristianini, Campbell, 1998; Veropoulos,
2001) and the SMO (Platt, 1998, 1999; Vogt, 2002). Due to its analytical foundation
the SMO approach is particularly popular and at the moment the widest used, ana-
lyzed and still heavily developing algorithm. At the same time, the KA although pro-
viding similar results in solving classification problems (in terms of both the accuracy
and the training computation time required) did not attract that many devotees. There
are two basic reasons for that. First, until recently (Veropoulos, 2001), the KA seemed
to be restricted to the classification problems only and second, it 'lacked' the fleur of
the strong theory (despite its beautiful 'simplicity' and strong convergence proofs).
The KA is based on a gradient ascent technique and this fact might have also dis-
tracted some researchers being aware of problems with gradient ascent approaches
faced with possibly ill-conditioned kernel matrix. Here we show when and why the
recently developed algorithms for SMO using positive definite kernels or models
without a bias term, (Vogt, 2002), and the KA for both classification (Friess, Cristi-
anini, Campbell, 1998) and regression (Veropoulos, 2001) are identical. Both the KA
and the SMO algorithm attempt to solve the following QP problem in the case of
classification (Vapnik, 1995; Cherkassky and Mullier, 1998; Cristianini and Shawe-
Taylor, 2000; Kecman, 2001; Schölkopf and Smola, 2002) - maximize the dual La-
grangian
l
1 l
Ld(α) = ∑ α i − ∑ yi y jα iα j K ( x i , x j ) , (1)
i =1 2 i , j =1
l

subject to αi ≥ 0, i = 1, …, l and ∑α y = 0 .
i i
(2)
i =1

where l is the number of training data pairs, αi are the dual Lagrange variables, yi are
the class labels (±1), and the K (xi , x j ) are the kernel function values. Because of
noise or generic class’ features, there will be an overlapping of training data points.
Nothing, but constraints, in solving (1) changes and they are
l

0 ≤ αi ≤ C, i = 1, ..., l and ∑α y = 0 ,
i i
(3)
i =1

where 0 < C < ∞, is a penalty parameter trading off the size of a margin with a num-
ber of misclassifications.
In the case of the nonlinear regression the learning problem is the maximiza-
tion of a dual Lagrangian below
l l
1 l
Ld(α,α*)= −ε ∑ (α i* + α i ) + ∑ (α i − α i* ) yi − ∑ (α i − α i* )(α j − α *j ) K ( x i , x j ) , (4)
i =1 i =1 2 i , j =1

∑ α *i = ∑i =1 α i ,
l l
s.t. (4a)
i =1
*
0 ≤ α C, i ≤ 0 ≤ αi ≤ C, i = 1, ..., l. (4b)
where ε is a prescribed size of the insensitivity zone, and αi and αi* (i = 1, ..., l) are
Lagrange multipliers for the points above and below the regression function respec-
tively. Learning results in l Lagrange multiplier pairs (α, α*). Because no training
data can be on both sides of the tube, either αi or αi* will be nonzero, i.e., αiαi* = 0.

2. The KA and SMO learning algorithms without-bias-term

It is known that positive definite kernels (such as the most popular and the most
widely used RBF Gaussian kernels as well as the complete polynomial ones) do not
require bias term (Evgeniou, Pontil, Poggio, 2000). Below, the KA and the SMO algo-
rithms will be presented for such a fixed (i.e., no-) bias design problem and compared
for the classification and regression cases. The equality of two learning schemes and
resulting models will be established. Originally, in (Platt, 1998, 1999), the SMO clas-
sification algorithm was developed for solving the problem (1) including the con-
straints related to the bias b. In these early publications the case when bias b is fixed
variable was also mentioned but the detailed analysis of a fixed bias update was not
accomplished.
2.1 Incremental Learning in Classification

a) Kernel AdaTron in classification


The classic AdaTron algorithm as given in (Anlauf and Biehl, 1989) is developed for
linear classifier. The KA is a variant of the classic AdaTron algorithm in the feature
space of SVMs (Frieß et al., 1998). The KA algorithm solves the maximization of the
dual Lagrangian (1) by implementing the gradient ascent algorithm. The update
∆α i of the dual variables αi is given as
∂Ld
∆α i = η
∂α i
( l
)
= η 1 − yi ∑ j =1 α j y j K ( x i , x j ) = η (1 − yi fi ) ,
(5a)

where fi is the value of the decision function f at the point xi, i.e.,

l
fi = j =1
α j y j K ( x i , x j ) , and yi denotes the value of the desired target (or the class'
label) which is either +1 or -1. The update of the dual variables αi is given as
α i ← min(max(0, α i + ∆α i ), C ) (i = 1, ..., l) (5b)
In other words, the dual variables αi are clipped to zero if (α i + ∆α i ) < 0 . In the case
of the soft nonlinear classifier (C < ∞) αi are clipped between zero and C, (0 ≤ αi ≤ C).
The algorithm converges from any initial setting for the Lagrange multipliers αi.

b) SMO without-bias-term in classification


Recently (Vogt, 2002) derived the update rule for multipliers αi that includes a de-
tailed analysis of the Karush-Kuhn-Tucker (KKT) conditions for checking the opti-
mality of the solution. (As referred above, a fixed bias update was mentioned in Platt's
papers). The following update rule for αi for a no-bias SMO algorithm was proposed
y i Ei y f −1 1 − yi f i
∆α i = − =− i i = , (6)
K (xi , xi ) K (xi , xi ) K (xi , xi )
where Ei = fi – yi denotes the difference between the value of the decision function f at
the point xi and the desired target (label) yi. Note the equality of (5a) and (6) when the
learning rate in (5a) is chosen to be η i = 1 / K ( x i , x i ) . The important part of the SMO
algorithm is to check the KKT conditions with precision τ (e.g., τ = 10-3) in each step.
An update is performed only if
α i < C ∧ yi Ei < −τ , or
(6a)
α i > 0 ∧ yi Ei > τ
After an update, the same clipping operation as in (5b) is performed
α i ← min(max(0, α i + ∆α i ), C ) (i = 1, ..., l) (6b)
It is the nonlinear clipping operation in (5b) and in (6b) that strictly equals the KA
and the SMO without-bias-term algorithm in solving nonlinear classification prob-
lems. This fact sheds new light on both algorithms. This equality is not that obvious
in the case of a 'classic' SMO algorithm with bias term due to the heuristics involved
in the selection of active points which should ensure the largest increase of the dual
Lagrangian Ld during the iterative optimization steps.
2.2 Incremental Learning in Regression

Similarly to the case of classification, there is a strict equality between the KA and the
SMO algorithm when positive definite kernels are used for nonlinear regression.

a) Kernel AdaTron in regression


The first extension of the Kernel AdaTron algorithm for regression is presented in
(Veropoulos, 2001) as the following gradient ascent update rules for αi and αi*
∂L
( )
∆α i = ηi d = ηi yi − ε − ∑ j =1 (α j − α *j ) K (x j , xi ) = ηi ( yi − ε − fi ) = −ηi ( Ei + ε ) , (7a)
∂α i
l

∂Ld
= η (− y − ε + ∑ − α ) K (x , x ) ) = η ( − y − ε + f ) = η ( E − ε ) , (7b)
l
∆α i* = ηi (α j *

∂α i*
i i j =1 j j i i i i i i

where yi is the measured value for the input xi, ε is the prescribed insensitivity zone,
and Ei = fi – yi stands for the difference between the regression function f at the point
xi and the desired target value yi at this point. The calculation of the gradient above
does not take into account the geometric reality that no training data can be on both
sides of the tube. In other words, it does not use the fact that either αi or αi* or both
will be nonzero. i.e., that αiαi* = 0 must be fulfilled in each iteration step. Below we
derive the gradients of the dual Lagrangian Ld accounting for geometry. This new
formulation of the KA algorithm strictly equals the SMO method and it is given as
∂Ld
= − K ( xi , xi )α i − ∑ j =1, j ≠i (α j − α *j ) K (x j , xi ) + yi − ε + K ( xi , xi )α i* − K (xi , xi )α i*
l

∂α i
(8a)
= − K (xi , xi )α i* − (α i − α i* ) K ( xi , xi ) − ∑ j =1, j ≠i (α j − α *j ) K ( x j , xi ) + yi − ε
l

= − K (xi , xi )α i* + yi − ε − f i = − ( K ( xi , xi )α i* + Ei + ε )
For the αi* multipliers, the value of the gradient is
∂Ld
*
= − K (xi , xi )α i + Ei − ε . (8b)
∂α i
The update value for αi is now
∂L
∆α i = ηi d = −ηi ( K (xi , xi )α i* + Ei + ε ) , (9a)
∂α i
∂L (9b)
α i ← α i + ∆α i = α i + ηi d = α i − ηi ( K (xi , xi )α i* + Ei + ε )
∂α i
For the learning rate ηi = 1/ K (xi , xi ) the gradient ascent learning KA is defined as,
Ei + ε (10a)
α i ← α i − α i* −
K (xi , xi )
Similarly, the update rule for αi* is
Ei − ε (10b)
α i* ← α i* − α i +
K ( x i , xi )
Same as in the classification, αi and αi* are clipped between zero and C,
α i ← min(max(0, α i ), C ) (i = 1, ..., l) , (11a)
* *
α i ← min(max(0, α i ), C ) (i = 1, ..., l) . (11b)
b) SMO without-bias-term in regression

The first algorithm for the SMO without-bias-term in regression (together with a de-
tailed analysis of the KKT conditions for checking the optimality of the solution) is
derived in (Vogt, 2002). The following learning rules for the Lagrange multipliers αi
and αi* updates were proposed
E +ε , (12a)
α i ← α i − α i* − i
K ( x i , xi )
E −ε . (12b)
α i* ← α i* − α i + i
K ( xi , xi )
The equality of equations (10a, b) and (12a, b) is obvious when the learning rate, as
presented above in (10a, b), is chosen to be ηi = 1/ K (xi , xi ) . Thus, in both the classi-
fication and the regression, the optimal learning rate is not necessarily equal for all
training data pairs. For a Gaussian kernel, η = 1 is same for all data points, and for a
complete nth order polynomial each data point has different learning rate
ηi = 1/(xTi xi + 1) n . Similar to classification, a joint update of αi and αi* is performed
only if the KKT conditions are violated by at least τ , i.e. if
α i < C ∧ ε + Ei < −τ , or
α i > 0 ∧ ε + Ei > τ , or

(13)
α i < C ∧ ε − Ei < −τ , or

α i > 0 ∧ ε − Ei > τ
After the changes, the same clipping operations as defined in (11) are performed
α i ← min(max(0, α i ), C ) (i = 1, ..., l) , (14a)
α i* ← min(max(0, α i* ), C ) (i = 1, ..., l) . (14b)
The KA learning as formulated in this paper and the SMO algorithm without-bias-
term for solving regression tasks are strictly equal in terms of both the number of it-
erations required and the final values of the Lagrange multipliers. The equality is
strict despite the fact that the implementation is slightly different. In every iteration
step, namely, the KA algorithm updates both weights αi and αi* without any checking
whether the KKT conditions are fulfilled or not, while the SMO performs an update
according to equations (13).

3. The Coordinate Ascent Based Learning for Nonlinear


Classification and Regression Tasks
When positive definite kernels are used, the learning problem for both tasks is same.
In a vector-matrix notation, in a dual space, the learning is represented as:
Ld (α ) = −0.5α Kα + f α (15)
T T
maximize
s.t. 0 <= α i <= C, (i = 1, ..., n), (16)
where, in the classification n = l and the matrix K is an (l, l) symmetric positive defi-
nite matrix, while in regression n = 2l and K is a (2l, 2l) symmetric semipositive defi-
nite one. Note that the constraints (16) define a convex subspace over which the con-
vex dual Lagrangian should be maximized. It is very well known that the vector α
may be looked at as the solution of a system of linear equations
Kα = f (17)
subject to the same constraints as given by (16).
Thus, it may seem natural to solve (17), subject to (16), by applying some of
the well known and established techniques for solving a general linear system of
equations. The size of training data set and the constraints (16) eliminate direct tech-
niques. Hence, one has to resort to the iterative approaches in solving the problems
above. There are three possible iterative avenues that can be followed. They are; the
use of the Non-Negative Least Squares (NNLS) technique (Lawson and Hanson,
1974), application of the Non-Negative Conjugate Gradient (NNCG) method (Heste-
nes, 1980) and the implementation of Gauss-Seidel (GS) i.e., the related Successive
Over-Relaxation technique (SOR). The first two methods solve for the non-negative
constraints only. Thus, they are not suitable in solving 'soft' tasks, when penalty pa-
rameter C < ∞ is used, i.e., when there is an upper bound on maximal value of αi.
Nevertheless, in the case of nonlinear regression, one can apply NNLS and NNCG by
taking C = ∞ and compensating (i.e. smoothing or 'softening' the solution) by increas-
ing the sensitivity zone ε. However, the two methods (namely NNLS and NNCG) are
not suitable for solving soft margin (C < ∞ ) classification problems in their present
form, because there is no other parameter that can be used in 'softening' the margin.
Here we show how to extend the application of GS and SOR to both the
nonlinear classification and to the nonlinear regression tasks. The Gauss-Seidel
method solves (17) by using the ith equation to update the ith unknown doing it itera-
tively, i.e., starting in the kth step with the first equation to compute the α1 , then the
k +1

k +1 k +1
second equation is used to calculate the α 2 by using new α 1 and α i (i > 2) and
k

so on. The iterative learning takes the following form,


 i −1 n  1  i −1 n  1 ∂Ld
α k +1 =  fi − ∑ K ijα k +1 − ∑Kα ij
k
 / K ii = α i −
k
 ∑ K ijα j + ∑ K ijα j − f i  = α i +
k +1 k k

K K ii ∂α i
i j j
 j =1 j =i +1  ii  j =1 j = i  k +1

(18)
where we use the fact that the term within a second bracket (called the residual ri in
mathematics' references) is the ith element of the gradient of a dual Lagrangian Ld
given in (15) at the k+1th iteration step. The equation (18) above shows that GS
method is a coordinate gradient ascent procedure as well as the KA and the SMO are.
The KA and SMO for positive definite kernels equal the GS! Note that the optimal
learning rate used in both the KA algorithm and in the SMO without-bias-term ap-
proach is exactly equal to the coefficient 1/Kii in a GS method. Based on this equality,
the convergence theorem for the KA, SMO and GS (i.e., SOR) in solving (15) subject
to constraints (16) can be stated and proved as follows:
Theorem: For SVMs with positive definite kernels, the iterative learning algorithms
KA i.e., SMO i.e., GS i.e., SOR, in solving nonlinear classification and regression
tasks (15) subject to constraints (16), converge starting from any initial choice of α 0.

Proof: The proof is based on the very well known theorem of convergence of the GS
method for symmetric positive definite matrices in solving (17) without constraints
(Ostrowski, 1966). First note that for positive definite kernels, the matrix K created
by terms yi y j K ( x i , x j ) in the second sum in (1), and involved in solving classifica-
tion problem, is also positive definite. In regression tasks K is a symmetric positive
semidefinite (meaning still convex) matrix, which after a mild regularization given as
(K ← K + λI, λ ~ 1e-12) becomes positive definite one. (Note that the proof in the
case of regression does not need regularization at all, but there is no space here to go
into these details). Hence, the learning without constraints (16) converges, starting
from any initial point α 0, and each point in an n-dimensional search space for multi-
pliers αi is a viable starting point ensuring a convergence of the algorithm to the
maximum of a dual Lagrangian Ld. This, naturally, includes all the (starting) points
within, or on a boundary of, any convex subspace of a search space ensuring the con-
vergence of the algorithm to the maximum of a dual Lagrangian Ld over the given
subspace. The constraints imposed by (16) preventing variables αi to be negative or
bigger than C, and implemented by the clipping operators above, define such a con-
vex subspace. Thus, each 'clipped' multiplier value αi defines a new starting point of
the algorithm guaranteeing the convergence to the maximum of Ld over the subspace
defined by (16). For a convex constraining subspace such a constrained maximum is
unique. Q.E.D.
Due to the lack of the space we do not go into the discussion on the convergence rate
here and we leave it to some other occasion. It should be only mentioned that both
KA and SMO (i.e. GS and SOR) for positive definite kernels have been successfully
applied for many problems (see references given here, as well as many other, bench-
marking the mentioned methods on various data sets). Finally, let us just mention that
the standard extension of the GS method is the method of successive over-relaxation
that can reduce the number of iterations required by proper choice of relaxation pa-
rameter ω significantly. The SOR method uses the following updating rule

1  i −1 n  1 ∂Ld (19)
α k +1 = α k − ω  ∑ K ijα j + ∑ K ijα j − fi  = α i + ω
k +1 k k

K ii  j =1 K ii ∂α i
i i
j =i  k +1

and similarly to the KA, SMO, and GS its convergence is guaranteed.

4. Conclusions
Both the KA and the SMO algorithms were recently developed and introduced as
alternatives to solving quadratic programming problem while training support vector
machines on huge data sets. It was shown that when using positive definite kernels the
two algorithms are identical in their analytic form and numerical implementation. In
addition, for positive definite kernels both algorithms are strictly identical with a clas-
sic iterative GS (optimal coordinate ascent) learning and its extension SOR. Till now,
these facts were blurred mainly due to different pace in posing the learning problems
and due to the 'heavy' heuristics involved in an SMO implementation that shadowed
an insight into the possible identity of the methods. It is shown that in the so-called
no-bias SVMs, both the KA and the SMO procedure are the coordinate ascent based
methods. Finally, due to the many ways how all the three algorithms (KA, SMO and
GS i.e., SOR) can be implemented there may be some differences in their overall be-
haviour. The introduction of the relaxation parameter 0 < ω < 2 will speed up the al-
gorithm. The exact optimal value ωopt is problem dependent.
Acknowledgment: The results presented are initiated during the stay of the first author at the
Prof. Rolf Isermann's Institute and sponsored by the Deutsche Forschungsgemeinschaft (DFG).
He is thankful to both Prof. Rolf Isermann and DFG for all the support during this stay.

5. References
1. Anlauf, J. K., Biehl, M., The AdaTron - an adaptive perceptron algorithm. Europhys-
ics Letters, 10(7), pp. 687–692, 1989
2. Cherkassky, V., Mulier, F., Learning From Data: Concepts, Theory and Methods,
John Wiley & Sons, New York, NY, 1998
3. Cristianini, N., Shawe-Taylor, J., An introduction to Support Vector Machines and
other kernel-based learning methods, Cambridge University Press, Cambridge, UK,
2000
4. Evgeniou, T., Pontil, M., Poggio, T., Regularization networks and support vector ma-
chines, Advances in Computational Mathematics, 13, pp.1-50, 2000.
5. Frieß, T.-T., Cristianini, N., Campbell, I. C. G., The Kernel-Adatron: a Fast and Sim-
ple Learning Procedure for Support Vector Machines. In Shavlik, J., editor, Proceed-
ings of the 15th International Conference on Machine Learning, Morgan Kaufmann,
pp. 188–196, San Francisco, CA, 1998
6. Kecman V., Learning and Soft Computing, Support Vector Machines, Neural Net-
works, and Fuzzy Logic Models, The MIT Press, Cambridge, MA,
(http://www.support-vector.ws), 2001
7. Lawson, C. I., Hanson, R. J., Solving Least Squares Problems, Prentice-Hall, Engle-
wood Cliffs, N.J., 1974
8. Ostrowski, A.M., Solutions of Equations and Systems of Equations, 2nd ed., Aca-
demic Press, New York, 1966
9. Platt, J. C., Sequential minimal optimization: A fast algorithm for training support
vector machines. TR MSR-TR-98-14, Microsoft Research, 1998
10. Platt, J.C., Fast Training of Support Vector Machines using Sequential Minimal Op-
timization. Ch. 12 in Advances in Kernel Methods – Support Vector Learning, edited
by B. Schölkopf, C. Burges, A. Smola, The MIT Press, Cambridge, MA, 1999
11. Schölkopf B., Smola, A., Learning with Kernels – Support Vector Machines, Optimi-
zation, and Beyond, The MIT Press, Cambridge, MA, 2002
12. Veropoulos, K., Machine Learning Approaches to Medical Decision Making, PhD
Thesis, The University of Bristol, Bristol, UK, 2001
13. Vapnik, V.N., The Nature of Statistical Learning Theory, Springer Verlag Inc, New
York, NY, 1995
14. Vogt, M., SMO Algorithms for Support Vector Machines without Bias, Institute Re-
port, Institute of Automatic Control, TU Darmstadt, Darmstadt, Germany,
(http://w3.rt.e-technik.tu-darmstadt.de/~vogt/), 2002

View publication stats

You might also like