0% found this document useful (0 votes)
63 views7 pages

Boosting of Support Vector Machines With Application To Editing

ICMLA05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views7 pages

Boosting of Support Vector Machines With Application To Editing

ICMLA05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4230172

Boosting of support vector machines with application to editing

Conference Paper · January 2006


DOI: 10.1109/ICMLA.2005.13 · Source: IEEE Xplore

CITATIONS READS
14 212

3 authors, including:

Pedro Rangel Elkin Garcia


University of Aveiro University of Delaware
15 PUBLICATIONS   62 CITATIONS    25 PUBLICATIONS   227 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

High Performance Computing View project

All content following this page was uploaded by Elkin Garcia on 10 June 2016.

The user has requested enhancement of the downloaded file.


Boosting of Support Vector Machines with application to editing

Pedro Rangel Fernando Lozano Elkin García


Departamento de Ingeniería Eléctrica y Electrónica
Universidad de los Andes
Bogotá, Colombia
{p-rangel,flozano,elkin-ga}@uniandes.edu.co

Abstract are the time and space complexities of the training algo-
rithm. Usually, the SVM training algorithm solves a large
In this paper, we present a weakened variation of Sup- Quadratic Programming (QP) Problem. Let m denote the
port Vector Machines that can be used together with Ad- number of training examples. The time complexity of a
aboost. Our modified Support Vector Machine algorithm QP problem is O(m3 ). Several researchers have proposed
has the following interesting properties: First, it is able to various methods that improve this time complexity. Some
handle distributions over the training data. Second, it is a of these algorithms include: chunking, the decomposition
weak algorithm in the sense that it ensures an empirical er- method and the sequential minimal optimization method
ror upper bounded by 1/2. Third, when used together with [7]. In practice, these algorithms have a time complexity
Adaboost, the resulting algorithm is faster than the usual that usually scales like O(m2 ).
SVM training algorithm. Finally, we show that our boosted Pavlov, Mao and Dom [6] propose a method for speeding
SVM can be effective as an editing algorithm. and scaling up the SVM training algorithm using Adaboost.
They implement the SVM training procedure by means of
the Sequential Minimal Optimization algorithm. In order
1. Introduction to simulate the distributions required by Adaboost, they use
bootstrap samples of the original data. In their experiments,
Adaboost [4] is an algorithm which employs a weak the bootstrap samples size was approximately equal to 2-
learner (i.e. an algorithm which returns a hypothesis that is 4% of the original data set size. In this way, they reduce
little better than random guessing) to find a good hypoth- substantially the training set in each Adaboost round.
esis. In this paper, we present a classification algorithm We present a modified version of SVM which has the
which uses a variant of Support Vector Machines as the following properties: First, it is able to handle distributions
weak learner for Adaboost. The success of Support Vec- without using bootstrap samples. Second, it is a weak al-
tor Machines (SVM) in practical classification problems [9] gorithm in the sense that it ensures an empirical error upper
has been explained from the statistical learning standpoint. bounded by 1/2. Third, when our algorithm is combined
In particular, it has been recently shown that for certain with Adaboost, the resulting algorithm outputs a hypothesis
kernels SVMs are strong learners [11]. This means that that is also a SVM, but which is trained much faster. Fi-
they can achieve generalization error arbitrarily close to the nally, we present empirical evidence that suggests that our
Bayes error, given a large enough training data set. Boosted Support Vector Machines algorithm can be used as
Some empirical evidence shows that using a strong an editing algorithm [1]. Usually, the SVM classification
learner as the base classifier for Adaboost is not a good rule is unnecessarily complex (i.e. The SVM hypothesis
idea. Wickramaratna, Holden and Buxton apply directly has linearly dependent support vectors). An editing algo-
SVM with Adaboost [12] and observe that the performance rithm is a procedure that reduces the training set in order to
of the boosted classifier degrades as the number of Ad- simplify the representation of the SVM.
aboost rounds increases. Of course, boosting of a strong
learner does not make much sense, from the generalization 2. Preliminaries
error point of view. However, we will see that if the SVM
is forced to output a weaker hypothesis, boosting still may Let (x, y) be a random couple, where x is an instance in a
have some other advantages. space X and y ∈ {−1, 1} is a label. Let S = {xi , yi }mi=1
The main drawbacks of the Support Vector Machines be a labeled set, consisting of i.i.d. copies of (x, y).

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
Algorithm: Adaboost (S, D1 , T, W eak)
m
Input: S = {xi , yi }m
i=1 , D1 , T , W eak(·, ·) 1 1 
Output: H(·) min τ (w, ξ, ρ) = w2H − νρ + ξi (1)
w,ξ,ρ 2 m i=1
for t = 1 to T do
Get a weak hypothesis using Dt .
s.t. yi [φ(xi ), wH + b] ≥ (ρ − ξi ), for i=1,. . . ,m
ht ← W eak(S, Dt ). ξi ≥ 0, ρ ≥ 0
Choose αt ∈ R
Update the distribution Dt+1 . Here H is the reproducing kernel Hilbert space of a pos-
end itive definite kernel k(·, ·) and φ is a mapping from X to H
such that k(xi , xj ) = φ(xi ), φ(xj )H . Using the kernel
T
Output: H(x) = sign t=1 αt ht . trick, the dual problem becomes:
Figure 1: Adaboost Algorithm.
m m
1 
min W (α) = αi αj yi yj k(xi , xj ) (2)
α 2 i=1 j=1
A classification rule, also called a hypothesis, is a func-
1
tion h : X → [−1, 1]. The sign of h(x) is interpreted as s.t. 0 ≤ αi ≤
the predicted label to be assigned to instance x, while the m
m
 m

magnitude |h(x)| is interpreted as the “confidence” in this αi yi = 0 αi ≥ ν
prediction. The goodness of a hypothesis will be evaluated i=1 i=1
using the generalization error R, and the empirical error
This optimization problem yields a hypothesis which has
Remp .
the form:
We say that H(x) is a combined classifier when it is a
m 
convex combination of several hypotheses hi . That is 
h(x) = sign αi yi k(x, xi ) + b (3)
m i=1

H(x) = αi · hi (x) The training examples are classified in three categories
i=1 depending on the value of αi . Within each category, the
m data margins yi h(xi ) are prescribed by the Karush-Kuhn-
Where αi ≥ 0 and i=1 αi = 1. Each hypothesis hi Tucker optimality conditions. The first category consists of
will be called a base classifier. all data points such that αi = 1/m and satisfy yi h(xi ) < ρ.
These data points are called margin errors or bouncing sup-
port vectors. Note that the set of bouncing vectors in-
2.1. Adaboost
cludes all the training examples misclassified by the SVM.
The second category consists of the data points such that
Adaboost, first introduced in [4], is a meta-algorithm 0 < αi < 1/m and satisfy yi h(xi ) = ρ. They are called
which has the ability to return a strong hypothesis using a ordinary support vectors. The third category consists of ex-
weak algorithm as a subroutine. Figure 1 shows the Ad- amples such that αi = 0 and satisfy yi h(xi ) > ρ. These
aboost algorithm as presented in [8]. examples play no role in the SVM decision function.
Adaboost takes a labeled examples set S, a discrete dis-
tribution D, and a weak learning algorithm W eak; and re- 3. Boosting Support Vector Machines
turns a combined classifier. At each iteration t, Adaboost
executes W eak over the distribution Dt to get ht , then it In this section, we present a boosting algorithm which
modifies the distribution assigning larger weights to exam- uses SVM as its weak classifier. In order to do that, we
ples misclassified by ht and smaller weights to examples need to solve two issues: we need to modify the support
with high confidence |ht (x)|. vector machine optimization problem so it can deal with
distributions, and we need to weaken the performance of
2.2. Support Vector Machines (SVM) the support vector machine.

3.1. Modified Support Vector Machines


Our algorithm uses the ν formulation of the SVM prob-
lem. The ν parameter allows us to control the empirical er- We modify the formulation of the ν-SVM to introduce
ror of the classifier as we explain below. The optimization the distribution D by multiplying the slack variables ξi by
problem in the ν formulation is as follows [10]: the corresponding weight Di :

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
Algorithm: WSV (S, D, k, γ, λ)
1
m
 Input: S = {xi , yi }m
i=1 , D, k, γ, λ
min τ (w, ξ, ρ) = w2(H) − νρ + Di ξi (4) Output: h(·)
w,ξ,ρ 2 i=1
begin
s.t. yi [xi , wH + b] ≥ Di (ρ − ξi ), for i=1,. . . ,m Set ν = λ · (1/2 − γ).
ξi ≥ 0, ρ ≥ 0 Set μ = (1 − λ) · (1/2 − γ).
Select J such that j∈J D(j) ≤ μ, and it has
In this way, we encourage solutions of the optimization maximum cardinality.
problem that do not err on examples with larger weights. Set S ∗ = {xj , yj }J  .
Set D∗ = DJ  .
The dual problem becomes: end
Output the hypothesis h(·) ← SVk,ν (S ∗ , D∗ )
m m
1  Figure 2: Weakened Support Vector Machine Algorithm.
min W (α) = αi αj yi yj k(xi , xj ) (5)
α 2 i=1 j=1
s.t. 0 ≤ αi ≤ 1 φ. We say that the kernel k is universal if for any uniquely
m
labeled data set S = {xi , yi }m
i=1 , the image of S through
αi yi = 0 φ is linearly separable. When we use an universal kernel in
i=1
m
(5), we can always find a solution with ρ > 0.

Di αi ≥ ν (6)
i=1
Lemma 2 Let k be a universal kernel, and let h∗w,b be a
hypothesis returned by SVk,ν (S, D), then the empirical
We can use several optimization algorithms to solve effi- error of h∗w,b is always upper bounded by ν.
ciently this problem. We say that SVk,ν (S, D) is a learning
algorithm that solves the optimization problem (5), and re-
turns a hypothesis hw,b . The proof of this lemma follows immediately from
The ν parameter in (5), lets us control the empirical error lemma 1 and the universality of the kernel.
of the classifier as is explained in the next lemma.
3.2. Weakening Support Vector Machines
Lemma 1 Let h∗w,b be a hypothesis returned by
SVk,ν (S, D). If w and b are a feasible point of (4)
with ρ > 0, then Remp (h∗w,b , S, D) ≤ ν (i.e. The ν As mentioned in the introduction, using Adaboost in
parameter is an upper bound in the empirical error of the combination with Support Vector Machines can be coun-
classifier). terproductive. Nevertheless, we claim that using a weak-
ened Support Vector Machine in conjunction with Adaboost
has some advantages over a single SVM trained in the
Proof By the Karush-Kuhn-Tucker optimality conditions, usual fashion. The usefulness of the method we propose
ρ greater than zero implies that (6) becomes an equality. is twofold. First, it potentially improves the performance
Hence of the Support Vector Machines. Second, it speeds up the
training algorithm.
Our weakened SVM algorithm, that we will refer to as
m
  m
 W SV , takes as inputs a labeled examples set S, a distrib-
ν= Di αi ≥ Di ≥ Di sign(h(xi ))
= yi 
ution D, a kernel k, and a “weakness” constant γ. W SV
i=1 i:αi =1 i=1
returns a hypothesis with empirical error upper bounded by
where the last inequality follows from the fact that all = 12 − γ. Moreover, it runs faster than the original support
examples with ξi > 0 satisfy αi = 1 (if not, αi could grow vector machine training algorithm. The basic idea of W SV
further to reduce the value of ξi ). is to reject some examples in the training set S, and train
 the support vector machine using the remaining examples.
Figure 2 shows the W SV algorithm.
Then, if we can guarantee that ρ is greater than zero, we Let J be the index set of the samples that we eliminate,
can guarantee that the empirical error is upper bounded by and let J  be the complement of J. Let μ be an upper
 bound
ν. on the weight of the rejected samples (i.e. μ ≥ J D(j)).
We define a special kind of kernels called universal ker- W SV runs SVk,ν (·, ·) using a modified  labeled examples
nels. Let φ be a transformation, and let k be the kernel of set with less data. It selects J such that j∈J D(j) ≤ μ,

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
and has maximum cardinality. We can show that this al- Algorithm: WSVE (S, D, k, γ, ν, μ, σ)
gorithm guarantees an upper bound on the empirical error.
Input: S = {xi , yi }m
i=1 , D, k, γ, ν, μ, σ
To see this, notice that the allowed error can be split into Output: h(·)
two parts. The first part is due to the original algorithm
begin
SVk,ν (S, D), and we can control it using the ν parameter
Select J such that j∈J D(j) ≤ σ, and it has
(see lemma 2). The second part comes from the discarded minimum cardinality.
portion of the training set. If we assume a worse case sce- Set S ∗ = {xj , yj }J .
nario in which the hypothesis returned by SV errs on the Set D∗ = DJ / DJ .
whole discarded set, then the empirical error of the algo- h(·) ← MWSV (S ∗ , D∗ , k, γ, ν, μ).
rithm is upper bounded by ν + μ. Now, let λ be a real num- end
ber between 0 and 1. If we choose ν = λ and μ = (1−λ) , Output the hypothesis h(·)
we can guarantee that the empirical error of W SV is upper Figure 3: Weakened SVM Algorithm for editing.
bounded by .
Since the time complexity of the training algorithm of
SVM scales as O(m2 ), the training time of W SV is re- training examples located on the wrong side of the Bayes
duced. The number of rejected data depends on two fac- decision boundary.
tors: First, it depends on the distribution over the original A priori, we do not know which training examples are
data set. For example, if the distribution is uniform (as is badly placed. We propose to use the distributions generated
the case in the first round of Adaboost) the percentage of by Adaboost to estimate the probability of a point being re-
points rejected is about 1 − μ. On the other hand, if the jected. Adaboost gives high weights to the examples that
distribution is askew (as in further rounds of Adaboost) the are misclassified in various rounds. Hence, the points with
percentage of discarded points becomes larger. Hence, if large weights are more likely to be in the wrong side of
we use W SV as the weak learner, we can execute many the Bayes decision boundary. So, if we reject the exam-
rounds of Adaboost in a small amount of time. The second ples with the largest weights, we can use the Boosting SVM
factor which affects the number of rejected data is the value algorithm to perform editing. The Weakened SVM algo-
of μ. The larger μ is, the more data is rejected. In the W SV rithm for editing is shown in Figure 3. This modified al-
algorithm, μ can be increased in two ways: decreasing γ, or gorithm creates a training set which rejects the points with
decreasing λ. large weights, then it runs the MWSV algorithm to get an
hypothesis with high accuracy in this modified data set.
3.3. Boosting SVM as an Editing Algorithm
4. Experiments
During both the training and prediction stages, the per-
formance of a SVM is highly influenced by the number 4.1. Boosting SVM
of support vectors. When a SVM classifier does not have
enough support vectors, it is not able to make predictions In this section, we report some experiments with the
with high accuracy. On the other hand, the more support Boosting SVM algorithm proposed above. We implement
vectors, the slower is the prediction. Ideally, the image of the SVM training algorithm using the sequential minimal
the support vectors in the feature space should be linearly optimization algorithm introduced by Platt [7] and modified
independent, giving a more compact representation. by Chang and Lin [2] for the ν formulation. We modify this
There are several methods to reduce the number of sup- algorithm so we can solve problem 5. This implementation
port vectors without degrading the generalization error of a performs caching of the most frequently used support vec-
SVM classifier. Most of these techniques are focused on re- tors. Note that if we remove the sign function in (3), then a
ducing the number of support vectors after the training pro- combined classifier of SVM is also a SVM. We use this fact
cedure [3]. Since these techniques require to compute the to create a SVM as the output of our Adaboost algorithm
SVM solution before being applied, they do not improve the that we refer to as the reduced model. In preliminary exper-
training time. On the other hand, there are methods that se- iments, the resulting SVM has a similar decision function,
lectively reduce the training set before running the training but it has a number of support vectors that does not increase
algorithm [1]. with the number of data points in the training set.
In order to obtain a good SVM classifier with a reduced We show results for an artificial data set and a real
training set, the data points in the new training set must look life data set. The artificial data set is called Fournorm.
as if they were drawn from a distribution that has the same This is a twenty dimensional binary classification prob-
Bayes decision boundary of the original problem, but with lem. Data for the first class is drawn with equal prob-
Bayes error equal to zero. So, the idea is to eliminate the ability from one of two multivariate normal distributions

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
Figure 4. Training time and training error vs. Figure 5. Generalization error vs. Weight of
weight of rejected data (fournorm). rejected data (fournorm)

with identity covariance matrix and means (a, a, . . . , a) and


(−a, −a, −a, −a, . . . , −a), while data for the other class
is drawn with equal probability from one of two multivari-
ate normal distributions with means (a, −a, . . . , −a) and
(−a, a, −a, √a, . . . , a) and identity covariances matrices. We
set a = 2/ 2. In Figure 4 we plot training time and training
error against the weight of the data that is rejected. Figure
5 shows the generalization error vs. the weight of the data
that is rejected.
For our second set of experimets we utilized the MNIST
data set of handwritten digits [5]. To reduce the classifi-
cation problem to a binary problem, we use only the data
corresponding to the numbers 8 and 3. We use 5000 train- Figure 6. Training time and training error vs.
ing and 1984 test examples, with a polynomial kernel of weight of rejected data (MNIST).
degree 3 and ν = 0.1. We set the maximum number of iter-
ations to 10000. The Adaboost algorithm runs 10 iterations
for several values of μ between 0 and 0.9.
Results for both data sets are similar. We can observe a dramatically reduced training time.
that the training error is a decreasing function of the number
of rounds, regardless of the amount of data that is rejected
(see Figures 4 and 6). This presents an advantage in terms of 4.2. Boosting SVM as an Editing Algorithm
the total training time relative to the training time using the
whole data set. It is faster to train some rounds of Adaboost
rejecting a lot of samples than it is to train with the original For this experiments, we use a simple two dimensional
data set.For example, for the Fournorm data set, rejecting binary classification problem. The classes are generated
58% of the data results in a training error of 1% after 5 from two uniform distributions in squares of side size one
rounds of Adaboost but the training time is close to one half with centers at c1 and c2 respectively. The position of the
of the training time with the whole data set. For the MNIST centers allow us to control the Bayes error of the problem.
data set, rejecting 50% of the data and after 2 rounds of In these experiments, we set the Bayes error to 10%.
Adaboost the training error reaches 0% and the training time Figure 8 shows that the number of support vectors of our
is close to one half of the original training time. methods does not increase very much. This fact implies
Regarding the generalization error, Figure 7 shows that that our algorithm can be used effectively as an editing al-
for the MNIST data, the test error with rejected data is less gorithms during the training procedure. In addition, our al-
than the test error with the whole data set. After 5 rounds of gorithms also reduce the training time as shown in Figure 9.
Adaboost, rejecting 60% of the data set results in general- Furthermore, the empirical errors of the three methods are
ization error similar to the error with the entire data set with similar.

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
Figure 7. Generalization error vs. Weight of Figure 9. Time vs. Training set size
rejected data (MNIST)
References
[1] G. Bakır, L. Bottou, and J. Weston. Breaking SVM complex-
ity with cross-training. In Advances in Neural Information
Processing Systems, volume 17. MIT Press, 2005.
[2] C.-C. Chang and C.-J. Lin. Training ν-support vector
classifiers: Theory and algorithms. Neural Computation,
(9):2119–2147, 2001.
[3] T. Downs, K. E. Gates, and A. Masters. Exact simplification
of support vector solutions. J. Mach. Learn. Res., 2:293–
297, 2002.
[4] Y. Freund and R. Schapire. A decision-theoretic general-
ization of on-line learning and an application to boosting.
Journal of Computer and System Sciences, 55(1):119–139,
Aug. 1997.
Figure 8. Number of Support Vectors vs. [5] Y. LeCun. The MNIST database of handwritten digits.
Training set size http://yann.lecun.com/exdb/mnist/.
[6] D. Pavlov, J. Mao, and D. Dom. Scaling-up support vec-
tor machines using boosting algorithm. 15th International
Conference on Pattern Recognition, 2:219–222, 2000.
5. Conclusions [7] J. Platt. Fast training of support vector machines using se-
quential minimal optimization. In B. Schölkopf, C. J. C.
Burges, and A. J. Smola, editors, Advances in Kernel Meth-
Although there exists empirical evidence that boosting
ods — Support Vector Learning, pages 185–208, Cam-
a strong learner may not be a good idea from the general-
bridge, MA, 1999. MIT Press.
ization standpoint, our experiments demonstrate that when [8] R. Schapire and Y. Singer. Improved boosting algorithms
combined with the reduced set method that we propose, it using confidence-rated predictions. Machine Learning,
can lead to some advantages in the running time of the al- 37(3):297–336, Dec. 1999.
gorithm. Moreover, adaboost is able to identify very effec- [9] B. Schölkopf and A. Smola. Learning With Kernels. MIT
tively the bouncing support vectors in a SVM solution. We Press, Cambridge, MA, 2002.
exploit this fact to implement an editing algorithm that pro- [10] B. Schölkopf, A. Smola, R. Williamson, and P. Bartlett.
New support vector algorithms. NeuroCOLT Technical Re-
duces classifiers with a more compact representation.
port NC-TR-98-031, Royal Holloway College, University of
Topics for future research are to investigate a more the- London, UK, 1998.
oretical explanation of the editing capabilities of Adaboost, [11] I. Steinwart. Cosistency of support vector machines and
and methods of finding optimal values for all the hyperpa- other regularized kernel classifiers. IEEE Transactions on
rameters of our algorithms. Information Theory, 51(1):128–142, January 2005.
[12] J. Wickramaratna, S. Holden, and B. Buxton. Performance
degradation in boosting. In J. Kittler and F. Roli, editors,
Acknowledgement Proceedings of the 2nd International Workshop on Multiple
Classifier Systems MCS2001, volume 2096 of LNCS, pages
This research was funded by a grant from the School of 11–21. Springer, 2001.
Engineering, Universidad de los Andes.

Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05)
0-7695-2495-8/05 $20.00 © 2005 IEEE
View publication stats

You might also like