0% found this document useful (0 votes)
12 views14 pages

Ertekin 2011

This document discusses nonconvex online support vector machines (SVM) algorithms for noisy data classification. It proposes two algorithms - LASVM-NC, which uses a nonconvex ramp loss function to suppress the influence of outliers, and LASVM-I, which approximates nonconvex behavior in convex optimization to filter outliers. Experimental results show these algorithms achieve significant robustness to outliers in noisy data with sparser models and less computational time than traditional SVM approaches, without sacrificing accuracy. The document also discusses how nonconvex optimization relates to traditional minimum-margin active learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

Ertekin 2011

This document discusses nonconvex online support vector machines (SVM) algorithms for noisy data classification. It proposes two algorithms - LASVM-NC, which uses a nonconvex ramp loss function to suppress the influence of outliers, and LASVM-I, which approximates nonconvex behavior in convex optimization to filter outliers. Experimental results show these algorithms achieve significant robustness to outliers in noisy data with sparser models and less computational time than traditional SVM approaches, without sacrificing accuracy. The document also discusses how nonconvex optimization relates to traditional minimum-margin active learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

368 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO.

2, FEBRUARY 2011

Nonconvex Online Support Vector Machines


Şeyda Ertekin, Léon Bottou, and C. Lee Giles, Fellow, IEEE

Abstract—In this paper, we propose a nonconvex online Support Vector Machine (SVM) algorithm (LASVM-NC) based on the Ramp
Loss, which has the strong ability of suppressing the influence of outliers. Then, again in the online learning setting, we propose an
outlier filtering mechanism (LASVM-I) based on approximating nonconvex behavior in convex optimization. These two algorithms are
built upon another novel SVM algorithm (LASVM-G) that is capable of generating accurate intermediate models in its iterative steps by
leveraging the duality gap. We present experimental results that demonstrate the merit of our frameworks in achieving significant
robustness to outliers in noisy data classification where mislabeled training instances are in abundance. Experimental evaluation
shows that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running
time, both in the training and recognition phases, without sacrificing generalization performance. We also point out the relation between
nonconvex optimization and min-margin active learning.

Index Terms—Online learning, nonconvex optimization, support vector machines, active learning.

1 INTRODUCTION

I Nsupervised learning systems, the generalization perfor-


mance of classification algorithms is known to be greatly
improved with large margin training. Large margin classi-
solutions are guaranteed to reach global optima and are not
sensitive to initial conditions. The popularity of convexity
further increased after the success of convex algorithms,
fiers find the maximal margin hyperplane that separates the particularly with SVMs, which yield good generalization
training data in the appropriately chosen kernel-induced performance and have strong theoretical foundations.
feature space. It is well established that if a large margin is However, in convex SVM solvers, all misclassified examples
obtained, the separating hyperplane is likely to have a small become support vectors, which may limit the scalability of
misclassification rate during recognition (or prediction) [1], the algorithm to learning from large-scale data sets. In this
[2], [3]. However, requiring that all instances be correctly paper, we show that nonconvexity can be very effective for
classified with the specified margin often leads to overfitting, achieving sparse and scalable solutions, particularly when
especially when the data set is noisy. Support Vector the data consist of abundant label noise. We present herein
Machines [4] address this problem by using a soft margin experimental results that show how a nonconvex loss
criterion, which allows some examples to appear on the function, Ramp Loss, can be integrated into an online SVM
wrong side of the hyperplane (i.e., misclassified examples) in algorithm in order to suppress the influence of misclassified
the training phase to achieve higher generalization accuracy. examples.
With the soft margin criterion, patterns are allowed to be Various works in the history of machine learning research
misclassified for a certain cost, and consequently, the focused on using nonconvex loss functions as an alternate to
outliers—the instances that are misclassified outside of convex Hinge Loss in large margin classifiers. While Mason
the margin—start to play a dominant role in determining et al. [5] and Krause and Singer [6] applied it to Boosting,
the decision hyperplane,since they tend to have the largest Perez-Cruz et al. [7] and Xu and Cramer [8] proposed
margin loss according to the Hinge Loss. Nonetheless, due to training algorithms for SVMs with the Ramp Loss and solved
its convex property and practicality, Hinge Loss has become the nonconvex optimization by utilizing semidefinite pro-
a commonly used loss function in SVMs. gramming and convex relaxation techniques. On the other
Convexity is viewed as a virtue in the machine learning hand, some previous works of Liu et al. [9] and Wang et al.
literature both from a theoretical and experimental point [10] used the Concave-Convex Procedure (CCCP) [11] for
of view. Convex methods can be easily analyzed mathema- nonconvex optimization as the work presented here. Those
tically and bounds can be produced. Additionally, convex studies are worthwhile in the endeavor of achieving sparse
models or competitive generalization performance; never-
theless, none of them are efficient in terms of computational
. Ş. Ertekin is with the Massachusetts Institute of Technology, Sloan School running time and scalability for real-world data mining
of Management, Cambridge, MA 02139. E-mail: [email protected].
. L. Bottou is with the NEC Laboratories America, 4 Independence Way, applications, and yet the improvement in classification
Suite 200, Princeton, NJ 08540. E-mail: [email protected]. accuracy is only marginal. Collobert et al. [12] pointed out
. C.L. Giles is with the College of Information Sciences and Technology, The the scalability advantages of nonconvex approaches and
Pennsylvania State University, University Park, PA 16802.
E-mail: [email protected]. used CCCP for nonconvex optimization in order to achieve
Manuscript received 15 Mar. 2009; revised 11 Dec. 2009; accepted 30 Mar.
faster batch SVMs and Transductive SVMs. In this paper, we
2010; published online 24 May 2010. focus on bringing the scalability advantages of nonconvexity
Recommended for acceptance by O. Chapelle. to the online learning setting by using an online SVM
For information on obtaining reprints of this article, please send e-mail to: algorithm, LASVM [13]. We also highlight and discuss the
[email protected], and reference IEEECS Log Number
TPAMI-2009-03-0168. connection between the nonconvex loss and traditional min-
Digital Object Identifier no. 10.1109/TPAMI.2010.109. margin active learners.
0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 369

Online learning offers significant computational advan- data set where xi are the feature vectors representing the
tages over batch learning algorithms and the benefits of instances and yi 2 f1; þ1g are the labels of those instances.
online learning become more evident when dealing with Using the training set, SVM builds an optimum hyperplane—
streaming or very large-scale data. Online learners incorpo- a linear discriminant in a higher dimensional feature
rate the information of recently observed training data into space—that separates the two classes by the largest margin.
the model via incremental model updates and without the The SVM solution is obtained by minimizing the following
need for retraining it with previously seen entire training primal objective function:
data. Since these learners process the data one at a time in
the training phase, selective sampling can be applied and 1 Xn
evaluation of the informativeness of the data prior to the wk2 þ C
min Jðw; bÞ ¼ kw i ; ð1Þ
w ;b 2 i¼1
processing by the learner becomes possible. The computa-
tional benefits of avoiding periodic batch optimizations, 
however, necessitate the online learner fulfilling two critical yi ðw
w  ðxi Þ þ bÞ  1  i ;
with 8i
requirements, namely, the intermediate models need to be i  0;
well enough trained in order to capture the characteristics
of the training data, but, on the other hand, should not be where w is the normal vector of the hyperplane, b is the
overoptimized since only part of the entire training data is offset, yi are the labels, ðÞ is the mapping from input space
seen at that point in time. In this paper, we present an to feature space, and i are the slack variables that permit
online SVM algorithm, LASVM-G, that maintains a balance the nonseparable case by allowing misclassification of
between these conditions by leveraging the duality gap training instances.
between the primal and dual functions throughout the In practice, the convex quadratic programming (QP)
online optimization steps. Based on the online training problem in (1) is solved by optimizing the dual cost function:
scheme of LASVM-G, we then present LASVM-NC, an online
SVM algorithm with nonconvex loss function, which yields X
N
1X
a significant speed improvement in training and builds a Þ 
max Gð i yi  i j Kðxi ; xj Þ; ð2Þ

i¼1
2 i;j
sparser model, hence resulting in faster recognition than its 8P
convex version as well. Finally, we propose an SVM >
> i i ¼ 0;
>
<A    B ;
algorithm (LASVM-I) that utilizes the selective sampling i i i
heuristic by ignoring the instances that lie in the flat region subject to ð3Þ
>
> Ai ¼ minð0; Cyi Þ;
of the Ramp Loss in advance, before they are processed by >
:
the learner. Although this approach may appear like an Bi ¼ maxð0; Cyi Þ;
overaggressive training sample elimination process, we where Kðxi ; xj Þ ¼ hðxi Þ; ðxj Þi is the kernel matrix repre-
point out that these instances do not play a large role in senting the dot products ðxi Þ  ðxj Þ in feature space. We
determining the decision hyperplane according to the Ramp adopt a slight deviation of the coefficients i from the
Loss anyway. We show that for a particular case of sample standard representation and let them inherit the signs of the
elimination scenario, misclassified instances according to labels yi , permitting the i to take on negative values. After
the most recent model are not taken into account in the
solving the QP problem, the norm of the hyperplane w can
training process. For another case, only the instances in the
be represented as a linear combination of the vectors in the
margin pass the barrier of elimination and are processed in
the training, hence leading to an extreme case of small pool training set
active learning framework [14] in online SVMs. The X
w¼ i ðxi Þ: ð4Þ
proposed nonconvex implementation and selective sample i
ignoring policy yields sparser models with fewer support
vectors and faster training with less computational time and Once a model is trained, a soft margin SVM classifies a
kernel computations, which overall leads to a more scalable pattern x according to the sign of a decision function, which
online SVM algorithm. The benefits of the proposed can be represented as a kernel expansion
methods are fully realized for kernel SVMs and their
advantages become more pronounced in noisy data X
n
y^ðxÞ ¼ i Kðx; xi Þ þ b; ð5Þ
classification, where mislabeled samples are in abundance. i¼1
In the next section, we present a background on Support
Vector Machines. Section 3 gives a brief overview of the where the sign of y^ðxÞ represents the predicted classification
online SVM solver, LASVM [13]. We then present the of x.
proposed online SVM algorithms, LASVM-G, LASVM-NC, A widely popular methodology for solving the SVM QP
and LASVM-I. The paper continues with the experimental problem is Sequential Minimal Optimization (SMO) [15].
analysis presented in Section 8, followed by concluding SMO works by making successive direction searches,
remarks. which involves finding a pair of instances that violate the
KKT conditions and taking an optimization step along that
feasible direction. The  coefficients of these instances are
2 SUPPORT VECTOR MACHINES modified by opposite amounts, so SMO makes sure that
P
Support Vector Machines [4] are well known for their strong the constraint i i ¼ 0 is not violated. Practical imple-
theoretical foundations, generalization performance, and mentations of SMO select working sets based on finding a
ability to handle high-dimensional data. In the binary pair of instances that violate the KKT conditions more than
classification setting, let ððx1 ; y1 Þ    ðxn ; yn ÞÞ be the training -precision, also known as -violating pairs [13]:
370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
8
< i < Bi ; after each PROCESS step, LASVM-G adjusts the number of
ði; jÞ is a -violating pair () j > Aj ; REPROCESS operations at each online iteration by lever-
:
gi  gj > ; aging the gap between the primal and the dual functions.
Further, LASVM-G replaces LASVM’s one time FINISHING
where g denotes the gradient of an instance and  is a small
optimization and cleaning stage with the optimizations
positive threshold. The algorithm terminates when all KKT
performed in each REPROCESS cycle at each iteration and
violations are below the desired precision. the periodic non-SV removal steps. These improvements
The effect of the bias term. Note that the equality enable LASVM-G to generate more reliable intermediate
constraint on the sum of i in (3) appears in the SVM models than LASVM, which lead to sparser SVM solutions
formulation only when we allow the offset (bias) term b to that can potentially have better generalization performance.
be nonzero. While there is a single “optimal” b, different For further computational efficiency, the algorithms that we
SVM implementations may choose separate ways of present in the rest of the paper use the SVM formulation
adjusting the offset. For instance, it is sometimes beneficial with b ¼ 0. As we pointed out in Section 2, the bias term b
to change b in order to adjust the number of false positives acts as a hyperparameter that can be used to adjust the
and false negatives [2, page 203], or even disallow the bias number of false positives and false negatives for varying
term completely (i.e., b ¼ 0) [16] for computational simpli- settings of b, or to achieve algorithmic efficiency due to
city. In SVM implementations
P that disallow the offset term, computational simplicity when b ¼ 0. In the rest of the
the constraint i i ¼ 0 is removed from the SVM problem. paper, all formulations are based on setting the bias b ¼ 0
The online algorithms proposed in this paper also adopt the and thus optimizing a single  at a time.
strategy of setting b ¼ 0. This strategy gives the algorithms
the flexibility to update a single i at a time at each 4.1 Leveraging the Duality Gap
optimization step, bringing computational simplicity and One question regarding the optimization scheme in the
efficiency to the solution of the SVM problem without original LASVM formulation is the rate at which to perform
adversely affecting the classification accuracy. REPROCESS operations. A straightforward approach would
be to perform one REPROCESS operation after each PROCESS
3 LASVM step, which is the default behavior of LASVM. However, this
heuristic approach may result in underoptimization of the
LASVM [13] is an efficient online SVM solver that uses less
objective function in the intermediate steps if this rate is
memory resources and trains significantly faster than other smaller than the optimal proportion. Another option would
state-of-the-art SVM solvers while yielding competitive be to run REPROCESS until a small predefined threshold "
misclassification rates after a single pass over the training exceeds the L1 norm of the projection of the gradient
examples. LASVM realizes these benefits due to its novel ð@GðÞ=@i Þ, but little work has been done to determine the
optimization steps that have been inspired by SMO. LASVM correct value of the threshold ". A geometrical argument
applies the same pairwise optimization principle to online relates this norm to the position of the support vectors
learning by defining two direction search operations. The relative to the margins [17]. As a consequence, one usually
first operation, PROCESS, attempts to insert a new example chooses a relatively small threshold, typically in the range
into the set of current support vectors (SVs) by searching for 104 -102 . Using such a small threshold to determine the rate
an existing SV that forms a -violating pair with maximal of REPROCESS operations results in many REPROCESS steps
gradient. Once such an SV is found, LASVM performs a after each PROCESS operation. This will not only increase the
direction search that can potentially change the coefficient of training time and computational complexity, but can
the new example and make it a support vector. The second potentially overoptimize the objective function at each
operation, REPROCESS, attempts to reduce the current iteration. Since nonconvex iterations work toward suppres-
number of SVs by finding two SVs that are -violating SVs sing some training instances (outliers), the intermediate
with maximal gradient. A direction search can zero the learned models should be well enough trained in order to
coefficient of one or both SVs, removing them from the set of capture the characteristics of the training data, but, on the
current support vectors of the model. In short, PROCESS adds other hand, should not be overoptimized since only part of
new instances to the working set and REPROCESS removes the entire training data is seen at that point in time. Therefore,
the ones that the learner does not benefit from anymore. In it is necessary to employ a criteria to determine an accurate
the online iterations, LASVM alternates between running rate of REPROCESS operations after each PROCESS. We define
single PROCESS and REPROCESS operations. Finally, LASVM this policy as the minimization of the gap between the primal and
simplifies the kernel expansion by running REPROCESS to the dual [2].
remove all -violating pairs from the kernel expansion, a step Optimization of the duality gap. From the formulations
known as FINISHING. The optimizations performed in the of the primal and dual functions in (1) and (2), respectively,
FINISHING step reduce the number of support vectors in the it can be shown that the optimal values of the primal and
SVM model. dual are same [18]. Furthermore, at any nonoptimal point,
the primal function is guaranteed to lie above the dual curve.
In formal terms, let ^ and ^ be solutions of problems (1) and
4 LASVM WITH GAP-BASED OPTIMIZATION— (2), respectively. The strong duality asserts that for any
LASVM-G feasible  and ,
In this section, we present LASVM-G—an efficient online X
GðÞ  GðÞ^ ¼ JðÞ^  JðÞ with ^ ¼ ^i ðxi Þ:
 ð6Þ
SVM algorithm that brings performance enhancements to
i
LASVM. Instead of running a single REPROCESS operation
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 371

That is, at any time during the optimization, the value of the Note that, as we have indicated earlier, the bias term b is set to
primal zero in all of the formulations. In the online iterations, we
P JðÞ is higher than the dual GðÞ. Using the equality
w ¼ l l ðxl Þ and b ¼ 0, we show that this holds as follows: cycle between running REPROCESS and computing the gap G
until the termination criteria G  maxðC; G ^ is reached.

1 X
JðÞ  GðÞ ¼ kwk2 þ C j1  yl ðw  ðxl ÞÞjþ That is, we require the duality gap after the REPROCESS
2 operations to be not greater than the initial gap target G ^
G.
l
X 1 The C parameter is part of the equation in order to prevent
 l yl þ kwk2 the algorithm from specifying a too narrow gap target
2
l
X X and therefore prevent making excessive number of optimi-
¼ kwk2  l yl þ C j1  yl ðw  ðxl ÞÞjþ zation steps. The heuristic upper bound on the gap is
X
l
X
l developed based on the oscillating characteristics of the
¼w l ðxl Þ  l yl primal function during the optimization steps and these
l l oscillations are related to the successive choice of examples
X
þC j1  yl ðw  ðxl ÞÞjþ to REPROCESS. Viewing these oscillations as noise, the gap
l target enables us to stop the REPROCESS operations when
X the difference is within the noise. After this point, the
¼ yl l j1  yl ðw  ðxl Þ þ bÞjþ
learner continues with computing the new Gap Target and
l
X running PROCESS and REPROCESS operation on the next
þC j1  yl ðw  ðxl Þ þ bÞjþ fresh instance from the unseen example pool.
l
X
¼ ðC  l yl Þ j1  yl ðw  ðxl ÞÞjþ 4.2 Building Blocks
|fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
l
0 0
The implementation of LASVM-G maintains the following
pieces of information as its key building blocks: the
 0;
coefficients i of the current kernel expansion S, the bounds
where C  l yl  0 is satisfied by the constraint of the dual for each , and the partial derivatives of the instances in the
function in (3). Then, the SVM solution is obtained when expansion, given as

one reaches ;  such that
@W ðÞ X
X gk ¼ ¼ yk  i Kðxi ; xk Þ ¼ yk  y^ðxk Þ: ð10Þ
  GðÞ
" > JðÞ  where  ¼ i ðxi Þ:
 ð7Þ @k i
i
The kernel expansion here maintains all of the training
 < JðÞ
The strong duality in (6) then guarantees that JðÞ ^ þ ".
instances in the learner’s active set, both the support vectors
Few solvers implement this criterion since it requires the and the instances with  ¼ 0.
additional calculation of the gap JðÞ  GðÞ. In this paper, In the online iterations of LASVM-G, the optimization is
we advocate using criterion (7) using a threshold value " that driven by two kinds of direction searches. The first operation,
grows sublinearly with the number of examples. Letting " PROCESS, inserts an instance into the kernel expansion and
grow makes the optimization coarser when the number of initializes the i and gradient gi of this instance (Step 1). After
examples increases. As a consequence, the asymptotic computing the step size (Step 2), it performs a direction
complexity of optimizations in online setting can be smaller search (Step 3). We set the offset term for kernel expansion b
than that of the exact optimization. to zero for computational simplicity. As discussed in the
Most SVM solvers use the dual formulation of the QP SVM section regarding the offset term, disallowing b
problem. However, increasing the dual does not necessarily removes the necessity of satisfying the constraint
reduce the duality gap. The dual function follows a nice P

i2S i ¼ 0, enabling the algorithm to update a single  at
monotonically increasing pattern at each optimization step, a time, both in PROCESS and REPROCESS operations.
whereas the primal shows significant up and down fluctua-
tions. In order to keep the size of the duality gap in check, LASVM-G PROCESS(i)
P
before each PROCESS operation, we compute the standard 1) i 0; gi yk  s2S s Kis
deviation of the primal, which we call the Gap Target G^
G: 2) If gi < 0 then
 g 
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ¼ max Ai  i ; Ki
u n Pn 2 Else
ii
uX 
^ ¼t
G
G h2i  i¼1 hi
; ð8Þ gi 
 ¼ max Bi  i ; K
i¼1
l ii
3) i i þ 
where l is the number of support vectors and hi ¼ Cyi gi gs gs  Kis 8s in kernel expansion
with C and gi denoting the misclassification penalty and the The second operation, REPROCESS, searches all of the
gradient of instance i, respectively. After computing the gap instances in the kernel expansion and selects the instance
target, we run a PROCESS step and check the new Gap G with the maximal gradient (Steps 1-3). Once an instance is
between the primal and the dual. After an easy derivation, selected, LASVM-G computes a step size (Step 4) and
the gap is computed as performs a direction search (Step 5).
X
n LASVM-G REPROCESS()
G¼ ði gi þ maxð0; Cgi ÞÞ: ð9Þ 1) i arg mins2S gs with s > As
i¼1
j arg maxs2S gs with s < Bs
372 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011

2) Bail out if ði; jÞ is not a -violating pair.


3) If gi þ gj < 0 then g gi ; t i
Else g gj ; t j
4) If g < 0 then
 g 
 ¼ max At  t ; K
tt
Else  g 
 ¼ min Bt  t ; K
tt
5) k k þ 
gs gs  Kts 8s in kernel expansion
Both PROCESS and REPROCESS operate on the instances in
the kernel expansion, but neither of them remove any
instances from it. A removal step is necessary for improved
efficiency because, as the learner evolves, the instances that
were admitted to the kernel expansion in earlier iterations as
support vectors may not serve as support vectors anymore.
Keeping such instances in the kernel expansion slows down
the optimization steps without serving much benefit to the
learner and increases the application’s requirement for
computational resources. A straightforward approach to
address this inefficiency would be to remove all of the
instances with i ¼ 0, namely, all nonsupport vectors in the
kernel expansion. One concern with this approach is that once
an instance is removed, it will not be seen by the learner again,
and thus it will no longer be eligible to become a support
vector in the later stages of training. It is important to find a
balance between maintaining the efficiency of a small-sized
kernel expansion and not aggressively removing instances
from the kernel expansion. Therefore, the cleaning policy
needs to preserve the instances that can potentially become
SVs at a later stage of training while removing instances that
Fig. 1. The duality gap (JðÞ  GðÞ), normalized by the number of
have the lowest possibility of becoming SVs in the future. training instances.
Our cleaning procedure periodically checks the number of
non-SVs in the kernel expansion. If the number of non-SVs n to linear SVMs as well, the algorithmic aspects for linear
is more than the number of instances that is permitted in the SVMs will be different. For instance, for linear SVMs, it is
expansion m by the algorithm, CLEAN selects the extra non- possible to store the normal vector w of the hyperplane
SV instances with the highest gradients for removal. It directly instead of manipulating the support vector expan-
follows from (10) that this heuristic predominantly selects the sions. In this regard, the focus of the paper is more about
most misclassified instances that are farther away from the kernel SVMs.
hyperplane for deletion from the kernel expansion. This
policy suppresses the influence of the outliers on the model to 4.3 Online Iterations in LASVM-G
yield sparse and scalable SVM solutions. LASVM-G exhibits the same learning principle as LASVM,
but in a more systematic way. Both algorithms make one
CLEAN
pass (epoch) over the training set. Empirical evidence
n: number of non-SVs in the kernel expansion.
suggests that a single epoch over the entire training set
m: maximum number of allowed non-SVs.
yields a classifier as good as the SVM solution.
v: Array of partial derivatives.
~
Upon initialization, LASVM-G alternates between PRO-
1) If n < m return CESS and REPROCESS steps during the epoch like LASVM, but
2) ~
v ~ v [ jgi jþ , 8i with i ¼ 0 distributes LASVM’s one time FINISHING step to the
3) Sort the gradients in ~ v in ascending order. optimizations performed in each REPROCESS cycle at each
gthreshold v½m iteration and the periodic CLEAN operations. Another
4) If jgi jþ  gthreshold then remove xi , 8i with i ¼ 0 important property of LASVM-G is that it leverages the gap
We want to point out that it is immaterial to distinguish between the primal and the dual functions to determine the
whether an instance has not been an SV for many iterations number of REPROCESS steps after each PROCESS (the -G
or it has just become a non-SV. In either case, these examples suffix emphasizes this distinction). Reducing the duality gap
do not currently contribute to the classifier and are treated too fast can cause overoptimization in early stages without
equally from a cleaning point of view. yet observing sufficient training data. Conversely, reducing
Note that these algorithmic components are geared the gap too slowly can result in underoptimization in the
toward designing and online SVM solver with nonlinear intermediate iterations. Fig. 1 shows that as the learner sees
kernels. Even though it is possible to apply these principles more training examples, the duality gap gets smaller.
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 373

Fig. 3. (a) The ramp loss can be decomposed into (b) a convex hinge
loss and (c) a concave loss.

introduce the nonconvex Ramp Loss function and discuss


Fig. 2. LASVM versus LASVM-G for the Adult data set. (a) Test error how nonconvexity can overcome the scalability problems of
convergence. (b) Growth of number of SVs. convex SVM solvers. We then present the methodology to
optimize the nonconvex objective function, followed by the
The major enhancements that are introduced to LASVM description of the online iterations of LASVM-NC.
enable LASVM-G to achieve higher prediction accuracies
than LASVM in the intermediate stages of training. Fig. 2 5.1 Ramp Loss
presents a comparative analysis of LASVM-G versus LASVM Traditional convex SVM solvers rely on the Hinge Loss H1
for the Adult data set (Table 2). Results on other data sets (as shown in Fig. 3b) to solve the QP problem, which can be
are provided in Section 8. represented in Primal form as
While both algorithms report the same generalization
performance in the end of training, LASVM-G reaches a 1 Xn

better classification accuracy at an earlier point in training min Jðw wk2 þ C


w; bÞ ¼ kw H1 ðyi fðxi ÞÞ: ð11Þ
w ;b 2 l¼1
than LASVM and is able to maintain its performance
relatively stable with a more reliable model over the course In the Hinge Loss formulation Hs ðzÞ ¼ maxð0; s  zÞ, s
of training. Furthermore, LASVM-G maintains fewer number indicates the Hinge point and the elbow at s ¼ 1 indicates
of support vectors in the intermediate training steps, as the point at which yl f ðxl Þ ¼ yl ðw  ðxl Þ þ bÞ ¼ 1. Assume,
evidenced in Fig. 2b. for simplicity, that the Hinge Loss is made differentiable
with a smooth approximation on a small interval z 2
LASVM-G
½1  ; 1 þ  near the hinge point. Differentiating (11) shows
1) Initialization:
that the minimum w must satisfy
Set  0
2) Online Iterations: X
L
0
Pick an example xi w ¼ C yl H1 ðyl Þf ðx
xl Þðx
xi Þ: ð12Þ
Compute Gap Target G ^
G l¼1

T hreshold ^
maxðC; G
GÞ In this setting, correctly classified instances outside of
Run PROCESS(xi ) the margin (z  1) cannot become SVs because H10 ðzÞ ¼ 0.
while Gap G > T hreshold On the other hand, for the training examples with ðz < 1Þ,
Run REPROCESS H10 ðzÞ is 1, so they cost a penalty term at the rate of
end misclassification of those instances. One problem with
Periodically run CLEAN Hinge Loss-based optimization is that it imposes no limit
on the influences of the outliers, that is, the misclassifica-
In the next sections, we further introduce three SVM
tion penalty is unbounded. Furthermore, in Hinge Loss-
algorithms that are implemented based on LASVM-G, namely,
based optimization, all misclassified training instances
LASVM-NC, LASVM-I, and FULL SVM. While these SVM
become support vectors. Consequently, the number of
algorithms share the main building blocks of LASVM-G, each
support vectors scales linearly with the number of training
algorithm exhibits a distinct learning principle. LASVM-NC
examples [19]. Specifically,
uses the LASVM-G methodology in a nonconvex learner
setting. LASVM-I is a learning scheme that we propose as a #SV
convex variant of LASVM-NC that employs selective sam- ! 2B ; ð13Þ
#Examples
pling. FULL SVM does not take advantage of the noncon-
vexity or the efficiency of the CLEAN operation, and acts as where B is the best possible error achievable linearly in the
a traditional online SVM solver in our experimental feature space ðÞ. Such a fast pace of growth of the number
evaluation. of support vectors becomes prohibitive for training SVMs in
large-scale data sets.
In practice, not all misclassified training examples are
5 NONCONVEX ONLINE SVM SOLVER— necessarily informative to the learner. For instance, in noisy
LASVM-NC data sets, many instances with label noise become support
In this section, we present LASVM-NC, a nonconvex online vectors due to misclassification, even though they are not
SVM solver that achieves sparser SVM solutions in less time informative about the correct classification of new instances
than online convex SVMs and batch SVM solvers. We first in recognition. Thus, it is reasonable to limit the influence of
374 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011

the outliers and allow the real informative training fundamental differences between optimization of batch
instances to define the model. Since Hinge Loss admits all SVMs and the online algorithms presented here. In parti-
outliers into the SVM solution, we need to select an cular, the batch SVM needs a convex initialization step prior
alternative loss function that enables selectively ignoring to nonconvex iterations to go over the entire or part of the
the instances that are misclassified according to the current training data in order to initialize the CCCP parameters to
model. For this purpose, we propose using the Ramp Loss avoid getting stuck in a poor local optima. Furthermore,
batch nonconvex SVMs alternate between solving (17) and
Rs ðzÞ ¼ H1 ðzÞ  Hs ðzÞ; ð14Þ updating the s of all training instances. On the other hand,
which allows us to control the score window for z at which LASVM-NC runs a few online convex iterations as the
we are willing to convert instances into support vectors. initialization stage, and adjusts the  of only the new fresh
Replacing H1 ðzÞ with Rs ðzÞ in (12), we see that the Ramp instance based on the current model and solves (17) while the
Loss suppresses the influence of the instances with score online algorithm is progressing. Additionally, due to the
z < s by not converting them into support vectors. How- nature of online learning, our learning scheme also permits
ever, since Ramp Loss is nonconvex, it prohibits us from selective sampling, which will be further discussed in the
using widely popular optimization schemes devised for LASVM-I section.
convex functions. We would also like to point out that if the s of all of the
While convexity has many advantages and nice mathe- training instances are initialized to zero and left unchanged
matical properties, the SV scaling property in (13) may be in the online iterations, the algorithm becomes traditional
prohibitive for large-scale learning because all misclassified Hinge Loss SVM. From another viewpoint, if s  0, then
examples become support vectors. Since the nonconvex the s will remain zero and the effect of Ramp Loss will not
solvers are not necessarily bounded by this constraint, be realized. Therefore, (17) can be viewed as a generic
nonconvexity has the potential to generate sparser solutions algorithm that can act as both Hinge Loss SVM and Ramp
[12]. In this work, our aim is to achieve the best of both Loss SVM with CCCP that enables nonconvex optimization.
worlds: generate a reliable and robust SVM solution that is
faster and sparser than traditional convex optimizers. This 5.2 Online Iterations in LASVM-NC
can be achieved by employing the CCCP, and thus reducing The online iterations in LASVM-NC are similar to LASVM-G
the complexity of nonconvex loss function by transforming in the sense that they are also based on alternating PROCESS
the problem into a difference of convex parts. The Ramp and REPROCESS steps, with the distinction of replacing the
Loss is amenable to CCCP optimization since it can be
Hinge Loss with the Ramp Loss. LASVM-NC extends the
decomposed into a difference of convex parts (as shown in
LASVM-G algorithm with the computation of the  ,
Fig. 3 and (14)). The cost function J s ðÞ for the Ramp Loss
followed by updating the  bounds A and B as shown in
can then be represented as the sum of a convex part Jvex ðt Þ
(17). Note that while the  do not explicitly appear in the
and a concave part Jcav ðt Þ :
PROCESS and REPROCESS algorithm blocks, they do, in fact,
1 Xn affect these optimization steps through the new definition
wk2 þ C
min J s ðÞ ¼ kw Rs ðyi fðxi ÞÞ of the bounds A and B.
 2 l¼1
When a new example xi is encountered, LASVM-NC first
1 X n X n
¼ kw wk2 þ C H1 ðyi fðxi ÞÞ  C Hs ðyi fðxi ÞÞ : computes the i for this instance as presented in the algorithm
2 l¼1 l¼1 block, where yi is the class label, f ðxi Þ is the decision score for
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} xi , and s is the score threshold for permitting instances to
s ð
Jvex Þ s ð
Jcav Þ
become support vectors.
ð15Þ We would like to point out that CCCP has convergence
For simplification purposes, we use the notation guarantees (c.f. [12]), but it is necessary to initialize the CCCP
 algorithm appropriately in order to avoid getting trapped in
s
@Jcav ðÞ C; if yl f ðxl Þ < s; poor local optima. In batch SVMs, this corresponds to
l ¼ y l ¼ ð16Þ
@f ðxl Þ 0; otherwise; running classical SVM on the entire set or on a subset of
training instances in the first iteration to initialize CCCP,
where C is the misclassification penalty and f ðxl Þ is the
followed by the nonconvex optimization in the subsequent
kernel expansion defined as in (5) with the offset term b ¼ 0.
iterations. In the online setting, we initially allow convex
The cost function in (15), along with the notation introduced
in (16), is then reformulated as the following dual optimization for the first few instances by setting their i ¼ 0
optimization problem: (i.e., use Hinge Loss), and then switch to nonconvex behavior
in the remainder of online iterations.
X 1X Note from (17) that the  bounds for instances with  ¼ 0
max GðÞ ¼ yi i  i j Ki;j ;

i
2 i;j follow the formulation for the traditional convex
8
> Ai  i  Bi ; LASVM-NC
>
> ð17Þ
< A ¼ minð0; Cy Þ   y ; SS: min. number of SVs to start nonconvex behavior.
i i i i
with 1) Initialization:
>
> Bi ¼ maxð0; Cyi Þ  i yi ;
>
: Set  0,  0
i from ð16Þ:
2) Online Iterations:
Collobert et al. [12] use a similar formulation for the CCCP- Pick an example xi
based nonconvex optimization of batch SVMs, but there are
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 375

C if yi f ðxi Þ < s and #SV > SS
Set i ¼ TABLE 1
0 otherwise
Analysis of Adult Data Set at the End of the Training Stage
Set i bounds for xi to (minð0; Cyi Þ  i yi 
i  maxð0; Cyi Þ  i yi )
Compute Gap Target G ^
G
T hreshold maxðC; G ^

Run PROCESS(xi )
while Gap G > T hreshold
Run REPROCESS
end
Periodically run CLEAN Intuitively, the examples that are misclassified by a wide
margin should not become support vectors. Ideally, the
setting. On the other hand, the bounds for the instances with
support vectors should be the instances that are within the
 ¼ C, that is, the outliers with score (z < s) are assigned new margin of the hyperplane. As studies on Active Learning
bounds based on the Ramp Loss criteria. Once LASVM-NC show [14], [20], the most informative instances to determine
establishes the  bounds for the new instance, it computes the hyperplane lie within the margin. Thus, LASVM-I ignores
the Gap Target G ^ and takes a PROCESS step. Then, it makes
G the instances that are misclassified by a margin (z < s) up
optimizations of the REPROCESS kind until the size of the front and prevents them from becoming support vectors.
duality gap comes down to the Gap Threshold. Finally,
LASVM-NC periodically runs the CLEAN operation to keep LASVM-I
the size of the kernel expansion under control and maintain 1) Initialization:
its efficiency throughout the training stage. Set  0
2) Online Iterations:
Pick an example xi
6 LASVM AND IGNORING INSTANCES—LASVM-I P
Compute z ¼ yi nj¼0 j Kðxi ; xj Þ
This SVM algorithm employs the Ramp function in Fig. 3a if (z > 1 or z < s)
as a filter to the learner prior to the PROCESS step. That is, Skip xi and bail out
once the learner is presented with a new instance, it first else
checks P if the instance is on the ramp region of the function Compute Gap Target G ^
G
ð1 > yi j j Kij > sÞ. The instances that are outside of the T hreshold maxðC; G^

ramp region are not eligible to participate in the optimiza- Run PROCESS(xi )
tion steps and they are immediately discarded without while Gap G > T hreshold
further action. The rationale is that the instances that lie on
Run REPROCESS
the flat regions of the Ramp function will have derivative
end
H 0 ðzÞ ¼ 0, and based on (12), these instances will not play
Periodically run CLEAN
role in determining the decision hyperplane w.
The LASVM-I algorithm is based on the following record Note that LASVM-I cannot be regarded as a nonconvex
keeping that we conducted when running LASVM-G experi- SVM solver since the instances with  ¼ C (which corre-
ments. In LASVM-G and LASVM-NC, we kept track of two sponds to z < s) are already being filtered out up front
important data points. First, we intentionally permitted before the optimization steps. Consequently, all of the
every new coming training instance into the kernel expan- instances visible to the optimization steps have  ¼ 0,
sion in online iterations and recorded the position of all which converts the objective function in (17) into the convex
instances on the Ramp Loss curve just before inserting the Hinge Loss from an optimization standpoint. Thus, combin-
instances into the expansion. Second, we kept track of the ing these two filtering criteria (z > 1 and z < s), LASVM-I
number of instances that were removed from the expansion trades nonconvexity with a filtering Ramp function to
which were on the flat region of the Ramp Loss curve when determine whether to ignore an instance or proceed with
they were admitted. The numeric breakdown is presented in optimization steps. Our goal with designing LASVM-I is
Table 1. Based on the distribution of these cleaned instances, that, based on this initial filtering step, it is possible to
it is evident that most of the cleaned examples that were achieve further speedups in training times while maintain-
initially admitted from (z > 1) region were removed from the ing competitive generalization performance. The experi-
kernel expansion with CLEAN at a later point in time. This is mental results validate this claim.
expected, since the instances with (z > 1) are already
correctly classified by the current model with a certain
confidence, and hence do not become SVs. 7 LASVM-G WITHOUT CLEAN—FULL SVM
On the other hand, Table 1 shows that almost all of the This algorithm serves as a baseline case for comparisons in
instances inserted from the left flat region (misclassified our experimental evaluation. The learning principle of FULL
examples due to z < s) became SVs in LASVM-G, and SVM is based on alternating between LASVM-G’s PROCESS
therefore were never removed from the kernel expansion. and REPROCESS steps throughout the training iterations.
In contrast, almost all of the training instances that were
admitted from the left flat region in LASVM-NC were FULL SVM
removed from the kernel expansion, leading to a much 1) Initialization:
larger reduction of the number of support vectors overall. Set  0
376 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011

2) Online Iterations: TABLE 2


Pick an example xi Data Sets Used in the Experimental Evaluations
Compute Gap Target G ^
G and the SVM Parameters C and for the RBF Kernel
T hreshold ^
maxðC; G

Run PROCESS(xi )
while Gap G > T hreshold
Run REPROCESS
end
When a new example is encountered, FULL SVM computes
the Gap Target (given in (8)) and takes a PROCESS step. Then,
it makes optimizations of the REPROCESS kind until the size
of the duality gap comes down to the Gap Threshold. In this
learning scheme, FULL SVM admits every new training
example into the kernel expansion without any removal step classify census data set to predict if the income of the person
(i.e., no CLEAN operation). This behavior mimics the is greater than 50K based on several census parameters, such
behavior of traditional SVM solvers by providing that the as age, education, marital status, etc. Mnist, USPS, and
learner has constant access to all training instances that it has USPS-N are optical character recognition data sets. USPS-N
seen during training and can make any of them a support contains artificial noise which we generated by changing the
vector any time if necessary. The SMO-like optimization in labels of 10 percent of the training examples in the USPS
the online iterations of FULL SVM enables it to converge to the data set. The Reuters-21578 is a popular text mining
batch SVM solution. benchmark data set of 21,578 news stories that appeared
Each PROCESS operation introduces a new instance to the on the Reuters newswire in 1987, and we test the algorithms
learner, updates its  coefficient, and optimizes the objective with the Money-fx category of the Reuters-21578 data set.
function. This is followed by potentially multiple REPROCESS The Banana data set is a synthetic two-dimensional data set
steps which exploit -violating pairs in the kernel expansion. that has 4,000 patterns consisting of two banana-shaped
Within each pair, REPROCESS selects the instance with clusters that have around 10 percent noise.
maximal gradient, and potentially can zero the  coefficient
8.1 Generalization Performances
of the selected instance. After sufficient iterations, as soon as a
One of the metrics that we used in the evaluation of the
-approximate solution is reached, the algorithm stops
updating the  coefficients. For full convergence to the batch generalization performances is Precision-Recall Breakeven
SVM solution, running FULL SVM usually consists of Point (PRBEP) (see, e.g., [21]). Given the definition of
performing a number of epochs where each epoch performs precision as the number of correct positive class predictions
n online iterations by sequentially visiting the randomly among all positive class predictions and recall as the
shuffled training examples. Empirical evidence suggests that number of correct positive class predictions among all
a single epoch yields a classifier almost as good as the SVM positive class instances, PRBEP is a widely used metric that
solution. For the theoretical explanation of the convergence measures the accuracy of the positive class where precision
properties of the online iterations, refer to [13]. equals recall. In particular, PRBEP measures the trade-off
The freedom to maintain and access the whole pool of between high precision and high recall. Fig. 4 shows the
seen examples during training in FULL SVM does come with a growth of PRBEP curves sampled over the course of training
price though. The kernel expansion needs to constantly grow for the data sets. Compared to the baseline case FULL SVM,
as new training instances are introduced to the learner, and it all algorithms are able to maintain competitive general-
needs to hold all non-SVs in addition to the SVs of the current ization performances at the end of training on all examples
model. Furthermore, the learner still needs to include those and show a more homogeneous growth compared to
non-SVs in the optimization steps and this additional LASVM, especially for the Adult and Banana data sets.
processing becomes a significant drag on the training time Furthermore, as shown in Table 3, LASVM-NC and LASVM-I
of the learner. actually yield higher classification accuracy for USPS-N
compared to FULL SVM. This can be attributed to their
ability to filter bad observations (i.e., noise) from training
8 EXPERIMENTS data. In noisy data sets, most of the noisy instances are
The experimental evaluation involves evaluating these misclassified and become support vectors in FULL SVM,
outlined SVM algorithms on various data sets in terms of LASVM-G, and LASVM due to Hinge Loss. This increase in
both their classification performances and algorithmic the number of support vectors (see Fig. 6) causes SVM to
efficiencies leading to scalability. We also compare these learn complex classification boundaries that can overfit to
algorithms against the reported metrics of LASVM and noise, which can adversely affect their generalization
LIBSVM on the same data sets. In the experiments presented performances. LASVM-NC and LASVM-I are less sensitive
below, we run a single epoch over the training examples, all to noise, and they learn simpler models that are able to yield
experiments use RBF kernels and the results averaged over better generalization performances under noisy conditions.
10 runs for each data set. Table 2 presents the characteristics For the evaluation of classification performances, we
of the data sets and the SVM parameters that were report three other metrics, namely, prediction accuracy (in
determined via 10-fold cross validation. Adult is a hard to Table 3), and AUC, and g-means (in Table 4). Prediction
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 377

TABLE 3
Comparison of All Four SVM Algorithms
with LASVM and LIBSVM for All Data Sets

correctly classified instances that are within the margin from


becoming SVs. This becomes detrimental to the general-
ization performance of LASVM-NC and LASVM-I since those
instances are among the most informative instances to the
learner. Likewise, moving s further down to the negative
territory diminishes the effect of the Ramp Loss on the
Fig. 4. PRBEP versus number of training instances. We used s ¼ 1 for
the ramp loss for LASVM-NC.
outliers. If s ! 1, then Rs ! H1 ; in other words, if s takes
large negative values, the Ramp Loss will not help to remove
accuracy measures a model’s ability to correctly predict the outliers from the SVM kernel expansion.
class labels of unseen observations. The area under the ROC
curve (AUC) is a numerical measure of a model’s discrimina-
TABLE 4
tion performance and shows how correctly the model Experimental Results That Assess
separates the positive and negative observations and ranks the Generalization Performance and Computational Efficiency
them. The Receiver operating characteristic (ROC) curve is
the plot of sensitivity versus 1  specificity and the AUC
represents the area below the ROC curve. g-means is the
geometric mean of sensitivity and specificity where sensitivity
and specificity represent the accuracy on positive and
negative instances, respectively. We report that all algo-
rithms presented in this paper yield as good results for these
performance metrics as FULL SVM and comparable classifica-
tion accuracy to LASVM and LIBSVM. Furthermore, LASVM-
NC yields the highest g-means for the Adult, Reuters, and
USPS-N data sets compared to the rest of the algorithms.
We study the impact of the s parameter on the general-
ization performances of LASVM-NC and LASVM-I, and
present our findings in Fig. 5. Since FULL SVM, LASVM-G,
LASVM, and LIBSVM do not use Ramp Loss, they are
represented with their testing errors and total number of
support vectors achieved in the end of training. These plots
depict that LASVM-NC and LASVM-I algorithms achieve
competitive generalization performance with much fewer
support vectors, especially for Adult, Banana, and USPS-N
data sets. In all data sets, increasing the value of s into the
positive territory actually has the effect of preventing
378 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011

regardless of the size of the training set. In practice, this


means that the best example among 59 random training
examples has a 95 percent chance of belonging to the best
5 percent of examples in the training set.
In the extreme case of small pool active learning, setting
the size of the pool to 1 corresponds to investigating
whether that instance is within the margin or not. In this
regard, setting s ¼ 1 for the Ramp Loss in LASVM-NC and
LASVM-I constrains the learner’s focus only on the instances
within the margin. Empirical evidence suggests that
LASVM-NC and LASVM-I algorithms exhibit the benefits of
active learning at s ¼ 1 point, which yields the best results
in most of our experiments. However, the exact setting for
the s hyperparameter should be determined by the require-
ments of the classification task and the characteristics of the
data set.

8.2 Computational Efficiency


A significant time-consuming operation of SVMs is the
computation of kernel products Kði; jÞ ¼ ðxi Þ  ðxj Þ. For
each new example, its kernel product with every instance in
the kernel expansion needs to be computed. By reducing
the number of kernel computations, it is possible to achieve
significant computational efficiency improvements over
traditional SVM solvers. In Fig. 7, we report the number
of kernel calculations performed over the course of training
iterations. FULL SVM suffers from uncontrolled growth of
the kernel expansion, which results in a steep increase in the
number of kernel products. This also shows why SVMs
cannot handle large-scale data sets efficiently. In compar-
ison, LASVM-G requires fewer kernel products than FULL
Fig. 5. Testing error versus number of support vectors for various
SVM since LASVM-G keeps the number of instances in the
settings of the s parameter of the ramp loss.
kernel expansion under control by periodically removing
uninformative instances through CLEAN operations.
It is important to note that at the point s ¼ 1, the Ramp-
LASVM-NC and LASVM-I yield significant reduction in the
loss-based algorithms LASVM-NC and LASVM-I behave as
number of kernel computations and their benefit is most
an Active Learning [20], [22] framework. Active Learning is
pronounced in the noisy data sets Adult, Banana, and
widely known as a querying technique for selecting the
USPS-N. LASVM-I achieves better reduction of kernel
most informative instances from a pool of unlabeled
computations than LASVM-NC. This is due to the aggressive
instances to acquire their labels. Even in cases where the filtering done in LASVM-I where no kernel computation is
labels for all training instances are available beforehand, performed for the instances on the flat regions of the Ramp
active learning can still be leveraged to select the most Loss. On the other hand, LASVM-NC admits the instances
informative instances from training sets. In SVMs, the that lie on the left flat region into the kernel expansion but
informativeness of an instance is synonymous with its achieves sparsity through the nonconvex optimization steps.
distance to the hyperplane and the instances closer to the The  values of those admitted instances are set to C in
hyperplane are the most informative. For this reason, LASVM-NC so that they have new  bounds. Resulting from
traditional min-margin active learners focus on the instances the new  bounds, these instances are more likely to be
that are within the margin of the hyperplane and pick an picked in the REPROCESS steps due to violating KKT
example from this region to process next by searching the conditions, and consequently removed from the set of
entire training set. However, such an exhaustive search is support vectors. The reason for the low number of kernel
impossible in the online setup and computationally expen- products in LASVM-NC is due to its ability to create sparser
sive in the offline setup. Ertekin et al. [14] suggest that models than other three algorithms. A comparison of the
querying for the most informative example does not need to growth of the number of support vectors during the course
be done from the entire training set, but instead, querying of training is shown in Fig. 6. LASVM-NC and LASVM-I end
from randomly picked small pools can work equally well in up with smaller number of support vectors than FULL SVM,
a more efficient way. Small pool active learning first samples LASVM-G, and LIBSVM. Furthermore, compared to LASVM-I,
M random training examples from the entire training set LASVM-NC builds noticeably sparser models with less
and selects the best one among those M examples based on support vectors in noisy Adult, Banana, and USPS-N data
the condition that the selected instance among the top sets. LASVM-I, on the other hand, makes fewer kernel
percent closest instances in the entire training set with calculations in training stage than LASVM-NC for those data
probability (1  ), where 0   1. With probability sets. This is a key distinction of these two algorithms: The
1  M , the value of the criterion for this example exceeds computational efficiency of LASVM-NC is the result of its
the -quantile of the criterion for all training examples ability to build sparse models. Conversely, LASVM-I creates
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 379

Fig. 7. Number of kernel computations versus number of training


Fig. 6. Number of SVs versus number of training instances. instances.

comparably more support vectors than LASVM-NC, but suppressing the influences of the outliers, which particu-
makes fewer kernel calculations due to early filtering. The larly becomes problematic in noisy data classification. For
overall training times for all data sets and all algorithms are this purpose, we first present a systematic optimization
presented both in Fig. 8 and Table 3. LASVM-G, LASVM-NC, approach for an online learning framework to generate
and LASVM-I are all significantly more efficient than FULL more reliable and trustworthy learning models in inter-
SVM. LASVM-NC and LASVM-I also yield faster training than mediate iterations (LASVM-G). We then propose two online
LIBSVM and LASVM. Note that the LIBSVM algorithm here algorithms, LASVM-NC and LASVM-I, which leverage the
uses a second order active set selection. Although second Ramp function to avoid the outliers to become support
order selection can also be applied to LASVM-like algorithms vectors. LASVM-NC replaces the traditional Hinge Loss with
to achieve improved speed and accuracy [23], we did not the Ramp Loss and brings the benefits of nonconvex
implement it in the algorithms discussed in this paper. optimization using CCCP to an online learning setting.
Nevertheless, in Fig. 8, the fastest training times belong to LASVM-I uses the Ramp function as a filtering mechanism to
LASVM-I where LASVM-NC comes close second. The sparsest discard the outliers during online iterations. In online
solutions are achieved by LASVM-NC and this time LASVM-I learning settings, we can discard new coming training
comes close second. These two algorithms represent a examples accurately enough only when the intermediate
compromise between training time versus sparsity and models are reliable as much as possible. In LASVM-G, the
recognition time, and the appropriate algorithm should be increased stability of intermediate models is achieved by
chosen based on the requirements of the classification task. the duality gap policy. This increased stability in the model
significantly reduces the number of wrongly discarded
instances in online iterations of LASVM-NC and LASVM-I.
9 CONCLUSION Empirical evidence suggests that the algorithms provide
In traditional convex SVM optimization, the number of efficient and scalable learning with noisy data sets in two
support vectors scales linearly with the number of training respects: 1) computational: there is a significant decrease in
examples, which unreasonably increases the training time the number of computations and running time during
and computational resource requirements. This fact has training and recognition, and 2) statistical: there is a
hindered widespread adoption of SVMs for classification significant decrease in the number of examples required
tasks in large-scale data sets. In this work, we have studied for good generalization. Our findings also reveal that
the ways in which the computational efficiency of an online discarding the outliers by leveraging the Ramp function is
SVM solver can be improved without sacrificing the closely related to the working principles of margin-based
generalization performance. This paper is concerned with Active Learning.
380 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011

[10] L. Wang, H. Jia, and J. Li, “Training Robust Support Vector


Machine with Smooth Ramp Loss in the Primal Space,”
Neurocomputing, vol. 71, pp. 3020-3025, 2008.
[11] A.L. Yuille and A. Rangarajan, “The Concave-Convex Procedure
(CCCP),” Advances in Neural Information Processing Systems. MIT
Press, 2002.
[12] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading Convexity
for Scalability,” Proc. Int’l Conf. Machine Learning, pp. 201-208,
2006.
[13] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast Kernel
Classifiers with Online and Active Learning,” J. Machine Learning
Research, vol. 6, pp. 1579-1619, 2005.
[14] S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the
Border: Active Learning in Imbalanced Data Classification,” Proc.
ACM Conf. Information and Knowledge Management, pp. 127-136,
2007.
[15] J.C. Platt, “Fast Training of Support Vector Machines Using
Sequential Minimal Optimization,” Advances in Kernel Methods:
Support Vector Learning, pp. 185-208, MIT Press, 1999.
[16] S. Shalev-Shwartz and N. Srebro, “SVM Optimization: Inverse
Dependence on Training Set Size,” Proc. Int’l Conf. Machine
Learning, pp. 928-935, 2008.
[17] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy,
“Improvements to Platt’s SMO Algorithm for SVM Classifier
Design,” Neural Computation, vol. 13, no. 3, pp. 637-649, 2001.
[18] O. Chapelle, “Training a Support Vector Machine in the Primal,”
Neural Computation, vol. 19, no. 5, pp. 1155-1178, 2007.
[19] I. Steinwart, “Sparseness of Support Vector Machines,” J. Machine
Learning Research, vol. 4, pp. 1071-1105, 2003.
[20] G. Schohn and D. Cohn, “Less Is More: Active Learning with
Support Vector Machines,” Proc. Int’l Conf. Machine Learning,
pp. 839-846, 2000.
[21] T. Joachims, “Text Categorization with Support Vector Machines:
Learning with Many Relevant Features,” Technical Report 23,
Univ. Dortmund, 1997.
[22] S. Tong and D. Koller, “Support Vector Machine Active Learning
with Applications to Text Classification,” J. Machine Learning
Research, vol. 2, pp. 45-66, 2001.
[23] T. Glasmachers and C. Igel, “Second-Order SMO Improves SVM
Online and Active Learning,” Neural Computation, vol. 20, no. 2,
Fig. 8. Training times of the algorithms for all data sets after one pass pp. 374-382, 2008.
over the training instances.
Şeyda Ertekin received the BSc degree in
electrical and electronics engineering from
ACKNOWLEDGMENTS Orta Dogu Teknik Universitesi (ODTU) in
Ankara, Turkey, the MSc degree in computer
This work was done while Şeyda Ertekin was with the science from the University of Louisiana at
Department of Computer Science and Engineering at the Lafayette, and the PhD degree in computer
science and engineering from Pennsylvania
Pennsylvania State University and NEC Laboratories, State University–University Park in 2009. She
America. is currently a postdoctoral research associate
at the Massachusetts Institute of Technology
(MIT). Her research interests focus on the design, analysis, and
REFERENCES implementation of machine learning algorithms for large-scale data
sets to solve real-world problems in the fields of data mining,
[1] O. Bousquet and A. Elisseeff, “Stability and Generalization,” information retrieval, and knowledge discovery. She is mainly known
J. Machine Learning, vol. 2, pp. 499-526, 2002. for her research that spans online and active learning for efficient and
[2] B. Schölkopf and A.J. Smola, Learning with Kernels: Support Vector scalable machine learning algorithms. At Penn State, she was a
Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. member of the technical team of CiteSeerX. Throughout her PhD
[3] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern studies, she also worked as a researcher in the Machine Learning
Analysis. Cambridge Univ. Press, 2004. Group at NEC Research Laboratories in Princeton, New Jersey. Prior
[4] C. Cortes and V. Vapnik, “Support Vector Networks,” Machine to that, she had worked at Aselsan, Inc., in Ankara, and worked on
Learning, vol. 20, pp. 273-297, 1995. digital wireless telecommunication systems. She has also worked as
[5] L. Mason, P.L. Bartlett, and J. Baxter, “Improved Generalization a consultant to several companies in the US on the design of data
through Explicit Optimization of Margins,” Machine Learning, mining infrastructures and algorithms. She is the recipient of
vol. 38, pp. 243-255, 2000. numerous awards from the ACM, US National Science foundation
[6] N. Krause and Y. Singer, “Leveraging the Margin More Care- (NSF), and Google.
fully,” Proc. Int’l Conf. Machine Learning, p. 63, 2004.
[7] F. Perez-Cruz, A. Navia-Vazquez, and A.R. Figueiras-Vidal,
“Empirical Risk Minimization for Support Vector Classifiers,”
IEEE Trans. Neural Networks, vol. 14, no. 2, pp. 296-303, Mar.
2002.
[8] D.S.L. Xu and K. Cramer, “Robust Support Vector Machine
Training via Convex Outlier Ablation,” Proc. 21st Nat’l Conf.
Artificial Intelligence, 2006.
[9] Y. Liu, X. Shen, and H. Doss, “Multicategory Learning and
Support Vector Machine: Computational Tools,” J. Computational
and Graphical Statistics, vol. 14, pp. 219-236, 2005.
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 381

Léon Bottou received the diplôme de l’Ecole C. Lee Giles is the David Reese professor of
Polytechnique, Paris, in 1987, the Magistère en information sciences and technology at the
mathématiques fondamentales et appliquées et Pennsylvania State University, University Park.
informatiques from the Ecole Normale Supér- He has appointments in the Departments of
ieure, Paris, in 1988, and the PhD degree in Computer Science and Engineering and Supply
computer science from the Université de Paris- Chain and Information Systems. Previously, he
Sud in 1991. He joined AT&T Bell Labs from 1991 was at NEC Research Institute, Princeton, New
to 1992 and AT&T Labs from 1995 to 2002. Jersey, and the Air Force Office of Scientific
Between 1992 and 1995, he was the chairman of Research, Washington, District of Columbia. His
Neuristique in Paris, a small company pioneering research interests are in intelligent cyberinfras-
machine learning for data mining applications. He joined NEC Labs tructure, Web tools, search engines and information retrieval, digital
America in Princeton, New Jersey in 2002. His primary research interest libraries, Web services, knowledge and information extraction, data
is machine learning. His contributions to this field address theory, mining, name matching and disambiguation, and social networks. He
algorithms, and large-scale applications. His secondary research interest was a cocreator of the popular search engines and tools: SeerSuite,
is data compression and coding. His best known contribution in this field CiteSeer (now CiteSeerX) for computer science, and ChemXSeer, for
is the DjVu document compression technology (http://www.djvu.org). He chemistry. He also was a cocreater of an early metasearch engine,
is serving on the boards of the Journal of Machine Learning Research and Inquirus, the first search engine for robots.txt, BotSeer, and the first for
the IEEE Transactions on Pattern Analysis and Machine Intelligence. He academic business, SmealSearch. He is a fellow of the ACM, IEEE, and
also serves on the scientific advisory board of Kxen, Inc. (http:// INNS.
www.kxen.com). He won the New York Academy of Sciences Blavatnik
Award for Young Scientists in 2007.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

You might also like