Ertekin 2011
Ertekin 2011
2, FEBRUARY 2011
Abstract—In this paper, we propose a nonconvex online Support Vector Machine (SVM) algorithm (LASVM-NC) based on the Ramp
Loss, which has the strong ability of suppressing the influence of outliers. Then, again in the online learning setting, we propose an
outlier filtering mechanism (LASVM-I) based on approximating nonconvex behavior in convex optimization. These two algorithms are
built upon another novel SVM algorithm (LASVM-G) that is capable of generating accurate intermediate models in its iterative steps by
leveraging the duality gap. We present experimental results that demonstrate the merit of our frameworks in achieving significant
robustness to outliers in noisy data classification where mislabeled training instances are in abundance. Experimental evaluation
shows that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running
time, both in the training and recognition phases, without sacrificing generalization performance. We also point out the relation between
nonconvex optimization and min-margin active learning.
Index Terms—Online learning, nonconvex optimization, support vector machines, active learning.
1 INTRODUCTION
Online learning offers significant computational advan- data set where xi are the feature vectors representing the
tages over batch learning algorithms and the benefits of instances and yi 2 f1; þ1g are the labels of those instances.
online learning become more evident when dealing with Using the training set, SVM builds an optimum hyperplane—
streaming or very large-scale data. Online learners incorpo- a linear discriminant in a higher dimensional feature
rate the information of recently observed training data into space—that separates the two classes by the largest margin.
the model via incremental model updates and without the The SVM solution is obtained by minimizing the following
need for retraining it with previously seen entire training primal objective function:
data. Since these learners process the data one at a time in
the training phase, selective sampling can be applied and 1 Xn
evaluation of the informativeness of the data prior to the wk2 þ C
min Jðw; bÞ ¼ kw i ; ð1Þ
w ;b 2 i¼1
processing by the learner becomes possible. The computa-
tional benefits of avoiding periodic batch optimizations,
however, necessitate the online learner fulfilling two critical yi ðw
w ðxi Þ þ bÞ 1 i ;
with 8i
requirements, namely, the intermediate models need to be i 0;
well enough trained in order to capture the characteristics
of the training data, but, on the other hand, should not be where w is the normal vector of the hyperplane, b is the
overoptimized since only part of the entire training data is offset, yi are the labels, ðÞ is the mapping from input space
seen at that point in time. In this paper, we present an to feature space, and i are the slack variables that permit
online SVM algorithm, LASVM-G, that maintains a balance the nonseparable case by allowing misclassification of
between these conditions by leveraging the duality gap training instances.
between the primal and dual functions throughout the In practice, the convex quadratic programming (QP)
online optimization steps. Based on the online training problem in (1) is solved by optimizing the dual cost function:
scheme of LASVM-G, we then present LASVM-NC, an online
SVM algorithm with nonconvex loss function, which yields X
N
1X
a significant speed improvement in training and builds a Þ
max Gð i yi i j Kðxi ; xj Þ; ð2Þ
i¼1
2 i;j
sparser model, hence resulting in faster recognition than its 8P
convex version as well. Finally, we propose an SVM >
> i i ¼ 0;
>
<A B ;
algorithm (LASVM-I) that utilizes the selective sampling i i i
heuristic by ignoring the instances that lie in the flat region subject to ð3Þ
>
> Ai ¼ minð0; Cyi Þ;
of the Ramp Loss in advance, before they are processed by >
:
the learner. Although this approach may appear like an Bi ¼ maxð0; Cyi Þ;
overaggressive training sample elimination process, we where Kðxi ; xj Þ ¼ hðxi Þ; ðxj Þi is the kernel matrix repre-
point out that these instances do not play a large role in senting the dot products ðxi Þ ðxj Þ in feature space. We
determining the decision hyperplane according to the Ramp adopt a slight deviation of the coefficients i from the
Loss anyway. We show that for a particular case of sample standard representation and let them inherit the signs of the
elimination scenario, misclassified instances according to labels yi , permitting the i to take on negative values. After
the most recent model are not taken into account in the
solving the QP problem, the norm of the hyperplane w can
training process. For another case, only the instances in the
be represented as a linear combination of the vectors in the
margin pass the barrier of elimination and are processed in
the training, hence leading to an extreme case of small pool training set
active learning framework [14] in online SVMs. The X
w¼ i ðxi Þ: ð4Þ
proposed nonconvex implementation and selective sample i
ignoring policy yields sparser models with fewer support
vectors and faster training with less computational time and Once a model is trained, a soft margin SVM classifies a
kernel computations, which overall leads to a more scalable pattern x according to the sign of a decision function, which
online SVM algorithm. The benefits of the proposed can be represented as a kernel expansion
methods are fully realized for kernel SVMs and their
advantages become more pronounced in noisy data X
n
y^ðxÞ ¼ i Kðx; xi Þ þ b; ð5Þ
classification, where mislabeled samples are in abundance. i¼1
In the next section, we present a background on Support
Vector Machines. Section 3 gives a brief overview of the where the sign of y^ðxÞ represents the predicted classification
online SVM solver, LASVM [13]. We then present the of x.
proposed online SVM algorithms, LASVM-G, LASVM-NC, A widely popular methodology for solving the SVM QP
and LASVM-I. The paper continues with the experimental problem is Sequential Minimal Optimization (SMO) [15].
analysis presented in Section 8, followed by concluding SMO works by making successive direction searches,
remarks. which involves finding a pair of instances that violate the
KKT conditions and taking an optimization step along that
feasible direction. The coefficients of these instances are
2 SUPPORT VECTOR MACHINES modified by opposite amounts, so SMO makes sure that
P
Support Vector Machines [4] are well known for their strong the constraint i i ¼ 0 is not violated. Practical imple-
theoretical foundations, generalization performance, and mentations of SMO select working sets based on finding a
ability to handle high-dimensional data. In the binary pair of instances that violate the KKT conditions more than
classification setting, let ððx1 ; y1 Þ ðxn ; yn ÞÞ be the training -precision, also known as -violating pairs [13]:
370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
8
< i < Bi ; after each PROCESS step, LASVM-G adjusts the number of
ði; jÞ is a -violating pair () j > Aj ; REPROCESS operations at each online iteration by lever-
:
gi gj > ; aging the gap between the primal and the dual functions.
Further, LASVM-G replaces LASVM’s one time FINISHING
where g denotes the gradient of an instance and is a small
optimization and cleaning stage with the optimizations
positive threshold. The algorithm terminates when all KKT
performed in each REPROCESS cycle at each iteration and
violations are below the desired precision. the periodic non-SV removal steps. These improvements
The effect of the bias term. Note that the equality enable LASVM-G to generate more reliable intermediate
constraint on the sum of i in (3) appears in the SVM models than LASVM, which lead to sparser SVM solutions
formulation only when we allow the offset (bias) term b to that can potentially have better generalization performance.
be nonzero. While there is a single “optimal” b, different For further computational efficiency, the algorithms that we
SVM implementations may choose separate ways of present in the rest of the paper use the SVM formulation
adjusting the offset. For instance, it is sometimes beneficial with b ¼ 0. As we pointed out in Section 2, the bias term b
to change b in order to adjust the number of false positives acts as a hyperparameter that can be used to adjust the
and false negatives [2, page 203], or even disallow the bias number of false positives and false negatives for varying
term completely (i.e., b ¼ 0) [16] for computational simpli- settings of b, or to achieve algorithmic efficiency due to
city. In SVM implementations
P that disallow the offset term, computational simplicity when b ¼ 0. In the rest of the
the constraint i i ¼ 0 is removed from the SVM problem. paper, all formulations are based on setting the bias b ¼ 0
The online algorithms proposed in this paper also adopt the and thus optimizing a single at a time.
strategy of setting b ¼ 0. This strategy gives the algorithms
the flexibility to update a single i at a time at each 4.1 Leveraging the Duality Gap
optimization step, bringing computational simplicity and One question regarding the optimization scheme in the
efficiency to the solution of the SVM problem without original LASVM formulation is the rate at which to perform
adversely affecting the classification accuracy. REPROCESS operations. A straightforward approach would
be to perform one REPROCESS operation after each PROCESS
3 LASVM step, which is the default behavior of LASVM. However, this
heuristic approach may result in underoptimization of the
LASVM [13] is an efficient online SVM solver that uses less
objective function in the intermediate steps if this rate is
memory resources and trains significantly faster than other smaller than the optimal proportion. Another option would
state-of-the-art SVM solvers while yielding competitive be to run REPROCESS until a small predefined threshold "
misclassification rates after a single pass over the training exceeds the L1 norm of the projection of the gradient
examples. LASVM realizes these benefits due to its novel ð@GðÞ=@i Þ, but little work has been done to determine the
optimization steps that have been inspired by SMO. LASVM correct value of the threshold ". A geometrical argument
applies the same pairwise optimization principle to online relates this norm to the position of the support vectors
learning by defining two direction search operations. The relative to the margins [17]. As a consequence, one usually
first operation, PROCESS, attempts to insert a new example chooses a relatively small threshold, typically in the range
into the set of current support vectors (SVs) by searching for 104 -102 . Using such a small threshold to determine the rate
an existing SV that forms a -violating pair with maximal of REPROCESS operations results in many REPROCESS steps
gradient. Once such an SV is found, LASVM performs a after each PROCESS operation. This will not only increase the
direction search that can potentially change the coefficient of training time and computational complexity, but can
the new example and make it a support vector. The second potentially overoptimize the objective function at each
operation, REPROCESS, attempts to reduce the current iteration. Since nonconvex iterations work toward suppres-
number of SVs by finding two SVs that are -violating SVs sing some training instances (outliers), the intermediate
with maximal gradient. A direction search can zero the learned models should be well enough trained in order to
coefficient of one or both SVs, removing them from the set of capture the characteristics of the training data, but, on the
current support vectors of the model. In short, PROCESS adds other hand, should not be overoptimized since only part of
new instances to the working set and REPROCESS removes the entire training data is seen at that point in time. Therefore,
the ones that the learner does not benefit from anymore. In it is necessary to employ a criteria to determine an accurate
the online iterations, LASVM alternates between running rate of REPROCESS operations after each PROCESS. We define
single PROCESS and REPROCESS operations. Finally, LASVM this policy as the minimization of the gap between the primal and
simplifies the kernel expansion by running REPROCESS to the dual [2].
remove all -violating pairs from the kernel expansion, a step Optimization of the duality gap. From the formulations
known as FINISHING. The optimizations performed in the of the primal and dual functions in (1) and (2), respectively,
FINISHING step reduce the number of support vectors in the it can be shown that the optimal values of the primal and
SVM model. dual are same [18]. Furthermore, at any nonoptimal point,
the primal function is guaranteed to lie above the dual curve.
In formal terms, let ^ and ^ be solutions of problems (1) and
4 LASVM WITH GAP-BASED OPTIMIZATION— (2), respectively. The strong duality asserts that for any
LASVM-G feasible and ,
In this section, we present LASVM-G—an efficient online X
GðÞ GðÞ^ ¼ JðÞ^ JðÞ with ^ ¼ ^i ðxi Þ:
ð6Þ
SVM algorithm that brings performance enhancements to
i
LASVM. Instead of running a single REPROCESS operation
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 371
That is, at any time during the optimization, the value of the Note that, as we have indicated earlier, the bias term b is set to
primal zero in all of the formulations. In the online iterations, we
P JðÞ is higher than the dual GðÞ. Using the equality
w ¼ l l ðxl Þ and b ¼ 0, we show that this holds as follows: cycle between running REPROCESS and computing the gap G
until the termination criteria G maxðC; G ^ is reached.
GÞ
1 X
JðÞ GðÞ ¼ kwk2 þ C j1 yl ðw ðxl ÞÞjþ That is, we require the duality gap after the REPROCESS
2 operations to be not greater than the initial gap target G ^
G.
l
X 1 The C parameter is part of the equation in order to prevent
l yl þ kwk2 the algorithm from specifying a too narrow gap target
2
l
X X and therefore prevent making excessive number of optimi-
¼ kwk2 l yl þ C j1 yl ðw ðxl ÞÞjþ zation steps. The heuristic upper bound on the gap is
X
l
X
l developed based on the oscillating characteristics of the
¼w l ðxl Þ l yl primal function during the optimization steps and these
l l oscillations are related to the successive choice of examples
X
þC j1 yl ðw ðxl ÞÞjþ to REPROCESS. Viewing these oscillations as noise, the gap
l target enables us to stop the REPROCESS operations when
X the difference is within the noise. After this point, the
¼ yl l j1 yl ðw ðxl Þ þ bÞjþ
learner continues with computing the new Gap Target and
l
X running PROCESS and REPROCESS operation on the next
þC j1 yl ðw ðxl Þ þ bÞjþ fresh instance from the unseen example pool.
l
X
¼ ðC l yl Þ j1 yl ðw ðxl ÞÞjþ 4.2 Building Blocks
|fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
l
0 0
The implementation of LASVM-G maintains the following
pieces of information as its key building blocks: the
0;
coefficients i of the current kernel expansion S, the bounds
where C l yl 0 is satisfied by the constraint of the dual for each , and the partial derivatives of the instances in the
function in (3). Then, the SVM solution is obtained when expansion, given as
one reaches ; such that
@W ðÞ X
X gk ¼ ¼ yk i Kðxi ; xk Þ ¼ yk y^ðxk Þ: ð10Þ
GðÞ
" > JðÞ where ¼ i ðxi Þ:
ð7Þ @k i
i
The kernel expansion here maintains all of the training
< JðÞ
The strong duality in (6) then guarantees that JðÞ ^ þ ".
instances in the learner’s active set, both the support vectors
Few solvers implement this criterion since it requires the and the instances with ¼ 0.
additional calculation of the gap JðÞ GðÞ. In this paper, In the online iterations of LASVM-G, the optimization is
we advocate using criterion (7) using a threshold value " that driven by two kinds of direction searches. The first operation,
grows sublinearly with the number of examples. Letting " PROCESS, inserts an instance into the kernel expansion and
grow makes the optimization coarser when the number of initializes the i and gradient gi of this instance (Step 1). After
examples increases. As a consequence, the asymptotic computing the step size (Step 2), it performs a direction
complexity of optimizations in online setting can be smaller search (Step 3). We set the offset term for kernel expansion b
than that of the exact optimization. to zero for computational simplicity. As discussed in the
Most SVM solvers use the dual formulation of the QP SVM section regarding the offset term, disallowing b
problem. However, increasing the dual does not necessarily removes the necessity of satisfying the constraint
reduce the duality gap. The dual function follows a nice P
i2S i ¼ 0, enabling the algorithm to update a single at
monotonically increasing pattern at each optimization step, a time, both in PROCESS and REPROCESS operations.
whereas the primal shows significant up and down fluctua-
tions. In order to keep the size of the duality gap in check, LASVM-G PROCESS(i)
P
before each PROCESS operation, we compute the standard 1) i 0; gi yk s2S s Kis
deviation of the primal, which we call the Gap Target G^
G: 2) If gi < 0 then
g
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ max Ai i ; Ki
u n Pn 2 Else
ii
uX
^ ¼t
G
G h2i i¼1 hi
; ð8Þ gi
¼ max Bi i ; K
i¼1
l ii
3) i i þ
where l is the number of support vectors and hi ¼ Cyi gi gs gs Kis 8s in kernel expansion
with C and gi denoting the misclassification penalty and the The second operation, REPROCESS, searches all of the
gradient of instance i, respectively. After computing the gap instances in the kernel expansion and selects the instance
target, we run a PROCESS step and check the new Gap G with the maximal gradient (Steps 1-3). Once an instance is
between the primal and the dual. After an easy derivation, selected, LASVM-G computes a step size (Step 4) and
the gap is computed as performs a direction search (Step 5).
X
n LASVM-G REPROCESS()
G¼ ði gi þ maxð0; Cgi ÞÞ: ð9Þ 1) i arg mins2S gs with s > As
i¼1
j arg maxs2S gs with s < Bs
372 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
Fig. 3. (a) The ramp loss can be decomposed into (b) a convex hinge
loss and (c) a concave loss.
T hreshold ^
maxðC; G
GÞ In this setting, correctly classified instances outside of
Run PROCESS(xi ) the margin (z 1) cannot become SVs because H10 ðzÞ ¼ 0.
while Gap G > T hreshold On the other hand, for the training examples with ðz < 1Þ,
Run REPROCESS H10 ðzÞ is 1, so they cost a penalty term at the rate of
end misclassification of those instances. One problem with
Periodically run CLEAN Hinge Loss-based optimization is that it imposes no limit
on the influences of the outliers, that is, the misclassifica-
In the next sections, we further introduce three SVM
tion penalty is unbounded. Furthermore, in Hinge Loss-
algorithms that are implemented based on LASVM-G, namely,
based optimization, all misclassified training instances
LASVM-NC, LASVM-I, and FULL SVM. While these SVM
become support vectors. Consequently, the number of
algorithms share the main building blocks of LASVM-G, each
support vectors scales linearly with the number of training
algorithm exhibits a distinct learning principle. LASVM-NC
examples [19]. Specifically,
uses the LASVM-G methodology in a nonconvex learner
setting. LASVM-I is a learning scheme that we propose as a #SV
convex variant of LASVM-NC that employs selective sam- ! 2B ; ð13Þ
#Examples
pling. FULL SVM does not take advantage of the noncon-
vexity or the efficiency of the CLEAN operation, and acts as where B is the best possible error achievable linearly in the
a traditional online SVM solver in our experimental feature space ðÞ. Such a fast pace of growth of the number
evaluation. of support vectors becomes prohibitive for training SVMs in
large-scale data sets.
In practice, not all misclassified training examples are
5 NONCONVEX ONLINE SVM SOLVER— necessarily informative to the learner. For instance, in noisy
LASVM-NC data sets, many instances with label noise become support
In this section, we present LASVM-NC, a nonconvex online vectors due to misclassification, even though they are not
SVM solver that achieves sparser SVM solutions in less time informative about the correct classification of new instances
than online convex SVMs and batch SVM solvers. We first in recognition. Thus, it is reasonable to limit the influence of
374 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
the outliers and allow the real informative training fundamental differences between optimization of batch
instances to define the model. Since Hinge Loss admits all SVMs and the online algorithms presented here. In parti-
outliers into the SVM solution, we need to select an cular, the batch SVM needs a convex initialization step prior
alternative loss function that enables selectively ignoring to nonconvex iterations to go over the entire or part of the
the instances that are misclassified according to the current training data in order to initialize the CCCP parameters to
model. For this purpose, we propose using the Ramp Loss avoid getting stuck in a poor local optima. Furthermore,
batch nonconvex SVMs alternate between solving (17) and
Rs ðzÞ ¼ H1 ðzÞ Hs ðzÞ; ð14Þ updating the s of all training instances. On the other hand,
which allows us to control the score window for z at which LASVM-NC runs a few online convex iterations as the
we are willing to convert instances into support vectors. initialization stage, and adjusts the of only the new fresh
Replacing H1 ðzÞ with Rs ðzÞ in (12), we see that the Ramp instance based on the current model and solves (17) while the
Loss suppresses the influence of the instances with score online algorithm is progressing. Additionally, due to the
z < s by not converting them into support vectors. How- nature of online learning, our learning scheme also permits
ever, since Ramp Loss is nonconvex, it prohibits us from selective sampling, which will be further discussed in the
using widely popular optimization schemes devised for LASVM-I section.
convex functions. We would also like to point out that if the s of all of the
While convexity has many advantages and nice mathe- training instances are initialized to zero and left unchanged
matical properties, the SV scaling property in (13) may be in the online iterations, the algorithm becomes traditional
prohibitive for large-scale learning because all misclassified Hinge Loss SVM. From another viewpoint, if s 0, then
examples become support vectors. Since the nonconvex the s will remain zero and the effect of Ramp Loss will not
solvers are not necessarily bounded by this constraint, be realized. Therefore, (17) can be viewed as a generic
nonconvexity has the potential to generate sparser solutions algorithm that can act as both Hinge Loss SVM and Ramp
[12]. In this work, our aim is to achieve the best of both Loss SVM with CCCP that enables nonconvex optimization.
worlds: generate a reliable and robust SVM solution that is
faster and sparser than traditional convex optimizers. This 5.2 Online Iterations in LASVM-NC
can be achieved by employing the CCCP, and thus reducing The online iterations in LASVM-NC are similar to LASVM-G
the complexity of nonconvex loss function by transforming in the sense that they are also based on alternating PROCESS
the problem into a difference of convex parts. The Ramp and REPROCESS steps, with the distinction of replacing the
Loss is amenable to CCCP optimization since it can be
Hinge Loss with the Ramp Loss. LASVM-NC extends the
decomposed into a difference of convex parts (as shown in
LASVM-G algorithm with the computation of the ,
Fig. 3 and (14)). The cost function J s ðÞ for the Ramp Loss
followed by updating the bounds A and B as shown in
can then be represented as the sum of a convex part Jvex ðt Þ
(17). Note that while the do not explicitly appear in the
and a concave part Jcav ðt Þ :
PROCESS and REPROCESS algorithm blocks, they do, in fact,
1 Xn affect these optimization steps through the new definition
wk2 þ C
min J s ðÞ ¼ kw Rs ðyi fðxi ÞÞ of the bounds A and B.
2 l¼1
When a new example xi is encountered, LASVM-NC first
1 X n X n
¼ kw wk2 þ C H1 ðyi fðxi ÞÞ C Hs ðyi fðxi ÞÞ : computes the i for this instance as presented in the algorithm
2 l¼1 l¼1 block, where yi is the class label, f ðxi Þ is the decision score for
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} xi , and s is the score threshold for permitting instances to
s ð
Jvex Þ s ð
Jcav Þ
become support vectors.
ð15Þ We would like to point out that CCCP has convergence
For simplification purposes, we use the notation guarantees (c.f. [12]), but it is necessary to initialize the CCCP
algorithm appropriately in order to avoid getting trapped in
s
@Jcav ðÞ C; if yl f ðxl Þ < s; poor local optima. In batch SVMs, this corresponds to
l ¼ y l ¼ ð16Þ
@f ðxl Þ 0; otherwise; running classical SVM on the entire set or on a subset of
training instances in the first iteration to initialize CCCP,
where C is the misclassification penalty and f ðxl Þ is the
followed by the nonconvex optimization in the subsequent
kernel expansion defined as in (5) with the offset term b ¼ 0.
iterations. In the online setting, we initially allow convex
The cost function in (15), along with the notation introduced
in (16), is then reformulated as the following dual optimization for the first few instances by setting their i ¼ 0
optimization problem: (i.e., use Hinge Loss), and then switch to nonconvex behavior
in the remainder of online iterations.
X 1X Note from (17) that the bounds for instances with ¼ 0
max GðÞ ¼ yi i i j Ki;j ;
i
2 i;j follow the formulation for the traditional convex
8
> Ai i Bi ; LASVM-NC
>
> ð17Þ
< A ¼ minð0; Cy Þ y ; SS: min. number of SVs to start nonconvex behavior.
i i i i
with 1) Initialization:
>
> Bi ¼ maxð0; Cyi Þ i yi ;
>
: Set 0, 0
i from ð16Þ:
2) Online Iterations:
Collobert et al. [12] use a similar formulation for the CCCP- Pick an example xi
based nonconvex optimization of batch SVMs, but there are
ERTEKIN ET AL.: NONCONVEX ONLINE SUPPORT VECTOR MACHINES 375
C if yi f ðxi Þ < s and #SV > SS
Set i ¼ TABLE 1
0 otherwise
Analysis of Adult Data Set at the End of the Training Stage
Set i bounds for xi to (minð0; Cyi Þ i yi
i maxð0; Cyi Þ i yi )
Compute Gap Target G ^
G
T hreshold maxðC; G ^
GÞ
Run PROCESS(xi )
while Gap G > T hreshold
Run REPROCESS
end
Periodically run CLEAN Intuitively, the examples that are misclassified by a wide
margin should not become support vectors. Ideally, the
setting. On the other hand, the bounds for the instances with
support vectors should be the instances that are within the
¼ C, that is, the outliers with score (z < s) are assigned new margin of the hyperplane. As studies on Active Learning
bounds based on the Ramp Loss criteria. Once LASVM-NC show [14], [20], the most informative instances to determine
establishes the bounds for the new instance, it computes the hyperplane lie within the margin. Thus, LASVM-I ignores
the Gap Target G ^ and takes a PROCESS step. Then, it makes
G the instances that are misclassified by a margin (z < s) up
optimizations of the REPROCESS kind until the size of the front and prevents them from becoming support vectors.
duality gap comes down to the Gap Threshold. Finally,
LASVM-NC periodically runs the CLEAN operation to keep LASVM-I
the size of the kernel expansion under control and maintain 1) Initialization:
its efficiency throughout the training stage. Set 0
2) Online Iterations:
Pick an example xi
6 LASVM AND IGNORING INSTANCES—LASVM-I P
Compute z ¼ yi nj¼0 j Kðxi ; xj Þ
This SVM algorithm employs the Ramp function in Fig. 3a if (z > 1 or z < s)
as a filter to the learner prior to the PROCESS step. That is, Skip xi and bail out
once the learner is presented with a new instance, it first else
checks P if the instance is on the ramp region of the function Compute Gap Target G ^
G
ð1 > yi j j Kij > sÞ. The instances that are outside of the T hreshold maxðC; G^
GÞ
ramp region are not eligible to participate in the optimiza- Run PROCESS(xi )
tion steps and they are immediately discarded without while Gap G > T hreshold
further action. The rationale is that the instances that lie on
Run REPROCESS
the flat regions of the Ramp function will have derivative
end
H 0 ðzÞ ¼ 0, and based on (12), these instances will not play
Periodically run CLEAN
role in determining the decision hyperplane w.
The LASVM-I algorithm is based on the following record Note that LASVM-I cannot be regarded as a nonconvex
keeping that we conducted when running LASVM-G experi- SVM solver since the instances with ¼ C (which corre-
ments. In LASVM-G and LASVM-NC, we kept track of two sponds to z < s) are already being filtered out up front
important data points. First, we intentionally permitted before the optimization steps. Consequently, all of the
every new coming training instance into the kernel expan- instances visible to the optimization steps have ¼ 0,
sion in online iterations and recorded the position of all which converts the objective function in (17) into the convex
instances on the Ramp Loss curve just before inserting the Hinge Loss from an optimization standpoint. Thus, combin-
instances into the expansion. Second, we kept track of the ing these two filtering criteria (z > 1 and z < s), LASVM-I
number of instances that were removed from the expansion trades nonconvexity with a filtering Ramp function to
which were on the flat region of the Ramp Loss curve when determine whether to ignore an instance or proceed with
they were admitted. The numeric breakdown is presented in optimization steps. Our goal with designing LASVM-I is
Table 1. Based on the distribution of these cleaned instances, that, based on this initial filtering step, it is possible to
it is evident that most of the cleaned examples that were achieve further speedups in training times while maintain-
initially admitted from (z > 1) region were removed from the ing competitive generalization performance. The experi-
kernel expansion with CLEAN at a later point in time. This is mental results validate this claim.
expected, since the instances with (z > 1) are already
correctly classified by the current model with a certain
confidence, and hence do not become SVs. 7 LASVM-G WITHOUT CLEAN—FULL SVM
On the other hand, Table 1 shows that almost all of the This algorithm serves as a baseline case for comparisons in
instances inserted from the left flat region (misclassified our experimental evaluation. The learning principle of FULL
examples due to z < s) became SVs in LASVM-G, and SVM is based on alternating between LASVM-G’s PROCESS
therefore were never removed from the kernel expansion. and REPROCESS steps throughout the training iterations.
In contrast, almost all of the training instances that were
admitted from the left flat region in LASVM-NC were FULL SVM
removed from the kernel expansion, leading to a much 1) Initialization:
larger reduction of the number of support vectors overall. Set 0
376 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
TABLE 3
Comparison of All Four SVM Algorithms
with LASVM and LIBSVM for All Data Sets
comparably more support vectors than LASVM-NC, but suppressing the influences of the outliers, which particu-
makes fewer kernel calculations due to early filtering. The larly becomes problematic in noisy data classification. For
overall training times for all data sets and all algorithms are this purpose, we first present a systematic optimization
presented both in Fig. 8 and Table 3. LASVM-G, LASVM-NC, approach for an online learning framework to generate
and LASVM-I are all significantly more efficient than FULL more reliable and trustworthy learning models in inter-
SVM. LASVM-NC and LASVM-I also yield faster training than mediate iterations (LASVM-G). We then propose two online
LIBSVM and LASVM. Note that the LIBSVM algorithm here algorithms, LASVM-NC and LASVM-I, which leverage the
uses a second order active set selection. Although second Ramp function to avoid the outliers to become support
order selection can also be applied to LASVM-like algorithms vectors. LASVM-NC replaces the traditional Hinge Loss with
to achieve improved speed and accuracy [23], we did not the Ramp Loss and brings the benefits of nonconvex
implement it in the algorithms discussed in this paper. optimization using CCCP to an online learning setting.
Nevertheless, in Fig. 8, the fastest training times belong to LASVM-I uses the Ramp function as a filtering mechanism to
LASVM-I where LASVM-NC comes close second. The sparsest discard the outliers during online iterations. In online
solutions are achieved by LASVM-NC and this time LASVM-I learning settings, we can discard new coming training
comes close second. These two algorithms represent a examples accurately enough only when the intermediate
compromise between training time versus sparsity and models are reliable as much as possible. In LASVM-G, the
recognition time, and the appropriate algorithm should be increased stability of intermediate models is achieved by
chosen based on the requirements of the classification task. the duality gap policy. This increased stability in the model
significantly reduces the number of wrongly discarded
instances in online iterations of LASVM-NC and LASVM-I.
9 CONCLUSION Empirical evidence suggests that the algorithms provide
In traditional convex SVM optimization, the number of efficient and scalable learning with noisy data sets in two
support vectors scales linearly with the number of training respects: 1) computational: there is a significant decrease in
examples, which unreasonably increases the training time the number of computations and running time during
and computational resource requirements. This fact has training and recognition, and 2) statistical: there is a
hindered widespread adoption of SVMs for classification significant decrease in the number of examples required
tasks in large-scale data sets. In this work, we have studied for good generalization. Our findings also reveal that
the ways in which the computational efficiency of an online discarding the outliers by leveraging the Ramp function is
SVM solver can be improved without sacrificing the closely related to the working principles of margin-based
generalization performance. This paper is concerned with Active Learning.
380 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 2, FEBRUARY 2011
Léon Bottou received the diplôme de l’Ecole C. Lee Giles is the David Reese professor of
Polytechnique, Paris, in 1987, the Magistère en information sciences and technology at the
mathématiques fondamentales et appliquées et Pennsylvania State University, University Park.
informatiques from the Ecole Normale Supér- He has appointments in the Departments of
ieure, Paris, in 1988, and the PhD degree in Computer Science and Engineering and Supply
computer science from the Université de Paris- Chain and Information Systems. Previously, he
Sud in 1991. He joined AT&T Bell Labs from 1991 was at NEC Research Institute, Princeton, New
to 1992 and AT&T Labs from 1995 to 2002. Jersey, and the Air Force Office of Scientific
Between 1992 and 1995, he was the chairman of Research, Washington, District of Columbia. His
Neuristique in Paris, a small company pioneering research interests are in intelligent cyberinfras-
machine learning for data mining applications. He joined NEC Labs tructure, Web tools, search engines and information retrieval, digital
America in Princeton, New Jersey in 2002. His primary research interest libraries, Web services, knowledge and information extraction, data
is machine learning. His contributions to this field address theory, mining, name matching and disambiguation, and social networks. He
algorithms, and large-scale applications. His secondary research interest was a cocreator of the popular search engines and tools: SeerSuite,
is data compression and coding. His best known contribution in this field CiteSeer (now CiteSeerX) for computer science, and ChemXSeer, for
is the DjVu document compression technology (http://www.djvu.org). He chemistry. He also was a cocreater of an early metasearch engine,
is serving on the boards of the Journal of Machine Learning Research and Inquirus, the first search engine for robots.txt, BotSeer, and the first for
the IEEE Transactions on Pattern Analysis and Machine Intelligence. He academic business, SmealSearch. He is a fellow of the ACM, IEEE, and
also serves on the scientific advisory board of Kxen, Inc. (http:// INNS.
www.kxen.com). He won the New York Academy of Sciences Blavatnik
Award for Young Scientists in 2007.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.