0% found this document useful (0 votes)
9 views4 pages

Incremental Support Vector Machine Construction

Uploaded by

bbshabarinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Incremental Support Vector Machine Construction

Uploaded by

bbshabarinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Incremental Support Vector Machine Construction

Carlotta Domeniconi Dimitrios Gunopulos


Computer Science Department
University of California
Riverside, CA 9252 1
{ carlotta,dg} @cs.ucr.edu

Abstract One approach to satisfy these constraints is to consider


incremental learning techniques, in which only a subset of
SVMs sufSer from the problem of large memory require- the data is to be considered at each step of the learning pro-
ment and CPU time when trained in batch mode on large cess.
data sets. We overcome these limitations, and at the same Support Vector Machines (SVMs) [ 171 have been suc-
time make SVMs suitable for learning with data streams, by cessfully used as a classification tool in a variety of areas
constructing incremental learning algorithms. [9, 2, 121. The solid theoretical foundations that have in-
We first introduce and compare different incremental spired SVMs convey desirable computational and learning
learning techniques, and show that they are capable of pro- theoretic properties to the SVM’s learning algorithm. An-
ducing performance results similar to the batch algorithm, other appealing feature of SVMs is the sparseness repre-
and in some cases superior condensation properties. We sentation of the decision boundary they provide. The loca-
then consider the problem of training SVMs using stream tion of the separating hyperplane is specified via real-valued
data. Our objective is to maintain an updated represen- weights on the training examples. Those training examples
tation of recent batches of data. We apply incremental that lie far away from the hyperplane do not participate in
schemes to the problem and show that their accuracy is its specification and therefore receive zero weight. Only
comparable to the batch algorithm. training examples that lie close to the decision boundary
between the two classes (support vectors) receive non-zero
weights.
Therefore, SVMs seem well suited to be trained accord-
1. Introduction ing to an incremental learning fashion [ 16, 113. In fact,
since their design allows the number of support vectors to
Many applications that involve massive data sets are be small compared to the total number of training examples,
emerging. Examples are: telephone records, sales logs, they provide a compact representation of the data, to which
multimedia data. When developing classifiers using learn- new examples can be added as they become available.
ing methods, while a large number of training data can help New optimization approaches that specifically exploit
reducing the generalization error, the learning process itself the structure of the SVM have also been developed for scal-
can get computationally intractable. ing up the learning process. See [ 1, 14,3].
One would like to consider all training examples simulta-
neously, in order to accurately estimate the underlying class 2. Incremental Learning with SVMs
distributions. Howerer, these data sets are far too large to
fit in main memory, and are typically stored in secondary In order to make the SVM learning algorithm incremen-
storage devices, making their access particularly expensive. tal, we can partition the data set in batches that fit into mem-
The fact that not all examples can be loaded into memory ory. Then, at each incremental step, the representation of
at once has two important consequences: the learning algo- the data seen so far is given by the set of support vectors de-
rithm won’t be able to see all data in one single batch, and is scribing the learned decision boundary (along with the cor-
not allowed to “remember” too much of the data scanned in responding weights). Such support vectors are incorporated
the past. As a consequence, scaling up classical learning al- with the new incoming batch of data to provide the training
gorithms to handle extremely large data sets and meet these data for the next step, Since the design of SVMs allows the
requirements is an important research issue 1151, [4]. number of support vectors to be small compared to the total

0-7695-1119-8/01 $17.00 0 2001 IEEE 589


number of training examples, this scheme should provide a points, are used as training data to obtain the new model
compact representation of the data set. SVMt+l.
It is reasonable to expect that the model incrementally
built won't be too far from the model built with the com- 3. Training SVMs using Data Streams
plete data set at once (batch mode). This is because, at each
incremental step, the SVM remembers the essential class
boundary information regarding the seen data, and this in- We consider here the scenario in which the example gen-
formation contributes properly to generate the classifier at eration is time dependent, and follow the data stream model
the successive iteration. presented in [8], also used in [7], [6], [4]. A data stream is
a sequence of items that can be seen only once, and in the
Once a new batch of data is loaded into memory, there
same order it is generated.
are different possibilities for the updating of the current
We seek algorithms for classification that maintain an
model. Here we explore four different techniques. For all
updated representation of recent batches of data. The al-
the techniques, at each step only the learned model from the
gorithm therefore must maintain an accurate representation
previously seen data (preserved in form of support vectors)
of a window of recent data [6]. This model is useful in prac-
is kept in memory.
tice because the characteristics of the data may change with
Error-driven technique (ED). This technique is a variation
time, and so old examples may not be a good predictor for
of the method introduced in [ 1 I], in which both a percentage
future points. The algorithm must perform only one pass
of the misclassified and correctly classified data is retained
over the stream data, and use a workspace that is smaller
for incremental training. The Error-driven technique, in-
than the size of the input.
stead, keeps only the misclassified data. Given the model
The incremental learning techniques we discussed are
SVMt at time t , new data are loaded into memory and
capable of achieving these objectives. Our approach is sim-
classified using S V M t . If the data is misclassified, it is
ilar to [5], and works as follows: We consider the incom-
kept, otherwise it is discarded. Once a given number ne of
ing data in batches of a given size b, and maintain in mem-
misclassified data is collected, the update of SVMt takes
ory w models representative of the last 1,2, . . . , w batches.
place: the support vectors of S V M t , together with the ne
Thus, the window size is W = wb examples. The w mod-
misclassified points, are used as training data to obtain the
els are trained incrementally as data becomes available. Let
new model S V M t + l .
us call the models, at time t, S V M ; , S V M ; , . . .S V M L
Fixed-partition technique (FP). This technique has been respectively. When a new batch of data comes in, at step
previously introduced in [ 161. The training data set is parti-
tioned in batches of fixed size. When a new batch of data is
+
t 1, S V M ; is discarded, the remaining S V M ; , ...,
S V M k - , are incrementally updated to take into account
loaded into memory, it is added to the current set of support
vectors; the resulting set gives the training set used to train the new batch of data, producing SVM;+l, ... , SVM;+'
respectively. SVM:+' is generated using the new batch of
the new model. The support vectors obtained from this pro-
data only. At each step t , S V M A gives the in-memory rep-
cess are the new representation of the data seen so far, and
resentation of the current distribution of data, and it is used
they are kept in memory.
to predict the class label of new data. Any of the discussed
Exceeding-margin technique (EM). Given the model S V M t techniques can be employed for the incremental updates.
at time t , new data { ( x i ,pi)} are loaded into memory. The Besides the w SVM models, only b data points need to
algorithm checks if ( x i ,yi) exceeds the margin defined by
reside in memory at once. Both b and w can be set according
S V M t , i.e. if y i f t ( ( x i ) ) 5 1. If the condition is satisfied to domain knowledge regarding locality properties of data
the point is kept, otherwise it is discarded. Once a given distributions over time.
number ne of data exceeding the margin is collected, the
update of S V M t takes place: the support vectors of S V M t ,
together with the ne points, are used as training data to ob- 4. Experimental Evaluation
tain the new model S V M t + l .
Exceeding-niargin+errors technique (EM+E). Given the We compare the four incremental techniques and the
model S V M t at time t , new data { ( x i ,yi)} are loaded into SVM learning algorithm in batch mode, to verify their
memory. The algorithm checks if ( x i ,yi) exceeds the mar- performances and sizes of resulting classifiers, i.e. num-
<
gin defined by S V M t , i.e. if y i f t ( ( x i ) ) l. If the condi- ber of resulting support vectors. We have tested the tech-
. tion is satisfied the point is kept, otherwise it is classified us- niques on both simulated and real data. The real dataset
ing S V M t : if misclassified it is kept, otherwise discarded. (Pima) is taken from UCI Machine Learning Repository
Once a given number ne of data, either exceeding the mar- at http://www.cs.uci.edu/ -mlearnNLRepository.html. We
gin or misclassified, is collected, the update of SVMt takes used, for both the incremental and batch algorithms, radial
place: the support vectors of S V M t , together with the ne basis function kernels. We used SVM"ght [lo], and set

590
the value of y in K ( x i ,x) = e--YIIx~-x112 equal to the op- incremental and batch techniques, respectively, are 41 8 and
timal one determined via cross-validation. Also the value 430. Since the data distribution is stationary, the perfor-
of C for the soft-margin classifier is optimized via cross- mance and estimator size remain stable over time. We
validation. For the incremental techniques we have tested observe that the incremental technique employed (Fixed-
different batch sizes and ne values. In Tables 1 - 2 we re- partition) and the batch mode algorithm basically provide
port the best performances obtained (B is for the batch al- the same results, both in terms of performance and size of
gorithm). We also report, besides the average classification the model. These results provide clear evidence that, al-
error rates and standard deviations, the number of support though the incremental techniques allow loss of informa-
vectors of the resulting classifier, the corresponding size of tion, they are capable of achieving accuracy results similar
the condensed set (%), and the number of training cycles to the batch algorithm, while significantly improving train-
the SVM underwent. ing time.
To test the incremental techniques with stream data, we
have used the Noisy-crossed-norm dataset (generated as the
Large-noisy-crossed-norm dataset described below), and Table 1. Results for Large-noisy-crossed-
generated streams in batches of size b = 1000, and set norm data.
w = 3. We have employed the Fixed-partition technique
for the incremental updates. At each incremental step, we
have tested the performance of the current model using 10
independent test sets of size 1000. We report average classi-
fication error rates and classifier sizes over successive steps.
For comparison, we have also trained a SVM in batch mode
over w = 3 consecutive batches of data over time, and re-
port average classification error rates obtained at each step.
The Problems: Large-noisy-crossed-normdata. This data
batchsize - 500 I 500 1 500 1 500
set consists of n = 20 attributes and J = 2 classes. Each
time 14 17 I 20 I 0.5 I 22
class is drawn from a multivariate normal distribution with
unit covariance matrix. One class has mean 2/malong
each dimension, and the other has mean -2/malong
each dimension. We have generated 200,000 data points, Table 2. Results for Pima data.
and performed 5-fold cross-validation with 100,000 train-
ing data and 100,000testing data. Table 1 shows the results.
The last column lists the running times (in hours). Experi- error(%) 31.9 29.3 26.2 27.1 26.4
ments were conducted on a 1.3 GHz machine with 1GB of std dev 0.47 0.02 0.02 0.02 0.02
RAM. Pima Indians Diabete data. This data set consists of #SVs 547 291 405 394 399
1
n = 8 attributes, J = 2 classes, and 1 = 768 instances. Cond. set (%) 96 51.2 71.3 69.4 70.2
Results are shown in Table 2. We performed IO-fold cross- cycles - 13 38 34 36
validation with 568 training data and 200 testing data. batchsize - 10 10 10 10
Results: Tables 1-2 show that, for both the data sets we
have tested, the performance obtained with the incremen-
tal techniques comes close to the performance given by the
batch algorithm. Moreover, for each problem considered,
more than one incremental scheme provides a much smaller
5. Related Work
condensed set. In particular, it is quite remarkable the con-
densation power (1.5%) that the Exceed-margin technique The incremental techniques discussed here can be
shows for the Large-noisy-crossed-norm, while still per- viewed as approximations of the chunking technique em-
forming close to the batch algorithm. The fact that the clas- ployed to train SVMs [ 131. The chunking technique is an
sifier is kept smaller allows for a much faster computation exact decomposition method that iterates through the train-
(30 minutes). The results obtained with the Pima data are ing set to select the support vectors.
also of interest. All four incremental techniques perform The incremental methods introduced here, instead, scan
better than the batch algorithm and, at the same time, com- the training data only once, and, once discarded, data are
pute a smaller condensed set. not considered anymore. This property makes the methods
In Figure 1, we plot the results obtained with the stream suited to be employed within the data stream model also.
data for 12 time steps. The average estimator size for the Furthermore, the experiments we have performed show that,

591
[4] Pedro Domingos, Geoff Hulten, “Mining high-speed
data streams.” SIGKDD 2000: 71-80, Boston, MA.
[ 5 ] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakr-
3.5 1 4 ishnan. “DEMON: Mining and Monitoring Evolving
Data.”, in ICDE 2000: 439-448, San Diego, CA.
[6] Sudipto Guha and Nick Koudas. “Data-Streams and
3.1 1 ’ r i
Histograms.”, In Proc. STOC 2001.
0 1 2 3 4 5 E 7 8 0 1 0 1 1 1 2 1 3 1 4
Time SI-
[7] S. Guha, N. Mishra, R. Motwani, L. O’Callaghan,
“Clustering Data Stream”, IEEE Foundations of Com-
Figure 1. Noisy-crossed-normdata: Average
puter Science, 2000.
Error Rates of Fixed-partition and batch algo-
rithms for consecutive time steps. [8] M. R. Henzinger, P. Raghavan, and S. Rajagopalan,
“Computing on data streams”, SRC Technical Note
1998-011 , Digital Research Center, May 26, 1998.
although the incremental techniques allow loss of informa- [9] T. Joachims, “Text categorization with support vector
tion, they are capable of achieving performance results sim- machines”, Proc. of European Conference on Machine
ilar to the batch algorithm. Learning, 1998.
[IO] T. Joachims, “Making large-scale SVM learning prac-
6 Conclusions tical” Advances in Kernel Methods - Support Vec-
tor Learning, B. Scholkopf and C. Burger and A.
We have introduced and compared new and existing in- Smola (ed.), MIT-Press, 1999. http://www-ai.cs.uni-
cremental techniques for constructing SVMs. The experi- dortmund.de/thorsten/svm-light. html
mental results presented show that incremental techniques
are capable of achieving performance results similar to the [ 1 13 P. Mitra, C. A. Murthy, and S. K. Pal, “Data Condensa-
batch algorithm, while improving the training time. We ex- tion in Large Databases by Incremental Learning with
tended these approaches to work with stream data, and pre- Support Vector Machines”, International Conference
sented experimental results to show the efficiency and accu- on Pattern Recognition, 2000.
racy of the method. [12] E. Osuna, R. Freund, and E Girosi, “Training sup-
port vector machines: An application to face detec-
Acknowledgments tion”, Proc. of Computer Vision and Pattern Recogni-
tion, 1997.
This research has been supported by the National
Science Foundation under grants NSF CAREER Award [ 131 E. Osuna, R. Freund, and F. Girosi, “An improved
9984729 and NSF 11s-9907477, by the US Department of training algorithm for support vector machines”, Pro-
Defense, and a research award from AT&T. ceedings of IEEE NNSP’97, 1997.
[ 141 J. C. Platt, “Fast Training of Support Vector Machines
References using Sequential Minimal Optimization”, Advances in
Kernel Methods, B. Scholf, C. J. C. Burges, and A. J.
[ 11 J. C. Bennett, C. Campbel, “Support Vector Machines: Smola (eds.), MIT Press, 185-208, 1999.
Hype or Hallelujah?’, SIGKDD Explorations, Vol. 2,
151 E J. Provost and V. Kolluri, “A survey of methods
NO. 2, 1-13,2000.
for scaling up inductive learning algorithms”, Techni-
[2] M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sug- cal Report ISL-97-3,Intelligent Systems Lab., Depart-
net, T. Furey, M. Ares, and D. Haussler, “Knowledge- ment of Computer Science, University of Pittsburgh,
based analysis of microarray gene expressions data us- 1997.
ing support vector machines,” Tech. Report, Univer- 161 N. A. Syed, H. Liu, and K. K. Sung, “Incremen-
sity of California in Santa Cruz, 1999. tal Learning with Support Vector Machines”, Interna-
[3] G. Cauwenberghs and T. Poggio, “Incremental and tional Joint Conference on Artijicial Intelligence (IJ-
Decremental Support Vector Machine Learning”, Ad- CAI), 1999.
vances in Neural Information Processing Systems, [ 171 V. Vapnik, Statistical Learning Theory. Wiley, 1998.
2000.

592

You might also like