A Parallel Mixture of Svms For Very Large Scale Problems
A Parallel Mixture of Svms For Very Large Scale Problems
A Parallel Mixture of Svms For Very Large Scale Problems
Problems
Ronan Collobert
Samy Bengio
IDIAP
CP 592, rue du Simplon 4
1920 Martigny, Switzerland
[email protected]
Yoshua Bengio
Abstract
Support Vector Machines (SVMs) are currently the state-of-the-art models for
many classication problems but they suer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples.
Hence, it is hopeless to try to solve real-life problems having more than a few
hundreds of thousands examples with SVMs. The present paper proposes a
new mixture of SVMs that can be easily implemented in parallel and where
each SVM is trained on a small subset of the whole dataset. Experiments on a
large benchmark dataset (Forest) as well as a dicult speech database, yielded
signicant time improvement (time complexity appears empirically to locally
grow linearly with the number of examples). In addition, and that is a surprise,
a signicant improvement in generalization was observed on Forest.
1 Introduction
Recently a lot of work has been done around Support Vector Machines [9], mainly due to
their impressive generalization performances on classication problems when compared to other
algorithms such as articial neural networks [3, 6]. However, SVMs require to solve a quadratic
optimization problem which needs resources that are at least quadratic in the number of training
examples, and it is thus hopeless to try solving problems having millions of examples using
classical SVMs.
In order to overcome this drawback, we propose in this paper to use a mixture of several SVMs,
each of them trained only on a part of the dataset. The idea of an SVM mixture is not new,
although previous attempts such as Kwok's paper on Support Vector Mixtures [5] did not train
the SVMs on part of the dataset but on the whole dataset and hence could not overcome the
Part of this work has been done while Ronan Collobert was at IDIAP, CP 592, rue du Simplon 4,
1920 Martigny, Switzerland.
time complexity problem for large datasets. We propose here a simple method to train such
a mixture, and we will show that in practice this method is much faster than training only
one SVM, and leads to results that are at least as good as one SVM. We conjecture that the
training time complexity of the proposed approach with respect to the number of examples is
sub-quadratic for large data sets. Moreover this mixture can be easily parallelized, which could
improve again signicantly the training time.
The organization of the paper goes as follows: in the next section, we briey introduce the SVM
model for classication. In section 3 we present our mixture of SVMs, followed in section 4 by
some comparisons to related models. In section 5 we show some experimental results, rst on a
toy dataset, then on two large real-life datasets. A short conclusion then follows.
where x Rd is the d-dimensional input vector of a test example, y {1, 1} is a class label, xi
is the input vector for the ith training example, yi is its associated class label, N is the number
of training examples, K(x, xi ) is a positive denite kernel function, and = {1 , . . . , N } and
b are the parameters of the model. Training an SVM consists in nding that minimizes the
objective function
N N
N
X
1 XX
i j yi yj K(xi , xj )
(2)
Q() =
i +
2 i=1 j=1
i=1
subject to the constraints
N
X
i yi = 0
(3)
i=1
and
0 i C i.
The kernel K(x, xi ) can have dierent forms, such as the Radial Basis Function (RBF):
kxi xj k2
K(xi , xj ) = exp
2
with parameter .
(4)
(5)
Therefore, to train an SVM, we need to solve a quadratic optimization problem, where the
number of parameters is N . This makes the use of SVMs for large datasets dicult: computing
K(xi , xj ) for every training pair would require O(N 2 ) computation, and solving may take up
to O(N 3 ). Note however that current state-of-the-art algorithms appear to have training time
complexity scaling much closer to O(N 2 ) than O(N 3 ) [2].
where M is the number of experts in the mixture, sm (x) is the output of the mth expert
given input x, wm (x) is the weight for the mth expert given by a gater module taking also
x in input, and h is a transfer function which could be for example the hyperbolic tangent for
classication tasks. Here each expert is an SVM, and we took a neural network for the gater in
our experiments. In the proposed model, the gater is trained to minimize the cost function
C=
N
X
[f (xi ) yi ] .
(7)
i=1
number of training data, but is in fact a transductive method as it cannot operate on a single
test example. Like in the previous case, this algorithm assigns the examples randomly to the
experts (however the Bayesian framework would in principle allow to nd better assignments).
Regarding our proposed mixture of SVMs, if the number of experts grows with the number
of examples, and the number of outer loop iterations is a constant, then the total training
time of the experts scales linearly with the number of examples. Indeed, given N the total
N
number of examples, choose the number of expert M such that the ratio M
is a constant r;
Then, if k is the number of outer loop iterations, and if the training time for an SVM with r
examples is O(r ) (empirically is slightly above 2), the total training time of the experts is
O(kr M ) = O(kr1 N ), where k , r and are constants, which gives a total training time
of O(N ). In particular for = 2 that gives O(krN ). The actual total training time should
however also include k times the training time of the gater, which may potentially grow more
rapidly than O(N ). However, it did not appear to be the case in our experiments, thus yielding
apparent linear training time. Future work will focus on methods to reduce the gater training
time and guarantee linear training time per outer loop iteration.
5 Experiments
In this section, we present three sets of experiments comparing the new mixture of SVMs to
other machine learning algorithms. Note that all the SVMs in these experiments have been
trained using SVMTorch [2].
We kept a separate test set of 50,000 examples to compare the best mixture of SVMs
to other learning algorithms.
We used a validation set of 10,000 examples to select the best mixture of SVMs, varying
the number of experts and the number of hidden units in the gater.
We trained our models on dierent training sets, using from 100,000 to 400,000 examples.
The mixtures had from 10 to 50 expert SVMs with Gaussian kernel and the gater was
an MLP with between 25 and 500 hidden units.
address:
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
0.5
0.5
1.5
1.5
1.5
0.5
0.5
1.5
1.5
0.5
0.5
1.5
Figure 1: Comparison of the decision surfaces obtained by (a) a linear SVM, (b) a Gaussian
SVM, and (c) a linear mixture of two linear SVMs, on a two-dimensional classication toy
problem.
Note that since the number of examples was quite large, we selected the internal training parameters such as the of the Gaussian kernel of the SVMs or the learning rate of the gater
using a held-out portion of the training set. We compared our models to
a single MLP, where the number of hidden units was selected by cross-validation between
25 and 250 units,
a single SVM, where the parameter of the kernel was also selected by cross-validation,
a mixture of SVMs where the gater was replaced by a constant vector, assigning the
same weight value to every expert.
Table 1 gives the results of a rst series of experiments with a xed training set of 100,000
examples. To select among the variants of the gated SVM mixture we considered performance
over the validation set as well as training time. All the SVMs used = 1.7. The selected model
had 50 experts and a gater with 150 hidden units. A model with 500 hidden units would have
given a performance of 8.1% over the test set but would have taken 621 minutes on one machine
(and 388 minutes on 50 machines).
one MLP
one SVM
uniform SVM mixture
gated SVM mixture
Train
Error
17.56
16.03
19.69
5.91
Test
(%)
18.15
16.76
20.31
9.28
Time (minutes)
(1 cpu) (50 cpu)
12
3231
85
2
237
73
Table 1: Comparison of performance between an MLP (100 hidden units), a single SVM, a
uniform SVM mixture where the gater always output the same value for each expert, and nally
a mixture of SVMs as proposed in this paper.
As it can be seen, the gated SVM outperformed all models in terms of training and test error.
Note that the training error of the single SVM is high because its hyper-parameters were selected
to minimize error on the validation set (other values could yield to much lower training error but
larger test error). It was also much faster, even on one machine, than the SVM and since the
mixture could easily be parallelized (each expert can be trained separately), we also reported
the time it took to train on 50 machines. In a rst attempt to understand these results, one
can at least say that the power of the model does not lie only in the MLP gater, since a single
MLP was pretty bad, it is neither only because we used SVMs, since a single SVM was not
as good as the gated mixture, and it was not only because we divided the problem into many
sub-problems since the uniform mixture also performed badly. It seems to be a combination of
all these elements.
We also did a series of experiments in order to see the inuence of the number of hidden units
of the gater as well as the number of experts in the mixture. Figure 2 shows the validation error
of dierent mixtures of SVMs, where the number of hidden units varied from 25 to 500 and the
number of experts varied from 10 to 50. There is a clear performance improvement when the
number of hidden units is increased, while the improvement with additional experts exists but
is not as strong. Note however that the training time increases also rapidly with the number of
hidden units while it slightly decreases with the number of experts if one uses one computer per
expert.
Validation error as a function of the number of hidden units
of the gater and the number of experts
14
13
12
11
10
9
8
2550
100
150
50
200
250
Number of hidden
units of the gater
500
10
15
20
25
Number of experts
Figure 2: Comparison of the validation error of dierent mixtures of SVMs with various number
of hidden units and experts.
In order to nd how the algorithm scaled with respect to the number of examples, we then
compared the same mixture of experts (50 experts, 150 hidden units in the gater) on dierent
training set sizes. Table 3 shows the validation error of the mixture of SVMs trained on training
sets of sizes from 100,000 to 400,000. It seems that, at least in this range and for this particular
dataset, the mixture of SVMs scales linearly with respect to the number of examples, and not
quadratically as a classical SVM. It is interesting to see for instance that the mixture of SVMs
was able to solve a problem of 400,000 examples in less than 7 hours (on 50 computers) while it
would have taken more than one month to solve the same problem with a single SVM.
Finally, gure 4 shows the evolution of the training and validation errors of a mixture of 50
SVMs gated by an MLP with 150 hidden units, during 5 iterations of the algorithm. This
should convince that the loop of the algorithm is essential in order to obtain good performance.
It is also clear that the empirical convergence of the outer loop is extremely rapid.
400
13
12
300
11
Error (%)
Time (min)
350
250
200
10
9
150
100
50
1
6
1.5
2
2.5
3
Number of train examples
3.5
4
5
5
1
x 10
2
3
4
Number of training iterations
turned it into a binary classication problem where the task was to separate silence frames from
non-silence frames. The total number of frames was around 540,000 frames. The training set
contained 100,000 randomly chosen frames out of the rst 400,000 frames. The disjoint validation set contained 10,000 randomly chosen frames out of the rst 400,000 frames also. Finally,
the test set contained 50,000 randomly chosen frames out of the last 140,000 frames. Note that
the validation set was used here to select the number of experts in the mixture, the number of
hidden units in the gater, and . Each frame was parameterized using standard methods used
in speech recognition (j-rasta coecients, with rst and second temporal derivatives) and was
thus described by 45 coecients, but we used in fact an input window of three frames, yielding
135 input features per examples.
Table 2 shows a comparison between a single SVM and a mixture of SVMs on this dataset. The
number of experts in the mixture was set to 50, the number of hidden units of the gater was set
to 50, and the of the SVMs was set to 3.0. As it can be seen, the mixture of SVMs was again
many times faster than the single SVM (even on 1 cpu only) but yielded similar generalization
performance.
one SVM
gated SVM mixture
Train
Error
0.98
4.41
Test
(%)
7.57
7.32
Time (minutes)
(1 cpu) (50 cpu)
6787
851
65
Table 2: Comparison of performance between a single SVM and a mixture of SVMs on the
speech dataset.
6 Conclusion
In this paper we have presented a new algorithm to train a mixture of SVMs that gave very good
results compared to classical SVMs either in terms of training time or generalization performance
on two large scale dicult databases. Moreover, the algorithm appears to scale linearly with
the number of examples, at least between 100,000 and 400,000 examples.
These results are extremely encouraging and suggest that the proposed method could allow
training SVM-like models for very large multi-million data sets in a reasonable time. If training
of the neural network gater with stochastic gradient takes time that grows much less than
quadratically, as we conjecture it to be the case for very large data sets (to reach a good enough
solution), then the whole method is clearly sub-quadratic in training time with respect to the
number of training examples. Future work will address several questions: how to guarantee
linear training time for the gater as well as for the experts? can better results be obtained by
tuning the hyper-parameters of each expert separately? Does the approach work well for other
types of experts?
Acknowledgments
RC would like to thank the Swiss NSF for nancial support (project FN2100-061234.00). YB
would like to thank the NSERC funding agency and NCM2 network for support.
References
[1] R.A. Cole, M. Noel, T. Lander, and T. Durham. New telephone speech corpora at CSLU.
Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 1:821824, 1995.
[2] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression
problems. Journal of Machine Learning Research, 1:143160, 2001.
[3] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273297, 1995.
[4] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Georey E. Hinton. Adaptive
mixtures of local experts. Neural Computation, 3(1):7987, 1991.
[5] J. T. Kwok. Support vector mixture for classication and regression problems. In Proceedings
of the International Conference on Pattern Recognition (ICPR), pages 255258, Brisbane,
Queensland, Australia, 1998.
[6] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to
face detection. In IEEE conference on Computer Vision and Pattern Recognition, pages
130136, San Juan, Puerto Rico, 1997.
[7] A. Rida, A. Labbi, and C. Pellegrini. Local experts combination trough density decomposition. In International Workshop on AI and Statistics (Uncertainty'99). Morgan Kaufmann,
1999.
[8] V. Tresp. A bayesian committee machine. Neural Computation, 12:27192741, 2000.
[9] V. N. Vapnik. The nature of statistical learning theory. Springer, second edition, 1995.