Large-Scale Image Classification: Fast Feature Extraction and SVM Training
Large-Scale Image Classification: Fast Feature Extraction and SVM Training
Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour and Kai Yu
NEC Laboratories America, Cupertino, CA 95014
1689
ImageNet
1000
800
# of classes
600
400
Caltech256
200 LabelMe
Caltech101
PASCAL MNIST
0
0.0 200.0k 400.0k 600.0k 800.0k 1.0M 1.2M
# of data samples Figure 2. The overview of our large-scale image classification
system. This system represents an image using a bag-of-words
Figure 1. The comparison of ImageNet dataset with other bench- (BoW) model and performs classification using a linear SVM
mark datasets for image classification. ImageNet dataset is classifier. Given an input image, the system first extracts dense
significantly larger in terms of both the number of data samples local descriptors, HOG [5] or LBP (local binary pattern [22]).
and the number of classes. Then, each local descriptor is coded either using local coordinate
coding (LCC) [26] or Gaussian model supervector coding [28].
The codes of the descriptors are then passed to weighted pooling
have at most 30,000 or so images, and it is still feasible to or max-pooling with spatial pyramid matching (SPM) to form a
exploit kernel methods for training nonlinear classifiers, vector for representing the image. Finally, the feature vector is fed
which often provide state-of-the-art performance. In to a linear SVM for classification.
contrast, the kernel methods are prohibitively expensive
for ImageNet dataset that consists of 1.2 million images.
Therefore, a key new challenge for the ImageNet large- large-scale imageNet dataset is not easy. For the reported
scale image classification is how to efficiently extract best performers on the medium-scale datasets [28, 24],
image features and train classifiers without compromising extracting image features on one image takes at least a
performance. This paper is to show how we address couple of seconds (and even minutes [24]). Even if it
the challenge and achieve so far the state-of-the-art takes 1 second per image for feature extraction, in total
classification performance on the ImageNet dataset. it would take 1.2 × 106 seconds ≈ 14 days. Even more
The major contribution of this paper is to show how to challenging is SVM training. Let’s use PASCAL VOC 2010
train an image classification system on large-scale datasets for comparison. The PASCAL dataset consists of about
in a system level. We develop a fast feature extraction 10,000 images in 20 classes. To our experience, training
scheme using Hadoop [21]. More importantly, we develop SVM for this PASCAL dataset (e.g. using LIBLINEAR [8])
a parallel averaging stochastic gradient descent (ASGD) would take more than 1 hour if we use the features that
algorithm with proper step size scheduling to achieve fast are employed in those state-of-the-art methods (without
SVM training. dimensionality reduction, e.g., by kernel trick). This means
we would need at least 1 × 50 × 120 hours = 250 days
2. Classification system overview in computation – not counting the often most painful part,
memory constraints and file I/O bottlenecks. Indeed, we
For ImageNet large-scale image classification, we em- need new thinking on existing algorithms: mostly, more
ploy a classification system shown in Fig. 2. This system parallelization and efficiency for computation, and faster
follows the approaches described in a number of previous convergence for iterative algorithms, particularly, SVM
works [24, 28] that showed state-of-the-art performance training. In the following two sections, Section 3 and
on medium-scale image classification datasets (such as Section 4, we will show how to implement the new thinking
PASCAL VOC and Caltech101&256). Here, we attempt into image feature extraction and SVM training, which are
to integrate the advantages from those previous systems. the two major functional blocks in our classification system
The contribution of this paper is not to propose a new (as shown in Fig. 2).
classification paradigm but to develop efficient algorithms
to gain similar performance on large-scale ImagetNet 3. Feature extraction
dataset as those achieved by the state-of-the-art methods on
medium-scale datasets. As shown in Fig. 2, given an input image, our system first
Extending the methods for medium-scale datasets to extracts dense HOG (histogram of oriented gradients [5])
1690
and LBP (local binary pattern [22]) local descriptors. Both 2. Ensure tight approximation: solve the optimization
features have been proven successful in various vision problem
tasks such as object classification, texture analysis and face
recognition, etc. HOG and LBP are complementary in the min kz − Bz αz k2 , subject to α⊤
z e = 1, (1)
αz
sense that HOG focuses more on shape information while
LBP emphasizes texture information within each patch. where e is a vector of ones. The problem has a closed
The advantage of such combination was also reported in form solution.
[22] for human detection task. For images with large Then the coding result α ∈ Rp of z is obtained by placing
size, we downsize them to no more than 500 pixels at the elements of αz into the corresponding positions of
either side. Such normalization not only considerably a p-dimensional vector and leaving the rest to be zeros.
reduces computational cost, but more importantly, makes The algorithm can be seen as one way of sparse coding,
the representation more robust to scale difference. We used because α is very sparse. But the implementation is much
three scales of patch size for computing HOG and LBP, simpler and the computation is much faster than traditional
namely, 16×16, 24×24 and 32×32. The multiple patch sparse coding, because there is no need to solve the L1-
sizes provide richer coverage of different scales and make norm regularized optimization problem. On the other hand,
the features more invariant to scale changes. we empirically find that the performance of this simple
After extracting dense local image descriptors, denoted LCC coding is often comparable or better than traditional
by z ∈ Rd , we perform the ‘coding’ and ‘pooling’ steps, as sparse coding for image classification. In addition, the
shown in Fig. 2, where the coding step encodes each local algorithm can also be explained as a simple extension of
descriptor z via a nonlinear feature mapping into a new vector quantization (VQ) coding, which can be recovered
space, then the pooling step aggregates the coding results by setting κ = 1.
fallen in a local region into a single vector. We apply two
state-of-the-art ‘coding + pooling’ pipelines in our system, 3.2. Super-vector Coding (SVC)
one is based on local coordinate coding (LCC) [26], and
SVC is another way to extend VQ, which explores
the other is based on super-vector coding (SVC) [28]. For
the geometry of data distribution. Suppose a codebook
simplicity, we assume the pooling is global. But spatial
B = [b1 , . . . , bp ] ∈ Rd×p is obtained by running K-means
pyramid pooling is simply implemented by applying the
algorithm. For a descriptor z, the coding procedure is
same operation independently within each partitioned block
of images. 1. Find z’s nearest basis vector in B, whose index is i =
arg minj kz − bj k2 ;
3.1. Local Coordinate Coding (LCC)
2. Obtain the VQ coding γ ∈ Rp , where its i-th element
Let B = [b1 , . . . , bp ] ∈ Rd×p be the codebook, where is one, and all others are zeros.
d is the dimensionality of descriptors z and p is the size of
the codebook. Like many coding methods, LCC seeks a 3. Obtain the SV coding result
linear combination of bases in B to reconstruct z, namely h i
z ≈ Bα, and then use the coefficients α as the coding result β = γ1 s, γ1 (z − b1 ) . . . γp s, γp (z − bp ) ,
for z. Typically α is sparse and its dimensionality is higher (2)
than that of z. We note that the mapping φ(z) from z to α is
usually nonlinear. The theory of LCC points out that, under where β ∈ R(d+1)p , and s is a predefined small constant.
a mild manifold assumption, a good coding should satisfy The SVC can be seen as expanding VQ with local tangent
two properties: directions, and is thus a smoother coding scheme.
At the pooling step, a linear pooling method has been
• The approximation z ≈ Bα is sufficiently accurate;
derived by smoothing the Bhattacharyya kernel. Let Z =
• The coding α should be sufficiently local – only those [zi ]ni=1 be the set of local descriptors of an image, and
bases close to z are activated; [βi ]ni=1 be their SV codes. Assigning zi into those p vector
quantization bins, we Ppartition Z into p groups, with sizes
p
Based on the theory, we develop a very simple algorithm proportional to ωk , k=1 ωk = 1. Then the pooling result
here. We first use K-means algorithm to learn a codebook for this image is
B and then for encoding z do the following: n
1X 1
x= √ βi ,
1. Ensure sufficient locality: find z’s κ nearest neighbors n i=1 ωδ(i)
in B, typically κ = 20, and denote the found bases as
Bz ∈ Rd×κ ; where δ(i) indicates the index of the group zi belongs to.
1691
Sets Coding scheme Descriptor Coding dimension SPM Feature dimension Data set Size(GB)
1 HOG+LBP 8,192 10 81,920 167*
2 Local coordinate coding HOG 16,384 10 163,840 187*
3 HOG+LBP 20,480 10 204,800 260*
4 HOG 32,768 8 262,144 1374
5 Super-vector coding HOG+LBP 51,200 4 204,800 1073
6 HOG 65,536 4 262,144 1374
Table 1. Extracted feature sets from ImageNet images for SVM training. The datasets with ∗ were compressed to reduce data size.
3.3. Parallel feature extraction final classification. However, even training SVM for the
smallest feature set here (about 160 GB) would not be
Depending on coding settings, the computation time for
easy. Furthermore, because the ImageNet dataset has 1000
feature extraction of one image ranges from 2 seconds
categories, we need to train 1000 binary classifiers – the
to 15 seconds on a dual quad-core 2GHz Intel Xeon
decision of using one-against-all SVMs is because training
CPU machine with 16G memory (single thread is used
a joint multi-class SVM is even harder and may not have
in computation). To process 1.2 million images, it
significant performance advantage. To our best knowledge,
would take around 27 to 208 days. Furthermore, feature
training SVMs on such huge datasets with so many classes
extraction yields terabytes of data. It is very difficult
has never been reported before.
for a single computer to handle such huge computation
Although there exist many off-the-shelf SVM
and such huge data. To speedup the computation and
solvers, such as SVMlight [12], SVMperf [13] or
accommodate the data, we choose Apache Hadoop [21]
LibSVM/LIBLINEAR [8], they are not feasible for such
to distribute computation over 20 machines and store data
huge training data. This is because most of them are batch
on Hadoop distributed file system (HDFS). Hadoop is an
methods, which require to go through all data to compute
open source implementation of MapReduce computation
gradient in each iteration and often need many iterations
framework and a distributed file system [3]. Because
(hundreds or even thousands of iterations) to reach a
there is no interdependence in feature extraction tasks,
reasonable solution. Even worse, most off-the-shelf batch-
MapReduce computation framework is very suitable for
type SVM solvers require to pre-load training data into
feature extraction. The HDFS distributes images over all
memory, which is impossible given the size of the training
machines and performs computation on the images located
data we have. Therefore, such solvers are unsuitable for
at local disks, which is called colocation. Colocation
our SVM training. Indeed, LIBLINEAR recently released
can speedup the computation by reducing overall network
an extended version that explicitly considered the memory
I/O cost. The most important advantage of Hadoop is
issue [25]. We tested it with a simplified image feature
that, it provides a reliable infrastructure for large scale
set (HOG descriptor only with coding dimension of 4,096,
computation. For example, a task can be automatically
which generated 80GB training data). However, even
restarted if it encounters some unexpected errors, such as
on such a small dataset (as compared to our largest one,
network issues or memory shortage. In our Hadoop cluster,
1.37TB), the LIBLINEAR solver was not able to provide
we only use 6 workers on each machine because of some
useful results after 2 weeks of running on a dual quad-core
limitation of the machines. Thus, we have 120 workers in
2GHz Intel Xeon CPU machine with 16G memory. The
total.
slowness of the LIBLINEAR solver is not only due to
We totally extracted six sets of features, as shown in
its inefficient inner-outer loop iterative structure but also
Table 1. With the help of Hadoop parallel computing, the
because it needs to learn as many as 1000 binary classifiers.
feature sets took 6 hours to 2 days to compute, depending
Therefore, we need a (much) better SVM solver, which
on coding settings.
should be memory efficient, converge fast and have some
parallelization scheme to train 1000 binary classifiers
4. ASGD for SVM training
in parallel. To meet these needs, we propose a parallel
After feature extraction, we ended up with terabytes averaging stochastic gradient descent (ASGD) algorithm
of training data, as shown in Table. 1. In general, the for training SVM classifiers.
features by LCC are sparse even after pooling and they
4.1. Averaging stochastic gradient descent
are much smaller than the ones generated by supervector
coding. The largest feature set is 1.37 terabytes and non- Let’s use binary classification as an example for de-
sparse. While one may concatenate those features to learn scribing the ASGD algorithm. We have training data that
an overall SVM for classification, we train SVMs separately consists of T feature-label pairs, denoted as {xt , yt }Tt=1 ,
for each feature set and then combine SVM scores to yield where xt is a d × 1 feature vector representing an image
1692
and yt ∈ {−1, +1} is the label of the image. Then, the cost popular. We believe there are two main reasons. First,
function for binary SVM classification can be written as the ASGD algorithm achieves asymptotic convergence
property (to gain similar performance as the second-order
T
X stochastic gradient descent) only when the number of data
L = L(w, b, xt , yt )
samples is sufficiently large. In fact, with insufficient
t=1
data samples, ASGD can be inferior to regular SGD.
T
X λ This probably explains – it may not be able to observe
kwk2 + max 0, 1 − yt (wT xt + b) , (3)
=
t=1
2 the superiority of the ASGD method when dealing with
medium-scale data. Second, for the ASGD algorithm
where w is d×1 SVM weight vector, λ (nonnegative scalar) to achieve fast convergence, the step size η needs to be
is a regularization parameter, and b (scalar) is a bias term. carefully scheduled. We adopt the following step size
Then, the gradient of w and b are scheduling [23]:
1
λw − yt xt if ∆t < 1 η = η0 , (7)
∇w L(w, b, xt , yt ) = (1 + γη0 t)c
λw if ∆t ≥ 1
−yt if ∆t < 1 where η0 (e.g. η0 = 10−2 ), γ and c are some positive
∇b L(w, b, xt , yt ) = , (4) constants, and they are problem-dependent. Typical values
0 if ∆t ≥ 1
for c are 1 or 2/3. Our analysis (not shown in this paper
where ∆t = yt (wT xt + b) is the margin of the data pair for brevity) shows that it is a good strategy to set γ to be
{xt , yt }. the smallest eigenvalue of the Hessian matrix of a stochastic
The proposed ASGD algorithm is a modification of con- objective function. Therefore, for solving the SVM problem
ventional stochastic gradient descent (SGD) algorithms [15, in Eq. 3, we set γ = λ for the step size in Eq. 7.
27]. For conventional SGD, training data are fed to the There is an important implementation trick to signif-
algorithm one by one, and the update rule for w and b icantly reduce the computation cost at each iteration of
respectively are ASGD. A plain implementation of ASGD would need
five scalar-vector multiplications or dot products at each
wt = (1 − λη)wt−1 + ηyt xt iteration: one for computing margin, two for updating wt
bt = bt−1 + ηyt (5) (Eq. 5) and two for averaging (Eq. 6). We choose to perform
the following variable transform:
if margin ∆t is less than 1; otherwise, wt = (1 − λη)wt−1
t
and bt = bt−1 . The parameter η is step size. The above wt = P1,1 vt
SGD algorithm is easy to implement, but it often takes many w̄t = t
P2,1 t
vt + P2,2 ut , (8)
iterations to reach a good solution.
t
The ASGD algorithm is to add an averaging scheme to P1,1 0
the above SGD algorithm. The averaging scheme is where Pt = t t is a 2 × 2 projection matrix,
P2,1 P2,2
and vt and ut are updated in the following manner:
w̄t = (1 − αt )w̄t−1 + αt wt
b̄t = (1 − αt )b̄t−1 + αt bt , (6) vt = vt−1 + ηyt R1,1 xt ;
ut = ut−1 + ηyt (R2,1 + αt R2,2 )xt , (9)
where αt (e.g. αt = 1/t) is averaging parameter. Note that t
the averaging scheme does not affect the SGD update rule R1,1 0
where Rt = t t = P−1
t , and Pt = Tt Pt−1
in Eq. 5, and the averaged SVM weights, w̄T and b̄T , will R2,1 R2,2
be output as the result of SVM training, not wT and bT . 1 − λη 0
with Tt = with P1 being an
The ASGD algorithm is known to have potential to αt (1 − λη) 1 − αt
achieve the theoretically optimal performance of stochastic identity matrix, w1 = v1 and w̄1 = u1 . It is easy to check
gradient descent algorithms. It was shown that, asymptot- that the update in Eq. 8 is equivalent to the update in Eq. 5
ically the ASGD algorithm is able to achieve similar con- and Eq. 6 but with only three scalar-vector multiplications
vergence rate as second-order stochastic gradient descent or dot products: one for computing margin, and two for
algorithm [18], which is often much faster than its first- the computation in Eq. 9 – the transform in Eq. 8 is not
order counterpart. However, unlike the second-order SGD computed until the last iteration when to output result.
that needs to compute the inverse of Hessian matrix, the
4.2. Parallel training
averaging is extremely simple to compute.
Despite the fact that the ASGD method has the potential Another important issue is how to parallelize the com-
to converge fast and is simple to implement, it has not been putation for training 1000 binary SVM classifiers [2].
1693
Apparently, the training of the binary classifiers can be
70
done independently. However, in contrast to the case
where a single machine is used to train all classifiers and 65
the major bottleneck is on computation, using a large
number of machines for training will suffer from file I/O 60
training data. The training data is shared by all the programs Figure 3. The convergence comparison between ASGD and
on a machine through careful memory sharing. Therefore, regular SGD.
the multiple programs only need to load data once (for
each epoch). Such memory sharing scheme significantly Histogram of top 5 accuracy
140
reduces file loading traffic and speeds up SVM training
dramatically. 120
5. Results 100
training 60
1694
of-the-art performance on the ImageNet dataset: 52.9% [12] T. Joachims. Making large-scale svm learning practical.
in classification accuracy and 71.9% in top 5 hit rate. LS8-Report, 24, Universitt Dortmund, LS VIII-Report,
The key factors in our system are fast feature extraction 1998.
and SVM training. We developed a parallel averaging [13] T. Joachims. Training linear svms in linear time. In Pro-
stochastic gradient descent (ASGD) algorithm for SVM ceedings of the ACM Conference on Knowledge Discovery
training, which is able to handle terabytes of data and 1000 and Data Mining, 2006.
classes. [14] S. Lazebnik, C. Achmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
In this paper, we observed very fast convergence from
scene categories. In IEEE Conf. on Computer Vision and
the ASGD algorithm. But we are still not able to Pattern Recognition, volume 2, pages 2169–2178, New York
quantitatively connect the superior empirical performance City, June17 - 22 2006.
with existing theoretical analysis, most of which focuses on [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
analyzing the asymptotic convergence property of ASGD. based learning applied to document recognition. Proceed-
We will study how many training data samples would ings of the IEEE, 86:2278–2324, 1998.
be needed for ASGD to enter its asymptotic convergence [16] D. G. Lowe. Distinctive image features from scale invariant
regime. Meanwhile, we plan to systematically compare keypoints. Int’l Journal of Computer Vision, 60(2):91–110,
the ASGD method with other SGD methods (such as 2004.
Pegasos [20]) for large-scale image classification. [17] T. Ojala, M. Pietikainen, and D. Harwood. A comparative
study of texture measures with classification based on feature
References distributions. Pattern Recognition, 29:51–59, 1996.
[18] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic
[1] http://yann.lecun.com/exdb/mnist/. approximation by averaging. SIAM J. Control Optim, 30,
[2] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. July 1992.
Psvm: Parallelizing support vector machines on distributed [19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T.
computers. Advances in Neural Information Processing Freeman. Labelme: a database and web-based tool for image
Systems, 20:16, 2007. annotation. International Journal of Computer Vision, 77(1-
[3] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and 3):157–173, 2008.
K. Olukotun. Map-reduce for machine learning on multicore. [20] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Pri-
In Advances in Neural Information Processing Systems 19: mal estimated sub-gradient solver for svm. In Proceedings
Proceedings of the 2006 Conference, page 281. The MIT of the 24th international conference on Machine learning,
Press, 2007. ICML ’07, 2007.
[4] C. Cortes and V. Vapnik. Support-vector networks. Machine [21] T. White. Hadoop: The Definitive Guide. O’Reilly Media,
Learning, 20:273–297, 1995. Inc, 2010.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for [22] S. Y. X. Wang, T. X. Han. An hog-lbp human detector with
human detection. In CVPR, 2005. partial occlusion handling. In ICCV, 2009.
[6] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does [23] W. Xu. Towards optimal one pass large scale learning
classifying more than 10,000 image categories tell us? with averaged stochastic gradient descent. Technical Report
ECCV, 2010. 2009-L102, NEC Labs America.
[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, [24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial
and A. Zisserman. The PASCAL Visual Object Classes pyramid matching using sparse coding for image classifica-
Challenge 2010 (VOC2010) Results. http://www.pascal- tion. In IEEE Conference on Computer Vision and Pattern
network.org/challenges/VOC/voc2010/workshop/index.html. Recognition, 2009.
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. [25] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large
Lin. Liblinear: A library for large linear classification. J. linear classification when data cannot fit in memory. In Pro-
Mach. Learn. Res., 9, June 2008. ceedings of the 16th ACM SIGKDD international conference
[9] L. Fei-Fei, R. Fergus, and P. Peron. Learning generative on Knowledge discovery and data mining, KDD ’10, 2010.
visual models from few training examples: an incremental [26] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local
bayesian approach tested on 101 object categories. In IEEE. coordinate coding. In NIPS’ 09, 2009.
CVPR 2004, Workshop on Generative-Model Based Vision, [27] T. Zhang. Solving large scale linear prediction problems
2004. using stochastic gradient descent algorithms. In Proceedings
[10] R. Fergus, P. Perona, and A. Zisserman. Object class of the twenty-first international conference on Machine
recognition by unsupervised scale-invariant learning. In learning, ICML ’04, 2004.
IEEE Conf. on Computer Vision and Pattern Recognition, [28] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classification
volume 2, pages 264–271, Wisconsin, WI, June16 - 22 2003. using super-vector coding of local image descriptors. In
[11] G. Griffin, A. Holub, and P. Perona. Caltech-256 object ECCV, 2010.
category dataset. Technical Report 7694, California Institute
of Technology, 2007.
1695
Figure 5. The top 5 hit rates on the 1000 categories in the ImageNet Challenge. The hit rate of each category is indicated by a red bar left
to the image representing the category.
1696