Large-Scale Image Classification
Large-Scale Image Classification
2
, subject to
z
e = 1, (1)
where e is a vector of ones. The problem has a closed
form solution.
Then the coding result R
p
of z is obtained by placing
the elements of
z
into the corresponding positions of
a p-dimensional vector and leaving the rest to be zeros.
The algorithm can be seen as one way of sparse coding,
because is very sparse. But the implementation is much
simpler and the computation is much faster than traditional
sparse coding, because there is no need to solve the L1-
norm regularized optimization problem. On the other hand,
we empirically nd that the performance of this simple
LCC coding is often comparable or better than traditional
sparse coding for image classication. In addition, the
algorithm can also be explained as a simple extension of
vector quantization (VQ) coding, which can be recovered
by setting = 1.
3.2. Super-vector Coding (SVC)
SVC is another way to extend VQ, which explores
the geometry of data distribution. Suppose a codebook
B = [b
1
, . . . , b
p
] R
dp
is obtained by running K-means
algorithm. For a descriptor z, the coding procedure is
1. Find zs nearest basis vector in B, whose index is i =
arg min
j
z b
j
2
;
2. Obtain the VQ coding R
p
, where its i-th element
is one, and all others are zeros.
3. Obtain the SV coding result
=
__
1
s,
1
(z b
1
)
_
. . .
_
p
s,
p
(z b
p
)
__
,
(2)
where R
(d+1)p
, and s is a predened small constant.
The SVC can be seen as expanding VQ with local tangent
directions, and is thus a smoother coding scheme.
At the pooling step, a linear pooling method has been
derived by smoothing the Bhattacharyya kernel. Let Z =
[z
i
]
n
i=1
be the set of local descriptors of an image, and
[
i
]
n
i=1
be their SV codes. Assigning z
i
into those p vector
quantization bins, we partition Z into p groups, with sizes
proportional to
k
,
p
k=1
k
= 1. Then the pooling result
for this image is
x =
1
n
n
i=1
1
(i)
i
,
where (i) indicates the index of the group z
i
belongs to.
1691
Sets Coding scheme Descriptor Coding dimension SPM Feature dimension Data set Size(GB)
1 HOG+LBP 8,192 10 81,920 167*
2 Local coordinate coding HOG 16,384 10 163,840 187*
3 HOG+LBP 20,480 10 204,800 260*
4 HOG 32,768 8 262,144 1374
5 Super-vector coding HOG+LBP 51,200 4 204,800 1073
6 HOG 65,536 4 262,144 1374
Table 1. Extracted feature sets from ImageNet images for SVM training. The datasets with were compressed to reduce data size.
3.3. Parallel feature extraction
Depending on coding settings, the computation time for
feature extraction of one image ranges from 2 seconds
to 15 seconds on a dual quad-core 2GHz Intel Xeon
CPU machine with 16G memory (single thread is used
in computation). To process 1.2 million images, it
would take around 27 to 208 days. Furthermore, feature
extraction yields terabytes of data. It is very difcult
for a single computer to handle such huge computation
and such huge data. To speedup the computation and
accommodate the data, we choose Apache Hadoop [21]
to distribute computation over 20 machines and store data
on Hadoop distributed le system (HDFS). Hadoop is an
open source implementation of MapReduce computation
framework and a distributed le system [3]. Because
there is no interdependence in feature extraction tasks,
MapReduce computation framework is very suitable for
feature extraction. The HDFS distributes images over all
machines and performs computation on the images located
at local disks, which is called colocation. Colocation
can speedup the computation by reducing overall network
I/O cost. The most important advantage of Hadoop is
that, it provides a reliable infrastructure for large scale
computation. For example, a task can be automatically
restarted if it encounters some unexpected errors, such as
network issues or memory shortage. In our Hadoop cluster,
we only use 6 workers on each machine because of some
limitation of the machines. Thus, we have 120 workers in
total.
We totally extracted six sets of features, as shown in
Table 1. With the help of Hadoop parallel computing, the
feature sets took 6 hours to 2 days to compute, depending
on coding settings.
4. ASGD for SVM training
After feature extraction, we ended up with terabytes
of training data, as shown in Table. 1. In general, the
features by LCC are sparse even after pooling and they
are much smaller than the ones generated by supervector
coding. The largest feature set is 1.37 terabytes and non-
sparse. While one may concatenate those features to learn
an overall SVM for classication, we train SVMs separately
for each feature set and then combine SVM scores to yield
nal classication. However, even training SVM for the
smallest feature set here (about 160 GB) would not be
easy. Furthermore, because the ImageNet dataset has 1000
categories, we need to train 1000 binary classiers the
decision of using one-against-all SVMs is because training
a joint multi-class SVM is even harder and may not have
signicant performance advantage. To our best knowledge,
training SVMs on such huge datasets with so many classes
has never been reported before.
Although there exist many off-the-shelf SVM solver-
s, such as SVM
light
[12], SVM
perf
[13] or Lib-
SVM/LIBLINEAR [8], they are not feasible for such huge
training data. This is because most of them are batch
methods, which require to go through all data to compute
gradient in each iteration and often need many iterations
(hundreds or even thousands of iterations) to reach a
reasonable solution. Even worse, most off-the-shelf batch-
type SVM solvers require to pre-load training data into
memory, which is impossible given the size of the training
data we have. Therefore, such solvers are unsuitable for
our SVM training. Indeed, LIBLINEAR recently released
an extended version that explicitly considered the memory
issue [25]. We tested it with a simplied image feature
set (HOG descriptor only with coding dimension of 4,096,
which generated 80GB training data). However, even
on such a small dataset (as compared to our largest one,
1.37TB), the LIBLINEAR solver was not able to provide
useful results after 2 weeks of running on a dual quad-
core 2GHz Intel Xeon CPU machine with 16G memory.
The slowness of the LIBLINEAR solver is not only due
to its inefcient inner-outer loop iterative structure but
also because it needs to learn as many as 1000 binary
classiers. Therefore, we need a (much) better SVM solver,
which should be memory efcient, converge fast and have
some parallelization scheme to train 1000 binary classiers
in parallel. To meet these needs, we propose a parallel
averaging stochastic gradient descent (ASGD) algorithm
for training SVM classiers.
4.1. Averaging stochastic gradient descent
Lets use binary classication as an example for de-
scribing the ASGD algorithm [18][23]. We have training
data that consists of T feature-label pairs, denoted as
{x
t
, y
t
}
T
t=1
, where x
t
is a d 1 feature vector representing
1692
an image and y
t
{1, +1} is the label of the image.
Then, the cost function for binary SVM classication can
be written as
L =
T
t=1
L(w, b, x
t
, y
t
)
=
T
t=1
2
w
2
+ max
_
0, 1 y
t
(w
T
x
t
+ b)
, (3)
where wis d1 SVM weight vector, (nonnegative scalar)
is a regularization parameter, and b (scalar) is a bias term.
Then, the gradient of w and b are
w
L(w, b, x
t
, y
t
) =
_
wy
t
x
t
if
t
< 1
w if
t
1
b
L(w, b, x
t
, y
t
) =
_
y
t
if
t
< 1
0 if
t
1
, (4)
where
t
= y
t
(w
T
x
t
+ b) is the margin of the data pair
{x
t
, y
t
}.
The ASGD algorithm is a modication of conventional
stochastic gradient descent (SGD) algorithms [15, 27]. For
conventional SGD, training data are fed to the algorithmone
by one, and the update rule for w and b respectively are
w
t
= (1 )w
t1
+ y
t
x
t
b
t
= b
t1
+ y
t
(5)
if margin
t
is less than 1; otherwise, w
t
= (1 )w
t1
and b
t
= b
t1
. The parameter is step size. The above
SGDalgorithmis easy to implement, but it often takes many
iterations to reach a good solution.
The ASGD algorithm is to add an averaging scheme to
the above SGD algorithm. The averaging scheme is
w
t
= (1
t
) w
t1
+
t
w
t
b
t
= (1
t
)
b
t1
+
t
b
t
, (6)
where
t
(e.g.
t
= 1/t) is averaging parameter. Note that
the averaging scheme does not affect the SGD update rule
in Eq. 5, and the averaged SVM weights, w
T
and
b
T
, will
be output as the result of SVM training, not w
T
and b
T
.
The ASGD algorithm is known to have potential to
achieve the theoretically optimal performance of stochastic
gradient descent algorithms. It was shown that, asymptot-
ically the ASGD algorithm is able to achieve similar con-
vergence rate as second-order stochastic gradient descent
algorithm [18], which is often much faster than its rst-
order counterpart. However, unlike the second-order SGD
that needs to compute the inverse of Hessian matrix, the
averaging is extremely simple to compute.
Despite the fact that the ASGD method has the potential
to converge fast and is simple to implement, it has not been
popular. We believe there are two main reasons. First,
the ASGD algorithm achieves asymptotic convergence
property (to gain similar performance as the second-order
stochastic gradient descent) only when the number of data
samples is sufciently large. In fact, with insufcient
data samples, ASGD can be inferior to regular SGD.
This probably explains it may not be able to observe
the superiority of the ASGD method when dealing with
medium-scale data. Second, for the ASGD algorithm
to achieve fast convergence, the step size needs to be
carefully scheduled. We adopt the following step size
scheduling [23]:
=
0
1
(1 +
0
t)
c
, (7)
where
0
(e.g.
0
= 10
2
), and c are some positive
constants, and they are problem-dependent. Typical values
for c are 1 or 2/3. Recent analysis [23] shows that it
is a good strategy to set to be the smallest eigenvalue
of the Hessian matrix of a stochastic objective function.
Therefore, for solving the SVM problem in Eq. 3, we set
= for the step size in Eq. 7.
There is an important implementation trick to signif-
icantly reduce the computation cost at each iteration of
ASGD [23]. A plain implementation of ASGD would need
ve scalar-vector multiplications or dot products at each
iteration: one for computing margin, two for updating w
t
(Eq. 5) and two for averaging (Eq. 6). We choose to perform
the following variable transform:
w
t
= P
t
1,1
v
t
w
t
= P
t
2,1
v
t
+ P
t
2,2
u
t
, (8)
where P
t
=
_
P
t
1,1
0
P
t
2,1
P
t
2,2
_
is a 2 2 projection matrix,
and v
t
and u
t
are updated in the following manner:
v
t
= v
t1
+ y
t
R
1,1
x
t
;
u
t
= u
t1
+ y
t
(R
2,1
+
t
R
2,2
)x
t
, (9)
where R
t
=
_
R
t
1,1
0
R
t
2,1
R
t
2,2
_
= P
1
t
, and P
t
= T
t
P
t1
with T
t
=
_
1 0
t
(1 ) 1
t
_
with P
1
being an
identity matrix, w
1
= v
1
and w
1
= u
1
. It is easy to check
that the update in Eq. 8 is equivalent to the update in Eq. 5
and Eq. 6 but with only three scalar-vector multiplications
or dot products: one for computing margin, and two for
the computation in Eq. 9 the transform in Eq. 8 is not
computed until the last iteration when to output result.
4.2. Parallel training
Another important issue is how to parallelize the com-
putation for training 1000 binary SVM classiers [2].
1693
Apparently, the training of the binary classiers can be
done independently. However, in contrast to the case
where a single machine is used to train all classiers and
the major bottleneck is on computation, using a large
number of machines for training will suffer from le I/O
bottleneck since one of our training datasets is as large as
1.37 terabyte. If the le I/O bandwith is 20MB/second,
simply loading/copying the dataset would take 19 hours.
Therefore, we hope to load data as less times as possible
while not incurring the computation bottleneck.
Our strategy here is to do memory sharing on each mul-
ticore machine. Each machine launches several programs
to train different subset of the 1000 binary classiers, and
the programs are synchronized to train on the same chunk of
training data. The training data is shared by all the programs
on a machine through careful memory sharing. Therefore,
the multiple programs only need to load data once (for
each epoch). Such memory sharing scheme signicantly
reduces le loading trafc and speeds up SVM training
dramatically.
5. Results
5.1. The performance of ASGD method for SVM
training
As aforementioned, the major challenge of the large-
scale ImageNet classication is on training SVMs with
terabytes of training data and as many as 1000 categories.
This paper proposes a parallel ASGD method that is aimed
to have fast convergence and parallel computation. Fig. 3
shows the convergence comparison between the ASGD
method and the regular SGD method. Both methods were
performed on the dataset 5 in Table 1 it is about 1 terabyte
in total. We see that the ASGD method converged very fast.
It reached fairly good solution after 5 iterations. In contrast,
SGD (without averaging) converges much more slowly. It
would take tens of iterations to reach a similarly good
solution. For this specic dataset, each epoch took about 20
hours on three 8-core machines (only 6 programs running
in parallel on each machine due to some limitations).
Therefore, ASGD took about 4 days to nish SVM training
while the regular SGD would have taken weeks if not
months.
5.2. ImageNet classication results
With the proposed ASGD method and 12 eight-core
machines, we were able to train 1000-class SVM classiers
for all those 6 feature sets listed in Table 1 within one
week. Classication on each feature set outputs a set of
SVM sores, and we combined them linearly to yield nal
prediction.
As a result, our classication system achieved 52.9%
in classication accuracy and 71.8% in top 5 hit rate.
1 2 3 4 5
40
45
50
55
60
65
70
Epochs
T
o
p
5
h
i
t
r
a
t
e
(
%
)
SGD
Averaging SGD
Figure 3. The convergence comparison between ASGD and
regular SGD.
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
120
140
Accuracy
Histogram of top 5 accuracy
Figure 4. The histogram of the top 5 hit rate of the 1000 classes in
ImageNet dataset.
Indeed, we see a huge improvement in performance from
the baseline that was reported recently [6], which achieved
about 20% in classication rate. Fig. 4 shows the histogram
of the top 5 hit rate on 1000 classes. We see that the top 5 hit
rate is mostly concentrated in the range of 60 90% while
it is over 90% for some classes but below 30% for some
other classes. The easy classes include odometer, monarch
buttery, cliff dwelling, lunar crater, bonsai, trolleybus,
geyser, snowplow, etc; the difcult classes include China
tree, logwood tree, shingle oak, red beech, cap opener,
Kentucky coffee tree, teak, alder tree, iron tree, grass pink,
etc. The detailed top 5 hit rate for each of the 1000 classes
is illustrated in Fig. 5.
6. Discussion
We have shown how to train an image classication
system on the large-scale ImageNet dataset (1.2 million im-
ages) with many classes (1000 classes). We achieved state-
1694
of-the-art performance on the ImageNet dataset: 52.9%
in classication accuracy and 71.9% in top 5 hit rate.
The key factors in our system are fast feature extraction
and SVM training. We developed a parallel averaging
stochastic gradient descent (ASGD) algorithm for SVM
training, which is able to handle terabytes of data and 1000
classes.
In this paper, we observed very fast convergence from
the ASGD algorithm. But we are still not able to
quantitatively connect the superior empirical performance
with existing theoretical analysis, most of which focuses on
analyzing the asymptotic convergence property of ASGD.
We will study how many training data samples would
be needed for ASGD to enter its asymptotic convergence
regime. Work in [23] has done some very interesting anal-
ysis in this regard. Meanwhile, we plan to systematically
compare the ASGD method with other SGD methods (such
as Pegasos [20]) for large-scale image classication.
References
[1] http://yann.lecun.com/exdb/mnist/.
[2] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui.
Psvm: Parallelizing support vector machines on distributed
computers. Advances in Neural Information Processing
Systems, 20:16, 2007.
[3] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and
K. Olukotun. Map-reduce for machine learning on multicore.
In Advances in Neural Information Processing Systems 19:
Proceedings of the 2006 Conference, page 281. The MIT
Press, 2007.
[4] C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20:273297, 1995.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In CVPR, 2005.
[6] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does
classifying more than 10,000 image categories tell us?
ECCV, 2010.
[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes
Challenge 2010 (VOC2010) Results. http://www.pascal-
network.org/challenges/VOC/voc2010/workshop/index.html.
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.
Lin. Liblinear: A library for large linear classication. J.
Mach. Learn. Res., 9, June 2008.
[9] L. Fei-Fei, R. Fergus, and P. Peron. Learning generative
visual models from few training examples: an incremental
bayesian approach tested on 101 object categories. In IEEE.
CVPR 2004, Workshop on Generative-Model Based Vision,
2004.
[10] R. Fergus, P. Perona, and A. Zisserman. Object class
recognition by unsupervised scale-invariant learning. In
IEEE Conf. on Computer Vision and Pattern Recognition,
volume 2, pages 264271, Wisconsin, WI, June16 - 22 2003.
[11] G. Grifn, A. Holub, and P. Perona. Caltech-256 object
category dataset. Technical Report 7694, California Institute
of Technology, 2007.
[12] T. Joachims. Making large-scale svm learning practical.
LS8-Report, 24, Universitt Dortmund, LS VIII-Report,
1998.
[13] T. Joachims. Training linear svms in linear time. In Pro-
ceedings of the ACM Conference on Knowledge Discovery
and Data Mining, 2006.
[14] S. Lazebnik, C. Achmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
scene categories. In IEEE Conf. on Computer Vision and
Pattern Recognition, volume 2, pages 21692178, New York
City, June17 - 22 2006.
[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86:22782324, 1998.
[16] D. G. Lowe. Distinctive image features from scale invariant
keypoints. Intl Journal of Computer Vision, 60(2):91110,
2004.
[17] T. Ojala, M. Pietikainen, and D. Harwood. A comparative
study of texture measures with classication based on feature
distributions. Pattern Recognition, 29:5159, 1996.
[18] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic
approximation by averaging. SIAM J. Control Optim, 30,
July 1992.
[19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T.
Freeman. Labelme: a database and web-based tool for image
annotation. International Journal of Computer Vision, 77(1-
3):157173, 2008.
[20] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Pri-
mal estimated sub-gradient solver for svm. In Proceedings
of the 24th international conference on Machine learning,
ICML 07, 2007.
[21] T. White. Hadoop: The Denitive Guide. OReilly Media,
Inc, 2010.
[22] S. Y. X. Wang, T. X. Han. An hog-lbp human detector with
partial occlusion handling. In ICCV, 2009.
[23] W. Xu. Towards optimal one pass large scale learning
with averaged stochastic gradient descent. Technical Report
2009-L102, NEC Labs America.
[24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial
pyramid matching using sparse coding for image classica-
tion. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
[25] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large
linear classication when data cannot t in memory. In Pro-
ceedings of the 16th ACM SIGKDD international conference
on Knowledge discovery and data mining, KDD 10, 2010.
[26] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local
coordinate coding. In NIPS 09, 2009.
[27] T. Zhang. Solving large scale linear prediction problems
using stochastic gradient descent algorithms. In Proceedings
of the twenty-rst international conference on Machine
learning, ICML 04, 2004.
[28] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classication
using super-vector coding of local image descriptors. In
ECCV, 2010.
1695
Figure 5. The top 5 hit rates on the 1000 categories in the ImageNet Challenge. The hit rate of each category is indicated by a red bar left
to the image representing the category.
1696