0% found this document useful (0 votes)

16 views8 pages

Large-Scale Image Classification: Fast Feature Extraction and SVM Training

Uploaded by

vpblissolin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Large-Scale Image Classification: Fast Feature Extraction and SVM Training

Uploaded by

vpblissolin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Large-scale Image Classification:

Fast Feature Extraction and SVM Training

Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour and Kai Yu
NEC Laboratories America, Cupertino, CA 95014

Liangliang Cao and Thomas Huang

Beckman Institute, University of Illinois at Urbana-Champaign, IL 61801

Abstract of pixels to untrained computers, and the object it presents.

Therefore, there have been extensive research efforts on
Most research efforts on image classification so far developing effective visual object recognizers [10]. Along
have been focused on medium-scale datasets, which are the line, there are quite a few benchmark datasets for
often defined as datasets that can fit into the memory of a image classification, such as MNIST [1], Caltech 101 [9],
desktop (typically 4G∼48G). There are two main reasons Caltech 256 [11], PASCAL VOC [7], LabelMe[19], etc.
for the limited effort on large-scale image classification. Researchers have developed a wide spectrum of different
First, until the emergence of ImageNet dataset, there local descriptors [17, 16, 5, 22], bag-of-words models [14,
was almost no publicly available large-scale benchmark 24] and classification methods [4], and they compared
data for image classification. This is mostly because to the best available results on those publicly available
class labels are expensive to obtain. Second, large-scale datasets – for PASCAL VOC, many teams from all over
classification is hard because it poses more challenges the world participate in the PASCAL Challenge each year
than its medium-scale counterparts. A key challenge is to compete for the best performance. Such benchmarking
how to achieve efficiency in both feature extraction and activities have played an important role in pushing object
classifier training without compromising performance. This classification research forward in the past years.
paper is to show how we address this challenge using In recent years, there is a growing consensus that it
ImageNet dataset as an example. For feature extraction, we is necessary to build general purpose object recognizers
develop a Hadoop scheme that performs feature extraction that are able to recognize many different classes of objects
in parallel using hundreds of mappers. This allows us – e.g. this can be very useful for image/video tagging
to extract fairly sophisticated features (with dimensions and retrieval. Caltech 101/256 are the pioneer benchmark
being hundreds of thousands) on 1.2 million images within datasets on that front. Newly released ImageNet dataset [6]
one day. For SVM training, we develop a parallel goes a big step further, as shown in Fig. 1 – it further
averaging stochastic gradient descent (ASGD) algorithm for increases the number of classes to 10001, and it has more
training one-against-all 1000-class SVM classifiers. The than 1000 images for each class on average. Indeed, it is
ASGD algorithm is capable of dealing with terabytes of necessary to have so many images for each class to cover
training data and converges very fast – typically 5 epochs visual variance, such as lighting, orientation as well as
are sufficient. As a result, we achieve state-of-the-art fairly wild appearance difference within the same class –
performance on the ImageNet 1000-class classification, i.e., like different cars may look very differently although all
52.9% in classification accuracy and 71.8% in top 5 hit belong to the same class.
rate. However, compared to those previous medium-
scale datasets (such as PASCAL VOC datasets and
Caltech101&256, which can fit into desktop memory),
1. Introduction large-scale ImageNet dataset poses more challenges in
image classification. For example, those previous datesets
It is needless to say how important of image clas- 1 The overall ImageNet dataset consists of 11,231,732 labeled images
sification/recognition is in the field of computer vision of 15589 classes by October 2010. But here we only concern about the
– image recognition is essential for bridging the huge subset of ImageNet dataset (about 1.2 million images) that was used in
semantic gap between an image, which is simply a scatter 2010 ImageNet Large Scale Visual Recognition Challenge

1689
ImageNet
1000

800
# of classes

600

400
Caltech256
200 LabelMe
Caltech101
PASCAL MNIST
0
0.0 200.0k 400.0k 600.0k 800.0k 1.0M 1.2M
# of data samples Figure 2. The overview of our large-scale image classification
system. This system represents an image using a bag-of-words
Figure 1. The comparison of ImageNet dataset with other bench- (BoW) model and performs classification using a linear SVM
mark datasets for image classification. ImageNet dataset is classifier. Given an input image, the system first extracts dense
significantly larger in terms of both the number of data samples local descriptors, HOG [5] or LBP (local binary pattern [22]).
and the number of classes. Then, each local descriptor is coded either using local coordinate
coding (LCC) [26] or Gaussian model supervector coding [28].
The codes of the descriptors are then passed to weighted pooling
have at most 30,000 or so images, and it is still feasible to or max-pooling with spatial pyramid matching (SPM) to form a
exploit kernel methods for training nonlinear classifiers, vector for representing the image. Finally, the feature vector is fed
which often provide state-of-the-art performance. In to a linear SVM for classification.
contrast, the kernel methods are prohibitively expensive
for ImageNet dataset that consists of 1.2 million images.
Therefore, a key new challenge for the ImageNet large- large-scale imageNet dataset is not easy. For the reported
scale image classification is how to efficiently extract best performers on the medium-scale datasets [28, 24],
image features and train classifiers without compromising extracting image features on one image takes at least a
performance. This paper is to show how we address couple of seconds (and even minutes [24]). Even if it
the challenge and achieve so far the state-of-the-art takes 1 second per image for feature extraction, in total
classification performance on the ImageNet dataset. it would take 1.2 × 106 seconds ≈ 14 days. Even more
The major contribution of this paper is to show how to challenging is SVM training. Let’s use PASCAL VOC 2010
train an image classification system on large-scale datasets for comparison. The PASCAL dataset consists of about
in a system level. We develop a fast feature extraction 10,000 images in 20 classes. To our experience, training
scheme using Hadoop [21]. More importantly, we develop SVM for this PASCAL dataset (e.g. using LIBLINEAR [8])
a parallel averaging stochastic gradient descent (ASGD) would take more than 1 hour if we use the features that
algorithm with proper step size scheduling to achieve fast are employed in those state-of-the-art methods (without
SVM training. dimensionality reduction, e.g., by kernel trick). This means
we would need at least 1 × 50 × 120 hours = 250 days
2. Classification system overview in computation – not counting the often most painful part,
memory constraints and file I/O bottlenecks. Indeed, we
For ImageNet large-scale image classification, we em- need new thinking on existing algorithms: mostly, more
ploy a classification system shown in Fig. 2. This system parallelization and efficiency for computation, and faster
follows the approaches described in a number of previous convergence for iterative algorithms, particularly, SVM
works [24, 28] that showed state-of-the-art performance training. In the following two sections, Section 3 and
on medium-scale image classification datasets (such as Section 4, we will show how to implement the new thinking
PASCAL VOC and Caltech101&256). Here, we attempt into image feature extraction and SVM training, which are
to integrate the advantages from those previous systems. the two major functional blocks in our classification system
The contribution of this paper is not to propose a new (as shown in Fig. 2).
classification paradigm but to develop efficient algorithms
to gain similar performance on large-scale ImagetNet 3. Feature extraction
dataset as those achieved by the state-of-the-art methods on
medium-scale datasets. As shown in Fig. 2, given an input image, our system first
Extending the methods for medium-scale datasets to extracts dense HOG (histogram of oriented gradients [5])

1690
and LBP (local binary pattern [22]) local descriptors. Both 2. Ensure tight approximation: solve the optimization
features have been proven successful in various vision problem
tasks such as object classification, texture analysis and face
recognition, etc. HOG and LBP are complementary in the min kz − Bz αz k2 , subject to α⊤
z e = 1, (1)
αz
sense that HOG focuses more on shape information while
LBP emphasizes texture information within each patch. where e is a vector of ones. The problem has a closed
The advantage of such combination was also reported in form solution.
[22] for human detection task. For images with large Then the coding result α ∈ Rp of z is obtained by placing
size, we downsize them to no more than 500 pixels at the elements of αz into the corresponding positions of
either side. Such normalization not only considerably a p-dimensional vector and leaving the rest to be zeros.
reduces computational cost, but more importantly, makes The algorithm can be seen as one way of sparse coding,
the representation more robust to scale difference. We used because α is very sparse. But the implementation is much
three scales of patch size for computing HOG and LBP, simpler and the computation is much faster than traditional
namely, 16×16, 24×24 and 32×32. The multiple patch sparse coding, because there is no need to solve the L1-
sizes provide richer coverage of different scales and make norm regularized optimization problem. On the other hand,
the features more invariant to scale changes. we empirically find that the performance of this simple
After extracting dense local image descriptors, denoted LCC coding is often comparable or better than traditional
by z ∈ Rd , we perform the ‘coding’ and ‘pooling’ steps, as sparse coding for image classification. In addition, the
shown in Fig. 2, where the coding step encodes each local algorithm can also be explained as a simple extension of
descriptor z via a nonlinear feature mapping into a new vector quantization (VQ) coding, which can be recovered
space, then the pooling step aggregates the coding results by setting κ = 1.
fallen in a local region into a single vector. We apply two
state-of-the-art ‘coding + pooling’ pipelines in our system, 3.2. Super-vector Coding (SVC)
one is based on local coordinate coding (LCC) [26], and
SVC is another way to extend VQ, which explores
the other is based on super-vector coding (SVC) [28]. For
the geometry of data distribution. Suppose a codebook
simplicity, we assume the pooling is global. But spatial
B = [b1 , . . . , bp ] ∈ Rd×p is obtained by running K-means
pyramid pooling is simply implemented by applying the
algorithm. For a descriptor z, the coding procedure is
same operation independently within each partitioned block
of images. 1. Find z’s nearest basis vector in B, whose index is i =
arg minj kz − bj k2 ;
3.1. Local Coordinate Coding (LCC)
2. Obtain the VQ coding γ ∈ Rp , where its i-th element
Let B = [b1 , . . . , bp ] ∈ Rd×p be the codebook, where is one, and all others are zeros.
d is the dimensionality of descriptors z and p is the size of
the codebook. Like many coding methods, LCC seeks a 3. Obtain the SV coding result
linear combination of bases in B to reconstruct z, namely h i
z ≈ Bα, and then use the coefficients α as the coding result β = γ1 s, γ1 (z − b1 ) . . . γp s, γp (z − bp ) ,
for z. Typically α is sparse and its dimensionality is higher (2)
than that of z. We note that the mapping φ(z) from z to α is
usually nonlinear. The theory of LCC points out that, under where β ∈ R(d+1)p , and s is a predefined small constant.
a mild manifold assumption, a good coding should satisfy The SVC can be seen as expanding VQ with local tangent
two properties: directions, and is thus a smoother coding scheme.
At the pooling step, a linear pooling method has been
• The approximation z ≈ Bα is sufficiently accurate;
derived by smoothing the Bhattacharyya kernel. Let Z =
• The coding α should be sufficiently local – only those [zi ]ni=1 be the set of local descriptors of an image, and
bases close to z are activated; [βi ]ni=1 be their SV codes. Assigning zi into those p vector
quantization bins, we Ppartition Z into p groups, with sizes
p
Based on the theory, we develop a very simple algorithm proportional to ωk , k=1 ωk = 1. Then the pooling result
here. We first use K-means algorithm to learn a codebook for this image is
B and then for encoding z do the following: n
1X 1
x= √ βi ,
1. Ensure sufficient locality: find z’s κ nearest neighbors n i=1 ωδ(i)
in B, typically κ = 20, and denote the found bases as
Bz ∈ Rd×κ ; where δ(i) indicates the index of the group zi belongs to.

1691
Sets Coding scheme Descriptor Coding dimension SPM Feature dimension Data set Size(GB)
1 HOG+LBP 8,192 10 81,920 167*
2 Local coordinate coding HOG 16,384 10 163,840 187*
3 HOG+LBP 20,480 10 204,800 260*
4 HOG 32,768 8 262,144 1374
5 Super-vector coding HOG+LBP 51,200 4 204,800 1073
6 HOG 65,536 4 262,144 1374
Table 1. Extracted feature sets from ImageNet images for SVM training. The datasets with ∗ were compressed to reduce data size.

3.3. Parallel feature extraction final classification. However, even training SVM for the
smallest feature set here (about 160 GB) would not be
Depending on coding settings, the computation time for
easy. Furthermore, because the ImageNet dataset has 1000
feature extraction of one image ranges from 2 seconds
categories, we need to train 1000 binary classifiers – the
to 15 seconds on a dual quad-core 2GHz Intel Xeon
decision of using one-against-all SVMs is because training
CPU machine with 16G memory (single thread is used
a joint multi-class SVM is even harder and may not have
in computation). To process 1.2 million images, it
significant performance advantage. To our best knowledge,
would take around 27 to 208 days. Furthermore, feature
training SVMs on such huge datasets with so many classes
extraction yields terabytes of data. It is very difficult
has never been reported before.
for a single computer to handle such huge computation
Although there exist many off-the-shelf SVM
and such huge data. To speedup the computation and
solvers, such as SVMlight [12], SVMperf [13] or
accommodate the data, we choose Apache Hadoop [21]
LibSVM/LIBLINEAR [8], they are not feasible for such
to distribute computation over 20 machines and store data
huge training data. This is because most of them are batch
on Hadoop distributed file system (HDFS). Hadoop is an
methods, which require to go through all data to compute
open source implementation of MapReduce computation
gradient in each iteration and often need many iterations
framework and a distributed file system [3]. Because
(hundreds or even thousands of iterations) to reach a
there is no interdependence in feature extraction tasks,
reasonable solution. Even worse, most off-the-shelf batch-
MapReduce computation framework is very suitable for
type SVM solvers require to pre-load training data into
feature extraction. The HDFS distributes images over all
memory, which is impossible given the size of the training
machines and performs computation on the images located
data we have. Therefore, such solvers are unsuitable for
at local disks, which is called colocation. Colocation
our SVM training. Indeed, LIBLINEAR recently released
can speedup the computation by reducing overall network
an extended version that explicitly considered the memory
I/O cost. The most important advantage of Hadoop is
issue [25]. We tested it with a simplified image feature
that, it provides a reliable infrastructure for large scale
set (HOG descriptor only with coding dimension of 4,096,
computation. For example, a task can be automatically
which generated 80GB training data). However, even
restarted if it encounters some unexpected errors, such as
on such a small dataset (as compared to our largest one,
network issues or memory shortage. In our Hadoop cluster,
1.37TB), the LIBLINEAR solver was not able to provide
we only use 6 workers on each machine because of some
useful results after 2 weeks of running on a dual quad-core
limitation of the machines. Thus, we have 120 workers in
2GHz Intel Xeon CPU machine with 16G memory. The
total.
slowness of the LIBLINEAR solver is not only due to
We totally extracted six sets of features, as shown in
its inefficient inner-outer loop iterative structure but also
Table 1. With the help of Hadoop parallel computing, the
because it needs to learn as many as 1000 binary classifiers.
feature sets took 6 hours to 2 days to compute, depending
Therefore, we need a (much) better SVM solver, which
on coding settings.
should be memory efficient, converge fast and have some
parallelization scheme to train 1000 binary classifiers
4. ASGD for SVM training
in parallel. To meet these needs, we propose a parallel
After feature extraction, we ended up with terabytes averaging stochastic gradient descent (ASGD) algorithm
of training data, as shown in Table. 1. In general, the for training SVM classifiers.
features by LCC are sparse even after pooling and they
4.1. Averaging stochastic gradient descent
are much smaller than the ones generated by supervector
coding. The largest feature set is 1.37 terabytes and non- Let’s use binary classification as an example for de-
sparse. While one may concatenate those features to learn scribing the ASGD algorithm. We have training data that
an overall SVM for classification, we train SVMs separately consists of T feature-label pairs, denoted as {xt , yt }Tt=1 ,
for each feature set and then combine SVM scores to yield where xt is a d × 1 feature vector representing an image

1692
and yt ∈ {−1, +1} is the label of the image. Then, the cost popular. We believe there are two main reasons. First,
function for binary SVM classification can be written as the ASGD algorithm achieves asymptotic convergence
property (to gain similar performance as the second-order
T
X stochastic gradient descent) only when the number of data
L = L(w, b, xt , yt )
samples is sufficiently large. In fact, with insufficient
t=1
data samples, ASGD can be inferior to regular SGD.
T
X λ This probably explains – it may not be able to observe
kwk2 + max 0, 1 − yt (wT xt + b) , (3)

=
t=1
2 the superiority of the ASGD method when dealing with
medium-scale data. Second, for the ASGD algorithm
where w is d×1 SVM weight vector, λ (nonnegative scalar) to achieve fast convergence, the step size η needs to be
is a regularization parameter, and b (scalar) is a bias term. carefully scheduled. We adopt the following step size
Then, the gradient of w and b are scheduling [23]:
1

λw − yt xt if ∆t < 1 η = η0 , (7)
∇w L(w, b, xt , yt ) = (1 + γη0 t)c
λw if ∆t ≥ 1

−yt if ∆t < 1 where η0 (e.g. η0 = 10−2 ), γ and c are some positive
∇b L(w, b, xt , yt ) = , (4) constants, and they are problem-dependent. Typical values
0 if ∆t ≥ 1
for c are 1 or 2/3. Our analysis (not shown in this paper
where ∆t = yt (wT xt + b) is the margin of the data pair for brevity) shows that it is a good strategy to set γ to be
{xt , yt }. the smallest eigenvalue of the Hessian matrix of a stochastic
The proposed ASGD algorithm is a modification of con- objective function. Therefore, for solving the SVM problem
ventional stochastic gradient descent (SGD) algorithms [15, in Eq. 3, we set γ = λ for the step size in Eq. 7.
27]. For conventional SGD, training data are fed to the There is an important implementation trick to signif-
algorithm one by one, and the update rule for w and b icantly reduce the computation cost at each iteration of
respectively are ASGD. A plain implementation of ASGD would need
five scalar-vector multiplications or dot products at each
wt = (1 − λη)wt−1 + ηyt xt iteration: one for computing margin, two for updating wt
bt = bt−1 + ηyt (5) (Eq. 5) and two for averaging (Eq. 6). We choose to perform
the following variable transform:
if margin ∆t is less than 1; otherwise, wt = (1 − λη)wt−1
t
and bt = bt−1 . The parameter η is step size. The above wt = P1,1 vt
SGD algorithm is easy to implement, but it often takes many w̄t = t
P2,1 t
vt + P2,2 ut , (8)
iterations to reach a good solution.
t
The ASGD algorithm is to add an averaging scheme to P1,1 0
the above SGD algorithm. The averaging scheme is where Pt = t t is a 2 × 2 projection matrix,
P2,1 P2,2
and vt and ut are updated in the following manner:
w̄t = (1 − αt )w̄t−1 + αt wt
b̄t = (1 − αt )b̄t−1 + αt bt , (6) vt = vt−1 + ηyt R1,1 xt ;
ut = ut−1 + ηyt (R2,1 + αt R2,2 )xt , (9)
where αt (e.g. αt = 1/t) is averaging parameter. Note that t
the averaging scheme does not affect the SGD update rule R1,1 0
where Rt = t t = P−1
t , and Pt = Tt Pt−1
in Eq. 5, and the averaged SVM weights, w̄T and b̄T , will R2,1 R2,2

be output as the result of SVM training, not wT and bT . 1 − λη 0
with Tt = with P1 being an
The ASGD algorithm is known to have potential to αt (1 − λη) 1 − αt
achieve the theoretically optimal performance of stochastic identity matrix, w1 = v1 and w̄1 = u1 . It is easy to check
gradient descent algorithms. It was shown that, asymptot- that the update in Eq. 8 is equivalent to the update in Eq. 5
ically the ASGD algorithm is able to achieve similar con- and Eq. 6 but with only three scalar-vector multiplications
vergence rate as second-order stochastic gradient descent or dot products: one for computing margin, and two for
algorithm [18], which is often much faster than its first- the computation in Eq. 9 – the transform in Eq. 8 is not
order counterpart. However, unlike the second-order SGD computed until the last iteration when to output result.
that needs to compute the inverse of Hessian matrix, the
4.2. Parallel training
averaging is extremely simple to compute.
Despite the fact that the ASGD method has the potential Another important issue is how to parallelize the com-
to converge fast and is simple to implement, it has not been putation for training 1000 binary SVM classifiers [2].

1693
Apparently, the training of the binary classifiers can be
70
done independently. However, in contrast to the case
where a single machine is used to train all classifiers and 65
the major bottleneck is on computation, using a large
number of machines for training will suffer from file I/O 60

Top 5 hit rate (%)

bottleneck since one of our training datasets is as large as
1.37 terabyte. If the file I/O bandwith is 20MB/second, 55
simply loading/copying the dataset would take 19 hours.
Therefore, we hope to load data as less times as possible 50
while not incurring the computation bottleneck.
SGD
Our strategy here is to do memory sharing on each mul- 45 Averaging SGD
ticore machine. Each machine launches several programs
to train different subset of the 1000 binary classifiers, and 40
1 2 3 4 5
the programs are synchronized to train on the same chunk of Epochs

training data. The training data is shared by all the programs Figure 3. The convergence comparison between ASGD and
on a machine through careful memory sharing. Therefore, regular SGD.
the multiple programs only need to load data once (for
each epoch). Such memory sharing scheme significantly Histogram of top 5 accuracy
140
reduces file loading traffic and speeds up SVM training
dramatically. 120

5. Results 100

5.1. The performance of ASGD method for SVM 80

training 60

As aforementioned, the major challenge of the large- 40

scale ImageNet classification is on training SVMs with
terabytes of training data and as many as 1000 categories. 20

This paper proposes a parallel ASGD method that is aimed

0
to have fast convergence and parallel computation. Fig. 3 0 0.2 0.4 0.6 0.8 1
Accuracy
shows the convergence comparison between the ASGD
method and the regular SGD method. Both methods were Figure 4. The histogram of the top 5 hit rate of the 1000 classes in
performed on the dataset 5 in Table 1 – it is about 1 terabyte ImageNet dataset.
in total. We see that the ASGD method converged very fast.
It reached fairly good solution after 5 iterations. In contrast,
Indeed, we see a huge improvement in performance from
SGD (without averaging) converges much more slowly. It
the baseline that was reported recently [6], which achieved
would take tens of iterations to reach a similarly good
about 20% in classification rate. Fig. 4 shows the histogram
solution. For this specific dataset, each epoch took about 20
of the top 5 hit rate on 1000 classes. We see that the top 5 hit
hours on three 8-core machines (only 6 programs running
rate is mostly concentrated in the range of 60 ∼ 90% while
in parallel on each machine due to some limitations).
it is over 90% for some classes but below 30% for some
Therefore, ASGD took about 4 days to finish SVM training
other classes. The easy classes include odometer, monarch
while the regular SGD would have taken weeks if not
butterfly, cliff dwelling, lunar crater, bonsai, trolleybus,
months.
geyser, snowplow, etc; the difficult classes include China
5.2. ImageNet classification results tree, logwood tree, shingle oak, red beech, cap opener,
Kentucky coffee tree, teak, alder tree, iron tree, grass pink,
With the proposed ASGD method and 12 eight-core etc. The detailed top 5 hit rate for each of the 1000 classes
machines, we were able to train 1000-class SVM classifiers is illustrated in Fig. 5.
for all those 6 feature sets listed in Table 1 within one
week. Classification on each feature set outputs a set of 6. Discussion
SVM sores, and we combined them linearly to yield final
prediction. We have shown how to train an image classification
As a result, our classification system achieved 52.9% system on the large-scale ImageNet dataset (1.2 million im-
in classification accuracy and 71.8% in top 5 hit rate. ages) with many classes (1000 classes). We achieved state-

1694
of-the-art performance on the ImageNet dataset: 52.9% [12] T. Joachims. Making large-scale svm learning practical.
in classification accuracy and 71.9% in top 5 hit rate. LS8-Report, 24, Universitt Dortmund, LS VIII-Report,
The key factors in our system are fast feature extraction 1998.
and SVM training. We developed a parallel averaging [13] T. Joachims. Training linear svms in linear time. In Pro-
stochastic gradient descent (ASGD) algorithm for SVM ceedings of the ACM Conference on Knowledge Discovery
training, which is able to handle terabytes of data and 1000 and Data Mining, 2006.
classes. [14] S. Lazebnik, C. Achmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
In this paper, we observed very fast convergence from
scene categories. In IEEE Conf. on Computer Vision and
the ASGD algorithm. But we are still not able to Pattern Recognition, volume 2, pages 2169–2178, New York
quantitatively connect the superior empirical performance City, June17 - 22 2006.
with existing theoretical analysis, most of which focuses on [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
analyzing the asymptotic convergence property of ASGD. based learning applied to document recognition. Proceed-
We will study how many training data samples would ings of the IEEE, 86:2278–2324, 1998.
be needed for ASGD to enter its asymptotic convergence [16] D. G. Lowe. Distinctive image features from scale invariant
regime. Meanwhile, we plan to systematically compare keypoints. Int’l Journal of Computer Vision, 60(2):91–110,
the ASGD method with other SGD methods (such as 2004.
Pegasos [20]) for large-scale image classification. [17] T. Ojala, M. Pietikainen, and D. Harwood. A comparative
study of texture measures with classification based on feature
References distributions. Pattern Recognition, 29:51–59, 1996.
[18] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic
[1] http://yann.lecun.com/exdb/mnist/. approximation by averaging. SIAM J. Control Optim, 30,
[2] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. July 1992.
Psvm: Parallelizing support vector machines on distributed [19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T.
computers. Advances in Neural Information Processing Freeman. Labelme: a database and web-based tool for image
Systems, 20:16, 2007. annotation. International Journal of Computer Vision, 77(1-
[3] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and 3):157–173, 2008.
K. Olukotun. Map-reduce for machine learning on multicore. [20] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Pri-
In Advances in Neural Information Processing Systems 19: mal estimated sub-gradient solver for svm. In Proceedings
Proceedings of the 2006 Conference, page 281. The MIT of the 24th international conference on Machine learning,
Press, 2007. ICML ’07, 2007.
[4] C. Cortes and V. Vapnik. Support-vector networks. Machine [21] T. White. Hadoop: The Definitive Guide. O’Reilly Media,
Learning, 20:273–297, 1995. Inc, 2010.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for [22] S. Y. X. Wang, T. X. Han. An hog-lbp human detector with
human detection. In CVPR, 2005. partial occlusion handling. In ICCV, 2009.
[6] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does [23] W. Xu. Towards optimal one pass large scale learning
classifying more than 10,000 image categories tell us? with averaged stochastic gradient descent. Technical Report
ECCV, 2010. 2009-L102, NEC Labs America.
[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, [24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial
and A. Zisserman. The PASCAL Visual Object Classes pyramid matching using sparse coding for image classifica-
Challenge 2010 (VOC2010) Results. http://www.pascal- tion. In IEEE Conference on Computer Vision and Pattern
network.org/challenges/VOC/voc2010/workshop/index.html. Recognition, 2009.
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. [25] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large
Lin. Liblinear: A library for large linear classification. J. linear classification when data cannot fit in memory. In Pro-
Mach. Learn. Res., 9, June 2008. ceedings of the 16th ACM SIGKDD international conference
[9] L. Fei-Fei, R. Fergus, and P. Peron. Learning generative on Knowledge discovery and data mining, KDD ’10, 2010.
visual models from few training examples: an incremental [26] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local
bayesian approach tested on 101 object categories. In IEEE. coordinate coding. In NIPS’ 09, 2009.
CVPR 2004, Workshop on Generative-Model Based Vision, [27] T. Zhang. Solving large scale linear prediction problems
2004. using stochastic gradient descent algorithms. In Proceedings
[10] R. Fergus, P. Perona, and A. Zisserman. Object class of the twenty-first international conference on Machine
recognition by unsupervised scale-invariant learning. In learning, ICML ’04, 2004.
IEEE Conf. on Computer Vision and Pattern Recognition, [28] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classification
volume 2, pages 264–271, Wisconsin, WI, June16 - 22 2003. using super-vector coding of local image descriptors. In
[11] G. Griffin, A. Holub, and P. Perona. Caltech-256 object ECCV, 2010.
category dataset. Technical Report 7694, California Institute
of Technology, 2007.

1695
Figure 5. The top 5 hit rates on the 1000 categories in the ImageNet Challenge. The hit rate of each category is indicated by a red bar left
to the image representing the category.

1696

Operation Research Project Transportation
87% (23)
Operation Research Project Transportation
11 pages
Image Classification Using SVM and CNN: March 2020
No ratings yet
Image Classification Using SVM and CNN: March 2020
6 pages
Bag of Feature
No ratings yet
Bag of Feature
75 pages
DL1 Ver1
No ratings yet
DL1 Ver1
49 pages
Scaling Vision Transformers: Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer
No ratings yet
Scaling Vision Transformers: Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer
31 pages
Mandar Sat Vil Kar
No ratings yet
Mandar Sat Vil Kar
23 pages
Very Deep Convolutional Networks
No ratings yet
Very Deep Convolutional Networks
28 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
C8-Modern CNNs
No ratings yet
C8-Modern CNNs
57 pages
ConvNeXt - A ConvNet For The 2020s
No ratings yet
ConvNeXt - A ConvNet For The 2020s
15 pages
Pcanet: A Simple Deep Learning Baseline For Image Classification?
No ratings yet
Pcanet: A Simple Deep Learning Baseline For Image Classification?
15 pages
Master Thesis Paper p-grad-CAM
No ratings yet
Master Thesis Paper p-grad-CAM
87 pages
1409 1556-Cropped
No ratings yet
1409 1556-Cropped
14 pages
ImageNet Large Scale Visual Recognition Challenge PDF
No ratings yet
ImageNet Large Scale Visual Recognition Challenge PDF
43 pages
VietNguyen MasterThesis
No ratings yet
VietNguyen MasterThesis
66 pages
Fei Fei Li
No ratings yet
Fei Fei Li
43 pages
Scaling Vision Transformers
No ratings yet
Scaling Vision Transformers
31 pages
DL7 2
No ratings yet
DL7 2
11 pages
VGG Image Classification Practical
No ratings yet
VGG Image Classification Practical
11 pages
L10 Image Classification
No ratings yet
L10 Image Classification
10 pages
4.saliency Maps
No ratings yet
4.saliency Maps
8 pages
1 s2.0 S0925231221009486 Main
No ratings yet
1 s2.0 S0925231221009486 Main
7 pages
MA AjamMontassar 201704
No ratings yet
MA AjamMontassar 201704
65 pages
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
No ratings yet
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
9 pages
Object Recog
No ratings yet
Object Recog
102 pages
Peronnin Etal ECCV10 PDF
No ratings yet
Peronnin Etal ECCV10 PDF
14 pages
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
No ratings yet
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
17 pages
Zhang DetectDistractedDriver Report
No ratings yet
Zhang DetectDistractedDriver Report
6 pages
Analyzing The Performance of Multilayer Neural
No ratings yet
Analyzing The Performance of Multilayer Neural
16 pages
Very Deep Convolutional Networks For Large-Scale Image Recognition
No ratings yet
Very Deep Convolutional Networks For Large-Scale Image Recognition
14 pages
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
No ratings yet
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
55 pages
Imagenet Classification
No ratings yet
Imagenet Classification
9 pages
VGG (Simonyan and Zisserman)
No ratings yet
VGG (Simonyan and Zisserman)
14 pages
Qi Li Dissertation
No ratings yet
Qi Li Dissertation
116 pages
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
No ratings yet
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
24 pages
Bag of Words
No ratings yet
Bag of Words
72 pages
Irjet V6i5695 PDF
No ratings yet
Irjet V6i5695 PDF
3 pages
Zeiler Ec CV 2014
No ratings yet
Zeiler Ec CV 2014
16 pages
DNN Architectures
No ratings yet
DNN Architectures
12 pages
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
No ratings yet
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
58 pages
A 12 Garbage Classification Using Deep Learning Techniques
No ratings yet
A 12 Garbage Classification Using Deep Learning Techniques
7 pages
CV Ss16 0609 Deep Learning
No ratings yet
CV Ss16 0609 Deep Learning
91 pages
14174-English
No ratings yet
14174-English
7 pages
Large-Scale Image Classification
No ratings yet
Large-Scale Image Classification
8 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
Ref 9
No ratings yet
Ref 9
12 pages
Data Structure Module 5
No ratings yet
Data Structure Module 5
22 pages
DogCat Report
No ratings yet
DogCat Report
10 pages
Current Affiliation: Google Deepmind Current Affiliation: University of Oxford and Google Deepmind
No ratings yet
Current Affiliation: Google Deepmind Current Affiliation: University of Oxford and Google Deepmind
1 page
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
No ratings yet
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
10 pages
Simonyan - Visualising Image Classification Models and Saliency Maps
No ratings yet
Simonyan - Visualising Image Classification Models and Saliency Maps
8 pages
Computer Vision Report
No ratings yet
Computer Vision Report
2 pages
Vehicle Detection and Tracking
No ratings yet
Vehicle Detection and Tracking
11 pages
Good - Analyzer - 3 (Manual - En)
No ratings yet
Good - Analyzer - 3 (Manual - En)
6 pages
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
No ratings yet
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
6 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Playfair Cipher
100% (1)
Playfair Cipher
6 pages
Statistical Modeling of High Frequency Datasets Using The ARIMA-ANN Hybrid2023
No ratings yet
Statistical Modeling of High Frequency Datasets Using The ARIMA-ANN Hybrid2023
17 pages
Ipsec Components and Ipsec VPN Features
No ratings yet
Ipsec Components and Ipsec VPN Features
45 pages
Lecture 2 - Probability Concepts and Applications
No ratings yet
Lecture 2 - Probability Concepts and Applications
38 pages
Interpolation Lagrange
No ratings yet
Interpolation Lagrange
10 pages
FVBSN
No ratings yet
FVBSN
4 pages
Introduction To Matrix: A11x1 + A12x2 + ... + A1nxn b1 A21x1 + A22x2 + ... + A2nxn b2
No ratings yet
Introduction To Matrix: A11x1 + A12x2 + ... + A1nxn b1 A21x1 + A22x2 + ... + A2nxn b2
42 pages
3 - CentumVP Engineering Course Day 3
No ratings yet
3 - CentumVP Engineering Course Day 3
76 pages
Mathematic Modelling of Dynamic SYSTEMS Ch. 2
No ratings yet
Mathematic Modelling of Dynamic SYSTEMS Ch. 2
31 pages
Class 12 Maths
No ratings yet
Class 12 Maths
7 pages
Monte Carlo Simulation Handouts
No ratings yet
Monte Carlo Simulation Handouts
8 pages
Akar Persamaan 1
No ratings yet
Akar Persamaan 1
15 pages
Lab 2 - Hooke's Law
No ratings yet
Lab 2 - Hooke's Law
3 pages
Lecture 4 The Time Value Is Money
No ratings yet
Lecture 4 The Time Value Is Money
6 pages
Isom Lab Question Bank
No ratings yet
Isom Lab Question Bank
6 pages
Acín 2018 New J. Phys. 20 080201 PDF
No ratings yet
Acín 2018 New J. Phys. 20 080201 PDF
25 pages
Probst DRFP
No ratings yet
Probst DRFP
21 pages
2020 - Implementation Matters in DRL A Case Study On PPO and TRPO
No ratings yet
2020 - Implementation Matters in DRL A Case Study On PPO and TRPO
14 pages
Rivregress
No ratings yet
Rivregress
16 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
EEE539 Spring2024 MT Sol
No ratings yet
EEE539 Spring2024 MT Sol
6 pages
Lect Slides - Dynamic Response Characteristics of More Complicated Processes
No ratings yet
Lect Slides - Dynamic Response Characteristics of More Complicated Processes
31 pages
L 11 Circle Drawing Algorithims 2
No ratings yet
L 11 Circle Drawing Algorithims 2
6 pages
School of Advanced Sciences MAT5005: Advanced Mathematical Methods Question Bank
No ratings yet
School of Advanced Sciences MAT5005: Advanced Mathematical Methods Question Bank
2 pages
2011 - Development of A Consistent Buckling Design Procedure For Tapered Columns
No ratings yet
2011 - Development of A Consistent Buckling Design Procedure For Tapered Columns
1 page
Functions (Algebraic) Summary MAT1510
No ratings yet
Functions (Algebraic) Summary MAT1510
1 page
Module4 Practice Problems
No ratings yet
Module4 Practice Problems
2 pages
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
From Everand
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
Anand Vemula
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet

Large-Scale Image Classification: Fast Feature Extraction and SVM Training

Uploaded by

Large-Scale Image Classification: Fast Feature Extraction and SVM Training

Uploaded by

Large-scale Image Classification:

Fast Feature Extraction and SVM Training

Liangliang Cao and Thomas Huang

Abstract of pixels to untrained computers, and the object it presents.

Top 5 hit rate (%)

5.1. The performance of ASGD method for SVM 80

As aforementioned, the major challenge of the large- 40

This paper proposes a parallel ASGD method that is aimed

You might also like