Aggregating Local Descriptors Into A Compact Image Representation

Uploaded by

argyrishadjipanteli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Aggregating Local Descriptors Into A Compact Image Representation

Uploaded by

argyrishadjipanteli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Aggregating local descriptors into a compact image representation

Hervé Jégou Matthijs Douze Cordelia Schmid Patrick Pérez

INRIA Rennes INRIA Grenoble INRIA Grenoble Technicolor

Abstract as search efficiency can be approximated by the amount

of memory to be visited. Most similar to our work is the
We address the problem of image search on a very large approach of [9], which proposes an approximate nearest
scale, where three constraints have to be considered jointly: neighbor search for BOF vectors. However, this method is
the accuracy of the search, its efficiency, and the memory limited to small vocabulary sizes, yielding lower search ac-
usage of the representation. We first propose a simple yet curacy compared to large ones. Moreover, an image still re-
efficient way of aggregating local image descriptors into a quires more than a hundred bytes to reproduce the relatively
vector of limited dimension, which can be viewed as a sim- low search accuracy of a low-dimensional BOF vector.
plification of the Fisher kernel representation. We then show The efficiency problem was partially addressed by the
how to jointly optimize the dimension reduction and the in- min-Hash approach [2, 3] and the method of Torresani et
dexing algorithm, so that it best preserves the quality of vec- al. [25]. However, these techniques still require a significant
tor comparison. The evaluation shows that our approach amount of memory per image, and are more relevant in the
significantly outperforms the state of the art: the search ac- context of near-duplicate image detection, as their search
curacy is comparable to the bag-of-features approach for accuracy is significantly lower than BOF. Some authors ad-
an image representation that fits in 20 bytes. Searching a dress the efficiency and memory constraints by using GIST
10 million image dataset takes about 50ms. descriptors [17], and by converting them to compact binary
vectors [5, 11, 24, 27]. These approaches are limited by the
low degree of invariance of the global GIST descriptor, and
1. Introduction none of them jointly fulfills the three aforementioned con-
straints: [24, 27] still requires an exhaustive search while
There are two reasons why the bag-of-features image [11] is memory consuming due to the redundancy of the
representation (BOF) is popular for indexation and catego- LSH algorithm. In contrast, our approach obtains a signif-
rization applications. First, this representation benefits from icantly higher accuracy with, typically, a 20-byte represen-
powerful local descriptors, such as the SIFT descriptor [12] tation. This is obtained by optimizing:
and more recent ones [13, 28, 29]. Second, these vector rep-
resentations can be compared with standard distances, and 1. the representation, i.e., how to aggregate local image
subsequently be used by robust classification methods such descriptors into a vector representation;
as support vector machines. 2. the dimensionality reduction of these vectors;
When used for large scale image search, the BOF vec- 3. the indexing algorithm.
tor representing the image is advantageously chosen to be
highly dimensional [16, 20, 10], up to a million dimensions. These steps are closely related: representing an image by a
In this case, the search efficiency results from the use of in- high-dimensional vector usually provides better exhaustive
verted lists [23], which speeds up the computation of dis- search results than with a low-dimensional one. However,
tances between sparse vectors. However, two factors limit high dimensional vectors are more difficult to index effi-
the number of images that can be indexed in practice: the ciently. In contrast, a low dimensional vector is more easily
efficiency of the search itself, which becomes prohibitive indexed, but its discriminative power is lower and might not
when considering more than 10 million images, and the be sufficient for recognizing objects or scenes.
memory required to represent an image. Our first contribution consists in proposing a represen-
In this paper, we address the problem of searching the tation that provides excellent search accuracy with a rea-
most similar images in a very large image database (ten mil- sonable vector dimensionality, as we know that the vector
lion images or more). We put an emphasis on the joint op- will be indexed subsequently. We propose a descriptor, de-
timization of three constraints: the search accuracy, its effi- rived from both BOF and Fisher kernel [18], that aggregates
ciency and the memory usage. The last two are related [24], SIFT descriptors and produces a compact representation. It

978-1-4244-6985-7/10/$26.00 ©2010 IEEE 3304

is termed VLAD (vector of locally aggregated descriptors). 2.2. Fisher kernel
Experimental results demonstrate that VLAD significantly
The Fisher kernel [6] is a powerful tool to transform an
outperforms BOF for the same size. It is cheaper to com-
incoming variable-size set of independent samples into a
pute and its dimensionality can be reduced to a few hun-
fixed size vector representation, assuming that the samples
dreds components by principal component analysis (PCA)
follow a parametric generative model estimated on a train-
without noticeably impacting its accuracy.
ing set. This description vector is the gradient of the sam-
As a second contribution, we show the advantage of ple’s likelihood with respect to the parameters of this distri-
jointly optimizing the trade-off between the dimensionality bution, scaled by the inverse square root of the Fisher infor-
reduction and the indexation algorithm. We consider in par- mation matrix. It gives the direction in parameter space into
ticular the recent indexing method of [7], as we can directly which the learnt distribution should be modified to better
compare the error induced by PCA with the error resulting fit the observed data. It has been shown that discriminative
from the indexation, due to the approximate reconstruction classifiers can be learned in this new representation space.
of the vector from its encoded index. Perronnin et al. [18] applied Fisher kernel in the con-
After presenting two image vector representations that text of image classification. They model the visual words
inspired ours, BOF and the Fisher kernel [18], we intro- with a Gaussian mixture model (GMM), restricted to diag-
duce our descriptor aggregation method in Section 2. The onal variance matrices for each of the k components of the
joint optimization of dimensionality reduction and indexing mixture. Deriving a diagonal approximation of the Fisher
is presented in Section 3. Experimental results demonstrate matrix of a GMM, they obtain a (2d + 1) × k − 1 dimen-
the performance of our approach in section 4: we show that sional vector representation of an image feature set, or d×k-
the performance of BOF is attained with an image represen- dimensional when considering only the components associ-
tation of about 20 bytes. This is a significant improvement ated with either the means or the variances of the GMM.
over the state-of-the-art [9], both in terms of memory usage, In comparison with the BOF representation, fewer visual
search accuracy and efficiency. words are required by this more sophisticated representa-
tion.
2. Image vector representation
2.3. VLAD: vector of locally aggregated descriptors
In this section, we briefly review two popular approaches We propose a vector representation of an image which
that produce a vector representation of an image from a set aggregates descriptors based on a locality criterion in fea-
of local descriptors. We then propose our method to aggre- ture space. It can be seen as a simplification of the
gate local descriptors. Fisher kernel. As for BOF, we first learn a codebook
C = {c1 , ...ck } of k visual words with k-means. Each
2.1. Bag of features local descriptor x is associated to its nearest visual word
ci = NN(x). The idea of the VLAD descriptor is to accu-
The BOF representation groups local descriptors. It re- mulate, for each visual word ci , the differences x − ci of the
quires the definition of a codebook of k “visual words” usu- vectors x assigned to ci . This characterizes the distribution
ally obtained by k-means clustering. Each local descriptor of the vectors with respect to the center.
of dimension d from an image is assigned to the closest cen- Assuming the local descriptor to be d-dimensional, the
troid. The BOF representation is obtained as the histogram dimension D of our representation is D = k × d. In the
of the assignment of all image descriptors to visual words. following, we represent the descriptor by vi,j , where the
Therefore, it produces a k-dimensional vector, which is sub- indices i = 1 . . . k and j = 1 . . . d respectively index the
sequently normalized. There are several variations on how visual word and the local descriptor component. Hence, a
to normalize the histogram. When seen as an empirical component of v is obtained as a sum over all the image de-
distribution, the BOF vector is normalized using the Man- scriptors:
hattan distance. Another common choice consists in using
Euclidean normalization. The vector components are then
X
vi,j = xj − ci,j (1)
weighted by idf (inverse document frequency) terms. Sev- x such that NN(x)=ci
eral weighting schemes have been proposed [16, 23]. In the
following, we perform L2 normalization of histograms and where xj and ci,j respectively denote the j th component of
use the idf calculation of [23]. the descriptor x considered and of its corresponding visual
Several variations have been proposed to improve the word ci . The vector v is subsequently L2 -normalized by
quality of this representation. One of the most popu- v := v/||v||2 .
lar [21, 26] consists in using soft quantization techniques Experimental results show that excellent results can be
instead of a k-means. obtained even with a relatively small number of visual

3305
>

<
Figure 1. Images and corresponding VLAD descriptors, for k=16 centroids (D=16×128). The components of the descriptor are represented
like SIFT, with negative components (see Equation 1) in red.

words k: we consider values ranging from k=16 to k=256. resulting vectors. For this purpose, we consider the recent
Figure 1 depicts the VLAD representations associated approximate nearest neighbor search method of [7], which
with a few images, when aggregating 128-dimensional is briefly described in the next section. We will show the
SIFT descriptors. The components of our descriptor map importance of the joint optimization by measuring the mean
to components of SIFT descriptors. Therefore we adopt the squared Euclidean error generated by each step.
usual 4 × 4 spatial grid representation of oriented gradients
for each vi=1..k . We have accumulated the descriptors in 16
of them, one per visual word. In contrast to SIFT descrip- 3.1. Approximate nearest neighbor
tors, a component may be positive or negative, due to the
difference in Equation 1. Approximate nearest neighbors search methods [4, 11,
One can observe that the descriptors are relatively sparse 15, 24, 27] are required to handle large databases in com-
(few values have a significant energy) and very structured: puter vision applications [22]. One of the most popu-
most high descriptor values are located in the same cluster, lar techniques is Euclidean Locality-Sensitive Hashing [4],
and the geometrical structure of SIFT descriptors is observ- which has been extended in [11] to arbitrary metrics. How-
able. Intuitively and as shown later, a principal component ever, these approaches and the one of [15] are memory con-
analysis is likely to capture this structure. For sufficiently suming, as several hash tables or trees are required. The
similar images, the closeness of the descriptors is obvious. method of [27], which embeds the vector into a binary
space, better satisfies the memory constraint. It is, how-
3. From vectors to codes ever, significantly outperformed in terms of the trade-off
between memory and accuracy by the product quantization-
This section addresses the problem of coding an image based approximate search method of [7]. In the following,
vector. Given a D-dimensional input vector, we want to we use this method, as it offers better accuracy and because
produce a code of B bits encoding the image representa- the search algorithm provides an explicit approximation of
tion, such that the nearest neighbors of a (non-encoded) the indexed vectors. This allows us to compare the vector
query vector can be efficiently searched in a set of n en- approximations introduced by the dimensionality reduction
coded database vectors. and the quantization. We use the asymmetric distance com-
We handle this problem in two steps, that must be opti- putation (ADC) variant of this approach, which only en-
mized jointly: 1) a projection that reduces the dimension- codes the vectors of the database, but not the query vector.
ality of the vector and 2) a quantization used to index the This method is summarized in the following.

3306
ADC approach. Let x ∈ ℜD be a query vector and 3.2. Indexation-aware dimensionality reduction
Y = {y1 , . . . , yn } a set of vectors in which we want to
Dimensionality reduction is an important step in approx-
find the nearest neighbor NN(x) of x. The ADC approach
imate nearest neighbor search, as it impacts the subsequent
consists in encoding each vector yi by a quantized version
indexation method. In this section, for the ADC approach1 ,
ci = q(yi ) ∈ ℜD . For a quantizer q(.) with k centroids,
we express the compromise between this operation and the
the vector is encoded by log2 (k) bits, k being a power of 2.
indexing scheme using a single quality measure: the ap-
Finding the a nearest neighbors NNa (x) of x simply con-
proximation error. For the sake of presentation, we assume
sists in computing
that the mean of the vectors is the null vector. This is ap-
NNa (x) = a- arg min ||x − q(yi )||2 . (2) proximately the case for VLAD vectors.
i
Principal component analysis (PCA) is a standard
Note that, in contrast with the embedding method of [27], tool [1] for dimensionality reduction: the eigenvectors as-
the query x is not converted to a code: there is no approxi- sociated with the D′ most energetic eigenvalues of the em-
mation error on the query side. pirical vector covariance matrix are used to define a matrix
To get a good vector approximation, k should be large M mapping a vector x ∈ ℜD to a vector x′ = M x ∈ ℜD .
′

(k = 264 for a 64 bit code). For such large values of k, Matrix M is the D′ × D upper part of an orthogonal matrix.
learning a k-means codebook as well as assignment to the This dimensionality reduction can also be interpreted in the
centroids is not tractable. Our solution is to use a prod- initial space as a projection. In that case, x is approximated
uct quantization method which defines the quantizer with- by
out explicitly enumerating its centroids. A vector x is first
xp = x − εp (x) (6)
split into m subvectors x1 , ... xm of equal length D/m. A
product quantizer is then defined as function where the error vector εp (x) lies in the null space of M . The
vector xp is related to x′ by the pseudo-inverse of M , which
q(x) = q1 (x1 ), ..., qm (xm ) ,

(3)
is the transpose of M in this case. Therefore, the projection
which maps the input vector x to a tuple of indices by sepa- is xp = M ⊤ M x. For the purpose of indexing, the vector x′
rately quantizing the subvectors. Each individual quantizer is subsequently encoded as q(x′ ) using the ADC approach,
qj (.) has ks reproduction values learned by k-means. To which can also be interpreted in the original D-dimensional
limit the assignment complexity O(m × ks ), ks is a small space as the approximation2
value (e.g., ks =256). However, the set k of centroids in-
duced by the product quantizer q(.) is large: k = (ks )m . q(xp ) = x − εp (x) − εq (xp ) (7)
The squared distances in Equation 2 are computed using
the decomposition where εp (x) ∈ Null(M ) and εq (xp ) ∈ Null(M )⊥ (because
X the ADC quantizer is learned in the principal subspace) are
||x − q(yi )||2 = ||xj − qj (yij )||2 , (4) orthogonal. At this point, we make two observations:
j=1,...,m
1. Due to the PCA, the variance of the different com-
where yij is the j th subvector of yi . The square distances
ponents of x′ is not balanced. Therefore the ADC
in this summation are read from look-up tables computed,
structure, which regularly subdivides the space, quan-
prior to the search, between each subvector xj and the ks
tizes the first principal components more coarsely in
centroids associated with the corresponding quantizer qj .
comparison with the last components that are selected.
The generation of the tables is of complexity O(D × ks ).
This allocation introduces a bottleneck on the first
When ks ≪ n, this complexity is negligible compared with
components with respect to the quantization error.
the summation cost of O(D × n) in Equation 2.
2. There is a trade-off on the number of dimension D′ to
This quantization method was chosen because of its ex- be retained by the PCA. If D′ is large, the projection
cellent performance, but also because it represents the in- error vector εp (x) is of limited magnitude, but a large
dexation as a vector approximation: a database vector yi quantization error εq (xp ) is introduced. On the oppo-
can be decomposed as site, keeping a small number of components leads to a
yi = q(yi ) + εq (yi ), (5) high projection error and a low quantization error.
where q(yi ) is the centroid associated with yi and εq (yi ) the Balancing the components’ variance. In [27], this issue
error vector generated by the quantizer. was addressed by allocating different numbers of bits to the
Notation: ADC m × bs refers to the method when using different components. The ADC method does not have this
m subvectors and bs bits to encode each subvector (bs = 1 Note
that [7] did not propose any dimensionality reduction.
log2 ks ). The total number of bits B used to encode a vector 2 For
the sake of conciseness, the quantities M T q(x′ ) and M T εq (x′ )
is then given by B = m bs . are simplified to q(xp ) and εq (xp ) respectively.

3307
x

q(xp )

Figure 2. Effect of the encoding steps on the descriptor. Top: VLAD vector x for k=16 (D=2048). Middle: vector xp altered by the
projection onto the PCA subspace (D′ =128). Bottom: vector q(xp ) after indexing by ADC 16 × 8 (16-bytes code).

flexibility. Therefore, we address the problem by perform- vectors (k=16, D=2048) generated from SIFT descriptors,
ing an orthogonal transformation after the PCA. In other we obtain the following average projection, quantization
terms, given the dimension D′ and X a vector to index, we and total mean square errors:
want to find an orthogonal matrix Q such that the compo- D′ ep (D′ ) eq (D′ ) e(D′ )
nents of the transformed vector X ′′ = QX ′ = QM X have 32 0.0632 0.0164 0.0796
equal variances. It would be optimal to find the matrix Q̃ 48 0.0508 0.0248 0.0757
minimizing the expected quantization error introduced by 64 0.0434 0.0321 0.0755
the ADC method: 80 0.0386 0.0458 0.0844

Q̃ = arg min EX ||εq (QM X)||2 . It is important to compute the measures several times

(8)
Q
(here averaged over 10 runs), as the ADC structure depends
However, this optimization problem is not tractable, as the on the local optimum found by k-means, and leads to vari-
objective function requires to learn the product quantizer able quantization errors. The optimization selects D′ =64
q(.) of the ADC structure at each iteration. Finding a matrix for k=16 and B=128. We will confirm in the experimental
Q satisfying the simplified balancing objective is done by section that this value represents an excellent trade-off in
choosing it in the form of a Householder matrix terms of search results.

Q = I − 2vv ⊤ , (9) Remarks.

with the optimization performed on the D′ components • The choice of D′ is constrained by the structure of
of v. A simple alternative is to avoid this optimization by ADC, which requires that D′ is a multiple of m. For
using, for Q, a D′ ×D′ random orthogonal matrix. For high instance, by keeping D′ =64 eigenvalues, the valid set
dimensional vectors, this choice is an acceptable solution of values for m is {1,2,4,8,16,32,64}.
with respect to our balancing criterion. This will be con- • The optimization is solely based on the mean squared
firmed by our experiments in Section 4, which show that, error quantization criterion. For this reason, it is not
for VLAD descriptors, both choices are equivalent. clear how our framework could be extended to another
Joint optimization of reduction/indexing. Let us now indexation method, such as LSH [4], which does not
consider the second problem, i.e., optimizing the dimen- provide an explicit approximation vector.
sion D′ , having fixed a constraint on the number of bits B • We apply the projection Q × M before the L2 nor-
used to represent the D-dimensional VLAD vector x, for malization of the aggregated representation (see Sub-
instance B=128 (16 bytes). The square Euclidean distance section 2.3). This brings a marginal improvement in
between the reproduction value and x is the sum of the er- terms of image search accuracy.
rors ||εp (x)||2 and ||εq (xp )||2 , which both depend on the
selected D′ . The mean square error e(D′ ) is empirically
measured on a learning vector set L as The impact of dimensionality reduction and indexation
based on ADC is illustrated by the VLAD pictorial repre-
e(D′ ) = ep (D′ ) + eq (D′ ) (10) sentation introduced in Section 2. We can present the pro-
1 X jected and quantized VLAD in this form, as both PCA pro-
= ||εp (x)||2 + ||εq (xp )||2 . (11)
card(L) jection and ADC provide a way of reconstructing the pro-
x∈L
jected/quantized vector. Figure 2 illustrates how each of
This gives us an objective criterion to optimize directly the these operations impacts our representation. One can see
dimensionality, which is obtained by finding on the learning that the vector is only slightly altered, even for a compact
set the value of D′ minimizing this criterion. For VLAD representation of B=16 bytes.

3308
Descriptor k D Holidays (mAP) UKB (score/4)
D → D′ =128 → D′ =64 → D′ =32 D → D′ =128 → D′ =64 → D′ =32
BOF 1 000 1 000 0.401 0.444 0.434 0.408 2.86 2.99 2.91 2.77
20 000 20 000 0.404 0.452 0.445 0.416 2.87 2.95 2.90 2.78
Fisher (µ) 16 2 048 0.497 0.490 0.475 0.452 3.07 3.05 2.98 2.83
64 8 192 0.495 0.492 0.464 0.424 3.09 3.09 2.98 2.75
VLAD 16 2 048 0.496 0.495 0.494 0.451 3.07 3.05 2.99 2.82
64 8 192 0.526 0.510 0.477 0.421 3.17 3.15 3.03 2.79
Table 1. Performance comparison of BOF, Fisher and VLAD representations, before and after dimension reduction: the performance is
given for the full D-dimensional descriptor, and after a dimensionality reduction to D′ =128, 64 and 32 components. Note that for UKB,
the best score reported by Nister and Stewénius is 3.19, for a 1M vocabulary tree [16] learned on an independent dataset.

4. Experiments number of mixture components in the Fisher kernel repre-

sentation. For Fisher, we have only used the components
In this section, we first evaluate our VLAD descriptor associated with the mean vectors (µ), as we observed that,
and the joint dimensionality reduction/indexing approach. although the variance components improve the results, they
We then provide a comparison with the state of the art and provide comparable results after dimensionality reduction
measure the accuracy and efficiency of the search on 10 mil- to the same D′ .
lion images. The evaluation is performed without the indexing
scheme at this stage. Here, we put an emphasis on the per-
4.1. Evaluation datasets and local descriptors formance obtained after dimensionality reduction, as these
To extract local features, we have used the experimen- vectors will be indexed afterwards. Despite its simplic-
tal setup of [9] and the feature extraction software available ity, the VLAD descriptor equals or outperforms the Fisher
online3 . More precisely, the regions of interest are extracted kernel on both Holidays and UKB, and significantly out-
using the Hessian affine-invariant region detector [14] and performs BOF. Note that surprisingly the dimension reduc-
described by the SIFT descriptor [12]. We have used an tion improves the accuracy for BOF. The scores of BOF are
independent image set for all the learning stages. The eval- slightly different from those reported in [9], because we use
uation is performed on three datasets: Euclidean distances to compare representations instead of
the cosine measure. These choices are not strictly equiva-
• The INRIA Holidays dataset [8]. This is a collection lent, because idf weights are applied after the L2 normal-
of 1491 holiday images, 500 of them being used as ization.
queries. The accuracy is measured by the mean Aver- It is worth noticing that higher dimensional representa-
age Precision (mAP). tions, which usually provide better accuracy, suffer more
• The University of Kentucky Benchmark (UKB). This from the dimensionality reduction. This is especially true
set comprises images of 2550 objects, each of which for Fisher kernel and VLAD: for D′ =32, using only 16 cen-
is represented by 4 images. The most commonly used troids/mixtures is significantly better than larger values of
evaluation metric for this dataset counts the average k. On average, VLAD outperforms the Fisher kernel rep-
number of relevant images (including the query itself) resentation. For only D′ =64 dimensions VLAD attains an
that are ranked in the first four positions when search- excellent accuracy of mAP=0.494 on Holidays.
ing the 10 200 images. Note that in a very recent work [19], Perronnin et al. fur-
• To evaluate the behavior of our method on a large ther improved the Fisher kernel in several ways, obtain-
scale, we have downloaded 10M images from Flickr. ing mAP=0.593 on Holidays with a mixture of k=64 Gaus-
The Holidays dataset is merged with this set, as in [9], sians. Hopefully, our VLAD representation would also ben-
to provide ground truth matches. efit from the proposed techniques.

4.2. Image vector representations 4.3. Reduction and indexation

Table 1 compares the different local aggregation meth- Balancing the variances. Table 2 compares the search
ods described in Section 2: BOF, Fisher kernel, and our performance obtained by applying the ADC indexing af-
VLAD aggregation technique. These representations are ter 1) PCA dimension reduction, 2) PCA followed by an
all parametrized by a single parameter k. It corresponds orthogonal transformation optimizing the variance balanc-
to the number of centroids for BOF and VLAD, and to the ing criterion (see Subsection 3.2), and 3) PCA followed by
a random orthogonal transformation. The need for a rota-
3 http://lear.inrialpes.fr/people/jegou/data.php tion is clear. However, using a random one provides results

3309
Method mAP Method bytes UKB Holidays
No transformation 0.445 BOF, k=20,000 (from [9]) 10.364 2.92 0.446
Balancing optimization 0.457 miniBOF [9] 20 2.07 0.255
Random orthogonal transformation 0.457 80 2.72 0.403
Table 2. Comparison of different orthogonal transformation matri- 160 2.83 0.426
ces, with VLAD, k=16, D′ =64, ADC 16 × 8. These measures are VLAD, k=16, ADC 16 × 8 16 2.88 0.460
averaged over 10 runs on the Holidays dataset. VLAD, k=64, ADC 32 × 10 40 3.10 0.495

0.5
Table 3. Comparison with the state of the art on UKB (score/4)
and Holidays (mAP). D′ =64 for k=16 and D′ =96 for k=64.
0.45

0.4 ADC parameters

8x8 16x8 32x10 128x10
0.35
0.55
0.3
mAP

0.25 0.5

0.2 0.45
miniBOF 32
0.15 miniBOF 16
0.4

mAP
miniBOF 8
0.1
VLAD k=16 0.35 miniBOF 4
0.05 VLAD k=64
VLAD k=256
0 0.3
16 64 256 1024 4096 VLAD k=16
0.25 VLAD k=64
D’ miniBOF 1 VLAD k=256
Figure 3. Search accuracy on Holidays with respect to reduction miniBOF [9]
0.2
to different dimensions D′ with ADC 16 × 8. Experiments are av- 8 16 32 64 128 256 512
eraged over 5 learning runs. The error bars represent the standard number of bytes
deviations over those runs. Figure 4. mAP for search on Holidays. For a given number of
bytes, the optimal choice of D′ is computed and only the results
of the best codebook size (k=16, 64 or 256) are reported. The error
comparable to those obtained by optimizing the matrix. Our bars represent the standard deviation over 5 runs. miniBOF results
explanation is that the random rotation sufficiently balances of [9] are reported for reference.
the energy.
Choice of the projection subspace dimension. For a fixed that the choice of the number of centroids k depends on the
image representation with a vector of length D and a fixed number of bits B chosen to represent the image. It shows
number B of bits to encode this vector, Figure 3 confirms that we attain competitive accuracy, with respect to BOF,
the analysis of Section 3: there is an important trade-off on using only 16 bytes. Note that small (resp. large) values of
D′ . The optimum limits the loss introduced by the projec- k should be associated with small (resp. large) values of B:
tion and the quantization step. The best choice of D′ corre- large ones are more impacted by dimensionality reduction.
sponds to the one found by our optimization procedure. For
instance, for VLAD with k=16, we obtain the same opti- 4.5. Large scale experiments
mum (D′ =64) as the one estimated in Section 3.2 based on
our objective error criterion. Figure 5 evaluates our approach on a large scale (up to
10 million images). It gives the mAP performance as a
function of the dataset size for the VLAD vector (k=64,
4.4. Comparison with the state of the art
D=8192) and when indexed with ADC 16 × 8 and 16 bytes
Our objectives are comparable to those of [9] in terms of memory (D′ =64). For this experiment, we have also used
of memory usage and desired degree of invariance (rota- the non exhaustive search variant of ADC proposed in [7].
tion/scale invariance). Table 3 and Figure 4 compare the It is referred to as IVFADC in the following, and gives com-
accuracies obtained by [9] and our approach on the bench- parable results, depending on the operating point (better and
marks Holidays and UKB. Our approach obtains a com- more efficient for very large sets). IVFADC combines ADC
parable search quality with at least an order of magnitude with an inverted file to restrict the search to a subset of vec-
less memory. Equivalently, for the same memory usage, tors. Consequently, it stores the image identifiers explic-
our method is significantly more precise. itly (4 bytes per image), and therefore requires 20 bytes of
Figure 4 also illustrates the trade-off between search memory with the selected parameters. Overall, our results
quality and memory usage. An interesting observation is are significantly better than those reported in [9], where a

3310
0.6 [8] H. Jégou, M. Douze, and C. Schmid. Hamming embedding
and weak geometric consistency for large scale image search.
0.5 In ECCV, October 2008.
[9] H. Jégou, M. Douze, and C. Schmid. Packing bag-of-
0.4 features. In ICCV, September 2009.
[10] H. Jégou, M. Douze, and C. Schmid. Improving bag-of-
mAP

0.3 features for large scale image search. IJCV, 87(3), May 2010.
[11] B. Kulis and K. Grauman. Kernelized locality-sensitive
0.2 hashing for scalable image search. In ICCV, October 2009.
BOF, k=200k [12] D. Lowe. Distinctive image features from scale-invariant
0.1 VLAD k=64, D=8192 keypoints. IJCV, 60(2):91–110, 2004.
VLAD k=64, PCA D’=64
VLAD k=64, ADC 16x8 [13] K. Mikolajczyk and C. Schmid. A performance evaluation
VLAD k=64, IVFADC 16x8
0 of local descriptors. PAMI, 27(10):1615–1630, 2005.
1000 10k 100k 1M 10M
[14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,
Database size
J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A
Figure 5. Search accuracy as a function of the database size. comparison of affine region detectors. IJCV, 65(1/2):43–72,
2005.
[15] M. Muja and D. G. Lowe. Fast approximate nearest neigh-
mAP of 0.066 is reported for 1 million images and a 20-
bors with automatic algorithm configuration. In VISAPP,
bytes representation, against mAP=0.193 or 0.241 in our
February 2009.
case, depending on the ADC variant.
[16] D. Nistér and H. Stewénius. Scalable recognition with a vo-
cabulary tree. In CVPR, June 2006.
Timings: All timing experiments have been performed on [17] A. Oliva and A. Torralba. Modeling the shape of the scene:
a single processor core. Searching our 10 million dataset a holistic representation of the spatial envelope. IJCV,
for VLAD vectors with k=64 reduced to D′ =64 dimensions 42(3):145–175, 2001.
takes 7.2 s when Euclidean distances are exhaustively com- [18] F. Perronnin and C. R. Dance. Fisher kernels on visual vo-
puted. Searching with ADC 16 × 8 takes 0.716 s, while cabularies for image categorization. In CVPR, June 2007.
using the IVFADC 16 × 8 variant takes 0.046 s. This timing [19] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier. Large-scale
is better than the one reported in [9], and this for a signifi- image retrieval with compressed Fisher vectors. In CVPR,
cantly better accuracy. June 2010.
[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-
Acknowledgements man. Object retrieval with large vocabularies and fast spatial
matching. In CVPR, June 2007.
We would like to thank QUAERO and the ANR projects [21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.
ICOS-HD and GAIA for their financial support. Many Lost in quantization: Improving particular object retrieval in
thanks to Florent Perronnin for insightful discussions. large scale image databases. In CVPR, June 2008.
[22] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-
References Neighbor Methods in Learning and Vision: Theory and
Practice, chapter 3. MIT Press, March 2006.
[1] C. M. Bishop. Pattern Recognition and Machine Learning.
[23] J. Sivic and A. Zisserman. Video Google: A text retrieval
Springer, 2007.
approach to object matching in videos. In ICCV, October
[2] O. Chum, M. Perdoch, and J. Matas. Geometric min- 2003.
hashing: Finding a (thick) needle in a haystack. In CVPR,
[24] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large
June 2009.
databases for recognition. In CVPR, June 2008.
[3] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image
[25] L. Torresani, M. Szummer, and A. Fitzgibbon. Learning
detection: min-hash and tf-idf weighting. In BMVC, Septem-
query-dependent prefilters for scalable image retrieval. In
ber 2008.
CVPR, June 2009.
[4] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-
[26] J. van Gemert, C. Veenman, A. Smeulders, and J. Geuse-
sensitive hashing scheme based on p-stable distributions. In
broek. Visual word ambiguity. PAMI. To appear.
Symposium on Computational Geometry, June 2004.
[27] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
[5] M. Douze, H. Jégou, H. Singh, L. Amsaleg, and C. Schmid.
NIPS, December 2008.
Evaluation of GIST descriptors for web-scale image search.
[28] S. Winder and M. Brown. Learning local image descriptors.
In CIVR, July 2009.
In CVPR, June 2007.
[6] T. Jaakkola and D. Haussler. Exploiting generative models
in discriminative classifiers. In NIPS, December 1998. [29] S. Winder, G. Hua, and M. Brown. Picking the best Daisy.
In CVPR, June 2009.
[7] H. Jégou, M. Douze, and C. Schmid. Product quantization
for nearest neighbor search. PAMI. To appear.

3311

Machine Learning Basic Principles
No ratings yet
Machine Learning Basic Principles
124 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
Peronnin Etal ECCV10 PDF
No ratings yet
Peronnin Etal ECCV10 PDF
14 pages
Image Descriptor Matching: Vineeth N Balasubramanian
No ratings yet
Image Descriptor Matching: Vineeth N Balasubramanian
20 pages
Content Based Image Retrieval Using Feature Coding
No ratings yet
Content Based Image Retrieval Using Feature Coding
4 pages
DLCV Day2
No ratings yet
DLCV Day2
5 pages
A.yavlinsky PHD
No ratings yet
A.yavlinsky PHD
129 pages
Appearance Based Methods
No ratings yet
Appearance Based Methods
68 pages
Bag of Words
No ratings yet
Bag of Words
72 pages
Q1 - VLAD - Aggregating Local Descriptors Into A Compact Image Representation
No ratings yet
Q1 - VLAD - Aggregating Local Descriptors Into A Compact Image Representation
8 pages
Feature Selection Using Adaboost For Face Expression Recognition
No ratings yet
Feature Selection Using Adaboost For Face Expression Recognition
6 pages
FREAK: Fast Retina Keypoint
No ratings yet
FREAK: Fast Retina Keypoint
8 pages
Applied Sciences: Image Retrieval Method Based On Image Feature Fusion and Discrete Cosine Transform
No ratings yet
Applied Sciences: Image Retrieval Method Based On Image Feature Fusion and Discrete Cosine Transform
28 pages
Bag of Feature
No ratings yet
Bag of Feature
75 pages
LBP Image Processing
No ratings yet
LBP Image Processing
15 pages
Lec 14
No ratings yet
Lec 14
18 pages
Trzcinski PAMI Preprint
No ratings yet
Trzcinski PAMI Preprint
14 pages
BEBLID: Boosted Efficient Binary Local Image Descriptor
No ratings yet
BEBLID: Boosted Efficient Binary Local Image Descriptor
18 pages
Lec 27
No ratings yet
Lec 27
25 pages
10.1 - What Are Image Descriptors, Featu... Feature Vectors - PyImageSearch Gurus
No ratings yet
10.1 - What Are Image Descriptors, Featu... Feature Vectors - PyImageSearch Gurus
11 pages
Visual Indexing and Retrieval (Z-Lib - Io)
No ratings yet
Visual Indexing and Retrieval (Z-Lib - Io)
113 pages
Asymmetric Distances For Binary Embeddings: Albert Gordo, Florent Perronnin, Yunchao Gong, Svetlana Lazebnik
No ratings yet
Asymmetric Distances For Binary Embeddings: Albert Gordo, Florent Perronnin, Yunchao Gong, Svetlana Lazebnik
15 pages
Ijcsereviewpaper
No ratings yet
Ijcsereviewpaper
6 pages
Detectores de Caracteristicas
No ratings yet
Detectores de Caracteristicas
20 pages
APIC Lecture2-3 Feature Selection
No ratings yet
APIC Lecture2-3 Feature Selection
28 pages
NaikalN ICCV2011
No ratings yet
NaikalN ICCV2011
8 pages
Bai09 Descriptors
No ratings yet
Bai09 Descriptors
81 pages
Marchesotti 2011
No ratings yet
Marchesotti 2011
8 pages
Jegou Packingbof
No ratings yet
Jegou Packingbof
9 pages
Bag-Of-Words Models: Noah Snavely
No ratings yet
Bag-Of-Words Models: Noah Snavely
47 pages
Feature Propagation On Image Webs For Enhanced Image Retrieval
No ratings yet
Feature Propagation On Image Webs For Enhanced Image Retrieval
8 pages
Sparse Concept Coding For Visual Analysis
No ratings yet
Sparse Concept Coding For Visual Analysis
6 pages
PHD Visual Object Category Recognition
No ratings yet
PHD Visual Object Category Recognition
193 pages
Local Features and Bag of Words Models
No ratings yet
Local Features and Bag of Words Models
60 pages
V-Unit AIIA Complete Material
No ratings yet
V-Unit AIIA Complete Material
162 pages
26 2010 CVPR Spatial Bag of Words
No ratings yet
26 2010 CVPR Spatial Bag of Words
8 pages
Ijmcer BB0310209217
No ratings yet
Ijmcer BB0310209217
9 pages
12.coupled Binary Embedding For
No ratings yet
12.coupled Binary Embedding For
13 pages
Jegou Improvingbof Preprint
No ratings yet
Jegou Improvingbof Preprint
22 pages
Kernel Visual Recognition
No ratings yet
Kernel Visual Recognition
9 pages
Eccv 06
No ratings yet
Eccv 06
15 pages
Sample New
No ratings yet
Sample New
6 pages
Q1 Fisher Kernels On Visual Vocabularies For Image Categorization
No ratings yet
Q1 Fisher Kernels On Visual Vocabularies For Image Categorization
8 pages
Uijlings ICIVR2009
No ratings yet
Uijlings ICIVR2009
8 pages
Feature Matching: Histogram of Oriented Gradients
No ratings yet
Feature Matching: Histogram of Oriented Gradients
11 pages
cvpr2004 Keypoint Rahuls PDF
No ratings yet
cvpr2004 Keypoint Rahuls PDF
8 pages
VietNguyen MasterThesis
No ratings yet
VietNguyen MasterThesis
66 pages
Local Feature Detectors, Descriptors, and Image Representations: A Survey
No ratings yet
Local Feature Detectors, Descriptors, and Image Representations: A Survey
20 pages
Understanding Bag-Of-Words Model A Statistical Fra
No ratings yet
Understanding Bag-Of-Words Model A Statistical Fra
16 pages
MarszalekSchmid CVPR06 SpatialWeighting
No ratings yet
MarszalekSchmid CVPR06 SpatialWeighting
9 pages
Trainable COSFIRE Filters For Keypoint Detection and Pattern Recognition
No ratings yet
Trainable COSFIRE Filters For Keypoint Detection and Pattern Recognition
15 pages
Lecture10-Featurebased Image Matching
No ratings yet
Lecture10-Featurebased Image Matching
33 pages
PCA and LDA Assignment
No ratings yet
PCA and LDA Assignment
5 pages
Lecture6 2
No ratings yet
Lecture6 2
37 pages
CV 2025 Spring 12 Short
No ratings yet
CV 2025 Spring 12 Short
120 pages
Tutorial 7 Developing A Simple Image Classifier
No ratings yet
Tutorial 7 Developing A Simple Image Classifier
11 pages
3local Features
No ratings yet
3local Features
76 pages
Recognizing Pictures at An Exhibition Using SIFT
No ratings yet
Recognizing Pictures at An Exhibition Using SIFT
5 pages
03-3 Feature Descriptors
No ratings yet
03-3 Feature Descriptors
58 pages
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
No ratings yet
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
8 pages
AI Presentation
No ratings yet
AI Presentation
66 pages
Dimensionality Reduction Techniques You Should Know in 2021
No ratings yet
Dimensionality Reduction Techniques You Should Know in 2021
12 pages
Autoencoder Asset Pricing Models
No ratings yet
Autoencoder Asset Pricing Models
22 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
13 pages
B.tech-I.T 7 & 8 Semester Course Work
No ratings yet
B.tech-I.T 7 & 8 Semester Course Work
16 pages
Fake News Analysis
No ratings yet
Fake News Analysis
46 pages
2 - Types of Machine Learning
No ratings yet
2 - Types of Machine Learning
26 pages
Introduction To Machine Learning and Hands On Sessions
No ratings yet
Introduction To Machine Learning and Hands On Sessions
50 pages
Unit 4 Dimenstionality Reduction
No ratings yet
Unit 4 Dimenstionality Reduction
104 pages
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
No ratings yet
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
307 pages
Advanced Certificate Program in Data Science and AI Curriculum v1.0
No ratings yet
Advanced Certificate Program in Data Science and AI Curriculum v1.0
55 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Multivariate Statistics Principal Component Analysis (PCA)
No ratings yet
Multivariate Statistics Principal Component Analysis (PCA)
41 pages
ML Visualization NeurIPS Tutorial
No ratings yet
ML Visualization NeurIPS Tutorial
145 pages
ML Imp Ques 1
No ratings yet
ML Imp Ques 1
22 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
Retrieve
No ratings yet
Retrieve
40 pages
SensMap R Package and SensMapGUI Shiny Web Application For Sensory and Consumer
No ratings yet
SensMap R Package and SensMapGUI Shiny Web Application For Sensory and Consumer
17 pages
1 s2.0 S0957417423000635 Main
No ratings yet
1 s2.0 S0957417423000635 Main
11 pages
Facial Recognition Technical Report - Group 2
No ratings yet
Facial Recognition Technical Report - Group 2
48 pages
EE769-11 Dimension Reduction
No ratings yet
EE769-11 Dimension Reduction
16 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
16 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Machine Learning Mock
No ratings yet
Machine Learning Mock
3 pages
Face Recognition Attendance System Based On Real-Time Video Processing
No ratings yet
Face Recognition Attendance System Based On Real-Time Video Processing
9 pages
Machine Learning MID-2 Question Bank
No ratings yet
Machine Learning MID-2 Question Bank
2 pages