0% found this document useful (0 votes)

49 views22 pages

A Survey On Learning To Hash

Uploaded by

xingyanzhou687

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views22 pages

A Survey On Learning To Hash

Uploaded by

xingyanzhou687

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO.

4, APRIL 2018 769

A Survey on Learning to Hash

Jingdong Wang , Ting Zhang, Jingkuan Song , Nicu Sebe, and Heng Tao Shen

Abstract—Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the
query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this
paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving
the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different
though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation
protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search
accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.

Index Terms—Similarity search, approximate nearest neighbor search, hashing, learning to hash, quantization, pairwise similarity preserving,
multiwise similarity preserving, implicit similarity preserving

Ç
1 INTRODUCTION

T HE problem of nearest neighbor search, also known as

similarity search, proximity search, or close item search,
is aimed at finding an item, called nearest neighbor, which
accuracy [19], [113], developing better search schemes [29],
[29], [95], providing a similarity estimator with smaller vari-
ance [70], [55], [74], [54], smaller storage [72], [73], or faster
is the nearest to a query item under a certain distance mea- computation of hash functions [71], [74], [54], [130]. LSH has
sure from a search (reference) database. The cost of finding been adopted in many applications, e.g., fast object detec-
the exact nearest neighbor is prohibitively high in the case tion [21], image matching [17], [99]. The detailed review on
that the reference database is very large or that computing LSH can be found in [147].
the distance between the query item and the database item Learning to hash, the interest of this survey, is a data-
is costly. The alternative approach, approximate nearest dependent hashing approach which aims to learn hash func-
neighbor search, is more efficient and is shown to be enough tions from a specific dataset so that the nearest neighbor
and useful for many practical problems, thus attracting an search result in the hash coding space is as close as possible to
enormous number of research efforts. the search result in the original space, and the search cost as
Hashing, a widely-studied solution to the approximate well as the space cost are also small. The development of
nearest neighbor search, aims to transform a data item to learning to hash has been inspired by the connection between
a low-dimensional representation, or equivalently a short the Hamming distance and the distance provided from the
code consisting of a sequence of bits, called hash code. There original space, e.g., the cosine distance shown in SimHash [14].
are two main categories of hashing algorithms: locality sensi- Since the two early algorithms, semantic hashing [120], [121]
tive hashing [14], [44] and learning to hash. Locality sensitive and spectral hashing [156] that learns projection vectors
hashing (LSH) is data-independent. Following the pioneer- instead of the random projections as done in [14], learning to
ing works [14], [44], there are a lot of efforts, such as propos- hash has been attracting a large amount of research interest in
ing random hash functions satisfying the locality sensitivity computer vision and machine learning and has been applied
property for various distance measures [10], [11], [14], [19], to a wide-range of applications such as large scale object
[20], [100], [111], proving better search efficiency and retrieval [51], image classification [122] and detection [139].
The main methodology of learning to hash is similarity
preserving, i.e., minimizing the gap between the similarities
J. Wang is with Microsoft Research, Beijing 100080, P.R. China.
E-mail: [email protected]. computed/given in the original space and the similarities in
T. Zhang is with the University of Science and Technology of China, Hefei the hash coding space in various forms. The similarity in
230000, China. E-mail: [email protected]. the original space might be from the semantic (class) infor-
J. Song and H.T. Shen are with the School of Computer Science and Engi-
neering, University of Electronic Science and Technology of China,
mation, or from the distance (e.g., Euclidean distance) com-
Chengdu 610051, China. puted in the original space, which is of broad interest and
E-mail: [email protected], [email protected]. widely studied in real applications, e.g., large scale image
N. Sebe is with the Department of Information Engineering and Computer search and image classification. Hence the latter is the main
Science, University of Trento, Trento 38122, Italy.
E-mail: [email protected]. focus in this paper.
Manuscript received 21 June 2016; revised 31 Dec. 2016; accepted 11 Apr.
This survey categorizes the algorithms according to the
2017. Date of publication 1 May 2017; date of current version 13 Mar. 2018. similarity preserving manners into: pairwise similarity pre-
(Corresponding author: Heng Tao Shen.) serving, multiwise similarity preserving, implicit similarity
Recommended for acceptance by K. Weinberger. preserving, quantization which we will show is also a form
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. of pairwise similarity preserving, as well as end-to-end
Digital Object Identifier no. 10.1109/TPAMI.2017.2699960 hash learning that computes the hash codes directly from
0162-8828 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
770 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

the object, e.g., image, under the deep learning framework the time-constrained search is usually transformed to
instead of first learning the representations and then com- another approximate way: terminate the search after exam-
puting the hash codes from the representations. In addition, ining a fixed number of data points.
we discuss other problems including evaluation datasets
and evaluation schemes, and so on. Meanwhile, we present 2.2 Search with Hashing
the empirical observation that the quantization approach The hashing approach aims to map the reference (and
outperforms other approaches and give some analysis about query) items to the target items so that approximate nearest
this observation. neighbor search is efficiently and accurately performed by
In comparison to other surveys on hash [145], [147], this resorting to the target items and possibly a small subset of
survey focuses more on learning to hash, and discusses the raw reference items. The target items are called hash
more on quantization-based solutions. Our categorization codes (a.k.a., hash values, or simply hashes). In this paper,
methodology is helpful for readers to understand connec- we may also call it short/compact codes interchangeably.
tions and differences between existing algorithms. In The hash function is formally defined as: y ¼ hðxÞ, where
particular, we point out the empirical observation that y is the hash code, may be an integer, or a binary value: 1
quantization is superior in terms of search accuracy, search and 0 (or 1), and hðÞ is the hash function. In the applica-
efficiency and space cost. tion to approximate nearest neighbor search, usually several
hash functions are used together to compute the compound
hash code: y ¼ hðxÞ, where y ¼ ½y1 y2 yM > and hðxÞ ¼
2 BACKGROUND ½h1 ðxÞ h2 ðxÞ hM ðxÞ> . Here we use a vector y to repre-
2.1 Nearest Neighbor Search sent the compound hash code for convenience.
Exact nearest neighbor search is defined as searching an There are two basic strategies for using hash codes to per-
item NNðqÞ (called nearest neighbor) for a query item q form nearest (near) neighbor search: hash table lookup and hash
from a set of N items X ¼ fx1 ; x2 ; . . . ; xN g so that NNðqÞ ¼ code ranking. The search strategies are illustrated in Fig. 1.
arg minx2X distðq; xÞ, where distðq; xÞ is a distance computed The main idea of hash table lookup for accelerating the search
between q and x. A straightforward generalization is is reducing the number of the distance computations. The
K-nearest neighbor search, where we need to find K nearest data structure, called hash table (a form of inverted index), is
neighbors. composed of buckets with each bucket indexed by a hash
The distance between a pair of items x and q depends on code. Each reference item x is placed into a bucket hðxÞ. Differ-
the specific nearest search problem. A typical example is that ent from the conventional hashing algorithm in computer sci-
the search (reference) database X lies in a d-dimensional ence that avoids collisions (i.e., avoids mapping two items
space Rd and P the distance is introduced by an ‘s norm, into some same bucket), the hashing approach using a hash
kx qks ¼ ð di¼1 jxi qi js Þ1=s . The search problem under the table essentially aims to maximize the probability of collision
Euclidean distance, i.e., the ‘2 norm, is widely studied. Other of near items and at the same time minimize the probability of
forms of the data item, for example, the data item is formed collision of the items that are far away. Given the query q, the
by a set, and other forms of distance measures, such as ‘1 dis- items lying in the bucket hðqÞ are retrieved as the candidates
tance, cosine similarity and so on, are also possible. of the nearest items of q. Usually this is followed by a rerank-
There exist efficient algorithms (e.g., k-d trees) for exact ing step: rerank the retrieved nearest neighbor candidates
nearest neighbor search in low-dimensional cases. In large according to the true distances computed using the original
scale high-dimensional cases, it turns out that the problem features and attain the nearest neighbors.
becomes hard and most algorithms even take higher To improve the recall, two ways are often adopted. The
computational cost than the naive solution, i.e., the linear first way is to visit a few more buckets (but with a single
scan. Therefore, a lot of recent efforts moved to searching hash table), whose corresponding hash codes are the nearest
approximate nearest neighbors: error-constrained nearest to (the hash code hðqÞ of) the query according to the distan-
(near) neighbor search, and time-constrained approximate ces in the coding space. The second way is to construct
nearest neighbor search [103], [105]. The error-constrained several (e.g., L) hash tables. The items lying in the L hash
search includes (randomized) ð1 þ Þ-approximate nearest buckets h1 ðqÞ ; . . . ; hL ðqÞ are retrieved as the candidates of
neighbor search [1], [14], [44], (approximate) fixed-radius near items of q which are possibly ordered according to the
near neighbor (R-near neighbor) search [6]. number of hits of each item in the L buckets. To guarantee
Time-constrained approximate nearest neighbor search the high precision, each of the L hash codes, yl , needs to be
limits the time spent during the search and is studied a long code. This means that the total number of the buckets
mostly for real applications, though it usually lacks an ele- is too large to index directly, and thus the buckets that are
gant theory behind. The goal is to make the search as accu- non-empty are retained by using the conventional hashing
rate as possible by comparing the returned K approximate over the hash codes hl ðxÞ.
nearest neighbors and the K exact nearest neighbors, and to The second way essentially stores multiple copies of the
make the query cost as small as possible. For example, id for each reference item. Consequently, the space cost is
when comparing the learning to hash approaches that use larger. In contrast, the space cost for the first way is smaller
linear scan based on the Hamming distance for search, it is as it only uses a single table and stores one copy of the id
typically assumed that the search time is the same for the for each reference item, but it needs to access more buckets
same code length by ignoring other small cost. When com- to guarantee the same recall with the second way. The mul-
paring the indexing structure algorithms, e.g., tree- tiple assignment scheme is also studied: construct a single
based [103], [105], [152] or neighborhood graph-based [151], table, but assign a reference item to multiple hash buckets.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 771

Fig. 1. Illustrating the search strategies. (a) Multi table lookup: the list corresponding to the hash code of the query in each table is retrieved. (b) Single
table lookup: the lists corresponding to and near to the hash code of the query are retrieved. (c) Hash code ranking: compare the query with each ref-
erence item in the coding space. (d) Non-exhaustive search: hash table lookup (or other inverted index structure) retrieves the candidates, followed
by hash code ranking over the candidates.

In essence, it is shown that the second way, multiple hash using the hash codes which are longer, providing the top can-
tables, can be regarded as a form of multiple assignment. didates subsequently reranked using the original features.
Hash code ranking performs an exhaustive search: com- Other research efforts include organizing the hash codes to
pare the query with each reference item by fast evaluating avoid the exhaustive search with a data structure, such as a
their distance (e.g., using distance table lookup or using the tree or a graph structure [104].
CPU instruction popcnt for Hamming distance) according
to (the hash code of) the query and the hash code of the ref- 3 LEARNING TO HASH
erence item, and retrieve the reference items with the small-
Learning to hash is the task of learning a (compound) hash
est distances as the candidates of nearest neighbors. Usually
function, y ¼ hðxÞ, mapping an input item x to a compact
this is followed by a reranking step: rerank the retrieved
code y, aiming that the nearest neighbor search result for a
nearest neighbor candidates according to the true distances
query q is as close as possible to the true nearest search
computed using the original features and attain the nearest
result and the search in the coding space is also efficient.
neighbors.
A learning-to-hash approach needs to consider five prob-
This strategy exploits one main advantage of hash codes:
lems: what hash function hðxÞ is adopted, what similarity in
the distance using hash codes is efficiently computed and
the coding space is used, what similarity is provided in the
the cost is much smaller than that of the distance computa-
input space, what loss function is chosen for the optimiza-
tion in the original input space.
tion objective, and what optimization technique is adopted.
Comments. Hash table lookup is mainly used in locality
sensitive hashing, and has been used for evaluating learning
3.1 Hash Function
to hash in a few publications. It has been pointed in [156] and
also observed from empirical results that LSH-based hash The hash function can be based on linear projection, kernels,
table lookup, except min-hash, is rarely adopted in reality, spherical function, (deep) neural networks, a non-paramet-
while hash table lookup with quantization-based hash codes ric function, and so on. One popular hash function is the lin-
is widely used in the non-exhaustive strategy to retrieve ear hash function, e.g., [136], [141]:
coarse candidates [50]. Hash code ranking goes through all y ¼ hðxÞ ¼ sgnðw> x þ bÞ; (1)
the candidates and thus is inferior in search efficiency com-
pared with hash table lookup which only checks a small sub- where sgnðzÞ ¼ 1 if z 5 0 and sgnðzÞ ¼ 0 (or equivalently 1)
set of candidates, which are determined by a lookup radius. otherwise, w is the projection vector, and b is the bias vari-
A practical way is to do a non-exhaustive search which is able. The kernel function,
!
suggested in [4], [50]: first retrieve a small set of candidates XT
using the inverted index that can be viewed as a hash table, y ¼ hðxÞ ¼ sgn wt Kðst ; xÞ þ b ; (2)
and then compute the distances of the query to the candidates t¼1
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

is also adopted in some approaches, e.g., [40], [66], where matrix and each diagonal entry is the weight of the corre-
fst g is a set of representative samples that are randomly sponding hash code.
drawn from the dataset or cluster centers of the dataset and Besides the Hamming distance/similarity and its var-
fwt g are the weights. The non-parametric function based on iants, the Euclidean distance is typically used in quantiza-
nearest vector assignment is widely used for quantization- tion approaches, and is evaluated between the vectors
based solutions: corresponding to the hash codes, dhij ¼ kcyi cyj k2 (symmet-
ric distance) or between the query q and the center that is
y ¼ arg mink2f1;...;Kg kx ck k2 ; (3)
the approximation to xj , dhqj ¼ kq cyj k2 (asymmetric dis-
tance, which is preferred because the accuracy is higher and
where fc1 ; . . . ; cK g is a set of centers computed by some
the time cost is almost the same). The distance is usually
algorithms, e.g., K-means, and y 2 Zþ is an integer. In con-
evaluated in the search stage efficiently by using a distance
trast to other hashing algorithms in which the distance, e.g.,
lookup table. There are also some works learning/optimiz-
Hamming distance, is often directly computed from hash
ing the distances between hash codes [37], [148] after the
codes, the hash codes generated from the nearest vector
hash codes are already computed.
assignment-based hash function are the indices of the near-
est vectors, and the distance is computed using the centers
corresponding to the hash codes. 3.3 Loss Function
The form of hash function is an important factor influenc- The basic rule of designing the loss function is to preserve
ing the search accuracy using the hash codes, as well as the the similarity order, i.e., minimize the gap between the
time cost of computing hash codes. A linear function is effi- approximate nearest neighbor search result computed from
ciently evaluated, while the kernel function and the nearest the hash codes and the true search result obtained from the
vector assignment based function lead to better search accu- input space.
racy as they are more flexible. Almost all the methods using The widely-used solution is pairwise similarity preserv-
a linear hash function can be extended to nonlinear hash ing, making the distances or similarities between a pair of
functions, such as kernelized hash functions, or neural net- items from the input and coding spaces as consistent as pos-
works. Thus we do not use the hash function to categorize sible. The multiwise similarity preserving solution, making
the hash algorithms. the orders among multiple items computed from the input
and coding spaces as consistent as possible, is also studied.
3.2 Similarity One class of solutions, e.g., spatial partitioning, implicitly
In the input space the distance doij between any pair of items preserve the similarities. The quantization-based solution
ðxi ; xj Þ could be the Euclidean distance, kxi xj k2 , or others. and other reconstruction-based solutions aim to find the
The similarity soij is often defined as a function about the dis- optimal approximation of the item in terms of the recon-
tance doij , and a typical function is the Gaussian function: struction error through a reconstruction function (e.g., in
ðdoij Þ2 the form of a lookup table in quantization or an auto-
soij ¼ gðdoij Þ ¼ expð 2s 2
Þ. There exist other similarity forms,
x> x
encoder in [120]). Besides similarity preserving items, some
such as cosine similarity kxi ki kxjj k . Besides, the semantic sim- approaches introduce bucket balance or its approximate
2 2
ilarity is often used for semantic similarity search. In this variants as extra constraints, which is also important for
case, the similarity soij is usually binary, valued 1 if the two obtaining better results or avoiding trivial solutions.
items xi and xj belong to the same semantic class, 0 (or 1)
otherwise. The hashing algorithms for semantic similarity 3.4 Optimization
usually can be applied to other distances, such as Euclidean
The challenges for optimizing the hash function parameters
distance, by defining a pseudo-semantic similarity: soij ¼ 1
lie in two main factors. One is that the problem contains the
for nearby points ði; jÞ and soij ¼ 0 (or 1) for farther points
sgn function, which leads to a challenging mixed-binary-
ði; jÞ.
integer optimization problem. The other is that the time
In the hash coding space, the typical distance dhij between
complexity is high when processing a large number of data
yi and yj is the Hamming distance. It is defined as the num-
points, which is usually handled by sampling a subset of
ber of bits where the values are different and is mathemati-
points or a subset of constraints (or equivalent basic terms
cally formulated as
in the objective functions).
X
M The ways to handle the sgn function are summarized
dhij ¼ d½yim 6¼ yjm ; below. The first way is the most widely-adopted continuous
m¼1
relaxation, including sigmoid relaxation, tanh relaxation,
and directly dropping the sign function sgnðzÞ z. The
which is equivalent to dhij ¼ kyi yj k1 if the code is valued by
relaxed problem is then solved using various standard opti-
1 and 0. The distance for the codes valued by 1 and 1 is simi-
mization techniques. The second one is a two-step
larly defined. The similarity based on the Hamming distance
scheme [76], [77] with its extension to alternative optimiza-
is defined as shij ¼ M dhij for the codes valued by 1 and 0, tion [32]: optimize the binary codes without considering the
computing the number of bits where the values are the same. hash function, followed by estimating the function parame-
The inner product shij ¼ y> i yj is used as the similarity for the ters from the optimized hash codes. The third one is discre-
codes valued by 1 and 1. These measuresP are also extended tization: drop the sign function (sgnðzÞ z) and regard the
to the weighted cases: e.g., dhij ¼ M m¼1 m d½yim ¼
6 yjm and hash code as a discrete approximation of z, which is formu-
shij ¼ y>
i L yj , where L ¼ Diagð 1 ; 2 ; . . . ; M Þ is a diagonal lated as a loss ðy zÞ2 . There also exist other ways only
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 773

adopted in a few algorithms, e.g., transforming the problem Normalized similarity-similarity divergence minimi-
into a latent structure-SVM formulation in [107], [109], the zation (NSSDM):
0 1
coordinate-descent approach in [66] (fixing all but one X
weight, optimize the original objective with respect to a sin- min KLðf shij gÞ ¼ min@
soij g; f shij A:
soij log
gle weight in each iteration), both of which do not conduct ði;jÞ2E
continuous relaxation.
Here soij and shij are normalized similarities
P o in the
input
P h space and the coding space:
s ij ¼ 1 and
3.5 Categorization ij
ij ¼ 1.
ij s
Our survey categorizes the existing algorithms to various
The following reviews these groups of algorithms except
classes: the pairwise similarity preserving class, the multi-
the distance-similarity product minimization group for
wise similarity preserving class, the implicit similarity pre-
which we are not aware of any algorithm belonging to. It
serving class, as well as the quantization class, according to
should be noted that merely optimizing the above similarity
what similarity preserving manner is adopted to formulate
preserving function, e.g., SDPM and SSPM, is not enough
the objective function. We separate the quantization class
and may lead to trivial solutions, and it is necessary to com-
from the pairwise similarity preserving class as they are
bine other constraints, which are detailed in the following
very different in formulation though the quantization class
discussion. We also point out the relation between similar-
can be explained from the perspective of pairwise similarity
ity-distance product minimization and similarity-similarity
preserving. In the following description, we may call quan-
product maximization, the relation between similarity-simi-
tization as quantization-based hashing and other algorithms
larity product maximization and similarity-similarity differ-
in which a hash function generates a binary value as binary
ence minimization, as well as the relation between distance-
code hashing. In addition, we will also discuss other studies
distance product maximization and distance-distance dif-
on learning to hash. The summary of the representative
ference minimization.
algorithms is given in Table 1.
The main reason we choose the similarity preserving
4.1 Similarity-Distance Product Minimization
manner to do the categorization is that similarity preserva-
We first introduce spectral hashing and its extensions, and
tion is the essential goal of hashing. It should be noted that
then review other forms.
as pointed in [145], [147], other factors, such as the hash
function, or the optimization algorithm, is also important
for the search performance. 4.1.1 Spectral Hashing
P
The goal of spectral hashing [156] is to minimize ði;jÞ2E soij dhij ,
4 PAIRWISE SIMILARITY PRESERVING where the Euclidean distance in the hashing space, dhij ¼
kyi yj k22 , is used for formulation simplicity and optimiza-
The algorithms aligning the distances or similarities of tion convenience, and the similarity in the input space is
a pair of items computed from the input space and the kx x k2
Hamming coding space are roughly divided to the follow- defined as: soij ¼ expð i2s2j 2 Þ. Note that the Hamming dis-
tance in the search stage can be still used for higher effi-
ing groups:
ciency as the Euclidean distance and the Hamming distance
Similarity-distance product minimization (SDPM): in the coding space are consistent: the larger the Euclidean
P distance, the larger the Hamming distance. The objective
min ði;jÞ2E soij dhij . The distance in the coding space is
expected to be smaller if the similarity in the original function can be written in a matrix form,
X
space is larger. Here E is a set of pairs of items that min soij dhij ¼ traceðYðD SÞY> Þ;
are considered. (4)
ði;jÞ2E
Similarity-similarity product maximization (SSPM):
P where Y ¼ ½y1 y2 yN is a matrix of M N, S ¼ ½soij NN
max ði;jÞ2E soij shij . The similarity in the coding space is the similarity matrix, D ¼ diagðd11 ; . . . ; dNN Þ is a diag-
P and
is expected to be larger if the similarity in the origi- onal matrix, dnn ¼ N o
i¼1 sni .
nal space is larger. There is a trivial solution to the problem (4): y1 ¼
Distance-distance product maximization (DDPM): y2 ¼ ¼ yN . To avoid it, the code balance condition is
P
max ði;jÞ2E doij dhij . The distance in the coding space is introduced: the number of data items mapped to each hash
expected to be larger if the distance in the original code is the same. Bit balance and bit uncorrelation are used
space is larger. to approximate the code balance condition. Bit balance
Distance-similarity product minimization (DSPM): means that each bit has about 50 percent chance of being 1
P
min ði;jÞ2E doij shij . The similarity in the coding space or 1. Bit uncorrelation means that different bits are uncor-
is expected to be smaller if the distance in the origi- related. The two conditions are formulated as,
nal space is larger.
Similarity-similarity difference minimization (SSDM): Y1 ¼ 0; YY> ¼ I; (5)
P
min ði;jÞ2E ðsoij shij Þ2 . The difference between the where 1 is an N-dimensional all-1 vector, and I is an identity
similarities is expected to be as small as possible. matrix of size N.
Distance-distance difference minimization (DDDM): Under the assumption of separate multi-dimensional
P
min ði;jÞ2E ðdoij dhij Þ2 . The difference between the uniform data distribution, the hashing algorithm is given as
distances is expected to be as small as possible. follows,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

TABLE 1
A Summary of Representative Hashing Algorithms with Respect to Similarity Preserving Functions, Code Balance,
Hash Function Similarity in the Coding Space, and the Manner to Handle the sgn Function

pres. = preserving, sim. = similarity. BB = bit balance, BU = bit uncorrelation, BMIM = bit Mutual Information Minimization, BKB = bucket balance. H =
Hamming distance, WH = weighted Hamming distance, SH = spherical Hamming distance, C = Cosine, E = Euclidean distance, DNN = deep neural networks;
Drop = drop the sgn operator in the hash function, Sigmoid = Sigmoid relaxation, ½a; b = ½a; b bounded relaxation, Tanh = Tanh relaxation, Discretize = drop
the sgn operator in the hash function and regard the hash code as a discrete approximation of the hash value, Keep = optimization without relaxation for sgn,
Two-step = two-step optimization.

1) Find the principal components of the N d-dimen- Analysis. In the case the spreads along the top M PCA
sional reference data items using principal compo- directions are the same, the hashing algorithm partitions
nent analysis (PCA). each direction into two parts using the median (due to the
2) Compute the M one-dimensional Laplacian eigen- bit balance requirement) as the threshold, which is equiva-
functions with the M smallest eigenvalues along lent to thresholding at the mean value under the assump-
each PCA direction (d directions in total). tion of uniform data distributions. In the case that the true
3) Pick the M eigenfunctions with the smallest eigen- data distribution is a multi-dimensional isotropic Gaussian
values among Md eigenfunctions. distribution, the algorithm is equivalent to two quantization
4) Threshold the eigenfunction at zero, obtaining the algorithms: iterative quantization [36], [35] and isotropic
binary codes. hashing [63].
The one-dimensional Laplacian eigenfunction for the Regarding the performance, this method performs well
case of uniform distribution on ½rl ; rr is fm ðxÞ ¼ for a short hash code but poor for a long hash code. The rea-
sin ðp2 þ rrmp
r xÞ, and the corresponding eigenvalue is m ¼ son includes three aspects. First, the assumption that the
l
2 2
1 expð 2 j rrmp data follow a uniform distribution does not hold in real
rl j Þ, where m ð¼ 1; 2; . . .Þ is the frequency
and is a fixed small value. The hash function is formally cases. Second, the eigenvalue monotonously decreases with
2
written as hðxÞ ¼ sgnð sin ðp2 þ gw> xÞÞ, where g depends on rl j , which means that the PCA direction with
respect to j rrm
the frequency m and the range of the projection along the a large spread (jrr rl j) and a lower frequency (m) is pre-
direction w. ferred. Hence there might be more than one eigenfunction
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 775

picked along a single PCA direction, which breaks the where r is a hyper-parameter used as a threshold in the Ham-
uncorrelation requirement. Last, thresholding the eigen- ming space to differentiate similar pairs from dissimilar
function fm ðxÞ ¼ sin ðp2 þ rrmp
rl xÞ at zero leads to that near
pairs, is another hyper-parameter that controls the ratio of
points may be mapped to different hash values and farther the slopes for the penalties incurred for similar (or dissimilar)
points may be mapped to the same hash value. As a result, points. The hash function is in the linear form: y ¼ sgnðW> xÞ.
the Hamming distance is not well consistent to the distance The projection matrix W is estimated by transforming y ¼
in the input space. sgnðW> xÞ ¼ arg maxh0 2H h0> W> x and optimizing using struc-
Extensions. There are some extensions using PCA. (1) tured prediction with latent variables. The hyper-parameters
Principal component hashing [98] uses the principal direction r and are chosen via cross-validation.
to formulate the hash function; (2) Searching with expecta- Comments. Besides the optimization techniques, the main
tions [123] and transform coding [9] that transforms the data differences of the three representative algorithms, i.e., spec-
using PCA and then adopts the rate distortion optimization tral hashing, LDA hashing, and minimal loss hashing, are
(bits allocation) approach to determine which principal twofold. First, the similarity in the input space in spectral
direction is used and how many bits are assigned to such a hashing is defined as a continuous positive number com-
direction; (3) Double-bit quantization handles the third draw- puted from the Euclidean distance, while in LDA hashing
back in spectral hashing by distributing two bits into each and minimal loss hashing the similarity is set to 1 for a simi-
projection direction, conducting only 3-cluster quantization, lar pair and 1 for a dissimilar pair. Second, the distance in
and assigning 01, 00, and 11 to each cluster. Instead of PCA, the hashing space for formulating the objective function in
ICA hashing [39] adopts independent component analysis minimal loss hashing is different from spectral hashing and
for hashing and uses bit balance and bit mutual information LDA hashing.
minimization for code balance.
There are many other extensions in a wide range, including 4.2 Similarity-Similarity Product Maximization
similarity graph extensions [75], [179], [92], [86], [84], [79], Semi-supervised hashing [141], [142], [143] is the representa-
[128], [170], hash function extensions [40], [124], weighted tive Palgorithm in this group. The objective function is
Hamming distance [153], self-taught hashing [166], sparse max ði;jÞ2E soij shij . The similarity soij in the input space is 1 if
hash codes [177], discrete hashing [164], and so on. the pair of items xi and xj belong to the same class or are
nearby points, and 1 otherwise. The similarity in the cod-
4.1.2 Variants ing space is defined as shij ¼ y> i yj . Thus, the objective func-
Linear discriminant analysis (LDA) P hashing [136] minimizes a tion is rewritten as maximizing:
form of the loss function: min ði;jÞ2E soij dhij , where dhij ¼ X
soij y>
i yj : (8)
kyi yj k22 . Different from spectral hashing, (1) soij ¼ 1 if data ði;jÞ2E
items xi and xj are a similar pair, ði; jÞ 2 E þ , and soij ¼ 1 if
data items xi and xj are a dissimilar pair, ði; jÞ 2 E (2) a linear The hash function is in a linear form y ¼ hðxÞ ¼ sgnðW> xÞ.
hash function is used: y ¼ sgnðW> x þ bÞ, and (3) a weight a is Besides, the bit balance is also considered, and is formu-
imposed to soij dhij for the similar pair. As a result, the objective lated as maximizing the variance, traceðYY> Þ, rather than
function is written as letting the mean be 0, Y1 ¼ 0. The overall objective is to
X X maximize
a kyi yj k22 kyi yj k22 :
(6)
traceðYSY> Þ þ h traceðYY> Þ;
þ
ði;jÞ2E
ði;jÞ2E (9)
The projection matrix W and the threshold b are sepa-
rately optimized: (1) to estimate the orthogonal matrix W, subject to W> W ¼ I, which is a relaxation of the bit uncorre-
drop the sgn function in Equation (6), leading to an eigen- lation condition. The estimation of W is done by directly
value decomposition problem; (2) estimate b by minimizing dropping the sgn operator.
Equation (6) with fixed W through a simple 1D search An unsupervised extension is given in [143]: sequentially
scheme. A similar loss function, contrastive loss, is adopted compute the projection vector fwm gM m¼1 from w1 to wM by
in [18] with a different optimization technique. optimizing the problem (9). In particular, the first iteration
The loss function
P in minimal loss hashing [107] is in the computes the PCA direction as the first w, and at each of
form of min ði;jÞ2E soij dhij . Similar to LDA hashing, soij ¼ 1 the later iterations, soij ¼ 1 if nearby points are mapped to
if ði; jÞ 2 E þ and soij ¼ 1 if ði; jÞ 2 E . Differently, the dis- different hash values in the previous iterations, and soij ¼
tance is hinge-like: dhij ¼ maxðkyi yj k1 þ 1; rÞ for ði; jÞ 2 E þ 1 if far points are mapped to same hash values in the pre-
vious iterations. An extension of the semi-supervised hash-
and dhij ¼ minðkyi yj k1 1; rÞ for ði; jÞ 2 E . The intuition
ing to nonlinear hash functions is presented in [157] using
is that there is no penalty if the Hamming distances for simi-
the kernel hash function. An iterative two-step optimization
lar pairs are small enough and if the Hamming distances for
using graph cuts is given in [32]. P
dissimilar pairs are large enough. The formulation, if r is
Comments. It is interesting to note that ði;jÞ2E soij y>
i yj ¼
fixed, is equivalent to, P P
X 1 o 2
const 2 ði;jÞ2E sij kyi yj k2 ¼ const 2 ði;jÞ2E sij dij if
1 o h
min maxðkyi yj k1 r þ 1; 0Þ
y 2 f1; 1gM , where const is a constant variable (and thus
ði;jÞ2E þ
X (7) traceðYSY> Þ ¼ const traceðYðD SÞY> Þ). In this case,
þ maxðr kyi yj k1 þ 1; 0Þ; similarity-similarity product maximization is equivalent to
ði;jÞ2E similarity-distance product minimization.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

4.3 Distance-Distance Product Maximization This shows that the difference between distance-distance
The mathematical formulation
P of distance-distance product difference minimization P and distance-distance product
maximization is max ði;jÞ2E doij dhij . Topology preserving hash- maximization lies on min ði;jÞ2E ðdhij Þ2 , minimizing the dis-
ing [169] formulates the objective function by starting with tances between the data items in the hash space. This could
this rule: be regarded as a regularizer, complementary P to distance-
X X distance product maximization max ði;jÞ2E doij dhij which
doij dhij ¼ doij kyi yj k22 ¼ traceðYLd Y> Þ; (10) tends to maximize the distances between the data items in
i;j i;j the hash space.
where Ld ¼ DiagfDo 1g Do and Do ¼ ½doij NN .
In addition, similarity-distance product minimization is 4.5 Similarity-Similarity Difference Minimization
also considered Similarity-similarity difference
P minimization is mathemati-
X cally formulated as min ði;jÞ2E ðsoij shij Þ2 . Supervised hashing
sij kyi yj k22 ¼ traceðYLY> Þ: (11) with kernels [85], one representative approach in this group,
ði;jÞ2E
aims to minimize an objective function,
The overall formulation is given as follows: X 1 >
2
traceðYðLd þ aIÞY Þ > min soij y y ; (18)
max ; (12) ði;jÞ2E
M i j
traceðYLY> Þ
where soij ¼ 1 if ði; jÞ is similar, and soij ¼ 1 if it is dissimi-
where aI is introduced as a regularization term, traceðYY> Þ,
lar. y ¼ hðxÞ is a kernel hash function. Kernel reconstructive
maximizing the variances, which is the same for semi-
hashing [162] extends this technique using a normalized
supervised hashing [141] for bit balance. The problem is
Gaussian kernel similarity. Scalable graph hashing [56] uses
optimized by dropping the sgn operator in the hash function
the feature transformation to approximate the similarity
y ¼ sgnðW> xÞ and letting W> XLX> W be an identity matrix.
matrix (graph) without explicitly computing the similarity
matrix. Binary hashing [25] solves the problem using a two-
4.4 Distance-Distance Difference Minimization
step approach, in which the first step adopts semi-definite
Binary
P reconstructive embedding [66] belongs to this group: relaxation and augmented lagrangian to estimate the dis-
min ði;jÞ2E ðdoij dhij Þ2 . The Euclidean distance is used in crete labels.
both the input and coding spaces. The objective function is Comments. We have the following equation,
formulated as follows,
X
X 1 1
2
min ðsoij shij Þ2
min 2 2
kxi xj k2 kyi yj k2 : (19)
(13) ði;jÞ2E
ði;jÞ2E
2 M
X
The kernel hash function is used: ¼ min ðsoij Þ2 þ ðshij Þ2 2soij shij (20)
! ði;jÞ2E
X
Tm
X
ynm ¼ hm ðxÞ ¼ sgn wmt Kðsmt ; xÞ ; (14) ¼ min ðshij Þ2 2soij shij :
t¼1 (21)
ði;jÞ2E

where fsmt gTt¼1

m
are sampled data items, Kð; Þ is a kernel
This shows that the difference between similarity-similarity
function, and fwmt g are the weights to be learnt.
difference minimization P and similarity-similarity product
Instead of relaxing or dropping the sgn function, a coor-
maximization lies in min ði;jÞ2E ðshij Þ2 , minimizing the simi-
dinate descent optimization scheme is presented in [66]: fix
all but one weight wmt and optimize the problem (13) with larities between the data items in the hash space, intuitively
respect to wmt . There is an exact, optimal update to this letting the hash codes be as different as possible. This could
weight wmt (fixing all the other weights), which is achieved be regarded as a regularizer complementary P to similarity-
with the time complexity OðNlog N þ jEjÞ. Alternatively, a similarity product maximization max ði;jÞ2E soij shij , which
two-step solution is presented in [106]: relax the binary vari- has a trivial solution: the hash codes are the same for all
ables to ð0; 1Þ and optimize the problem via an augmented data points.
Lagrangian formulation and a nonnegative matrix factoriza- Extensions and Variants. Multi-dimensional spectral hash-
tion formulation. ing [155] uses a similar objective function, but with a
Comments. We have the following: weighted Hamming distance,
X X
min ðdoij dhij Þ2 min ðsoij y> 2
i L yj Þ ; (22)
(15)
ði;jÞ2E ði;jÞ2E
X
¼ min ðdoij Þ2 þ ðdhij Þ2 2doij dhij (16) where L is a diagonal matrix. Both L and hash codes fyi g
ði;jÞ2E need to be optimized. The algorithm for solving the prob-
X lem 22 to compute the hash codes is similar to that given
¼ min ðdhij Þ2 2doij dhij : (17)
in spectral hashing [156]. Bilinear hyperplane hashing [87]
ði;jÞ2E extends the formulation of supervised hashing with kernels
by introducing a bilinear hyperplane hashing function.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 777

Label-regularized maximum margin hashing [102] formulates The objective function is given as follows,
the objective function from three components: the similar-
ity-similarity difference, a hinge loss from the hash function, X
‘triplet hðxÞ; hðxþ Þ; hðx Þ þ traceðW> WÞ;
and the maximum margin part. ðx;xþ ;x Þ2D
2

4.6 Normalized Similarity-Similarity Divergence where hðxÞ ¼ hðx; WÞ is the compound hash function. The
Minimization problem is optimized using the algorithm similar to minimal
Spec hashing [78], belonging to this group, views each pair of loss hashing [107]. The extension to asymmetric Hamming
data items as a sample and their (normalized) similarity as distance is also discussed in [109]. Binary optimized hash-
the probability, and finds the hash functions so that the ing [18] also uses a triplet loss function, with a slight differ-
probability distributions from the input space and the cod- ent distance measure in the Hamming space and a different
ing space are well aligned. The objective function is written optimization technique.
as follows: Top rank supervised binary coding [132] presents another
X
KLðf soij g; f
shij gÞ ¼ const soij log shij : form of triplet loss in order to penalize the samples that are
(23) incorrectly ranked at the top of a Hamming-distance rank-
ði;jÞ2E
ing list more than those at the bottom.
o
Here,
P o sij is the normalized similarity in the input space, Listwise supervision hashing [146] also uses triplets of

s
ij ij ¼ 1.
s h
ij is the normalized similarity in the Hamming items. The formulation is based on a triplet tensor So
space, shijP¼ Z1 expðdhij Þ, where Z is a normalization vari- defined as follows:
able Z ¼ ij expðdhij Þ. 8
Supervised binary hash code learning [27] presents a super- <1 ifso ðqi ; xj Þ < so ðqi ; xk Þ
sijk ¼ sðqi ; xj ; xk Þ ¼ 1 ifso ðqi ; xj Þ > so ðqi ; xk Þ :
o
vised learning algorithm based on the Jensen-Shannon :
divergence which is derived from minimizing an upper 0 ifso ðqi ; xj Þ ¼ so ðqi ; xk Þ
bound of the probability of Bayes decision errors. The objective is to maximize triple-similarity-triple-similarity
product:
5 MULTIWISE SIMILARITY PRESERVING X
shijk soijk ; (25)
This section reviews the category of hashing algorithms that i;j;k
formulate the loss function by maximizing the agreement of
the similarity orders over more than two items computed where shijk is a ranking triplet computed in the coding
from the input space and the coding space. space using the cosine similarity, shijk ¼ sgnðhðqi Þ> hðxj Þ
Order preserving hashing [150] aims to learn hash func- hðqi Þ> hðxk ÞÞ. Through dropping the sgn function, the
tions through aligning the orders computed from the origi- objective function is transformed to
nal space and the ones in the coding space. Given a data X
point xn , the database points X are divided into ðM þ 1Þ cat- hðqi Þ> ðhðxj Þ hðxk ÞÞsoijk ; (26)
egories, ðChn0 ; Chn1 ; . . . ; ChnM Þ, where Chnm corresponds to the i;j;k
items whose distances to the given point are m, and
ðCon0 ; Con1 ; . . . ; ConM Þ, using the distances in the hashing space which is solved by dropping the sgn operator in the hash
and the distances in the input (original) space, respectively. function hðxÞ ¼ sgnðW> xÞ.
ðCon0 ; Con1 ; . . . ; ConM Þ is constructed such that in the ideal case Comments. Order preserving hashing considers the rela-
the probability of assigning an item to any hash code is the tion between the search lists while triplet loss hashing and
same. The basic objective function maximizing the align- listwise supervision hashing consider triplewise relation.
ment between the two categories is given as follows: The central ideas of triplet loss hashing and listwise super-
vision hashing are very similar, and their difference lies in
X X
M how to formulate the loss function besides the different
LðhðÞ; X Þ ¼ ðjConm Chnm j þ jChnm Conm jÞ; optimization techniques they adopted.
n2f1;...;Ng m¼0

where jConm Chnm j is the cardinality of the difference of the 6 IMPLICIT SIMILARITY PRESERVING
two sets. The linear hash function hðxÞ is used and dropping We review the category of hashing algorithms that focus on
the sgn function is adopted for optimization. pursuing effective space partitioning without explicitly eval-
Instead of preserving the order, KNN hashing [23] directly uating the relation between the distances/similarities in the
maximizes the KNN accuracy of the search result, which is input and coding spaces. The common idea is to partition
solved by using the factorized neighborhood representation the space, formulated as a classification problem, with the
to parsimoniously model the neighborhood relationships maximum margin criterion or the code balance condition.
inherent in the training data. Random maximum margin hashing [61] learns a hash func-
Triplet loss hashing [109] formulates the hashing problem tion with the maximum margin criterion. The point is that
by maximizing the similarity order agreement defined over the positive and negative labels are randomly generated by
triplets of items, fðx; xþ ; x Þg, where the pair ðx; x Þ is less randomly sampling N data items and randomly labeling
similar than the pair ðx; xþ Þ. The triplet loss is defined as half of the items with 1 and the other half with 1. The for-
mulation is a standard SVM formulation that is equivalent
‘triplet ðy; yþ ; y Þ ¼ maxð1 ky y k1 þ ky yþ k1 ; 0Þ: (24) to the following form,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
778 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

( ) X
1 4 min 2 ðjxj zj j22 þ jxi zi j22 Þ (33)
max min min ðw> xþ þ bÞ; min ðw> x bÞ ;
kwk2 i¼1;...;N
2
i
i¼1;...;N
2
i i;j2f1;2;...;Ng

X
where fxþ
i g are the positive samples and fxi g are the nega- ¼ min 4 jxi zi j22 : (34)
tive samples. Note that this is different from PICODES [7] i2f1;2;...;Ng
as random maximum margin hashing adopts the hyper-
planes learnt from SVM to form the hash functions while This means that the distance-distance difference minimiza-
PICODES [7] exploits the hyperplanes to check whether the tion rule is transformed to minimizing its upper-bound, the
hash codes are semantically separable rather than forming quantization error (as shown in Equation (34)), which is
hash functions. described as a theorem below.
Complementary projection hashing [60], similar to comple- Theorem 1. The distortion error in the quantization approach is
mentary hashing [160], finds the hash function such that an upper bound (with a scale) of the differences between the
the items are as far away as possible from the partition pairwise distances computed from the input features and from
plane corresponding to the hash function. It is formulated the approximate representations.
as Hð jw> x þ bjÞ, where HðÞ ¼ 12 ð1 þ sgnðÞÞ is the unit
step function. Moreover, the bit balance condition, Y1 ¼ 0, The quantization approach for hashing is roughly
and the bit uncorrelation condition, the non-diagonal divided into two main groups: hypercubic quantization, in
entries in YY> are 0, are considered. An extension is also which the approximation z is equal to the hash code y, and
given by using the kernel hash function. In addition, when Cartesian quantization, in which the approximation z corre-
learning the mth hash function, the data item is weighted sponds to a vector formed by the hash code y, e.g., y repre-
by a variable, which is computed according to the previ- sents the index of a candidate approximation among a set of
ously computed ðm 1Þ hash functions. candidate approximations. In addition, we will review the
Spherical hashing [41] uses a hypersphere to partition related reconstruction-based hashing algorithms.
the space. The spherical hash function is defined as hðxÞ ¼ 1
if dðp; xÞ 4 t and hðxÞ ¼ 0 otherwise. The compound hash 7.1 Hypercubic Quantization
function consists of M spherical functions, depending on M Hypercubic quantization refers to a category of algorithms
pivots fp1 ; . . . ; pM g and M thresholds ft1 ; . . . ; tM g. The dis- that quantize a data item to a vertex in a hypercubic, i.e., a
tance in the coding space is defined based on the distance: vector belonging to a set f½y1 y2 yM > j ym 2 f1; 1gg or
ky1 y2 k1
yT1 y2
. Unlike the pairwise and multiwise similarity pre- the rotated hypercubic vertices. It is in some sense related to
serving algorithms, there is no explicit function penalizing 1-bit compressive sensing [8]: Its goal is to design a mea-
the disagreement of the similarities computed in the input surement matrix A and a recovery algorithm such that a k-
and coding spaces. The M pivots and thresholds are learnt sparse unit vector x can be efficiently recovered from the
such that it satisfies a pairwise bit balance condition: sign of its linear measurements, i.e., b ¼ sgnðAxÞ, while
jfx j hm ðxÞ ¼ 1gj ¼ jfx j hm ðxÞ ¼ 0gj, and jfxjhi ðxÞ ¼ b1 ; hypercubic quantization aims to find the matrix A which is
hj ðxÞ ¼ b2 gj ¼ 14 jX j; b1 ; b2 2 f0; 1g; i 6¼ j. usually a rotation matrix, and the codes b, from the input x.
The widely-used scalar quantization approach with only
7 QUANTIZATION one bit assigned to each dimension can be viewed as a
hypercubic quantization approach, and can be derived by
The following provides a simple derivation showing that minimizing
the quantization approach can be derived from the dis-
tance-distance difference minimization criterion. There is a jjxi yi jj22 ; (35)
similar statement in [50] obtained from the statistical per- subject to yi 2 f1; 1g. The local digit coding approach [64]
spective: the distance reconstruction error is statistically also belongs to this category.
bounded by the quantization error. Considering two points
xi and xj and their approximations zi and zj , we have
7.1.1 Iterative Quantization
jdoij dhij j (27) Iterative quantization [35], [36] preprocesses the centralized
data, by reducing the dimension using PCA to M dimensions,
v ¼ P> x, where P is a matrix of size d M (M 4 d) computed
¼ jjxi xj j2 jzi zj j2 j (28)
using PCA, and then finds an optimal rotation R followed by
a scalar quantization. The formulation is given as,
¼ jjxi xj j2 jxi zj j2 þ jxi zj j2 jzi zj j2 j (29)
minkY R> Vk2F ; (36)
4jjxi xj j2 jxi zj j2 j þ jjxi zj j2 jzi zj j2 j (30)
where R is a matrix of M M, V ¼ ½v1 v2 vN and Y ¼
½y1 y2 yN .
4jxj zj j2 þ jxi zi j2 : (31)
The problem is solved via alternative optimization. There
Thus, jdoij dhij j2 4 2ðjxj zj j22 þ jxi zi j22 Þ, and are two alternative steps. Fixing R, Y ¼ sgnðR> VÞ. Fixing B,
X the problem becomes the classic orthogonal Procrustes
min jdoij dhij j2 (32) problem, and the solution is R ¼ SS ^ > , where S and S ^ are
> > >
i;j2f1;2;...;Ng obtained from the SVD of YV , YV ¼ SL LS .
^
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 779

Comments. We present an integrated objective function 7.2 Cartesian Quantization

that is able to explain the necessity of PCA dimension reduc- Cartesian quantization refers to a class of quantization algo-
tion. Let y
be a d-dimensional vector, which is a concatenated rithms in which the composed dictionary C is formed from
vector from y and an all-zero subvector: y ¼ ½y> 0:::0> . The a Cartesian product of a set of small source dictionaries
integrated objective function is written as follows: fC1 ; C2 ; . . . ; CP g: C ¼ C1 C2 CP ¼ fðc1i1 ; c2i2 ; . . . ; cPiP Þg,
where Cp ¼ fcp0 ; cp2 ; . . . ; cpðKp 1Þ g, ip 2 f0; 1; . . . ; Kp 1g.
> Xk2 ;
R The P benefits include that (1) P small dictionaries, with
minkY (37)
F totally Pp¼1 QKp dictionary items, generate a larger dictio-
nary with Pp¼1 Kp dictionary items; (2) the (asymmetric)
where Y ¼ ½
y1 y
2 y
N , X ¼ ½x1 x2 xN , and R
is a rotation
distance from a query q to the composed dictionary item
matrix of d d. Let P be the projection matrix of d d,

ðc1i1 ; c2i2 ; . . . ; cPiP Þ (an approximation of a data item) is com-
computed using PCA, P ¼ ½PP? , and P? is a matrix
puted from the distances fdistðq; c1i1 Þ; . . . ; distðq; cPiP Þg
of d ðd MÞ. It can be derived that, the solutions for y of through a sum operation, thus the cost of the distance com-
the two problems in (37) and (36) are the same, and R ¼
putation between a query and a data item is OðP Þ, if the
diagðR; IðdMÞðdMÞ Þ.
P distances between the query and the source dictionary items
are precomputed; and (3) the query cost with a set of N
7.1.2 Extensions and Variants database items is reduced from OðNdÞ to OðNP Þ through
Harmonious hashing [159] modifies iterative quantization looking up a distance table which is efficiently computed
by adding an extra constraint: YY> ¼ sI. The problem is between the query and the P source dictionaries.
solved by relaxing Y to continuous values: fixing R, let
R> V ¼ UL LV> , then Y ¼ s 1=2 UV> ; fixing Y, R ¼ SS^ > , where 7.2.1 Product Quantization
S and S ^ are obtained from the SVD of YV , YV> ¼ SL
> ^>.
LS Product quantization [50], which initiates the quantization-
The hash function is finally computed as y ¼ sgnðR> vÞ. based compact coding solution to similarity search, forms
Isotropic hashing [63] finds a rotation following PCA the P source dictionaries by dividing the feature space into P
preprocessing such that R> VV> R ¼ S becomes a matrix disjoint subspaces, accordingly dividing the database into P
S11 ¼ ½S
with equal diagonal values, i.e., ½S S22 ¼ ¼ ½S
SMM . sets, each set consisting of N subvectors fxp1 ; . . . ; xpN g, and
The objective function is written as kR> VV> R ZkF ¼ 0, then quantizing each subspace separately into (usually
where Z is a matrix with all the diagonal entries equal to an K1 ¼ K2 ¼ ¼ KP ¼ K) clusters. Let fcp1 ; cp2 ; . . . ; cpK g be
unknown variable s. The problem can be solved by two the cluster centers of the pth subspace. The operation forming
algorithms: lift and projection, and gradient flow. an item in the dictionary from a P-tuple ðc1i1 ; c2i2 ; . . . ; cPiP Þ is
> >
Comments. The goal of making the variances along the M the concatenation ½c> >
1i1 c2i2 cPiP . A data point assigned to
directions being the same is to make the bits in the hash the nearest dictionary item ðc1i1 ; c2i2 ; . . . ; cPiP Þ is represented
codes equally contributing to the distance evaluation. In the by a compact code ði1 ; i2 ; . . . ; iP Þ, whose length is P log 2 K.
case that the data items satisfy the isotropic Gaussian distri- The distance distðq; cpip Þ between a query q and the dictionary
bution, the solution from isotropic hashing is equivalent to element in the pth dictionary is computed as kqp cpip k22 ,
iterative quantization. where qp is the subvector of q in the pth subspace.
Similar to iterative quantization, the PCA preprocessing Mathematically, product quantization can be viewed as
in isotropic hashing is also interpretable: finding a global minimizing the following objective function,
rotation matrix R such that the first M diagonal entries of
¼R
S > XX> R
are equal, and their sum is as large as possi- X
N
min kxn Cbn k22 : (40)
ble, which is formally written as follows: C;fbn g
n¼1

X
M
Here C is a matrix of d PK in the form of
max ;
½S (38)
mm
2 3
m¼1
C1 0 0
6 0 C2 0 7
6 7
> C ¼ diagðC1 ; C2 ; . . . ; CP Þ ¼ 6 .. .. . . .. 7;
s:t:½S mm ¼ s; m ¼ 1; . . . ; M; R R ¼ I: (39) 4 . . . . 5
0 0 CP
Other extensions include cosine similarity preserving
> >
quantization (Angular quantization [34]), nonlinear embed- where Cp ¼ ½cp1 cp2 cpK . bn ¼ ½b> >
n1 bn2 bnP is the com-
ding replacing PCA embedding [46] [175], matrix hash- position vector, and its subvector bnp of length K is an indi-
ing [33], and so on. Quantization is also applied to cator vector with only one entry being 1 and all others being
supervised problems: Supervised discrete hashing [125], [127], 0, showing which element is selected from the pth source
[168], [170], presents an SVM-like formulation to minimize dictionary for quantization.
the quantization loss and the classification loss in the hash Extensions. Distance-encoded product quantization [42]
coding space, and jointly optimize the hash function param- extends product quantization by encoding both the cluster
eters and the SVM weights. Intuitively, the goal of these index and the distance between the cluster center and the
methods is that the hash codes are semantically separable, point. The cluster index is encoded in a way similar to that
which is guaranteed through maximizing the classification in product quantization. The way of encoding the distance
performance. between a point and its cluster center is as follows: the
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
780 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

points belonging to one cluster are partitioned (quantized)

according to the distances to the cluster center, the points in
each partition are represented by the corresponding parti-
tion index, and accordingly the distance of each partition to
the cluster center is also recorded with the partition index.
Cartesian k-means [108] and optimized production quantiza-
tion [31] extend product quantization and introduce a rota-
tion R into the objective function,
Fig. 2. 2D toy examples illustrating the quantization algorithms. The
X
N space partitioning results are generated by (a) product quantization, (b)
min kR> xn Cbn k22 : (41) Cartesian k-means, and (c) composite quantization. The space partition
R;C;fbn g from composition quantization is more flexible.
n¼1

The introduced rotation does not affect the Euclidean dis- explained from the view of sparse coding, as pointed in [2],
tance as the Euclidean distance is invariant to the rotation, [138], [171]: the dictionary (fCp g) in composite quantization
and helps to find an optimized subspace partition for quan- (product quantization and Cartesian k-means) satisfies the
tization. Locally optimized product quantization [62] applies constant (orthogonality) constraint, and the sparse codes
optimized production quantization to the search algorithm (fbn g) are 0 and 1 vectors where there is only one 1 for each
with the inverted index, where there is a quantizer for each subvector corresponding to a source dictionary.
inverted list. Comments. As discussed in product quantization [50], the
idea of using the summation of several dictionary items as
7.2.2 Composite Quantization an approximation of a data item has already been studied in
In composite quantization [171], the operation forming an the signal processing research area, known as multi-stage
item in the dictionary from a P-tuple ðc1i1 ; c2i2 ; . . . ; cPiP Þ is vector quantization, residual quantization, or more generally
P structured vector quantization [38], and recently re-devel-
the summation Pp¼1 cpip . In order to compute the distance
from a query q to the composed dictionary item formed oped for similarity search under the Euclidean distance
by ðc1i1 ; c2i2 ; . . . ; cPiP Þ from the distances fdistðq; c1i1 Þ; . . . ; (additive quantization [2], [149], and tree quantization [3]
distðq; cPiP Þg, a constraint is introduced: the summation of modifying additive quantization by introducing a tree-
the inner products of all pairs of elements that are used to structure sparsity) and inner product [26].
approximate the vector xn but from different dictionaries,
PP PP 7.2.3 Variants
i¼1 j¼1;6¼i cikin cjkjn , is constant.
The problem is formulated as The work in [37] presents an approach to compute the
source dictionaries given the M hash functions fhm ðxÞ ¼
XN
min kxn ½C1 C2 CP bn k22 bm ðgm ðxÞÞg, where gm ðÞ is a real-valued embedding function
fCp g;fbn g;
n¼1 and bm ðÞ is a binarization function, for a better distance
X
P X
P measure, quantization-like distance, instead of Hamming or
s:t: b> >
ni Ci Cj bnj ¼ ; weighted Hamming distance. It computes M dictionaries,
j¼1 i¼1;i6¼j (42) each corresponding to a hash bit and being computed as
> >
bn ¼ ½b> >
n1 bn2 bnP ;

bnp 2 f0; 1gK ; kbnp k1 ¼ 1; gkb ¼ Eðgk ðxÞjbk ðgk ðxÞÞ ¼ bÞ; (43)
n ¼ 1; 2; . . . ; N; p ¼ 1; 2; . . . P:
where b 2 f0; 1g. The distance computation cost is OðMÞ
Here, Cp is a matrix of size d K, and each column corre- through looking up a distance table, which can be acceler-
sponds to an element of the pth dictionary Cp . ated by dividing the hash functions into groups (e.g., each
Sparse composite quantization [172] improves composite group contains 8 functions, thus the cost is reduced to
quantization
PP PK by constructing a sparse dictionary, OðM8 Þ), building a table (e.g., consisting of 256 entries) per
p¼1 k¼1 kcpk k0 4 S, with S being a parameter controlling group instead of per hash function, and forming a larger
the sparsity degree, resulting in a great reduction of the dis- distance lookup table. In contrast, optimized code ranking [148]
tance table computation cost. directly estimates the distance table rather than computing
Connection with Product Quantization. It is shown in [171] it from the estimated dictionary.
that both product quantization and Cartesian k-means can Composite quantization [171] points to relation between
be regarded as constrained versions of composite quantiza- Cartesian quantization and sparse coding. This indicates
tion. Composite quantization attains smaller quantization the application of sparse coding to similarity search.
errors, yielding better search accuracy with similar search Compact sparse coding [15], the extension of robust sparse
efficiency. A 2D illustration of the three algorithms is given coding [16], adopts sparse codes to represent the database
in Fig. 2, where 2D points are grouped into 9 groups. It is items: the atom indices corresponding to nonzero codes,
observed that composition quantization is more flexible which is equivalent to letting the hash bits associated with
in partitioning the space and thus the quantization error is nonzero codes be 1 and 0 for zero codes, are used to build
possibly smaller. the inverted index, and the nonzero coefficients are used to
Composite quantization, product quantization, and reconstruct the database items and calculate the distances
Cartesian k-means (optimized product quantization) can be between the database items and the query. Anti-sparse
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 781

coding [52] aims to learn a hash code so that non-zero ele- algorithms [30], [67], [158], [174]. Typically, these
ments in the hash code are as many as possible. approaches except [67] simultaneously learn the representa-
tion using a deep neural network and the hashing function
7.3 Reconstruction under some loss functions, rather than separately learn the
We review a few reconstruction-based hashing approaches. features and then learn the hash functions.
Essentially, quantization can be viewed as a reconstruction The methodology is similar to other learning to hash
approach for a data item. Semantic hashing [120], [121] gener- algorithms that do not adopt deep learning, and the hash
ates the hash codes using the deep generative model, a function is more general and could be a deep neural net-
restricted Boltzmann machine (RBM), for reconstructing the work. We provide here a separate discussion because this
data item. As a result, the binary codes are used for finding area is relatively new. However, we will not discuss seman-
similar data. A variant method proposed in [13] recon- tic hashing [120] which is usually not thought as a feature
structs the input vector from the binary codes, which is learning approach but just a hash function learning
effectively solved using the auxiliary coordinates algorithm. approach. In general, almost all non-deep-learning hashing
A simplified algorithm [5] finds a binary hash code that can algorithms if the similarity order (e.g., semantic similarity)
be used to effectively reconstruct the vector through a linear is given, can be extended to deep learning based hashing
transformation. algorithms. In the following, we discuss the deep learning
based algorithms and also categorize them according to
8 OTHER TOPICS their similarity preserving manners.

Most hashing learning algorithms assume that the similarity Pairwise similarity preserving. The similarity-simi-
information in the input space, especially the semantic similarity difference minimization criterion is adopted
larity information, and the database items have already been in [158]. It uses a two-step scheme: the hash codes are
given. There are some approaches to learn hash functions computed by minimizing the similarity-similarity
without such assumptions: active hashing [176] that actively difference without considering the visual informa-
selects the pairs which are most informative for hash function, and then the image representation and hash
tion learning and labels them for further learning, online hash- function are jointly learnt through deep learning.
ing [43], smart hashing [163], online sketching hashing [69], Multiwise similarity preserving. The triplet loss is
and online adaptive hashing [12], which learn the hash func- used in [67], [174], which adopt the loss function
tions when the similar/dissimilar pairs come sequentially. defined in Equation (24) (1 is dropped in [67]).
The manifold structure in the database is exploited for Quantization. Following the scalar quantization
hashing, which is helpful for semantic similarity search, approach, deep hashing [80] defines a loss to penalize
such as locally linear hashing [46], spline regression hashing [93], the difference between the binary hash codes (see
and inductive manifold hashing [126]. Multi-table hashing, Equation (35)) and the real values from which a linear
aimed at improving locality sensitive hashing, is also stud- projection is used to generate the binary codes, and
ied, such as complementary hashing [160] and its multi-view introduces the bit balance and bit uncorrelation
extension [91], reciprocal hash tables [90] and its query-adap- conditions.
tive extension [88], and so on.
There are some works extending the Hamming distance. 8.2 Fast Search in the Hamming Space
In contrast to multi-dimensional spectral hashing [155] in The computation of the Hamming distance is shown much
which the weights for the weighted Hamming distance are faster than the computation of the distance in the input
the same for arbitrary queries, the query-dependent dis- space. It is still expensive, however, to handle a large scale
tance approaches learn a distance measure whose weights data set using linear scan. Thus, some indexing algorithms,
or parameters depend on a specific query. Query adaptive which are shown effective and efficient for general vectors,
hashing [81], a learning-to-hash version extended from are borrowed for the search in the Hamming space. For
query adaptive locality sensitive hashing [48], aims to select example, min-hash, a kind of LSH, is exploited to search
the hash bits (thus hash functions forming the hash bits) over high-dimensional binary data [129]. In the following,
according to the query vector. Query-adaptive class-specific bit we discuss other representative algorithms.
weighting [57], [58] presents a weighted Hamming distance Multi-index hashing [110] and its extension [133] aim to
measure by learning the class-specific bit weights from the partition the binary codes into M disjoint substrings and
class information of the query. Bits reconfiguration [101] aims build M hash tables each corresponding to a substring,
to learn a good distance measure over the hash codes pre- indexing all the binary codes M times. Given a query, the
computed from a pool of hash functions. method outputs the NN candidates which are near to the
The following reviews three research topics: joint feature query at least in one hash table. FLANN-binary [104] extends
and hash learning with deep learning, fast search in the Ham- the FLANN algorithm [103] that was initially designed for
ming space replacing the exhaustive search, and the impor- ANN search over real-value vectors to search over binary
tant application of Cartesian quantization to inverted index. vectors. The key idea is to build multiple hierarchical cluster
trees to organize the binary vectors and to search for the
8.1 Joint Feature and Hash Learning via Deep nearest neighbors simultaneously over the multiple trees by
Learning traversing each tree in a best-first manner.
The great success in deep neural network for representation PQTable [97] extends multi-index hashing from the
learning has inspired a lot of deep compact coding Hamming space to the product-quantization coding space,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
782 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

for fast exact search. Unlike multi-index hashing flipping in the case the Hamming distance in hash code ranking is
the bits in the binary codes to find candidate tables, used in the coding space, it is not necessary to report the
PQTable adopts the multi-sequence algorithm [4] for effi- search time costs because they are the same. It is necessary
ciently finding candidate tables. The neighborhood graph- to report the search time cost when a non-Hamming dis-
based search algorithm [144] for real-value vectors is tance or the hash table lookup scheme is used.
extended to the Hamming space [59]. The search quality is measured using recall@R (i.e., a
recall-R curve). For each query, we retrieve its R nearest
8.3 Inverted Multi-Index items and compute the ratio of the true nearest items in the
Hash table lookup with binary hash codes is a form of retrieved R items to T , i.e., the fraction of T ground-truth
inverted index. Retrieving multiple hash buckets from nearest neighbors are found in the retrieved R items. The
multiple hash tables is computationally cheaper compared average recall score over all the queries is used as the mea-
with the subsequent reranking step using the true distance sure. The ground-truth nearest neighbors are computed
computed in the input space. It is also cheap to visit more over the original features using linear scan. Note that the
buckets in a single table if the standard Hamming distance recall@R is equivalent to the accuracy computed after reor-
is used, as the nearby hash codes of the hash code of the dering the R retrieved nearest items using the original fea-
query which can be obtained by flipping the bits of the tures and returning the top T items. In the case where the
hash code of the query. If there are a lot of empty buckets, linear scan cost in the hash coding space is not the same
which increase the retrieval cost, the double-hash scheme (e.g., binary code hashing, and quantization-based hashing),
or the fast search algorithm in the Hamming space, e.g., the curve in terms of search recall and search time cost is
[104], [110] can be used to fast retrieve the hash buckets. usually reported.
Thanks to the multi-sequence algorithm, the Cartesian The semantic similarity search, a variant of nearest
quantization algorithms are also applied to the inverted neighbor search, sometimes uses the precision, the recall,
index [4], [172], [31] (called inverted multi-index), in which the precision-recall curve, and mean average precision
each composed quantization center corresponds to an (mAP). The precision is computed at the retrieved position
inverted list. Instead of comparing the query with all the R, i.e., R items are retrieved, as the ratio of the number of
composed quantization centers, which is computationally retrieved true positive items to R. The recall is computed,
expensive, the multi-sequence algorithm [4] is able to effi- also at position R, as the ratio of the number of retrieved
ciently produce a sequence of (T ) inverted lists ordered by true positive items to the number of all true positive items
the increasing distances between the query and the com- in the database. The pairs of recall and precision in the
posed quantization centers, whose cost is OðT log T Þ. The precision-recall curve are computed by varying the
study (Fig. 5 in [151]) shows that the time cost of the multi- retrieved position R. The mAP score is computed as
sequence algorithm, when retrieving 10K candidates over follows: the average precision for a query, the PNarea under
the two datasets: SIFT1M and GIST1M, is the smallest com- the precision-recall curve is computed as t¼1 P ðtÞDðtÞ,
pared with other non-hashing inverted index algorithms. where P ðtÞ is the precision at cut-off t in the ranked list
Though the cost of the multi-sequence algorithm is and DðtÞ is the change in recall from items t 1 to t; the
greater than that with binary hash codes, both are relatively mean of average precisions over all the queries is com-
small and negligible compared with the subsequent rerank- puted as the final score.
ing step that is often conducted in real applications. Thus
the quantization-based inverted index (hash table) is more 9.2 Evaluation Datasets
widely used compared with the conventional hash tables The widely-used evaluation datasets have different scales
with binary hash codes. from small, large, to very large. Various features have been
used, such as SIFT features [94] extracted from Photo-tour-
ism [131] and Caltech 101 [28], GIST features [112] from
9 EVALUATION PROTOCOLS LabelMe [119] and Peekaboom [140], as well as some fea-
9.1 Evaluation Metrics tures used in object retrieval: Fisher vectors [116] and
There are three main concerns for an approximate nearest VLAD vectors [51]. The following presents a brief introduc-
neighbor search algorithm: space cost, search efficiency, tion to several representative datasets, which is summarized
and search quality. The space cost for hashing algorithms in Table 2.
depends on the code length for hash code ranking, and the MNIST [68] includes 60K 784-dimensional raw pixel fea-
code length and the table number for hash table lookup. tures describing grayscale images of handwritten digits as a
The search performance is usually measured under the reference set, and 10K features as the queries.
same space cost, i.e., the code length (and the table number) SIFT10K [50] consists of 10K 128-dimensional SIFT vec-
is chosen the same for different algorithms. tors as the reference set, 25K vectors as the learning set, and
The search efficiency is measured as the time taken to 100 vectors as the query set. SIFT1M [50] is composed of
return the search result for a query, which is usually com- 1M 128-dimensional SIFT vectors as the reference set, 100K
puted as the average time over a number of queries. The vectors as the learning set, and 10K as the query set. The
time cost often does not include the cost of the reranking learning sets in SIFT10K and SIFT1M are extracted from
step (using the original feature representations) as it is Flicker images and the reference sets and the query sets are
assumed that such a cost given the same number of candi- from the INRIA holidays images [49].
dates does not depend on the hashing algorithms and can GIST1M [50] consists of 1M 960-dimensional GIST vec-
be viewed as a constant. When comparing the performance tors as the reference set, 50K vectors as the learning set,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 783

TABLE 2 validation criterion is not the objective function value but

A Summary of Evaluation Datasets the search performance.
Dim Reference set Learning set Query set
10 PERFORMANCE ANALYSIS
MNIST 784 60K - 10K
SIFT10K 128 10K 25K 100 10.1 Query Performance
SIFT1M 128 1M 100K 10K We summarize empirical observations and the analysis of
GIST1M 960 1M 50K 1K the nearest neighbor search performance using the compact
Tiny1M 384 1M - 100K coding approach, most of which have already been men-
SIFT1B 128 1B 100M/1M 10K tioned or discussed in the existing works. We discuss about
GloVe1:2M 200 1:2M - 10K both hash table lookup and hash code ranking, with more
focus on hash code ranking because the major usage of the
learning to hash algorithms lies in hash code ranking for
1K vectors as the query set. The learning set is extracted retrieving top candidates from a set of candidates obtained
from the first 100K images from the tiny images [137]. The from the inverted index or other hash table lookup algo-
reference set is from the Holiday images combined with rithms. The analysis is mainly focusing on the major
Flickr1M [49]. The query set is from the Holiday image application of hashing: nearest neighbor search with the
queries. Tiny1M [152]1 consists of 1M 384-dimensional Euclidean distance. The conclusion for semantic similarity
GIST vectors as the reference set and 100K vectors as the search is similar in principle and the performance also
query set. The two sets are extracted from the 1100K tiny depends on the ability of representing the semantic mean-
images. ing of the input features. We also present empirical results
SIFT1B [53] includes 1B 128-dimensional BYTE-valued of the quantization algorithms and the representative binary
SIFT vectors as the reference set, 100M vectors as the learn- coding algorithms for hash code ranking.
ing set and 10K vectors as the query set. The three sets are
extracted from around 1M images. This dataset, and 10.1.1 Query Performance with Hash Table Lookup
SIFT10K, SIFT1M and GIST1M are publicly available2.
We give a performance summary of the query scheme using
GloVe1:2M [115]3 contains 1,193,514 200-dimensional
hash table lookup for the two main hash algorithms: the
word feature vectors extracted from Tweets. We randomly
binary hash codes and the quantization-based hash codes.
sample 10K vectors as the query set and use the remaining
In terms of space cost, hash table lookup with binary
as the reference set.
hash codes has a little but negligible advantage over that
9.3 Training Sets and Hyper-Parameters Selection with quantization-based hash codes because the main space
There are three main choices of the training set over which cost comes from the indices of the reference items and the
the hash functions are learnt for learning-to-hash algo- extra cost from the centers corresponding to the buckets
rithms. The first choice is a separate set used for learning using quantization is relatively small. Multi-assignment
hash functions, which is not contained in the reference set. and multiple hash tables increase space cost as they require
The second choice is to sample a small subset from the refer- to store multiple copies of reference vector indices. As an
ence set. The third choice is to use all the reference set to alternative choice, single-assignment with a single table can
train hash functions. The query set and the reference set are be used but more buckets are retrieved for high recall.
then used to evaluate the learnt hash functions. When retrieving the same number of candidates, hash
In the case where the query is transformed to a hash table lookup using binary hash codes is better in terms
code, e.g., when adopting the Hamming distance for most of the query time cost, but inferior to the quantization
binary hash algorithms, learning over the whole reference approach in terms of the recall, which has probably been
set might lead to over-fitting and the performance might be first discussed in [114]. In terms of recall versus time cost
worse than learning with a subset of the reference set or a the quantization approach is overall superior as the cost
separate set. In the case where the raw query is used with- from the multi-sequence algorithm is relatively small and
out any processing, e.g., when adopting the asymmetric dis- negligible compared with the subsequent reranking step,
tance in Cartesian quantization, learning over the whole which is observed from our experience, and can be derived
reference set is better as it results in better approximation of from [103] and [4]. In general, the performance for other
the reference set. algorithms based on the weighted Hamming distance and
There are some hyper-parameters in the objective func- the learnt distance is in between. The observation holds for
tions, e.g, the objective functions in minimal loss hash- a single table with single assignment and multiple assign-
ing [107] and composite quantization [171]. It is unfair ment, or multiple tables.
and not suggested to select the hyper-parameters corre-
sponding to the best performance over the query set. It is 10.1.2 Query Performance with Hash Code Ranking
suggested instead to select the hyper-parameters by vali- The following provides a short summary of the overall per-
dation, e.g., sampling a subset from the reference set as formance for three main categories: pairwise similarity pre-
the validation set which is reasonable because the serving, multiwise similarity preserving, and quantization in
terms of search cost and search accuracy under the same
1. http://research.microsoft.com/jingdw/SimilarImageSearch/ space cost, guaranteed by coding the items using the same
NNData/NNdatasets.html
2. http://corpus-texmex.irisa.fr/ number of bits, ignoring the small space cost of the dictionary
3. http://nlp.stanford.edu/projects/glove/ in Cartesian quantization and the distance lookup tables.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
784 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

TABLE 3
A Summary of Query Performance Comparison for Approximate
Nearest Neighbor Search Under the Euclidean Distance

Accuracy Efficiency Overall

pairwise low high low
multiwise fair high fair
quantization high fair high

Search Accuracy. Multiwise similarity preserving is better

than pairwise similarity preserving as it considers more
information for hash function learning. There is no observa-
tion/conclusion on which algorithm, pairwise or multiwise
Fig. 3. 2D toy examples illustrating the comparison between binary code
similarity preserving algorithm, performs consistently the hashing and quantization. (a) shows the Hamming distances from clus-
best. Nevertheless, there is a large amount of pairwise and ters B and D to cluster A, usually adopted in the binary code hashing
multiwise similarity preserving algorithms because differ- algorithms, are the same while the Euclidean distances, used in the
ent algorithms may be suitable to different data distribu- quantization algorithms, are different. (b) the binary code hashing algo-
rithms need 6 hash bits (red lines show the corresponding hash func-
tions and optimization also affects the performance. tions) to differentiate the 16 uniformly-distributed clusters while the
It has been shown in Section 7.1 that the cost function quantization algorithms only require 4 (¼ log 16) bits (green lines show
of hypercubic quantization is an approximation of the the partition line).
distance-distance difference. But it outperforms pairwise
and multiwise similarity preserving algorithms. This is the quantization algorithms are suggested for hash code
because it is infeasible to consider all pairs (triples) of ranking, hash table lookup, as well as the scheme of combin-
items for the distance-distance difference in pairwise ing inverted index (hash table lookup) and hash code rank-
(multiwise) similarity preserving algorithms, and thus ing. The comparison of the query performances of pairwise
only a small subset of the pairs (triples), by sampling a and multiwise similarity preserving algorithms, as well as
subset of items or pairs (triples), is considered for almost quantization is summarized in Table 3.
all the pairwise (multiwise) similarity preserving hashing Fig. 3 presents 2D toy examples. Fig. 3a shows that the
algorithms, while the cost function for quantization is an quantization algorithm is able to discriminate the non-uni-
approximation for all pairs of items. This point is also dis- formly distributed clusters with different between-cluster
cussed in [154]. distances while the binary code hashing algorithm is lack-
Compared with binary code hashing including hyper- ing such a capability due to the Hamming distance. Fig. 3b
cubic quantization, another reason for the superiority of shows that the binary hash coding algorithms require more
Cartesian quantization, as discussed in [171], is that there (6) hash bits to differentiate the 16 uniformly-distributed
are only a small number (L þ 1) of distinct Hamming dis- clusters while the quantization algorithms only require 4
tances in the coding space for binary code hashing with the (¼ log 16) bits.
code length being L, while the number of distinct distances
for Cartesian quantization is much larger. It is shown that 10.1.3 Empirical Results
the performance when learning a distance measure using a We present the empirical results of the several representa-
way like the quantization approach [37] or directly learning tive hashing and quantization algorithms over SIFT1M [50].
a distance lookup table [148]4 from precomputed hash codes We show the results for searching the nearest neighbor
is comparable to the performance of the Cartesian quantiza- (T ¼ 1) with 128 bits and the conclusion holds for searching
tion approach if the codes from the quantization approach more nearest neighbors (T > 1) and with other numbers of
are given as the input. bits. More results, such as the search time cost, and results
Search Cost. The evaluation of the Hamming distance using inverted multi-index with different quantization algo-
using the CPU instruction popcnt is faster than the dis- rithms can be found in [172]. We also conduct experiments
tance-table lookup. For example, it is around twice faster for over word feature vectors GloVe1:2M. We present the
the same code length L than distance table lookup if a sub- results using recall@R for searching the nearest neighbor
table corresponds to a byte and there are totally L8 sub-tables. (T ¼ 1) with 128 bits.
It is worth pointing (also observed in [171]) that the Carte- We also report the results over deep learning features
sian quantization approaches relying on the distance table extracted from the ILSVRC 2012 dataset. The ILSVRC 2012
lookup still achieve better search accuracy even with a code dataset is a subset of ImageNet [22] and contains over 1.2 mil-
of the half length, which indicates that the overall perfor- lion images. We use the provided training set, 1,281,167
mance of the quantization approaches in terms of space images, as the retrieval database and use the provided vali-
cost, query time cost, and search accuracy is superior. dation set, 50,000 images, as the test set. Similar to [125],
In summary, if the online performance in terms of space the 4096-dimensional feature extracted from the convolution
cost, query time cost, and search accuracy is cared about, neural networks (CNN) in [65] is used to represent
each image. We evaluate the search performance under the
Euclidean distance in terms of recall@R, where R is the num-
4. A similar idea is concurrently proposed in [117], [118] to learn a
better similarity for a bag-of-words representation and quantized ber of the returned top candidates, and under the semantic
kernels. similarity in terms of MAP versus #bits.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 785

Fig. 4. (a) and (b) show the performance in terms of recall@R over SIFT1M and GloVe1:2M for the representative hashing and quantization algo-
rithms. (c) and (d) show the performance over the ILSVRC 2012 ImageNet dataset under the Euclidean distance in terms of recall@R and under the
semantic similarity in terms of mAP versus # bits. BRE = binary reconstructive embedding [66], MLH = minimal loss hashing [107], LSH = locality
sensitive hashing [14], ITQ = iterative quantization [35], [36], SH = spectral hashing [156], AGH-2 = two-layer hashing with graphs [86], USPLH =
unsupervised sequential projection learning hashing [143], PQ = product quantization [50], CKM = Cartesian k-means [108], CQ = composite quanti-
zation [171], SCQ = sparse composite quantization [172] whose dictionary is the same sparse with PQ. CCA-ITQ = iterative quantization with canon-
ical correlation analysis [36], SSH = semi-supervised hashing [143], KSH = supervised hashing with kernels [85], FastHash = fash supervised
hashing [76], SDH = supervised discrete hashing with kernels [125], SDH-linear = supervised discrete hashing without using kernel representa-
tions [125], SQ = supervised quantization [154], Euclidean = linear scan with the Euclidean distance.

Fig. 4 shows the recall@R curves and the MAP results. 11 EMERGING TOPICS
We have several observations. (1) The performance of the
The main goal of the hashing algorithm is to accelerate the
quantization method is better than the hashing method in
online search through fast Hamming distance computation
most cases for both Euclidean distance-based and semantic
or fast distance table lookup. The offline hash function
search. (2) LSH, a data-independent algorithm, is generally
learning and hash code computation are shown to be still
worse than other learning to hash approaches. (3) For
expensive, and have become attractive in research. The
Euclidean distance-based search the performance of CQ is
computation cost of the distance table used for looking up is
the best among quantization methods, which is consistent
thought ignorable and in reality could be higher when han-
with the analysis and the 2D illustration shown in Fig. 2.
dling high-dimensional databases. There is also increasing
interest in topics such as multi-modality and cross-modality
10.2 Training Time Cost
hashing [45] and semantic quantization.
We present the analysis of the training time cost for the case
of using the linear hash function. The pairwise similarity 11.1 Speed up the Learning and Query Processes
preserving category considers the similarities of all pairs of Scalable Hash Function Learning. The algorithms depending
items, and thus in general the training process takes qua- on the pairwise similarity, such as binary reconstructive
dratic time with respect to the number N of the training embedding, usually sample a small subset of pairs to reduce
samples (OðN 2 M þ N 2 dÞ). To reduce the computational the cost of learning hash functions. It has been shown that
cost, sampling schemes are adopted: sample a small num- the search accuracy is increased with a high sampling rate,
ber (e.g., OðNÞ) of pairs, whose time complexity becomes but the training cost is greatly increased. The algorithms
linear with respect to N, resulting in (OðNM þ NdÞ), or sam- even without relying on the pairwise similarity, e.g., quanti-
ple a subset of the training items (e.g., containing N items), zation, were also shown to be slow and even infeasible
whose time complexity becomes smaller (OðN M þ N 2 dÞ).
2
when handling very large data, e.g., 1B data items, and usu-
The multiwise similarity preserving category considers the ally have to learn hash functions over a small subset, e.g.,
similarities of all triples of items, and in general the training 1M data items. This poses a challenging request to learn the
cost is greater and the sampling scheme is also used for hash function over larger datasets.
acceleration. The analysis for kernel hash functions and Hash Code Computation Speedup. Existing hashing algo-
other complex functions is similar, and the time complexity rithms rarely take into consideration the cost of encoding a
for both training hash functions and encoding database data item. Such a cost during the query stage becomes sig-
items is higher. nificant in the case that only a small number of database
Iterative quantization consists of a PCA preprocessing items or a small database are compared to the query. The
step whose time complexity is OðNd2 Þ, and the hash code search combined with the inverted index and compact
and hash function optimization step, whose time complexity codes is such a case. When kernel hash functions are used,
is OðNM 2 þ M 3 Þ (M is the number of hash bits). The whole encoding the database items to binary codes is also much
complexity is OðNd2 þ NM 2 þ M 3 Þ. Product quantization more expensive than that with linear hash functions. The
includes the k-means process for each partition, and the com- composite quantization-like approach also takes much time
plexity is OðTNKP Þ, where K is usually 256, P ¼ M8 , and T is to compute the hash codes.
the number of iterations for the k-means algorithm. The com- A recent work, circulant binary embedding [165], acceler-
plexity of Cartesian k-means is OðNd2 þ d3 Þ. The time com- ates the encoding process for the linear hash functions, and
plexity of composite quantization is OðNKPd þ NP 2 þ tree-quantization [3] sparsifies the dictionary items into a
P 2 K 2 dÞ. In summary, the time complexity of iterative quanti- tree structure, to speeding up the assignment process. How-
zation is the lowest and that of composite quantization is the ever, more research is needed to speed up the hash code
highest. This indicates that it takes larger offline computation computation for other hashing algorithms, such as compos-
cost to get a higher (online) search performance. ite quantization.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
786 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Distance Table Computation Speedup. Product quantization [5] R. Balu, T. Furon, and H. Jegou, “Beyond ”project and sign” for
cosine estimation with binary codes,” in Proc. IEEE Int. Conf.
and its variants need to precompute the distance table Acoustics Speech Signal Process., 2014, pp. 6884–6888.
between the query and the elements of the dictionaries. [6] J. L. Bentley, D. F. Stanat, and E. H. W. Jr, “The complexity of
Most existing algorithms claim that the cost of distance table finding fixed-radius near neighbors,” Inf. Process. Lett., vol. 6
computation is negligible. However in practice, the cost no. 6, pp. 209–212, 1977.
[7] A. Bergamo, L. Torresani, and A. W. Fitzgibbon, “Picodes: Learn-
becomes bigger when using the codes computed from quan- ing a compact code for novel-category recognition,” in Proc. 24th
tization to rank the candidates retrieved from the inverted Int. Conf. Neural Inf. Proc. Syst., 2011, pp. 2088–2096.
index. This is a research direction that will attract research [8] P. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,” in
Proc. 42nd Annu. Conf. Inf. Sci. Syst., 2008, pp. 16–21.
interest in the near future, such as a recent study, sparse [9] J. Brandt, “Transform coding for fast approximate nearest neigh-
composite quantization [172]. bor search in high dimensions,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2010, pp. 1815–1822.
11.2 Promising Extensions [10] A. Z. Broder, “On the resemblance and containment of doc-
Semantic Quantization. Existing quantization algorithms uments,” in Proc. Compression Complexity Sequences, 1997, pp. 21–29.
[11] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig,
focus on the search under the Euclidean distances. Like “Syntactic clustering of the web,” Comput. Netw., vol. 29, no. 8–
binary code hashing algorithms where many studies on 13, pp. 1157–1166, 1997.
semantic similarity have been conducted, learning quantiza- [12] F. Çakir and S. Sclaroff, “Adaptive hashing for fast similarity
search,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1044–1052.
tion-based hash codes with semantic similarity is attracting [13] M. A. Carreira-Perpi~ nan and R. Raziperchikolaei, “Hashing with
interest. There are already a few studies. For example, we binary autoencoders,” in Proc. IEEE Conf. Comput. Vis. Pattern
have proposed a supervised quantization approach [154] Recognit., 2015, pp. 557–566.
and some comparisons are provided in Fig. 4. [14] M. Charikar, “Similarity estimation techniques from rounding
algorithms,” in Proc. 34th Annu. ACM Symp. Theory Comput.,
Multiple and Cross Modality Hashing. One important char- 2002, pp. 380–388.
acteristic of big data is the variety of data types and data [15] A. Cherian, “Nearest neighbors using compact sparse codes,” in
sources. This is particularly true for multimedia data, where Proc. 31st Int. Conf. Int. Conf. Mach. Learning, 2014, pp. 1053–1061.
[16] A. Cherian, V. Morellas, and N. Papanikolopoulos, “Robust
various media types (e.g., video, image, audio and hyper- sparse hashing,” in Proc. 19th IEEE Int. Conf. Image Process., 2012,
text) can be described by many different low- and high-level pp. 2417–2420.
features, and relevant multimedia objects may come from [17] O. Chum and J. Matas, “Large-scale discovery of spatially related
different data sources contributed by different users and images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2,
pp. 371–377, Feb. 2010.
organizations. This raises a research direction, performing [18] Q. Dai, J. Li, J. Wang, and Y. Jiang, “Binary optimized hashing,”
joint-modality hashing learning by exploiting the relation in Proc. ACM Multimedia, 2016, pp. 1247–1256.
among multiple modalities, for supporting some special [19] A. Dasgupta, R. Kumar, and T. Sarl os, “Fast locality-sensitive
applications, such as cross-modal search. This topic is hashing,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2011, pp. 1073–1081.
attracting a lot of research efforts nowadays, such as collab- [20] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-
orative hashing [89], [167], collaborative quantization [173], sensitive hashing scheme based on p-stable distributions,” in
and cross-media hashing [134], [135], [178], [161], [83]. Proc. Symp. Comput. Geometry, 2004, pp. 253–262.
[21] T. L. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan,
and J. Yagnik, “Fast, accurate detection of 100, 000 object classes
12 CONCLUSION on a single machine,” in Proc. IEEE Conf. Comput. Vis. Pattern
In this paper, we categorize the learning-to-hash algorithms Recognit., 2013, pp. 1814–1821.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
into four main groups: pairwise similarity preserving, multi- “Imagenet: A large-scale hierarchical image database,” in Proc.
wise similarity preserving, implicit similarity preserving, and IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
quantization, present a comprehensive survey with a discus- [23] K. Ding, C. Huo, B. Fan, and C. Pan, “kNN hashing with factor-
ized neighborhood representation,” in Proc. IEEE Int. Conf. Com-
sion about their relations. We point out the empirical observa- put. Vis., 2015, pp. 1098–1106.
tion that quantization is superior in terms of search accuracy, [24] T. Do, A. Doan, and N. Cheung, “Learning to hash with binary
search efficiency and space cost. In addition, we introduce a deep neural network,” in Proc. Eur. Conf. Comput. Vis., 2016,
few emerging topics and the promising extensions. pp. 219–234.
[25] T. Do, A. Doan, D. T. Nguyen, and N. Cheung, “Binary hashing
with semidefinite relaxation and augmented lagrangian,” in
ACKNOWLEDGMENTS Proc. Eur. Conf. Comput. Vis., 2016, pp. 802–817.
[26] C. Du and J. Wang, “Inner product similarity search using com-
This work was partially supported by the National Nature positional codes,” CoRR, abs/1406.4966, 2014.
Science Foundation of China No. 61632007. Heng Tao Shen [27] L. Fan, “Supervised binary hash code learning with Jensen Shannon
is the corresponding author. divergence,” in Proc. IEEE Int. Conf. Comput. Vis., pp. 2616–2623,
2013.
REFERENCES [28] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual
models from few training examples: an incremental Bayesian
[1] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approach tested on 101 object categories,” in Proc. Workshop Gen-
approximate nearest neighbor in high dimensions,” in Proc. 47th erative-Model Based Vis., 2004.
Annu. IEEE Symp. Found. Comput. Sci., 2006, pp. 459–468. [29] J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing
[2] A. Babenko and V. Lempitsky, “Additive quantization for scheme based on dynamic collision counting,” in Proc. SIGMOD
extreme vector compression,” in Proc. IEEE Conf. Comput. Vis. Conf., 2012, pp. 541–552.
Pattern Recognit., 2014, pp. 931–939. [30] L. Gao, J. Song, F. Zou, D. Zhang, and J. Shao, “Scalable multime-
[3] A. Babenko and V. Lempitsky, “Tree quantization for large-scale dia retrieval by deep learning hashing with relative similarity
similarity search and classification,” in Proc. IEEE Conf. Comput. learning,” in Proc. ACM Multimedia, 2015, pp. 903–906.
Vis. Pattern Recognit., 2015, pp. 4240–4248. [31] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization
[4] A. Babenko and V. S. Lempitsky, “The inverted multi-index,” in for approximate nearest neighbor search,” in Proc. IEEE Conf.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3069– Comput. Vis. Pattern Recognit., 2013, pp. 2946–2953.
3076.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 787

[32] T. Ge, K. He, and J. Sun, “Graph cuts for supervised binary [58] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive
coding,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 250–264. image search with hash codes,” IEEE Trans. Multimedia, vol. 15,
[33] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik, “Learning no. 2, pp. 442–453, Feb. 2013.
binary codes for high-dimensional data using bilinear projec- [59] Z. Jiang, L. Xie, X. Deng, W. Xu, and J. Wang, “Fast nearest
tions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, neighbor search in the Hamming space,” in Proc. Int. Conf. Multi-
pp. 484–491. media Model., 2016, pp. 325–336.
[34] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quanti- [60] Z. Jin, et al., “Complementary projection hashing,” in Proc. IEEE
zation-based binary codes for fast similarity search,” in Proc. 25th Int. Conf. Comput. Vis., 2013, pp. 257–264.
Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1205–1213. [61] A. Joly and O. Buisson, “Random maximum margin hashing,” in
[35] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 873–880.
approach to learning binary codes,” in Proc. IEEE Conf. Comput. [62] Y. Kalantidis and Y. Avrithis, “Locally optimized product quan-
Vis. Pattern Recognit., 2011, pp. 817–824. tization for approximate nearest neighbor search,” in Proc. IEEE
[36] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2329–2336.
quantization: A procrustean approach to learning binary codes [63] W. Kong and W.-J. Li, “Isotropic hashing,” in Proc. Int. Conf.
for large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Neural Inf. Process. Syst., 2012, pp. 1655–1663.
Intell., vol. 35, no. 12, pp. 2916–2929, Dec. 2013. [64] N. Koudas, B. C. Ooi, H. T. Shen, and A. K. H. Tung, “LDC:
[37] A. Gordo, F. Perronnin, Y. Gong, and S. Lazebnik, “Asymmetric Enabling search by partial distance in a hyper-dimensional
distances for binary embeddings,” IEEE Trans. Pattern Anal. space,” in Proc. 20th Int. Conf. Data Eng., 2004, pp. 6–17.
Mach. Intell., vol. 36, no. 1, pp. 33–47, Jan. 2014. [65] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
[38] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. cation with deep convolutional neural networks,” in Proc. Int.
Inform. Theory, vol. 44, no. 6, pp. 2325–2383, Oct. 1998. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[39] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer, “Compact [66] B. Kulis and T. Darrell, “Learning to hash with binary recon-
hashing with joint optimization of search accuracy and time,” in structive embeddings,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 753–760, 2011. 2009, pp. 1042–1050.
[40] J. He, W. Liu, and S.-F. Chang, “Scalable similarity search with [67] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning
optimized kernel hashing,” in Proc. 16th ACM SIGKDD Int. Conf. and hash coding with deep neural networks,” in Proc. IEEE Conf.
Knowl. Discovery Data Mining, 2010, pp. 1129–1138. Comput. Vis. Pattern Recognit., 2015, pp. 3270–3278.
[41] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Spherical [68] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, learning applied to document recognition,” in Intell. Signal Pro-
pp. 2957–2964. cess., pages 306–351. IEEE Press, 2001.
[42] J.-P. Heo, Z. Lin, and S.-E. Yoon, “Distance encoded product [69] C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu, “Online sketching
quantization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
2014, pp. 2139–2146. pp. 2503–2511.
[43] L.-K. Huang, Q. Yang, and W.-S. Zheng, “Online hashing,” [70] P. Li, K. W. Church, and T. Hastie, “Conditional random sam-
in Proc. Int. Conf. Artif. Intell., 2013, pp. 1422–1428. pling: A sketch-based sampling technique for sparse data,” in
[44] P. Indyk and R. Motwani, “Approximate nearest neighbors: Proc. Int. Conf. Neural Inf. Process. Syst., 2006, pp. 873–880.
Towards removing the curse of dimensionality,” in Proc. 30th [71] P. Li, T. Hastie, and K. W. Church, “Very sparse random projec-
Annu. ACM Symp. Theory Comput., 1998, pp. 604–613. tions,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery
[45] G. Irie, H. Arai, and Y. Taniguchi, “Alternating co-quantization Data Mining, 2006, pp. 287–296.
for cross-modal hashing,” in Proc. IEEE Int. Conf. Comput. Vis., [72] P. Li and A. C. K€ onig, “b-bit minwise hashing,” in Proc. 19th Int.
2015, pp. 1886–1894. Conf. World Wide Web, 2010, pp. 671–680.
[46] G. Irie, Z. Li, X.-M. Wu, and S.-F. Chang, “Locally linear hashing [73] P. Li, A. C. K€ onig, and W. Gui, “b-bit minwise hashing for esti-
for extracting non-linear manifolds,” in Proc. IEEE Conf. Comput. mating three-way similarities,” in Proc. Int. Conf. Neural Inf. Pro-
Vis. Pattern Recognit., 2014, pp. 2123–2130. cess. Syst., 2010, pp. 1387–1395.
[47] H. Jain, P. Perez, R. Gribonval, J. Zepeda, and H. Jegou, [74] P. Li, A. B. Owen, and C.-H. Zhang, “One permutation hashing,”
“Approximate search with quantized sparse representations,” in in Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 3122–3130.
Proc. Eur. Conf. Comput. Vis., 2016, pp. 681–696. [75] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing
[48] H. Jegou, L. Amsaleg, C. Schmid, and P. Gros, “Query adaptative with semantically consistent graph for image indexing,” IEEE
locality sensitive hashing,” in Proc. IEEE Int. Conf. Acoustics, Trans. Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013.
Speech Signal Process., 2008, pp. 825–828. [76] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, “Fast super-
[49] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and vised hashing with decision trees for high-dimensional data,” in
weak geometric consistency for large scale image search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1971–1978.
Proc. Eur. Conf. Comput. Vis., 2008, pp. 304–317. [77] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general
[50] H. Jegou, M. Douze, and C. Schmid, “Product quantization for two-step approach to learning-based hashing,” in Proc. IEEE Int.
nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., Conf. Comput. Vis., 2013, pp. 2552–2559.
vol. 33, no. 1, pp. 117–128, Jan. 2011. [78] R.-S. Lin, D. A. Ross, and J. Yagnik, “Spec hashing: Similarity
[51] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local preserving algorithm for entropy-based coding,” in Proc. IEEE
descriptors into a compact image representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 848–854.
Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3304–3311. [79] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hashing,” in
[52] H. Jegou, T. Furon, and J.-J. Fuchs, “Anti-sparse coding for Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 446–451.
approximate nearest neighbor search,” in Proc. IEEE Int. Conf. [80] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hash-
Acoustics, Speech Signal Process., 2012, pp. 2029–2032. ing for compact binary codes learning,” in Proc. IEEE Conf. Com-
[53] H. Jegou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in put. Vis. Pattern Recognit., 2015, pp. 2475–2483.
one billion vectors: Re-rank with source coding,” in Proc. IEEE [81] D. Liu, S. Yan, R.-R. Ji, X.-S. Hua, and H.-J. Zhang, “Image
Int. Conf. Acoustics, Speech Signal Process., 2011, pp. 861–864. retrieval with query-adaptive hashing,” ACM Trans. Multimedia
[54] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang, “Min-max hash for jac- Comput. Commun. Appl., vol. 9, no. 1, 2013: Art. no. 2.
card similarity,” in Proc. IEEE 13th Int. Conf. Data Mining, 2013, [82] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hash-
pp. 301–309. ing for fast image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pat-
[55] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensi- tern Recognit., 2016, pp. 2064–2072.
tive hashing,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., [83] L. Liu, Z. Lin, L. Shao, F. Shen, G. Ding, and J. Han, “Sequential
2012, pp. 108–116. discrete hashing for scalable cross-modality similarity retrieval,”
[56] Q. Jiang and W. Li, “Scalable graph hashing with feature trans- IEEE Trans. Image Process., vol. 26, no. 1, pp. 107–118, Jan. 2017.
formation,” in Proc. 24th Int. Conf. Artif. Intell., 2015, pp. 2248–2254. [84] W. Liu, C. Mu, S. Kumar, and S. Chang, “Discrete graph hashing,”
[57] Y.-G. Jiang, J. Wang, and S.-F. Chang, “Lost in binarization: in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3419–3427.
Query-adaptive ranking for similar image search with compact [85] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised
codes,” in Proc. ACM Int. Conf. Multimedia Retrieval, 2011, hashing with kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern
Art. no. 16. Recognit., 2012, pp. 2074–2081.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
788 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

[86] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with [113] R. Panigrahy, “Entropy based nearest neighbor search in high
graphs,” in Proc. Int. Conf. Mach. Learning, 2011, pp. 1–8. dimensions,” in Proc. 17th Annu. ACM-SIAM Symp. Discrete Algo-
[87] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Compact rithm, 2006, pp. 1186–1195.
hyperplane hashing with bilinear functions,” in Proc. Int. Conf. [114] L. Pauleve, H. Jegou, and L. Amsaleg, “Locality sensitive hash-
Mach. Learning, 2012, pp. 467–474. ing: A comparison of hash function types and querying mecha-
[88] X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Query-adaptive nisms,” Pattern Recognit. Lett., vol. 31, no. 11, pp. 1348–1358, 2010.
reciprocal hash tables for nearest neighbor search,” IEEE Trans. [115] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vec-
Image Process., vol. 25, no. 2, pp. 907–919, Feb. 2016. tors for word representation,” in Proc. Empirical Methods Natural
[89] X. Liu, J. He, C. Deng, and B. Lang, “Collaborative hashing,” in Language Process., 2014, pp. 1532–1543.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2147–2154. [116] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, “Large-scale
[90] X. Liu, J. He, and B. Lang, “Reciprocal hash tables for nearest image retrieval with compressed fisher vectors,” in Proc. IEEE
neighbor search,” in Proc. 27th AAAI Conf. Artif. Intell., 2013. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3384–3391.
[91] X. Liu, L. Huang, C. Deng, J. Lu, and B. Lang, “Multi-view com- [117] D. Qin, X. Chen, M. Guillaumin, and L. J. V. Gool, “Quantized
plementary hash tables for nearest neighbor search,” in Proc. kernel learning for feature matching,” in Proc. Int. Conf. Neural
IEEE Int. Conf. Comput. Vis., 2015, pp. 1107–1115. Inf. Process. Syst., 2014, pp. 172–180.
[92] Y. Liu, J. Shao, J. Xiao, F. Wu, and Y. Zhuang, “Hypergraph spec- [118] D. Qin, Y. Chen, M. Guillaumin, and L. J. V. Gool, “Learning to
tral hashing for image retrieval with heterogeneous social con- rank histograms for object retrieval,” in Proc. British Mach. Vis.
texts,” Neurocomputing, vol. 119, pp. 49–58, 2013. Conf., 2014.
[93] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, and A. G. Hauptmann, [119] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
“Spline regression hashing for fast image search,” IEEE Trans. “Labelme: A database and web-based tool for image annotation,”
Image Process., vol. 21, no. 10, pp. 4480–4491, Oct. 2012. Int. J. Comput. Vis., vol. 77, no. 1–3, pp. 157–173, 2008.
[94] D. G. Lowe, “Distinctive image features from scale-invariant key- [120] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” in Proc.
points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. SIGIR Workshop Inf. Retrieval Appl. Graphical Models, 2007,
[95] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe pp. 969–978.
LSH: Efficient indexing for high-dimensional similarity search,” in [121] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J.
Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 950–961. Approx. Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
[96] J. Martinez, J. Clement, H. H. Hoos, and J. J. Little, “Revisiting [122] J. Sanchez and F. Perronnin, “High-dimensional signature com-
additive quantization,” in Proc. Eur. Conf. Comput. Vis., 2016, pression for large-scale image classification,” in Proc. IEEE Conf.
pp. 137–153. Comput. Vis. Pattern Recognit., 2011, pp. 1665–1672.
[97] Y. Matsui, T. Yamasaki, and K. Aizawa, “Pqtable: Fast exact [123] H. Sandhawalia and H. Jegou, “Searching with expectations,” in
asymmetric distance neighbor search for product quantization Proc. Int. Conf. Acoustics Speech Signal Process., 2010, pp. 1242–1245.
using hash tables,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, [124] J. Shao, F. Wu, C. Ouyang, and X. Zhang, “Sparse spectral
pp. 1940–1948. hashing,” Pattern Recognit. Lett., vol. 33, no. 3, pp. 271–277, 2012.
[98] Y. Matsushita and T. Wada, “Principal component hashing: An [125] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete
accelerated approximate nearest neighbor search,” in Proc. hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
Pacific-Rim Symp. Image Video Technol., 2009, pp. 374–385. pp. 37–45.
[99] Y. Moon, et al., “Capsule: A camera-based positioning system [126] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang,
using learning,” in Proc. ACM Symp. Cloud Comput., 2015, “Inductive hashing on manifolds,” in Proc. IEEE Conf. Comput.
pp. 235–240. Vis. Pattern Recognit., 2013, pp. 1562–1569.
[100] R. Motwani, A. Naor, and R. Panigrahy, “Lower bounds on local- [127] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao, “A
ity sensitive hashing,” SIAM J. Discrete Math., vol. 21, no. 4, fast optimization method for general binary code learning,”
pp. 930–935, 2007. IEEE Trans. Image Process., vol. 25, no. 12, pp. 5610–5621,
[101] Y. Mu, X. Chen, X. Liu, T.-S. Chua, and S. Yan, “Multimedia seman- Dec. 2016.
tics-aware query-adaptive hashing with bits reconfigurability,” Int. [128] X. Shi, F. Xing, J. Cai, Z. Zhang, Y. Xie, and L. Yang, “Kernel-
J. Multimedia Inf. Retrieval, vol. 1, no. 1, pp. 59–70, 2012. based supervised discrete hashing for image retrieval,” in Proc.
[102] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in ker- Eur. Conf. Comput. Vis., 2016, pp. 419–433.
nel space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [129] A. Shrivastava and P. Li, “Fast near neighbor search in high-
2010, pp. 3344–3351. dimensional binary data,” in Proc. Eur. Conf. Mach. Learn. Princi-
[103] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors ples Practice Knowl. Discovery Databases, 2012, pp. 474–489.
with automatic algorithm configuration,” in Proc. Int. Conf. Com- [130] A. Shrivastava and P. Li, “Densifying one permutation hashing
put. Vis. Theory Appl., 2009, pp. 331–340. via rotation for fast near neighbor,” in Proc. 31st Int. Conf. Int.
[104] M. Muja and D. G. Lowe, “Fast matching of binary features,” in Conf. Mach. Learning, 2014, pp 557–565.
Proc. 9th Conf. Comput. Robot Vis., 2012, pp. 404–410. [131] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Explor-
[105] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms ing photo collections in 3d,” ACM Trans. Graph., vol. 25, no. 3,
for high dimensional data,” IEEE Trans. Pattern Anal. Mach. pp. 835–846, 2006.
Intell., vol. 36, no. 11, pp. 2227–2240, Nov. 2014. [132] D. Song, W. Liu, R. Ji, D. A. Meyer, and J. R. Smith, “Top rank
[106] L. Mukherjee, S. N. Ravi, V. K. Ithapu, T. Holmes, and V. Singh, supervised binary coding for visual search,” in Proc. IEEE Int.
“An NMF perspective on binary hashing,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1922–1930.
Conf. Comput. Vis., 2015, pp. 4184–4192. [133] J. Song, H. T. Shen, J. Wang, Z. Huang, N. Sebe, and J. Wang,
[107] M. Norouzi and D. J. Fleet, “Minimal loss hashing for compact “A distance-computation-free search scheme for binary code
binary codes,” in Proc. Int. Conf. Mach. Learning, 2011, pp. 353– databases,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 484–495,
360. Mar. 2016.
[108] M. Norouzi and D. J. Fleet, “Cartesian k-means,” in Proc. IEEE [134] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective
Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3017–3024. multiple feature hashing for large-scale near-duplicate video
[109] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming dis- retrieval,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1997–2008,
tance metric learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., Dec. 2013.
2012, pp. 1070–1078. [135] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-media
[110] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in Hamming hashing for large-scale retrieval from heterogeneous data
space with multi-index hashing,” in Proc. IEEE Conf. Comput. Vis. sources,” in Proc. SIGMOD Conf., 2013, pp. 785–796.
Pattern Recognit., 2012, pp. 3108–3115. [136] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “Ldahash:
[111] R. O’Donnell, Y. Wu, and Y. Zhou, “Optimal lower bounds for Improved matching with smaller descriptors,” IEEE Trans. Pattern
locality sensitive hashing (except when q is tiny),” in Proc. Int. Anal. Mach. Intell., vol. 34, no. 1, pp. 66–78, Jan. 2012.
Conf. Supercomputing, 2011, pp. 275–283. [137] A. B. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny
[112] A. Oliva and A. Torralba, “Modeling the shape of the scene: A images: A large data set for nonparametric object and scene rec-
holistic representation of the spatial envelope,” Int. J. Comput. ognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11,
Vis., vol. 42, no. 3, pp. 145–175, 2001. pp. 1958–1970, Nov. 2008.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 789

[138] A. Vedaldi and A. Zisserman, “Efficient additive kernels via [164] Y. Yang, F. Shen, H. T. Shen, H. Li, and X. Li, “Robust discrete
explicit feature maps,” IEEE Trans. Pattern Anal. Mach. Intell., spectral hashing for large-scale image semantic indexing,” IEEE
vol. 34, no. 3, pp. 480–492, Mar. 2012. Trans. Big Data, vol. 1, no. 4, pp. 162–171, Oct.-Dec. 2015.
[139] A. Vedaldi and A. Zisserman, “Sparse kernel approximations for [165] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang, “Circulant
efficient classification and detection,” in Proc. IEEE Conf. Comput. binary embedding,” in Proc. Int. Conf. Mach. Learning, 2014,
Vis. Pattern Recognit., 2012, pp. 2320–2327. pp. 946–954.
[140] L. von Ahn, R. Liu, and M. Blum, “Peekaboom: A game for locat- [166] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast
ing objects in images,” in Proc. CHI Conf., 2006, pp. 55–64. similarity search,” in Proc. 33rd Int. ACM SIGIR Conf. Res. Devel-
[141] J. Wang, O. Kumar, and S.-F. Chang, “Semi-supervised hashing opment Inf. Retrieval, 2010, pp. 18–25.
for scalable image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pat- [167] H. Zhang, F. Shen, W. Liu, X. He, H. Luan, and T. Chua,
tern Recognit., 2010, pp. 3424–3431. “Discrete collaborative filtering,” in Proc. 39th Annu. Int. ACM-
[142] J. Wang, S. Kumar, and S.-F. Chang, “Sequential projection learn- SIGIR Conf. Res. Development Inf. Retrieval, 2016, pp. 325–334.
ing for hashing with compact codes,” in Proc. 27th Int. Conf. [168] H. Zhang, N. Zhao, X. Shang, H. Luan, and T. Chua, “Discrete
Mach. Learn., 2010, pp. 1127–1134. image hashing using large weakly annotated photo collections,”
[143] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 3669–3675.
for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., [169] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology
vol. 34, no. 12, pp. 2393–2406, Dec. 2012. preserving hashing for similarity search,” in Proc. ACM Multime-
[144] J. Wang and S. Li, “Query-driven iterated neighborhood graph dia, 2013, pp. 123–132.
search for large scale indexing,” in Proc. ACM Multimedia, 2012, [170] S. Zhang, J. Li, J. Guo, and B. Zhang, “Scalable discrete super-
pp. 179–188. vised hash learning with asymmetric matrix factorization,” IEEE
[145] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash 16th Int. Conf. Data Mining, Barcelona, Spain, pp. 1347–1352,
for indexing big data—A survey,” Proc. IEEE, vol. 104, no. 1, 2016, doi: 10.1109/ICDM.2016.0184.
pp. 34–57, Jan. 2016. [171] T. Zhang, C. Du, and J. Wang, “Composite quantization for
[146] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang, “Learning hash codes approximate nearest neighbor search,” in Proc. Int. Conf. Mach.
with listwise supervision,” in Proc. IEEE Int. Conf. Comput. Vis., Learning, 2014, pp. 838–846.
2013, pp. 3032–3039. [172] T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Sparse composite
[147] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity quantization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
search: A survey,” CoRR, abs/1408.2927, 2014. 2015, pp. 4548–4556.
[148] J. Wang, H. T. Shen, S. Yan, N. Yu, S. Li, and J. Wang, [173] T. Zhang and J. Wang, “Collaborative quantization for cross-
“Optimized distances for binary code ranking,” in Proc. ACM modal similarity search,” in Proc. IEEE Conf. Comput. Vis. Pattern
Multimedia, 2014, pp. 517–526. Recognit., 2016, pp. 2036–2045.
[149] J. Wang, J. Wang, J. Song, X.-S. Xu, H. T. Shen, and S. Li, [174] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking
“Optimized cartesian k-means,” IEEE Trans. Knowl. Data Eng., based hashing for multi-label image retrieval,” in Proc. IEEE
vol. 27, no. 1, pp. 180–192, 2015, doi: 10.1109/TKDE.2014.2324592. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1556–1564.
[150] J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashing for [175] K. Zhao, H. Lu, and J. Mei, “Locality preserving hashing,” in
approximate nearest neighbor search,” in Proc. ACM Multimedia, Proc. AAAI Conf. Artif. Intell., 2014, pp. 2874–2881.
2013, pp. 133–142. [176] Y. Zhen and D.-Y. Yeung, “Active hashing and its application to
[151] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo, “Fast neigh- image and text retrieval,” Data Min. Knowl. Discov., vol. 26, no. 2,
borhood graph search using Cartesian concatenation,” in Proc. pp. 255–274, 2013.
IEEE Int. Conf. Comput. Vis., 2013, pp. 2128–2135. [177] X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen, “Sparse
[152] J. Wang, et al., “Trinary-projection trees for approximate nearest hashing for fast multimedia search,” ACM Trans. Inf. Syst.,
neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, vol. 31, no. 2, 2013: Art. no. 9.
no. 2, pp. 388–403, Feb. 2014. [178] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear cross-modal
[153] Q. Wang, D. Zhang, and L. Si, “Weighted hashing for fast large hashing for efficient multimedia search,” in Proc. ACM Multime-
scale similarity search,” in Proc. Conf. Inf. Knowl. Manage., 2013, dia, 2013, pp. 143–152.
pp. 1185–1188. [179] Y. Zhuang, Y. Liu, F. Wu, Y. Zhang, and J. Shao, “Hypergraph
[154] X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Supervised spectral hashing for similarity search of social image,” in Proc.
quantization for similarity search,” in Proc. IEEE Conf. Comput. ACM Multimedia, 2011, pp. 1457–1460.
Vis. Pattern Recognit., 2016, pp. 2018–2026.
[155] Y. Weiss, R. Fergus, and A. Torralba, “Multidimensional spectral Jingdong Wang received the BEng and MEng
hashing,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 340–353. degrees from the Department of Automation,
[156] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Tsinghua University, Beijing, China, in 2001 and
Int. Conf. Neural Inf. Process. Syst., 2008, pp. 1753–1760. 2004, respectively, and the PhD degree from the
[157] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised non- Department of Computer Science and Engineer-
linear hashing using bootstrap sequential projection learning,” ing, the Hong Kong University of Science and
IEEE Trans. Knowl. Data Eng., vol. 25, no. 6, pp. 1380–1393, Jun. Technology, Hong Kong, in 2007. He is a lead
2013. researcher at the Visual Computing Group,
[158] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for Microsoft Research Asia. His areas of interest
image retrieval via image representation learning,” in Proc. AAAI include deep learning, large-scale indexing,
Conf. Artif. Intell., 2014, pp. 2156–2162. human understanding, and person re-identifica-
[159] B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai, “Harmonious tion. He has been serving as an associate editor of IEEE TMM, and has
hashing,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, pp. 1820– served as an area chair of ICCV 2017, CVPR 2017, ECCV 2016 and
1826. ACM Multimedia 2015.
[160] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary
hashing for approximate nearest neighbor search,” in Proc. IEEE
Int. Conf. Comput. Vis., 2011, pp. 1631–1638. Ting Zhang received the bachelor’s degree in
[161] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning dis- mathematical science from the School of the
criminative binary codes for large-scale cross-modal Gifted Young, in 2012. She is working toward the
retrieval,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2494– PhD degree in the Department of Automation,
2507, May 2017. University of Science and Technology of China.
[162] H. Yang, X. Bai, J. Zhou, P. Ren, Z. Zhang, and J. Cheng, “Adaptive Her main research interests include machine
object retrieval with kernel reconstructive hashing,” in Proc. IEEE learning, computer vision and pattern recognition.
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1955–1962. She is currently a research intern at Microsoft
[163] Q. Yang, L.-K. Huang, W.-S. Zheng, and Y. Ling, “Smart hashing Research, Beijing.
update for fast response,” in Proc. Int. Joint Conf. Artif. Intell., 2013,
pp. 1855–1861.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
790 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Jingkuan Song received the BS degree in soft- Heng Tao Shen received the BSc degree with
ware engineering from the University of Electronic 1st class Honours and the PhD degree from the
Science and Technology of China. He received Department of Computer Science, National Uni-
the PhD degree in information technology from versity of Singapore, in 2000 and 2004, respec-
The University of Queensland, Australia. Currently, tively. He then joined University of Queensland
he is a Professor at University of Electronic Science as a Lecturer, Senior Lecturer, Reader, and
and Technology of China. His research interest became a Professor in late 2011. He is currently
includes large-scale multimedia search and a professor of National “Thousand Talents Plan”
machine learning. and the director of Future Media Research
Center at University of Electronic Science and
Technology of China. His research interests
mainly include multimedia search, computer vision, and big data man-
Nicu Sebe is currently a professor with the Univer- agement on spatial, temporal, multimedia and social media databases.
sity of Trento, Italy, leading the research in the Heng Tao has extensively published and served on program committees
areas of multimedia information retrieval and in most prestigious international publication venues of interests. He
human behavior understanding. He was the gen- received the Chris Wallace Award for outstanding Research Contribution
eral co-chair of the IEEE FG Conference 2008 in 2010 conferred by Computing Research and Education Association,
and ACM Multimedia 2013, and the program chair Australasia. He has served as a PC co-chair for ACM Multimedia 2015
of the International Conference on Image and and currently is an associate editor of the IEEE Transactions on Knowl-
Video Retrieval in 2007 and 2010, and ACM Multi- edge and Data Engineering.
media 2007 and 2011. He is the program chair of
ECCV 2016 and ICCV 2017. He is a fellow of the
International Association for Pattern Recognition. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.

Book 1 Air Pilot's Manual - Flying Training (Pooleys)
94% (31)
Book 1 Air Pilot's Manual - Flying Training (Pooleys)
408 pages
Vocality Radio Over IP - Introduction
No ratings yet
Vocality Radio Over IP - Introduction
18 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
Learning Hash Functions Using Column Generation: Xi Li Guosheng Lin Chunhua Shen Anton Van Den Hengel Anthony Dick
No ratings yet
Learning Hash Functions Using Column Generation: Xi Li Guosheng Lin Chunhua Shen Anton Van Den Hengel Anthony Dick
9 pages
Double-Bit Quantization For Hashing: Weihao Kong Wu-Jun Li
No ratings yet
Double-Bit Quantization For Hashing: Weihao Kong Wu-Jun Li
7 pages
Binary Hashing For Approximate Nearest Neighbor Search On Big Data A Survey
No ratings yet
Binary Hashing For Approximate Nearest Neighbor Search On Big Data A Survey
16 pages
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
No ratings yet
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
29 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
Learning To Hash For Indexing Big Data - A Survey
No ratings yet
Learning To Hash For Indexing Big Data - A Survey
22 pages
SSDH
No ratings yet
SSDH
7 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
No ratings yet
Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
14 pages
CSQ - Yuan Central Similarity Quantization For Efficient Image and Video Retrieval CVPR 2020 Paper
No ratings yet
CSQ - Yuan Central Similarity Quantization For Efficient Image and Video Retrieval CVPR 2020 Paper
10 pages
Bit Reduction For Locality-Sensitive Hashing
No ratings yet
Bit Reduction For Locality-Sensitive Hashing
12 pages
Composite Hashing With Multiple Information Sources: Dan Zhang Fei Wang Luo Si
No ratings yet
Composite Hashing With Multiple Information Sources: Dan Zhang Fei Wang Luo Si
10 pages
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
No ratings yet
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
12 pages
Nearest Neighbor Retrieval Using Distance-Based Hashing
No ratings yet
Nearest Neighbor Retrieval Using Distance-Based Hashing
10 pages
Principles of Hash-Based Text Retrieval.
100% (1)
Principles of Hash-Based Text Retrieval.
8 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
19 pages
Attribute Discovery Via Predictable Discriminative Binary Codes
No ratings yet
Attribute Discovery Via Predictable Discriminative Binary Codes
14 pages
VDSH
No ratings yet
VDSH
11 pages
Deep Learning Approaches For Similarity Computation A Survey
No ratings yet
Deep Learning Approaches For Similarity Computation A Survey
20 pages
Learning To Hash With Binary Deep Neural Network: October 2016
No ratings yet
Learning To Hash With Binary Deep Neural Network: October 2016
17 pages
Asymmetric Distances For Binary Embeddings
No ratings yet
Asymmetric Distances For Binary Embeddings
8 pages
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
No ratings yet
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
9 pages
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
No ratings yet
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
31 pages
Fast Exact Search in Hamming Space With Multi-Index Hashing: Mohammad Norouzi, Ali Punjani, David J. Fleet
No ratings yet
Fast Exact Search in Hamming Space With Multi-Index Hashing: Mohammad Norouzi, Ali Punjani, David J. Fleet
14 pages
Iterative Quantization: A Procrustean Approach To Learning Binary Codes
No ratings yet
Iterative Quantization: A Procrustean Approach To Learning Binary Codes
8 pages
A Literature Survey On Various Approaches On Content Based Image Search
No ratings yet
A Literature Survey On Various Approaches On Content Based Image Search
6 pages
Product Quantization For Nearest Neighbor Search
No ratings yet
Product Quantization For Nearest Neighbor Search
13 pages
Finding Similar Items
No ratings yet
Finding Similar Items
85 pages
Lshanalysis Preprint
No ratings yet
Lshanalysis Preprint
12 pages
Binary Code Ranking With Weighted Hamming Distance
No ratings yet
Binary Code Ranking With Weighted Hamming Distance
8 pages
A Hash Centroid Construction Method With Swin Transformer For Multi-Label Image Retrieval
No ratings yet
A Hash Centroid Construction Method With Swin Transformer For Multi-Label Image Retrieval
17 pages
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
No ratings yet
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
17 pages
Elpis
No ratings yet
Elpis
12 pages
Article cvprw15
No ratings yet
Article cvprw15
9 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
PSLSH
No ratings yet
PSLSH
10 pages
Tutti Gli Articoli
No ratings yet
Tutti Gli Articoli
140 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
Online Product Quantization
No ratings yet
Online Product Quantization
18 pages
Big Data Unit II
No ratings yet
Big Data Unit II
23 pages
Secure Mining of Association Rules in Horizontally Distributed Databases
No ratings yet
Secure Mining of Association Rules in Horizontally Distributed Databases
3 pages
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
No ratings yet
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
11 pages
Nips 11
No ratings yet
Nips 11
9 pages
Wang 等 - 2019 - A Memory-Efficient Sketch Method for Estimating Hi
No ratings yet
Wang 等 - 2019 - A Memory-Efficient Sketch Method for Estimating Hi
10 pages
Paper 5
No ratings yet
Paper 5
10 pages
Filtering Search A New Approach To Query-Answering
No ratings yet
Filtering Search A New Approach To Query-Answering
22 pages
Dedupe Theory
No ratings yet
Dedupe Theory
152 pages
Compusoft, 3 (7), 1020-1023 PDF
No ratings yet
Compusoft, 3 (7), 1020-1023 PDF
4 pages
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
No ratings yet
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
4 pages
Paper Geohash
No ratings yet
Paper Geohash
12 pages
E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
No ratings yet
E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
15 pages
003 05 KNN - Enhancements W3L2
No ratings yet
003 05 KNN - Enhancements W3L2
10 pages
Approximate Data Structures With Applications
No ratings yet
Approximate Data Structures With Applications
8 pages
A Privacy-Preserving Framework For Large-Scale Content-Based Information Retrieval Using K-Secure Sum Protocol
No ratings yet
A Privacy-Preserving Framework For Large-Scale Content-Based Information Retrieval Using K-Secure Sum Protocol
4 pages
39 Remotesensing 13 02924 v2
No ratings yet
39 Remotesensing 13 02924 v2
16 pages
Graph-Based Nearest Neighbor Search: From Practice To Theory
No ratings yet
Graph-Based Nearest Neighbor Search: From Practice To Theory
31 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Murabahah and Murabahah For Purchase Orderer: Islamic Financial Transactions
No ratings yet
Murabahah and Murabahah For Purchase Orderer: Islamic Financial Transactions
14 pages
Teaching Resume 2017
No ratings yet
Teaching Resume 2017
2 pages
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
No ratings yet
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
23 pages
ASN RB Nov Dec 2023 1
No ratings yet
ASN RB Nov Dec 2023 1
51 pages
API FR - INR.RINR DS2 en Excel v2 2917298
No ratings yet
API FR - INR.RINR DS2 en Excel v2 2917298
74 pages
Subtitle
No ratings yet
Subtitle
4 pages
Types of Sensor and Their Application
50% (2)
Types of Sensor and Their Application
6 pages
CESC Summative Test No.2
No ratings yet
CESC Summative Test No.2
4 pages
Answers To The First General Quick TEST UTME
No ratings yet
Answers To The First General Quick TEST UTME
22 pages
PYQ DEMO COMBO PYQ BANK All Odisha Previous Year Subject Wise Topic Wise 20000 Questions Answer PDF
100% (1)
PYQ DEMO COMBO PYQ BANK All Odisha Previous Year Subject Wise Topic Wise 20000 Questions Answer PDF
51 pages
Classification of Wildlife: Geography Project
No ratings yet
Classification of Wildlife: Geography Project
2 pages
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
100% (2)
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
76 pages
ReportPdfResponseServlet - 2024-12-20T111226.809
No ratings yet
ReportPdfResponseServlet - 2024-12-20T111226.809
9 pages
JPPPF June2025 111 02 13 26 Dwi+Ambar
No ratings yet
JPPPF June2025 111 02 13 26 Dwi+Ambar
14 pages
06 - Class 06 - Trade Setups
No ratings yet
06 - Class 06 - Trade Setups
12 pages
Pre-Call Report 4 Template - MKT3403-S2020
No ratings yet
Pre-Call Report 4 Template - MKT3403-S2020
5 pages
Pre-Schwarzian and Schwarzian Norm Estimates For Subclasses of Univalent Functions
No ratings yet
Pre-Schwarzian and Schwarzian Norm Estimates For Subclasses of Univalent Functions
19 pages
Computer Ports and Cables
No ratings yet
Computer Ports and Cables
7 pages
Amer Shield
No ratings yet
Amer Shield
4 pages
The Threats To The Objectivity in Internal Auditing
No ratings yet
The Threats To The Objectivity in Internal Auditing
2 pages
Structural Foundation Sections Sheet 1 of 2
No ratings yet
Structural Foundation Sections Sheet 1 of 2
1 page
14 Hes
No ratings yet
14 Hes
2 pages
En10272 PDF
100% (1)
En10272 PDF
42 pages
Medical Appointment Application: Acta Electronica Malaysia (AEM)
No ratings yet
Medical Appointment Application: Acta Electronica Malaysia (AEM)
5 pages
Survey Instrument Validation Rating Scale SHS 2023
No ratings yet
Survey Instrument Validation Rating Scale SHS 2023
1 page
Đề Thi Học Kì 1 Lớp 3 Môn Tiếng Anh
No ratings yet
Đề Thi Học Kì 1 Lớp 3 Môn Tiếng Anh
56 pages
Radical-Scavenging Effects of Aloe Arborescens Miller On Prevention of Pancreatic Islet B-Cell Destruction in Rats
No ratings yet
Radical-Scavenging Effects of Aloe Arborescens Miller On Prevention of Pancreatic Islet B-Cell Destruction in Rats
9 pages
Emergency Nursing Questionnaires 2
No ratings yet
Emergency Nursing Questionnaires 2
1 page
Intellexa 2025 Final
No ratings yet
Intellexa 2025 Final
40 pages

A Survey On Learning To Hash

Uploaded by

A Survey On Learning To Hash

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO.

4, APRIL 2018 769

A Survey on Learning to Hash

T HE problem of nearest neighbor search, also known as

where fsmt gTt¼1

Comments. We present an integrated objective function 7.2 Cartesian Quantization

points belonging to one cluster are partitioned (quantized)

TABLE 2 validation criterion is not the objective function value but

Accuracy Efficiency Overall

Search Accuracy. Multiwise similarity preserving is better

You might also like