A Survey On Learning To Hash
A Survey On Learning To Hash
Abstract—Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the
query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this
paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving
the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different
though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation
protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search
accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.
Index Terms—Similarity search, approximate nearest neighbor search, hashing, learning to hash, quantization, pairwise similarity preserving,
multiwise similarity preserving, implicit similarity preserving
Ç
1 INTRODUCTION
the object, e.g., image, under the deep learning framework the time-constrained search is usually transformed to
instead of first learning the representations and then com- another approximate way: terminate the search after exam-
puting the hash codes from the representations. In addition, ining a fixed number of data points.
we discuss other problems including evaluation datasets
and evaluation schemes, and so on. Meanwhile, we present 2.2 Search with Hashing
the empirical observation that the quantization approach The hashing approach aims to map the reference (and
outperforms other approaches and give some analysis about query) items to the target items so that approximate nearest
this observation. neighbor search is efficiently and accurately performed by
In comparison to other surveys on hash [145], [147], this resorting to the target items and possibly a small subset of
survey focuses more on learning to hash, and discusses the raw reference items. The target items are called hash
more on quantization-based solutions. Our categorization codes (a.k.a., hash values, or simply hashes). In this paper,
methodology is helpful for readers to understand connec- we may also call it short/compact codes interchangeably.
tions and differences between existing algorithms. In The hash function is formally defined as: y ¼ hðxÞ, where
particular, we point out the empirical observation that y is the hash code, may be an integer, or a binary value: 1
quantization is superior in terms of search accuracy, search and 0 (or 1), and hðÞ is the hash function. In the applica-
efficiency and space cost. tion to approximate nearest neighbor search, usually several
hash functions are used together to compute the compound
hash code: y ¼ hðxÞ, where y ¼ ½y1 y2 yM > and hðxÞ ¼
2 BACKGROUND ½h1 ðxÞ h2 ðxÞ hM ðxÞ> . Here we use a vector y to repre-
2.1 Nearest Neighbor Search sent the compound hash code for convenience.
Exact nearest neighbor search is defined as searching an There are two basic strategies for using hash codes to per-
item NNðqÞ (called nearest neighbor) for a query item q form nearest (near) neighbor search: hash table lookup and hash
from a set of N items X ¼ fx1 ; x2 ; . . . ; xN g so that NNðqÞ ¼ code ranking. The search strategies are illustrated in Fig. 1.
arg minx2X distðq; xÞ, where distðq; xÞ is a distance computed The main idea of hash table lookup for accelerating the search
between q and x. A straightforward generalization is is reducing the number of the distance computations. The
K-nearest neighbor search, where we need to find K nearest data structure, called hash table (a form of inverted index), is
neighbors. composed of buckets with each bucket indexed by a hash
The distance between a pair of items x and q depends on code. Each reference item x is placed into a bucket hðxÞ. Differ-
the specific nearest search problem. A typical example is that ent from the conventional hashing algorithm in computer sci-
the search (reference) database X lies in a d-dimensional ence that avoids collisions (i.e., avoids mapping two items
space Rd and P the distance is introduced by an ‘s norm, into some same bucket), the hashing approach using a hash
kx qks ¼ ð di¼1 jxi qi js Þ1=s . The search problem under the table essentially aims to maximize the probability of collision
Euclidean distance, i.e., the ‘2 norm, is widely studied. Other of near items and at the same time minimize the probability of
forms of the data item, for example, the data item is formed collision of the items that are far away. Given the query q, the
by a set, and other forms of distance measures, such as ‘1 dis- items lying in the bucket hðqÞ are retrieved as the candidates
tance, cosine similarity and so on, are also possible. of the nearest items of q. Usually this is followed by a rerank-
There exist efficient algorithms (e.g., k-d trees) for exact ing step: rerank the retrieved nearest neighbor candidates
nearest neighbor search in low-dimensional cases. In large according to the true distances computed using the original
scale high-dimensional cases, it turns out that the problem features and attain the nearest neighbors.
becomes hard and most algorithms even take higher To improve the recall, two ways are often adopted. The
computational cost than the naive solution, i.e., the linear first way is to visit a few more buckets (but with a single
scan. Therefore, a lot of recent efforts moved to searching hash table), whose corresponding hash codes are the nearest
approximate nearest neighbors: error-constrained nearest to (the hash code hðqÞ of) the query according to the distan-
(near) neighbor search, and time-constrained approximate ces in the coding space. The second way is to construct
nearest neighbor search [103], [105]. The error-constrained several (e.g., L) hash tables. The items lying in the L hash
search includes (randomized) ð1 þ Þ-approximate nearest buckets h1 ðqÞ ; . . . ; hL ðqÞ are retrieved as the candidates of
neighbor search [1], [14], [44], (approximate) fixed-radius near items of q which are possibly ordered according to the
near neighbor (R-near neighbor) search [6]. number of hits of each item in the L buckets. To guarantee
Time-constrained approximate nearest neighbor search the high precision, each of the L hash codes, yl , needs to be
limits the time spent during the search and is studied a long code. This means that the total number of the buckets
mostly for real applications, though it usually lacks an ele- is too large to index directly, and thus the buckets that are
gant theory behind. The goal is to make the search as accu- non-empty are retained by using the conventional hashing
rate as possible by comparing the returned K approximate over the hash codes hl ðxÞ.
nearest neighbors and the K exact nearest neighbors, and to The second way essentially stores multiple copies of the
make the query cost as small as possible. For example, id for each reference item. Consequently, the space cost is
when comparing the learning to hash approaches that use larger. In contrast, the space cost for the first way is smaller
linear scan based on the Hamming distance for search, it is as it only uses a single table and stores one copy of the id
typically assumed that the search time is the same for the for each reference item, but it needs to access more buckets
same code length by ignoring other small cost. When com- to guarantee the same recall with the second way. The mul-
paring the indexing structure algorithms, e.g., tree- tiple assignment scheme is also studied: construct a single
based [103], [105], [152] or neighborhood graph-based [151], table, but assign a reference item to multiple hash buckets.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 771
Fig. 1. Illustrating the search strategies. (a) Multi table lookup: the list corresponding to the hash code of the query in each table is retrieved. (b) Single
table lookup: the lists corresponding to and near to the hash code of the query are retrieved. (c) Hash code ranking: compare the query with each ref-
erence item in the coding space. (d) Non-exhaustive search: hash table lookup (or other inverted index structure) retrieves the candidates, followed
by hash code ranking over the candidates.
In essence, it is shown that the second way, multiple hash using the hash codes which are longer, providing the top can-
tables, can be regarded as a form of multiple assignment. didates subsequently reranked using the original features.
Hash code ranking performs an exhaustive search: com- Other research efforts include organizing the hash codes to
pare the query with each reference item by fast evaluating avoid the exhaustive search with a data structure, such as a
their distance (e.g., using distance table lookup or using the tree or a graph structure [104].
CPU instruction popcnt for Hamming distance) according
to (the hash code of) the query and the hash code of the ref- 3 LEARNING TO HASH
erence item, and retrieve the reference items with the small-
Learning to hash is the task of learning a (compound) hash
est distances as the candidates of nearest neighbors. Usually
function, y ¼ hðxÞ, mapping an input item x to a compact
this is followed by a reranking step: rerank the retrieved
code y, aiming that the nearest neighbor search result for a
nearest neighbor candidates according to the true distances
query q is as close as possible to the true nearest search
computed using the original features and attain the nearest
result and the search in the coding space is also efficient.
neighbors.
A learning-to-hash approach needs to consider five prob-
This strategy exploits one main advantage of hash codes:
lems: what hash function hðxÞ is adopted, what similarity in
the distance using hash codes is efficiently computed and
the coding space is used, what similarity is provided in the
the cost is much smaller than that of the distance computa-
input space, what loss function is chosen for the optimiza-
tion in the original input space.
tion objective, and what optimization technique is adopted.
Comments. Hash table lookup is mainly used in locality
sensitive hashing, and has been used for evaluating learning
3.1 Hash Function
to hash in a few publications. It has been pointed in [156] and
also observed from empirical results that LSH-based hash The hash function can be based on linear projection, kernels,
table lookup, except min-hash, is rarely adopted in reality, spherical function, (deep) neural networks, a non-paramet-
while hash table lookup with quantization-based hash codes ric function, and so on. One popular hash function is the lin-
is widely used in the non-exhaustive strategy to retrieve ear hash function, e.g., [136], [141]:
coarse candidates [50]. Hash code ranking goes through all y ¼ hðxÞ ¼ sgnðw> x þ bÞ; (1)
the candidates and thus is inferior in search efficiency com-
pared with hash table lookup which only checks a small sub- where sgnðzÞ ¼ 1 if z 5 0 and sgnðzÞ ¼ 0 (or equivalently 1)
set of candidates, which are determined by a lookup radius. otherwise, w is the projection vector, and b is the bias vari-
A practical way is to do a non-exhaustive search which is able. The kernel function,
!
suggested in [4], [50]: first retrieve a small set of candidates XT
using the inverted index that can be viewed as a hash table, y ¼ hðxÞ ¼ sgn wt Kðst ; xÞ þ b ; (2)
and then compute the distances of the query to the candidates t¼1
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
is also adopted in some approaches, e.g., [40], [66], where matrix and each diagonal entry is the weight of the corre-
fst g is a set of representative samples that are randomly sponding hash code.
drawn from the dataset or cluster centers of the dataset and Besides the Hamming distance/similarity and its var-
fwt g are the weights. The non-parametric function based on iants, the Euclidean distance is typically used in quantiza-
nearest vector assignment is widely used for quantization- tion approaches, and is evaluated between the vectors
based solutions: corresponding to the hash codes, dhij ¼ kcyi cyj k2 (symmet-
ric distance) or between the query q and the center that is
y ¼ arg mink2f1;...;Kg kx ck k2 ; (3)
the approximation to xj , dhqj ¼ kq cyj k2 (asymmetric dis-
tance, which is preferred because the accuracy is higher and
where fc1 ; . . . ; cK g is a set of centers computed by some
the time cost is almost the same). The distance is usually
algorithms, e.g., K-means, and y 2 Zþ is an integer. In con-
evaluated in the search stage efficiently by using a distance
trast to other hashing algorithms in which the distance, e.g.,
lookup table. There are also some works learning/optimiz-
Hamming distance, is often directly computed from hash
ing the distances between hash codes [37], [148] after the
codes, the hash codes generated from the nearest vector
hash codes are already computed.
assignment-based hash function are the indices of the near-
est vectors, and the distance is computed using the centers
corresponding to the hash codes. 3.3 Loss Function
The form of hash function is an important factor influenc- The basic rule of designing the loss function is to preserve
ing the search accuracy using the hash codes, as well as the the similarity order, i.e., minimize the gap between the
time cost of computing hash codes. A linear function is effi- approximate nearest neighbor search result computed from
ciently evaluated, while the kernel function and the nearest the hash codes and the true search result obtained from the
vector assignment based function lead to better search accu- input space.
racy as they are more flexible. Almost all the methods using The widely-used solution is pairwise similarity preserv-
a linear hash function can be extended to nonlinear hash ing, making the distances or similarities between a pair of
functions, such as kernelized hash functions, or neural net- items from the input and coding spaces as consistent as pos-
works. Thus we do not use the hash function to categorize sible. The multiwise similarity preserving solution, making
the hash algorithms. the orders among multiple items computed from the input
and coding spaces as consistent as possible, is also studied.
3.2 Similarity One class of solutions, e.g., spatial partitioning, implicitly
In the input space the distance doij between any pair of items preserve the similarities. The quantization-based solution
ðxi ; xj Þ could be the Euclidean distance, kxi xj k2 , or others. and other reconstruction-based solutions aim to find the
The similarity soij is often defined as a function about the dis- optimal approximation of the item in terms of the recon-
tance doij , and a typical function is the Gaussian function: struction error through a reconstruction function (e.g., in
ðdoij Þ2 the form of a lookup table in quantization or an auto-
soij ¼ gðdoij Þ ¼ expð 2s 2
Þ. There exist other similarity forms,
x> x
encoder in [120]). Besides similarity preserving items, some
such as cosine similarity kxi ki kxjj k . Besides, the semantic sim- approaches introduce bucket balance or its approximate
2 2
ilarity is often used for semantic similarity search. In this variants as extra constraints, which is also important for
case, the similarity soij is usually binary, valued 1 if the two obtaining better results or avoiding trivial solutions.
items xi and xj belong to the same semantic class, 0 (or 1)
otherwise. The hashing algorithms for semantic similarity 3.4 Optimization
usually can be applied to other distances, such as Euclidean
The challenges for optimizing the hash function parameters
distance, by defining a pseudo-semantic similarity: soij ¼ 1
lie in two main factors. One is that the problem contains the
for nearby points ði; jÞ and soij ¼ 0 (or 1) for farther points
sgn function, which leads to a challenging mixed-binary-
ði; jÞ.
integer optimization problem. The other is that the time
In the hash coding space, the typical distance dhij between
complexity is high when processing a large number of data
yi and yj is the Hamming distance. It is defined as the num-
points, which is usually handled by sampling a subset of
ber of bits where the values are different and is mathemati-
points or a subset of constraints (or equivalent basic terms
cally formulated as
in the objective functions).
X
M The ways to handle the sgn function are summarized
dhij ¼ d½yim 6¼ yjm ; below. The first way is the most widely-adopted continuous
m¼1
relaxation, including sigmoid relaxation, tanh relaxation,
and directly dropping the sign function sgnðzÞ z. The
which is equivalent to dhij ¼ kyi yj k1 if the code is valued by
relaxed problem is then solved using various standard opti-
1 and 0. The distance for the codes valued by 1 and 1 is simi-
mization techniques. The second one is a two-step
larly defined. The similarity based on the Hamming distance
scheme [76], [77] with its extension to alternative optimiza-
is defined as shij ¼ M dhij for the codes valued by 1 and 0, tion [32]: optimize the binary codes without considering the
computing the number of bits where the values are the same. hash function, followed by estimating the function parame-
The inner product shij ¼ y> i yj is used as the similarity for the ters from the optimized hash codes. The third one is discre-
codes valued by 1 and 1. These measuresP are also extended tization: drop the sign function (sgnðzÞ z) and regard the
to the weighted cases: e.g., dhij ¼ M m¼1 m d½yim ¼
6 yjm and hash code as a discrete approximation of z, which is formu-
shij ¼ y>
i L yj , where L ¼ Diagð 1 ; 2 ; . . . ; M Þ is a diagonal lated as a loss ðy zÞ2 . There also exist other ways only
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 773
adopted in a few algorithms, e.g., transforming the problem Normalized similarity-similarity divergence minimi-
into a latent structure-SVM formulation in [107], [109], the zation (NSSDM):
0 1
coordinate-descent approach in [66] (fixing all but one X
weight, optimize the original objective with respect to a sin- min KLðf shij gÞ ¼ min@
soij g; f shij A:
soij log
gle weight in each iteration), both of which do not conduct ði;jÞ2E
continuous relaxation.
Here soij and shij are normalized similarities
P o in the
input
P h space and the coding space:
s ij ¼ 1 and
3.5 Categorization ij
ij ¼ 1.
ij s
Our survey categorizes the existing algorithms to various
The following reviews these groups of algorithms except
classes: the pairwise similarity preserving class, the multi-
the distance-similarity product minimization group for
wise similarity preserving class, the implicit similarity pre-
which we are not aware of any algorithm belonging to. It
serving class, as well as the quantization class, according to
should be noted that merely optimizing the above similarity
what similarity preserving manner is adopted to formulate
preserving function, e.g., SDPM and SSPM, is not enough
the objective function. We separate the quantization class
and may lead to trivial solutions, and it is necessary to com-
from the pairwise similarity preserving class as they are
bine other constraints, which are detailed in the following
very different in formulation though the quantization class
discussion. We also point out the relation between similar-
can be explained from the perspective of pairwise similarity
ity-distance product minimization and similarity-similarity
preserving. In the following description, we may call quan-
product maximization, the relation between similarity-simi-
tization as quantization-based hashing and other algorithms
larity product maximization and similarity-similarity differ-
in which a hash function generates a binary value as binary
ence minimization, as well as the relation between distance-
code hashing. In addition, we will also discuss other studies
distance product maximization and distance-distance dif-
on learning to hash. The summary of the representative
ference minimization.
algorithms is given in Table 1.
The main reason we choose the similarity preserving
4.1 Similarity-Distance Product Minimization
manner to do the categorization is that similarity preserva-
We first introduce spectral hashing and its extensions, and
tion is the essential goal of hashing. It should be noted that
then review other forms.
as pointed in [145], [147], other factors, such as the hash
function, or the optimization algorithm, is also important
for the search performance. 4.1.1 Spectral Hashing
P
The goal of spectral hashing [156] is to minimize ði;jÞ2E soij dhij ,
4 PAIRWISE SIMILARITY PRESERVING where the Euclidean distance in the hashing space, dhij ¼
kyi yj k22 , is used for formulation simplicity and optimiza-
The algorithms aligning the distances or similarities of tion convenience, and the similarity in the input space is
a pair of items computed from the input space and the kx x k2
Hamming coding space are roughly divided to the follow- defined as: soij ¼ expð i2s2j 2 Þ. Note that the Hamming dis-
tance in the search stage can be still used for higher effi-
ing groups:
ciency as the Euclidean distance and the Hamming distance
Similarity-distance product minimization (SDPM): in the coding space are consistent: the larger the Euclidean
P distance, the larger the Hamming distance. The objective
min ði;jÞ2E soij dhij . The distance in the coding space is
expected to be smaller if the similarity in the original function can be written in a matrix form,
X
space is larger. Here E is a set of pairs of items that min soij dhij ¼ traceðYðD SÞY> Þ;
are considered. (4)
ði;jÞ2E
Similarity-similarity product maximization (SSPM):
P where Y ¼ ½y1 y2 yN is a matrix of M N, S ¼ ½soij NN
max ði;jÞ2E soij shij . The similarity in the coding space is the similarity matrix, D ¼ diagðd11 ; . . . ; dNN Þ is a diag-
P and
is expected to be larger if the similarity in the origi- onal matrix, dnn ¼ N o
i¼1 sni .
nal space is larger. There is a trivial solution to the problem (4): y1 ¼
Distance-distance product maximization (DDPM): y2 ¼ ¼ yN . To avoid it, the code balance condition is
P
max ði;jÞ2E doij dhij . The distance in the coding space is introduced: the number of data items mapped to each hash
expected to be larger if the distance in the original code is the same. Bit balance and bit uncorrelation are used
space is larger. to approximate the code balance condition. Bit balance
Distance-similarity product minimization (DSPM): means that each bit has about 50 percent chance of being 1
P
min ði;jÞ2E doij shij . The similarity in the coding space or 1. Bit uncorrelation means that different bits are uncor-
is expected to be smaller if the distance in the origi- related. The two conditions are formulated as,
nal space is larger.
Similarity-similarity difference minimization (SSDM): Y1 ¼ 0; YY> ¼ I; (5)
P
min ði;jÞ2E ðsoij shij Þ2 . The difference between the where 1 is an N-dimensional all-1 vector, and I is an identity
similarities is expected to be as small as possible. matrix of size N.
Distance-distance difference minimization (DDDM): Under the assumption of separate multi-dimensional
P
min ði;jÞ2E ðdoij dhij Þ2 . The difference between the uniform data distribution, the hashing algorithm is given as
distances is expected to be as small as possible. follows,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
TABLE 1
A Summary of Representative Hashing Algorithms with Respect to Similarity Preserving Functions, Code Balance,
Hash Function Similarity in the Coding Space, and the Manner to Handle the sgn Function
pres. = preserving, sim. = similarity. BB = bit balance, BU = bit uncorrelation, BMIM = bit Mutual Information Minimization, BKB = bucket balance. H =
Hamming distance, WH = weighted Hamming distance, SH = spherical Hamming distance, C = Cosine, E = Euclidean distance, DNN = deep neural networks;
Drop = drop the sgn operator in the hash function, Sigmoid = Sigmoid relaxation, ½a; b = ½a; b bounded relaxation, Tanh = Tanh relaxation, Discretize = drop
the sgn operator in the hash function and regard the hash code as a discrete approximation of the hash value, Keep = optimization without relaxation for sgn,
Two-step = two-step optimization.
1) Find the principal components of the N d-dimen- Analysis. In the case the spreads along the top M PCA
sional reference data items using principal compo- directions are the same, the hashing algorithm partitions
nent analysis (PCA). each direction into two parts using the median (due to the
2) Compute the M one-dimensional Laplacian eigen- bit balance requirement) as the threshold, which is equiva-
functions with the M smallest eigenvalues along lent to thresholding at the mean value under the assump-
each PCA direction (d directions in total). tion of uniform data distributions. In the case that the true
3) Pick the M eigenfunctions with the smallest eigen- data distribution is a multi-dimensional isotropic Gaussian
values among Md eigenfunctions. distribution, the algorithm is equivalent to two quantization
4) Threshold the eigenfunction at zero, obtaining the algorithms: iterative quantization [36], [35] and isotropic
binary codes. hashing [63].
The one-dimensional Laplacian eigenfunction for the Regarding the performance, this method performs well
case of uniform distribution on ½rl ; rr is fm ðxÞ ¼ for a short hash code but poor for a long hash code. The rea-
sin ðp2 þ rrmp
r xÞ, and the corresponding eigenvalue is m ¼ son includes three aspects. First, the assumption that the
l
2 2
1 expð 2 j rrmp data follow a uniform distribution does not hold in real
rl j Þ, where m ð¼ 1; 2; . . .Þ is the frequency
and is a fixed small value. The hash function is formally cases. Second, the eigenvalue monotonously decreases with
2
written as hðxÞ ¼ sgnð sin ðp2 þ gw> xÞÞ, where g depends on rl j , which means that the PCA direction with
respect to j rrm
the frequency m and the range of the projection along the a large spread (jrr rl j) and a lower frequency (m) is pre-
direction w. ferred. Hence there might be more than one eigenfunction
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 775
picked along a single PCA direction, which breaks the where r is a hyper-parameter used as a threshold in the Ham-
uncorrelation requirement. Last, thresholding the eigen- ming space to differentiate similar pairs from dissimilar
function fm ðxÞ ¼ sin ðp2 þ rrmp
rl xÞ at zero leads to that near
pairs, is another hyper-parameter that controls the ratio of
points may be mapped to different hash values and farther the slopes for the penalties incurred for similar (or dissimilar)
points may be mapped to the same hash value. As a result, points. The hash function is in the linear form: y ¼ sgnðW> xÞ.
the Hamming distance is not well consistent to the distance The projection matrix W is estimated by transforming y ¼
in the input space. sgnðW> xÞ ¼ arg maxh0 2H h0> W> x and optimizing using struc-
Extensions. There are some extensions using PCA. (1) tured prediction with latent variables. The hyper-parameters
Principal component hashing [98] uses the principal direction r and are chosen via cross-validation.
to formulate the hash function; (2) Searching with expecta- Comments. Besides the optimization techniques, the main
tions [123] and transform coding [9] that transforms the data differences of the three representative algorithms, i.e., spec-
using PCA and then adopts the rate distortion optimization tral hashing, LDA hashing, and minimal loss hashing, are
(bits allocation) approach to determine which principal twofold. First, the similarity in the input space in spectral
direction is used and how many bits are assigned to such a hashing is defined as a continuous positive number com-
direction; (3) Double-bit quantization handles the third draw- puted from the Euclidean distance, while in LDA hashing
back in spectral hashing by distributing two bits into each and minimal loss hashing the similarity is set to 1 for a simi-
projection direction, conducting only 3-cluster quantization, lar pair and 1 for a dissimilar pair. Second, the distance in
and assigning 01, 00, and 11 to each cluster. Instead of PCA, the hashing space for formulating the objective function in
ICA hashing [39] adopts independent component analysis minimal loss hashing is different from spectral hashing and
for hashing and uses bit balance and bit mutual information LDA hashing.
minimization for code balance.
There are many other extensions in a wide range, including 4.2 Similarity-Similarity Product Maximization
similarity graph extensions [75], [179], [92], [86], [84], [79], Semi-supervised hashing [141], [142], [143] is the representa-
[128], [170], hash function extensions [40], [124], weighted tive Palgorithm in this group. The objective function is
Hamming distance [153], self-taught hashing [166], sparse max ði;jÞ2E soij shij . The similarity soij in the input space is 1 if
hash codes [177], discrete hashing [164], and so on. the pair of items xi and xj belong to the same class or are
nearby points, and 1 otherwise. The similarity in the cod-
4.1.2 Variants ing space is defined as shij ¼ y> i yj . Thus, the objective func-
Linear discriminant analysis (LDA) P hashing [136] minimizes a tion is rewritten as maximizing:
form of the loss function: min ði;jÞ2E soij dhij , where dhij ¼ X
soij y>
i yj : (8)
kyi yj k22 . Different from spectral hashing, (1) soij ¼ 1 if data ði;jÞ2E
items xi and xj are a similar pair, ði; jÞ 2 E þ , and soij ¼ 1 if
data items xi and xj are a dissimilar pair, ði; jÞ 2 E (2) a linear The hash function is in a linear form y ¼ hðxÞ ¼ sgnðW> xÞ.
hash function is used: y ¼ sgnðW> x þ bÞ, and (3) a weight a is Besides, the bit balance is also considered, and is formu-
imposed to soij dhij for the similar pair. As a result, the objective lated as maximizing the variance, traceðYY> Þ, rather than
function is written as letting the mean be 0, Y1 ¼ 0. The overall objective is to
X X maximize
a kyi yj k22 kyi yj k22 :
(6)
traceðYSY> Þ þ h traceðYY> Þ;
þ
ði;jÞ2E
ði;jÞ2E (9)
The projection matrix W and the threshold b are sepa-
rately optimized: (1) to estimate the orthogonal matrix W, subject to W> W ¼ I, which is a relaxation of the bit uncorre-
drop the sgn function in Equation (6), leading to an eigen- lation condition. The estimation of W is done by directly
value decomposition problem; (2) estimate b by minimizing dropping the sgn operator.
Equation (6) with fixed W through a simple 1D search An unsupervised extension is given in [143]: sequentially
scheme. A similar loss function, contrastive loss, is adopted compute the projection vector fwm gM m¼1 from w1 to wM by
in [18] with a different optimization technique. optimizing the problem (9). In particular, the first iteration
The loss function
P in minimal loss hashing [107] is in the computes the PCA direction as the first w, and at each of
form of min ði;jÞ2E soij dhij . Similar to LDA hashing, soij ¼ 1 the later iterations, soij ¼ 1 if nearby points are mapped to
if ði; jÞ 2 E þ and soij ¼ 1 if ði; jÞ 2 E . Differently, the dis- different hash values in the previous iterations, and soij ¼
tance is hinge-like: dhij ¼ maxðkyi yj k1 þ 1; rÞ for ði; jÞ 2 E þ 1 if far points are mapped to same hash values in the pre-
vious iterations. An extension of the semi-supervised hash-
and dhij ¼ minðkyi yj k1 1; rÞ for ði; jÞ 2 E . The intuition
ing to nonlinear hash functions is presented in [157] using
is that there is no penalty if the Hamming distances for simi-
the kernel hash function. An iterative two-step optimization
lar pairs are small enough and if the Hamming distances for
using graph cuts is given in [32]. P
dissimilar pairs are large enough. The formulation, if r is
Comments. It is interesting to note that ði;jÞ2E soij y>
i yj ¼
fixed, is equivalent to, P P
X 1 o 2
const 2 ði;jÞ2E sij kyi yj k2 ¼ const 2 ði;jÞ2E sij dij if
1 o h
min maxðkyi yj k1 r þ 1; 0Þ
y 2 f1; 1gM , where const is a constant variable (and thus
ði;jÞ2E þ
X (7) traceðYSY> Þ ¼ const traceðYðD SÞY> Þ). In this case,
þ maxðr kyi yj k1 þ 1; 0Þ; similarity-similarity product maximization is equivalent to
ði;jÞ2E similarity-distance product minimization.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
4.3 Distance-Distance Product Maximization This shows that the difference between distance-distance
The mathematical formulation
P of distance-distance product difference minimization P and distance-distance product
maximization is max ði;jÞ2E doij dhij . Topology preserving hash- maximization lies on min ði;jÞ2E ðdhij Þ2 , minimizing the dis-
ing [169] formulates the objective function by starting with tances between the data items in the hash space. This could
this rule: be regarded as a regularizer, complementary P to distance-
X X distance product maximization max ði;jÞ2E doij dhij which
doij dhij ¼ doij kyi yj k22 ¼ traceðYLd Y> Þ; (10) tends to maximize the distances between the data items in
i;j i;j the hash space.
where Ld ¼ DiagfDo 1g Do and Do ¼ ½doij NN .
In addition, similarity-distance product minimization is 4.5 Similarity-Similarity Difference Minimization
also considered Similarity-similarity difference
P minimization is mathemati-
X cally formulated as min ði;jÞ2E ðsoij shij Þ2 . Supervised hashing
sij kyi yj k22 ¼ traceðYLY> Þ: (11) with kernels [85], one representative approach in this group,
ði;jÞ2E
aims to minimize an objective function,
The overall formulation is given as follows: X 1 >
2
traceðYðLd þ aIÞY Þ > min soij y y ; (18)
max ; (12) ði;jÞ2E
M i j
traceðYLY> Þ
where soij ¼ 1 if ði; jÞ is similar, and soij ¼ 1 if it is dissimi-
where aI is introduced as a regularization term, traceðYY> Þ,
lar. y ¼ hðxÞ is a kernel hash function. Kernel reconstructive
maximizing the variances, which is the same for semi-
hashing [162] extends this technique using a normalized
supervised hashing [141] for bit balance. The problem is
Gaussian kernel similarity. Scalable graph hashing [56] uses
optimized by dropping the sgn operator in the hash function
the feature transformation to approximate the similarity
y ¼ sgnðW> xÞ and letting W> XLX> W be an identity matrix.
matrix (graph) without explicitly computing the similarity
matrix. Binary hashing [25] solves the problem using a two-
4.4 Distance-Distance Difference Minimization
step approach, in which the first step adopts semi-definite
Binary
P reconstructive embedding [66] belongs to this group: relaxation and augmented lagrangian to estimate the dis-
min ði;jÞ2E ðdoij dhij Þ2 . The Euclidean distance is used in crete labels.
both the input and coding spaces. The objective function is Comments. We have the following equation,
formulated as follows,
X
X 1 1
2
min ðsoij shij Þ2
min 2 2
kxi xj k2 kyi yj k2 : (19)
(13) ði;jÞ2E
ði;jÞ2E
2 M
X
The kernel hash function is used: ¼ min ðsoij Þ2 þ ðshij Þ2 2soij shij (20)
! ði;jÞ2E
X
Tm
X
ynm ¼ hm ðxÞ ¼ sgn wmt Kðsmt ; xÞ ; (14) ¼ min ðshij Þ2 2soij shij :
t¼1 (21)
ði;jÞ2E
Label-regularized maximum margin hashing [102] formulates The objective function is given as follows,
the objective function from three components: the similar-
ity-similarity difference, a hinge loss from the hash function, X
‘triplet hðxÞ; hðxþ Þ; hðx Þ þ traceðW> WÞ;
and the maximum margin part. ðx;xþ ;x Þ2D
2
4.6 Normalized Similarity-Similarity Divergence where hðxÞ ¼ hðx; WÞ is the compound hash function. The
Minimization problem is optimized using the algorithm similar to minimal
Spec hashing [78], belonging to this group, views each pair of loss hashing [107]. The extension to asymmetric Hamming
data items as a sample and their (normalized) similarity as distance is also discussed in [109]. Binary optimized hash-
the probability, and finds the hash functions so that the ing [18] also uses a triplet loss function, with a slight differ-
probability distributions from the input space and the cod- ent distance measure in the Hamming space and a different
ing space are well aligned. The objective function is written optimization technique.
as follows: Top rank supervised binary coding [132] presents another
X
KLðf soij g; f
shij gÞ ¼ const soij log shij : form of triplet loss in order to penalize the samples that are
(23) incorrectly ranked at the top of a Hamming-distance rank-
ði;jÞ2E
ing list more than those at the bottom.
o
Here,
P o sij is the normalized similarity in the input space, Listwise supervision hashing [146] also uses triplets of
s
ij ij ¼ 1.
s h
ij is the normalized similarity in the Hamming items. The formulation is based on a triplet tensor So
space, shijP¼ Z1 expðdhij Þ, where Z is a normalization vari- defined as follows:
able Z ¼ ij expðdhij Þ. 8
Supervised binary hash code learning [27] presents a super- <1 ifso ðqi ; xj Þ < so ðqi ; xk Þ
sijk ¼ sðqi ; xj ; xk Þ ¼ 1 ifso ðqi ; xj Þ > so ðqi ; xk Þ :
o
vised learning algorithm based on the Jensen-Shannon :
divergence which is derived from minimizing an upper 0 ifso ðqi ; xj Þ ¼ so ðqi ; xk Þ
bound of the probability of Bayes decision errors. The objective is to maximize triple-similarity-triple-similarity
product:
5 MULTIWISE SIMILARITY PRESERVING X
shijk soijk ; (25)
This section reviews the category of hashing algorithms that i;j;k
formulate the loss function by maximizing the agreement of
the similarity orders over more than two items computed where shijk is a ranking triplet computed in the coding
from the input space and the coding space. space using the cosine similarity, shijk ¼ sgnðhðqi Þ> hðxj Þ
Order preserving hashing [150] aims to learn hash func- hðqi Þ> hðxk ÞÞ. Through dropping the sgn function, the
tions through aligning the orders computed from the origi- objective function is transformed to
nal space and the ones in the coding space. Given a data X
point xn , the database points X are divided into ðM þ 1Þ cat- hðqi Þ> ðhðxj Þ hðxk ÞÞsoijk ; (26)
egories, ðChn0 ; Chn1 ; . . . ; ChnM Þ, where Chnm corresponds to the i;j;k
items whose distances to the given point are m, and
ðCon0 ; Con1 ; . . . ; ConM Þ, using the distances in the hashing space which is solved by dropping the sgn operator in the hash
and the distances in the input (original) space, respectively. function hðxÞ ¼ sgnðW> xÞ.
ðCon0 ; Con1 ; . . . ; ConM Þ is constructed such that in the ideal case Comments. Order preserving hashing considers the rela-
the probability of assigning an item to any hash code is the tion between the search lists while triplet loss hashing and
same. The basic objective function maximizing the align- listwise supervision hashing consider triplewise relation.
ment between the two categories is given as follows: The central ideas of triplet loss hashing and listwise super-
vision hashing are very similar, and their difference lies in
X X
M how to formulate the loss function besides the different
LðhðÞ; X Þ ¼ ðjConm Chnm j þ jChnm Conm jÞ; optimization techniques they adopted.
n2f1;...;Ng m¼0
where jConm Chnm j is the cardinality of the difference of the 6 IMPLICIT SIMILARITY PRESERVING
two sets. The linear hash function hðxÞ is used and dropping We review the category of hashing algorithms that focus on
the sgn function is adopted for optimization. pursuing effective space partitioning without explicitly eval-
Instead of preserving the order, KNN hashing [23] directly uating the relation between the distances/similarities in the
maximizes the KNN accuracy of the search result, which is input and coding spaces. The common idea is to partition
solved by using the factorized neighborhood representation the space, formulated as a classification problem, with the
to parsimoniously model the neighborhood relationships maximum margin criterion or the code balance condition.
inherent in the training data. Random maximum margin hashing [61] learns a hash func-
Triplet loss hashing [109] formulates the hashing problem tion with the maximum margin criterion. The point is that
by maximizing the similarity order agreement defined over the positive and negative labels are randomly generated by
triplets of items, fðx; xþ ; x Þg, where the pair ðx; x Þ is less randomly sampling N data items and randomly labeling
similar than the pair ðx; xþ Þ. The triplet loss is defined as half of the items with 1 and the other half with 1. The for-
mulation is a standard SVM formulation that is equivalent
‘triplet ðy; yþ ; y Þ ¼ maxð1 ky y k1 þ ky yþ k1 ; 0Þ: (24) to the following form,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
778 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
( ) X
1 4 min 2 ðjxj zj j22 þ jxi zi j22 Þ (33)
max min min ðw> xþ þ bÞ; min ðw> x bÞ ;
kwk2 i¼1;...;N
2
i
i¼1;...;N
2
i i;j2f1;2;...;Ng
X
where fxþ
i g are the positive samples and fxi g are the nega- ¼ min 4 jxi zi j22 : (34)
tive samples. Note that this is different from PICODES [7] i2f1;2;...;Ng
as random maximum margin hashing adopts the hyper-
planes learnt from SVM to form the hash functions while This means that the distance-distance difference minimiza-
PICODES [7] exploits the hyperplanes to check whether the tion rule is transformed to minimizing its upper-bound, the
hash codes are semantically separable rather than forming quantization error (as shown in Equation (34)), which is
hash functions. described as a theorem below.
Complementary projection hashing [60], similar to comple- Theorem 1. The distortion error in the quantization approach is
mentary hashing [160], finds the hash function such that an upper bound (with a scale) of the differences between the
the items are as far away as possible from the partition pairwise distances computed from the input features and from
plane corresponding to the hash function. It is formulated the approximate representations.
as Hð jw> x þ bjÞ, where HðÞ ¼ 12 ð1 þ sgnðÞÞ is the unit
step function. Moreover, the bit balance condition, Y1 ¼ 0, The quantization approach for hashing is roughly
and the bit uncorrelation condition, the non-diagonal divided into two main groups: hypercubic quantization, in
entries in YY> are 0, are considered. An extension is also which the approximation z is equal to the hash code y, and
given by using the kernel hash function. In addition, when Cartesian quantization, in which the approximation z corre-
learning the mth hash function, the data item is weighted sponds to a vector formed by the hash code y, e.g., y repre-
by a variable, which is computed according to the previ- sents the index of a candidate approximation among a set of
ously computed ðm 1Þ hash functions. candidate approximations. In addition, we will review the
Spherical hashing [41] uses a hypersphere to partition related reconstruction-based hashing algorithms.
the space. The spherical hash function is defined as hðxÞ ¼ 1
if dðp; xÞ 4 t and hðxÞ ¼ 0 otherwise. The compound hash 7.1 Hypercubic Quantization
function consists of M spherical functions, depending on M Hypercubic quantization refers to a category of algorithms
pivots fp1 ; . . . ; pM g and M thresholds ft1 ; . . . ; tM g. The dis- that quantize a data item to a vertex in a hypercubic, i.e., a
tance in the coding space is defined based on the distance: vector belonging to a set f½y1 y2 yM > j ym 2 f1; 1gg or
ky1 y2 k1
yT1 y2
. Unlike the pairwise and multiwise similarity pre- the rotated hypercubic vertices. It is in some sense related to
serving algorithms, there is no explicit function penalizing 1-bit compressive sensing [8]: Its goal is to design a mea-
the disagreement of the similarities computed in the input surement matrix A and a recovery algorithm such that a k-
and coding spaces. The M pivots and thresholds are learnt sparse unit vector x can be efficiently recovered from the
such that it satisfies a pairwise bit balance condition: sign of its linear measurements, i.e., b ¼ sgnðAxÞ, while
jfx j hm ðxÞ ¼ 1gj ¼ jfx j hm ðxÞ ¼ 0gj, and jfxjhi ðxÞ ¼ b1 ; hypercubic quantization aims to find the matrix A which is
hj ðxÞ ¼ b2 gj ¼ 14 jX j; b1 ; b2 2 f0; 1g; i 6¼ j. usually a rotation matrix, and the codes b, from the input x.
The widely-used scalar quantization approach with only
7 QUANTIZATION one bit assigned to each dimension can be viewed as a
hypercubic quantization approach, and can be derived by
The following provides a simple derivation showing that minimizing
the quantization approach can be derived from the dis-
tance-distance difference minimization criterion. There is a jjxi yi jj22 ; (35)
similar statement in [50] obtained from the statistical per- subject to yi 2 f1; 1g. The local digit coding approach [64]
spective: the distance reconstruction error is statistically also belongs to this category.
bounded by the quantization error. Considering two points
xi and xj and their approximations zi and zj , we have
7.1.1 Iterative Quantization
jdoij dhij j (27) Iterative quantization [35], [36] preprocesses the centralized
data, by reducing the dimension using PCA to M dimensions,
v ¼ P> x, where P is a matrix of size d M (M 4 d) computed
¼ jjxi xj j2 jzi zj j2 j (28)
using PCA, and then finds an optimal rotation R followed by
a scalar quantization. The formulation is given as,
¼ jjxi xj j2 jxi zj j2 þ jxi zj j2 jzi zj j2 j (29)
minkY R> Vk2F ; (36)
4jjxi xj j2 jxi zj j2 j þ jjxi zj j2 jzi zj j2 j (30)
where R is a matrix of M M, V ¼ ½v1 v2 vN and Y ¼
½y1 y2 yN .
4jxj zj j2 þ jxi zi j2 : (31)
The problem is solved via alternative optimization. There
Thus, jdoij dhij j2 4 2ðjxj zj j22 þ jxi zi j22 Þ, and are two alternative steps. Fixing R, Y ¼ sgnðR> VÞ. Fixing B,
X the problem becomes the classic orthogonal Procrustes
min jdoij dhij j2 (32) problem, and the solution is R ¼ SS ^ > , where S and S ^ are
> > >
i;j2f1;2;...;Ng obtained from the SVD of YV , YV ¼ SL LS .
^
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 779
X
M
Here C is a matrix of d PK in the form of
max ;
½S (38)
mm
2 3
m¼1
C1 0 0
6 0 C2 0 7
6 7
> C ¼ diagðC1 ; C2 ; . . . ; CP Þ ¼ 6 .. .. . . .. 7;
s:t:½S mm ¼ s; m ¼ 1; . . . ; M; R R ¼ I: (39) 4 . . . . 5
0 0 CP
Other extensions include cosine similarity preserving
> >
quantization (Angular quantization [34]), nonlinear embed- where Cp ¼ ½cp1 cp2 cpK . bn ¼ ½b> >
n1 bn2 bnP is the com-
ding replacing PCA embedding [46] [175], matrix hash- position vector, and its subvector bnp of length K is an indi-
ing [33], and so on. Quantization is also applied to cator vector with only one entry being 1 and all others being
supervised problems: Supervised discrete hashing [125], [127], 0, showing which element is selected from the pth source
[168], [170], presents an SVM-like formulation to minimize dictionary for quantization.
the quantization loss and the classification loss in the hash Extensions. Distance-encoded product quantization [42]
coding space, and jointly optimize the hash function param- extends product quantization by encoding both the cluster
eters and the SVM weights. Intuitively, the goal of these index and the distance between the cluster center and the
methods is that the hash codes are semantically separable, point. The cluster index is encoded in a way similar to that
which is guaranteed through maximizing the classification in product quantization. The way of encoding the distance
performance. between a point and its cluster center is as follows: the
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
780 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
The introduced rotation does not affect the Euclidean dis- explained from the view of sparse coding, as pointed in [2],
tance as the Euclidean distance is invariant to the rotation, [138], [171]: the dictionary (fCp g) in composite quantization
and helps to find an optimized subspace partition for quan- (product quantization and Cartesian k-means) satisfies the
tization. Locally optimized product quantization [62] applies constant (orthogonality) constraint, and the sparse codes
optimized production quantization to the search algorithm (fbn g) are 0 and 1 vectors where there is only one 1 for each
with the inverted index, where there is a quantizer for each subvector corresponding to a source dictionary.
inverted list. Comments. As discussed in product quantization [50], the
idea of using the summation of several dictionary items as
7.2.2 Composite Quantization an approximation of a data item has already been studied in
In composite quantization [171], the operation forming an the signal processing research area, known as multi-stage
item in the dictionary from a P-tuple ðc1i1 ; c2i2 ; . . . ; cPiP Þ is vector quantization, residual quantization, or more generally
P structured vector quantization [38], and recently re-devel-
the summation Pp¼1 cpip . In order to compute the distance
from a query q to the composed dictionary item formed oped for similarity search under the Euclidean distance
by ðc1i1 ; c2i2 ; . . . ; cPiP Þ from the distances fdistðq; c1i1 Þ; . . . ; (additive quantization [2], [149], and tree quantization [3]
distðq; cPiP Þg, a constraint is introduced: the summation of modifying additive quantization by introducing a tree-
the inner products of all pairs of elements that are used to structure sparsity) and inner product [26].
approximate the vector xn but from different dictionaries,
PP PP 7.2.3 Variants
i¼1 j¼1;6¼i cikin cjkjn , is constant.
The problem is formulated as The work in [37] presents an approach to compute the
source dictionaries given the M hash functions fhm ðxÞ ¼
XN
min kxn ½C1 C2 CP bn k22 bm ðgm ðxÞÞg, where gm ðÞ is a real-valued embedding function
fCp g;fbn g;
n¼1 and bm ðÞ is a binarization function, for a better distance
X
P X
P measure, quantization-like distance, instead of Hamming or
s:t: b> >
ni Ci Cj bnj ¼ ; weighted Hamming distance. It computes M dictionaries,
j¼1 i¼1;i6¼j (42) each corresponding to a hash bit and being computed as
> >
bn ¼ ½b> >
n1 bn2 bnP ;
bnp 2 f0; 1gK ; kbnp k1 ¼ 1; gkb ¼ Eðgk ðxÞjbk ðgk ðxÞÞ ¼ bÞ; (43)
n ¼ 1; 2; . . . ; N; p ¼ 1; 2; . . . P:
where b 2 f0; 1g. The distance computation cost is OðMÞ
Here, Cp is a matrix of size d K, and each column corre- through looking up a distance table, which can be acceler-
sponds to an element of the pth dictionary Cp . ated by dividing the hash functions into groups (e.g., each
Sparse composite quantization [172] improves composite group contains 8 functions, thus the cost is reduced to
quantization
PP PK by constructing a sparse dictionary, OðM8 Þ), building a table (e.g., consisting of 256 entries) per
p¼1 k¼1 kcpk k0 4 S, with S being a parameter controlling group instead of per hash function, and forming a larger
the sparsity degree, resulting in a great reduction of the dis- distance lookup table. In contrast, optimized code ranking [148]
tance table computation cost. directly estimates the distance table rather than computing
Connection with Product Quantization. It is shown in [171] it from the estimated dictionary.
that both product quantization and Cartesian k-means can Composite quantization [171] points to relation between
be regarded as constrained versions of composite quantiza- Cartesian quantization and sparse coding. This indicates
tion. Composite quantization attains smaller quantization the application of sparse coding to similarity search.
errors, yielding better search accuracy with similar search Compact sparse coding [15], the extension of robust sparse
efficiency. A 2D illustration of the three algorithms is given coding [16], adopts sparse codes to represent the database
in Fig. 2, where 2D points are grouped into 9 groups. It is items: the atom indices corresponding to nonzero codes,
observed that composition quantization is more flexible which is equivalent to letting the hash bits associated with
in partitioning the space and thus the quantization error is nonzero codes be 1 and 0 for zero codes, are used to build
possibly smaller. the inverted index, and the nonzero coefficients are used to
Composite quantization, product quantization, and reconstruct the database items and calculate the distances
Cartesian k-means (optimized product quantization) can be between the database items and the query. Anti-sparse
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 781
coding [52] aims to learn a hash code so that non-zero ele- algorithms [30], [67], [158], [174]. Typically, these
ments in the hash code are as many as possible. approaches except [67] simultaneously learn the representa-
tion using a deep neural network and the hashing function
7.3 Reconstruction under some loss functions, rather than separately learn the
We review a few reconstruction-based hashing approaches. features and then learn the hash functions.
Essentially, quantization can be viewed as a reconstruction The methodology is similar to other learning to hash
approach for a data item. Semantic hashing [120], [121] gener- algorithms that do not adopt deep learning, and the hash
ates the hash codes using the deep generative model, a function is more general and could be a deep neural net-
restricted Boltzmann machine (RBM), for reconstructing the work. We provide here a separate discussion because this
data item. As a result, the binary codes are used for finding area is relatively new. However, we will not discuss seman-
similar data. A variant method proposed in [13] recon- tic hashing [120] which is usually not thought as a feature
structs the input vector from the binary codes, which is learning approach but just a hash function learning
effectively solved using the auxiliary coordinates algorithm. approach. In general, almost all non-deep-learning hashing
A simplified algorithm [5] finds a binary hash code that can algorithms if the similarity order (e.g., semantic similarity)
be used to effectively reconstruct the vector through a linear is given, can be extended to deep learning based hashing
transformation. algorithms. In the following, we discuss the deep learning
based algorithms and also categorize them according to
8 OTHER TOPICS their similarity preserving manners.
Most hashing learning algorithms assume that the similarity Pairwise similarity preserving. The similarity-simi-
information in the input space, especially the semantic simi- larity difference minimization criterion is adopted
larity information, and the database items have already been in [158]. It uses a two-step scheme: the hash codes are
given. There are some approaches to learn hash functions computed by minimizing the similarity-similarity
without such assumptions: active hashing [176] that actively difference without considering the visual informa-
selects the pairs which are most informative for hash func- tion, and then the image representation and hash
tion learning and labels them for further learning, online hash- function are jointly learnt through deep learning.
ing [43], smart hashing [163], online sketching hashing [69], Multiwise similarity preserving. The triplet loss is
and online adaptive hashing [12], which learn the hash func- used in [67], [174], which adopt the loss function
tions when the similar/dissimilar pairs come sequentially. defined in Equation (24) (1 is dropped in [67]).
The manifold structure in the database is exploited for Quantization. Following the scalar quantization
hashing, which is helpful for semantic similarity search, approach, deep hashing [80] defines a loss to penalize
such as locally linear hashing [46], spline regression hashing [93], the difference between the binary hash codes (see
and inductive manifold hashing [126]. Multi-table hashing, Equation (35)) and the real values from which a linear
aimed at improving locality sensitive hashing, is also stud- projection is used to generate the binary codes, and
ied, such as complementary hashing [160] and its multi-view introduces the bit balance and bit uncorrelation
extension [91], reciprocal hash tables [90] and its query-adap- conditions.
tive extension [88], and so on.
There are some works extending the Hamming distance. 8.2 Fast Search in the Hamming Space
In contrast to multi-dimensional spectral hashing [155] in The computation of the Hamming distance is shown much
which the weights for the weighted Hamming distance are faster than the computation of the distance in the input
the same for arbitrary queries, the query-dependent dis- space. It is still expensive, however, to handle a large scale
tance approaches learn a distance measure whose weights data set using linear scan. Thus, some indexing algorithms,
or parameters depend on a specific query. Query adaptive which are shown effective and efficient for general vectors,
hashing [81], a learning-to-hash version extended from are borrowed for the search in the Hamming space. For
query adaptive locality sensitive hashing [48], aims to select example, min-hash, a kind of LSH, is exploited to search
the hash bits (thus hash functions forming the hash bits) over high-dimensional binary data [129]. In the following,
according to the query vector. Query-adaptive class-specific bit we discuss other representative algorithms.
weighting [57], [58] presents a weighted Hamming distance Multi-index hashing [110] and its extension [133] aim to
measure by learning the class-specific bit weights from the partition the binary codes into M disjoint substrings and
class information of the query. Bits reconfiguration [101] aims build M hash tables each corresponding to a substring,
to learn a good distance measure over the hash codes pre- indexing all the binary codes M times. Given a query, the
computed from a pool of hash functions. method outputs the NN candidates which are near to the
The following reviews three research topics: joint feature query at least in one hash table. FLANN-binary [104] extends
and hash learning with deep learning, fast search in the Ham- the FLANN algorithm [103] that was initially designed for
ming space replacing the exhaustive search, and the impor- ANN search over real-value vectors to search over binary
tant application of Cartesian quantization to inverted index. vectors. The key idea is to build multiple hierarchical cluster
trees to organize the binary vectors and to search for the
8.1 Joint Feature and Hash Learning via Deep nearest neighbors simultaneously over the multiple trees by
Learning traversing each tree in a best-first manner.
The great success in deep neural network for representation PQTable [97] extends multi-index hashing from the
learning has inspired a lot of deep compact coding Hamming space to the product-quantization coding space,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
782 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
for fast exact search. Unlike multi-index hashing flipping in the case the Hamming distance in hash code ranking is
the bits in the binary codes to find candidate tables, used in the coding space, it is not necessary to report the
PQTable adopts the multi-sequence algorithm [4] for effi- search time costs because they are the same. It is necessary
ciently finding candidate tables. The neighborhood graph- to report the search time cost when a non-Hamming dis-
based search algorithm [144] for real-value vectors is tance or the hash table lookup scheme is used.
extended to the Hamming space [59]. The search quality is measured using recall@R (i.e., a
recall-R curve). For each query, we retrieve its R nearest
8.3 Inverted Multi-Index items and compute the ratio of the true nearest items in the
Hash table lookup with binary hash codes is a form of retrieved R items to T , i.e., the fraction of T ground-truth
inverted index. Retrieving multiple hash buckets from nearest neighbors are found in the retrieved R items. The
multiple hash tables is computationally cheaper compared average recall score over all the queries is used as the mea-
with the subsequent reranking step using the true distance sure. The ground-truth nearest neighbors are computed
computed in the input space. It is also cheap to visit more over the original features using linear scan. Note that the
buckets in a single table if the standard Hamming distance recall@R is equivalent to the accuracy computed after reor-
is used, as the nearby hash codes of the hash code of the dering the R retrieved nearest items using the original fea-
query which can be obtained by flipping the bits of the tures and returning the top T items. In the case where the
hash code of the query. If there are a lot of empty buckets, linear scan cost in the hash coding space is not the same
which increase the retrieval cost, the double-hash scheme (e.g., binary code hashing, and quantization-based hashing),
or the fast search algorithm in the Hamming space, e.g., the curve in terms of search recall and search time cost is
[104], [110] can be used to fast retrieve the hash buckets. usually reported.
Thanks to the multi-sequence algorithm, the Cartesian The semantic similarity search, a variant of nearest
quantization algorithms are also applied to the inverted neighbor search, sometimes uses the precision, the recall,
index [4], [172], [31] (called inverted multi-index), in which the precision-recall curve, and mean average precision
each composed quantization center corresponds to an (mAP). The precision is computed at the retrieved position
inverted list. Instead of comparing the query with all the R, i.e., R items are retrieved, as the ratio of the number of
composed quantization centers, which is computationally retrieved true positive items to R. The recall is computed,
expensive, the multi-sequence algorithm [4] is able to effi- also at position R, as the ratio of the number of retrieved
ciently produce a sequence of (T ) inverted lists ordered by true positive items to the number of all true positive items
the increasing distances between the query and the com- in the database. The pairs of recall and precision in the
posed quantization centers, whose cost is OðT log T Þ. The precision-recall curve are computed by varying the
study (Fig. 5 in [151]) shows that the time cost of the multi- retrieved position R. The mAP score is computed as
sequence algorithm, when retrieving 10K candidates over follows: the average precision for a query, the PNarea under
the two datasets: SIFT1M and GIST1M, is the smallest com- the precision-recall curve is computed as t¼1 P ðtÞDðtÞ,
pared with other non-hashing inverted index algorithms. where P ðtÞ is the precision at cut-off t in the ranked list
Though the cost of the multi-sequence algorithm is and DðtÞ is the change in recall from items t 1 to t; the
greater than that with binary hash codes, both are relatively mean of average precisions over all the queries is com-
small and negligible compared with the subsequent rerank- puted as the final score.
ing step that is often conducted in real applications. Thus
the quantization-based inverted index (hash table) is more 9.2 Evaluation Datasets
widely used compared with the conventional hash tables The widely-used evaluation datasets have different scales
with binary hash codes. from small, large, to very large. Various features have been
used, such as SIFT features [94] extracted from Photo-tour-
ism [131] and Caltech 101 [28], GIST features [112] from
9 EVALUATION PROTOCOLS LabelMe [119] and Peekaboom [140], as well as some fea-
9.1 Evaluation Metrics tures used in object retrieval: Fisher vectors [116] and
There are three main concerns for an approximate nearest VLAD vectors [51]. The following presents a brief introduc-
neighbor search algorithm: space cost, search efficiency, tion to several representative datasets, which is summarized
and search quality. The space cost for hashing algorithms in Table 2.
depends on the code length for hash code ranking, and the MNIST [68] includes 60K 784-dimensional raw pixel fea-
code length and the table number for hash table lookup. tures describing grayscale images of handwritten digits as a
The search performance is usually measured under the reference set, and 10K features as the queries.
same space cost, i.e., the code length (and the table number) SIFT10K [50] consists of 10K 128-dimensional SIFT vec-
is chosen the same for different algorithms. tors as the reference set, 25K vectors as the learning set, and
The search efficiency is measured as the time taken to 100 vectors as the query set. SIFT1M [50] is composed of
return the search result for a query, which is usually com- 1M 128-dimensional SIFT vectors as the reference set, 100K
puted as the average time over a number of queries. The vectors as the learning set, and 10K as the query set. The
time cost often does not include the cost of the reranking learning sets in SIFT10K and SIFT1M are extracted from
step (using the original feature representations) as it is Flicker images and the reference sets and the query sets are
assumed that such a cost given the same number of candi- from the INRIA holidays images [49].
dates does not depend on the hashing algorithms and can GIST1M [50] consists of 1M 960-dimensional GIST vec-
be viewed as a constant. When comparing the performance tors as the reference set, 50K vectors as the learning set,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 783
TABLE 3
A Summary of Query Performance Comparison for Approximate
Nearest Neighbor Search Under the Euclidean Distance
Fig. 4. (a) and (b) show the performance in terms of recall@R over SIFT1M and GloVe1:2M for the representative hashing and quantization algo-
rithms. (c) and (d) show the performance over the ILSVRC 2012 ImageNet dataset under the Euclidean distance in terms of recall@R and under the
semantic similarity in terms of mAP versus # bits. BRE = binary reconstructive embedding [66], MLH = minimal loss hashing [107], LSH = locality
sensitive hashing [14], ITQ = iterative quantization [35], [36], SH = spectral hashing [156], AGH-2 = two-layer hashing with graphs [86], USPLH =
unsupervised sequential projection learning hashing [143], PQ = product quantization [50], CKM = Cartesian k-means [108], CQ = composite quanti-
zation [171], SCQ = sparse composite quantization [172] whose dictionary is the same sparse with PQ. CCA-ITQ = iterative quantization with canon-
ical correlation analysis [36], SSH = semi-supervised hashing [143], KSH = supervised hashing with kernels [85], FastHash = fash supervised
hashing [76], SDH = supervised discrete hashing with kernels [125], SDH-linear = supervised discrete hashing without using kernel representa-
tions [125], SQ = supervised quantization [154], Euclidean = linear scan with the Euclidean distance.
Fig. 4 shows the recall@R curves and the MAP results. 11 EMERGING TOPICS
We have several observations. (1) The performance of the
The main goal of the hashing algorithm is to accelerate the
quantization method is better than the hashing method in
online search through fast Hamming distance computation
most cases for both Euclidean distance-based and semantic
or fast distance table lookup. The offline hash function
search. (2) LSH, a data-independent algorithm, is generally
learning and hash code computation are shown to be still
worse than other learning to hash approaches. (3) For
expensive, and have become attractive in research. The
Euclidean distance-based search the performance of CQ is
computation cost of the distance table used for looking up is
the best among quantization methods, which is consistent
thought ignorable and in reality could be higher when han-
with the analysis and the 2D illustration shown in Fig. 2.
dling high-dimensional databases. There is also increasing
interest in topics such as multi-modality and cross-modality
10.2 Training Time Cost
hashing [45] and semantic quantization.
We present the analysis of the training time cost for the case
of using the linear hash function. The pairwise similarity 11.1 Speed up the Learning and Query Processes
preserving category considers the similarities of all pairs of Scalable Hash Function Learning. The algorithms depending
items, and thus in general the training process takes qua- on the pairwise similarity, such as binary reconstructive
dratic time with respect to the number N of the training embedding, usually sample a small subset of pairs to reduce
samples (OðN 2 M þ N 2 dÞ). To reduce the computational the cost of learning hash functions. It has been shown that
cost, sampling schemes are adopted: sample a small num- the search accuracy is increased with a high sampling rate,
ber (e.g., OðNÞ) of pairs, whose time complexity becomes but the training cost is greatly increased. The algorithms
linear with respect to N, resulting in (OðNM þ NdÞ), or sam- even without relying on the pairwise similarity, e.g., quanti-
ple a subset of the training items (e.g., containing N items), zation, were also shown to be slow and even infeasible
whose time complexity becomes smaller (OðN M þ N 2 dÞ).
2
when handling very large data, e.g., 1B data items, and usu-
The multiwise similarity preserving category considers the ally have to learn hash functions over a small subset, e.g.,
similarities of all triples of items, and in general the training 1M data items. This poses a challenging request to learn the
cost is greater and the sampling scheme is also used for hash function over larger datasets.
acceleration. The analysis for kernel hash functions and Hash Code Computation Speedup. Existing hashing algo-
other complex functions is similar, and the time complexity rithms rarely take into consideration the cost of encoding a
for both training hash functions and encoding database data item. Such a cost during the query stage becomes sig-
items is higher. nificant in the case that only a small number of database
Iterative quantization consists of a PCA preprocessing items or a small database are compared to the query. The
step whose time complexity is OðNd2 Þ, and the hash code search combined with the inverted index and compact
and hash function optimization step, whose time complexity codes is such a case. When kernel hash functions are used,
is OðNM 2 þ M 3 Þ (M is the number of hash bits). The whole encoding the database items to binary codes is also much
complexity is OðNd2 þ NM 2 þ M 3 Þ. Product quantization more expensive than that with linear hash functions. The
includes the k-means process for each partition, and the com- composite quantization-like approach also takes much time
plexity is OðTNKP Þ, where K is usually 256, P ¼ M8 , and T is to compute the hash codes.
the number of iterations for the k-means algorithm. The com- A recent work, circulant binary embedding [165], acceler-
plexity of Cartesian k-means is OðNd2 þ d3 Þ. The time com- ates the encoding process for the linear hash functions, and
plexity of composite quantization is OðNKPd þ NP 2 þ tree-quantization [3] sparsifies the dictionary items into a
P 2 K 2 dÞ. In summary, the time complexity of iterative quanti- tree structure, to speeding up the assignment process. How-
zation is the lowest and that of composite quantization is the ever, more research is needed to speed up the hash code
highest. This indicates that it takes larger offline computation computation for other hashing algorithms, such as compos-
cost to get a higher (online) search performance. ite quantization.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
786 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
Distance Table Computation Speedup. Product quantization [5] R. Balu, T. Furon, and H. Jegou, “Beyond ”project and sign” for
cosine estimation with binary codes,” in Proc. IEEE Int. Conf.
and its variants need to precompute the distance table Acoustics Speech Signal Process., 2014, pp. 6884–6888.
between the query and the elements of the dictionaries. [6] J. L. Bentley, D. F. Stanat, and E. H. W. Jr, “The complexity of
Most existing algorithms claim that the cost of distance table finding fixed-radius near neighbors,” Inf. Process. Lett., vol. 6
computation is negligible. However in practice, the cost no. 6, pp. 209–212, 1977.
[7] A. Bergamo, L. Torresani, and A. W. Fitzgibbon, “Picodes: Learn-
becomes bigger when using the codes computed from quan- ing a compact code for novel-category recognition,” in Proc. 24th
tization to rank the candidates retrieved from the inverted Int. Conf. Neural Inf. Proc. Syst., 2011, pp. 2088–2096.
index. This is a research direction that will attract research [8] P. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,” in
Proc. 42nd Annu. Conf. Inf. Sci. Syst., 2008, pp. 16–21.
interest in the near future, such as a recent study, sparse [9] J. Brandt, “Transform coding for fast approximate nearest neigh-
composite quantization [172]. bor search in high dimensions,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2010, pp. 1815–1822.
11.2 Promising Extensions [10] A. Z. Broder, “On the resemblance and containment of doc-
Semantic Quantization. Existing quantization algorithms uments,” in Proc. Compression Complexity Sequences, 1997, pp. 21–29.
[11] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig,
focus on the search under the Euclidean distances. Like “Syntactic clustering of the web,” Comput. Netw., vol. 29, no. 8–
binary code hashing algorithms where many studies on 13, pp. 1157–1166, 1997.
semantic similarity have been conducted, learning quantiza- [12] F. Çakir and S. Sclaroff, “Adaptive hashing for fast similarity
search,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1044–1052.
tion-based hash codes with semantic similarity is attracting [13] M. A. Carreira-Perpi~ nan and R. Raziperchikolaei, “Hashing with
interest. There are already a few studies. For example, we binary autoencoders,” in Proc. IEEE Conf. Comput. Vis. Pattern
have proposed a supervised quantization approach [154] Recognit., 2015, pp. 557–566.
and some comparisons are provided in Fig. 4. [14] M. Charikar, “Similarity estimation techniques from rounding
algorithms,” in Proc. 34th Annu. ACM Symp. Theory Comput.,
Multiple and Cross Modality Hashing. One important char- 2002, pp. 380–388.
acteristic of big data is the variety of data types and data [15] A. Cherian, “Nearest neighbors using compact sparse codes,” in
sources. This is particularly true for multimedia data, where Proc. 31st Int. Conf. Int. Conf. Mach. Learning, 2014, pp. 1053–1061.
[16] A. Cherian, V. Morellas, and N. Papanikolopoulos, “Robust
various media types (e.g., video, image, audio and hyper- sparse hashing,” in Proc. 19th IEEE Int. Conf. Image Process., 2012,
text) can be described by many different low- and high-level pp. 2417–2420.
features, and relevant multimedia objects may come from [17] O. Chum and J. Matas, “Large-scale discovery of spatially related
different data sources contributed by different users and images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2,
pp. 371–377, Feb. 2010.
organizations. This raises a research direction, performing [18] Q. Dai, J. Li, J. Wang, and Y. Jiang, “Binary optimized hashing,”
joint-modality hashing learning by exploiting the relation in Proc. ACM Multimedia, 2016, pp. 1247–1256.
among multiple modalities, for supporting some special [19] A. Dasgupta, R. Kumar, and T. Sarl os, “Fast locality-sensitive
applications, such as cross-modal search. This topic is hashing,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2011, pp. 1073–1081.
attracting a lot of research efforts nowadays, such as collab- [20] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-
orative hashing [89], [167], collaborative quantization [173], sensitive hashing scheme based on p-stable distributions,” in
and cross-media hashing [134], [135], [178], [161], [83]. Proc. Symp. Comput. Geometry, 2004, pp. 253–262.
[21] T. L. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan,
and J. Yagnik, “Fast, accurate detection of 100, 000 object classes
12 CONCLUSION on a single machine,” in Proc. IEEE Conf. Comput. Vis. Pattern
In this paper, we categorize the learning-to-hash algorithms Recognit., 2013, pp. 1814–1821.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
into four main groups: pairwise similarity preserving, multi- “Imagenet: A large-scale hierarchical image database,” in Proc.
wise similarity preserving, implicit similarity preserving, and IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
quantization, present a comprehensive survey with a discus- [23] K. Ding, C. Huo, B. Fan, and C. Pan, “kNN hashing with factor-
ized neighborhood representation,” in Proc. IEEE Int. Conf. Com-
sion about their relations. We point out the empirical observa- put. Vis., 2015, pp. 1098–1106.
tion that quantization is superior in terms of search accuracy, [24] T. Do, A. Doan, and N. Cheung, “Learning to hash with binary
search efficiency and space cost. In addition, we introduce a deep neural network,” in Proc. Eur. Conf. Comput. Vis., 2016,
few emerging topics and the promising extensions. pp. 219–234.
[25] T. Do, A. Doan, D. T. Nguyen, and N. Cheung, “Binary hashing
with semidefinite relaxation and augmented lagrangian,” in
ACKNOWLEDGMENTS Proc. Eur. Conf. Comput. Vis., 2016, pp. 802–817.
[26] C. Du and J. Wang, “Inner product similarity search using com-
This work was partially supported by the National Nature positional codes,” CoRR, abs/1406.4966, 2014.
Science Foundation of China No. 61632007. Heng Tao Shen [27] L. Fan, “Supervised binary hash code learning with Jensen Shannon
is the corresponding author. divergence,” in Proc. IEEE Int. Conf. Comput. Vis., pp. 2616–2623,
2013.
REFERENCES [28] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual
models from few training examples: an incremental Bayesian
[1] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approach tested on 101 object categories,” in Proc. Workshop Gen-
approximate nearest neighbor in high dimensions,” in Proc. 47th erative-Model Based Vis., 2004.
Annu. IEEE Symp. Found. Comput. Sci., 2006, pp. 459–468. [29] J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing
[2] A. Babenko and V. Lempitsky, “Additive quantization for scheme based on dynamic collision counting,” in Proc. SIGMOD
extreme vector compression,” in Proc. IEEE Conf. Comput. Vis. Conf., 2012, pp. 541–552.
Pattern Recognit., 2014, pp. 931–939. [30] L. Gao, J. Song, F. Zou, D. Zhang, and J. Shao, “Scalable multime-
[3] A. Babenko and V. Lempitsky, “Tree quantization for large-scale dia retrieval by deep learning hashing with relative similarity
similarity search and classification,” in Proc. IEEE Conf. Comput. learning,” in Proc. ACM Multimedia, 2015, pp. 903–906.
Vis. Pattern Recognit., 2015, pp. 4240–4248. [31] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization
[4] A. Babenko and V. S. Lempitsky, “The inverted multi-index,” in for approximate nearest neighbor search,” in Proc. IEEE Conf.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3069– Comput. Vis. Pattern Recognit., 2013, pp. 2946–2953.
3076.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 787
[32] T. Ge, K. He, and J. Sun, “Graph cuts for supervised binary [58] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive
coding,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 250–264. image search with hash codes,” IEEE Trans. Multimedia, vol. 15,
[33] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik, “Learning no. 2, pp. 442–453, Feb. 2013.
binary codes for high-dimensional data using bilinear projec- [59] Z. Jiang, L. Xie, X. Deng, W. Xu, and J. Wang, “Fast nearest
tions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, neighbor search in the Hamming space,” in Proc. Int. Conf. Multi-
pp. 484–491. media Model., 2016, pp. 325–336.
[34] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quanti- [60] Z. Jin, et al., “Complementary projection hashing,” in Proc. IEEE
zation-based binary codes for fast similarity search,” in Proc. 25th Int. Conf. Comput. Vis., 2013, pp. 257–264.
Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1205–1213. [61] A. Joly and O. Buisson, “Random maximum margin hashing,” in
[35] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 873–880.
approach to learning binary codes,” in Proc. IEEE Conf. Comput. [62] Y. Kalantidis and Y. Avrithis, “Locally optimized product quan-
Vis. Pattern Recognit., 2011, pp. 817–824. tization for approximate nearest neighbor search,” in Proc. IEEE
[36] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2329–2336.
quantization: A procrustean approach to learning binary codes [63] W. Kong and W.-J. Li, “Isotropic hashing,” in Proc. Int. Conf.
for large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Neural Inf. Process. Syst., 2012, pp. 1655–1663.
Intell., vol. 35, no. 12, pp. 2916–2929, Dec. 2013. [64] N. Koudas, B. C. Ooi, H. T. Shen, and A. K. H. Tung, “LDC:
[37] A. Gordo, F. Perronnin, Y. Gong, and S. Lazebnik, “Asymmetric Enabling search by partial distance in a hyper-dimensional
distances for binary embeddings,” IEEE Trans. Pattern Anal. space,” in Proc. 20th Int. Conf. Data Eng., 2004, pp. 6–17.
Mach. Intell., vol. 36, no. 1, pp. 33–47, Jan. 2014. [65] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
[38] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. cation with deep convolutional neural networks,” in Proc. Int.
Inform. Theory, vol. 44, no. 6, pp. 2325–2383, Oct. 1998. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[39] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer, “Compact [66] B. Kulis and T. Darrell, “Learning to hash with binary recon-
hashing with joint optimization of search accuracy and time,” in structive embeddings,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 753–760, 2011. 2009, pp. 1042–1050.
[40] J. He, W. Liu, and S.-F. Chang, “Scalable similarity search with [67] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning
optimized kernel hashing,” in Proc. 16th ACM SIGKDD Int. Conf. and hash coding with deep neural networks,” in Proc. IEEE Conf.
Knowl. Discovery Data Mining, 2010, pp. 1129–1138. Comput. Vis. Pattern Recognit., 2015, pp. 3270–3278.
[41] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Spherical [68] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, learning applied to document recognition,” in Intell. Signal Pro-
pp. 2957–2964. cess., pages 306–351. IEEE Press, 2001.
[42] J.-P. Heo, Z. Lin, and S.-E. Yoon, “Distance encoded product [69] C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu, “Online sketching
quantization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
2014, pp. 2139–2146. pp. 2503–2511.
[43] L.-K. Huang, Q. Yang, and W.-S. Zheng, “Online hashing,” [70] P. Li, K. W. Church, and T. Hastie, “Conditional random sam-
in Proc. Int. Conf. Artif. Intell., 2013, pp. 1422–1428. pling: A sketch-based sampling technique for sparse data,” in
[44] P. Indyk and R. Motwani, “Approximate nearest neighbors: Proc. Int. Conf. Neural Inf. Process. Syst., 2006, pp. 873–880.
Towards removing the curse of dimensionality,” in Proc. 30th [71] P. Li, T. Hastie, and K. W. Church, “Very sparse random projec-
Annu. ACM Symp. Theory Comput., 1998, pp. 604–613. tions,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery
[45] G. Irie, H. Arai, and Y. Taniguchi, “Alternating co-quantization Data Mining, 2006, pp. 287–296.
for cross-modal hashing,” in Proc. IEEE Int. Conf. Comput. Vis., [72] P. Li and A. C. K€ onig, “b-bit minwise hashing,” in Proc. 19th Int.
2015, pp. 1886–1894. Conf. World Wide Web, 2010, pp. 671–680.
[46] G. Irie, Z. Li, X.-M. Wu, and S.-F. Chang, “Locally linear hashing [73] P. Li, A. C. K€ onig, and W. Gui, “b-bit minwise hashing for esti-
for extracting non-linear manifolds,” in Proc. IEEE Conf. Comput. mating three-way similarities,” in Proc. Int. Conf. Neural Inf. Pro-
Vis. Pattern Recognit., 2014, pp. 2123–2130. cess. Syst., 2010, pp. 1387–1395.
[47] H. Jain, P. Perez, R. Gribonval, J. Zepeda, and H. Jegou, [74] P. Li, A. B. Owen, and C.-H. Zhang, “One permutation hashing,”
“Approximate search with quantized sparse representations,” in in Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 3122–3130.
Proc. Eur. Conf. Comput. Vis., 2016, pp. 681–696. [75] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing
[48] H. Jegou, L. Amsaleg, C. Schmid, and P. Gros, “Query adaptative with semantically consistent graph for image indexing,” IEEE
locality sensitive hashing,” in Proc. IEEE Int. Conf. Acoustics, Trans. Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013.
Speech Signal Process., 2008, pp. 825–828. [76] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, “Fast super-
[49] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and vised hashing with decision trees for high-dimensional data,” in
weak geometric consistency for large scale image search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1971–1978.
Proc. Eur. Conf. Comput. Vis., 2008, pp. 304–317. [77] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general
[50] H. Jegou, M. Douze, and C. Schmid, “Product quantization for two-step approach to learning-based hashing,” in Proc. IEEE Int.
nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., Conf. Comput. Vis., 2013, pp. 2552–2559.
vol. 33, no. 1, pp. 117–128, Jan. 2011. [78] R.-S. Lin, D. A. Ross, and J. Yagnik, “Spec hashing: Similarity
[51] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local preserving algorithm for entropy-based coding,” in Proc. IEEE
descriptors into a compact image representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 848–854.
Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3304–3311. [79] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hashing,” in
[52] H. Jegou, T. Furon, and J.-J. Fuchs, “Anti-sparse coding for Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 446–451.
approximate nearest neighbor search,” in Proc. IEEE Int. Conf. [80] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hash-
Acoustics, Speech Signal Process., 2012, pp. 2029–2032. ing for compact binary codes learning,” in Proc. IEEE Conf. Com-
[53] H. Jegou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in put. Vis. Pattern Recognit., 2015, pp. 2475–2483.
one billion vectors: Re-rank with source coding,” in Proc. IEEE [81] D. Liu, S. Yan, R.-R. Ji, X.-S. Hua, and H.-J. Zhang, “Image
Int. Conf. Acoustics, Speech Signal Process., 2011, pp. 861–864. retrieval with query-adaptive hashing,” ACM Trans. Multimedia
[54] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang, “Min-max hash for jac- Comput. Commun. Appl., vol. 9, no. 1, 2013: Art. no. 2.
card similarity,” in Proc. IEEE 13th Int. Conf. Data Mining, 2013, [82] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hash-
pp. 301–309. ing for fast image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pat-
[55] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensi- tern Recognit., 2016, pp. 2064–2072.
tive hashing,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., [83] L. Liu, Z. Lin, L. Shao, F. Shen, G. Ding, and J. Han, “Sequential
2012, pp. 108–116. discrete hashing for scalable cross-modality similarity retrieval,”
[56] Q. Jiang and W. Li, “Scalable graph hashing with feature trans- IEEE Trans. Image Process., vol. 26, no. 1, pp. 107–118, Jan. 2017.
formation,” in Proc. 24th Int. Conf. Artif. Intell., 2015, pp. 2248–2254. [84] W. Liu, C. Mu, S. Kumar, and S. Chang, “Discrete graph hashing,”
[57] Y.-G. Jiang, J. Wang, and S.-F. Chang, “Lost in binarization: in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3419–3427.
Query-adaptive ranking for similar image search with compact [85] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised
codes,” in Proc. ACM Int. Conf. Multimedia Retrieval, 2011, hashing with kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern
Art. no. 16. Recognit., 2012, pp. 2074–2081.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
788 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
[86] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with [113] R. Panigrahy, “Entropy based nearest neighbor search in high
graphs,” in Proc. Int. Conf. Mach. Learning, 2011, pp. 1–8. dimensions,” in Proc. 17th Annu. ACM-SIAM Symp. Discrete Algo-
[87] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Compact rithm, 2006, pp. 1186–1195.
hyperplane hashing with bilinear functions,” in Proc. Int. Conf. [114] L. Pauleve, H. Jegou, and L. Amsaleg, “Locality sensitive hash-
Mach. Learning, 2012, pp. 467–474. ing: A comparison of hash function types and querying mecha-
[88] X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Query-adaptive nisms,” Pattern Recognit. Lett., vol. 31, no. 11, pp. 1348–1358, 2010.
reciprocal hash tables for nearest neighbor search,” IEEE Trans. [115] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vec-
Image Process., vol. 25, no. 2, pp. 907–919, Feb. 2016. tors for word representation,” in Proc. Empirical Methods Natural
[89] X. Liu, J. He, C. Deng, and B. Lang, “Collaborative hashing,” in Language Process., 2014, pp. 1532–1543.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2147–2154. [116] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, “Large-scale
[90] X. Liu, J. He, and B. Lang, “Reciprocal hash tables for nearest image retrieval with compressed fisher vectors,” in Proc. IEEE
neighbor search,” in Proc. 27th AAAI Conf. Artif. Intell., 2013. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3384–3391.
[91] X. Liu, L. Huang, C. Deng, J. Lu, and B. Lang, “Multi-view com- [117] D. Qin, X. Chen, M. Guillaumin, and L. J. V. Gool, “Quantized
plementary hash tables for nearest neighbor search,” in Proc. kernel learning for feature matching,” in Proc. Int. Conf. Neural
IEEE Int. Conf. Comput. Vis., 2015, pp. 1107–1115. Inf. Process. Syst., 2014, pp. 172–180.
[92] Y. Liu, J. Shao, J. Xiao, F. Wu, and Y. Zhuang, “Hypergraph spec- [118] D. Qin, Y. Chen, M. Guillaumin, and L. J. V. Gool, “Learning to
tral hashing for image retrieval with heterogeneous social con- rank histograms for object retrieval,” in Proc. British Mach. Vis.
texts,” Neurocomputing, vol. 119, pp. 49–58, 2013. Conf., 2014.
[93] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, and A. G. Hauptmann, [119] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
“Spline regression hashing for fast image search,” IEEE Trans. “Labelme: A database and web-based tool for image annotation,”
Image Process., vol. 21, no. 10, pp. 4480–4491, Oct. 2012. Int. J. Comput. Vis., vol. 77, no. 1–3, pp. 157–173, 2008.
[94] D. G. Lowe, “Distinctive image features from scale-invariant key- [120] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” in Proc.
points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. SIGIR Workshop Inf. Retrieval Appl. Graphical Models, 2007,
[95] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe pp. 969–978.
LSH: Efficient indexing for high-dimensional similarity search,” in [121] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J.
Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 950–961. Approx. Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
[96] J. Martinez, J. Clement, H. H. Hoos, and J. J. Little, “Revisiting [122] J. Sanchez and F. Perronnin, “High-dimensional signature com-
additive quantization,” in Proc. Eur. Conf. Comput. Vis., 2016, pression for large-scale image classification,” in Proc. IEEE Conf.
pp. 137–153. Comput. Vis. Pattern Recognit., 2011, pp. 1665–1672.
[97] Y. Matsui, T. Yamasaki, and K. Aizawa, “Pqtable: Fast exact [123] H. Sandhawalia and H. Jegou, “Searching with expectations,” in
asymmetric distance neighbor search for product quantization Proc. Int. Conf. Acoustics Speech Signal Process., 2010, pp. 1242–1245.
using hash tables,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, [124] J. Shao, F. Wu, C. Ouyang, and X. Zhang, “Sparse spectral
pp. 1940–1948. hashing,” Pattern Recognit. Lett., vol. 33, no. 3, pp. 271–277, 2012.
[98] Y. Matsushita and T. Wada, “Principal component hashing: An [125] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete
accelerated approximate nearest neighbor search,” in Proc. hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
Pacific-Rim Symp. Image Video Technol., 2009, pp. 374–385. pp. 37–45.
[99] Y. Moon, et al., “Capsule: A camera-based positioning system [126] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang,
using learning,” in Proc. ACM Symp. Cloud Comput., 2015, “Inductive hashing on manifolds,” in Proc. IEEE Conf. Comput.
pp. 235–240. Vis. Pattern Recognit., 2013, pp. 1562–1569.
[100] R. Motwani, A. Naor, and R. Panigrahy, “Lower bounds on local- [127] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao, “A
ity sensitive hashing,” SIAM J. Discrete Math., vol. 21, no. 4, fast optimization method for general binary code learning,”
pp. 930–935, 2007. IEEE Trans. Image Process., vol. 25, no. 12, pp. 5610–5621,
[101] Y. Mu, X. Chen, X. Liu, T.-S. Chua, and S. Yan, “Multimedia seman- Dec. 2016.
tics-aware query-adaptive hashing with bits reconfigurability,” Int. [128] X. Shi, F. Xing, J. Cai, Z. Zhang, Y. Xie, and L. Yang, “Kernel-
J. Multimedia Inf. Retrieval, vol. 1, no. 1, pp. 59–70, 2012. based supervised discrete hashing for image retrieval,” in Proc.
[102] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in ker- Eur. Conf. Comput. Vis., 2016, pp. 419–433.
nel space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [129] A. Shrivastava and P. Li, “Fast near neighbor search in high-
2010, pp. 3344–3351. dimensional binary data,” in Proc. Eur. Conf. Mach. Learn. Princi-
[103] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors ples Practice Knowl. Discovery Databases, 2012, pp. 474–489.
with automatic algorithm configuration,” in Proc. Int. Conf. Com- [130] A. Shrivastava and P. Li, “Densifying one permutation hashing
put. Vis. Theory Appl., 2009, pp. 331–340. via rotation for fast near neighbor,” in Proc. 31st Int. Conf. Int.
[104] M. Muja and D. G. Lowe, “Fast matching of binary features,” in Conf. Mach. Learning, 2014, pp 557–565.
Proc. 9th Conf. Comput. Robot Vis., 2012, pp. 404–410. [131] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Explor-
[105] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms ing photo collections in 3d,” ACM Trans. Graph., vol. 25, no. 3,
for high dimensional data,” IEEE Trans. Pattern Anal. Mach. pp. 835–846, 2006.
Intell., vol. 36, no. 11, pp. 2227–2240, Nov. 2014. [132] D. Song, W. Liu, R. Ji, D. A. Meyer, and J. R. Smith, “Top rank
[106] L. Mukherjee, S. N. Ravi, V. K. Ithapu, T. Holmes, and V. Singh, supervised binary coding for visual search,” in Proc. IEEE Int.
“An NMF perspective on binary hashing,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1922–1930.
Conf. Comput. Vis., 2015, pp. 4184–4192. [133] J. Song, H. T. Shen, J. Wang, Z. Huang, N. Sebe, and J. Wang,
[107] M. Norouzi and D. J. Fleet, “Minimal loss hashing for compact “A distance-computation-free search scheme for binary code
binary codes,” in Proc. Int. Conf. Mach. Learning, 2011, pp. 353– databases,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 484–495,
360. Mar. 2016.
[108] M. Norouzi and D. J. Fleet, “Cartesian k-means,” in Proc. IEEE [134] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective
Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3017–3024. multiple feature hashing for large-scale near-duplicate video
[109] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming dis- retrieval,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1997–2008,
tance metric learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., Dec. 2013.
2012, pp. 1070–1078. [135] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-media
[110] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in Hamming hashing for large-scale retrieval from heterogeneous data
space with multi-index hashing,” in Proc. IEEE Conf. Comput. Vis. sources,” in Proc. SIGMOD Conf., 2013, pp. 785–796.
Pattern Recognit., 2012, pp. 3108–3115. [136] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “Ldahash:
[111] R. O’Donnell, Y. Wu, and Y. Zhou, “Optimal lower bounds for Improved matching with smaller descriptors,” IEEE Trans. Pattern
locality sensitive hashing (except when q is tiny),” in Proc. Int. Anal. Mach. Intell., vol. 34, no. 1, pp. 66–78, Jan. 2012.
Conf. Supercomputing, 2011, pp. 275–283. [137] A. B. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny
[112] A. Oliva and A. Torralba, “Modeling the shape of the scene: A images: A large data set for nonparametric object and scene rec-
holistic representation of the spatial envelope,” Int. J. Comput. ognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11,
Vis., vol. 42, no. 3, pp. 145–175, 2001. pp. 1958–1970, Nov. 2008.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
WANG ET AL.: A SURVEY ON LEARNING TO HASH 789
[138] A. Vedaldi and A. Zisserman, “Efficient additive kernels via [164] Y. Yang, F. Shen, H. T. Shen, H. Li, and X. Li, “Robust discrete
explicit feature maps,” IEEE Trans. Pattern Anal. Mach. Intell., spectral hashing for large-scale image semantic indexing,” IEEE
vol. 34, no. 3, pp. 480–492, Mar. 2012. Trans. Big Data, vol. 1, no. 4, pp. 162–171, Oct.-Dec. 2015.
[139] A. Vedaldi and A. Zisserman, “Sparse kernel approximations for [165] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang, “Circulant
efficient classification and detection,” in Proc. IEEE Conf. Comput. binary embedding,” in Proc. Int. Conf. Mach. Learning, 2014,
Vis. Pattern Recognit., 2012, pp. 2320–2327. pp. 946–954.
[140] L. von Ahn, R. Liu, and M. Blum, “Peekaboom: A game for locat- [166] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast
ing objects in images,” in Proc. CHI Conf., 2006, pp. 55–64. similarity search,” in Proc. 33rd Int. ACM SIGIR Conf. Res. Devel-
[141] J. Wang, O. Kumar, and S.-F. Chang, “Semi-supervised hashing opment Inf. Retrieval, 2010, pp. 18–25.
for scalable image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pat- [167] H. Zhang, F. Shen, W. Liu, X. He, H. Luan, and T. Chua,
tern Recognit., 2010, pp. 3424–3431. “Discrete collaborative filtering,” in Proc. 39th Annu. Int. ACM-
[142] J. Wang, S. Kumar, and S.-F. Chang, “Sequential projection learn- SIGIR Conf. Res. Development Inf. Retrieval, 2016, pp. 325–334.
ing for hashing with compact codes,” in Proc. 27th Int. Conf. [168] H. Zhang, N. Zhao, X. Shang, H. Luan, and T. Chua, “Discrete
Mach. Learn., 2010, pp. 1127–1134. image hashing using large weakly annotated photo collections,”
[143] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 3669–3675.
for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., [169] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology
vol. 34, no. 12, pp. 2393–2406, Dec. 2012. preserving hashing for similarity search,” in Proc. ACM Multime-
[144] J. Wang and S. Li, “Query-driven iterated neighborhood graph dia, 2013, pp. 123–132.
search for large scale indexing,” in Proc. ACM Multimedia, 2012, [170] S. Zhang, J. Li, J. Guo, and B. Zhang, “Scalable discrete super-
pp. 179–188. vised hash learning with asymmetric matrix factorization,” IEEE
[145] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash 16th Int. Conf. Data Mining, Barcelona, Spain, pp. 1347–1352,
for indexing big data—A survey,” Proc. IEEE, vol. 104, no. 1, 2016, doi: 10.1109/ICDM.2016.0184.
pp. 34–57, Jan. 2016. [171] T. Zhang, C. Du, and J. Wang, “Composite quantization for
[146] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang, “Learning hash codes approximate nearest neighbor search,” in Proc. Int. Conf. Mach.
with listwise supervision,” in Proc. IEEE Int. Conf. Comput. Vis., Learning, 2014, pp. 838–846.
2013, pp. 3032–3039. [172] T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Sparse composite
[147] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity quantization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
search: A survey,” CoRR, abs/1408.2927, 2014. 2015, pp. 4548–4556.
[148] J. Wang, H. T. Shen, S. Yan, N. Yu, S. Li, and J. Wang, [173] T. Zhang and J. Wang, “Collaborative quantization for cross-
“Optimized distances for binary code ranking,” in Proc. ACM modal similarity search,” in Proc. IEEE Conf. Comput. Vis. Pattern
Multimedia, 2014, pp. 517–526. Recognit., 2016, pp. 2036–2045.
[149] J. Wang, J. Wang, J. Song, X.-S. Xu, H. T. Shen, and S. Li, [174] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking
“Optimized cartesian k-means,” IEEE Trans. Knowl. Data Eng., based hashing for multi-label image retrieval,” in Proc. IEEE
vol. 27, no. 1, pp. 180–192, 2015, doi: 10.1109/TKDE.2014.2324592. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1556–1564.
[150] J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashing for [175] K. Zhao, H. Lu, and J. Mei, “Locality preserving hashing,” in
approximate nearest neighbor search,” in Proc. ACM Multimedia, Proc. AAAI Conf. Artif. Intell., 2014, pp. 2874–2881.
2013, pp. 133–142. [176] Y. Zhen and D.-Y. Yeung, “Active hashing and its application to
[151] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo, “Fast neigh- image and text retrieval,” Data Min. Knowl. Discov., vol. 26, no. 2,
borhood graph search using Cartesian concatenation,” in Proc. pp. 255–274, 2013.
IEEE Int. Conf. Comput. Vis., 2013, pp. 2128–2135. [177] X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen, “Sparse
[152] J. Wang, et al., “Trinary-projection trees for approximate nearest hashing for fast multimedia search,” ACM Trans. Inf. Syst.,
neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, vol. 31, no. 2, 2013: Art. no. 9.
no. 2, pp. 388–403, Feb. 2014. [178] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear cross-modal
[153] Q. Wang, D. Zhang, and L. Si, “Weighted hashing for fast large hashing for efficient multimedia search,” in Proc. ACM Multime-
scale similarity search,” in Proc. Conf. Inf. Knowl. Manage., 2013, dia, 2013, pp. 143–152.
pp. 1185–1188. [179] Y. Zhuang, Y. Liu, F. Wu, Y. Zhang, and J. Shao, “Hypergraph
[154] X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Supervised spectral hashing for similarity search of social image,” in Proc.
quantization for similarity search,” in Proc. IEEE Conf. Comput. ACM Multimedia, 2011, pp. 1457–1460.
Vis. Pattern Recognit., 2016, pp. 2018–2026.
[155] Y. Weiss, R. Fergus, and A. Torralba, “Multidimensional spectral Jingdong Wang received the BEng and MEng
hashing,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 340–353. degrees from the Department of Automation,
[156] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Tsinghua University, Beijing, China, in 2001 and
Int. Conf. Neural Inf. Process. Syst., 2008, pp. 1753–1760. 2004, respectively, and the PhD degree from the
[157] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised non- Department of Computer Science and Engineer-
linear hashing using bootstrap sequential projection learning,” ing, the Hong Kong University of Science and
IEEE Trans. Knowl. Data Eng., vol. 25, no. 6, pp. 1380–1393, Jun. Technology, Hong Kong, in 2007. He is a lead
2013. researcher at the Visual Computing Group,
[158] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for Microsoft Research Asia. His areas of interest
image retrieval via image representation learning,” in Proc. AAAI include deep learning, large-scale indexing,
Conf. Artif. Intell., 2014, pp. 2156–2162. human understanding, and person re-identifica-
[159] B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai, “Harmonious tion. He has been serving as an associate editor of IEEE TMM, and has
hashing,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, pp. 1820– served as an area chair of ICCV 2017, CVPR 2017, ECCV 2016 and
1826. ACM Multimedia 2015.
[160] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary
hashing for approximate nearest neighbor search,” in Proc. IEEE
Int. Conf. Comput. Vis., 2011, pp. 1631–1638. Ting Zhang received the bachelor’s degree in
[161] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning dis- mathematical science from the School of the
criminative binary codes for large-scale cross-modal Gifted Young, in 2012. She is working toward the
retrieval,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2494– PhD degree in the Department of Automation,
2507, May 2017. University of Science and Technology of China.
[162] H. Yang, X. Bai, J. Zhou, P. Ren, Z. Zhang, and J. Cheng, “Adaptive Her main research interests include machine
object retrieval with kernel reconstructive hashing,” in Proc. IEEE learning, computer vision and pattern recognition.
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1955–1962. She is currently a research intern at Microsoft
[163] Q. Yang, L.-K. Huang, W.-S. Zheng, and Y. Ling, “Smart hashing Research, Beijing.
update for fast response,” in Proc. Int. Joint Conf. Artif. Intell., 2013,
pp. 1855–1861.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.
790 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018
Jingkuan Song received the BS degree in soft- Heng Tao Shen received the BSc degree with
ware engineering from the University of Electronic 1st class Honours and the PhD degree from the
Science and Technology of China. He received Department of Computer Science, National Uni-
the PhD degree in information technology from versity of Singapore, in 2000 and 2004, respec-
The University of Queensland, Australia. Currently, tively. He then joined University of Queensland
he is a Professor at University of Electronic Science as a Lecturer, Senior Lecturer, Reader, and
and Technology of China. His research interest became a Professor in late 2011. He is currently
includes large-scale multimedia search and a professor of National “Thousand Talents Plan”
machine learning. and the director of Future Media Research
Center at University of Electronic Science and
Technology of China. His research interests
mainly include multimedia search, computer vision, and big data man-
Nicu Sebe is currently a professor with the Univer- agement on spatial, temporal, multimedia and social media databases.
sity of Trento, Italy, leading the research in the Heng Tao has extensively published and served on program committees
areas of multimedia information retrieval and in most prestigious international publication venues of interests. He
human behavior understanding. He was the gen- received the Chris Wallace Award for outstanding Research Contribution
eral co-chair of the IEEE FG Conference 2008 in 2010 conferred by Computing Research and Education Association,
and ACM Multimedia 2013, and the program chair Australasia. He has served as a PC co-chair for ACM Multimedia 2015
of the International Conference on Image and and currently is an associate editor of the IEEE Transactions on Knowl-
Video Retrieval in 2007 and 2010, and ACM Multi- edge and Data Engineering.
media 2007 and 2011. He is the program chair of
ECCV 2016 and ICCV 2017. He is a fellow of the
International Association for Pattern Recognition. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 15,2023 at 02:28:43 UTC from IEEE Xplore. Restrictions apply.