0% found this document useful (0 votes)
13 views29 pages

Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji

a

Uploaded by

sexplorearch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji

a

Uploaded by

sexplorearch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Hashing for Similarity Search: A Survey


Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji

August 14, 2014


Abstract—Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the
smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been
devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing,
which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the
data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in
arXiv:1408.2927v1 [cs.DS] 13 Aug 2014

the hash coding space.

Index Terms—Approximate Nearest Neighbor Search, Similarity Search, Hashing, Locality Sensitive Hashing, Learning to Hash,
Quantization.

1 I NTRODUCTION property and designing efficient search schemes using


The problem of similarity search, also known as nearest and beyond hash tables.
neighbor search, proximity search, or close item search, is The latter way ranks the items according to the dis-
to find an item that is the nearest to a query item, called tances computed using the short codes, which exploits
nearest neighbor, under some distance measure from a the property that the distance computation using the
search (reference) database. In the case that the reference short codes is efficient. The main research effort along
database is very large or that the distance computation this direction is to design the effective ways to com-
between the query item and the database item is costly, pute the short codes and design the distance measure
it is often computationally infeasible to find the exact using the short codes guaranteeing the computational
nearest neighbor. Thus, a lot of research efforts have efficiency and preserving the similarity.
been devoted to approximate nearest neighbor search
that is shown to be enough and useful for many practical
2 OVERVIEW
problems.
Hashing is one of the popular solutions for approx- 2.1 The Nearest Neighbor Search Problem
imate nearest neighbor search. In general, hashing is 2.1.1 Exact nearest neighbor search
an approach of transforming the data item to a low-
Nearest neighbor search, also known as similarity search,
dimensional representation, or equivalently a short code
proximity search, or close item search, is defined as:
consisting of a sequence of bits. The application of
Given a query item q, the goal is to find an item
hashing to approximate nearest neighbor search includes
NN(q), called nearest neighbor, from a set of items X =
two ways: indexing data items using hash tables that is
{x1 , x2 , · · · , xN } so that NN(q) = arg minx∈X dist(q, x),
formed by storing the items with the same code in a
where dist(q, x) is a distance computed between q and
hash bucket, and approximating the distance using the
x. A straightforward generalization is a K-NN search,
one computed with short codes.
where K-nearest neighbors (KNN(q)) are needed to be
The former way regards the items lying the buckets
found.
corresponding to the codes of the query as the nearest
neighbor candidates, which exploits the locality sensitive The problem is not fully specified without the distance
property that similar items have larger probability to between an arbitrary pair of items x and q. As a typical
be mapped to the same code than dissimilar items. example, the search (reference) database X lies in a d-
The main research efforts along this direction consist of dimensional space Rd and the distance is induced by
Pd
designing hash functions satisfying the locality sensitive an ls norm, kx − qks = ( i=1 |xi − qi |s )1/s . The search
problem under the Euclidean distance, i.e., the l2 norm,
• J. Wang is with Microsoft Research, Beijing, P.R. China.
is widely studied. Other notions of search database, such
E-mail: [email protected] as each item formed by a set, and distance measure,
• J. Song and H.T. Shen are with School of Information Technology and such as ℓ1 distance, cosine similarity and so on are also
Electrical Engineering, The University of Queensland, Australia.
Email:{jk.song,shenht}@itee.uq.edu.au
possible.
• J. Ji is with Department of Computer Science and Technology, Tsinghua The fixed-radius near neighbor (R-near neighbor)
University, Beijing, P.R. China. problem, an alternative of nearest neighbor search, is
E-mail: [email protected]
defined as: Given a query item q, the goal is to find
2

the items R that are within the distance C of q, R = from the conventional hashing algorithm in computer
{x| dist(q, x) 6 R, x ∈ X }. science that avoids collisions (i.e., avoids mapping two
items into some same bucket), the hashing approach
2.1.2 Approximate nearest neighbor search using a hash table aims to maximize the probability of
There exists efficient algorithms for exact nearest neigh- collision of near items. Given the query q, the items lying
bor and R-near neighbor search problems in low- in the bucket h(q) are retrieved as near items of q.
dimensional cases. It turns out that the problems be- To improve the recall, L hash tables are constructed,
come hard in the large scale high-dimensional case and and the items lying in the L (L′ , L′ < L) hash buckets
even most algorithms take higher computational cost h1 (q), · · · , hL (q) are retrieved as near items of q for
than the naive solution, linear scan. Therefore, a lot of randomized R-near neighbor search (or randomized c-
recent efforts are moved to approximate nearest neigh- approximate R-near neighbor search). To guarantee the
bor search problems. The (1 + ǫ)-approximate nearest precision, each of the L hash codes, yi , needs to be a long
neighbor search problem, ǫ > 0, is defined as: Given a code, which means that the total number of the buckets
query x, the goal is to find an item x so that dist(q, x) 6 is too large to index directly. Thus, only the nonempty
(1 + ǫ) dist(q, x)∗ , where x∗ is the true nearest neighbor. buckets are retained by resorting to convectional hashing
The c-approximate R-near neighbor search problem is of the hash codes hl (x).
defined as: Given a query x, the goal is to find some
item x, called cR-near neighbor, so that dist(q, x) 6 cR,
where x∗ is the true nearest neighbor. 2.2.2 Fast distance approximation.
The direct way is to perform an exhaustive search:
2.1.3 Randomized nearest neighbor search
compare the query with each reference item by fast com-
The randomized search problem aims to report the puting the distance between the query and the hash code
(approximate) nearest (or near) neighbors with proba- of the reference item and retrieve the reference items
bility instead of deterministically. There are two widely- with the smallest distances as the candidates of nearest
studied randomized search problems: randomized c- neighbors, which is usually followed by a reranking step:
approximate R-near neighbor search and randomized rerank the nearest neighbor candidates retrieved with
R-near neighbor search. The former one is defined as: hash codes according to the true distances computed
Given a query x, the goal is to report some cR-near using the original features and attain the K nearest
neighbor of the query q with probability 1 − δ, where neighbors or R-near neighbor.
0 < δ < 1. The latter one is defined as: Given a query x, This strategy exploits two advantages of hash codes.
the goal is to report some R-near neighbor of the query The first one is that the distance using hash codes can
q with probability 1 − δ. be efficiently computed and the cost is much smaller
than that of the computation in the input space. The
2.2 The Hashing Approach second one is that the size of the hash codes is much
The hashing approach aims to map the reference and/or smaller than the input features and hence can be loaded
query items to the target items so that approximate into memory, resulting the disk I/O cost reduction in the
nearest neighbor search can be efficiently and accurately case the original features are too large to be loaded into
performed using the target items and possibly a small memory.
subset of the raw reference items. The target items are One practical way of speeding up the search is to
called hash codes (also known as hash values, simply perform a non-exhaustive search: first retrieve a set of
hashes). In this paper, we may also call it short/compact candidates using inverted index and then compute the
code interchangeably. distances of the query with the candidates using the
Formally, the hash function is defined as: y = h(x), short codes. Other research efforts includes organizing
where y is the hash code and h(·) is the function. In the hash codes with a data structure, such as a tree or a
the application to approximate nearest neighbor search, graph structure, to avoid exhaustive search.
usually several hash functions are used together to com-
pute the hash code: y = h(x), where y = [y1 y2 · · · yM ]T
and [h1 (x) h2 (x) · · · hM (x)]T . Here we use a vector y 2.3 Organization of This Paper
to represent the hash code for presentation convenience.
There are two basic strategies for using hash codes The organization of the remaining part is given as fol-
to perform nearest (near) neighbor search: hash table lows. Section 3 presents the definition of the locality
lookup and Fast distance approximation. sensitive hashing (LSH) family and the instances of
LSH with various distances. Section 4 presents some
2.2.1 Hash table lookup. research works on how to perform efficient search given
The hash table is a data structure that is composed of LSH codes and model and analyze LSH in aspects Sec-
buckets, each of which is indexed by a hash code. Each tions 5, 6,and 7 review the learning-to-hash algorithms.
reference item x is placed into a bucket h(x). Different Finally, Section 9 concludes this survey.
3

3 L OCALITY S ENSITIVE H ASHING : D EFINI - “small code” or “compact code” has become the focus
TION AND I NSTANCES of many researchers from the CV community, and many
learning-based hashing methods have come in to being
The term “locality-sensitive hashing” (LSH) was intro- [135][125][126][30][83][139][127][29][82][130][31]. These
duced in 1998 [42], to name a randomized hashing methods aim at learning the hash functions for better
framework for efficient approximate nearest neighbor fitting the data distribution and labeling information,
(ANN) search in high dimensional space. It is based on and thus overcoming the drawback of LSH. This part
the definition of LSH family H, a family of hash func- of the research often takes LSH as the baseline for
tions mapping similar input items to the same hash code comparison.
with higher probability than dissimilar items. However, The machine learning and statistics community also
the first specific LSH family, min-hash, was invented in contribute to the study of LSH. Research from this side
1997 by Andrei Broder [11], for near-duplicate web page often view LSH as a probabilistic similarity-preserving
detection and clustering, and it is one of the most pop- dimensionality reduction method, from which the hash
ular LSH method that is extensively-studied in theory codes that are produced can provide estimations to some
and widely-used in practice. pairwise distance or similarity. This part of the study
Locality-sensitive hashing was first studied by the mainly focuses on developing variants of LSH functions
theoretical computer science community. The theoretical that provide an (unbiased) estimator of certain distance
research mainly focuses on three aspects. The first one or similarity, with smaller variance [68], [52], [73], [51], or
is on developing different LSH families for various dis- smaller storage requirement of the hash codes [70], [71],
tances or similarities, for example, p-stable distribution or faster computation of hash functions [69], [73], [51],
LSH for ℓp distance [20], sign-random-projection (or sim- [118]. Besides, the machine learning community also
hash) for angle-based distance [13], min-hash for Jaccard devotes to developing learning-based hashing methods.
coefficient [11], [12] and so on, and many variants are In practice, LSH is widely and successfully used in the
developed based on these basic LSH families [19]. The IT industry, for near-duplicate web page and image de-
second one is on exploring the theoretical boundary of tection, clustering and so on. Specifically, The Altavista
the LSH framework, including the bound on the search search engine uses min-hash to detect near-duplicate
efficiency (both time and space) that the best possible web pages [11], [12], while Google uses sim-hash to
LSH family can achieve for certain distances and sim- fulfill the same goal [92].
ilarities [20], [94], [105], the tight characteristics for a In the subsequent sections, we will first introduce
similarity measure to admit an LSH family [13], [16], and different LSH families for various kinds of distances or
so on. The third one focuses on improving the search similarities, and then we review the study focusing on
scheme of the LSH methods, to achieve theoretically the search scheme and the work devoted to modeling
provable better search efficiency [107], [19]. LSH and ANN problem.
Shortly after it was proposed by the theoretical com-
puter science community, the database and related com- 3.1 The Family
munities began to study LSH, aiming at building real
database systems for high dimensional similarity search. The locality-sensitive hashing (LSH) algorithm is intro-
Research from this side mainly focuses on developing duced in [42], [27], to solve the (R, c)-near neighbor
better data structures and search schemes that lead to problem. It is based on the definition of LSH family
better search quality and efficiency in practice [91], [25]. H, a family of hash functions mapping similar input
The quality criteria include precision and recall, and the items to the same hash code with higher probability than
efficiency criteria are commonly the query time, storage dissimilar items. Formally, an LSH family is defined as
requirement, I/O consumption and so on. Some of these follows:
work also provide theoretical guarantees on the search Definition 1 (Locality-sensitive hashing): A family of H
quality of their algorithms [25]. is called (R, cR, P1 , P2 )-sensitive if for any two items p
In recent years, LSH has attracted extensive attention and q,
from other communities including computer vision (CV), • if dist(p, q) 6 R, then Prob[h(p) = h(q)] > P1 ,

machine learning, statistics, natural language processing • if dist(p, q) > cR, then Prob[h(p) = h(q)] 6 P2 .

(NLP) and so on. For example, in computer vision, Here c > 1, and P1 > P2 . The parameter ρ = log(1/P 1)
log(1/P2 )
high dimensional features are often required for various governs the search performance, the smaller ρ, the better
tasks, such as image matching, classification. LSH, as search performance. Given such an LSH family for dis-
a probabilistic dimension reduction method, has been tance measure dist, there exists an algorithm for (R, c)-
used in various CV applications which often reduce to near neighbor problem which uses O(dn + n1+ρ ) space,
approximate nearest neighbor search [17], [18]. However, with query time dominated by O(nρ ) distance compu-
the performance of LSH is limited due to the fact that it tations and O(nρ log1/p2 n) evaluations of hash functions
is totally probabilistic and data-independent, and thus [20].
it does not take the data distribution into account. On The LSH scheme indexes all items in hash tables and
the other hand, as an inspiration of LSH, the concept of searches for near items via hash table lookup. The hash
4

table is a data structure that is composed of buckets, The LSH scheme using the p-stable distribution to
each of which is indexed by a hash code. Each reference generate hash codes is described as follows. The hash
T
item x is placed into a bucket h(x). Different from the function is formulated as hw,b (x) = ⌊ w rx+b ⌋. Here, w
conventional hashing algorithm in computer science that is a d-dimensional vector with entries chosen indepen-
avoids collisions (i.e., avoids mapping two items into dently from a p-stable distribution. b is a real number
some same bucket), the LSH approach aims to maximize chosen uniformly from the range [0, r]. r is the window
the probability of collision of near items. Given the query size, thus a positive real number.
q, the items lying in the bucket h(q) are considered as The following equation can be proved
near items of h(q). Z r
1 t t
Given an LSH family H, the LSH scheme amplifies P (hw,b (x1 ) = hw,b (x2 )) fp ( )(1 − )dt, (1)
the gap between the high probability P1 and the low 0 c c r
probability P2 by concatenating several functions. In where c = kx1 − x2 kp , which means that such a hash
particular, for parameter K, K functions h1 (x), ..., hK (x), function belongs to the LSH family under the ℓp distance.
where hk (1 6 k 6 K) are chosen independently and Specifically, to solve the search problem under the
uniformly at random from H, form a compound hash Euclidean distance, the 2-stable distribution, i.e., the
function g(x) = (h1 (x), · · · , hK (x)). The output of this Gaussian distribution, is chosen to generate the random
compound hash function identifies a bucket id in a hash projection w. In this case (p = 2), the exponent ρ drops
table. However, the concatenation of K functions also strictly below 1/c for some (carefully chosen) finite value
reduces the chance of collision between similar items. of r.
To improve the recall, L such compound hash functions It is claimed that uniform quantization [72] without
g1 , g2 , ..., gL are sampled independently, each of which T
the offset b, hw (x) = ⌊ wr x ⌋ is more accurate and uses
corresponds to a hash table. These functions are used fewer bits than the scheme with the offset.
to hash each data point into L hash codes, and L hash
tables are constructed to index the buckets correspond- 3.2.2 Leech lattice LSH
ing to these hash codes respectively. The items lying in
Leech lattice LSH [1] is an LSH algorithm for the search
the L hash buckets are retrieved as near items of h(q)
in the Euclidean space. It is a multi-dimensional version
for randomized R-near neighbor search (or randomized
of the aforementioned approach. The approach firstly
c-approximate R-near neighbor search).
randomly projects the data points into Rt , t is a small
In practice, to guarantee the precision, each of the L
super-constant (= 1 in the aforementioned approach).
hash codes, gl (x), needs to be a long code (or K is large),
The space Rt is partitioned into cells, using Leech lattice,
and thus the total number of the buckets is too large to
which is a constellation in 24 dimensions. The nearest
index directly. Therefore, only the nonempty buckets are
point in Leech lattice can be found using a (bounded)
retained by resorting to conventional hashing of the hash
decoder which performs only 519 floating point opera-
codes gl (x).
tions per decoded point. On the other hand, the exponent
There are different kinds of LSH families for different
ρ(c) is quite attractive: ρ(2) is less than 0.37. E8 lattice is
distances or similarities, including ℓp distance, arccos or
used because its decoding is much cheaper than Leech
angular distance, Hamming distance, Jaccard coefficient
lattice (its quantization performance is slightly worse) A
and so on.
comparison of LSH methods for the Euclidean distance
is given in [108].
3.2 ℓp Distance
3.2.1 LSH with p-stable distributions 3.2.3 Spherical LSH
The LSH scheme based on the p-stable distributions, Spherical LSH [123] is an LSH algorithm designed for
presented in [20], is designed to solve the search problem points that are on a unit hypersphere in the Euclidean
under the ℓp distance kxi − xj kp , where p ∈ (0, 2]. The space. The idea is to consider the regular polytope,
p-stable distribution is defined as: A distribution D is simplex, orthoplex, and hypercube, for example, that are
called p-stable, where p > 0, if for any n real numbers inscribed into the hypersphere and rotated at random.
v1 · · · vn and i.i.d. variables
Pn X1 · · · Xn with distribution D, The hash function maps a vector on the hypersphere into
the random variable Pn i=1 vi Xi has the same distribution the closest polytope vertex lying on the hypersphere.
as the variable ( i=1 |vi |p )1/p X, where X is a random It means that the buckets of the hash function are the
variable with distribution D. The well-known Gaussian Voronoi cells of the polytope vertices. Though there is
distribution DG , defined by the density function g(x) = no theoretic analysis about exponent ρ, the Monte Carlo
2
√1 e−x /2 , is 2-stable. simulation shows that it is an improvement over the

In the case that p = 1, the exponent ρ is equal to 1c + Leech lattice approach [1].
O(R/r), and later it is shown in [94] that it is impossible
to achieve ρ 6 2c 1
. Recent study in [105] provides more 3.2.4 Beyond LSH
lower bound analysis for Hamming distance, Euclidean Beyond LSH [3] improves the ANN search in the Eu-
distance, and Jaccard distance. clidean space, specifically solving (c, 1)-ANN. It consists
5

of two-level hashing structures: outer hash table and Normal distribution N (0, I). Then the hash function is
inner hash table. The outer hash scheme aims to partition given as
the data into buckets with a filtered out process such
if φ(x)Σ−1/2 z̃t > 0

1
that all the pairs of points in the bucket are not more h(φ(x)) = (2)
0 otherwise.
than a threshold, and find a (1 + 1/c)-approximation to
the minimum enclosing ball for the remaining points. The covariance matrix Σ and the mean µ are estimated
The inner hash tables are constructed by first computing over a set of randomly chosen p database items, using
the center of the ball corresponding to a non-empty a technique similar to that used in kernel principal
bucket in outer hash tables and partitioning the points component analysis.
belonging to the ball into a set of over-lapped subsets, Multi-kernel LSH [133], [132], uses multiple kernels
for each of which the differences of the distance of the instead of a single kernel to form the hash functions
points to the center is within [−1, 1] and the distance of with assigning the same number of bits to each kernel
the overlapped area to the center is within [0, 1]. For the hash function. A boosted version of multi-kernel LSH
subset, an LSH scheme is conducted. The query process is presented in [137], which adopts the boosting scheme
first locates a bucket from outer hash tables for a query. If to automatically assign various number of bits to each
the bucket is empty, the algorithm stops. If the distance kernel hash function.
of the query to the bucket center is not larger than c,
3.3.4 LSH with learnt metric
then the points in the bucket are output as the results.
Otherwise, the process further checks the subsets in the Semi-supervised LSH [45], [46], [66] first learns a Maha-
bucket whose distances to the query lie in a specific lanobis metric from the semi-supervised information and
range and then does the LSH query in those subsets. then form the hash function according to the pairwise
xT Axj
similarity θ(xi , xj ) = arccos kGxi ki 2 kGx j k2
, where GT G =
A and A is the learnt metric from the semi-supervised
3.3 Angle-Based Distance information. An extension, distribution aware LSH [146],
is proposed, which, however, partitions the data along
3.3.1 Random projection
each projection direction into multiple parts instead of
The LSH algorithm based on random projection [2], only two parts.
[13] is developed to solve the near neighbor search
problem under the angle between vectors, θ(xi , xj ) = 3.3.5 Concomitant LSH
xT x Concomitant LSH [23] is an LSH algorithm that uses con-
arccos kxi ki2 kxjj k2 . The hash function is formulated as
h(x) = sign(wT x), where w follows the standard Gaus- comitant rank order statistics to form the hash functions
sian distribution. It is easily shown that P (h(xi ) = for cosine similarity. There are two schemes: concomitant
θ(x ,x ) min hash and concomitant min L-multi-hash.
h(xj )) = 1− iπ j , where θ(xi , xj ) is the angle between
Concomitant min hash is formulated as follows: gen-
xi and xj , thus such a hash function belongs to the LSH
erate 2K random projections {w1 , w2 , · · · , w2K }, each
family with the angle-based distance.
of which is drawn independently from the standard
normal distribution N (0, I). The hash code is computed
3.3.2 Super-bit LSH in two steps: compute the 2K projections along the 2K
Super-bit LSH [52] aims to improve the above hashing projection directions, and output the index of the pro-
functions for arccos (angular) similarity, by dividing the jection direction along which the projection value is the
K
random projections into G groups then orthogonalizing smallest, formally written by hc (x) = arg min2k=1 wkT x.
B random projections for each group, obtaining new It is shown that the probability Prob[hc (x1 ) = hc (x2 )]
GB random projections and thus G B-super bits. It is is a monotonically increasing function with respect to
xT1 x2
shown that the Hamming distance over the super bits kx1 k2 kx2 k2 .
is an unbiased estimation for the angular distance and Concomitant min L-multi-hash instead generates L
the variance is smaller than the above random projection hash codes: the indices of the projection directions along
algorithm. which the projection values are the top L smallest. It can
be shown that the collision probability is similar to that
3.3.3 Kernel LSH of Concomitant min hash.
Generating a hash code of length K = 20 means
Kernel LSH [64], [65] aims to build LSH functions that it requires 1, 048, 576 random projections and vec-
with the angle defined in the kernel space, θ(xi , xj ) = tor multiplications, which is too high. To solve this
φ(xi )T φ(xj )
arccos kφ(xi )k 2 kφ(xj )k2
. The key challenge is in construct- problem, a cascading scheme is adopted: e.g., gener-
ing a projectionPvector w from the Gaussian distribution. ate two concomitant hash functions, each of which
Define zt = 1t i∈St φ(xi ) where t is a natural number, generates a code of length 10, and compose them
and S is a set of t database items chosen i.i.d.. The together, yielding a code of 20 bits, which only re-
central limit theorem shows √that for sufficiently large quires 2 × 210 random projections and vector multi-
t, the random variables z̃t = tΣ−1/2 (zt − µ) follows a plications. There are two schemes proposed in [23]:
6

cascade concomitant min & max hash that composes It is also shown that the exponent for embedded
K K
the two codes [arg min2k=1 wkT x, arg max2k=1 wkT x], and hyperplane hashing is similar to that for XOR 1-bit hy-
cascade concomitant L2 min & max hash multi-hash perplane hashing and stronger than that for hyperplane
which is formed using the indices of the top smallest hashing.
and largest projection values.
3.4 Hamming Distance
3.3.6 Hyperplane hashing One LSH function for the Hamming distance with binary
The goal of searching nearest neighbors to a query vectors y ∈ {0, 1}d is proposed in [42], h(y) = yk , where
hyperplane is to retrieve the points from the database k ∈ {1, 2, · · · , d} is a randomly-sampled index. It can be
X that are closest to a query hyperplane whose normal ky −y k
shown that P (h(yi ) = h(yj )) = 1 − i d j h . It is proven
is given by n ∈ Rd . The Euclidean distance of a point x that the exponent ρ is 1/c.
to a hyperplane with the normal n is:
d(Pn , x) = knT xk. (3) 3.5 Jaccard Coefficient
3.5.1 Min-hash
The hyperplane hashing family [47], [124], under the
assumption that the hyperplane passes through origin The Jaccard coefficient, a similarity measure between
kA∩Bk
and the data points and the normal are unit norm (which two sets, A, B ∈ U, is defined as sim(A, B) = kA∪Bk .
indicates that hyperplane hashing corresponds to search Its corresponding distance is taken as 1 − sim(A, B).
with absolute cosine similarity), is defined as follows, Min-hash [11], [12] is an LSH function for the Jac-
 card similarity. Min-hash is defined as follows: pick a
hu,v (z, z) if z is a database vector random permutation π from the ground universe U,
h(z) =
hu,v (z, −z if z is a query hyperplane normal. and define h(A) = mina∈A π(a). It is easily shown
(4)
that P (h(A) = h(B)) = sim(A, B). Given the K hash
Here hu,v (a, b) = [hu (a) hv (b)] = [sign(uT a) sign(vT b)],
values of two sets, the Jaccard similarity is estimated as
where u and v are sampled independently from a stan- 1 PK
dard Gaussian distribution. K k=1 δ[hk (A) = hk (B)], where each hk corresponds to
a random permutation that is independently generated.
It is shown that the above hashing family belongs to
LSH: it is (r, r(1 + ǫ), 14 − π12 r, 14 − π12 r(1 + ǫ))-sensitive for 3.5.2 K-min sketch
the angle distance dθ (x, n) = (θx,n − π2 )2 , where r, ǫ > 0.
K-min sketch [11], [12] is a generalization of min-wise
The angle distance is equivalent to the distance of a point
sketch (forming the hash values using the K smallest
to the query hyperplane.
nonzeros from one permutation) used for min-hash.
The below family, called XOR 1-bit hyperplane hash-
It also provides an unbiased estimator of the Jaccard
ing,
 coefficient but with a smaller variance, which however
hu (z) ⊕ hv (z) if z is a database vector cannot be used for approximate nearest neighbor search
h(z) =
hu (z) ⊕ hv (−z) if z is a hyperplane normal, using hash tables like min-hash. Conditional random
(5) sampling [68], [67] also takes the k smallest nonzeros
is shown to be (r, r(1+ǫ), 12 − π12 r, 12 − π12 r(1+ǫ))-sensitive from one permutation, and is shown to be a more accu-
for the angle distance dθ (x, n) = (θx,n − π2 )2 , where r, ǫ > rate similarity estimator. One-permutation hashing [73],
0. also uses one permutation, but breaks the space into
Embedded hyperplane hashing transforms the K bins, and stores the smallest nonzero position in
database vector (the normal of the query hyperplane) each bin and concatenates them together to generate a
into a high-dimensional vector, sketch. However, it is not directly applicable to nearest
ā = vec(aaT )[a21 , a1 a2 , · · · , a1 ad , a2 a1 , a22 , a2 a3 , · · · , a2d ]. neighbor search by building hash tables due to empty
(6) bins. This issue is solved by performing rotation over
one permutation hashing [118]. Specifically, if one bin is
Assuming a and b to be unit vectors, the Euclidean empty, the hashed value from the first non-empty bin
distance between the embeddings ā and −b̄ is given on the right (circular) is borrowed as the key of this bin,
kā−(−ā)k22 = 2+2(aT b)2 , which means that minimizing which supplies an unbiased estimate of the resemblance
the distance between the two embeddings is equivalent unlike [73].
to minimizing |aT b|.
The embedded hyperplane hash function family is 3.5.3 Min-max hash
defined as Min-max hash [51], instead of keeping the smallest hash
value of each random permutation, keeps both the small-

hu (z̄) if z is a database vector
h(z) = est and largest values of each random permutation. Min-
hu (−z̄) if z is a query hyperplane normal.
(7) max hash can generate K hash values, using K 2 random
It is shown
√ to be (r, r(1 + permutations, while still providing an unbiased estima-
ǫ), π1 cos−1 sin2 ( r), π1 cos−1 sin2 ( r(1 + ǫ))) for the
p
tor of the Jaccard coefficient, with a slightly smaller
angle distance dθ (x, n) = (θx,n − π2 )2 , where r, ǫ > 0. variance than min-hash.
7

3.5.4 B-bit minwise hashing more useful for high-dimensional features than the Eu-
B-bit minwise hashing [71], [70] only uses the lowest clidean distance, in particular in the case of normalized
b-bits of the min-hash value as a short hash value, feature vectors (e.g., the ℓ2 norm is equal to 1). The used
which gains substantial advantages in terms of storage similarity measure is a pairwise-order function, defined
space while still leading to an unbiased estimator of the as
resemblance (the Jaccard coefficient). d−1
XX i
simpo (x1 , x2 ) = δ[(x1i − x1j )(x2i − x2j ) > 0]
3.5.5 Sim-min-hash i=0 j=1
Sim-min-hash [149] extends min-hash to compare sets (11)
of real-valued vectors. This approach first quantizes the d
X
real-valued vectors and assigns an index (word) for each = Ri (x1 , x2 ), (12)
real-valued vector. Then, like the conventional min-hash, i=1

several random permutations are used to generate the where Ri (x1 , x2 ) = |L(x1 , i) ∩ L(x2 , i)| and L(x1 , i) =
hash keys. The different thing is that the similarity is {j|x1i > x1j }.
1
PK A B A B
estimated as K k=1 sim(xk , xk ), where xk (xk ) is WTA hash generates a set of K random permutations
the real-valued vector (or Hamming embedding) that is {πk }. Each permutation πk is used to reorder the ele-
assigned to the word hk (A) (hk (B)), and sim(·, ·) is the ments of x, yielding a new vector x̄. The kth hash code
similarity measure. is computed as arg maxTi=1 x̄i , taking a value between 0
and T − 1. The final hash code is a concatenation of T
3.6 χ2 Distance values each corresponding to a permutation. It is shown
that WTA hash codes satisfy the LSH property and min-
χ2 -LSH [33] is a locality sensitive hashing function for
hash is a special case of WTA hash.
the χ2 distance. The χ2 distance over two vectors xi and
xj is defined as
v 3.7.2 Shift invariant kernels
u d
2
uX (xit − xjt )2 Locality sensitive binary coding using shift invariant
χ (xi , xj ) = t . (8) kernel hashing [109] exploits the property that the binary
t=1
xit − xjt
mapping of the original data is guaranteed to preserve
The χ2 distance can also be defined without the square- the value of a shift-invariant kernel, the random Fourier
root, and the below developments still hold by substi- features (RFF) [110]. The RFF is defined as
tuting r to r2 in all the equations. √
φw,b (x) = 2 cos(wT x + b), (13)
The χ2 -LSH function is defined as
hw,b (x) = ⌊gr (wT x) + b⌋, (9) where w ∼ PK and b ∼ Unif[0, 2π]. For example, for the
2
q Gaussian Kernel K(s) = e−γksk /2 , w ∼ Normal(0, γI). It
where gr (x) = 12 ( 8x can be shown that Ew,b [φw,b (x)φw,b (y)] = K(x, y).
r 2 + 1 − 1), each entry of w is drawn
from a 2-stable distribution, and b is drawn from a The binary code is computed as
uniform distribution over [0, 1]. sign(φw,b (x) + t), (14)
It can be shown that
where t is a random threshold, t ∼ Unif[−1, 1]. It is
P (hw,b (xi ) = hw,b (xj ))
shown that the normalized Hamming distance (i.e., the
Z (n+1)r2
1 t t Hamming distance divided by the number of bits in
= f ( )(1 − )dt, (10)
0 c c (n + 1)r2 the code string) are both lower bounded and upper
bounded and that the codes preserve the similarity in
where f (t) denotes the probability density function of
a probabilistic way.
the absolute value of the 2-stable distribution, c = kxi −
xj k2 .
3.7.3 Non-metric distance
Let c′ = χ2 (xi , xj ). It can be shown that P (hw,b (xi ) =
hw,b (xj )) decreases monotonically with respect to c and Non-metric LSH [98] extends LSH to non-metric data by
c′ . Thus, we can show it belongs to the LSH family. embedding the data in the original space into an implicit
reproducing kernel Kreĭn space where the hash function
is defined. The kreĭn space with the indefinite inner
3.7 Other Similarities
product < ·, · >K K admits an orthogonal decomposition
3.7.1 Rank similarity as a direct sum K = K+ ⊕ K− , where (K+ , κ+ (·, ·)) and
Winner Take All (WTA) hash [140] is a sparse embedding (K− , κ− (·, ·)) are separable Hilbert spaces with their cor-
method that transforms the input feature space into responding positive definite inner products. The inner
binary codes such that the Hamming distance in the product K is then computed as
resulting space closely correlates with rank similarity ′ ′ ′ ′
measure. The rank similarity measure is shown to be < ξ+ + ξ− , ξ+ + ξ− >K = κ+ (ξ+ , ξ+ ) − κ− (ξ− , ξ− ). (15)
8

Given the orthogonality of K+ and K− of, the pairwise apart will be hashed to the same bucket. In addition, the
ℓ2 distance in K is compute as search algorithm suggests to build a single hash table
N
with K = log(1/g) hash bits.
kξ − ξ ′ k2K = kξ+ − ξ+
′ 2 ′ 2
kK+ − kξ− − ξ− kK− . (16)
The paper [107] presents the theoretic evidence theo-
The projections with the definite inner product K+ and retically guaranteeing the search quality.
K− can be computed using the technology in kernel LSH,
denoted by p+ and p− , respectively. The hash function 4.1.2 LSH forest
with the input being (p+ (ξ) − p+ (ξ), p+ (ξ) + p+ (ξ)) = LSH forest [9] represents each hash table, built from
(a1 (ξ), a2 (ξ)) and the output being two binary bits is LSH, using a tree, by pruning subtrees (nodes) that do
defined as, not contain any database points and also restricting the
depth of each leaf node not larger than a threshold.
h(ξ) = [δ[a1 (ξ) > θ], δ[a2 (ξ) > θ]], (17) Different from the conventional scheme that finds the
where a1 (ξ) and a2 (ξ) are assumed to be normalized to candidates from the hash buckets corresponding to the
[0, 1] and θ is a real number uniformly drawn from [0, 1]. hash codes of the query point, the search algorithm finds
It can be shown that P (h(ξ) = h(ξ ′ )) = (1 − |a1 (ξ) − the points contained in subtrees over LSH forest having
a1 (ξ ′ )|)(1−|a2 (ξ)−a2 (ξ ′ )|), which indicates that the hash the largest prefix match by a two-phase approach: the
function belongs to the LSH family. first top-down phase descends each LSH tree to find the
leaf having the largest prefix match with the hash code of
3.7.4 Arbitrary distance measures the query, the second bottom-up phase back-tracks each
tree from the discovered leaf nodes in the first phase
The basic idea of distance-based hashing [4] uses a line
in the largest-prefix-match-first manner to find subtrees
projection function
having the largest prefix match with the hash code of
f (x; a1 , a2 ) the query.
1
= (dist2 (x, a1 ) + dist2 (a1 , a2 ) − dist2 (x, a2 )), 4.1.3 Adaptative LSH
2 dist(a1 , a2 )
(18) The basic idea of adaptative LSH [48] is to select the
most relevant hash codes based on the relevance value.
to formulate a hash function, The relevance value is computed by accumulating the
differences between the projection value and the mean

1 if f (x; a1 , a2 ) ∈ [t1 , t2 ]
h(x; a1 , a2 ) = (19) of the corresponding line segment along the projection
0 otherwise.
direction (or equivalently the difference of the projection
Here, a1 and a2 are randomly selected data items, values along the projection directions and the center of
dist(·, ·) is the distance measure, and t1 and t2 are two the corresponding bucket).
thresholds, selected so that half of the data items are
hashed to 1 and the other half to 0. 4.1.4 Multi-probe LSH
Similar to LSH, distance-based hashing generates a The basic idea of multi-probe LSH [91] is to intelligently
compound hash function using K distance-based hash probe multiple buckets that are likely to contain query
functions and accordingly L compound hash functions, results in a hash table, whose hash values may not
yielding L hash tables. However, it cannot be shown necessarily be the same to the hash value of the query
that the theoretic guarantee in LSH holds for DBH. vector. Given a query q, with its hash code denoted
There are some other schemes discussed in [4], including by g(q) = (h1 (q), h2 (q), · · · , hK (q)), multi-probe LSH
optimizing L and K from the dataset, applying DBH finds a sequence of hash perturbation vector, {δ i =
hierarchically so that different set of queries use different {δi1 , δi2 , · · · , δiK }} and sequentially probe the hash buck-
parameters L and K, and so on. ets {g(q) + δ o(i) }. A score, computed as
PK 2
j=1 xj (δij ),
where xj (δij ) is the distance of q from the boundary
4 L OCALITY S ENSITIVE H ASHING : S EARCH , of the slot hj (q) + δj , is used to sort the perturbation
M ODELING , AND A NALYSIS vectors, so that the buckets are accessed in order of
increasing the scores. The paper [91] also proposes to
4.1 Search
use the expectation E(x2j (δij )), which is estimated with
4.1.1 Entropy-based search the assumption that δij is uniformly distributed in [0, r]
The entropy-based search algorithm [107], given a query (r is the width of the hash function used for Euclidean
point q, picks a set of (O(N ρ )) random points v from LSH), to replace x2j (δij ) for sorting the perturbation
B(q, R), a ball centered at q with the radius r and vectors. Compared with conventional LSH, to achieve
searches in the buckets H(v), to find cR-near neighbors. the same search quality, multi-probe LSH has a similar
E
Here N is the number of the database items, ρ = log(1/g) , time efficiency while reducing the number of hash tables
M is the entropy I(h(p)|q, R) where p is a random by an order of magnitude.
point in B(q, R), and g denotes the upper bound on the The posteriori multi-probe LSH algorithm presented
probability that two points that are at least distance cr in [56] gives a probabilistic interpretation of multi-probe
9

LSH and presents a probabilistic score, to sort the per- another Hadamard transform. It is shown that it takes
turbation vectors. The basic ideas of the probabilistic only O(d log d + KL) for both ACHash and DHHash to
score computation include the property (likelihood) that compute hash codes instead of O(dKL). The algorithms
the difference of the projections of two vectors along are also extended to the angle-based similarity, where
a random projection direction drawn from a Gaussian the query time to ǫ-approximate the angle between two
distribution follows a Gaussian distribution, as well as vectors is reduced from O(d/ǫ2 ) to O(d log 1/ǫ + 1/ǫ2 ).
estimating the distribution (prior) of the neighboring
points of a point from the train query points and their 4.1.8 Bi-level LSH
neighboring points with assuming that the neighbor The first level of bi-level LSH [106] uses a random-
points of a query point follow a Gaussian distribution. projection tree to divide the dataset into subgroups with
bounded aspect ratios. The second level is an LSH table,
4.1.5 Dynamic collision counting for search
which is basically implemented by randomly projecting
The collision counting LSH scheme introduced in [25] data points into a low-dimensional space and then par-
uses a base of m single hash functions to construct titioning the low-dimensional space into cells. The table
dynamic compound hash functions, instead of L static is enhanced using a hierarchical structure. The hierarchy,
compound hash functions each of which is composed implemented using the space filling Morton curve (a.k.a.,
of K hash functions. This scheme regards a data vector the Lebesgue or Z-order curve), is useful when there
that collides with the query vector over at least K hash are not enough candidates retrieved for the multi-probe
functions out of the base of m single hash functions as LSH algorithm. In addition, the E8 lattice is used for
a good cR-NN candidate. The theoretical analysis shows partitioning the low-dimensional space to overcome the
that such a scheme by appropriately choosing m and K curse of dimensionality caused by the basic Z M lattice.
can have a guarantee on search quality. In case that there
is no data returned for a query (i.e., no data vector has
at least K collisions with the query), a virtual reranking 4.2 SortingKeys-LSH
scheme is presented with the essential idea of expanding SortingKeys LSH [88] aims at improving the search
the window width gradually in the hash function for scheme of LSH by reducing random I/O operations
E2LSH, to increase the collision chance, until finding when retrieving candidate data points. The paper defines
enough number of data vectors that have at least K a distance measure between compound hash keys to
collisions with the query. estimate the true distance between data points, and
introduces a linear order on the set of compound hash
4.1.6 Bayesian LSH keys. The method sorts all the compound hash keys
The goal of Bayesian LSH [113] is to estimate the prob- in ascending order and stores the corresponding data
ability distribution, p(s|M (m, k)), of the true similarity points on disk according to this order, then close data
s in the case that m matches out of k hash bits for a points are likely to be stored locally. During ANN search,
pair of hash codes (g(q), g(p)) of the query vector q and a limited number of pages on the disk, which are “close”
a NN candidate p, which is denoted by M (m, k), and to the query in terms of the distance defined between
prune the candidate p if the probability for the case s > t compound hash keys, are needed to be accessed for
with t being a threshold is less than ǫ. In addition, if the sufficient candidate generation, leading to much shorter
concentration probability P (|s − s∗ | 6 δ|M (m, k)) > λ, response time due to the reduction of random I/O
or intuitively the true similarity s under the distribution operations, yet with higher search accuracy.
p(s|M (m, k)) is almost located near the mode, s∗ =
arg maxs p(s|M (m, k)), the similarity evaluation is early
4.3 Analysis and Modeling
stopped and such a pair is regarded as similar enough,
which is an alternative of computing the exact similarity 4.3.1 Modeling LSH
of such a pair in the original space. The paper [113] gives The purpose [21] is to model the recall and the selectivity
two examples of Bayesian LSH for the Jaccard similarity and apply it to determine the optimal parameters, the
and the arccos similarity for which p(s|M (m, k)) are window size r, the number of hash functions K forming
instantiated. the compound hash function, the number of tables L,
and the number of bins T probed in each table for
4.1.7 Fast LSH E2LSH. The recall is defined as the percentage of the
Fast LSH [19] presents two algorithms, ACHash and true NNs in the retrieved NN candidates. The selectivity
DHHash, that formulate L K-bits compound hash func- is defined as the ratio of the number of the retrieved
tions. ACHash pre-conditions the input vector using a candidates to the number of the database points. The
random diagonal matrix and a Hadamard transform, two factors are formulated as a function of the data dis-
and then applies a sparse Gaussian matrix followed by tribution, for which the squared L2 distance is assumed
a rounding. DHHash does the same pre-conditioning to follow a Gamma distribution that is estimated from
process and then applies a random permutation, fol- the real data. The estimated distributions of 1-NN, 2-
lowed by a random diagonal Gaussian matrix and an NNs, and so on are used to compute the recall and
10

selectivity. Finally, the optimal parameters are computed to a compact code y, such that the nearest neighbor
to minimize the selectivity with the constraint that the search in the coding space is efficient and the result
recall is not less than a required value. A similar and is an effective approximation of the true nearest search
more complete analysis for parameter optimization is result in the input space. An instance of the learning-to-
given in [119] hash approach includes three elements: hash function,
similarity measure in the coding space, and optimization
4.3.2 The difficulty of nearest neighbor search criterion. Here The similarity in similarity measure is a
[36] introduces a new measure, relative contrast for general concept, and may mean distance or other forms
analyzing the meaningfulness and difficulty of nearest of similarity.
neighbor search. The relative contrast for a query q, Hash function. The hash function can be based on
Ex [d(q,x)]
given a dataset X is defined as Crq = min x (d(q,x))
. The linear projection, spherical function, kernels, and neural
relative contrast expectation with respect to the queries network, even a non-parametric function, and so on.
Ex,q [d(q,x)]
is given as follows, Cr = Eq [min x (d(q,x))]
. One popular hash function is a linear hash function:
Define a random variable R =
Pd y = sign(wT x) ∈ {0, 1}, where sign(wT x) = 1 if
j=1 Rj =
Pd p wT x > 0 and sign(wT x) = 0 otherwise. Another widely-
j=1 Eq [kxj − qj kp ], and let the mean be µ and the
2 used hash function is a function based on nearest vector
variance be σ 2 . Define the normalized variance: σ ′2 = σµ2 . assignment: y = arg mink∈{1,··· ,K} kx − ck k2 ∈ Z, where
It is shown that if {R1 , R2 , · · · , Rd } are independent {c1 , · · · , cK } is a set of centers, computed by some
and satisfy Lindeberg’s condition, the expected relative algorithm, e.g., K-means.
contrast is approximated as, The choice of hash function types influences the effi-
1 ciency of computing hash codes and the flexility of the
Cr ≈ 1 , (20)
[1 + φ−1 ( N1 + φ( −1 ′ p hash codes, or the flexibility of partitioning the space.
σ′ ))σ ]
The optimization of hash function parameters is depen-
where N is the number of database points, φ(·) is the dent to both distance measure and distance preserving.
cumulative density function of standard Gaussian, σ ′ is
the normalized standard deviation, and p is the distance Similarity measure. There are two main distance mea-
metric norm. It can also be generalized to the relative sure schemes in the coding space. Hamming distance
contrast for the kth nearest neighbor, with its variants, and Euclidean distance. Hamming
distance is widely used when the hashing function maps
Ex,q [d(q, x)] 1 the data point into a Hamming code y for which each
Crk = ≈ 1 ,
Eq [k-minx (d(q, x))] [1 + φ ( N + φ( −1
−1 k
′ ))σ ′] p entry is either 1 or 0, and is defined as the number of bits
σ
(21) at which the corresponding values are different. There
are some other variants, such as weighted Hamming
where k-minx (d(q, x)) is the distance of the query to the distance, distance table lookup, and so on. Euclidean
kth nearest neighbor. distance is used in the approaches based on nearest
Given the approximate relative contrast, it is clear how vector assignment and evaluated between the vectors
the data dimensionality d, the database size N , the metric
corresponding to the hash codes, i.e., the nearest vectors
norm p, and the sparsity of the data vector (determining assigned to the data vectors, which is efficiently com-
σ ′ ) influence the relative contrast. puted by looking up a precomputed distance table. There
It is shown that LSH, under the ℓp -norm distance, can
is a variant, asymmetric Euclidean distance, for which
find the exact nearest neighbor with probability 1 − δ by
only one vector is approximated by its nearest vector
returning O(log 1δ ng(Cr ) ) candidate points, where g(Cr )
while the other vector is not approximated. There are
is a function monotonically decreasing with Cr , and
also some works learning a distance table between hash
that, in the context of linear hashing sign(wT x + b), the
codes by assuming the hash codes are already given.
optimal projection that maximizes the relative contrast is
T
w∗ = arg maxw wwT SΣNxNww , where Σx = N1
PN T Optimization criterion. The approximate nearest neigh-
i=1 xi xi and
bor search result is evaluated by comparing it with the
SN N = Eq [(q−NN(q))(q−NN(q))T ], subject to SN N = I,
true search result, that is the result according to the
w∗ = arg maxw wT Σx w.
distance computed in the input space. Most similarity
The LSH scheme has very nice theoretic properties.
preserving criteria design various forms as the surrogate
However, as the hash functions are data-independent,
of such an evaluation.
the practical performance is not as good as expected
in certain applications. Therefore, there are a lot of The straightforward form is to directly compare the
followups that learn hash functions from the data. order of the ANN search result with that of the true
result (using the reference data points as queries), which
called the order-preserving criterion. The empirical re-
5 L EARNING TO H ASH : H AMMING E MBED - sults show that the ANN search result usually has
DING AND E XTENSIONS higher probability to approach the true search result if
Learning to hash is a task of learning a compound the distance computed in the coding space accurately
hash function, y = h(x), mapping an input item x approximates the distance computed in the input space.
11

TABLE 1 TABLE 3
Hash functions Optimization criterion.
type abbreviation type abbreviation
linear LI Hamming embedding
bilinear BILI coding consistent CC
Laplacian eigenfunction LE coding consistent to distance CCD
kernel KE code balance CB
quantizer QU bit balance BB
1D quantizer OQ bit uncorrelation BU
spline SP projection uncorrelation PU
neural network NN mutual information maximization MIM
spherical function SF minimizing differences between distances MDD
classifier CL minimizing differences between similarities MDS
minimizing differences between similarity distribution MDSD
hinge-like loss HL
TABLE 2 rank order loss ROL
Distance measures in the coding space triplet loss TL
classification error CE
type abbreviation space partitioning SP
complementary partitioning CP
Hamming distance HD
pair-wise bit balance PBB
normalized Hamming distance NHD
maximum margin MM
asymmetric Hamming distance AHD
weighted Hamming distance WHD Quantization
query-dependent weighted Hamming distance QWHD bit allocation BA
normalized Hamming affinity NHA quantization error QE
Manhattan MD equal variance EV
maximum cosine similarity MCS
asymmetric Euclidean distance AED
symmetric Euclidean distance SED
lower bound LB
sij is the similarity between xi and xj computed from
the input space or given from the semantic meaning.
This motivates the so-called similarity alignment crite-
rion, which directly minimizes the differences between 5.1.1 Spectral hashing
the distances (similarities) computed in the coding and Spectral hashing [135], the pioneering coding consistency
input space. An alternative surrogate is coding consistent hashing algorithm, aims to find an easy-evaluated hash
hashing, which penalizes the larger distances in the function so that (1) similar items are mapped to similar
coding space but with the larger similarities in the input hash codes based on the Hamming distance (coding
space (called coding consistent to similarity, shorted as consistency) and (2) a small number of hash bits are
coding consistent as a major of algorithms use it) and required. The second requirement is a form similar to
encourages the smaller (larger) distances in the coding coding balance, which is transformed to two require-
space but with the smaller (larger) distances in the input ments: bit balance and bit uncorrelation. The balance
space (called coding consistent to distance). One typical means that each bit has around 50% chance of being 1
approach, the space partitioning approach, assumes that or 0 (−1). The uncorrelation means that different bits are
space partitioning has already implicitly preserved the uncorrelated.
similarity to some degree. Let {yn }n = 1N be the hash codes of the N data items,
Besides similarity preserving, another widely-used cri- each yn be a binary vector of length M . Let sij be the
terion is coding balance, which means that the reference similarity that correlates with the Euclidean distance.
vectors should be uniformly distributed in each bucket The formulation is given as follows:
(corresponding to a hash code). Other related criteria,
such as bit balance, bit independence, search efficiency, min Trace(Y(D − S)YT ) (22)
and so on, are essentially (degraded) forms of coding Y

balance. s. t. Y1 = 0 (23)
In the following, we review Hamming bedding based YYT = I (24)
hashing algorithms. Table 4 presents the summary of the yim ∈ {−1, 1}, (25)
algorithms reviewed from Section 5.1 to Section 5.5, with
some concepts given in Tables 1, 1 and 3. where Y = [y1 y2 · · · yN ], S is a matrix [sij ] of size
N × N , D is a diagonal matrix Diag(d11 , · · · , dN N ), and
PN
dnn = i=1 sni . D − S is called Laplacian matrix and
5.1 Coding Consistent Hashing PN PN
Trace(Y(D − S)YT ) = i=1 j=1 wij kyi − yj k22 . Y1 = 0
Coding consistent hashing refers to a category of corresponds to the bit balance requirement. YYT = I
hashing functions based on minimizing the similarity corresponds to the bit uncorrelation requirement.
weighted distance, sij d(yi , yj ) (and possibly maximizing Rather than solving the problem Equation 25 directly,
dij d(yi , yj )), to formulate the objective function. Here, a simple approximate solution with the assumption of
12

TABLE 4
A summary of hashing algorithms. ∗ means that hash function learning does not explicitly rely on the distance
measure in the coding space. S = semantic similarity. E = Euclidean distance. sim. = similarity. dist. = distance.

method input sim. hash function dist. measure optimization criteria


spectral hashing [135] E LE HD CC + BB + BU
kernelized spectral hashing [37] S, E KE HD CC + BB + BU
Hypergraph spectral hashing [153], [89] S CL HD CC + BB + BU
Topology preserving hashing [145] E LI HD CC + CCD + BB + BU
hashing with graphs [83] S KE HD CC + BB
ICA Hashing [35] E LI, KE HD CC + BB + BU + MIM
Semi-supervised hashing [125], [126], [127] S, E LI HD CC + BB + PU
LDA hash [122] S LI HD CC + PU
binary reconstructive embedding [63] E LI, KE HD MDD
supervised hashing with kernels [82] E, S LI, KE HD MDS
spec hashing [78] S CL HD MDSD
bilinear hyperplane hashing [84] ACS BILI HD MDS
minimal loss hashing [101] E, S LI HD HL
order preserving hashing [130] E LI HD ROL
Triplet loss hashing [103] E, S Any HD, AHD TL
listwise supervision hashing [128] E, S LI HD TL
Similarity sensitive coding (SSC) [114] S CL WHD CE
parameter sensitive hashing [115] S CL WHD CE
column generation hashing [75] S CL WHD CE
complementary projection hashing [55]∗ E LI, KE HD SP + CP + PBB
label-regularized maximum margin hashing [96]∗ E, S KE HD SP + MM + BB
Random maximum margin hashing [57]∗ E LI, KE HD SP + MM + BB
spherical hashing [38]∗ E SF NHD SP + PBB
density sensitive hashing [79]∗ E LI HD SP + BB
multi-dimensional spectral hashing [134] E LE WHD CC + BB + BU
Weighted hashing [131] E LI WHD CC + BB + BU
Query-adaptive bit weights [53], [54] S LI (all) QWHD CE
Query adaptive hashing [81] S LI QWHD CE

uniform data distribution is presented in [135]. The that the hamming distance is not well consistent to the
algorithm is given as follows: Euclidean distance.
1) Find the principal components of the N d- In the case that the spreads along the top M PCA
dimensional reference data items using principal direction are the same, the spectral hashing algorithm
component analysis (PCA). actually partitions each direction into two parts using
2) Compute the M 1D Laplacian eigenfunctions with the median (due to the bit balance requirement) as the
the smallest eigenvalues along each PCA direction. threshold. It is noted that, in the case of uniform distri-
3) Pick the M eigenfunctions with the smallest eigen- butions, the solution is equivalent to thresholding at the
values among M d eigenfunctions. mean value. In the case that the true data distribution
4) Threshold the eigenfunction at zero, obtaining the is a multi-dimensional isotropic Gaussian distribution,
binary codes. it is equivalent to iterative quantization [30], [31] and
The 1D Laplacian eigenfunction for the case of isotropic hashing [60].
uniform distribution on [rl , rr ] is φf (x) = sin( π2 +
fπ Principal component hashing [93] also uses the princi-
rr −rl x) and the corresponding eigenvalue is λf = 1 −
2 pal direction to formulate the hash function. Specifically,
exp (− ǫ2 | rrf−rπ 2
| ), where f = 1, 2, · · · is the frequency
l it partitions the data points into K buckets so that
and ǫ is a fixed small value.
the projected points along the principal direction are
The assumption that the data is uniformly distributed uniformly distributed in the K buckets. In addition,
does not hold in real cases, resulting in that the per- bucket overlapping is adopted to deal with the boundary
formance of spectral hashing is deteriorated. Second, issue (neighboring points around the partitioning posi-
the eigenvalue monotonously increases with respect to tion are assigned to different buckets). Different from
f
| rr −r l
|2 , which means that the PCA direction with a spectral hashing, principal component hashing aims at
large spread (|rr − rl |) and a lower frequency (f ) is constructing hash tables rather than compact codes.
preferred. This means that there might be more than
one eigenfunctions picked along a single PCA direction, The approach in [74], spectral hashing with seman-
which breaks the uncorrelation requirement, and thus tically consistent graph first learns a linear transform
the performance is influenced. Last, thresholding the matrix such that the similarities computed over the
eigenfunction φf (x) = sin( π2 + rrf−r π
l
x) at zero leads to transformed space is consistent to the semantic similarity
that near points are mapped to different values and even as well as the Euclidean distance-based similarity, then
far points are mapped to the same value. It turns out applies spectral hashing to learn hash codes.
13

5.1.2 Kernelized spectral hashing 5.1.5 ICA hashing


The approach introduced in [37] extends spectral hash- The idea of independent component analysis (ICA)
ing by explicitly defining the hash function using ker- Hashing [35] starts from coding balance. Intuitively cod-
nels. The mth hash function is given as follows, ing balance means that the average number of data items
mapped to each hash code is the same. The coding
ym = hm (x) (26)
balance requirement is formulated as maximizing the en-
Tm
X tropy entropy(y1 , y2 , · · · , yM ), and subsequently formu-
= sign( wmt K(smt , x) − bm ) (27)
lated as bit balance: E(ym ) = 0 and mutual information
t=1
Tm
minimization: I(y1 , y2 , · · · , yM ).
= sign(
X
wmt < φ(smt ), φ(x) > −bm ) (28) The approach approximates the mutual information
t=1
using the scheme similar to the one widely used inde-
= sign(< vm , φ(x) > −bm ). (29) pendent component analysis. The mutual information
is relaxed: I(y1 , y2 , · · · , yM ) = I(w1T x, w2T x), · · · , wM
T
x)
Here {smt }Tt=1
m
is the set of randomly-sampled anchor and is approximated as maximizing
items for forming the hash function, and its size Tm is M N
usually the same for all M hash functions. K(·, ·) is a X 1 X
kc − g(WT xn )k22 , (31)
kernel function, and φ(·) is its corresponding mapping m=1
N n=1
function. vm = [wm1 φ(sm1 ) · · · wmTm φ(smTm )]T .
The objective function is written as: under the constraint of whiten condition (which can be
derived from bit uncorrelation), wiT E(xxT )wj = δ[i =
M
X j], c is a constant, g(u) is some non-quadratic functions,
min Trace(Y(D − S)YT ) + kvm k22 , (30) 2
{wmt } such that g(u) = − exp (− u2 ) or g(u) = log cosh(u).
m=1
The whole objective function together preserving the
The constraints are the same to those of spectral hashing, similarities as done in spectral hashing is written as
and differently the hash function is given in Equation 29. follows,
To efficiently solve the problem, the sparse similarity
M N
matrix W and the Nyström algorithm are used to reduce X 1 X
the computation cost. max kc − g(WT xn )k22 (32)
W
m=1
N n=1

5.1.3 Hypergraph spectral hashing s.t. wiT E(xxT )wj = δ[i = j] (33)
T
Hypergraph spectral hashing [153], [89] extends spec- trace(W ΣW) ≤ η. (34)
tral hashing from an ordinary (pair-wise) graph to a
The paper [35] also presents a kernelized version by
hypergraph (multi-wise graph), formulates the prob-
using the kernel hash function.
lem using the hypergraph Laplacian (replace the graph
Laplacian [135], [134]) to form the objective function,
5.1.6 Semi-supervised hashing
with the same constraints to spectral hashing. The al-
gorithm in [153], [89] solves the optimization problem, Semi-supervised hashing [125], [126], [127] extends spec-
by relaxing the binary constraint eigen-decomposing the tral hashing into the semi-supervised case, in which
the hypergraph Laplacian matrix, and thresholding the some pairs of data items are labeled as belonging to
eigenvectors at zero. It computes the code for an out-of- the same semantic concept, some pairs are labeled as
sample vector, by regarding each hash bit as a class label belonging to different semantic concepts. Specifically,
of the data vector and learning a classifier for each bit. In the similarity weight sij is assigned to 1 and −1 if
essence, this approach is a two-step approach that sepa- the corresponding pair of data items, (xi , xj ), belong to
rates the optimization of coding and hash functions. The the same concept, and different concepts, and 0 if no
remaining challenge lies in how to extend the algorithm labeling information is given. This leads to a formulation
to large scale because the eigen-decomposition step is maximizing the empirical fitness,
quite time-consuming. M
X X
sij hm (xi )hm (xj ), (35)
5.1.4 Sparse spectral hashing i,j∈{1,··· ,N } m=1
Sparse spectral hashing [116] combines sparse principal
component analysis (Sparse PCA) and Boosting Simi- where hk (·) ∈ {1, −1}. It is easily shown
larity Sensitive Hashing (Boosting SSC) into traditional that this objective function 35 is equivalent to
P PM (hm (xi )−hm (xj ))2
spectral hashing. The problem is formulated as as thresh- minimizing i,j∈{1,··· ,N } sij m=1 2 =
1 2
P
olding a subset of eigenvectors of the Laplacian graph by 2 i,j∈{1,··· ,N } s ij kyi − yj k 2 .
constraining the number of nonzero features. The convex In addition, the bit balance requirement (over each
relaxation makes the learnt codes globally optimal and hash bit) is explained as maximizing the variance over
the out-of-sample extension is achieved by learning the the hash bits. Assuming the hash function is a sign
eigenfunctions. function, h(x) = sign(wT x), variance maximization is
14

relaxed as maximizing the variance of the projected data which is then decomposed into K subproblems each
wT x. In summary, the formulation is given as of which finds bk for each hash function wkT x − bk .
The subproblem can be exactly solved using simple 1D
trace[WT Xl SXTl W] + η trace[WT XXT W], (36)
search.
where S is the similarity matrix over the labeled data Xl ,
X is the data matrix withe each column corresponding 5.1.8 Topology preserving hashing
to one data item, and η is a balance variable. Topology preserving hashing [145] formulates the hash-
In the case that W is an orthogonal matrix (the ing problem by considering two forms of coding con-
columns are orthogonal to each other, WT W = I, sistency: preserving the neighborhood ranking and pre-
which is called projection uncorrelation) (equivalent to serving the data topology.
the independence requirement in spectral hashing), it is The first coding consistency form is presented as a
solved by eigen-decomposition. The authors present a maximization problem,
sequential projection learning algorithm by embedding
1 X
WT W = I into the objective function as a soft constraint sign(doi,j − dos,t ) sign(dhi,j − dhs,t ) (43)
2 i,j,s,t
trace[WT Xl SXTl W] + η trace[WT XXT W]
1 X o
+ ρkWT W − Ik2F , (37) ≈ (d − dos,t )(dhi,j − dhs,t ) (44)
2 i,j,s,t i,j
where ρ is a tradeoff variable. An extension of semi-
supervised hashing to nonlinear hash functions is pre- where do and dh are the distances in the original space
and the Hamming space. This ranking preserving formu-
PTin [136], where the kernel hash function, h(x) =
sented
lation, based on the rearrangement inequality, is trans-
sign( t=1 wt < φ(st ), φ(x) > −b) , is used.
formed to
5.1.7 LDA hash 1X o h
d d (45)
LDA (linear discriminant analysis) hash [122] aims to 2 i,j i,j i,j
find the binary codes by minimizing the following ob-
1X o
jective function, = d kyi − yj k22 (46)
2 i,j i,j
α E{kyi − yj k2 |(i, j) ∈ P} − E{kyi − yj k2 |(i, j) ∈ N },
(38) = trace(YLt YT ), (47)

where y = sign(WT x + b), P is the set of positive where Lt = Dt − St , Dt = diag(St 1) and st (i, j) = f (doij )
(similar) pairs, and N is the set of negative (dissimilar) with f (·) is monotonically non-decreasing.
pairs. Data topology preserving is formulated in a way
LDA hash consists of two steps: (1) finding the projec- similar to spectral hashing, by minimizing the following
tion matrix that best discriminates the nearer pairs from function
the farther pairs, which is a form of coding consistency, 1X
sij kyi − yj k22 (48)
and (2) finding the threshold to generate binary hash 2 ij
codes. The first step relaxes the problem, by removing
the sign and minimizes a related function, = trace(YLs YT ), (49)

α E{kWT xi − WT xj k2 |(i, j) ∈ P} where Ls = Ds − Ss , Ds = diag(Ss 1), and ss (i, j) is the


similarity between xi and xj in the original space.
− E{kWT xi − WT xj k2 |(i, j) ∈ N }. (39) Assume the hash function is in the form of sign(WT x)
This formulation is then transformed to an equivalent (the following formulation can also be extended to the
form, kernel hash function), the overall formulation, by a
relaxation step sign(WT x) ≈ WT x, is given as follows,
α trace{WT Σp W} − trace{WT Σn W}, (40)
trace(WT X(Lt + αI)XT W)
where Σp = E{(xi − xj )(xi − xj )T |(i, j) ∈ P} and max , (50)
trace(WT XLs XT W)
Σn = E{(xi − xj )(xi − xj )T |(i, j) ∈ N }. There are two so-
lutions given in [122]: minimizing trace{WT Σp Σ−1 n W},
where αI introduces a regularization term,
which does not need to specify α, and minimizing trace(WT XXT W), similar to the bit balance condition
trace{WT (αΣp − Σn )}. in semi-supervised hashing [125], [126], [127].
The second step aims to find the threshold by mini-
mizing 5.1.9 Hashing with graphs
T T
α E{sign{W xi − b} − sign{W xj − b}|(i, j) ∈ P} The key ideas of hashing with graphs [83] consist of us-
(41) ing the anchor graph to approximate the neighborhood
graph, (accordingly using the graph Laplacian over the
− E{sign{WT xi − b} − sign{WT xj − b}|(i, j) ∈ N }, anchor graph to approximate the graph Laplacian of the
(42) original graph) for fast computing the eigenvectors and
15

using a hierarchical hashing to address the boundary embedding [63], and (2) minimizing the differences be-
issue for which the points around the hash plane are tween the Hamming affinity over the hash codes and
assigned different hash bits. The first idea aims to solve the similarity over the data items, which has two types,
the same problem in spectral hashing [135], present an similar (s = 1) or dissimilar (s = −1) e.g., given by the
approximate solution using the anchor graph rather than Euclidean distance or the labeling information.
the PCA-based solution with the assumption that the The hash function is given as follows,
data points are uniformly distributed. The second idea Tm
breaks the independence constraint over hash bits. ynm = hm (xn ) = sign(
X
wmt K(smt , x) + b), (53)
Compressed hashing [80] borrows the idea about an- t=1
chor graph in [83] uses the anchors to generate a sparse
representation of data items by computing the kernels where b is the bias. The objective function is given as the
with the nearest anchors and normalizing it so that the following,
summation is 1. Then it uses M random projections and X
min (sij − affinity(yi , yj ))2 , (54)
the median of the projections of the sparse projections
(i,j)∈L
along each random projection as the bias to generate the
hash functions. where L is the set of labeled pairs, affinity(yi , yj ) = M −
kyi − yj k1 is the Hamming affinity, and y ∈ {1, −1}M .
5.2 Similarity Alignment Hashing Kernel reconstructive hashing [141] extends this tech-
nique using a normalized Gaussian kernel similarity.
Similarity alignment hashing is a category of hashing
algorithms that directly compare the similarities (dis- 5.2.3 Spec hashing
tances) computed from the input space and the coding
space. In addition, the approach aligning the distance The idea of spec hashing [78] is to view each pair of
distribution is also discussed in this section. Other al- data items as a sample and their (normalized) similarity
gorithms, such as quantization, can also be interpreted as the probability, and to find the hash functions so
as similarity alignment, and for clarity, are described in that the probability distributions from the input space
separate paragraphs. and the Hamming spacePare well aligned. Let siij be the
normalized similarity ( ij siij = 1) given in the input
5.2.1 Binary reconstructive embedding space, and shij be the normalized similarity computed in
The key idea of binary reconstructive embedding [63] the Hamming space, shij = Z1 exp (−λ
P disth (i, j)), where Z
is to learn the hash codes such that the difference be- is a normalization variable Z = ij exp (−λ disth (i, j)).
tween the Euclidean distance in the input space and the Then, the objective function is given as follows,
Hamming distance in the hash codes is minimized. The min KL({siij }||{shij })
objective function is formulated as follows, X
=− λsiij log shij (55)
X 1 1
min ( kxi − xj k2f − kyi − yj k22 )2 . (51) ij
2 M X X
(i,j)∈N =λ suij disth (i, j) + log exp (−λ disth (i, j)).
ij ij
The set N is composed of point pairs, which includes
(56)
both the nearest neighbors and other pairs.
The hash function is parameterized as: Supervised binary hash code learning [24] presents
Tm a supervised binary hash code learning algorithm us-
ing Jensen Shannon Divergence which is derived from
X
ynm = hm (x) = sign( wmt K(smt , x)), (52)
t=1 minimizing an upper bound of the probability of Bayes
decision errors.
where {smt }Tt=1m
are sampled data items forming the
hashing function hm (·) ∈ {h1 (·), · · · , hM (·)}, K(·, ·) is a 5.2.4 Bilinear hyperplane hashing
kernel function, and {wmt } are the weights to be learnt.
Instead of relaxing the sign function to a continuous Bilinear hyperplane hashing [84] transforms the database
function, an alternative optimization scheme is presented vector (the normal of the query hyperplane) into a high-
in [63]: fixing all but one weight wmt and optimizing the dimensional vector,
problem 51 with respect to wmt . It is shown that an exact, ā = vec(aaT )[a21 , a1 a2 , · · · , a1 ad , a2 a1 , a22 , a2 a3 , · · · , a2d ].
optimal update to this weight wmt (fixing all the other (57)
weights) can be achieved in time O(N log N + n|N |).
The bilinear hyperplane hashing family is defined as
5.2.2 Supervised hashing with kernels follows,
The idea of supervised hashing with kernels [82] con- sign(uT zzT v)

if z is a database vector
h(z) =
sists of two aspects: (1) using the kernels to form the sign(−uT zzT v) if z is a hyperplane normal.
hash functions, which is similar to binary reconstructive (58)
16

Here u and v are sampled independently from a stan- space and the distance in the Hamming space, respec-
dard Gaussian distribution. It is shown to be r, r(1 + tively. The objective function maximizing the alignment
ǫ), 12 − π2r2 , 12 − 2r(1+ǫ)
π2 -sensitive to the angle distance between the two categories is given as follows,
dθ (x, n) = (θx,n − π2 )2 , where r, ǫ > 0.
L(h(·); X ) (62)
Rather than randomly drawn, u and u can be also XN
learnt according to the similarity information. A formu- = L(h(·); xn ) (63)
n=1
lation is given in [84] as the below, XN XM−1
= L(h(·); xn , m) (64)
1 n=1 m=0
min k YT Y − Sk, (59) XN XM−1
e h h e
{uk ,vk }K
k=1
K = (|Nnm − Nnm | + λ|Nnm − Nnm |), (65)
n=1 m=0
e
where Y = [y1 , y2 , · · · , yN ] and S is the similarity where Nnm = ∪m e h m h
j=0 Cnj and Nnm = ∪j=0 Cnj .
matrix, Given the compound hash function defined as below,
h(x) = sign(WT x + b) (66)

 1 if cos(θxi ,xj ) > t1
sij = −1 if cos(θxi ,xj ) 6 t2 (60) = [sign(w1T x + b1 ) · · · T
sign(wm x T
+ bm )] ,
2| cos(θxi ,xj )| − 1 otherwise,

the loss is transformed to:
The above problem is solved by relaxing sign with the L(W; xn , i)
sigmoid-shaped function and finding the solution with X
the gradient descent algorithm. = sign(kh(xn ) − h(x′ )k22 − i)
x′ ∈Nni
e
X
+λ ′ e
sign(i + 1 − kh(xn ) − h(x′ )k22 ). (67)
x ∈N
/ ni
5.3 Order Preserving Hashing
This problem is solved by dropping the sign function
This section reviews the category of hashing algorithms and using the quadratic penalty algorithm [130].
that depend on various forms of maximizing the align-
ment between the orders of the reference data items 5.3.3 Triplet loss hashing
computed from the input space and the coding space. Triplet loss hashing [103] formulates the hashing prob-
lem by preserving the relative similarity defined over
5.3.1 Minimal loss hashing triplets of items, (x, x+ , x− ), where the pair (x, x+ ) is
more similar than the pair (x, x− ). The triplet loss is
The key point of minimal loss hashing [101] is to use
defined as
a hinge-like loss function to assign penalties for similar
(or dissimilar) points when they are too far apart (or too ℓtriplet (y, y+ , y− ) = max(ky − y+ k1 − ky − y− k1 + 1, 0).
close). The formulation is given as follows, (68)
X Suppose the compound hash function is defined as
min I[sij = 1] max(kyi − yi k1 − ρ + 1, 0)
h(x; W), the objective function is given as follows,
(i,j)∈L X
+ I[sij = 0]λ max(ρ − kyi − yi k1 + 1, 0), (61) ℓtriplet (h(x; W), h(x+ ; W), h(x− ; W))
(x,x+ ,x− )∈D
where ρ is a hyper-parameter and is uses as a threshold λ
in the Hamming space that differentiates neighbors from trace (WT W).
+ (69)
2
non-neighbors, λ is also a hyper-parameter that controls The problem is optimized using the algorithm similar to
the ratio of the slopes for the penalties incurred for simi- minimal loss hashing [101]. The extension to asymmetric
lar (or dissimilar) points. Both the two hyper-parameters Hamming distance is also discussed in [101].
are selected using the validation set.
Minimal loss hashing [101] solves the problem by 5.3.4 Listwise supervision hashing
building the convex-concave upper bound of the Similar to [101], listwise supervision hashing [128] also
above objective function and optimizing it using the uses triplets of items to approximate the listwise loss.
perceptron-like learning procedure. The formulation is based on a triplet tensor S defined as
follows,
5.3.2 Rank order loss 
 1 if sim(xi ; i) > sim(xi ; j)
The idea of order preserving hashing [130] is to learn s(i; j, k) = −1 if sim(xi ; i) < sim(xi ; j) (70)
hash functions by maximizing the alignment between 0 if sim(xi ; i) = sim(xi ; j).

the similarity orders computed from the original space The goal is to minimize the following objective func-
and the ones in the Hamming space. To formulate the tion,
problem, given a data point xn , the database points X X
are divided into M categories, (Cn0 e e
, Cn1 e
, · · · , CnM ) and − h(xi )T (h(xj ) − h(xk ))sijk , (71)
h h h
(Cn0 , Cn1 , · · · , CnM ), using the distance in the original i,j,k
17

where is solved by dropping the sign operator in Besides, it generalizes the bit balance condition, for
h(x; W) = sign(WT x). each hit, half of points are mapped to −1 and the rest
mapped to 1, and introduces a pair-wise bit balance
5.3.5 Similarity sensitive coding condition to approximate the coding balance condition,
Similarity sensitive coding (SSC) [114] aims to learn an i.e. every two hyperplanes spit the space into four sub-
embedding, which can be called weighted Hamming spaces, and each subspace contains N/4 data points. The
embedding: h(x) = [α1 h(x1 ) α2 h(x2 ) · · · αM h(xM )] that condition is guaranteed by
is faithful to a task-specific similarity. An example algo- N
X
rithm, boosted SSC, uses adaboost to learn a classifier. h1 (xn ) = 0, (76)
The output of each weak learner on an input item is a n=1
N
binary code, and the outputs of all the weak learners X
are aggregated as the hash code. The weight of each h2 (xn ) = 0, (77)
n=1
weak learner forms the weight in the embedding, and
N
is used to compute the weighted Hamming distance. X
h1 (xn )h2 (xn ) = 0. (78)
Parameter sensitive hashing [115] is a simplified version
n=1
of SSC with the standard LSH search procedure instead
of the linear scan with weighted Hamming distance and The whole formulation for updating the mth hash func-
uses decision stumps to form hash functions with thresh- tion is written as the following
old optimally decided according to the information of N
X N
X
similar pairs, dissimilar pairs and pairs with undefined min um T
n H(ǫ − |wm xn + b|) + α(( hm (xn ))2
similarities. The forgiving hashing approach [6], [7], n=1 n=1
[8] extends parameter sensitive hashing and does not m−1
X N
X
explicitly create dissimilar pairs, but instead relies on the + ( hj (xn )hm (xn ))2 , (79)
maximum entropy constraint to provide that separation. j=1 n=1

A column generation algorithm, which can be used T


where hm (x) = (wm x + b).
to solve adaboost, is presented to simultaneously learn The paper [55] also extends the linear hash function
the weights and hash functions [75], with the following the kernel function, and presents the gradient descent
objective function algorithm to optimize the continuous-relaxed objective
N
function which is formed by dropping the sign function.
X
min ζi + Ckαkp (72) 5.4.2 Label-regularized maximum margin hashing
α,ζ
i=1
The idea of label-regularized maximum margin hash-
s. t. α > 0, ζ ≥ 0, (73)
ing [96] is to use the side information to find the hash
dh (xi , x−
i ) − dh (xi , x+
i ) > 1 − ζi ∀i. (74) function with the maximum margin criterion. Specifi-
Here k · kl is a ℓp norm, e.g., l = 1, 2, ∞. cally, the hash function is computed so that ideally one
pair of similar points are mapped to the same hash bit
and one pair of dissimilar points are mapped to different
5.4 Regularized Space Partitioning hash bits. Let P be a set of pairs {(i, j)} labeled to be
Almost all hashing algorithms can be interpreted from similar. The formulation is given as follows,
the view of partitioning the space. In this section, we N
λ1 X λ2 X
review the category of hashing algorithms that focus on min kwk22 + ξn + ζij (80)
{yi },w,b,{ξi },{ζ} N n=1 N
pursuiting effective space partitioning without explicitly (i,j)∈S
evaluating the distance in the coding space. s. t. yi (wT xi + b) + ξi > 1, ξi > 0, ∀i, (81)
yi yj + ζij > 0.∀(i, j) ∈ P, (82)
5.4.1 Complementary projection hashing T
− l 6 w xi + b 6 l. (83)
Complementary projection hashing [55] computes the
mth hash function according to the previously computed Here, kwk22 corresponds to the maximum margin crite-
(m − 1) hash functions, using a way similar to comple- rion. The second constraint comes from the side infor-
mentary hashing [139], checking the distance of the point mation for similar pairs, and its extension to dissimilar
to the previous (m − 1) partition planes. The penalty pairs is straightforward. The last constraint comes from
weight for xn when learning the mth hash function is the bit balance constraint, half of data items mapped to
given as −1 or 1.
Similar to Pthe BRE, the hash function is defined as
m−1 T
X h(x) = sign( t=1 vt < φ(st ), φ(x) > −b) , which means
um
n = 1+
T
H(ǫ − |wm xn + b|), (75) PT
that w = t=1 vt φ(st ). This definition reduces the op-
j=1
timization cost. Constrained-concave-convex-procedure
where H(·) = 12 (1 + sign(·)) is a unit function. (CCCP) and cutting plane are used for the optimization.
18

5.4.3 Random maximum margin hashing entropy, −Pm0 log Pm0 − −Pm1 log Pm1 , where Pm0 = nn0
Random maximum margin hashing [57] learns a hash and Pm1 = 1 − Pm0 . n is the number of the data points,
function with the maximum margin criterion, where the and n0 is the number of the data points lying one
positive and negative labels are randomly generated, by partition formed by the hyperplane of the corresponding
randomly sampling N data items and randomly labeling hash function. Lastly, L hash functions with the greatest
half of the items with −1 and the other half with 1. entropy scores are selected to form the compound hash
The formulation is a standard SVM formulation that is function.
equivalent to the following form,
5.5 Hashing with Weighted Hamming Distance
N′ N′
1 2 2
T − This section presents the hashing algorithms which eval-
max min[min(wT x+
i + b), min(−w xi − b)], (84)
kwk2 i=1 i=1 uates the distance in the coding space using the query-
− dependent and query-independent weighted Hamming
where {x+ i } are the positive samples and {xi } are distance scheme.
the negative samples. Using the kernel trick, the
hashP function can be a kernel-based function, h(x = 5.5.1 Multi-dimensional spectral hashing
v
sign( i=1 αi < φ(x), φ(s) >) + b), where {s} are the Multi-dimensional spectral hashing [134] seeks hash
selected v support vectors. codes such that the weighted Hamming affinity is equal
to the original affinity,
5.4.4 Spherical hashing X
The basic idea of spherical hashing [38] is to use a min (wij − yiT Λyj )2 = kW − YT ΛYk2F , (87)
hypersphere to formulate a spherical hash function, (i,j)∈N
 where Λ is a diagonal matrix, and both Λ and hash codes
+1 if d(p, x) 6 t
h(x) = (85) {yi } are needed to be optimized.
0 otherwise.
The algorithm for solving the problem 87 to compute
The compound hash function consists of K spherical hash codes is exactly the same to that given in [135].
functions, depending on K pivots {p1 , · · · , pK } and K Differently, the affinity over hash codes for multi-
thresholds {t1 , · · · , tK }. Given two hash codes, y1 and dimensional spectral hashing is the weighted Hamming
y2 , the distance is computed as affinity rather than the ordinary (isotropically weighted)
Hamming affinity. Let (d, l) correspond to the index of
ky1 − y2 k1
, (86) one selected eigenfunction for computing the hash bit,
y1T y2
the l eigenfunction along the PC direction d, I = {(d, l)}
where ky1 − y2 k1 is similar to the Hamming distance, be the set of the indices of all the selected eigenfunctions.
i.e., the frequency that both the two points lie inside (or The weighted Hamming affinity using pure eigenfunc-
outside) the hypersphere, and y1T y2 is equivalent to the tions along (PC) dimension d is computed as
number of common 1 bits between two binary codes, X
affinityd (i, j) = λdl sign(φdl (xid )) sign(φdl (xjd )),
i.e., the frequency that both the two points lie inside the
(d,l)∈I
hypersphere.
(88)
The paper [38] proposes an iterative optimization al-
gorithm to learn K pivots and thresholds such that it where xid is the projection of xi along dimension d,
satisfies a pairwise bit balanced condition: φdl (·) is the lth eigenfunction along dimension d, λdl is
the corresponding eigenvalue. The weighted Hamming
k{x|hk (x) = 1}k = k{x|hk (x) = 0}k, affinity using all the hash codes is then computed as
and follows,
Y
1 affinity(yi , yj ) = (1 + affinityd (i, j)) − 1. (89)
k{x|hi (x) = b1 , hj (x) = b2 }k = kX k, b1 , b2 ∈ {0, 1}.
4 d
The computation can be accelerated using lookup tables.
5.4.5 Density sensitive hashing
The idea of density sensitive hashing [79] is to exploit 5.5.2 Weighted hashing
the clustering results to generate a set of candidate hash Weighted hashing [131] uses the weighted Hamming
functions and to select the hash functions which can split distance to evaluate the distance between hash codes,
the data most equally. First, the k-means algorithm is run kαT (yi − yj )k22 . It optimizes the following problem,
over the data set, yielding K clusters with centers being 1
{µ1 , µ2 , · · · , µK }. Second, a hash function is defined min trace(diag(α)YLYT ) + λk YYT − Ik2F (90)
n
over two clusters (µi , µj ) if the center is one of the r
s. t. Y ∈ {−1, 1}M×N , YT 1 = 0 (91)
nearest neighbors of the other, h(x) = sign(wT x − b),
where w = µi − µj and b = 21 (µi + µj )T (µi − µj ). The kαk1 = 1 (92)
third step aims to evaluate if the hash function (wm , bm ) α1 α2 αM
= = ··· = , (93)
can split the data most equally, which is evaluated by the var(y1 ) var(y2 ) var(yM )
19

where L = D − S is the Laplacian matrix. The formula- are the set of primitive polynomials which can span the
tion is essentially similar to spectral hashing [135], and polynomial space with a degree less than s, {gni (x)}
the difference lies in including the weights for weighed are the green functions, and {αni } and {βni} are the
Hamming distance. corresponding coefficients. The whole formulation is
The above problem is solved by discarding the first given as follows,
constraint and then binarizing y at the M medians. The N
T
hash function wm x + b is learnt by mapping the input
X X
min ( khn (xi ) − yi k22 + γψn (hn )
x to a hash bit ym . v,{hi },{yn }
n=1 xi ∈Nn
N
5.5.3 Query-adaptive bit weights
X
+ λ( kh(xn ) − yn k22 + γkvk22 ). (95)
[53], [54] presents a weighted Hamming distance mea- n=1
sure by learning the weights from the query information.
5.6.3 Inductive manifold hashing
Specifically, the approach learns class-specific bit weights
so that the weighted Hamming distance between the Inductive manifold mashing [117] consists of three steps:
hash codes belong the class and the center, the mean cluster the data items into K clusters, whose centers
of those hash codes is minimized. The weight for a are {c1 , c2 , · · · , cK }, embed the cluster centers into a
specific query is the average weight of the weights of the low-dimensional space, {y1 , y2 , · · · , yK }, using existing
classes that the query most likely belong to and that are manifold embedding technologies, and finally the hash
discovered using the top similar images (each of which function is given as follows,
is associated with a semantic label). PK
k=1 w(x, ck )yk
h(x) = sign( P K
). (96)
5.5.4 Query-adaptive hashing k=1 w(x, ck )

Query adaptive hashing [81] aims to select the hash bits 5.6.4 Nonlinear embedding
(thus hash functions forming the hash bits) according The approach introduced in [41] is an exact nearest
to the query vector (image). The approach consists of neighbor approach, which relies on a key inequality,
two steps: offline hash functions h(x) = sign(WT x)
({hb (x) = sign(wbT x)}) and online hash function selec- kx1 − x2 k22 > d((µ1 − µ2 )2 + (σ1 − σ2 )2 ), (97)
tion. The online hash function selection, given the query d
where µ = d1 i=1 xi is the mean of all the entries of
P
q, is formulated as the following, Pd
the vector x, and σ = d1 i=1 (xi − µ)2 is the standard
min kq − Wαk22 + ρkαk1 . (94) deviation. The above inequality is generalized by divid-
α
ing the vector into M subvectors, with the length of
Given the optimal solution α∗ , α∗i = 0 means the ith each subvector being dm , and the resulting inequality
hash function is not selected, and the hash function is formulated as follows,
corresponding to the nonzero entries in α∗ . A solution
M
based on biased discriminant analysis is given to find X
kx1 − x2 k22 > dm ((µ1m − µ2m )2 + (σ1m − σ2m )2 ).
W, for which more details can be found from [81].
m=1
(98)
5.6 Other Hash Learning Algorithms In the search strategy, before computing the exact
5.6.1 Semantic hashing Euclidean distance between the query and the database
Semantic hashing [111], [112] generate the hash codes, point, the lower bound is first computed and is com-
which can be used to reconstruct the input data, us- pared with the current minimal Euclidean distance, to
ing the deep generative model (based on the pretrain- determine if the exact distance is necessary to be com-
ing technique and the fine-tuning scheme originally puted.
designed for the restricted Boltzmann machines). This
algorithm does not use any similarity information. The 5.6.5 Anti-sparse coding
binary codes can be used for finding similarity data as The idea of anti-sparse coding [50] is to learn a hash
they can be used to well reconstruct the input data. code so that non-zero elements in the hash code as many as
possible. The binarization process is as follows. First, it
5.6.2 Spline regression hashing solves the following problem,
Spline regression hashing [90] aims to find a global hash z∗ = arg min kzk∞ , (99)
function in the kernel form, h(x) = vT φ(x), such that the z:Wz=x

hash value from the global hash function is consistent to where kzk∞ = maxi∈{1,2,··· ,K} |zi |, and W is a projection
those from the local hash functions that corresponds to matrix. It is proved that in the optimal solution (mini-
its neighborhood points. Each data point corresponds to mizing the range of the components), K − d + 1 of the
a local hash function in the form of spline regression, components are stuck to the limit, i.e., zi = ±kzk∞ . The
Pt Pk
hn (x = i=1 βni pi (x)) + i=1 αni gni (x), where {pi (x)} binary code (of the length K) is computed as y = sign(z).
20

The distance between the query q and a vector x can Here Z is a nonlinear embedding, similar to locally linear
be evaluated based on the similarity in the Hamming embedding and M is a sparse matrix, M = (I − W)T (I −
space, yqT yx or the asymmetric similarity zTq yx . The nice W). W is the locally linear reconstruction weight matrix,
property is that the anti-sparse code allows, up to a which is computed by solving the following optimiza-
scaling factor, the explicit reconstruction of the original tion problem for each database item,
vector x ∝ Wy. 1 X
min λksTn wn k1 + kxn − wij xn k22 (107)
5.6.6 Two-Step Hashing wn 2
j∈N (xn )
The paper [77] presents a general two-step approach to s. t. wnT 1 = 1, (108)
learning-based hashing: learn binary embedding (codes)
and then learn the hash function mapping the input item where wn = [wn1 , wn2 , · · · , wnn ]T , and wnj = 0 if j ∈ /
to the learnt binary codes. An instance algorithm [76] N (xn ). sn = [sn1 , sn2 , · · · , snn ]T is a vector and snj =
P kxn −xj k2
uses an efficient GraphCut based block search method .
t∈N (xn ) kxn −xt k2
for inferring binary codes for large databases and trains Out-of-sample extension computes the binary embed-
boosted decision trees fit the binary codes. ding of a query q as yq = sign(YT wq ). Here wq is
Self-taught hashing [144] optimizes an objective func- a locally linear reconstruction weight, and computed
tion, similar to the spectral hashing, similarly to the above optimization problem. Differently,
min trace(YLYT ) (100) Y and wq correspond to the cluster centers, computed
T using k-means, of the database X.
s. t. YDY = I (101)
YD1 = 0, (102)
5.7 Beyond Hamming Distances in the Coding
where Y is a real-valued matrix, relaxed from the binary Space
matrix), L is the Laplacian matrix and D is the degree
This section reviews the algorithms focusing on design-
matrix. The solution is the M eigenvectors correspond-
ing effective distance measures given the binary codes
ing to the smallest M eigenvalues (except the trivial
and possibly the hash functions. The summary is given
eigenvalue 0), Lv = λDv. To get the binary code, each
in Table 5.
row of Y is thresholded using the median value of
the column. To form the hash function, mapping the
vector to a single hash bit is regarded as a classification 5.7.1 Manhattan distance
problem, which is solved by linear SVM, sign(wT x + b). When assigning multiple bits into a projection direc-
The linear SVM is then regarded as the hash function. tion, the Hamming distance breaks the neighborhood
Sparse hashing [151] also is a two step approach. The structure, thus the points with smaller Hamming dis-
first step learns a sparse nonnegative embedding, in tance along the projection direction might have large
which the positive embedding is encodes as 1 and the Euclidean distance along the projection direction. Man-
zero embedding is encoded as 0. The formulation is as hattan hashing [61] introduces a scheme to address this
follows, issue, the Hamming codes along the projection direction
N N X
N N are in turn (e.g., from the left to the right) transformed
integers, and the difference of the integers is used to
X X X
kxn − PT zn k22 + α sij kzi − zj k22 + λ kzn k1 ,
n=1 i=1 j=1 n=1 replace the Hamming distance. The aggregation of the
(103) differences along all the projection directions is used as
the distance of the hash codes.
kxi −xj k22
where sij = exp (− ) is the similarity between xi
σ2
and xj . 5.7.2 Asymmetric distance
The second step is to learn a linear hash function for Let the compound hash function consist of K hash func-
each hash bit, which is optimized based on the elastic tions {hk (x) = bk (gk (x))}, where gk () is a real-valued
net estimator (for the mth hash function), embedding function and bk () is a binarization function.
N
X Asymmetric distance [32]presents two schemes. The first
t
min kyn − wm xn k22 + λ1 kwm k1 + λ2 kwm k22 . (104) one (Asymmetric distance I) is based on the expectation
w
n=1 ḡkb = E(gk (x)|hk (x) = bk (gk (x)) = b), where b = 0 and
Locally linear hashing [43] first learns binary codes b = 1. When performing an online search, a distance
that preserves the locally linear structures and then lookup table is precomputed:
introduces a locally linear extension algorithm for out-
of-sample extension. The objective function of the first {de (g1 (q), ḡ10 ), de (g1 (q), ḡ11 ), de (g2 (q), ḡ20 ),
step to obtain the binary embedding Y is given as de (g2 (q), ḡ21 ), · · · , de (gK (q), ḡK0 ), de (gK (q), ḡK1 ),
(109)
min trace(ZT MZ) + ηkY − ZRk2F (105)
Z,R,Y
where de (·, ·) is an Euclidean distance operation.
s. t. Y ∈ {1, −1}N ×M , RT R = I. (106) Then the distance is computed as dah (q, x) =
21

TABLE 5
A summary of algorithms beyond Hamming distances in the coding space.

method input similarity distance measure


Manhattan hashing [61] E MD
Asymmetric distance I [61] E AED, SED
Asymmetric distance II [61] E LB
asymmetric Hamming embedding [44] E LB

PK
k=1 de (gk (q), ḡkhk (x) ), which can be speeded up e.g., is given as the following maximization problem,
by grouping the hash functions in blocks of 8 bits and 1 1
have one 256-dimensional look-up table per block (rather max trace(WT Dc DTc W) − trace(WT Dm DTm W)
W nc nm
than one 2-dimensional look-up table per hash function.) η
This reduces the number of summations as well as the + trace(WT Ys YsT W) − η trace(WT µµT W),
ns
number of lookup operations. (112)
The second scheme (Asymmetric distance II) is under
the assumption that bk (gk (x)) = δ[gk (x) > tk ], and where W is a projection matrix of size b × t. The first
computes the distance lower bound (similar way also term aims to maximize the differences between dissim-
adopted in asymmetric Hamming embedding [44]) over ilar pairs, and the second term aims to minimize the
the k-th hash function, differences between similar pairs. The last two terms are
maximized so that the bit distribution is balanced, which

|gk (q)| if hk (x) 6= hk (q) is derived by maximizing E[kWT (y − µ)k22 ], where µ
d(gk (q), bk (gk (x))) = represents the mean of the hash vectors, and Ys is a
0 otherwise.
(110) subset of input hash vectors with cardinality ns . [95]
Similar to the first one, the distance is computed as furthermore refines the hash vectors using the idea of
dah (q, x) = K supervised locality-preserving method based on graph
P
k=1 d(gk (q), bk (gk (x))). Similar hash func-
tion grouping scheme is used to speed up the search Laplacian.
efficiency.
6 L EARNING TO H ASH : Q UANTIZATION
5.7.3 Query sensitive hash code ranking This section focuses on the algorithms that are based on
quantization. The representative algorithms are summa-
Query sensitive hash code ranking [148] presented a rized in Table 6.
similar asymmetric scheme for R-neighbor search. This
method uses the PCA projection W to formulate the
hash functions sign(WT x) = sign(z). The similarity 6.1 1D Quantization
along the k projection is computed as This section reviews the hashing algorithms that focuses
on how to do the quantization along a projection direc-
P (zk yk > 0, |qk − zk | 6 R) tion (partitioning the projection values of the reference
sk (qk , yk , R) = , (111)
P (|qk − zk | 6 R)) data items along the direction into multiple parts).

which intuitively means that the fraction of the points 6.1.1 Transform coding
that lie in the range |qk − zk | 6 R and are mapped to Similar to spectral hashing, transform coding [10] first
yk over the points that lie in the range |qk − zk | 6 R. transforms the data using PCA and then assigns several
The similarity is computed with the assumption that bits to each principal direction. Different from spectral
p(zk ) is a Gaussian distribution.
QK The whole similar- hashing that uses Laplacian eigenvalues computed along
ity is then computed as k=1 sk (qk , yk , R), equivalently each direction to select Laplacian eigenfunctions to form
PK
k=1 log sk (qk , yk , R). The lookup table is also used to hash functions, transform coding first adopts bit alloca-
speed up the distance computation. tion to determine which principal direction is used and
how many bits are assigned to such a direction.
The bit allocation algorithm is given as follows in
5.7.4 Bit reconfiguration
Algorithm 1. To form the hash function, each selected
The goal of bits reconfiguration [95] is to learn a good principal direction i is quantized into 2mi clusters with
distance measure over the hash codes precomputed from the centers as {ci1 , ci2 , · · · , ci2mi }, where each center is
a pool of hash functions. Given the hash codes {yn }N n=1 represented by a binary code of length mi . Encoding an
with length M , the similar pairs M = {(i, j)} and the item consists of PCA projection followed by quantization
dissimilar pairs C = {(i, j)}, compute the difference of the components. the hash function can be formulated.
matrix Dm (Dc ) over M (C) each column of which corre- The distance between a query item and the hash code is
sponds to yi − yj , (i, j) ∈ M ((i, j) ∈ C). The formulation evaluated as the aggregation of the distance between the
22

TABLE 6
A summary of quantization algorithms. sim. = similarity. dist. = distance.

method input sim. hash function dist. measure optimization criteria


transform coding [10] E OQ AED, SED BA
double-bit quantization [59] E OQ HD 3 partitions
iterative quantization [30], [31] E LI HD QE
isotropic hashing [60] E LI HD EV
harmonious hashing [138] E LI HD QE + EV
Angular quantization [29] CS LI NHA MCS
product quantization [49] E QU (A)ED QE
Cartesian k-means [102] E QU (A)ED QE
composite quantization [147] E QU (A)ED QE

Algorithm 1 Distribute M bits into the principal directions matrix of size d × M (M 6 d) computed using PCA, and
1. Initialization: ei ← log2 σi , mi ← 0. (2) find the hash codes as well as an optimal rotation R,
2. for j = 1 to b do
3. i ← arg max ei .
by solving the following optimization problem,
4. mi ← mi + 1.
5. ei ← ei − 1. min kY − RT Vk2F , (113)
6. end for
where V = [v1 v2 · · · vN ] and Y = [y1 y2 · · · yN ].
The problem is solved via alternative optimiza-
centers of the query and the database item along each tion. There are two alternative steps. Fixing R, Y =
selected principal direction, or the aggregation of the sign(RT V). Fixing B, the problem becomes the clas-
distance between the center of the database item and the sic orthogonal Procrustes problem, and the solution is
projection of the query of the corresponding principal R = ŜST , where S and Ŝ is obtained from the SVD of
direction along all the selected principal direction. YVT , YVT = SΛŜT .
We present an integrated objective function that is
6.1.2 Double-bit quantization able to explain the necessity of the first step. Let ȳ be
a d-dimensional vector, which is a concatenated vector
The double-bit quantization-based hashing
from y and an all-zero subvector: ȳ = [yT 0...0]T . The
algorithm [59] distributes two bits into each projection
integrated objective function is written as follows:
direction instead of one bit in ITQ or hierarchical
hashing [83]. Unlike transform coding quantizing the min kȲ − R̄T Xk2F , (114)
points into 2b clusters along each direction, double-bit
quantization conducts 3-cluster quantization, and then where Ȳ = [ȳ1 ȳ2 · · · ȳN ] X = [x1 x2 · · · xN ], and R̄ is a
assigns 01, 00, and 11 to each cluster so that the rotation matrix.
Let P̄ be the projection matrix of d×d, computed using
Hamming distance between the points belonging to
neighboring clusters is 1, and the Hamming distance PCA, P̄ = [PP− ]. It can be seen that, the solutions for
between the points not belonging to neighboring y of the two problems in 114 and 113 are the same, if
R̄ = P̄ Diag(R, I).
clusters is 2.
Local digit coding [62] represents each dimension of
6.2.2 Isotropic hashing
a point by a single bit, which is set to 1 if the value
of the dimension it corresponds to is larger than a The idea of isotropic hashing [60] is to rotate the space
threshold (derived from the mean of the corresponding so that the variance along each dimension is the same.
data points), and 0 otherwise. It consists of three steps: (1) reduce the dimension using
PCA to M dimensions, v = PT x, where P is a matrix of
size d × M (M 6 d) computed using PCA, and (2) find
6.2 Hypercubic Quantization an optimal rotation R, so that RT VVT R = Σ becomes
Hypercubic quantization refers to a category of algo- a matrix with equal diagonal values, i.e., [Σ]11 = [Σ]22 =
rithms that quantize a data item to a vertex in a hyper- · · · = [Σ]MM .
1
cubic, i.e., a vector belonging to {[y1 , y2 , · · · , yM ]|ym ∈ Let σ = M Trace VVT . The isotropic hashing algo-
{−1, 1}}. rithm then aims to find an rotation matrix, by solving
the following problem:
6.2.1 Iterative quantization
kRT VVT R − ZkF = 0, (115)
Iterative quantization [30], [31] aims to find the hash
codes such that the difference between the hash codes where Z is a matrix with all the diagonal entries equal
and the data items, by viewing each bit as the quantiza- to σ. The problem can be solved by two algorithms: lift
tion value along the corresponding dimension, is mini- and projection and gradient flow.
mized. It consists of two steps: (1) reduce the dimension The goal of making the variances along the M direc-
using PCA to M dimensions, v = PT x, where P is a tions same is to make the bits in the hash codes equally
23

contributed to the distance evaluation. In the case that denominator kRT xn k2 :


the data items satisfy the isotropic Gaussian distribution, N
the solution from isotropic hashing is equivalent to
X ynT
max RT xn (125)
iterative quantization. R,{yn }
n=1
kyn k 2
Similar to generalized iterative quantization, the PCA s. t. yn ∈ {0, 1}M , (126)
preprocess in isotropic hashing is also interpretable: T
R R = IM . (127)
finding a global rotation matrix R̄ such that the first
M diagonal entries of Σ̄R̄T XXT R̄ are equal, and their The above problem is solved using alternative optimiza-
sum is as large as possible, which is formally written as tion.
follows,
M
X 6.3 Cartesian Quantization
max [Σ]mm (116) 6.3.1 Product quantization
m=1
The basic idea of product quantization [49] is to divide
s. t. [Σ] = σ, m = 1, · · · , M (117) the feature space into (P ) disjoint subspaces, thus the
T
R R = I. (118) database is divided into P sets, each set consisting
of N subvectors {xp1 , · · · , xpN }, and then to quan-
tize each subspace separately into (K) clusters. Let
6.2.3 Harmonious hashing
{cp1 , cp2 , · · · , cpK } be the cluster centers of the p sub-
Harmonious hashing [138] can be viewed as a combi- space, each of which can be encoded as a code of length
nation of ITQ and Isotropic hashing. The formulation is log2 K.
given as follows, A data item xn is divided into P subvectors {xpn },
and each subvector is assigned to the nearest center
min kY − RT Vk2F (119) cpkpn among the cluster centers of the pth subspace.
Y,R
Then the data item xn is represent by P subvec-
s. t. YYT = σI (120) tors {cpkpn }P p=1 , thus represented by a code of length
T P log2 K, k1n k2n · · · kP n . Product quantization can be
R R = I. (121)
viewed as minimizing the following objective function,
It is different from ITQ in that the formulation does not N
X
require Y to be a binary matrix. An iterative algorithm min kxn − Cbn k22 . (128)
is presented to optimize the above problem. Fixing R, C,{bn }
n=1
let RT V = UΛVT , then Y = σ 1/2 UVT . Fixing Y, R =
Here C is a matrix of d × P K in the form of
ŜST , where S and Ŝ is obtained from the SVD of YVT ,
C1 0 ··· 0
 
YVT = SΛŜT . Finally, Y is cut at zero, attaining binary
codes.  0 C2
 ··· 0 

diag(C1 , C2 , · · · , CP ) = 
 .. .. .. ..
, (129)
.

 . . . 
6.2.4 Angular quantization
0 0 ··· CP
Angular quantization [29] addresses the ANN search
where Cp = [cp1 cp2 · · · cpK ]. bn is the composition vector,
problem under the cosine similarity. The basic idea is
and its subvector bnp of length K is an indicator vector
to use the nearest vertex from the vertices of the binary
with only one entry being 1 and all others being 0, show-
hypercube {0, 1}d to approximate the data vector x,
yT x ing which element is selected from the pth dictionary for
arg maxy kbk 2
, subject to y ∈ {0, 1}d, which is shown quantization.
to be solved in O(d log d) time, and then to evaluate the Given a query vector xt , the distance to a vector xn ,
bT b
similarity kbq kx2 kbqx k2 in the Hamming space. represented by a code k1n k2n · · · kP n can be evaluated
The objective function of finding the binary codes, in symmetric and asymmetric ways. The symmetric dis-
similar to iterative quantization [30], is formulated as tance is computed as follows. First, the code of the query
below, xt is computed using the way similar to the database
vector, denoted by k1t k2t · · · kP t . Second, a distance table
N
X ynT RT xn is computed. The table consists of P K distance entries,
max (122) {dpk = kcpkpt − cpk k22 |p = 1, · · · , P, k = 1, · · · , K}.
R,{yn }
n=1
kyn k2 kRT xn k2
Finally, the distance of the query to the vector xn is
s. t. yn ∈ {0, 1}M , (123) computed by looking up the distance table and summing
T PP
R R = IM . (124) up P distances, p=1 dpkpn . The asymmetric distance
does not encode the query vector, directly computes the
Here R is a projection matrix of d × M . This is trans- distance table that also includes P K distance entries,
formed to an easily-solved problem by discarding the {dpk = kxpt − cpk k22 |p = 1, · · · , P, k = 1, · · · , K}, and
24

finally conducts the same step to the symmetric distance The problem is formulated as
evaluation, computing the distance as P
P
p=1 dpkpn . XN
Distance-encoded product quantization [39] extends min kxn − [C1 C2 · · · CP ]bn k22 (131)
{Cp },{bn },ǫ n=1
product quantization by encoding both the cluster index XP XP
and the distance between a point and its cluster center. s. t. bT CT Cj bnj =ǫ
i=1 j=1,j6=i ni i
The way of encoding the cluster index is similar to that in
product quantization. The way of encoding the distance bn = [bTn1 bTn2 · · · bTnP ]T
between a point and its cluster center is given as follows. bnp ∈ {0, 1}K , kbnp k1 = 1
Given a set of points belonging to a cluster, those points n = 1, 2, · · · , N, p = 1, 2, · · · P.
are partitioned (quantized) according to the distances to
the cluster center. Here, Cp is a matrix of size d × K, and each column
corresponds to an element of the pth dictionary Cp .
6.3.2 Cartesian k-means To get an easily optimization algorithm, the objective
function is transformed as
Cartesian k-means [102], [26] extends product quanti-
XN
zation and introduces a rotation R into the objective φ({Cp }, {bn }, ǫ) = kxn − Cbn k22
function, n=1
XN XP
N +µ ( bTni CTi Cj bnj − ǫ)2 , (132)
X n=1 i6=j
min kRT xn − Cbn k22 . (130)
R,C,{bn }
n=1
where µ is the penalty parameter, C = [C1 C2 · · · CP ]
PP PP PP
and i6=j = i=1 j=1,j6=i . The transformed problem
The introduced rotation does not affect the Euclidean is solved by alternative optimization.
distance as the Euclidean distance is invariant to the ro- The idea of using the summation of several dictionary
tation, and helps to find an optimized subspace partition items as an approximation of a data item has already
for quantization. been studied in the signal processing area, known as
The problem is solved by an alternative optimization multi-stage vector quantization, residual quantization,
algorithm. Each iteration alternatively solves C, {bn }, or more generally structured vector quantization [34],
and R. Fixing R, C and {bn } are solved using the same and recently re-developed for similarity search under the
way as the one in product quantization but with fewer Euclidean distance [5], [129] and inner product [22].
iterations and the necessity of reaching the converged
solution. Fixing C and {bn }, the problem of optimizing
R is the classic orthogonal Procrustes problem, also 7 L EARNING TO H ASH : OTHER TOPICS
occurring in iterative quantization. 7.1 Multi-Table Hashing
The database vector xn with Cartesian k-means is
represented by P subvectors {cpkpn }P 7.1.1 Complementary hashing
p=1 , thus encoded as
k1n k2n · · · kP n , with a rotation matrix R for all database The purpose of complementary hashing [139] is to learn
vector (thus the rotation matrix does not increase the multiple hash tables such that nearest neighbors have a
code length). Given a query vector xt , it is first rotated large probability to appear in the same bucket at least in
as RT xt . Then the distance is computed using the same one hash table. The algorithm learns the hashing func-
way to that in production quantization. As rotating the tions for the multiple hash tables in a sequential way. The
query vector is only done once for a query, its compu- compound hash function for the first table is learnt by
tation cost for a large database is negligible compared solving the same problem in [125], as formulated below
with the cost of computing the approximate distances
with a large amount of database vectors. trace[WT Xl SXTl W] + η trace[WT XXT W], (133)
Locally optimized product quantization [58] applies where sij is initialized as K(aij − α), aij is the similarity
Cartesian k-means to the search algorithm with the in- between xi and xj and α is a super-constant.
verted index, where there is a quantizer for each inverted
To compute the second compound hash functions, the
list.
same objective function is optimized but with different
matrix S:
6.3.3 Composite quantization  (t−1)
 0 baij = bij
The basic ideas of composite quantization [147] consist

(t−1)
of (1) approximating the database vector xn using P vec- stij = min(sij , fij ) baij = 1, bij = −1 (134)
(t−1)

a
tors with the same dimension d, c1k1n , c1k2n , · · · , c1kP n , − min(−sij , fij ) bij = −1, bij =1

each selected from K elements among one of P source
(t−1)
dictionaries {C1 , C2 , · · · , CP }, respectively, (2) making the where fij = (aij − α)( 14 dh (xi , xj ) − β), β is a super-
(t−1)
summation of the inner products of all pairs of elements constant, and b(t−1)ij = 1 − 2 sign[ 14 dh (xi , xj ) − β].
that are used to approximate the vector but from differ- Some tricks are also given to scale up the problem to
PP PP
ent dictionaries, i=1 j=1,6=i cikin cjkjn , be constant. large scale databases.
25

7.1.2 Reciprocal hash tables pairs come sequentially. Unlike the online hash algo-
The reciprocal hash tables [86] extends complementary rithm that updates all hash functions, smart hashing only
hashing by building a graph over a pool B hash func- selects a small subset of hash functions for relearning for
tions (with the output being a binary value) and search- a fast response to newly-coming labeled pairs.
ing the best hash functions over such a graph for build-
ing a hashing table, updating the graph weight using
a boosting-style algorithm and finding the subsequent 7.3 Hashing for the Absolute Inner Product Similar-
hash tables. The vertex in the graph corresponds to a ity
hash function and is associated with a weight showing
7.3.1 Concomitant hashing
the degree that similar pairs are mapped to the same
binary value and dissimilar pairs are mapped to different Concomitant hashing [97] aims to find the
binary values. The weight over the edge connecting two points with the smallest and largest absolute
hash functions reflects the independence between two cosine similarity. The approach is similar to
hash functions the weight is higher if the difference of concomitant LSH [23] and formulate a two-bit
the distributions of the binary values {−1, 1} computed hash code using a multi-set {hmin (x), hmax (x)} =
K K
from the two hash functions is larger. [87] shows how to {arg min2k=1 wkT x, arg max2k=1 wkT x}. The two bits are
formulate the hash bit selection problem into a quadric unordered, which is slightly different from concomitant
program, which is derived from organizing the candi- LSH [23]. The collision probability is defined as
date bits in graph. Prob[{hmin (x), hmax (x)} = {hmin (y), hmax (y)}], which
is shown to be a monotonically increasing function
with respect to |xT1 x2 |. This, thus, means that the
7.2 Active and Online Hashing
larger hamming distance, the smaller |xT1 x2 | (min-inner-
7.2.1 Active hashing product) and the smaller hamming distance, the larger
Active hashing [150] starts with a small set of pairs |xT1 x2 | (max-inner-product).
of points with labeling information and actively selects
the most informative labeled pairs for hash function
learning. Given the sets of labeled data L, unlabeled 7.4 Matrix Hashing
data U, and candidate data C, the algorithm first learns
the compound hash function h = sign(WT x), and then 7.4.1 Bilinear projection
computes the data certainty score for each point in A bilinear projection algorithm is proposed in [28] to
the candidate set, f (x) = kWT xk2 , which reflects the hash a matrix feature to short codes. The (compound)
distance of a point to the hyperplane forming the hash hash function is defined as
functions. Points with smaller the data certainty scores
should be selected for further labeling. On the other vec(sign(RTl XRr )), (138)
hand, the selected points should not be similar to each
other. To this end, the problem of finding the most where X is a matrix of dl × dr , Rl of size dl × dl and Rr
informative points is formulated as the following, of size dr × dr are two random orthogonal matrices. It is
easy to show that
λ T
min bT f̄ + b Kb (135)
b M vec(RTl XRr ) = (RTr ⊗ RTl ) vec(X) = RT vec(X). (139)
s. t. b ∈ {0, 1}kCk (136)
kbT k1 = M, (137) The objective is to minimize the angle between
a rotated feature RT vec(X) and its binary encoding
where b is an indicator vector in which bi = 1 when sign(RT vec(X)) = vec(sign(RTl XRr )). The formulation
xi is selected and bi = 0 when xi is not selected, M is given as follows,
is the number of points that need to be selected, f̄ is
a vector of the normalized certainty scores over the N
candidate set, with each element f¯i = fi X
kCk ¯ , K is the
maxj=1 fj
max trace(Bn RTr XTn Rl ) (140)
Rl ,Rr ,{Bn }
similarity matrix computed over C, and λ is the trade-off n=1

parameter. s. t. Bn ∈ {−1, +1}dl×dr (141)


RTl Rl = I (142)
7.2.2 Online hashing RTr Rr = I, (143)
Online hashing [40] presents an algorithm to learn the
hash functions when the similar/dissimilar pairs come where Bn = sign(RTl Xn Rr ). The problem is optimized
sequentially rather than at the beginning, all the simi- by alternating between {Bn }, Rl and Rr . To reduce the
lar/dissimilar pairs come together. Smart hashing [142] code length, the low-dimensional orthogonal matrices
also addresses the problem when the similar/dissimilar can be used: Rl ∈ Rdl ×cl and Rr ∈ Rdr ×cr .
26

7.5 Compact Sparse Coding vectors and search for the nearest neighbors simultane-
Compact sparse coding [14], the extension of the early ously over the multiple trees.
work robust sparse coding [15] adopts sparse codes to The tree building process starts with all the points
represent the database items: the atom indices corre- and divides them into K clusters with cluster centers
sponding to nonzero codes are used to build the inverted randomly selected from the input points and each point
index, and the nonzero coefficients are used to recon- assigned to the center that is closest to the point. The
struct the database items and compute the approximate algorithm is repeated recursively for each of the resulting
distances between the query and the database items. clusters until the number of points in each cluster is be-
The sparse coding objective function, with introducing low a certain threshold, in which case that node becomes
the incoherence constraint of the dictionary, is given as a leaf node. The whole process is repeated several times,
follows, yielding multiple trees.
N K The search process starts with a single traverse of
1X X
each of the trees, during which the algorithm always
min kxi − zij cj k22 + λkzi k1 (144)
C,{zn }N
n=1
2 i=1 j=1 picks the node closest to the query point and recursively
explores it, while adding the unexplored nodes to a pri-
s. t. kCT∼k ck k∞ 6 γ; k = 1, 2, · · · , n, (145)
ority queue. When reaching the leaf node all the points
where C is the dictionary, CT∼k is the dictionary C with contained within are linearly searched. After each of the
the kth atom removed, {zn }Nn=1 are the N sparse codes. trees has been explored once, the search is continued by
kCT∼k ck k∞ 6 γ aims to control the dictionary coherence extracting from the priority queue the closest node to the
degree. query point and resuming the tree traversal from there.
The support of xn is defined as the indices corre- The search ends when the number of points examined
sponding to nonzero coefficients in zn : bn = δ.[zn 6= 0], exceeds a maximum limit.
where δ.[] is an element-wise operation. The introduced
approach uses {bn }N n=1 to build the inverted indices,
which is similar to min-hash, and also uses the Jaccard
8 D ISCUSSIONS AND F UTURE T RENDS
similarity to get the search results. Finally, the asymmet-
ric distances between the query and the retrieved results 8.1 Scalable Hash Function Learning
using the Jaccard similarity, kq−Bzn k2 are computed for
reranking. The algorithms depending on the pairwise similarity,
such binary reconstructive embedding, usually sample
7.6 Fast Search in Hamming Space a small subset of pairs to reduce the cost of learning
hash functions. It is shown that the search accuracy is
7.6.1 Multi-index hashing
increased with a high sampling rate, but the training
The idea [104] is that binary codes in the reference cost is greatly increased. The algorithms even without
database are indexed M times into M different hash relying pairwise similarity are also shown to be slow and
tables, based on M disjoint binary substrings. Given a even infeasible when handling very large data, e.g., 1B
query binary code, entries that fall close to the query in at data items, and usually learn hash functions over a small
least one substring are considered neighbor candidates. subset, e.g., 1M data items. This poses a challenging
Specifically, each code y is split into M disjoint subcodes request to learn the hash function over larger datasets.
{y1 , · · · , yM }. For each subcode, ym , one hash table is
built, where each entry corresponds to a list of indices
of the binary code whose mth subcodes is equal to the
8.2 Hash Code Computation Speedup
code associated with this entry.
To find R-neighbors of a query q with substrings Existing hashing algorithms rarely do not take consid-
{qm }M m=1 , the algorithm search mth hash table for entries eration of the cost of encoding a data item. Such a cost
R
that are within a Hamming distance ⌊ M ⌋ of qm , thereby during the query stage becomes significant in the case
retrieving a set of candidates, denoted by Nm (q) and that only a small number of database items or a small
thus a set of final candidates, N = ∪M m=1 Nm (q). Lastly, database are compared to the query. The search with
the algorithm computes the Hamming distance between combining inverted index and compact codes is such a
q and each candidate, retaining only those codes that case. an recent work, circulant binary embedding [143],
are true R-neighbors of q. [104] also discussed how to formulates the projection matrix (the weights in the
choose the optimal number M of substrings. hash function) using a circular matrix R = circ(r).
The compound hash function is formulated is given as
7.6.2 FLANN h(x) = sign(RT x), where the computation is accelerated
[100] extends the FLANN algorithm [99] that is initially using fast Fourier transformation with the time cost
designed for ANN search over real-value vectors to reduced from O(d2 ) to d log d. It expects more research
search over binary vectors. The key idea is to build study to speed up the hash code computation for other
multiple hierarchical cluster trees to organize the binary hashing algorithms, such as composite quantization.
27

8.3 Distance Table Computation Speedup [12] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn-
tactic clustering of the web. Computer Networks, 29(8-13):1157–
Product quantization and its variants need to precom- 1166, 1997. 3, 6
pute the distance table between the query and the ele- [13] M. Charikar. Similarity estimation techniques from rounding
algorithms. In STOC, pages 380–388, 2002. 3, 5
ments of the dictionaries. Existing algorithms claim the [14] A. Cherian. Nearest neighbors using compact sparse codes. In
cost of distance table computation is negligible. However ICML (2), pages 1053–1061, 2014. 26
in practice, the cost becomes bigger when using the [15] A. Cherian, V. Morellas, and N. Papanikolopoulos. Robust sparse
hashing. In ICIP, pages 2417–2420, 2012. 26
codes computed from quantization to rank the candi- [16] F. Chierichetti and R. Kumar. Lsh-preserving functions and their
dates retrieved from inverted index. This is a research applications. In SODA, pages 1078–1094, 2012. 3
[17] O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near
direction that will attract research interests. identical image and shot detection. In CIVR, pages 549–556, 2007.
3
[18] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image
8.4 Multiple and Cross Modality Hashing detection: min-hash and tf-idf weighting. In BMVC, pages 1–10,
2008. 3
One important characteristic of big data is the variety [19] A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive
of data types and data sources. This is particularly true hashing. In KDD, pages 1073–1081, 2011. 3, 9
[20] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-
to multimedia data, where various media types (e.g., sensitive hashing scheme based on p-stable distributions. In
video, image, audio and hypertext) can be described Symposium on Computational Geometry, pages 253–262, 2004. 3,
by many different low- and high-level features, and 4
[21] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li.
relevant multimedia objects may come from different Modeling lsh for performance tuning. In CIKM, pages 669–678,
data sources contributed by different users and organiza- 2008. 9
[22] C. Du and J. Wang. Inner product similarity search using
tions. This raises a research direction, performing joint- compositional codes. CoRR, abs/1406.4966, 2014. 24
modality hashing learning by exploiting the relation [23] K. Eshghi and S. Rajaram. Locality sensitive hash functions
among multiple modalities, for supporting some special based on concomitant rank order statistics. In KDD, pages 221–
229, 2008. 5, 25
applications, such as cross-model search. This topic is [24] L. Fan. Supervised binary hash code learning with jensen
attracting a lot of research efforts, such as collaborative shannon divergence. In ICCV, pages 2616–2623, 2013. 15
[25] J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing
hashing [85], and cross-media hashing [120], [121], [152]. scheme based on dynamic collision counting. In SIGMOD
Conference, pages 541–552, 2012. 3, 9
[26] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization
9 C ONCLUSION for approximate nearest neighbor search. In CVPR, pages 2946–
2953, 2013. 24
In this paper, we review two categories of hashing algo- [27] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high
rithm developed for similarity search: locality sensitive dimensions via hashing. In VLDB, pages 518–529, 1999. 3
[28] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik. Learning bi-
hashing and learning to hash and show how they are nary codes for high-dimensional data using bilinear projections.
designed to conduct similarity search. We also point out In CVPR, pages 484–491, 2013. 25
the future trends of hashing for similarity search. [29] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik. Angular
quantization-based binary codes for fast similarity search. In
NIPS, pages 1205–1213, 2012. 3, 22, 23
R EFERENCES [30] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean
approach to learning binary codes. In CVPR, pages 817–824,
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for 2011. 3, 12, 22, 23
approximate nearest neighbor in high dimensions. In FOCS, [31] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative
pages 459–468, 2006. 4 quantization: A procrustean approach to learning binary codes
[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach.
approximate nearest neighbor in high dimensions. Commun. Intell., 35(12):2916–2929, 2013. 3, 12, 22
ACM, 51(1):117–122, 2008. 5 [32] A. Gordo, F. Perronnin, Y. Gong, and S. Lazebnik. Asymmetric
[3] A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn. Beyond distances for binary embeddings. IEEE Trans. Pattern Anal. Mach.
locality-sensitive hashing. In SODA, pages 1018–1028, 2014. 4 Intell., 36(1):33–47, 2014. 20
[4] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios. Nearest [33] D. Gorisse, M. Cord, and F. Precioso. Locality-sensitive hash-
neighbor retrieval using distance-based hashing. In ICDE, pages ing for chi2 distance. IEEE Trans. Pattern Anal. Mach. Intell.,
327–336, 2008. 8 34(2):402–409, 2012. 7
[5] A. Babenko and V. Lempitsky. Additive quantization for extreme [34] R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions
vector compression. In CVPR, pages 931–939, 2014. 24 on Information Theory, 44(6):2325–2383, 1998. 24
[6] S. Baluja and M. Covell. Learning ”forgiving” hash functions: [35] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer. Compact
Algorithms and large scale tests. In IJCAI, pages 2663–2669, 2007. hashing with joint optimization of search accuracy and time. In
17 CVPR, pages 753–760, 2011. 12, 13
[7] S. Baluja and M. Covell. Learning to hash: forgiving hash [36] J. He, S. Kumar, and S.-F. Chang. On the difficulty of nearest
functions and applications. Data Min. Knowl. Discov., 17(3):402– neighbor search. In ICML, 2012. 10
430, 2008. 17 [37] J. He, W. Liu, and S.-F. Chang. Scalable similarity search with
[8] S. Baluja and M. Covell. Beyond ”near duplicates”: Learning optimized kernel hashing. In KDD, pages 1129–1138, 2010. 12,
hash codes for efficient similar-image retrieval. In ICPR, pages 13
543–547, 2010. 17 [38] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical
[9] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning hashing. In CVPR, pages 2957–2964, 2012. 12, 18
indexes for similarity search. In WWW, pages 651–660, 2005. 8 [39] J.-P. Heo, Z. Lin, and S.-E. Yoon. Distance encoded product
[10] J. Brandt. Transform coding for fast approximate nearest neigh- quantization. In CVPR, pages 2139–2146, 2014. 24
bor search in high dimensions. In CVPR, pages 1815–1822, 2010. [40] L.-K. Huang, Q. Yang, and W.-S. Zheng. Online hashing. In
21, 22 IJCAI, 2013. 25
[11] A. Z. Broder. On the resemblance and containment of doc- [41] Y. Hwang, B. Han, and H.-K. Ahn. A fast nearest neighbor search
uments. In Proceedings of the Compression and Complexity of algorithm by nonlinear embedding. In CVPR, pages 3053–3060,
Sequences 1997, SEQUENCES ’97, pages 21–29, Washington, DC, 2012. 19
USA, 1997. IEEE Computer Society. 3, 6 [42] P. Indyk and R. Motwani. Approximate nearest neighbors:
28

Towards removing the curse of dimensionality. In STOC, pages with semantically consistent graph for image indexing. IEEE
604–613, 1998. 3, 6 Transactions on Multimedia, 15(1):141–152, 2013. 12
[43] G. Irie, Z. Li, X.-M. Wu, and S.-F. Chang. Locally linear hashing [75] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. R. Dick.
for extracting non-linear manifolds. In CVPR, pages 2123–2130, Learning hash functions using column generation. In ICML (1),
2014. 20 pages 142–150, 2013. 12, 17
[44] M. Jain, H. Jégou, and P. Gros. Asymmetric hamming embed- [76] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast
ding: taking the best of our bits for large scale image search. In supervised hashing with decision trees for high-dimensional
ACM Multimedia, pages 1441–1444, 2011. 21 data. In CVPR, pages 1971–1978, 2014. 20
[45] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric [77] G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general
learning and fast similarity search. In NIPS, pages 761–768, 2008. two-step approach to learning-based hashing. In ICCV, pages
5 2552–2559, 2013. 20
[46] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned [78] R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity
metrics. In CVPR, 2008. 5 preserving algorithm for entropy-based coding. In CVPR, pages
[47] P. Jain, S. Vijayanarasimhan, and K. Grauman. Hashing hy- 848–854, 2010. 12, 15
perplane queries to near points with applications to large-scale [79] Y. Lin, D. Cai, and C. Li. Density sensitive hashing. CoRR,
active learning. In NIPS, pages 928–936, 2010. 6 abs/1205.2930, 2012. 12, 18
[48] H. Jégou, L. Amsaleg, C. Schmid, and P. Gros. Query adaptative [80] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li. Compressed hashing. In
locality sensitive hashing. In ICASSP, pages 825–828, 2008. 8 CVPR, pages 446–451, 2013. 15
[49] H. Jégou, M. Douze, and C. Schmid. Product quantization for [81] D. Liu, S. Yan, R.-R. Ji, X.-S. Hua, and H.-J. Zhang. Image
nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., retrieval with query-adaptive hashing. TOMCCAP, 9(1):2, 2013.
33(1):117–128, 2011. 22, 23 12, 19
[50] H. Jégou, T. Furon, and J.-J. Fuchs. Anti-sparse coding for [82] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised
approximate nearest neighbor search. In ICASSP, pages 2029– hashing with kernels. In CVPR, pages 2074–2081, 2012. 3, 12, 15
2032, 2012. 19 [83] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with
[51] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang. Min-max hash for graphs. In ICML, pages 1–8, 2011. 3, 12, 14, 15, 22
jaccard similarity. In ICDM, pages 301–309, 2013. 3, 6 [84] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang. Compact
[52] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian. Super-bit locality- hyperplane hashing with bilinear functions. In ICML, 2012. 12,
sensitive hashing. In NIPS, pages 108–116, 2012. 3, 5 15, 16
[53] Y.-G. Jiang, J. Wang, and S.-F. Chang. Lost in binarization: query- [85] X. Liu, J. He, C. Deng, and B. Lang. Collaborative hashing. In
adaptive ranking for similar image search with compact codes. CVPR, pages 2147–2154, 2014. 27
In ICMR, page 16, 2011. 12, 19 [86] X. Liu, J. He, and B. Lang. Reciprocal hash tables for nearest
[54] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang. Query-adaptive neighbor search. In AAAI, 2013. 25
image search with hash codes. IEEE Transactions on Multimedia, [87] X. Liu, J. He, B. Lang, and S.-F. Chang. Hash bit selection: A
15(2):442–453, 2013. 12, 19 unified solution for selection problems in hashing. In CVPR,
[55] Z. Jin, Y. Hu, Y. Lin, D. Zhang, S. Lin, D. Cai, and X. Li. pages 1570–1577, 2013. 25
Complementary projection hashing. In ICCV, pages 257–264, [88] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. Sk-lsh: An
2013. 12, 17 efficient index structure for approximate nearest neighbor search.
[56] A. Joly and O. Buisson. A posteriori multi-probe locality sensi- PVLDB, 7(9):745–756, 2014. 9
tive hashing. In ACM Multimedia, pages 209–218, 2008. 8 [89] Y. Liu, J. Shao, J. Xiao, F. Wu, and Y. Zhuang. Hypergraph
[57] A. Joly and O. Buisson. Random maximum margin hashing. In spectral hashing for image retrieval with heterogeneous social
CVPR, pages 873–880, 2011. 12, 18 contexts. Neurocomputing, 119:49–58, 2013. 12, 13
[58] Y. Kalantidis and Y. Avrithis. Locally optimized product quanti- [90] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, and A. G. Hauptmann. Spline
zation for approximate nearest neighbor search. In CVPR, pages regression hashing for fast image search. IEEE Transactions on
2329–2336, 2014. 24 Image Processing, 21(10):4480–4491, 2012. 19
[59] W. Kong and W.-J. Li. Double-bit quantization for hashing. In [91] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-
AAAI, 2012. 22 probe lsh: Efficient indexing for high-dimensional similarity
[60] W. Kong and W.-J. Li. Isotropic hashing. In NIPS, pages 1655– search. In VLDB, pages 950–961, 2007. 3, 8
1663, 2012. 12, 22 [92] G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates
[61] W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large- for web crawling. In WWW, pages 141–150, 2007. 3
scale image retrieval. In SIGIR, pages 45–54, 2012. 20, 21 [93] Y. Matsushita and T. Wada. Principal component hashing: An
[62] N. Koudas, B. C. Ooi, H. T. Shen, and A. K. H. Tung. Ldc: accelerated approximate nearest neighbor search. In PSIVT,
Enabling search by partial distance in a hyper-dimensional pages 374–385, 2009. 12
space. In ICDE, pages 6–17, 2004. 22 [94] R. Motwani, A. Naor, and R. Panigrahy. Lower bounds on
[63] B. Kulis and T. Darrell. Learning to hash with binary reconstruc- locality sensitive hashing. SIAM J. Discrete Math., 21(4):930–935,
tive embeddings. In NIPS, pages 1042–1050, 2009. 12, 15 2007. 3, 4
[64] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing [95] Y. Mu, X. Chen, X. Liu, T.-S. Chua, and S. Yan. Multimedia
for scalable image search. In ICCV, pages 2130–2137, 2009. 5 semantics-aware query-adaptive hashing with bits reconfigura-
[65] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. bility. IJMIR, 1(1):59–70, 2012. 21
IEEE Trans. Pattern Anal. Mach. Intell., 34(6):1092–1104, 2012. 5 [96] Y. Mu, J. Shen, and S. Yan. Weakly-supervised hashing in kernel
[66] B. Kulis, P. Jain, and K. Grauman. Fast similarity search space. In CVPR, pages 3344–3351, 2010. 12, 17
for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell., [97] Y. Mu, J. Wright, and S.-F. Chang. Accelerated large scale
31(12):2143–2157, 2009. 5 optimization by concomitant hashing. In ECCV (1), pages 414–
[67] P. Li and K. W. Church. A sketch algorithm for estimating 427, 2012. 25
two-way and multi-way associations. Computational Linguistics, [98] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In
33(3):305–354, 2007. 6 AAAI, 2010. 7
[68] P. Li, K. W. Church, and T. Hastie. Conditional random sampling: [99] M. Muja and D. G. Lowe. Fast approximate nearest neighbors
A sketch-based sampling technique for sparse data. In NIPS, with automatic algorithm configuration. In VISSAPP (1), pages
pages 873–880, 2006. 3, 6 331–340, 2009. 26
[69] P. Li, T. Hastie, and K. W. Church. Very sparse random projec- [100] M. Muja and D. G. Lowe. Fast matching of binary features. In
tions. In KDD, pages 287–296, 2006. 3 CRV, pages 404–410, 2012. 26
[70] P. Li and A. C. König. b-bit minwise hashing. In WWW, pages [101] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact
671–680, 2010. 3, 7 binary codes. In ICML, pages 353–360, 2011. 12, 16
[71] P. Li, A. C. König, and W. Gui. b-bit minwise hashing for [102] M. Norouzi and D. J. Fleet. Cartesian k-means. In CVPR, pages
estimating three-way similarities. In NIPS, pages 1387–1395, 3017–3024, 2013. 22, 24
2010. 3, 7 [103] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance
[72] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random metric learning. In NIPS, pages 1070–1078, 2012. 12, 16
projections. In ICML (2), pages 676–684, 2014. 4 [104] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming
[73] P. Li, A. B. Owen, and C.-H. Zhang. One permutation hashing. space with multi-index hashing. In CVPR, pages 3108–3115,
In NIPS, pages 3122–3130, 2012. 3, 6 2012. 26
[74] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu. Spectral hashing [105] R. O’Donnell, Y. Wu, and Y. Zhou. Optimal lower bounds for
29

locality sensitive hashing (except when q is tiny). In ICS, pages [133] S. Wang, S. Jiang, Q. Huang, and Q. Tian. S3mkl: scalable semi-
275–283, 2011. 3, 4 supervised multiple kernel learning for image data mining. In
[106] J. Pan and D. Manocha. Bi-level locality sensitive hashing for ACM Multimedia, pages 163–172, 2010. 5
k-nearest neighbor computation. In ICDE, pages 378–389, 2012. [134] Y. Weiss, R. Fergus, and A. Torralba. Multidimensional spectral
9 hashing. In ECCV (5), pages 340–353, 2012. 12, 13, 18
[107] R. Panigrahy. Entropy based nearest neighbor search in high [135] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS,
dimensions. In SODA, pages 1186–1195, 2006. 3, 8 pages 1753–1760, 2008. 3, 11, 12, 13, 15, 18, 19
[108] L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: [136] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu. Semi-supervised
A comparison of hash function types and querying mechanisms. nonlinear hashing using bootstrap sequential projection learning.
Pattern Recognition Letters, 31(11):1348–1358, 2010. 4 IEEE Trans. Knowl. Data Eng., 25(6):1380–1393, 2013. 14
[109] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes [137] H. Xia, P. Wu, S. C. H. Hoi, and R. Jin. Boosting multi-kernel
from shift-invariant kernels. In NIPS, pages 1509–1517, 2009. 7 locality-sensitive hashing for scalable image retrieval. In SIGIR,
[110] A. Rahimi and B. Recht. Random features for large-scale kernel pages 55–64, 2012. 5
machines. In NIPS, 2007. 7 [138] B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai. Harmonious
[111] R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR hashing. In IJCAI, 2013. 22, 23
workshop on Information Retrieval and applications of Graphical [139] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Complementary
Models, 2007. 19 hashing for approximate nearest neighbor search. In ICCV, pages
[112] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. 1631–1638, 2011. 3, 17, 24
Approx. Reasoning, 50(7):969–978, 2009. 19 [140] J. Yagnik, D. Strelow, D. A. Ross, and R.-S. Lin. The power of
[113] V. Satuluri and S. Parthasarathy. Bayesian locality sensitive comparative reasoning. In ICCV, pages 2431–2438, 2011. 7
hashing for fast similarity search. PVLDB, 5(5):430–441, 2012. [141] H. Yang, X. Bai, J. Zhou, P. Ren, Z. Zhang, and J. Cheng.
9 Adaptive object retrieval with kernel reconstructive hashing. In
[114] G. Shakhnarovich. Learning Task-Specific Similarity. PhD thesis, CVPR, pages 1955–1962, 2014. 15
Department of Electrical Engineering and Computer Science, [142] Q. Yang, L.-K. Huang, W.-S. Zheng, and Y. Ling. Smart hashing
Massachusetts Institute of Technology, 2005. 12, 17 update for fast response. In IJCAI, 2013. 25
[115] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estima- [143] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant binary
tion with parameter-sensitive hashing. In ICCV, pages 750–759, embedding. In ICML (2), pages 946–954, 2014. 26
2003. 12, 17 [144] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for
[116] J. Shao, F. Wu, C. Ouyang, and X. Zhang. Sparse spectral fast similarity search. In SIGIR, pages 18–25, 2010. 20
hashing. Pattern Recognition Letters, 33(3):271–277, 2012. 13 [145] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian. Topology
[117] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang. preserving hashing for similarity search. In ACM Multimedia,
Inductive hashing on manifolds. In CVPR, pages 1562–1569, pages 123–132, 2013. 12, 14
2013. 19 [146] L. Zhang, Y. Zhang, D. Zhang, and Q. Tian. Distribution-aware
[118] A. Shrivastava and P. Li. Densifying one permutation hashing locality sensitive hashing. In MMM (2), pages 395–406, 2013. 5
via rotation for fast near neighbor. In ICML (1), page 557565, [147] T. Zhang, C. Du, and J. Wang. Composite quantization for
2014. 3, 6 approximate nearest neighbor search. In ICML (2), pages 838–
[119] M. Slaney, Y. Lifshits, and J. He. Optimal parameters for locality- 846, 2014. 22, 24
sensitive hashing. Proceedings of the IEEE, 100(9):2604–2623, 2012. [148] X. Zhang, L. Zhang, and H.-Y. Shum. Qsrank: Query-sensitive
10 hash code ranking for efficient neighbor search. In CVPR, pages
[120] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo. Effective 2058–2065, 2012. 21
multiple feature hashing for large-scale near-duplicate video [149] W.-L. Zhao, H. Jégou, and G. Gravier. Sim-min-hash: an efficient
retrieval. IEEE Transactions on Multimedia, 15(8):1997–2008, 2013. matching technique for linking large image collections. In ACM
27 Multimedia, pages 577–580, 2013. 7
[121] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter- [150] Y. Zhen and D.-Y. Yeung. Active hashing and its application to
media hashing for large-scale retrieval from heterogeneous data image and text retrieval. Data Min. Knowl. Discov., 26(2):255–274,
sources. In SIGMOD Conference, pages 785–796, 2013. 27 2013. 25
[122] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. [151] X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen. Sparse
Ldahash: Improved matching with smaller descriptors. IEEE hashing for fast multimedia search. ACM Trans. Inf. Syst., 31(2):9,
Trans. Pattern Anal. Mach. Intell., 34(1):66–78, 2012. 12, 14 2013. 20
[123] K. Terasawa and Y. Tanaka. Spherical lsh for approximate nearest [152] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear cross-modal
neighbor search on unit hypersphere. In WADS, pages 27–38, hashing for efficient multimedia search. In ACM Multimedia,
2007. 4 pages 143–152, 2013. 27
[124] S. Vijayanarasimhan, P. Jain, and K. Grauman. Hashing hy- [153] Y. Zhuang, Y. Liu, F. Wu, Y. Zhang, and J. Shao. Hypergraph
perplane queries to near points with applications to large-scale spectral hashing for similarity search of social image. In ACM
active learning. IEEE Trans. Pattern Anal. Mach. Intell., 36(2):276– Multimedia, pages 1457–1460, 2011. 12, 13
288, 2014. 6
[125] J. Wang, O. Kumar, and S.-F. Chang. Semi-supervised hashing
for scalable image retrieval. In CVPR, pages 3424–3431, 2010. 3,
12, 13, 14, 24
[126] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection
learning for hashing with compact codes. In ICML, pages 1127–
1134, 2010. 3, 12, 13, 14
[127] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing
for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell.,
34(12):2393–2406, 2012. 3, 12, 13, 14
[128] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang. Learning hash codes
with listwise supervision. In ICCV, pages 3032–3039, 2013. 12,
16
[129] J. Wang, J. Wang, J. Song, X.-S. Xu, H. T. Shen, and S. Li.
Optimized cartesian k-means. CoRR, abs/1405.4054, 2014. 24
[130] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for
approximate nearest neighbor search. In ACM Multimedia, pages
133–142, 2013. 3, 12, 16
[131] Q. Wang, D. Zhang, and L. Si. Weighted hashing for fast large
scale similarity search. In CIKM, pages 1185–1188, 2013. 12, 18
[132] S. Wang, Q. Huang, S. Jiang, and Q. Tian. S3 mkl: Scalable
semi-supervised multiple kernel learning for real-world image
applications. IEEE Transactions on Multimedia, 14(4):1259–1274,
2012. 5

You might also like