Binary Code Ranking With Weighted Hamming Distance
Binary Code Ranking With Weighted Hamming Distance
1587
1585
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
a finer-grained binary code level rather than at the original 5000
ITQ_32
7000
SPH_32
will show that an effective bit-level weight is not only data- 3000
4000
1588
1586
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
1
0.9
1
0.9
h2 h1 sk=fk(p)-fk(q)
0.8 0.8
111
p2
Pr(bit-flipping)
Pr(bit-flipping)
0.7 0.7
0.6 0.6 h3
0.5 0.5
0.4 0.4 p1 q p3
0.3 0.3
0.2 0.2
001 101 110
0.1 0.1 Pr(Δhk(q)≠0) Pr(Δhk(q)=0)
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3
distance to threshold
0.35 0.02 0.04 0.06 0.08 0.1 0.12
standard variance
0.14
Tk fk(q) fk(x)
Figure 2. The probability of a query q’s neighbor, p, mapped to a Figure 3. The Left gives an example where Hamming distance
bit different from h(q) by hash function h(x) = sgn(f (x) − T ). causes ambiguity for binary code ranking. The Right illustrates
The abscissa of the Left is |f (q)−T |, and the abscissa of the Right the probability of a neighbor of q mapped to a different bit by
is the standard variance of the distribution of f (p) − f (q). hash function fk (x), Tk is the binary threshold.
bit-flipping increases with |fk (q) − Tk | for the most part. Assume all the hash bits are independent [22], we have:
As a result, ωk (1) is not only dependent on the hash func-
tion hk (x), but also dependent on the specified data point q, Pr(h|H(q)) = Pr(hk = hk (q)) ∗
and on this account, ωk is also a function of q. As a result, k
eq. (1) can be rewritten as:
Pr(hk = hk (q)) (4)
k
ωk (q) = g(μk , σk , q) (2)
where Pr(hk = hk (q)) (denoted by Pr(Δhk (q) = 0))
Moreover, the smaller |fk (q) − Tk |, the larger the probabil- is the probability of hash bit k of h flipped as compared
ity of bit-flipping on hash bit k, thus the smaller ωk . There- with that of H(q), and Pr(hk = hk (q)) (denoted by
fore, ωk (q) should also be monotonically non-decreasing Pr(Δhk (q) = 0)) is the probability of hash bit k of h not
w.r.t. |fk (q) − Tk |. flipped . Apparently, these two probabilities are dependent
on the specified query q and the hash function hk (x).
3.3. Dynamic Bit-level Weighting
Since the weighted Hamming distance is used for rank-
In the previous sections 3.1 and 3.2, we show that an ing, the ranking of each DH w
(H(q), h) is more crucial
effective bit-level weight is not only data-dependent, but than their actualvalues. Therefore, by dividing each
K
also query-dependent. In this section, we give a simple Pr(h|H(q)) by k=1 Pr(Δhk (q) = 0) without chang-
method to calculate the data-adaptive and query-sensitive ing the ranking of each DH w
(H(q), h), we get a modified
bit-level weight ωk (q) of each hash bit k for a given query weighted Hamming distance:
q, and we will show that ωk (q) satisfies the abovemen-
tioned constraints theoretically. The intuition behind our w
DH (H(q), h) = λk (q) (5)
method is: given a query q and two binary codes h(1) , h(2) , k∈S
after adding a random noise ñ to q, if the probability of
H(q + ñ) = h(1) , denoted as Pr(h(1) |H(q)), is larger where S is the set of hash bits in h differ from H(q), and
than Pr(h(2) |H(q)), then the data points mapped to h(1)
are considered to be more similar neighbors of q rather than Pr(Δhk (q) = 0) 1 − Pr(Δhk (q) = 0)
λk (q) = log = log
those mapped to h(2) , which means the weighted Hamming Pr(Δhk (q) = 0) Pr(Δhk (q) = 0)
distance DH w
(H(q), h(1) ) is smaller than DH
w
(H(q), h(2) ). (6)
Therefore, given a query q and a binary code h, a func- Equation (6) is a monotonically decreasing function w.r.t.
tion parameterized by Pr(h|H(q)) is used as a probabilis- Pr(Δhk (q) = 0). The smaller Pr(Δhk (q) = 0), the s-
w
tic interpretation of DH (H(q), h). This function should be maller the probability of a data point p ∈ N (q) mapped to
monotonically non-increasing w.r.t. Pr(h|H(q)). Further- a different bit by hk (x), thus the more discriminative hash
more, if Pr(h|H(q)) ≈ 1, DH w
(H(q), h) should be small, bit k. Therefore, λk (q) satisfies the constraints for data-
and if Pr(h|H(q)) ≈ 0, DH (H(q), h) should be relative-
w adaptive weight introduced in Section 3.1.
ly large. A famous function satisfies these constraints is To calculate Pr(Δhk (q) = 0) or Pr(Δhk (q) = 0), the
the Information Entropy. As a result, given a query q, the distribution of hk (q + ñ) − hk (q) is essential. Based on
weighted Hamming distance between H(q) and a binary our discussion in Section 3.2, we can use the distribution of
code h is defined as follows: s(q + ñ, q) = fk (q + ñ) − fk (q) with density function
pdfk (s) to estimate Pr(Δhk (q) = 0). The Right of Fig. 3
w
DH (H(q), h) ≈ − log Pr(h|H(q)) (3) shows the probability of a neighbor of q, p, mapped to a bit
1589
1587
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
different from hk (q). If fk (q) > Tk , we have: To learn the μk and σk of a hash function hk (x), we
construct a training set consists of s query points, each of
Pr(Δhk (q) = 0) = Pr(fk (p) < Tk ) which has m neighbors. The complexity of calculating the
= Pr(sk (p, q) ≤ Tk − fk (q)) unbinarized hash values of each query and its neighbors is
Tk −fk (q) almost O(s(m + 1)d), and the complexity of calculating μk
= pdfk (s)ds (7) and σk is bounded by O(3sm). Therefore, the overall train-
−∞
ing complexity of our parameters learning stage is bounded
The Gaussian distribution assumption for pdfk is used by O(K ∗ s(md + d + 3m)) ≈ O(Ksmd).
for PCAH, LSH and ITQ in our experiments, since they are
all Gaussian-like distributions as shown in Fig. 1. There- 5. Experiments
fore, pdfk (s) = N (μk , σk ), and if fk (q) > Tk , we have:
5.1. Experimental Setup
1 Tk − fk (q) − μk Our experiments are carried out on two benchmark
Pr(Δhk (q) = 0) = 1 + erf( √ ) (8)
2 σk 2 datasets: MINST70K and ANN-SIFT1M. The MNIST70K
[10] consists of 70K 784-dimensional images, each of
Similarly, for fk (q) < Tk :
which is associated with a digit label from ‘0’ to ‘9’, and is
split into a database set (i.e. training set, 60K) and a query
1 Tk − fk (q) − μk
Pr(Δhk (q) = 0) = 1 − erf( √ ) (9) set (10K). The ANN-SIFT1M [5] consists of 1M images
2 σk 2
each represented as a 128-dimensional SIFT descriptors
where erf is the Gauss error function. For SPH and AGH, [13]. It contains three vector subsets: learning set (100K),
we use the Laplace distribution assumption for pdf√ k , thus database set (1M) and query set (10K). The learning subset
pdfk = exp {|x − μk |/bk } /2bk , where bk = σk / 2. is retrieved from Flickr images and the database and query
In our experiments, we set ωk (q) = λk (q) and de- subsets are extracted from the INRIA Holidays images [4].
note this weighting scheme as WhRank. Given a query As stated in Section 3.3, our methods can be applied
q, first the unbinarized hash value fk (q) of each hash bit to different kinds of binary hashing methods. In our ex-
k is calculated. Then, the adaptive weight ωk (q) is calcu- periments, some representative hashing methods, Locality
lated using eq. (8)(9)(6). Apparently, the larger σk , the Sensitive Hashing (LSH) [1], PCA Hashing (PCAH) [20],
smaller ωk (q). Moreover, the smaller |fk (q) − Tk |, the Iterative Quantization (ITQ) [3], Spectral Hashing (SPH)
larger Pr(Δhk (q) = 0), thus the smaller ωk (q). There- [21] and Anchor Graph Hashing (AGH) [12], are chosen
fore, ωk (q) satisfies the constraints for data-adaptive and to evaluate the effectiveness of WhRank. The source codes
query-sensitive weight introduced in Section 3.1 and 3.2. generously provided by the authors and the recommended
For the Laplace distribution assumption and the Student’s t- parameters settings in their papers, are used in our experi-
distribution assumption used in our experiments, these dis- ments. For AGH, the number of anchors is set to 500 and the
cussions still hold. Another straightforward dynamic bit- number of nearest neighbors for anchor graph constructing
level weighting is setting ωk (q) = |Tk − f (q)| /σk . In is set to 2 for MINST70K and 5 for ANN-SIFT1M, respec-
our experiments, we use this weighting scheme as a natu- tively. Note that, the hash functions of LSH, PCAH and
ral baseline and denote it as WhRank1. Note that, since we ITQ are linear, while those of SPH and AGH are nonlin-
make no assumption about the hashing method used in the ear. Experimental results in Section 5.2 show that, WhRank
bit-level weights learning, our algorithm, WhRank, can be is applicable to both linear and nonlinear hashing method-
applied to different kinds of hashing methods. s. Moreover, we also compare our algorithm with QsRank
[22], a novel ranking algorithm for binary code. Since Qs-
4. Analysis Rank is developed only for PCA-based hashing methods,
the comparisons are carried out on PCAH and ITQ.
As shown in eq. (5), given a query q and a bina-
Given a query, by ranking with traditional Hamming dis-
ry code h, DH w
(H(q), h) can be calculated efficiently as:
tance and our weighted Hamming distance, the returned
ω (q)(H(q) ⊗ h), where ⊗ means the xor of two binary
T
top N nearest neighbors and the rankings are both dif-
codes and ω(q) = (ω1 (q), ω2 (q), · · · , ωK (q))T . While
ferent. The efficacy of WhRank can be measured by the
the weighted distances can now be calculated by inner-
Precision@N , Recall@N and the distance error ratio@N
product operation, it is actually possible to avoid this com-
[15] defined as:
putational cost by computing the traditional Hamming dis-
tance first, and then ranking the returned binary codes based the number of similar points in top N
on their weighted-Hamming distances to H(q). Therefore, Precision@N =
N
the ranking of the returned binary codes can be obtained the number of similar points in top N
with minor additional cost. Recall@N =
the number of all similar points
1590
1588
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
(a) (b) distribution is used as the distribution assumption, while for
LSH PCAH
0.8
LSH−WhRank
0.8
PCAH−WhRank SPH and AGH, the Laplace distribution is used as the dis-
LSH−WhRank1 PCAH−WhRank1
SPH PCAH−QsRank
tribution assumption.
0.7 0.7
SPH−WhRank ITQ
Fig. 4 gives the Precision@N on MINST70K using 32
Precision@N
SPH−WhRank1 ITQ−WhRank
0.6 AGH 0.6 ITQ−WhRank1 bits binary code. For clarity, the results are shown in t-
AGH−WhRank ITQ−QsRank
0.5 AGH−WhRank1 0.5 wo parts. It is easy to find out that, by ranking with our
weighted Hamming distance (WhRank), all baseline hash-
0.4 0.4
ing methods achieve a better search performance. On aver-
0.3 0.3
age, we get a 5% higher precision for each hashing method.
0 1000 2000 3000
top−N
4000 5000 0 1000 2000 3000
top−N
4000 5000
For SPH and PCAH, the improvements are even higher (al-
most 10%). Meanwhile, as shown in this figure, each base-
Figure 4. Evaluations of Precision@N of WhRank, WhRank1 and line method combined with WhRank1 also achieves a rea-
QsRank on MINST70K using 32 bits binary code. As shown in sonable good performance improvement, and the improve-
this figure, by applying WhRank for query results ranking, the re- ment is a little inferior than that of WhRank (2% on aver-
trieve accuracy of each method is improved. age). In our subsequent experiments, this result still holds.
Therefore, the results of WhRank1 are not given in subse-
quent figures for the sake of clarity.
1 d(q, nk ) − d(q, n∗k )
N
error ratio@N = Fig. 5 gives the Precision@N on MINST70K under d-
N |Q| d(q, n∗k ) ifferent code lengths. Once again, we can easily find out
q∈Q k=1
that the performance of each baseline hashing method is
where q ∈ Q is a query, nk is the k-th nearest neighbor in
improved when combined with WhRank. Moreover, as can
the ranked results, and n∗k is the actual k-th nearest neighbor
be seen from Fig. 4 and Fig. 5(c), even with a relative-
of q in the database set. For MINST70K, a returned point
ly short binary code (32 bits), the retrieval accuracy of each
is considered as a true neighbor of a query if they share
baseline method combined with WhRank is almost the same
the same digit label. For ANN-SIFT1M, we use the same
as, sometimes better than, that of the baseline method itself
criterion as in [19]: a returned point is considered to be a
with a binary code of larger size (64 bits, 96 bits).
true neighbor if it lies in the top 1% points closest to the
query in terms of Euclidean distance in the original space. In the experiments on ANN-SIFT1M, for distribution pa-
rameters estimation, we randomly sample 100 points from
5.2. Experimental Results the query set as the training set, and for each training sam-
ple, we find its top 5,000 nearest neighbors in the database
To demonstrate the efficacy of applying our weighted set, measured by the Euclidean distance. For LSH, PCAH
Hamming distance for ranking, given a query, the returned and ITQ, we still use the Gaussian distribution as the dis-
results of each baseline hashing method are ranked by their tribution assumption. For SPH, the Laplace distribution is
traditional Hamming distance and the weighted Hamming used as the distribution assumption. For AGH, the Student’s
distance to the query respectively. The Precision@N and t-distribution is used as the distribution assumption.
distance error ratio@N of each ranked result list are re-
Since the neighborhood relationship of a data pair in
ported to show the efficacy of WhRank (the number of
ANN-SIFT1M is defined based on the Euclidean distance,
returned results is predefined in our experiments, a high-
we use Precision@N and distance error ratio@N to show
er Precision@N means a higher Recall@N , thus only the
the efficacy of ranking with our weighted Hamming dis-
Precision@N is reported.). Note that, ranking with the
tance. Fig. 6 and Fig. 7 give the evaluations of Recall@N
weighted Hamming distance is only performed to the result-
and Distance error ratio@N on ANN-SIFT1M under differ-
s returned by computing the traditional Hamming distance,
ent code lengths, respectively. As shown in these two fig-
so the additional computational cost is minor.
ures, when combined with WhRank, each methods achieves
Since MINST70K is fully annotated, we can use the
a 10% higher precision on average. Moreover, the distance
Precision@N and Recall@N to show the efficacy of
error ratio of each baseline method reduces 40% as com-
WhRank. The dataset is first embedded into Hamming s-
pared with the original. The experimental results demon-
pace using each baseline hashing method. After that, from
strate that applying WhRank to existing hashing methods
each digit class, we randomly sample 50 images from the
yielding a more accurate similarity search result.
query set constituting a subset contains 500 images. For
each training image, we find its 1,000 neighbors in the We also compare our algorithm with QsRank [22]. Since
dataset based on their digit labels. The training set and the QsRank is developed only for PCA-based hashing method,
corresponding neighbors are used for distribution parame- The comparisons are carried out on PCAH [20] and ITQ
ters estimation. The rest of the query set is used as queries [3]. As QsRank is designed for -neighbor search, in our
in our experiments. For LSH, PCAH and ITQ, the Gaussian experiments on MINST70K, given a query q and N , the
1591
1589
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
(a) 48 bits (b) 64 bits (c) 96 bits
LSH
LSH−WhRank
0.8 0.8 0.8 PCAH
PCAH−WhRank
0.7 0.7 0.7 PCAH−QsRank
SPH
Precision@N
SPH−WhRank
0.6 0.6 0.6
ITQ
ITQ−WhRank
0.5 0.5 0.5 ITQ−QsRank
AGH
0.4 0.4 0.4 AGH−WhRank
Figure 5. Evaluations of Precision@N of WhRank and QsRank on MINST70K. Code lengths: (a) 48 bits; (b) 64 bits; (c) 96 bits. As
shown, the retrieve accuracy of each baseline method is improved when combined with WhRank under different code lengths.
Figure 6. Evaluations of Precision@N of WhRank and QsRank on ANN-SIFT1M. Code lengths: (a) 32; (b) 48; (c) 64; (d) 96. The
retrieve accuracy of each baseline method is improved when combined with WhRank under each code length setting. Moreover, the
retrieve accuracy of each method combined with WhRank is as good as, sometimes better than, that of combined with QsRank.
search radius is set to the mean of the distance between q ity. When applied to existing hashing methods, different
and its all neighbors. On ANN-SIFT1M, the radius is set to bit-level weights are assigned to different hash bits, and the
the distance between q and its actual N -th nearest neighbor returned results can be ranked at a finer-grained binary code
in the database set. The comparison results are reported in level rather than at the original integer Hamming distance
Fig. 4(b) to Fig. 7. As shown in these figures, the perfor- level. We demonstrate that an effective bit-level weight is
mance improvements of our algorithm are as good as, some- not only data-dependent but also query-dependent, and give
times better than, those of QsRank. One remarkable advan- a simple yet effective algorithm to learn the weights.
tage of WhRank over QsRank is that, the ranking model of
The experimental results on two large-scale image
WhRank is more general, thus WhRank is also applicable to
datasets containing up to one million high-dimensional data
other non-PCA-based hashing methods, e.g. SPH and AGH.
points demonstrate the efficacy of WhRank. The search per-
Furthermore, WhRank can be easily applied to -neighbor
formances of all evaluated hashing methods are improved
search, while QsRank is not very effective for nearest neigh-
when combined with WhRank. Moreover, as compared
bor search since the distance between a query and its nearest
with QsRank, a novel ranking algorithm for binary code,
neighbor is often unknown in practice.
the performance improvements of WhRank are as good as
6. Conclusion (sometimes better than) those of QsRank. There are two
remarkable advantages of WhRank over QsRank. First,
Most existing binary hashing methods rank the returned WhRank can be applied to various kinds of hashing method-
results of a query simply with the traditional Hamming s while QsRank is only developed for PCA-based hashing
distance, which poses a critical issue for similarity search methods. Second, as QsRank is developed for -neighbor
where ranking is important, since there can be many re- search, it’s not very effective for nearest neighbor search
sults sharing the same Hamming distance to the query. since the distance of a query to its nearest neighbor is un-
This paper proposes a weighted Hamming distance rank- known in practice. On the contrary, WhRank can be easily
ing algorithm (WhRank) to alleviate this ranking ambigu- applied to -neighbor search.
1592
1590
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.
(a) 32 bits (b) 48 bits (c) 64 bits (d) 96 bits
0.45
LSH
0.4 LSH−WhRank
0.3 0.3 0.3
PCAH
0.35
PCAH−WhRank
Distance error ratio@N
0.25 0.25 0.25
0.3 PCAH−QsRank
SPH
0.25 0.2 0.2 0.2
SPH−WhRank
0.2 ITQ
0.15 0.15 0.15
ITQ−WhRank
0.15 ITQ−QsRank
0.1 0.1 0.1
AGH
0.1
AGH−WhRank
0.05 0.05 0.05
0.05
0 0 0 0
0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000
top−N top−N top−N top−N
Figure 7. Evaluations of Distance error ratio@N of WhRank and QsRank on MINST70K and ANN-SIFT1M. Code lengths: (a) 32 bits;
(b) 48 bits; (c) 64 bits; (d) 96 bits.
1593
1591
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 18,2024 at 02:19:14 UTC from IEEE Xplore. Restrictions apply.