Hash Code Indexing in Cross-Modal Retrieval
Hash Code Indexing in Cross-Modal Retrieval
Abstract—Cross-modal hashing, which searches nearest neigh- original space into a Hamming space as binary hash codes. It
bors across different modalities in the Hamming space, has generally exploits inter/intra class correlations or underlying
become a popular technique to overcome the storage and compu- data distribution/manifold to learn a set of hash functions, so
tation barrier in multimedia retrieval recently. Although dozens
of cross-modal hashing algorithms are proposed to yield compact that similar binary codes are usually generated for similar data
binary code representation, applying exhaustive search in a large- points. The Hamming distance computation between binary
scale dataset is impractical for the real-time purpose, and the codes enables a fast nearest neighbor search through hardware-
Hamming distance computation suffers inaccurate results. In supported bit operations with least memory consumption.
this paper, we propose a novel index scheme over binary hash However, the use of hashing has to face the critical problem
codes in cross-modal retrieval. The proposed indexing scheme
exploits a few binary bits of the hash code as the index code. of quantization loss after binary embedding. Even though
Based on the index code representation, we construct an inverted numerous hashing methods have been proposed to address
index structure to accelerate the retrieval efficiency and train a this issue, there exists an inevitable large information gap
neural network to improve the indexing accuracy. Experiments between a real-valued vector and the corresponding binary
are performed on two benchmark datasets for retrieval across code. Searching nearest neighbors in the binary Hamming
image and text modalities, where hash codes are generated by
three cross-modal hashing methods. Results show the proposed space is therefore less accurate than that in the real-valued
method effectively boosts the performance over the benchmark Euclidean space.
datasets and hash methods. In this paper, we propose to utilize a novel index scheme
Index Terms—cross-modal hashing, inverted indexing, nearest
neighbor search
over binary hash codes for cross-modal retrieval. The proposed
index scheme exploits a few binary bits of the hash code as the
I. I NTRODUCTION index code. An index structure is built by compiling reference
data points into inverted lists according to their index codes.
Nearest neighbor (NN) search plays a fundamental role
Given a query, we estimate the relevance of each index code
in machine learning and information retrieval. Cross-modal
that implicitly reflects the nearest neighbor probability for the
retrieval, which is an application based on nearest neighbor
query. The estimation is realized by a prediction model that
search, has grabbed much research attention recently. It is
learns a nonlinear mapping between the query of one modality
natural that multimedia data have multiple modalities; these
and the index space of another modality through deep learning.
modalities may contribute correlated semantic information,
Then we lookup the index table for the top rank index codes of
such as video-tag pairs in YouTube and image-text pairs in
the high relevance scores to retrieve high quality candidates.
Flickr. Cross-modal retrieval can return relevant results of one
We evaluate the proposed index scheme adopted on three state-
modality for a query of another modality. For example, we
of-the-art cross-modal hashing methods in two widely-used
can use text queries to retrieve images, and use image queries
benchmark datasets. Experimental results show the proposed
to retrieve texts. This retrieval paradigm provides a flexible
method can effectively improve the search performance both in
interface for users to search multimedia data across different
retrieval accuracy and computation time. The proposed index
modalities.
scheme can be built upon a binary code datasets generated by
With the rapid growth of multimedia data, it is impractical
any hashing methods to derive the following benefits:
to apply exhaustive search that consumes a tremendous com-
putation resource in a large-scale dataset. To address this issue, • Based on the built index structure, the retrieval process
existing cross-modal retrieval methods mainly leverage the can achieve sub-linear time complexity through inverted
hashing technique to generate compact data representations. table lookup, compared with the exhaustive search that
The goal of hashing is to embed the data points from the takes linear time complexity, so that the retrieval process
can be accelerated.
• Given a query, the learned prediction model is employed
to estimate the relevance scores of the index codes for
978-1-7281-4673-7/19/$31.00 ©2019 IEEE more precise ranking, rather than ranking by inaccurate
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:58:38 UTC from IEEE Xplore. Restrictions apply.
Hamming distances, so that the retrieval accuracy can be A. Index Model Construction and Training
improved. Suppose that we have a reference dataset of N binary codes
The remainder of this paper is organized as follows. In of length c, denoted as B = {bi ∈ {0, 1}c |i = 1, 2, ..., N }.
Section 2, we discuss the previous work about cross-modal Notice that the binary codes can be generated by any one of the
hashing. Section 3 presents the proposed probability-based in- CMH methods. Without loss of generality, we select the first
dex scheme and search method. Section 4 shows experimental d binary bits from bi as the index code xi ∈ {0, 1}d . An index
results. Conclusion remarks are given in Section 5. table with 2d entries is constructed based on the index codes,
where each entry EX = {bi |xi = X} represents a particular
II. R ELATED W ORK index code X and attaches a set of associated reference data
points.
The hashing technique can be classified into three main We train a prediction model that learns a nonlinear mapping
categories: uni-modal hashing, multi-view hashing, and cross- between the query of one modality (e.g., texts) and the index
modal hashing. Uni-modal hashing derives binary hash codes space of another modality (e.g., images) through deep learning.
from a single type of features. The seminal work includes The model is used to estimate the relevance scores of index
locality-sensitive hashing [1] and iterative quantization [2]. codes for a given query. To compile the training dataset, we
Multi-view hashing utilizes multiple types of features to learn prepare a set of queries of one modality, denoted as Q =
a better binary representation [3] [4]. Cross-modal hashing {qjθ |j = 1, 2, ..., J}, where qjθ is the jth query. The relevant
(CMH) aims to facilitate information retrieval across different examples of another modality for qjθ are denoted as {bθjk |k =
modalities. It usually embeds multiple heterogeneous data into 1, 2, ..., K} ∈ B, where bθjk is the kth relevant example for qjθ .
a common latent space where the discriminability or similarity The definition of the relevant example is based on the class
correlation is preserved. label information. For example, the relevant examples of a text
Existing CMH methods can be further divided into unsu- query are the images whose class labels are the same to the
pervised and supervised approaches. The unsupervised CMH query. The relevance score for each index code X is defined
approach basically employs the data distribution to learn by the proportion of relevant examples in the index entry:
hash functions without the label information. For example,
θ θ
{bjk |xjk = X}
composite correlation quantization (CCQ) [5] uses correlation-
θ
maximal mappings to transform data from different modality RjX = , (1)
|EX |
types into an isomorphic latent space. Fusion similarity hash-
ing (FSH) [6] constructs an undirected asymmetric graph to where | · | denotes the set cardinality. The training set is
model the fusion similarity among different modalities and compiled as pairs of query features and relevance scores; the
embeds the fusion similarity across modalities into a common jth query qjθ is associated with the set of 2d relevance scores
θ
Hamming space. On the other hand, the supervised CMH of index codes {RjX }.
approach leverages the label information to assist the learning A fully-connected neural network is employed to learn the
process. For example, deep cross-modal hashing (DCMH) [7] relation between the query and index codes based on the
learns hash functions for corresponding modalities through training set. The input layer receives the feature representation
supervised deep neural networks. Semantics-preserving hash- of qjθ , and the output layer predicts 2d relevance scores of
θ
ing (SePH) [8] transforms semantic affinities to a probability index codes {PjX }. Based on the cross entropy loss between
θ θ
distribution and approximates it with hash codes by using the predictions {PjX } and the target {RjX }, we compute the
kernel logistic regression. Discrete latent factor model (DLFH) error derivative with respect to the output of each neuron,
[9] utilizes the discrete latent factor to model the supervised which is backward propagated to each layer in order to update
information and adopts the maximum likelihood loss func- the weights of the neural network.
tion without relaxation. Deep discrete cross-modal hashing
B. NN Search
(DDCMH) [10] learns discrete nonlinear hash functions by
preserving the intra-modality similarity at each hidden layer Given a query q for cross-modal retrieval, we utilize the
of the networks and the inter-modality similarity at the output trained network to predict the relevance scores of index codes
layer of each individual network. {PX }. The index codes are ranked to select the top-R index
codes {X1 , X2 , ..., XR } with the highest relevance scores, and
III. H ASH C ODE I NDEXING the reference data points associated with the top-ranking index
codes are retrieved in a candidate set C = {bi |xi ∈ Xr , r =
To adopt the proposed hash code index scheme, we first 1, 2, ..., R}. We calculate the Hamming distance between the
construct an index structure that consists of a set of index query and each of the candidates in C, then sort the distances
codes with inverted lists of reference data points. Then We of the candidates in ascending order to return the desired
train a prediction model that estimates the relevance scores of number of NNs.
the index codes for a given query. With the proposed index The time complexity for NN search mainly involves three
scheme, we can perform a more accurate and more efficient parts, namely, the relevance score prediction, index code rank-
cross-modal retrieval, as elaborated in the following. ing, and candidate computation. The time spent for relevance
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:58:38 UTC from IEEE Xplore. Restrictions apply.
score prediction is related to the size of the neural network; it and NUS-WIDE datasets, and Figure 2 shows another search
is regarded as a constant time. Index code ranking requires to modality “image query vs. text dataset” (I → T ). The
sort all index codes based on their relevance scores; it takes exhaustive scheme did not benefit from reranking all reference
2d · log 2d = d · 2d computation time. Candidate computation data since Hamming distances are not accurate enough to
is to compute the Hamming distances to the query for all reflect the similarities to the query. However, the naı̈ve-index
candidates; it spends s · |C|, where s is a tiny constant scheme can reach a comparable accuracy by taking only a few
time for computing the Hamming distance. The candidate candidates for reranking. Moreover, the proposed DNN-index
set C is usually a fraction of the reference dataset B, so we scheme effectively boosted the accuracy compared with the
can reduce the computation time significantly compared with above two schemes.
exhaustive search. Interestingly, the quality of the candidate Tables I and II compare the proposed DNN-index (14 bits)
set is extremely good to further boost the search accuracy, as scheme with these CMH methods for T → I and I → T ,
illustrated in the experimental section. respectively, in terms of MAP@50, the fraction of accessed
reference data (ARD%), and runtime. ARD% is defined by:
IV. E XPERIMENT
the number of candidates
To evaluate the proposed method, the experiment is con- ARD% = × 100%. (3)
the number of reference data points
ducted by using three state-of-the-art CMH methods on two
widely-used benchmark datasets. The benchmark datasets are A lower ARD% induces a smaller computation cost to access
MIRFlickr [11] and NUS-WIDE [12], each of which consists the reference data. The 14-bit DNN-index scheme, which
of an image modality and a text modality. The CMH methods, obtained the highest accuracy and smallest computation cost,
including DLFH [9], DCMH [7], and FSH [6], were employed showed a significant improvement when it integrated with
to generate binary code datasets for MIRFlickr and NUS- these CMH methods.
WIDE individually. The program was implemented in Python V. C ONCLUSION
and run on a PC with Intel i7 [email protected] GHz and 32GB RAM.
In this paper, we propose a novel search method that utilizes
A. Implementation and Comparison a probability-based index scheme over binary hash codes in
For each CMH algorithm, three kinds of index schemes are cross-modal retrieval. The index scheme, which ranks the hash
implemented for comparison: index codes of the inverted table through DNN, can effectively
• Exhaustive. It applies the exhaustive search that cal-
increase the search accuracy and decrease the computation
culates Hamming distances between the query and all cost. Extensive experimental results show the outperformance
reference data without adopting any index structure. of the proposed method compared with some state-of-the-art
• Naı̈ve-index (d bits). It takes the first d bits of the
CMH methods in MIRFlickr and NUS-WIDE datasets.
hash code as the index code for each reference data ACKNOWLEDGEMENT
point. The given query compares the index code to find
This work was supported by the Ministry of Science and
candidates and then rerank the candidates according to
Technology under grants MOST 106-2221-E-415-019-MY3.
their Hamming distances. Here d = 14.
• DNN-index (d bits). It is the proposed method. Based R EFERENCES
on the above naı̈ve index structure, we learn a 4-layer [1] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
neural network, where the input layer is the raw query, removing the curse of dimensionality,” in Proceedings of the ACM
and follows three hidden layers with 8 units. The output Symposium on Theory of Computing, pp. 604-613, 1998.
[2] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantiza-
layer predicts the relevance scores of 2d clusters. ReLU tion: A procrustean approach to learning binary codes for large-scale
and softmax are used as the activation functions. Here image retrieval,” IEEE Transactions on Pattern Analysis and Machine
d ∈ {8, 10, 12, 14}. Intelligence, vol. 35, no. 12, pp. 2916-2929, 2013.
[3] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple
Mean average precision (MAP) is used to evaluate the information sources,” in Proceedings of the ACM SIGIR Conference on
retrieval accuracy for a set of queries Q: Research and Development in Information Retrieval, pp. 225-234, 2011.
[4] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficient
|Q| R image search,” IEEE Transactions on Image Processing, vol. 24, no. 3,
1 X 1 X pp. 956-966, 2015.
MAP@R = pr(j) · rel(j), (2)
|Q| i=1 R j=1 [5] M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation
quantization for efficient multimodal retrieval,” in Proceedings of ACM
International Conference on Information Retrieval, pp. 579-588, 2016.
where R is the number of retrieved documents, pr(j) denotes [6] H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang, “Cross-modality binary
the precision of the top j retrieved documents, and rel(j) = 1 code learning via fusion similarity hashing,” in Proceedings of the IEEE
if the jth retrieved document is relevant to the query, otherwise Conference on Computer Vision and Pattern Recognition, pp. 7380-
7388, 2017.
rel(j) = 0. The relevant documents are defined as those [7] Q. Y. Jiang and W. J. Li, “Deep cross-modal hashing,” in Proceedings
image-text pairs which share at least one common label. MAP of the IEEE Conference on Computer Vision and Pattern Recognition,
is computed as the mean of all the queries’ average precision. pp. 3232-3240, 2017.
[8] Z. Lin, G. Ding, J. Han, and J. Wang, “Cross-view retrieval via
Figure 1 shows the results for the search modality “text probability-based semantics-preserving hashing,” IEEE Transactions on
query vs. image dataset” (T → I) with 32-bit in MIRFlickr Cybernetics, vol. 47, no. 12, pp. 4342-4355, 2017.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:58:38 UTC from IEEE Xplore. Restrictions apply.
TABLE I
C OMPARISON FOR CMH METHODS IN TERMS OF MAP@50, ARD%, AND RUNTIME ( MILLISECONDS PER QUERY ) UNDER TEXT QUERY VS . IMAGE
DATASET
MIRFlickr NUS-WIDE
T →I 16-bit 32-bit 16-bit 32-bit
MAP@50 ARD% time MAP@50 ARD% time MAP@50 ARD% time MAP@50 ARD% time
DLFH 0.8529 100% 1.46 0.8887 100% 2.32 0.8457 100% 13.90 0.8418 100% 18.93
DCMH 0.7451 100% 1.49 0.7660 100% 2.19 0.5777 100% 13.20 0.5961 100% 19.80
FSH 0.4636 100% 1.84 0.4851 100% 3.00 0.4337 100% 12.04 0.2861 100% 16.74
DNN-index (14 bits) 0.9147 1.08% 1.04 0.9147 1.08% 1.04 0.8780 0.19% 1.27 0.8780 0.19% 1.27
TABLE II
C OMPARISON FOR CMH METHODS IN TERMS OF MAP@50, ARD%, AND RUNTIME ( MILLISECONDS PER QUERY ) UNDER IMAGE QUERY VS . TEXT
DATASET
MIRFlickr NUS-WIDE
I→T 16-bit 32-bit 16-bit 32-bit
MAP@50 ARD% time MAP@50 ARD% time MAP@50 ARD% time MAP@50 ARD% time
DLFH 0.8160 100% 1.47 0.8283 100% 2.17 0.7289 100% 13.89 0.7881 100% 18.95
DCMH 0.6899 100% 1.54 0.7075 100% 2.44 0.4823 100% 14.97 0.6005 100% 19.34
FSH 0.4887 100% 1.91 0.5073 100% 2.94 0.4261 100% 11.49 0.2920 100% 15.80
DNN-index (14 bits) 0.9021 0.55% 0.93 0.9021 0.55% 0.93 0.9095 0.05% 1.02 0.9095 0.05% 1.02
MAP@R
MAP@R
MAP@R
0.6 0.6 0.6 0.6
MAP@R
MAP@R
MAP@R
0.6 0.6 0.6 0.6
MAP@R
MAP@R
MAP@R
Fig. 1. MAP@R in the 32-bit CMH dataset under text query vs. image Fig. 2. MAP@R in the 32-bit CMH dataset under image query vs. text
dataset. dataset.
[9] Q. Y. Jiang, and W. J. Li, “Discrete latent factor model for cross-modal
hashing,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. [12] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-
3490-3501, 2019. WIDE: a real-world web image database from national university of
[10] F. Zhong, Z. Chen, and G. Min, “Deep discrete cross-modal hashing for Singapore,” in Proceedings of the ACM International Conference on
cross-media retrieval.” Pattern Recognition, vol. 83, pp. 64-77, 2018. Image and Video Retrieval, p. 48, 2009.
[11] M. J. Huiskes and M. S. Lew, “The MIR Flickr retrieval evaluation,”
in Proceedings of the ACM International Conference on Multimedia
Information Retrieval, pp. 39-43, 2008.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:58:38 UTC from IEEE Xplore. Restrictions apply.