Cross Modal Survey
Cross Modal Survey
Abstract
Human beings experience life through a spectrum of modes such as vision, taste, hearing, smell, and touch. These multiple modes
are integrated for information processing in our brain using a complex network of neuron connections. Likewise for artificial
intelligence to mimic the human way of learning and evolve into the next generation, it should elucidate multi-modal information
fusion efficiently. Modality is a channel that conveys information about an object or an event such as image, text, video, and audio. A
research problem is said to be multi-modal when it incorporates information from more than a single modality. Multi-modal systems
involve one mode of data to be inquired for any (same or varying) modality outcome whereas cross-modal system strictly retrieves
the information from a dissimilar modality. As the input-output queries belong to diverse modal families, their coherent comparison
is still an open challenge with their primitive forms and subjective definition of content similarity. Numerous techniques have been
proposed by researchers to handle this issue and to reduce the semantic gap of information retrieval among different modalities. This
paper focuses on a comparative analysis of various research works in the field of cross-modal information retrieval. Comparative
analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets
have also been discussed. In the end, open issues are presented to enable the researchers to a better understanding of the present
scenario and to identify future research directions.
Keywords: Cross-modal, multimedia, information retrieval, data fusion, comparative analysis
∗ Correspondingauthor
Email addresses: [email protected] (Parminder Kaur),
[email protected] (Husanbir Singh Pannu), [email protected]
(Avleen Kaur Malhi)
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Related surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Article organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Review methodology 4
2.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sources of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Search criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Publication metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Background 6
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Origin and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Benchmark datasets 23
6 Comparative analysis 25
6.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Comparison of results using diverse techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Discussion 29
8 Open issues 30
9 Conclusion 36
2
1. Introduction Cross-modal and multi-modal are explained using a simple ex-
ample in figure (2) where + represents both text and images can
When we fail to understand the contents of an image embed- be retrieved using an image query and vice versa in multi-modal
ded in a text, figure captions, and referral text often help. Just by approach.
looking at a figure, a person might not be able to understand it
exactly but with the help of collateral text, it can be understood
efficiently. For instance, when we see a volleyball picture(figure
1), we may not be able to understand or know about the volley-
ball game. However, the picture can be completely understood
with the help of collateral text (such as caption, figure reference,
and related citation) describing the volleyball game. This im-
plies that information from more than one source is beneficial
in further understanding of things and also helpful in better in-
formation retrieval. This is where cross-modal data fusion and
retrieval come into the picture. Figure 2: An illustration of information retrieval in cross-modal and multi-
modal system.
represented in spatial or spectral while the text is symbolic 4.1.1. Subspace learning
and dependent upon grammar rules and cultural norms [2]. Subspace learning plays a vital role in cross-modal informa-
tion retrieval. Diverse modalities have different representation
features as well as they are located in diverse feature spaces
4. Cross-modal representation and retrieval techniques [52]. The modalities can be mapped to common isomorphic
subspaces from old miscellaneous spaces by learning potential
Cross-modal representation techniques can be broadly clas- common subspaces (as shown in figure 17).
sified into two categories: (a) Real-valued representation and
(b) Binary representation. In real-valued representation learn-
ing, the learned common representations of diverse modalities CCA and its variants, CM, SM and SCM. CCA is the most pop-
are real-valued. However, in binary representation learning, ular unsupervised technique of subspace learning which was
diverse modalities are mapped into a common hamming space. introduced by Hotelling [53] in 1936. The principal logic be-
Cross-modal similarity searching is faster in binary representa- hind this technique is to find the pair of projections for di-
tion, so the retrieval process also becomes faster. However, the verse modalities such that the correlation between them is max-
retrieval accuracy becomes less in binary representation as the imized [54]. CCA can be recognized as an issue of identi-
information is lost because representation is encoded to binary fying the basis vectors for two group of variables aiming to
codes. Prominent cross-modal learning methods and related mutually maximize the correlation between variables’ projec-
works are presented in the following sub-sections. Figure(14) tions onto the basis vectors [55]. Let h·, ·i represents the eu-
presents a taxonomy of cross-modal retrieval methods. Table clidean inner product of vectors p, q which is equal to p0 q,
(4) shows the list of acronyms used in this article. Figure (15) where A0 is the transpose of a vector or matrix A. Let (p, q)
presents the literature classification utilized in this survey. denotes a multivariate random vector and its sample instances
as S = ((p1 , q1 ), ..., (pn , qn )). S p represents (p1 , ..., pn ) and
S q = (q1 , ..., qn ), consider defining a new coordinate for p by
4.1. Real-valued representation learning
choosing a direction d p and projecting p onto the direction:
This section presents the information regarding various real- p → hd p , pi, similarly for q, the direction is dq . A sample of
valued representation learning methods and their application on new coordinate is obtained: S p,d p = (hd p , p1 i, ..., hd p , pn i) and
different datasets. Figure (16) presents the evolution of real- similarly S q,dq = (hdq , q1 i, ..., hdq , qn i). First step is to choose
valued representation learning methods in recent years. d p and dq for maximizing the correlation between vectors, such
11
Figure 14: Taxonomy of cross-modal retrieval methods
that:
E[hd p , pihdq , qi]
ρ = max q (4)
d p ,dq
ρ = max Corr(S p d p , S q dq ) (1) E[hd p , pi2 ]E[hdq , qi2 ]
d p ,dq
E[d0p pq0 dq ]
hS p d p , S q dq i = max q (5)
= max (2) d p ,dq
d p ,dq S pdp S q dq E[d0p pp0 d p ]E[dq0 qq0 dq ]
d0p E[pq0 ]dq
where ρ represents the equation to be maximized. Let E de- = max q (6)
d p ,dq
notes the empirical expectation of function f (p, q) and given d0p E[pp0 ]d p dq0 E[qq0 ]dq
by
m Covariance matrix of (p, q) is defined as:
1X
E= f (pi , qi ) (3)
m i=1 C C
0 pp pq
Cov(p, q) = E qp qp = = C (7)
then ρ can be redefined as C C qp qq
12
Figure 15: Overview of literature based on image-text cross-modal retrieval
14
are then obtained from these hypotheses are CM, SM and SCM. data are learned at the output of the network. A similarity met-
CM is an unsupervised approach that models cross-modal cor- ric is also presented for improving distance measure which is
relations, SM is a supervised method that relies on semantic inspired by large scale similarity learning. In [62], an extension
representation and SCM is the combination of both of them. of the CCA approach has been introduced, named multi-label
In [58], a cross-modal retrieval framework has been pre- CCA (ml-CCA). It learns the shared subspaces by taking care of
sented which outputs a ranked list of semantically relevant text high-level semantic information in the formation of multi-label
from a separate text corpus (having no related images) when annotations. This approach utilizes the multi-label information
queried using an image and vice versa. For these two tasks, a for generating correspondences instead of relying on explicit
novel Structural SVM based unified formulation has been pro- pairing among different modalities like CCA. A fast ml-CCA
posed. Two representations considered for both image and text technique is also presented in this which has the capability of
modality are: (a) uni-modal probability distributions over top- handling huge datasets.
ics learned using LDA, and (b) explicit learning multi-modal An unsupervised learning framework based on KCCA is pro-
correlations using CCA. The work done in [41] is an extension posed which identifies the relation between image annotation
of [58]. A new loss function based on normalized correlation is by humans and the corresponding importance of things and
introduced in this which is found to be better than the previous their layout in the scene [63]. This uncovered relation is uti-
two loss functions. Along with this, the proposed method is lized in increasing the accuracy of search results as per queries.
compared with other baseline methods, extensive analysis of A novel approach for image retrieval and auto-tagging has been
training, and run-time efficiency. Comparison based on two introduced in [64] which utilizes the object importance infor-
new evaluation metrics and recent image and text features is mation provided by keyword tag list. It is an unsupervised ap-
also incorporated in the new work. [59] has proposed a cross- proach based on KCCA which finds the relationship between
modal technique for extracting semantic relationship between image tagging by humans and the corresponding importance
classes using annotated images. Firstly, both visual features of objects and their outline in the scene. As the KCCA tech-
and text are projected onto a latent space using CCA, and then nique is non-parametric, so it scales poorly with the training set
the probabilistic interpretation of CCA is utilized for calculat- size and has trouble with huge real-world datasets [2]. To han-
ing the representative distribution of the latent variable for each dle KCCA drawbacks and to provide an alternative, Deep CCA
class. Two measures are obtained based on the representative (DCCA) has been proposed. It tackles the scalability issue and
distributions: (1) semantic relation between classes, and (2) ab- leads to better correlated representation space.
straction level of each class.
Classic CCA method has few drawbacks [54]: (1) It is able Graph regularization based methods. Cross-modal retrieval
to compute only the linear correlation between two sets of vari- typically includes two fundamental issues: (a) Relevance es-
ables, however, the relationship may be non-linear in most of timation; and (b) Coupled feature selection. In [65], authors are
the real-world implementations; (2) It is able to operate only on dealing with both the issues. To deal with the first issue, multi-
two modalities; (3) If it is applied on a supervised problem then modal data is mapped to a common subspace to measure the
it wastes the information available in the form of labels because similarity among modalities. Projection matrices are learned
it is an unsupervised technique, and (4) Intra-modal semantic for this mapping and l21 -norm penalties are imposed on them
consistency is an important factor to improve retrieval accuracy separately to deal with the second issue, which selects appro-
but CCA fails to capture this [60]. To handle the drawbacks priate and discriminative features from diverse feature spaces
of classic CCA, several variants of this method are introduced at the same time. Further, a multi-modal graph regularization
such as Generalized CCA (GCCA), Kernal CCA (KCCA), Lo- term is applied to the projected data to preserve intra and in-
cality Preserving CCA (LPCCA), and Deep CCA (DCCA) to ter modality similarity relationships. An iterative algorithm is
name a few. CCA extension techniques seek to construct a cor- introduced for solving the joint learning issue along with its
relation that maximizes non-linear projection. In [61], authors convergence analysis. The excessive experimentation on three
have introduced a new dataset containing images, text (para- popular datasets proved the proposed technique to outperform
graph), and hyperlinks. This dataset is named as WIKI-CMR the state-of-art techniques.
and it is composed of Wikipedia articles. It consists of to- To overcome the semantic and heterogeneity gap between
tal of 74961 documents including images, textual paragraphs, modalities, the potential correlation of diverse modalities need
and hyperlinks. Documents are classified into 11 diverse se- to be considered. Also, the semantic information of class labels
mantic classes. CCA and KCCA cross-modal retrieval tech- required to be utilized for reducing the semantic gap among
niques have been applied to the dataset. An Improved CCA different modalities as well as realizing the inter-dependence
(ICCA) technique has been proposed in [60] to control the lim- and interoperability of divergent modalities. So, authors in
itations of traditional 2-view CCA. For improvement in intra- [52] have proposed a cross-modal retrieval framework which is
modal semantic consistency, two effective semantic features are based on graph regularization and modality dependence, fully
proposed which are based on text features. Traditional 2-view utilizing the correlation between modalities. After consider-
CCA has been expanded to 4-view CCA and it is embedded into ing the semantic and feature correlation, projection matrices
an escalating framework to reduce the over-fitting. The frame- are learned separately for Image-to-Text and Text-to-Image re-
work combines training of linear projection and non-linear hid- trievals. Then the internal arrangement of original feature space
den layers to make sure that fine representations of input raw is utilized to construct an adjoining graph having semantic in-
15
formation constraints which enables the diverse labels of mis- Other subspace learning methods. A modality-dependent
cellaneous modality data to get closer to respective semantic cross-media retrieval (MDCR) model has been proposed in [68]
information. The whole process can be visualized in figure in which two couple of projections are learned for diverse cross-
(18). The objective function for I2T and T2I tasks are defined media retrieval tasks rather than one couple of projections. Two
in equation (9 and 10) respectively. couple of mappings are learned to project text and images from
original feature space into separate common latent subspaces
2 2
F(U1 , V1 ) = λ U1T X − V1T Y F
+ (1 − λ) U1T X − S F
+ by simultaneously optimizing the correlation between text and
αtr(U1 X T L1 XU1T − S T L1 S )+ (9) images and linear regression from one modal space to seman-
tic space. A novel discriminative dictionary learning (DDL)
β1 kU1 k2F + β2 kV1 k2F approach amplified with common label alignment has been in-
troduced in [69] for effective cross-modal retrieval. It increases
2 2
F(U2 , V2 ) = λ U2T X − V2T Y F
+ (1 − λ) V2T Y −S F
+ the discriminative ability of intra-modality information from di-
αtr(V2 Y T L2 YV2T − S T L2 S )+ (10) verse concepts and relevance of inter-modality information in
the same class. To handle the huge multi-modal web data, [70]
β1 kU2 k2F + β2 kV2 k2F has proposed a cluster-sensitive cross-modal correlation learn-
where U1 , U2 and V1 , V2 represent the image and text projec- ing framework. A novel correlation subspace learning tech-
tion matrices in I2T and T2I respectively. S is the semantic nique which learns a group of a cluster–sensitive sub-models is
matrix of image and text, X and Y represents the feature ma- presented to better fit the content divergence of various modal-
trices of image and text respectively, λ, α, β1 and β2 are bal- ities.
ance parameters. A semantic consistency cross-modal retrieval A Multi-ordered Discriminative Structured Subspace Learn-
with semi-supervised graph regularization (SCCMR) method ing (MDSSL) approach is proposed in [71]. This metric
is introduced in [66] which ensures a globally optimal solu- learning framework learns a discriminative structured subspace
tion by merging prediction of labels and optimization of pro- where data distribution is reserved for ensuring a required met-
jection matrices to a unified architecture. Simultaneously, the ric. An adversarial cross-modal retrieval method has been pro-
method also considers nearest neighbors in potential image-text posed in [72] which attempts to make an effective common sub-
subspace and image-text with the same semantics using graph space based on adversarial learning. To handle the problem of
embedding. discriminative features are captured from different multi-view embedding from diverse visual hints and modalities,
modalities by applying l21 -norm constraint to projection matri- a unified solution is proposed for subspace learning techniques
ces. which makes use of Rayleigh quotient [73]. It is extendable
for supervised learning, multiple views, and non-linear embed-
ding. A multi-view modular discriminant analysis (MvMDA)
approach is introduced for considering the view difference. Af-
ter getting motivation from the fact that un-annotated data can
be easily compiled and helps to utilize the correlations among
diverse modalities, a novel generalized semi-supervised struc-
tured subspace learning (GSS-SL) approach is proposed in [67]
for the task of cross-modal retrieval. For aligning diverse
modality data by moving one source modality to another tar-
get modality, a cross-modal retrieval approach with augmented
Figure 18: Process of cross-modal retrieval framework followed in [52] adversarial training is proposed in [74]. An augmented version
of the conditional generative adversarial network is utilized for
Inspired by the fact that unlabelled data can be composed reserving the semantic meaning in the modality transfer pro-
easily and aid to exploit the correlation between modali- cess.
ties, [67] has proposed a novel framework generalized semi-
supervised structured subspace learning (GSS-SL) for cross- 4.1.2. Statistical and probabilistic methods
modal retrieval. A label graph constraint is proposed for pre- Statistical methods include the Markov model (MM), Hid-
dicting appropriate concept labels for un-annotated data. For den Markov Model (HMM), Markov Random Field, and so
modeling correlation between modalities, GSS-SL utilizes the forth. Probabilistic methods incorporate the use of probability
label space as a linkage after consideration of the fact that con- and various probabilistic models. They are typically utilized to
cept labels directly unveils the semantic information of multi- find out the probability of generating a particular modality re-
modal data. Specifically, a joint minimization formulation is sult based on a given query modality. Scientific biomedical ar-
created from the combination of the label-linked loss function, ticles contain multi-modal information such as images and text.
label graph constraint, and regularization for learning discrim- Considering the growth of the healthcare industry, important
inative common subspace. Multiple linear transformations are text and images keep on hiding under the inessential data which
alternatively optimized by an effective optimization method for makes it hard to retrieve the relevant information. Biomedical
diverse modalities and updating of the class indicator matrices articles often contain annotation markers or tags such as letters,
for un-annotated data is also performed. stars, symbols, or arrows in figures which highlight the cru-
16
cial area in the figure. These markers are also correlated with phase. In the end, a dynamic interpolation algorithm is se-
the image captions and text in the article. Identification of the lected for dealing with the problem of fusion of loss function.
markers becomes important to extract the ROIs from images. A Cross-Modal Online Low-Rank Similarity function learning
A novel technique has been proposed in [26] with the combi- (CMOLRS) technique is proposed in [79] that learns a low-rank
nation of rule-based and statistical image processing ways for bilinear similarity measurement for the task of cross-modal re-
localizing and annotating the medical image regions or ROIs. trieval. A fast-CMOLRS technique is also introduced which
Moreover, a cross-modal image retrieval technique has been has less processing time than the former technique.
implemented on articles and it is based upon ROI identification
and classification. 4.1.4. Topic Models
Automatic image annotation and retrieval framework based Topic models are a kind of statistical model that finds the
on probabilistic models have been proposed in [75] with an as- abstract topics which arise in a set of documents. A cross-
sumption that image regions can be explained using blobs (a modal topic correlation model has been introduced in [80]
kind of vocabulary). Blob is an acronym for Binary Large Ob- which jointly models the text and image modalities. A sta-
ject and it is a collection of binary data that is stored as a sin- tistical correlation model is examined which is conditioned on
gle unit in a database. Blobs are created from image features category information. [81] proposed a novel supervised multi-
using clustering. To automatically annotate or retrieve images modal mutual topic reinforcement modeling (M3R) technique
using a word as a query, the trained probabilistic model pre- that makes a joint cross-modal probabilistic graphical model for
dicts the probability of producing a word with the help of im- finding the mutually consistent semantic topics using required
age blobs. After experimentation, the proposed probabilistic interaction between model factors.
model based on the cross-media relevance model is proved to A topic correlation model (TCM) is presented in [82] by mu-
be almost six times better than a model based on the word- tual modeling of images and text modalities for cross-modal
blob co-occurrence model and two times better than a model retrieval task. Images are represented by the bag-of-features
derived from machine translation in terms of mean precision. model based on SIFT and text is represented by topic distri-
An improvement of cross-media relevance model [75] is pre- bution learned from the latent topic model. These features are
sented in [76] to automatically assign related keywords to un- mapped into a common semantic space and statistical correla-
annotated images based on images’ train data. Images present tions are analyzed. These correlations are utilized for finding
in the training dataset are fragmented into parts and then these out the conditional probability of results in one modality while
parts are represented using a blob. K-means algorithm is used querying in another modality.
for blobs’ creation for clustering those image parts. Using this
model, the probability for assigning a keyword into a blob is 4.1.5. Machine learning and Deep learning based methods
predicted and after annotation success, one image part is repre- Machine learning (ML) refers to the capability of a ma-
sented by a keyword. TF-IDF method is used for text document chine to enhance its performance on the basis of previous out-
feature extraction and appropriate text documents are retrieved comes. ML approaches allow systems to learn without being
using images’ automatic annotation information. Experimen- programmed explicitly. Deep learning mimics the way the hu-
tation is performed on IAPR TC-12 and 500 Wikipedia web- man brain works for both feature extraction and classification
pages (landscape related) dataset to show the usefulness of the as discussed in [83]. This section includes the works which
proposed technique. are based on machine learning and deep learning. Summary of
deep learning based cross-modal systems incorporating image
4.1.3. Rank based methods and text have been presented separately in the table (18). In
These methods see the issue of cross-modal retrieval as a [40], authors have proposed a novel technique of multi-modal
problem of learning to rank. Ranking of images and tags is suit- Deep Belief Network for finding out the missing data in text
able for efficient tag recommendation or image search. In [77], or image modality. Also, the proposed model can be used for
a new Multi-correlation Learning to Rank (MLRank) approach multi-modal data retrieval as well as annotation purpose. After
is proposed for image annotation which ranks the tags for im- experimentation on MIR Flickr data containing images and cor-
ages as per their relevance after considering semantic impor- responding tags, the proposed model is found to be better than
tance and visual similarity. Two cases are defined: (a) image- bi-modal data of images and text. Moreover, its performance
bias consistency; and (b) tag-bias consistency that is developed outperforms the performance of Linear Discriminant Analysis
into an optimization problem for rank learning. (LDA) and Support Vector Machine (SVM) models. As the
In [78], a ranking model has been optimized as a listwise cross-modal data is heterogeneous in nature, so it is trouble-
ranking problem considering cross-modal retrieval process and some to compare directly. For making it comparable, authors in
a learning to rank with relational graph and pointwise con- [30] have made use of deep learning by proposing a deep corre-
straint (LR2GP) technique has been proposed. Firstly, a dis- lation mining technique. Various media features are trained in
criminative ranking model is introduced that utilizes the rela- this technique and then fused together with the help of correla-
tionship between a single modality for improvement in ranking tion between their trained features. Moreover, the Levenberg-
performance and learning of an optimal embedding shared sub- Marquart technique has been used for avoiding the local min-
space. A pointwise constraint is proposed in low-dimension ima problem in deep learning. Experiments are performed on
embedding space to make up for the real loss in the training image-audio and image-text databases to validate the proposed
17
solution. Authors have proposed a novel cross-modal retrieval convolutional network.
technique based on similarity theory and deep learning [84]. A novel correspondence autoencoder model is proposed in
They have utilized Local Binary Pattern (LBP) as an image de- [89] which is designed by correlating hidden representations of
scriptor and Deep Belief Network (DBN) as a deep learning two uni-modal autoencoders. For this model training, an opti-
algorithm. mal objective that minimizes the linear combination of repre-
In [85], a new Scalable Deep Multi-modal Learning (SDML) sentation learning errors for every mode and correlation learn-
data retrieval method has been introduced. A common sub- ing error between the hidden representation of the modalities.
space is predefined to maximize between-class variation and A correspondence restricted Boltzmann machine (Corr-RBM)
minimize within-class variation. Then a network is trained for is proposed in [90] for mapping the original features of modal-
each modality separately such that n networks are obtained for ity data into a low-dimensional common space where hetero-
n modalities. It is done to transform multi-modal data into the geneous data can be compared. Two deep neural structures are
common predefined subspace for achieving multi-modal learn- made from corr-RBM as the chief building block for the cross-
ing. The method is scalable to a number of modalities as it can modal retrieval process. Cross-modal retrieval is performed
train different modality-specific networks separately. It is the using CNN visual features with various classic approaches in
first proposed technique which is individually projecting data [91]. A deep semantic matching (DSM) technique is also intro-
of different modalities into a predefined common subspace. Ex- duced for handling cross-modal retrieval w.r.t. samples labeled
perimentation is performed on four benchmark datasets such as with one or multiple labels. In [92], authors have proposed a
PKU XMedia, Wikipedia, NUS-WIDE, and MS-COCO dataset deep and bidirectional representation learning model (DBRLM)
to validate the proposed technique. To solve the problem of where images and text are represented by two separate convo-
image-text cross-modal retrieval, various novel models are in- lutional based networks.
troduced in [86] which are designed by correlating hidden rep- A novel modal-adversarial hybrid transfer network has been
resentations of a pair of autoencoders. Minimizing correlation proposed in [93]. It realizes the knowledge transfer from the
learning error enables the model to learn invisible representa- single-modal source domain to the cross-modal target domain
tions by just utilizing the general information in diverse modal- and then learns the common cross-modal representation. The
ities. On the other hand, minimizing the representation learn- architecture is based on deep learning and is divided into two
ing error builds hidden representations good enough for recon- subnetworks: (a) Modal-sharing knowledge transfer subnet-
structing inputs of each modality. A specific parameter is set in work; and (b) Modal adversarial semantic learning subnet-
the models to make a balance between two types of error gener- work. A deep learning model has been introduced in [94],
ated by representation and correlation learning. Models are di- named, AdaMine (ADAptive MINing Embedding) for learn-
vided into two groups: (1) one contains three models that recon- ing the common representation of recipe items incorporating
struct both modalities and so named as multimodal reconstruc- recipe images and their recipe in textual form. In [95], au-
tion correspondence autoencoder, and (2) the second contains thors have proposed a novel approach generative cross-modal
two models that reconstruct a single modality and so named learning network (GXN) which includes generative processes
as unimodal reconstruction correspondence autoencoder. Ex- into the cross-modal feature embedding which will be useful in
perimentation is performed on three popular datasets and the learning both global abstract features and local grounded fea-
proposed technique is found to be better than two popular mul- tures. A deep neural network based approach known as hybrid
timodal deep models and three CCA based models. representation learning (HRL) is proposed for learning com-
Supervised cross-modal retrieval techniques provide better mon representation for each modality [96].
accuracy than unsupervised techniques at the additional cost of A new deep adversarial metric learning (DAML) technique
data labeling or annotation. Lately, semi-supervised techniques is introduced for cross-modal retrieval which maps annotated
are gaining popularity as they provide a better framework to data pairs of diverse modalities non-linearly into a shared la-
balance the trade-off between annotation cost and retrieval ac- tent feature subspace [97]. The inter-concept difference is max-
curacy. A novel deep semi-supervised framework is proposed imized and the intra-concept difference is minimized. Each
in [87] to handle both annotated and un-annotated data. Firstly, data pair difference caught from modalities of the same class
an un-annotated part of training data is labeled using the la- is also minimized. Motivated by zero-shot learning, [98] has
bel prediction component and then a common representation presented a ternary adversarial network with self-supervision
of both modalities is learned to perform cross-modal retrieval. (TANSS) model. It includes three parallel sub-networks: (1)
The two modules of the network are trained in a sequential two semantic feature learning subnetworks which capture the
manner. After extensive experimentation on pascal, Wikipedia, intrinsic data structures of diverse modes and preserve their re-
and NUS-WIDE datasets, the proposed framework is found to lationships using semantic features in shared semantic space;
be outperforming both supervised and semi-supervised exist- (2) a self-supervised semantic subnetwork that utilizes seen and
ing methods. In [88], authors have introduced an image-text unseen label word vectors to use them as guidance for supervis-
multi-modal neural language model which can be utilized for ing semantic feature learning and increases knowledge transfer
retrieving related images from complex sentence queries and to unseen labels; and (3) adversarial learning scheme is used
vice versa. It has been presented here that text representations for maximizing the correlation and consistency of semantic fea-
and image features can be jointly learned in the case of image- tures among various modalities. This whole network facilitates
text modeling by training the models in conjunction using a effective iterative parameter optimization. In [99], a shared se-
18
mantic space with correlation alignment (S3CA) is proposed for Hashing function is practically utilized in a hash table data
cross-modal data representation. It aligns the non-linear corre- structure which is highly popular for quick data lookup.
lations of cross-modal data distribution in deep neural networks
made for diversified data. • Nearest neighbour (NN): It represents one or more data
entities in A = [a1 , a2 , ..., an ] ∈ RD which are nearest to the
4.1.6. Other methods query point aq .
This section includes the summary of those works which can- • Approximate nearest neighbour (ANN): It attempts to find
not be classified under any of the above classes. In [100], au- a data point a x ∈ A which is an ε−approximate nearest
thors have proposed an Annotation by Image-to-Concept Dis- neighbour of the query point aq in that ∀a x ∈ A, the dis-
tribution Model (AICDM) for image annotation using the links tance between a x and a satisfies the relation d(a x , a) ≤
between visual features and human concepts from image-to- (1 + ε)d(aq , a).
concept distribution. There is a rapid increase in the discussions
regarding disaster and emergency management on social me- Cross-modal hashing techniques are effective in resolving
dia these days. Flood event observation has a principal role in the issue of large scale cross-modal retrieval because it com-
emergency management and the related videos and images are bines the benefits of classic cross-modal retrieval and hash-
also uploaded and searched on the web while disasters. This ing. These techniques either rely on annotated training data
data can be helpful in emergency management by using it in or they lack semantic analysis [106]. For correlating diverse
sensors. Inspired by this, authors in [25] are performing image modalities, typical cross-modal hashing techniques learn a uni-
retrieval enhancement in the field of floods and flood aids. Inte- fied hash space. Then the search process is improved based
gration of image and text features is performed after extracting on hash codes. Hashing methods are broadly classified into
visual features from images using BoW and text features using Data-dependent and Data-independent methods [107]. In data-
TF-IDF and weirdness. After extensive experimentation on US dependent methods, an appropriate hash function is learned us-
FEMA and Facebook datasets, it has been demonstrated that ing the available training data, however, the hash function is
the proposed method is enhancing the emergency management generated using random mapping independent of the training
efficiency by showing improvement in image recognition with data in data-independent methods. Hash function learning is
the incorporation of text features in it. categorized into two stages: (1) Dimensionality reduction; and
Images are ranked as per similarity of semantic features in (2) Quantization. Dimensionality reduction means mapping the
the query by semantic example retrieval. So, in [38], the accu- information from the original space to a low-dimensional spa-
racy of semantic features is improved using cross-modal regu- tial representation. Quantization means a linear or non-linear
larization which is based on associated text. transformation of actual features to binary segment the feature
space for acquiring hash codes. The aim of hashing methods
4.2. Binary representation learning or Cross-modal hashing is to minimize the semantic gap among modalities as much as
In general, the word hash means chop and mix which con- possible. A typical resolution for this issue can be learning of a
secutively means that the hashing function chops and mixes in- uniform hash code to make it more consistent. Another resolu-
formation to obtain hash results [101]. The idea of hashing was tion can be the minimization of the coding distance and enhance
first introduced by H. P. Luhn in his 1953 article A new method its compactness. Hashing taxonomy followed in this survey is:
of recording and searching information [102]. Entire informa- (1) General hashing methods which are defined first; and (2)
tion regarding the birth of hashing is presented in [103]. It is Deep learning based hashing methods which are defined later
nearly impossible to achieve a completely even distribution. It in a different subsection. General hashing methods include all
can only be created by considering of structure of keys. For a the methods which do not incorporate deep learning. Figure
random group of keys, it is impractical to generate an appropri- (19) presents an evolution of cross-modal hashing techniques.
ate generic hash function as the keys are not known beforehand. Table (5) presents the comparison of hashing techniques on
Random uniform hash works best in this case. So, inspired by various characteristics such as optimization, time complexity,
the need of using random access system having a huge capacity hash function, and distance metric utilized for similarity cal-
for business applications, Peterson gave an estimation for the culation. While optimizing the objective function, either the
amount of search needed for the exact location of a record in relaxation is given for easy optimization or not which we call
numerous storage systems including the sorted-file and index discrete type. Relaxation of discrete hash codes may result
table method [104]. Then the term hashing was first used by in quantization loss and performance degradation [108]. Time
Morris in his article [105] in 1968. Few general definitions in complexity mentioned here is for the whole method execution
hashing are described below [101]: where n is the number of training samples used in it. Hash-
ing models can be categorized into linear and non-linear type
• Hashing function: This function (h(·)) is used to map the [109]. The distance metric is the metric utilized in the inter or
random size of data to a fixed interval [0, p]. Given a data intra similarity among modalities’ calculation.
having n data points i.e. A = [a1 , a2 , ..., an ] ∈ RD (real
coordinate space of dimension D) and a hashing function 4.2.1. General hashing methods
h(·), then h(A) = [h(a1 ), h(a2 ), ..., h(an )] ∈ [0, p] are known This section includes all the cross-modal retrieval works
as hashes or hash values of data points represented by A. based on hashing technique and which does not incorporate a
19
Figure 19: Evolution of research in cross-modal hashing
Table 5: Comparison of hashing methods on the basis of various characteristics. T = Traditional hashing method and D = Deep learning based hashing method
Characteristics Type Hashing method Methods
T LCMH [110], QCH [111]
Relaxation
D TDH [112], SDCH [113]
T DLSH [114], SRLCH [109]
Optimization Discrete
D DVSH [115], DCMH [116]
T MFDH [117], MSFH [108], SMFH [118]
Alternative solution
D ZSH [119]
Linear T UCH [120], LCMH [110], CMSTH [106], MSFH [108]
Hash function T SRLCH [109]
Non-linear
D DVSH [115]
T QCH [111]
Cosine
D DVSH [115]
T LCMH [110], MSFH [108]
Distance metric Euclidean
D ZSH [119]
T DLSH [114], CMSTH [106]
Hamming
D DCMH [116], TDH [112], AADAH [121]
deep learning approach. In [120], authors have proposed an in the end hash functions are learned for projecting the modal-
Unsupervised Concatenation Hashing (UCH) technique where ities to a unified hash space. A new cross-modal hashing tech-
Locally Linear Embedding and Locality Preserving Projection nique is proposed in [110] to handle the method scalability is-
are introduced for reconstructing the manifold structure of orig- sue in the training period. The time complexity of the technique
inal space in the hamming space. l2,1 -norm regularization is varies linearly with training data size which allows scalable in-
imposed on the projection matrices for exploiting the diverse dexing for multi-media search over various modalities. Hash
characteristics of various modalities. The proposed technique functions are learned accurately while considering inter and in-
has been compared with other hashing techniques such as CVH, tra modality similarities. Experiments are performed on NUS-
IMH, RCH, FSH, and CCA [122] as well. CVH [123] is an ex- WIDE and Wikipedia dataset to prove the effectiveness of the
tension of classic uni-modal spectral hashing [124] to multi- method. The objective function utilized here for preservation
modal field. In IMH [125], learned binary codes conserve of inter-similarity between modalities for the bi-modal case is
both inter and intra-media consistency. FSH [126] embeds the defined as:
graph-based fusion similarity to a common hamming space.
In RCH [127], common hamming space is learned in diverse
2
modalities’ binary codes are created as consistent as possible. min B(1) − B(2) ;
B(1) ,B(2) F
Table (6 shows the comparison of these techniques when ap-
T
plied on Wikipedia and Pascal dataset. This comparison is s.t., B(i) e = 0, (11)
based on MAP scores when images are retrieved from the text
b(i) ∈ {−1, 1},
(T2I), the text is retrieved from image (I2T) and the average of T
both scores. Bold values in the table represent the highest MAP B(i) B(i) = Ic , i = 1, 2;
score in the respective task and hash code length.
In [106], authors have introduced Cross-Modal Self-Taught where B(1) and B(2) represents the data matrices of image and
Hashing (CMSTH) technique for both cross-modal and uni- text modalities, e is n × 1 vector having each entry equal to 1,
modal image retrieval. It can successfully catch the semantic k·kF is a Frobenius norm, Ic is c × c identity matrix, B depicts
T
correlation from un-annotated training data. Three steps are fol- final binary codes obtained, constraint B(i) e = 0 needs each bit
T
lowed in the learning procedure: (1) Hierarchical Multi-Modal has same chance to be 1 or −1 and constraint B(i) B(i) = Ic
Topic Learning (HMMTL) is proposed for identifying multi- requires the bits of each modality to be acquired separately.
2
modal topics using semantic information; (2) Robust Matrix Loss function term B(1) − B(2) F obtains the maximal consis-
Factorization (RMF) is utilized for transferring the multi-modal tency (or the minimal difference) on two object representations.
topics to hash codes which form a unified hash space, and (3) Equation (11) is extended for more than two modality case and
20
Figure 20: Cross-modal hashing approach proposed in [109]
the new general equation obtained is: of one modal and yTi represents ith row of data matrix Y ∈ Rn×dy
of another modal. d x and dy are dimensions of the modalities.
p X
p
X 2 Similarity information between data points across domains is
min B(i) − B( j) ;
B(i) ,i=1,...,p F defined as: S i j = 1 iff xi and y j are similar and 0 otherwise.
i=1 i< j
s.t., B (i)T
e = 0, (12) min O(Bx , By , W x , Wy ) = (kBx − XW x k2F +
2
X
b(i) ∈ {−1, 1}, By − YWy F ) − α0 S ij
T (i, j)
B(i) B(i) = Ic , i = 1, ..., p,
(13)
q q
xiT W x WyT y j − xiT W x W xT xi yTj Wy WyT y j
where p represents no. of diverse modalities and rest of the
notations are same as eq. (11). s.t. W xT W x = Ic×c
The issue of cross-modal hashing is how to efficiently con- WyT Wy = Ic×c
struct the correlation among diverse modality representations in
the hash function learning process. Most of the traditional hash- where Bx ∈ {−1, 1}n×c and By ∈ {−1, 1}n×c are two kinds of bi-
ing techniques map the miscellaneous modality data to a joint nary codes with same code length c for each object. W x ∈ Rdx ×c
abstraction space by linear projections similar to CCA. Due to and Wy ∈ Rdy ×c depicts two projection matrices for two modali-
this, these methods are unable to effectively reduce the seman- ties, WyT means transpose of a matrix Wy and similarly for other
tic gap among modalities which has been proved to lead to bet- matrices. α0 represents control parameter for balancing quan-
ter accuracy in information retrieval. So to tackle this issue, a tization loss and cosine similarity constraint. For making W x
Latent Semantic Sparse Hashing method has been proposed in and Wy as orthogonal projections, constraints W xT W x = Ic×c and
[128]. This method executes the cross-modal similarity with WyT Wy = Ic×c are used.
the use of sparse coding, for capturing important images’ struc- Most of the classic hashing techniques either suffer from high
tures, and matrix factorization, for learning latent concepts from training costs or fail to capture the diverse semantics of various
the text. In [111], a quantized correlation hashing (QCH) tech- modalities. In order to tackle this issue, [114] has presented an
nique is proposed which considers the quantization loss over efficient Discrete Latent Semantic Hashing (DLSH) approach.
different modalities and the relation among them simultane- Firstly, it learns the latent semantic representations of miscella-
ously. The relation among diverse modalities that explains the neous modalities and afterward, projects them into a common
similar object is established by maximizing the correlation be- hamming space for supporting scalable cross-modal retrieval.
tween the hash codes across modes. The resultant objective This approach directly correlates the explicit semantic labels
function is converted to a uni-modal formulation which is then with binary codes, so it increases the discriminative ability of
optimized using another process. Objective function is defined learned hashing codes. Unlike traditional hashing approaches,
in equation (13). Suppose two modalities (xi , yi ) are represent- DLSH directly learns binary codes using an effective discrete
ing n object, where xiT depicts ith row of data matrix X ∈ Rn×dx hash optimization. The overall objective function of the DLSH
21
Table 6: Comparison of benchmark techniques on the basis of MAP scores on Wikipedia and Pascal VOC dataset with different hash code lengths presented in
[120].
Length of hash codes
Tasks Methods Wikipedia Pascal VOC 2007
16 32 64 128 16 32 64 128
CVH [123] 0.1499 0.1408 0.1372 0.1323 0.1484 0.1187 0.1651 0.1411
CCA [122] 0.1699 0.1519 0.1495 0.1472 0.1245 0.1267 0.123 0.1218
IMH [125] 0.2022 0.2127 0.2164 0.2171 0.2087 0.2016 0.1873 0.1718
I2T RCH [127] 0.2102 0.2234 0.2397 0.2497 0.2633 0.3013 0.3209 0.333
FSH [126] 0.2346 0.2491 0.2531 0.2573 0.289 0.3173 0.334 0.3496
UCH LPP [120] 0.242 0.2497 0.255 0.2576 0.2706 0.3074 0.3255 0.3277
UCH LLE [120] 0.2429 0.2518 0.2578 0.2588 0.2905 0.3245 0.3345 0.3396
CVH 0.1315 0.1171 0.108 0.1093 0.0931 0.0945 0.0978 0.0918
CCA 0.1587 0.1392 0.1272 0.1211 0.1283 0.1362 0.1465 0.1553
IMH 0.1648 0.1703 0.1737 0.172 0.1631 0.1558 0.1537 0.1464
T2I RCH 0.2171 0.2497 0.2825 0.2973 0.2145 0.2656 0.3275 0.3983
FSH 0.2149 0.2241 0.2332 0.2368 0.2617 0.303 0.3216 0.3428
UCH LPP 0.2351 0.2518 0.2623 0.2689 0.3945 0.4877 0.5187 0.5321
UCH LLE 0.2363 0.2567 0.2845 0.2993 0.4106 0.4913 0.5217 0.5343
CVH 0.1407 0.129 0.1226 0.1208 0.1208 0.1066 0.1315 0.1165
CCA 0.1643 0.1456 0.1384 0.1341 0.1264 0.1315 0.1347 0.1386
IMH 0.1835 0.1915 0.1951 0.1946 0.1859 0.1787 0.1705 0.1591
Average RCH 0.2137 0.2365 0.2611 0.2735 0.2389 0.2834 0.3242 0.3657
FSH 0.2248 0.2366 0.2431 0.247 0.2753 0.3102 0.3278 0.3462
UCH LPP 0.2385 0.2508 0.2586 0.2632 0.3326 0.3976 0.4221 0.4299
UCH LLE 0.2396 0.2542 0.2712 0.2791 0.3506 0.4079 0.4281 0.437
approach for two modalities is given as: increases the training accuracy. Both hash functions and uni-
2
fied binary codes are learned at the same time using an iterative
alternative optimization algorithm. Using these hash functions
X
min kφ(Xi ) − Ui Ai k2F +
Ui |i=1,2 ,Ai |i=1,2 ,Wi |i=1,2 ,Q and binary codes, multi-modal data can be effectively indexed
i=1
2 and searched. The framework of the proposed SRLCH tech-
X
β kB − Wi Ai k2F + nique is shown in figure (20).
i=1 (14)
2 2
X X In [118], authors have proposed an approach of supervised
δkB − QYk2F + γ kUi k2F + kWi k2F + kQk2F matrix factorization hashing for using label information and
i=1 i=1
effective cross-modal retrieval. This method is based on col-
s.t.B ∈ {−1, 1}L×N lective matrix factorization which considers both local geomet-
ric consistency in each mode and label consistency across sev-
where B is binary hash code matrix, k·kF is the Frobenius norm
eral modalities. To resolve the issue of quantization loss which
of matrix, L is hash code length and N is no. of training in-
happens by relaxing discrete hash codes in the cross-modal re-
stances, Xi denotes the original feature matrices of modalities,
trieval process, [108] has proposed a multi-modal graph reg-
Q is semantic transfer matrix, Ai ∈ Rk×N is the latent semantic
ularized smooth matrix factorization hashing which is an un-
representation of modalities and k is its dimension, Ui ∈ Rm×k
supervised technique. The aim of this technique is to learn
is basis matrix and m is no. of anchors, Wi ∈ RL×k represents
unified hash codes for multi-media data in a common latent
projection matrices for two sub-retrieval tasks, φ(Xi ) ∈ Rm×N
space where similarity of diverse modalities can be identified
is Gaussian kernel projection of image and text features, β and
efficiently.
δ are penalty parameters and γ is regularization parameter for
avoiding over-fitting.
In [109], authors have proposed a novel supervised Subspace [117] utilizes multiple views for image and text representa-
Relation Learning for Cross-modal Hashing technique which tion to enhance feature information. A discrete hashing learning
utilizes the relation information of labels in semantic space for framework is proposed which employs complementary infor-
making similar data from diverse modalities nearer in the low- mation among multiple views to make discriminative compact
dimension hamming subspace. This technique preserves the hash codes learning better. It performs classifier and subspace
discrete constraints, modality relations, and non-linear struc- learning simultaneously for completing multiple searches at the
tures while admitting a closed-form binary code solution which same time.
22
4.2.2. Cross-modal hashing methods based on deep learning shot hashing learns a hashing model that is trained using only
Deep learning has become highly popular in recent years. samples from seen classes, however, it has the capability of
Features extracted by deep learning methods have a powerful good generalization for unseen classes’ samples. Typically, it
capability of expressing the data and they also have rich seman- utilizes the class attributes to seek a semantic embedding space
tic information contained in them [106]. Thus, the multi-media for transferring knowledge from seen classes to unseen classes.
information retrieval accuracy enhances significantly by com- So, it may perform poorly in the case of less labeled data. In
bining hashing methods with deep learning. Various works in- [129], authors have proposed a multi-level semantic supervi-
corporating cross-modal hashing methods based on deep learn- sion generating method after exploring the label relevance, and
ing have been introduced recently which are discussed in this a deep hashing framework is introduced for multi-label image-
section. text cross-modal retrieval. It can capture the binary similarity
as well as the complex multi-label semantic structure of data in
Capturing of spatial dependency of images and temporal dy-
diverse forms at the same time.
namics of text is an important task in learning potential feature
representations and cross-modal relations as it reduces the het-
erogeneity gap among modalities. So, a novel Deep Visual Se- 5. Benchmark datasets
mantic Hashing model has been introduced in [115]. It creates
concise hash codes of textual sentences and images in a com- With the advent of huge multi-modal data generation, cross-
plete deep learning architecture that catches the essential cross- modal retrieval has become a crucial and interesting problem.
modal correspondences between natural language and visual Researchers have composed diverse multi-modal datasets for
data. DVSH model has a hybrid deep framework that comprises evaluating the proposed cross-modal techniques. Figure (21)
a visual semantic fusion network to learn joint embedding space presents the evolution of the datasets in recent years. Sum-
of text and images, and two mode-specific hashing networks to mary of prominent multi-modal datasets is given in table (7)
learn hash functions for generating concise binary codes. The which includes dataset name, mode, total concepts, dataset size,
proposed framework efficiently unites cross-modal hashing and image representation, text representation, related article, and
joint multi-modal embedding that is based on a new amalgama- data source. Figure (22) presents a graph of the total number
tion of RNN over sentences, CNN over images, and a structures of categories in the datasets. Information regarding prominent
max-margin objective which combines everything together to benchmark datasets is given in the following points. After go-
facilitate the learning of similarity preserving and high-quality ing through all the references related to cross-modal retrieval
hash codes. Various cross-modal hashing techniques are based used in this survey, approximately used frequencies of popular
on hand-crafted features that may not attain a good accuracy datasets have been found and are represented in the form of a
value. A novel deep cross-modal hashing technique has been bar chart in figure (23).
introduced in [116] by combining hash-code learning and fea-
ture learning into the same framework. From beginning to end,
this framework consists of deep neural networks, one for each
mode to do feature learning from starting.
A triplet based deep hashing network is proposed in [112].
firstly, the triplet labels are utilized that explains the relative
relationship among three instances as supervision for catching
more common semantic correlations among cross-modal in-
stances. For boosting the discriminative ability of hash codes,
a loss function is generated from intra-modal and inter-modal
views. In the end, graph regularization is utilized for preserv-
ing the actual semantic similarity between hash codes in the
hamming space. A deep adversarial hashing network has been
proposed in [121] with attention mechanism for increasing the
measurement of content similarities for particularly aiming at
the informative pieces of multi-media. It has three modules:
(a) feature learning module for getting feature representations; Figure 21: Evolution of benchmark datasets
(b) attention module for creating attention mask; (c) hashing
module for learning hash functions. A novel deep cross-modal
hashing framework is proposed in [113] which combines hash 1. NUS-WIDE1 [130]: This is a real-world web image dataset
codes and feature learning into the same network. It has consid- composed by Lab for Media Search in the National Uni-
ered both inter and intra modality correlation and a loss function versity of Singapore. It consists of: (a) 2,69,648 images
with dual semantic supervision for hash learning. and associated tags from Flickr with 5,018 unique tags,
In [119], a novel cross-modal zero-shot hashing method has
been introduced which efficiently utilizes both labeled and un- 1 https://lms.comp.nus.edu.sg/wp-content/uploads/2019/
23
Figure 24: Two examples from NUS-WIDE dataset in which an image is asso-
Figure 22: A chart displaying the total number of categories in the popular ciated with numerous related tags
datasets
guidelines.html
2 https://www.imageclef.org/photodata 7 http://press.liacs.nl/mirflickr/
24
available in 2 sizes: 25k and 1M. The images have been 9. MS-COCO10 [140]: Microsoft Common Objects in COn-
collected from Flickr for the research purpose related to text (MS COCO) dataset has been composed of the pic-
image content and image tags. Moreover, tags and EXIF tures of daily scenes consisting of general objects in their
(Exchangeable image file format) image metadata has also usual environment. The objects are labelled using per-
been extracted and made publicly available. Image tags instance segmentation to help in precise object localiza-
have been presented in two forms: (a) raw form in which tion. The dataset consists of total 3,28,000 images with
they are obtained from users; and, (b) in the processed 25,00,000 labelled instances. The objects chosen for the
form where raw tags have been cleaned by Flickr (e.g. re- dataset are from 91 diverse categories. The annotation
moval of spaces and special characters). In MIR Flickr 25k pipeline has been divided into three prominent exercises:
data, images have been manually annotated. Each image (1) labelling concepts which are present in the image, (2)
has an average of 8.94 tags. So, there are 1386 tags that locating and marking all instances of labelled concepts;
are associated with at least 20 images. Images are split into and (3) segmentation of each object instance.
15,000 training and 10,000 testing images. MIR Flickr 1M 10. WIKI-CMR [61]: This dataset has been collected from
data is an extension of MIR Flickr 25k. Images have not Wikipedia articles which contain images, paragraphs and
been annotated manually, unlike original 25k data. Images hyperlinks. Authors mainly focused on the areas: geog-
are represented using MPEG-7 edge histogram and homo- raphy, people, nature, culture and history for dataset col-
geneous texture descriptors and color descriptors. lection. It consists of total 74,961 documents categorized
6. INRIA-Websearch [136]: This dataset consists of 71,478 into 11 diverse concepts. Each of the document includes
images resulted from a web search engine for 353 mis- one paragraph, one associated image (or no image), a cat-
cellaneous search queries. Top-ranked images have been egory label and hyperlinks. Images are represented using
chosen from this search along with their corresponding eight types of features including dense SIFT, Gist, PHOG,
metadata and ground-truth annotations. For each searched LBP and other features. Text is represented using TF-IDF.
query, the dataset comprises the initial textual query, top-
ranked images, and an annotation file. More than 200 im-
ages have been retrieved for 80% of queries. Annotation 6. Comparative analysis
file consists of manual labels for image relevance to the
query and other related metadata such as web page URL, In this section, prominent evaluation metrics used for cross-
image URL, page title, image’s alternate text, 10 words modal retrieval method performance analysis are defined. Af-
before the image on a web page, and 10 words after. Im- terward, comparisons of various cross-modal retrieval methods
ages have been scaled to fit in a 150 × 150 pixel square, when applied on diverse datasets are presented on the basis of
however, preserving the original aspect ratio. MAP score.
7. Flickr 8k and 30k8 [137], [138]: Flickr 30k is an exten-
sion of the Flickr 8k dataset. Both datasets have been cre- 6.1. Evaluation metrics
ated from the Flickr website. Flickr 8k contains 8,092 im- For image and text modality, two cross-modal retrievals are
ages and its main focus is on people or animals (mainly considered: (a) image to text retrieval (I2T), means retrieving
dogs) carrying out some action. Images have been col- text related to the query image; and (2) text to image retrieval
lected from six different Flickr groups manually and anno- (T2I), retrieving images that match with the textual query [1].
tated using multiple captions in the form of sentences by Precisely in the testing phase, given a text or an image query,
selected workers from the US. Flickr 30k contains 31,783 the aim of the cross-modal method is to search and retrieve the
images of everyday scenes, activities, and events. Images images or text that closely matches the query modality respec-
are associated with 1,58,915 captions which have been at- tively. A retrieved outcome is considered to be relevant if it
tained via crowd-sourcing. The approach followed to col- belongs to the same concept as the query modality. Two typical
lect this data is the same as followed by [137]. factors considered while quantitative performance evaluation
8. PASCAL sentence data9 [139]: The images for this dataset are: (1) class relevance evaluation between query and outcome;
have been collected from PASCAL VOC 2008 challenge (2) examining cross-modal relevance for image-text pairs. The
[132]. Data consists of 1000 images selected from around first factor tells about the ability to learn diverse cross-modal
6000 images of PASCAL VOC 2008 training data. Images latent representations while the second factor says about the ca-
have been categorized into 20 categories depending upon pability of learning correlated latent concepts [81]. The metrics
the objects that appear in them and few images are present related to the above two factors are as follows:
in multiple classes. Fifty random images have been cho-
sen from each class to compose the dataset. Each image is 1. Precision, recall and PR curve: Precision is defined as the
annotated with five different captions in the form of sen- ratio of T P to T P + FP, where T P is the number of out-
tences. comes which are similar to query and T P + FP is the num-
ber of total retrieved outcomes. It is useful in measuring
8 http://shannon.cs.illinois.edu/DenotationGraph/
9 http://vision.cs.uiuc.edu/pascal-sentences/ 10 http://cocodataset.org/
25
Table 7: Summary of prominent image-text multi-modal datasets
Sr. Dataset Year Mode Total con- Total images/ Image representation Text repre-
No. cepts text sentation
1 IAPR TC-12 2006 Image/ cap- Diverse 20,000/ - -
[131] tion 60,000
2 MIRFlickr 25k 2008 Image/ tags Diverse 25,000/ - -
[134] 2,23,500
3 NUS-WIDE 2009 Image/ tags 81 2,69,648/ Color correlogram, Tag occur-
[130] 5,018 unique wavelet texture, color rence feature
tags histogram, BoW based
on SIFT descriptions,
edge direction histogram
and block-wise color
moments
4 ImageNet11 [141] 2009 Images/ 12 subtrees 32,00,000/ SIFT -
synsets 5,247
5 Wikipedia [56] 2010 Image/ text 29 (10 major) 2,866/ 2,866 SIFT LDA
6 Pascal VOC 2007 2010 Image/ tags 20 9,963/ 24,640 - -
[132]
7 MIRFlickr 1M 2010 Image/ tags Diverse 10,00,000/ MPEG-7 edge histogram Flickr user
[135] 89,40,000 and homogeneous tex- tags, EXIF
ture descriptors, color metadata
descriptor
8 INRIA- 2010 Image/ labels 353 71478/ - - -
websearch
[136]
9 Pascal sentence 2010 Image/ sen- 20 1000/ 5000 - -
data [139] tences
10 Wikipedia POTD 2011 Images/ para- NA 1987/ 1987 SIFT Text tokeniza-
[142] graphs tion using
rainbow
11 Flickr 8k [137] 2013 Image/ cap- Diverse 8092/ 40460 - -
tions or
sentences
12 WIKI-CMR [61] 2013 Images/ 11 38,804/ SIFT, gist, PHOG, LBP, TF-IDF
paragraphs/ 74,961 self similarity, spatial
hyperlinks pyramid method
13 Flickr 30k [138] 2014 Image/ cap- Diverse 31,783/ - -
tions or 1,58,915
sentences
14 MS COCO [140] 2014 Images/ labels 91 3,28,000/ - -
25,00,000
11
http://www.image-net.org
the probability of success for an information retrieval sys- T P + FN is the total number of relevant outcomes in the
tem. On the other hand, Recall is defined as the ratio of T P repository. It is useful in measuring the percentage of re-
to T P + FN, where T P is the same as explained above and trieved relevant results for an information retrieval system
26
[76, 84]. Refer to the table (8) for a complete understand- rel(o) = 1 and otherwise 0. Now, MAP can be defined
ing of the definition of precision and recall. Precision and as (eq. 20):
recall can be expressed as (eq. 15, 16):
Q
1 X
TP MAP = AP (20)
prec = (15) Q q=1
T P + FP
TP where Q is the total number of queries. A large MAP value
rec = (16) signifies the betterment of the cross-modal algorithm when
T P + FN
applied on a particular dataset.
where prec represents precision, rec is recall, T P indicates 4. Percentage: MAP metric only considers the factors that
true positive, FP is false positive and FN represents false whether the outcome is relevant to query or not. For more
negative. precise evaluation, all the retrieved outcomes are ranked as
per correlation. Typically, a query text (or image) is con-
Table 8: Table for better understanding of precision and recall sidered to be successful in retrieving results if its corre-
Relevant Irrelevant Total sponding ground-truth image (or text) appears in the first
Retrieved True Positive (TP) False Positive (FP) Predicted Positive
a percent of the ranked list of retrieved outcomes. Per-
Not retrieved False Negative (FN) True Negative (TN) Predicted Negative
Total Actual Positive Actual Negative T P + FP + T N + FN
centage is the ratio of correctly retrieved query outcomes
among all the query outcomes. Authors in [84, 81, 142]
Most of the works [143, 52, 144, 145] have used the have utilized this metric for algorithm evaluation and have
precision-recall curve to visualize the performance of their chosen the value of a as 0.2 or 20%.
algorithm. The curve indicates the precision value at dif- 5. MRR: Mean Reciprocal Rank (MRR) is another perfor-
ferent recall levels. Authors in [146] have also used preci- mance evaluation metric similar to percentage. It has been
sion curve for performance visualization. It indicates the applied in [84, 81] for method evaluation regarding the po-
change in precision with respect to the number of retrieved sition of the corresponding ground-truth outcome paired
results. with the query. It is mathematically expressed as (eq. 21):
2. F-measure: It is a typical metric utilized for evaluating |O|
1 X 1
the performance of information retrieval systems [84]. Af- MRR = (21)
ter considering the effects of both precision and recall, F- |O| n=0 rankn
measure can be defined mathematically as eq. (17): where |O| is the number of query outcomes, rankn in-
dicates the position of corresponding unique groud-truth
(θ2 + 1) × prec × rec paired with nth query in the retrieved set.
F= (17)
θ2 × (prec + rec)
here θ has been used for adjusting the weighted proportion
of both recall (rec) and precision (prec). If θ becomes 1
then F-measure can be redefined as F1 (eq. 18):
2 × prec × rec
F1 = (18)
prec + rec
Here, F1 is the perfect combination of recall and precision.
More the value of F1 , more better is the algorithm.
3. MAP: Mean Average Precision (MAP) is the most popu-
lar metric used for evaluating the performance of a cross-
modal retrieval algorithm. It measures whether the re-
trieved result belongs to the same class as the query data
(relevant) or not (irrelevant) [81]. It is the average of av-
erage precision calculated over all the queries. Given a
query (an image or a text) and a group of its correspond- Figure 25: Average MAP score chart of different hashing methods on NUS-
ing O retrieved outcomes, average precision is defined as WIDE data
(eq. 19):
1X
O 6.2. Comparison of results using diverse techniques
AP = P(o)rel(o) (19) This section presents the comparison of various cross-modal
R o=1
retrieval techniques on primary datasets. Techniques are com-
where R is the number of relevant outcomes in the re- pared on the basis of the MAP score as it is the most popular and
trieved outcomes, P(o) is the precision of top o retrieved widely used evaluation metric. Three MAP scores are consid-
outcomes, if the oth retrieved outcome is relevant then ered here which are I2T (when image queries related text), T 2I
27
Table 9: Comparison of prominent hashing techniques on the basis of MAP scores on NUS-WIDE dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
IMH [125] 0.2056 0.2145 0.2317 0.2381 0.2533 0.2613 0.22185 0.2339 0.2465
LSSH [128] 0.4933 0.5006 0.5069 0.5084 0.625 0.6578 0.6823 0.6913 0.55915 0.5792 0.5946 0.59985
QCH [111] 0.5395 0.5489 0.5568 0.5741 0.54815 0.5615
CMSTH [106] 0.5032 0.5073 0.527 0.5439 0.4761 0.4965 0.5088 0.5243 0.48965 0.5019 0.5179 0.5341
FSH-S [126] 0.4996 0.461 0.4556 0.4776 0.446 0.4423 0.4886 0.4535 0.44895
FSH [126] 0.5059 0.5063 0.5171 0.479 0.481 0.4965 0.49245 0.49365 0.5068
SMFH [118] 0.4553 0.4623 0.4658 0.468 0.5033 0.5056 0.5065 0.5079 0.4793 0.48395 0.48615 0.48795
MFDH [117] 0.646 0.6714 0.7014 0.7121 0.7811 0.8285 0.8653 0.8824 0.71355 0.74995 0.78335 0.79725
DLSH [114] 0.5127 0.516 0.5179 0.5203 0.5234 0.5284 0.5165 0.5197 0.52315
DCMH [116] 0.5903 0.6031 0.6093 0.6389 0.6511 0.6571 0.6146 0.6271 0.6332
AADAH [121] 0.6403 0.6294 0.652 0.6789 0.6975 0.7039 0.6596 0.66345 0.67795
TDH H [112] 0.6393 0.6626 0.6754 0.6647 0.6758 0.6803 0.652 0.6692 0.67785
TDH C [112] 0.6393 0.6626 0.6754 0.6647 0.6758 0.6803 0.652 0.6692 0.67785
ZSH1 [119] 0.6411 0.6434 0.6457 0.6468 0.6755 0.6763 0.6789 0.6796 0.6583 0.65985 0.6623 0.6632
ZSH2 [119] 0.5982 0.6017 0.6033 0.6059 0.6286 0.6297 0.6325 0.6339 0.6134 0.6157 0.6179 0.6199
ZSH3 [119] 0.1733 0.1756 0.1771 0.1783 0.1721 0.1736 0.1743 0.1748 0.1727 0.1746 0.1757 0.17655
ZSH4 [119] 0.1481 0.1492 0.1511 0.1519 0.1437 0.1453 0.1475 0.1498 0.1459 0.14725 0.1493 0.15085
SDCH [113] 0.813 0.834 0.841 0.823 0.857 0.868 0.818 0.8455 0.8545
Table 10: Comparison of prominent hashing techniques on the basis of MAP scores on Wikipedia dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
LSSH [128] 0.233 0.234 0.2387 0.234 0.5571 0.5743 0.571 0.5577 0.39505 0.40415 0.40485 0.39585
QCH [111] 0.2343 0.2477 0.3034 0.317 0.26885 0.28235
CMSTH [106] 0.3155 0.3293 0.3313 0.3375 0.3562 0.37 0.3825 0.3878 0.33585 0.34965 0.3569 0.36265
SMFH [118] 0.2572 0.2759 0.2863 0.2913 0.5784 0.604 0.6163 0.6219 0.4178 0.43995 0.4513 0.4566
MFDH [117] 0.3548 0.3763 0.3878 0.3954 0.8318 0.8458 0.8568 0.8666 0.5933 0.61105 0.6223 0.631
DLSH [114] 0.2838 0.3429 0.352 0.6764 0.7478 0.749 0.4801 0.54535 0.5505
ZSH1 [119] 0.2998 0.3017 0.3035 0.3063 0.3016 0.3025 0.3044 0.3061 0.3007 0.3021 0.30395 0.3062
ZSH2 [119] 0.2543 0.2551 0.2576 0.2581 0.2526 0.2541 0.2563 0.2587 0.25345 0.2546 0.25695 0.2584
ZSH3 [119] 0.1214 0.1233 0.1247 0.1251 0.1178 0.1196 0.1218 0.1232 0.1196 0.12145 0.12325 0.12415
ZSH4 [119] 0.0982 0.0997 0.1012 0.1019 0.0936 0.0949 0.0971 0.0995 0.0959 0.0973 0.09915 0.1007
(when text queries matched images), and the average of these IAPR TC-12 dataset. SDCH [113] method has the highest map
(I2T and T 2I) two values. Table (9) shows the MAP scores score in both I2T and T2I tasks on all hash code lengths except
on NUS-WIDE dataset. The blank spaces in the tables indi- 128. On length 128, DVSH-B [115] method shows the high-
cate that there is no value provided for that particular hash code est performance for both tasks. Average MAP results shown
length. The bold value in each of the hash code columns rep- in table (10, 11 and 12) can be visualized in figure (26, 27 and
resents the highest value in that column. Figure (25) presents 28) for Wikipedia, MIRFlickr and IAPR TC-12 datasets respec-
a chart of average MAP scores for table (9). It is evident from tively.
table (9) and figure (25) that the performance of SDCH [113]
method is the best in both I2T and T2I tasks, however, MFDH
[117] shows best performance on 128 hash code length. Table Table (13 and 14) show the comparison of various real-
(10 and 11) presents the MAP scores on Wikipedia and MIR- valued learning techniques based on MAP score on Wikipedia
Flickr 25k dataset respectively. For table (10), the best results and NUS-WIDE dataset respectively. Four types of methods are
are shown by MFDH [117] technique in both I2T and T2I tasks. included: (1) deep learning based; (2) subspace learning meth-
ZSH1 [119] method shows the best performance on MIRFlickr ods; (3) topic models; and (4) rank-based methods. Map score
25k dataset for 128 hash code and otherwise SDCH [113] per- in bold font represents the highest value in that particular col-
forms the best. Table (12) shows the comparison of various umn and italic font represents the highest value in a particular
cross-modal hashing techniques based on deep learning on the method type.
28
Table 11: Comparison of prominent hashing techniques on the basis of MAP scores on MIRFlickr 25k dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
FSH-S [126] 0.609 0.5969 0.593 0.6036 0.5944 0.5923 0.6063 0.59565 0.59265
FSH [126] 0.5968 0.6189 0.6195 0.5924 0.6128 0.6091 0.5946 0.61585 0.6143
MFDH [117] 0.6836 0.6939 0.7066 0.723 0.7408 0.7506 0.7602 0.7797 0.7122 0.72225 0.7334 0.75135
DLSH [114] 0.6379 0.648 0.6603 0.6764 0.6777 0.685 0.65715 0.66285 0.67265
DCMH [116] 0.741 0.7465 0.7485 0.7827 0.79 0.7932 0.76185 0.76825 0.77085
AADAH [121] 0.7563 0.7719 0.772 0.7922 0.8062 0.8074 0.77425 0.78905 0.7897
TDH H [112] 0.711 0.7228 0.7289 0.7422 0.75 0.7548 0.7266 0.7364 0.74185
TDH C [112] 0.711 0.7228 0.7289 0.7422 0.75 0.7548 0.7266 0.7364 0.74185
DMSH [129] 0.726 0.737 0.75 0.755 0.763 0.775 0.7405 0.75 0.7625
ZSH1 [119] 0.7812 0.7831 0.7862 0.7874 0.7964 0.7989 0.8025 0.8037 0.7888 0.791 0.79435 0.79555
ZSH2 [119] 0.7302 0.7334 0.7351 0.7363 0.7092 0.7113 0.7132 0.7148 0.7197 0.72235 0.72415 0.72555
ZSH3 [119] 0.2126 0.2135 0.2141 0.2147 0.2016 0.2023 0.2027 0.2031 0.2071 0.2079 0.2084 0.2089
ZSH4 [119] 0.1873 0.1899 0.1917 0.1926 0.1795 0.1807 0.1816 0.1822 0.1834 0.1853 0.18665 0.1874
SDCH [113] 0.845 0.866 0.873 0.831 0.856 0.863 0.838 0.861 0.868
Table 12: Comparison of prominent deep learning based hashing techniques on the basis of MAP scores on IAPR TC-12 dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
DVSH [115] 0.5696 0.6321 0.6964 0.7236 0.6037 0.6395 0.6806 0.6751 0.58665 0.6358 0.6885 0.69935
DVSH-B [115] 0.626 0.6761 0.7359 0.7554 0.6285 0.6728 0.6922 0.6756 0.62725 0.67445 0.71405 0.7155
DVSH-Q [115] 0.5385 0.6113 0.6869 0.7097 0.5684 0.6153 0.6618 0.6693 0.55345 0.6133 0.67435 0.6895
DVSH-I [115] 0.4792 0.5035 0.5583 0.589 0.4903 0.5496 0.589 0.6012 0.48475 0.52655 0.57365 0.5951
DVSH-H [115] 0.4575 0.4975 0.5493 0.569 0.4396 0.4853 0.5185 0.5337 0.44855 0.4914 0.5339 0.55135
DCMH [116] 0.4526 0.4732 0.4844 0.5185 0.5378 0.5468 0.48555 0.5055 0.5156
AADAH [121] 0.5293 0.5283 0.5439 0.5358 0.5565 0.5648 0.53255 0.5424 0.55435
SDCH [113] 0.726 0.787 0.803 0.704 0.783 0.797 0.715 0.785 0.8
Figure 26: Average MAP score chart of different hashing methods on Wikipedia Figure 27: Average MAP score chart of different hashing methods on MIR-
data Flickr data
Table 16: Comparison of general and deep learning based hashing methods
Hashing method Generality Modeling complexity Retrieval performance Parameter scale Hardware cost
General Poor Complex Fair Small Small
Deep learning based Good Simple Good Large Large
Algorithm level 3. Need of a scalable algorithm 3. Algorithms have restrictions of data size, modalities
and application areas
4. Need of a reproducible cross-modal retrieval method 4. Most methods are applicable in a particular applica-
tion area
5. Cross-modal retrieval implementation in big data, 5. Rarely applied
cloud, and IoT environments
6. More utilization of semi-supervised cross-modal re- 6. Less used
trieval techniques
7. Lack of huge datasets incorporating diverse modali- 7. Most existing benchmark datasets are old and consist
ties of only image and text modality
Data level 8. Requirement of proper and exact labeling of images 8. Poor and noisy annotations
9. Diversity in data composition area 9. Datasets majorly composed from common social me-
dia websites
32
Table 18: Summary of works done in image-text cross-modal retrieval
SR. TECHNIQUE IMAGE REP. TEXT REP. DATA TYPE METRIC REF.
Real-valued representation techniques
1 Linkage of each image Bag-of-Words CiCui system [154], US FEMA flood data, Face- - MAP [25]
feature with text feature TF-IDF and weird- book pages’ and groups’ data
ness [155] related to floods
2 Structural SVM, SR, BoW of dense SIFT features Probability distribu- UIUC Pascal Sentence dataset, - BLEU score, [58]
CSR (using CCA) tion IAPR TC-12 benchmark and rouge score
SBUCaptioned Photo dataset
3 Markov Random Field Image moments, gray level co- Bag-of-Keywords Thoracic CT scan data of nine Supervised Precision, [26]
(MRF) and Hidden occurrence matrix (GLCM) distinct concepts containing recall and their
Markov Model (HMM) moments, auto-correlation 842 ROIs (created) curve, ten-fold
coefficients (AC), edge fre- cross valida-
quency (EF), Gabor filter tion accuracy,
descriptor, Tamura descriptor, classification
color edge directional de- accuracy
scriptor (CEDD), fuzzy color
texture histogram (FCTH)
descriptor and combined
texture feature
4 Cluster sensitive cross- Wavelet feature, 3 level spa- TF-IDF and Latent Image Clef and Wikipedia Semi-supervised MAP [70]
modal correlation learn- cial max-pooling, GIST, dense Dirichlet alloca- [156] dataset
ing framework SIFT with sparse coding, tion(LDA)
PHOG and color histogram
5 AICDM Scalable color descriptor, - ESP, pascal VOC 2007, web - PR curve [100]
color layout descriptor, ho- image
mogeneous texture descriptor,
edge histogram, grid color
moment and gabor wavelet
moment
6 Probabilistic model of Blobs to represent image re- TF-IDF IAPR TC-12 and 500 - Precision, [76]
automatic image annota- gions Wikipedia web-pages dataset recall, F-
tion measure
7 Joint feature selection Gist, SIFT LDA Pascal, Wikipedia and NUS- - MAP, PR [65]
and subspace learning WIDE dataset curve
8 Local Group based Con- GIST, HoG word frequency fea- LabelMe, Wikipedia, Pascal Supervised MAP, PR [157]
sistent Feature Learning ture, latent Dirich- VOC2007, NUS-WIDE curve
(LGCFL) let allocation model
with 10 dimensions
9 KCCA based approach Gist, color histogram, BoVW word frequency, rel- Pascal VOC 2007, labelme, Unsupervised Normalized [64]
ative tag rank, abso- discounted
lute tag rank cumulative
gain
10 Structural SVM based SIFT, BoVW, LDA BoW, LDA IAPR TC-12 Benchmark, - BLEU, pre- [41]
unified framework UIUC Pascal Sentence, cision, recall,
SBU-Captioned Photo median rank,
MAP
11 Cross mOdal Similar- SIFT, GIST latent Dirichlet allo- Wikipedia , Pascal VOC2007, Supervised MAP [42]
ity Learning algorithm cation model NUSWIDE-1.5K, LabelMe
with Active Queries
(COSLAQ)
12 CM, SM, SCM SIFT LDA Wikipedia Unsupervised MAP, PR [56]
curve
13 Graph regularization and CNN LDA INRIA-websearch, Pascal sen- - MAP, PR [52]
modality dependence tence, Wikipedia 2010 curve
(GRMD)
14 CCA, KCCA SIFT, gist, PHOG, LBP, self TF-IDF WIKI-CMR - Precision [61]
similarity, spatial pyramid
method
15 Improved CCA - - NUS-WIDE, Pascal sentence, - MAP [60]
Wikipedia
16 Unsupervised KCCA Gist, HSV color histogram, Words frequency, Labelme, Pascal VOC Unsupervised Normalized [63]
based framework SIFT relative tag rank, Discounted
absolute tag rank Cumulative
Gain (NDCG)
17 MLRank Gist, color histogram, color - Corel 5k, NUS-WIDE, IAPR Semi-supervised Precision, re- [77]
suto-correlation, edge direc- TC12 call, F1 score,
tion histogram, wavelet tex- MAP, N+ (no.
ture, block-wise color mo- of keywords
ments with non-zero
recall value)
33
18 CM, SM, SCM SIFT LDA TVGraz, Wikipedia Supervised and MAP, PR [57]
unsupervised curve
19 CCA and its probabilistic RGB-SIFT Binary features MIRFlickr 1M - Precision [59]
interpretation
20 Regularizer of image se- SIFT LDA TVGraz, Wikipedia, pascal - MAP, PR [38]
mantics sentence dataset curve
21 Modality-dependent CNN visual features LDA Wikipedia, pascal sentence, Supervised MAP [68]
cross-media retrieval INRIA-websearch
(MDCR) model
22 Semantic consistency CNN, VGG LDA, BoW Wikipedia, pascal sentence, Semi-supervised MAP, PR [66]
cross-modal retrieval NUS-WIDE-10k, INRIA- curve
(SCCMR) websearch
Binary-valued or cross-modal hashing techniques
23 Unsupervised Concate- Gist word frequency Pascal, UCI handwritten digit, Unsupervised MAP [120]
nation Hashing (UCH) count Wikipedia
24 Cross-modal self-taught SIFT, HoG, GIST TF-IDF Wikipedia, NUS-WIDE Unsupervised MAP [106]
hashing (CMSTH)
25 Linear cross-modal hash- SIFT LDA NUS-WIDE, Wikipedia - MAP, recall [110]
ing
26 Latent semantic sparse Sparse coding Matrix factorization Labelme, NUS-WIDE, - MAP, PR [128]
hashing Wikipedia curve
27 Quantized correlation SIFT, BoW LDA, tag vector 58W-CIFAR, NUS-WIDE, Supervised MAP, preci- [111]
hashing Wikipedia sion
28 Discrete Latent Semantic SIFT, gist, edge histogram Topic vectors, index Labelme, MIRFlickr 25k, Supervised MAP, PR [114]
Hashing vector of selected NUS-WIDE, Wikipedia curve
tags, feature vector
derived from PCA,
binary tagging vec-
tor
29 Subspace Relation SIFT, gist LDA, tag occur- ImageNet, Labelme, MIR- Supervised MAP, preci- [109]
Learning for Cross- rence feature vector Flickr 25k, NUS-WIDE, sion
modal Hashing UCI handwritten digit data,
Wikipedia
30 Deep Visual Semantic Deep f c7 features BoW vector IAPR TC-12, MS COCO - MAP, preci- [115]
Hashing model sion
31 Deep cross-modal hash- Gist, bag-of-visual-words BoW vector IAPR TC-12, MIRFlickr 25k, - MAP, PR [116]
ing (BOVW) NUS-WIDE curve
32 Triplet-based deep hash- SIFT BoW MIRFlickr 25k, NUS-WIDE Supervised MAP, PR [112]
ing network curve
33 Attention-Aware Deep - BoW IAPR TC-12, NUS-WIDE, - MAP, PR [121]
Adversarial Hashing MIRFlickr 25k curve
34 Supervised matrix factor- SIFT Topic vector, BoW NUS-WIDE, Wikipedia Supervised MAP, pre- [118]
ization hashing cision, PR
curve
35 Semantic deep cross- - BoW IAPR TC-12, MIRFlickr, Supervised MAP, preci- [113]
modal hashing NUS-wIDE sion curve, PR
curve
36 Zero-shot hashing BoVW, SIFT, gist LDA, BoW MIRFlickr, NUS-WIDE, Semi-supervised MAP [119]
Wikipedia
37 Deep multi-level seman- - BoW MIRFlickr 25k Supervised MAP, PR [129]
tic hashing curve
38 Cycle-Consistent Deep CNN LDA Microsoft COCO, IAPR TC- - MAP, preci- [158]
Generative Hashing 12, wiki sion curve, PR
(CYC-DGH) curve
39 Multi-modal graph reg- SIFT, BoW, edge histogram LDA, tag vector fea- MIRFlickr, NUS-WIDE, Unsupervised MAP, PR [108]
ularized smooth matrix ture vectors Wikipedia curve
factorization hashing
40 Multi-view feature dis- SIFT, histogram feature, Word vector, mean MIRFlickr, MMED, NUS- Supervised MAP, PR [117]
crete hashing BoVW vector, covariance WIDE, Wikipedia curve
matrix, feature
histogram
Cross-modal methods based on deep learning
41 Multi-modal Deep Belief Image specific DBN which Text specific DBN MIR Flickr Data Unsupervised MAP [40]
Network (DBN) used Gaussian Restricted which used Repli-
Boltzmann Machines (RBM) cated Softmax
model
42 Levinberg-Marquardt Deep neural network Deep neural net- Wikipedia Articles data - Precision [30]
deep canonical correla- work recovery curve
tion analysis (LMDCCA)
34
43 Cross-media multiple GIST, Pyramid Histogram of BoW NUSWIDE-10k, Wikipedia, - MAP, PR [159]
deep network Words (PHoW), MPEG-7, Pascal Sentences curve
SIFT, color correlogram, color
histogram, wavelet texture,
edge direction histogram,
block-wise color moments
44 Deep canonical correla- color histogram, color cor- Bag-of-Words Wikipedia, pascal, NUS- Supervised MAP [160]
tion analysis(DCCA) relogram, edge direction (BoW) WIDE10k
histogram, wavelet texture,
block-wise color moments,
SIFT, GIST, MPEG-7
45 Deep coupled met- SIFT, BoVW, GIST, color his- Latent Dirichlet Wikipedia, Pascal VOC 2007, - Precision, [161]
ric learning (DCML) togram allocation (LDA) NUS-WIDE MAP, ROC
method model and CMC
curve
46 Deep semi-supervised CNN, GIST, SIFT 100-d, 399-d and Wikipedia, pascal VOC, NUS- Semi-supervised MAP [87]
framework 1000-d word freq WIDE
vectors
47 Correspondence autoen- Pyramid Histogram of Words Bag-of-Words Wikipedia, Pascal, NUS- - MAP [86]
coder (PHOW), MPEG-7 descrip- WIDE
tors and Gist
48 Multitask learning ap- 4096-dimensional vector ex- 1386/2000- MIRFLICKR-25K, MS - MAP [162]
proach with 3 modules: tracted by the fc10 layer of dimensional bag-of- COCO
Correlation Network, VGGNet word vectors
Cross-modal Autoen-
coder, Latent Semantic
Hashing
49 Deep Adversarial Met- SIFT, VGG LDA, BoW Wikipedia, pascal, NUS- Supervised MAP [97]
ric Learning approach WIDE
(DAML)
50 Deep Pairwise Ranking CNN, GIST, SIFT 100-d, 399-d and Wikipedia, pascal, NUS- Supervised MAP [163]
model with multi- 1000-d word freq WIDE
label information for vectors
Cross-Modal retrieval
(DPRCM)
51 Deep Belief Network LBP - NUS-WIDE, Wikipedia - MAP, percent- [84]
age, MRR
52 Log-Bilinear Model - - IAPR TC-12, attribute discov- - Bleu, per- [88]
ery, SBU captioned photos plexity and
retrieval evalu-
ation
35
9. Conclusion [8] B. Kitchenham, S. Charters, Guidelines for performing systematic liter-
ature reviews in software engineering (2007).
From the review on cross-modal information retrieval, it has [9] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey,
S. Linkman, Systematic literature reviews in software engineering–a
been found that cross-modal retrieval techniques are better than systematic literature review, Information and software technology 51 (1)
classic uni-modal systems in retrieving the multi-modal data (2009) 7–15 (2009).
and adding values to complement meaningful information. The [10] B. E. Stein, T. R. Stanford, B. A. Rowland, Development of multisen-
sory integration from the perspective of the individual neuron, Nature
article summarizes the prominent works done by various re-
Reviews Neuroscience 15 (8) (2014) 520–535 (2014).
searchers in the field of image-text cross-modal retrieval. Pri- [11] R. L. Miller, B. A. Rowland, Multisensory integration: How the brain
mary information has been presented with the help of tables, combines information across the senses, Computational Models of Brain
figures, and graphs to make it more understandable. A tax- and Behavior (2017) 215–228 (2017).
[12] R. K. Srihari, Use of captions and other collateral text in understanding
onomy of cross-modal retrieval techniques has been demon- photographs, in: Integration of Natural Language and Vision Processing,
strated. Information regarding famous image-text multi-modal Springer, 1995, pp. 245–266 (1995).
datasets has been presented. Comparison among various cross- [13] B. E. Stein, M. A. Meredith, The merging of the senses., The MIT Press,
modal techniques when applied on a particular dataset is shown. 1993 (1993).
[14] B. E. Stein, M. A. Meredith, W. S. Huneycutt, L. McDade, Behavioral
Miscellaneous applications in the field of cross-modal retrieval indices of multisensory integration: orientation to visual cues is affected
are mentioned and general architecture is shown. Challenges by auditory stimuli, Journal of Cognitive Neuroscience 1 (1) (1989) 12–
and open issues have also been discussed to help the research 24 (1989).
community in further research. Although significant work has [15] M. Otoom, Beyond von neumann: Brain-computer structural metaphor,
in: 2016 Third International Conference on Electrical, Electronics,
been proposed in this field, still we are far away from achieving Computer Engineering and their Applications (EECEA), IEEE, 2016,
an ideal position and accuracy in the process. This approach pp. 46–51 (2016).
has still not been accepted and applied worldwide in most of [16] B. P. Yuhas, M. H. Goldstein, T. J. Sejnowski, Integration of acoustic
the real-life applications. Moreover, ample work is required to and visual speech signals using neural networks, IEEE Communications
Magazine 27 (11) (1989) 65–71 (1989).
be done in this field to introduce novel better algorithms or to [17] C. Saraceno, R. Leonardi, Indexing audiovisual databases through joint
enhance the retrieval efficiency of the classic algorithms. It is audio and video processing, International Journal of Imaging Systems
expected that this article will be useful for the readers and re- and Technology 9 (5) (1998) 320–331 (1998).
[18] D. Roy, Integration of speech and vision using mutual information, in:
searchers to better understand the present situation and state-of-
2000 IEEE International Conference on Acoustics, Speech, and Signal
the-art cross-modal retrieval methods and motivate researches Processing. Proceedings (Cat. No. 00CH37100), Vol. 4, IEEE, 2000, pp.
in the field. 2369–2372 (2000).
[19] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature
264 (5588) (1976) 746–748 (1976).
Declaration of Competing Interest [20] T. Westerveld, D. Hiemstra, F. De Jong, Extracting bimodal representa-
tions for language-based image retrieval, in: Multimedia’99, Springer,
2000, pp. 33–42 (2000).
The authors declare that they have no known competing fi- [21] T. Westerveld, Image retrieval: Content versus context., in: RIAO, Cite-
nancial interests or personal relationships that could have ap- seer, 2000, pp. 276–284 (2000).
peared to influence the work reported in this paper. [22] C. Xiong, D. Zhang, T. Liu, X. Du, Voice-face cross-modal matching
and retrieval: A benchmark, arXiv preprint arXiv:1911.09338 (2019).
[23] A. C. Duarte, Cross-modal neural sign language translation, in: Proceed-
ings of the 27th ACM International Conference on Multimedia, ACM,
References 2019, pp. 1650–1654 (2019).
[24] S. Mariooryad, C. Busso, Exploring cross-modality affective reactions
[1] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey for audiovisual emotion recognition, IEEE Transactions on affective
on cross-modal retrieval, arXiv preprint arXiv:1607.06215 (2016). computing 4 (2) (2013) 183–196 (2013).
[2] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: [25] M. Jing, B. W. Scotney, S. A. Coleman, M. T. McGinnity, X. Zhang,
A survey and taxonomy, IEEE Transactions on Pattern Analysis and Ma- S. Kelly, K. Ahmad, A. Schlaf, S. Gründer-Fahrer, G. Heyer, Integration
chine Intelligence 41 (2) (2018) 423–443 (2018). of text and image analysis for flood event image recognition, in: 2016
[3] M. Ayyavaraiah, B. Venkateswarlu, Cross media feature retrieval and 27th Irish Signals and Systems Conference (ISSC), IEEE, 2016, pp. 1–6
optimization: A contemporary review of research scope, challenges and (2016).
objectives, in: International Conference On Computational Vision and [26] M. M. Rahman, D. You, M. S. Simpson, S. K. Antani, D. Demner-
Bio Inspired Computing, Springer, 2019, pp. 1125–1136 (2019). Fushman, G. R. Thoma, Interactive cross and multimodal biomedical
[4] Y. Peng, X. Huang, Y. Zhao, An overview of cross-media retrieval: Con- image retrieval based on automatic region-of-interest (roi) identification
cepts, methodologies, benchmarks, and challenges, IEEE Transactions and classification, International Journal of Multimedia Information Re-
on circuits and systems for video technology 28 (9) (2017) 2372–2385 trieval 3 (3) (2014) 131–146 (2014).
(2017). [27] Z. Liu, H. Liu, W. Huang, B. Wang, F. Sun, Audiovisual cross-modal
[5] M. Ayyavaraiah, B. Venkateswarlu, Joint graph regularization based se- material surface retrieval, Neural Computing and Applications (2019)
mantic analysis for cross-media retrieval: a systematic review, Inter- 1–9 (2019).
national Journal of Engineering & Technology 7 (2.7) (2018) 257–261 [28] D. Cao, Z. Yu, H. Zhang, J. Fang, L. Nie, Q. Tian, Video-based cross-
(2018). modal recipe retrieval, in: Proceedings of the 27th ACM International
[6] Y.-x. Peng, W.-w. Zhu, Y. Zhao, C.-s. Xu, Q.-m. Huang, H.-q. Lu, Q.- Conference on Multimedia, ACM, 2019, pp. 1685–1693 (2019).
h. Zheng, T.-j. Huang, W. Gao, Cross-media analysis and reasoning: [29] M. Lazaridis, A. Axenopoulos, D. Rafailidis, P. Daras, Multimedia
advances and directions, Frontiers of Information Technology & Elec- search and retrieval using multimodal annotation propagation and index-
tronic Engineering 18 (1) (2017) 44–57 (2017). ing techniques, Signal Processing: Image Communication 28 (4) (2013)
[7] M. Priyanka, B. S. Devi, S. Riyazoddin, M. J. Reddy, Analysis of cross- 351–367 (2013).
media web information fusion for text and image association-a survey [30] D. Xia, L. Miao, A. Fan, A cross-modal multimedia retrieval method us-
paper, Global Journal of Computer Science and Technology (2013).
36
ing depth correlation mining in big data environment, Multimedia Tools trieval based on graph regularization, Mobile Information Systems 2020
and Applications (2019) 1–16 (2019). (2020).
[31] X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint [53] H. Hotelling, Relations between two sets of variates, in: Breakthroughs
graph regularization for cross-media retrieval, in: Twenty-seventh AAAI in statistics, Springer, 1992, pp. 162–190 (1992).
conference on artificial intelligence, 2013 (2013). [54] C. Guo, D. Wu, Canonical correlation analysis (cca) based multi-view
[32] B. Elizalde, S. Zarar, B. Raj, Cross modal audio search and retrieval with learning: An overview, arXiv preprint arXiv:1907.01693 (2019).
joint embeddings based on text and audio, in: ICASSP 2019-2019 IEEE [55] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation
International Conference on Acoustics, Speech and Signal Processing analysis: An overview with application to learning methods, Neural
(ICASSP), IEEE, 2019, pp. 4095–4099 (2019). computation 16 (12) (2004) 2639–2664 (2004).
[33] Y. Yu, S. Tang, F. Raposo, L. Chen, Deep cross-modal correlation learn- [56] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet,
ing for audio and lyrics in music retrieval, ACM Transactions on Multi- R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia
media Computing, Communications, and Applications (TOMM) 15 (1) retrieval, in: Proceedings of the 18th ACM international conference on
(2019) 20 (2019). Multimedia, 2010, pp. 251–260 (2010).
[34] D. Zeng, Y. Yu, K. Oyama, Deep triplet neural networks with cluster-cca [57] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet,
for audio-visual cross-modal retrieval, arXiv preprint arXiv:1908.03737 R. Levy, N. Vasconcelos, On the role of correlation and abstraction in
(2019). cross-modal multimedia retrieval, IEEE transactions on pattern analysis
[35] P. Tripathi, P. P. Watwani, S. Thakur, A. Shaw, S. Sengupta, Discover and machine intelligence 36 (3) (2013) 521–535 (2013).
cross-modal human behavior analysis, in: 2018 Second International [58] Y. Verma, C. Jawahar, Im2text and text2im: Associating images and
Conference on Electronics, Communication and Aerospace Technology texts for cross-modal retrieval., in: BMVC, Vol. 1, Citeseer, 2014, p. 2
(ICECA), IEEE, 2018, pp. 1818–1824 (2018). (2014).
[36] J. Imura, T. Fujisawa, T. Harada, Y. Kuniyoshi, Efficient multi-modal [59] M. Katsurai, T. Ogawa, M. Haseyama, A cross-modal approach for ex-
retrieval in conceptual space, in: Proceedings of the 19th ACM interna- tracting semantic relationships between concepts using tagged images,
tional conference on Multimedia, ACM, 2011, pp. 1085–1088 (2011). IEEE transactions on multimedia 16 (4) (2014) 1059–1074 (2014).
[37] P. Goyal, S. Sahu, S. Ghosh, C. Lee, Cross-modal learning for multi- [60] J. Shao, Z. Zhao, F. Su, T. Yue, Towards improving canonical correlation
modal video categorization, arXiv preprint arXiv:2003.03501 (2020). analysis for cross-modal retrieval, in: Proceedings of the on Thematic
[38] J. C. Pereira, N. Vasconcelos, Cross-modal domain adaptation for text- Workshops of ACM Multimedia 2017, 2017, pp. 332–339 (2017).
based regularization of image semantics in image retrieval systems, [61] W. Xiong, S. Wang, C. Zhang, Q. Huang, Wiki-cmr: A web cross modal-
Computer Vision and Image Understanding 124 (2014) 123–135 (2014). ity dataset for studying and evaluation of cross modality retrieval mod-
[39] T. Gou, L. Liu, Q. Liu, Z. Deng, A new approach to cross-modal re- els, in: 2013 IEEE International Conference on Multimedia and Expo
trieval, in: Journal of Physics: Conference Series, Vol. 1288, IOP Pub- (ICME), IEEE, 2013, pp. 1–6 (2013).
lishing, 2019, p. 012044 (2019). [62] V. Ranjan, N. Rasiwasia, C. Jawahar, Multi-label cross-modal retrieval,
[40] N. Srivastava, R. Salakhutdinov, Learning representations for multi- in: Proceedings of the IEEE International Conference on Computer Vi-
modal data with deep belief nets, in: International conference on ma- sion, 2015, pp. 4094–4102 (2015).
chine learning workshop, Vol. 79, 2012 (2012). [63] S. J. Hwang, K. Grauman, Accounting for the relative importance of
[41] Y. Verma, C. Jawahar, A support vector approach for cross-modal search objects in image retrieval., in: BMVC, Vol. 1, 2010, p. 5 (2010).
of images and texts, Computer Vision and Image Understanding 154 [64] S. Hwang, K. Grauman, Learning the relative importance of objects from
(2017) 48–63 (2017). tagged images for retrieval and cross-modal search, International journal
[42] N. Gao, S.-J. Huang, Y. Yan, S. Chen, Cross modal similarity learning of computer vision 100 (2) (2012) 134–153 (2012).
with active queries, Pattern Recognition 75 (2018) 214–222 (2018). [65] K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection
[43] A. Habibian, T. Mensink, C. G. Snoek, Discovering semantic vocabular- and subspace learning for cross-modal retrieval, IEEE transactions on
ies for cross-media retrieval, in: Proceedings of the 5th ACM on Inter- pattern analysis and machine intelligence 38 (10) (2015) 2010–2023
national Conference on Multimedia Retrieval, ACM, 2015, pp. 131–138 (2015).
(2015). [66] G. Xu, X. Li, Z. Zhang, Semantic consistency cross-modal retrieval with
[44] N. Van Nguyen, M. Coustaty, J.-M. Ogier, Multi-modal and cross-modal semi-supervised graph regularization, IEEE Access 8 (2020) 14278–
for lecture videos retrieval, in: 2014 22nd International Conference on 14288 (2020).
Pattern Recognition, IEEE, 2014, pp. 2667–2672 (2014). [67] L. Zhang, B. Ma, G. Li, Q. Huang, Q. Tian, Generalized semi-supervised
[45] T. Nakano, A. Kimura, H. Kameoka, S. Miyabe, S. Sagayama, N. Ono, and structured subspace learning for cross-modal retrieval, IEEE Trans-
K. Kashino, T. Nishimoto, Automatic video annotation via hierarchical actions on Multimedia 20 (1) (2017) 128–141 (2017).
topic trajectory model considering cross-modal correlations, in: 2011 [68] Y. Wei, Y. Zhao, Z. Zhu, S. Wei, Y. Xiao, J. Feng, S. Yan, Modality-
IEEE International Conference on Acoustics, Speech and Signal Pro- dependent cross-media retrieval, ACM Transactions on Intelligent Sys-
cessing (ICASSP), IEEE, 2011, pp. 2380–2383 (2011). tems and Technology (TIST) 7 (4) (2016) 1–13 (2016).
[46] B. Jiang, X. Huang, C. Yang, J. Yuan, Cross-modal video moment re- [69] C. Deng, X. Tang, J. Yan, W. Liu, X. Gao, Discriminative dictionary
trieval with spatial and language-temporal attention, in: Proceedings of learning with common label alignment for cross-modal retrieval, IEEE
the 2019 on International Conference on Multimedia Retrieval, ACM, Transactions on Multimedia 18 (2) (2015) 208–218 (2015).
2019, pp. 217–225 (2019). [70] S. Wang, F. Zhuang, S. Jiang, Q. Huang, Q. Tian, Cluster-sensitive struc-
[47] X. Xu, L. He, A. Shimada, R.-i. Taniguchi, H. Lu, Learning unified tured correlation analysis for web cross-modal retrieval, Neurocomput-
binary codes for cross-modal retrieval via latent semantic hashing, Neu- ing 168 (2015) 747–760 (2015).
rocomputing 213 (2016) 191–203 (2016). [71] L. Zhang, B. Ma, G. Li, Q. Huang, Q. Tian, Cross-modal retrieval using
[48] K. Ahmad, Slandail: A security system for language and image analysis- multiordered discriminative structured subspace learning, IEEE Trans-
project no: 607691, Available at SSRN 3060047 (2017). actions on Multimedia 19 (6) (2016) 1220–1233 (2016).
[49] A. Hanbury, A survey of methods for image annotation, Journal of Vi- [72] B. Wang, Y. Yang, X. Xu, A. Hanjalic, H. T. Shen, Adversarial cross-
sual Languages & Computing 19 (5) (2008) 617–627 (2008). modal retrieval, in: Proceedings of the 25th ACM international confer-
[50] B. Rafkind, M. Lee, S.-F. Chang, H. Yu, Exploring text and image ence on Multimedia, 2017, pp. 154–162 (2017).
features to classify images in bioscience literature, in: Proceedings of [73] G. Cao, A. Iosifidis, K. Chen, M. Gabbouj, Generalized multi-view em-
the HLT-NAACL BioNLP Workshop on Linking Natural Language and bedding for visual recognition and cross-modal retrieval, IEEE transac-
Biology, Association for Computational Linguistics, 2006, pp. 73–80 tions on cybernetics 48 (9) (2017) 2542–2555 (2017).
(2006). [74] Y. Wu, S. Wang, G. Song, Q. Huang, Augmented adversarial training for
[51] G. Wang, D. Hoiem, D. Forsyth, Building text features for object image cross-modal retrieval, IEEE Transactions on Multimedia (2020).
classification, in: 2009 IEEE conference on computer vision and pattern [75] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and re-
recognition, IEEE, 2009, pp. 1367–1374 (2009). trieval using cross-media relevance models, in: Proceedings of the 26th
[52] G. Wang, H. Ji, D. Kong, N. Zhang, Modality-dependent cross-modal re- annual international ACM SIGIR conference on Research and develop-
37
ment in informaion retrieval, 2003, pp. 119–126 (2003). transactions on cybernetics (2019).
[76] Y. Xia, Y. Wu, J. Feng, Cross-media retrieval using probabilistic model [99] Z. Yang, Z. Lin, P. Kang, J. Lv, Q. Li, W. Liu, Learning shared semantic
of automatic image annotation, International Journal of Signal Process- space with correlation alignment for cross-modal event retrieval, ACM
ing, Image Processing and Pattern Recognition 8 (4) (2015) 145–154 Transactions on Multimedia Computing, Communications, and Appli-
(2015). cations (TOMM) 16 (1) (2020) 1–22 (2020).
[77] Z. Li, J. Liu, C. Xu, H. Lu, Mlrank: Multi-correlation learning to rank [100] J.-H. Su, C.-L. Chou, C.-Y. Lin, V. S. Tseng, Effective semantic an-
for image annotation, Pattern Recognition 46 (10) (2013) 2700–2710 notation by image-to-concept distribution model, IEEE Transactions on
(2013). Multimedia 13 (3) (2011) 530–538 (2011).
[78] Q. Xu, M. Li, M. Yu, Learning to rank with relational graph and point- [101] L. Chi, X. Zhu, Hashing techniques: A survey and taxonomy, ACM
wise constraint for cross-modal retrieval, Soft Computing 23 (19) (2019) Computing Surveys (CSUR) 50 (1) (2017) 1–36 (2017).
9413–9427 (2019). [102] H. P. Luhn, A new method of recording and searching information,
[79] Y. Wu, S. Wang, Q. Huang, Online fast adaptive low-rank similarity American Documentation 4 (1) (1953) 14–16 (1953).
learning for cross-modal retrieval, IEEE Transactions on Multimedia [103] H. Stevens, Hans peter luhn and the birth of the hashing algorithm, IEEE
(2019). Spectrum 55 (2) (2018) 44–49 (2018).
[80] J. Yu, Y. Cong, Z. Qin, T. Wan, Cross-modal topic correlations for mul- [104] W. W. Peterson, Addressing for random-access storage, IBM journal of
timedia retrieval, in: Proceedings of the 21st International Conference Research and Development 1 (2) (1957) 130–146 (1957).
on Pattern Recognition (ICPR2012), IEEE, 2012, pp. 246–249 (2012). [105] R. Morris, Scatter storage techniques, Communications of the ACM
[81] Y. Wang, F. Wu, J. Song, X. Li, Y. Zhuang, Multi-modal mutual topic 11 (1) (1968) 38–44 (1968).
reinforce modeling for cross-media retrieval, in: Proceedings of the [106] L. Xie, L. Zhu, P. Pan, Y. Lu, Cross-modal self-taught hashing for large-
22nd ACM international conference on Multimedia, 2014, pp. 307–316 scale image retrieval, Signal Processing 124 (2016) 81–92 (2016).
(2014). [107] W. Cao, W. Feng, Q. Lin, G. Cao, Z. He, A review of hashing methods
[82] Z. Qin, J. Yu, Y. Cong, T. Wan, Topic correlation model for cross- for multimodal retrieval, IEEE Access 8 (2020) 15377–15391 (2020).
modal multimedia information retrieval, Pattern Analysis and Applica- [108] Y. Fang, H. Zhang, Y. Ren, Unsupervised cross-modal retrieval via
tions 19 (4) (2016) 1007–1022 (2016). multi-modal graph regularized smooth matrix factorization hashing,
[83] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) Knowledge-Based Systems 171 (2019) 69–80 (2019).
(2015) 436–444 (2015). [109] H. T. Shen, L. Liu, Y. Yang, X. Xu, Z. Huang, F. Shen, R. Hong, Exploit-
[84] B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, Y. Yan, Internet cross-media ing subspace relation in semantic labels for cross-modal hashing, IEEE
retrieval based on deep learning, Journal of Visual Communication and Transactions on Knowledge and Data Engineering (2020).
Image Representation 48 (2017) 356–366 (2017). [110] X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for
[85] P. Hu, L. Zhen, D. Peng, P. Liu, Scalable deep multimodal learning for efficient multimedia search, in: Proceedings of the 21st ACM interna-
cross-modal retrieval, in: Proceedings of the 42nd International ACM tional conference on Multimedia, 2013, pp. 143–152 (2013).
SIGIR Conference on Research and Development in Information Re- [111] B. Wu, Q. Yang, W.-S. Zheng, Y. Wang, J. Wang, Quantized correla-
trieval, 2019, pp. 635–644 (2019). tion hashing for fast cross-modal search, in: Twenty-Fourth International
[86] F. Feng, X. Wang, R. Li, I. Ahmad, Correspondence autoencoders for Joint Conference on Artificial Intelligence, 2015 (2015).
cross-modal retrieval, ACM Transactions on Multimedia Computing, [112] C. Deng, Z. Chen, X. Liu, X. Gao, D. Tao, Triplet-based deep hashing
Communications, and Applications (TOMM) 12 (1s) (2015) 26 (2015). network for cross-modal retrieval, IEEE Transactions on Image Process-
[87] D. Mandal, P. Rao, S. Biswas, Semi-supervised cross-modal retrieval ing 27 (8) (2018) 3893–3903 (2018).
with label prediction, IEEE Transactions on Multimedia (2019). [113] C. Yan, X. Bai, S. Wang, J. Zhou, E. R. Hancock, Cross-modal hash-
[88] R. Kiros, R. Salakhutdinov, R. Zemel, Multimodal neural language mod- ing with semantic deep embedding, Neurocomputing 337 (2019) 58–66
els, in: International conference on machine learning, 2014, pp. 595–603 (2019).
(2014). [114] X. Lu, L. Zhu, Z. Cheng, X. Song, H. Zhang, Efficient discrete latent
[89] F. Feng, X. Wang, R. Li, Cross-modal retrieval with correspondence semantic hashing for scalable cross-modal retrieval, Signal Processing
autoencoder, in: Proceedings of the 22nd ACM international conference 154 (2019) 217–231 (2019).
on Multimedia, 2014, pp. 7–16 (2014). [115] Y. Cao, M. Long, J. Wang, Q. Yang, P. S. Yu, Deep visual-semantic
[90] F. Feng, R. Li, X. Wang, Deep correspondence restricted boltzmann hashing for cross-modal retrieval, in: Proceedings of the 22nd ACM
machine for cross-modal retrieval, Neurocomputing 154 (2015) 50–60 SIGKDD International Conference on Knowledge Discovery and Data
(2015). Mining, 2016, pp. 1445–1454 (2016).
[91] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, S. Yan, Cross-modal [116] Q.-Y. Jiang, W.-J. Li, Deep cross-modal hashing, in: Proceedings of the
retrieval with cnn visual features: A new baseline, IEEE transactions on IEEE conference on computer vision and pattern recognition, 2017, pp.
cybernetics 47 (2) (2016) 449–460 (2016). 3232–3240 (2017).
[92] Y. He, S. Xiang, C. Kang, J. Wang, C. Pan, Cross-modal retrieval via [117] J. Yu, X.-J. Wu, J. Kittler, Learning discriminative hashing codes for
deep and bidirectional representation learning, IEEE Transactions on cross-modal retrieval based on multi-view features, Pattern Analysis and
Multimedia 18 (7) (2016) 1363–1377 (2016). Applications (2020) 1–18 (2020).
[93] X. Huang, Y. Peng, M. Yuan, Mhtn: Modal-adversarial hybrid trans- [118] J. Tang, K. Wang, L. Shao, Supervised matrix factorization hashing for
fer network for cross-modal retrieval, IEEE transactions on cybernetics cross-modal retrieval, IEEE Transactions on Image Processing 25 (7)
(2018). (2016) 3157–3166 (2016).
[94] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, M. Cord, [119] X. Liu, Z. Li, J. Wang, G. Yu, C. Domeniconi, X. Zhang, Cross-modal
Cross-modal retrieval in the cooking context: Learning semantic text- zero-shot hashing, arXiv preprint arXiv:1908.07388 (2019).
image embeddings, in: The 41st International ACM SIGIR Conference [120] J. Yu, X.-J. Wu, Unsupervised concatenation hashing with sparse
on Research & Development in Information Retrieval, 2018, pp. 35–44 constraint for cross-modal retrieval, arXiv preprint arXiv:1904.00726
(2018). (2019).
[95] J. Gu, J. Cai, S. R. Joty, L. Niu, G. Wang, Look, imagine and match: [121] X. Zhang, H. Lai, J. Feng, Attention-aware deep adversarial hashing for
Improving textual-visual cross-modal retrieval with generative models, cross-modal retrieval, in: Proceedings of the European Conference on
in: Proceedings of the IEEE Conference on Computer Vision and Pattern Computer Vision (ECCV), 2018, pp. 591–606 (2018).
Recognition, 2018, pp. 7181–7189 (2018). [122] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization:
[96] W. Cao, Q. Lin, Z. He, Z. He, Hybrid representation learning for cross- A procrustean approach to learning binary codes for large-scale image
modal retrieval, Neurocomputing 345 (2019) 45–57 (2019). retrieval, IEEE transactions on pattern analysis and machine intelligence
[97] X. Xu, L. He, H. Lu, L. Gao, Y. Ji, Deep adversarial metric learning for 35 (12) (2012) 2916–2929 (2012).
cross-modal retrieval, World Wide Web 22 (2) (2019) 657–672 (2019). [123] S. Kumar, R. Udupa, Learning hash functions for cross-view similarity
[98] X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, X. Li, Ternary adversarial search, in: Twenty-Second International Joint Conference on Artificial
networks with self-supervision for zero-shot cross-modal retrieval, IEEE Intelligence, 2011 (2011).
38
[124] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in neu- [145] Y. Wang, X. Lin, L. Wu, W. Zhang, Q. Zhang, Lbmch: Learning bridg-
ral information processing systems, 2009, pp. 1753–1760 (2009). ing mapping for cross-modal hashing, in: Proceedings of the 38th inter-
[125] J. Song, Y. Yang, Y. Yang, Z. Huang, H. T. Shen, Inter-media hashing for national ACM SIGIR conference on research and development in infor-
large-scale retrieval from heterogeneous data sources, in: Proceedings of mation retrieval, 2015, pp. 999–1002 (2015).
the 2013 ACM SIGMOD International Conference on Management of [146] G. Ding, Y. Guo, J. Zhou, Y. Gao, Large-scale cross-modality search
Data, 2013, pp. 785–796 (2013). via collective matrix factorization hashing, IEEE Transactions on Image
[126] H. Liu, R. Ji, Y. Wu, F. Huang, B. Zhang, Cross-modality binary code Processing 25 (11) (2016) 5427–5440 (2016).
learning via fusion similarity hashing, in: Proceedings of the IEEE Con- [147] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, S. Belongie,
ference on Computer Vision and Pattern Recognition, 2017, pp. 7380– Learning from noisy large-scale datasets with minimal supervision, in:
7388 (2017). Proceedings of the IEEE Conference on Computer Vision and Pattern
[127] X. Shen, F. Shen, Q.-S. Sun, Y.-H. Yuan, H. T. Shen, Robust cross-view Recognition, 2017, pp. 839–847 (2017).
hashing for multimedia retrieval, IEEE Signal Processing Letters 23 (6) [148] C. Tian, V. De Silva, M. Caine, S. Swanson, Use of machine learning
(2016) 893–897 (2016). to automate the identification of basketball strategies using whole team
[128] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross- player tracking data, Applied Sciences 10 (1) (2020) 24 (2020).
modal similarity search, in: Proceedings of the 37th international ACM [149] D. J. Armaghani, G. D. Hatzigeorgiou, C. Karamani, A. Skentou,
SIGIR conference on Research & development in information retrieval, I. Zoumpoulaki, P. G. Asteris, Soft computing-based techniques for con-
2014, pp. 415–424 (2014). crete beams shear strength, Procedia Structural Integrity 17 (2019) 924–
[129] Z. Ji, W. Yao, W. Wei, H. Song, H. Pi, Deep multi-level semantic hashing 933 (2019).
for cross-modal retrieval, IEEE Access 7 (2019) 23667–23674 (2019). [150] C. Raghuraman, S. Suresh, S. Shivshankar, R. Chapaneri, Static and
[130] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng, Nus-wide: dynamic malware analysis using machine learning, in: First Interna-
A real-world web image database from national university of singapore, tional Conference on Sustainable Technologies for Computational In-
in: Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), San- telligence, Springer, 2020, pp. 793–806 (2020).
torini, Greece., July 8-10, 2009 (July 8-10, 2009). [151] H. Müller, D. Unay, Retrieval from and understanding of large-scale
[131] M. Grubinger, P. Clough, H. Müller, T. Deselaers, The iapr tc-12 bench- multi-modal medical datasets: A review, IEEE transactions on multime-
mark: A new evaluation resource for visual information systems, in: dia 19 (9) (2017) 2093–2104 (2017).
International workshop ontoImage, Vol. 2, 2006 (2006). [152] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review
[132] M. Everingham, L. Gool, C. K. Williams, J. Winn, A. Zisserman, The and new perspectives, IEEE transactions on pattern analysis and ma-
pascal visual object classes (voc) challenge, International journal of chine intelligence 35 (8) (2013) 1798–1828 (2013).
computer vision 88 (2) (2010) 303–338 (2010). [153] Y. Jia, L. Bai, S. Liu, P. Wang, J. Guo, Y. Xie, Semantically-enhanced
[133] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisser- kernel canonical correlation analysis: a multi-label cross-modal re-
man, The pascal visual object classes challenge 2007 (voc2007) results trieval, Multimedia Tools and Applications 78 (10) (2019) 13169–13188
(2007). (2019).
[134] M. J. Huiskes, M. S. Lew, The mir flickr retrieval evaluation, in: Pro- [154] X. Zhang, K. Ahmad, Ontology and terminology of disaster manage-
ceedings of the 1st ACM international conference on Multimedia infor- ment, in: DIMPLE: DIsaster Management and Principled Large-scale
mation retrieval, 2008, pp. 39–43 (2008). information Extraction Workshop Programme, 2014, p. 46 (2014).
[135] M. J. Huiskes, B. Thomee, M. S. Lew, New trends and ideas in visual [155] M. Rogers, K. Ahmad, Corpus linguistics and terminology extraction
concept detection: the mir flickr retrieval evaluation initiative, in: Pro- (2001).
ceedings of the international conference on Multimedia information re- [156] Z. Zhongming, L. Linong, Y. Xiaona, Z. Wangqiang, L. Wei, et al., Wiki-
trieval, 2010, pp. 527–536 (2010). cmr: A web cross modality database for studing and evaluation of cross
[136] J. Krapac, M. Allan, J. Verbeek, F. Juried, Improving web image search modality retrieval methods (2013).
results using query-relative classifiers, in: 2010 IEEE Computer Society [157] C. Kang, S. Xiang, S. Liao, C. Xu, C. Pan, Learning consistent feature
Conference on Computer Vision and Pattern Recognition, IEEE, 2010, representation for cross-modal multimedia retrieval, IEEE Transactions
pp. 1094–1101 (2010). on Multimedia 17 (3) (2015) 370–381 (2015).
[137] M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a [158] L. Wu, Y. Wang, L. Shao, Cycle-consistent deep generative hashing for
ranking task: Data, models and evaluation metrics, Journal of Artificial cross-modal retrieval, IEEE Transactions on Image Processing 28 (4)
Intelligence Research 47 (2013) 853–899 (2013). (2018) 1602–1612 (2018).
[138] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions [159] Y. Peng, X. Huang, J. Qi, Cross-media shared representation by hier-
to visual denotations: New similarity metrics for semantic inference over archical learning with multiple deep networks., in: IJCAI, 2016, pp.
event descriptions, Transactions of the Association for Computational 3846–3853 (2016).
Linguistics 2 (2014) 67–78 (2014). [160] J. Shao, L. Wang, Z. Zhao, A. Cai, et al., Deep canonical correlation
[139] C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting im- analysis with progressive and hypergraph learning for cross-modal re-
age annotations using amazon’s mechanical turk, in: Proceedings of the trieval, Neurocomputing 214 (2016) 618–628 (2016).
NAACL HLT 2010 Workshop on Creating Speech and Language Data [161] V. E. Liong, J. Lu, Y.-P. Tan, J. Zhou, Deep coupled metric learning for
with Amazon’s Mechanical Turk, Association for Computational Lin- cross-modal matching, IEEE Transactions on Multimedia 19 (6) (2016)
guistics, 2010, pp. 139–147 (2010). 1234–1244 (2016).
[140] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [162] J. Luo, Y. Shen, X. Ao, Z. Zhao, M. Yang, Cross-modal image-text re-
P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, trieval with multitask learning, in: Proceedings of the 28th ACM Inter-
in: European conference on computer vision, Springer, 2014, pp. 740– national Conference on Information and Knowledge Management, 2019,
755 (2014). pp. 2309–2312 (2019).
[141] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: [163] Y. Jian, J. Xiao, Y. Cao, A. Khan, J. Zhu, Deep pairwise ranking with
A large-scale hierarchical image database, in: 2009 IEEE conference multi-label information for cross-modal retrieval, in: 2019 IEEE Inter-
on computer vision and pattern recognition, Ieee, 2009, pp. 248–255 national Conference on Multimedia and Expo (ICME), IEEE, 2019, pp.
(2009). 1810–1815 (2019).
[142] Y. Jia, M. Salzmann, T. Darrell, Learning cross-modality similarity for
multinomial data, in: 2011 International Conference on Computer Vi-
sion, IEEE, 2011, pp. 2407–2414 (2011).
[143] F. Zhong, G. Wang, Z. Chen, F. Xia, G. Min, Cross-modal retrieval for
cpss data, IEEE Access 8 (2020) 16689–16701 (2020).
[144] G. Xu, X. Li, L. Shi, Z. Zhang, A. Zhai, Combination subspace
graph learning for cross-modal retrieval, Alexandria Engineering Jour-
nal (2020).
39