0% found this document useful (0 votes)

16 views39 pages

Cross Modal Survey

This document presents a comparative analysis of cross-modal information retrieval, focusing on the integration of multiple modalities such as text, images, and audio for effective information processing. It discusses various techniques and state-of-the-art methods used to address challenges in cross-modal systems, highlighting the semantic gap in information retrieval. The paper concludes with open issues and future research directions to enhance understanding in this field.

Uploaded by

Nissa Liane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views39 pages

Cross Modal Survey

Uploaded by

Nissa Liane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Comparative analysis on cross-modal information retrieval: a review

Parminder Kaura,∗, Husanbir Singh Pannua , Avleen Kaur Malhib

a Computer Science and Engineering Department,

Thapar Institute of Engineering and Technology, Patiala, India

b Department of Computer Science,

Aalto University, Finland

Abstract
Human beings experience life through a spectrum of modes such as vision, taste, hearing, smell, and touch. These multiple modes
are integrated for information processing in our brain using a complex network of neuron connections. Likewise for artificial
intelligence to mimic the human way of learning and evolve into the next generation, it should elucidate multi-modal information
fusion efficiently. Modality is a channel that conveys information about an object or an event such as image, text, video, and audio. A
research problem is said to be multi-modal when it incorporates information from more than a single modality. Multi-modal systems
involve one mode of data to be inquired for any (same or varying) modality outcome whereas cross-modal system strictly retrieves
the information from a dissimilar modality. As the input-output queries belong to diverse modal families, their coherent comparison
is still an open challenge with their primitive forms and subjective definition of content similarity. Numerous techniques have been
proposed by researchers to handle this issue and to reduce the semantic gap of information retrieval among different modalities. This
paper focuses on a comparative analysis of various research works in the field of cross-modal information retrieval. Comparative
analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets
have also been discussed. In the end, open issues are presented to enable the researchers to a better understanding of the present
scenario and to identify future research directions.
Keywords: Cross-modal, multimedia, information retrieval, data fusion, comparative analysis

∗ Correspondingauthor
Email addresses: [email protected] (Parminder Kaur),
[email protected] (Husanbir Singh Pannu), [email protected]
(Avleen Kaur Malhi)

Preprint submitted to Elsevier December 17, 2021

Contents

1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Related surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Article organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Review methodology 4
2.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sources of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Search criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Publication metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Background 6
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Origin and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Cross-modal representation and retrieval techniques 11

4.1 Real-valued representation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Subspace learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Statistical and probabilistic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Rank based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.4 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.5 Machine learning and Deep learning based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.6 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Binary representation learning or Cross-modal hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 General hashing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Cross-modal hashing methods based on deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Benchmark datasets 23

6 Comparative analysis 25
6.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Comparison of results using diverse techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Discussion 29

8 Open issues 30

9 Conclusion 36

2
1. Introduction Cross-modal and multi-modal are explained using a simple ex-
ample in figure (2) where + represents both text and images can
When we fail to understand the contents of an image embed- be retrieved using an image query and vice versa in multi-modal
ded in a text, figure captions, and referral text often help. Just by approach.
looking at a figure, a person might not be able to understand it
exactly but with the help of collateral text, it can be understood
efficiently. For instance, when we see a volleyball picture(figure
1), we may not be able to understand or know about the volley-
ball game. However, the picture can be completely understood
with the help of collateral text (such as caption, figure reference,
and related citation) describing the volleyball game. This im-
plies that information from more than one source is beneficial
in further understanding of things and also helpful in better in-
formation retrieval. This is where cross-modal data fusion and
retrieval come into the picture. Figure 2: An illustration of information retrieval in cross-modal and multi-
modal system.

Therefore, the fundamental idea of cross-modal is to in-

tegrate numerous modes of information to derive better re-
sults than just one channel. For instance, an image-text cross-
modal system integrates textual information along with an im-
age which is known as image annotation. Vice-versa, it also
queries text keywords to retrieve images, known as image re-
trieval. In simple words, image annotation is a process of ex-
plaining an image with appropriate linguistic cues. It is use-
ful in knowledge transfer sessions for application areas such
as medical science, military, business, education, and sports to
Figure 1: An example of a volleyball image and collateral text in the form of
the caption, figure reference, and related citation. name a few. For example, a CT scan is known to the radiologist
but not to an intern or a patient. Therefore, the expert has to
Recently, cross-modal retrieval has gained a lot of attention explain it using proper terminology by pointing out key areas
due to the rapid increase in multi-modal data such as images, on the given image. Image retrieval is a process of retrieving an
text, video, and audio. The term modality represents a spe- appropriate image from the database as per the user query, for
cific form in which data exists and it is also associated with a instance, with text keywords. With the evolution of the seman-
sensory perception such as vision and hear modalities which tic web and huge data repositories, a major challenge comes
are major sources of communication and responsiveness in hu- into the picture which is effective indexation and retrieval of
man beings and animals. The data consisting of more than one both still and moving images and the identification of key areas
modality is known as multi-modal data. It has the characteristic inside the images. An image cannot be expressed completely
of high-level semantic homogeneity and low-level expressive just by using visual features only as they under-constrain the
heterogeneity such as the same thing having diverse representa- information contained in it. Visual features of an image include
tions. Different forms of representation help people better un- color distribution, texture, shape, and edges. Typically, image
derstand things as illustrated in the volleyball example above. retrieval systems make use of images and the corresponding
While searching for something, people often want to get accu- text/keywords for indexing and retrieving images using both
rate results in different forms which create a need for an efficient keywords and visual features of the image. Cross-modal image
multi-media information retrieval platform. Classic approaches retrieval aims to use text for retrieving relevant images related
to information retrieval are of uni-modal nature. Uni-modal to the text.
means information derived just from one channel, such as only
from images or only from the text (but not both). For example, 1.1. Motivation
only the text query is used for information search and retrieval
from a text repository. This retrieval approach is of the least use Cross-modal learning has become tremendously popular be-
these days when enormous multimedia data is being generated. cause of its effective information retrieval capability. Numer-
Cross-modal and multi-modal systems, on the other hand, are ous cross-modal representation and retrieval methods have been
able to link more than one modalities such as image, text, au- proposed by researchers to resolve the issue of cross-modal re-
dio, and video. In cross-modal, input query mode and resultant trieval considering several modalities. Various appealing sur-
mode are dissimilar. For example, query text for related images veys have been introduced which summarizes the work done in
and query image for related text. However, the resultant mode this field. Image and text are the highly utilized modalities and
can be similar to the query mode in a multi-modal system. For a number of articles on cross-modal retrieval have been pub-
example, query text to retrieve related images and matched text. lished considering these. However, there is no proper survey
3
mainly focusing on the image-text cross-modal retrieval tech- surveys. It will act as a valuable resource for beginners to
niques. The objective of this article is to conduct a comprehen- get acquainted with the topic.
sive review of cross-modal retrieval which incorporates image 2. A broad classification of various cross-modal approaches
and text modalities, the main concerns of which are different has been presented and difference among them is also dis-
from previous surveys and reviews. So, the motivation behind cussed.
this review article is: 3. It provides information regarding various prominent
1. Lack of a full-fledged review article on image and text benchmark datasets and evaluation metrics utilized for
modalities. cross-modal method performance estimation.
2. To present various challenges and open issues in the cross- 4. It presents a comparative analysis of diverse cross-modal
modal retrieval field. representation techniques when applied on benchmark
3. Image and text modalities are the basic and highly utilized datasets. This analysis will be highly useful for future re-
modalities, however, we are still away from achieving an search.
ideal level in their cross-modal retrieval process. 5. The article summarizes various challenges in the field of
cross-modal retrieval and open issues to work upon by fu-
1.2. Related surveys ture researchers.
Existing literature reviews related to cross-modal informa-
tion retrieval have presented the topic quite well to the research 1.4. Article organization
community. [1] presented an overview of cross-modal retrieval
in 2016, however, it does not comprise several significant works This article starts with an introduction of cross-modal re-
proposed in recent years. In [2], authors have presented nu- trieval in Section 1 which includes motivation for the survey,
merous multi-modal techniques, but their focus is only on tech- contributions, comparison with existing surveys, article road
niques based on machine learning. [3] is a contemporary sur- map, and organization. An appropriate review methodology
vey, however, it presents a brief study of cross-media retrieval (Section 2) has been shadowed in writing the proposed survey
methods compared to the vastness of the topic. An overview which incorporates five subtopics: research questions, sources
of different cross-media retrieval techniques incorporating mis- of information, search criteria, data extraction, and publica-
cellaneous modalities has been provided in [4]. [5] article only tion metrics. The inception of cross-modal retrieval, its general
explore various cross-media retrievals with joint graph regular- architecture, applications, observed challenges in the process,
ization. The focus of [6] is on cross-media analysis and reason- and the initial related articles are presented in concert under
ing and the various analysis methods rather than cross-media the background section (Section 3). Section 4 discusses about
retrieval. [7] has provided a survey on cross-media image and the diverse cross-modal representation and retrieval techniques
text information fusion where the main focus is on analyzing which are broadly classified into real-valued and binary tech-
two methods of image and text associations. niques. The literature related to these techniques has also been
Table (1) shows the comparison of the current survey with included in this section. The famous image-text benchmark
the existing reviews related to cross-modal learning sorted year datasets which have been widely used by the researchers in the
wise. Comparison is performed on the basis of the domain, dif- cross-modal field have been presented in Section 5. Section
ferent modalities incorporated in the paper, comparative anal- 6 is of comparative analysis which introduces different perfor-
ysis, challenges, open issues, benchmark datasets, and evalua- mance evaluation metrics along with a comparison of various
tion metrics. It can be seen in the table that only one survey is cross-modal retrieval methods. A summary of several state-of-
focusing on image and text modalities but their main concern the-art cross-modal retrieval works has been demonstrated with
is an image-text association and not cross-modal retrieval. A the use of tables in Section 7. The miscellaneous open issues
blank cell in the table implies that the information is missing in cross-modal retrieval domain have been discussed in Section
for that particular column and Xmeans that it is present in the 8. Finally, Section 9 culminates the survey with the conclusion.
article. Domain column specifies the main focus of the arti- Figure (3) depicts the road map of the article.
cle and all value under Mode column means that the article is
not particularly focusing on any two or three modalities rather
it is talking about the whole multi-media. Comparative analy-
2. Review methodology
sis depicts whether the comparison among techniques has been
performed quantitatively or qualitatively.
The categorical survey technique described in this research
1.3. Contributions article has been obtained from the technique described by
Kitchenham et al. [8, 9]. Distinct stages used in this review
The significant contributions of this paper are as follows:
are: to create a review technique, planning an exhaustive sur-
1. This review focuses to present a summary of recent vey, executing the survey, comparison of results, comparative
progress in cross-modal retrieval considering text and im- result analysis, and exploring open issues. The review tech-
age (image-to-text and text-to-image). It comprises several nique employed in this categorical survey is described in figure
novel works and references which are absent in previous (4).
4
Table 1: Comparison of the proposed survey with existing surveys
Sr. Article Year Domain Mode Comparative Challenges Open Datasets Eval.
No. analysis issues met-
rics
1 Priyanka [7] 2013 Cross-media text and image informa- Image
tion fusion and
text
2 Wang et al. [1] 2016 Cross-modal retrieval all Quantitative X X X X
3 Peng et al. [4] 2017 Cross-media retrieval: concepts, all Quantitative X X X X
benchmarks, methodologies and
challenges
4 Peng et al. [6] 2017 Cross-media analysis and reasoning all X X
advances and directions
5 Baltruvsaitis et al.[2] 2018 Multimodal machine learning all Qualitative X
6 Monelli and Bondu et 2018 Joint graph regularization based se- all Qualitative
al.[5] mantic analysis for cross-media re-
trieval
7 Monelli and Bondu [3] 2019 Cross-media feature retrieval and opti- all X X
mization
8 Proposed survey - Cross-modal retrieval considering im- Image Qualitative and X X X X
age and text modalities and quantitative
text

Figure 4: Shortlisting of the research articles starting from title, ab-

stract/conclusion, full text and redundancy

highly improved when information from more than one

mode is incorporated into the process. [Details in Section
1]
2. RQ2: What is the background of the word cross-modal?
AS2: Cross-modal learning is inspired from working of
human brain. [Details in Section 3]
3. RQ3: What are the different challenges faced during the
Figure 3: Road map for the article process of cross-modal retrieval?
AS3: Handling of huge multi-modal datasets, heteroge-
neous modalities and others. [Details in Section 3.3]
2.1. Research questions
4. RQ4: What are various applications of cross-modal re-
A number of vital areas required to be considered in the case trieval?
of cross-modal retrieval are summarized in the following re- AS4: Emotion recognition system, biomedical image re-
search questions: trieval, and spoken to sign language transcription to name
1. RQ1: What is the need of cross-modal retrieval? a few. [Details in Section 3.2.1]
AS1: The results achieved in information retrieval can be 5. RQ5: Which state-of-the-art methods have been proposed
5
recently for multi-modal data representation and retrieval were excluded (such as papers based on other cross-modal re-
and which image and text descriptors have been used in trievals were excluded). The number was further reduced to
combination? 280 based on the abstract and conclusion of the article. The ab-
AS5: The popular multi-modal data representation method stracts/conclusions of all the papers were read and relevant pa-
is CCA, SIFT and LDA are famous methods utilized for pers to cross-modal retrieval considering image and text were
image and text representation respectively. [Details in Sec- selected among all other papers. Finally, 189 articles were se-
tion 4 and Table 18] lected based on their full text as the papers whose technique
6. RQ6: What are the various existing prominent image-text was not relevant to our domain were excluded. Then, these 189
benchmark datasets? articles were examined thoroughly to give the final list of 132
AS6: NUS-WIDE, Wikipedia and MIRFlickr 25k are few research papers with the help of reference investigation and re-
popular datasets. [Details in Section 5 and Table 7] dundancies to eliminate common challenges based on inclusion
7. RQ7: What are the primary evaluation metrics utilized by and exclusion criteria.
researchers and comparison of miscellaneous techniques
on different benchmark datasets? 2.4. Data extraction
AS7: MAP and PR curve are commonly accepted evalu- Many problems were faced in extracting relevant information
ation metrics by research community. [Details in Section from the sources specified. Various authors were contacted for
6] finding the in-depth knowledge of research if required. Our
8. RQ8: What are the open issues in the field of cross-modal review used the following procedure for data extraction:
retrieval?
• The data from 132 research articles was extracted by one
AS8: Lack of huge multi-modal datasets, restricted anno-
of the authors after a detailed review.
tations and diversity requirement in dataset composition
are few open issues. [Details in Section 8 and Table 17] • Another author then cross-checked the review results ex-
tracted.
2.2. Sources of information
• If any conflict arose during cross-checking, then a compro-
We searched broadly in electronic database sources as rec- mise meeting was held by authors to resolve the conflict.
ommended by Kitchenham et al. [8, 9]; the electronic databases
used for searching are given in figure (5). The aim of this review is to find the available research in image-
text cross-modal retrieval. Most of the research articles on
cross-modal retrieval are published in a comprehensive variety
of referred journals and conference proceedings.

2.5. Publication metrics

This section involves the publication metrics related to cross-
modal retrieval. Figure (7) presents a chart of year-wise publi-
cations from year 2010 to 2020 which has been obtained by
executing the search string allintitle: retrieval ”cross modal”
OR ”cross media” OR ”multi modal” OR ”multi media” on
Google Scholar as per year. It can be inferred from the chart
that the publication count in the field has increased overall.
Figure 5: The electronic databases used in the survey Figure (6) displays the publication count of prominent jour-
nals in the field of cross-modal retrieval during last decade.
Journal Multimedia Tools and Applications has the highest pub-
2.3. Search criteria lication count of 23. Here, X represents all journals having
The survey conducted contains the literature review of the publication count of 1. They are given in table (2). Fig-
qualitative and quantitative research articles from the year 2010 ure (8) demonstrates the cross-modal retrieval journal publica-
to 2020 in the English language. In this article, we have in- tion count geographically on the world map during last decade.
cluded research papers from peer-review journals, symposiums, China has the maximum publication count with a score of 174.
conferences, technical reports, and workshops. The exclusion India and US are on second and third position with a score of
criteria used in the search is given in figure (4). An individ- 15 and 11 respectively.
ual search was applied to a few articles from Springer, Elsevier,
and IEEE to name a few to cross-check the electronic search- 3. Background
ing. There were 579 articles gathered from the search which
were then further reduced to 485 based on the titles of the ar- The inception of the terms cross-modal and multi-modal is in
ticles. The exclusion was done based on the titles as the ti- neurology and are inspired from multi-sensory integration in-
tles which were relevant to image-text cross-modal were kept side brain [10, 11]. We often need to understand images of ob-
in the list and titles which seemed out of the scope of the area jects/scenes through the use of phrases because image does not
6
Figure 6: Journal publications count in the area of Cross-modal retrieval during last decade. Journals represented by X are given in table (2)

Table 2: List of journals represented by X in figure (6), having a publication

count of 1.
Publisher Journals
Alexandria Engineering; Computer Vision and Image Under-
Elsevier standing; Digital Signal Processing
Information Sciences; Procedia Computer Science; Signal
Processing: Image Communication
Hindawi Mathematical Problems in Engineering; Mobile Information
Systems
Journal of Se- Applied Earth Observations and Remote Sens-
lected Topics in ing; Signal Processing

IEEE IEEE Signal Processing Magazine

Automation Science and Engineering; Biomet-
Transactions on rics, Behavior, and Identity Science; Circuits
and Systems for Video Technology
Dependable and Secure Computing; Geo-
science and Remote Sensing; Industrial Elec-
tronics Figure 7: Cross-modal retrieval article publishing trend
Industrial Informatics; Neural Networks and
Learning Systems
CAAI Transactions on Intelligence Technoloy
IET
Image Processing
Computer Vision; Computer Science and Technology
Springer Neural Computing and Applications; Pattern Analysis and Ap-
plications
Signal, image and video processing; Soft Computing
Taylor & Intelligent Automation & Soft Computing; The imaging sci-
Francis ence
Other International Journal of Signal Processing, Image Processing
and Pattern Recognition

Figure 8: Geographical representation of country wise journal publication

contain all the relevant information. Thus, we use one modality count in last decade
of communication to compensate for the absence of informa-
tion in another mode [12] which implies co-relating text and
image. In simple terms, cross-modal or multi-modal is linking text using an image is called image annotation and retrieval of
of more than one modality such as text, image and video. The an image using text is known as image retrieval.
process of using one modality to retrieve related information in Figure (9) shows a lucid comparison of various modalities
other modality is known as cross-modal retrieval. Retrieval of utilized inside the brain and computer. For instance, sight
7
mon representation enables the cross-modal retrieval process by
appropriate ways of data indexing, summarization, and ranking.

Figure 10: General framework of cross-modal retrieval process

Figure 9: Comparison of computer and brain, based on different modalities
which are integrated inside them to make a decision. For instance, image Figure (11) shows the general topological architecture of
modality in a computer is similar to vision modality in the brain, tongue bio- cross-modal image annotation and retrieval system. It defines
metric is similar to taste and so on.
the relative locations of each entity in the rectangular boxes and
process flow through directed arrows. Working of each entity is
modality for the brain is collated with image modality for com- briefed as follows:
puter, hear is collated with audio, and so on. Humans get fa-
miliar with the outside world using multiple sensory channels
where each channel provides a distinctive impression of the en-
vironment [13]. Each sense or sensory modality seems to ex-
ercise separately to interact with the environment and acquire
knowledge, however, the information received from all the sen-
sory channels is integrated by the brain into an extensive aware-
ness regarding the outside world [14]. As an example, when it is
difficult to understand a person’s articulation then the interlocu-
tor will automatically start observing other modes of expression
such as mouth movements and facial/body expressions. In a
similar fashion, only one modality is not enough to understand
an incident/object for decision making. So, computers can be
evolved to the next level by integrating information from di- Figure 11: General topological architecture of image-text cross-modal system
verse modalities to achieve better results than just one infor-
mation channel. Current generation computers are not capa-
1. Manual annotation: Manual annotation implies the label-
ble to perform cross-modal computation efficiently due to their
ing of unlabelled images or providing an appropriate ex-
existing structure (separation of memory and processor unlike
planation for each image by an expert.
brain neurons [15]) and complexity involved in the cross-modal
2. Repository: A repository refers to a central location for
learning phenomenon. Hence, cross-modal learning is an at-
the storage and management of data consisting of images
tempt to make the computer evolve to the next level.
and text. It is the common location for fetching and saving
data for all the entities present in the system. The pre-
3.1. Architecture pared and organized data after an annotation is put into the
Figure (10) demonstrates the general framework of a cross- repository for further analysis. The image and textual data
modal retrieval system. Four modalities such as text, image, are picked up by image and text analysis modules sepa-
video, and audio are shown in the figure as an example. Typ- rately for analysis and then the extracted features are put
ically, the initial form of data contains noise which affects the back in the repository. The cross-modal learning module
overall building and accuracy of the system. Pre-processing is connected to the repository so that when a query is fed
is performed on the data to remove that noise and to make it into the module, it can retrieve the related result from the
appropriate for further processing on it. The second step is to repository module.
represent each modality separately by doing a feature extraction 3. Image analysis: This module consists of three sub-
process using varied algorithms such as BoW, SIFT, and CNN modules which are as follows:
depending upon the multi-modal data. As per the multi-modal • Image pre-processing: It involves the cleaning of
representations, common representations for diverse modalities noisy images, improving the quality of blur images,
are learned using correlation modeling. In the end, this com- and image resizing.
8
• Image segmentation: It is a process of segregation of
an image into several small segments such as a group
of pixels which makes the image representation eas-
ier to analyze.
• Feature extraction: It consists of withdrawing the
useful features from an image which uniquely iden-
tifies it.
4. Text analysis: This is further divided into three steps:
• Corpus creation: It includes text pre-processing, typ-
ically consisting of noise and stop words removal.
After pre-processing, the final word corpus is cre-
ated.
• Frequency analysis: It involves assigning a fre-
quency to the words in the corpus.
• Feature extraction: Like image feature extraction,
text feature extraction identifies the vital features Figure 12: Applications of cross-modal retrieval
from the text which differs it from other text. Few
feature extraction methods include Bag-of-Words
(BoW), TF-IDF, and weirdness coefficient. 1. Face-voice matching and retrieval: Studies reveal that hu-
man appearances are linked to their voices and humans
5. Cross-modal learning: This is the most important module have the tendency to recognize the association of voice
in the architecture and is the final system which is used for and face. As an example, after hearing a voice, humans
image annotation and retrieval purpose. When a text query can easily identify the gender of the person and the ap-
is fed into it, it fetches the matched text and related images proximate age. Inspired by this, a cross-modal framework
from the repository and returns them to the user. for face voice matching and retrieval is proposed [22]. In
addition to the framework, a novel face-voice dataset has
3.2. Origin and applications been constructed from Chinese speakers and formed a data
collection tool as well.
Most of the initial works which are inspired by cross-modal 2. Spoken to sign language transcription: Cross-modal data
human behavior are related to the integration of acoustic and vi- fusion has been applied to explore the spoken language to
sual modalities [16, 17, 18]. Numerous works proposed in the sign language and vice versa [23]. Sign language is help-
80s and 90s are influenced by the McGurk effect [19]. As per ful to assist the communication with hard-of-hearing or
this effect, an illusion happens when an acoustic element of one deaf people and the proposed methodology has introduced
sound is combined with the visual element of another sound How2Sign technology which is based upon the American
which leads to the perception of a third sound. After getting Sign Language dataset.
motivation from previous researches on cross-language infor- 3. Emotion recognition system: Psycho-linguistics is the
mation retrieval (CLIR) and spoken document retrieval (SDR) study of the mental aspects of speech and language. As per
Thijs Westerveld presented one of the earliest researches on the studies on human communication, people start adjust-
cross-modal retrieval considering image and text [20, 21]. In ing their behavior by imitating the expressions, gestures,
both of his works, he proposed the use of the Latent Seman- and speaking style of their interlocutor which is known
tic Indexing (LSI) method for cross-modal image-text retrieval. as entrainment. Entrainment presence in cross-modal set-
This method builds a common multi-modal semantic space to tings is investigated in [24] and its effect on multi-modal
represent images and text simultaneously which benefits the re- emotion recognition system. Moreover, cross-speaker
trieval of related text using an image and vice-versa. dependence and cross-modality have also been inquired
about. After empirical analysis, it has been found that
3.2.1. Applications there is a strong relationship between acoustic and fa-
Authors have introduced numerous techniques for cross- cial features of one person with the emotional state of
modal information retrieval considering miscellaneous modal- other and speakers show alike emotions specifying pow-
ities and applied them in varied applications. Summary of erful mutual influence in their expressive behaviors 72%
prominent and recent applications exploiting various modalities of the time. It has been found that there is a robust depen-
is stated in the table (3). As this survey is focused on image and dence among heterogeneous modalities across interlocu-
text modalities, the prominent works and representation tech- tors and the emotional state of an interlocutor can be iden-
niques related to those are mentioned in section (4). Few ap- tified from the information provided by the other interlocu-
plications are described as below and also presented in figure tor.
(12): 4. Disaster and emergency management system: A large
9
number of disaster videos, images, and news are uploaded
and searched on social media regularly. These multime-
dia can be utilized as sensors to extract important infor-
mation about the disasters. Images and text have been as-
sociated by [25] to extract prominent phrases related to
floods. Bag of words model for text features and Speeded-
up Robust Features (SURF) for image feature extraction
has been used. For the integration of image and text, anal-
ysis has been done using a proposed novel method. Two
flood event corpora were used for experiments (a) US Fed-
eral Emergency Management Agency media library, and
(b) public Facebook groups and pages for the flood and
the aid (in German).
5. Biomedical image retrieval: Considering the growth of Figure 13: Challenges in cross-modal retrieval
the healthcare industry, important text and images keep on
hiding under the inessential data which makes it hard to re-
trieve the relevant information. Biomedical articles often in real-time cross-modal systems to serve a useful purpose
contain annotation markers or tags such as letters, stars, and claim the automatic semantic application.
symbols, or arrows in their figures to spot the highlights. 2. Heterogeneity among modalities: The ever increasing size
These markers are also correlated with the image captions of multimedia data on social media every day creates a
and text in the article. Identification of the markers be- bottleneck for efficient information retrieval [48]. Mo-
comes important to extract the Region of Interest (ROI) bile devices and social websites such as Twitter, Facebook,
from images. A novel technique has been proposed in [26] and Flickr are generating a variety of heterogeneous data
with the combination of rule-based and statistical image which is semantically different and cannot be compared
processing ways for localizing and annotating the medical directly in their initial form. It is required to reduce the se-
image regions or ROIs. Moreover, a cross-modal image mantic gap among miscellaneous modalities so that they
retrieval technique has been implemented based on ROI can be compared/matched with each other to find similar-
identification and classification. ities. Semantic gap refers to the difference between low-
level features and high-level concepts. The nature of data
6. Material identification: A combination of visual and au-
distributions, noise/artifacts, and key features involved in
dio can be utilized for identifying a particular material. In
various modalities are subtle and prone to errors while or-
[27], authors are identifying a wooden material based on
chestrating them for mutual information retrieval. There-
sound and its image. ELM-LRF method is used for feature
fore multimedia data and massive size present a challenge.
extraction from images and audio. For cross-modal data
representation, CCA, MCCA (Mean CCA), and CCCA 3. Manual procedure is expensive: Most of the data that is
(Cluster CCA) approaches are utilized and out of which found on the Internet these days is either not annotated
CCCA is found to be the best among the three. or inaccurate. It is quite difficult to annotate the raw data
(images for example) manually by an expert due to its mas-
7. Recipe video retrieval: A video of a recipe can be re-
sive volume and diversity. Hence it is needed to leverage
trieved from a textual recipe. A cross-modal fusion sub-
this manual process for an automatic replacement which is
network is proposed in [28] for obtaining video from a
comparatively accurate [49].
written recipe. It learns both independent and collabora-
tive dynamics which enhances the associated representa- 4. Need of efficient feature extraction: Choosing optimal fea-
tion of videos and recipes. Co-attention network utilized ture extraction method for underlying multi-modal data is
for explicitly emphasizing the cross-modal interactive fea- still an open question [50], [51]. How effectively a modal-
tures between recipe and video. ity has been represented through its feature vectors eventu-
ally affects the overall model quality and reliability. There
is a traditional trade-off between the time complexity and
3.3. Challenges model validation accuracy so the art is to find mutual equi-
The main challenges observed in the process of cross-modal librium. With effective feature extraction best practices,
retrieval are represented in figure (13) and defined as follows: identifying similarities between modalities will be easier
and efficient.
1. Massive multimedia data sets: Sophisticated content re- 5. Representation complication: In the case of multiple
trieval has become a challenge in ever-growing multime- modalities, the basic problem is about coherent representa-
dia data volume over the internet [47]. Thus the model tion and synchronization among various modalities which
efficiency and accuracy suffer from the relevant feature ex- are often not complementary and thus carry redundancy.
traction, selective data retention by removing redundancy Thus, it is important to have precise spectrum representa-
while taking care of language syntax, and semantic inter- tion with maximum information gain and no redundancy.
pretations. It is also challenging to store and retrieve data For example, in the case of image and text, images can be
10
Table 3: Various application areas of cross-modal information retrieval along with modalities used, year, and reference. Modalities’ representation: A-audio,
I-image,3D-three dimensional, T-text, V-video, L-location.
Sr. Modalities Applications Year Ref.
Face-voice matching and retrieval 2019 [22]
1 A, I
Material identification 2019 [27]
2 A, I, 3D Image-audio-3D retrieval in multiple concepts 2013 [29]
3 A, I, T Multiple distinct category image-audio and image-text retrieval 2019 [30]
4 A, I, T, V, 3D Image-text-audio-video cross modal retrieval in various concepts 2013 [31]
Text-audio retrieval in audios from Freesound library 2019 [32]
5 A, T
Text-audio retrieval in multiple concepts from Spotify, YouTube and Musixmatch 2019 [33]
sources
Audio-visual retrieval in distinct categories of YouTube and Google audio set 2019 [34]
6 A, V videos
Emotion recognition systems 2013 [24]
Human behaviour analysis 2018 [35]
7 A, V, L Audio-visual-location cross-modal retrieval in distinct categories 2011 [36]
8 A, V, T Youtube video categorization 2020 [37]
RoI identification and classification in CT scan images 2014 [26]
Image retrieval in different categories 2014, 2019 [38], [39]
9 I, T 24 distinct category image-text retrieval 2012 [40]
Disaster and emergency management 2016 [25]
image-text retrieval in various categories 2017, 2018 [41], [42]
Image-text and video-text retrieval in multiple categories 2015 [43]
10 I, T, V
Video, image and text retrieval in video lectures 2014 [44]
Multiple concepts’ video annotation 2011 [45]
11 T, V cooking activities’ video annotation, videos’ temporal activity localization evalu- 2019 [46]
ation, personal videos’ annotation
Cooking recipe retrieval 2019 [28]

represented in spatial or spectral while the text is symbolic 4.1.1. Subspace learning
and dependent upon grammar rules and cultural norms [2]. Subspace learning plays a vital role in cross-modal informa-
tion retrieval. Diverse modalities have different representation
features as well as they are located in diverse feature spaces
4. Cross-modal representation and retrieval techniques [52]. The modalities can be mapped to common isomorphic
subspaces from old miscellaneous spaces by learning potential
Cross-modal representation techniques can be broadly clas- common subspaces (as shown in figure 17).
sified into two categories: (a) Real-valued representation and
(b) Binary representation. In real-valued representation learn-
ing, the learned common representations of diverse modalities CCA and its variants, CM, SM and SCM. CCA is the most pop-
are real-valued. However, in binary representation learning, ular unsupervised technique of subspace learning which was
diverse modalities are mapped into a common hamming space. introduced by Hotelling [53] in 1936. The principal logic be-
Cross-modal similarity searching is faster in binary representa- hind this technique is to find the pair of projections for di-
tion, so the retrieval process also becomes faster. However, the verse modalities such that the correlation between them is max-
retrieval accuracy becomes less in binary representation as the imized [54]. CCA can be recognized as an issue of identi-
information is lost because representation is encoded to binary fying the basis vectors for two group of variables aiming to
codes. Prominent cross-modal learning methods and related mutually maximize the correlation between variables’ projec-
works are presented in the following sub-sections. Figure(14) tions onto the basis vectors [55]. Let h·, ·i represents the eu-
presents a taxonomy of cross-modal retrieval methods. Table clidean inner product of vectors p, q which is equal to p0 q,
(4) shows the list of acronyms used in this article. Figure (15) where A0 is the transpose of a vector or matrix A. Let (p, q)
presents the literature classification utilized in this survey. denotes a multivariate random vector and its sample instances
as S = ((p1 , q1 ), ..., (pn , qn )). S p represents (p1 , ..., pn ) and
S q = (q1 , ..., qn ), consider defining a new coordinate for p by
4.1. Real-valued representation learning
choosing a direction d p and projecting p onto the direction:
This section presents the information regarding various real- p → hd p , pi, similarly for q, the direction is dq . A sample of
valued representation learning methods and their application on new coordinate is obtained: S p,d p = (hd p , p1 i, ..., hd p , pn i) and
different datasets. Figure (16) presents the evolution of real- similarly S q,dq = (hdq , q1 i, ..., hdq , qn i). First step is to choose
valued representation learning methods in recent years. d p and dq for maximizing the correlation between vectors, such
11
Figure 14: Taxonomy of cross-modal retrieval methods

Table 4: List of acronyms in alphabetical order

Acronym Definition Acronym Definition Acronym Definition
AADAH Attention-Aware Deep Adversarial Hashing DVSH-Q DVSH variant without bitwise max-margin loss SCM Semantic Correlation Matching
ACMR Adversarial Cross-Modal Retrieval FSH Fusion Similarity Hashing S CM KL SCM with Kullback-Leibler divergence measure
AICDM Annotation by Image-to-Concept Distribution Model FSH-S FSH with a simple fusion graph construction S CM l1 SCM with l1 distance measure
BoVW Bag of Visual Words GSS-SL Generalized Semisupervised Structured Subspace learning SCM l2 SCM with l2 distance measure
BoW Bag of Words HOG Histogram of Oriented Gradients S CM NC SCM with normalized correlation measure
CCA Canonical Correlation Analysis HRL Hybrid Representation Learning S CM NCc SCM with centered normalized correlation measure
CM Correlation Matching HRL C Hybrid Representation Learning with CNN features SDCH Semantic Deep Cross-modal Hashing
CM l1 CM with l1 distance measure HRL H Hybrid Representation Learning with handcrafted features SIFT Scale Invariant Feature Transformation
CM l2 CM with l2 distance measure IMH Inter Media Hashing SM Semantic Matching
CM NC CM with normalized correlation measure JFSSL Joint Feature Selection and Subspace Learning SM KL SM with Kullback-Leibler divergence measure
CM NCc CM with centered normalized correlation measure KCCA Kernel Canonical Correlation Analysis S M l1 SM with l1 distance measure
CMOLRS Cross-Modal Online Low-Rank Similarity LBP Local Binary Pattern S M l2 SM with l2 distance measure
CMSTH Cross-Modal Self-Taught Hashing LCMH Linear Cross-Modal Hashing SM NC SM with normalized correlation measure
CMTC Cross-Modal Topic Correlation LDA Latent Dirichlet Allocation S M NCc SM with centered normalized correlation measure
CNN Convolutional Neural Network LSSH Latent Semantic Sparse Hashing SMFH Supervised Matrix Factorization Hashing
Corr-AE Correspondence Autoencoder M3R Multi-Modal Mutual topic Reinforcement modelling SRLCH Subspace Relation Learning for Cross-modal Hashing tech-
nique
CVH Cross-View Hashing MAP Mean Average Precision SVM Support Vector Machine
DAML Deep Adversarial Metric Learning MDCR Modality Dependent Cross-media Retrieval TDH Triplet based Deep Hashing
DAML D Deep Adversarial Metric Learning with deep features MDSSL Multiordered Discriminative Structured Subspace Learning TDH C TDH with CNN-F features
DAML S Deep Adversarial Metric Learning with shallow features MFDH Multi-view Feature Discrete Hashing T DH H TDH with handcrafted features
DCMH Deep cross-modal hashing MHTN Modal-adversarial Hybrid Transfer Network UCH Unsupervised Concatenation Hashing
DDL Discriminative Dictionary Learning MLRank Multi-correlation Learning to Rank UCH LLE UCH with Locally Linear Embedding
DLSH Discrete Latent Semantic Hashing MRR Mean Reciprocal Rank UCH LPP UCH with Locality Preserving Projection
DMSH Deep Multi-level Semantic Hashing MSFH Multi-modal graph regularized Smooth matrix Factorization ZSH Zero-Shot Hashing
Hashing
DVSH Deep Visual Semantic Hashing PR curve Precision Recall curve ZSH1 ZSH (with complete data)
DVSH-B DVSH variant without binarization QCH Quantized Correlation Hashing ZSH2 ZSH (with zero shot data)
DVSH-H DVSH variant without using the hashing networks RCH Robust Cross-view Hashing ZSH3 ZSH (with semi-supervised zero shot data)
DVSH-I DVSH variant by replacing the cosine max-margin loss S3CA Shared Semantic Space with Correlation Alignment ZSH4 ZSH (with semi-supervised zero shot data and different label
spaces)

that:
E[hd p , pihdq , qi]
ρ = max q (4)
d p ,dq
ρ = max Corr(S p d p , S q dq ) (1) E[hd p , pi2 ]E[hdq , qi2 ]
d p ,dq
E[d0p pq0 dq ]
hS p d p , S q dq i = max q (5)
= max (2) d p ,dq
d p ,dq S pdp S q dq E[d0p pp0 d p ]E[dq0 qq0 dq ]
d0p E[pq0 ]dq
where ρ represents the equation to be maximized. Let E de- = max q (6)
d p ,dq
notes the empirical expectation of function f (p, q) and given d0p E[pp0 ]d p dq0 E[qq0 ]dq
by
m Covariance matrix of (p, q) is defined as:
1X
E= f (pi , qi ) (3)  
m i=1 C C 
0 pp pq
Cov(p, q) = E qp qp =   = C (7)
 
then ρ can be redefined as C C  qp qq

12
Figure 15: Overview of literature based on image-text cross-modal retrieval

Total covariance matrix C is a block matrix where C pq = Cqp 0

as:
are between-sets covariance matrices and C pp , Cqq are within- d0pC pq dq
ρ = max q (8)
sets covariance matrices, although eq (7) represents the covari- d p ,dq
d0pC pp d p dq0 Cqq dq
ance matrix in zero-mean case only. Now ρ can be redefined
The maximum of ρ w.r.t. d p and dq is the maximum canonical
13
Figure 16: Real-valued representation learning methods’ research evolution.

in cross-modal document retrieval task. Two hypotheses have

been investigated in this: (a) Benefit to explicitly modelling the
correlation between two elements, and (b) This modelling is
more useful in feature spaces having higher levels of abstrac-
tion. Images are represented by Scale Invariant Feature Trans-
formation (SIFT) features and text using Latent Dirichlet Al-
location (LDA). The motive is to retrieve images that closely
match the text query and to retrieve text which closely matches
the image query. A new Wikipedia dataset has been composed
for the experimentation. The cross-modal framework proved to
outperform the state-of-art cross-modal retrieval methods and
even the novel image retrieval systems on the uni-modal re-
trieval tasks. A mathematical formulation is introduced in [57]
Figure 17: General representation of subspace learning process which associates cross-modal retrieval systems’ design with
isomorphic feature spaces for diverse modalities. Two hypothe-
ses are inspected related to the principal characteristics of these
correlation. feature spaces: (1) low-level cross-modal correlations should
Authors in [56] have proposed the use of CCA, Semantic be accounted for, and (2) space should allow semantic abstrac-
Matching (SM), and Semantic Correlation Matching (SCM) tion. So, three novel solutions to cross-modal retrieval problem

14
are then obtained from these hypotheses are CM, SM and SCM. data are learned at the output of the network. A similarity met-
CM is an unsupervised approach that models cross-modal cor- ric is also presented for improving distance measure which is
relations, SM is a supervised method that relies on semantic inspired by large scale similarity learning. In [62], an extension
representation and SCM is the combination of both of them. of the CCA approach has been introduced, named multi-label
In [58], a cross-modal retrieval framework has been pre- CCA (ml-CCA). It learns the shared subspaces by taking care of
sented which outputs a ranked list of semantically relevant text high-level semantic information in the formation of multi-label
from a separate text corpus (having no related images) when annotations. This approach utilizes the multi-label information
queried using an image and vice versa. For these two tasks, a for generating correspondences instead of relying on explicit
novel Structural SVM based unified formulation has been pro- pairing among different modalities like CCA. A fast ml-CCA
posed. Two representations considered for both image and text technique is also presented in this which has the capability of
modality are: (a) uni-modal probability distributions over top- handling huge datasets.
ics learned using LDA, and (b) explicit learning multi-modal An unsupervised learning framework based on KCCA is pro-
correlations using CCA. The work done in [41] is an extension posed which identifies the relation between image annotation
of [58]. A new loss function based on normalized correlation is by humans and the corresponding importance of things and
introduced in this which is found to be better than the previous their layout in the scene [63]. This uncovered relation is uti-
two loss functions. Along with this, the proposed method is lized in increasing the accuracy of search results as per queries.
compared with other baseline methods, extensive analysis of A novel approach for image retrieval and auto-tagging has been
training, and run-time efficiency. Comparison based on two introduced in [64] which utilizes the object importance infor-
new evaluation metrics and recent image and text features is mation provided by keyword tag list. It is an unsupervised ap-
also incorporated in the new work. [59] has proposed a cross- proach based on KCCA which finds the relationship between
modal technique for extracting semantic relationship between image tagging by humans and the corresponding importance
classes using annotated images. Firstly, both visual features of objects and their outline in the scene. As the KCCA tech-
and text are projected onto a latent space using CCA, and then nique is non-parametric, so it scales poorly with the training set
the probabilistic interpretation of CCA is utilized for calculat- size and has trouble with huge real-world datasets [2]. To han-
ing the representative distribution of the latent variable for each dle KCCA drawbacks and to provide an alternative, Deep CCA
class. Two measures are obtained based on the representative (DCCA) has been proposed. It tackles the scalability issue and
distributions: (1) semantic relation between classes, and (2) ab- leads to better correlated representation space.
straction level of each class.
Classic CCA method has few drawbacks [54]: (1) It is able Graph regularization based methods. Cross-modal retrieval
to compute only the linear correlation between two sets of vari- typically includes two fundamental issues: (a) Relevance es-
ables, however, the relationship may be non-linear in most of timation; and (b) Coupled feature selection. In [65], authors are
the real-world implementations; (2) It is able to operate only on dealing with both the issues. To deal with the first issue, multi-
two modalities; (3) If it is applied on a supervised problem then modal data is mapped to a common subspace to measure the
it wastes the information available in the form of labels because similarity among modalities. Projection matrices are learned
it is an unsupervised technique, and (4) Intra-modal semantic for this mapping and l21 -norm penalties are imposed on them
consistency is an important factor to improve retrieval accuracy separately to deal with the second issue, which selects appro-
but CCA fails to capture this [60]. To handle the drawbacks priate and discriminative features from diverse feature spaces
of classic CCA, several variants of this method are introduced at the same time. Further, a multi-modal graph regularization
such as Generalized CCA (GCCA), Kernal CCA (KCCA), Lo- term is applied to the projected data to preserve intra and in-
cality Preserving CCA (LPCCA), and Deep CCA (DCCA) to ter modality similarity relationships. An iterative algorithm is
name a few. CCA extension techniques seek to construct a cor- introduced for solving the joint learning issue along with its
relation that maximizes non-linear projection. In [61], authors convergence analysis. The excessive experimentation on three
have introduced a new dataset containing images, text (para- popular datasets proved the proposed technique to outperform
graph), and hyperlinks. This dataset is named as WIKI-CMR the state-of-art techniques.
and it is composed of Wikipedia articles. It consists of to- To overcome the semantic and heterogeneity gap between
tal of 74961 documents including images, textual paragraphs, modalities, the potential correlation of diverse modalities need
and hyperlinks. Documents are classified into 11 diverse se- to be considered. Also, the semantic information of class labels
mantic classes. CCA and KCCA cross-modal retrieval tech- required to be utilized for reducing the semantic gap among
niques have been applied to the dataset. An Improved CCA different modalities as well as realizing the inter-dependence
(ICCA) technique has been proposed in [60] to control the lim- and interoperability of divergent modalities. So, authors in
itations of traditional 2-view CCA. For improvement in intra- [52] have proposed a cross-modal retrieval framework which is
modal semantic consistency, two effective semantic features are based on graph regularization and modality dependence, fully
proposed which are based on text features. Traditional 2-view utilizing the correlation between modalities. After consider-
CCA has been expanded to 4-view CCA and it is embedded into ing the semantic and feature correlation, projection matrices
an escalating framework to reduce the over-fitting. The frame- are learned separately for Image-to-Text and Text-to-Image re-
work combines training of linear projection and non-linear hid- trievals. Then the internal arrangement of original feature space
den layers to make sure that fine representations of input raw is utilized to construct an adjoining graph having semantic in-
15
formation constraints which enables the diverse labels of mis- Other subspace learning methods. A modality-dependent
cellaneous modality data to get closer to respective semantic cross-media retrieval (MDCR) model has been proposed in [68]
information. The whole process can be visualized in figure in which two couple of projections are learned for diverse cross-
(18). The objective function for I2T and T2I tasks are defined media retrieval tasks rather than one couple of projections. Two
in equation (9 and 10) respectively. couple of mappings are learned to project text and images from
original feature space into separate common latent subspaces
2 2
F(U1 , V1 ) = λ U1T X − V1T Y F
+ (1 − λ) U1T X − S F
+ by simultaneously optimizing the correlation between text and
αtr(U1 X T L1 XU1T − S T L1 S )+ (9) images and linear regression from one modal space to seman-
tic space. A novel discriminative dictionary learning (DDL)
β1 kU1 k2F + β2 kV1 k2F approach amplified with common label alignment has been in-
troduced in [69] for effective cross-modal retrieval. It increases
2 2
F(U2 , V2 ) = λ U2T X − V2T Y F
+ (1 − λ) V2T Y −S F
+ the discriminative ability of intra-modality information from di-
αtr(V2 Y T L2 YV2T − S T L2 S )+ (10) verse concepts and relevance of inter-modality information in
the same class. To handle the huge multi-modal web data, [70]
β1 kU2 k2F + β2 kV2 k2F has proposed a cluster-sensitive cross-modal correlation learn-
where U1 , U2 and V1 , V2 represent the image and text projec- ing framework. A novel correlation subspace learning tech-
tion matrices in I2T and T2I respectively. S is the semantic nique which learns a group of a cluster–sensitive sub-models is
matrix of image and text, X and Y represents the feature ma- presented to better fit the content divergence of various modal-
trices of image and text respectively, λ, α, β1 and β2 are bal- ities.
ance parameters. A semantic consistency cross-modal retrieval A Multi-ordered Discriminative Structured Subspace Learn-
with semi-supervised graph regularization (SCCMR) method ing (MDSSL) approach is proposed in [71]. This metric
is introduced in [66] which ensures a globally optimal solu- learning framework learns a discriminative structured subspace
tion by merging prediction of labels and optimization of pro- where data distribution is reserved for ensuring a required met-
jection matrices to a unified architecture. Simultaneously, the ric. An adversarial cross-modal retrieval method has been pro-
method also considers nearest neighbors in potential image-text posed in [72] which attempts to make an effective common sub-
subspace and image-text with the same semantics using graph space based on adversarial learning. To handle the problem of
embedding. discriminative features are captured from different multi-view embedding from diverse visual hints and modalities,
modalities by applying l21 -norm constraint to projection matri- a unified solution is proposed for subspace learning techniques
ces. which makes use of Rayleigh quotient [73]. It is extendable
for supervised learning, multiple views, and non-linear embed-
ding. A multi-view modular discriminant analysis (MvMDA)
approach is introduced for considering the view difference. Af-
ter getting motivation from the fact that un-annotated data can
be easily compiled and helps to utilize the correlations among
diverse modalities, a novel generalized semi-supervised struc-
tured subspace learning (GSS-SL) approach is proposed in [67]
for the task of cross-modal retrieval. For aligning diverse
modality data by moving one source modality to another tar-
get modality, a cross-modal retrieval approach with augmented
Figure 18: Process of cross-modal retrieval framework followed in [52] adversarial training is proposed in [74]. An augmented version
of the conditional generative adversarial network is utilized for
Inspired by the fact that unlabelled data can be composed reserving the semantic meaning in the modality transfer pro-
easily and aid to exploit the correlation between modali- cess.
ties, [67] has proposed a novel framework generalized semi-
supervised structured subspace learning (GSS-SL) for cross- 4.1.2. Statistical and probabilistic methods
modal retrieval. A label graph constraint is proposed for pre- Statistical methods include the Markov model (MM), Hid-
dicting appropriate concept labels for un-annotated data. For den Markov Model (HMM), Markov Random Field, and so
modeling correlation between modalities, GSS-SL utilizes the forth. Probabilistic methods incorporate the use of probability
label space as a linkage after consideration of the fact that con- and various probabilistic models. They are typically utilized to
cept labels directly unveils the semantic information of multi- find out the probability of generating a particular modality re-
modal data. Specifically, a joint minimization formulation is sult based on a given query modality. Scientific biomedical ar-
created from the combination of the label-linked loss function, ticles contain multi-modal information such as images and text.
label graph constraint, and regularization for learning discrim- Considering the growth of the healthcare industry, important
inative common subspace. Multiple linear transformations are text and images keep on hiding under the inessential data which
alternatively optimized by an effective optimization method for makes it hard to retrieve the relevant information. Biomedical
diverse modalities and updating of the class indicator matrices articles often contain annotation markers or tags such as letters,
for un-annotated data is also performed. stars, symbols, or arrows in figures which highlight the cru-
16
cial area in the figure. These markers are also correlated with phase. In the end, a dynamic interpolation algorithm is se-
the image captions and text in the article. Identification of the lected for dealing with the problem of fusion of loss function.
markers becomes important to extract the ROIs from images. A Cross-Modal Online Low-Rank Similarity function learning
A novel technique has been proposed in [26] with the combi- (CMOLRS) technique is proposed in [79] that learns a low-rank
nation of rule-based and statistical image processing ways for bilinear similarity measurement for the task of cross-modal re-
localizing and annotating the medical image regions or ROIs. trieval. A fast-CMOLRS technique is also introduced which
Moreover, a cross-modal image retrieval technique has been has less processing time than the former technique.
implemented on articles and it is based upon ROI identification
and classification. 4.1.4. Topic Models
Automatic image annotation and retrieval framework based Topic models are a kind of statistical model that finds the
on probabilistic models have been proposed in [75] with an as- abstract topics which arise in a set of documents. A cross-
sumption that image regions can be explained using blobs (a modal topic correlation model has been introduced in [80]
kind of vocabulary). Blob is an acronym for Binary Large Ob- which jointly models the text and image modalities. A sta-
ject and it is a collection of binary data that is stored as a sin- tistical correlation model is examined which is conditioned on
gle unit in a database. Blobs are created from image features category information. [81] proposed a novel supervised multi-
using clustering. To automatically annotate or retrieve images modal mutual topic reinforcement modeling (M3R) technique
using a word as a query, the trained probabilistic model pre- that makes a joint cross-modal probabilistic graphical model for
dicts the probability of producing a word with the help of im- finding the mutually consistent semantic topics using required
age blobs. After experimentation, the proposed probabilistic interaction between model factors.
model based on the cross-media relevance model is proved to A topic correlation model (TCM) is presented in [82] by mu-
be almost six times better than a model based on the word- tual modeling of images and text modalities for cross-modal
blob co-occurrence model and two times better than a model retrieval task. Images are represented by the bag-of-features
derived from machine translation in terms of mean precision. model based on SIFT and text is represented by topic distri-
An improvement of cross-media relevance model [75] is pre- bution learned from the latent topic model. These features are
sented in [76] to automatically assign related keywords to un- mapped into a common semantic space and statistical correla-
annotated images based on images’ train data. Images present tions are analyzed. These correlations are utilized for finding
in the training dataset are fragmented into parts and then these out the conditional probability of results in one modality while
parts are represented using a blob. K-means algorithm is used querying in another modality.
for blobs’ creation for clustering those image parts. Using this
model, the probability for assigning a keyword into a blob is 4.1.5. Machine learning and Deep learning based methods
predicted and after annotation success, one image part is repre- Machine learning (ML) refers to the capability of a ma-
sented by a keyword. TF-IDF method is used for text document chine to enhance its performance on the basis of previous out-
feature extraction and appropriate text documents are retrieved comes. ML approaches allow systems to learn without being
using images’ automatic annotation information. Experimen- programmed explicitly. Deep learning mimics the way the hu-
tation is performed on IAPR TC-12 and 500 Wikipedia web- man brain works for both feature extraction and classification
pages (landscape related) dataset to show the usefulness of the as discussed in [83]. This section includes the works which
proposed technique. are based on machine learning and deep learning. Summary of
deep learning based cross-modal systems incorporating image
4.1.3. Rank based methods and text have been presented separately in the table (18). In
These methods see the issue of cross-modal retrieval as a [40], authors have proposed a novel technique of multi-modal
problem of learning to rank. Ranking of images and tags is suit- Deep Belief Network for finding out the missing data in text
able for efficient tag recommendation or image search. In [77], or image modality. Also, the proposed model can be used for
a new Multi-correlation Learning to Rank (MLRank) approach multi-modal data retrieval as well as annotation purpose. After
is proposed for image annotation which ranks the tags for im- experimentation on MIR Flickr data containing images and cor-
ages as per their relevance after considering semantic impor- responding tags, the proposed model is found to be better than
tance and visual similarity. Two cases are defined: (a) image- bi-modal data of images and text. Moreover, its performance
bias consistency; and (b) tag-bias consistency that is developed outperforms the performance of Linear Discriminant Analysis
into an optimization problem for rank learning. (LDA) and Support Vector Machine (SVM) models. As the
In [78], a ranking model has been optimized as a listwise cross-modal data is heterogeneous in nature, so it is trouble-
ranking problem considering cross-modal retrieval process and some to compare directly. For making it comparable, authors in
a learning to rank with relational graph and pointwise con- [30] have made use of deep learning by proposing a deep corre-
straint (LR2GP) technique has been proposed. Firstly, a dis- lation mining technique. Various media features are trained in
criminative ranking model is introduced that utilizes the rela- this technique and then fused together with the help of correla-
tionship between a single modality for improvement in ranking tion between their trained features. Moreover, the Levenberg-
performance and learning of an optimal embedding shared sub- Marquart technique has been used for avoiding the local min-
space. A pointwise constraint is proposed in low-dimension ima problem in deep learning. Experiments are performed on
embedding space to make up for the real loss in the training image-audio and image-text databases to validate the proposed
17
solution. Authors have proposed a novel cross-modal retrieval convolutional network.
technique based on similarity theory and deep learning [84]. A novel correspondence autoencoder model is proposed in
They have utilized Local Binary Pattern (LBP) as an image de- [89] which is designed by correlating hidden representations of
scriptor and Deep Belief Network (DBN) as a deep learning two uni-modal autoencoders. For this model training, an opti-
algorithm. mal objective that minimizes the linear combination of repre-
In [85], a new Scalable Deep Multi-modal Learning (SDML) sentation learning errors for every mode and correlation learn-
data retrieval method has been introduced. A common sub- ing error between the hidden representation of the modalities.
space is predefined to maximize between-class variation and A correspondence restricted Boltzmann machine (Corr-RBM)
minimize within-class variation. Then a network is trained for is proposed in [90] for mapping the original features of modal-
each modality separately such that n networks are obtained for ity data into a low-dimensional common space where hetero-
n modalities. It is done to transform multi-modal data into the geneous data can be compared. Two deep neural structures are
common predefined subspace for achieving multi-modal learn- made from corr-RBM as the chief building block for the cross-
ing. The method is scalable to a number of modalities as it can modal retrieval process. Cross-modal retrieval is performed
train different modality-specific networks separately. It is the using CNN visual features with various classic approaches in
first proposed technique which is individually projecting data [91]. A deep semantic matching (DSM) technique is also intro-
of different modalities into a predefined common subspace. Ex- duced for handling cross-modal retrieval w.r.t. samples labeled
perimentation is performed on four benchmark datasets such as with one or multiple labels. In [92], authors have proposed a
PKU XMedia, Wikipedia, NUS-WIDE, and MS-COCO dataset deep and bidirectional representation learning model (DBRLM)
to validate the proposed technique. To solve the problem of where images and text are represented by two separate convo-
image-text cross-modal retrieval, various novel models are in- lutional based networks.
troduced in [86] which are designed by correlating hidden rep- A novel modal-adversarial hybrid transfer network has been
resentations of a pair of autoencoders. Minimizing correlation proposed in [93]. It realizes the knowledge transfer from the
learning error enables the model to learn invisible representa- single-modal source domain to the cross-modal target domain
tions by just utilizing the general information in diverse modal- and then learns the common cross-modal representation. The
ities. On the other hand, minimizing the representation learn- architecture is based on deep learning and is divided into two
ing error builds hidden representations good enough for recon- subnetworks: (a) Modal-sharing knowledge transfer subnet-
structing inputs of each modality. A specific parameter is set in work; and (b) Modal adversarial semantic learning subnet-
the models to make a balance between two types of error gener- work. A deep learning model has been introduced in [94],
ated by representation and correlation learning. Models are di- named, AdaMine (ADAptive MINing Embedding) for learn-
vided into two groups: (1) one contains three models that recon- ing the common representation of recipe items incorporating
struct both modalities and so named as multimodal reconstruc- recipe images and their recipe in textual form. In [95], au-
tion correspondence autoencoder, and (2) the second contains thors have proposed a novel approach generative cross-modal
two models that reconstruct a single modality and so named learning network (GXN) which includes generative processes
as unimodal reconstruction correspondence autoencoder. Ex- into the cross-modal feature embedding which will be useful in
perimentation is performed on three popular datasets and the learning both global abstract features and local grounded fea-
proposed technique is found to be better than two popular mul- tures. A deep neural network based approach known as hybrid
timodal deep models and three CCA based models. representation learning (HRL) is proposed for learning com-
Supervised cross-modal retrieval techniques provide better mon representation for each modality [96].
accuracy than unsupervised techniques at the additional cost of A new deep adversarial metric learning (DAML) technique
data labeling or annotation. Lately, semi-supervised techniques is introduced for cross-modal retrieval which maps annotated
are gaining popularity as they provide a better framework to data pairs of diverse modalities non-linearly into a shared la-
balance the trade-off between annotation cost and retrieval ac- tent feature subspace [97]. The inter-concept difference is max-
curacy. A novel deep semi-supervised framework is proposed imized and the intra-concept difference is minimized. Each
in [87] to handle both annotated and un-annotated data. Firstly, data pair difference caught from modalities of the same class
an un-annotated part of training data is labeled using the la- is also minimized. Motivated by zero-shot learning, [98] has
bel prediction component and then a common representation presented a ternary adversarial network with self-supervision
of both modalities is learned to perform cross-modal retrieval. (TANSS) model. It includes three parallel sub-networks: (1)
The two modules of the network are trained in a sequential two semantic feature learning subnetworks which capture the
manner. After extensive experimentation on pascal, Wikipedia, intrinsic data structures of diverse modes and preserve their re-
and NUS-WIDE datasets, the proposed framework is found to lationships using semantic features in shared semantic space;
be outperforming both supervised and semi-supervised exist- (2) a self-supervised semantic subnetwork that utilizes seen and
ing methods. In [88], authors have introduced an image-text unseen label word vectors to use them as guidance for supervis-
multi-modal neural language model which can be utilized for ing semantic feature learning and increases knowledge transfer
retrieving related images from complex sentence queries and to unseen labels; and (3) adversarial learning scheme is used
vice versa. It has been presented here that text representations for maximizing the correlation and consistency of semantic fea-
and image features can be jointly learned in the case of image- tures among various modalities. This whole network facilitates
text modeling by training the models in conjunction using a effective iterative parameter optimization. In [99], a shared se-
18
mantic space with correlation alignment (S3CA) is proposed for Hashing function is practically utilized in a hash table data
cross-modal data representation. It aligns the non-linear corre- structure which is highly popular for quick data lookup.
lations of cross-modal data distribution in deep neural networks
made for diversified data. • Nearest neighbour (NN): It represents one or more data
entities in A = [a1 , a2 , ..., an ] ∈ RD which are nearest to the
4.1.6. Other methods query point aq .
This section includes the summary of those works which can- • Approximate nearest neighbour (ANN): It attempts to find
not be classified under any of the above classes. In [100], au- a data point a x ∈ A which is an ε−approximate nearest
thors have proposed an Annotation by Image-to-Concept Dis- neighbour of the query point aq in that ∀a x ∈ A, the dis-
tribution Model (AICDM) for image annotation using the links tance between a x and a satisfies the relation d(a x , a) ≤
between visual features and human concepts from image-to- (1 + ε)d(aq , a).
concept distribution. There is a rapid increase in the discussions
regarding disaster and emergency management on social me- Cross-modal hashing techniques are effective in resolving
dia these days. Flood event observation has a principal role in the issue of large scale cross-modal retrieval because it com-
emergency management and the related videos and images are bines the benefits of classic cross-modal retrieval and hash-
also uploaded and searched on the web while disasters. This ing. These techniques either rely on annotated training data
data can be helpful in emergency management by using it in or they lack semantic analysis [106]. For correlating diverse
sensors. Inspired by this, authors in [25] are performing image modalities, typical cross-modal hashing techniques learn a uni-
retrieval enhancement in the field of floods and flood aids. Inte- fied hash space. Then the search process is improved based
gration of image and text features is performed after extracting on hash codes. Hashing methods are broadly classified into
visual features from images using BoW and text features using Data-dependent and Data-independent methods [107]. In data-
TF-IDF and weirdness. After extensive experimentation on US dependent methods, an appropriate hash function is learned us-
FEMA and Facebook datasets, it has been demonstrated that ing the available training data, however, the hash function is
the proposed method is enhancing the emergency management generated using random mapping independent of the training
efficiency by showing improvement in image recognition with data in data-independent methods. Hash function learning is
the incorporation of text features in it. categorized into two stages: (1) Dimensionality reduction; and
Images are ranked as per similarity of semantic features in (2) Quantization. Dimensionality reduction means mapping the
the query by semantic example retrieval. So, in [38], the accu- information from the original space to a low-dimensional spa-
racy of semantic features is improved using cross-modal regu- tial representation. Quantization means a linear or non-linear
larization which is based on associated text. transformation of actual features to binary segment the feature
space for acquiring hash codes. The aim of hashing methods
4.2. Binary representation learning or Cross-modal hashing is to minimize the semantic gap among modalities as much as
In general, the word hash means chop and mix which con- possible. A typical resolution for this issue can be learning of a
secutively means that the hashing function chops and mixes in- uniform hash code to make it more consistent. Another resolu-
formation to obtain hash results [101]. The idea of hashing was tion can be the minimization of the coding distance and enhance
first introduced by H. P. Luhn in his 1953 article A new method its compactness. Hashing taxonomy followed in this survey is:
of recording and searching information [102]. Entire informa- (1) General hashing methods which are defined first; and (2)
tion regarding the birth of hashing is presented in [103]. It is Deep learning based hashing methods which are defined later
nearly impossible to achieve a completely even distribution. It in a different subsection. General hashing methods include all
can only be created by considering of structure of keys. For a the methods which do not incorporate deep learning. Figure
random group of keys, it is impractical to generate an appropri- (19) presents an evolution of cross-modal hashing techniques.
ate generic hash function as the keys are not known beforehand. Table (5) presents the comparison of hashing techniques on
Random uniform hash works best in this case. So, inspired by various characteristics such as optimization, time complexity,
the need of using random access system having a huge capacity hash function, and distance metric utilized for similarity cal-
for business applications, Peterson gave an estimation for the culation. While optimizing the objective function, either the
amount of search needed for the exact location of a record in relaxation is given for easy optimization or not which we call
numerous storage systems including the sorted-file and index discrete type. Relaxation of discrete hash codes may result
table method [104]. Then the term hashing was first used by in quantization loss and performance degradation [108]. Time
Morris in his article [105] in 1968. Few general definitions in complexity mentioned here is for the whole method execution
hashing are described below [101]: where n is the number of training samples used in it. Hash-
ing models can be categorized into linear and non-linear type
• Hashing function: This function (h(·)) is used to map the [109]. The distance metric is the metric utilized in the inter or
random size of data to a fixed interval [0, p]. Given a data intra similarity among modalities’ calculation.
having n data points i.e. A = [a1 , a2 , ..., an ] ∈ RD (real
coordinate space of dimension D) and a hashing function 4.2.1. General hashing methods
h(·), then h(A) = [h(a1 ), h(a2 ), ..., h(an )] ∈ [0, p] are known This section includes all the cross-modal retrieval works
as hashes or hash values of data points represented by A. based on hashing technique and which does not incorporate a
19
Figure 19: Evolution of research in cross-modal hashing

Table 5: Comparison of hashing methods on the basis of various characteristics. T = Traditional hashing method and D = Deep learning based hashing method
Characteristics Type Hashing method Methods
T LCMH [110], QCH [111]
Relaxation
D TDH [112], SDCH [113]
T DLSH [114], SRLCH [109]
Optimization Discrete
D DVSH [115], DCMH [116]
T MFDH [117], MSFH [108], SMFH [118]
Alternative solution
D ZSH [119]
Linear T UCH [120], LCMH [110], CMSTH [106], MSFH [108]
Hash function T SRLCH [109]
Non-linear
D DVSH [115]
T QCH [111]
Cosine
D DVSH [115]
T LCMH [110], MSFH [108]
Distance metric Euclidean
D ZSH [119]
T DLSH [114], CMSTH [106]
Hamming
D DCMH [116], TDH [112], AADAH [121]

deep learning approach. In [120], authors have proposed an in the end hash functions are learned for projecting the modal-
Unsupervised Concatenation Hashing (UCH) technique where ities to a unified hash space. A new cross-modal hashing tech-
Locally Linear Embedding and Locality Preserving Projection nique is proposed in [110] to handle the method scalability is-
are introduced for reconstructing the manifold structure of orig- sue in the training period. The time complexity of the technique
inal space in the hamming space. l2,1 -norm regularization is varies linearly with training data size which allows scalable in-
imposed on the projection matrices for exploiting the diverse dexing for multi-media search over various modalities. Hash
characteristics of various modalities. The proposed technique functions are learned accurately while considering inter and in-
has been compared with other hashing techniques such as CVH, tra modality similarities. Experiments are performed on NUS-
IMH, RCH, FSH, and CCA [122] as well. CVH [123] is an ex- WIDE and Wikipedia dataset to prove the effectiveness of the
tension of classic uni-modal spectral hashing [124] to multi- method. The objective function utilized here for preservation
modal field. In IMH [125], learned binary codes conserve of inter-similarity between modalities for the bi-modal case is
both inter and intra-media consistency. FSH [126] embeds the defined as:
graph-based fusion similarity to a common hamming space.
In RCH [127], common hamming space is learned in diverse
2
modalities’ binary codes are created as consistent as possible. min B(1) − B(2) ;
B(1) ,B(2) F
Table (6 shows the comparison of these techniques when ap-
T
plied on Wikipedia and Pascal dataset. This comparison is s.t., B(i) e = 0, (11)
based on MAP scores when images are retrieved from the text
b(i) ∈ {−1, 1},
(T2I), the text is retrieved from image (I2T) and the average of T
both scores. Bold values in the table represent the highest MAP B(i) B(i) = Ic , i = 1, 2;
score in the respective task and hash code length.

In [106], authors have introduced Cross-Modal Self-Taught where B(1) and B(2) represents the data matrices of image and
Hashing (CMSTH) technique for both cross-modal and uni- text modalities, e is n × 1 vector having each entry equal to 1,
modal image retrieval. It can successfully catch the semantic k·kF is a Frobenius norm, Ic is c × c identity matrix, B depicts
T
correlation from un-annotated training data. Three steps are fol- final binary codes obtained, constraint B(i) e = 0 needs each bit
T
lowed in the learning procedure: (1) Hierarchical Multi-Modal has same chance to be 1 or −1 and constraint B(i) B(i) = Ic
Topic Learning (HMMTL) is proposed for identifying multi- requires the bits of each modality to be acquired separately.
2
modal topics using semantic information; (2) Robust Matrix Loss function term B(1) − B(2) F obtains the maximal consis-
Factorization (RMF) is utilized for transferring the multi-modal tency (or the minimal difference) on two object representations.
topics to hash codes which form a unified hash space, and (3) Equation (11) is extended for more than two modality case and
20
Figure 20: Cross-modal hashing approach proposed in [109]

the new general equation obtained is: of one modal and yTi represents ith row of data matrix Y ∈ Rn×dy
of another modal. d x and dy are dimensions of the modalities.
p X
p
X 2 Similarity information between data points across domains is
min B(i) − B( j) ;
B(i) ,i=1,...,p F defined as: S i j = 1 iff xi and y j are similar and 0 otherwise.
i=1 i< j

s.t., B (i)T
e = 0, (12) min O(Bx , By , W x , Wy ) = (kBx − XW x k2F +
2
X
b(i) ∈ {−1, 1}, By − YWy F ) − α0 S ij
T (i, j)
B(i) B(i) = Ic , i = 1, ..., p,
(13)
q q
xiT W x WyT y j − xiT W x W xT xi yTj Wy WyT y j
where p represents no. of diverse modalities and rest of the
notations are same as eq. (11). s.t. W xT W x = Ic×c
The issue of cross-modal hashing is how to efficiently con- WyT Wy = Ic×c
struct the correlation among diverse modality representations in
the hash function learning process. Most of the traditional hash- where Bx ∈ {−1, 1}n×c and By ∈ {−1, 1}n×c are two kinds of bi-
ing techniques map the miscellaneous modality data to a joint nary codes with same code length c for each object. W x ∈ Rdx ×c
abstraction space by linear projections similar to CCA. Due to and Wy ∈ Rdy ×c depicts two projection matrices for two modali-
this, these methods are unable to effectively reduce the seman- ties, WyT means transpose of a matrix Wy and similarly for other
tic gap among modalities which has been proved to lead to bet- matrices. α0 represents control parameter for balancing quan-
ter accuracy in information retrieval. So to tackle this issue, a tization loss and cosine similarity constraint. For making W x
Latent Semantic Sparse Hashing method has been proposed in and Wy as orthogonal projections, constraints W xT W x = Ic×c and
[128]. This method executes the cross-modal similarity with WyT Wy = Ic×c are used.
the use of sparse coding, for capturing important images’ struc- Most of the classic hashing techniques either suffer from high
tures, and matrix factorization, for learning latent concepts from training costs or fail to capture the diverse semantics of various
the text. In [111], a quantized correlation hashing (QCH) tech- modalities. In order to tackle this issue, [114] has presented an
nique is proposed which considers the quantization loss over efficient Discrete Latent Semantic Hashing (DLSH) approach.
different modalities and the relation among them simultane- Firstly, it learns the latent semantic representations of miscella-
ously. The relation among diverse modalities that explains the neous modalities and afterward, projects them into a common
similar object is established by maximizing the correlation be- hamming space for supporting scalable cross-modal retrieval.
tween the hash codes across modes. The resultant objective This approach directly correlates the explicit semantic labels
function is converted to a uni-modal formulation which is then with binary codes, so it increases the discriminative ability of
optimized using another process. Objective function is defined learned hashing codes. Unlike traditional hashing approaches,
in equation (13). Suppose two modalities (xi , yi ) are represent- DLSH directly learns binary codes using an effective discrete
ing n object, where xiT depicts ith row of data matrix X ∈ Rn×dx hash optimization. The overall objective function of the DLSH
21
Table 6: Comparison of benchmark techniques on the basis of MAP scores on Wikipedia and Pascal VOC dataset with different hash code lengths presented in
[120].
Length of hash codes
Tasks Methods Wikipedia Pascal VOC 2007
16 32 64 128 16 32 64 128
CVH [123] 0.1499 0.1408 0.1372 0.1323 0.1484 0.1187 0.1651 0.1411
CCA [122] 0.1699 0.1519 0.1495 0.1472 0.1245 0.1267 0.123 0.1218
IMH [125] 0.2022 0.2127 0.2164 0.2171 0.2087 0.2016 0.1873 0.1718
I2T RCH [127] 0.2102 0.2234 0.2397 0.2497 0.2633 0.3013 0.3209 0.333
FSH [126] 0.2346 0.2491 0.2531 0.2573 0.289 0.3173 0.334 0.3496
UCH LPP [120] 0.242 0.2497 0.255 0.2576 0.2706 0.3074 0.3255 0.3277
UCH LLE [120] 0.2429 0.2518 0.2578 0.2588 0.2905 0.3245 0.3345 0.3396
CVH 0.1315 0.1171 0.108 0.1093 0.0931 0.0945 0.0978 0.0918
CCA 0.1587 0.1392 0.1272 0.1211 0.1283 0.1362 0.1465 0.1553
IMH 0.1648 0.1703 0.1737 0.172 0.1631 0.1558 0.1537 0.1464
T2I RCH 0.2171 0.2497 0.2825 0.2973 0.2145 0.2656 0.3275 0.3983
FSH 0.2149 0.2241 0.2332 0.2368 0.2617 0.303 0.3216 0.3428
UCH LPP 0.2351 0.2518 0.2623 0.2689 0.3945 0.4877 0.5187 0.5321
UCH LLE 0.2363 0.2567 0.2845 0.2993 0.4106 0.4913 0.5217 0.5343
CVH 0.1407 0.129 0.1226 0.1208 0.1208 0.1066 0.1315 0.1165
CCA 0.1643 0.1456 0.1384 0.1341 0.1264 0.1315 0.1347 0.1386
IMH 0.1835 0.1915 0.1951 0.1946 0.1859 0.1787 0.1705 0.1591
Average RCH 0.2137 0.2365 0.2611 0.2735 0.2389 0.2834 0.3242 0.3657
FSH 0.2248 0.2366 0.2431 0.247 0.2753 0.3102 0.3278 0.3462
UCH LPP 0.2385 0.2508 0.2586 0.2632 0.3326 0.3976 0.4221 0.4299
UCH LLE 0.2396 0.2542 0.2712 0.2791 0.3506 0.4079 0.4281 0.437

approach for two modalities is given as: increases the training accuracy. Both hash functions and uni-
2
fied binary codes are learned at the same time using an iterative
alternative optimization algorithm. Using these hash functions
X
min kφ(Xi ) − Ui Ai k2F +
Ui |i=1,2 ,Ai |i=1,2 ,Wi |i=1,2 ,Q and binary codes, multi-modal data can be effectively indexed
i=1
2 and searched. The framework of the proposed SRLCH tech-
X
β kB − Wi Ai k2F + nique is shown in figure (20).
i=1 (14)
 2 2

X X  In [118], authors have proposed an approach of supervised
δkB − QYk2F + γ  kUi k2F + kWi k2F + kQk2F  matrix factorization hashing for using label information and
i=1 i=1
effective cross-modal retrieval. This method is based on col-
s.t.B ∈ {−1, 1}L×N lective matrix factorization which considers both local geomet-
ric consistency in each mode and label consistency across sev-
where B is binary hash code matrix, k·kF is the Frobenius norm
eral modalities. To resolve the issue of quantization loss which
of matrix, L is hash code length and N is no. of training in-
happens by relaxing discrete hash codes in the cross-modal re-
stances, Xi denotes the original feature matrices of modalities,
trieval process, [108] has proposed a multi-modal graph reg-
Q is semantic transfer matrix, Ai ∈ Rk×N is the latent semantic
ularized smooth matrix factorization hashing which is an un-
representation of modalities and k is its dimension, Ui ∈ Rm×k
supervised technique. The aim of this technique is to learn
is basis matrix and m is no. of anchors, Wi ∈ RL×k represents
unified hash codes for multi-media data in a common latent
projection matrices for two sub-retrieval tasks, φ(Xi ) ∈ Rm×N
space where similarity of diverse modalities can be identified
is Gaussian kernel projection of image and text features, β and
efficiently.
δ are penalty parameters and γ is regularization parameter for
avoiding over-fitting.
In [109], authors have proposed a novel supervised Subspace [117] utilizes multiple views for image and text representa-
Relation Learning for Cross-modal Hashing technique which tion to enhance feature information. A discrete hashing learning
utilizes the relation information of labels in semantic space for framework is proposed which employs complementary infor-
making similar data from diverse modalities nearer in the low- mation among multiple views to make discriminative compact
dimension hamming subspace. This technique preserves the hash codes learning better. It performs classifier and subspace
discrete constraints, modality relations, and non-linear struc- learning simultaneously for completing multiple searches at the
tures while admitting a closed-form binary code solution which same time.
22
4.2.2. Cross-modal hashing methods based on deep learning shot hashing learns a hashing model that is trained using only
Deep learning has become highly popular in recent years. samples from seen classes, however, it has the capability of
Features extracted by deep learning methods have a powerful good generalization for unseen classes’ samples. Typically, it
capability of expressing the data and they also have rich seman- utilizes the class attributes to seek a semantic embedding space
tic information contained in them [106]. Thus, the multi-media for transferring knowledge from seen classes to unseen classes.
information retrieval accuracy enhances significantly by com- So, it may perform poorly in the case of less labeled data. In
bining hashing methods with deep learning. Various works in- [129], authors have proposed a multi-level semantic supervi-
corporating cross-modal hashing methods based on deep learn- sion generating method after exploring the label relevance, and
ing have been introduced recently which are discussed in this a deep hashing framework is introduced for multi-label image-
section. text cross-modal retrieval. It can capture the binary similarity
as well as the complex multi-label semantic structure of data in
Capturing of spatial dependency of images and temporal dy-
diverse forms at the same time.
namics of text is an important task in learning potential feature
representations and cross-modal relations as it reduces the het-
erogeneity gap among modalities. So, a novel Deep Visual Se- 5. Benchmark datasets
mantic Hashing model has been introduced in [115]. It creates
concise hash codes of textual sentences and images in a com- With the advent of huge multi-modal data generation, cross-
plete deep learning architecture that catches the essential cross- modal retrieval has become a crucial and interesting problem.
modal correspondences between natural language and visual Researchers have composed diverse multi-modal datasets for
data. DVSH model has a hybrid deep framework that comprises evaluating the proposed cross-modal techniques. Figure (21)
a visual semantic fusion network to learn joint embedding space presents the evolution of the datasets in recent years. Sum-
of text and images, and two mode-specific hashing networks to mary of prominent multi-modal datasets is given in table (7)
learn hash functions for generating concise binary codes. The which includes dataset name, mode, total concepts, dataset size,
proposed framework efficiently unites cross-modal hashing and image representation, text representation, related article, and
joint multi-modal embedding that is based on a new amalgama- data source. Figure (22) presents a graph of the total number
tion of RNN over sentences, CNN over images, and a structures of categories in the datasets. Information regarding prominent
max-margin objective which combines everything together to benchmark datasets is given in the following points. After go-
facilitate the learning of similarity preserving and high-quality ing through all the references related to cross-modal retrieval
hash codes. Various cross-modal hashing techniques are based used in this survey, approximately used frequencies of popular
on hand-crafted features that may not attain a good accuracy datasets have been found and are represented in the form of a
value. A novel deep cross-modal hashing technique has been bar chart in figure (23).
introduced in [116] by combining hash-code learning and fea-
ture learning into the same framework. From beginning to end,
this framework consists of deep neural networks, one for each
mode to do feature learning from starting.
A triplet based deep hashing network is proposed in [112].
firstly, the triplet labels are utilized that explains the relative
relationship among three instances as supervision for catching
more common semantic correlations among cross-modal in-
stances. For boosting the discriminative ability of hash codes,
a loss function is generated from intra-modal and inter-modal
views. In the end, graph regularization is utilized for preserv-
ing the actual semantic similarity between hash codes in the
hamming space. A deep adversarial hashing network has been
proposed in [121] with attention mechanism for increasing the
measurement of content similarities for particularly aiming at
the informative pieces of multi-media. It has three modules:
(a) feature learning module for getting feature representations; Figure 21: Evolution of benchmark datasets
(b) attention module for creating attention mask; (c) hashing
module for learning hash functions. A novel deep cross-modal
hashing framework is proposed in [113] which combines hash 1. NUS-WIDE1 [130]: This is a real-world web image dataset
codes and feature learning into the same network. It has consid- composed by Lab for Media Search in the National Uni-
ered both inter and intra modality correlation and a loss function versity of Singapore. It consists of: (a) 2,69,648 images
with dual semantic supervision for hash learning. and associated tags from Flickr with 5,018 unique tags,
In [119], a novel cross-modal zero-shot hashing method has
been introduced which efficiently utilizes both labeled and un- 1 https://lms.comp.nus.edu.sg/wp-content/uploads/2019/

labeled multi-modal data having separate label spaces. Zero- research/nuswide/NUS-WIDE.html

23
Figure 24: Two examples from NUS-WIDE dataset in which an image is asso-
Figure 22: A chart displaying the total number of categories in the popular ciated with numerous related tags
datasets

lished from an initiative started by Technical Commit-

tee 12 (TC-12) of the International Association of Pattern
Recognition (IAPR). The idea behind this dataset creation
was to use it for evaluating the efficiency of both visual
and text-based retrieval techniques.
3. Wikipedia3 [56]: It consists of a document corpus with as-
sociated text and image pairs. It has been designed from
Wikipedia’s featured articles which are complemented by
one or more images from Wikipedia Commons, providing
a pair of desirable variety. Each article is classified into
one of 29 concepts by Wikipedia and the concepts are as-
signed to both image and text modules of the article. The
researchers have considered the top 10 highly populated
concepts as some of the concepts are rare. The final cor-
pus consists of 2,866 documents. These are image-text
pairs that have been assigned a class from the vocabulary
of 10 semantic classes.
4. PASCAL VOC 20074 [132]: This dataset has been taken
from the PASCAL (pattern analysis, statistical modeling,
and computational learning) VOC (Visual Object Chal-
lenge) challenge. The dataset provided in this challenge
is being utilized by researchers for the evaluation of the
Figure 23: Approximate used frequencies of prominent datasets in the refer- proposed cross-modal techniques. PASCAL VOC 2007
ences on cross-modal retrieval dataset has been widely used by the research commu-
nity. It contains annotated consumer pictures composed
(b) Ground-truth for 81 concepts; and (c) low-level im- from Flickr5 (photo and video sharing website). The
age features of six types, comprising 144-D color cor- dataset consists of a total of 9,963 images and 24,640 an-
relogram, 128-D wavelet texture, 64-D color histogram, notated objects which have been categorized into 20 differ-
500-D BoW based on SIFT descriptions, 73-D edge direc- ent classes with four main concepts. The images consist of
tion histogram and 225-D block-wise color moments. Fig- varied viewing conditions such as lightning, pose, and oth-
ure (24) shows two examples (angelfish and autumn class) ers. Annotators took guidance from annotation guidelines6
from the dataset with the image and the associated tags. for appropriately annotating each image in the ground-
2. IAPR TC-122 [131]: This dataset is also known as Im- truth[133]. The entities mentioned in the annotation are
ageCLEF 2006. It has been created for CLEF (Cross- class, bounding box, view, truncated, and difficult.
Language Evaluation Forum) cross-language image re- 5. MIR FLickr 25k and 1M7 [134], [135]: The dataset is
trieval task. It is composed of 20,000 images taken from
3 http://www.svcl.ucsd.edu/projects/crossmodal/
a private photographic image collection and associated 4 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.
captions are in three different languages such as English,
html
Spanish, and German. This benchmark has been estab- 5 https://www.flickr.com/
6 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

guidelines.html
2 https://www.imageclef.org/photodata 7 http://press.liacs.nl/mirflickr/

24
available in 2 sizes: 25k and 1M. The images have been 9. MS-COCO10 [140]: Microsoft Common Objects in COn-
collected from Flickr for the research purpose related to text (MS COCO) dataset has been composed of the pic-
image content and image tags. Moreover, tags and EXIF tures of daily scenes consisting of general objects in their
(Exchangeable image file format) image metadata has also usual environment. The objects are labelled using per-
been extracted and made publicly available. Image tags instance segmentation to help in precise object localiza-
have been presented in two forms: (a) raw form in which tion. The dataset consists of total 3,28,000 images with
they are obtained from users; and, (b) in the processed 25,00,000 labelled instances. The objects chosen for the
form where raw tags have been cleaned by Flickr (e.g. re- dataset are from 91 diverse categories. The annotation
moval of spaces and special characters). In MIR Flickr 25k pipeline has been divided into three prominent exercises:
data, images have been manually annotated. Each image (1) labelling concepts which are present in the image, (2)
has an average of 8.94 tags. So, there are 1386 tags that locating and marking all instances of labelled concepts;
are associated with at least 20 images. Images are split into and (3) segmentation of each object instance.
15,000 training and 10,000 testing images. MIR Flickr 1M 10. WIKI-CMR [61]: This dataset has been collected from
data is an extension of MIR Flickr 25k. Images have not Wikipedia articles which contain images, paragraphs and
been annotated manually, unlike original 25k data. Images hyperlinks. Authors mainly focused on the areas: geog-
are represented using MPEG-7 edge histogram and homo- raphy, people, nature, culture and history for dataset col-
geneous texture descriptors and color descriptors. lection. It consists of total 74,961 documents categorized
6. INRIA-Websearch [136]: This dataset consists of 71,478 into 11 diverse concepts. Each of the document includes
images resulted from a web search engine for 353 mis- one paragraph, one associated image (or no image), a cat-
cellaneous search queries. Top-ranked images have been egory label and hyperlinks. Images are represented using
chosen from this search along with their corresponding eight types of features including dense SIFT, Gist, PHOG,
metadata and ground-truth annotations. For each searched LBP and other features. Text is represented using TF-IDF.
query, the dataset comprises the initial textual query, top-
ranked images, and an annotation file. More than 200 im-
ages have been retrieved for 80% of queries. Annotation 6. Comparative analysis
file consists of manual labels for image relevance to the
query and other related metadata such as web page URL, In this section, prominent evaluation metrics used for cross-
image URL, page title, image’s alternate text, 10 words modal retrieval method performance analysis are defined. Af-
before the image on a web page, and 10 words after. Im- terward, comparisons of various cross-modal retrieval methods
ages have been scaled to fit in a 150 × 150 pixel square, when applied on diverse datasets are presented on the basis of
however, preserving the original aspect ratio. MAP score.
7. Flickr 8k and 30k8 [137], [138]: Flickr 30k is an exten-
sion of the Flickr 8k dataset. Both datasets have been cre- 6.1. Evaluation metrics
ated from the Flickr website. Flickr 8k contains 8,092 im- For image and text modality, two cross-modal retrievals are
ages and its main focus is on people or animals (mainly considered: (a) image to text retrieval (I2T), means retrieving
dogs) carrying out some action. Images have been col- text related to the query image; and (2) text to image retrieval
lected from six different Flickr groups manually and anno- (T2I), retrieving images that match with the textual query [1].
tated using multiple captions in the form of sentences by Precisely in the testing phase, given a text or an image query,
selected workers from the US. Flickr 30k contains 31,783 the aim of the cross-modal method is to search and retrieve the
images of everyday scenes, activities, and events. Images images or text that closely matches the query modality respec-
are associated with 1,58,915 captions which have been at- tively. A retrieved outcome is considered to be relevant if it
tained via crowd-sourcing. The approach followed to col- belongs to the same concept as the query modality. Two typical
lect this data is the same as followed by [137]. factors considered while quantitative performance evaluation
8. PASCAL sentence data9 [139]: The images for this dataset are: (1) class relevance evaluation between query and outcome;
have been collected from PASCAL VOC 2008 challenge (2) examining cross-modal relevance for image-text pairs. The
[132]. Data consists of 1000 images selected from around first factor tells about the ability to learn diverse cross-modal
6000 images of PASCAL VOC 2008 training data. Images latent representations while the second factor says about the ca-
have been categorized into 20 categories depending upon pability of learning correlated latent concepts [81]. The metrics
the objects that appear in them and few images are present related to the above two factors are as follows:
in multiple classes. Fifty random images have been cho-
sen from each class to compose the dataset. Each image is 1. Precision, recall and PR curve: Precision is defined as the
annotated with five different captions in the form of sen- ratio of T P to T P + FP, where T P is the number of out-
tences. comes which are similar to query and T P + FP is the num-
ber of total retrieved outcomes. It is useful in measuring

8 http://shannon.cs.illinois.edu/DenotationGraph/
9 http://vision.cs.uiuc.edu/pascal-sentences/ 10 http://cocodataset.org/

25
Table 7: Summary of prominent image-text multi-modal datasets
Sr. Dataset Year Mode Total con- Total images/ Image representation Text repre-
No. cepts text sentation
1 IAPR TC-12 2006 Image/ cap- Diverse 20,000/ - -
[131] tion 60,000
2 MIRFlickr 25k 2008 Image/ tags Diverse 25,000/ - -
[134] 2,23,500
3 NUS-WIDE 2009 Image/ tags 81 2,69,648/ Color correlogram, Tag occur-
[130] 5,018 unique wavelet texture, color rence feature
tags histogram, BoW based
on SIFT descriptions,
edge direction histogram
and block-wise color
moments
4 ImageNet11 [141] 2009 Images/ 12 subtrees 32,00,000/ SIFT -
synsets 5,247
5 Wikipedia [56] 2010 Image/ text 29 (10 major) 2,866/ 2,866 SIFT LDA
6 Pascal VOC 2007 2010 Image/ tags 20 9,963/ 24,640 - -
[132]
7 MIRFlickr 1M 2010 Image/ tags Diverse 10,00,000/ MPEG-7 edge histogram Flickr user
[135] 89,40,000 and homogeneous tex- tags, EXIF
ture descriptors, color metadata
descriptor
8 INRIA- 2010 Image/ labels 353 71478/ - - -
websearch
[136]
9 Pascal sentence 2010 Image/ sen- 20 1000/ 5000 - -
data [139] tences
10 Wikipedia POTD 2011 Images/ para- NA 1987/ 1987 SIFT Text tokeniza-
[142] graphs tion using
rainbow
11 Flickr 8k [137] 2013 Image/ cap- Diverse 8092/ 40460 - -
tions or
sentences
12 WIKI-CMR [61] 2013 Images/ 11 38,804/ SIFT, gist, PHOG, LBP, TF-IDF
paragraphs/ 74,961 self similarity, spatial
hyperlinks pyramid method
13 Flickr 30k [138] 2014 Image/ cap- Diverse 31,783/ - -
tions or 1,58,915
sentences
14 MS COCO [140] 2014 Images/ labels 91 3,28,000/ - -
25,00,000
11
http://www.image-net.org

the probability of success for an information retrieval sys- T P + FN is the total number of relevant outcomes in the
tem. On the other hand, Recall is defined as the ratio of T P repository. It is useful in measuring the percentage of re-
to T P + FN, where T P is the same as explained above and trieved relevant results for an information retrieval system

26
[76, 84]. Refer to the table (8) for a complete understand- rel(o) = 1 and otherwise 0. Now, MAP can be defined
ing of the definition of precision and recall. Precision and as (eq. 20):
recall can be expressed as (eq. 15, 16):
Q
1 X
TP MAP = AP (20)
prec = (15) Q q=1
T P + FP
TP where Q is the total number of queries. A large MAP value
rec = (16) signifies the betterment of the cross-modal algorithm when
T P + FN
applied on a particular dataset.
where prec represents precision, rec is recall, T P indicates 4. Percentage: MAP metric only considers the factors that
true positive, FP is false positive and FN represents false whether the outcome is relevant to query or not. For more
negative. precise evaluation, all the retrieved outcomes are ranked as
per correlation. Typically, a query text (or image) is con-
Table 8: Table for better understanding of precision and recall sidered to be successful in retrieving results if its corre-
Relevant Irrelevant Total sponding ground-truth image (or text) appears in the first
Retrieved True Positive (TP) False Positive (FP) Predicted Positive
a percent of the ranked list of retrieved outcomes. Per-
Not retrieved False Negative (FN) True Negative (TN) Predicted Negative
Total Actual Positive Actual Negative T P + FP + T N + FN
centage is the ratio of correctly retrieved query outcomes
among all the query outcomes. Authors in [84, 81, 142]
Most of the works [143, 52, 144, 145] have used the have utilized this metric for algorithm evaluation and have
precision-recall curve to visualize the performance of their chosen the value of a as 0.2 or 20%.
algorithm. The curve indicates the precision value at dif- 5. MRR: Mean Reciprocal Rank (MRR) is another perfor-
ferent recall levels. Authors in [146] have also used preci- mance evaluation metric similar to percentage. It has been
sion curve for performance visualization. It indicates the applied in [84, 81] for method evaluation regarding the po-
change in precision with respect to the number of retrieved sition of the corresponding ground-truth outcome paired
results. with the query. It is mathematically expressed as (eq. 21):
2. F-measure: It is a typical metric utilized for evaluating |O|
1 X 1
the performance of information retrieval systems [84]. Af- MRR = (21)
ter considering the effects of both precision and recall, F- |O| n=0 rankn
measure can be defined mathematically as eq. (17): where |O| is the number of query outcomes, rankn in-
dicates the position of corresponding unique groud-truth
(θ2 + 1) × prec × rec paired with nth query in the retrieved set.
F= (17)
θ2 × (prec + rec)
here θ has been used for adjusting the weighted proportion
of both recall (rec) and precision (prec). If θ becomes 1
then F-measure can be redefined as F1 (eq. 18):

2 × prec × rec
F1 = (18)
prec + rec
Here, F1 is the perfect combination of recall and precision.
More the value of F1 , more better is the algorithm.
3. MAP: Mean Average Precision (MAP) is the most popu-
lar metric used for evaluating the performance of a cross-
modal retrieval algorithm. It measures whether the re-
trieved result belongs to the same class as the query data
(relevant) or not (irrelevant) [81]. It is the average of av-
erage precision calculated over all the queries. Given a
query (an image or a text) and a group of its correspond- Figure 25: Average MAP score chart of different hashing methods on NUS-
ing O retrieved outcomes, average precision is defined as WIDE data
(eq. 19):

1X
O 6.2. Comparison of results using diverse techniques
AP = P(o)rel(o) (19) This section presents the comparison of various cross-modal
R o=1
retrieval techniques on primary datasets. Techniques are com-
where R is the number of relevant outcomes in the re- pared on the basis of the MAP score as it is the most popular and
trieved outcomes, P(o) is the precision of top o retrieved widely used evaluation metric. Three MAP scores are consid-
outcomes, if the oth retrieved outcome is relevant then ered here which are I2T (when image queries related text), T 2I
27
Table 9: Comparison of prominent hashing techniques on the basis of MAP scores on NUS-WIDE dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
IMH [125] 0.2056 0.2145 0.2317 0.2381 0.2533 0.2613 0.22185 0.2339 0.2465
LSSH [128] 0.4933 0.5006 0.5069 0.5084 0.625 0.6578 0.6823 0.6913 0.55915 0.5792 0.5946 0.59985
QCH [111] 0.5395 0.5489 0.5568 0.5741 0.54815 0.5615
CMSTH [106] 0.5032 0.5073 0.527 0.5439 0.4761 0.4965 0.5088 0.5243 0.48965 0.5019 0.5179 0.5341
FSH-S [126] 0.4996 0.461 0.4556 0.4776 0.446 0.4423 0.4886 0.4535 0.44895
FSH [126] 0.5059 0.5063 0.5171 0.479 0.481 0.4965 0.49245 0.49365 0.5068
SMFH [118] 0.4553 0.4623 0.4658 0.468 0.5033 0.5056 0.5065 0.5079 0.4793 0.48395 0.48615 0.48795
MFDH [117] 0.646 0.6714 0.7014 0.7121 0.7811 0.8285 0.8653 0.8824 0.71355 0.74995 0.78335 0.79725
DLSH [114] 0.5127 0.516 0.5179 0.5203 0.5234 0.5284 0.5165 0.5197 0.52315
DCMH [116] 0.5903 0.6031 0.6093 0.6389 0.6511 0.6571 0.6146 0.6271 0.6332
AADAH [121] 0.6403 0.6294 0.652 0.6789 0.6975 0.7039 0.6596 0.66345 0.67795
TDH H [112] 0.6393 0.6626 0.6754 0.6647 0.6758 0.6803 0.652 0.6692 0.67785
TDH C [112] 0.6393 0.6626 0.6754 0.6647 0.6758 0.6803 0.652 0.6692 0.67785
ZSH1 [119] 0.6411 0.6434 0.6457 0.6468 0.6755 0.6763 0.6789 0.6796 0.6583 0.65985 0.6623 0.6632
ZSH2 [119] 0.5982 0.6017 0.6033 0.6059 0.6286 0.6297 0.6325 0.6339 0.6134 0.6157 0.6179 0.6199
ZSH3 [119] 0.1733 0.1756 0.1771 0.1783 0.1721 0.1736 0.1743 0.1748 0.1727 0.1746 0.1757 0.17655
ZSH4 [119] 0.1481 0.1492 0.1511 0.1519 0.1437 0.1453 0.1475 0.1498 0.1459 0.14725 0.1493 0.15085
SDCH [113] 0.813 0.834 0.841 0.823 0.857 0.868 0.818 0.8455 0.8545

Table 10: Comparison of prominent hashing techniques on the basis of MAP scores on Wikipedia dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
LSSH [128] 0.233 0.234 0.2387 0.234 0.5571 0.5743 0.571 0.5577 0.39505 0.40415 0.40485 0.39585
QCH [111] 0.2343 0.2477 0.3034 0.317 0.26885 0.28235
CMSTH [106] 0.3155 0.3293 0.3313 0.3375 0.3562 0.37 0.3825 0.3878 0.33585 0.34965 0.3569 0.36265
SMFH [118] 0.2572 0.2759 0.2863 0.2913 0.5784 0.604 0.6163 0.6219 0.4178 0.43995 0.4513 0.4566
MFDH [117] 0.3548 0.3763 0.3878 0.3954 0.8318 0.8458 0.8568 0.8666 0.5933 0.61105 0.6223 0.631
DLSH [114] 0.2838 0.3429 0.352 0.6764 0.7478 0.749 0.4801 0.54535 0.5505
ZSH1 [119] 0.2998 0.3017 0.3035 0.3063 0.3016 0.3025 0.3044 0.3061 0.3007 0.3021 0.30395 0.3062
ZSH2 [119] 0.2543 0.2551 0.2576 0.2581 0.2526 0.2541 0.2563 0.2587 0.25345 0.2546 0.25695 0.2584
ZSH3 [119] 0.1214 0.1233 0.1247 0.1251 0.1178 0.1196 0.1218 0.1232 0.1196 0.12145 0.12325 0.12415
ZSH4 [119] 0.0982 0.0997 0.1012 0.1019 0.0936 0.0949 0.0971 0.0995 0.0959 0.0973 0.09915 0.1007

(when text queries matched images), and the average of these IAPR TC-12 dataset. SDCH [113] method has the highest map
(I2T and T 2I) two values. Table (9) shows the MAP scores score in both I2T and T2I tasks on all hash code lengths except
on NUS-WIDE dataset. The blank spaces in the tables indi- 128. On length 128, DVSH-B [115] method shows the high-
cate that there is no value provided for that particular hash code est performance for both tasks. Average MAP results shown
length. The bold value in each of the hash code columns rep- in table (10, 11 and 12) can be visualized in figure (26, 27 and
resents the highest value in that column. Figure (25) presents 28) for Wikipedia, MIRFlickr and IAPR TC-12 datasets respec-
a chart of average MAP scores for table (9). It is evident from tively.
table (9) and figure (25) that the performance of SDCH [113]
method is the best in both I2T and T2I tasks, however, MFDH
[117] shows best performance on 128 hash code length. Table Table (13 and 14) show the comparison of various real-
(10 and 11) presents the MAP scores on Wikipedia and MIR- valued learning techniques based on MAP score on Wikipedia
Flickr 25k dataset respectively. For table (10), the best results and NUS-WIDE dataset respectively. Four types of methods are
are shown by MFDH [117] technique in both I2T and T2I tasks. included: (1) deep learning based; (2) subspace learning meth-
ZSH1 [119] method shows the best performance on MIRFlickr ods; (3) topic models; and (4) rank-based methods. Map score
25k dataset for 128 hash code and otherwise SDCH [113] per- in bold font represents the highest value in that particular col-
forms the best. Table (12) shows the comparison of various umn and italic font represents the highest value in a particular
cross-modal hashing techniques based on deep learning on the method type.
28
Table 11: Comparison of prominent hashing techniques on the basis of MAP scores on MIRFlickr 25k dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
FSH-S [126] 0.609 0.5969 0.593 0.6036 0.5944 0.5923 0.6063 0.59565 0.59265
FSH [126] 0.5968 0.6189 0.6195 0.5924 0.6128 0.6091 0.5946 0.61585 0.6143
MFDH [117] 0.6836 0.6939 0.7066 0.723 0.7408 0.7506 0.7602 0.7797 0.7122 0.72225 0.7334 0.75135
DLSH [114] 0.6379 0.648 0.6603 0.6764 0.6777 0.685 0.65715 0.66285 0.67265
DCMH [116] 0.741 0.7465 0.7485 0.7827 0.79 0.7932 0.76185 0.76825 0.77085
AADAH [121] 0.7563 0.7719 0.772 0.7922 0.8062 0.8074 0.77425 0.78905 0.7897
TDH H [112] 0.711 0.7228 0.7289 0.7422 0.75 0.7548 0.7266 0.7364 0.74185
TDH C [112] 0.711 0.7228 0.7289 0.7422 0.75 0.7548 0.7266 0.7364 0.74185
DMSH [129] 0.726 0.737 0.75 0.755 0.763 0.775 0.7405 0.75 0.7625
ZSH1 [119] 0.7812 0.7831 0.7862 0.7874 0.7964 0.7989 0.8025 0.8037 0.7888 0.791 0.79435 0.79555
ZSH2 [119] 0.7302 0.7334 0.7351 0.7363 0.7092 0.7113 0.7132 0.7148 0.7197 0.72235 0.72415 0.72555
ZSH3 [119] 0.2126 0.2135 0.2141 0.2147 0.2016 0.2023 0.2027 0.2031 0.2071 0.2079 0.2084 0.2089
ZSH4 [119] 0.1873 0.1899 0.1917 0.1926 0.1795 0.1807 0.1816 0.1822 0.1834 0.1853 0.18665 0.1874
SDCH [113] 0.845 0.866 0.873 0.831 0.856 0.863 0.838 0.861 0.868

Table 12: Comparison of prominent deep learning based hashing techniques on the basis of MAP scores on IAPR TC-12 dataset with different hash code lengths.
Length of hash codes
Methods I2T T2I Average
16 32 64 128 16 32 64 128 16 32 64 128
DVSH [115] 0.5696 0.6321 0.6964 0.7236 0.6037 0.6395 0.6806 0.6751 0.58665 0.6358 0.6885 0.69935
DVSH-B [115] 0.626 0.6761 0.7359 0.7554 0.6285 0.6728 0.6922 0.6756 0.62725 0.67445 0.71405 0.7155
DVSH-Q [115] 0.5385 0.6113 0.6869 0.7097 0.5684 0.6153 0.6618 0.6693 0.55345 0.6133 0.67435 0.6895
DVSH-I [115] 0.4792 0.5035 0.5583 0.589 0.4903 0.5496 0.589 0.6012 0.48475 0.52655 0.57365 0.5951
DVSH-H [115] 0.4575 0.4975 0.5493 0.569 0.4396 0.4853 0.5185 0.5337 0.44855 0.4914 0.5339 0.55135
DCMH [116] 0.4526 0.4732 0.4844 0.5185 0.5378 0.5468 0.48555 0.5055 0.5156
AADAH [121] 0.5293 0.5283 0.5439 0.5358 0.5565 0.5648 0.53255 0.5424 0.55435
SDCH [113] 0.726 0.787 0.803 0.704 0.783 0.797 0.715 0.785 0.8

Figure 26: Average MAP score chart of different hashing methods on Wikipedia Figure 27: Average MAP score chart of different hashing methods on MIR-
data Flickr data

7. Discussion To handle this issue, researchers have introduced several tech-

niques for multi-modal data representation in the past few
Cross-modal information retrieval is a burdensome task be- years. Table (18) presents a summary of recent literature for
cause of the semantic gap among modalities. Due to which state-of-the-art techniques used for image-text cross-modal re-
different modalities cannot be compared directly to each other. trieval. It is divided into three parts: the first part contains works
29
Table 13: Performance comparison of prominent real-valued learning methods
on the basis of MAP score on Wikipedia dataset
MAP score
Method type Methods
I2T T2I Average
LBP+DBN [84] 0.2576 0.2761 0.2669
Deep semi- 0.436 0.341 0.388
supervised frame-
work [87]
MHTN [93] 0.514 0.444 0.479
HRL H [96] 0.672 0.686 0.679
Deep learning
based methods HRL C [96] 0.647 0.666 0.656
DAML S [97] 0.356 0.267 0.322
DAML D [97] 0.559 0.481 0.52
S3CA [99] 0.551 0.485 0.518
Corr-AE [89] 0.326 0.361 0.344
Figure 28: Average MAP score chart of different hashing methods on IAPR Corr-Cross-AE [89] 0.336 0.341 0.338
TC-12 data
Corr-Full-AE [89] 0.335 0.368 0.352
JFSSL [65] 0.3063 0.2275 0.2669
CM [56] 0.249 0.196 0.223
SM [56] 0.225 0.223 0.224
SCM [56] 0.277 0.226 0.252
incorporating real-valued representation learning, the second CM l1 [57] 0.193 0.234 0.214
includes binary representation learning works and the third is CM l2 [57] 0.199 0.243 0.221
devoted to works based on deep learning. The table describes CM NC [57] 0.288 0.239 0.263
the cross-modal method, image and text feature extractors, the CM NCc [57] 0.287 0.239 0.263
dataset used, method type, evaluation metric, and references. SM l1 [57] 0.22 0.274 0.247
SM l2 [57] 0.205 0.276 0.241
Subspace
learning meth- SM NC [57] 0.301 0.276 0.289
The data-dependent hashing methods can be categorized into ods
S M NCc [57] 0.352 0.272 0.312
supervised, unsupervised, and semi-supervised as per utiliza-
SM KL [57] 0.206 0.274 0.24
tion of data supervision information. Supervised methods usu-
SCM l1 [57] 0.334 0.273 0.304
ally obtain better search accuracy than the other two methods
SCM l2 [57] 0.315 0.267 0.291
because of the utilization of semantic label information. Un-
SCM NC [57] 0.371 0.279 0.325
supervised methods are appropriate for small scale and data-
SCM NC c [57] 0.382 0.281 0.332
distributed retrieval, however, semi-supervised methods per-
SCM KL [57] 0.311 0.27 0.291
form better in case of less label information. Table (15) shows
MDCR [68] 0.287 0.225 0.256
the comparison of these three types of methods. Deep learn-
SCCMR [66] 0.431 0.403 0.417
ing plays a vital role in hash learning, feature extraction, and
MDSSL [71] 0.3517 0.2851 0.3184
retrieval performance in a hashing method. Usually, the deep
CMTC [80] 0.293 0.232 0.266
learning based hashing method performs better than the general Topic models
M3R [81] 0.2298 0.2677 0.2488
hashing method as it is data-dependent and its performance de-
CMOLRS [79] 0.454 0.414 0.434
pends upon a substantial increase in data scale. So, deep hash- Rank based
methods fast CMOLRS [79] 0.451 0.411 0.431
ing methods usually perform better in case of a colossal amount
of multi-modal data but with higher hardware cost. Besides,
the black box feature extraction attributes of deep learning may
lead to the exclusion of vital information from the original data. 8. Open issues
Moreover, the optimization process of deep learning requires
plenty of manual fine-tuning [107]. The refinement and potent The motive of cross-modal learning is to prepare a model to
of feature extraction part of the deep hashing method must be which one type of modality is inserted as a query to retrieve
considered in future works. Table (16) presents a comparison of the results in another modality. For this process, the collected
general and deep learning based hashing methods in the cross- data has to be arranged in a manner so that retrieval can happen
modal retrieval field. As the hash retrieval is a type of statisti- in less time as well as the results must be accurate and seman-
cal task, label information plays an important role in it without tically relate to the queried modality data. Researchers have
particular method consideration. In the case of un-annotated or proposed miscellaneous algorithms for making cross-modal re-
incomplete labeled data, impulsively following retrieval perfor- trieval task more effective, however, there are few open issues
mance under a supervised situation may lead to poor algorithm which still need to be considered in the future to make the re-
performance. So, consideration of algorithm performance in trieval process much better. These issues are discussed in a
case of diverse data labeling degrees is required in the future. table (17) which can act as future research directions. The table
30
good results in data representation [148, 149, 150]. Al-
Table 14: Performance comparison of prominent real-valued learning methods
on the basis of MAP score on NUS-WIDE dataset though, various authors have applied these techniques for
MAP score cross-modal system design, however, they are still in their
Method type Methods
I2T T2I Average inception stage and need to be explored more. Moreover,
LBP+DBN [84] 0.3373 0.4221 0.3797 researchers either use a soft-computing or algorithmic ap-
Deep semi- 0.556 0.422 0.489 proach. Both have their own limitations and strengths. So,
supervised frame- there is a need for a hybrid approach to link heterogeneous
work [87]
data of various modalities.
MHTN [93] 0.52 0.534 0.527
Deep learning 3. Lack of large scale multi-modal datasets: Researchers are
HRL H [96] 0.446 0.476 0.461
based methods designing various algorithms these days for cross-modal
HRL C [96] 0.603 0.599 0.601
retrieval and annotation. However, there is a lack of huge
DAML S [97] 0.531 0.539 0.535
datasets that contain data of various modalities to test and
DAML D [97] 0.512 0.534 0.523
validate the proposed algorithms. The algorithms have
Corr-AE [89] 0.319 0.375 0.347
been tested on extremely small datasets like the Wikipedia
Corr-Cross-AE [89] 0.349 0.348 0.349
dataset which consists of 2866 documents only. After sur-
Corr-Full-AE [89] 0.331 0.379 0.355
veying, it has been found that there is a lack of large multi-
JFSSL [65] 0.4035 0.3747 0.3891
modal datasets and especially in the medical field [151].
SCCMR [66] 0.434 0.386 0.41
Subspace
4. Confusion in choosing data feature extraction method:
DDL [69] 0.4498 0.498 0.4739
learning meth- Most of the approaches used by authors consists of inde-
ods MDSSL [71] 0.5218 0.4079 0.4649
pendent feature extraction from each modality to be used
ACMR [72] 0.544 0.538 0.541
in cross-modal system construction as an initial step. If
GSS-SL [67] 0.5364 0.404 0.4702
the initial step is inappropriate then it will affect the whole
Topic model M3R [81] 0.2445 0.3044 0.2742
cross-modal technique. As an example, the performance
Rank based CMOLRS [79] 0.415 0.34 0.378
methods
of a machine learning model extremely depends upon the
fast CMOLRS [79] 0.414 0.348 0.381
feature representation used for building the model. This
happens because various feature representations hide more
is categorized into two parts: (1) Algorithm level; and (2) Data or less diverse descriptive factors of variations behind the
level. Former discusses the issues related to the cross-modal al- data [152]. So, it is necessary to choose an appropriate
gorithm and later presents the open issues related to the multi- feature extraction method for each modality under consid-
modal data considered in the algorithm. The current situation eration depending upon the type of data, application, and
corresponding to the open issue is also described separately in cross-modal method.
the table. 5. Lack of scalable algorithms: A colossal amount of multi-
modal data is being generated and spread on the internet
1. Noisy and restricted annotations: A large amount of multi- nowadays due to the availability of fast networks, mobile
modal data is created by people on various websites such devices, and huge storage devices. So, productive cross-
as YouTube, Facebook, and Flickr to name a few. This modal methods are needed which can be applied in a dis-
data on the web is not properly organized and annotations tributed environment as well [23]. Moreover, further re-
related to it are also noisy and restricted. Annotations pro- search is required for designing efficient cross-modal algo-
vide the required semantic information to understand the rithms which can be applied on huge multi-modal datasets
particular modality data and labeling a huge data manu- [153].
ally is almost impossible. In [147], authors have used the 6. Need of novel and diverse fields’ datasets: It can be identi-
combination of noisy and cleanly annotated images for ro- fied from Section (4) and Table (7) that most of the datasets
bust image representations. One technique for combining are comprised from social media websites, so their content
noisy and clean data is to train a network with noisy data is very similar to each other. There is an immense need
and then fine-tune it using clean data. However, this tech- for diversity in the data content. Moreover, the datasets
nique is not suitable for the proper usage of clean data. The which have been utilized by most of the researchers and
proposed method represents the technique of using clean are very popular such as NUS-WIDE, Pascal VOC 2007,
annotations for a reduction in noise in a large dataset and and Wikipedia, have become too old now. Novel and di-
fine-tuning of the network with both clean and reduced verse multi-modal datasets are required to be introduced.
noise data. The method consists of a multi-task network 7. Requirement of semi-supervised cross-modal techniques:
that learns to clean noisy annotations together with effi- Supervised techniques perform better than unsupervised
cient classification of images. Extensive experimentation because of the utilization of semantic label information
is performed on the Open Images dataset to show the effi- [117]. However, most of the generated multi-media data
ciency of the proposed technique. is either unlabelled or has noisy annotations. Semi-
2. Need of a hybrid approach for designing cross-modal sys- supervised methods are getting highly popular now and are
tem: Soft computing techniques have been used exten- the future of cross-modal retrieval as they are the combina-
sively these days to solve real-life problems and they show tion of both supervised and unsupervised and also provide
31
Table 15: Comparison of hashing techniques in diverse supervision modes
Mode Label use Data process Hash learning Retrieval performance Performance in huge
data
Supervised Yes Complex Complex Good Good
Unsupervised No Simple Simple Fair Poor
Semi-supervised Partly Simple Complex Average Fair

Table 16: Comparison of general and deep learning based hashing methods
Hashing method Generality Modeling complexity Retrieval performance Parameter scale Hardware cost
General Poor Complex Fair Small Small
Deep learning based Good Simple Good Large Large

Table 17: Open issues in cross-modal retrieval

Type Open issue Current state
1. Appropriate adoption of diverse modality feature de- 1. Descriptors provided with benchmark datasets are
scriptors chosen mostly
2. Need of a hybrid of soft and hard computing ap- 2. Either soft or hard computing approach is utilized
proaches

Algorithm level 3. Need of a scalable algorithm 3. Algorithms have restrictions of data size, modalities
and application areas
4. Need of a reproducible cross-modal retrieval method 4. Most methods are applicable in a particular applica-
tion area
5. Cross-modal retrieval implementation in big data, 5. Rarely applied
cloud, and IoT environments
6. More utilization of semi-supervised cross-modal re- 6. Less used
trieval techniques
7. Lack of huge datasets incorporating diverse modali- 7. Most existing benchmark datasets are old and consist
ties of only image and text modality

Data level 8. Requirement of proper and exact labeling of images 8. Poor and noisy annotations
9. Diversity in data composition area 9. Datasets majorly composed from common social me-
dia websites

promising results [66].

32
Table 18: Summary of works done in image-text cross-modal retrieval

SR. TECHNIQUE IMAGE REP. TEXT REP. DATA TYPE METRIC REF.
Real-valued representation techniques
1 Linkage of each image Bag-of-Words CiCui system [154], US FEMA flood data, Face- - MAP [25]
feature with text feature TF-IDF and weird- book pages’ and groups’ data
ness [155] related to floods
2 Structural SVM, SR, BoW of dense SIFT features Probability distribu- UIUC Pascal Sentence dataset, - BLEU score, [58]
CSR (using CCA) tion IAPR TC-12 benchmark and rouge score
SBUCaptioned Photo dataset
3 Markov Random Field Image moments, gray level co- Bag-of-Keywords Thoracic CT scan data of nine Supervised Precision, [26]
(MRF) and Hidden occurrence matrix (GLCM) distinct concepts containing recall and their
Markov Model (HMM) moments, auto-correlation 842 ROIs (created) curve, ten-fold
coefficients (AC), edge fre- cross valida-
quency (EF), Gabor filter tion accuracy,
descriptor, Tamura descriptor, classification
color edge directional de- accuracy
scriptor (CEDD), fuzzy color
texture histogram (FCTH)
descriptor and combined
texture feature
4 Cluster sensitive cross- Wavelet feature, 3 level spa- TF-IDF and Latent Image Clef and Wikipedia Semi-supervised MAP [70]
modal correlation learn- cial max-pooling, GIST, dense Dirichlet alloca- [156] dataset
ing framework SIFT with sparse coding, tion(LDA)
PHOG and color histogram
5 AICDM Scalable color descriptor, - ESP, pascal VOC 2007, web - PR curve [100]
color layout descriptor, ho- image
mogeneous texture descriptor,
edge histogram, grid color
moment and gabor wavelet
moment
6 Probabilistic model of Blobs to represent image re- TF-IDF IAPR TC-12 and 500 - Precision, [76]
automatic image annota- gions Wikipedia web-pages dataset recall, F-
tion measure
7 Joint feature selection Gist, SIFT LDA Pascal, Wikipedia and NUS- - MAP, PR [65]
and subspace learning WIDE dataset curve
8 Local Group based Con- GIST, HoG word frequency fea- LabelMe, Wikipedia, Pascal Supervised MAP, PR [157]
sistent Feature Learning ture, latent Dirich- VOC2007, NUS-WIDE curve
(LGCFL) let allocation model
with 10 dimensions
9 KCCA based approach Gist, color histogram, BoVW word frequency, rel- Pascal VOC 2007, labelme, Unsupervised Normalized [64]
ative tag rank, abso- discounted
lute tag rank cumulative
gain
10 Structural SVM based SIFT, BoVW, LDA BoW, LDA IAPR TC-12 Benchmark, - BLEU, pre- [41]
unified framework UIUC Pascal Sentence, cision, recall,
SBU-Captioned Photo median rank,
MAP
11 Cross mOdal Similar- SIFT, GIST latent Dirichlet allo- Wikipedia , Pascal VOC2007, Supervised MAP [42]
ity Learning algorithm cation model NUSWIDE-1.5K, LabelMe
with Active Queries
(COSLAQ)
12 CM, SM, SCM SIFT LDA Wikipedia Unsupervised MAP, PR [56]
curve
13 Graph regularization and CNN LDA INRIA-websearch, Pascal sen- - MAP, PR [52]
modality dependence tence, Wikipedia 2010 curve
(GRMD)
14 CCA, KCCA SIFT, gist, PHOG, LBP, self TF-IDF WIKI-CMR - Precision [61]
similarity, spatial pyramid
method
15 Improved CCA - - NUS-WIDE, Pascal sentence, - MAP [60]
Wikipedia
16 Unsupervised KCCA Gist, HSV color histogram, Words frequency, Labelme, Pascal VOC Unsupervised Normalized [63]
based framework SIFT relative tag rank, Discounted
absolute tag rank Cumulative
Gain (NDCG)
17 MLRank Gist, color histogram, color - Corel 5k, NUS-WIDE, IAPR Semi-supervised Precision, re- [77]
suto-correlation, edge direc- TC12 call, F1 score,
tion histogram, wavelet tex- MAP, N+ (no.
ture, block-wise color mo- of keywords
ments with non-zero
recall value)

33
18 CM, SM, SCM SIFT LDA TVGraz, Wikipedia Supervised and MAP, PR [57]
unsupervised curve
19 CCA and its probabilistic RGB-SIFT Binary features MIRFlickr 1M - Precision [59]
interpretation
20 Regularizer of image se- SIFT LDA TVGraz, Wikipedia, pascal - MAP, PR [38]
mantics sentence dataset curve
21 Modality-dependent CNN visual features LDA Wikipedia, pascal sentence, Supervised MAP [68]
cross-media retrieval INRIA-websearch
(MDCR) model
22 Semantic consistency CNN, VGG LDA, BoW Wikipedia, pascal sentence, Semi-supervised MAP, PR [66]
cross-modal retrieval NUS-WIDE-10k, INRIA- curve
(SCCMR) websearch
Binary-valued or cross-modal hashing techniques
23 Unsupervised Concate- Gist word frequency Pascal, UCI handwritten digit, Unsupervised MAP [120]
nation Hashing (UCH) count Wikipedia
24 Cross-modal self-taught SIFT, HoG, GIST TF-IDF Wikipedia, NUS-WIDE Unsupervised MAP [106]
hashing (CMSTH)
25 Linear cross-modal hash- SIFT LDA NUS-WIDE, Wikipedia - MAP, recall [110]
ing
26 Latent semantic sparse Sparse coding Matrix factorization Labelme, NUS-WIDE, - MAP, PR [128]
hashing Wikipedia curve
27 Quantized correlation SIFT, BoW LDA, tag vector 58W-CIFAR, NUS-WIDE, Supervised MAP, preci- [111]
hashing Wikipedia sion
28 Discrete Latent Semantic SIFT, gist, edge histogram Topic vectors, index Labelme, MIRFlickr 25k, Supervised MAP, PR [114]
Hashing vector of selected NUS-WIDE, Wikipedia curve
tags, feature vector
derived from PCA,
binary tagging vec-
tor
29 Subspace Relation SIFT, gist LDA, tag occur- ImageNet, Labelme, MIR- Supervised MAP, preci- [109]
Learning for Cross- rence feature vector Flickr 25k, NUS-WIDE, sion
modal Hashing UCI handwritten digit data,
Wikipedia
30 Deep Visual Semantic Deep f c7 features BoW vector IAPR TC-12, MS COCO - MAP, preci- [115]
Hashing model sion
31 Deep cross-modal hash- Gist, bag-of-visual-words BoW vector IAPR TC-12, MIRFlickr 25k, - MAP, PR [116]
ing (BOVW) NUS-WIDE curve
32 Triplet-based deep hash- SIFT BoW MIRFlickr 25k, NUS-WIDE Supervised MAP, PR [112]
ing network curve
33 Attention-Aware Deep - BoW IAPR TC-12, NUS-WIDE, - MAP, PR [121]
Adversarial Hashing MIRFlickr 25k curve
34 Supervised matrix factor- SIFT Topic vector, BoW NUS-WIDE, Wikipedia Supervised MAP, pre- [118]
ization hashing cision, PR
curve
35 Semantic deep cross- - BoW IAPR TC-12, MIRFlickr, Supervised MAP, preci- [113]
modal hashing NUS-wIDE sion curve, PR
curve
36 Zero-shot hashing BoVW, SIFT, gist LDA, BoW MIRFlickr, NUS-WIDE, Semi-supervised MAP [119]
Wikipedia
37 Deep multi-level seman- - BoW MIRFlickr 25k Supervised MAP, PR [129]
tic hashing curve
38 Cycle-Consistent Deep CNN LDA Microsoft COCO, IAPR TC- - MAP, preci- [158]
Generative Hashing 12, wiki sion curve, PR
(CYC-DGH) curve
39 Multi-modal graph reg- SIFT, BoW, edge histogram LDA, tag vector fea- MIRFlickr, NUS-WIDE, Unsupervised MAP, PR [108]
ularized smooth matrix ture vectors Wikipedia curve
factorization hashing
40 Multi-view feature dis- SIFT, histogram feature, Word vector, mean MIRFlickr, MMED, NUS- Supervised MAP, PR [117]
crete hashing BoVW vector, covariance WIDE, Wikipedia curve
matrix, feature
histogram
Cross-modal methods based on deep learning
41 Multi-modal Deep Belief Image specific DBN which Text specific DBN MIR Flickr Data Unsupervised MAP [40]
Network (DBN) used Gaussian Restricted which used Repli-
Boltzmann Machines (RBM) cated Softmax
model
42 Levinberg-Marquardt Deep neural network Deep neural net- Wikipedia Articles data - Precision [30]
deep canonical correla- work recovery curve
tion analysis (LMDCCA)

34
43 Cross-media multiple GIST, Pyramid Histogram of BoW NUSWIDE-10k, Wikipedia, - MAP, PR [159]
deep network Words (PHoW), MPEG-7, Pascal Sentences curve
SIFT, color correlogram, color
histogram, wavelet texture,
edge direction histogram,
block-wise color moments
44 Deep canonical correla- color histogram, color cor- Bag-of-Words Wikipedia, pascal, NUS- Supervised MAP [160]
tion analysis(DCCA) relogram, edge direction (BoW) WIDE10k
histogram, wavelet texture,
block-wise color moments,
SIFT, GIST, MPEG-7
45 Deep coupled met- SIFT, BoVW, GIST, color his- Latent Dirichlet Wikipedia, Pascal VOC 2007, - Precision, [161]
ric learning (DCML) togram allocation (LDA) NUS-WIDE MAP, ROC
method model and CMC
curve
46 Deep semi-supervised CNN, GIST, SIFT 100-d, 399-d and Wikipedia, pascal VOC, NUS- Semi-supervised MAP [87]
framework 1000-d word freq WIDE
vectors
47 Correspondence autoen- Pyramid Histogram of Words Bag-of-Words Wikipedia, Pascal, NUS- - MAP [86]
coder (PHOW), MPEG-7 descrip- WIDE
tors and Gist
48 Multitask learning ap- 4096-dimensional vector ex- 1386/2000- MIRFLICKR-25K, MS - MAP [162]
proach with 3 modules: tracted by the fc10 layer of dimensional bag-of- COCO
Correlation Network, VGGNet word vectors
Cross-modal Autoen-
coder, Latent Semantic
Hashing
49 Deep Adversarial Met- SIFT, VGG LDA, BoW Wikipedia, pascal, NUS- Supervised MAP [97]
ric Learning approach WIDE
(DAML)
50 Deep Pairwise Ranking CNN, GIST, SIFT 100-d, 399-d and Wikipedia, pascal, NUS- Supervised MAP [163]
model with multi- 1000-d word freq WIDE
label information for vectors
Cross-Modal retrieval
(DPRCM)
51 Deep Belief Network LBP - NUS-WIDE, Wikipedia - MAP, percent- [84]
age, MRR
52 Log-Bilinear Model - - IAPR TC-12, attribute discov- - Bleu, per- [88]
ery, SBU captioned photos plexity and
retrieval evalu-
ation

35
9. Conclusion [8] B. Kitchenham, S. Charters, Guidelines for performing systematic liter-
ature reviews in software engineering (2007).
From the review on cross-modal information retrieval, it has [9] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey,
S. Linkman, Systematic literature reviews in software engineering–a
been found that cross-modal retrieval techniques are better than systematic literature review, Information and software technology 51 (1)
classic uni-modal systems in retrieving the multi-modal data (2009) 7–15 (2009).
and adding values to complement meaningful information. The [10] B. E. Stein, T. R. Stanford, B. A. Rowland, Development of multisen-
sory integration from the perspective of the individual neuron, Nature
article summarizes the prominent works done by various re-
Reviews Neuroscience 15 (8) (2014) 520–535 (2014).
searchers in the field of image-text cross-modal retrieval. Pri- [11] R. L. Miller, B. A. Rowland, Multisensory integration: How the brain
mary information has been presented with the help of tables, combines information across the senses, Computational Models of Brain
figures, and graphs to make it more understandable. A tax- and Behavior (2017) 215–228 (2017).
[12] R. K. Srihari, Use of captions and other collateral text in understanding
onomy of cross-modal retrieval techniques has been demon- photographs, in: Integration of Natural Language and Vision Processing,
strated. Information regarding famous image-text multi-modal Springer, 1995, pp. 245–266 (1995).
datasets has been presented. Comparison among various cross- [13] B. E. Stein, M. A. Meredith, The merging of the senses., The MIT Press,
modal techniques when applied on a particular dataset is shown. 1993 (1993).
[14] B. E. Stein, M. A. Meredith, W. S. Huneycutt, L. McDade, Behavioral
Miscellaneous applications in the field of cross-modal retrieval indices of multisensory integration: orientation to visual cues is affected
are mentioned and general architecture is shown. Challenges by auditory stimuli, Journal of Cognitive Neuroscience 1 (1) (1989) 12–
and open issues have also been discussed to help the research 24 (1989).
community in further research. Although significant work has [15] M. Otoom, Beyond von neumann: Brain-computer structural metaphor,
in: 2016 Third International Conference on Electrical, Electronics,
been proposed in this field, still we are far away from achieving Computer Engineering and their Applications (EECEA), IEEE, 2016,
an ideal position and accuracy in the process. This approach pp. 46–51 (2016).
has still not been accepted and applied worldwide in most of [16] B. P. Yuhas, M. H. Goldstein, T. J. Sejnowski, Integration of acoustic
the real-life applications. Moreover, ample work is required to and visual speech signals using neural networks, IEEE Communications
Magazine 27 (11) (1989) 65–71 (1989).
be done in this field to introduce novel better algorithms or to [17] C. Saraceno, R. Leonardi, Indexing audiovisual databases through joint
enhance the retrieval efficiency of the classic algorithms. It is audio and video processing, International Journal of Imaging Systems
expected that this article will be useful for the readers and re- and Technology 9 (5) (1998) 320–331 (1998).
[18] D. Roy, Integration of speech and vision using mutual information, in:
searchers to better understand the present situation and state-of-
2000 IEEE International Conference on Acoustics, Speech, and Signal
the-art cross-modal retrieval methods and motivate researches Processing. Proceedings (Cat. No. 00CH37100), Vol. 4, IEEE, 2000, pp.
in the field. 2369–2372 (2000).
[19] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature
264 (5588) (1976) 746–748 (1976).
Declaration of Competing Interest [20] T. Westerveld, D. Hiemstra, F. De Jong, Extracting bimodal representa-
tions for language-based image retrieval, in: Multimedia’99, Springer,
2000, pp. 33–42 (2000).
The authors declare that they have no known competing fi- [21] T. Westerveld, Image retrieval: Content versus context., in: RIAO, Cite-
nancial interests or personal relationships that could have ap- seer, 2000, pp. 276–284 (2000).
peared to influence the work reported in this paper. [22] C. Xiong, D. Zhang, T. Liu, X. Du, Voice-face cross-modal matching
and retrieval: A benchmark, arXiv preprint arXiv:1911.09338 (2019).
[23] A. C. Duarte, Cross-modal neural sign language translation, in: Proceed-
ings of the 27th ACM International Conference on Multimedia, ACM,
References 2019, pp. 1650–1654 (2019).
[24] S. Mariooryad, C. Busso, Exploring cross-modality affective reactions
[1] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey for audiovisual emotion recognition, IEEE Transactions on affective
on cross-modal retrieval, arXiv preprint arXiv:1607.06215 (2016). computing 4 (2) (2013) 183–196 (2013).
[2] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: [25] M. Jing, B. W. Scotney, S. A. Coleman, M. T. McGinnity, X. Zhang,
A survey and taxonomy, IEEE Transactions on Pattern Analysis and Ma- S. Kelly, K. Ahmad, A. Schlaf, S. Gründer-Fahrer, G. Heyer, Integration
chine Intelligence 41 (2) (2018) 423–443 (2018). of text and image analysis for flood event image recognition, in: 2016
[3] M. Ayyavaraiah, B. Venkateswarlu, Cross media feature retrieval and 27th Irish Signals and Systems Conference (ISSC), IEEE, 2016, pp. 1–6
optimization: A contemporary review of research scope, challenges and (2016).
objectives, in: International Conference On Computational Vision and [26] M. M. Rahman, D. You, M. S. Simpson, S. K. Antani, D. Demner-
Bio Inspired Computing, Springer, 2019, pp. 1125–1136 (2019). Fushman, G. R. Thoma, Interactive cross and multimodal biomedical
[4] Y. Peng, X. Huang, Y. Zhao, An overview of cross-media retrieval: Con- image retrieval based on automatic region-of-interest (roi) identification
cepts, methodologies, benchmarks, and challenges, IEEE Transactions and classification, International Journal of Multimedia Information Re-
on circuits and systems for video technology 28 (9) (2017) 2372–2385 trieval 3 (3) (2014) 131–146 (2014).
(2017). [27] Z. Liu, H. Liu, W. Huang, B. Wang, F. Sun, Audiovisual cross-modal
[5] M. Ayyavaraiah, B. Venkateswarlu, Joint graph regularization based se- material surface retrieval, Neural Computing and Applications (2019)
mantic analysis for cross-media retrieval: a systematic review, Inter- 1–9 (2019).
national Journal of Engineering & Technology 7 (2.7) (2018) 257–261 [28] D. Cao, Z. Yu, H. Zhang, J. Fang, L. Nie, Q. Tian, Video-based cross-
(2018). modal recipe retrieval, in: Proceedings of the 27th ACM International
[6] Y.-x. Peng, W.-w. Zhu, Y. Zhao, C.-s. Xu, Q.-m. Huang, H.-q. Lu, Q.- Conference on Multimedia, ACM, 2019, pp. 1685–1693 (2019).
h. Zheng, T.-j. Huang, W. Gao, Cross-media analysis and reasoning: [29] M. Lazaridis, A. Axenopoulos, D. Rafailidis, P. Daras, Multimedia
advances and directions, Frontiers of Information Technology & Elec- search and retrieval using multimodal annotation propagation and index-
tronic Engineering 18 (1) (2017) 44–57 (2017). ing techniques, Signal Processing: Image Communication 28 (4) (2013)
[7] M. Priyanka, B. S. Devi, S. Riyazoddin, M. J. Reddy, Analysis of cross- 351–367 (2013).
media web information fusion for text and image association-a survey [30] D. Xia, L. Miao, A. Fan, A cross-modal multimedia retrieval method us-
paper, Global Journal of Computer Science and Technology (2013).

36
ing depth correlation mining in big data environment, Multimedia Tools trieval based on graph regularization, Mobile Information Systems 2020
and Applications (2019) 1–16 (2019). (2020).
[31] X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint [53] H. Hotelling, Relations between two sets of variates, in: Breakthroughs
graph regularization for cross-media retrieval, in: Twenty-seventh AAAI in statistics, Springer, 1992, pp. 162–190 (1992).
conference on artificial intelligence, 2013 (2013). [54] C. Guo, D. Wu, Canonical correlation analysis (cca) based multi-view
[32] B. Elizalde, S. Zarar, B. Raj, Cross modal audio search and retrieval with learning: An overview, arXiv preprint arXiv:1907.01693 (2019).
joint embeddings based on text and audio, in: ICASSP 2019-2019 IEEE [55] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation
International Conference on Acoustics, Speech and Signal Processing analysis: An overview with application to learning methods, Neural
(ICASSP), IEEE, 2019, pp. 4095–4099 (2019). computation 16 (12) (2004) 2639–2664 (2004).
[33] Y. Yu, S. Tang, F. Raposo, L. Chen, Deep cross-modal correlation learn- [56] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet,
ing for audio and lyrics in music retrieval, ACM Transactions on Multi- R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia
media Computing, Communications, and Applications (TOMM) 15 (1) retrieval, in: Proceedings of the 18th ACM international conference on
(2019) 20 (2019). Multimedia, 2010, pp. 251–260 (2010).
[34] D. Zeng, Y. Yu, K. Oyama, Deep triplet neural networks with cluster-cca [57] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet,
for audio-visual cross-modal retrieval, arXiv preprint arXiv:1908.03737 R. Levy, N. Vasconcelos, On the role of correlation and abstraction in
(2019). cross-modal multimedia retrieval, IEEE transactions on pattern analysis
[35] P. Tripathi, P. P. Watwani, S. Thakur, A. Shaw, S. Sengupta, Discover and machine intelligence 36 (3) (2013) 521–535 (2013).
cross-modal human behavior analysis, in: 2018 Second International [58] Y. Verma, C. Jawahar, Im2text and text2im: Associating images and
Conference on Electronics, Communication and Aerospace Technology texts for cross-modal retrieval., in: BMVC, Vol. 1, Citeseer, 2014, p. 2
(ICECA), IEEE, 2018, pp. 1818–1824 (2018). (2014).
[36] J. Imura, T. Fujisawa, T. Harada, Y. Kuniyoshi, Efficient multi-modal [59] M. Katsurai, T. Ogawa, M. Haseyama, A cross-modal approach for ex-
retrieval in conceptual space, in: Proceedings of the 19th ACM interna- tracting semantic relationships between concepts using tagged images,
tional conference on Multimedia, ACM, 2011, pp. 1085–1088 (2011). IEEE transactions on multimedia 16 (4) (2014) 1059–1074 (2014).
[37] P. Goyal, S. Sahu, S. Ghosh, C. Lee, Cross-modal learning for multi- [60] J. Shao, Z. Zhao, F. Su, T. Yue, Towards improving canonical correlation
modal video categorization, arXiv preprint arXiv:2003.03501 (2020). analysis for cross-modal retrieval, in: Proceedings of the on Thematic
[38] J. C. Pereira, N. Vasconcelos, Cross-modal domain adaptation for text- Workshops of ACM Multimedia 2017, 2017, pp. 332–339 (2017).
based regularization of image semantics in image retrieval systems, [61] W. Xiong, S. Wang, C. Zhang, Q. Huang, Wiki-cmr: A web cross modal-
Computer Vision and Image Understanding 124 (2014) 123–135 (2014). ity dataset for studying and evaluation of cross modality retrieval mod-
[39] T. Gou, L. Liu, Q. Liu, Z. Deng, A new approach to cross-modal re- els, in: 2013 IEEE International Conference on Multimedia and Expo
trieval, in: Journal of Physics: Conference Series, Vol. 1288, IOP Pub- (ICME), IEEE, 2013, pp. 1–6 (2013).
lishing, 2019, p. 012044 (2019). [62] V. Ranjan, N. Rasiwasia, C. Jawahar, Multi-label cross-modal retrieval,
[40] N. Srivastava, R. Salakhutdinov, Learning representations for multi- in: Proceedings of the IEEE International Conference on Computer Vi-
modal data with deep belief nets, in: International conference on ma- sion, 2015, pp. 4094–4102 (2015).
chine learning workshop, Vol. 79, 2012 (2012). [63] S. J. Hwang, K. Grauman, Accounting for the relative importance of
[41] Y. Verma, C. Jawahar, A support vector approach for cross-modal search objects in image retrieval., in: BMVC, Vol. 1, 2010, p. 5 (2010).
of images and texts, Computer Vision and Image Understanding 154 [64] S. Hwang, K. Grauman, Learning the relative importance of objects from
(2017) 48–63 (2017). tagged images for retrieval and cross-modal search, International journal
[42] N. Gao, S.-J. Huang, Y. Yan, S. Chen, Cross modal similarity learning of computer vision 100 (2) (2012) 134–153 (2012).
with active queries, Pattern Recognition 75 (2018) 214–222 (2018). [65] K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection
[43] A. Habibian, T. Mensink, C. G. Snoek, Discovering semantic vocabular- and subspace learning for cross-modal retrieval, IEEE transactions on
ies for cross-media retrieval, in: Proceedings of the 5th ACM on Inter- pattern analysis and machine intelligence 38 (10) (2015) 2010–2023
national Conference on Multimedia Retrieval, ACM, 2015, pp. 131–138 (2015).
(2015). [66] G. Xu, X. Li, Z. Zhang, Semantic consistency cross-modal retrieval with
[44] N. Van Nguyen, M. Coustaty, J.-M. Ogier, Multi-modal and cross-modal semi-supervised graph regularization, IEEE Access 8 (2020) 14278–
for lecture videos retrieval, in: 2014 22nd International Conference on 14288 (2020).
Pattern Recognition, IEEE, 2014, pp. 2667–2672 (2014). [67] L. Zhang, B. Ma, G. Li, Q. Huang, Q. Tian, Generalized semi-supervised
[45] T. Nakano, A. Kimura, H. Kameoka, S. Miyabe, S. Sagayama, N. Ono, and structured subspace learning for cross-modal retrieval, IEEE Trans-
K. Kashino, T. Nishimoto, Automatic video annotation via hierarchical actions on Multimedia 20 (1) (2017) 128–141 (2017).
topic trajectory model considering cross-modal correlations, in: 2011 [68] Y. Wei, Y. Zhao, Z. Zhu, S. Wei, Y. Xiao, J. Feng, S. Yan, Modality-
IEEE International Conference on Acoustics, Speech and Signal Pro- dependent cross-media retrieval, ACM Transactions on Intelligent Sys-
cessing (ICASSP), IEEE, 2011, pp. 2380–2383 (2011). tems and Technology (TIST) 7 (4) (2016) 1–13 (2016).
[46] B. Jiang, X. Huang, C. Yang, J. Yuan, Cross-modal video moment re- [69] C. Deng, X. Tang, J. Yan, W. Liu, X. Gao, Discriminative dictionary
trieval with spatial and language-temporal attention, in: Proceedings of learning with common label alignment for cross-modal retrieval, IEEE
the 2019 on International Conference on Multimedia Retrieval, ACM, Transactions on Multimedia 18 (2) (2015) 208–218 (2015).
2019, pp. 217–225 (2019). [70] S. Wang, F. Zhuang, S. Jiang, Q. Huang, Q. Tian, Cluster-sensitive struc-
[47] X. Xu, L. He, A. Shimada, R.-i. Taniguchi, H. Lu, Learning unified tured correlation analysis for web cross-modal retrieval, Neurocomput-
binary codes for cross-modal retrieval via latent semantic hashing, Neu- ing 168 (2015) 747–760 (2015).
rocomputing 213 (2016) 191–203 (2016). [71] L. Zhang, B. Ma, G. Li, Q. Huang, Q. Tian, Cross-modal retrieval using
[48] K. Ahmad, Slandail: A security system for language and image analysis- multiordered discriminative structured subspace learning, IEEE Trans-
project no: 607691, Available at SSRN 3060047 (2017). actions on Multimedia 19 (6) (2016) 1220–1233 (2016).
[49] A. Hanbury, A survey of methods for image annotation, Journal of Vi- [72] B. Wang, Y. Yang, X. Xu, A. Hanjalic, H. T. Shen, Adversarial cross-
sual Languages & Computing 19 (5) (2008) 617–627 (2008). modal retrieval, in: Proceedings of the 25th ACM international confer-
[50] B. Rafkind, M. Lee, S.-F. Chang, H. Yu, Exploring text and image ence on Multimedia, 2017, pp. 154–162 (2017).
features to classify images in bioscience literature, in: Proceedings of [73] G. Cao, A. Iosifidis, K. Chen, M. Gabbouj, Generalized multi-view em-
the HLT-NAACL BioNLP Workshop on Linking Natural Language and bedding for visual recognition and cross-modal retrieval, IEEE transac-
Biology, Association for Computational Linguistics, 2006, pp. 73–80 tions on cybernetics 48 (9) (2017) 2542–2555 (2017).
(2006). [74] Y. Wu, S. Wang, G. Song, Q. Huang, Augmented adversarial training for
[51] G. Wang, D. Hoiem, D. Forsyth, Building text features for object image cross-modal retrieval, IEEE Transactions on Multimedia (2020).
classification, in: 2009 IEEE conference on computer vision and pattern [75] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and re-
recognition, IEEE, 2009, pp. 1367–1374 (2009). trieval using cross-media relevance models, in: Proceedings of the 26th
[52] G. Wang, H. Ji, D. Kong, N. Zhang, Modality-dependent cross-modal re- annual international ACM SIGIR conference on Research and develop-

37
ment in informaion retrieval, 2003, pp. 119–126 (2003). transactions on cybernetics (2019).
[76] Y. Xia, Y. Wu, J. Feng, Cross-media retrieval using probabilistic model [99] Z. Yang, Z. Lin, P. Kang, J. Lv, Q. Li, W. Liu, Learning shared semantic
of automatic image annotation, International Journal of Signal Process- space with correlation alignment for cross-modal event retrieval, ACM
ing, Image Processing and Pattern Recognition 8 (4) (2015) 145–154 Transactions on Multimedia Computing, Communications, and Appli-
(2015). cations (TOMM) 16 (1) (2020) 1–22 (2020).
[77] Z. Li, J. Liu, C. Xu, H. Lu, Mlrank: Multi-correlation learning to rank [100] J.-H. Su, C.-L. Chou, C.-Y. Lin, V. S. Tseng, Effective semantic an-
for image annotation, Pattern Recognition 46 (10) (2013) 2700–2710 notation by image-to-concept distribution model, IEEE Transactions on
(2013). Multimedia 13 (3) (2011) 530–538 (2011).
[78] Q. Xu, M. Li, M. Yu, Learning to rank with relational graph and point- [101] L. Chi, X. Zhu, Hashing techniques: A survey and taxonomy, ACM
wise constraint for cross-modal retrieval, Soft Computing 23 (19) (2019) Computing Surveys (CSUR) 50 (1) (2017) 1–36 (2017).
9413–9427 (2019). [102] H. P. Luhn, A new method of recording and searching information,
[79] Y. Wu, S. Wang, Q. Huang, Online fast adaptive low-rank similarity American Documentation 4 (1) (1953) 14–16 (1953).
learning for cross-modal retrieval, IEEE Transactions on Multimedia [103] H. Stevens, Hans peter luhn and the birth of the hashing algorithm, IEEE
(2019). Spectrum 55 (2) (2018) 44–49 (2018).
[80] J. Yu, Y. Cong, Z. Qin, T. Wan, Cross-modal topic correlations for mul- [104] W. W. Peterson, Addressing for random-access storage, IBM journal of
timedia retrieval, in: Proceedings of the 21st International Conference Research and Development 1 (2) (1957) 130–146 (1957).
on Pattern Recognition (ICPR2012), IEEE, 2012, pp. 246–249 (2012). [105] R. Morris, Scatter storage techniques, Communications of the ACM
[81] Y. Wang, F. Wu, J. Song, X. Li, Y. Zhuang, Multi-modal mutual topic 11 (1) (1968) 38–44 (1968).
reinforce modeling for cross-media retrieval, in: Proceedings of the [106] L. Xie, L. Zhu, P. Pan, Y. Lu, Cross-modal self-taught hashing for large-
22nd ACM international conference on Multimedia, 2014, pp. 307–316 scale image retrieval, Signal Processing 124 (2016) 81–92 (2016).
(2014). [107] W. Cao, W. Feng, Q. Lin, G. Cao, Z. He, A review of hashing methods
[82] Z. Qin, J. Yu, Y. Cong, T. Wan, Topic correlation model for cross- for multimodal retrieval, IEEE Access 8 (2020) 15377–15391 (2020).
modal multimedia information retrieval, Pattern Analysis and Applica- [108] Y. Fang, H. Zhang, Y. Ren, Unsupervised cross-modal retrieval via
tions 19 (4) (2016) 1007–1022 (2016). multi-modal graph regularized smooth matrix factorization hashing,
[83] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) Knowledge-Based Systems 171 (2019) 69–80 (2019).
(2015) 436–444 (2015). [109] H. T. Shen, L. Liu, Y. Yang, X. Xu, Z. Huang, F. Shen, R. Hong, Exploit-
[84] B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, Y. Yan, Internet cross-media ing subspace relation in semantic labels for cross-modal hashing, IEEE
retrieval based on deep learning, Journal of Visual Communication and Transactions on Knowledge and Data Engineering (2020).
Image Representation 48 (2017) 356–366 (2017). [110] X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for
[85] P. Hu, L. Zhen, D. Peng, P. Liu, Scalable deep multimodal learning for efficient multimedia search, in: Proceedings of the 21st ACM interna-
cross-modal retrieval, in: Proceedings of the 42nd International ACM tional conference on Multimedia, 2013, pp. 143–152 (2013).
SIGIR Conference on Research and Development in Information Re- [111] B. Wu, Q. Yang, W.-S. Zheng, Y. Wang, J. Wang, Quantized correla-
trieval, 2019, pp. 635–644 (2019). tion hashing for fast cross-modal search, in: Twenty-Fourth International
[86] F. Feng, X. Wang, R. Li, I. Ahmad, Correspondence autoencoders for Joint Conference on Artificial Intelligence, 2015 (2015).
cross-modal retrieval, ACM Transactions on Multimedia Computing, [112] C. Deng, Z. Chen, X. Liu, X. Gao, D. Tao, Triplet-based deep hashing
Communications, and Applications (TOMM) 12 (1s) (2015) 26 (2015). network for cross-modal retrieval, IEEE Transactions on Image Process-
[87] D. Mandal, P. Rao, S. Biswas, Semi-supervised cross-modal retrieval ing 27 (8) (2018) 3893–3903 (2018).
with label prediction, IEEE Transactions on Multimedia (2019). [113] C. Yan, X. Bai, S. Wang, J. Zhou, E. R. Hancock, Cross-modal hash-
[88] R. Kiros, R. Salakhutdinov, R. Zemel, Multimodal neural language mod- ing with semantic deep embedding, Neurocomputing 337 (2019) 58–66
els, in: International conference on machine learning, 2014, pp. 595–603 (2019).
(2014). [114] X. Lu, L. Zhu, Z. Cheng, X. Song, H. Zhang, Efficient discrete latent
[89] F. Feng, X. Wang, R. Li, Cross-modal retrieval with correspondence semantic hashing for scalable cross-modal retrieval, Signal Processing
autoencoder, in: Proceedings of the 22nd ACM international conference 154 (2019) 217–231 (2019).
on Multimedia, 2014, pp. 7–16 (2014). [115] Y. Cao, M. Long, J. Wang, Q. Yang, P. S. Yu, Deep visual-semantic
[90] F. Feng, R. Li, X. Wang, Deep correspondence restricted boltzmann hashing for cross-modal retrieval, in: Proceedings of the 22nd ACM
machine for cross-modal retrieval, Neurocomputing 154 (2015) 50–60 SIGKDD International Conference on Knowledge Discovery and Data
(2015). Mining, 2016, pp. 1445–1454 (2016).
[91] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, S. Yan, Cross-modal [116] Q.-Y. Jiang, W.-J. Li, Deep cross-modal hashing, in: Proceedings of the
retrieval with cnn visual features: A new baseline, IEEE transactions on IEEE conference on computer vision and pattern recognition, 2017, pp.
cybernetics 47 (2) (2016) 449–460 (2016). 3232–3240 (2017).
[92] Y. He, S. Xiang, C. Kang, J. Wang, C. Pan, Cross-modal retrieval via [117] J. Yu, X.-J. Wu, J. Kittler, Learning discriminative hashing codes for
deep and bidirectional representation learning, IEEE Transactions on cross-modal retrieval based on multi-view features, Pattern Analysis and
Multimedia 18 (7) (2016) 1363–1377 (2016). Applications (2020) 1–18 (2020).
[93] X. Huang, Y. Peng, M. Yuan, Mhtn: Modal-adversarial hybrid trans- [118] J. Tang, K. Wang, L. Shao, Supervised matrix factorization hashing for
fer network for cross-modal retrieval, IEEE transactions on cybernetics cross-modal retrieval, IEEE Transactions on Image Processing 25 (7)
(2018). (2016) 3157–3166 (2016).
[94] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, M. Cord, [119] X. Liu, Z. Li, J. Wang, G. Yu, C. Domeniconi, X. Zhang, Cross-modal
Cross-modal retrieval in the cooking context: Learning semantic text- zero-shot hashing, arXiv preprint arXiv:1908.07388 (2019).
image embeddings, in: The 41st International ACM SIGIR Conference [120] J. Yu, X.-J. Wu, Unsupervised concatenation hashing with sparse
on Research & Development in Information Retrieval, 2018, pp. 35–44 constraint for cross-modal retrieval, arXiv preprint arXiv:1904.00726
(2018). (2019).
[95] J. Gu, J. Cai, S. R. Joty, L. Niu, G. Wang, Look, imagine and match: [121] X. Zhang, H. Lai, J. Feng, Attention-aware deep adversarial hashing for
Improving textual-visual cross-modal retrieval with generative models, cross-modal retrieval, in: Proceedings of the European Conference on
in: Proceedings of the IEEE Conference on Computer Vision and Pattern Computer Vision (ECCV), 2018, pp. 591–606 (2018).
Recognition, 2018, pp. 7181–7189 (2018). [122] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization:
[96] W. Cao, Q. Lin, Z. He, Z. He, Hybrid representation learning for cross- A procrustean approach to learning binary codes for large-scale image
modal retrieval, Neurocomputing 345 (2019) 45–57 (2019). retrieval, IEEE transactions on pattern analysis and machine intelligence
[97] X. Xu, L. He, H. Lu, L. Gao, Y. Ji, Deep adversarial metric learning for 35 (12) (2012) 2916–2929 (2012).
cross-modal retrieval, World Wide Web 22 (2) (2019) 657–672 (2019). [123] S. Kumar, R. Udupa, Learning hash functions for cross-view similarity
[98] X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, X. Li, Ternary adversarial search, in: Twenty-Second International Joint Conference on Artificial
networks with self-supervision for zero-shot cross-modal retrieval, IEEE Intelligence, 2011 (2011).

38
[124] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in neu- [145] Y. Wang, X. Lin, L. Wu, W. Zhang, Q. Zhang, Lbmch: Learning bridg-
ral information processing systems, 2009, pp. 1753–1760 (2009). ing mapping for cross-modal hashing, in: Proceedings of the 38th inter-
[125] J. Song, Y. Yang, Y. Yang, Z. Huang, H. T. Shen, Inter-media hashing for national ACM SIGIR conference on research and development in infor-
large-scale retrieval from heterogeneous data sources, in: Proceedings of mation retrieval, 2015, pp. 999–1002 (2015).
the 2013 ACM SIGMOD International Conference on Management of [146] G. Ding, Y. Guo, J. Zhou, Y. Gao, Large-scale cross-modality search
Data, 2013, pp. 785–796 (2013). via collective matrix factorization hashing, IEEE Transactions on Image
[126] H. Liu, R. Ji, Y. Wu, F. Huang, B. Zhang, Cross-modality binary code Processing 25 (11) (2016) 5427–5440 (2016).
learning via fusion similarity hashing, in: Proceedings of the IEEE Con- [147] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, S. Belongie,
ference on Computer Vision and Pattern Recognition, 2017, pp. 7380– Learning from noisy large-scale datasets with minimal supervision, in:
7388 (2017). Proceedings of the IEEE Conference on Computer Vision and Pattern
[127] X. Shen, F. Shen, Q.-S. Sun, Y.-H. Yuan, H. T. Shen, Robust cross-view Recognition, 2017, pp. 839–847 (2017).
hashing for multimedia retrieval, IEEE Signal Processing Letters 23 (6) [148] C. Tian, V. De Silva, M. Caine, S. Swanson, Use of machine learning
(2016) 893–897 (2016). to automate the identification of basketball strategies using whole team
[128] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross- player tracking data, Applied Sciences 10 (1) (2020) 24 (2020).
modal similarity search, in: Proceedings of the 37th international ACM [149] D. J. Armaghani, G. D. Hatzigeorgiou, C. Karamani, A. Skentou,
SIGIR conference on Research & development in information retrieval, I. Zoumpoulaki, P. G. Asteris, Soft computing-based techniques for con-
2014, pp. 415–424 (2014). crete beams shear strength, Procedia Structural Integrity 17 (2019) 924–
[129] Z. Ji, W. Yao, W. Wei, H. Song, H. Pi, Deep multi-level semantic hashing 933 (2019).
for cross-modal retrieval, IEEE Access 7 (2019) 23667–23674 (2019). [150] C. Raghuraman, S. Suresh, S. Shivshankar, R. Chapaneri, Static and
[130] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng, Nus-wide: dynamic malware analysis using machine learning, in: First Interna-
A real-world web image database from national university of singapore, tional Conference on Sustainable Technologies for Computational In-
in: Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), San- telligence, Springer, 2020, pp. 793–806 (2020).
torini, Greece., July 8-10, 2009 (July 8-10, 2009). [151] H. Müller, D. Unay, Retrieval from and understanding of large-scale
[131] M. Grubinger, P. Clough, H. Müller, T. Deselaers, The iapr tc-12 bench- multi-modal medical datasets: A review, IEEE transactions on multime-
mark: A new evaluation resource for visual information systems, in: dia 19 (9) (2017) 2093–2104 (2017).
International workshop ontoImage, Vol. 2, 2006 (2006). [152] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review
[132] M. Everingham, L. Gool, C. K. Williams, J. Winn, A. Zisserman, The and new perspectives, IEEE transactions on pattern analysis and ma-
pascal visual object classes (voc) challenge, International journal of chine intelligence 35 (8) (2013) 1798–1828 (2013).
computer vision 88 (2) (2010) 303–338 (2010). [153] Y. Jia, L. Bai, S. Liu, P. Wang, J. Guo, Y. Xie, Semantically-enhanced
[133] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisser- kernel canonical correlation analysis: a multi-label cross-modal re-
man, The pascal visual object classes challenge 2007 (voc2007) results trieval, Multimedia Tools and Applications 78 (10) (2019) 13169–13188
(2007). (2019).
[134] M. J. Huiskes, M. S. Lew, The mir flickr retrieval evaluation, in: Pro- [154] X. Zhang, K. Ahmad, Ontology and terminology of disaster manage-
ceedings of the 1st ACM international conference on Multimedia infor- ment, in: DIMPLE: DIsaster Management and Principled Large-scale
mation retrieval, 2008, pp. 39–43 (2008). information Extraction Workshop Programme, 2014, p. 46 (2014).
[135] M. J. Huiskes, B. Thomee, M. S. Lew, New trends and ideas in visual [155] M. Rogers, K. Ahmad, Corpus linguistics and terminology extraction
concept detection: the mir flickr retrieval evaluation initiative, in: Pro- (2001).
ceedings of the international conference on Multimedia information re- [156] Z. Zhongming, L. Linong, Y. Xiaona, Z. Wangqiang, L. Wei, et al., Wiki-
trieval, 2010, pp. 527–536 (2010). cmr: A web cross modality database for studing and evaluation of cross
[136] J. Krapac, M. Allan, J. Verbeek, F. Juried, Improving web image search modality retrieval methods (2013).
results using query-relative classifiers, in: 2010 IEEE Computer Society [157] C. Kang, S. Xiang, S. Liao, C. Xu, C. Pan, Learning consistent feature
Conference on Computer Vision and Pattern Recognition, IEEE, 2010, representation for cross-modal multimedia retrieval, IEEE Transactions
pp. 1094–1101 (2010). on Multimedia 17 (3) (2015) 370–381 (2015).
[137] M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a [158] L. Wu, Y. Wang, L. Shao, Cycle-consistent deep generative hashing for
ranking task: Data, models and evaluation metrics, Journal of Artificial cross-modal retrieval, IEEE Transactions on Image Processing 28 (4)
Intelligence Research 47 (2013) 853–899 (2013). (2018) 1602–1612 (2018).
[138] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions [159] Y. Peng, X. Huang, J. Qi, Cross-media shared representation by hier-
to visual denotations: New similarity metrics for semantic inference over archical learning with multiple deep networks., in: IJCAI, 2016, pp.
event descriptions, Transactions of the Association for Computational 3846–3853 (2016).
Linguistics 2 (2014) 67–78 (2014). [160] J. Shao, L. Wang, Z. Zhao, A. Cai, et al., Deep canonical correlation
[139] C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting im- analysis with progressive and hypergraph learning for cross-modal re-
age annotations using amazon’s mechanical turk, in: Proceedings of the trieval, Neurocomputing 214 (2016) 618–628 (2016).
NAACL HLT 2010 Workshop on Creating Speech and Language Data [161] V. E. Liong, J. Lu, Y.-P. Tan, J. Zhou, Deep coupled metric learning for
with Amazon’s Mechanical Turk, Association for Computational Lin- cross-modal matching, IEEE Transactions on Multimedia 19 (6) (2016)
guistics, 2010, pp. 139–147 (2010). 1234–1244 (2016).
[140] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [162] J. Luo, Y. Shen, X. Ao, Z. Zhao, M. Yang, Cross-modal image-text re-
P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, trieval with multitask learning, in: Proceedings of the 28th ACM Inter-
in: European conference on computer vision, Springer, 2014, pp. 740– national Conference on Information and Knowledge Management, 2019,
755 (2014). pp. 2309–2312 (2019).
[141] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: [163] Y. Jian, J. Xiao, Y. Cao, A. Khan, J. Zhu, Deep pairwise ranking with
A large-scale hierarchical image database, in: 2009 IEEE conference multi-label information for cross-modal retrieval, in: 2019 IEEE Inter-
on computer vision and pattern recognition, Ieee, 2009, pp. 248–255 national Conference on Multimedia and Expo (ICME), IEEE, 2019, pp.
(2009). 1810–1815 (2019).
[142] Y. Jia, M. Salzmann, T. Darrell, Learning cross-modality similarity for
multinomial data, in: 2011 International Conference on Computer Vi-
sion, IEEE, 2011, pp. 2407–2414 (2011).
[143] F. Zhong, G. Wang, Z. Chen, F. Xia, G. Min, Cross-modal retrieval for
cpss data, IEEE Access 8 (2020) 16689–16701 (2020).
[144] G. Xu, X. Li, L. Shi, Z. Zhang, A. Zhai, Combination subspace
graph learning for cross-modal retrieval, Alexandria Engineering Jour-
nal (2020).

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
No ratings yet
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
72 pages
Master Asl
No ratings yet
Master Asl
83 pages
Scalable Multi-Document Summarization Using Natural Language Proc
No ratings yet
Scalable Multi-Document Summarization Using Natural Language Proc
58 pages
PhoCLIP 232 Specialized Project OFFICIAL
No ratings yet
PhoCLIP 232 Specialized Project OFFICIAL
105 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
Latex Conversion
No ratings yet
Latex Conversion
42 pages
Full Text 01
No ratings yet
Full Text 01
68 pages
Mir2ed Toc
No ratings yet
Mir2ed Toc
17 pages
Social Information Retrival Thesis
No ratings yet
Social Information Retrival Thesis
104 pages
ND Review 2014
No ratings yet
ND Review 2014
35 pages
Dissertation Krueger Robert PDF
No ratings yet
Dissertation Krueger Robert PDF
212 pages
Introduction To Machine Learning - Wikipedia
No ratings yet
Introduction To Machine Learning - Wikipedia
456 pages
Prediction of Consumer Behavior
No ratings yet
Prediction of Consumer Behavior
201 pages
Machine Learning General Concepts
100% (4)
Machine Learning General Concepts
80 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Predicting User Interaction On Social Media Using Machine Learnin
No ratings yet
Predicting User Interaction On Social Media Using Machine Learnin
76 pages
Mining Social Media: Tracking Content and Predicting Behaviour
50% (2)
Mining Social Media: Tracking Content and Predicting Behaviour
228 pages
CSE 21-131 Carlsson Lindgren
No ratings yet
CSE 21-131 Carlsson Lindgren
78 pages
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
No ratings yet
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
147 pages
ENIT TIC Report Template
No ratings yet
ENIT TIC Report Template
35 pages
1 s2.0 S1319157820303712 Mainext
No ratings yet
1 s2.0 S1319157820303712 Mainext
25 pages
G.W. Hiddink: Educational Multimedia Databases
No ratings yet
G.W. Hiddink: Educational Multimedia Databases
267 pages
Wang Asu 0010N 21448
No ratings yet
Wang Asu 0010N 21448
81 pages
Named Entity Recognition
No ratings yet
Named Entity Recognition
120 pages
TFM Jenifer Tabita Ciuciu-Kis
No ratings yet
TFM Jenifer Tabita Ciuciu-Kis
83 pages
On Path To Multimodal Generalist:: General-Level and General-Bench
No ratings yet
On Path To Multimodal Generalist:: General-Level and General-Bench
305 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
Book Genre Classification Using ML
No ratings yet
Book Genre Classification Using ML
46 pages
Master Thesis
No ratings yet
Master Thesis
58 pages
Neural Transfer Learning For NLP
No ratings yet
Neural Transfer Learning For NLP
329 pages
Scalable Entity Resolution
No ratings yet
Scalable Entity Resolution
66 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
Content Modeling For Social Media Text: Christina Sauper
No ratings yet
Content Modeling For Social Media Text: Christina Sauper
136 pages
1 s2.0 S0262885620301748 Main
No ratings yet
1 s2.0 S0262885620301748 Main
17 pages
Big Data and Social Science-Dikompresi
No ratings yet
Big Data and Social Science-Dikompresi
81 pages
Deep Learning in Indus Valley Script Digitization
No ratings yet
Deep Learning in Indus Valley Script Digitization
55 pages
2024 Donner Catherine Thesis
No ratings yet
2024 Donner Catherine Thesis
161 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
Seed1.5-VL Technical Report
No ratings yet
Seed1.5-VL Technical Report
77 pages
Advances in Data Analysis Theory and Applications To Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks PDF
100% (17)
Advances in Data Analysis Theory and Applications To Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks PDF
17 pages
Speech Recognition in Noisy Environments
No ratings yet
Speech Recognition in Noisy Environments
130 pages
Recommender Systems: A, B, C B B, D A, B, C, A, B, C A, B, C, e
No ratings yet
Recommender Systems: A, B, C B B, D A, B, C, A, B, C A, B, C, e
97 pages
Logo Recognition Theory and Practice Compress
No ratings yet
Logo Recognition Theory and Practice Compress
189 pages
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Soft Computing and Medical Bioinformatics (1st Ed.)
No ratings yet
Soft Computing and Medical Bioinformatics (1st Ed.)
145 pages
Thesis Philippe Saade
No ratings yet
Thesis Philippe Saade
69 pages
VenkteshV Thesis PhD19016 Revised Final
No ratings yet
VenkteshV Thesis PhD19016 Revised Final
172 pages
Review of Automatic Text Summarization Techniques & Methods - 1-S2.0-S1319157820303712-Mainext
No ratings yet
Review of Automatic Text Summarization Techniques & Methods - 1-S2.0-S1319157820303712-Mainext
25 pages
Python Data Science
100% (1)
Python Data Science
173 pages
Cristian-Stefan Tutuianu PDF
No ratings yet
Cristian-Stefan Tutuianu PDF
40 pages
Tong 2020
No ratings yet
Tong 2020
14 pages
Thesis Darius Dragnea
No ratings yet
Thesis Darius Dragnea
64 pages
Big Data and The Web
No ratings yet
Big Data and The Web
170 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
IRS Extended
No ratings yet
IRS Extended
15 pages
Active Learning
100% (3)
Active Learning
116 pages
Chang Joseph Thesis
No ratings yet
Chang Joseph Thesis
144 pages
A Survey On Large Language Models With Some Insights
No ratings yet
A Survey On Large Language Models With Some Insights
174 pages
David Coimbra - Dissertacao
No ratings yet
David Coimbra - Dissertacao
74 pages
Introducing British Sign Language
No ratings yet
Introducing British Sign Language
24 pages
ASL 101 Notes - 06 - 10 - 2019
No ratings yet
ASL 101 Notes - 06 - 10 - 2019
3 pages
The Significance of Language As A Tool of Communication: Conference Paper
No ratings yet
The Significance of Language As A Tool of Communication: Conference Paper
3 pages
12A7 DT B34 PracTest10 Nokey
No ratings yet
12A7 DT B34 PracTest10 Nokey
8 pages
Mat 4
No ratings yet
Mat 4
9 pages
Linguistics of American Sign Language An Introduction 5th Edition Clayton Valli PDF Download
No ratings yet
Linguistics of American Sign Language An Introduction 5th Edition Clayton Valli PDF Download
84 pages
Smart Glove For Hearing-Impaired: Abhilasha C Chougule, Sanjeev S Sannakki, Vijay S Rajpurohit
No ratings yet
Smart Glove For Hearing-Impaired: Abhilasha C Chougule, Sanjeev S Sannakki, Vijay S Rajpurohit
5 pages
Tips For Teaching Students Who Are Deaf or Hard of Hearing
No ratings yet
Tips For Teaching Students Who Are Deaf or Hard of Hearing
18 pages
Goldin-Meadow Mylander 1990 Language
No ratings yet
Goldin-Meadow Mylander 1990 Language
34 pages
The Deaf and Language: Oral, Sign and Written (MCQS)
0% (1)
The Deaf and Language: Oral, Sign and Written (MCQS)
1 page
Thời gian làm bài: 120 phút (không kể thời gian giao đề)
No ratings yet
Thời gian làm bài: 120 phút (không kể thời gian giao đề)
97 pages
Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues
No ratings yet
Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues
35 pages
MINI - PROJECT (Sign Language)
No ratings yet
MINI - PROJECT (Sign Language)
3 pages
Unit 10: Communication A. Phonetics
100% (1)
Unit 10: Communication A. Phonetics
14 pages
Eng511 Solved Assignment Solution by Leo ?-1
No ratings yet
Eng511 Solved Assignment Solution by Leo ?-1
3 pages
Silent Conversation 1
No ratings yet
Silent Conversation 1
4 pages
Named Studies: Summaries and Reference List: Cambridge IGCSE Psychology 0266
No ratings yet
Named Studies: Summaries and Reference List: Cambridge IGCSE Psychology 0266
20 pages
UNICEF ESA Guidance Use Sign Language ADT
No ratings yet
UNICEF ESA Guidance Use Sign Language ADT
41 pages
Deaf Culture
No ratings yet
Deaf Culture
14 pages
6468
67% (3)
6468
266 pages
Unit 1 - Theories of Origin of Human Language
100% (1)
Unit 1 - Theories of Origin of Human Language
7 pages
Communication Options
No ratings yet
Communication Options
1 page
English Sample Papers
No ratings yet
English Sample Papers
12 pages
Hand To Mouth
No ratings yet
Hand To Mouth
4 pages
Communicative Language Activities and Strategies (CEFR Section 4.4)
No ratings yet
Communicative Language Activities and Strategies (CEFR Section 4.4)
3 pages
Michelle Ruiz Resume 4
No ratings yet
Michelle Ruiz Resume 4
2 pages
Project Synopsis
100% (1)
Project Synopsis
3 pages
R E J ADAM PHD FINAL Version
No ratings yet
R E J ADAM PHD FINAL Version
267 pages
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
No ratings yet
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
9 pages

Cross Modal Survey

Uploaded by

Cross Modal Survey

Uploaded by

Comparative analysis on cross-modal information retrieval: a review

Parminder Kaura,∗, Husanbir Singh Pannua , Avleen Kaur Malhib

Thapar Institute of Engineering and Technology, Patiala, India

Aalto University, Finland

Preprint submitted to Elsevier December 17, 2021

4 Cross-modal representation and retrieval techniques 11

Therefore, the fundamental idea of cross-modal is to in-

Figure 4: Shortlisting of the research articles starting from title, ab-

highly improved when information from more than one

2.5. Publication metrics

Table 2: List of journals represented by X in figure (6), having a publication

IEEE IEEE Signal Processing Magazine

Figure 8: Geographical representation of country wise journal publication

Figure 10: General framework of cross-modal retrieval process

Table 4: List of acronyms in alphabetical order

Total covariance matrix C is a block matrix where C pq = Cqp 0

in cross-modal document retrieval task. Two hypotheses have

labeled multi-modal data having separate label spaces. Zero- research/nuswide/NUS-WIDE.html

lished from an initiative started by Technical Commit-

7. Discussion To handle this issue, researchers have introduced several tech-

Table 17: Open issues in cross-modal retrieval

promising results [66].

You might also like