Deep Multimodal Representation Learning A Survey

Uploaded by

Deepshikha Mehta joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views22 pages

Deep Multimodal Representation Learning A Survey

Uploaded by

Deepshikha Mehta joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Received April 16, 2019, accepted May 4, 2019, date of publication May 15, 2019, date of current version

May 28, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2916887

Deep Multimodal Representation Learning:

A Survey
WENZHONG GUO 1,2 , (Member,IEEE), JIANWEN WANG 1,2,3,4 ,

AND SHIPING WANG1,2 , (Member, IEEE)

1 College of Mathematics and Computer Sciences, Fuzhou University, Fuzhou 350116, China
2 Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China
3 College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China
4 Fujian Provincial Engineering Technology Research Center for Public Service Big Data Mining and Application, Fujian Normal University, Fuzhou 350117,

China
Corresponding author: Shiping Wang ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 61502104 and Grant 61672159, in part
by the Fujian Collaborative Innovation Center for Big Data Application in Governments, and in part by the Technology Innovation
Platform Project of Fujian Province under Grant 2014H2005.

ABSTRACT Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal data. Due to
the powerful representation ability with multiple levels of abstraction, deep learning-based multimodal rep-
resentation learning has attracted much attention in recent years. In this paper, we provided a comprehensive
survey on deep multimodal representation learning which has never been concentrated entirely. To facilitate
the discussion on how the heterogeneity gap is narrowed, according to the underlying structures in which
different modalities are integrated, we category deep multimodal representation learning methods into three
frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we review
some typical models in this area ranging from conventional models to newly developed technologies. This
paper highlights on the key issues of newly developed technologies, such as encoder-decoder model, gen-
erative adversarial networks, and attention mechanism in a multimodal representation learning perspective,
which, to the best of our knowledge, have never been reviewed previously, even though they have become the
major focuses of much contemporary research. For each framework or model, we discuss its basic structure,
learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and
experienced researchers can benefit from this survey. Finally, we suggest some important directions for
future work.

INDEX TERMS Multimodal representation learning, multimodal deep learning, deep multimodal fusion,
multimodal translation, multimodal adversarial learning.

I. INTRODUCTION Since multimodal data depict an object from different

To convey the comprehensive information about objects in the viewpoints, usually complementary or supplementary in con-
world, various cognitive signals describing different aspects tents, they are more informative than unimodal data. For
of the same object are recorded in different kinds of media example, early research on speech recognition showed that
such as text, image, video, sound, and graph. In the repre- the visual modality provides information on lip motions and
sentation learning area, the word ‘‘modality’’ refers to a par- articulations of the mouth including open and close, thus can
ticular way or mechanism of encoding information. Hence, help to improve the speech recognition performance. There-
different types of media listed above also refer to modalities, fore, it is valuable to exploit the comprehensive semantics
and the representation learning tasks involving several modal- provided by several modalities.
ities will be characterized as multimodal. However, although it is easy for human beings to perceive
the world through comprehensive information from multiple
sensory organs [3], how to endow machines with analogous
The associate editor coordinating the review of this manuscript and cognitive capabilities is still an open question. One of the
approving it for publication was Canbing Li. challenges we are confronted with is the heterogeneity gap
2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 63373
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
W. Guo et al.: Deep Multimodal Representation Learning: A Survey

In recent years, due to the powerful representation ability

with multiple levels of abstraction, deep learning has demon-
strated outstanding results in various applications involving
computer vision, natural language processing, and speech
recognition [16]. Additionally, another key advantage of deep
learning is that a hierarchical representation can be learned
directly using a general-purpose learning procedure, without
requiring a design or selection process of handcrafted fea-
tures. Motivated by this success, deep multimodal represen-
tation learning, which is a natural extension of its unimodal
version, has recently attracted tremendous research attention.
The goal of this article is to provide a comprehensive sur-
vey on deep multimodal representation learning and suggest
the future direction in this active field. Generally, the machine
learning tasks based on multimodal data include three neces-
sary steps: modality-specific features extracting, multimodal
representation learning which aims to integrate diverse fea-
tures from different modalities in a common subspace, and a
reasoning step such as classification or clustering. This paper
FIGURE 1. Schematic of the common subspace learning (adapted
from [5]), which aims to project the heterogeneous data of different
mainly focuses on the second step, multimodal representation
modalities into a common subspace, where the multimodal data with learning in deep learning scenarios, and will also make a brief
similar semantics will be represented by similar vectors. reference to the other two steps but not go into the details.
The focus of this paper is the key issues on how to nar-
row the heterogeneity gap while keeping modality specific
in multimodal data. As Fig. 1 shows, since the feature vec- semantics intact in different multimodal application scenes.
tors from different modalities originally located in unequal To facilitate the discussion, according to the underlying struc-
subspaces, the vector representations associated with similar tures in which different modalities are integrated, shown as
semantics would be completely different. Here, this phe- Fig. 2, we category these methods into three types of frame-
nomenon is referred to as heterogeneity gap, which would works: joint representation, coordinated representation, and
hinder the multimodal data from being comprehensively encoder-decoder. Each framework has its distinct architecture
utilized by the subsequent machine learning modules [4]. and approach of integrating multimodal features. Addition-
A popular method for addressing this problem is projecting ally, we review some typical models including probabilis-
the heterogeneous features into a common subspace, where tic graphical models (PGM), multimodal autoencoders, deep
the multimodal data with similar semantics will be repre- canonical correlation analysis (DCCA), generative adversar-
sented by similar vectors [5]. Thus, the primary objective of ial networks (GAN), and attention mechanism, which have
multimodal representation learning is narrowing the distribu- either proven to be effective or shown promising results.
tion gap in a joint semantic subspace while keeping modality The connection between the typical models and the three
specific semantics intact. frameworks can be seen in Table 1. Each of the typical models
To narrow the heterogeneity gap, numerous researches described here can be categorized into one or more of the
with various approaches have been conducted in the past frameworks or can be integrated with them. For each type
decades. As a result, the advancement of multimodal rep- of framework or model, we will discuss its basic structure,
resentation learning has benefited plenty of applications. learning objective, application scenes, key issues, advantages,
For example, by the utilization of fused features from and disadvantages, such that both novel and experienced
multimodalities, improved performance can be achieved in researchers will benefit from this survey. The key issues
cross-media analysis tasks, such as video classification [6], relevant to different frameworks and models will be marked
event detection [7], [8], and sentiment analysis [9], [10]. in bold and summarized in Section IV (Table 3).
Further, via the exploitation of cross-modal similarity or Most recently, several surveys [17]–[20] related to
cross-modal correlation, it becomes possible for us to retrieve the topic of multimodal learning have been published.
images using a sentence as input or vice versa, which is a task Comparing to previous reviews, the focus of our paper is
known as cross-modal retrieval [11]. Most recently, a novel distinctive in that we seek to survey the literature from a
type of multimodal application, cross-modal translation [12], cross-perspective of multimodal representation learning and
has drawn great attention in the computer vision community. deep learning, which has never been concentrated fully. For
As the name suggests, it strives to translate one modality into example, the review proposed by Zhao et al. [17] mainly
another. Exemplary applications within this category include focuses on conventional methods. The work proposed by
image caption [13], video description [14], and text-to-image Baltrušaitis et al. [18] focuses on the challenges of multi-
synthesis [15]. modal machine learning, as one of the five challenges they