Deep Multimodal Representation Learning A Survey
Deep Multimodal Representation Learning A Survey
China
Corresponding author: Shiping Wang ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 61502104 and Grant 61672159, in part
by the Fujian Collaborative Innovation Center for Big Data Application in Governments, and in part by the Technology Innovation
Platform Project of Fujian Province under Grant 2014H2005.
ABSTRACT Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal data. Due to
the powerful representation ability with multiple levels of abstraction, deep learning-based multimodal rep-
resentation learning has attracted much attention in recent years. In this paper, we provided a comprehensive
survey on deep multimodal representation learning which has never been concentrated entirely. To facilitate
the discussion on how the heterogeneity gap is narrowed, according to the underlying structures in which
different modalities are integrated, we category deep multimodal representation learning methods into three
frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we review
some typical models in this area ranging from conventional models to newly developed technologies. This
paper highlights on the key issues of newly developed technologies, such as encoder-decoder model, gen-
erative adversarial networks, and attention mechanism in a multimodal representation learning perspective,
which, to the best of our knowledge, have never been reviewed previously, even though they have become the
major focuses of much contemporary research. For each framework or model, we discuss its basic structure,
learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and
experienced researchers can benefit from this survey. Finally, we suggest some important directions for
future work.
INDEX TERMS Multimodal representation learning, multimodal deep learning, deep multimodal fusion,
multimodal translation, multimodal adversarial learning.
TABLE 1. The relationship between typical models and three types of TABLE 2. A summary of typical applications of three frameworks. Each
deep multimodal representation learning frameworks. Each of the typical application may include some of the modalities such as audio, video,
models
a may belong to (denoted by X) or can be integrated with (denoted image, and text which are denoted by their first letter. Here, different
by ) the relevant framework. integration ways are denoted by + (fusion), ∼ (coordination) and →
(translation).
FIGURE 2. Three types of frameworks about deep multimodal representation. (a) Joint representation aims to learn a shared semantic subspace.
(b) Coordinated representation framework learns separated but coordinated representations for each modality under some constraints.
(c) Encoder-decoder framework translates one modality into another and keep their semantics consistent.
and ResNet [49]. They can be integrated into multimodal to RNN, CNN is another widely used model for extract-
learning models and trained together with other components. ing salient n-gram features from sentences. Experiments
However, considering the requirement for sufficient training showed that CNN based models perform remarkably well
data and computation resources, the pre-trained version of in sentence-level classification [66] and sentiment analysis
CNN may be a better choice for multimodal representation tasks [67].
learning. As to video modality, since the input of each time step
The fundamental works for neural language processing is an image, its feature can be extracted via the tech-
involve representing words and encoding sentences. A pop- niques used for handling images. In addition to deep fea-
ular way to represent words is word embedding such as tures, handcrafted features are still widely used in video
word2vec [50] or Glove [51] which maps words into a dis- and audio modalities [10], [68]. Further, some toolkits have
tributional vector space, where the similarity between words been developed to extract handcrafted features. For example,
can be measured. In NLP tasks, a common issue that OpenFace [69] can be used to extract facial features such as
should be considered is the unknown word problem, also facial landmark, head pose, and eye gaze. Another tool is
known as out-of-vocabulary (OOV) words, that can poten- Opensmile [70] which can be used to extract acoustic features
tially affect the performance of many systems. To deal with including Mel-frequency cepstral coefficients (MFCC), voice
unknown word issue, character embeddings [52], [53] is a intensity, pitch, and their statistics. After the frames of videos
viable option for representing language inputs. For example, and audios have been encoded, CNN or RNN networks
Kim et al. [52] trained a convolution neural network to yield aforementioned can be used to summarize the sequences into
word representations based on character-level embeddings. individual vector representations.
Bojanowski et al. [53] proposed to learn the vector repre-
sentations of character n-grams, then, by treating each word B. JOINT REPRESENTATION
as a bag of character n-grams, the embedding of a word The strategy of integrating different types of features to
can be obtained by the sum of these vector representations. improve the performance of machine learning methods has
Experiments [54], [55] showed that handling OOV issue long been used by researches. A natural extension of this
properly would improve the performance of NLP systems strategy in a multimodal setting is the utilization of fused
considerably. heterogeneous features. Following this strategy, promising
Recurrent neural networks (RNN) [56] is a powerful tool results have been shown in many multimodal classification or
for dealing with varying length sequences such as sen- clustering tasks, such as video classification [6], [21], event
tences, videos, and audios. Since the activation of the current detection [7], [8], sentiment analysis [9], [10], and visual
hidden state at time t depends on that of all the previ- question answering [23].
ous time steps, it can be seen as a summarization of the To bridge the heterogeneity gap of different modalities,
sequence up to step t. However, vanilla RNNs is diffi- joint representation aims to project unimodal representations
cult to capture long-term dependencies because of the gra- into a shared semantic subspace, where the multimodal fea-
dient vanishing problem [57]. In practice, a better choice tures can be fused [18]. As Fig. 2(a) showed, after each
is long short-term memory (LSTM) [58], [59] networks or modality is encoded via an individual neural network, both
gated recurrent unit (GRU) [60] networks, which has a better of them will be mapped into a shared subspace, where the
performance in capturing long-term dependencies [61], [62]. conceptions shared by modalities will be extracted and fused
Further, bidirectional recurrent neural networks (BRNN) [63] into a single vector.
and the bidirectional edition of LSTM [64] or GRU [65] are The simplest way for fusing multimodal features is to
also widely used for capturing the semantics. In addition concatenate them directly. However, mostly this subspace is
implemented by a distinct hidden layer, in which the trans- modality-specific characteristics. The work proposed by
formed modality specific vectors will be added, and thus Aytar et al. [73] shows that constrained by a statistical reg-
the semantics from different modalities will be combined. ularization which encourages activations in the intermediate
This property can be seen from (1), where z is the acti- hidden layers to have similar statistics distribution across
vation of output nodes in the shared layer, v is the output modalities, the modality-invariant property can be strength-
of modality-specific encoding network, w is the weights ened. Their model encourages different modalities to be
connecting between modality specific encoding layer to aligned with each other automatically in the representation
the shared layer and the subscript index denotes different layer, even when the training data is unaligned.
modalities. To be more expressive, the learned vector is expected to
fuse complementary semantics form different modalities.
z = f (wT1 v1 + wT2 v2 ) (1) The property, complementary, cannot be guaranteed auto-
Other than the fusion process in a distinct hidden layer, usu- matically since joint representation tends to preserve shared
ally called as an additive approach, a multiplicative method semantics across modalities while ignoring modality-specific
is also adopted in some literature. In a sentiment analysis information. A solution is adding extra regularization terms to
task, Zadeh et al. [10] proposed to fuse language, video, and the optimization objectives [74]. For example, the reconstruc-
audio modalities in a tensor, which is constructed from the out tion loss used in multimodal autoencoders [1] can be consid-
product of all the modality-specific feature vectors. By this ered as a regularization term playing as a role to preserve
way, the author intends to exploit either intra-modality or modality independence. Another example is the approach
inter-modality dynamics. The definition of the fused tensor proposed by Jiang et al. [21], which impose a trace norm reg-
can be formulated as follows: ularization over the network weights to reveal the hidden cor-
l v a relations and diversity of the multimodal features. Intuitively,
m z z z if a pair of features are highly correlated, the weights used for
z = ⊗ ⊗ (2)
1 1 1 fusing them should be similar such that their contributions to
where zm denotes the fused tensor, zl , zv , za denotes different the fused representation will be roughly equal. Thus, the goal
modalities respectively, and ⊗ indicates the outer product of trace norm regularization is to discover the relationship
operator. However, since the outer product is cost expensive, between modalities and adjust the weights of the fusion layer
in a more efficient way, Fukui et al. [23] alternatively propose accordingly. Their experiments in video classification tasks
to utilize Multimodal Compact Bilinear pooling (MCB) to showed that this regularization term is helpful for improving
fuse language and image modalities. Formulated as (3), given performance.
vectors x and q, the proposed method seeks to reduce the Comparing to other frameworks, one of the advantages
dimension of the outer product x ⊗ q by Count Sketch pro- of joint representation is that it is convenient to fuse
jection function 9. Particularly, the count sketch of the outer several modalities since there is no need to coordinate
product can be decomposed into a convolution of separated modalities explicitly. Another advantage is that the shared
count sketches [71], which means that the computation of an common subspace tends to be modality-invariant, which is
outer product can be avoided. Further, the authors use Fast helpful for transferring knowledge from one modality to
Fourier Transform (FFT) to accelerate the computation. another [1], [73]. While one of the disadvantages of this
framework is that it cannot be used to infer the separated
8 = 9(x ⊗ q) representations for each modality.
= 9(x) ∗ 9(q)
C. COORDINATED REPRESENTATION
= FFT−1 (FFT (9(x)) FFT (9(q))) (3)
Another type of methods popular in multimodal learn-
Although the model shown in Fig. 2(a) is designed for the ing is coordinated representation. As Fig. 2(b) showed,
setting in which parallel data are available during training and instead of learning representations in a joint subspace,
inference steps, the ability to deal with partial data missing coordinated representation framework learns separated but
problem in some modalities is also desired, such that more coordinated representations for each modality under some
training data can be exploited or the performance of down- constraints [18]. Since the information contained in differ-
stream tasks is influenced only slightly in the case of data ent modalities is unequal, learning separated representa-
missing from one or more modalities. To this end, a widely tions is beneficial for persevering the exclusive and useful
used method is training the model via the data including only modality-specific characteristics [31]. Typically, condition
some modalities, excluding a modality in different training on the constraint types, coordinated representation methods
epochs [1], [72]. can be categorized into two groups, cross-modal similar-
Interestingly, the training trick used for tackling data ity based and cross-modal correlation based. Cross-modal
missing is also helpful for obtaining modality-invariant similarity based methods aim to learn a common subspace
property, which means that the difference of the sta- where the distance of vectors from different modalities can
tistical distribution between modalities is minimized, or, be measured directly [75], while cross-modal correlation
in other words, the feature vectors contains minimum based methods aim to learn a shared subspace such that the
correlation of the representation sets from different modal- CNN network to obtain image features v and trained an
ities is maximized [5]. In this section, we will review the LSTM network to encode its relevant sentences into t, then
former and leave the latter in Section III-C. mapped both encodings into a coordinated embedding space
Cross-modal similarity methods learn coordinated rep- where the similarity between them can be exploited by a
resentations under constraints of similarity measurement. cross-modal similarity model similar to [34]. Their model
The learning objective of this model is to preserve adopted the same similarity measurement used in DeViSE
inter-modality and intra-modality similarity structure, but employed a bi-directional rank loss formulated in (4)
which expects the cross-modal similarity distance associated such that much richer cross-modal relationships can be dis-
with the same semantics or object to be as minimum as covered. This model is also employed in the work pro-
possible, while expects the distance with dissimilar semantics posed by Socher et al. [32], which aims to map sentences
to be as maximum as possible. and images into a common space for cross-modal retrieval.
A widely used constraint is cross-modal ranking. Take They introduced dependency trees based recursive neural
visual-text embedding for example, ignoring the regulariza- network (DTRNN) to encode language modality and argued
tion terms and denoting the matched embedding vectors of that the proposed DTRNN is robust to surface changes such
visual and text as (v, t) ∈ D, the optimization objective as word order.
can be expressed as a loss function in (4), where α is the Further, Karpathy and Fei-Fei [76] extended this frame-
margin, S is the similarity measurement function, t − is the work to learn a fine-grained cross-model alignment between
embedding vectors unmatched to v and v− is the embedding words and image regions for generating region-level descrip-
vectors unmatched to t. Commonly, t − and v− are known tions of images. Unfortunately, this task suffers from lack-
as negative samples which are selected randomly from the ing of necessary supervision information. Given images and
dataset D, and (4) is known as margin rank loss [36]. their correlated sentences, the one-to-one correspondence
XX between a word and the region it described is not yet known.
rankLoss = max(0, α − S(v, t) + S(v, t − )) To address this problem, they selected to infer the align-
v t−
XX ment between segments of sentences and the regions of the
+ max(0, α − S(t, v) + S(t, v− )) (4) image in a cross-modal embedding space. The key idea is
t v− to formulate the image-sentence score as a function of the
Based on the cross-modal ranking constraint, a variety of individual region-word similarity. Let vi denotes the image
cross-modal applications have been developed. For exam- regions and st denotes the words in a sentence, the score
ple, Frome et al. [34] used a combination of dot-product between image k and sentence l is defined as follows:
similarity and margin rank loss to learn a visual-semantic X
embedding model (DeViSE) for visual recognition. DeViSE Skl = maxi∈gk vTi st (6)
t∈gl
firstly pre-trains a pair of deep networks to map images and
their correlated labels into embedding vectors v and t then where, gk is the set of fragments in image k, gl is the set of
leverages the cross-modal similarity model to learn a shared snippets in sentence l and each word st aligns to a unique
semantic embedding space for both modalities. Following the best image region. Additionally, assuming that k = l denotes
notations in (4), the loss function for each training sample can a matched image-sentence pair, the cross-modal ranking con-
be defined as follows: straint can be defined as a loss function in (7), which encour-
ages aligned image-sentences pairs to have a higher score
X
loss(v, t) = max(0, α − tMv + t − Mv) (5)
t−
than misaligned pairs.
XX
where M is a linear transformation matrix used for trans- rankLoss = max (0, 1 − Skk + Skl )
forming v into the shared semantic embedding space, and the k l
dot-product between t and Mv is the similarity measurement XX
+ max (0, 1 − Skk + Slk ) (7)
used for both training and testing. Under the constraint in
k l
(5), the model is expected to produce a higher dot-product
similarity between matched vectors than between unmatched The strategy to measure image-sentence similarity based
ones and subsequently endows images embedding with rich on individual region-word scores is also adopted by
semantic information which is transferred from language Peng et al. [31], who aim to preserve the modality-specific
modality. This idea is also shared by the work proposed by characteristics by utilizing the fine-grained information
Lazaridou and Baroni [35], which aims to integrate and prop- within each modality during the cross-modal correlation
agate visual information into word embeddings. Their exper- learning. The authors argued that different modalities have
imental results implied that the transferred visual knowledge imbalanced and complementary relationships, thus, instead
is helpful for representing abstract concepts. of measuring the similarity in a common space, they con-
Inspired by the success of DeViSE, Kiros et al. [36] struct an independent semantic space for each modality and
extended this model to learn a joint image-sentence measure the cross-modal similarity in both spaces simultane-
embedding used for image captioning. They pre-trained a ously. After that, the modality-specific similarity scores will
be combined into a final measurement used for cross-modal denotes the neighborhood of sentence ti , the within-view
retrieval. neighborhood structure preservation constraints can be for-
In addition to cross-modal ranking, another widely used mulated as follows:
constraint is Euclid distance. The mainstream approach (
d vi , vj + m < d (vi , vk ) ∀vj ∈ N (vi ) , ∀vk ∈/ N (vi )
in this category is to minimize the distance of paired
d ti , tj + m < d (ti , tk ) ∀tj ∈ N (ti ) , ∀tk ∈
/ N (ti )
samples [33], [77], [78]. An example is a model proposed
by Pan et al. [33], which aims to learn a visual-semantic (11)
embedding used for generating video descriptions. The
model projects both visual and language representations In addition to the applications characterized as find-
into a low-dimensional embedding space, where the dis- ing one modality from another such as cross-modal
tances between paired samples are minimized such that the retrieval [75], [77], [80] and retrieval-based visual
semantics of visual embeddings will be consistent with their description [32], another type of application of coordinated
relevant sentences. This constraint can be expressed as a loss representation is transfer knowledge across modalities, which
term: may enhance the semantic description capability of the
X embeddings in target modality. The basic idea is minimized
distanceLoss = kTv v − Ts sk22 (8) the cross-modal similarity of paired multimodal data in a
(v,s)∈D common subspace during training, such that the embeddings
where Tv and Ts are transform matrices for video v and can capture their shared semantics, which means that the
sentence s, v and s are paired samples form dataset D. knowledge has been transferred. Several pieces of litera-
Another example is the model for cross-modal matching ture mentioned above [33]–[36] can be considered as rep-
proposed by Liong et al. [78], which aims to reduce the resentative examples of this idea. Furthermore, coordinated
modality gap of paired data by minimizing the difference of representation can also be used for cross-domain transfer
hidden representations over all layers. Suppose that visual learning which would partially reduce the need for labeled
modality v and text modality t are encoded via homogeneous data. For example, in order to transfer knowledge from a
feed-forward neural networks, the loss can be formulated as large-scale cross-media dataset to small-scale one, the works
follows: from Huang et al. [37], [38] proposed to train a pair of net-
XL−1 XN 2 works, each for one of the domains, and coordinate them via
distanceLoss = hlit − hliv (9) minimizing the maximum mean discrepancy (MMD) [81].
l=1 i=1 2
Comparing to other frameworks, coordinated repre-
where l indicates a layer of both modality-specific networks, i sentation tends to persevere the exclusive and useful
indicates a pair of instances of training data and h denotes the modality-specific characteristics within each modality [31].
hidden representations. Further, the authors also imposed a Since different modalities are encoded in separated networks,
large margin criterion to the distance of unpaired data which one of the advantages is that each modality can be inferred
aims to minimize the intra-class distance and maximize the individually. This property is also beneficial for cross-modal
inter-class distance, such that more discriminative informa- transfer learning which aims to transfer knowledge across dif-
tion can be exploited. This criterion is defined as follows: ferent modalities or domains. A disadvantage of this frame-
2 work is that, mostly, it is hard to learn representations with
(
ti − vj 2 ≤ θ1 , if lti ,vj = 1
2 (10) more than two modalities.
ti − vj 2 ≥ θ2 , if lti ,vj = −1
where ti denotes the sentence i, vj denotes image j, and θ1 , θ2 D. ENCODER-DECODER
are the small and large thresholds respectively. The condition Recently, Encoder-decoder framework has been widely used
lti ,vj = 1 means that ti and vj belong to the same class, for multimodal translation tasks which map one modal-
otherwise, belong to the different class. ity into another, such as image caption [13], [39], video
Except for learning inter-modality similarity measurement, description [14], [41], and image synthesis [15], [82]. Typi-
another key issue of cross-modal applications is to preserve cally, as shown in Fig. 2(c), the encoder-decoder framework
the intra-modality similarity structure. A widely used strategy is mainly composed of two components, an encoder, and
is classifying the category of learned features such that they a decoder. The encoder maps source modality into a latent
are also discriminative within each modality [30], [79]. Addi- vector v, and then, based on the vector v, the decoder will
tionally, another method is to keep the neighborhood structure generate a novel sample of target modality.
within each view. The constraint in (10) is one of the imple- Although most of the encoder-decoder models contain only
mentations in this group. Another example is the work from an encoder and a decoder, some of the variants can also
Wang et al. [80], which proposed to learn image-text embed- be composed of several encoders or decoders. For example,
dings via coordinated representation model which combines Mor et al. [83] proposed a model to translate music across
cross-view ranking constraints with within-view neighbor- musical instruments, where a single encoder and several
hood structure preservation constraints in the loss function. decoders are involved. The shared encoder is responsible for
Let N (vi ) denotes the neighborhood of image vi and N (ti ) extracting domain-independent music semantics, and each
decoder will reproduce a piece of music in the target domain. is the work proposed by Reed et al. [15], which endeavors
An example including two encoders is the image-to-image to translate characters into pixels via Generative adversarial
translation model proposed by Huang et al. [84]. It consists network (GAN) [82]. In their model, within each class,
of a content encoder and a style encoder, each is responsible the similarity between the source and target encodings is
for part of the duty. maximized such that the semantics in both modalities will
The generalized learning objective of encoder-decoder keep consistent. Since the models of image synthesis are
models, take visual description as an example [41], can be mostly implemented by GAN, more example of this task
expressed as follows: will be left to Section III-D which concentrates on generative
X adversarial learning.
θ ∗ = argmax log p(S|V ; θ ) (12)
θ (V ,S) loss = kv − Rvk2F (14)
which maximizes the log likelihood of the sentence S given On condition that the semantic consistency between
the corresponding visual content V and the model parame- modalities has been modeled explicitly, this framework can
ters θ . Further, assuming that each word in the sequence is be used to learn cross-modal semantic embedding. For exam-
produced in order, the log probability of the sentence can be ple, based on the encoder-decoder framework, Gu et al. [86]
expressed as: proposed to learn cross-modal embeddings used for retrieval.
Their model translates each of the modality into another
N
X via distinct encoder-decoder networks and expects that the
log p(S|V ; θ ) = log p Swt |V , Sw1 , . . . , Swt−1
(13) generated images or sentences are similar to their sources.
t=0
In this model, the similarity between the generated sentence
where Swi represents the ith word in the sentence and N is the and its corresponding reference sentences is measured by a
total number of words. standard evaluation metric like BLEU [87], and the similarity
Superficially, the latent vector learned by the encoder- between images is measured by a discriminator which is
decoder model seems to relate only to the source mode, but responsible for distinguishing whether an image comes from
in fact, it closely relates to both source and target modalities. the generator or not.
Since the flowing direction of the error correction signal In early works [88], [89], the representation of visual
is from the decoder to the encoder, the encoder is guided modality is usually a fixed visual semantic list such as objects
by the decoder during training. Subsequently, the generated and their relationship which is detected explicitly by the
representation tends to capture the shared semantics from encoder. Then based on n-gram language models or sentence
both modalities. templates, a sentence is generated by the decoder. In this way,
To capture shared semantics more effectively, a popular the problem is simplified. However, it is difficult for these
solution is keeping the semantic consistency among modali- models to deal with large vocabulary or to model complex
ties via some regularization terms. It depends on the coordi- sentence structure [41].
nation between the encoder and the decoder. Both the correct Recently, a more accessible way of representing source
understanding of semantics in source modality and the perti- modality is encoding essential information into a single vec-
nent generating of novel samples in target modality are impor- torial representation [14]. Comparing to traditional methods,
tant for success. Take image caption [85] as an example, it is more convenient for neural networks to encode informa-
the description generated by the decoder may cover multiple tion and generate samples. However, using the single vector
visual aspects of an image including objects, attributes such as a bridge, it is challenging for both encoder and decoder
as color and size, backgrounds, scenes and spatial relation- to translate semantics between modalities. A problem for
ships, hence, the encoder has to detect and encode necessary the encoder is that the high-level vectorial representation
information correctly, and further, the decoder will be respon- distilled from the source may lose some information which
sible for reasoning high-level semantics and generating gram- is useful for generating target modality [13]. Also, another
matically well-formed sentences. problem will arise in decoder once RNN models are used for
An example of explicitly considering the semantic con- generating a long sequence. The information contained in the
sistency between modalities is the model proposed by original representation vector will be diminished during its
Gao et al. [42], which aims to translate videos into sentences. delivery through time steps.
To tackle this problem, on the one hand, they maximized Attention mechanism has become a popular solution for
the likelihood formulated in (13) such that sentences can be both aforementioned problems. Rather than merely using a
generated correctly, on the other hand, they minimized the single vector resulting from the last step of the encoder,
representation difference in a common subspace such that attention mechanism allows utilizing the intermediate rep-
their semantics are correlated with each other. Suppose that resentations which distribute among time steps in an RNN
v denotes the visual features, s denotes the sentence embed- network [90] or localized regions in a CNN network [91].
dings, and R denotes a matrix used for linearly projecting For the encoder, this mechanism relieves the requirement that
s into the subspace where v located, the consistency con- the full information should be integrated into a single vector,
straint can be written as loss term in (14). Another example and thus gives more flexibility to the design of encoder.
(1)
ing, such that salient features can be involved while noise hm
(1) (2)
X
will be excluded. Conversely, an exemplary application ( P(vt , ht , ht )) (15)
of deep reinforcement learning during decoding is image (1)
ht
captioning [94], [95].
Comparing to other frameworks, one of the advantages of where vm , vt denote image and text input respectively, θ
(1) (2) (1) (2)
the encoder-decoder framework is its being able to generate denotes the parameters, hm = {hm , hm }, ht = {ht , ht }
novel samples of target modality condition on the representa- (3)
denotes the hidden layers in each modality and h denotes
tions of source modality. On the contrary, the disadvantage the shared representation layer.
of this framework is that each encoder-decoder can only Unlike the strategy which connects different modalities via
encode one of the modalities. Further, the complexity in a shared representation layer, Feng et al. [28] tended to max-
designing the generator should be taken into consideration, imize the correspondence between modalities layer wisely.
since the technique for generating plausible target is still on At each equivalent hidden layer, two RBMs from different
its development. modalities are connected respectively by a correlation loss
function. In this way, the essential cross-model correlation
III. TYPICAL MODELS for cross-modal retrieval is captured.
In this section, some typical models in deep multimodal By fusing modalities together in a unified latent space,
representation learning will be summarized. They range from probabilistic graphical models can be used to learn
conventional models, including probabilistic graphical mod- the essential cross-modal correlations. Based on multi-
els, multimodal autoencoders, and deep canonical correlation modal deep belief networks, several applications such
analysis, to newly developed technologies, including gen- as audio-visual emotion recognition [25], audio-visual
erative adversarial networks and attention mechanism. The speech recognition [27], and information trustworthiness
typical models described here can be categorized into one or estimation [100] have been reported. Also, based on
more of the frameworks above introduced or can be integrated multimodal deep Boltzmann machines, several solutions
with them. used for human pose estimation [101] and video emotion
prediction [26] have been proposed.
A. PROBABILISTIC GRAPHICAL MODELS One of the advantages of probabilistic graphical models is
In the deep representation learning area, probabilistic graph- that they can be trained in an unsupervised fashion, allowing
ical models include deep belief networks (DBN) [97] and the use of unlabeled data. Another advantage comes from
Since ρ is invariant to the scale of wx or wy , the opti- proposed an extension named deep variational canonical cor-
mization objective can be further reformed as a constrained relation analysis (VCCA). As a generative model, VCCA
optimization problem as follows: enables us to obtain unseen samples of each view. The basic
probabilistic interpretation of CCA assumes that two views
max wTx Cxy wTy s.t. wTx Cxx wx = 1, wTy Cyy wy = 1 (18)
wx ,wy of observed variable x and y are generated according to
conditional probabilities p(x|z) and p(y|z), where z is a latent
The basic CCA is limited to modeling linear relation-
variable shared by both views. Other than a linear assumption
ship, regardless of the truth of probability distribution in
between x, y and z, implemented via DNN network, VCCA
different data views. To address this problem, many exten-
aims to model a non-linear relationship among them, which
sions have been proposed. One of the non-linear exten-
potentially has a stronger representation power. Specifically,
sions is kernel CCA [111] which transforms the data into a
the optimization objective of VCCA is a variational lower
higher dimensional Hilbert space before applying the CCA
bound of the likelihood which can be expressed as a sum over
method. However, KCCA suffers from poor scalability [112],
data samples. Hence, the model can be trained via stochastic
in that its closed form solution requires computation of
gradient descent method conveniently.
high time complexity and memory consumption. Alter-
A challenge for DCCA is its relatively poor scala-
natively, some approximation methods such as Nyström
bility. Directly inherited from basic CCA, the standard
method [113], incomplete Cholesky decomposition [114],
correlation function couples all training samples together
partial Gram Schmidt orthogonalization [115], and block
and cannot be expressed as a sum of all data samples.
incremental SVD [116] can be used to speed up this model.
Thus, Andrew et al. [117] choose a batch-based algorithm
Another drawback of KCCA is its poor efficiency, which
(L-BFGS) to optimize the network. However, it computes
results from its requirement of accessing to all training sets
gradients over entire data samples and requires high memory
when transforming an unseen instance [117].
volume which is infeasible for large datasets. In order to
A new extension of CCA is deep CCA [117], which aims
improve the scalability of DCCA, some efforts have been
to learn a pair of more complex non-linear transformation for
made. Wang et al. [121], [122] adopted a stochastic opti-
different modalities. The basic structure of this model can be
mization method with large mini-batch to approximate the
illustrated by Fig. 2(b), where each modality is encoded by a
gradients. As a result, the problem of memory consumption
deep neural network, then in a common subspace, the canon-
is relieved.
ical correlation between modalities is maximized. Let Hx =
Recently, a more efficient optimization solution named
fx (x, θx ) and Hy = fy (y, θy ) are non-linear transformation
Soft CCA, which requires lower computation complexity, has
functions implemented by neural network which mapped x
been proposed by Chang et al. [123]. Unlike to traditional
and y into a shared subspace, the optimization objective is to
CCA which constrains the correlation matrix over the train-
maximize the cross-modality correlation between Hx and
ing batch to be an identity matrix, Soft CCA relaxes this
Hy formulated as follows:
constraint to a loss in (20), which minimizes the L1 loss
max corr(Hx , Hy ) = max corr(fx (x, θx ), fy (y, θy )) (19) of off-diagonal element in constraint matrix. By expressing
θx ,θy
CCA objective as a loss function, Soft CCA avoids some
Comparing to a particular kernel function used in KCCA, computationally expensive operations such as matrix inver-
the non-linear function learned from the neural network is sion and singular value decomposition (SVD). Thus, Soft
far more general. Hence, DCCA exhibits better performance CCA is effective and more scalable in computation.
in adaptability and flexibility. Meanwhile, as a parametric Xk Xk
method, DCCA scales better with data size and does not LSDL = φij (20)
i=1 j6=i
require to reference to train data during testing.
Commonly, a maximized correlation objective focuses on Comparing to another type of model in the coordi-
learning the shared semantic information but tends to ignore nated framework, cross-modal similarity method, one of
modality specific knowledge. To address this problem, extra the advantages of DCCA is that it can be trained in an
regularization terms should be considered. For example, unsupervised manner. Due to these advantages, DCCA has
Wang et al. [118] proposed a variant of DACC name deep been widely used for various multi-view and multimodal
canonically correlated autoencoders (DCCAE). In addition learning tasks including word embedding in a multilingual
to maximize the correlation between views, this model also context [124], [125], acoustic features representation [121],
minimizes the reconstruction error of each view via autoen- matching images and text [29], music retrieval [126], and
coders architecture. The role of additional autoencoders can speech recognition [127], [128]. On the contrary, the draw-
be interpreted as a regularization item which aims to raise the back of DCCA is the higher computation complexity which
lower bound of mutual information between views. may limit its scalability in data size.
So far, most DCCA based applications can be character-
ized as predicting one modality given another, while DCCA D. GENERATIVE ADVERSARIAL NETWORK
can also be used to generate novel samples. Based on the Generative adversarial network (GAN) is an emerging deep
probabilistic interpretation of CCA [119], Wang et al. [120] learning technique. As an unsupervised learning method,
them, and the predicted probability p will be 0.5 for all inputs.
min max V (G, D) (21)
G D
V (G, D) = Ex∼pdata (x) [log D(x)]
+ Ez∼pz (z) (1 − log D(G(z))) (22)
The optimization objective of GANs is a solution of (21),
FIGURE 5. The conceptual structure of basic generative adversarial where function V (G, D) is the cross-entropy loss of discrim-
networks. inator D which formulated in (22). During the training pro-
cess, G and D will be updated in an iterative paradigm. While
one of the components is updated, the parameters of another
one will keep fixed. In step one, given samples from either
it can be used for learning data representation without generator or training dataset, the discriminator is trained to
involving labels, which will significantly lower the depen- tell them apart. This objective is achieved by maximizing
dence on manual annotations. Also, as a generative method, function V . On the other hand, in step two, the generator is
it can be used for generating high-quality novel samples trained to produce samples sufficient to confuse the discrim-
according to the distribution of training data. Since 2014, inator. This objective is achieved by minimizing the func-
after being proposed by Goodfellow et al. [82], the genera- tion V . In such an adversarial manner, both subnets evolve
tive adversarial learning strategy has been successfully used alternately.
for various unimodal applications. One of the best-known Comparing to classic representation learning methods,
applications is image synthesis [82], [129], [130], which gen- a visible difference for GANs is that the learning process
erates high-quality images according to a random input of data representation is not straightforward. It is rather in
drawn from a normal distribution. The other success- an implicit paradigm. Unlike traditional unsupervised rep-
ful examples including image-to-image translation [131] resentation methods, such as autoencoders, which learns a
and image super-resolution [132]. Most recently, generative mapping from data to latent variables directly, GANs learns
adversarial learning strategy is further extended to multi- a reverse mapping from latent variables to the data samples.
modal cases such as text-to-image synthesis [15], [44], visual Specifically, the generator maps a random vector into a dis-
captioning [40], [43], cross-modal retrieval [30], multimodal tinctive sample. Thus, this random signal is a representation
features fusion [4], and multimodal storytelling [133]. In this corresponding to generated data. On condition that Pg fits
section, we will briefly introduce the fundamental concepts Pdata well, this random signal is a good enough representation
of GAN and discuss its role in multimodal representation for realistic training data.
learning. However, despite the success of GANs in image synthesis,
Generally, a generative adversarial network is composed of a disadvantage of basic GANs is that the latent representation
two components, a generative network G playing as a genera- is hard to be interpreted since such a random representation
tor and a discriminative network D playing as a discriminator, has no connection with meaningful semantics. To improve the
contesting with each other. The network G is responsible interpretability of this latent representation, Chen et al. [134]
for generating new samples according to the learned data introduced a semantically meaningful method name Info-
distribution. While the network D aims to discriminate the GAN which separates the random noise vector into several
difference between an instance generated by network G and groups, z and c = (c1 , . . . , cL ). By maximizing the mutual
an item sampled from the training set. Commonly, both com- information between latent variable c and generator distribu-
ponents, G and D, are implemented via deep neural networks. tion G(z, c), the model encourages the different ci to represent
The generator G can be considered as a function mapping uncoupled salient attributes. As a result, a modification on the
a vector in latent space, z, into a sample in data space, and value of ci will lead to a change of its relevant data attributes
this mapping can be formulated as G(z; θg ) → x, where such as shape or style.
θg is the parameters of G. Similarly, the discriminator D Another disadvantage of basic GANs is its lacking of
can be formulated as D(x, θd ) → p, mapping a matrix or a direct mapping from data to latent space which is crit-
a vector into a scalar probability value predicting whether ical for representation learning in traditional tasks such
a sample is drawn from training data or not, where θd is as retrieval and classification. To address this problem,
the parameters of D and p ∈ (0, 1). Although G generates some techniques equipped with an additional inference net-
novel samples from distribution Pg (x), it endeavors to capture work have been proposed [135], [136]. Other typical models
the ground truth Pdata (x). Once the distribution Pg estimates which can translate representations between data space and
Pdata well enough, the discriminator D will be confused, latent space bi-directionally include Adversarially Learned
and its prediction accuracy will be lowered. Theoretically, Inference model (ALI) [137] and Bidirectional Generative
Goodfellow et al. [82] shows that the global optimum can be Adversarial Networks (BiGANs) [138]. In these models,
achieved on condition that Pg = Pdata . In such a case, the dis- the generator comprises a pair of parallel networks: a decoder
criminator is unable to distinguish the difference between used for mapping a latent vector z into a novel sample x̂,
FIGURE 7. Two methods used for improving modality-invariant property via adversarial learning. The key idea is mapping paired inputs into a
common subspace such that the discriminator cannot distinguish which modality a feature comes from. (a) Discriminate which modality a
feature comes from. (b) Discriminate whether the input is a pair or not.
of feature vectors, the distribution gap of different modalities samples come from different modalities. If a pair come from
will be minimized accordingly. the same category, its distance will be minimized. Otherwise,
Based on the learning strategies of the first category, it is maximized.
several models used for cross-modal retrieval have been As Fig. 7(b) showed, the cross-modal adversarial model
proposed [4], [30], [143]. In these models, the adversarial of the second category contains an encoder-decoder network,
process is served to enforce the distributions of projected which translates one modality into another. For example,
representations from different modalities to be closer to each given a pair of input (v, t), the encoder maps t into a rep-
other. The main difference between them is the way how resentation vector, then the decoder, playing as the generator,
to preserve the intra-modality and inter-modality similarities maps this vector into a reproduced sample v̂. The generated
simultaneously. For example, Wang et al. [30] proposed to sample v̂ is expected to sufficiently similar to v, such that the
learn presentations that are modality-invariant and discrimi- reproduced pair (v̂, t) is considered as a real pair by the dis-
native. In addition to the modality classifier, a label predictor criminator. On condition that the learned representation can
is also integrated into this model to keep the learned features be translated into another modality soundly, it is believable
discriminative within each modality. Further, a triplet mar- that the cross-modal invariant property has been preserved.
gin rank constraint is added to the label classifier such that An example in this category is the model proposed by Gu
inter-modality similarity can be preserved. et al. [86] which integrated a generative adversarial network
Peng et al. [4] proposed to learn discriminative common in their model to train a text encoder. In the following, more
representation for bridging the heterogeneity gap. In their examples will be shown to demonstrate how this model can
model, the generator is formed by a cross-modal autoencoder be used in practice.
with weight-sharing constraint, and the discriminator is com- Zhang et al. [144] adopted GANs to model cross-modal
posed of two kinds of discriminative modules: intra-modality hashing in an unsupervised fashion. In addition to preserving
and inter-modality discriminators. The generator seeks to inter-modality and intra-modality correlations in the common
project multimodal inputs into common subspace with two hash space, the property preserving manifold structure across
useful properties, keeping semantic consistency within each different modalities is also desired in their model. Given a
modality and distribution consistency among modalities, sample from a modality, the generator is trained to select a
on the contrary, the discriminators tries to detect the inconsis- sample from another modality located in the same manifold.
tency. Specifically, the intra-modality discriminator aims to Then, the discriminator will determine whether the generated
distinguish generated reconstruction feature from the original pair of samples belonging to the same manifold structure or
input, while the inter-modality discriminator endeavors to tell not. Here, the hash codes play a key role for both generator
which modality a feature comes from. and discriminator. Specifically, the generator selects samples
The model proposed by Xu et al. [143] aims to learn conditioned on hash codes; also, the discriminator judges
cross-modal representations which are maximally correlated their correlation between modalities based on hash codes.
and statistically indistinguishable in the common subspace. The adversarial learning process is used for enhancing the
They decompose the whole problem into three loss terms: an property of preserving cross-modal manifold structure in a
adversarial loss which is utilized to minimize the statistical common hash space.
difference between distributions of different modalities, a fea- Wu et al. [145] extended CycleGAN [146] to learn
ture discrimination loss which ensures the representations to cross-modal hash functions on the condition without paired
be discriminative within each modality, and a cross-modal training samples are available. CycleGAN can be seen as a
correlation loss which is responsible for keeping cross-modal special case of the second category, which includes a pair
similarity structure. Specifically, the cross-modal correlation of encoder-decoder, each of them is designed to translate
loss is measured by the square distance between pairs of one modality into another. For example, given an input v,
FIGURE 8. The typical structure of key-based attention mechanism. The attention module uses the current state (ht ) as a key to search
salient elements in the source ({ai }).
the model translates v into t, and then t is reversely translated extracting the salient feature ct , the current state ht in decoder
back to v̂, it is expected that v ≈ v̂. Similarly, given an input t, plays as a key and the encoder states {ai } play as a source
a reconstructed t̂ is expected to roughly equal with t. Based on to be searched [159]. The computation method of attention
the cycle-consistent constraint in both modalities, the model mechanism [13], [156] can be defined as (26) to (28), and the
can be trained in the absence of paired training samples. compatibility scores between the key and the sources can be
One of the advantages of GAN is that it can be trained evaluated via one of the three different functions listed in (29).
by unsupervised learning which will significantly lower the
dependence on manual annotations. Another advantage is eti = score(ai , ht ) (26)
its powerful ability to generate high-quality novel samples exp(eti )
αti = PL (27)
according to the distribution of training data. However, i=1 exp(eti )
though a unique global optimum is existent theoretically, XL
ct = αti ai (28)
it is challenging to train a GAN system which may suffer i=1
from training instability, either ‘‘collapsing’’ or failing to T
ht ai
converge [147]. Although several improvements have been score(ai , ht ) = hTt Wa ai (29)
proposed [147]–[150], the way for stabilizing the training of
T
va tanh(Wa [ht ; ai ])
GANs remains an open problem.
Key-based attention is widespread in visual description
E. ATTENTION MECHANISM applications [13], [90], [160], where an encoder-decoder net-
Attention mechanism allows a model to focus on specific work is commonly used. It brings us an approach to evaluate
regions of a feature map or specific time steps of a feature the importance of the features within a modality or among
sequence. Via attention mechanism, not only an improved modalities. On the one hand, attention mechanism can be
performance can be achieved, but also better interpretabil- used to select the most salient features within a modality,
ity of feature representations can be seen. This mecha- on the other hand, it can be used to balance the contribution
nism mimics the human ability to extract the most discrim- among modalities during fusing several modalities.
inative information for recognition. Rather than using all In order to recognize and describe objects contained in
of the information at once, the attention decision process the visual modality, a set of localized region features, which
prefers to concentrate on the part of the scene selectively potentially encode different objects distinctly, would be more
which is needed [151]. Recently, this method has demon- helpful than a single feature vector. By selecting the most
strated its unique power in improving performance in many salient regions in an image or time steps of a video sequence
applications such as visual classification [152]–[154], neu- dynamically, both system performance and noise tolerance
ral machine translation [155], [156], speech recognition [92], can be improved. For example, Xu et al. [13] adopted atten-
image captioning [13], [91], video description [42], [90], tion mechanism to detect salient objects in an image and
visual question-answering [24], [157], cross-modal retrieval fused them with text features in a decoder unit for captioning.
[31], [158], and sentiment analysis [22]. In such a case, guided by current text generated in time step t,
According to whether a key is used during selecting part of the attention module will be used to search local regions
the features, attention mechanism can be categorized into two appropriate for predicting next word.
groups: key-based attention, and keyless attention. Key-based For locating local features more accurately, several atten-
attention used a key to search for salient localized features. tion models have been proposed. Yang et al. [157] proposed
Take image caption as an example [13], its typical structure a stacked attention network for searching image regions.
can be illustrated as Fig. 8, where a CNN network encodes They suggested that multiple steps of search or reasoning are
the image into a feature set {ai }, and then an RNN network helpful to locate to fine-grained regions. In the beginning,
decodes the input into hidden states {ht }. In time step t, the model locates one or more local regions in the image by
the output yt is predicted based on ht and ct , where ct is the attention using language features as a key and then combines
salient feature summarized from {ai }. During the process of the attended visual and language features into a vector, which
(
also plays as a key used for next iterator. After K steps, vT ai
score(ai ) = (33)
not only the appropriate local regions are located, but both vT tanh(W ai )
features are fused. Zhu et al. [161] proposed a structured
attention model to capture the semantic structure among Due to the nature to select prominent cues from raw input,
image regions, and their experiments showed that this model keyless attention mechanism is suitable for a multimodal
is capable of inferring spatial relations and attending to the feature fusion task which suffers from issues such as semantic
right region. Chen et al. [162] proposed to incorporate spatial confliction, duplication, and noise. Through the attention
and channel wise attentions in a CNN network. In their model, mechanism, it provides us an approach to evaluate the rela-
not only local regions but also channels of CNN features are tionship between parts of modalities, which may be com-
filtered simultaneously. plementary or supplementary. By selecting complementary
So far, attention models are mostly trained using indirect features from different modalities and fusing them into a
cues because of lacking explicit attention annotations. Alter- single representation, the semantic ambiguity could be eased.
natively, Gan et al. [163] trained the attention module using The advantage of attention mechanism in multimodal
direct supervision. They collected link information between fusion has been proven in many applications. For example,
visual segments and words from several datasets and then Long et al. [165] compared four multimodal fusion methods
utilized the link information to guide the training of attention and demonstrated that attention based method is the most
module explicitly. The experiments showed that improved effective one for addressing the video classification problem.
performance could be achieved. They performed experiments in different setups: early fusion,
Balancing the contribution of different modalities is a middle-level fusion, attention-based fusion, and late fusion,
key issue that should be considered during fusing multi- which corresponding to different fusion points. The experi-
modal features. By contrast to concatenation or fixed weights mental result also shows that attention based fusion method
fusion methods, an attention-based method can adaptively is robust across various datasets. Some other researches also
balance the contributions of different modalities. Several demonstrated the promising perspective of attention based
pieces of research [90], [91], [164] have been reported that methods for multimodal features fusion [166], [167].
dynamically assigning weights to modality-specific features A special issue on multimodal feature fusion is fusing fea-
condition on a context is helpful to improve application tures from several variable length sequences such as videos,
performance. audios, sentences or a set of localized features. A simple way
Hori et al. [90] proposed to tackle multimodal fusion based to tackle this problem is fusing each sequence independently
on attention for video description. In addition to attending via the attention mechanism. After each sequence has been
on specific regions and time steps, the proposed method combined into a weighted representation with a fixed length,
highlights attending on modality-specific information. After they will be concatenated or fused into a single vector. This
modality-specific features have been extracted, the atten- way is beneficial for fusing several sequences, even in the
tion module produces appropriate weights to combine fea- case that their lengths are different, which is commonly
tures from different modalities based on the context. In a seen in a multimodal dataset. However, such a simplified
cross-modal retrieval task, Chen et al. [164] adopted a similar method does not explicitly consider the interaction between
strategy to adaptively fuse modalities and filter out unrelated modalities, and thus may ignore the fine-grained cross-modal
information within each modality according to search keys. relationships.
Lu et al. [91] introduced an adaptive attention frame to A solution to model the interactions between attention
determine whether including a visual feature or not during modules is constructing a shared context as an extra condition
generating the caption. They argued that some words such as for the computation of modality-specific attention modules.
‘‘the’’ are not related to any visual object. Therefore, no visual For example, Lu et al. [24] proposed to construct a global
feature is needed in this case. Suppose that the visual feature context by calculating the similarity between visual and text
is excluded, the decoder would just depend on the language features. Nam et al. [158] used an iterative strategy to update
features to predict a word. the shared context and modality-specific attention distribu-
Keyless attention is mostly used for classification or regres- tion. Firstly, modality-specific features will be summarized
sion task. In such an application scene, since the result is based on attention modules, then they are fused into a context
generated in a single step, it is hard to define a key to guide used for next iterator.
the attention module. Alternatively, the attention is applied Recently, a novel learning strategy named multi-attention
directly on the localized features without any key involved. mechanism, which utilizes several attention modules to
The computation functions can be illustrated as flow: extract different types of features from the same input data,
has been exploited. Generally, each type of feature locates in
ei = score(ai ) (30) a distinct subspace and reflects different semantics. Hence,
exp(ei ) the multi-attention mechanism is helpful in discovering dif-
α i = PL (31)
ferent inter-modal dynamics. For example, Zadeh et al. [22]
i=1 exp(ei )
XL proposed to discovery diverse interactions between modal-
ci = αi ai (32) ities using multi-attention mechanism. At each time step t,
i=1
TABLE 3. A summary of the key issues, advantages and disadvantages for each framework or typical model described in this paper. One thing should be
mentioned is that both cross-modal similarity model and deep canonical correlation analysis (DCCA) are belonged to coordinated representation
framework.
the hidden states hm t from all modalities were concatenated we category deep multimodal representation learning meth-
into vector ht , then multi-attentions will be applied on ht to ods into three groups of frameworks: joint representation,
extract K different weighted vectors which reflect distinctive coordinated representation, and encoder-decoder. Addition-
cross-modal relationships. After that, all the K vectors are ally, we summarize some typical models in this area, which
fused into a single vector which represents the shared hidden range from conventional models to newly developed tech-
state across modalities at time t. nologies, including probabilistic graphical models, multi-
Another example is the model form Zhou et al. [167], modal autoencoders, deep canonical correlation analysis,
which fused heterogeneous features of user behavior based on generative adversarial networks, and attention mechanism.
multi-attention mechanism. Here, a user behavior type can be For each framework or model, we describe its basic struc-
seen as a distinctive modal, because different types of behav- ture, learning objective, and application scenes. Additionally,
iors have distinctive attributes. The author supposed that the we also discuss their key issues, advantages, and disadvan-
semantics of user behavior can be affected by the context. tages which have been briefly summarized in Table 3.
Hence, the semantic intensity of that behavior also depends When coming into the learning objectives and key issues
on the context. Firstly, the model project all types of behaviors in all kinds of learning frames or typical models, we can
into a concatenated vector denoted as S, which is a global clearly see that the primary objective of multimodal repre-
feature and plays as the context in the attention module. Then, sentation learning is to narrow the distribution gap in a joint
S is projected into K latent semantic sub-spaces to represent semantic subspace while keeping modality specific semantic
different semantics. After that, the model fuses K sub-spaces intact. They achieve this objective in different ways: joint
through attention module. representation framework maps all modalities into a global
One of the advantages of attention mechanism is its capa- common subspace; coordinated representation framework
bility to select salient and discriminative localized features, maximizes the similarity or correlation between modalities
which can not only improve the performance of multimodal while keeping each modality independent; encoder-decoder
representations but also lead to better interpretability. Addi- framework maximizes the condition distribution among
tionally, by selecting prominent cues, this technique can also modalities and keep their semantics consistent; probabilistic
help to tackle issues such as noise and help to fuse comple- graphical models maximize the joint probability distribution
mentary semantics into multimodal representations. across modalities; multimodal autoencoders endeavor to keep
modality specific distribution intact by minimizing the recon-
IV. CONCLUSION AND FUTURE DIRECTIONS struction errors; generative adversarial networks aims to nar-
In this paper, we provided a comprehensive survey on deep row the distribution difference between modalities by an
multimodal representation learning. According to the under- adversarial process; attention mechanism selects salient fea-
lying structures in which different modalities are integrated, tures from modalities, such that they are similar in local
manifolds or such that they are complementary with each That is why conventional unsupervised learning methods
other. such as multimodal autoencoders are still active today,
With the rapid development of deep multimodal repre- although comparing to CNN or RNN features their perfor-
sentation learning methods, the need for much more train- mance are not so good. Due to a similar reason, generative
ing data is growing. However, the volume of the current adversarial nets has recently attracted much attention in the
multimodal datasets is limited because of the high cost of multimodal learning area.
manual labeling. The acquirement of high-quality labeled Most recently, weakly supervised learning has demon-
datasets is extremely labor-consuming. A popular solution to strated its potential in exploiting useful knowledge hidden
address this problem is transfer learning, transferring gen- behind the multimodal data. For example, given an image
eral knowledge from the source domain with a large-scale and its description, it is highly possible that a segment can be
dataset to target domain with insufficient data [168]. Trans- described by some words in the sentence. Although the one-
fer learning has been widely used in the multimodal repre- to-one correspondences between them are fully unknown,
sentation learning area and has been shown to be effective the work proposed by Karpathy and Fei-Fei [76] shows
in improving performance in many multimodal tasks. One that these hidden relationships can be discovered via weakly
of the examples is the reuse of pre-trained CNN network supervised learning. Potentially, a more promising applica-
such as VGGNet [48], ResNet [49], which can be used for tion of these type of weak supervision based methods is video
extracting image features in a multimodal system. The sec- analysis, where different modalities such as actions, audios,
ond example is word embeddings such as word2vec [50], languages have been roughly aligned in the timeline.
Glove [51]. Although these representations of words are For a long time, multimodal representation learning suffers
trained only on general-purpose language corpora, they from issues such as semantic confliction, duplication, and
can be transferred to other datasets directly even without noise. Although attention mechanism can be used to address
fine-tuning. these problems partially, they work implicitly and cannot
In contrast to the widespread use of convenient and effec- be controlled actively. A more promising method for this
tive knowledge transfer strategy in image and language problem is integrating reasoning ability into multimodal rep-
modality, similar methods are not yet available within audio resentation learning networks. Via the reasoning mechanism,
or video modality. Hence, the deep networks used for extract- a system would have the capability to select evidence actively
ing audio or video features would more easily suffer from which is sorely needed and could play an important role in
overfitting due to the limited training instances. As a result, mitigating the impact of these troubling issues. We believe
in many applications such as sentiment analysis and emotion that the close combination of representation learning and their
recognition which based on fused multimodal features, it is reasoning mechanism will endow machines with intelligent
relatively hard to improve the performance when only audio cognitive capabilities.
and video data are available. Alternatively, most works have
to increasingly rely on a stronger language model. Although REFERENCES
some efforts have been made to transfer cross-domain knowl- [1] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, ‘‘Multimodal
edge to audio and video modalities, in the multimodal rep- deep learning,’’ in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 689–696.
[2] S. Wang and W. Guo, ‘‘Sparse multigraph embedding for multi-
resentation learning area, more convenient and effective modal feature representation,’’ IEEE Trans. Multimedia, vol. 19, no. 7,
methods are still required. pp. 1454–1466, Jul. 2017.
In Addition to the knowledge transferring within the [3] H. McGurk and J. MacDonald, ‘‘Hearing lips and seeing voices,’’ Nature,
vol. 264, no. 5588, p. 746, 1976.
same modality, cross-modal transfer learning which aims to [4] Y. Peng, J. Qi, and Y. Yuan. (2017). ‘‘CM-GANs: Cross-modal generative
transfer knowledge from one modality to another is also a adversarial networks for common representation learning.’’ [Online].
significant research direction. For example, recent studies Available: https://arxiv.org/abs/1710.05106
[5] N. Rasiwasia et al., ‘‘A new approach to cross-modal multimedia
show that the knowledge transferred from images can help retrieval,’’ in Proc. 18th ACM Int. Conf. Multimedia, 2010, pp. 251–260.
to improve the performance of video analysis tasks [169]. [6] Y. Liu, X. Feng, and Z. Zhou, ‘‘Multimodal video classification
Besides, an alternative but the more challenging approach with stacked contractive autoencoders,’’ Signal Process., vol. 120,
pp. 761–766, Mar. 2016.
is the transfer learning between multimodal datasets. The
[7] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, ‘‘Zero-
advantage of this method is that the correlation information shot event detection using multi-modal fusion of weakly supervised
among different modalities in the source domain can also be concepts,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
pp. 2665–2672.
exploited, while the weakness is its complexity, both modal-
[8] A. Habibian, T. Mensink, and C. G. M. Snoek, ‘‘Video2vec embeddings
ity difference and domain discrepancy should be tackled recognize events when examples are scarce,’’ IEEE Trans. Pattern Anal.
simultaneously. Mach. Intell., vol. 39, no. 10, pp. 2089–2103, Oct. 2017.
Another feasible future direction to tackle the problem [9] S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, ‘‘Fusing
audio, visual and textual clues for sentiment analysis from multimodal
of relying on large scale labeled datasets is unsupervised content,’’ Neurocomputing, vol. 174, pp. 50–59, Jan. 2016.
or weakly supervised learning, which can be trained using [10] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, ‘‘Tensor
the ubiquitous multimodal data generated by internet users. fusion network for multimodal sentiment analysis,’’ in Proc. Conf. Empir-
ical Methods Natural Lang. Process., 2017, pp. 1103–1114.
Unsupervised learning has been widely used for dimension- [11] F. Feng, X. Wang, and R. Li, ‘‘Cross-modal retrieval with correspondence
ality reduction and feature extraction on unlabeled datasets. autoencoder,’’ in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 7–16.
[12] J. Qi and Y. Peng, ‘‘Cross-modal bidirectional translation via rein- [36] R. Kiros, R. Salakhutdinov, and R. S. Zemel. (2014). ‘‘Unifying
forcement learning,’’ in Proc. 27th Int. Joint Conf. Artif. Intell., 2018, visual-semantic embeddings with multimodal neural language models.’’
pp. 2630–2636. [Online]. Available: https://arxiv.org/abs/1411.2539
[13] K. Xu et al., ‘‘Show, attend and tell: Neural image caption generation [37] X. Huang, Y. Peng, and M. Yuan, ‘‘Cross-modal common representation
with visual attention,’’ in Proc. 32nd Int. Conf. Mach. Learn., 2015, learning by hybrid transfer network,’’ in Proc. 26th Int. Joint Conf. Artif.
pp. 2048–2057. Intell., 2017, pp. 1893–1900.
[14] J. Donahue et al., ‘‘Long-term recurrent convolutional networks for visual [38] X. Huang and Y. Peng, ‘‘Deep cross-media knowledge transfer,’’ in Proc.
recognition and description,’’ in Proc. IEEE Conf. Comput. Vis. Pattern IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8837–8846.
Recognit., Jun. 2015, pp. 2625–2634. [39] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ‘‘Show and tell: A neural
[15] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, image caption generator,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
‘‘Generative adversarial text to image synthesis,’’ in Proc. 33rd Int. Conf. Recognit., Jun. 2015, pp. 3156–3164.
Mach. Learn., 2016, pp. 1060–1069. [40] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, ‘‘Recurrent
[16] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, topic-transition GAN for visual paragraph generation,’’ in Proc. IEEE Int.
no. 7553, p. 436, 2015. Conf. Comput. Vis., Jun. 2017, pp. 3362–3371.
[17] J. Zhao, X. Xie, X. Xu, and S. Sun, ‘‘Multi-view learning overview: [41] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and
Recent progress and new challenges,’’ Inf. Fusion, vol. 38, pp. 43–54, K. Saenko, ‘‘Translating videos to natural language using deep recurrent
Nov. 2017. neural networks,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput.
[18] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, ‘‘Multimodal machine learn- Linguistics, Hum. Lang. Technol., 2015, pp. 1494–1504.
ing: A survey and taxonomy,’’ IEEE Trans. Pattern Anal. Mach. Intell., [42] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, ‘‘Video captioning
vol. 41, no. 2, pp. 423–443, Feb. 2019. with attention-based LSTM and semantic consistency,’’ IEEE Trans.
[19] Y. Li, M. Yang, and Z. Zhang. (2016). ‘‘A survey of multi-view represen- Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep. 2017.
tation learning.’’ [Online]. Available: https://arxiv.org/abs/1610.01206 [43] Y. Yang et al., ‘‘Video captioning by adversarial LSTM,’’ IEEE Trans.
[20] D. Ramachandram and G. W. Taylor, ‘‘Deep multimodal learning: A sur- Image Process., vol. 27, no. 11, pp. 5600–5611, Nov. 2018.
vey on recent advances and trends,’’ IEEE Signal Process. Mag., vol. 34, [44] H. Zhang et al., ‘‘StackGAN: Text to photo-realistic image synthesis
no. 6, pp. 96–108, Nov. 2017. with stacked generative adversarial networks,’’ in Proc. IEEE Int. Conf.
[21] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, ‘‘Exploiting Comput. Vis., Jun. 2017, pp. 5907–5915.
feature and class relationships in video categorization with regularized [45] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
deep neural networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
no. 2, pp. 352–364, Feb. 2018. pp. 2278–2324, Nov. 1998.
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
[22] A. Zadeh, P. P. Liang, S. Poria, E. Cambria, P. Vij, and L.-P. Morency,
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
‘‘Multi-attention recurrent network for human communication compre-
Process. Syst., 2012, pp. 1097–1105.
hension,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 1–35.
[47] C. Szegedy et al., ‘‘Going deeper with convolutions,’’ in Proc. IEEE Conf.
[23] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach,
Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.
‘‘Multimodal compact bilinear pooling for visual question answering
[48] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
and visual grounding,’’ in Proc. Conf. Empirical Methods Natural Lang.
large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent.,
Process., 2016, pp. 457–468.
2015, pp. 1–14.
[24] J. Lu, J. Yang, D. Batra, and D. Parikh, ‘‘Hierarchical question-image
[49] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
co-attention for visual question answering,’’ in Proc. Adv. Neural Inf.
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Process. Syst., 2016, pp. 289–297.
pp. 770–778.
[25] Y. Kim, H. Lee, and E. M. Provost, ‘‘Deep learning for robust feature
[50] T. Mikolov, K. Chen, G. Corrado, and J. Dean. (2013). ‘‘Efficient esti-
generation in audiovisual emotion recognition,’’ in Proc. IEEE Int. Conf.
mation of word representations in vector space.’’ [Online]. Available:
Acoust., Speech Signal Process., May 2013, pp. 3687–3691
https://arxiv.org/abs/1301.3781
[26] L. Pang and C.-W. Ngo, ‘‘Mutlimodal learning with deep Boltzmann [51] J. Pennington, R. Socher, and C. D. Manning, ‘‘GloVe: Global vectors for
machine for emotion prediction in user generated videos,’’ in Proc. 5th word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
ACM Int. Conf. Multimedia Retr., 2015, pp. 619–622. Process., 2014, pp. 1532–1543.
[27] J. Huang and B. Kingsbury, ‘‘Audio-visual deep learning for noise robust [52] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, ‘‘Character-aware neu-
speech recognition,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal ral language models,’’ in Proc. 30th AAAI Conf. Artif. Intell., 2016,
Process., May 2013, pp. 7596–7599. pp. 2741–2749.
[28] F. Feng, R. Li, and X. Wang, ‘‘Deep correspondence restricted [53] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, ‘‘Enriching word
Boltzmann machine for cross-modal retrieval,’’ Neurocomputing, vectors with subword information,’’ Trans. Assoc. Comput. Linguistics,
vol. 154, pp. 50–60, Apr. 2015. vol. 5, pp. 135–146, Dec. 2017.
[29] F. Yan and K. Mikolajczyk, ‘‘Deep correlation for matching images and [54] R. Sennrich, B. Haddow, and A. Birch, ‘‘Neural machine translation of
text,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, rare words with subword units,’’ in Proc. 54th Annu. Meeting Assoc.
pp. 3441–3450. Comput. Linguistics, vol. 1, 2016, pp. 1715–1725.
[30] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, ‘‘Adversarial [55] H. Peng, E. Cambria, and X. Zou, ‘‘Radical-based hierarchical embed-
cross-modal retrieval,’’ in Proc. 25th ACM Int. Conf. Multimedia, 2017, dings for Chinese sentiment analysis at sentence level,’’ in Proc. 13th Int.
pp. 154–162. Flairs Conf., 2017, pp. 1–6.
[31] Y. Peng, J. Qi, and Y. Yuan, ‘‘Modality-specific cross-modal similarity [56] J. L. Elman, ‘‘Finding structure in time,’’ Cognit. Sci., vol. 14, no. 2,
measurement with recurrent attention network,’’ IEEE Trans. Image Pro- pp. 179–211, Mar. 1990.
cess., vol. 27, no. 11, pp. 5585–5599, Nov. 2018. [57] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies
[32] R. Socher, Q. V. L. A. Karpathy, C. D. Manning, and A. Y. Ng, ‘‘Grounded with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2,
compositional semantics for finding and describing images with sen- pp. 157–166, Mar. 1994.
tences,’’ Trans. Assoc. Comput. Linguistics, vol. 2, no. 1, pp. 207–218, [58] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
2014. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[33] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, ‘‘Jointly modeling embedding [59] F. A. Gers, J. Schmidhuber, and F. Cummins, ‘‘Learning to forget:
and translation to bridge video and language,’’ in Proc. IEEE Conf. Continual prediction with LSTM,’’ Neural Comput., vol. 12, no. 10,
Comput. Vis. Pattern Recognit., Jun. 2016, pp. 4594–4602. pp. 2451–2471, 2000.
[34] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov, [60] K. Cho et al., ‘‘Learning phrase representations using RNN encoder–
‘‘DeViSE: A deep visual-semantic embedding model,’’ in Proc. 26th Int. decoder for statistical machine translation,’’ in Proc. Conf. Empirical
Conf. Neural Inf. Process. Syst., vol. 2, 2013, pp. 2121–2129. Methods Natural Lang. Process., 2014, pp. 1724–1734.
[35] A. Lazaridou and M. Baroni, ‘‘Combining language and vision with [61] R. Jozefowicz, W. Zaremba, and I. Sutskever, ‘‘An empirical exploration
a multimodal skip-gram model,’’ in Proc. Conf. North Amer. Chapter of recurrent network architectures,’’ in Proc. 32nd Int. Conf. Mach.
Assoc. Comput. Linguistics, Hum. Lang. Technol., 2015, pp. 153–163. Learn., 2015, pp. 2342–2350.
[62] Y. Dai, W. Guo, X. Chen, and Z. Zhang, ‘‘Relation classification via [86] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, ‘‘Look, imagine and
LSTMs based on sequence and tree structure,’’ IEEE Access, to be match: Improving textual-visual cross-modal retrieval with generative
published. models,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[63] M. Schuster and K. K. Paliwal, ‘‘Bidirectional recurrent neural net- pp. 7181–7189.
works,’’ IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, [87] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method
Nov. 1997. for automatic evaluation of machine translation,’’ in Proc. 40th Annu.
[64] A. Graves and J. Schmidhuber, ‘‘Framewise phoneme classification with Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
bidirectional LSTM and other neural network architectures,’’ Neural [88] G. Kulkarni et al., ‘‘Baby talk: Understanding and generating simple
Netw., vol. 18, no. 5, pp. 602–610, 2005. image descriptions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[65] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. (2014). ‘‘Empirical Jun. 2011, pp. 1601–1608.
evaluation of gated recurrent neural networks on sequence modeling.’’ [89] S. Guadarrama et al., ‘‘YouTube2Text: Recognizing and describing arbi-
[Online]. Available: https://arxiv.org/abs/1412.3555 trary activities using semantic hierarchies and zero-shot recognition,’’ in
[66] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 2712–2719.
in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, [90] C. Hori et al., ‘‘Attention-based multimodal fusion for video description,’’
pp. 1746–1751. in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 4203–4212.
[67] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, ‘‘A convolutional [91] J. Lu, C. Xiong, D. Parikh, and R. Socher, ‘‘Knowing when to look: Adap-
neural network for modelling sentences,’’ in Proc. 52nd Annu. Meeting tive attention via a visual sentinel for image captioning,’’ in Proc. IEEE
Assoc. Comput. Linguistics, 2014, pp. 655–665. Conf. Comput. Vis. Pattern Recognit., vol. 6, Jun. 2017, pp. 375–383.
[68] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and [92] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,
L.-P. Morency, ‘‘Context-dependent sentiment analysis in user-generated ‘‘End-to-end attention-based large vocabulary speech recognition,’’ in
videos,’’ in Proc. 55th Annu. Meeting Assoc. Comput. Linguistics, vol. 1, Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 2016,
2017, pp. 873–883. pp. 4945–4949.
[69] T. Baltrušaitis, P. Robinson, and L.-P. Morency, ‘‘OpenFace: An open [93] M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P.
source facial behavior analysis toolkit,’’ in Proc. IEEE Winter Conf. Appl. Morency, ‘‘Multimodal sentiment analysis with word-level fusion and
Comput. Vis., Mar. 2016, pp. 1–10. reinforcement learning,’’ in Proc. 19th ACM Int. Conf. Multimodal Inter-
[70] F. Eyben, M. Wöllmer, and B. Schuller, ‘‘Opensmile: The munich versa- act., 2017, pp. 163–171.
tile and fast open-source audio feature extractor,’’ in Proc. 18th ACM Int. [94] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, ‘‘Deep reinforcement
Conf. Multimedia, 2010, pp. 1459–1462. learning-based image captioning with embedding reward,’’ in Proc. IEEE
[71] N. Pham and R. Pagh, ‘‘Fast and scalable polynomial kernels via explicit Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 1151–1159.
feature maps,’’ in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery [95] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-critical
Data Mining, 2013, pp. 239–247. sequence training for image captioning,’’ in Proc. IEEE Conf. Comput.
[72] N. Srivastava and R. Salakhutdinov, ‘‘Learning representations for mul- Vis. Pattern Recognit., Jun. 2017, pp. 7008–7024.
timodal data with deep belief nets,’’ in Proc. Int. Conf. Mach. Learn. [96] N. Srivastava and R. R. Salakhutdinov, ‘‘Multimodal learning with deep
Workshop, vol. 79, 2012, pp. 1–8. Boltzmann machines,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012,
[73] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba, pp. 2222–2230.
‘‘Cross-modal scene networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., [97] G. E. Hinton, S. Osindero, and Y.-W. Teh, ‘‘A fast learning algorithm for
vol. 40, no. 10, pp. 2303–2314, Oct. 2018. deep belief nets,’’ Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
[74] S. Wang, H. Zhang, and H. Wang, ‘‘Object co-segmentation via weakly [98] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc.
supervised data fusion,’’ Comput. Vis. Image Understand., vol. 155, 29th Int. Conf. Artif. Intell. Statist., 2009, pp. 448–455.
pp. 43–54, Feb. 2017. [99] G. E. Hinton, ‘‘Training products of experts by minimizing contrastive
[75] Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, ‘‘Cross-modal retrieval via divergence,’’ Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002.
deep and bidirectional representation learning,’’ IEEE Trans. Multimedia, [100] L. Ge, J. Gao, X. Li, and A. Zhang, ‘‘Multi-source deep learning for
vol. 18, no. 7, pp. 1363–1377, Jul. 2016. information trustworthiness estimation,’’ in Proc. 19th ACM SIGKDD Int.
[76] A. Karpathy and L. Fei-Fei, ‘‘Deep visual-semantic alignments for gen- Conf. Knowl. Discovery Data Mining, 2013, pp. 766–774.
erating image descriptions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [101] W. Ouyang, X. Chu, and X. Wang, ‘‘Multi-source deep learning for
Recognit., Jun. 2015, pp. 3128–3137. human pose estimation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[77] R. Xu, C. Xiong, W. Chen, and J. J. Corso, ‘‘Jointly modeling deep Recognit., Jun. 2014, pp. 2329–2336.
video and compositional text to bridge vision and language in a unified [102] R. Salakhutdinov and H. Larochelle, ‘‘Efficient learning of deep
framework,’’ in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 2346–2352. Boltzmann machines,’’ in Proc. 30th Int. Conf. Artif. Intell. Statist., 2010,
[78] V. E. Liong, J. Lu, Y. Tan, and J. Zhou, ‘‘Deep coupled metric learning pp. 693–700.
for cross-modal matching,’’ IEEE Trans. Multimedia, vol. 19, no. 6, [103] G. E. Hinton and R. S. Zemel, ‘‘Autoencoders, minimum description
pp. 1234–1244, Jun. 2017. length and Helmholtz free energy,’’ in Proc. Adv. Neural Inf. Process.
[79] Y. Peng, J. Qi, X. Huang, and Y. Yuan, ‘‘CCL: Cross-modal correlation Syst., 1994, pp. 3–10.
learning with multigrained fusion by hierarchical network,’’ IEEE Trans. [104] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ‘‘Extracting
Multimedia, vol. 20, no. 2, pp. 405–420, Feb. 2017. and composing robust features with denoising autoencoders,’’ in Proc.
[80] L. Wang, Y. Li, and S. Lazebnik, ‘‘Learning deep structure-preserving 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103.
image-text embeddings,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [105] C. Silberer and M. Lapata, ‘‘Learning grounded meaning representations
Recognit., Jun. 2016, pp. 5005–5013. with autoencoders,’’ in Proc. 52nd Annu. Meeting Assoc. Comput. Lin-
[81] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, guistics, vol. 1, 2014, pp. 721–732.
‘‘A kernel two-sample test,’’ J. Mach. Learn. Res., vol. 13, pp. 723–773, [106] D. Wang, P. Cui, M. Ou, and W. Zhu, ‘‘Deep multimodal hashing with
Mar. 2012. orthogonal regularization,’’ in Proc. 24th Int. Conf. Artif. Intell., 2015,
[82] I. J. Goodfellow et al., ‘‘Generative adversarial nets,’’ in Proc. 27th Int. pp. 2291–2297.
Conf. Neural Inf. Process. Syst. (NIPS), vol. 2. Cambridge, MA, USA: [107] W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, ‘‘Effective
MIT Press, 2014, pp. 2672–2680. multi-modal retrieval based on stacked auto-encoders,’’ VLDB Endow-
[83] N. Mor, L. Wolf, A. Polyak, and Y. Taigman. (2018). ‘‘A universal ment, vol. 7, no. 8, pp. 649–660, 2014.
music translation network.’’ [Online]. Available: https://arxiv.org/abs/ [108] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, ‘‘Multimodal deep autoen-
1805.07848 coder for human pose recovery,’’ IEEE Trans. Image Process., vol. 24,
[84] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, ‘‘Multimodal unsu- no. 12, pp. 5659–5670, Dec. 2015.
pervised image-to-image translation,’’ in Proc. Eur. Conf. Comput. Vis., [109] H. Hotelling, ‘‘Relations between two sets of variates,’’ Biometrika,
2018, pp. 172–189. vol. 28, nos. 3–4, pp. 321–377, 1936.
[85] R. Bernardi et al., ‘‘Automatic description generation from images: A sur- [110] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, ‘‘Canonical correlation
vey of models, datasets, and evaluation measures,’’ J. Artif. Intell. Res., analysis: An overview with application to learning methods,’’ Neural
vol. 55, pp. 409–442, Jan. 2016. Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
[111] S. Akaho. (2006). ‘‘A kernel method for canonical correlation analysis.’’ [135] A. Creswell and A. A. Bharath. (2016). ‘‘Inverting the generator
[Online]. Available: https://arxiv.org/abs/cs/0609071 of a generative adversarial network.’’ [Online]. Available:
[112] N. Mallinar and C. Rosset. (2018). ‘‘Deep canonically correlated https://arxiv.org/abs/1611.05644
LSTMs.’’ [Online]. Available: https://arxiv.org/abs/1801.05407 [136] Z. C. Lipton and S. Tripathi. (2017). ‘‘Precise recovery of latent
[113] C. K. I. Williams and M. Seeger, ‘‘Using the Nyström method to speed vectors from generative adversarial networks.’’ [Online]. Available:
up kernel machines,’’ in Proc. Adv. Neural Inf. Process. Syst., 2001, https://arxiv.org/abs/1702.04782
pp. 682–688. [137] V. Dumoulin et al., ‘‘Adversarially learned inference,’’ in Proc. Int. Conf.
[114] F. R. Bach and M. I. Jordan, ‘‘Kernel independent component analysis,’’ Learn. Represent., 2017, pp. 1–18.
J. Mach. Learn. Res., vol. 3, pp. 1–48, Jan. 2002. [138] J. Donahue, P. Krähenbuhl, and T. Darrell, ‘‘Adversarial feature learning,’’
[115] N. Cristianini, J. Shawe-Taylor, and H. Lodhi, ‘‘Latent semantic kernels,’’ in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–18.
J. Intell. Inf. Syst., vol. 18, nos. 2–3, pp. 127–152, 2002. [139] M. Mirza and S. Osindero. (2014). ‘‘Conditional generative adversarial
[116] R. Arora and K. Livescu, ‘‘Kernel CCA for multi-view learning of nets.’’ [Online]. Available: https://arxiv.org/abs/1411.1784
acoustic features using articulatory measurements,’’ in Proc. Symp. Mach. [140] S. Reed, Z. Akata, H. Lee, and B. Schiele, ‘‘Learning deep representations
Learn. Speech Lang. Process., 2012, pp. 1–4. of fine-grained visual descriptions,’’ in Proc. IEEE Conf. Comput. Vis.
[117] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, ‘‘Deep canonical Pattern Recognit., Jun. 2016, pp. 49–58.
correlation analysis,’’ in Proc. 30th Int. Conf. Mach. Learn., 2013, [141] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,
pp. 1247–1255. ‘‘Learning what and where to draw,’’ in Proc. Adv. Neural Inf. Process.
[118] W. Wang, R. Arora, K. Livescu, and J. Bilmes, ‘‘On deep multi-view Syst., 2016, pp. 217–225.
representation learning,’’ in Proc. 32nd Int. Conf. Mach. Learn., vol. 37, [142] J. Johnson, A. Gupta, and L. Fei-Fei, ‘‘Image generation from scene
2015, pp. 1083–1092. graphs,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[119] F. R. Bach and M. I. Jordan, ‘‘A probabilistic interpretation of canonical pp. 1219–1228.
correlation analysis,’’ Dept. Statist., Univ. California, Berkeley, Berkeley, [143] X. Xu, L. He, H. Lu, L. Gao, and Y. Ji, ‘‘Deep adversarial metric learning
CA, USA, Tech. Rep. 688, 2005. for cross-modal retrieval,’’ World Wide Web, vol. 22, no. 2, pp. 657–672,
[120] W. Wang, X. Yan, H. Lee, and K. Livescu. (2016). ‘‘Deep Mar. 2019.
variational canonical correlation analysis.’’ [Online]. Available: [144] J. Zhang, Y. Peng, and M. Yuan, ‘‘Unsupervised generative adversarial
https://arxiv.org/abs/1610.03454 cross-modal hashing,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
[121] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, ‘‘Unsupervised pp. 1–8.
learning of acoustic features via deep canonical correlation analysis,’’ [145] L. Wu, Y. Wang, and L. Shao, ‘‘Cycle-consistent deep generative hashing
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Apr. 2015, for cross-modal retrieval,’’ IEEE Trans. Image Process., vol. 28, no. 4,
pp. 4590–4594. pp. 1602–1612, Apr. 2019.
[122] W. Wang, R. Arora, K. Livescu, and N. Srebro, ‘‘Stochastic optimization [146] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-image
for deep CCA via nonlinear orthogonal iterations,’’ in Proc. Allerton Conf. translation using cycle-consistent adversarial networks,’’ in Proc. IEEE
Commun., Control Comput., Sep./Oct. 2015, pp. 688–695. Int. Conf. Comput. Vis., Jun. 2017, pp. 2223–2232.
[123] X. Chang, T. Xiang, and T. M. Hospedales, ‘‘Scalable and effective deep [147] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
CCA via soft decorrelation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern X. Chen, ‘‘Improved techniques for training GANs,’’ in Proc. Adv. Neural
Recognit., 2018, pp. 1488–1497. Inf. Process. Syst., 2016, pp. 2234–2242.
[124] A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu, ‘‘Deep multilin- [148] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, ‘‘Unrolled genera-
gual correlation for improved word embeddings,’’ in Proc. Conf. North tive adversarial networks,’’ in Proc. Int. Conf. Learn. Represent., 2017,
Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., 2015, pp. 1–25.
pp. 250–256. [149] M. Arjovsky, S. Chintala, and L. Bottou. (2017). ‘‘Wasserstein GAN.’’
[125] G. Rotman, I. Vulić, and R. Reichart, ‘‘Bridging languages through [Online]. Available: https://arxiv.org/abs/1701.07875
images with deep partial canonical correlation analysis,’’ in Proc. 56th [150] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2018, pp. 910–921. ‘‘Improved training of wasserstein gans,’’ in Proc. 31st Conf. Neural Inf.
[126] Y. Yu, S. Tang, F. Raposo, and L. Chen. (2017). ‘‘Deep cross-modal Process. Syst., 2017, pp. 5767–5777.
correlation learning for audio and lyrics in music retrieval.’’ [Online]. [151] R. A. Rensink, ‘‘The dynamic representation of scenes,’’ Vis. Cognit.,
Available: https://arxiv.org/abs/1711.08976 vol. 7, nos. 1–3, pp. 17–42, 2000.
[127] Y. Takashima, T. Takiguchi, Y. Ariki, and K. Omori, ‘‘Audio-visual [152] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, ‘‘Recurrent mod-
speech recognition for a person with severe hearing loss using deep els of visual attention,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014,
canonical correlation analysis,’’ in Proc. 1st Int. Workshop Challenges pp. 2204–2212.
Hearing Assistive Technol., 2017, pp. 77–81. [153] W. Pei, T. Baltrušaitis, D. M. Tax, and L.-P. Morency, ‘‘Temporal
[128] Q. Tang, W. Wang, and K. Livescu, ‘‘Acoustic feature learning via deep attention-gated model for robust sequence classification,’’ in Proc. IEEE
variational canonical correlation analysis,’’ in Proc. Conf. Int. Speech Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 820–829.
Commun. Assoc., 2017, pp. 1656–1660. [154] F. Wang et al., ‘‘Residual attention network for image classifica-
[129] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, ‘‘Deep generative tion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2017,
image models using a Laplacian pyramid of adversarial networks,’’ in pp. 3156–3164.
Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1486–1494. [155] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
[130] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation jointly learning to align and translate,’’ in Proc. Int. Conf. Learn. Repre-
learning with deep convolutional generative adversarial networks,’’ in sent., 2015, pp. 1–15.
Proc. Int. Conf. Learn. Represent., 2016, pp. 1–16. [156] T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to
[131] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation attention-based neural machine translation,’’ in Proc. Conf. Empirical
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis. Methods Natural Lang. Process., 2015, pp. 1412–1421.
Pattern Recognit., Jun. 2017, pp. 5967–5976. [157] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, ‘‘Stacked attention
[132] C. Ledig et al., ‘‘Photo-realistic single image super-resolution using networks for image question answering,’’ in Proc. IEEE Conf. Comput.
a generative adversarial network,’’ in Proc. IEEE Conf. Comput. Vis. Vis. Pattern Recognit., Jun. 2016, pp. 21–29.
Pattern Recognit., Jun. 2017, pp. 4681–4690. [158] H. Nam, J.-W. Ha, and J. Kim, ‘‘Dual attention networks for multimodal
[133] Z. Chen, X. Zhang, A. P. Boedihardjo, J. Dai, and C.-T. Lu, ‘‘Multimodal reasoning and matching,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
storytelling via generative adversarial imitation learning,’’ in Proc. 26th Recognit., Jun. 2017, pp. 299–307.
Int. Joint Conf. Artif. Intell., 2017, pp. 3967–3973. [159] A. Vaswani et al., ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf.
[134] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Process. Syst., 2017, pp. 5998–6008.
Abbeel, ‘‘InfoGAN: Interpretable representation learning by information [160] H. Xu and K. Saenko, ‘‘Ask, attend and answer: Exploring
maximizing generative adversarial nets,’’ in Proc. 30th Int. Conf. Neural question-guided spatial attention for visual question answering,’’ in
Inf. Process. Syst., 2016, pp. 2172–2180. Proc. Eur. Conf. Comput. Vis., 2016, pp. 451–466.
[161] C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma, ‘‘Structured attentions JIANWEN WANG is currently pursuing the Ph.D.
for visual question answering,’’ in Proc. IEEE Int. Conf. Comput. Vis., degree with the College of Mathematics and
Jun. 2017, pp. 1291–1300. Computer Science, Fuzhou University, Fuzhou,
[162] L. Chen et al., ‘‘SCA-CNN: Spatial and channel-wise attention in con- China. He is currently a Lecturer with the College
volutional networks for image captioning,’’ in Proc. IEEE Conf. Comput. of Mathematics and Informatics, Fujian Nor-
Vis. Pattern Recognit., Jun. 2017, pp. 6298–6306. mal University, Fuzhou. His research interests
[163] C. Gan, Y. Li, H. Li, C. Sun, and B. Gong, ‘‘VQS: Linking segmen- include multimodal machine learning and com-
tations to questions and answers for supervised attention in VQA and
puter vision.
question-focused semantic segmentation,’’ in Proc. IEEE Int. Conf. Com-
put. Vis., Jun. 2017, pp. 1811–1820.
[164] K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia, ‘‘AMC: Attention
guided multi-modal correlation learning for image search,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 6203–6211.
[165] X. Long et al., ‘‘Multimodal keyless attention fusion for video classifica-
tion,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 1–8.
[166] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and
L.-P. Morency, ‘‘Memory fusion network for multi-view sequential learn-
ing,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 1–8.
[167] C. Zhou et al., ‘‘ATRank: An attention-based user behavior modeling
framework for recommendation,’’ in Proc. 32nd AAAI Conf. Artif. Intell.,
2018, pp. 1–8.
[168] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[169] J. Zhang, Y. Han, J. Tang, Q. Hu, and J. Jiang, ‘‘Semi-supervised image-
to-video adaptation for video action recognition,’’ IEEE Trans. Cybern.,
vol. 47, no. 4, pp. 960–973, Apr. 2017.