0% found this document useful (0 votes)
92 views11 pages

Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

The document introduces the CMU-MOSEI dataset, which is a large multimodal language dataset containing over 23,000 video segments annotated with sentiment and emotion labels. It also introduces a new fusion method called Dynamic Fusion Graph that achieves good performance on the dataset and allows interpretation of how modalities interact. This is an important contribution to the field of multimodal language analysis.

Uploaded by

Megha Aaradhya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views11 pages

Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

The document introduces the CMU-MOSEI dataset, which is a large multimodal language dataset containing over 23,000 video segments annotated with sentiment and emotion labels. It also introduces a new fusion method called Dynamic Fusion Graph that achieves good performance on the dataset and allows interpretation of how modalities interact. This is an important contribution to the field of multimodal language analysis.

Uploaded by

Megha Aaradhya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Multimodal Language Analysis in the Wild:

CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph


Amir Zadeh1 , Paul Pu Liang2 , Jonathan Vanbriesen1 , Soujanya Poria3 ,
Edmund Tong1 , Erik Cambria4 , Minghai Chen1 , Louis-Philippe Morency1
{1- Language Technologies Institute, 2- Machine Learning Department}, CMU, USA
{3- A*STAR, 4- Nanyang Technological University}, Singapore
{abagherz,pliang,jvanbrie}@cs.cmu.edu, [email protected]
[email protected], [email protected], [email protected]

Abstract Studies strive to model the dual dynamics of multi-


modal language: intra-modal dynamics (dynamics
Analyzing human multimodal language is within each modality) and cross-modal dynamics
an emerging area of research in NLP. In- (dynamics across different modalities). However,
trinsically human communication is mul- from a resource perspective, previous multimodal
timodal (heterogeneous), temporal and language datasets have severe shortcomings in the
asynchronous; it consists of the language following aspects:
(words), visual (expressions), and acoustic Diversity in the training samples: The diversity
(paralinguistic) modalities all in the form in training samples is crucial for comprehensive
of asynchronous coordinated sequences. multimodal language studies due to the complex-
From a resource perspective, there is a gen- ity of the underlying distribution. This complexity
uine need for large scale datasets that al- is rooted in variability of intra-modal and cross-
low for in-depth studies of multimodal lan- modal dynamics for language, vision and acoustic
guage. In this paper we introduce CMU modalities (Rajagopalan et al., 2016). Previously
Multimodal Opinion Sentiment and Emo- proposed datasets for multimodal language are gen-
tion Intensity (CMU-MOSEI), the largest erally small in size due to difficulties associated
dataset of sentiment analysis and emo- with data acquisition and costs of annotations.
tion recognition to date. Using data from Variety in the topics: Variety in topics opens the
CMU-MOSEI and a novel multimodal fu- door to generalizable studies across different do-
sion technique called the Dynamic Fusion mains. Models trained on only few topics gener-
Graph (DFG), we conduct experimentation alize poorly as language and nonverbal behaviors
to investigate how modalities interact with tend to change based on the impression of the topic
each other in human multimodal language. on speakers’ internal mental state.
Unlike previously proposed fusion tech- Diversity of speakers: Much like writing styles,
niques, DFG is highly interpretable and speaking styles are highly idiosyncratic. Training
achieves competitive performance com- models on only few speakers can lead to degen-
pared to the current state of the art. erate solutions where models learn the identity of
speakers as opposed to a generalizable model of
1 Introduction
multimodal language (Wang et al., 2016).
Theories of language origin identify the combina- Variety in annotations Having multiple labels to
tion of language and nonverbal behaviors (vision predict allows for studying the relations between
and acoustic modality) as the prime form of com- labels. Another positive aspect of having variety of
munication utilized by humans throughout evolu- labels is allowing for multi-task learning which has
tion (Müller, 1866). In natural language processing, shown excellent performance in past research.
this form of language is regarded as human multi- Our first contribution in this paper is to intro-
modal language. Modeling multimodal language duce the largest dataset of multimodal sentiment
has recently become a centric research direction in and emotion recognition called CMU Multimodal
both NLP and multimodal machine learning (Haz- Opinion Sentiment and Emotion Intensity (CMU-
arika et al., 2018; Zadeh et al., 2018a; Poria et al., MOSEI). CMU-MOSEI contains 23,453 annotated
2017a; Baltrušaitis et al., 2017; Chen et al., 2017). video segments from 1,000 distinct speakers and

2236
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2236–2246
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
250 topics. Each video segment contains manual Dataset #S # Sp Mod Sent Emo TL (hh:mm:ss)
CMU-MOSEI 23,453 1,000 {l, v, a} 3 3 65:53:36
transcription aligned with audio to phoneme level. CMU-MOSI 2,199 98 {l, v, a} 3 7 02:36:17
All the videos are gathered from online video shar- ICT-MMMO 340 200 {l, v, a} 3 7 13:58:29
ing websites 1 . The dataset is currently a part of the YouTube 300 50 {l, v, a} 3 7 00:29:41
MOUD 400 101 {l, v, a} 3 7 00:59:00
CMU Multimodal Data SDK and is freely available SST 11,855 – {l} 3 7 –
to the scientific community through Github 2 . Cornell 2,000 – {l} 3 7 –
Large Movie 25,000 – 3 7 –
Our second contribution is an interpretable fu- {l}
STS 5,513 – {l} 3 7 –
sion model called Dynamic Fusion Graph (DFG) to IEMOCAP 10,000 10 {l, v, a} 7 3 11:28:12
study the nature of cross-modal dynamics in multi- SAL 23 4 {v, a} 7 3 11:00:00
VAM 499 20 {v, a} 7 3 12:00:00
modal language. DFG contains built-in efficacies VAM-faces 1,867 20 {v} 7 3 –
that are directly related to how modalities interact. HUMAINE 50 4 {v, a} 7 3 04:11:00
These efficacies are visualized and studied in detail RECOLA 46 46 {v, a} 7 3 03:50:00
SEWA 538 408 {v, a} 7 3 04:39:00
in our experiments. Aside interpretability, DFG SEMAINE 80 20 {v, a} 7 3 06:30:00
achieves superior performance compared to previ- AFEW 1,645 330 {v, a} 7 3 02:28:03
ously proposed models for multimodal sentiment AM-FED 242 242 {v} 7 3 03:20:25
Mimicry 48 48 {v, a} 7 3 11:00:00
and emotion recognition on CMU-MOSEI. AFEW-VA 600 240 {v, a} 7 3 00:40:00

2 Background Table 1: Comparison of the CMU-MOSEI dataset


with previous sentiment analysis and emotion
In this section we compare the CMU-MOSEI recognition datasets. #S denotes the number of
dataset to previously proposed datasets for mod- annotated data points. #Sp is the number of distinct
eling multimodal language. We then describe the speakers. Mod indicates the subset of modalities
baselines and recent models for sentiment analysis present from {(l)anguage, (v)ision, (a)udio}.
and emotion recognition. Sent and Emo columns indicate presence of sen-
timent and emotion labels. TL denotes the total
2.1 Comparison to other Datasets
number of video hours.
We compare CMU-MOSEI to an extensive pool of
datasets for sentiment analysis and emotion recog- segment is annotated for the presence of 9 emo-
nition. The following datasets include a combina- tions (angry, excited, fear, sad, surprised, frustrated,
tion of language, visual and acoustic modalities as happy, disappointed and neutral) as well as valence,
their input data. arousal and dominance.
2.1.1 Multimodal Datasets
2.1.2 Language Datasets
CMU-MOSI (Zadeh et al., 2016b) is a collection
Stanford Sentiment Treebank (SST) (Socher
of 2199 opinion video clips each annotated with
et al., 2013) includes fine grained sentiment labels
sentiment in the range [-3,3]. CMU-MOSEI is the
for phrases in the parse trees of sentences collected
next generation of CMU-MOSI. The ICT-MMMO
from movie review data. While SST has larger pool
(Wöllmer et al., 2013) consists of online social re-
of annotations, we only consider the root level an-
view videos annotated at the video level for sen-
notations for comparison. Cornell Movie Review
timent. YouTube (Morency et al., 2011) contains
(Pang et al., 2002) is a collection of 2000 movie-
videos from the social media web site YouTube that
review documents and sentences labeled with re-
span a wide range of product reviews and opinion
spect to their overall sentiment polarity or subjec-
videos. MOUD (Perez-Rosas et al., 2013) consists
tive rating. Large Movie Review dataset (Maas
of product review videos in Spanish. Each video
et al., 2011) contains text from highly polar movie
consists of multiple segments labeled to display
reviews. Sanders Tweets Sentiment (STS) con-
positive, negative or neutral sentiment. IEMO-
sists of 5513 hand-classified tweets each classified
CAP (Busso et al., 2008) consists of 151 videos
with respect to one of four topics of Microsoft,
of recorded dialogues, with 2 speakers per session
Apple, Twitter, and Google.
for a total of 302 videos across the dataset. Each
1
following creative commons license allows for personal 2.1.3 Visual and Acoustic Datasets
unrestricted use and redistribution of the videos
2
https://github.com/A2Zadeh/CMU- The Vera am Mittag (VAM) corpus consists of
MultimodalDataSDK 12 hours of recordings of the German TV talk-

2237
show “Vera am Mittag” (Grimm et al., 2008). This Graves et al., 2013; Schuster and Paliwal, 1997).
audio-visual data is labeled for continuous-valued In case of unimodal models EF-LSTM refers to a
scale for three emotion primitives: valence, acti- single LSTM.
vation and dominance. VAM-Audio and VAM- We also compare to the following baseline mod-
Faces are subsets that contain on acoustic and vi- els: † BC-LSTM (Poria et al., 2017b), ♣ C-MKL
sual inputs respectively. RECOLA (Ringeval et al., (Poria et al., 2016), ♭ DF (Nojavanasghari et al.,
2013) consists of 9.5 hours of audio, visual, and 2016), ♡ SVM (Cortes and Vapnik, 1995; Zadeh
physiological (electrocardiogram, and electroder- et al., 2016b; Perez-Rosas et al., 2013; Park et al.,
mal activity) recordings of online dyadic interac- 2014), ● RF (Breiman, 2001), THMM (Morency
tions. Mimicry (Bilakhia et al., 2015) consists of et al., 2011), SAL-CNN (Wang et al., 2016), 3D-
audiovisual recordings of human interactions in CNN (Ji et al., 2013). For language only base-
two situations: while discussing a political topic line models: ∪ CNN-LSTM (Zhou et al., 2015),
and while playing a role-playing game. AFEW RNTN (Socher et al., 2013), ×: DynamicCNN
(Dhall et al., 2012, 2015) is a dynamic temporal (Kalchbrenner et al., 2014), ⊳ DAN (Iyyer et al.,
facial expressions data corpus consisting of close 2015), ≀ DHN (Srivastava et al., 2015), ⊲ RHN
to real world environment extracted from movies. (Zilly et al., 2016). For acoustic only baseline
Detailed comparison of CMU-MOSEI to the models: AdieuNet (Trigeorgis et al., 2016), SER-
datasets in this section is presented in Table 1. LSTM (Lim et al., 2016).
CMU-MOSEI has longer total duration as well as
larger number of data point in total. Furthermore, 3 CMU-MOSEI Dataset
CMU-MOSEI has a larger variety in number of
Understanding expressed sentiment and emotions
speakers and topics. It has all three modalities pro-
are two crucial factors in human multimodal lan-
vided, as well as annotations for both sentiment
guage. We introduce a novel dataset for multimodal
and emotions.
sentiment and emotion recognition called CMU
Multimodal Opinion Sentiment and Emotion Inten-
2.2 Baseline Models
sity (CMU-MOSEI). In the following subsections,
Modeling multimodal language has been the sub- we first explain the details of the CMU-MOSEI
ject of studies in NLP and multimodal machine data acquisition, followed by details of annotation
learning. Notable approaches are listed as follows and feature extraction.
and indicated with a symbol for reference in the
Experiments and Discussion section (Section 5). 3.1 Data Acquisition
# MFN: (Memory Fusion Network) (Zadeh Social multimedia presents a unique opportunity
et al., 2018a) synchronizes multimodal sequences for acquiring large quantities of data from various
using a multi-view gated memory that stores intra- speakers and topics. Users of these social multime-
view and cross-view interactions through time. dia websites often post their opinions in the forms
∎ MARN: (Multi-attention Recurrent Network) of monologue videos; videos with only one per-
(Zadeh et al., 2018b) models intra-modal and multi- son in front of camera discussing a certain topic
ple cross-modal interactions by assigning multiple of interest. Each video inherently contains three
attention coefficients. Intra-modal and cross-modal modalities: language in the form of spoken text,
interactions are stored in a hybrid LSTM mem- visual via perceived gestures and facial expressions,
ory component. ∗ TFN (Tensor Fusion Network) and acoustic through intonations and prosody.
(Zadeh et al., 2017) models inter and intra modal During our automatic data acquisition process,
interactions by creating a multi-dimensional tensor videos from YouTube are analyzed for the presence
that captures unimodal, bimodal and trimodal in- of one speaker in the frame using face detection
teractions. ◇ MV-LSTM (Multi-View LSTM) (Ra- to ensure the video is a monologue. We limit the
jagopalan et al., 2016) is a recurrent model that des- videos to setups where the speaker’s attention is
ignates regions inside a LSTM to different views exclusively towards the camera by rejecting videos
of the data. § EF-LSTM (Early Fusion LSTM) that have moving cameras (such as camera on bikes
concatenates the inputs from different modalities or selfies recording while walking). We use a di-
at each time-step and uses that as the input to a verse set of 250 frequently used topics in online
single LSTM (Hochreiter and Schmidhuber, 1997; videos as the seed for acquisition. We restrict the

2238
s
tor ti emes g

ng
t p e c ea g g

s d s
soceakers

sto u t re pdatutorial

der
es h tur ononlinestimreta it

in su es et inv al q ma ers

resomrsesing wit opic rso on we ersling


Total number of sentences 23453

cbounsinmen oliticssumustrotemxtbook
on
po mm ereari firm e see ony

angaconuct aserc kedue erchwtiminy

ech bat faq thisinsdneezin


il

s ersialuads l

ini

co n ies t p ipt erti hanler


sbpetaslkisdeltinagdvefeerlng pharks itati
n te t tion ty mm

finmpannde desc y advg retaoins


ia

e e om

len
edi pro rs m e sp vi econte

rec

ref
s lelssoa ing

c dep mes qui esti uest ry


iew
Total number of videos 3228

ly
sp

u
osi
e
su

lav

t n i
sym

ek
enc
on ns
Total number of distinct speakers 1000

ous
c i

nom

n
sm
rea ent

mo
stry
rali

auto

l
hte

s
fore
ltu

bec
Total number of distinct topics 250

ulu
ser
al
icu

ri
ust
lt
s e

ses
mu

stim
ties
ind

y
rpri
ict

erl
uri
rev

ente
sec
Average number of sentences in a video 7.3

art
e
s n nantextiles

e i
ry

sse
res

dai n
lik
ato

qu
nt a
futu
iew d
c
icip

u sm
erv nifie
sio no

ly
nda
Average length of sentences in seconds 7.28

li
part

era

epe
lib
i y

resi qs
n

ind

r
atio
rev ke

ate
motatemsum

fa
gn
ing

Total number of words in sentences 447143

crm
ials
d ock lecsponrseessp

ly

deb
iew

aq
rep

ms
ma inv rrat ol en
rc
sto litic ents ng

nasd
rev

me

rke
ing
ov
com

far
il

ed
t
reta

loa ss
ma
alis

is
Total of unique words in sentences 23026
e

rev

nt
itio
m ck s c sfi
en

edit

rke e ive og

d
me

n
up ne
p

n
ritt

rou

add

tio
ps

com
soc

o e r
t
rew

sho
citig

liza
atio

iler

we ly su

ing
e
Total number of words appearing at least 10 times in the dataset 3413

cia
reta

tlin
narr

uss
A
gs

spe

um
n Q
nyse
o

rin

c
re

dis
hea

ou
dpa

vol

mi
n
dea
t

Total number of words appearing at least 20 times in the dataset 1971

tio
tes

pre
d
na n

ura
ent

ver

lly
e
co anastw

cis

fig
pro

um

icia
con

con
s
doc

off
Total number of words appearing at least 50 times in the dataset 888

ice

h u c
ly

ed e
nth
st

vo
mo
p ov s

add nv

po e nd
a
od ie
up

do st

t
se
m
ers

ion

tin
iu

s
ca

ics
ing Table 2: Summary of CMU-MOSEI dataset statis-
oqu

nom
pli

i
coll

t
iva

Th

eco
sup

cro
der

ma
tics.
Figure 1: The diversity of topics of videos in CMU-
MOSEI, displayed as a word cloud. Larger words sentences using punctuation markers manually pro-
indicate more videos from that topic. The most fre- vided by transcripts. Due to the high quality of
quent 3 topics are reviews (16.2%), debate (2.9%) the transcripts, using punctuation markers showed
and consulting (1.8%) while the remaining topics better sentence quality than using the Stanford
are almost uniformly distributed. CoreNLP tokenizer (Manning et al., 2014). This
was verified on a set of 20 random videos by two ex-
number of videos acquired from each channel to perts. After tokenization, a set of 23,453 sentences
a maximum of 10. This resulted in discovering were chosen as the final sentences in the dataset.
1,000 identities from YouTube. The definition of a This was achieved by restricting each identity to
identity is proxy to the number of channels since contribute at least 10 and at most 50 sentences to
accurate identification requires quadratic manual the dataset. Table 2 shows high-level summary
annotations, which is infeasible for high number statistics of the CMU-MOSEI dataset.
of speakers. Furthermore, we limited the videos
to have manual and properly punctuated transcrip- 3.2 Annotation
tions provided by the uploader. The final pool of Annotation of CMU-MOSEI follows closely the an-
acquired videos included 5,000 videos which were notation of CMU-MOSI (Zadeh et al., 2016a) and
then manually checked for quality of video, au- Stanford Sentiment Treebank (Socher et al., 2013).
dio and transcript by 14 expert judges over three Each sentence is annotated for sentiment on a [-3,3]
months. The judges also annotated each video Likert scale of: [−3: highly negative, −2 negative,
for gender and confirmed that each video is an −1 weakly negative, 0 neutral, +1 weakly positive,
acceptable monologue. A set of 3228 videos re- +2 positive, +3 highly positive]. Ekman emotions
mained after manual quality inspection. We also (Ekman et al., 1980) of {happiness, sadness, anger,
performed automatic checks on the quality of video fear, disgust, surprise} are annotated on a [0,3] Lik-
and transcript which are discussed in Section 3.3 us- ert scale for presence of emotion x: [0: no evidence
ing facial feature extraction confidence and forced of x, 1: weakly x, 2: x, 3: highly x]. The anno-
alignment confidence. Furthermore, we balance the tation was carried out by 3 crowdsourced judges
gender in the dataset using the data provided by the from Amazon Mechanical Turk platform. To avert
judges (57% male to 43% female). This constitutes implicitly biasing the judges and to capture the raw
the final set of raw videos in CMU-MOSEI. The perception of the crowd, we avoided extreme anno-
topics covered in the final set of videos are shown tation training and instead provided the judges with
in Figure 1 as a Venn-style word cloud (Copper- a 5 minutes training video on how to use the annota-
smith and Kelly, 2014) with the size proportional tion system. All the annotations have been carried
to the number of videos gathered for that topic. out by only master workers with higher than 98%
The most frequent 3 topics are reviews (16.2%), de- approval rate to assure high quality annotations 4 .
bate (2.9%) and consulting (1.8%). The remaining Figure 2 shows the distribution of sentiment and
topics are almost uniformly distributed 3 . emotions in CMU-MOSEI dataset. The distribution
The final set of videos are then tokenized into
4
Extensive statistics of the dataset including the crawl-
3
more detailed analysis such as exact percentages and ing mechanism, the annotation UI, training procedure for the
number of videos per topic are available in the supplementary workers, agreement scores are available in submitted supple-
material mentary material available on arXiv.

2239
8000 13000
12000
et al., 2016). We also extract a set of six basic
7000

6000
11000
10000
emotions purely from static faces using Emotient
5000
9000
8000
FACET (iMotions, 2017). MultiComp OpenFace
4000
7000
6000
(Baltrušaitis et al., 2016) is used to extract the set
3000 5000 of 68 facial landmarks, 20 facial shape parameters,
4000
2000
3000 facial HoG features, head pose, head orientation
2000
1000
1000 and eye gaze (Baltrušaitis et al., 2016). Finally,
0 0
Negative Weakly Neutral Weakly Positive
Negative Positive
Happiness Sadness Anger Disgust Surprise Fear
we extract face embeddings from commonly used
facial recognition models such as DeepFace (Taig-
Figure 2: Distribution of sentiment and emotions in man et al., 2014), FaceNet (Schroff et al., 2015)
the CMU-MOSEI dataset. The distribution shows and SphereFace (Liu et al., 2017).
a natural skew towards more frequently used emo- Acoustic: We use the COVAREP software (De-
tions. However, the least frequent emotion, fear, gottex et al., 2014) to extract acoustic features
still has 1,900 data points which is an acceptable including 12 Mel-frequency cepstral coefficients,
number for machine learning studies. pitch, voiced/unvoiced segmenting features (Drug-
man and Alwan, 2011), glottal source parameters
shows a slight shift in favor of positive sentiment (Drugman et al., 2012; Alku et al., 1997, 2002),
which is similar to distribution of CMU-MOSI and peak slope parameters and maxima dispersion quo-
SST. We believe that this is an implicit bias in tients (Kane and Gobl, 2013). All extracted fea-
online opinions being slightly shifted towards posi- tures are related to emotions and tone of speech.
tive, since this is also present in CMU-MOSI. The
emotion histogram shows different prevalence for 4 Multimodal Fusion Study
different emotions. The most common category is
happiness with more than 12,000 positive sample From the linguistics perspective, understanding the
points. The least prevalent emotion is fear with interactions between language, visual and audio
almost 1900 positive sample points which is an modalities in multimodal language is a fundamen-
acceptable number for machine learning studies. tal research problem. While previous works have
been successful with respect to accuracy metrics,
3.3 Extracted Features they have not created new insights on how the fu-
Data points in CMU-MOSEI come in video format sion is performed in terms of what modalities are
with one speaker in front of the camera. The ex- related and how modalities engage in an interaction
tracted features for each modality are as follows during fusion. Specifically, to understand the fu-
(for other benchmarks we extract the same fea- sion process one must first understand the n-modal
tures): dynamics (Zadeh et al., 2017). n-modal dynam-
Language: All videos have manual transcrip- ics state that there exists different combination of
tion. Glove word embeddings (Pennington et al., modalities and that all of these combinations must
2014) were used to extract word vectors from tran- be captured to better understand the multimodal
scripts. Words and audio are aligned at phoneme language. In this paper, we define building the
level using P2FA forced alignment model (Yuan n-modal dynamics as a hierarchical process and
and Liberman, 2008). Following this, the visual propose a new fusion model called the Dynamic
and acoustic modalities are aligned to the words Fusion Graph (DFG). DFG is easily interpretable
by interpolation. Since the utterance duration of through what is called efficacies in graph connec-
words in English is usually short, this interpolation tions. To utilize this new fusion model in a multi-
does not lead to substantial information loss. modal language framework, we build upon Mem-
Visual: Frames are extracted from the full ory Fusion Network (MFN) by replacing the origi-
videos at 30Hz. The bounding box of the face nal fusion component in the MFN with our DFG.
is extracted using the MTCNN face detection al- We call this resulting model the Graph Memory
gorithm (Zhang et al., 2016). We extract facial Fusion Network (Graph-MFN). Once the model
action units through Facial Action Coding System is trained end to end, we analyze the efficacies in
(FACS) (Ekman et al., 1980). Extracting these the DFG to study the fusion mechanism learned
action units allows for accurate tracking and un- for modalities in multimodal language. In addi-
derstanding of the facial expressions (Baltrušaitis tion to being an interpretable fusion mechanism,

2240
Multimodal State Memory
𝒯 Dynamic Fusion Graph
𝒯"
output (𝑡 − 1)
𝛾 (*)
𝒟1(>) 𝜎 ⨀
⊕ ⨁ 𝑢"
𝛾 (?)
𝒟1(2) 𝜎 ⨀
trimodal
𝒟=
𝒟- ℎ 𝑢'"
𝒟",%,$

𝒟# 𝒟$ 𝒟%
bimodal 𝑧"

𝒟",% 𝒟",$ 𝒟%,$ %


ℎ")* ℎ"%
$ System of
ℎ")* ℎ$" LSTM-MNLs
unimodal #
ℎ")* ℎ"#
𝑙 𝑣 𝑎
𝑡−1 𝑡 𝑡+1 𝑡+2 𝑡+3 𝑡+4

Figure 3: The structure of Dynamic Fu-


sion Graph (DFG) for three modalities of Figure 4: The overview of Graph Memory Fusion
{(l)anguage, (v)ision, (a)coustic}. Dashed Network (Graph-MFN) pipeline. Graph-MFN re-
lines in DFG show the dynamic connections be- places the fusion block in MFN with a Dynamic
tween vertices controlled by the efficacies (α). Fusion Graph (DFG). For description of variables
and memory formulation please refer to the origi-
nal Memory Fusion Network paper (Zadeh et al.,
Graph-MFN also outperforms previously proposed
2018a).
state-of-the-art models for sentiment analysis and
emotion recognition on the CMU-MOSEI.
ity neuron which indicates how strong or weak the
4.1 Dynamic Fusion Graph connection is between vi and vj . αs are the main
source of interpretability in DFG. The vector of
In this section we discuss the internal structure
all αs is inferred using a deep neural network Dα
of the proposed Dynamic Fusion Graph (DFG)
which takes as input singleton vertices in V (l, v,
neural model (Figure 3. DFG has the following
and a). We leave it to the supervised training objec-
properties: 1) it explicitly models the n-modal
tive to learn parameters of Dα and make good use
interactions, 2) does so with an efficient num-
of efficacies, thus dynamically controlling the struc-
ber of parameters (as opposed to previous ap-
ture of the graph. The singleton vertices are chosen
proaches such as Tensor Fusion (Zadeh et al.,
for this purpose since they have no incoming edges
2017)) and 3) can dynamically alter its structure
thus no efficacy associated with those edges (no
and choose the proper fusion graph based on the
efficacy is needed to infer the singleton vertices).
importance of each n-modal dynamics during in-
The same singleton vertices l, v, and a are the in-
ference. We assume the set of modalities to be
puts to the DFG. In the next section we discuss
M = {(l)anguage, (v)ision, (a)coustic}. The
how these inputs are given to DFG. All vertices are
unimodal dynamics are denoted as {l}, {v}, {a},
connected to the output vertex Tt of the network
the bimodal dynamics as {l, v}, {v, a}, {l, a} and
via edges scaled by their respective efficacy. The
trimodal dynamics as {l, v, a}. These dynamics are
overall structure of the vertices, edges and respec-
in the form of latent representations and are each
tive efficacies is shown in Figure 3. There are a
considered as vertices inside a graph G = (V, E)
total of 8 vertices (counting the output vertex), 19
with V the set of vertices and E the set of edges.
edges and subsequently 19 efficacies.
A directional neural connection is established be-
tween two vertices vi and vj only if vi ⊂ vj . For
4.2 Graph-MFN
example, {l} ⊂ {l, v} which results in a connection
between < language > and < language, vision >. To test the performance of DFG, we use a similar
This connection is denoted as an edge eij . Dj takes recurrent architecture to Memory Fusion Network
as input all vi that satisfy the neural connection (MFN). MFN is a recurrent neural model with three
formula above for vj . main components 1) System of LSTMs: a set of
We define an efficacy for each edge eij denoted parallel LSTMs with each LSTM modeling a sin-
as αij . vi is multiplied by αij before being used as gle modality. 2) Delta-memory Attention Network
input to Dj . Each α is a sigmoid activated probabil- is the component that performs multimodal fusion

2241
Dataset MOSEI Sentiment MOSEI Emotions
Task Sentiment Anger Disgust Fear Happy Sad Surprise
Metric A2 F1 A5 A7 MAE r WA F1 WA F1 WA F1 WA F1 WA F1 WA F1
LANGUAGE
SOTA2 74.1§ 74.1⊳ 43.1≀ 42.9≀ 0.75§ 0.46≀ 56.0∪ 71.0× 59.0§ 67.1⊳ 56.2§ 79.7§ 53.0⊳ 44.1⊳ 53.8≀ 49.9≀ 53.2× 70.0⊳
SOTA1 74.3⊳ 74.1§ 43.2§ 43.2§ 0.74⊳ 0.47§ 56.6≀ 71.8● 64.0⊳ 72.6● 58.8× 89.8● 54.0§ 47.0§ 54.0§ 61.2● 54.3⊳ 85.3●
VISUAL
SOTA2 73.8§ 73.5§ 42.5⊳ 42.5⊳ 0.78≀ 0.41♡ 54.4≀ 64.6§ 54.4♡ 71.5⊲ 51.3§ 78.4§ 53.4≀ 40.8§ 54.3⊳ 60.8● 51.3⊳ 84.2§
SOTA1 73.9⊳ 73.7⊳ 42.7≀ 42.7≀ 0.78§ 0.43≀ 60.0§ 71.0● 60.3≀ 72.4● 64.2♡ 89.8● 57.4● 49.3● 57.7§ 61.5⊲ 51.8§ 85.4●
ACOUSTIC
SOTA2 74.2≀ 73.8△ 42.1△ 42.1△ 0.78⊳ 0.43§ 55.5⊲ 51.8△ 58.9⊳ 72.4● 58.5⊳ 89.8● 57.2∩ 55.5∩ 58.9⊲ 65.9⊲ 52.2♡ 83.6∩
SOTA1 74.2△ 73.9≀ 42.4∩ 42.4∩ 0.74∩ 0.43⊳ 56.4△ 71.9● 60.9§ 72.4● 62.7§ 89.8⊲ 61.5§ 61.4§ 62.0∩ 69.2∩ 54.3⊲ 85.4●
MULTIMODAL
SOTA2 76.0# 76.0# 44.7† 44.6† 0.72∗ 0.52∗ 56.0◇ 71.4♭ 65.2# 71.4# 56.7§ 89.9# 57.8§ 66.6∗ 58.9∗ 60.8# 52.2∗ 85.4●
SOTA1 76.4◇ 76.4◇ 44.8∗ 44.7∗ 0.72# 0.52# 60.5∗ 72.0● 67.0♭ 73.2● 60.0♡ 89.9● 66.5∗ 71.0∎ 59.2§ 61.8● 53.3# 85.4#
Graph-MFN 76.9 77.0 45.1 45.0 0.71 0.54 62.6 72.8 69.1 76.6 62.0 89.9 66.3 66.3 60.4 66.9 53.7 85.5

Table 3: Results for sentiment analysis and emotion recognition on the MOSEI dataset (reported results
are as of 5/11/2018. please check the CMU Multimodal Data SDK github for current state of the art and
new features for CMU-MOSEI and other datasets). SOTA1 and SOTA2 refer to the previous best and
second best state-of-the-art models (from Section 2) respectively. Compared to the baselines Graph-MFN
achieves superior performance in sentiment analysis and competitive performance in emotion recognition.
For all metrics, higher values indicate better performance except for MAE where lower values indicate
better performance.

by assigning coefficients to highlight cross-modal is subsequently connected to a classification or re-


dynamics. 3) Multiview Gated Memory is a com- gression layer for final prediction (for sentiment
ponent that stores the output of multimodal fusion. and emotion recognition).
We replace the Delta-memory Attention Network
with DFG and refer to the modified model as Graph 5 Experiments and Discussion
Memory Fusion Network (Graph-MFN). Figure 4
In our experiments, we seek to evaluate how modal-
shows the overall architecture of the Graph-MFN.
ities interact during multimodal fusion by studying
Similar to MFN, Graph-MFN employs a system the efficacies of DFG through time.
of LSTMs for modeling individual modalities. cl , Table 3 shows the results on CMU-MOSEI. Ac-
cv , and ca represent the memory of LSTMs for lan- curacy is reported as Ax where x is the number
guage, vision and acoustic modalities respectively. of sentiment classes as well as F1 measure. For
Dm , m ∈ {l, v, a} is a fully connected deep neural regression we report MAE and correlation (r). For
network that takes in hm [t−1,t] the LSTM represen-
emotion recognition due to the natural imbalances
tation across two consecutive timestamps, which across various emotions, we use weighted accuracy
allows the network to track changes in memory (Tong et al., 2017) and F1 measure. Graph-MFN
dimensions across time. The outputs of Dl , Dv shows superior performance in sentiment analy-
and Da are the singleton vertices for the DFG. The sis and competitive performance in emotion recog-
DFG models cross-modal interactions and encodes nition. Therefore, DFG is both an effective and
the cross-modal representations in its output vertex interpretable model for multimodal fusion.
Tt for storage in the Multi-view Gated Memory To better understand the internal fusion mecha-
ut . The Multi-view Gated Memory functions using nism between modalities, we visualize the behavior
a network Du that transforms Tt into a proposed of the learned DFG efficacies in Figure 5 for vari-
memory update ût . γ1 and γ2 are the Multi-view ous cases (deep red denotes high efficacy and deep
Gated Memory’s retain and update gates respec- blue denotes low efficacy).
tively and are learned using networks Dγ1 and Dγ2 . Multimodal Fusion has a Volatile Nature:
Finally, a network Dz transforms Tt into a multi- The first observation is that the structure of the
modal representation zt to update the system of DFG is changing case by case and for each case
LSTMs. The output of Graph-MFN in all the ex- over time. As a result, the model seems to be selec-
periments is the output of each LSTM hm T as well tively prioritizing certain dynamics over the others.
as contents of the Multi-view Gated Memory at For example, in case (I) where all modalities are
time T (last recurrence timestep), uT . This output informative, all efficacies seem to be high, imply-

2242
Vision and acoustic modalities informative Vision modality uninformative Language modality uninformative Acoustic modality uninformative
𝑙 → 𝑙, 𝑎 𝑙 → 𝑙, 𝑎
𝑎 → 𝑙, 𝑎 𝑎 → 𝑙, 𝑎
𝑙 → 𝑙, 𝑣 𝑙 → 𝑙, 𝑣
𝑣 → 𝑙, 𝑣 𝑣 → 𝑙, 𝑣
𝑎 → 𝑎, 𝑣 𝑎 → 𝑎, 𝑣
𝑣 → 𝑎, 𝑣 𝑣 → 𝑎, 𝑣
𝑙 → 𝑙, 𝑎, 𝑣 𝑙 → 𝑙, 𝑎, 𝑣
𝑎 → 𝑙, 𝑎, 𝑣 𝑎 → 𝑙, 𝑎, 𝑣
𝑣 → 𝑙, 𝑎, 𝑣 𝑣 → 𝑙, 𝑎, 𝑣
𝑙, 𝑎 → 𝑙, 𝑎, 𝑣 𝑙, 𝑎 → 𝑙, 𝑎, 𝑣
𝑙, 𝑣 → 𝑙, 𝑎, 𝑣 𝑙, 𝑣 → 𝑙, 𝑎, 𝑣
𝑎, 𝑣 → 𝑙, 𝑎, 𝑣 𝑎, 𝑣 → 𝑙, 𝑎, 𝑣
𝑙→𝒯 𝑙→𝒯
𝑎→𝒯 𝑎→𝒯
𝑣→𝒯 𝑣→𝒯
𝑙, 𝑎 → 𝒯 𝑙, 𝑎 → 𝒯
𝑙, 𝑣 → 𝒯 𝑙, 𝑣 → 𝒯
𝑎, 𝑣 → 𝒯 𝑎, 𝑣 → 𝒯
𝑙, 𝑎, 𝑣 → 𝒯 𝑙, 𝑎, 𝑣 → 𝒯
𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇

Language: And he I don’t think he got mad when hah Too much too fast, I mean we basically just All I can say is he’s a pretty average guy. What disappointed me was that one of the actors
I don’t know maybe. get introduced to this character… in the movie was there for short amount of time.

Uninformative
Gaze aversion

Contradictory

Surprised
Vision:

smile
Acoustic: (frustrated voice) (angry voice) (disappointed voice) (neutral voice)

(I) (II) (III) (IV)

Figure 5: Visualization of DFG efficacies across time. The efficacies (thus the DFG structure) change
over time as DFG is exposed to new information. DFG is able choose which n-modal dynamics to rely
on. It also learns priors about human communication since certain efficacies (thus edges in DFG) remain
unchanged across time and across data points.

ing that the DFG is able to find useful informa- ing the visual modality have low efficacies. 3) The
tion in unimodal, bimodal and trimodal interac- acoustic modality is mostly present in fusion with
tions. However, in cases (II) and (III) where the the language modality. However, unlike language,
visual modality is either uninformative or contra- the acoustic modality also appears to fuse with the
dictory, the efficacies of v → l, v and v → l, a, v visual modality if both modalities are meaningful,
and l, a → l, a, v are reduced since no meaningful such as in case (I).
interactions involve the visual modality. An interesting observation is that in almost all
Priors in Fusion: Certain efficacies remain un- cases the efficacies of unimodal connections to ter-
changed across cases and across time. These are minal T is low, implying that T prefers to not rely
priors from Human Multimodal Language that on just one modality. Also, DFG always prefers
DFG learns. For example the model always seems to perform fusion between language and audio as
to prioritize fusion between language and audio in in most cases both l → l, a and a → l, a have high
(l → l, a), and (a → l, a). Subsequently, DFG efficacies; intuitively in most natural scenarios lan-
gives low values to efficacies that rely unilater- guage and acoustic modalities are highly aligned.
ally on language or audio alone: the (l → τ ) and Both of these cases show unchanging behaviors
(a → τ ) efficacies seem to be consistently low. On which we believe DFG has learned as natural pri-
the other hand, the visual modality appears to have ors of human communicative signal.
a partially isolated behavior. In the presence of in- With these observations, we believe that DFG
formative visual information, the model increases has successfully learned how to manage its internal
the efficacies of (v → τ ) although the values of structure to model human communication.
other visual efficacies also increase.
6 Conclusion
Trace of Multimodal Fusion: We trace the
dominant path that every modality undergoes dur- In this paper we presented the largest dataset of
ing fusion: 1) language tends to first fuse with multimodal sentiment analysis and emotion recog-
audio via (l → l, a) and the language and acoustic nition called CMU Multimodal Opinion Sentiment
modalities together engage in higher level fusions and Emotion Intensity (CMU-MOSEI). CMU-
such as (l, a → l, a, v). Intuitively, this is aligned MOSEI consists of 23,453 annotated sentences
with the close ties between language and audio from more than 1000 online speakers and 250 dif-
through word intonations. 2) The visual modality ferent topics. The dataset expands the horizons of
seems to engage in fusion only if it contains mean- Human Multimodal Language studies in NLP. One
ingful information. In cases (I) and (IV), all the such study was presented in this paper where we
paths involving the visual modality are relatively analyzed the structure of multimodal fusion in sen-
active while in cases (II) and (III) the paths involv- timent analysis and emotion recognition. This was

2243
done using a novel interpretable fusion mechanism Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-
called Dynamic Fusion Graph (DFG). In our stud- trušaitis, Amir Zadeh, and Louis-Philippe Morency.
2017. Multimodal sentiment analysis with word-
ies we investigated the behavior of modalities in in-
level fusion and reinforcement learning. In Pro-
teracting with each other using built-in efficacies of ceedings of the 19th ACM International Con-
DFG. Aside analysis of fusion, DFG was trained in ference on Multimodal Interaction. ACM, New
the Memory Fusion Network pipeline and showed York, NY, USA, ICMI 2017, pages 163–171.
superior performance in sentiment analysis and https://doi.org/10.1145/3136755.3136801.
competitive performance in emotion recognition. Glen Coppersmith and Erin Kelly. 2014. Dynamic
wordclouds and vennclouds for exploratory data
Acknowledgments analysis. In Proceedings of the Workshop on In-
teractive Language Learning, Visualization, and In-
This material is based upon work partially sup- terfaces. Association for Computational Linguistics,
ported by the National Science Foundation (Award Baltimore, Maryland, USA, pages 22–29.
#1833355) and Oculus VR. Any opinions, findings, Corinna Cortes and Vladimir Vapnik. 1995. Support-
and conclusions or recommendations expressed in vector networks. Mach. Learn. 20(3):273–297.
this material are those of the author(s) and do not https://doi.org/10.1023/A:1022627411411.
necessarily reflect the views of National Science Gilles Degottex, John Kane, Thomas Drugman, Tuomo
Foundation or Oculus VR, and no official endorse- Raitio, and Stefan Scherer. 2014. Covarep—a col-
ment should be inferred. laborative voice analysis repository for speech tech-
nologies. In Acoustics, Speech and Signal Process-
ing (ICASSP), 2014 IEEE International Conference
on. IEEE, pages 960–964.
References
A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. 2012.
Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002.
Collecting large, richly annotated facial-expression
Normalized amplitude quotient for parametrization
databases from movies. IEEE MultiMedia 19(3):34–
of the glottal flow. the Journal of the Acoustical So-
41. https://doi.org/10.1109/MMUL.2012.26.
ciety of America 112(2):701–710.
Abhinav Dhall, O.V. Ramana Murthy, Roland Goecke,
Paavo Alku, Helmer Strik, and Erkki Vilkman. 1997. Jyoti Joshi, and Tom Gedeon. 2015. Video
Parabolic spectral parameter—a new method for and image based emotion recognition challenges
quantification of the glottal flow. Speech Commu- in the wild: Emotiw 2015. In Proceed-
nication 22(1):67–79. ings of the 2015 ACM on International Con-
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis- ference on Multimodal Interaction. ACM, New
Philippe Morency. 2017. Multimodal machine learn- York, NY, USA, ICMI ’15, pages 423–426.
ing: A survey and taxonomy. arXiv preprint https://doi.org/10.1145/2818346.2829994.
arXiv:1705.09406 . Thomas Drugman and Abeer Alwan. 2011. Joint ro-
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe bust voicing detection and pitch estimation based
Morency. 2016. Openface: an open source facial be- on residual harmonics. In Interspeech. pages 1973–
havior analysis toolkit. In Applications of Computer 1976.
Vision (WACV), 2016 IEEE Winter Conference on. Thomas Drugman, Mark Thomas, Jon Gudnason,
IEEE, pages 1–10. Patrick Naylor, and Thierry Dutoit. 2012. Detec-
tion of glottal closure instants from speech signals:
Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and
A quantitative review. IEEE Transactions on Audio,
Maja Pantic. 2015. The mahnob mimicry database:
Speech, and Language Processing 20(3):994–1006.
A database of naturalistic human interactions. Pat-
tern Recognition Letters 66(Supplement C):52 – 61. Paul Ekman, Wallace V Freisen, and Sonia Ancoli.
Pattern Recognition in Human Computer Interaction. 1980. Facial signs of emotional experience. Journal
https://doi.org/https://doi.org/10.1016/j.patrec.2015.03.005.of personality and social psychology 39(6):1125.
Leo Breiman. 2001. Random A. Graves, A. r. Mohamed, and G. Hinton.
forests. Mach. Learn. 45(1):5–32. 2013. Speech recognition with deep recur-
https://doi.org/10.1023/A:1010933404324. rent neural networks. In 2013 IEEE Inter-
national Conference on Acoustics, Speech
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe and Signal Processing. pages 6645–6649.
Kazemzadeh, Emily Mower, Samuel Kim, Jean- https://doi.org/10.1109/ICASSP.2013.6638947.
nette Chang, Sungbok Lee, and Shrikanth S.
Narayanan. 2008. Iemocap: Interactive emotional Michael Grimm, Kristian Kroschel, and Shrikanth
dyadic motion capture database. Journal of Lan- Narayanan. 2008. The vera am mittag german audio-
guage Resources and Evaluation 42(4):335–359. visual emotional speech database. In ICME. IEEE,
https://doi.org/10.1007/s10579-008-9076-6. pages 865–868.

2244
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Louis-Philippe Morency, Rada Mihalcea, and Payal
Erik Cambria, Louis-Philippe Morency, and Roger Doshi. 2011. Towards multimodal sentiment anal-
Zimmerman. 2018. Memn: Multimodal emotional ysis: Harvesting opinions from the web. In Proceed-
memory network for emotion recognition in dyadic ings of the 13th International Conference on Multi-
conversational videos. In NAACL. modal Interactions. ACM, pages 169–176.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Friedrich Max Müller. 1866. Lectures on the science
short-term memory. Neural computation 9(8):1735– of language: Delivered at the Royal Institution of
1780. Great Britain in April, May, & June 1861, volume 1.
Longmans, Green.
iMotions. 2017. Facial expression analysis. Behnaz Nojavanasghari, Deepak Gopinath, Jayanth
goo.gl/1rh1JN. Koushik, Tadas Baltrušaitis, and Louis-Philippe
Morency. 2016. Deep multimodal fusion for persua-
Mohit Iyyer, Varun Manjunatha, Jordan L Boyd- siveness prediction. In Proceedings of the 18th ACM
Graber, and Hal Daumé III. 2015. Deep unordered International Conference on Multimodal Interaction.
composition rivals syntactic methods for text classi- ACM, New York, NY, USA, ICMI 2016, pages 284–
fication. In ACL (1). pages 1681–1691. 288. https://doi.org/10.1145/2993148.2993176.
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Yu. 2013. 3d convolutional neural networks 2002. Thumbs up? sentiment classification using
for human action recognition. IEEE Trans. machine learning techniques. In Proceedings of
Pattern Anal. Mach. Intell. 35(1):221–231. EMNLP. pages 79–86.
https://doi.org/10.1109/TPAMI.2012.59.
Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Kenji Sagae, and Louis-Philippe Morency. 2014.
som. 2014. A convolutional neural network for mod- Computational analysis of persuasiveness in social
elling sentences. arXiv preprint arXiv:1404.2188 . multimedia: A novel dataset and multimodal pre-
diction approach. In Proceedings of the 16th In-
John Kane and Christer Gobl. 2013. Wavelet maxima ternational Conference on Multimodal Interaction.
dispersion for breathy to tense voice discrimination. ACM, New York, NY, USA, ICMI ’14, pages 50–57.
IEEE Transactions on Audio, Speech, and Language https://doi.org/10.1145/2663204.2663260.
Processing 21(6):1170–1179. Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
Wootaek Lim, Daeyoung Jang, and Taejin Lee. 2016. representation. In EMNLP. volume 14, pages 1532–
Speech emotion recognition using convolutional 1543.
and recurrent neural networks. In Signal and In-
formation Processing Association Annual Summit Veronica Perez-Rosas, Rada Mihalcea, and Louis-
and Conference (APSIPA), 2016 Asia-Pacific. IEEE, Philippe Morency. 2013. Utterance-Level Multi-
pages 1–4. modal Sentiment Analysis. In Association for Com-
putational Linguistics (ACL). Sofia, Bulgaria.
Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li,
Bhiksha Raj, and Le Song. 2017. Sphereface: Deep Soujanya Poria, Erik Cambria, Devamanyu Haz-
hypersphere embedding for face recognition. In Pro- arika, Navonil Mazumder, Amir Zadeh, and Louis-
ceedings of the IEEE conference on computer vision Philippe Morency. 2017a. Context dependent senti-
and pattern recognition. ment analysis in user generated videos. In Associa-
tion for Computational Linguistics.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Soujanya Poria, Erik Cambria, Devamanyu Haz-
Dan Huang, Andrew Y. Ng, and Christopher Potts.
arika, Navonil Mazumder, Amir Zadeh, and Louis-
2011. Learning word vectors for sentiment analysis.
Philippe Morency. 2017b. Context-dependent senti-
In Proceedings of the 49th Annual Meeting of the
ment analysis in user-generated videos. In Associa-
Association for Computational Linguistics: Human
tion for Computational Linguistics.
Language Technologies. Association for Computa-
tional Linguistics, Portland, Oregon, USA, pages Soujanya Poria, Iti Chaturvedi, Erik Cambria, and
142–150. http://www.aclweb.org/anthology/P11- Amir Hussain. 2016. Convolutional mkl based mul-
1015. timodal emotion recognition and sentiment analysis.
In Data Mining (ICDM), 2016 IEEE 16th Interna-
Christopher D. Manning, Mihai Surdeanu, John tional Conference on. IEEE, pages 439–448.
Bauer, Jenny Finkel, Steven J. Bethard,
and David McClosky. 2014. The Stanford Shyam Sundar Rajagopalan, Louis-Philippe Morency,
CoreNLP natural language processing toolkit. Tadas Baltrušaitis, and Roland Goecke. 2016. Ex-
In Association for Computational Linguistics tending long short-term memory for multi-view
(ACL) System Demonstrations. pages 55–60. structured learning. In European Conference on
http://www.aclweb.org/anthology/P/P14/P14-5010. Computer Vision.

2245
Fabien Ringeval, Andreas Sonderegger, Jürgen S. Sentiment analysis in an audio-visual context. IEEE
Sauer, and Denis Lalanne. 2013. Introducing the Intelligent Systems 28(3):46–53.
recola multimodal corpus of remote collaborative
and affective interactions. In FG. IEEE Computer Jiahong Yuan and Mark Liberman. 2008. Speaker iden-
Society, pages 1–8. tification on the scotus corpus. Journal of the Acous-
tical Society of America 123(5):3878.
Florian Schroff, Dmitry Kalenichenko, and James
Philbin. 2015. Facenet: A unified embedding for Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-
face recognition and clustering. In CVPR. IEEE bria, and Louis-Philippe Morency. 2017. Tensor fu-
Computer Society, pages 815–823. sion network for multimodal sentiment analysis. In
Empirical Methods in Natural Language Processing,
M. Schuster and K.K. Paliwal. 1997. Bidi- EMNLP.
rectional recurrent neural networks.
Trans. Sig. Proc. 45(11):2673–2681. Amir Zadeh, Paul Pu Liang, Navonil Mazumder,
https://doi.org/10.1109/78.650093. Soujanya Poria, Erik Cambria, and Louis-Philippe
Morency. 2018a. Memory fusion network for
Richard Socher, Alex Perelygin, Jean Y Wu, Jason multi-view sequential learning. arXiv preprint
Chuang, Christopher D Manning, Andrew Y Ng, arXiv:1802.00927 .
Christopher Potts, et al. 2013. Recursive deep
models for semantic compositionality over a senti- Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra-
ment treebank. In Proceedings of the conference on teek Vij, Erik Cambria, and Louis-Philippe Morency.
empirical methods in natural language processing 2018b. Multi-attention recurrent network for hu-
(EMNLP). Citeseer, volume 1631, page 1642. man communication comprehension. arXiv preprint
arXiv:1802.00923 .
Rupesh K Srivastava, Klaus Greff, and Juergen Schmid-
huber. 2015. Training very deep networks. In Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Philippe Morency. 2016a. Mosi: Multimodal cor-
and R. Garnett, editors, Advances in Neural Informa- pus of sentiment intensity and subjectivity anal-
tion Processing Systems 28, Curran Associates, Inc., ysis in online opinion videos. arXiv preprint
pages 2377–2385. http://papers.nips.cc/paper/5850- arXiv:1606.06259 .
training-very-deep-networks.pdf.
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
Yaniv Taigman, Ming Yang, Marc’Aurelio Ran- Philippe Morency. 2016b. Multimodal sentiment in-
zato, and Lior Wolf. 2014. Deepface: Clos- tensity analysis in videos: Facial gestures and verbal
ing the gap to human-level performance in messages. IEEE Intelligent Systems 31(6):82–88.
face verification. In Proceedings of the 2014
IEEE Conference on Computer Vision and Pat- Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and
tern Recognition. IEEE Computer Society, Wash- Yu Qiao. 2016. Joint face detection and alignment
ington, DC, USA, CVPR ’14, pages 1701–1708. using multitask cascaded convolutional networks.
https://doi.org/10.1109/CVPR.2014.220. IEEE Signal Processing Letters 23(10):1499–1503.

Edmund Tong, Amir Zadeh, Cara Jones, and Louis- Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Fran-
Philippe Morency. 2017. Combating human traffick- cis C. M. Lau. 2015. A c-lstm neural network for
ing with multimodal deep models. In Proceedings text classification. CoRR abs/1511.08630.
of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Julian Georg Zilly, Rupesh Kumar Srivastava, Jan
volume 1, pages 1547–1556. Koutnı́k, and Jürgen Schmidhuber. 2016. Re-
current Highway Networks. arXiv preprint
George Trigeorgis, Fabien Ringeval, Raymond Brueck- arXiv:1607.03474 .
ner, Erik Marchi, Mihalis A Nicolaou, Björn
Schuller, and Stefanos Zafeiriou. 2016. Adieu fea-
tures? end-to-end speech emotion recognition using
a deep convolutional recurrent network. In Acous-
tics, Speech and Signal Processing (ICASSP), 2016
IEEE International Conference on. IEEE, pages
5200–5204.
Haohan Wang, Aaksha Meghawat, Louis-Philippe
Morency, and Eric P Xing. 2016. Select-additive
learning: Improving cross-individual generalization
in multimodal sentiment analysis. arXiv preprint
arXiv:1609.05244 .
Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
Schuller, Congkai Sun, Kenji Sagae, and Louis-
Philippe Morency. 2013. Youtube movie reviews:

2246

You might also like