Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
2236
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2236–2246
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
250 topics. Each video segment contains manual Dataset #S # Sp Mod Sent Emo TL (hh:mm:ss)
CMU-MOSEI 23,453 1,000 {l, v, a} 3 3 65:53:36
transcription aligned with audio to phoneme level. CMU-MOSI 2,199 98 {l, v, a} 3 7 02:36:17
All the videos are gathered from online video shar- ICT-MMMO 340 200 {l, v, a} 3 7 13:58:29
ing websites 1 . The dataset is currently a part of the YouTube 300 50 {l, v, a} 3 7 00:29:41
MOUD 400 101 {l, v, a} 3 7 00:59:00
CMU Multimodal Data SDK and is freely available SST 11,855 – {l} 3 7 –
to the scientific community through Github 2 . Cornell 2,000 – {l} 3 7 –
Large Movie 25,000 – 3 7 –
Our second contribution is an interpretable fu- {l}
STS 5,513 – {l} 3 7 –
sion model called Dynamic Fusion Graph (DFG) to IEMOCAP 10,000 10 {l, v, a} 7 3 11:28:12
study the nature of cross-modal dynamics in multi- SAL 23 4 {v, a} 7 3 11:00:00
VAM 499 20 {v, a} 7 3 12:00:00
modal language. DFG contains built-in efficacies VAM-faces 1,867 20 {v} 7 3 –
that are directly related to how modalities interact. HUMAINE 50 4 {v, a} 7 3 04:11:00
These efficacies are visualized and studied in detail RECOLA 46 46 {v, a} 7 3 03:50:00
SEWA 538 408 {v, a} 7 3 04:39:00
in our experiments. Aside interpretability, DFG SEMAINE 80 20 {v, a} 7 3 06:30:00
achieves superior performance compared to previ- AFEW 1,645 330 {v, a} 7 3 02:28:03
ously proposed models for multimodal sentiment AM-FED 242 242 {v} 7 3 03:20:25
Mimicry 48 48 {v, a} 7 3 11:00:00
and emotion recognition on CMU-MOSEI. AFEW-VA 600 240 {v, a} 7 3 00:40:00
2237
show “Vera am Mittag” (Grimm et al., 2008). This Graves et al., 2013; Schuster and Paliwal, 1997).
audio-visual data is labeled for continuous-valued In case of unimodal models EF-LSTM refers to a
scale for three emotion primitives: valence, acti- single LSTM.
vation and dominance. VAM-Audio and VAM- We also compare to the following baseline mod-
Faces are subsets that contain on acoustic and vi- els: † BC-LSTM (Poria et al., 2017b), ♣ C-MKL
sual inputs respectively. RECOLA (Ringeval et al., (Poria et al., 2016), ♭ DF (Nojavanasghari et al.,
2013) consists of 9.5 hours of audio, visual, and 2016), ♡ SVM (Cortes and Vapnik, 1995; Zadeh
physiological (electrocardiogram, and electroder- et al., 2016b; Perez-Rosas et al., 2013; Park et al.,
mal activity) recordings of online dyadic interac- 2014), ● RF (Breiman, 2001), THMM (Morency
tions. Mimicry (Bilakhia et al., 2015) consists of et al., 2011), SAL-CNN (Wang et al., 2016), 3D-
audiovisual recordings of human interactions in CNN (Ji et al., 2013). For language only base-
two situations: while discussing a political topic line models: ∪ CNN-LSTM (Zhou et al., 2015),
and while playing a role-playing game. AFEW RNTN (Socher et al., 2013), ×: DynamicCNN
(Dhall et al., 2012, 2015) is a dynamic temporal (Kalchbrenner et al., 2014), ⊳ DAN (Iyyer et al.,
facial expressions data corpus consisting of close 2015), ≀ DHN (Srivastava et al., 2015), ⊲ RHN
to real world environment extracted from movies. (Zilly et al., 2016). For acoustic only baseline
Detailed comparison of CMU-MOSEI to the models: AdieuNet (Trigeorgis et al., 2016), SER-
datasets in this section is presented in Table 1. LSTM (Lim et al., 2016).
CMU-MOSEI has longer total duration as well as
larger number of data point in total. Furthermore, 3 CMU-MOSEI Dataset
CMU-MOSEI has a larger variety in number of
Understanding expressed sentiment and emotions
speakers and topics. It has all three modalities pro-
are two crucial factors in human multimodal lan-
vided, as well as annotations for both sentiment
guage. We introduce a novel dataset for multimodal
and emotions.
sentiment and emotion recognition called CMU
Multimodal Opinion Sentiment and Emotion Inten-
2.2 Baseline Models
sity (CMU-MOSEI). In the following subsections,
Modeling multimodal language has been the sub- we first explain the details of the CMU-MOSEI
ject of studies in NLP and multimodal machine data acquisition, followed by details of annotation
learning. Notable approaches are listed as follows and feature extraction.
and indicated with a symbol for reference in the
Experiments and Discussion section (Section 5). 3.1 Data Acquisition
# MFN: (Memory Fusion Network) (Zadeh Social multimedia presents a unique opportunity
et al., 2018a) synchronizes multimodal sequences for acquiring large quantities of data from various
using a multi-view gated memory that stores intra- speakers and topics. Users of these social multime-
view and cross-view interactions through time. dia websites often post their opinions in the forms
∎ MARN: (Multi-attention Recurrent Network) of monologue videos; videos with only one per-
(Zadeh et al., 2018b) models intra-modal and multi- son in front of camera discussing a certain topic
ple cross-modal interactions by assigning multiple of interest. Each video inherently contains three
attention coefficients. Intra-modal and cross-modal modalities: language in the form of spoken text,
interactions are stored in a hybrid LSTM mem- visual via perceived gestures and facial expressions,
ory component. ∗ TFN (Tensor Fusion Network) and acoustic through intonations and prosody.
(Zadeh et al., 2017) models inter and intra modal During our automatic data acquisition process,
interactions by creating a multi-dimensional tensor videos from YouTube are analyzed for the presence
that captures unimodal, bimodal and trimodal in- of one speaker in the frame using face detection
teractions. ◇ MV-LSTM (Multi-View LSTM) (Ra- to ensure the video is a monologue. We limit the
jagopalan et al., 2016) is a recurrent model that des- videos to setups where the speaker’s attention is
ignates regions inside a LSTM to different views exclusively towards the camera by rejecting videos
of the data. § EF-LSTM (Early Fusion LSTM) that have moving cameras (such as camera on bikes
concatenates the inputs from different modalities or selfies recording while walking). We use a di-
at each time-step and uses that as the input to a verse set of 250 frequently used topics in online
single LSTM (Hochreiter and Schmidhuber, 1997; videos as the seed for acquisition. We restrict the
2238
s
tor ti emes g
ng
t p e c ea g g
s d s
soceakers
sto u t re pdatutorial
der
es h tur ononlinestimreta it
in su es et inv al q ma ers
cbounsinmen oliticssumustrotemxtbook
on
po mm ereari firm e see ony
s ersialuads l
ini
e e om
len
edi pro rs m e sp vi econte
rec
ref
s lelssoa ing
ly
sp
u
osi
e
su
lav
t n i
sym
ek
enc
on ns
Total number of distinct speakers 1000
ous
c i
nom
n
sm
rea ent
mo
stry
rali
auto
l
hte
s
fore
ltu
bec
Total number of distinct topics 250
ulu
ser
al
icu
ri
ust
lt
s e
ses
mu
stim
ties
ind
y
rpri
ict
erl
uri
rev
ente
sec
Average number of sentences in a video 7.3
art
e
s n nantextiles
e i
ry
sse
res
dai n
lik
ato
qu
nt a
futu
iew d
c
icip
u sm
erv nifie
sio no
ly
nda
Average length of sentences in seconds 7.28
li
part
era
epe
lib
i y
resi qs
n
ind
r
atio
rev ke
ate
motatemsum
fa
gn
ing
crm
ials
d ock lecsponrseessp
ly
deb
iew
aq
rep
ms
ma inv rrat ol en
rc
sto litic ents ng
nasd
rev
me
rke
ing
ov
com
far
il
ed
t
reta
loa ss
ma
alis
is
Total of unique words in sentences 23026
e
rev
nt
itio
m ck s c sfi
en
edit
rke e ive og
d
me
n
up ne
p
n
ritt
rou
add
tio
ps
com
soc
o e r
t
rew
sho
citig
liza
atio
iler
we ly su
ing
e
Total number of words appearing at least 10 times in the dataset 3413
cia
reta
tlin
narr
uss
A
gs
spe
um
n Q
nyse
o
rin
c
re
dis
hea
ou
dpa
vol
mi
n
dea
t
tio
tes
pre
d
na n
ura
ent
ver
lly
e
co anastw
cis
fig
pro
um
icia
con
con
s
doc
off
Total number of words appearing at least 50 times in the dataset 888
ice
h u c
ly
ed e
nth
st
vo
mo
p ov s
add nv
po e nd
a
od ie
up
do st
t
se
m
ers
ion
tin
iu
s
ca
ics
ing Table 2: Summary of CMU-MOSEI dataset statis-
oqu
nom
pli
i
coll
t
iva
Th
eco
sup
cro
der
ma
tics.
Figure 1: The diversity of topics of videos in CMU-
MOSEI, displayed as a word cloud. Larger words sentences using punctuation markers manually pro-
indicate more videos from that topic. The most fre- vided by transcripts. Due to the high quality of
quent 3 topics are reviews (16.2%), debate (2.9%) the transcripts, using punctuation markers showed
and consulting (1.8%) while the remaining topics better sentence quality than using the Stanford
are almost uniformly distributed. CoreNLP tokenizer (Manning et al., 2014). This
was verified on a set of 20 random videos by two ex-
number of videos acquired from each channel to perts. After tokenization, a set of 23,453 sentences
a maximum of 10. This resulted in discovering were chosen as the final sentences in the dataset.
1,000 identities from YouTube. The definition of a This was achieved by restricting each identity to
identity is proxy to the number of channels since contribute at least 10 and at most 50 sentences to
accurate identification requires quadratic manual the dataset. Table 2 shows high-level summary
annotations, which is infeasible for high number statistics of the CMU-MOSEI dataset.
of speakers. Furthermore, we limited the videos
to have manual and properly punctuated transcrip- 3.2 Annotation
tions provided by the uploader. The final pool of Annotation of CMU-MOSEI follows closely the an-
acquired videos included 5,000 videos which were notation of CMU-MOSI (Zadeh et al., 2016a) and
then manually checked for quality of video, au- Stanford Sentiment Treebank (Socher et al., 2013).
dio and transcript by 14 expert judges over three Each sentence is annotated for sentiment on a [-3,3]
months. The judges also annotated each video Likert scale of: [−3: highly negative, −2 negative,
for gender and confirmed that each video is an −1 weakly negative, 0 neutral, +1 weakly positive,
acceptable monologue. A set of 3228 videos re- +2 positive, +3 highly positive]. Ekman emotions
mained after manual quality inspection. We also (Ekman et al., 1980) of {happiness, sadness, anger,
performed automatic checks on the quality of video fear, disgust, surprise} are annotated on a [0,3] Lik-
and transcript which are discussed in Section 3.3 us- ert scale for presence of emotion x: [0: no evidence
ing facial feature extraction confidence and forced of x, 1: weakly x, 2: x, 3: highly x]. The anno-
alignment confidence. Furthermore, we balance the tation was carried out by 3 crowdsourced judges
gender in the dataset using the data provided by the from Amazon Mechanical Turk platform. To avert
judges (57% male to 43% female). This constitutes implicitly biasing the judges and to capture the raw
the final set of raw videos in CMU-MOSEI. The perception of the crowd, we avoided extreme anno-
topics covered in the final set of videos are shown tation training and instead provided the judges with
in Figure 1 as a Venn-style word cloud (Copper- a 5 minutes training video on how to use the annota-
smith and Kelly, 2014) with the size proportional tion system. All the annotations have been carried
to the number of videos gathered for that topic. out by only master workers with higher than 98%
The most frequent 3 topics are reviews (16.2%), de- approval rate to assure high quality annotations 4 .
bate (2.9%) and consulting (1.8%). The remaining Figure 2 shows the distribution of sentiment and
topics are almost uniformly distributed 3 . emotions in CMU-MOSEI dataset. The distribution
The final set of videos are then tokenized into
4
Extensive statistics of the dataset including the crawl-
3
more detailed analysis such as exact percentages and ing mechanism, the annotation UI, training procedure for the
number of videos per topic are available in the supplementary workers, agreement scores are available in submitted supple-
material mentary material available on arXiv.
2239
8000 13000
12000
et al., 2016). We also extract a set of six basic
7000
6000
11000
10000
emotions purely from static faces using Emotient
5000
9000
8000
FACET (iMotions, 2017). MultiComp OpenFace
4000
7000
6000
(Baltrušaitis et al., 2016) is used to extract the set
3000 5000 of 68 facial landmarks, 20 facial shape parameters,
4000
2000
3000 facial HoG features, head pose, head orientation
2000
1000
1000 and eye gaze (Baltrušaitis et al., 2016). Finally,
0 0
Negative Weakly Neutral Weakly Positive
Negative Positive
Happiness Sadness Anger Disgust Surprise Fear
we extract face embeddings from commonly used
facial recognition models such as DeepFace (Taig-
Figure 2: Distribution of sentiment and emotions in man et al., 2014), FaceNet (Schroff et al., 2015)
the CMU-MOSEI dataset. The distribution shows and SphereFace (Liu et al., 2017).
a natural skew towards more frequently used emo- Acoustic: We use the COVAREP software (De-
tions. However, the least frequent emotion, fear, gottex et al., 2014) to extract acoustic features
still has 1,900 data points which is an acceptable including 12 Mel-frequency cepstral coefficients,
number for machine learning studies. pitch, voiced/unvoiced segmenting features (Drug-
man and Alwan, 2011), glottal source parameters
shows a slight shift in favor of positive sentiment (Drugman et al., 2012; Alku et al., 1997, 2002),
which is similar to distribution of CMU-MOSI and peak slope parameters and maxima dispersion quo-
SST. We believe that this is an implicit bias in tients (Kane and Gobl, 2013). All extracted fea-
online opinions being slightly shifted towards posi- tures are related to emotions and tone of speech.
tive, since this is also present in CMU-MOSI. The
emotion histogram shows different prevalence for 4 Multimodal Fusion Study
different emotions. The most common category is
happiness with more than 12,000 positive sample From the linguistics perspective, understanding the
points. The least prevalent emotion is fear with interactions between language, visual and audio
almost 1900 positive sample points which is an modalities in multimodal language is a fundamen-
acceptable number for machine learning studies. tal research problem. While previous works have
been successful with respect to accuracy metrics,
3.3 Extracted Features they have not created new insights on how the fu-
Data points in CMU-MOSEI come in video format sion is performed in terms of what modalities are
with one speaker in front of the camera. The ex- related and how modalities engage in an interaction
tracted features for each modality are as follows during fusion. Specifically, to understand the fu-
(for other benchmarks we extract the same fea- sion process one must first understand the n-modal
tures): dynamics (Zadeh et al., 2017). n-modal dynam-
Language: All videos have manual transcrip- ics state that there exists different combination of
tion. Glove word embeddings (Pennington et al., modalities and that all of these combinations must
2014) were used to extract word vectors from tran- be captured to better understand the multimodal
scripts. Words and audio are aligned at phoneme language. In this paper, we define building the
level using P2FA forced alignment model (Yuan n-modal dynamics as a hierarchical process and
and Liberman, 2008). Following this, the visual propose a new fusion model called the Dynamic
and acoustic modalities are aligned to the words Fusion Graph (DFG). DFG is easily interpretable
by interpolation. Since the utterance duration of through what is called efficacies in graph connec-
words in English is usually short, this interpolation tions. To utilize this new fusion model in a multi-
does not lead to substantial information loss. modal language framework, we build upon Mem-
Visual: Frames are extracted from the full ory Fusion Network (MFN) by replacing the origi-
videos at 30Hz. The bounding box of the face nal fusion component in the MFN with our DFG.
is extracted using the MTCNN face detection al- We call this resulting model the Graph Memory
gorithm (Zhang et al., 2016). We extract facial Fusion Network (Graph-MFN). Once the model
action units through Facial Action Coding System is trained end to end, we analyze the efficacies in
(FACS) (Ekman et al., 1980). Extracting these the DFG to study the fusion mechanism learned
action units allows for accurate tracking and un- for modalities in multimodal language. In addi-
derstanding of the facial expressions (Baltrušaitis tion to being an interpretable fusion mechanism,
2240
Multimodal State Memory
𝒯 Dynamic Fusion Graph
𝒯"
output (𝑡 − 1)
𝛾 (*)
𝒟1(>) 𝜎 ⨀
⊕ ⨁ 𝑢"
𝛾 (?)
𝒟1(2) 𝜎 ⨀
trimodal
𝒟=
𝒟- ℎ 𝑢'"
𝒟",%,$
𝒟# 𝒟$ 𝒟%
bimodal 𝑧"
2241
Dataset MOSEI Sentiment MOSEI Emotions
Task Sentiment Anger Disgust Fear Happy Sad Surprise
Metric A2 F1 A5 A7 MAE r WA F1 WA F1 WA F1 WA F1 WA F1 WA F1
LANGUAGE
SOTA2 74.1§ 74.1⊳ 43.1≀ 42.9≀ 0.75§ 0.46≀ 56.0∪ 71.0× 59.0§ 67.1⊳ 56.2§ 79.7§ 53.0⊳ 44.1⊳ 53.8≀ 49.9≀ 53.2× 70.0⊳
SOTA1 74.3⊳ 74.1§ 43.2§ 43.2§ 0.74⊳ 0.47§ 56.6≀ 71.8● 64.0⊳ 72.6● 58.8× 89.8● 54.0§ 47.0§ 54.0§ 61.2● 54.3⊳ 85.3●
VISUAL
SOTA2 73.8§ 73.5§ 42.5⊳ 42.5⊳ 0.78≀ 0.41♡ 54.4≀ 64.6§ 54.4♡ 71.5⊲ 51.3§ 78.4§ 53.4≀ 40.8§ 54.3⊳ 60.8● 51.3⊳ 84.2§
SOTA1 73.9⊳ 73.7⊳ 42.7≀ 42.7≀ 0.78§ 0.43≀ 60.0§ 71.0● 60.3≀ 72.4● 64.2♡ 89.8● 57.4● 49.3● 57.7§ 61.5⊲ 51.8§ 85.4●
ACOUSTIC
SOTA2 74.2≀ 73.8△ 42.1△ 42.1△ 0.78⊳ 0.43§ 55.5⊲ 51.8△ 58.9⊳ 72.4● 58.5⊳ 89.8● 57.2∩ 55.5∩ 58.9⊲ 65.9⊲ 52.2♡ 83.6∩
SOTA1 74.2△ 73.9≀ 42.4∩ 42.4∩ 0.74∩ 0.43⊳ 56.4△ 71.9● 60.9§ 72.4● 62.7§ 89.8⊲ 61.5§ 61.4§ 62.0∩ 69.2∩ 54.3⊲ 85.4●
MULTIMODAL
SOTA2 76.0# 76.0# 44.7† 44.6† 0.72∗ 0.52∗ 56.0◇ 71.4♭ 65.2# 71.4# 56.7§ 89.9# 57.8§ 66.6∗ 58.9∗ 60.8# 52.2∗ 85.4●
SOTA1 76.4◇ 76.4◇ 44.8∗ 44.7∗ 0.72# 0.52# 60.5∗ 72.0● 67.0♭ 73.2● 60.0♡ 89.9● 66.5∗ 71.0∎ 59.2§ 61.8● 53.3# 85.4#
Graph-MFN 76.9 77.0 45.1 45.0 0.71 0.54 62.6 72.8 69.1 76.6 62.0 89.9 66.3 66.3 60.4 66.9 53.7 85.5
Table 3: Results for sentiment analysis and emotion recognition on the MOSEI dataset (reported results
are as of 5/11/2018. please check the CMU Multimodal Data SDK github for current state of the art and
new features for CMU-MOSEI and other datasets). SOTA1 and SOTA2 refer to the previous best and
second best state-of-the-art models (from Section 2) respectively. Compared to the baselines Graph-MFN
achieves superior performance in sentiment analysis and competitive performance in emotion recognition.
For all metrics, higher values indicate better performance except for MAE where lower values indicate
better performance.
2242
Vision and acoustic modalities informative Vision modality uninformative Language modality uninformative Acoustic modality uninformative
𝑙 → 𝑙, 𝑎 𝑙 → 𝑙, 𝑎
𝑎 → 𝑙, 𝑎 𝑎 → 𝑙, 𝑎
𝑙 → 𝑙, 𝑣 𝑙 → 𝑙, 𝑣
𝑣 → 𝑙, 𝑣 𝑣 → 𝑙, 𝑣
𝑎 → 𝑎, 𝑣 𝑎 → 𝑎, 𝑣
𝑣 → 𝑎, 𝑣 𝑣 → 𝑎, 𝑣
𝑙 → 𝑙, 𝑎, 𝑣 𝑙 → 𝑙, 𝑎, 𝑣
𝑎 → 𝑙, 𝑎, 𝑣 𝑎 → 𝑙, 𝑎, 𝑣
𝑣 → 𝑙, 𝑎, 𝑣 𝑣 → 𝑙, 𝑎, 𝑣
𝑙, 𝑎 → 𝑙, 𝑎, 𝑣 𝑙, 𝑎 → 𝑙, 𝑎, 𝑣
𝑙, 𝑣 → 𝑙, 𝑎, 𝑣 𝑙, 𝑣 → 𝑙, 𝑎, 𝑣
𝑎, 𝑣 → 𝑙, 𝑎, 𝑣 𝑎, 𝑣 → 𝑙, 𝑎, 𝑣
𝑙→𝒯 𝑙→𝒯
𝑎→𝒯 𝑎→𝒯
𝑣→𝒯 𝑣→𝒯
𝑙, 𝑎 → 𝒯 𝑙, 𝑎 → 𝒯
𝑙, 𝑣 → 𝒯 𝑙, 𝑣 → 𝒯
𝑎, 𝑣 → 𝒯 𝑎, 𝑣 → 𝒯
𝑙, 𝑎, 𝑣 → 𝒯 𝑙, 𝑎, 𝑣 → 𝒯
𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇 𝑡=1 𝑡=𝑇
Language: And he I don’t think he got mad when hah Too much too fast, I mean we basically just All I can say is he’s a pretty average guy. What disappointed me was that one of the actors
I don’t know maybe. get introduced to this character… in the movie was there for short amount of time.
Uninformative
Gaze aversion
Contradictory
Surprised
Vision:
smile
Acoustic: (frustrated voice) (angry voice) (disappointed voice) (neutral voice)
Figure 5: Visualization of DFG efficacies across time. The efficacies (thus the DFG structure) change
over time as DFG is exposed to new information. DFG is able choose which n-modal dynamics to rely
on. It also learns priors about human communication since certain efficacies (thus edges in DFG) remain
unchanged across time and across data points.
ing that the DFG is able to find useful informa- ing the visual modality have low efficacies. 3) The
tion in unimodal, bimodal and trimodal interac- acoustic modality is mostly present in fusion with
tions. However, in cases (II) and (III) where the the language modality. However, unlike language,
visual modality is either uninformative or contra- the acoustic modality also appears to fuse with the
dictory, the efficacies of v → l, v and v → l, a, v visual modality if both modalities are meaningful,
and l, a → l, a, v are reduced since no meaningful such as in case (I).
interactions involve the visual modality. An interesting observation is that in almost all
Priors in Fusion: Certain efficacies remain un- cases the efficacies of unimodal connections to ter-
changed across cases and across time. These are minal T is low, implying that T prefers to not rely
priors from Human Multimodal Language that on just one modality. Also, DFG always prefers
DFG learns. For example the model always seems to perform fusion between language and audio as
to prioritize fusion between language and audio in in most cases both l → l, a and a → l, a have high
(l → l, a), and (a → l, a). Subsequently, DFG efficacies; intuitively in most natural scenarios lan-
gives low values to efficacies that rely unilater- guage and acoustic modalities are highly aligned.
ally on language or audio alone: the (l → τ ) and Both of these cases show unchanging behaviors
(a → τ ) efficacies seem to be consistently low. On which we believe DFG has learned as natural pri-
the other hand, the visual modality appears to have ors of human communicative signal.
a partially isolated behavior. In the presence of in- With these observations, we believe that DFG
formative visual information, the model increases has successfully learned how to manage its internal
the efficacies of (v → τ ) although the values of structure to model human communication.
other visual efficacies also increase.
6 Conclusion
Trace of Multimodal Fusion: We trace the
dominant path that every modality undergoes dur- In this paper we presented the largest dataset of
ing fusion: 1) language tends to first fuse with multimodal sentiment analysis and emotion recog-
audio via (l → l, a) and the language and acoustic nition called CMU Multimodal Opinion Sentiment
modalities together engage in higher level fusions and Emotion Intensity (CMU-MOSEI). CMU-
such as (l, a → l, a, v). Intuitively, this is aligned MOSEI consists of 23,453 annotated sentences
with the close ties between language and audio from more than 1000 online speakers and 250 dif-
through word intonations. 2) The visual modality ferent topics. The dataset expands the horizons of
seems to engage in fusion only if it contains mean- Human Multimodal Language studies in NLP. One
ingful information. In cases (I) and (IV), all the such study was presented in this paper where we
paths involving the visual modality are relatively analyzed the structure of multimodal fusion in sen-
active while in cases (II) and (III) the paths involv- timent analysis and emotion recognition. This was
2243
done using a novel interpretable fusion mechanism Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-
called Dynamic Fusion Graph (DFG). In our stud- trušaitis, Amir Zadeh, and Louis-Philippe Morency.
2017. Multimodal sentiment analysis with word-
ies we investigated the behavior of modalities in in-
level fusion and reinforcement learning. In Pro-
teracting with each other using built-in efficacies of ceedings of the 19th ACM International Con-
DFG. Aside analysis of fusion, DFG was trained in ference on Multimodal Interaction. ACM, New
the Memory Fusion Network pipeline and showed York, NY, USA, ICMI 2017, pages 163–171.
superior performance in sentiment analysis and https://doi.org/10.1145/3136755.3136801.
competitive performance in emotion recognition. Glen Coppersmith and Erin Kelly. 2014. Dynamic
wordclouds and vennclouds for exploratory data
Acknowledgments analysis. In Proceedings of the Workshop on In-
teractive Language Learning, Visualization, and In-
This material is based upon work partially sup- terfaces. Association for Computational Linguistics,
ported by the National Science Foundation (Award Baltimore, Maryland, USA, pages 22–29.
#1833355) and Oculus VR. Any opinions, findings, Corinna Cortes and Vladimir Vapnik. 1995. Support-
and conclusions or recommendations expressed in vector networks. Mach. Learn. 20(3):273–297.
this material are those of the author(s) and do not https://doi.org/10.1023/A:1022627411411.
necessarily reflect the views of National Science Gilles Degottex, John Kane, Thomas Drugman, Tuomo
Foundation or Oculus VR, and no official endorse- Raitio, and Stefan Scherer. 2014. Covarep—a col-
ment should be inferred. laborative voice analysis repository for speech tech-
nologies. In Acoustics, Speech and Signal Process-
ing (ICASSP), 2014 IEEE International Conference
on. IEEE, pages 960–964.
References
A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. 2012.
Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002.
Collecting large, richly annotated facial-expression
Normalized amplitude quotient for parametrization
databases from movies. IEEE MultiMedia 19(3):34–
of the glottal flow. the Journal of the Acoustical So-
41. https://doi.org/10.1109/MMUL.2012.26.
ciety of America 112(2):701–710.
Abhinav Dhall, O.V. Ramana Murthy, Roland Goecke,
Paavo Alku, Helmer Strik, and Erkki Vilkman. 1997. Jyoti Joshi, and Tom Gedeon. 2015. Video
Parabolic spectral parameter—a new method for and image based emotion recognition challenges
quantification of the glottal flow. Speech Commu- in the wild: Emotiw 2015. In Proceed-
nication 22(1):67–79. ings of the 2015 ACM on International Con-
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis- ference on Multimodal Interaction. ACM, New
Philippe Morency. 2017. Multimodal machine learn- York, NY, USA, ICMI ’15, pages 423–426.
ing: A survey and taxonomy. arXiv preprint https://doi.org/10.1145/2818346.2829994.
arXiv:1705.09406 . Thomas Drugman and Abeer Alwan. 2011. Joint ro-
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe bust voicing detection and pitch estimation based
Morency. 2016. Openface: an open source facial be- on residual harmonics. In Interspeech. pages 1973–
havior analysis toolkit. In Applications of Computer 1976.
Vision (WACV), 2016 IEEE Winter Conference on. Thomas Drugman, Mark Thomas, Jon Gudnason,
IEEE, pages 1–10. Patrick Naylor, and Thierry Dutoit. 2012. Detec-
tion of glottal closure instants from speech signals:
Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and
A quantitative review. IEEE Transactions on Audio,
Maja Pantic. 2015. The mahnob mimicry database:
Speech, and Language Processing 20(3):994–1006.
A database of naturalistic human interactions. Pat-
tern Recognition Letters 66(Supplement C):52 – 61. Paul Ekman, Wallace V Freisen, and Sonia Ancoli.
Pattern Recognition in Human Computer Interaction. 1980. Facial signs of emotional experience. Journal
https://doi.org/https://doi.org/10.1016/j.patrec.2015.03.005.of personality and social psychology 39(6):1125.
Leo Breiman. 2001. Random A. Graves, A. r. Mohamed, and G. Hinton.
forests. Mach. Learn. 45(1):5–32. 2013. Speech recognition with deep recur-
https://doi.org/10.1023/A:1010933404324. rent neural networks. In 2013 IEEE Inter-
national Conference on Acoustics, Speech
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe and Signal Processing. pages 6645–6649.
Kazemzadeh, Emily Mower, Samuel Kim, Jean- https://doi.org/10.1109/ICASSP.2013.6638947.
nette Chang, Sungbok Lee, and Shrikanth S.
Narayanan. 2008. Iemocap: Interactive emotional Michael Grimm, Kristian Kroschel, and Shrikanth
dyadic motion capture database. Journal of Lan- Narayanan. 2008. The vera am mittag german audio-
guage Resources and Evaluation 42(4):335–359. visual emotional speech database. In ICME. IEEE,
https://doi.org/10.1007/s10579-008-9076-6. pages 865–868.
2244
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Louis-Philippe Morency, Rada Mihalcea, and Payal
Erik Cambria, Louis-Philippe Morency, and Roger Doshi. 2011. Towards multimodal sentiment anal-
Zimmerman. 2018. Memn: Multimodal emotional ysis: Harvesting opinions from the web. In Proceed-
memory network for emotion recognition in dyadic ings of the 13th International Conference on Multi-
conversational videos. In NAACL. modal Interactions. ACM, pages 169–176.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Friedrich Max Müller. 1866. Lectures on the science
short-term memory. Neural computation 9(8):1735– of language: Delivered at the Royal Institution of
1780. Great Britain in April, May, & June 1861, volume 1.
Longmans, Green.
iMotions. 2017. Facial expression analysis. Behnaz Nojavanasghari, Deepak Gopinath, Jayanth
goo.gl/1rh1JN. Koushik, Tadas Baltrušaitis, and Louis-Philippe
Morency. 2016. Deep multimodal fusion for persua-
Mohit Iyyer, Varun Manjunatha, Jordan L Boyd- siveness prediction. In Proceedings of the 18th ACM
Graber, and Hal Daumé III. 2015. Deep unordered International Conference on Multimodal Interaction.
composition rivals syntactic methods for text classi- ACM, New York, NY, USA, ICMI 2016, pages 284–
fication. In ACL (1). pages 1681–1691. 288. https://doi.org/10.1145/2993148.2993176.
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Yu. 2013. 3d convolutional neural networks 2002. Thumbs up? sentiment classification using
for human action recognition. IEEE Trans. machine learning techniques. In Proceedings of
Pattern Anal. Mach. Intell. 35(1):221–231. EMNLP. pages 79–86.
https://doi.org/10.1109/TPAMI.2012.59.
Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Kenji Sagae, and Louis-Philippe Morency. 2014.
som. 2014. A convolutional neural network for mod- Computational analysis of persuasiveness in social
elling sentences. arXiv preprint arXiv:1404.2188 . multimedia: A novel dataset and multimodal pre-
diction approach. In Proceedings of the 16th In-
John Kane and Christer Gobl. 2013. Wavelet maxima ternational Conference on Multimodal Interaction.
dispersion for breathy to tense voice discrimination. ACM, New York, NY, USA, ICMI ’14, pages 50–57.
IEEE Transactions on Audio, Speech, and Language https://doi.org/10.1145/2663204.2663260.
Processing 21(6):1170–1179. Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
Wootaek Lim, Daeyoung Jang, and Taejin Lee. 2016. representation. In EMNLP. volume 14, pages 1532–
Speech emotion recognition using convolutional 1543.
and recurrent neural networks. In Signal and In-
formation Processing Association Annual Summit Veronica Perez-Rosas, Rada Mihalcea, and Louis-
and Conference (APSIPA), 2016 Asia-Pacific. IEEE, Philippe Morency. 2013. Utterance-Level Multi-
pages 1–4. modal Sentiment Analysis. In Association for Com-
putational Linguistics (ACL). Sofia, Bulgaria.
Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li,
Bhiksha Raj, and Le Song. 2017. Sphereface: Deep Soujanya Poria, Erik Cambria, Devamanyu Haz-
hypersphere embedding for face recognition. In Pro- arika, Navonil Mazumder, Amir Zadeh, and Louis-
ceedings of the IEEE conference on computer vision Philippe Morency. 2017a. Context dependent senti-
and pattern recognition. ment analysis in user generated videos. In Associa-
tion for Computational Linguistics.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Soujanya Poria, Erik Cambria, Devamanyu Haz-
Dan Huang, Andrew Y. Ng, and Christopher Potts.
arika, Navonil Mazumder, Amir Zadeh, and Louis-
2011. Learning word vectors for sentiment analysis.
Philippe Morency. 2017b. Context-dependent senti-
In Proceedings of the 49th Annual Meeting of the
ment analysis in user-generated videos. In Associa-
Association for Computational Linguistics: Human
tion for Computational Linguistics.
Language Technologies. Association for Computa-
tional Linguistics, Portland, Oregon, USA, pages Soujanya Poria, Iti Chaturvedi, Erik Cambria, and
142–150. http://www.aclweb.org/anthology/P11- Amir Hussain. 2016. Convolutional mkl based mul-
1015. timodal emotion recognition and sentiment analysis.
In Data Mining (ICDM), 2016 IEEE 16th Interna-
Christopher D. Manning, Mihai Surdeanu, John tional Conference on. IEEE, pages 439–448.
Bauer, Jenny Finkel, Steven J. Bethard,
and David McClosky. 2014. The Stanford Shyam Sundar Rajagopalan, Louis-Philippe Morency,
CoreNLP natural language processing toolkit. Tadas Baltrušaitis, and Roland Goecke. 2016. Ex-
In Association for Computational Linguistics tending long short-term memory for multi-view
(ACL) System Demonstrations. pages 55–60. structured learning. In European Conference on
http://www.aclweb.org/anthology/P/P14/P14-5010. Computer Vision.
2245
Fabien Ringeval, Andreas Sonderegger, Jürgen S. Sentiment analysis in an audio-visual context. IEEE
Sauer, and Denis Lalanne. 2013. Introducing the Intelligent Systems 28(3):46–53.
recola multimodal corpus of remote collaborative
and affective interactions. In FG. IEEE Computer Jiahong Yuan and Mark Liberman. 2008. Speaker iden-
Society, pages 1–8. tification on the scotus corpus. Journal of the Acous-
tical Society of America 123(5):3878.
Florian Schroff, Dmitry Kalenichenko, and James
Philbin. 2015. Facenet: A unified embedding for Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-
face recognition and clustering. In CVPR. IEEE bria, and Louis-Philippe Morency. 2017. Tensor fu-
Computer Society, pages 815–823. sion network for multimodal sentiment analysis. In
Empirical Methods in Natural Language Processing,
M. Schuster and K.K. Paliwal. 1997. Bidi- EMNLP.
rectional recurrent neural networks.
Trans. Sig. Proc. 45(11):2673–2681. Amir Zadeh, Paul Pu Liang, Navonil Mazumder,
https://doi.org/10.1109/78.650093. Soujanya Poria, Erik Cambria, and Louis-Philippe
Morency. 2018a. Memory fusion network for
Richard Socher, Alex Perelygin, Jean Y Wu, Jason multi-view sequential learning. arXiv preprint
Chuang, Christopher D Manning, Andrew Y Ng, arXiv:1802.00927 .
Christopher Potts, et al. 2013. Recursive deep
models for semantic compositionality over a senti- Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra-
ment treebank. In Proceedings of the conference on teek Vij, Erik Cambria, and Louis-Philippe Morency.
empirical methods in natural language processing 2018b. Multi-attention recurrent network for hu-
(EMNLP). Citeseer, volume 1631, page 1642. man communication comprehension. arXiv preprint
arXiv:1802.00923 .
Rupesh K Srivastava, Klaus Greff, and Juergen Schmid-
huber. 2015. Training very deep networks. In Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Philippe Morency. 2016a. Mosi: Multimodal cor-
and R. Garnett, editors, Advances in Neural Informa- pus of sentiment intensity and subjectivity anal-
tion Processing Systems 28, Curran Associates, Inc., ysis in online opinion videos. arXiv preprint
pages 2377–2385. http://papers.nips.cc/paper/5850- arXiv:1606.06259 .
training-very-deep-networks.pdf.
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
Yaniv Taigman, Ming Yang, Marc’Aurelio Ran- Philippe Morency. 2016b. Multimodal sentiment in-
zato, and Lior Wolf. 2014. Deepface: Clos- tensity analysis in videos: Facial gestures and verbal
ing the gap to human-level performance in messages. IEEE Intelligent Systems 31(6):82–88.
face verification. In Proceedings of the 2014
IEEE Conference on Computer Vision and Pat- Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and
tern Recognition. IEEE Computer Society, Wash- Yu Qiao. 2016. Joint face detection and alignment
ington, DC, USA, CVPR ’14, pages 1701–1708. using multitask cascaded convolutional networks.
https://doi.org/10.1109/CVPR.2014.220. IEEE Signal Processing Letters 23(10):1499–1503.
Edmund Tong, Amir Zadeh, Cara Jones, and Louis- Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Fran-
Philippe Morency. 2017. Combating human traffick- cis C. M. Lau. 2015. A c-lstm neural network for
ing with multimodal deep models. In Proceedings text classification. CoRR abs/1511.08630.
of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Julian Georg Zilly, Rupesh Kumar Srivastava, Jan
volume 1, pages 1547–1556. Koutnı́k, and Jürgen Schmidhuber. 2016. Re-
current Highway Networks. arXiv preprint
George Trigeorgis, Fabien Ringeval, Raymond Brueck- arXiv:1607.03474 .
ner, Erik Marchi, Mihalis A Nicolaou, Björn
Schuller, and Stefanos Zafeiriou. 2016. Adieu fea-
tures? end-to-end speech emotion recognition using
a deep convolutional recurrent network. In Acous-
tics, Speech and Signal Processing (ICASSP), 2016
IEEE International Conference on. IEEE, pages
5200–5204.
Haohan Wang, Aaksha Meghawat, Louis-Philippe
Morency, and Eric P Xing. 2016. Select-additive
learning: Improving cross-individual generalization
in multimodal sentiment analysis. arXiv preprint
arXiv:1609.05244 .
Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
Schuller, Congkai Sun, Kenji Sagae, and Louis-
Philippe Morency. 2013. Youtube movie reviews:
2246