0% found this document useful (0 votes)
8 views10 pages

Investigation of Multimodal Features Classifiers A

The document presents a study on automatic emotion recognition using multimodal features, classifiers, and fusion methods, specifically for the EmotiW 2018 challenge. The authors propose a system that integrates audio, video, and text information to enhance recognition performance, achieving a notable accuracy of 60.34%. The paper discusses various feature extraction techniques, classifiers, and the innovative Beam Search Fusion method for emotion classification.

Uploaded by

Anurag Joardar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Investigation of Multimodal Features Classifiers A

The document presents a study on automatic emotion recognition using multimodal features, classifiers, and fusion methods, specifically for the EmotiW 2018 challenge. The authors propose a system that integrates audio, video, and text information to enhance recognition performance, achieving a notable accuracy of 60.34%. The paper discusses various feature extraction techniques, classifiers, and the innovative Beam Search Fusion method for emotion classification.

Uploaded by

Anurag Joardar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327718436

Investigation of Multimodal Features, Classifiers and Fusion Methods for


Emotion Recognition

Preprint · September 2018

CITATIONS READS

0 1,862

4 authors:

Zheng Lian Ya Li
Chinese Academy of Sciences Chinese Academy of Sciences
70 PUBLICATIONS 1,193 CITATIONS 101 PUBLICATIONS 1,322 CITATIONS

SEE PROFILE SEE PROFILE

Jianhua Tao Jian Huang


Tsinghua University Institute of Automation, Chinese Academy of Sciences
524 PUBLICATIONS 6,954 CITATIONS 29 PUBLICATIONS 721 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Zheng Lian on 07 November 2018.

The user has requested enhancement of the downloaded file.


Investigation of Multimodal Features, Classifiers and Fusion
Methods for Emotion Recognition
Zheng Lian Ya Li Jianhua Tao
National Laboratory of Pattern National Laboratory of Pattern National Laboratory of Pattern
Recognition, Institute of Automation Recognition, Institute of Automation Recognition, CAS Center for
Chinese Academy of Sciences, Beijing, Chinese Academy of Sciences, Beijing, Excellence in Brain Science and
China China Intelligence Technology, Institute of
[email protected] [email protected] Automation Chinese Academy of
Sciences, Beijing, China
[email protected]

Jian Huang
National Laboratory of Pattern
Recognition, Institute of Automation
Chinese Academy of Sciences, Beijing,
China
[email protected]
ABSTRACT scenarios, the single modality is difficult to meet the demand.
Multimodal recognition methods, which take into account the
Automatic emotion recognition is a challenging task. In this paper, audio, video, text and biological information, can improve the
we present our effort for the audio-video based sub-challenge of recognition performance.
the Emotion Recognition in the Wild (EmotiW) 2018 challenge, The audio-video based sub-challenge of the Emotion
which requires participants to assign a single emotion label to the Recognition in the Wild (EmotiW) challenge plays an important
video clip from the six universal emotions (Anger, Disgust, Fear, role in the emotion recognition. The Acted Facial Expressions in
Happiness, Sad and Surprise) and Neutral. The proposed the Wild (AFEW) dataset [1] is the dataset of the EmotiW
multimodal emotion recognition system takes into account audio, challenge. Organizers provide an open platform for participators
video and text information. Besides handcraft features, we also to evaluate their recognition systems. The first EmotiW challenge
extract bottleneck features from deep neural networks (DNNs) via was organized in 2013. This year is the 6th challenge. The
transfer learning. Both temporal classifiers and non-temporal recognition accuracy on the seven emotions increases every year:
classifiers are evaluated to obtain the best unimodal emotion 41.03%[2], 50.37% [3], 53.80%[4], 59.02% [5], 60.34%[6].
classification result. Then emotion possibilities are calculated and It is important to extract more discriminative features in the
fused by the Beam Search Fusion (BS-Fusion). We test our emotion classification. Before the popularity of deep neural
method1 in the EmotiW 2018 challenge and we a gain promising networks (DNNs), frame-level handcraft features are wildly
result: 60.34% on the testing dataset. Compared with the baseline studied and utilized [2, 7, 8], including Histogram of Oriented
system, there is a significant improvement. What’s more, our Gradient (HOG) [9], Local Binary Patterns (LBP) [10], Local
result is only 1.5% lower than the winner’s. Phase Quantization (LPQ) [11] and Scale Invariant Feature
Transform (SIFT) [12]. Three Orthogonal Planes (TOP) [13],
CCS CONCEPTS summarizing functionals (FUN), Fisher Vector encoding (FV)
• Computing methodologies → Activity recognition and [14], Spatial Pyramid Matching (SPM) [15] and Bag of Words
understanding; (BOW) are also utilized to capture temporal information [7, 16].
Now the DNNs based approach generates the state-of-the-art
KEYWORDS performance in many fields [17-21]. However, due to limited
Emotion Recognition; Multimodal Features; Classifiers; Fusion training samples in the AFEW database, complex DNNs are
Methods difficult to train [22]. To deal with that problem, transfer learning
is adopted. Then bottleneck features are extracted from fine-tuned
models [5, 23, 24].
1 INTRODUCTION The classifiers are also important in the emotion recognition.
With the development of artificial intelligence, there is an Liu et al. [3] exploited Partial Least Squares (PLS), Logistic
explosion of interest in realizing more natural human-machine Regression (LR) and Kernel Support Vector Machine (KSVM)
interaction (HMI) systems. The emotion, as an important aspect of operating in vector space to classify data points on Riemannian
HMI, is also attracting more and more attention. Due to the manifolds for emotion recognition. Kaya et al. [25] chose Extreme
complexity of emotion recognition and the diversity of application Learning Machines (ELM) and Kernel Extreme Learning

1
NLPR’s method: https://github.com/zeroQiaoba/EmotiW2018
Figure 1: An overview of the proposed multimodal emotion recognition system. Features from different modalities are trained
individually based on multiple classifiers. Emotion possibilities are fused by the BS-Fusion.

Machines (KELM) for modeling modality-special features, which The rest of paper is organized as follow. Multimodal features
were faster and more accurate than SVM. Recently, many and various classifiers are illustrated in Section 2 and Section 3,
temporal models are also tested, such as Long Short-Term respectively. In Section 4, we focus on our proposed BS-Fusion.
Memory (LSTM) [26], Gated Recurrent Unit (GRU) [27] and 3D Datasets and experimental results are illustrated in Section 5 and
Convolution Networks (C3D) [28]. Section 6, respectively. Section 7 concludes the whole paper.
To gain better performance, fusion methods that merge
different modalities are essential. Fusion methods can be 2 MULTIMODAL FEATURES
classified into feature level fusion (or called early fusion),
In our approach, audio, video and text features are taken into
decision level fusion (or called late fusion) and model level fusion.
account to improve the recognition performance. Besides
Most teams chose late fusion in pervious challenges [5, 23, 29].
handcraft features, bottleneck features extracted from fine-tuned
Vielzeuf et al. [30] discussed five fusion methods: Majority Vote,
models are also considered.
Mean, ModDrop, Score Tree and Weighted Mean. They found
that Weighted Mean was the most effective fusion method, which
2.1 Audio Features
had less risk of overfitting. Ouyang et al. [22] utilized
reinforcement learning strategy to find the best fusion weight. In this section, multiple audio features are discussed. Besides
In EmotiW 2018 [31], we participate in the Audio-Video based handcraft feature sets, bottleneck features of the automatic speech
sub-challenge. The task is to assign a single emotion label to the recognition (ASR) acoustic model, SoundNet and VGGish are
video clip and classification accuracy is the comparison metric. In also evaluated.
this paper, we propose our multimodal emotion recognition 2.1.1 OpenSMILE-based Audio Features. The OpenSMILE
system, which is shown in Fig. 1. Features from different toolkit [33] is utilized to extract audio feature sets, including
modalities are trained individually based on multiple classifiers. eGemaps (eGeMAPSv-01.conf) [34], IS09 (IS09_emotion.conf),
Emotion possibilities are fused by the BS-Fusion. Compared with IS10 (IS10_paraling.conf), IS11 (IS11_speaker_state.conf), IS13
the previous solutions in EmotiW challenges, our innovations (IS13_ComParE.conf) and MFCC (MFCC12_0_D_A.conf).
mainly focus on three parts: To extract those feature sets, the acoustic low-level descriptors
1. Multimodal features: To our best knowledge, it is the first (LLDs), covering spectral, cepstral, prosodic and voice quality
time to take into account text, identity and background information, are first extracted within a 25ms frame with a
information. window shift of 10ms. Then statistical functions such as mean and
2. Classifiers: Different types of aggregation models are maximum are calculated over LLDs to get segment-level features.
investigated, including NetFV, NetVLAD, NetRVLAD We test two segment lengths in the paper: 100ms and the length of
and SoftDBoW[32]. the whole utterance.
3. Fusion methods: The Beam Search Fusion (BS-Fusion) is 2.1.2 ASR Bottleneck Features. We extract bottleneck features
proposed for the modal selection and weight determination. from the ASR acoustic model. At first, we train a Chinese ASR
system with the 500 hours spontaneous and accented Mandarin In this paper, the VGGish network is used as the feature
speech corpus. The ASR acoustic model has six hidden layers. extractor. We divide raw waveforms into multiple 1s segments.
The first five layers have 1024 nodes and the last layer has 60 Log spectrograms extracted from segments are treated as inputs.
nodes. As most speakers are spoken in English in the AFEW VGGish extracts semantically meaningful, high-level 128D
dataset, we fine-tune the Chinese ASR system with the 300 hours embedding features from fc2. Then Principal Component Analysis
English speech corpus due to the limited English corpus. Then, we (PCA) is utilized to extract normalized features.
extract bottleneck features from two acoustic models: the English
ASR acoustic model and the Chinese ASR acoustic model.

Figure 4: The structure of the VGGish network. The input log


spectrogram is 96×64. Yellow boxes, green boxes and grey
boxes denote the 2D convolutional layers, max pooling layers
and fully-connected layers, respectively. The number inside of
the yellow box is the number of filters and the number inside
of the grey box is the number of neurons.

Figure 2: The architecture of our ASR acoustic model. FBank 2.2 Video Features
features extracted from waveforms are used as inputs. The In this paper, we extract multiple video features. Besides
last layer of the ASR acoustic model is treated as the handcraft features such as Local Binary Patterns from Three
bottleneck layer. Orthogonal Planes (LBPTOP) [13], HOG and Dense SIFT
(DSIFT), bottleneck features extracted from VGG, DenseNet and
2.1.3 SoundNet Bottleneck Features. We extract bottleneck
C3D are also considered. Furthermore, we take into account
features from the SoundNet network [35], which learns rich
geometry features, background features and identity features.
natural sound representation by capitalizing on large amounts of
2.2.1 Handcraft Video Features. In general, facial features
unlabeled sound data collected in the wild. The SoundNet network
consist of two parts: appearance features and geometry features.
is a 1-dimensional convolutional network, which consists of full
As for appearance features, LBPTOP features are wildly used
convolutional layers and pooling layers.
in previous EmotiW challenges. Basic LBP features have 59
In this paper, we divide raw waveforms into multiple 1s
features while using uniform code. LBPTOP features extend LBP
segments. Then those segments are treated as inputs to the
from two dimensions to three dimensions, which apply the
network and we extract SoundNet Bottleneck Features from the
relevant descriptor on XY, XT and YT planes independently and
conv7 layer in Fig.3.
concatenate histograms together.
Besides LBPTOP features, LBP, HOG, HOGLBP and DSIFT
features are also tested. HOGLBP features apply the HOG
descriptor on the XY plane, LBP descriptor on the XT and YT
plane and then concatenate them together. As for DSIFT features,
it is equivalent to performing the SIFT descriptor on a dense grid
of locations on an image at a fixed scale and orientation.
As for geometry features, the head pose and landmarks are
considered. Emotion is related with landmarks and the head pose.
When people feel neutral, movement of landmarks is relatively
small. When people feel sad, they tend to lower their heads.
Therefore, we take into account those features, which are marked
as Geo-Features.
Figure 3: The architecture of the SoundNet network [35]. 2.2.2 CNN Bottleneck Features. To extract bottleneck features
Visual knowledge is transferred into the sound modality using from images, the VGG (configuration “D”) [38] and DenseNet-
unlabeled video as a bridge. BC [39] structure are chosen.
In this paper, the VGG and DenseNet-BC network are pre-
2.1.4 VGGish Bottleneck Features. The VGGish network [36] trained on ImageNet [40] and fine-tuned on the (Facial Expression
is trained on AudioSet [37], which contains over 2 million human- Recognition +) FER+ [41] and Static Facial Expression in the
labeled 10s YouTube video soundtracks with more than 600 audio Wild (SFEW) 2.0 database [42]. Grey-scale images are treated as
event classes. inputs. As for the VGG network, we extract bottleneck features
from conv5-b, conv5-c, fc1 and fc2 in Fig.5. As for the DenseNet-

3
BC structure, we extract bottleneck features from the last mean neural network (DCNN). Specifically, it is an implementation of
pooling layer, which is marked as pool3. the VIPLFaceNet [46], which consists of 7 convolutional layers
and 2 fully-connected layers with the input size of 256x256x3. In
the SeetaFace open-source face identification toolkit, outputs of
2048 nodes of the FC2 layer in the VIPLFaceNet are exploited as
the feature of the input face.

2.3 Text Features


Figure 5: The structure of the VGG network. The input image
is 64×64 pixels. Meanings of other components are the same Contents in audios reflect the emotion. For example, dirty words
as definitions in Fig. 4. such as ‘fuck’ and ‘shit’ are common when people are angry.
‘Sorry’ is always utilized to express one’s guilt about others.
People often use ‘oh my god’ to express their surprise.
To take into account text information, Term Frequency–Inverse
Document Frequency (TF-IDF) [47] and Word Vectors (WV) [48]
are utilized to extract computable features from raw texts.
2.3.1 TF-IDF. TF-IDF is a numerical statistic that is intended
Figure 6: The structure of the DenseNet-BC [43] network. The to reflect how important a word is to a document. TF means term
input image is 64×64 pixels. There are three Dense Block. frequency while IDF means inverse document frequency. The TF-
Yellow boxes and green boxes denote the convolutional layers IDF value increases proportionally to the number of times a word
and the mean pooling layers, respectively. appears in the document and is offset by the frequency of the
word in the corpus.
2.2.3 C3D Features. The C3D network is an extension of the
TF  IDF (t , d )  TF (t , d )  IDF (t ) (1)
2D convolutional process, which captures spatial-temporal
features from videos. The C3D network shows its performance in 1 n d
IDF (t )  log 1 (2)
previous EmotiW challenges [5, 22]. The architecture of C3D is 1  df ( d , t )
shown in Fig. 7. where TF(t, d) shows the number of times the word t appears in
In this paper, C3D network is pre-trained on sports1M [44] and the document d. nd is the total number of documents and df(d, t) is
fine-tuned on the AFEW database. It takes continuous 16 frames the number of documents that contain the word t.
as inputs with 8 overlapped frames. Outputs of fc6 are exploited 2.3.2 WV. Word vectors are the high-level representation of
as bottleneck features. words, which learn grammatical relations between words through
a large corpus.
In the paper, we utilize pre-trained FastText word vectors. It
has 1 million word vectors trained on the Wikipedia 2017, UMBC
webbase corpus and statmt.org news dataset. Each word can be
mapped into 300-D computable vectors.

Figure 7: The structure of the C3D network. It takes 3 CLASSIFIERS


continuous 16 images as inputs. Each image is cropped into Besides classic classifiers such as SVM, Random Forest (RF) and
112×112 pixels. Yellow boxes denote 3D convolutional layers. LR, we also test temporal models, including Mean Pooling LSTM,
Meanings of other components are the same as the definitions temporal LSTM and CNN-LSTM Model. Furthermore, several
in Fig. 4.
types of aggregation models are also investigated: NetVLAD,
NetRVLAD, SoftDBoW and NetFV.
2.2.4 Background Features. Background information is
helpful to judge emotion states. Fear is often accompanied with a
3.1 Mean Pooling LSTM
dim environment. Happy is often accompanied with a bright
environment. To take into account background information, we As for Mean Pooling LSTM, we use a one-layer LSTM and
take the Inception network [45] as the feature extractor, which is average the time-step outputs as the video representation in the
pre-trained on ImageNet. Original frames extracted from videos encoder and a fully-connected layer in the decoder. The softmax
are passed into the network. The last mean pooling layer is treated layer is treated as the classifier. The structure of Mean Pooling
as the bottleneck layer. Then PCA is utilized to extract normalized LSTM is shown in Fig. 8.
features and reduce feature dimensions.
2.2.5 Identity Features. Identity information also counts. As
some samples in the AFEW database are continuous, their
emotions have high possibilities to be continuous as well.
In the experiment, SeetaFace1 is utilized to extract identity
features. SeetaFace identification is based on deep convolutional

1
SeetaFace: https://github.com/seetaface/SeetaFaceEngine
representation for inputs. Then LSTM is utilized to capture
temporal information. The whole structure is trained in the end-to-
end manner.

Figure 8: The structure of Mean Pooling LSTM. Red boxes


represent features in different time steps.

3.2 Temporal LSTM Figure 10: The structure of CNN-LSTM.

To consider more contextual information, we propose Temporal 3.4 Aggregation Models


LSTM. The difference between Temporal LSTM and Mean
Pooling LSTM mainly focuses on inputs. Instead of processing on Aggregation models have shown their performance in the
one time step features, features in the same window are Youtube 8M Large-Scale Video Understanding challenge [49]. It
concatenate together as inputs in Temporal LSTM. The overlap is an efficient way to remember all of the relevant visual cues.
size can be adjusted. If the overlap size is set to be 0, adjacent We investigate several types of trainable aggregation models,
windows are processed independently. Temporal LSTM can including NetVLAD, NetRVLAD, SoftDBoW and NetFV [32].
consider more contextual information. The structure of Temporal As VLAD encoding is not trainable in DNNs, the NetVLAD
LSTM is shown in Fig. 9. architecture is proposed to reproduce the VLAD encoding in a
trainable manner. Therefore parameters can be optimized through
backpropagation instead of using k-means clustering. The
NetVLAD descriptor can be written as:
N
NetVLAD( j , k )   ak ( xi )  xi ( j )  ck ( j )  (3)
i 1

ak(xi) is the soft assignment of descriptor xi to cluster k. NetVLAD


descriptor computes the weighted sum of residuals (xi-ck) of
descriptors xi from learnable anchor point ck in cluster k.
The SoftDBoW and NetFV descriptor exploit the same idea in
the NetVLAD descriptor to imitate FV and BOW. Compared with
the NetVLAD descriptor, the NetRVLAD descriptor averages the
actual descriptors instead of residuals.

4 FUSION METHODS
Figure 9: The structure of Temporal LSTM. Red boxes
represent features in different time steps. Weighted mean [5, 30] is an efficient late fusion method in
previous EmotiW challenges. However, how to efficiently
3.3 CNN-LSTM compute weights for a subset of models and ignore useless models
are still controversial.
CNN-LSTM is an end-to-end classifier. Mean Pooling LSTM and In this paper, we propose the BS-Fusion, which learns from the
Temporal LSTM are all multi-step process, where features are bream search method. As there is a combinatorial explosion in the
extracted first and then fed into classifiers. However, targets of number of feasible subset (2N subsets for N models), we employ a
multi-step process are not consistent. Besides, there is no sampling procedure with the goal of filtering out subsets that are
agreement on appropriate features for the emotion classification. less likely to yield good results. We use a beam search of the size
To solve these problems, we introduce the end-to-end classifier – K and select topK subsets in each turn. The selection approach is
CNN-LSTM, whose structure is shown in Fig. 10. based on the classification accuracy of the subset.
CNN-LSTM takes raw images as inputs. The CNN network is
treated as a feature extractor, which extracts the high-level

5
Algorithm 1 Beam Search Fusion (BS-Fusion) methods to compress variable-length features into fixed-length
1: procedure BS-Fusion(K, N) features. As for statistical functions, mean, maximum and FV are
2: Init empty storage S utilized to extract fixed-length features. Then we pass them into
3: S i denotes i th components in S classifiers such as SVM, RF and LR. As for aggregation models
and temporal models, variable-length features are padded into
4: pre _ best _ score  0 ; now _ best _ score  0
fixed-length features. Then aggregation models (such as NetFV,
5: for round  1,, N do NetVLAD, NetRVLAD and SoftDBoW) and temporal models
6: pre _ best _ score  now _ best _ score (such as Mean Pooling LSTM, Temporal LSTM and CNN-LSTM)
7: Init empty storage S * are tested.
Through experimental analysis, we find that FV has the worst
8: for i  1, , K do
performance among statistical functions. Although CNN-LSTM
9: for j  1, , N do gains highest accuracy on the training dataset compared with
10: if model j not in S i then Mean Pooling LSTM and Temporal LSTM, it has the overfitting
11: S _ test  S i  j problem in the validation dataset. Temporal LSTM gains similar
results compared with Mean Pooling LSTM. Therefore, FV is
12: S _ test _ score  calculate scores for S _ test
ignored and LSTM refers to Mean Pooling LSTM in the following
13: if S _ test _ score  pre _ best _ score then experiments.
14: S *  S *  S _ test 6.1.1 Results of Temporal Models and Aggregation Models. In
15: if S * is empty then this section, we compare the performance of LSTM, NetVLAD,
16: break NetRVLAD, SoftDBoW and NetFV. Experimental results are
17: listed in Table 2.
S  topK ( S * )
In the experiments, we choose segment-level audio features,
including SoundNet Bottleneck features, MFCCs, IS10 and
5 DATASETS eGemaps. Segment length for SoundNet Bottleneck features is set
The AFEW database (version 2018) contains video clips labeled to be 1000ms, and segment length for other features is set to be
using the semi-automatic approach defined in [1]. There are 1809 100ms. The number of neurons in the LSTM layer and the number
video clips: 773 for training, 383 for validation and 653 for testing. of neurons in the fully-connected layer are fixed as 128. The
LBPTOP features and the meta-data are also provided for the cluster size of NetVLAD, NetRVLAD, SoftDBoW and NetFV is
Training dataset and the Validation dataset. Category distribution set to be 64.
of the AFEW dataset is shown in Table 1.
Table 2: Comparison of Temporal Models and Aggregation
Table 1: Emotion Category Distribution of the AFEW Dataset Models for Audio Features (%)

Emotion Training Validation Testing 1000ms 100ms 100ms 100ms


Angry 133 64 98 SoundNet MFCCs IS10 eGemaps
Disgust 74 40 40 NetVLAD 32.64 26.63 21.41 27.94
Fear 81 46 70 NetRVLAD 33.68 24.80 19.58 26.11
Happy 150 63 144 NetFV 32.11 27.68 21.41 26.11
Neutral 144 63 193 SoftDBoW 32.38 25.85 20.37 27.42
Sad 117 61 80 LSTM 34.99 27.15 24.03 26.11
Surprise 74 46 28 Through experimental results in Table 2, we find that LSTM
Total 773 383 653 has better performance in most cases. Therefore, we only consider
LSTM in the following experiments and ignore aggregation
6 EXPERIMENTAL RESULTS models.
6.1.2 Performance of Audio Features. In this section, we
In this section, we investigate the performance of audio, video and
compare the performance of multiple audio features. Experimental
text features. Furthermore, we demonstrate the effectiveness of
results are listed in Table 3.
the BS-Fusion.

6.1 Audio Feature Analysis Table 3: Classification Accuracy of Audio Features (%)
Since statistical functions have been considered, the feature
dimensions of utterance-level features are fixed. We only evaluate Exp. Features Statistical Classi Accur
their performance in SVM, RF and LR. Functions fiers acy
Feature dimensions of segment-level features and frame-level 1 1000ms SoundNet None LSTM 31.33
features are variable due to variable-length waveforms. Since 2 1000ms VGGish None LSTM 34.86
classifiers take fixed-length features as inputs, we test two 3 1000ms Chinese ASR Max RF 36.03
4 1000ms English ASR Mean RF 33.42 Exp. 1~9 in Table 4 choose frame-level features or segment-
5 100ms eGemaps None LSTM 26.89 level features. Exp. 10~14 in Table 4 evaluate multiple video-
6 100ms IS10 Max RF 25.59 level features. Through experimental results, we find that different
7 100ms MFCCs None LSTM 26.63 video features need different statistical functions and classifiers.
8 eGemaps — RF 34.46 VGG_conv5-c features gain the highest accuracy, 43.34%, which
9 IS09 — RF 32.11 outperform the best result in the audio modality. HOGLBP
10 IS11 — RF 21.15 features are the best handcraft features, which gains 40.73%
11 IS13 — RF 20.10 accuracy. Through Exp. 7~9 in Table 4, we find that our newly
Exp. 1~7 in Table 3 choose segment-level audio features. Exp. proposed features have worse performance compared with other
8~11 in Table 3 test multiple utterance-level audio features. As visual features. However, through further experiments, we find
for segment-level features, we list segment length in front of the that those features (especially Identity features) are helpful during
feature name. As statistical functions are not needed for LSTM, the fusion phase. We can gain higher accuracy if we take into
they are set to be None. account those features.
Through experimental results in Table 3, we find that different
audio features need different statistical functions and classifiers. 6.3 Text Feature Analysis
Chinese ASR bottleneck features gain the highest accuracy,
We utilize the open-source Baidu API1 to recognize audio
36.03%. As the Chinese ASR system trains on a larger speech
contents. To reduce the size of the vocabulary, we remove the
corpus than the English ASR system, Chinese ASR bottleneck
word whose frequency is less than three. Furthermore, we change
features have better performance. It shows the efficiency of
the word to its prototype. For example, ‘go’, ‘going’ and ‘gone’
features extracted from the multilingual system.
are all converted into ‘go’. Then TF-IDF and WV features are
extracted.
6.2 Video Feature Analysis
In this section, we show our face detection approach and the
performance of video features. Table 5: Classification Accuracy of Text Features (%)
6.2.1 Face Detection Methods. In provided faces, 17 videos in
the training dataset and 12 videos in the validation dataset are Exp. Features Statistical Classifiers Accuracy
false detected. As for false detected videos, we manually initial Functions
the position of the first face and then use the object tracking 1 WV Max SVM 36.94
method to extract the following faces. In the end, we convert faces 2 TF-IDF — RF 27.68
into grey-scale images and apply histogram equalization to
alleviate the impact of lights. Through experimental results in Table 5, we find that WV
6.2.2 Performance of Video Features. We extract bottleneck features are more suitable for the limited dataset. WV features
features from both SFEW fine-tuned models and FER+ fine-tuned gain the highest accuracy, 36.94%, which outperform the best
models. We find that SFEW fine-tuned models gain worse features in the audio modality. It shows the effectiveness of
performance compared with FER+ fine-tuned models. Therefore, textual features.
only FER+ fine-tuned models are considered.
6.4 Fusion Results
Through the BS-Fusion, a subset of emotion possibilities is
Table 4: Classification Accuracy of Video Features (%)
selected according to the classification performance on the
validation dataset. In the testing dataset, we achieve 60.34%
Exp. Features Statistical Classi Accur accuracy.
Functions fiers acy
1 DenseNet_pool3 None LSTM 41.25
2 VGG_conv5-b Mean RF 43.08
3 VGG_conv5-c None LSTM 43.34
4 VGG_fc1 Mean RF 41.78
5 VGG_fc2 None LSTM 39.16
6 C3D_fc6 None LSTM 37.86
7 Geo-Features Max SVM 28.86
8 Background None LSTM 24.54
9 Identity Features Mean RF 36.81
10 LBPTOP — SVM 38.81
11 LBP — SVM 29.24
12 HOG — RF 39.95
13 HOGLBP — RF 40.73
14 DSIFT — RF 39.69

7
1
Baidu API: http://ai.baidu.com/docs#/ASR-Online-Python-SDK/top
and their latent relations for emotion recognition in the wild," in Proceedings
of the 2015 ACM on International Conference on Multimodal Interaction,
2015, pp. 451-458: ACM.
[5] Y. Fan, X. Lu, D. Li, and Y. Liu, "Video-based emotion recognition using
CNN-RNN and C3D hybrid networks," in Proceedings of the 18th ACM
International Conference on Multimodal Interaction, 2016, pp. 445-450:
ACM.
[6] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, "Learning supervised scoring
ensemble for emotion recognition in the wild," in The ACM International
Conference, 2017, pp. 553-560.
[7] H. Kaya, F. Gürpinar, S. Afshar, and A. A. Salah, "Contrasting and
combining least squares based learners for emotion recognition in the wild,"
in Proceedings of the 2015 ACM on International Conference on Multimodal
Interaction, 2015, pp. 459-466: ACM.
[8] B. Sun, L. Li, T. Zuo, Y. Chen, G. Zhou, and X. Wu, "Combining
Multimodal Features with Hierarchical Classifier Fusion for Emotion
Recognition in the Wild," in ACM on International Conference on
Multimodal Interaction, 2014, pp. 481-486.
[9] N. Dalal and B. Triggs, "Histograms of oriented gradients for human
detection," in IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2005, vol. 1, pp. 886-893: IEEE.
[10] T. Ojala, M. Pietikainen, and T. Maenpaa, "Multiresolution gray-scale and
Figure 11: The confusion matrix in the testing dataset. rotation invariant texture classification with local binary patterns," IEEE
Transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp.
971-987, 2002.
Through Fig. 11, we can figure out that our approach has great [11] V. Ojansivu and J. Heikkilä, "Blur insensitive texture classification using
local phase quantization," in International conference on image and signal
recognition performance in angry, happy and neutral. However, processing, 2008, pp. 236-243: Springer.
disgust and surprise are easily confused with others. [12] D. G. Lowe, "Distinctive image features from scale-invariant keypoints,"
Surprise is easily confused with fear and neutral. It is related to International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
[13] G. Zhao and M. Pietikainen, "Dynamic texture recognition using local binary
the vague definition of surprise. Surprise contains pleasant patterns with an application to facial expressions," IEEE transactions on
surprises and fright. Pleasant surprises are easy confused with pattern analysis and machine intelligence, vol. 29, no. 6, pp. 915-928, 2007.
[14] F. Perronnin and C. Dance, "Fisher Kernels on Visual Vocabularies for
happy. And fright is easily confused with fear. Image Categorization," in IEEE Conference on Computer Vision and Pattern
Disgust is more blurred than surprise. Disgust is related to the Recognition, 2007, pp. 1-8.
video content. If we add the video description information, the [15] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories," in null, 2006,
recognition accuracy of disgust can be increased. pp. 2169-2178: IEEE.
[16] B. Sun et al., "Combining multimodal features within a fusion network for
emotion recognition in the wild," in Proceedings of the 2015 ACM on
7 CONCLUSIONS International Conference on Multimodal Interaction, 2015, pp. 497-502:
ACM.
In this paper, we present the audio-video-text based emotion [17] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, "Dual Path Networks,"
recognition system submitted to EmotiW 2018. Features from 2017.
[18] A. Vaswani et al., "Attention is all you need," in Advances in Neural
different modalities are trained individually. Then emotion Information Processing Systems, 2017, pp. 6000-6010.
possibilities are extracted and passed into the BS-Fusion. We [19] L. Chen et al., "SCA-CNN: Spatial and Channel-wise Attention in
evaluate our method in the EmotiW 2018 Audio-Video based sub- Convolutional Networks for Image Captioning," pp. 6298-6306, 2017.
[20] A. Van Den Oord et al., "Wavenet: A generative model for raw audio," arXiv
challenge. Multiple features and classifiers are investigated. preprint arXiv:1609.03499, 2016.
Through experimental analysis, we find that the video modality [21] L. Shen, Z. Lin, and Q. Huang, "Relay backpropagation for effective learning
of deep convolutional neural networks," in European conference on
has the highest recognition accuracy among three modalities. computer vision, 2016, pp. 467-482: Springer.
Finally, we achieve 60.34% recognition accuracy in the testing [22] X. Ouyang et al., "Audio-visual emotion recognition using deep transfer
dataset via the BS-Fusion. learning and multiple temporal models," in The ACM International
Conference, 2017, pp. 577-582.
In the future, we will add more discriminative features for [23] J. Yan et al., "Multi-clue fusion for emotion recognition in the wild," in ACM
emotion recognition. Since emotion expression is related to the International Conference on Multimodal Interaction, 2016, pp. 458-463.
[24] S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang, "Emotion recognition
video content, video description information will be considered. in the wild from videos using images," in ACM International Conference on
Furthermore, movie types also count. Fear is common in horror Multimodal Interaction, 2016, pp. 433-436.
films. [25] H. Kaya and A. A. Salah, "Combining Modality-Specific Extreme Learning
Machines for Emotion Recognition in the Wild," 2014, pp. 487-493.
[26] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
REFERENCES [27]
computation, vol. 9, no. 8, pp. 1735-1780, 1997.
K. Cho, B. V. Merrienboer, D. Bahdanau, and Y. Bengio, "On the Properties
[1] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, "Collecting Large, Richly of Neural Machine Translation: Encoder-Decoder Approaches," Computer
Annotated Facial-Expression Databases from Movies," IEEE Multimedia, Science, 2014.
vol. 19, no. 3, pp. 34-41, 2012. [28] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning
[2] S. E. Kahou et al., "Combining modality specific deep neural networks for spatiotemporal features with 3d convolutional networks," in IEEE
emotion recognition in video," in Proceedings of the 15th ACM on International Conference on Computer Vision (ICCV), 2015, pp. 4489-4497:
International conference on multimodal interaction, 2013, pp. 543-550: IEEE.
ACM. [29] A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen, "HoloNet: towards
[3] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, "Combining robust emotion recognition in the wild," in ACM International Conference on
multiple kernel methods on riemannian manifold for emotion recognition in Multimodal Interaction, 2016, pp. 472-478.
the wild," in Proceedings of the 16th International Conference on [30] V. Vielzeuf, S. Pateux, and F. Jurie, "Temporal multimodal fusion for video
Multimodal Interaction, 2014, pp. 494-501: ACM. emotion classification in the wild," pp. 569-576, 2017.
[4] A. Yao, J. Shao, N. Ma, and Y. Chen, "Capturing au-aware facial features
View publication stats

[31] A. Dhall, A. Kaur, R. Goecke, and T. Gedeon, "EmotiW 2018: Audio-Video,


Student Engagement and Group-Level Affect Prediction," in ACM on
International Conference on Multimodal Interaction, 2018.
[32] A. Miech, I. Laptev, and J. Sivic, "Learnable pooling with Context Gating for
video classification," 2017.
[33] F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile
and fast open-source audio feature extractor," in Proceedings of the 18th
ACM international conference on Multimedia, 2010, pp. 1459-1462: ACM.
[34] F. Eyben et al., "The Geneva Minimalistic Acoustic Parameter Set
(GeMAPS) for Voice Research and Affective Computing," IEEE
Transactions on Affective Computing, vol. 7, no. 2, pp. 190-202, 2016.
[35] Y. Aytar, C. Vondrick, and A. Torralba, "Soundnet: Learning sound
representations from unlabeled video," in Advances in Neural Information
Processing Systems, 2016, pp. 892-900.
[36] S. Hershey et al., "CNN architectures for large-scale audio classification," in
IEEE International Conference on Acoustics, Speech and Signal Processing,
2017, pp. 131-135.
[37] J. F. Gemmeke et al., "Audio Set: An ontology and human-labeled dataset
for audio events," 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 776-780, 2017.
[38] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for
Large-Scale Image Recognition," Computer Science, 2014.
[39] G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, "Densely
Connected Convolutional Networks," in IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 2261-2269.
[40] O. Russakovsky et al., "ImageNet Large Scale Visual Recognition
Challenge," International Journal of Computer Vision, vol. 115, no. 3, pp.
211-252, 2015.
[41] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, "Training deep networks
for facial expression recognition with crowd-sourced label distribution," in
ACM International Conference on Multimodal Interaction, 2016, pp. 279-
283.
[42] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon, "Video
and image based emotion recognition challenges in the wild: Emotiw 2015,"
in Proceedings of the 2015 ACM on International Conference on Multimodal
Interaction, 2015, pp. 423-426: ACM.
[43] S. Chen, Q. Jin, J. Zhao, and S. Wang, "Multimodal Multi-task Learning for
Dimensional and Continuous Emotion Recognition," in The Workshop, 2017,
pp. 19-26.
[44] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li,
"Large-Scale Video Classification with Convolutional Neural Networks," in
IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
1725-1732.
[45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking
the Inception Architecture for Computer Vision," in Computer Vision and
Pattern Recognition, 2016, pp. 2818-2826.
[46] X. Liu, M. Kan, W. U. Wanglong, S. Shan, and X. Chen, "VIPLFaceNet: an
open source deep face recognition SDK," Frontiers of Computer Science,
vol. 11, no. 2, pp. 208-218, 2017.
[47] T. Joachims, "Text Categorization with Suport Vector Machines: Learning
with Many Relevant Features," in European Conference on Machine
Learning, 1998, pp. 137-142.
[48] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of
Word Representations in Vector Space," Computer Science, 2013.
[49] S. Abu-El-Haija et al., "YouTube-8M: A Large-Scale Video Classification
Benchmark," 2016.

You might also like