Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
6, JUNE 2018
1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1577
Fig. 1. An overview of the proposed speech emotion recognition framework using DCNNs and DTPM: (1) Three channels of log Mel-spectrograms (static, delta
and delta-delta) are extracted and divided into N overlapping segments as the DCNN input. (2) A DCNN model is employed for automatic feature learning on each
segment to generate segment-level features. (3) A DTPM scheme is designed to concatenate the learned segment-level features to form a global utterance-level
feature representation. (4) With utterance-level features, a linear SVM classifier is employed to predict utterance-level emotions.
ploy 1-D convolution, such as frequency convolution [14]–[16] In this paper, we use deep features learned by DCNNs [7] and
or time convolution [17], rather than 2-D convolution widely propose a Discriminant Temporal Pyramid Matching (DTPM)
used in DCNN models [7], [10]. Additionally, these used 1- algorithm to pool deep features for speech emotion recognition.
layer or 2-layer CNNs are much shallower compared with the As illustrated in Fig. 1, three channels of log Mel-spectrograms
deep structures in DCNN models [7], [10]. Accordingly, they (static, delta and delta-delta) are extracted as the DCNN input.
may could not effectively learn affective features discriminative The DCNN models are trained to produce deep features for each
enough to distinguish the subjective emotions. segment. The DTPM pools the learned segment-level features
It has been recently found that, with deep multi-level convo- into a global utterance-level feature representation, followed
lutional and pooling layers, DCNNs usually exhibit much better by the linear SVM emotion classifier. Extensive experiments on
performance than the shallow CNNs in computer vision [19], four public datasets, i.e., the Berlin dataset of German emotional
[20]. This is reasonable because the deep structures of DCNNs speech (EMO-DB) [21], the RML audio-visual dataset [22], the
can effectively model the hierarchical architecture of informa- eNTERFACE05 audio-visual dataset [23], and the BAUM-1s
tion processing in the primate visual perception system [7], [10]. dataset [24], demonstrate the promising performance of our
Motivated by the promising performance of deep models, this proposed method.
work aims to employ DCNNs to develop an effective speech The main contributions of this paper can be summarized as:
emotion recognition system. 1) We propose to use three channels of log Mel-spectrograms
The success of DCNNs in visual tasks motivates us to test generated from the original 1-D utterances as the DCNN
DCNNs in speech emotion recognition. To achieve this, three input. This input is similar to the red, green, blue (RGB)
issues need to be addressed. First, a proper speech representa- image representation, thus makes it possible to use ex-
tion should be designed as the DCNN input. Previous works isting DCNNs pre-trained on image datasets for affective
[14]–[17] have employed 1-D speech signals as the CNN in- feature extraction.
puts, and 1-D convolution is adopted for CNNs. Compared with 2) The proposed DTPM strategy combines temporal pyra-
1-D convolution, 2-D convolution involves more parameters mid matching and optimal Lp-norm pooling to generate a
to capture more detailed temporal-frequency correlations, thus discriminative utterance-level feature representation from
is potential to present stronger feature learning ability. There- segment-level features learned by DCNNs.
fore, it is important to convert 1-D speech signals into suitable 3) We find that the DCNN model pre-trained for image ap-
2-D representations as the DCNN input. Second, most existing plications performs reasonably good in affective feature
emotional speech datasets [3]–[5] contain limited numbers of extraction. A further fine-tuning on target speech emotion
samples. They are not sufficient enough to train deep models recognition tasks substantially promotes the recognition
having a large amount of parameters. Finally, speech signals performance.
may have variant time of duration but the DCNN models re- The rest of this paper is structured as follows. The re-
quire fixed input size. It is hence easier to design the DCNN lated works are reviewed in Section II. Section III describes
models for speech segments with a fixed length, rather than for our DCNN model for affective feature extraction. Section IV
the global utterance. Therefore, proper pooling strategies are presents the details of our DTPM scheme. Section V describes
needed to generate a global utterance-level feature representa- and analyzes the experimental results. Section VI provides dis-
tion based on the segment-level features learned by DCNNs. cussions, followed by the conclusions in Section VII.
1578 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
II. RELATED WORK related features, has shown promising performance in speech
Generally, feature extraction and emotion classification are emotion recognition.
two key steps in speech emotion recognition. In this section, we Language features, which are computed based on the verbal
first briefly review emotion classifiers and then focus on feature contents of speech, are another important representation con-
extraction since it is more relevant to our work. veying emotion information. Note that, language features are
usually combined with acoustic features for speech emotion
A. Emotion Classifier recognition [36], [37]. In [37], language features are extracted
with the bag of n-gram and character n-gram approaches. Then
For emotion classification various machine learning algo- the linguistic features are combined with acoustic features to
rithms have been utilized to constitute a good classifier to predict dimensional emotions in a 3-D continuous space. In
distinguish the underlying emotion categories. Early emotion [36], by computing the weight of every word, a four-dimensional
classifiers contain K-Nearest-Neighbor (KNN) [25] and Artifi- emotion lexicon for four emotion classes, i.e., anger, joy, sad-
cial Neural Network (ANN) [26]. Then, a number of statisti- ness and neutral, are obtained. Then, integrating these feature
cal pattern recognition approaches, such as Gaussian Mixture representations via early fusion and late fusion is employed for
Model (GMM) [27], Hidden Markov Models (HMM) [28], and speech emotion recognition.
SVM [29], are widely adopted for speech emotion recognition. Context information has also been investigated in recent lit-
Recently, some advanced classifiers based on sparse representa- eratures [38], [39] for emotion recognition. In [38], the authors
tion [30], [31] have also been studied. Nevertheless, each clas- present a context analysis of subject and text on speech emo-
sifier has its own advantages and disadvantages. To integrate tion recognition, and find that gender-based context informa-
the merits of different classifiers, ensembles of multiple clas- tion enhances recognition performance. In [39], the influences
sifiers have been investigated for speech emotion recognition of cultural information on speech emotion recognition are ex-
[32], [33]. plored. The authors claim that intra-cultural and multi-cultural
emotion recognition paradigms give better performance than
B. Feature Extraction cross-cultural recognition.
Affective speech features widely used for emotion recog- Note that, since these hand-designed features mentioned
nition can be roughly divided into four categories: 1) acoustic above are low-level, they may not be discriminative enough
features [34], [35], 2) language features, such as lexical informa- to identify the subjective emotions. To tackle this issue, it may
tion [36], [37], 3) context information, such as subject, gender, be feasible to employ deep learning techniques to automatically
culture influences [38], [39], 4) hybrid features [36], [40], such learn high-level affective features for speech emotion recogni-
as the integration of two or three features above-mentioned. tion.
Acoustic features, as one of the most popular affective fea-
tures, mainly contain prosody features, voice quality features, III. DCNNS FOR AFFECTIVE FEATURE EXTRACTION
and spectral features [34], [35]. Pitch, loudness, and duration To utilize DCNNs in speech emotion recognition, three prob-
are commonly used as prosody features [41], since they express lems should be addressed. First, the DCNN input should be
the stress and intonation patterns of spoken language. Voice properly computed from 1-D speech signals. Second, DCNN’s
quality features, as the characteristic auditory colouring of an training requires a large amount of labeled data. Third, a feature
individual voice, have been shown to be discriminative in ex- pooling strategy is required to generate the global utterance-
pressing positive or negative emotions [42]. The widely used level feature representation from the DCNN outputs on local
voice quality features are the first three formants (F1, F2, F3), segments. In this section, we present the details of how the first
spectral energy distribution, harmonics-to-noise-ratio, pitch ir- two problems are addressed.
regularity (jitter), amplitude irregularity (shimmer), and so on. Fig. 2 illustrates the framework for affective feature ex-
Combining prosody features and voice quality features shows traction. From the original 1-D utterance, we first extract the
better performance than using prosody features alone [43], [44]. static 2-D log Mel-spectrogram and then reorganize it into
In recent years, glottal features [45] and voice source parameters three channels of log Mel-spectrograms (static, delta and
[46] have been used as more advanced voice quality features for delta-delta). For data augmentation, the log Mel-spectrogram
speech emotion recognition. The third typical acoustic features extracted from an utterance is divided into a certain number
are spectral features, computed from the short-term power spec- of overlapping segments as the DCNN input. More details
trum of sound, such as Linear Prediction Cepstral Coefficients about data augmentation can be found in Section V-B. Then
(LPCC), Log Frequency Power Coefficients (LFPC) and Mel- the AlexNet DCNN model [7] pre-trained on the large-scale
frequency Cepstral Coefficients (MFCC). Among them, MFCC ImageNet dataset is employed to perform fine-tuning tasks for
is the most popular spectral feature, since it is able to model affective feature extraction. We present more details of the two
the human auditory perception system. In recent years, modula- steps in the following two sections.
tion spectral features [47] from an auditory-inspired long-term
spectro-temporal representation, and weighted spectral features
A. Generation of DCNN Input
[48] based on local Hu moments, have also been studied. In ad-
dition, the newly-developed Geneva minimalistic acoustic pa- Because of the limited training data of speech emotion recog-
rameter set (GeMAPS) [5], such as frequency, energy, spectral nition, it is not possible to directly train a robust deep model.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1579
Fig. 2. The flowchart of our DCNN model for affective feature extraction. Three channels of log Mel-spectrograms with size 64 × 64 × 3 (static, delta and
delta-delta) are firstly produced, and then are resized to 227 × 227 × 3 as the DCNN input. The DCNN model is first initialized with the AlexNet [7], then is
fine-tuned on target emotional datasets. The 4096-D FC7 output is finally used as the segment-level affective features.
Motivated by the promising performance of available DCNN for an utterance we adopt 64 Mel-filter banks from 20 to 8000 Hz
models, we propose to first initialize deep models with available to obtain the whole log Mel-spectrogram using a 25 ms Ham-
DCNN models like AlexNet [7], then fine-tune it for transfer ming window size with 10ms overlapping. Then, a context win-
learning on target emotional datasets. Because available DCNN dow of 64 frames is applied to the whole log Mel-spectrogram
models take 2-D or 3-D images as inputs, we transform the raw to extract the static 2-D Mel-spectrogram segments with size
1-D speech spectrogram into 3-D array as the DCNN input. 64 × 64. A frame shift size of 30 frames is used to produce
In recent years, Abdel-Hamid et al., [14] adopt the extracted such overlapping segments of Mel-spectrogram. Each segment
log Mel-spectrogram and organize it into a 2-D array as the CNN hence includes a context window of 64 frames and its length is
input with a shallow 1-layer structure for speech recognition. 10 ms × 63 + 25 ms = 655 ms. In this case, the segment length
Specifically, for each frame with a context window of 15 frames is about 2.6 times longer than the suggested length of 250 ms in
and 40 Mel-filter banks, they construct 45 (i.e., 15 × 3) 1-D fea- [49], [50], and conveys sufficient clues for emotion recognition.
ture maps with size 40 × 45. Then, the 1-D convolutional kernel Note that we set F as 64 because the input height-width ratio
is applied along the frequency axis. However, speech emotion of our DCNN model is 1:1. Besides, F is usually set to be
recognition using DCNNs is different from speech recognition relatively large values for the usage of CNNs. For example, F
in [14]. First, 1-D convolution operation along the frequency is set to 40 in speech recognition [14] and 60 in speech emotion
axis could not capture the temporal information, which is im- recognition [16], respectively. Therefore, it is reasonable to set
portant for emotion recognition. Second, the divided segments F as 64 in this work.
with 15 frames (about 165 ms) used for speech recognition, are In speech recognition, the first and second temporal deriva-
too short to distinguish emotions, since it has been found that tives on the extracted acoustic features such as MFCC, are
only a speech segment length of more than 250 ms presents widely used as additional features. Similarly, after extracting
sufficient information for identifying emotions [49], [50]. the static 2-D Mel-spectrogram, we also calculate the first order
To address these two issues, from the raw 1-D speech sig- and second order regression coefficients along the time axis as
nals we generate the following overlapping Mel-spectrogram the delta and delta-delta coefficients of Mel-spectrogram. In this
segments (abbreviated as Mel SS) as the DCNN input way, we organize the 1-D speech signals into three channels of
Mel-spectrogram segments, i.e., Mel SS with size 64 × 64 × 3
Mel SS ∈ RF ×T ×C , (1)
(three channels: static, delta and delta-delta) as the DCNN input.
where F is the number of Mel-filter banks, T is the segment Then, 2-D convolution operation along the frequency axis and
length corresponding to the frame number in a context window, time axis can be performed for DCNN’s training on this input.
and C (C = 1, 2, 3) represents the number of channels of Mel- When using the AlexNet DCNN model [7] for affective fea-
spectrogram. Note that C = 1 denotes one channel of Mel- ture extraction, we have to resize the spectrogram 64 × 64 × 3
spectrogram, i.e., the original static spectrogram, C = 2 denotes into 227 × 227 × 3, which is the input size of AlexNet. Since the
the static and delta coefficients of Mel-spectrograms, and C = 3 extracted three channels of Mel-spectrograms can be regarded as
represents three channels of Mel-spectrograms including the the RGB image representation, we perform the resize operation
static, delta and delta-delta coefficients of Mel-spectrogram. with bilinear interpolation, which is commonly used for image
As an example described in Fig. 2, we extract Mel SS with resizing. Note that, the number of channels of Mel-spectrogram
size 64 × 64 × 3 (F = 64, T = 64, C = 3) as the input of C and the segment length T may have an important impact on
DCNN. This kind of three channels of spectrograms is analo- the learned deep features. Therefore, we will investigate their
gous to the RGB image representation of visual data. In detail, effects on the recognition accuracy in experiments.
1580 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
B. DCNN Architecture where v denotes the momentum variable, η is the learning rate,
i is the iteration number index, and ∂∂ wL |wi D i is the mean
As shown in Fig. 2, our DCNN model includes five con-
volutional layers, three of which are followed by max-pooling of derivatives of the i-th batch Di . The network hence can be
updated by back-prorogation. More details of DCNN’s training
layers, and two fully-connected layers. The last fully-connected
can be found in [7].
layer consists of 4096 units, giving a 4096-D feature represen-
tation. It can be observed that this structure is identical to the In our DCNN’s training, we first initialize the network with
parameters in the AlexNet, then fine-tune the network in emotion
one of AlexNet [7], which is trained on the large-scale Ima-
classification tasks, which uses the Mel SS with size 227 ×
geNet dataset. The initial parameters of this DCNN model can
thus be copied from the AlexNet, making this DCNN model 227 × 3 as input and multiple emotion classes as output. Note
that, the number of classes used in the AlexNet model is 1000,
easier to train on speech emotion recognition tasks. In the
followings, we introduce the computations and principles of but in our emotion classification tasks, the number of emotion
convolutional layer, pooling layer and fully-connected layer, categories is 6 or 7. Therefore, our used DCNN model differs
from the AlexNet in the last two layers, where our model predicts
respectively.
Convolutional layer: A convolutional layer employs a set 6 or 7 emotion categories.
of convolutional filters to extract multiple local patterns at each After fine-tuning the AlexNet model, we take the output of
its FC7 layer as the segment-level affective features x. Given
local region in the input space, and produces many feature maps.
This can be denoted as N overlapping Mel-spectrogram segments as the inputs of the
DCNN model, we can obtain a segment-level feature repre-
(hk )ij = (Wk ⊗ q)ij + bk , (2) sentation X = (x1 , x2 , · · · , xN ) ∈ Rd×N with feature dimen-
sionality d = 4096. This representation X hence is used as the
where (hk )ij denotes the (i, j) element of the k-th output feature input of the following DTPM algorithm to produce the global
map, q represents the input feature maps, Wk and bk denotes utterance-level features for emotion classification.
the k-th filter and bias, respectively. The symbol ⊗ represents
2-D spatial convolution operation.
Pooling layer: After each convolutional layer, a pooling layer IV. DTPM FOR UTTERANCE-LEVEL FEATURE
may be used. The pooling layer aims to down-sample the ob- REPRESENTATION
tained feature maps from the previous convolutional layers and Because of the unfixed time of duration for speech utterances,
produces a single output from local regions of convolution fea- the above-mentioned segment-level features X have a variant
ture maps. Two widely used pooling operators are max-pooling number of segments. This unfixed dimensionality makes such
and average-pooling. A max-pooling or average-pooling layer segment-level features not directly useable for emotion recogni-
produces a lower resolution version of convolution layer activa- tion. Therefore, we proceed to convert the segment-level features
tions by taking the maximum or average filter activation from into an utterance-level feature representation with fixed dimen-
different positions within a specified window. sionality. This process, which is also called as feature pooling,
Fully-connected layer: This layer integrates the outputs from is widely used in computer vision to convert the local features
previous layers to yield the final feature representations for clas- into the global features for image classification and retrieval.
sification or regression. The activation function is a sigmoid or There are two types of widely-used pooling strategies, i.e.,
tanh function. The output of fully-connected layers is computed average-pooling and max-pooling, which compute the averaged
by values and max values on each dimension, respectively. Note
that, different pooling strategies are suited for different types of
xk = Wk l q l + b k , (3)
l features, e.g., max-pooling is suited for sparse features. It is diffi-
where xk denotes the k-th output neuron, ql denotes the l-th cult to decide which pooling strategy is optimal for our segment-
input neuron, Wk l represents the weight value connecting ql level affective features. Moreover, most of pooling strategies
with xk , and bk denotes the bias term of yk . discard the temporal clues of speech signals, which might be
Since fully-connected layers can be taken as convolutional important to distinguish emotions.
layers with a kernel size of 1 × 1, (3) can be reformulated as Our DTPM is motivated to simultaneously embed the tem-
poral clues and find the optimal pooling strategy. It is partially
(xk )1,1 = (Wk ⊗ q)1,1 + bk . (4) inspired by the Spatial Pyramid Matching (SPM) [51], which
embeds the spatial clues during feature pooling for image classi-
For DCNN’s training, Stochastic Gradient Descent (SGD) fication. In SPM, an image is first divided into regions at different
is commonly employed with parameters like the batch size of scales, then feature pooling is conducted on each region. The
examples, the momentum value (e.g., 0.9), and the weight decay final feature is hence the concatenation of the pooled features at
value (e.g., 0.0005). In this case, the weight w is updated by each scale. Similarly, we also divide the segment-level features
∂L X into non-overlapping sub-blocks along the time axis at differ-
vi+1 = 0.9 · vi − 0.0005 · η · wi − η · |wi , ent scales, then conduct feature pooling on each sub-block. The
∂w Di
final concatenated feature thus integrates the temporal clues at
wi+1 ⇐ wi + vi+1 , (5) different scales. The details will be presented in Section IV-A.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1581
To acquire the optimal pooling strategy, we formulate the d-dimension feature representation f p (Xm ), i.e.,
feature pooling as ⎛ ⎞ p1
⎛ ⎞ p1 1 n
i.e., where the superscript t denotes the t-th iteration. In our imple-
mentation, the iteration stops if the number of iterations exceeds
α Sb (p)α
T
α∗ , p∗ = arg max Ω(α, p) := , (11) the permitted number Niter . After acquiring the final feature rep-
αT Sω (p)α
α ,p
resentation up (X), we use it for emotional classification with
where Sb (p) represents the inter-class separability, Sω (p) rep- classifiers like SVM.
resents the inner-class separability. They are computed by Our training strategy divides the utterances into segments.
This enlarges the training set for DCNNs, but is potential to
Sb (p) = (vip − vjp )(vip − vjp )T , make emotion recognition on each segment more difficult if the
−
j ∈N k (i)
i
(12) segment is too short. We have carefully set the length of each
Sω (p) = (vip − vjp )(vip − vjp )T ,
i j ∈N + (i) segment to 655 ms, which is about 2.6 times longer than the
k
suggested 250 ms for emotion recognition in [49], [50]. There-
where Nk− (i) denotes the index set for k nearest neighbors of fore, each segment should preserve sufficient clues for emotion
the pooling data vip from different classes, and Nk+ (i) represents recognition. To conduct utterance-level emotion recognition, we
the k nearest neighbors of the pooling data vip from the same generate utterance-level features with the DTPM, which aggre-
classes. gates segment-level features at different scales with Lp-norm
Eq. (11) can be solved by optimizing α and p alternatively. pooling. DTPM is inspired by the Spatial Pyramid Matching
When fixing p, this objection function is transformed into the (SPM) [51] commonly used in image classification. SPM ag-
classical Linear Discriminant Analysis (LDA) [55], [56] prob- gregates low-level features from image patches to form a global
lem. In this case, Sb (p) and Sω (p), represent the between-class feature discriminative to high-level semantics. Similar to SPM,
scatter matrix and the within-class scatter matrix, respectively. DTPM is potential of learning a discriminative utterance-level
Therefore, the optional solution α∗ can be obtained with the feature from local segment-level features. In the following sec-
closed-form solution for a fixed p: tion, we will testy the validity of this training strategy.
lα∗ = arg max λ,
α V. EXPERIMENTS
s.t. Sb α = λSω α. (13) A. Datasets
The diagonal vector for the optional solution α∗ is the eigenvec- We test the proposed method on four public datasets,
tor corresponding to the largest eigenevalue λm ax . including the Berlin dataset of German emotional speech
When fixing α, the optimizing problem in (11) has no closed- (EMO-DB) [21], the RML audio-visual dataset [22], the eN-
form solution. Nevertheless, it can be solved with a gradient TERFACE05 audio-visual dataset [23], and the BAUM-1s
descent process in an iterative way. Specifically, with a fixed α, audio-visual dataset [24].
we can get EMO-DB: The acted EMO-DB speech corpus [21] contains
∼ 535 emotional utterances with seven different acted emotions:
Sb (p) = αT Sb (p)α = (upi − upj )2 , anger, joy, sadness, neutral, boredom, disgust and fear. Ten pro-
i j ∈N k− (i) fessional native German-speaking actors (five female and five
∼ male) are asked to simulate these emotions, giving 10 German
Sω (p) = αT Sω (p)α = (upi − upj )2 . (14) utterances (five short and five long sentences) which are able to
i j ∈N k+ (i) be used in everyday communication. These actors are required
∼ ∼ to read these predefined sentences in the targeted seven emo-
The partial derivatives of Sb (p) and Sω (p) related to p are tions. The recordings in this dataset are taken in an anechoic
then computed by chamber with high-quality recording equipment and produced
∼ at a sampling rate of 16 kHz with a 16-bit resolution and mono
∂ Sb
=2 (upi − upj )αT (βi − βj ), channel. The audio files are on average around 3 seconds long.
∂p − A human perception test with other 20 subjects is conducted to
i j ∈N k (i)
evaluate the quality of the recorded data.
∼
∂ Sω RML: The acted RML audio-visual dataset [22], collected
=2 (upi − upj )αT (βi − βj ), (15) from Ryerson Multimedia Research Lab, Ryerson University,
∂p i +
j ∈N k (i) contains 720 utterances of eight subjects from different gender
and culture, in six different speaking languages. It consists of
where β denotes the Hadamard product β = v p ◦ ln v. Then we
six emotions: anger, disgust, fear, joy, sadness, and surprise.
can get the partial derivative of (11) with respect to p:
⎛ ∼ ⎞ The samples were recorded at a sampling rate of 44,100 Hz
∼
with a 16-bit resolution and mono channel. The audio files are
∂ 1 ⎝ ∂ Sω ∼ ∂ Sb ∼ ⎠
∇p = Ω(α, p) = ∼ Sb − Sω . (16) on average around 5 seconds long. To ensure the context inde-
∂p ∂p ∂p
Sω2 pendency of speech samples, more than ten reference sentences
for each emotion are presented. At least two participants who
The p value can be updated along the gradient direction with
do not know the corresponding language are employed in hu-
a step size γ, i.e.,
man perception test to evaluate whether the correct emotion is
p(t+1) = p(t) + γ · ∇p, (17) expressed.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1583
eNTERFACE05: The eNTERFACE05 [23] is an induced of the whole utterance, we can still employ DCNNs to learn ef-
audio-visual emotion dataset with six basic emotions, i.e., anger, fective segment-level features from the segment-level emotions,
disgust, fear, joy, sadness, and surprise. 42 subjects from 14 dif- which can be utilized to predict utterance-level emotions.
ferent nationalities are included. Each subject is asked to listen The structure of the used DCNN model [7] is presented in
to six successive short stories, each of which is used to induce Fig. 2. The DCNN model is trained with mini-batch size of 30,
a particular emotion. Two experts are employed to evaluate Stochastic Gradient Descent (SGD) with a momentum of 0.9,
whether the reaction expresses the intended emotions in an un- and a learning rate of 0.001. The maximum number of epochs is
ambiguous way. The speech utterances are pulled from video set as 300. We perform DCNNs on the MATLAB2014 platform
files of the subjects speaking in English. The sampling rate is with the MatConvNet package [59], which is a MATLAB tool-
48 kHz for audio. The audio files are on average around 3 sec- box implementing CNNs for computer vision applications. One
onds long. Overall, the eNTERFACE05 dataset contains 1290 NVIDIA GTX TITAN X GPU with a 12 GB memory is used
utterances. to train DCNNs with a GPU mode. We employ the LIBSVM
BAUM-1s: The spontaneous BAUM-1s [24] audio-visual package [60] with the linear kernel function and the one-versus-
dataset contains eight emotions (joy, anger, sadness, disgust, one strategy for multi-class classification. When implementing
fear, surprise, boredom and contempt), and four mental states optimal Lp-norm pooling, we set the number of permitted iter-
(unsure, thinking, concentrating and bothered). It has 1222 ut- ation Niter = 50, and the number of nearest neighbors k = 20,
terances collected from 31 Turkish subjects, 17 of which are as done in [52].
female. Emotion elicitation using video clips is employed to get It is noted that the used DCNN model called AlexNet, is
spontaneous audio-visual expressions. Each utterance is given firstly reported in [7] with input size of 224 × 224 × 3. How-
an emotion label by using a majority voting over the five an- ever, in many practical implementations such as imagenet-caffe-
notators. The audio files have a sampling rate of 48 kHz, and alex, available at http://www.vlfeat.org/matconvnet/pretrained/,
the average time of duration is around 3 seconds. As done in researchers commonly use input size 227 × 227 × 3 rather than
[22], [23], this work aims to identify six basic emotions (joy, 224 × 224 × 3.
anger, sadness, disgust, fear, surprise), giving 521 utterances 2) Evaluation Methods: As suggested in [61], test-runs
in total for experiments. Note that, BAUM-1s, is a latest audio- are implemented by using a speaker-independent Leave-
visual emotional data set released in 2016. Moreover, BAUM-1s One-Speaker-Out (LOSO) or Leave-One-Speakers-Group-Out
records spontaneous emotions rather than acted emotions, thus (LOSGO) cross-validation strategy, which are usually adopted
defines a more challenging emotion recognition problem than in most real applications. Specifically, for the EMO-DB and
the aforementioned datasets like EMO-DB and eNTERFACE05. RML datasets, we employ the LOSO scheme. For the eNTER-
Therefore, BAUM-1s is a reasonable and challenging testset. FACE05 and BAUM-1s datasets, we use the LOSGO scheme
with five speakers group, similar to [24]. Note that, we adopt
the speaker-independent test-runs, which is more realistic and
B. Experimental Setup challenging than the speaker-dependent test-runs. Therefore, we
1) Details of DCNN Training: Each of the four emotional only compare with works also using the same setting and wont
datasets contains a limited number of samples. It is thus desir- compare with works like [58] that report speaker-dependent re-
able to generate more samples for DCNN’s training. To address sults. The Weighted Average Recall (WAR), also known as the
this issue, we directly split an utterance into a certain num- standard accuracy, is reported to evaluate the performance of
ber of overlapping segments. Each of the segments is labeled speech emotion recognition. Here, WAR denotes the recogni-
with the utterance emotion category for DCNN’s training. In tion rates of individual classes weighted by the class distribution.
this case, the number of training samples is decided by the We evaluate the performance of two methods, i.e., DCNN-
overlap length (a frame shift size) between two adjacent seg- Average, and DCNN-DTPM. The details of these two methods
ments, i.e., smaller overlap results in a larger number of train- are described below.
ing samples. However, as suggested in [50], the overlap length DCNN-Average also uses DCNNs as feature extractor. Af-
should be larger than 250 ms in speech emotion recognition. ter extracting features on each Mel-spectrogram segment with
Therefore, we set the overlap length as 30 frames, which is DCNNs, the conventional average-pooling is employed over all
about 10 ms × 29 + 25 ms = 315 ms. As a result, when extract- the segments to produce the final fixed-length global utterance-
ing Mel-spectrogram segments with size 64 × 64 × 3, we can level features. Then the linear SVM classifier is adopted for
significantly augment the size of training data, i.e., from 535 ut- emotion identification. Therefore, we compare our method to
terances to 11,842 segments for the EMO-DB dataset, from 720 DCNN-Average to show the validity of the proposed DTPM.
utterances to 11,316 segments for the RML dataset, from 1290 DCNN-DTPM is our proposed method described in Fig. 3.
utterances to 16,186 segments for the eNTERFACE05 dataset,
and 521 utterances to 6368 segments for the BAUM-1s dataset, C. Experimental Results and Analysis
respectively. We use Mel-spectrogram segments with size Mel SS ∈
Note that, segmenting an utterance into small segments, was RF ×T ×C as the DCNN input, where F is the number of Mel-
widely used for discrete emotion classification, as in [13], [57], filter banks commonly set as 64, T is the number of frames
[58]. Although it is not necessarily true that the emotion labels in each segment, and C represents the number of channels of
in all segments divided from an utterance are equivalent to that Mel-spectrogram. The parameters C and T largely affects the
1584 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
TABLE I
SPEAKER-INDEPENDENT ACCURACY (%) COMPARISONS BY SETTING DIFFERENT VALUES OF C USING A SIMPLIFIED DCNN MODEL
C 1 2 3 1 2 3 1 2 3 1 2 3
DCNN-Average 73.86 77.44 78.92 59.25 61.84 61.19 62.33 65.01 66.42 36.49 38.21 38.62
DCNN-DTPM (L ∗ ) 77.03 (3) 82.69 (2) 83.53 (3) 61.48 (2) 64.88 (3) 64.21 (3) 65.95 (1) 69.88 (2) 70.25 (2) 38.74 (2) 39.05 (3) 40.57 (2)
The size of the Spectrogram is 64 × 64 × C . L ∗ denotes the value of L corresponding to the best performance of DCNN-DTPM.
Fig. 4. The effects of T on the EMO-DB dataset. Fig. 6. The effects of T on the eNTERFACE05 dataset.
Fig. 5. The effects of T on the RML dataset. Fig. 7. The effects of T on the BAUM-1s dataset.
TABLE IV
AlexNet, then fine-tune on the target emotional speech datasets. THE BEST RECOGNITION ACCURACY (%) AND CORRESPONDING T USING THE
The reason why the AlexNet helps emotion recognition might FINE-TUNED ALEXNET ON FOUR DATASETS
be because we convert the audio signals into an image-like
representation, as well as the deep structure and huge training Fine-tuning EMO-DB RML eNTERFACE05 BAUM-1s
data of the AlexNet.
Segment length T = 64 T = 220 T = 80 T = 64
3) Effects of the Segment Length: The segment length T de- DCNN-DTPM (L ∗ ) 87.31 (2) 75.34 (3) 79.25 (2) 44.61 (2)
cides the duration of audio signals the DCNN model processes.
It hence may largely affect the discriminative power of the ex- L ∗ denotes the value of L corresponding to the best performance.
tracted affective features. We thus show the effects of T on the
emotion recognition performance. and 7 show the effects of T on four datasets. Table IV presents
The length of the shortest utterance is 1.23 second long on the the best performance and the optimal T on four datasets. From
EMO-DB dataset, and 1.12 second long on the eNTERFACE05 the experimental results, we can draw two conclusions.
dataset. Accordingly, for the EMO-DB dataset and the eNTER- First, it can be observed that larger T is helpful for better
FACE05 dataset, we test T ranges in [15, 30, 45, 64, 80, 100, performance. However, too large T does not constantly improve
120], where T = 120 corresponds to about 1.22 second, which the performance. Table IV shows that the best performance
is close to the length of the shortest utterance. The length of the on four datasets are 87.31%, 75.34%, 79.25%, and 44.61%,
shortest utterance is 3.27 seconds long on the RML dataset. We respectively. The corresponding optimal T on four datasets are
thus test T ranges in [15, 30, 45, 64, 80, 100, 120, 140, · · · , 64, 220, 80, and 64, respectively. This may be because setting
320] on the RML dataset. On the BAUM-1s dataset, we test T larger T decreases the number of generated training samples for
ranges in [15, 30, 45, 64, 80], since the length of the shortest DCNNs. Therefore, DCNN-DTPM does not always improve
utterance is 0.768 seconds long. For some certain utterances the performance with the increase of the segment length.
shorter than T , we simply repeat the first frame and last frame Second, the four curves shows that the recognition perfor-
in an utterance so that the length of this utterance equals to T . mance of DCNN-DTPM remains stable when T is larger than
Note that for T = 15, as a benchmark used in speech recog- 64. Setting T = 64 generally gives promising performance on
nition, the overlap length of Mel-spectrogram segments is 15 four datasets. This might be because the DTPM also considers
frames, whereas for T ≥ 30 the overlap length is 30 frames. All the temporal clues, thus makes the algorithm more robust to T .
spectrograms with different T are resized to be 227 × 227 × 3 It is also interesting to observe that segment length of 15 frames,
with bilinear interpolation as the input of DCNN. Figs. 4, 5, 6, i.e., T = 15, widely used for speech recognition [14], does not
1586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
Fig. 8. Confusion matrix of DCNN-DTPM with an average accuracy of Fig. 10. Confusion matrix of DCNN-DTPM with an average accuracy of
87.31% on the EMO-DB dataset. 79.25% on the eNTERFACE05 dataset.
From Table V, we can see that our method is very compet- TABLE VI
RECOGNITION ACCURACY (%) COMPARISON OF THREE POOLING METHODS IN
itive to the state-of-the-art results. Specially, on the EMO-DB DTPM USING 64 × 64 × 3 MEL-SPECTROGRAM AND L = 2 ON THREE
dataset our method performs best, compared with [5], [12], [61], DATASETS.p ∗ DENOTES THE MEAN OPTIMAL VALUES OF p IN LOSO OR
[62]. On the RML dataset, our method gives much better per- LOSGO TEST-RUNS
formance than [63], [64]. On the eNTERFACE05 dataset, our
method obviously outperforms [12], [24], [61], and presents a Feature pooling EMO-DB RML eNTERFACE05 BAUM-1s
little lower performance than [62]. On the BAUM-1s dataset, Average 83.28 60.73 71.08 41.94
our method also clearly outperforms [24], i.e., our 44.61% vs. Max 82.64 63.48 72.75 40.26
29.41% of [24] in term of WAR. Therefore, although the BAUM- Ours (p ∗ ) 87.31(1.12) 69.70(1.50) 76.56(1.58) 44.61(0.21)
1s is a relatively small dataset, it defines a challenging emotion
recognition problem and also validates the advantages of the
and BAUM-1s datasets. This thus shows the necessarily of pool-
proposed algorithm. Note that in [61], the authors employ 6552
ing strategy learning.
LLD acoustic features such as prosody and MFCC for emo-
Since the Mel-spectrogram domain is represented as a 2-D
tion classification. This shows the advantages of our learned
matrix, it is natural to utilize CNNs to learn emotion informa-
affective features using DCNNs. [12] also uses a DNN to learn
tion. To this end, it is straightforward to train a deep model
discriminative features. Different from our work, [12] learns
on 64 × 64 spectrogram data. However, Tables I and IV indi-
features from 6552 LLD acoustic features, rather than from the
cate that directly using 64 × 64 features to train a deep model
raw speech signals or the spectrogram. This thus clearly shows
obtains lower performance than our fine-tuned AlexNet. The
the advantages of our DCNN model, i.e., using three channels
reason might be the limited training data of speech emotion
of spectrograms as input and coding raw DCNN features with
recognition. This motivates us to use the pre-trained AlexNet,
DTPM to get the final feature representation. [62] reports the
which is already trained with millions of images and shows
best performance of by using the large AVEC-2013 feature set
reasonably good performance in emotion feature extraction as
[65] on the EMO-DB dataset, and the large ComParE feature
shown in Table II. Therefore, we initialize a deep model with
set [66] on the eNTERFACE05 dataset.
the same structure and parameters of the AlexNet and fine-tune
Our experimental results show that our method gets impres-
it on target emotional datasets. Experimental results in Tables II
sive recognition accuracies in comparison with the state-of-the-
and IV have shown the effectiveness of the pre-trained AlexNet
art works. For example, we report an UAR accuracy of 86.30%
as well as our fine-tuned deep model.
on the EMO-DB dataset, on which outperforms all the three
It is a challenging problem to collect and annotate large num-
compared works, i.e., 79.1% by [12], 84.6% by [61], 86.0% by
bers of utterances for emotion classification due to the difficulty
[5] and 86.1% by [62]. As far as we know, this is an early work
of emotion annotation. At present, on existing small emotional
using DCNNs pre-trained on image domain for emotion recog-
speech datasets, it is a good choice to fine-tune pre-trained deep
nition. The success of this work guarantees further investigation
models. As shown in our experiments, fine-tuned the AlexNet
in this direction. These distinctive characteristics distinguish our
pre-trained on the ImageNet works well on speech emotion
work from existing efforts on speech emotion recognition.
recognition tasks. The reason why the AlexNet helps emotion
recognition might be because we convert the audio signals into
VI. DISCUSSIONS an image-like representation as well as the strong feature learn-
ing ability of the AlexNet, e.g., higher-level convolutions gradu-
The pyramid level L controls the number of levels in DTPM,
ally deduce semantics from larger receptive fields. The extracted
thus may affect the recognition performance. In our experi-
three channels of Mel-spectrograms are analogous to the RGB
ments, we investigate the effects of L with a value range be-
image representation. This representation makes it feasible to
tween 1 and 3. We do not use L ≥ 4, since the resulted feature
first generate meaningful low-level time-frequency features with
dimensionality is too large. As shown in the above experimen-
low-level 2-D convolutions, then deduce more discriminative
tal results, L = 2 or L = 3 generally gets the optimal results.
features with higher-levels of convolutions. Besides, three chan-
This indicates that dividing the Mel-spectrogram into multiple
nels of Mel-spectrograms may characterize emotions as certain
levels, i.e., L ≥ 2, helps to improve the performance. It also can
shapes and structures, which are thus able to be effectively per-
be inferred that our algorithm is not quite sensitive to L, and
ceived by the AlexNet pre-trained on the image domain.
setting L = 2 or L = 3 is a reasonable option at most cases.
The proposed method is based on the AlexNet. Similar to the
To verify the effectiveness of our Lp-norm pooling, we com-
AlexNet for ImageNet large-scale classification, our method
pare it with two commonly used pooling methods, i.e., average-
is capable of learning on million-scale training data with the
pooling and max-pooling, in Table VI. This is conducted by
commonly used GPU, e.g., NVIDIA TITAN X. It is thus also
modifying the value of p in DTPM, e.g., p = 1 corresponds to
interesting to retrain deep models on larger emotional speech
average-pooling, whereas p = ∞ corresponds to max-pooling.
datasets than the used EMO-DB, eNTERFACE05, and BAUM-
It can be seen from Table VI that our Lp-norm pooling performs
1s in our future work.
better than the other two pooling methods. It also can be seen
that, it is hard to decide which pooling strategy performs better
VII. CONCLUSIONS AND FUTURE WORK
for a specific task with experience. E.g., max-pooling performs
better than average-pooling on the RML and eNTERFACE05 This paper is motivated by how to employ DCNNs for au-
datasets, but average-pooling performs better on the EMO-DB tomatic feature learning on speech emotion recognition tasks.
1588 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
We present a new method combining DCNNs with DTPM for [9] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
automatic affective feature learning. A DCNN is used to learn recognition: The shared views of four research groups,” IEEE Signal
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
discriminative segment-level features from three channels of [10] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
log Mel-spectrograms similar to the RGB image representation. convolutional networks for visual recognition,” in Proc. 13th Eur. Conf.
DTPM is designed to aggregate the learned segment-level fea- Comput. Vis., New York, NY, USA: Springer, 2014, pp. 346–361.
[11] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
tures into the global utterance-level feature representation for network cascade for face detection,” in Proc. IEEE Conf. Comput. Vis.
emotion recognition. Extensive experiments on four data sets Pattern Recogn., 2015, pp. 5325–5334.
show that our method can yield promising performance in com- [12] A. Stuhlsatz et al., “Deep neural networks for acoustic emotion recogni-
tion: raising the benchmarks,” in Proc. IEEE Int. Conf. Acoust., Speech
parison with the state-of-the-arts. In addition, we also find that Signal Process., 2011, pp. 5688–5691.
with our generated DCNN input, DCNN models pre-trained on [13] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep
the large-scale ImageNet data could be leveraged in speech af- neural network and extreme learning machine,” in Proc. Interspeech, 2014,
pp. 223–227.
fective feature extraction. This makes DCNN’s training with a [14] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D.
limited amount of annotated speech data easier. The success of Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM
this work warranties further investigation on using deep learning Trans. Audio, Speech, Lang. Process., vol. 22, no. 10, pp. 1533–1545, Oct.
2014.
in speech emotion recognition. [15] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition
Although this paper focuses on discrete emotion recognition, using CNN,” in Proc. ACM Int. Conf. Multimedia, NewYork, NY, USA,
it is interesting to explore the effectiveness of deep features in 2014, pp. 801–804.
[16] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for
continuous dimension emotion recognition on datasets like SE- speech emotion recognition using convolutional neural networks,” IEEE
MAINE [67], RECOLA [68] and JESTKOD [69]. Note that this Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
work focuses on global utterance-level emotion classification [17] G. Trigeorgis et al., “Adieu features? End-to-end speech emotion recogni-
tion using a deep convolutional recurrent network,” in Proc. 41st IEEE Int.
and proposes the algorithm accordingly, i.e., first uses DCNNs Conf. Acoust., Speech, Signal Process., Shanghai, China, 2016, pp. 5200–
to extract segment-level feature, then aggregates segment-level 5204.
features with DTPM to form a global feature, and finally per- [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
forms emotion classification with the linear SVM. Therefore, [19] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhu-
this algorithm is still not capable to deal with continuous dimen- ber, “Flexible, high performance convolutional neural networks for image
sional emotion recognition. To tackle this problem, one possible classification,” in Proc. Int. Joint Conf. Artif. Intell., Barcelona, Spain,
2011, vol. 22, pp. 1237–1242.
way is to consider extra temporal cues and combine CNN and [20] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
LSTM [17], which is commonly used to select and accumulate Comput. Vis. Pattern Recogn., Boston, MA, USA, 2015, pp. 1–9.
frame-level features for video categorization. This will be one [21] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss,
“A database of German emotional speech,” in Proc. Interspeech, 2005,
of our future works. Moreover, there are many open issues that vol. 5, pp. 1517–1520.
still need to be further studied to make emotion recognition [22] Y. Wang and L. Guan, “Recognizing human emotional state from audiovi-
work well in real-life settings. For example, as show in Table V, sual signals*,” IEEE Trans. Multimedia, vol. 10, no. 5, pp. 936–946, Aug.
2008.
it is more difficult for our model to recognize the spontaneous [23] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05 audio-
emotions. It is also necessary to take the personality into con- visual emotion database,” in Proc. 22nd Int. Conf. Data Eng. Workshops,
sideration because different persons may have different ways Atlanta, GA, USA, 2006, p. 8.
[24] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
to express emotions. Additionally, it is also interesting to apply spontaneous audio-visual face database of affective and mental states,”
our proposed method for affective analysis of music video [70]. IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, Jul.–Sep.
2016.
REFERENCES [25] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in speech,”
in Proc. 4th Int. Conf. Spoken Lang., 1996, vol. 3, pp. 1970–1973.
[1] R. Cowie et al., “Emotion recognition in human-computer interaction,” [26] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in
IEEE Signal Process. Mag., vol. 18, no. 1, pp. 32–80, Jan. 2001. speech using neural networks,” Neural Comput. Appl., vol. 9, no. 4,
[2] S. Ramakrishnan and I. M. El Emary, “Speech emotion recognition ap- pp. 290–296, 2000.
proaches in human computer interaction,” Telecommun. Syst., vol. 52, [27] D. Ververidis and C. Kotropoulos, “Emotional speech classification using
no. 3, pp. 1467–1478, 2013. Gaussian mixture models and the sequential floating forward selection
[3] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- algorithm,” in Proc. IEEE Int. Conf. Multimedia Expo., Amsterdam, The
tion recognition: features, classification schemes, and databases,” Pattern Netherlands, 2005, pp. 1500–1503.
Recogn., vol. 44, no. 3, pp. 572–587, 2011. [28] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition
[4] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and classi- using hidden Markov models,” Speech Commun., vol. 41, no. 4, pp. 603–
fiers for emotion recognition from speech: A survey from 2000 to 2011,” 623, 2003.
Artif. Intell. Rev., vol. 43, no. 2, pp. 155–177, 2015. [29] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combin-
[5] F. Eyben et al., “The Geneva minimalistic acoustic parameter set ing acoustic features and linguistic information in a hybrid support vector
(GeMAPS) for voice research and affective computing,” IEEE Trans. machine-belief network architecture,” in Proc. IEEE Int. Conf. Acoust.,
Affect. Comput., vol. 7, no. 2, pp. 190–202, Apr.–Jun. 2016. Speech, Signal Process., 2004, vol. 1, pp. 577–580.
[6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of [30] X. Zhao, S. Zhang, and B. Lei, “Robust emotion recognition in noisy
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, speech via sparse representation,” Neural Comput. Appl., vol. 24, no. 7/8,
2006. pp. 1539–1553, 2014.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [31] X. Zhao and S. Zhang, “Spoken emotion recognition via locality-
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- constrained kernel sparse representation,” Neural Comput. Appl., vol. 26,
cess. Syst., 2012, pp. 1097–1105. no. 3, pp. 735–744, 2015.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning [32] D. Morrison, R. Wang, and L. C. De Silva, “Ensemble methods for spoken
applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278– emotion recognition in call-centres,” Speech Commun., vol. 49, no. 2,
2324, Nov. 1998. pp. 98–112, 2007.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1589
[33] E. M. Albornoz, D. H. Milone, and H. L. Rufiner, “Spoken emotion [59] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for
recognition using hierarchical classifiers,” Comput. Speech Lang., vol. 25, MATLAB,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015, pp. 689–692.
no. 3, pp. 556–570, 2011. [60] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-
[34] I. Luengo, E. Navas, and I. Hernáez, “Feature analysis and evaluation tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011,
for automatic emotion identification in speech,” IEEE Trans. Multimedia, Art. no. 27.
vol. 12, no. 6, pp. 490–501, Oct. 2010. [61] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acous-
[35] K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li, “Speech emotion recogni- tic emotion recognition: a benchmark comparison of performances,” in
tion using Fourier parameters,” IEEE Trans. Affect. Comput., vol. 6, no. 1, Proc. IEEE Workshop Autom. Speech Recogn. Understanding, 2009,
pp. 69–75, Jan.–Mar. 2015. pp. 552–557.
[36] Q. Jin, C. Li, S. Chen, and H. Wu, “Speech emotion recognition with [62] E. Florian, Real-Time Speech and Music Classification by Large Audio
acoustic and lexical features,” in Proc. IEEE Int. Conf. Acoust., Speech Feature Space Extraction. New York, NY, USA: Springer, 2016.
Signal Process., 2015, pp. 4749–4753. [63] L. Gao, L. Qi, and L. Guan, “Information fusion based on kernel entropy
[37] B. Schuller, “Recognizing affect from linguistic information in 3D con- component analysis in discriminative canonical correlation space with ap-
tinuous space,” IEEE Trans. Affect. Comput., vol. 2, no. 4, pp. 192–205, plication to audio emotion recognition,” in Proc. IEEE Int. Conf. Acoust.,
Oct.–Dec. 2011. Speech Signal Process., Shanghai, China, 2016, pp. 2817–2821.
[38] A. Tawari and M. M. Trivedi, “Speech emotion analysis: Exploring the [64] N. E. D. Elmadany, Y. He, and L. Guan, “Multiview emotion recognition
role of context,” IEEE Trans. Multimedia, vol. 12, no. 6, pp. 502–509, via multi-set locality preserving canonical correlation analysis,” in Proc.
Oct. 2010. IEEE Int. Symp. Circuits Syst., Montreal, QC, Canada, 2016, pp. 590–593.
[39] M. A. Quiros-Ramirez and T. Onisawa, “Considering cross-cultural con- [65] M. Valstar et al., “AVEC 2013-the continuous audio/visual emotion and
text in the automatic recognition of emotions,” Int. J. Mach. Learn. Cy- depression recognition challenge,” in Proc. 3rd ACM Int. Workshop Au-
bern., vol. 6, no. 1, pp. 119–127, 2015. dio/Visual Emotion Challenge, Barcelona, Spain, 2013, pp. 3–10.
[40] H. Cao, A. Savran, R. Verma, and A. Nenkova, “Acoustic and lexical rep- [66] B. Schuller et al., “The INTERSPEECH 2013 computational paralin-
resentations for affect prediction in spontaneous conversations,” Comput. guistics challenge: Social signals, conflict, emotion, autism,” in Proc.
Speech Lang., vol. 29, no. 1, pp. 203–217, 2015. Interspeech, Lyon, France, 2013, pp. 148–152.
[41] V. A. Petrushin, “Emotion recognition in speech signal: Experimental [67] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The
study, development, and application,” in Proc. 6th Int. Conf. Spoken Lan- SEMAINE database: Annotated multimodal records of emotionally col-
guage Process., Beijing, China, 2000, pp. 222–225. ored conversations between a person and a limited agent,” IEEE Trans.
[42] R. Tato, R. Santos, R. Kompe, and J. M. Pardo, “Emotional space improves Affect. Comput., vol. 3, no. 1, pp. 5–17, Jan. 2012.
emotion recognition,” in Proc. Interspeech, 2002, pp. 2029–2032. [68] F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introducing the
[43] M. Lugger and B. Yang, “The relevance of voice quality features in speaker RECOLA multimodal corpus of remote collaborative and affective inter-
independent emotion recognition,” in Proc. IEEE Int. Conf. Acoust., actions,” in Proc. 10th IEEE Int. Conf. Workshops Autom. Face Gesture
Speech Signal Process., 2007, vol. 4, pp. 17–20. Recogn., Shanghai, China, 2013, pp. 1–8.
[44] S. Zhang, “Emotion recognition in Chinese natural speech by combining [69] E. Bozkurt et al., “JESTKOD database: Dyadic interaction analysis,” in
prosody and voice quality features,” in Proc. Adv. Neural Netw., 2008, Proc. 23th Signal Process. Commun. Appl. Conf., 2015, pp. 1374–1377.
pp. 457–464. [70] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian, “Affective visual-
[45] A. I. Iliev, M. S. Scordilis, J. P. Papa, and A. X. Falcão, “Spoken emo- ization and retrieval for music video,” IEEE Trans. Multimedia, vol. 12,
tion recognition through optimum-path forest classification using glottal no. 6, pp. 510–522, Oct. 2010.
features,” Comput. Speech Lang., vol. 24, no. 3, pp. 445–460, 2010.
[46] J. Sundberg, S. Patel, E. Björkner, and K. R. Scherer, “Interdependencies
among voice source parameters in emotional speech,” IEEE Trans. Affect.
Comput., vol. 2, no. 3, pp. 162–174, Jul.–Sep. 2011. Shiqing Zhang received the Ph.D. degree from the
[47] S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion recognition School of Communication and Information Engineer-
using modulation spectral features,” Speech Commun., vol. 53, no. 5, ing, University of Electronic Science and Technology
pp. 768–785, 2011. of China, Chengdu, China, in 2012. He is currently
[48] Y. Sun, G. Wen, and J. Wang, “Weighted spectral features based on local a Postdoctor with the School of Electronic Engi-
Hu moments for speech emotion recognition,” Biomed. Signal Process. neering and Computer Science, Peking University,
Control, vol. 18, pp. 80–90, 2015. Beijing, China, and also works as an Associate
[49] E. M. Provost, “Identifying salient sub-utterance emotion dynamics using Professor in the Institute of Intelligent Information
flexible units and estimates of affective flow,” in Proc. IEEE Int. Conf. Processing, Taizhou University, Taizhou, China. His
Acoust., Speech Signal Process., 2013, pp. 3682–3686. research interests include audio and image process-
[50] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “LSTM- ing, affective computing, and pattern recognition.
modeling of continuous emotions in an audiovisual affect recognition
framework,” Image Vis. Comput., vol. 31, no. 2, pp. 153–163, 2013.
[51] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial Shiliang Zhang received the Ph.D. degree in com-
pyramid matching for recognizing natural scene categories,” in Proc. IEEE puter science from the Institute of Computing Tech-
Conf. Comput. Vis. Pattern Recogn., 2006, vol. 2, pp. 2169–2178. nology, Chinese Academy of Sciences, Beijing,
[52] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric Lp-norm feature pool- China, in 2012. He was a Postdoctoral Scientist in
ing for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern NEC Labs America and a Postdoctoral Research Fel-
Recogn., 2011, pp. 2609–2704. low in the University of Texas at San Antonio. He
[53] C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learned-norm pooling is currently a tenure-track Assistant Professor in the
for deep feedforward and recurrent neural networks,” in Machine Learning School of Electronic Engineering and Computer Sci-
and Knowledge Discovery in Databases. New York, NY, USA: Springer, ence, Peking University, Beijing, China. His research
2014, pp. 530–546. interests include large-scale image retrieval and com-
[54] S. Yan et al., “Graph embedding and extensions: A general framework puter vision for autonomous driving. Dr. Zhang was
for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., awarded the National 1000 Youth Talents Plan of China, Outstanding Doc-
vol. 29, no. 1, pp. 40–51, Jan. 2007. toral Dissertation Awards from both Chinese Academy of Sciences and Chinese
[55] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Computer Federation, President Scholarship by Chinese Academy of Sciences,
Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936. NEC Laboratories America Spot Recognition Award, and the Microsoft Re-
[56] K. Fukunaga, Introduction to statistical pattern recognition. Cambridge, search Fellowship. He has published more than 30 papers in journals and
MA, USA: Academic, 2013. conferences including the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
[57] M. T. Shami and M. S. Kamel, “Segment-based approach to the recogni- MACHINE INTELLIGENCE, the IEEE TRANSACTIONS ON IMAGE PROCESSING, the
tion of emotions in speech,” in Proc. IEEE Int. Conf. Multimedia Expo., IEEE TRANSACTIONS ON MULTIMEDIA, ACM Multimedia, and International
Amsterdam, The Netherlands, 2005, pp. 4–7. Conference on Computer Vision. He received the Top 10% Paper Award in
[58] B. W. Schuller and G. Rigoll, “Timing levels in segment-based speech IEEE MMSP 2011. His research is supported by the National 1000 Youth Tal-
emotion recognition.” in Proc. Interspeech, 2006, pp. 1818–1821. ents Plan and Natural Science Foundation of China (NSFC).
1590 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018
Tiejun Huang (M’01−SM’12) received the Bache- Wen Gao (M’92−SM’05−F’09) received the Ph.D.
lor’s and Master’s degrees in computer science from degree in electronics engineering from the Univer-
Wuhan University of Technology, Wuhan, China, in sity of Tokyo, Tokyo, Japan, in 1991. He is cur-
1992 and 1995, respectively, and the Ph.D. degree rently a Professor in the School of Electronic En-
in pattern recognition and intelligent system from gineering and Computer Science, Peking University,
Huazhong (Central China) University of Science and Beijing, China. Before joining Peking University, he
Technology, Wuhan, China, in 1998. He is currently was a Professor of computer science with Harbin In-
a Professor in the School of Electronic Engineering stitute of Technology, Harbin, China, from 1991 to
and Computer Science, Peking University, Beijing, 1995, and a Professor with the Institute of Computing
China, where he is also the Director of the Institute Technology, Chinese Academy of Sciences, Beijing,
for Digital Media Technology. He has authored or China. He has authored five books and more than
coauthored more than 100 peer-reviewed papers and three books. His research 600 technical articles in refereed journals and conference proceedings in image
interest area includes video coding, image understanding, digital right manage- processing, video coding and communication, pattern recognition, multimedia
ment, and digital library. Prof. Huang is a member of the Board of Director for information retrieval, multimodal interface, and bioinformatics. Dr. Gao serves
Digital Media Project, the Advisory Board of the IEEE Computing Society, and on the editorial board for several journals, such as the IEEE TRANSACTIONS ON
the Board of the Chinese Institute of Electronics. CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, the IEEE TRANSACTIONS
ON MULTIMEDIA, the IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVEL-
OPMENT, EURASIP Journal of Image Communications, and Journal of Visual
Communication and Image Representation. He chaired a number of prestigious
international conferences on multimedia and video signal processing, such as
the IEEE ICME and ACM Multimedia, and also worked on the advisory and
technical committees of numerous professional organizations.