0% found this document useful (0 votes)

269 views15 pages

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Uploaded by

Galaa Gantumur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

269 views15 pages

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Uploaded by

Galaa Gantumur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

1576 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO.

6, JUNE 2018

Speech Emotion Recognition Using Deep

Convolutional Neural Network and Discriminant
Temporal Pyramid Matching
Shiqing Zhang , Shiliang Zhang , Member, IEEE, Tiejun Huang , Senior Member, IEEE,
and Wen Gao, Fellow, IEEE

Abstract—Speech emotion recognition is challenging because I. INTRODUCTION

of the affective gap between the subjective emotions and
PEECH signals, as one of the most natural media of hu-
low-level features. Integrating multilevel feature learning and
model training, deep convolutional neural networks (DCNN) has
exhibited remarkable success in bridging the semantic gap in
S man communication, not only carry the explicit linguistic
contents but also contain the implicit paralinguistic informa-
visual tasks like image classification, object detection. This paper tion about the speakers. During the last two decades, enormous
explores how to utilize a DCNN to bridge the affective gap in
speech signals. To this end, we first extract three channels of log
efforts have been devoted to developing methods for automat-
Mel-spectrograms (static, delta, and delta delta) similar to the ically identifying human emotions from speech signals, which
red, green, blue (RGB) image representation as the DCNN input. is called speech emotion recognition. At present, speech emo-
Then, the AlexNet DCNN model pretrained on the large ImageNet tion recognition has become an attractive research topic in signal
dataset is employed to learn high-level feature representations on processing, pattern recognition, artificial intelligence, and so on,
each segment divided from an utterance. The learned segment-
level features are aggregated by a discriminant temporal pyramid
due to its importance in human-machine interactions [1], [2].
matching (DTPM) strategy. DTPM combines temporal pyramid Feature extraction is a critical step to bridge the affective
matching and optimal Lp-norm pooling to form a global utterance- gap between speech signals and the subjective emotions. So far,
level feature representation, followed by the linear support vector a variety of hand-designed features have been used for speech
machines for emotion classification. Experimental results on four emotion recognition [3]–[5]. However, these hand-designed fea-
public datasets, that is, EMO-DB, RML, eNTERFACE05, and
BAUM-1s, show the promising performance of our DCNN model
tures are usually low-level, they may hence not be discriminative
and the DTPM strategy. Another interesting finding is that enough to depict the subjective emotions. It is needed to develop
the DCNN model pretrained for image applications performs automatic feature learning algorithms to extract high-level af-
reasonably good in affective speech feature extraction. Further fective feature representations for speech emotion recognition.
fine tuning on the target emotional speech datasets substantially To address this issue, the newly-emerged deep learning tech-
promotes recognition performance.
niques [6] provide a possible solution. Among them, two typical
deep leaning methods are Deep Neural Networks (DNN) [6],
Index Terms—Speech emotion recognition, feature learning, and Deep Convolutional Neural Networks (DCNN) [7]. Here,
deep convolutional neural network, discriminant temporal a DCNN is taken as a deep extension of the conventional Con-
pyramid matching, Lp-norm pooling.
volutional Neural Networks (CNN) [8]. Recently, deep learning
techniques have been employed to automatically learn high-
level feature representations from low-level data in tasks like
speech recognition [9], image classification and understanding
Manuscript received January 22, 2017; revised July 7, 2017 and September
30, 2017; accepted October 2, 2017. Date of publication October 26, 2017; date [7], [10], object detection [11]. As far as speech emotion recog-
of current version May 15, 2018. This work was supported in part by the National nition is concerned, one of the early-used deep learning methods
Science Foundation of China; in part by the Zhejiang Provincial National Sci- is the DNN method. For instance, in [12], [13] a DNN is used
ence Foundation of China under Grants 61572050, LY16F020011, 91538111,
and 61620106009; and in part by the National 1000 Youth Talents Plan. The to learn high-level feature representations from the extracted
associate editor coordinating the review of this manuscript and approving it for low-level acoustic features for emotion classification.
publication was Dr. Hatice Gunes. In recent years, several works [14]–[17] have successfully em-
Shiqing Zhang is with the Institute of Digital Media, School of Electronic
Engineering and Computer Science, Peking University 100871, China, and ployed CNNs for feature learning in speech signal processing. In
also with the Institute of Intelligent Information Processing, Taizhou University [14], a 1-layer CNN is adopted to obtain promising performance
317000, China (e-mail: [email protected]). for speech recognition. In [15], [16], the authors also employ
Shiliang Zhang, T. Huang, and W. Gao are with the Institute of Digital Media,
School of Electronic Engineering and Computer Science, Peking University, a 1-layer CNN trained with a Sparse Auto-encoder (SAE) to
Beijing 100871, China (e-mail: [email protected]; [email protected]; extract affective features for speech emotion recognition. Re-
[email protected]). cently, Trigeorgis et al., [17] presents an end-to-end speech
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. emotion recognition system by combining a 2-layer CNN with
Digital Object Identifier 10.1109/TMM.2017.2766843 a Long Short-Term Memory (LSTM) [18]. Note that they em-

1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1577

Fig. 1. An overview of the proposed speech emotion recognition framework using DCNNs and DTPM: (1) Three channels of log Mel-spectrograms (static, delta
and delta-delta) are extracted and divided into N overlapping segments as the DCNN input. (2) A DCNN model is employed for automatic feature learning on each
segment to generate segment-level features. (3) A DTPM scheme is designed to concatenate the learned segment-level features to form a global utterance-level
feature representation. (4) With utterance-level features, a linear SVM classifier is employed to predict utterance-level emotions.

ploy 1-D convolution, such as frequency convolution [14]–[16] In this paper, we use deep features learned by DCNNs [7] and
or time convolution [17], rather than 2-D convolution widely propose a Discriminant Temporal Pyramid Matching (DTPM)
used in DCNN models [7], [10]. Additionally, these used 1- algorithm to pool deep features for speech emotion recognition.
layer or 2-layer CNNs are much shallower compared with the As illustrated in Fig. 1, three channels of log Mel-spectrograms
deep structures in DCNN models [7], [10]. Accordingly, they (static, delta and delta-delta) are extracted as the DCNN input.
may could not effectively learn affective features discriminative The DCNN models are trained to produce deep features for each
enough to distinguish the subjective emotions. segment. The DTPM pools the learned segment-level features
It has been recently found that, with deep multi-level convo- into a global utterance-level feature representation, followed
lutional and pooling layers, DCNNs usually exhibit much better by the linear SVM emotion classifier. Extensive experiments on
performance than the shallow CNNs in computer vision [19], four public datasets, i.e., the Berlin dataset of German emotional
[20]. This is reasonable because the deep structures of DCNNs speech (EMO-DB) [21], the RML audio-visual dataset [22], the
can effectively model the hierarchical architecture of informa- eNTERFACE05 audio-visual dataset [23], and the BAUM-1s
tion processing in the primate visual perception system [7], [10]. dataset [24], demonstrate the promising performance of our
Motivated by the promising performance of deep models, this proposed method.
work aims to employ DCNNs to develop an effective speech The main contributions of this paper can be summarized as:
emotion recognition system. 1) We propose to use three channels of log Mel-spectrograms
The success of DCNNs in visual tasks motivates us to test generated from the original 1-D utterances as the DCNN
DCNNs in speech emotion recognition. To achieve this, three input. This input is similar to the red, green, blue (RGB)
issues need to be addressed. First, a proper speech representa- image representation, thus makes it possible to use ex-
tion should be designed as the DCNN input. Previous works isting DCNNs pre-trained on image datasets for affective
[14]–[17] have employed 1-D speech signals as the CNN in- feature extraction.
puts, and 1-D convolution is adopted for CNNs. Compared with 2) The proposed DTPM strategy combines temporal pyra-
1-D convolution, 2-D convolution involves more parameters mid matching and optimal Lp-norm pooling to generate a
to capture more detailed temporal-frequency correlations, thus discriminative utterance-level feature representation from
is potential to present stronger feature learning ability. There- segment-level features learned by DCNNs.
fore, it is important to convert 1-D speech signals into suitable 3) We find that the DCNN model pre-trained for image ap-
2-D representations as the DCNN input. Second, most existing plications performs reasonably good in affective feature
emotional speech datasets [3]–[5] contain limited numbers of extraction. A further fine-tuning on target speech emotion
samples. They are not sufficient enough to train deep models recognition tasks substantially promotes the recognition
having a large amount of parameters. Finally, speech signals performance.
may have variant time of duration but the DCNN models re- The rest of this paper is structured as follows. The re-
quire fixed input size. It is hence easier to design the DCNN lated works are reviewed in Section II. Section III describes
models for speech segments with a fixed length, rather than for our DCNN model for affective feature extraction. Section IV
the global utterance. Therefore, proper pooling strategies are presents the details of our DTPM scheme. Section V describes
needed to generate a global utterance-level feature representa- and analyzes the experimental results. Section VI provides dis-
tion based on the segment-level features learned by DCNNs. cussions, followed by the conclusions in Section VII.
1578 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

II. RELATED WORK related features, has shown promising performance in speech
Generally, feature extraction and emotion classification are emotion recognition.
two key steps in speech emotion recognition. In this section, we Language features, which are computed based on the verbal
first briefly review emotion classifiers and then focus on feature contents of speech, are another important representation con-
extraction since it is more relevant to our work. veying emotion information. Note that, language features are
usually combined with acoustic features for speech emotion
A. Emotion Classifier recognition [36], [37]. In [37], language features are extracted
with the bag of n-gram and character n-gram approaches. Then
For emotion classification various machine learning algo- the linguistic features are combined with acoustic features to
rithms have been utilized to constitute a good classifier to predict dimensional emotions in a 3-D continuous space. In
distinguish the underlying emotion categories. Early emotion [36], by computing the weight of every word, a four-dimensional
classifiers contain K-Nearest-Neighbor (KNN) [25] and Artifi- emotion lexicon for four emotion classes, i.e., anger, joy, sad-
cial Neural Network (ANN) [26]. Then, a number of statisti- ness and neutral, are obtained. Then, integrating these feature
cal pattern recognition approaches, such as Gaussian Mixture representations via early fusion and late fusion is employed for
Model (GMM) [27], Hidden Markov Models (HMM) [28], and speech emotion recognition.
SVM [29], are widely adopted for speech emotion recognition. Context information has also been investigated in recent lit-
Recently, some advanced classifiers based on sparse representa- eratures [38], [39] for emotion recognition. In [38], the authors
tion [30], [31] have also been studied. Nevertheless, each clas- present a context analysis of subject and text on speech emo-
sifier has its own advantages and disadvantages. To integrate tion recognition, and find that gender-based context informa-
the merits of different classifiers, ensembles of multiple clas- tion enhances recognition performance. In [39], the influences
sifiers have been investigated for speech emotion recognition of cultural information on speech emotion recognition are ex-
[32], [33]. plored. The authors claim that intra-cultural and multi-cultural
emotion recognition paradigms give better performance than
B. Feature Extraction cross-cultural recognition.
Affective speech features widely used for emotion recog- Note that, since these hand-designed features mentioned
nition can be roughly divided into four categories: 1) acoustic above are low-level, they may not be discriminative enough
features [34], [35], 2) language features, such as lexical informa- to identify the subjective emotions. To tackle this issue, it may
tion [36], [37], 3) context information, such as subject, gender, be feasible to employ deep learning techniques to automatically
culture influences [38], [39], 4) hybrid features [36], [40], such learn high-level affective features for speech emotion recogni-
as the integration of two or three features above-mentioned. tion.
Acoustic features, as one of the most popular affective fea-
tures, mainly contain prosody features, voice quality features, III. DCNNS FOR AFFECTIVE FEATURE EXTRACTION
and spectral features [34], [35]. Pitch, loudness, and duration To utilize DCNNs in speech emotion recognition, three prob-
are commonly used as prosody features [41], since they express lems should be addressed. First, the DCNN input should be
the stress and intonation patterns of spoken language. Voice properly computed from 1-D speech signals. Second, DCNN’s
quality features, as the characteristic auditory colouring of an training requires a large amount of labeled data. Third, a feature
individual voice, have been shown to be discriminative in ex- pooling strategy is required to generate the global utterance-
pressing positive or negative emotions [42]. The widely used level feature representation from the DCNN outputs on local
voice quality features are the first three formants (F1, F2, F3), segments. In this section, we present the details of how the first
spectral energy distribution, harmonics-to-noise-ratio, pitch ir- two problems are addressed.
regularity (jitter), amplitude irregularity (shimmer), and so on. Fig. 2 illustrates the framework for affective feature ex-
Combining prosody features and voice quality features shows traction. From the original 1-D utterance, we first extract the
better performance than using prosody features alone [43], [44]. static 2-D log Mel-spectrogram and then reorganize it into
In recent years, glottal features [45] and voice source parameters three channels of log Mel-spectrograms (static, delta and
[46] have been used as more advanced voice quality features for delta-delta). For data augmentation, the log Mel-spectrogram
speech emotion recognition. The third typical acoustic features extracted from an utterance is divided into a certain number
are spectral features, computed from the short-term power spec- of overlapping segments as the DCNN input. More details
trum of sound, such as Linear Prediction Cepstral Coefficients about data augmentation can be found in Section V-B. Then
(LPCC), Log Frequency Power Coefficients (LFPC) and Mel- the AlexNet DCNN model [7] pre-trained on the large-scale
frequency Cepstral Coefficients (MFCC). Among them, MFCC ImageNet dataset is employed to perform fine-tuning tasks for
is the most popular spectral feature, since it is able to model affective feature extraction. We present more details of the two
the human auditory perception system. In recent years, modula- steps in the following two sections.
tion spectral features [47] from an auditory-inspired long-term
spectro-temporal representation, and weighted spectral features
A. Generation of DCNN Input
[48] based on local Hu moments, have also been studied. In ad-
dition, the newly-developed Geneva minimalistic acoustic pa- Because of the limited training data of speech emotion recog-
rameter set (GeMAPS) [5], such as frequency, energy, spectral nition, it is not possible to directly train a robust deep model.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1579

Fig. 2. The flowchart of our DCNN model for affective feature extraction. Three channels of log Mel-spectrograms with size 64 × 64 × 3 (static, delta and
delta-delta) are firstly produced, and then are resized to 227 × 227 × 3 as the DCNN input. The DCNN model is first initialized with the AlexNet [7], then is
fine-tuned on target emotional datasets. The 4096-D FC7 output is finally used as the segment-level affective features.

Motivated by the promising performance of available DCNN for an utterance we adopt 64 Mel-filter banks from 20 to 8000 Hz
models, we propose to first initialize deep models with available to obtain the whole log Mel-spectrogram using a 25 ms Ham-
DCNN models like AlexNet [7], then fine-tune it for transfer ming window size with 10ms overlapping. Then, a context win-
learning on target emotional datasets. Because available DCNN dow of 64 frames is applied to the whole log Mel-spectrogram
models take 2-D or 3-D images as inputs, we transform the raw to extract the static 2-D Mel-spectrogram segments with size
1-D speech spectrogram into 3-D array as the DCNN input. 64 × 64. A frame shift size of 30 frames is used to produce
In recent years, Abdel-Hamid et al., [14] adopt the extracted such overlapping segments of Mel-spectrogram. Each segment
log Mel-spectrogram and organize it into a 2-D array as the CNN hence includes a context window of 64 frames and its length is
input with a shallow 1-layer structure for speech recognition. 10 ms × 63 + 25 ms = 655 ms. In this case, the segment length
Specifically, for each frame with a context window of 15 frames is about 2.6 times longer than the suggested length of 250 ms in
and 40 Mel-filter banks, they construct 45 (i.e., 15 × 3) 1-D fea- [49], [50], and conveys sufficient clues for emotion recognition.
ture maps with size 40 × 45. Then, the 1-D convolutional kernel Note that we set F as 64 because the input height-width ratio
is applied along the frequency axis. However, speech emotion of our DCNN model is 1:1. Besides, F is usually set to be
recognition using DCNNs is different from speech recognition relatively large values for the usage of CNNs. For example, F
in [14]. First, 1-D convolution operation along the frequency is set to 40 in speech recognition [14] and 60 in speech emotion
axis could not capture the temporal information, which is im- recognition [16], respectively. Therefore, it is reasonable to set
portant for emotion recognition. Second, the divided segments F as 64 in this work.
with 15 frames (about 165 ms) used for speech recognition, are In speech recognition, the first and second temporal deriva-
too short to distinguish emotions, since it has been found that tives on the extracted acoustic features such as MFCC, are
only a speech segment length of more than 250 ms presents widely used as additional features. Similarly, after extracting
sufficient information for identifying emotions [49], [50]. the static 2-D Mel-spectrogram, we also calculate the first order
To address these two issues, from the raw 1-D speech sig- and second order regression coefficients along the time axis as
nals we generate the following overlapping Mel-spectrogram the delta and delta-delta coefficients of Mel-spectrogram. In this
segments (abbreviated as Mel SS) as the DCNN input way, we organize the 1-D speech signals into three channels of
Mel-spectrogram segments, i.e., Mel SS with size 64 × 64 × 3
Mel SS ∈ RF ×T ×C , (1)
(three channels: static, delta and delta-delta) as the DCNN input.
where F is the number of Mel-filter banks, T is the segment Then, 2-D convolution operation along the frequency axis and
length corresponding to the frame number in a context window, time axis can be performed for DCNN’s training on this input.
and C (C = 1, 2, 3) represents the number of channels of Mel- When using the AlexNet DCNN model [7] for affective fea-
spectrogram. Note that C = 1 denotes one channel of Mel- ture extraction, we have to resize the spectrogram 64 × 64 × 3
spectrogram, i.e., the original static spectrogram, C = 2 denotes into 227 × 227 × 3, which is the input size of AlexNet. Since the
the static and delta coefficients of Mel-spectrograms, and C = 3 extracted three channels of Mel-spectrograms can be regarded as
represents three channels of Mel-spectrograms including the the RGB image representation, we perform the resize operation
static, delta and delta-delta coefficients of Mel-spectrogram. with bilinear interpolation, which is commonly used for image
As an example described in Fig. 2, we extract Mel SS with resizing. Note that, the number of channels of Mel-spectrogram
size 64 × 64 × 3 (F = 64, T = 64, C = 3) as the input of C and the segment length T may have an important impact on
DCNN. This kind of three channels of spectrograms is analo- the learned deep features. Therefore, we will investigate their
gous to the RGB image representation of visual data. In detail, effects on the recognition accuracy in experiments.
1580 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

B. DCNN Architecture where v denotes the momentum variable, η is the learning rate,
i is the iteration number index, and ∂∂ wL |wi D i is the mean
As shown in Fig. 2, our DCNN model includes five con-
volutional layers, three of which are followed by max-pooling of derivatives of the i-th batch Di . The network hence can be
updated by back-prorogation. More details of DCNN’s training
layers, and two fully-connected layers. The last fully-connected
can be found in [7].
layer consists of 4096 units, giving a 4096-D feature represen-
tation. It can be observed that this structure is identical to the In our DCNN’s training, we first initialize the network with
parameters in the AlexNet, then fine-tune the network in emotion
one of AlexNet [7], which is trained on the large-scale Ima-
classification tasks, which uses the Mel SS with size 227 ×
geNet dataset. The initial parameters of this DCNN model can
thus be copied from the AlexNet, making this DCNN model 227 × 3 as input and multiple emotion classes as output. Note
that, the number of classes used in the AlexNet model is 1000,
easier to train on speech emotion recognition tasks. In the
followings, we introduce the computations and principles of but in our emotion classification tasks, the number of emotion
convolutional layer, pooling layer and fully-connected layer, categories is 6 or 7. Therefore, our used DCNN model differs
from the AlexNet in the last two layers, where our model predicts
respectively.
Convolutional layer: A convolutional layer employs a set 6 or 7 emotion categories.
of convolutional filters to extract multiple local patterns at each After fine-tuning the AlexNet model, we take the output of
its FC7 layer as the segment-level affective features x. Given
local region in the input space, and produces many feature maps.
This can be denoted as N overlapping Mel-spectrogram segments as the inputs of the
DCNN model, we can obtain a segment-level feature repre-
(hk )ij = (Wk ⊗ q)ij + bk , (2) sentation X = (x1 , x2 , · · · , xN ) ∈ Rd×N with feature dimen-
sionality d = 4096. This representation X hence is used as the
where (hk )ij denotes the (i, j) element of the k-th output feature input of the following DTPM algorithm to produce the global
map, q represents the input feature maps, Wk and bk denotes utterance-level features for emotion classification.
the k-th filter and bias, respectively. The symbol ⊗ represents
2-D spatial convolution operation.
Pooling layer: After each convolutional layer, a pooling layer IV. DTPM FOR UTTERANCE-LEVEL FEATURE
may be used. The pooling layer aims to down-sample the ob- REPRESENTATION
tained feature maps from the previous convolutional layers and Because of the unfixed time of duration for speech utterances,
produces a single output from local regions of convolution fea- the above-mentioned segment-level features X have a variant
ture maps. Two widely used pooling operators are max-pooling number of segments. This unfixed dimensionality makes such
and average-pooling. A max-pooling or average-pooling layer segment-level features not directly useable for emotion recogni-
produces a lower resolution version of convolution layer activation. Therefore, we proceed to convert the segment-level features
tions by taking the maximum or average filter activation from into an utterance-level feature representation with fixed dimen-
different positions within a specified window. sionality. This process, which is also called as feature pooling,
Fully-connected layer: This layer integrates the outputs from is widely used in computer vision to convert the local features
previous layers to yield the final feature representations for clas- into the global features for image classification and retrieval.
sification or regression. The activation function is a sigmoid or There are two types of widely-used pooling strategies, i.e.,
tanh function. The output of fully-connected layers is computed average-pooling and max-pooling, which compute the averaged
by values and max values on each dimension, respectively. Note
that, different pooling strategies are suited for different types of
xk = Wk l q l + b k , (3)
l features, e.g., max-pooling is suited for sparse features. It is diffi-
where xk denotes the k-th output neuron, ql denotes the l-th cult to decide which pooling strategy is optimal for our segment-
input neuron, Wk l represents the weight value connecting ql level affective features. Moreover, most of pooling strategies
with xk , and bk denotes the bias term of yk . discard the temporal clues of speech signals, which might be
Since fully-connected layers can be taken as convolutional important to distinguish emotions.
layers with a kernel size of 1 × 1, (3) can be reformulated as Our DTPM is motivated to simultaneously embed the tem-
poral clues and find the optimal pooling strategy. It is partially
(xk )1,1 = (Wk ⊗ q)1,1 + bk . (4) inspired by the Spatial Pyramid Matching (SPM) [51], which
embeds the spatial clues during feature pooling for image classi-
For DCNN’s training, Stochastic Gradient Descent (SGD) fication. In SPM, an image is first divided into regions at different
is commonly employed with parameters like the batch size of scales, then feature pooling is conducted on each region. The
examples, the momentum value (e.g., 0.9), and the weight decay final feature is hence the concatenation of the pooled features at
value (e.g., 0.0005). In this case, the weight w is updated by each scale. Similarly, we also divide the segment-level features

∂L X into non-overlapping sub-blocks along the time axis at differ-
vi+1 = 0.9 · vi − 0.0005 · η · wi − η · |wi , ent scales, then conduct feature pooling on each sub-block. The
∂w Di
final concatenated feature thus integrates the temporal clues at
wi+1 ⇐ wi + vi+1 , (5) different scales. The details will be presented in Section IV-A.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1581

Fig. 3. The framework of discriminant temporal pyramid matching (DTPM).

To acquire the optimal pooling strategy, we formulate the d-dimension feature representation f p (Xm ), i.e.,
feature pooling as ⎛ ⎞ p1
⎛ ⎞ p1 1 n

N f p (Xm ) = ⎝ |xj |p ⎠ , (8)

1 n j =1
f p (X) = ⎝ |xj |p ⎠ , (6)
N j =1
where p = 1 corresponds to average-pooling, whereas p = ∞
corresponds to max-pooling. The optimal p will be computed in
where f p (X) denotes the acquired feature after pooling oper-
Section IV-B.
ation, N is the number of segment features, and p controls the
The generated feature f p (Xm ) at different scales encodes
pooling strategy. E.g., p = 1 corresponds to average-pooling,
different temporal clues. For example, compared with the feature
whereas p = ∞ corresponds to max-pooling. To testify the ad-
at 0-th level, the pooled feature at the second level embeds more
vantages of the optimal pooling strategy, we compare it with
refined temporal clues. We thus aggregate the pooling results on
average-pooling and max-pooling in the latter experiments, as
all sub-blocks at different levels into the final global utterance-
shown in Section VI.
level feature representation.
From (6), it can be observed that the parameter p decides the
Let Γ (X) = (f p (X1 ), f p (X2 ), . . . , f p (Xm )) denote the
performance of the pooling. In recent years, it has been found
concatenated feature of X at pyramid level . Then we can
in [52], [53] that the p value has an important impact on the
get the global utterance-level features vLp (X) of TPM by means
image classification accuracy. Therefore, we proceed to acquire
of concatenating all Γ (X) at different pyramid level, i.e.,
an optimal p for our affective features. More details will be
presented in Section IV-B. vLp (X) = ( 21L Γ0 , 21L Γ1 , 2 L1−1 Γ2 , . . . , 12 ΓL ), (9)
In a word, given the segment-level features X, DTPM
aggregates X to produce the global utterance-level features where Γ (X) is abbreviated as Γ . In the final utterance-level
vLp (X) with the optimal pooling parameter p and pyramid level features, we set higher weights for the features on higher levels,
= 0, 1, 2, . . . , L. For example, Fig. 3 presents the framework which embeds more refined temporal clues. This is also similar
of DTPM. The original segment-level features X are divided at to the weighting strategy in SPM [51].
three scales with = 0, 1, 2. The final feature is generated by
concatenating the features at each scale with the optimal pooling B. Optimal Lp-norm Pooling
parameter p. To improve the discriminative power of vLp (X), we employ
the class separability criteria according to the Marginal Fisher
A. Temporal Pyramid Matching Analysis (MFA) [54] to learn the optimal Lp-norm pooling. Let
up (X) denote the final utterance-level features after optimal
Temporal Pyramid Matching (TPM) first divides the
Lp-norm pooling, then we get
segment-level features X at multiple levels. Specifically,
X = (x1 , x2 , . . . , xN ) ∈ Rd×N is equally divided into 2 suc- up (X) = αT vLp (X), (10)
cessive non-overlapping sub-blocks along the time axis at dif-
ferent levels with = 0, 1, 2, . . . , L. For the -th level, this can where α is a diagonal matrix used to weight vLp (X),
making
be expressed as vLp (X) discriminant. In the following, vLp (X) is abbreviated
as v p , and up (X) is abbreviated as up . Therefore, the task is
X = (X1 , X2 , . . . , Xm ), (7) acquiring the optimal α and p to make the final feature up as
discriminative as possible.
where m = 2 , = 0, 1, 2, . . . , L. To optimize both α and p simultaneously, the objective func-
For a sub-block Xm = (x1 , x2 , . . . , xn ) ∈ Rd×n with n seg- tion should maximize the inter-class separability while minimize
ments, we use the pooling strategy in (6) to produce fixed-length the inner-class separability. This induces our objective function,
1582 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

i.e., where the superscript t denotes the t-th iteration. In our imple-
mentation, the iteration stops if the number of iterations exceeds
α Sb (p)α
T
α∗ , p∗ = arg max Ω(α, p) := , (11) the permitted number Niter . After acquiring the final feature rep-
αT Sω (p)α
α ,p
resentation up (X), we use it for emotional classification with
where Sb (p) represents the inter-class separability, Sω (p) rep- classifiers like SVM.
resents the inner-class separability. They are computed by Our training strategy divides the utterances into segments.
This enlarges the training set for DCNNs, but is potential to
Sb (p) = (vip − vjp )(vip − vjp )T , make emotion recognition on each segment more difficult if the
−
j ∈N k (i)
i
(12) segment is too short. We have carefully set the length of each
Sω (p) = (vip − vjp )(vip − vjp )T ,
i j ∈N + (i) segment to 655 ms, which is about 2.6 times longer than the
k
suggested 250 ms for emotion recognition in [49], [50]. There-
where Nk− (i) denotes the index set for k nearest neighbors of fore, each segment should preserve sufficient clues for emotion
the pooling data vip from different classes, and Nk+ (i) represents recognition. To conduct utterance-level emotion recognition, we
the k nearest neighbors of the pooling data vip from the same generate utterance-level features with the DTPM, which aggre-
classes. gates segment-level features at different scales with Lp-norm
Eq. (11) can be solved by optimizing α and p alternatively. pooling. DTPM is inspired by the Spatial Pyramid Matching
When fixing p, this objection function is transformed into the (SPM) [51] commonly used in image classification. SPM ag-
classical Linear Discriminant Analysis (LDA) [55], [56] prob- gregates low-level features from image patches to form a global
lem. In this case, Sb (p) and Sω (p), represent the between-class feature discriminative to high-level semantics. Similar to SPM,
scatter matrix and the within-class scatter matrix, respectively. DTPM is potential of learning a discriminative utterance-level
Therefore, the optional solution α∗ can be obtained with the feature from local segment-level features. In the following sec-
closed-form solution for a fixed p: tion, we will testy the validity of this training strategy.
lα∗ = arg max λ,
α V. EXPERIMENTS
s.t. Sb α = λSω α. (13) A. Datasets
The diagonal vector for the optional solution α∗ is the eigenvec- We test the proposed method on four public datasets,
tor corresponding to the largest eigenevalue λm ax . including the Berlin dataset of German emotional speech
When fixing α, the optimizing problem in (11) has no closed- (EMO-DB) [21], the RML audio-visual dataset [22], the eN-
form solution. Nevertheless, it can be solved with a gradient TERFACE05 audio-visual dataset [23], and the BAUM-1s
descent process in an iterative way. Specifically, with a fixed α, audio-visual dataset [24].
we can get EMO-DB: The acted EMO-DB speech corpus [21] contains
∼ 535 emotional utterances with seven different acted emotions:
Sb (p) = αT Sb (p)α = (upi − upj )2 , anger, joy, sadness, neutral, boredom, disgust and fear. Ten pro-
i j ∈N k− (i) fessional native German-speaking actors (five female and five
∼ male) are asked to simulate these emotions, giving 10 German
Sω (p) = αT Sω (p)α = (upi − upj )2 . (14) utterances (five short and five long sentences) which are able to
i j ∈N k+ (i) be used in everyday communication. These actors are required
∼ ∼ to read these predefined sentences in the targeted seven emo-
The partial derivatives of Sb (p) and Sω (p) related to p are tions. The recordings in this dataset are taken in an anechoic
then computed by chamber with high-quality recording equipment and produced
∼ at a sampling rate of 16 kHz with a 16-bit resolution and mono
∂ Sb
=2 (upi − upj )αT (βi − βj ), channel. The audio files are on average around 3 seconds long.
∂p − A human perception test with other 20 subjects is conducted to
i j ∈N k (i)
evaluate the quality of the recorded data.
∼
∂ Sω RML: The acted RML audio-visual dataset [22], collected
=2 (upi − upj )αT (βi − βj ), (15) from Ryerson Multimedia Research Lab, Ryerson University,
∂p i +
j ∈N k (i) contains 720 utterances of eight subjects from different gender
and culture, in six different speaking languages. It consists of
where β denotes the Hadamard product β = v p ◦ ln v. Then we
six emotions: anger, disgust, fear, joy, sadness, and surprise.
can get the partial derivative of (11) with respect to p:
⎛ ∼ ⎞ The samples were recorded at a sampling rate of 44,100 Hz
∼
with a 16-bit resolution and mono channel. The audio files are
∂ 1 ⎝ ∂ Sω ∼ ∂ Sb ∼ ⎠
∇p = Ω(α, p) = ∼ Sb − Sω . (16) on average around 5 seconds long. To ensure the context inde-
∂p ∂p ∂p
Sω2 pendency of speech samples, more than ten reference sentences
for each emotion are presented. At least two participants who
The p value can be updated along the gradient direction with
do not know the corresponding language are employed in hu-
a step size γ, i.e.,
man perception test to evaluate whether the correct emotion is
p(t+1) = p(t) + γ · ∇p, (17) expressed.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1583

eNTERFACE05: The eNTERFACE05 [23] is an induced of the whole utterance, we can still employ DCNNs to learn ef-
audio-visual emotion dataset with six basic emotions, i.e., anger, fective segment-level features from the segment-level emotions,
disgust, fear, joy, sadness, and surprise. 42 subjects from 14 dif- which can be utilized to predict utterance-level emotions.
ferent nationalities are included. Each subject is asked to listen The structure of the used DCNN model [7] is presented in
to six successive short stories, each of which is used to induce Fig. 2. The DCNN model is trained with mini-batch size of 30,
a particular emotion. Two experts are employed to evaluate Stochastic Gradient Descent (SGD) with a momentum of 0.9,
whether the reaction expresses the intended emotions in an un- and a learning rate of 0.001. The maximum number of epochs is
ambiguous way. The speech utterances are pulled from video set as 300. We perform DCNNs on the MATLAB2014 platform
files of the subjects speaking in English. The sampling rate is with the MatConvNet package [59], which is a MATLAB tool-
48 kHz for audio. The audio files are on average around 3 sec- box implementing CNNs for computer vision applications. One
onds long. Overall, the eNTERFACE05 dataset contains 1290 NVIDIA GTX TITAN X GPU with a 12 GB memory is used
utterances. to train DCNNs with a GPU mode. We employ the LIBSVM
BAUM-1s: The spontaneous BAUM-1s [24] audio-visual package [60] with the linear kernel function and the one-versus-
dataset contains eight emotions (joy, anger, sadness, disgust, one strategy for multi-class classification. When implementing
fear, surprise, boredom and contempt), and four mental states optimal Lp-norm pooling, we set the number of permitted iter-
(unsure, thinking, concentrating and bothered). It has 1222 ut- ation Niter = 50, and the number of nearest neighbors k = 20,
terances collected from 31 Turkish subjects, 17 of which are as done in [52].
female. Emotion elicitation using video clips is employed to get It is noted that the used DCNN model called AlexNet, is
spontaneous audio-visual expressions. Each utterance is given firstly reported in [7] with input size of 224 × 224 × 3. How-
an emotion label by using a majority voting over the five an- ever, in many practical implementations such as imagenet-caffe-
notators. The audio files have a sampling rate of 48 kHz, and alex, available at http://www.vlfeat.org/matconvnet/pretrained/,
the average time of duration is around 3 seconds. As done in researchers commonly use input size 227 × 227 × 3 rather than
[22], [23], this work aims to identify six basic emotions (joy, 224 × 224 × 3.
anger, sadness, disgust, fear, surprise), giving 521 utterances 2) Evaluation Methods: As suggested in [61], test-runs
in total for experiments. Note that, BAUM-1s, is a latest audio- are implemented by using a speaker-independent Leave-
visual emotional data set released in 2016. Moreover, BAUM-1s One-Speaker-Out (LOSO) or Leave-One-Speakers-Group-Out
records spontaneous emotions rather than acted emotions, thus (LOSGO) cross-validation strategy, which are usually adopted
defines a more challenging emotion recognition problem than in most real applications. Specifically, for the EMO-DB and
the aforementioned datasets like EMO-DB and eNTERFACE05. RML datasets, we employ the LOSO scheme. For the eNTER-
Therefore, BAUM-1s is a reasonable and challenging testset. FACE05 and BAUM-1s datasets, we use the LOSGO scheme
with five speakers group, similar to [24]. Note that, we adopt
the speaker-independent test-runs, which is more realistic and
B. Experimental Setup challenging than the speaker-dependent test-runs. Therefore, we
1) Details of DCNN Training: Each of the four emotional only compare with works also using the same setting and wont
datasets contains a limited number of samples. It is thus desir- compare with works like [58] that report speaker-dependent re-
able to generate more samples for DCNN’s training. To address sults. The Weighted Average Recall (WAR), also known as the
this issue, we directly split an utterance into a certain num- standard accuracy, is reported to evaluate the performance of
ber of overlapping segments. Each of the segments is labeled speech emotion recognition. Here, WAR denotes the recogni-
with the utterance emotion category for DCNN’s training. In tion rates of individual classes weighted by the class distribution.
this case, the number of training samples is decided by the We evaluate the performance of two methods, i.e., DCNN-
overlap length (a frame shift size) between two adjacent seg- Average, and DCNN-DTPM. The details of these two methods
ments, i.e., smaller overlap results in a larger number of train- are described below.
ing samples. However, as suggested in [50], the overlap length DCNN-Average also uses DCNNs as feature extractor. Af-
should be larger than 250 ms in speech emotion recognition. ter extracting features on each Mel-spectrogram segment with
Therefore, we set the overlap length as 30 frames, which is DCNNs, the conventional average-pooling is employed over all
about 10 ms × 29 + 25 ms = 315 ms. As a result, when extract- the segments to produce the final fixed-length global utterance-
ing Mel-spectrogram segments with size 64 × 64 × 3, we can level features. Then the linear SVM classifier is adopted for
significantly augment the size of training data, i.e., from 535 ut- emotion identification. Therefore, we compare our method to
terances to 11,842 segments for the EMO-DB dataset, from 720 DCNN-Average to show the validity of the proposed DTPM.
utterances to 11,316 segments for the RML dataset, from 1290 DCNN-DTPM is our proposed method described in Fig. 3.
utterances to 16,186 segments for the eNTERFACE05 dataset,
and 521 utterances to 6368 segments for the BAUM-1s dataset, C. Experimental Results and Analysis
respectively. We use Mel-spectrogram segments with size Mel SS ∈
Note that, segmenting an utterance into small segments, was RF ×T ×C as the DCNN input, where F is the number of Mel-
widely used for discrete emotion classification, as in [13], [57], filter banks commonly set as 64, T is the number of frames
[58]. Although it is not necessarily true that the emotion labels in each segment, and C represents the number of channels of
in all segments divided from an utterance are equivalent to that Mel-spectrogram. The parameters C and T largely affects the
1584 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

TABLE I
SPEAKER-INDEPENDENT ACCURACY (%) COMPARISONS BY SETTING DIFFERENT VALUES OF C USING A SIMPLIFIED DCNN MODEL

Dataset EMO-DB RML eNTERFACE05 BAUM-1s

C 1 2 3 1 2 3 1 2 3 1 2 3
DCNN-Average 73.86 77.44 78.92 59.25 61.84 61.19 62.33 65.01 66.42 36.49 38.21 38.62
DCNN-DTPM (L ∗ ) 77.03 (3) 82.69 (2) 83.53 (3) 61.48 (2) 64.88 (3) 64.21 (3) 65.95 (1) 69.88 (2) 70.25 (2) 38.74 (2) 39.05 (3) 40.57 (2)

The size of the Spectrogram is 64 × 64 × C . L ∗ denotes the value of L corresponding to the best performance of DCNN-DTPM.

amount of affective cues DCNNs could perceive. In this part, we TABLE II

SPEAKER-INDEPENDENT RECOGNITION ACCURACY (%) USING THE ALEXNET
first investigate the effects of C and the validity of our DCNN’s WITHOUT FINE-TUNING AS A FEATURE EXTRACTOR
training strategy. Then we will validate the effects of T on the
recognition accuracy and compare to the state-of-the-arts.
Dataset EMO-DB RML eNTERFACE05 BAUM-1s
1) Effects of the Number of Channels in Mel-Spectrogram:
To investigate the effects of the number of channels, i.e., C, DCNN-Average 72.35 59.46 51.33 36.10
DCNN-DTPM (L ∗ ) 76.27 (3) 62.40 (3) 56.08 (2) 38.42 (2)
we use a simplified DCNN model for feature extraction. This
DCNN model contains five layers (Conv1-Pool1-Conv2-Pool2- L ∗ denotes the value of L corresponding to the best performance.
Conv3-Conv4-FC5) and finally generates a 600-D feature rep-
resentation. Specifically, the size of the input is 64 × 64 × C, TABLE III
the first three convolutional layers (Conv1, Conv2, Conv3) have SPEAKER-INDEPENDENT RECOGNITION ACCURACY (%) USING THE
128 kernels of size 5 × 5 with a stride of 1. The fourth convo- FINE-TUNED ALEXNET AS A FEATURE EXTRACTOR
lution layer (Conv4) has 256 kernels of size 4 × 4 with a stride
of 1. We adopt average-pooling for the pooling layers. Pooling Dataset EMO-DB RML eNTERFACE05 BAUM-1s
size of 3 × 3 with a stride of 3 is used for Pool1, and 2 × 2 with DCNN-Average 82.65 66.17 72.80 42.26
a stride of 2 is used for Pool2. The fully-connected layer in FC5 DCNN-DTPM (L ∗ ) 87.31 (2) 69.70 (2) 76.56 (2) 44.61 (2)
has 600 neurons, giving a 600-D feature representation. For the
DCNN inputs with different C, we change the number of input L ∗ denotes the value of L corresponding to the best performance.

channel of this DCNN model.

Table I presents performance comparisons with different val- Note that, to use the Alexnet we resize 64 × 64 × 3 spectrogram
ues of C. Note that, for DCNN-DTPM, we test different pyramid to 227 × 227 × 3 with bilinear interpolation.
levels with L = 1, 2, and 3, respectively. We present the best Table II gives the recognition performance obtained by
performance as well as the corresponding L in Table I. From the AlexNet without fine-tuning. It can be observed that, the
the results, we can make the following two observations. AlexNet shows reasonably good performance, e.g., on the RML
First, setting C = 3 shows the best performance at most cases, dataset it gives performance close to the results in Table I ob-
and constantly outperforms the case when C = 1. This indi- tained with the simplified DCNN model. This demonstrates
cates that the first order and second order derivatives of 2-D that, although the AlexNet is trained on an independent image
Mel-spectrogram segments preserve helpful cues for emotion dataset, it also extracts discriminative affective features from
recognition. The fact that, C = 3 slightly outperforms C = 2 in- emotional speech datasets with our DCNN input.
dicates that further introducing higher order of derivatives may We further show the performance of the fine-tuned AlexNet in
not significantly boost the performance. Nevertheless, C = 3 Table III. It is easy to observe that the fine-tuning procedure sig-
results in an input similar to the RGB image representation. nificantly boosts the discriminative power of the extracted fea-
Accordingly, we set C = 3 in our following experiments. tures. After fine-tuning, the best performance of DCNN-DTPM
Second, DCNN-DTPM clearly outperforms DCNN-average comes up to 87.31%, 69.70%, 76.56%, and 44.61%, respectively
on four datasets. It is also clear that dividing the segment-level on four datasets. Note that the recognition performance on the
features into multiple levels, i.e., setting L larger than 1, im- spontaneous BAUM-1s dataset is much lower than the obtained
proves the performance of DCNN-DTPM. This demonstrates performance on other three emotional datasets. This shows that
the advantages of our DTPM over the conventional average- the spontaneous emotions are more difficult to be identified well
pooling strategy when coding the local segment-level features. than the acted and induced emotions. It is also clear that the fine-
2) The Performance of DCNN Pre-Trained on Imagenet: tuned AlexNet significantly outperforms the simplified DCNN
The above experiment suggests C = 3, corresponding to a in Table I, which is trained directly on the target datasets. This
DCNN input similar to the RGB image representation. Such indicates the advantages of our DCNN’s training strategy, i.e.,
input can be directly processed by available DCNNs pre-trained using available models trained on image datasets to initialize
on large-scale image datasets. In this experiment, we first di- our DCNNs for fine-tuning. Moreover, the experimental results
rectly use the original AlexNet [7] to extract affective features. also show the validity of our generated DCNN input.
Then, we fine-tune the AlexNet on the target emotion recog- The comparisons among Tables I, II, and III clearly show
nition tasks and test the performance of the fine-tuned model. the advantages of our training strategy, i.e., initialize with the
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1585

Fig. 4. The effects of T on the EMO-DB dataset. Fig. 6. The effects of T on the eNTERFACE05 dataset.

Fig. 5. The effects of T on the RML dataset. Fig. 7. The effects of T on the BAUM-1s dataset.

TABLE IV
AlexNet, then fine-tune on the target emotional speech datasets. THE BEST RECOGNITION ACCURACY (%) AND CORRESPONDING T USING THE
The reason why the AlexNet helps emotion recognition might FINE-TUNED ALEXNET ON FOUR DATASETS
be because we convert the audio signals into an image-like
representation, as well as the deep structure and huge training Fine-tuning EMO-DB RML eNTERFACE05 BAUM-1s
data of the AlexNet.
Segment length T = 64 T = 220 T = 80 T = 64
3) Effects of the Segment Length: The segment length T de- DCNN-DTPM (L ∗ ) 87.31 (2) 75.34 (3) 79.25 (2) 44.61 (2)
cides the duration of audio signals the DCNN model processes.
It hence may largely affect the discriminative power of the ex- L ∗ denotes the value of L corresponding to the best performance.
tracted affective features. We thus show the effects of T on the
emotion recognition performance. and 7 show the effects of T on four datasets. Table IV presents
The length of the shortest utterance is 1.23 second long on the the best performance and the optimal T on four datasets. From
EMO-DB dataset, and 1.12 second long on the eNTERFACE05 the experimental results, we can draw two conclusions.
dataset. Accordingly, for the EMO-DB dataset and the eNTER- First, it can be observed that larger T is helpful for better
FACE05 dataset, we test T ranges in [15, 30, 45, 64, 80, 100, performance. However, too large T does not constantly improve
120], where T = 120 corresponds to about 1.22 second, which the performance. Table IV shows that the best performance
is close to the length of the shortest utterance. The length of the on four datasets are 87.31%, 75.34%, 79.25%, and 44.61%,
shortest utterance is 3.27 seconds long on the RML dataset. We respectively. The corresponding optimal T on four datasets are
thus test T ranges in [15, 30, 45, 64, 80, 100, 120, 140, · · · , 64, 220, 80, and 64, respectively. This may be because setting
320] on the RML dataset. On the BAUM-1s dataset, we test T larger T decreases the number of generated training samples for
ranges in [15, 30, 45, 64, 80], since the length of the shortest DCNNs. Therefore, DCNN-DTPM does not always improve
utterance is 0.768 seconds long. For some certain utterances the performance with the increase of the segment length.
shorter than T , we simply repeat the first frame and last frame Second, the four curves shows that the recognition perfor-
in an utterance so that the length of this utterance equals to T . mance of DCNN-DTPM remains stable when T is larger than
Note that for T = 15, as a benchmark used in speech recog- 64. Setting T = 64 generally gives promising performance on
nition, the overlap length of Mel-spectrogram segments is 15 four datasets. This might be because the DTPM also considers
frames, whereas for T ≥ 30 the overlap length is 30 frames. All the temporal clues, thus makes the algorithm more robust to T .
spectrograms with different T are resized to be 227 × 227 × 3 It is also interesting to observe that segment length of 15 frames,
with bilinear interpolation as the input of DCNN. Figs. 4, 5, 6, i.e., T = 15, widely used for speech recognition [14], does not
1586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

Fig. 8. Confusion matrix of DCNN-DTPM with an average accuracy of Fig. 10. Confusion matrix of DCNN-DTPM with an average accuracy of
87.31% on the EMO-DB dataset. 79.25% on the eNTERFACE05 dataset.

Fig. 9. Confusion matrix of DCNN-DTPM with an average accuracy of

75.34% on the RML dataset. Fig. 11. Confusion matrix of DCNN-DTPM with an average accuracy of
44.61% on the BAUM-1s dataset.

get promising emotion recognition performance. This might be TABLE V

because T = 15 is too short to provide sufficient temporal cues COMPARISONS OF RECOGNITION ACCURACY (%) WITH
for distinguishing emotions. STATE-OF-THE-ART WORKS
To further investigate the recognition accuracy, we present
the confusion matrix corresponding to the results of DCNN- Datasets Refs. Features WAR UAR
DTPM in Table IV. Fig. 8 shows that on the EMO-DB dataset, EMO-DB [61] Prosody, MFCC 85.60 84.60
“neutral” is identified with the highest accuracy of 93.15%, and [12] Prosody, MFCC 81.90 79.10
the other six emotions are classified with accuracies higher than [5] ComParE set N/A 86.00
[62] AVEC-2013 set N/A 86.10
80%. Fig. 9 indicates that only two emotions, i.e., “anger” and Ours DCNNs 87.31 86.30
“surprise”, are distinguished with accuracies higher than 82% on
RML [63] Prosody 51.04 N/A
the RML dataset. On the eNTERFACE05 dataset, “anger”, “joy” [64] PNCC 58.33 N/A
and “surprise” can be recognized with accuracies of 87.50%, Ours DCNNs 75.34 75.20
81.31%, 84.66%, respectively, as shown in Fig. 10. Fig. 11 eNTERFACE05 [61] Prosody, MFCC 72.40 72.50
indicates that on the BAUM-1s dataset “joy” and “sadness”, [12] Prosody, MFCC 61.10 61.10
[24] MFCC, RASTA-PLP 72.95 N/A
are classified with accuracies of 50.67%, 41.74%, respectively, [62] ComParE set N/A 80.50
whereas the other four emotions are identified with accuracies Ours DCNNs 79.25 79.40
lower than 40%. The low recognition accuracies on the BAUM- BAUM-1s [24] MFCC,RASTA-PLP 29.41 N/A
1s dataset demonstrate the difficulty in recognizing spontaneous Ours DCNNs 44.61 44.03
emotions.
Here, “Features” denotes the used typical affective features in those works.
4) Comparisons with the State-of-the-Art Results: We com-
pare our method with some previous works on four public
datasets in Table V. We compare with these works because to better reflect unbalance among classes, as the evaluate mea-
they also use the speaker-independent LOSO or LOSGO test- sures of recognition performance, although we have presented
runs, which are more reasonable than the speaker-dependent the common WAR for performance evaluation. Accordingly,
test-runs used in [22]. Note that some pervious works [5], [62] we present both WAR and UAR on these four datasets for a fair
also employ Unweighted Averaged recall (UAR), which is used comparison.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1587

From Table V, we can see that our method is very compet- TABLE VI
RECOGNITION ACCURACY (%) COMPARISON OF THREE POOLING METHODS IN
itive to the state-of-the-art results. Specially, on the EMO-DB DTPM USING 64 × 64 × 3 MEL-SPECTROGRAM AND L = 2 ON THREE
dataset our method performs best, compared with [5], [12], [61], DATASETS.p ∗ DENOTES THE MEAN OPTIMAL VALUES OF p IN LOSO OR
[62]. On the RML dataset, our method gives much better per- LOSGO TEST-RUNS
formance than [63], [64]. On the eNTERFACE05 dataset, our
method obviously outperforms [12], [24], [61], and presents a Feature pooling EMO-DB RML eNTERFACE05 BAUM-1s
little lower performance than [62]. On the BAUM-1s dataset, Average 83.28 60.73 71.08 41.94
our method also clearly outperforms [24], i.e., our 44.61% vs. Max 82.64 63.48 72.75 40.26
29.41% of [24] in term of WAR. Therefore, although the BAUM- Ours (p ∗ ) 87.31(1.12) 69.70(1.50) 76.56(1.58) 44.61(0.21)
1s is a relatively small dataset, it defines a challenging emotion
recognition problem and also validates the advantages of the
and BAUM-1s datasets. This thus shows the necessarily of pool-
proposed algorithm. Note that in [61], the authors employ 6552
ing strategy learning.
LLD acoustic features such as prosody and MFCC for emo-
Since the Mel-spectrogram domain is represented as a 2-D
tion classification. This shows the advantages of our learned
matrix, it is natural to utilize CNNs to learn emotion informa-
affective features using DCNNs. [12] also uses a DNN to learn
tion. To this end, it is straightforward to train a deep model
discriminative features. Different from our work, [12] learns
on 64 × 64 spectrogram data. However, Tables I and IV indi-
features from 6552 LLD acoustic features, rather than from the
cate that directly using 64 × 64 features to train a deep model
raw speech signals or the spectrogram. This thus clearly shows
obtains lower performance than our fine-tuned AlexNet. The
the advantages of our DCNN model, i.e., using three channels
reason might be the limited training data of speech emotion
of spectrograms as input and coding raw DCNN features with
recognition. This motivates us to use the pre-trained AlexNet,
DTPM to get the final feature representation. [62] reports the
which is already trained with millions of images and shows
best performance of by using the large AVEC-2013 feature set
reasonably good performance in emotion feature extraction as
[65] on the EMO-DB dataset, and the large ComParE feature
shown in Table II. Therefore, we initialize a deep model with
set [66] on the eNTERFACE05 dataset.
the same structure and parameters of the AlexNet and fine-tune
Our experimental results show that our method gets impres-
it on target emotional datasets. Experimental results in Tables II
sive recognition accuracies in comparison with the state-of-the-
and IV have shown the effectiveness of the pre-trained AlexNet
art works. For example, we report an UAR accuracy of 86.30%
as well as our fine-tuned deep model.
on the EMO-DB dataset, on which outperforms all the three
It is a challenging problem to collect and annotate large num-
compared works, i.e., 79.1% by [12], 84.6% by [61], 86.0% by
bers of utterances for emotion classification due to the difficulty
[5] and 86.1% by [62]. As far as we know, this is an early work
of emotion annotation. At present, on existing small emotional
using DCNNs pre-trained on image domain for emotion recog-
speech datasets, it is a good choice to fine-tune pre-trained deep
nition. The success of this work guarantees further investigation
models. As shown in our experiments, fine-tuned the AlexNet
in this direction. These distinctive characteristics distinguish our
pre-trained on the ImageNet works well on speech emotion
work from existing efforts on speech emotion recognition.
recognition tasks. The reason why the AlexNet helps emotion
recognition might be because we convert the audio signals into
VI. DISCUSSIONS an image-like representation as well as the strong feature learn-
ing ability of the AlexNet, e.g., higher-level convolutions gradu-
The pyramid level L controls the number of levels in DTPM,
ally deduce semantics from larger receptive fields. The extracted
thus may affect the recognition performance. In our experi-
three channels of Mel-spectrograms are analogous to the RGB
ments, we investigate the effects of L with a value range be-
image representation. This representation makes it feasible to
tween 1 and 3. We do not use L ≥ 4, since the resulted feature
first generate meaningful low-level time-frequency features with
dimensionality is too large. As shown in the above experimen-
low-level 2-D convolutions, then deduce more discriminative
tal results, L = 2 or L = 3 generally gets the optimal results.
features with higher-levels of convolutions. Besides, three chan-
This indicates that dividing the Mel-spectrogram into multiple
nels of Mel-spectrograms may characterize emotions as certain
levels, i.e., L ≥ 2, helps to improve the performance. It also can
shapes and structures, which are thus able to be effectively per-
be inferred that our algorithm is not quite sensitive to L, and
ceived by the AlexNet pre-trained on the image domain.
setting L = 2 or L = 3 is a reasonable option at most cases.
The proposed method is based on the AlexNet. Similar to the
To verify the effectiveness of our Lp-norm pooling, we com-
AlexNet for ImageNet large-scale classification, our method
pare it with two commonly used pooling methods, i.e., average-
is capable of learning on million-scale training data with the
pooling and max-pooling, in Table VI. This is conducted by
commonly used GPU, e.g., NVIDIA TITAN X. It is thus also
modifying the value of p in DTPM, e.g., p = 1 corresponds to
interesting to retrain deep models on larger emotional speech
average-pooling, whereas p = ∞ corresponds to max-pooling.
datasets than the used EMO-DB, eNTERFACE05, and BAUM-
It can be seen from Table VI that our Lp-norm pooling performs
1s in our future work.
better than the other two pooling methods. It also can be seen
that, it is hard to decide which pooling strategy performs better
VII. CONCLUSIONS AND FUTURE WORK
for a specific task with experience. E.g., max-pooling performs
better than average-pooling on the RML and eNTERFACE05 This paper is motivated by how to employ DCNNs for au-
datasets, but average-pooling performs better on the EMO-DB tomatic feature learning on speech emotion recognition tasks.
1588 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

We present a new method combining DCNNs with DTPM for [9] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
automatic affective feature learning. A DCNN is used to learn recognition: The shared views of four research groups,” IEEE Signal
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
discriminative segment-level features from three channels of [10] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
log Mel-spectrograms similar to the RGB image representation. convolutional networks for visual recognition,” in Proc. 13th Eur. Conf.
DTPM is designed to aggregate the learned segment-level fea- Comput. Vis., New York, NY, USA: Springer, 2014, pp. 346–361.
[11] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
tures into the global utterance-level feature representation for network cascade for face detection,” in Proc. IEEE Conf. Comput. Vis.
emotion recognition. Extensive experiments on four data sets Pattern Recogn., 2015, pp. 5325–5334.
show that our method can yield promising performance in com- [12] A. Stuhlsatz et al., “Deep neural networks for acoustic emotion recogni-
tion: raising the benchmarks,” in Proc. IEEE Int. Conf. Acoust., Speech
parison with the state-of-the-arts. In addition, we also find that Signal Process., 2011, pp. 5688–5691.
with our generated DCNN input, DCNN models pre-trained on [13] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep
the large-scale ImageNet data could be leveraged in speech af- neural network and extreme learning machine,” in Proc. Interspeech, 2014,
pp. 223–227.
fective feature extraction. This makes DCNN’s training with a [14] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D.
limited amount of annotated speech data easier. The success of Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM
this work warranties further investigation on using deep learning Trans. Audio, Speech, Lang. Process., vol. 22, no. 10, pp. 1533–1545, Oct.
2014.
in speech emotion recognition. [15] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition
Although this paper focuses on discrete emotion recognition, using CNN,” in Proc. ACM Int. Conf. Multimedia, NewYork, NY, USA,
it is interesting to explore the effectiveness of deep features in 2014, pp. 801–804.
[16] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for
continuous dimension emotion recognition on datasets like SE- speech emotion recognition using convolutional neural networks,” IEEE
MAINE [67], RECOLA [68] and JESTKOD [69]. Note that this Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
work focuses on global utterance-level emotion classification [17] G. Trigeorgis et al., “Adieu features? End-to-end speech emotion recogni-
tion using a deep convolutional recurrent network,” in Proc. 41st IEEE Int.
and proposes the algorithm accordingly, i.e., first uses DCNNs Conf. Acoust., Speech, Signal Process., Shanghai, China, 2016, pp. 5200–
to extract segment-level feature, then aggregates segment-level 5204.
features with DTPM to form a global feature, and finally per- [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
forms emotion classification with the linear SVM. Therefore, [19] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhu-
this algorithm is still not capable to deal with continuous dimen- ber, “Flexible, high performance convolutional neural networks for image
sional emotion recognition. To tackle this problem, one possible classification,” in Proc. Int. Joint Conf. Artif. Intell., Barcelona, Spain,
2011, vol. 22, pp. 1237–1242.
way is to consider extra temporal cues and combine CNN and [20] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
LSTM [17], which is commonly used to select and accumulate Comput. Vis. Pattern Recogn., Boston, MA, USA, 2015, pp. 1–9.
frame-level features for video categorization. This will be one [21] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss,
“A database of German emotional speech,” in Proc. Interspeech, 2005,
of our future works. Moreover, there are many open issues that vol. 5, pp. 1517–1520.
still need to be further studied to make emotion recognition [22] Y. Wang and L. Guan, “Recognizing human emotional state from audiovi-
work well in real-life settings. For example, as show in Table V, sual signals*,” IEEE Trans. Multimedia, vol. 10, no. 5, pp. 936–946, Aug.
2008.
it is more difficult for our model to recognize the spontaneous [23] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05 audio-
emotions. It is also necessary to take the personality into con- visual emotion database,” in Proc. 22nd Int. Conf. Data Eng. Workshops,
sideration because different persons may have different ways Atlanta, GA, USA, 2006, p. 8.
[24] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
to express emotions. Additionally, it is also interesting to apply spontaneous audio-visual face database of affective and mental states,”
our proposed method for affective analysis of music video [70]. IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, Jul.–Sep.
2016.
REFERENCES [25] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in speech,”
in Proc. 4th Int. Conf. Spoken Lang., 1996, vol. 3, pp. 1970–1973.
[1] R. Cowie et al., “Emotion recognition in human-computer interaction,” [26] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in
IEEE Signal Process. Mag., vol. 18, no. 1, pp. 32–80, Jan. 2001. speech using neural networks,” Neural Comput. Appl., vol. 9, no. 4,
[2] S. Ramakrishnan and I. M. El Emary, “Speech emotion recognition ap- pp. 290–296, 2000.
proaches in human computer interaction,” Telecommun. Syst., vol. 52, [27] D. Ververidis and C. Kotropoulos, “Emotional speech classification using
no. 3, pp. 1467–1478, 2013. Gaussian mixture models and the sequential floating forward selection
[3] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- algorithm,” in Proc. IEEE Int. Conf. Multimedia Expo., Amsterdam, The
tion recognition: features, classification schemes, and databases,” Pattern Netherlands, 2005, pp. 1500–1503.
Recogn., vol. 44, no. 3, pp. 572–587, 2011. [28] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition
[4] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and classi- using hidden Markov models,” Speech Commun., vol. 41, no. 4, pp. 603–
fiers for emotion recognition from speech: A survey from 2000 to 2011,” 623, 2003.
Artif. Intell. Rev., vol. 43, no. 2, pp. 155–177, 2015. [29] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combin-
[5] F. Eyben et al., “The Geneva minimalistic acoustic parameter set ing acoustic features and linguistic information in a hybrid support vector
(GeMAPS) for voice research and affective computing,” IEEE Trans. machine-belief network architecture,” in Proc. IEEE Int. Conf. Acoust.,
Affect. Comput., vol. 7, no. 2, pp. 190–202, Apr.–Jun. 2016. Speech, Signal Process., 2004, vol. 1, pp. 577–580.
[6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of [30] X. Zhao, S. Zhang, and B. Lei, “Robust emotion recognition in noisy
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, speech via sparse representation,” Neural Comput. Appl., vol. 24, no. 7/8,
2006. pp. 1539–1553, 2014.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [31] X. Zhao and S. Zhang, “Spoken emotion recognition via locality-
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- constrained kernel sparse representation,” Neural Comput. Appl., vol. 26,
cess. Syst., 2012, pp. 1097–1105. no. 3, pp. 735–744, 2015.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning [32] D. Morrison, R. Wang, and L. C. De Silva, “Ensemble methods for spoken
applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278– emotion recognition in call-centres,” Speech Commun., vol. 49, no. 2,
2324, Nov. 1998. pp. 98–112, 2007.
ZHANG et al.: SPEECH EMOTION RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK AND DISCRIMINANT 1589

[33] E. M. Albornoz, D. H. Milone, and H. L. Rufiner, “Spoken emotion [59] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for
recognition using hierarchical classifiers,” Comput. Speech Lang., vol. 25, MATLAB,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015, pp. 689–692.
no. 3, pp. 556–570, 2011. [60] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-
[34] I. Luengo, E. Navas, and I. Hernáez, “Feature analysis and evaluation tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011,
for automatic emotion identification in speech,” IEEE Trans. Multimedia, Art. no. 27.
vol. 12, no. 6, pp. 490–501, Oct. 2010. [61] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acous-
[35] K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li, “Speech emotion recogni- tic emotion recognition: a benchmark comparison of performances,” in
tion using Fourier parameters,” IEEE Trans. Affect. Comput., vol. 6, no. 1, Proc. IEEE Workshop Autom. Speech Recogn. Understanding, 2009,
pp. 69–75, Jan.–Mar. 2015. pp. 552–557.
[36] Q. Jin, C. Li, S. Chen, and H. Wu, “Speech emotion recognition with [62] E. Florian, Real-Time Speech and Music Classification by Large Audio
acoustic and lexical features,” in Proc. IEEE Int. Conf. Acoust., Speech Feature Space Extraction. New York, NY, USA: Springer, 2016.
Signal Process., 2015, pp. 4749–4753. [63] L. Gao, L. Qi, and L. Guan, “Information fusion based on kernel entropy
[37] B. Schuller, “Recognizing affect from linguistic information in 3D con- component analysis in discriminative canonical correlation space with ap-
tinuous space,” IEEE Trans. Affect. Comput., vol. 2, no. 4, pp. 192–205, plication to audio emotion recognition,” in Proc. IEEE Int. Conf. Acoust.,
Oct.–Dec. 2011. Speech Signal Process., Shanghai, China, 2016, pp. 2817–2821.
[38] A. Tawari and M. M. Trivedi, “Speech emotion analysis: Exploring the [64] N. E. D. Elmadany, Y. He, and L. Guan, “Multiview emotion recognition
role of context,” IEEE Trans. Multimedia, vol. 12, no. 6, pp. 502–509, via multi-set locality preserving canonical correlation analysis,” in Proc.
Oct. 2010. IEEE Int. Symp. Circuits Syst., Montreal, QC, Canada, 2016, pp. 590–593.
[39] M. A. Quiros-Ramirez and T. Onisawa, “Considering cross-cultural con- [65] M. Valstar et al., “AVEC 2013-the continuous audio/visual emotion and
text in the automatic recognition of emotions,” Int. J. Mach. Learn. Cy- depression recognition challenge,” in Proc. 3rd ACM Int. Workshop Au-
bern., vol. 6, no. 1, pp. 119–127, 2015. dio/Visual Emotion Challenge, Barcelona, Spain, 2013, pp. 3–10.
[40] H. Cao, A. Savran, R. Verma, and A. Nenkova, “Acoustic and lexical rep- [66] B. Schuller et al., “The INTERSPEECH 2013 computational paralin-
resentations for affect prediction in spontaneous conversations,” Comput. guistics challenge: Social signals, conflict, emotion, autism,” in Proc.
Speech Lang., vol. 29, no. 1, pp. 203–217, 2015. Interspeech, Lyon, France, 2013, pp. 148–152.
[41] V. A. Petrushin, “Emotion recognition in speech signal: Experimental [67] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The
study, development, and application,” in Proc. 6th Int. Conf. Spoken Lan- SEMAINE database: Annotated multimodal records of emotionally col-
guage Process., Beijing, China, 2000, pp. 222–225. ored conversations between a person and a limited agent,” IEEE Trans.
[42] R. Tato, R. Santos, R. Kompe, and J. M. Pardo, “Emotional space improves Affect. Comput., vol. 3, no. 1, pp. 5–17, Jan. 2012.
emotion recognition,” in Proc. Interspeech, 2002, pp. 2029–2032. [68] F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introducing the
[43] M. Lugger and B. Yang, “The relevance of voice quality features in speaker RECOLA multimodal corpus of remote collaborative and affective inter-
independent emotion recognition,” in Proc. IEEE Int. Conf. Acoust., actions,” in Proc. 10th IEEE Int. Conf. Workshops Autom. Face Gesture
Speech Signal Process., 2007, vol. 4, pp. 17–20. Recogn., Shanghai, China, 2013, pp. 1–8.
[44] S. Zhang, “Emotion recognition in Chinese natural speech by combining [69] E. Bozkurt et al., “JESTKOD database: Dyadic interaction analysis,” in
prosody and voice quality features,” in Proc. Adv. Neural Netw., 2008, Proc. 23th Signal Process. Commun. Appl. Conf., 2015, pp. 1374–1377.
pp. 457–464. [70] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian, “Affective visual-
[45] A. I. Iliev, M. S. Scordilis, J. P. Papa, and A. X. Falcão, “Spoken emo- ization and retrieval for music video,” IEEE Trans. Multimedia, vol. 12,
tion recognition through optimum-path forest classification using glottal no. 6, pp. 510–522, Oct. 2010.
features,” Comput. Speech Lang., vol. 24, no. 3, pp. 445–460, 2010.
[46] J. Sundberg, S. Patel, E. Björkner, and K. R. Scherer, “Interdependencies
among voice source parameters in emotional speech,” IEEE Trans. Affect.
Comput., vol. 2, no. 3, pp. 162–174, Jul.–Sep. 2011. Shiqing Zhang received the Ph.D. degree from the
[47] S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion recognition School of Communication and Information Engineer-
using modulation spectral features,” Speech Commun., vol. 53, no. 5, ing, University of Electronic Science and Technology
pp. 768–785, 2011. of China, Chengdu, China, in 2012. He is currently
[48] Y. Sun, G. Wen, and J. Wang, “Weighted spectral features based on local a Postdoctor with the School of Electronic Engi-
Hu moments for speech emotion recognition,” Biomed. Signal Process. neering and Computer Science, Peking University,
Control, vol. 18, pp. 80–90, 2015. Beijing, China, and also works as an Associate
[49] E. M. Provost, “Identifying salient sub-utterance emotion dynamics using Professor in the Institute of Intelligent Information
flexible units and estimates of affective flow,” in Proc. IEEE Int. Conf. Processing, Taizhou University, Taizhou, China. His
Acoust., Speech Signal Process., 2013, pp. 3682–3686. research interests include audio and image process-
[50] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “LSTM- ing, affective computing, and pattern recognition.
modeling of continuous emotions in an audiovisual affect recognition
framework,” Image Vis. Comput., vol. 31, no. 2, pp. 153–163, 2013.
[51] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial Shiliang Zhang received the Ph.D. degree in com-
pyramid matching for recognizing natural scene categories,” in Proc. IEEE puter science from the Institute of Computing Tech-
Conf. Comput. Vis. Pattern Recogn., 2006, vol. 2, pp. 2169–2178. nology, Chinese Academy of Sciences, Beijing,
[52] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric Lp-norm feature pool- China, in 2012. He was a Postdoctoral Scientist in
ing for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern NEC Labs America and a Postdoctoral Research Fel-
Recogn., 2011, pp. 2609–2704. low in the University of Texas at San Antonio. He
[53] C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learned-norm pooling is currently a tenure-track Assistant Professor in the
for deep feedforward and recurrent neural networks,” in Machine Learning School of Electronic Engineering and Computer Sci-
and Knowledge Discovery in Databases. New York, NY, USA: Springer, ence, Peking University, Beijing, China. His research
2014, pp. 530–546. interests include large-scale image retrieval and com-
[54] S. Yan et al., “Graph embedding and extensions: A general framework puter vision for autonomous driving. Dr. Zhang was
for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., awarded the National 1000 Youth Talents Plan of China, Outstanding Doc-
vol. 29, no. 1, pp. 40–51, Jan. 2007. toral Dissertation Awards from both Chinese Academy of Sciences and Chinese
[55] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Computer Federation, President Scholarship by Chinese Academy of Sciences,
Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936. NEC Laboratories America Spot Recognition Award, and the Microsoft Re-
[56] K. Fukunaga, Introduction to statistical pattern recognition. Cambridge, search Fellowship. He has published more than 30 papers in journals and
MA, USA: Academic, 2013. conferences including the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
[57] M. T. Shami and M. S. Kamel, “Segment-based approach to the recogni- MACHINE INTELLIGENCE, the IEEE TRANSACTIONS ON IMAGE PROCESSING, the
tion of emotions in speech,” in Proc. IEEE Int. Conf. Multimedia Expo., IEEE TRANSACTIONS ON MULTIMEDIA, ACM Multimedia, and International
Amsterdam, The Netherlands, 2005, pp. 4–7. Conference on Computer Vision. He received the Top 10% Paper Award in
[58] B. W. Schuller and G. Rigoll, “Timing levels in segment-based speech IEEE MMSP 2011. His research is supported by the National 1000 Youth Tal-
emotion recognition.” in Proc. Interspeech, 2006, pp. 1818–1821. ents Plan and Natural Science Foundation of China (NSFC).
1590 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 6, JUNE 2018

Tiejun Huang (M’01−SM’12) received the Bache- Wen Gao (M’92−SM’05−F’09) received the Ph.D.
lor’s and Master’s degrees in computer science from degree in electronics engineering from the Univer-
Wuhan University of Technology, Wuhan, China, in sity of Tokyo, Tokyo, Japan, in 1991. He is cur-
1992 and 1995, respectively, and the Ph.D. degree rently a Professor in the School of Electronic En-
in pattern recognition and intelligent system from gineering and Computer Science, Peking University,
Huazhong (Central China) University of Science and Beijing, China. Before joining Peking University, he
Technology, Wuhan, China, in 1998. He is currently was a Professor of computer science with Harbin In-
a Professor in the School of Electronic Engineering stitute of Technology, Harbin, China, from 1991 to
and Computer Science, Peking University, Beijing, 1995, and a Professor with the Institute of Computing
China, where he is also the Director of the Institute Technology, Chinese Academy of Sciences, Beijing,
for Digital Media Technology. He has authored or China. He has authored five books and more than
coauthored more than 100 peer-reviewed papers and three books. His research 600 technical articles in refereed journals and conference proceedings in image
interest area includes video coding, image understanding, digital right manage- processing, video coding and communication, pattern recognition, multimedia
ment, and digital library. Prof. Huang is a member of the Board of Director for information retrieval, multimodal interface, and bioinformatics. Dr. Gao serves
Digital Media Project, the Advisory Board of the IEEE Computing Society, and on the editorial board for several journals, such as the IEEE TRANSACTIONS ON
the Board of the Chinese Institute of Electronics. CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, the IEEE TRANSACTIONS
ON MULTIMEDIA, the IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVEL-
OPMENT, EURASIP Journal of Image Communications, and Journal of Visual
Communication and Image Representation. He chaired a number of prestigious
international conferences on multimedia and video signal processing, such as
the IEEE ICME and ACM Multimedia, and also worked on the advisory and
technical committees of numerous professional organizations.

Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
No ratings yet
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
12 pages
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
No ratings yet
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
10 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
Court Order
100% (1)
Court Order
17 pages
Research Paper
No ratings yet
Research Paper
5 pages
Casio AP500
0% (1)
Casio AP500
42 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Empathetic Deep Learning: Transferring Adult Speech Emotion Models To Children With Gender-Specific Adaptations Using Neural Embeddings
No ratings yet
Empathetic Deep Learning: Transferring Adult Speech Emotion Models To Children With Gender-Specific Adaptations Using Neural Embeddings
10 pages
4bs1 02 Rms 20250123
No ratings yet
4bs1 02 Rms 20250123
27 pages
1 s2.0 S0950705123002757 Main
No ratings yet
1 s2.0 S0950705123002757 Main
11 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
ECE CAD Introduction To AutoCAD
No ratings yet
ECE CAD Introduction To AutoCAD
5 pages
Material Test Report: Cse. Chiang Sung Enterprise Co., LTD
No ratings yet
Material Test Report: Cse. Chiang Sung Enterprise Co., LTD
3 pages
Medical Astrology - Medicine by The Stars
No ratings yet
Medical Astrology - Medicine by The Stars
4 pages
9 - Yogendra
No ratings yet
9 - Yogendra
5 pages
THIRD - s10772 022 09985 6
No ratings yet
THIRD - s10772 022 09985 6
19 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
Reality
No ratings yet
Reality
11 pages
Acmmm 14
No ratings yet
Acmmm 14
4 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Speech Emotion Analysis Using Convolutional Neural Network (CNN) and Gamma Classifier Based Error Correcting Output Codes (ECOC)
No ratings yet
Speech Emotion Analysis Using Convolutional Neural Network (CNN) and Gamma Classifier Based Error Correcting Output Codes (ECOC)
18 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Multimodal Information-Based Broad and Deep Learning Model For Emotion Understanding
No ratings yet
Multimodal Information-Based Broad and Deep Learning Model For Emotion Understanding
5 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Zhao 2019
No ratings yet
Zhao 2019
12 pages
Speech Emotion Recognition PDF
No ratings yet
Speech Emotion Recognition PDF
5 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
Group 110 Arun Kumar Review 2 Report
No ratings yet
Group 110 Arun Kumar Review 2 Report
14 pages
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
No ratings yet
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
30 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
No ratings yet
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
15 pages
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
No ratings yet
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
9 pages
10 1109@access 2019 2936124
No ratings yet
10 1109@access 2019 2936124
19 pages
Tzirakis 2017
No ratings yet
Tzirakis 2017
9 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Speech Emotion Analysis Using Convolutional Neural
No ratings yet
Speech Emotion Analysis Using Convolutional Neural
19 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Irjet V7i6804
No ratings yet
Irjet V7i6804
7 pages
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
Research Paper Attri
No ratings yet
Research Paper Attri
7 pages
Speech Emotion Recognition Based On CNN and Random Forest
No ratings yet
Speech Emotion Recognition Based On CNN and Random Forest
5 pages
Fine-Grained Emotion Prediction by Modeling Emotion Definitions
No ratings yet
Fine-Grained Emotion Prediction by Modeling Emotion Definitions
8 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Applying Machine Learning Techniques For Speech Emotion Recognition
No ratings yet
Applying Machine Learning Techniques For Speech Emotion Recognition
6 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
SPRINGERIJST
No ratings yet
SPRINGERIJST
11 pages
Speech Recog
No ratings yet
Speech Recog
5 pages
DL For SER
No ratings yet
DL For SER
9 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Lee-Tashev - Paper 6
No ratings yet
Lee-Tashev - Paper 6
4 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
A Voice-Based Real-Time Emotion Detection Technique Using Recurrent Neural Network Empowered Feature Modelling
No ratings yet
A Voice-Based Real-Time Emotion Detection Technique Using Recurrent Neural Network Empowered Feature Modelling
22 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
People v. Pagal
No ratings yet
People v. Pagal
3 pages
Herbs and Spices
No ratings yet
Herbs and Spices
13 pages
Employee Welfare
No ratings yet
Employee Welfare
44 pages
Mercedes-Benz: Faculty of Political Science
No ratings yet
Mercedes-Benz: Faculty of Political Science
7 pages
Review Notes in Police Operational Planning: - // (Jonathan R. Budaden)
No ratings yet
Review Notes in Police Operational Planning: - // (Jonathan R. Budaden)
8 pages
Research Report On ASR Development For Mongolian
No ratings yet
Research Report On ASR Development For Mongolian
15 pages
Sony Ericsson Product
No ratings yet
Sony Ericsson Product
34 pages
Ephesians: What To Do
No ratings yet
Ephesians: What To Do
8 pages
Rahwaz Syndicate Profile
No ratings yet
Rahwaz Syndicate Profile
3 pages
Defects
No ratings yet
Defects
51 pages
SinclairCollins K-Series 02 2016
No ratings yet
SinclairCollins K-Series 02 2016
20 pages
Term 2 Year 1 Hass
No ratings yet
Term 2 Year 1 Hass
13 pages
November 09
No ratings yet
November 09
2 pages
1.0 Introduction To Biochemistry and Cellular Organization
No ratings yet
1.0 Introduction To Biochemistry and Cellular Organization
6 pages
Warmups Linear Functions 8 TH Grade Math Common Core Standards
No ratings yet
Warmups Linear Functions 8 TH Grade Math Common Core Standards
61 pages
Semitic Alphabets
No ratings yet
Semitic Alphabets
16 pages
Manual HON 370 20 GB
No ratings yet
Manual HON 370 20 GB
51 pages
Judo Physiological Profile Sportsmedicine Franchini
No ratings yet
Judo Physiological Profile Sportsmedicine Franchini
21 pages
Exhibit B - Security Policy
No ratings yet
Exhibit B - Security Policy
4 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
6 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
A Comparative Study On Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition
No ratings yet
A Comparative Study On Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition
6 pages
Secret of Anti-Aging Anti-Aging Food Con
No ratings yet
Secret of Anti-Aging Anti-Aging Food Con
5 pages
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
No ratings yet
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
3 pages
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
No ratings yet
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
2 pages
24F - 48F DJ ADSS Specs 600 MTR
No ratings yet
24F - 48F DJ ADSS Specs 600 MTR
2 pages
ED Mid
No ratings yet
ED Mid
1 page
Deep Learning
From Everand
Deep Learning
Manish Soni
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Uploaded by

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Uploaded by

1576 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO.

Speech Emotion Recognition Using Deep

Abstract—Speech emotion recognition is challenging because I. INTRODUCTION

Fig. 3. The framework of discriminant temporal pyramid matching (DTPM).

N f p (Xm ) = ⎝ |xj |p ⎠ , (8)

Dataset EMO-DB RML eNTERFACE05 BAUM-1s

amount of affective cues DCNNs could perceive. In this part, we TABLE II

channel of this DCNN model.

Fig. 9. Confusion matrix of DCNN-DTPM with an average accuracy of

get promising emotion recognition performance. This might be TABLE V

You might also like