A Survey of Speech Emotion Recognition in Natural Environment
A Survey of Speech Emotion Recognition in Natural Environment
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
a r t i c l e i n f o a b s t r a c t
Article history: Recently, increasing attention has been directed to the study of the emotional content of speech signals,
Received 4 February 2009 and hence, many systems have been proposed to identify the emotional content of a spoken utterance.
Received in revised form This paper is a survey of speech emotion classification addressing three important aspects of the design of
25 July 2010
a speech emotion recognition system. The first one is the choice of suitable features for speech
Accepted 1 September 2010
representation. The second issue is the design of an appropriate classification scheme and the third
issue is the proper preparation of an emotional speech database for evaluating system performance.
Keywords: Conclusions about the performance and limitations of current speech emotion recognition systems are
Archetypal emotions discussed in the last section of this survey. This section also suggests possible ways of improving speech
Speech emotion recognition
emotion recognition systems.
Statistical classifiers
& 2010 Elsevier Ltd. All rights reserved.
Dimensionality reduction techniques
Emotional speech databases
0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2010.09.020
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 573
the automatic emotion recognizer will detect: the long-term emo- since different emotional speech corpora and experimental setups
tion or the transient one. Emotion does not have a commonly were used with each of them.
agreed theoretical definition [62]. However, people know emotions The paper is divided into five sections. In Section 2, important
when they feel them. For this reason, researchers were able to study issues in the design of an emotional speech database are discussed.
and define different aspects of emotions. It is widely thought that Section 3 reviews in detail speech feature extraction methods.
emotion can be characterized in two dimensions: activation and Classification techniques applied in speech emotion recognition
valence [40]. Activation refers to the amount of energy required are addressed in Section 4. Finally, important conclusions are
to express a certain emotion. According to some physiological drawn in Section 5.
studies made by Williams and Stevens [136] of the emotion
production mechanism, it has been found that the sympathetic
nervous system is aroused with the emotions of Joy, Anger, and 2. Emotional speech databases
Fear. This induces an increased heart rate, higher blood pressure,
changes in depth of respiratory movements, greater sub-glottal An important issue to be considered in the evaluation of an
pressure, dryness of the mouth, and occasional muscle tremor. emotional speech recognizer is the degree of naturalness of the
The resulting speech is correspondingly loud, fast and enunciated database used to assess its performance. Incorrect conclusions may
with strong high-frequency energy, a higher average pitch, and be established if a low-quality database is used. Moreover, the
wider pitch range. On the other hand, with the arousal of the design of the database is critically important to the classification
parasympathetic nervous system, as with sadness, heart rate task being considered. For example, the emotions being classified
and blood pressure decrease and salivation increases, producing may be infant-directed; e.g. soothing and prohibition [15,120], or
speech that is slow, low-pitched, and with little high-frequency adult-directed; e.g. joy and anger [22,38]. In other databases, the
energy. Thus, acoustic features such as the pitch, timing, voice classification task is to detect stress in speech [140]. The
quality, and articulation of the speech signal highly correlate with classification task is also defined by the number and type of
the underlying emotion [20]. However, emotions cannot be emotions included in the database. This section is divided into
distinguished using only activation. For example, both the anger three subsections. In Section 2.1, different criteria used to evaluate
and the happiness emotions correspond to high activation but they the goodness of an emotional speech database are discussed. In
convey different affect. This difference is characterized by the Section 2.2, a brief overview of some of the available databases is
valence dimension. Unfortunately, there is no agreement within given. Finally, limitations of the emotional speech databases are
researchers on how, or even if, acoustic features correlate with this addressed in Section 2.3.
dimension [79]. Therefore, while classification between high-
activation (also called high-arousal) emotions and low-activation 2.1. Design criteria
emotions can be achieved at high accuracies, classification between
different emotions is still challenging. There should be some criteria that can be used to judge how well
An important issue in speech emotion recognition is the need to a certain emotional database simulates a real-world environment.
determine a set of the important emotions to be classified by an According to some studies [69,22], the following are the most
automatic emotion recognizer. Linguists have defined inventories relevant factors to be considered:
of the emotional states, most encountered in our lives. A typical set Real-world emotions or acted ones?: It is more realistic to use
is given by Schubiger [111] and O’Connor and Arnold [95], which speech data that are collected from real life situations. A famous
contains 300 emotional states. However, classifying such a large example is the recordings of the radio news broadcast of major
number of emotions is very difficult. Many researchers agree with events such as the crash of Hindenburg [22]. Such recordings
the ‘palette theory’, which states that any emotion can be contain utterances with very natural conveyed emotions.
decomposed into primary emotions similar to the way that any Unfortunately, there may be some legal and moral issues that
color is a combination of some basic colors. Primary emotions are prohibit the use of them for research purposes. Alternatively,
Anger, Disgust, Fear, Joy, Sadness, and Surprise [29]. These emotional sentences can be elicited in sound laboratories as in
emotions are the most obvious and distinct emotions in our life. the majority of the existing databases. It has always been criticized
They are called the archetypal emotions [29]. that acted emotions are not the same as real ones. Williams and
In this paper, we present a comprehensive review of speech Stevens [135] found that acted emotions tend to be more
emotion recognition systems targeting pattern recognition exaggerated than real ones. Nonetheless, the relationship
researchers who do not necessarily have a deep background in between the acoustic correlate and the acted emotions does not
speech analysis. We survey three important aspects in speech contradict that between acoustic correlates and real ones.
emotion recognition: (1) important design criteria of emotional Who utters the emotions?: In most emotional speech databases,
speech corpora, (2) the impact of speech features on the classi- professional actors are invited to express (or feign) pre-determined
fication performance of speech emotion recognition, and (3) sentences with the required emotions. However, in some of them
classification systems employed in speech emotion recognition. such as the Danish Emotional Speech (DES) database [38], semi-
Though there are many reviews on speech emotion recognition professional actors are employed instead in order to avoid
such as [129,5,12], our survey is more comprehensive in surveying exaggeration in expressing emotions and to be closer to real-
the speech features and the classification techniques used in world situations.
speech emotion recognition. We surveyed different types of How to simulate the utterances?: The recorded utterances in most
features and considered the benefits of combining the available emotional speech databases are not produced in a conversational
acoustic information with other sources of information such as context [69]. Therefore, utterances may lack some naturalness since it
linguistic, discourse, and video information. We theoretically is believed that most emotions are outcomes of our response to
covered, in some detail different classification techniques com- different situations. Generally, there are two approaches for eliciting
monly used in speech emotion recognition. We also included emotional utterances. In the first approach, experienced speakers act as
numerous speech recognition systems implemented in other if they were in a specific emotional state, e.g. being glad, angry, or sad. In
research papers in order to have an insight on the performance many developed corpora [15,38], such experienced actors were not
of existing speech emotion recognizers. However, the reader available and semi-professional or amateur actors were invited to utter
should interpret the recognition rates of those systems carefully the emotional utterances. Alternatively, a Wizard-of-Oz scenario is
574 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
used in order to help the actor reach the required emotional states. This (3) Phonetic transcriptions are not provided with some databases
wizard involves the interaction between the actor and the computer as such as BabyEars [120]. Thus, it is difficult to extract linguistic
if the latter is a human [8]. In a recent study [59], it was proposed to use content from the utterances of such databases.
computer games to induce natural emotional speech. Voice samples
were elicited following game events whether the player won or lost the 3. Features for speech emotion recognition
game and were accompanied by either pleasant or unpleasant sounds.
Balanced utterances or unbalanced utterances?: While balanced An important issue in the design of a speech emotion recogni-
utterances are useful for controlled scientific analysis and experi- tion system is the extraction of suitable features that efficiently
ments, they may reduce the validity of the data. As an alternative, a characterize different emotions. Since pattern recognition techni-
large set of unbalanced and valid utterances may be used. ques are rarely independent of the problem domain, it is believed
Utterances are uniformly distributed over emotions?: Some corpus that a proper selection of features significantly affects the classi-
developers prefer that the number of utterances for each emotion is fication performance.
almost the same in order to properly evaluate the classification Four issues must be considered in feature extraction. The first
accuracy such as in the Berlin corpus [18]. On the other hand, many issue is the region of analysis used for feature extraction. While
other researchers prefer that the distribution of the emotions in the some researchers follow the ordinary framework of dividing the
database reflects their frequency in the world [140,91]. For speech signal into small intervals, called frames, from each which a
example, the neutral emotion is the most frequent emotion in local feature vector is extracted, other researchers prefer to extract
our daily life. Hence, the number of utterances with neutral global statics from the whole speech utterance. Another important
emotion should be the largest in the emotional speech corpus. question is what the best feature types for this task are, e.g. pitch,
Same statement with different emotions?: In order to study the energy, zero crossing, etc.? A third question is what is the effect of
explicit effect of emotions on the acoustic features of the speech ordinary speech processing such as post-filtering and silence
utterances, it is common in many databases to record the same removal on the overall performance of the classifier? Finally,
sentence with different emotions. One advantage of such a whether it suffices to use acoustic features for modeling emotions
database is to ensure that the human judgment on the perceived or if it is necessary to combine them with other types of features
emotion is solely based on the emotional content of the sentence such as linguistic, discourse information, or facial features.
and not on its lexical content. The above issues are discussed in detail in the following five
subsections. In Section 3.1, a comparison between local features
2.2. Available and known emotional speech databases and global features is given. Section 3.2 describes different types of
speech features used in speech emotion recognition. This sub-
Most of the developed emotional speech databases are not section is concluded with our recommendations for the choice of
available for public use. Thus, there are very few benchmark speech features. Section 3.3 explains the pre-processing and the
databases that can be shared among researchers. Another conse- post-processing steps required for the extracted speech features.
quence from this privacy is the lack of coordination among Finally, Section 3.4 discusses other sources of information that
researchers in this field: the same mistakes in recording are being can be integrated with the acoustic one in order to improve
repeated for different emotional speech databases. Table 1 classification performance.
summarizes characteristics of some databases commonly used in
speech emotion recognition. From this table, we notice that the
3.1. Local features versus global features
emotions are usually stimulated by professional or nonprofessional
actors. In fact, there are some legal and ethical issues that may
Since speech signals are not stationary even in wide sense, it is
prevent researchers from recording real voices. In addition,
common in speech processing to divide a speech signal into small
nonprofessional actors are invited to produce emotions in many
segments called frames. Within each frame the signal is considered
databases in order to avoid exaggeration in the perceived emotions.
to be approximately stationary [104]. Prosodic speech features
Moreover, we notice that most the databases share the following
such as pitch and energy are extracted from each frame and called
emotions: anger, joy, sadness, surprise, boredom, disgust, and
local features. On the other hand, global features are calculated as
neutral following the palette theory. Finally, most of the
statistics of all speech features extracted from an utterance. There
databases addressed adult-directed emotions while only two,
has been a disagreement on which of local and global features are
KISMET and BabyEars, considered infant-directed emotions. It is
more suitable for speech emotion recognition. The majority of
believed that recognizing infant-directed emotions is very useful in
researchers have agreed that global features are superior to local
the interaction between man and robots [15].
ones in terms of classification accuracy and classification time
[128,57,117,100]. Global features have another advantage over
2.3. Problems in existing emotional speech databases local features; their number is much less. Therefore, the application
of cross validation and feature selection algorithms to global
Almost all the existing emotional speech databases have some features are executed much faster than if applied to local features.
limitations for assessing the performance of proposed emotion However, researchers have claimed that global features are
recognizers. Some of the limitations of emotional speech databases efficient only in distinguishing between high-arousal emotions, e.g.
are briefly mentioned: anger, fear, and joy, versus low-arousal ones, e.g. sadness [94]. They
claim that global features fail to classify emotions which have
(1) Most speech emotional databases do not well enough simulate similar arousal, e.g. Anger versus Joy. Another disadvantage of
emotions in a natural and clear way. This is evidenced by the global features is that temporal information present in speech
relatively low recognition rates of human subjects. In some signals is completely lost. Moreover, it may be unreliable to use
databases (see [94]), the human recognition performance is as complex classifiers such as the hidden Markov model (HMM) and
low as about 65%. the support vector machine (SVM) with global speech features
(2) In some databases such as KISMET, the quality of the recorded since the number of training vectors may not be sufficient for
utterances is not so good. Moreover, the sampling frequency is reliably estimating model parameters. On the other hand, complex
somewhat low (8 kHz). classifiers can be trained reliably using the large number of local
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 575
Table 1
Characteristics of common emotional speech databases.
LDC Emotional Commercially English 7 actors ! 15 emotions Professional Neutral, panic, anxiety, hot anger, cold anger, despair,
Prosody Speech availablea ! 10 utterances actors sadness, elation, joy, interest, boredom, shame, pride,
and Transcripts contempt
[78]
Berlin emotional Public and German 800 utterances (10 actors Professional Anger, joy, sadness, fear, disgust, boredom, neutral
database [18] freeb ! 7 emotions ! 10 utterances + some actors
second version) ¼ 800 utterances
Danish emotional Public with Danish 4 actors ! 5 emotions (2 words Nonprofessional Anger, joy, sadness, surprise, neutral
database [38] license feec + 9 sentences + 2 passages) actors
Natural [91] Private Mandarin 388 utterances, 11 speakers, 2 Call centers Anger, neutral
emotions
ESMBS [94] Private Mandarin 720 utterances, 12 speakers, 6 Nonprofessional Anger, joy, sadness, disgust, fear, surprise
emotions actors
INTERFACE [54] Commercially English, English (186 utterances), Slovenian Actors Anger, disgust, fear, joy, surprise, sadness, slow neutral, fast
availabled Slovenian, (190 utterances), Spanish (184 neutral
Spanish, utterances), French (175 utterances)
French
KISMET [15] Private American 1002 utterances, 3 female speakers, Nonprofessional Approval, attention, prohibition, soothing, neutral
English 5 emotions actors
BabyEars [120] Private English 509 utterances, 12 actors (6 males Mothers and Approval, attention, prohibition
+ 6 females), 3 emotions fathers
SUSAS [140] Public with English 16,000 utterances, 32 actors Speech under Four stress styles: Simulated Stress, Calibrated Workload
license feee (13 females + 19 males) simulated and Tracking Task, Acquisition and Compensatory Tracking Task,
actual stress Amusement Park Roller-Coaster, Helicopter Cockpit
Recordings
MPEG-4 [114] Private English 2440 utterances, 35 speakers U.S. American Joy, anger, disgust, fear, sadness, surprise, neutral
movies
Beihang Private Mandarin 7 actors ! 5 emotions ! 20 Nonprofessional Anger, joy, sadness, disgust, surprise
University [43] utterances actors
FERMUS III [112] Public with German, 2829 utterances, 7 emotions, Automotive Anger, disgust, joy, neutral, sadness, surprise
license feef English 13 actors environment
KES [65] Private Korean 5400 utterances, 10 actors Nonprofessional Neutral, joy, sadness, anger
actors
CLDC [146] Private Chinese 1200 utterances, 4 actors Nonprofessional Joy, anger, surprise, fear, neutral, sadness
actors
Hao Hu et al. [56] Private Chinese 8 actors ! 5 emotions Nonprofessional Anger, fear, joy, sadness, neutral
! 40 utterances actors
Amir et al. [2] Private Hebrew 60 Hebrew and 1 Russian actors Nonprofessional Anger, disgust, fear, joy, neutral, sadness
actors
Pereira [55] Private English 2 actors ! 5 emotions Nonprofessional Hot anger, cold anger, joy, neutral, sadness
! 8 utterances actors
a
Linguistic Data Consortium, University of Pennsylvania, USA.
b
Institute for Speech and Communication, Department of Communication Science, the Technical University, Germany.
c
Department of Electronic Systems, Aalborg University, Denmark.
d
Center for Language and Speech Technologies and Applications (TALP), the Technical University of Catalonia, Spain.
e
Linguistic Data Consortium, University of Pennsylvania, USA.
f
FERMUS research group, Institute for Human-Machine Communication, Technische Universität München, Germany.
feature vectors and hence their parameters will be accurately speech that are caused by vibrations of the vocal cord and are
estimated. This may lead to higher classification accuracy than that oscillatory [104]. This approach is much easier to implement than
achieved if global features are used. the phoneme-based approach. In [117], the feature vector
A third approach for feature extraction is based on segmenting contained a combination of segment-based and global features.
speech signals to the underlying phonemes and then calculating The k-nearest neighbor (k-NN) and the SVM were used for
one feature vector for each segmented phoneme [73]. This classification. The KISMET emotional corpus [15] was used for
approach relies on a study that observes variation in the spectral assessing the classification performance. The corpus contained
shapes of the same phone under different emotions [74]. This 1002 utterances from three English speakers with the following
observation is essentially true for vowel sounds. However, the poor infant-directed emotions: approval, attention, prohibition,
performance of phoneme segmentation algorithms can be another soothing, and neutral. Speaker-dependent classification was
problem, especially when the phonetic transcriptions of utterances mainly considered. Employing their feature representation
are not provided. An alternative method is to extract a feature resulted in 5% increase over the baseline accuracy corresponding
vector for each voiced speech segment rather than for each to using only global features. In particular, the segment-based
phoneme. Voiced speech segments refer to continuous parts of approach achieved classification accuracies of 87% and 83% using
576 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
the k-NN and the SVM, respectively, versus 81% and 78% obtained 4th order Legendre parameters, vibrations, mean of first difference,
by utterance-level features and using the same classifiers. mean of the absolute of the first difference, jitter, and ratio of the
sample number of the up-slope to that of the down-slope of the pitch
3.2. Categories of speech features contour.
Energy: mean, median, standard deviation, maximum, mini-
An important issue in speech emotion recognition is the extraction mum, range (max–min), linear regression coefficients, shimmer,
of speech features that efficiently characterize the emotional content of and 4th order Legendre parameters.
speech and at the same time do not depend on the speaker or the lexical Duration: speech rate, ratio of duration of voiced and unvoiced
content. Although many speech features have been explored in speech regions, and duration of the longest voiced speech.
emotion recognition, researchers have not identified the best speech Formants: first and second formants, and their bandwidths.
features for this task. More complex statistics are also used such as the parameters of
Speech features can be grouped into four categories: continuous the F0-pattern generation model proposed by Fujisaki (for more
features, qualitative features, spectral features, and TEO (Teager details, see [51]).
energy operator)-based features. Fig. 1 shows examples of features Several studies on the relationship between the above-mentioned
belonging to each category. The main purpose of this section is to speech features and the basic archetypal emotions have been made
compare the pros and cons of each category. However, it is common [28,29,7,92,96,9,11,123]. From these studies, it has been shown that
in speech emotion recognition to combine features that belong to prosodic features provide a reliable indication of the emotion. However,
different categories to represent the speech signal. there are contradictory reports on the effect of emotions on prosodic
features. For example, while Murray and Arnott [92] indicate that a
high speaking rate is associated with the emotion of anger, Oster and
3.2.1. Continuous speech features
Risberg [96] have an opposite conclusion. In addition, it seems that
Most researchers believe that prosody continuous features such
there are similarities between characteristics of some emotions. For
as pitch and energy convey much of the emotional content of an
instance, the emotions of anger, fear, joy, and surprise have similar
utterance [29,19,12]. According to the studies performed by
characteristics for the fundamental frequency (F0) [104,20] such as:
Williams and Stevens [136], the arousal state of the speaker
(high activation versus low activation) affects the overall energy,
energy distribution across the frequency spectrum and the # Average pitch: average value of F0 for the utterance.
frequency and duration of pauses of speech signal. Recently, # Contour slope: the slope of the F0-contour.
several studies have confirmed this conclusion [60,27]. # Final lowering: the steepness of the F0 decrease at the end of the
Continuous speech features have been heavily used in speech falling contour, or of the rise at the end of rising contour.
emotion recognition. For example, Banse et al. examined vocal cues for # Pitch range: the difference between the highest and the smallest
14 emotion categories [7]. The speech features they used are related to value of F0.
the fundamental frequency (F0), the energy, the articulation rate, and # Reference line: the steady value of F0 after an excursion of high
the spectral information in voiced and unvoiced portions. According to or small pitch.
many studies (see [29,92,69]), these acoustic features can be grouped
into the following categories:
3.2.2. Voice quality features
(1) pitch-related features; It is believed that the emotional content of an utterance is
(2) formants features; strongly related to its voice quality [29,109,31]. Experimental
(3) energy-related features; studies with listening human subjects demonstrated a strong
(4) timing features; relation between voice quality and the perceived emotion [46].
(5) articulation features. Many researchers studying the auditory aspects of emotions have
been trying to define a relation [29,92,28,110]. Voice quality seems
Some of the most commonly used global features in speech to be described most regularly with reference to full-blown
emotion recognition are: emotions; i.e. emotions that strongly direct people into a course
Fundamental frequency (F 0): mean, median, standard deviation, of actions [29]. This is opposed to ‘‘underlying emotions’’ which
maximum, minimum, range (max–min), linear regression coefficients, influence positively or negatively a person’s actions and thoughts
without seizing control [29]. A wide range of phonetic variables representation for speech signal. It is recognized that the emotional
contributes to the subjective impression of voice quality [92]. content of an utterance has an impact on the distribution of the
According to an extensive study made by Cowie et al. [29], the spectral energy across the speech range of frequency [94]. For
acoustic correlates, related to the voice quality, are grouped into example, it is reported that utterances with happiness emotion
the following categories. have high energy at high frequency range while utterances with the
sadness emotion have small energy at the same range [7,64].
(1) voice level: signal amplitude, energy and duration have been Spectral features can be extracted in a number of ways including
shown to be reliable measures of voice level; the ordinary linear predictor coefficients (LPC) [104], one-sided
(2) voice pitch; autocorrelation linear predictor coefficients (OSALPC) [50], short-
(3) phrase, phoneme, word and feature boundaries; time coherence method (SMC) [14], and least-squares modified
(4) temporal structures. Yule–Walker equations (LSMYWE) [13]. However, in order to
better exploit the spectral distribution over the audible
frequency range, the estimated spectrum is often passed through
However, relatively little is known about the role of voice
a bank of band-pass filters. Spectral features are then extracted
quality in delivering emotions for two reasons. First, impressio-
from the outputs of these filters. Since human perception of pitch
nistic labels are used to described voice quality such as tense, harsh,
does not follow a linear scale [103], the filters’ bandwidths are
and breathy. Those terms can have different interpretations based
usually evenly distributed with respect to a suitable nonlinear
on the understanding of the researcher [46]. This led to a
frequency scale such as the Bark scale [103], the Mel-frequency
disagreement between researchers on how to associate vocal
scale [103,61], the modified Mel-frequency scale, and the ExpoLog
quality terms to emotion. For example, Sherer [109] suggested
scale [13].
that tense voice is associated with anger, joy, and fear; and lax voice
Cepstral-based features can be derived from the corresponding
is associated with sadness. On the other hand, Murray and Arnott
linear features as in the case of linear predictor cepstral coefficients
[92] suggested that breathy voice is associated with both anger and
(LPCC) [4] and cepstral-based OSALPC (OSALPCC) [13]. There have
happiness; sadness is associated with a ‘resonant’ voice quality.
been contradictory reports on whether cepstral-based features are
The second problem is the difficulty of automatically deciding
better than linear-based ones in emotion recognition. In [13], it was
those voice quality terms directly from the speech signal. There has
shown that features based on cepstral analysis such as LPCC,
been numerous research for the latter problem which can be
OSALPCC, and Mel-frequency cepstrum coefficients (MFCC)
categorized into two approaches. The first approach depends on the
clearly outperform the performance of the linear-based features
fact that the speech signal can be modelled as the output of vocal
of LPC and OSALPC, in detecting stress in speech signal. However,
tract filter excited by a glottal source signal [104]. Therefore, voice
New et al. [94] compared a linear-based feature, namely Log-
quality can better measured by removing the filtering effect of the
frequency power coefficients (LFPC), and two cepstral-based
vocal tract and measuring parameters of the glottal signal.
features, namely LPCC and MFCC. They mainly used HMM for
However, neither the glottal source signal nor the vocal tract
classification. The emotional speech database they used was locally
filter are known and hence the glottal signal is estimated by
recorded. It contained 720 utterances from six Burmese speakers
exploiting knowledge about the characteristics of the source signal
and six Mandarin speakers with the six archetypal emotions: anger,
and of the vocal tract filter. For a review of inverse-filtering
disgust, fear, joy, sadness, and surprise. Sixty percent of the
techniques, the reader is referred to [46] and the references
emotion utterances of each speaker were used to train each
therein. Because of the inherent difficulty in this approach, it is
emotion model while the remaining 40% of the utterances were
not much used in speech emotion recognition; e.g. [122]. In the
used for testing. They showed that the LFPC provided an average
second approach, the voice quality is numerically represented by
classification accuracy of 77.1% while the LPCC and the MFCC gave
parameters estimated directly from the speech signal; i.e. no
56.1% and 59.0% identification accuracies, respectively.
estimation of the glottal source signal is performed. In [76],
voice quality was represented by the jitter and shimmer [44].
The speech emotion recognition system used continuous HMM as a 3.2.4. Nonlinear TEO-based features
classifier and applied to utterances from the SUSAS database [140] According to experimental studies done by Teager, the speech is
with the following selected speaking styles: angry, fast, Lombard, produced by nonlinear air flow in the vocal system [125]. Under
question, slow and soft. The classification task was speaker stressful conditions, the muscle tension of the speaker affects the
independent but dialect-dependent. The baseline accuracy air flow in the vocal system producing the sound. Therefore,
corresponding to using only MFCC as features was 65.5%. The nonlinear speech features are necessary for detecting the speech
classification accuracy was 68.1% when the MFCC was combined in the sound. The Teager-energy-operator (TEO), first introduced by
with the jitter, 68.5% when the MFCC was combined with the Teager [124] and Kaiser [63], was originally developed with the
shimmer, and 69.1% when the MFCC was combined with both supporting evidence that hearing is the process of detecting energy.
of them. For a discrete time signal, x[n], the TEO is defined as
In [81,83,84], voice quality parameters are roughly calculated as Cfx½n%g ¼ x2 ½n%&x½n&1%x½n þ1%: ð1Þ
follows. The pitch, the first four formant frequencies and their
It has been observed that under stressful conditions the
bandwidths are estimated from the speech signal. The effect of
fundamental frequency changes, as does the distribution of
vocal tract is equalized mathematically by subtracting terms which
harmonics over the critical bands [13,125]. It is verified that the
represent the vocal tract influence from the amplitudes of each
TEO of multi-frequency signal does not only reflects individual
harmonic (see [85] for details). Finally, voice quality parameters,
frequency components but also interaction between them [145].
called spectral gradients, are calculated as simple functions of the
Based on this fact, TEO-based features can then be used for
compensated harmonic amplitudes. The experimental result of
detecting stress in speech. In [21], the Teager energy profile of
their study is discussed in Section 4.5.
the pitch contour was the feature used to classify the following
effects in speech: loud, angry, Lombard, clear, and neutral.
3.2.3. Spectral-based speech features Classification was performed by a combination of vector
In addition to time-dependent acoustic features such as pitch quantization and HMM. The classification system was applied to
and energy, spectral features are often selected as a short-time utterances from the SUSAS database [140] and it was speaker
578 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
dependent. While the classification system detected the loud and the speech spectrum, each frame is often multiplied by a Hamming
angry effects of speech with a high accuracy of 98.1% and 99.1%, window before feature extraction [104].
respectively, the classification accuracies of detecting the Lombard Since the silence intervals carry important information about
and clear effects were much lower: 86.1% and 64.8%. Moreover, two the expressed emotion [94], these intervals are usually kept intact
assumptions were made: (1) the text of the spoken utterances is in speech emotion recognition. Note that silent intervals are
already known to the system, and (2) the spoken words have the frequently omitted from analysis in other spoken language tasks,
structure of vowel-consonant or consonant-vowel-consonant. such as speaker identification [107].
Therefore, much lower accuracies are expected for free-style Having extracted the suitable speech features from the pre-
speech. processed time samples, some post-processing may be necessary
In another study [145], other TEO-based features, namely TEO- before the feature vectors are used to train or test the classifier. For
decomposed FM variation (TEO-FM-Var), normalized TEO example, the extracted features may be of different units and hence
autocorrelation envelope area (TEO-Auto-Env), and critical band- their numerical values have different orders of magnitude. In
based TEO autocorrelation envelope area (TEO-CB-Auto-Env), were addition, some of them may be biased. This can cause some
proposed for detecting neutral versus stressed speech and for numerical problems in training some classifiers, e.g. the Gaussian
classifying the stressed speech into three styles: angry, loud, and mixture model (GMM), since the covariance matrix of the training
Lombard. Five-state HMM was used as a baseline classifier and data may be ill conditioned. Therefore, feature normalization may
tested using utterances from the SUSAS database [140]. The be necessary in such cases. The most common method for feature
developed features were compared against the MFCC and the normalization is through z-score normalization [116,115]:
pitch features in three classification tasks:
x&m
x^ ¼ , ð3Þ
s
(1) Text-dependent pairwise stress classification:
TEO-FM-Var (70.5%715.77%), TEO-Auto-Env (79.4%74.01%), where m is the mean of the feature x and s is the standard deviation.
TEO-CB-Auto-Env (92.9% 73.97%), MFCC (90.9%75.73%), However, a disadvantage of this method is that all the normalized
Pitch (79.9% 717.18%). features have a unity variance. It is believed that the variances of
(2) Text-independent pairwise stress classification: features have high information content [90].
TEO-CB-Auto-Env (89.0%78.39%), MFCC (67.7% 78.78%), It is also common to use dimensionality reduction techniques in
Pitch (79.9% 717.18%). speech emotion recognition applications in order to reduce the
(3) Text-independent multi-style stress classification: storage and computation requirements of the classifier and to have
TEO-CB-Auto-Env (Neutral 70.6%, Angry 65.0%, Loud 51.9%, an insight about the discriminating features. There are two
Lombard 44.9%), MFCC (Neutral 46.3%, Angry 58.6%, Loud approaches for dimensionality reduction: feature selection and
20.7%, Lombard 35.1%), Pitch (Neutral 52.2%, Angry 44.4%, feature extraction (also called feature transform [80]). In feature
Loud 53.3%, Lombard 89.5%). selection, the main objective is to find the feature subset that
achieves the best possible classification between classes. The
Based on the extensive experimental evaluations, the authors classification ability of a feature subset is usually characterized
concluded that TEO-CB-Auto-Env outperformed the MFCC and the by an easy-to-calculate function, called the feature selection
pitch in stress detection but it completely fails for the composite criterion, such as the cross validation error [10] and the mutual
task of speech recognition and stress classification. information between the class label and the feature [137]. On the
We also conclude that the choice of proper features for speech other hand, feature extraction techniques aims at finding a suitable
emotion recognition highly depends on the classification task being linear or nonlinear mapping from the original feature space to
considered. In particular, based on the review in this section, we another space with reduced dimensionality while preserving as
recommend the use of TEO-based features for detecting stress in much relevant classification information as possible. The reader
speech. For classifying high-arousal versus low-arousal emotions, may refer to [58,34] for excellent reviews on dimensionality
continuous features such as the fundamental frequency and the pitch reduction techniques.
should be used. For N-way classification, the spectral features such as The principle component analysis (PCA) feature extraction
the MFCC are the most promising features for speech representation. method has been used extensively in the context of speech emotion
We also believe that combining continuous and spectral features will recognition [141,143,130,71]. In [25], it is observed that increasing
provide even a better classification performance for the same task. the number of principle components improves the classification
Clearly there are some relationships among the feature types described performance until a certain order after which the classification
above. For example, spectral variables relate to voice quality, and the accuracy begins to decrease. This means that employing PCA may
pitch contours relate to the patterns arising from different tones. But provide an improvement in the classification performance over
links are rarely made in the literature. using the whole feature set. It is not clear also whether the PCA is
superior to other dimensionality reduction techniques. While the
3.3. Speech processing performance of the PCA was very comparable to the linear
discriminant analysis (LDA) in [143], it is reported in [141,142]
The term pre-processing refers to all operations, required to be that the PCA is significantly inferior to the LDA and the sequential
performed on the time samples of speech signal before extracting floating search (SFS) dimensionality techniques. The obvious
features. For example, due to recording environment differences, interpretation is that different acoustic features and emotional
some sort of energy normalization has to be done to all utterances. databases are used in those studies.
In order to equalize the effect of the propagation of speech through The LDA has also been applied in speech emotion recognition
air, a pre-emphasis radiation filter is used to process speech signal applications [141,143] though it has the limitation that the reduced
before extraction of features. The transfer function of the pre- dimensionality must be less than the number of classes [34]. In
emphasis filter is usually given by [104] [116], the LDA technique is used to compare more than 200 speech
features. According to this study, it is concluded that pitch-related
HðzÞ ¼ 1&0:97z&1 : ð2Þ
features yield about 69.81% recognition accuracy versus 36.58%,
In order to smooth the extracted contours, overlapped frames are provided by energy-related features. This result is opposed to that
commonly used. In addition, to reduce ripples in the spectrum of established in [101] where it is concluded that the first and third
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 579
quartiles in the energy distribution are important features in the the GMM, the multi-layer perceptron (MLP), and the SVM
task of emotion classification. In order to establish a reliable classifiers were used to classify emotions based on the acoustic
conclusion about a certain feature as being powerful in information. The SVM provided the best speaker-independent
distinguishing different emotional classes, one has to do ranking classification accuracy (81.29%) and thus selected as the acoustic
over more than one database. classifier to be integrated with the linguistic classifier. The
decisions of the acoustic and linguistic classifiers were fused by
3.4. Combining acoustic features with other information sources a MLP neural network. In that study, it was shown that the average
recognition accuracy was 74.2% for acoustic features alone, 59.6%
In many situations, nonacoustic emotional cues such as facial for linguistic information alone, 83.1% for both acoustic and
expressions or some specific words are helpful to understand the linguistic using fusion by mean and 92.0% for fusion by MLP
desired speaker’s emotion. This fact has motivated some research- neural network.
ers to employ other sources of information in conjunction with the An alternative procedure for detecting emotions using lexical
acoustic correlates in order to improve the recognition perfor- information is found in [69]. In this work, a new information
mance. In this section, a detailed overview of some emotion theoretic measure, named emotional salience, was defined.
recognition systems that apply this idea is presented. Emotional salience measures how much information a word
provides towards a certain emotion. This measure is more or
3.4.1. Combining acoustic and linguistic information less related to the mutual information between a particular word
Linguistic content of the spoken utterance is an important part and a certain emotional category [47]. The training data set was
of the conveyed emotion [33]. Recently, there has been a focus on selected 10 times in a random manner from the whole data set for
the integration of acoustic and linguistic information [72]. In order each gender with the same number of data for each class (200 for
to make use of the linguistic information, it is first necessary to male data and 240 for female). Using acoustic information only, the
recognize the word sequence of the spoken utterance. Therefore, classification error ranged from 17.85% to 25.45% for male data and
a language model is necessary. Language models describe con- from 12.04% to 24.25% for female data. The increase in the
straints on possible word sequences in a certain language. A classification accuracy due to combining the linguistic
common language model is the N-gram model [144]. This model information with the acoustic information was in the range from
assigns high probabilities to typical word sequences and low 7.3% to 11.05% for male data and 4.05% to 9.47% for female data.
probabilities for atypical word sequences [5].
Fig. 2 shows the basic architecture of a speech emotion
recognition that combines the roles of acoustic and linguistic
models in finding the most probable word sequence. The input 3.4.2. Combining acoustic, linguistic, and discourse information
word-transcriptions are processed in order to produce the language Discourse markers are linguistic expressions that convey expli-
model.1 In parallel, the feature extraction module converts speech cit information about the structure of the discourse or have a
signal into a sequence of feature vectors. The extracted feature specific semantic contribution [48,26]. In the context of speech
vectors together with the pronunciation dictionary and the input emotion recognition, discourse information may also refer to the
word-transcriptions are then used to train the phoneme acoustic way a user interacts with the machine [69]. Often, these systems do
models. In the recognition phase, both the language model and the not operate in a perfect manner; and hence, it might happen that
acoustic models obtained in the training phase are used to the user expresses some emotion such as frustration in response to
recognize the output word sequence according to the following them [3]. Therefore, it is believed that there is a strong relation
Bayes rule: between the way a user interacts with a system and his/her
expressed emotion [23,35]. Discourse information has been
^ ¼ arg max PðWjYÞ ¼ arg max PðWÞPðYjWÞ
W combined with acoustic correlates in order to improve the
W W PðYÞ recognition performance of emotion recognition systems [8,3]. In
¼ arg max PðWÞPðYjWÞ, ð4Þ [69], the following speech-acts are used for labeling the user
W
response: rejection, repeat, rephrase, ask-start over, and none of
the above. The speech data in this study was obtained from real
where Y is the set of acoustic feature vectors produced by the
users engaged in spoken dialog with a machine agent over the
feature extraction module. The prior word probability is
telephone using a commercially developed call center application.
determined directly from the language model. In order to
The main focus of this study was on detecting negative emotions
estimate the conditional probability of the acoustic feature set
(anger and frustration) versus nonnegative emotions; e.g. neutral
given a certain word sequence, a HMM for each phoneme is
and happy. As expected, there is a strong correlation between the
constructed and trained based on available speech database. The
speech-act of rejection and the negative emotions. In that work,
required conditional probability is estimated as the likelihood
acoustic, linguistic and discourse information are combined
value produced by a set of phoneme HMMs concatenated in a
together for recognizing emotions. Linear discriminant classifier
sequence according to the word transcription stored in the
(LDC) was used for classification with both linguistic and discourse
dictionary. The Viterbi algorithm [131] is usually used for
information. For acoustic information, both the LDC and the k-NN
searching for the optimum word sequence that produced the
classifier were used. The increase in the classification accuracy due
given testing utterances.
to combining the discourse information with the acoustic
In [116], a spotting algorithm that searches for emotional
information was in the range from 1.4% to 6.75% for male data
keywords or phrases in the utterances was employed. A
and 0.75% to 3.96% for female data.
Bayesian belief network was used to recognize emotions based
The above information sources can be combined by a variety of
on the acoustic features extracted from these keywords. The
ways. The most straightforward way is to combine all measure-
emotional speech corpus was collected from the FERMUS III
ments output by these sources into one long feature vector [3].
project [112]. The corpus contained the following emotions:
However, as mentioned earlier, having features vectors with high
angry, disgust, fear, joy, neutral, sad, and surprise. The k-means,
dimensionality is not desirable. Another method is to implement
three classifiers, one for each information source, and combine
1
It is also possible to use a ready-made language model. their output decisions using any decision fusion method such as
580 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
Testing
acoustic files
Feature
extraction
Transcriptions
Recognized
word
Search sequence
Training Acoustic models engine and
Feature
acoustic
extraction emotions
models
Training
acoustic files
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Word a b c d e f g
Pronunciation
dictionary
Fig. 2. The architecture of a speech emotion recognition engine combining acoustic and linguistic information.
bagging [16]. In [69], the final decision is based on the average of an HMM emotion recognizer. Synchronization was made between
the likelihood values of all individual classifiers. the audio and the video signals and all features were pooled in one
long vector. The classification scheme was tested using the
emotional video corpus developed by De Silva et al. [119]. The
3.4.3. Combining acoustic and video information corpus contained six emotions: anger, happiness, sadness, surprise,
Human facial expressions can be used in detecting emotions. dislike, and fear. The overall decision was made using a rule-based
There have been many studies on recognizing emotions based only classification approach. Unfortunately, no classification accuracy
on video recordings of the facial expressions. [36,97]. According to was reported in this study.
an experimental study based on human subjective evaluation, De In the other study [45], there were two classifiers: one for
Silva et al. [118] concluded that some emotions are more easily the video part and the other for the audio part. The emotional
recognized using audio information than using video information database was locally recorded and contained the basic six
and vice versa. Based on this observation, they proposed combining archetypal emotions. Features were extracted from the video
the performances of the audio-based and the video-based systems data using multi-resolution analysis based on the discrete
using any aggregation scheme. In fact, not much research work is wavelet transform. The dimensionality of the obtained wavelet
done in this area. In this survey, a brief overview of only two studies coefficients vectors was reduced using a combination of the PCA
is given. and LDA techniques. In the training phase, a codebook was
The first one is provided in [24]. Regarding speech signal, pitch- constructed based on the feature vectors for each emotion. In
and energy-related features such as the minimum and maximum the testing phase, the extracted features were compared to the
values were first extracted from all the utterances. To analyze the reference vectors in each codebook and a membership value was
video signal, the Fourier transform of the optical flow vectors for returned. The same was repeated for the audio data. The two
the eye region and the mouth region was computed. This method obtained membership values for each emotion were combined
has shown to be useful in analyzing video sequences [97,77]. The using the maximum rule. The fusion algorithm was applied to
coefficients of the Fourier transform were then used as features for a locally recorded database which contained the following
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 581
emotions: happiness, sadness, anger, surprise, fear, and dislike. segmentation because it is physically related to the production
Speaker-dependent classification was mainly considered in this mechanism of speech signal [102]. The HMM is a doubly stochastic
study. When only acoustic features were used, the recognition process which consists of a first-order Markov chain whose states
accuracies ranged from 57% to 93.3% for male speakers and 68% to are hidden from the observer. Associated with each state is a
93.3% for female speakers. The facial emotion recognition rates random process which generates the observation sequence. Thus,
ranged from 65% to 89.4% for male subjects and from 60% to 88.8% the hidden states of the model capture the temporal structure of
for female subjects when the PCA method was used for feature the data. Mathematically, for modeling a sequence of observable
extraction. When the LDA method was used for feature extraction, data vectors, x1 , . . . ,xT , by an HMM, we assume the existence of a
the accuracies ranged from 70% to 90% for male subjects and from hidden Markov chain responsible for generating this observable
64.4% to 95% for female subjects. When both acoustic and facial data sequence. Let K be the number of states, pi , i¼ 1,y,K be the
information sources were combined, the recognition accuracies initial state probabilities for the hidden Markov chain, and aij ,
were 98.3% for female speakers and 95% for male speakers. i¼1,y,K, j ¼1,y,K be the transition probability from state i to state
Finally, it should be mentioned that though the combination of j. Usually, the HMM parameters are estimated based on the ML
audio and video information seems to be powerful in detecting principle. Assuming the true state sequence is s1 , . . . ,sT , the
emotions, the application of such a scheme may not be feasible. likelihood of the observable data is given by
Video data may not be available for some applications such as
automated dialog systems. pðx1 ,s1 . . . ,xT ,sT Þ ¼ ps1 bs1 ðx1 Þas1 ,s2 bs2 ðx2 Þ . . . asT&1 ,sT bsT ðxT Þ
Y
T
¼ ps1 bs1 ðx1 Þ ast&1 ,st bst ðxt Þ, ð5Þ
4. Classification schemes t¼2
(ANN), k-NN and many others. In fact, there has been no agreement on Fortunately, very efficient algorithms have been proposed for the
which classifier is the most suitable for emotion classification. It seems calculation of the likelihood function in a time of order OðKTÞ such
also that each classifier has its own advantages and limitations. In order as the forward recursion and the backward recursion algorithms
to combine the merits of several classifiers, aggregating a group of (for details about these algorithms, the reader is referred to
classifiers has also been recently employed [113,84]. Based on several [102,39]). In the training phase, the HMM parameters are
studies [94,21,72,97,115,43,138,129], we can conclude that HMM is the determined as those maximizing the likelihood of (6). This is
most used classifier in emotion classification probably because it is commonly achieved using the expectation maximization (EM)
widely used in almost all speech applications. The objective of this algorithm [32].
section is to give an overview of various classifiers used in speech There are many design issues regarding the structure and the
emotion recognition and to discuss the limitation of each one of them. training of the HMM classifier. The topology of the HMM may be a
The focus will be on statistical classifiers because they are the most left-to-right topology [115] as in most speech recognition
widely used in the context of speech emotion recognition. The applications or a fully connected topology [94]. The assumption
classifiers are mentioned according to their relevance in the of left-to-right topology explicitly models advance in time.
literature of speech emotion recognition. Multiple classifier systems However, this assumption may not be valid in the case of speech
are also discussed in this section. emotion recognition since, in this case, the HMM states correspond
In the statistical approach to pattern recognition, each class is to emotional cues such as pauses. For example, if the pause is
modelled by a probability distribution based on the available associated with the emotion of sadness, there is no definite time
training data. Statistical classifiers have been used in many speech instant of this state; the pause may occur at the beginning, the
recognition applications. While HMM is the most widely used middle, or at the end of the utterance. Thus, any state should be
classifier in the task of automatic speech recognition (ASR), GMM is reachable from any other state and a fully connected HMM may be
considered the state-of-the-art classifier for speaker identification more suitable. Another distinction between ASR and emotion
and verification [106]. recognition is that the HMM states in the former are aligned
HMM and GMM generally have many interesting properties with a small number of acoustic features which correspond to small
such as the ease of implementation and their solid mathematical speech units such as phonemes or syllables. On the other hand,
basis. However, compared to simple parametric classifiers such as prosodic acoustic features associated with emotions only make
LDC and quadratic discriminant analysis (QDC), they have some sense with larger time units spanning at least a word [12]. Other
minor drawbacks compared such as the need of a proper initializa- design issues of the HMM classifier include determining the
tion for the model parameters before training and the long training optimal number of states, the type of the observations (discrete
time often associated with them [10]. versus continuous) and the optimal number of observation
symbols (also called codebook size [102]) in case of using
4.1. Hidden Markov model discrete HMM or the optimum number of Gaussian components
in case of using continuous HMM.
The HMM classifier has been extensively used in speech Generally, HMM provides classification accuracies for speech
applications such as isolated word recognition and speech emotion recognition tasks that are comparable to other
582 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
well-known classifiers. In [94], an HMM-based system for the information criterion (AIC) [1], and kurtosis-based goodness-of-fit
classification of the six archetypal emotions was proposed. The (GOF) measures [37,132]. Recently, a greedy version of the EM
LFPC, MFCC, and LPCC were used as a representation of speech algorithm has been developed such that both the model parameters
signal. A four-state fully connected HMM was built for each and the model order are estimated simultaneously [133].
emotion and for each speaker. The HMMs were discrete and a In [15], a GMM classifier was used with the KISMET infant-
codebook of size 64 was constructed for the data of each speaker. directed speech database, which contains 726 utterances. The
Two speech databases were developed by the authors to train and emotions encountered were approval, attention, prohibition,
test the HMM classifier: Burmese and Mandarin. Four hundred and soothing, and neutral. A kurtosis-based model selection criterion
thirty-two out of 720 utterances were used for training while the was used to determine the optimum number of Gaussian
other for testing. The best average rates were 78.5% and 75.5% for components for each model [132]. Due to the limited number of
the Burmese and Mandarin databases, respectively, while the available utterances, a 100-fold cross validation was used to assess
human classification accuracy was 65.8%. That is, their proposed the classification performance. The SFS feature selection technique
speech emotion recognition system performed better than human was used to select the best features from a set containing pitch-
for those particular databases. However, this result cannot be related and energy-related features. A maximum accuracy of
generalized unless a more comprehensive study involving more 78.77% accuracy was achieved when the best five features are
than database is performed. used. Using a hierarchical sequential classification scheme, the
HMMs are used in many other studies such as [68,73]. In the former classification accuracy was increased to 81.94%.
study, the recognition accuracy was 70.1% for 4-class style classification The GMM is also used with some other databases such as the
of utterances from the text-independent SUSAS database. In [73], two BabyEars emotional speech database [120]. This database contains 509
systems were proposed: the first was an ordinary system in which each utterances: 212 utterances for the approval emotion, 149 for the
emotion was modelled by a continuous HMM system with 12 Gaussian attention emotion, and 148 for the prohibition emotion. The cross
mixtures for each state. In the second system, a three-state continuous validation error was measured for a wide range of GMM orders (from
HMM was built for each phoneme class. There were 46 phonemes, 1 to 100). The best average performance obtained was about 75%
which were grouped into five classes: vowel, glide, nasal, stop, and (speaker-independent classification), which corresponded to a model
fricative sound. Each state was modeled by 16 Gaussian components. order of 10. A similar result was obtained with the FERMUS III database
The TIMIT speech database was used to train the HMM for each [112], which contained a total of 5250 samples for the basic archetypal
phoneme-class. The evaluation was performed using utterances of emotions plus the neutral emotion. Sixteen-component GMMs were
another locally recorded emotional speech database which contained used to model each emotion. The average classification accuracy was
the emotions of anger, happiness, neutral, and sadness. Each utterance 74.83% for speaker-independent recognition and 89.12% for speaker-
was segmented to the phoneme level and the phoneme sequence was dependent recognition. These results were based on threefold cross
reported. For each testing utterance, a global HMM was built for this validation.
utterance, which was composed of phoneme-class HMMs In order to model the temporal structure of the data, the GMM was
concatenated in the same order as the corresponding phoneme integrated with the vector autoregressive (VAR) process resulting in
sequence. The start and end frame numbers of each segment were what is called Gaussian mixture vector autoregressive model (GMVAR)
determined using the Viterbi algorithm. This procedure was repeated [6]. The GMVAR model was applied to the Berlin emotional speech
for each emotion and the ML criterion was used to determine the database [18] which contained the anger, fear, happiness, boredom,
expressed emotion. Applying this scheme on a locally recorded speech sadness, disgust, and neutral emotions. The disgust emotion was
database containing 704 training utterances and 176 testing discarded because of the small number of utterances. The GMVAR
utterances, the obtained overall accuracy using the phoneme-class provided a classification accuracy of 76% versus 71% for the hidden
dependent HMM was 76.12% versus 55.68% for SVM using the prosodic Markov model, 67% for the k-nearest neighbors, and 55% for feed-
features and 64.77% for generic emotional HMM. Based on the obtained forward neural networks. All the classification accuracies were based
results, the authors claimed that phoneme-based modeling provided on fivefold cross validation where speaker information were not
better discrimination between emotions. This may be true since there considered in the split of data into training and testing sets; i.e. the
are variations across emotional states in the spectral features at the classification was speaker dependent. In addition, the GMVAR model
phoneme level, especially vowel sounds [75]. provided a 90% accuracy of classification between high-arousal
emotions, low-arousal emotions, and the neutral emotion versus
86.00% for the HMM technique.
4.2. Gaussian mixture models
Gaussian mixture model is a probabilistic model for density 4.3. Neural networks
estimation using a convex combination of multi-variate normal
densities [133]. It can be considered as a special continuous HMM Another common classifier, used for many pattern recognition
which contains only one state [107]. GMMs are very efficient in applications is the artificial neural network (ANN). ANNs have
modeling multi-modal distributions [10] and their training and testing some advantages over GMM and HMM. They are known to be more
requirements are much less than the requirements of a general effective in modeling nonlinear mappings. Also, their classification
continuous HMM. Therefore, GMMs are more appropriate for speech performance is usually better than HMM and GMM when the
emotion recognition when only global features are to be extracted from number of training examples is relatively low. Almost all ANNs can
the training utterances. However, GMMs cannot model temporal be categorized into three main basic types: MLP, recurrent neural
structure of the training data since all the training and testing networks (RNN), and radial basis functions (RBF) networks [10].
equations are based on the assumption that all vectors are The latter is rarely used in speech emotion recognition.
independent. Similar to many other classifiers, determining the MLP neural networks are relatively common in speech emotion
optimum number of Gaussian components is an important but recognition. The reason for that may be the ease of implementation
difficult problem [107]. The most common way to determine the and the well-defined training algorithm once the structure of ANN
optimal number of Gaussian components is through model order is completely specified. However, ANN classifiers in general have
section criteria such as classification error with respect to a cross many design parameters, e.g. the form of the neuron activation
validation set, minimum description length (MDL) [108], Akaike function, the number of the hidden layers and the number of
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 583
neuron in each layer, which are usually set in an ad hoc manner. In on the use of kernel functions to nonlinearly map the original
fact, the performance of ANN heavily depends on these parameters. features to a high-dimensional space where data can be well
Therefore, in some speech emotion recognition systems, more than classified using a linear classifier. SVM classifiers are widely used in
one ANN is used [93]. An appropriate aggregation scheme is used to many pattern recognition applications and shown to outperform
combine the outputs of the individuals ANN classifiers. other well-known classifiers [70]. They have some advantages over
The classification accuracy of ANN is fairly low compared to GMM and HMM including the global optimality of the training
other classifiers. In [93], the main objective was to classify the algorithm [17], and the existence of excellent data-dependent
following eight emotions: joy, teasing, fear, sadness, disgust, anger, generalization bounds [30]. However, their treatment of
surprise, and neutral from a locally recorded emotional speech nonseparable cases is somewhat heuristic. In fact, there is no
database. The basic classifier was a One-Class-in-One Neural systematic way to choose the kernel functions, and hence,
Network (OCON) [87], which consists of eight MLP sub-neural separability of the transformed features is not guaranteed. In
networks and a decision logic control. Each sub-neural network fact, in many pattern recognition applications including speech
contained two hidden layers in addition to the input and the output emotion recognition, it is not advised to have a perfect separation of
layers. The output layer contained only one neuron whose output the training data so as to avoid over-fitting.
was an analog value from 0 to 1. Each sub-neural network was SVM classifiers are also used extensively for the problem of
trained to recognize one of the eight emotions. In the testing phase, speech emotion recognition in many studies [116,73,68,101]. The
the output of each ANN specified how likely the input speech performances of almost all of them are similar, and hence, only the
vectors were produced by a certain emotion. The decision logic first one will be briefly described. In this study, three approaches are
control generated a single hypothesis based on the outputs of the investigated in order to extend the basic SVM binary classification to
eight sub-neural networks. This scheme was applied to a locally the multi-class case. In the first two approaches, an SVM classifier is
recorded speech database, which contained the recordings of used to model each emotion and is trained against all other
100 speakers. Each speaker uttered 100 words eight times, emotions. In the first approach, the decision is made for the class
one for each of the above mentioned emotions. The best with highest distance to other classes. In the second approach, the
classification accuracy was only 52.87%, obtained by training on SVM output distances are fed to a 3-layer MLP classifier that
the utterances of 30 speakers and testing on the remaining produces the final output decision. The third approach followed a
utterances; i.e. the classification task was speaker independent. hierarchical classification scheme which is described in Section 4.5.
Similar classification accuracies were obtained in [53] with All- The three systems were tested using utterances from the FERMUS III
Class-in-One neural network architecture. Four topologies were corpus [112]. For speaker-independent classification, the classi-
tried in that work. In all of them, the neural network had only one fication accuracies are 76.12%, 75.45%, and 81.29% for the first,
hidden layer which contained 26 neurons. The input layer had either the second, and the third approaches, respectively. For speaker-
7 or 8 neurons and the output layer had either 14 or 26 neurons. The dependent classification, the classification accuracies are 92.95%,
best achieved classification accuracy in this work was 51.19%. 88.7%, and 90.95% for the first, the second, and the third approaches,
However, the classification models were speaker dependent. respectively.
A better result is found in [99]. In this study, three ANN There are many other classifiers that have been applied in many
configurations were applied. The first one was an ordinary two- other studies to the problem of speech emotion recognition such as
layer MLP classifier. The speech database was also locally recorded k-NN classifiers [116], fuzzy classifiers [105], and decision trees
and contained 700 utterances for the following emotions: [101]. However, the above-mentioned classifiers, especially the
happiness, anger, sadness, fear, and normal. A subset of the data GMM and the HMM, are the most used ones on this task. Moreover,
containing 369 utterances is selected based on subjects’ decisions the performance of many of them is not significantly different from
and is randomly split into training (70% of the utterances) and the above mentioned classification techniques. Table 2 compares
testing (30%) subsets. The average classification accuracy was the performance of popular classifiers, employed for the task of
about 65%. The average classification accuracy was 70% for the speech emotion recognition. One might conclude that the GMM
second configuration in which the bootstrap aggregation (bagging) achieves the best compromise between the classification
scheme was employed. Bagging scheme is a method for generating performance and the computational requirements required for
multiple versions of the classifier and using them to get an training and testing. However, we should be cautious that different
aggregated classifier with higher classification accuracy [16]. emotional corpora with different emotion inventories were used in
Finally, an average classification accuracy of 63% was achieved in those individual studies. Moreover, some of those corpora are
the third configuration which is very similar to that described in the locally recorded and inaccessible to other researchers. Therefore,
previous system. The superiority in performance of this study to the such a conclusion cannot be established without performing more
other two studies discussed is attributed to the use of different comprehensive experiments that employ many accessible corpora
emotional corpus in each study. for comparing the performance of different classifiers.
An important example of the general discriminant classifiers is As an alternative to highly complex classifiers that may require
the support vector machine [34]. SVM classifiers are mainly based large computational requirement for training, multiple classifier
Table 2
Classification performance of popular classifiers, employed for the task of speech emotion recognition.
Average classification accuracy 75.5–78.5% [94,115] 74.83–81.94% [15,120] 51.19–52.82% [93,53] 75.45–81.29% [116]
63–70% [99]
Average training time Small Smallest Back-propagation: large Large
Sensitivity to model initialization Sensitive Sensitive Sensitive Insensitive
584 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
systems (MCS) have been proposed recently for the task of speech and the important design criteria of emotional speech databases.
emotion recognition [113,84]. There are three approaches for There are several conclusions that can be drawn from this study.
combining classifiers [67,84]: hierarchical, serial, and parallel. In The first one is that while high classification accuracies have
the hierarchical approach, classifiers are arranged in a tree been obtained for classification between high-arousal and low-
structure where the set of candidate classes becomes smaller as arousal emotions, N-way classification is still challenging. More-
we go in depth in the tree. At the leave-node classifiers, only one over, the performance of current stress detectors still needs
class remains after decision. In the serial approach, classifiers are significant improvement. The average classification accuracy of
placed in a queue where each classifier reduces the number of speaker-independent speech emotion recognition systems is less
candidate classes for the next classifier [88,139]. In the parallel than 80% in most of the proposed techniques. In some cases, such as
approach, all classifiers work independently and a decision fusion [93], it is as low as 50%. For speaker-dependent classification, the
algorithm is applied to their outputs [66]. recognition accuracy exceeded 90% only in few studies
The hierarchical approach was applied in [83] for classifying [116,101,98]. Many classifiers have been tried for speech
utterances from the Berlin emotional database [18] where the main emotion recognition such as the HMM, the GMM, the ANN, and
goal was to improve speaker-independent emotion classification. The the SVM. However, it is hard to decide which classifier performs
following emotions are selected for classification: anger, happiness, best for this task because different emotional corpora with different
sadness, boredom, anxiety, and neutral. The hierarchical classification experimental setups were applied.
system was motivated by the psychological study of emotions in [110] Most of the current body of research focuses on studying many
in which emotions are represented in three dimensions: activation speech features and their relations to the emotional content of the
(arousal), potency (power), and evaluation (pleasure). Therefore, speech utterance. New features have also been developed such as
2-stage and 3-stage hierarchical classification systems were the TEO-based features. There are also attempts to employ different
proposed in [83]. The naive Bayesian classifier [34] was used for all feature selection techniques in order to find the best features for
classifications. Both systems are shown in Fig. 3. Prosody features this task. However, the conclusions obtained from different studies
included statistics of pitch, energy, duration, articulation, and zero- are not consistent. The main reason may be attributed to the fact
crossing rate. Voice quality features were calculated as parameters of that only one emotional speech database is investigated in
the excitation spectrum, called spectral gradients [121]. The 2-stage each study.
system provided a classification accuracy of 83.5% which is about 9% Most of the existing databases are not perfect for evaluating the
more than that obtained by the same authors in a previous study using performance of a speech emotion recognizer. In many databases, it
the same voice quality features [82]. For 3-stage classification, the is difficult even for human subjects to determine the emotion of
classification accuracy is further increased to 88.8%. In the two studies, some recorded utterances; e.g. the human recognition accuracy
classification accuracies are based on 10-fold cross validation but the was 67% for DED [38], 80% for Berlin [18], and 65% in [94]. There are
validation data vectors were used for both feature selection (they used some other problems for some databases such as the low quality of
Sequential Floating Forward Search (SFFS) algorithm) and testing. the recorded utterances, the small number of available utterances,
All the three approaches for combining classifiers were applied and the unavailability of phonetic transcriptions. Therefore, it is
to speech emotion recognition in [84]. The authors applied the likely that some of the conclusions established in some studies
same experimental setup as in the previous study. When the cannot be generalized to other databases. To address this problem,
validation vectors were used for both feature selection and testing, more cooperation across research institutes in developing bench-
the classification accuracies for the hierarchical, the serial, and the mark emotional speech databases is necessary.
parallel approaches for classifier combination were 88.6%, 96.5%, In order to improve the performance of current speech emotion
and 92.6%, respectively, versus 74.6%. When the validation and test recognition systems, the following possible extensions are pro-
data sets are different, the classification accuracies reduce posed. The first extension relies on the fact that speaker-dependent
considerably to 58.6%, 59.7%, 61.8%, and 70.1% for single classification is generally easier than speaker-independent classi-
classifier, the hierarchical approach, the serial approach, and the fication. At the same time, there exist speaker identification
parallel approach for combining classifiers, respectively. techniques with high recognition performance such as the
GMM-based text-independent speaker identification system pro-
posed by Reynolds [107]. Thus, a speaker-independent emotion
5. Conclusions recognition system may be implemented as a combination of a
speaker identification system followed by a speaker-dependent
In this paper, a survey of current research work in speech emotion recognition system.
emotion recognition system has been given. Three important issues It is also noted that the majority of the existing classification
have been studied: the features used to characterize different techniques do not model the temporal structure of the training
emotions, the classification techniques used in previous research, data. The only exception may be the HMM in which time
Fig. 3. 2-stage and 3-stage hierarchical classification of emotions by Lugger and Yang [83].
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 585
dependency may be modelled using its states. However, all the Science—I (including subseries Lecture Notes in Artificial Intelligence and
Baum–Welch re-estimation formulae are based on the assumption Lecture Notes in Bioinformatics), vol. 3991, 2006, pp. 449–456, cited by (since
1996) 1.
that all the feature vectors are statistically independent [102]. This [24] L. Chen, T. Huang, T. Miyasato, R. Nakatsu, Multimodal human emotion/
assumption is invalid in practice. It is sought that direct modeling of expression recognition, in: Proceedings of the IEEE Automatic Face and
the dependency between feature vectors, e.g. through the use of Gesture Recognition, 1998, pp. 366–371.
[25] Z. Chuang, C. Wu, Emotion recognition using acoustic features and textual
autoregressive models, may provide an improvement in the content, Multimedia and Expo, 2004. IEEE International Conference on ICME
classification performance. Potential discriminative sequential ’04, vol. 1, 2004, pp. 53–56.
classifiers that do not assume statistical independence between [26] R. Cohen, A computational theory of the function of clue words in argument
understanding, in: ACL-22: Proceedings of the 10th International Conference
feature vectors include conditional random fields (CRF) [134] and
on Computational Linguistics and 22nd Annual Meeting on Association for
switching linear dynamic system (SLDS) [89]. Computational Linguistics, 1984, pp. 251–258.
Finally, there are only few studies that considered applying [27] R. Cowie, R.R. Cornelius, Describing the emotional states that are expressed in
multiple classifier systems (MCS) to speech emotion recognition speech, Speech Commun. 40 (1–2) (2003) 5–32.
[28] R. Cowie, E. Douglas-Cowie, Automatic statistical analysis of the signal and
[84,113]. We believe that this research direction has to be further prosodic signs of emotion in speech, in: Proceedings, Fourth International
explored. In fact, MCS is now a well-established area in pattern Conference on Spoken Language, 1996. ICSLP 96. vol. 3, 1996,
recognition [66,67,127,126] and there are many aggregation pp. 1989–1992.
[29] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, S. Kollias, W. Fellenz, J. Taylor,
techniques that have not been applied to speech emotion Emotion recognition in human–computer interaction, IEEE Signal Process.
recognition such as Adaboost.M1 [42] and dynamic classifier Mag. 18 (2001) 32–80.
selection (DCS) [52]. [30] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines,
Cambridge University Press, 2000.
[31] J.R. Davitz, The Communication of Emotional Meaning, McGraw-Hill, New
York, 1964.
References [32] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data
via the em algorithm, J. R. Stat. Soc. 39 (1977) 1–38.
[1] H. Akaike, A new look at the statistical model identification, IEEE Trans. [33] L. Devillers, L. Lamel, Emotion detection in task-oriented dialogs, in:
Autom. Control 19 (6) (1974) 716–723. Proceedings of the International Conference on Multimedia and Expo
[2] N. Amir, S. Ron, N. Laor, Analysis of an emotional speech corpus in Hebrew 2003, 2003, pp. 549–552.
based on objective criteria, in: SpeechEmotion-2000, 2000, pp. 29–33. [34] R. Duda, P. Hart, D. Stork, Pattern Recognition, John Wiley and Sons, 2001.
[3] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based automatic [35] D. Edwards, Emotion discourse, Culture Psychol. 5 (3) (1999) 271–291.
detection of annoyance and frustration in human–computer dialog, in: [36] P. Ekman, Emotion in the Human Face, Cambridge University Press, Cam-
Proceedings of the ICSLP 2002, 2002, pp. 2037–2040. bridge, 1982.
[4] B.S. Atal, Effectiveness of linear prediction characteristics of the speech wave [37] M. Abu El-Yazeed, M. El Gamal, M. El Ayadi, On the determination of optimal
for automatic speaker identification and verification, J. Acoust. Soc. Am. 55 (6) model order for gmm-based text-independent speaker identification, EUR-
(1974) 1304–1312. ASIP J. Appl. Signal Process. 8 (2004) 1078–1087.
[5] T. Athanaselis, S. Bakamidis, I. Dologlou, R. Cowie, E. Douglas-Cowie, C. Cox, [38] I. Engberg, A. Hansen, Documentation of the Danish emotional speech
Asr for emotional speech: clarifying the issues and enhancing the perfor- database des /http://cpk.auc.dk/tb/speech/Emotions/S, 1996.
mance, Neural Networks 18 (2005) 437–444. [39] Y. Ephraim, N. Merhav, Hidden Markov processes, IEEE Trans. Inf. Theory
[6] M.M.H. El Ayadi, M.S. Kamel, F. Karray, Speech emotion recognition using 48 (6) (2002) 1518–1569.
Gaussian mixture vector autoregressive models, in: ICASSP 2007, vol. 4, 2007, [40] R. Fernandez, A computational model for the automatic recognition of affect
pp. 957–960. in speech, Ph.D. Thesis, Massachusetts Institute of Technology, February
[7] R. Banse, K. Scherer, Acoustic profiles in vocal emotion expression, J. Pers. Soc. 2004.
Psychol. 70 (3) (1996) 614–636. [41] D.J. France, R.G. Shiavi, S. Silverman, M. Silverman, M. Wilkes, Acoustical
[8] A. Batliner, K. Fischer, R. Huber, J. Spiker, E. Noth, Desperately seeking properties of speech as indicators of depression and suicidal risk, IEEE Trans.
emotions: actors, wizards and human beings, in: Proceedings of the ISCA Biomedical Eng. 47 (7) (2000) 829–837.
Workshop Speech Emotion, 2000, pp. 195–200. [42] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line
[9] S. Beeke, R. Wilkinson, J. Maxim, Prosody as a compensatory strategy in the learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997)
conversations of people with agrammatism, Clin. Linguist. Phonetics 23 (2) 119–139 cited by (since 1996) 1695.
(2009) 133–155. [43] L. Fu, X. Mao, L. Chen, Speaker independent emotion recognition based on
[10] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University svm/hmms fusion system, in: International Conference on Audio, Language
Press, 1995. and Image Processing, 2008. ICALIP 2008, pp. 61–65.
[11] M. Borchert, A. Dusterhoft, Emotions in speech—experiments with prosody [44] M. Gelfer, D. Fendel, Comparisons of jitter, shimmer, and signal-to-noise ratio
and quality features in speech for use in categorical and dimensional emotion from directly digitized versus taped voice samples, J. Voice 9 (4) (1995)
recognition environments, in: Proceedings of 2005 IEEE International Con- 378–382.
ference on Natural Language Processing and Knowledge Engineering, IEEE [45] H. Go, K. Kwak, D. Lee, M. Chun, Emotion recognition from the facial image
NLP-KE’05 2005, 2005, pp. 147–151. and speech signal, in: Proceedings of the IEEE SICE 2003, vol. 3, 2003,
[12] L. Bosch, Emotions, speech and the asr framework, Speech Commun. 40 pp. 2890–2895.
(2003) 213–225. [46] C. Gobl, A.N. Chasaide, The role of voice quality in communicating emotion,
[13] S. Bou-Ghazale, J. Hansen, A comparative study of traditional and newly mood and attitude, Speech Commun. 40 (1–2) (2003) 189–212.
proposed features for recognition of speech under stress, IEEE Trans. Speech [47] A. Gorin, On automated language acquisition, J. Acoust. Soc. Am. 97 (1995)
Audio Process. 8 (4) (2000) 429–442. 3441–3461.
[14] R. Le Bouquin, Enhancement of noisy speech signals: application to mobile [48] B.J. Grosz, C.L. Sidner, Attention, intentions, and the structure of discourse,
radio communications, Speech Commun. 18 (1) (1996) 3–19. Comput. Linguist. 12 (3) (1986) 175–204.
[15] C. Breazeal, L. Aryananda, Recognition of affective communicative intent in [49] J. Hansen, D. Cairns, Icarus: source generator based real-time recognition of
robot-directed speech, Autonomous Robots 2 (2002) 83–104. speech in noisy stressful and Lombard effect environments, Speech Commun.
[16] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. 16 (4) (1995) 391–422.
[17] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, [50] J. Hernando, C. Nadeu, Linear prediction of the one-sided autocorrelation
Data Mining Knowl. Discovery 2 (2) (1998) 121–167. sequence for noisy speech recognition, IEEE Trans. Speech Audio Process. 5 (1)
[18] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of (1997) 80–84.
German emotional speech, in: Proceedings of the Interspeech 2005, Lissabon, [51] K. Hirose, H. Fujisaki, M. Yamaguchi, Synthesis by rule of voice fundamental
Portugal, 2005, pp. 1517–1520. frequency contours of spoken Japanese from linguistic information, in: IEEE
[19] C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of International Conference on Acoustics, Speech, and Signal Processing, ICASSP
fundamental frequency for emotion detection, IEEE Trans. Audio Speech ’84, vol. 9, 1984, pp. 597–600.
Language Process. 17 (4) (2009) 582–596. [52] T. Ho, J. Hull, S.N. Srihari, Decision combination in multiple classifier systems,
[20] J. Cahn, The generation of affect in synthesized speech, J. Am. Voice Input/ IEEE Trans. Pattern Anal. Mach. Intell. 16 (1) (1994) 66–75.
Output Soc. 8 (1990) 1–19. [53] V. Hozjan, Z. Kacic, Context-independent multilingual emotion recognition
[21] D. Caims, J. Hansen, Nonlinear analysis and detection of speech under from speech signal, Int. J. Speech Technol. 6 (2003) 311–320.
stressed conditions, J. Acoust. Soc. Am. 96 (1994) 3392–3400. [54] V. Hozjan, Z. Moreno, A. Bonafonte, A. Nogueiras, Interface databases: design
[22] W. Campbell, Databases of emotional speech, in: Proceedings of the ISCA and collection of a multilingual emotional speech database, in: Proceedings of
(International Speech Communication and Association) ITRW on Speech and the 3rd International Conference on Language Resources and Evaluation
Emotion, 2000, pp. 34–38. (LREC’02) Las Palmas de Gran Canaria, Spain, 2002, pp. 2019–2023.
[23] C. Chen, M. You, M. Song, J. Bu, J. Liu, An enhanced speech emotion recognition [55] H. Hu, M. Xu, W. Wu, Dimensions of emotional meaning in speech, in:
system based on discourse information, in: Lecture Notes in Computer Proceedings of the ISCA ITRW on Speech and Emotion, 2000, pp. 25–28.
586 M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587
[56] H. Hu, M. Xu, W. Wu, Gmm supervector based svm with spectral features for [88] D. Mashao, M. Skosan, Combining classifier decisions for robust speaker
speech emotion recognition, in: IEEE International Conference on Acoustics, identification, Pattern Recognition 39 (1) (2006) 147–155.
Speech and Signal Processing, 2007. ICASSP 2007, vol. 4, 2007, pp. IV 413–IV [89] B. Mesot, D. Barber, Switching linear dynamical systems for noise robust
416. speech recognition, IEEE Trans. Audio Speech Language Process. 15 (6) (2007)
[57] H. Hu, M.-X. Xu, W. Wu, Fusion of global statistical and segmental spectral 1850–1858.
features for speech emotion recognition, in: International Speech Commu- [90] P. Mitra, C. Murthy, S. Pal, Unsupervised feature selection using feature
nication Association—8th Annual Conference of the International Speech similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 301–312.
Communication Association, Interspeech 2007, vol. 2, 2007, pp. 1013–1016. [91] D. Morrison, R. Wang, L. De Silva, Ensemble methods for spoken emotion
[58] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE recognition in call-centres, Speech Commun. 49 (2) (2007) 98–112.
Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4–37. [92] I. Murray, J. Arnott, Toward a simulation of emotions in synthetic speech:
[59] T. Johnstone, C.M. Van Reekum, K. Hird, K. Kirsner, K.R. Scherer, Affective A review of the literature on human vocal emotion, J. Acoust. Soc. Am. 93 (2)
speech elicited with a computer game, Emotion 5 (4) (2005) 513–518 (1993) 1097–1108.
cited by (since 1996) 6. [93] J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using
[60] T. Johnstone, K.R. Scherer, Vocal Communication of Emotion, second ed., neural networks, Neural Comput. Appl. 9 (2000) 290–296.
Guilford, New York, 2000, pp. 226–235. [94] T. Nwe, S. Foo, L. De Silva, Speech emotion recognition using hidden Markov
[61] J. Deller Jr., J. Proakis, J. Hansen, Discrete Time Processing of Speech Signal, models, Speech Commun. 41 (2003) 603–623.
Macmillan, 1993. [95] J. O’Connor, G. Arnold, Intonation of Colloquial English, second ed., Longman,
[62] P.R. Kleinginna Jr., A.M. Kleinginna, A categorized list of emotion definitions, London, UK, 1973.
with suggestions for a consensual definition, Motivation Emotion 5 (4) (1981) [96] A. Oster, A. Risberg, The identification of the mood of a speaker by hearing
345–379. impaired listeners, Speech Transmission Lab. Quarterly Progress Status
[63] J. Kaiser, On a simple algorithm to calculate the ‘energy’ of the signal, in: Report 4, Stockholm, 1986, pp. 79–90.
ICASSP-90, 1990, pp. 381–384. [97] T. Otsuka, J. Ohya, Recognizing multiple persons’ facial expressions using
[64] L. Kaiser, Communication of affects by single vowels, Synthese 14 (4) (1962) hmm based on automatic extraction of significant frames from image
300–319. sequences, in: Proceedings of the International Conference on Image
[65] E. Kim, K. Hyun, S. Kim, Y. Kwak, Speech emotion recognition using eigen-fft Processing (ICIP-97), 1997, pp. 546–549.
in clean and noisy environments, in: The 16th IEEE International Symposium [98] T.L. Pao, Y.-T. Chen, J.-H. Yeh, W.-Y. Liao, Combining acoustic features for
on Robot and Human Interactive Communication, 2007, RO-MAN 2007, 2007, improved emotion recognition in Mandarin speech, in: Lecture Notes in
pp. 689–694. Computer Science (including subseries Lecture Notes in Artificial Intelligence
[66] L.I. Kuncheva, A theoretical study on six classifier fusion strategies, IEEE and Lecture Notes in Bioinformatics), vol. 3784, 2005, pp. 279–285, cited by
Trans. Pattern Anal. Mach. Intell. 24 (2002) 281–286. (since 1996) 1.
[67] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, [99] V. Petrushin, Emotion recognition in speech signal: experimental study,
2004. development and application, in: Proceedings of the ICSLP 2000, 2000,
[68] O. Kwon, K. Chan, J. Hao, T. Lee, Emotion recognition by speech signal, in: pp. 222–225.
EUROSPEECH Geneva, 2003, pp. 125–128. [100] R.W. Picard, E. Vyzas, J. Healey, Toward machine emotional intelligence:
[69] C. Lee, S. Narayanan, Toward detecting emotions in spoken dialogs, IEEE analysis of affective physiological state, IEEE Trans. Pattern Anal. Mach. Intell.
Trans. Speech Audio Process. 13 (2) (2005) 293–303. 23 (10) (2001) 1175–1191.
[70] C. Lee, S. Narayanan, R. Pieraccini, Classifying emotions in human–machine [101] O. Pierre-Yves, The production and recognition of emotions in speech:
spoken dialogs, in: Proceedings of the ICME’02, vol. 1, 2002, pp. 737–740. features and algorithms, Int. J. Human–Computer Stud. 59 (2003) 157–183.
[71] C. Lee, S.S. Narayanan, R. Pieraccini, Classifying emotions in human–machine [102] L. Rabiner, B. Juang, An introduction to hidden Markov models, IEEE ASSP
spoken dialogs, in: 2002 IEEE International Conference on Multimedia and Mag. 3 (1) (1986) 4–16.
Expo, 2002, ICME ’02, Proceedings, vol. 1, 2002, pp. 737–740. [103] L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[72] C. Lee, R. Pieraccini, Combining acoustic and language information for [104] L. Rabiner, R. Schafer, Digital Processing of Speech Signals, first ed., Pearson
emotion recognition, in: Proceedings of the ICSLP 2002, 2002, pp. 873–876. Education, 1978.
[73] C. Lee, S. Yildrim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, S. [105] A. Razak, R. Komiya, M. Abidin, Comparison between fuzzy and nn method for
Narayanan, Emotion recognition based on phoneme classes, in: Proceedings speech emotion recognition, in: 3rd International Conference on Information
of ICSLP, 2004, pp. 2193–2196. Technology and Applications ICITA 2005, vol. 1, 2005, pp. 297–302.
[74] L. Leinonen, T. Hiltunen, Expression of emotional-motivational connotations [106] D. Reynolds, T. Quatieri, R. Dunn, Speaker verification using adapted Gaussian
with a one-word utterance, J. Acoust. Soc. Am. 102 (3) (1997) 1853–1863. mixture models, Digital Signal Process. 10 (2000) 19–41.
[75] L. Leinonen, T. Hiltunen, I. Linnankoski, M. Laakso, Expression of emotional- [107] D. Reynolds, C. Rose, Robust text-independent speaker identification using
motivational connotations with a one-word utterance, J. Acoust. Soc. Am. 102 Gaussian mixture speaker models, IEEE Trans. Speech Audio Process. 3 (1)
(3) (1997) 1853–1863. (1995) 72–83.
[76] X. Li, J. Tao, M.T. Johnson, J. Soltis, A. Savage, K.M. Leong, J.D. Newman, Stress [108] J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978)
and emotion classification using jitter and shimmer features, in: IEEE 465–471.
International Conference on Acoustics, Speech and Signal Processing, 2007. [109] K.R. Scherer, Vocal affect expression. A review and a model for future research,
ICASSP 2007, vol. 4, April 2007, pp. IV-1081–IV-1084. Psychological Bull. 99 (2) (1986) 143–165 cited by (since 1996) 311.
[77] J. Lien, T. Kanade, C. Li, Detection, tracking and classification of action units in [110] H. Schlosberg, Three dimensions of emotion, Psychological Rev. 61 (2) (1954)
facial expression, J. Robotics Autonomous Syst. 31 (3) (2002) 131–146. 81–88.
[78] University of Pennsylvania Linguistic Data Consortium, Emotional prosody [111] M. Schubiger, English intonation: its form and function, Niemeyer, Tubingen,
speech and transcripts /http://www.ldc.upenn.edu/Catalog/CatalogEntry. Germany, 1958.
jsp?catalogId=LDC2002S28S, July 2002. [112] B. Schuller, Towards intuitive speech interaction by the integration of
[79] J. Liscombe, Prosody and speaker state: paralinguistics, pragmatics, and emotional aspects, in: 2002 IEEE International Conference on Systems,
proficiency, Ph.D. Thesis, Columbia University, 2007. Man and Cybernetics, vol. 6, 2002, p. 6.
[80] D.G. Lowe, Object recognition from local scale-invariant features, in: Pro- [113] B. Schuller, M. Lang, G. Rigoll, Robust acoustic speech emotion recognition by
ceedings of the IEEE International Conference on Computer Vision, vol. 2, ensembles of classifiers, in: Proceedings of the DAGA’05, 31, Deutsche
1999, pp. 1150–1157. Jahrestagung für Akustik, DEGA, 2005, pp. 329–330.
[81] M. Lugger, B. Yang, The relevance of voice quality features in speaker [114] B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, Speaker
independent emotion recognition, in: icassp, vol. 4, 2007, pp. 17–20. independent speech emotion recognition by ensemble classification, in: IEEE
[82] M. Lugger, B. Yang, The relevance of voice quality features in speaker International Conference on Multimedia and Expo, 2005. ICME 2005, 2005,
independent emotion recognition, in: IEEE International Conference on pp. 864–867.
Acoustics, Speech and Signal Processing, 2007, ICASSP 2007, vol. 4, April [115] B. Schuller, G. Rigoll, M. Lang, Hidden Markov model-based speech emotion
2007, pp. IV-17–IV-20. recognition, in: International Conference on Multimedia and Expo (ICME),
[83] M. Lugger, B. Yang, Psychological motivated multi-stage emotion classifica- vol. 1, 2003, pp. 401–404.
tion exploiting voice quality features, in: F. Mihelic, J. Zibert (Eds.), Speech [116] B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic
Recognition, In-Tech, 2008. features and linguistic information in a hybrid support vector machine-belief
[84] M. Lugger, B. Yang, Combining classifiers with diverse feature sets for robust network architecture, in: Proceedings of the ICASSP 2004, vol. 1, 2004,
speaker independent emotion recognition, in: Proceedings of EUSIPCO, 2009. pp. 577–580.
[85] M. Lugger, B. Yang, W. Wokurek, Robust estimation of voice quality [117] M.T. Shami, M.S. Kamel, Segment-based approach to the recognition of
parameters under realworld disturbances, in: 2006 IEEE International emotions in speech, in: IEEE International Conference on Multimedia and
Conference on Acoustics, Speech and Signal Processing, 2006, ICASSP 2006 Expo, 2005. ICME 2005, 2005, 4pp.
Proceedings, vol. 1, May 2006, pp. I–I. [118] L.C. De Silva, T. Miyasato, R. Nakatsu, Facial emotion recognition using
[86] J. Ma, H. Jin, L. Yang, J. Tsai, in: Ubiquitous Intelligence and Computing: Third multimodal information, in: Proceedings of the IEEE International Conference
International Conference, UIC 2006, Wuhan, China, September 3–6, 2006, on Information, Communications and Signal Processing (ICICS’97), 1997,
Proceedings (Lecture Notes in Computer Science), Springer-Verlag, New York, pp. 397–401.
Inc., Secaucus, NJ, USA, 2006. [119] L.C. De Silva, T. Miyasato, R. Nakatsu, Facial emotion recognition using multi-
[87] J. Markel, A. Gray, Linear Prediction of Speech, Springer-Verlag, 1976. modal information, in: Proceedings of 1997 International Conference on
M. El Ayadi et al. / Pattern Recognition 44 (2011) 572–587 587
Information, Communications and Signal Processing, 1997, ICICS, vol. 1, [133] N. Vlassis, A. Likas, A greedy em algorithm for Gaussian mixture learning,
September 1997, pp. 397–401. Neural Process. Lett. 15 (2002) 77–87.
[120] M. Slaney, G. McRoberts, Babyears: a recognition system for affective [134] Y. Wang, K.-F. Loe, J.-K. Wu, A dynamic conditional random field model for
vocalizations, Speech Commun. 39 (2003) 367–384. foreground and shadow segmentation, IEEE Trans. Pattern Anal. Mach. Intell.
[121] K. Stevens, H. Hanson, Classification of glottal vibration from acoustic 28 (2) (2006) 279–289.
measurements, Vocal Fold Physiol. (1994) 147–170. [135] C. Williams, K. Stevens, Emotions and speech: some acoustical correlates,
[122] R. Sun, E. Moore, J.F. Torres, Investigating glottal parameters for differentiat- J. Acoust. Soc. Am. 52 (4 Pt 2) (1972) 1238–1250.
ing emotional categories with similar prosodics, in: IEEE International [136] C. Williams, K. Stevens, Vocal correlates of emotional states, Speech Evalua-
Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009, tion in Psychiatry, Grune and Stratton, 1981, pp. 189–220.
April 2009, pp. 4509–4512. [137] I. Witten, E. Frank, Data Mining, Morgan Kauffmann, Los Atlos, CA, 2000.
[123] J. Tao, Y. Kang, A. Li, Prosody conversion from neutral speech to emotional [138] B.D. Womack, J.H.L. Hansen, N-channel hidden Markov models for combined
speech, IEEE Trans. Audio Speech Language Process. 14 (4) (2006) 1145–1154. stressed speech classification and recognition, IEEE Trans. Speech Audio
[124] H. Teager, Some observations on oral air flow during phonation, IEEE Trans. Process. 7 (6) (1999) 668–677.
Acoust. Speech Signal Process. 28 (5) (1990) 599–601. [139] J. Wu, M.D. Mullin, J.M. Rehg, Linear asymmetric classifier for cascade
[125] H. Teager, S. Teager, Evidence for nonlinear production mechanisms in the detectors, in: 22th International Conference on Machine Learning, 2005.
vocal tract, in: Speech Production and Speech Modelling, Nato Advanced [140] M. You, C. Chen, J. Bu, J. Liu, J. Tao, Getting started with susas: a speech under
Institute, vol. 55, 1990, pp. 241–261. simulated and actual stress database, in: EUROSPEECH-97, vol. 4, 1997,
[126] A. Tsymbal, M. Pechenizkiy, P. Cunningham, Diversity in search strategies for pp. 1743–1746.
ensemble feature selection, Inf. Fusion 6 (32) (2005) 146–156. [141] M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in:
[127] A. Tsymbal, S. Puuronen, D.W. Patterson, Ensemble feature selection with the IEEE International Conference on Multimedia and Expo, 2006, 2006,
simple Bayesian classification, Inf. Fusion 4 (32) (2003) 146–156. pp. 1653–1656l.
[128] D. Ververidis, C. Kotropoulos, Emotional speech classification using Gaussian [142] M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotional speech analysis on nonlinear
mixture models and the sequential floating forward selection algorithm, in: manifold, in: 18th International Conference on Pattern Recognition, 2006.
IEEE International Conference on Multimedia and Expo, 2005. ICME 2005, July ICPR 2006, vol. 3, 2006, pp. 91–94.
2005, pp. 1500–1503. [143] M. You, C. Chen, J. Bu, J. Liu, J. Tao, A hierarchical framework for speech
[129] D. Ververidis, C. Kotropoulos, Emotional speech recognition: resources, emotion recognition, in: IEEE International Symposium on Industrial Elec-
features and methods, Speech Commun. 48 (9) (2006) 1162–1181. tronics, 2006, vol. 1, 2006, pp. 515–519.
[130] D. Ververidis, C. Kotropoulos, I. Pitas, Automatic emotional speech classifica- [144] S. Young, Large vocabulary continuous speech recognition, IEEE Signal
tion, in: IEEE International Conference on Acoustics, Speech, and Signal Process. Mag. 13 (5) (1996) 45–57.
Processing, 2004, Proceedings, (ICASSP ’04), vol. 1, 2004, pp. I-593-6. [145] G. Zhou, J. Hansen, J. Kaiser, Nonlinear feature based classification of speech
[131] A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum under stress, IEEE Trans. Speech Audio Process. 9 (3) (2001) 201–216.
decoding algorithm Viterbi, IEEE Trans. Inf. Theory 13 (2) (1967) 260–269. [146] J. Zhou, G. Wang, Y. Yang, P. Chen, Speech emotion recognition based on rough
[132] N. Vlassis, A. Likas, A kurtosis-based dynamic approach to Gaussian mixture set and svm, in: 5th IEEE International Conference on Cognitive Informatics,
modeling, IEEE Trans. Syst. Man Cybern. 29 (4) (1999) 393–399. 2006, ICCI 2006, vol. 1, 2006, pp. 53–61.
Moataz M.H. El Ayadi received his B.Sc. degree (Hons) in Electronics and Communication Engineering, Cairo University, in 2000, M.Sc. degree in Engineering Mathematics and
Physics, Cairo University, in 2004, and Ph.D. degree in Electrical and Computer Engineering, University of Waterloo, in 2008.
He worked as a postdoctoral research fellow in the Electrical and Computer Engineering Department, University of Toronto, from January 2009 to March 2010. Since April
2010, has been an assistant professor in the Engineering Mathematics and Physics Department, Cairo University.
His research interests include statistical pattern recognition and speech processing. His master work was in enhancing the performance of text independent speaker
identification systems that uses Gaussian Mixture Models as the core statistical classifier. The main contribution was in developing a new model order selection technique
based on the goodness of fit statistical test. He is expected to follow the same line of research in his Ph.D.
Mohamed S. Kamel received the B.Sc. (Hons) EE (Alexandria University), M.A.Sc. (McMaster University), Ph.D. (University of Toronto).
He joined the University of Waterloo, Canada, in 1985 where he is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory at the
Department of Electrical and Computer Engineering and holds a University Research Chair. Professor Kamel held Canada Research Chair in Cooperative Intelligent Systems
from 2001 to 2008.
Dr. Kamel’s research interests are in Computational Intelligence, Pattern Recognition, Machine Learning and Cooperative Intelligent Systems. He has authored and
co-authored over 390 papers in journals and conference proceedings, 11 edited volumes, two patents and numerous technical and industrial project reports. Under his
supervision, 81 Ph.D. and M.A.Sc. students have completed their degrees.
He is the Editor-in-Chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, Pattern Recognition Letters, Cognitive
Neurodynamics journal and Pattern Recognition J. He is also member of the editorial advisory board of the International Journal of Image and Graphics and the Intelligent
Automation and Soft Computing journal. He also served as Associate Editor of Simulation, the Journal of The Society for Computer Simulation.
Based on his work at the NCR, he received the NCR Inventor Award. He is also a recipient of the Systems Research Foundation Award for outstanding presentation in 1985 and
the ISRAM best paper award in 1992. In 1994 he has been awarded the IEEE Computer Society Press outstanding referee award. He was also a coauthor of the best paper in the
2000 IEEE Canadian Conference on electrical and Computer Engineering. Dr. Kamel is recipient of the University of Waterloo outstanding performance award twice, the faculty
of engineering distinguished performance award. Dr. Kamel is member of ACM, PEO, Fellow of IEEE, Fellow of the Engineering Institute of Canada (EIC), Fellow of the Canadian
Academy of Engineering (CAE) and selected to be a Fellow of the International Association of Pattern Recognition (IAPR) in 2008. He served as consultant for General Motors,
NCR, IBM, Northern Telecom and Spar Aerospace. He is co-founder of Virtek Vision Inc. of Waterloo and chair of its Technology Advisory Group. He served as member of the
board from 1992 to 2008 and VP research and development from 1987 to 1992.
Fakhreddine Karray (S’89,M90,SM’01) received Ing. Dipl. in Electrical Engineering from University of Tunis, Tunisia (84) and Ph.D. degree from the University of Illinois,
Urbana-Champaign, USA (89). He is Professor of Electrical and Computer Engineering at the University of Waterloo and the Associate Director of the Pattern Analysis and
Machine Intelligence Lab. Dr. Karray’s current research interests are in the areas of autonomous systems and intelligent man–machine interfacing design. He has authored
more than 200 articles in journals and conference proceedings. He is the co-author of 13 patents and the co-author of a recent textbook on soft computing: Soft Computing and
Intelligent Systems Design, Addison Wesley Publishing, 2004. He serves as the associate editor of the IEEE Transactions on Mechatronics, the IEEE Transactions on Systems Man
and Cybernetics (B), the International Journal of Robotics and Automation and the Journal of Control and Intelligent Systems. He is the Associate Editor of the IEEE Control
Systems Society’s Conference Proceedings. He has served as Chair (or) co-Chair of more than eight International conferences. He is the General Co-Chair of the IEEE Conference
on Logistics and Automation, China, 2008. Dr. Karray is the KW Chapter Chair of the IEEE Control Systems Society and the IEEE Computational Intelligence Society. He is
co-founder of Intelligent Mechatronics Systems Inc. and of Voice Enabling Systems Technology Inc.