0% found this document useful (0 votes)
67 views4 pages

Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari

This document discusses speech emotion recognition. It reviews different classifiers that have been used for emotion recognition from speech, including Hidden Markov Models, Support Vector Machines, neural networks, and Gaussian mixture models. These classifiers aim to differentiate emotions like anger, happiness, sadness, surprise, and neutral states based on extracted speech features. Common features discussed are energy, pitch, mel frequency cepstrum coefficients (MFCC), and linear predictive cepstral coefficients (LPCC). The document also outlines the typical structure of a speech emotion recognition system and some of the challenges in developing accurate speech emotion recognition, such as variability between speakers, languages, cultures, and transient versus sustained emotional states.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views4 pages

Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari

This document discusses speech emotion recognition. It reviews different classifiers that have been used for emotion recognition from speech, including Hidden Markov Models, Support Vector Machines, neural networks, and Gaussian mixture models. These classifiers aim to differentiate emotions like anger, happiness, sadness, surprise, and neutral states based on extracted speech features. Common features discussed are energy, pitch, mel frequency cepstrum coefficients (MFCC), and linear predictive cepstral coefficients (LPCC). The document also outlines the typical structure of a speech emotion recognition system and some of the challenges in developing accurate speech emotion recognition, such as variability between speakers, languages, cultures, and transient versus sustained emotional states.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Soft Computing and Engineering (IJSCE)

ISSN: 2231-2307, Volume-2, Issue-1, March 2012

Speech Emotion Recognition


Ashish B. Ingale, D. S. Chaudhari

 Markov Model was presented and achieved an accuracy of


Abstract— In human machine interface application, emotion 70% for seven emotional states. In another study Support
recognition from the speech signal has been research topic since Vector Machine for speech motion recognition of the four
many years. To identify the emotions from the speech signal,
different emotions with an accuracy of 73% was obtained [4,
many systems have been developed. In this paper speech
emotion recognition based on the previous technologies which 5].
uses different classifiers for the emotion recognition is reviewed. Emotion recognition from the speaker‟s speech is
The classifiers are used to differentiate emotions such as anger, very difficult because of the following reasons: In
happiness, sadness, surprise, neutral state, etc. The database for differentiating between various emotions which particular
the speech emotion recognition system is the emotional speech speech features are more useful is not clear. Because of the
samples and the features extracted from these speech samples
existence of the different sentences, speakers, speaking
are the energy, pitch, linear prediction cepstrum coefficient
(LPCC), Mel frequency cepstrum coefficient (MFCC). The styles, speaking rates accosting variability was introduced,
classification performance is based on extracted features. because of which speech features get directly affected. The
Inference about the performance and limitation of speech same utterance may show different emotions. Each emotion
emotion recognition system based on the different classifiers are may correspond to the different portions of the spoken
also discussed. utterance. Therefore it is very difficult to differentiate these
Keywords— Classifier, Emotion recognition, Feature extraction, portions of utterance. Another problem is that emotion
Feature Selection.
expression is depending on the speaker and his or her culture
and environment. As the culture and environment gets
change the speaking style also gets change, which is another
I. INTRODUCTION
challenge in front of the speech emotion recognition system.
There are many ways of communication but the speech signal There may be two or more types of emotions, long term
is one of the fastest and most natural methods of emotion and transient one, so it is not clear which type of
communications between humans. Therefore the speech can emotion the recognizer will detect [1].
be the fast and efficient method of interaction between human Emotion recognition from the speech information
and machine also [1]. Humans have the natural ability to use may be the speaker dependent or speaker independent. The
all their available senses for maximum awareness of the different classifiers available are k-nearest neighbors (KNN),
received message. Through all the available senses people Hidden Markov Model (HMM) and Support Vector Machine
actually sense the emotional state of their communication (SVM), Artificial Neural Network (ANN), Gaussian
partner. The emotional detection is natural for humans but it Mixtures Model (GMM). The paper reviews the mentioned
is very difficult task for machine. Therefore the purpose of classifiers [4]. The application of the speech emotion
emotion recognition system is to use emotion related recognition system include the psychiatric diagnosis,
knowledge in such a way that human machine intelligent toys, lie detection, in the call centre conversations
communication will be improved [2]. which is the most important application for the automated
In speech emotion recognition, the emotions from recognition of emotions from the speech, in car board system
the speech of male or female speakers are found out [1]. In where information of the mental state of the driver may
the past century some speech features were studied which provide to the system to start his/her safety [1].
involved the fundamental frequencies, Mel frequency The paper is organized as follows: section two
cepstrum coefficient (MFCC), linear prediction cepstrum describes the overall structure of the speech emotion
coefficient (LPCC), etc., which form the basis for speech recognition system. The different features extracted in the
processing even today. In one of the research the feature extraction and the details about the feature selection
spectrograms of real and acted emotional speech were are discussed in section three. Different classification
studied and found similar recognition rate for both, which schemes which could be use in the speech emotion
recommend that later one can be use for the speech emotion recognition system describes in section four.
recognition system. In another research a correlation between
emotion and speech features were present. Further humans II. SPEECH EMOTION RECOGNITION SYSTEM
and machine emotion recognition rate was Compared, in
Speech emotion recognition is nothing but the pattern
which same recognition rates were found for both. After this
recognition system. This shows that the stages that are
study a speech emotion recognition system using Hidden
present in the pattern recognition system are also present in
Ashish B. Ingale, Department of Electronics and Telecommunication the Speech emotion recognition system. The speech emotion
Engineering, Government College of Engineering, Amravati, India, Mobile recognition system contains five main modules emotional
Phone No.: +919860305773, (e-mail: [email protected]). speech input, feature extraction, feature selection,
D. S. Chaudhari, Department of Electronics and Telecommunication
Engineering, Government College of Engineering, Amravati, India, (e-mail: classification, and recognized emotional output [2]. The
[email protected]).

235
Speech Emotion Recognition

structure of the speech emotion recognition is as shown in One of the main speech features which indicate emotion is
Figure 1. energy and the study of energy is depends on short term
energy and short term average amplitude [6]. As the arousal
Speech Feature Feature Classifier Recognized
input Extraction selection emotion level of emotions is associated with the short term speech
energy therefore it can be used in the field of emotion
recognition. The pitch signal which is also referred as the
Figure 1. Structure of the Speech Emotion Recognition glottal wave form is one more main feature which indicates
System. emotion in speech. The pitch signal depends on the tension of
the vocal folds and sub glottal air pressure, and it is produced
The need to find out a set of the significant emotions from the vibration rate of the vocal cord. The pitch signal is
to be classified by an automatic emotion recognizer is a main characterize by the two features that is pitch frequency, and
concern in speech emotion recognition system. A typical set glottal air velocity at the vocal fold opening time instant.
of emotions contains 300 emotional states. Therefore to Number of harmonics present in the spectrum is directly get
classify such a great number of emotions is very complicated. affected by the pitch frequency [7].
According to „Palette theory‟ any emotion can be Linear prediction cepstrum coefficient (LPCC)
decomposed into primary emotions similar to the way that gives the details about the characteristics of particular
any color is a combination of some basic colors. Primary channel of any individual person and this channel
emotions are anger, disgust, fear, joy, sadness and surprise characteristic will get change in accordance with the different
[1]. emotions, so by using these features one can extract the
The evaluation of the speech emotion recognition emotions in speech. The merits of using the LPCC is that it
system is based on the level of naturalness of the database involves less computation, its algorithm is more efficient and
which is used as an input to the speech emotion recognition it could describe the vowels in better manner. Mel frequency
system. If the inferior database is used as an input to the cepstrum coefficient (MFCC) is extensively used in speech
system then incorrect conclusion may be drawn. The recognition and speech emotion recognition systems and the
database as an input to the speech emotion recognition recognition rate of the MFCC is very good. In the low
system may contain the real world emotions or the acted frequency region better frequency resolution and robustness
ones. It is more practical to use database that is collected from to noise could be achieved with the help of MFCC rather than
the real life situations [1]. that for high frequency region [6]. Mel frequency cepstrum is
an illustration of short term power spectrum of sound [4].
III. FEATURE EXTRACTION AND SELECTION In feature extraction all of the basic speech feature
Any emotion from the speaker‟s speech is represented by the extracted may not be helpful and essential for speech emotion
large number of parameters which is contained in the speech recognition system. If all the extracted features gives as an
and the changes in these parameters will result in input to the classifier this would not guarantee the best
corresponding change in emotions. Therefore an extraction system performance which shows that there is a need to
of these speech features which represents emotions is an remove such a unusefull features from the base features.
important factor in speech emotion recognition system [6]. Therefore there is a need of systematic feature selection to
The speech features can be divide into two main categories reduce these features. Forward selection (FS) feature
that is long term and short term features. The region of selection method could be used to select the best feature
analysis of the speech signal used for the feature extraction is subset. In the initial stage forward selection initializes with
an important issue which is to be considering in the feature the single best feature out of the whole feature set. The
extraction. The speech signal is divided into the small remaining features are further added which increases the
intervals which are referred as a frame [1]. classification accuracy. If the added number of features
The prosodic features are known as the primary indicator attained the preset number, the selection process should stop
of the speakers emotional states. Research on emotion of [10].
speech indicates that pitch, energy, duration, formant, Mel
frequency cepstrum coefficient (MFCC), and linear IV. CLASSIFIER SELECTION
prediction cepstrum coefficient (LPCC) are the important In the speech emotion recognition system after calculation of
features [5, 6]. With the different emotional state, the features, the best features are provided to the classifier. A
corresponding changes occurs in the speak rate, pitch, classifier recognizes the emotion in the speaker‟s speech
energy, and spectrum. Typically anger has a higher mean utterance. Various types of classifier have been proposed for
value and variance of pitch and mean value of energy. In the the task of speech emotion recognition. Gaussian Mixtures
happy state there is an improvement in mean value, variation Model (GMM), K-nearest neighbors (KNN), Hidden Markov
range and variance of pitch and mean value of energy. On the Model (HMM) and Support Vector Machine (SVM),
other hand the mean value, variation range and variance of Artificial Neural Network (ANN), etc. are the classifiers used
pitch is decreases in sadness, also the energy is weak, speak in the speech emotion recognition system. Each classifier has
rate is slow and decrease in spectrum of high frequency some advantages and limitations over the others.
components. The feature of fear has a high mean value and Only when the global features are extracted from the
variation range of pitch, improvement of spectrum in high training utterances, Gaussian Mixture Model is more suitable
frequency components. Therefore statistics of pitch, energy for speech emotion recognition. All the training and testing
and some spectrum feature can be extracted to recognize equations are based on the supposition that all vectors are
emotions from speech [5, 6]. independent therefore GMM cannot form temporal structure

236
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-2, Issue-1, March 2012
of the training data. For the best features a maximum performance compared to other classifiers [1, 4]. The
accuracy of 78.77% could be achieved using GMM. In emotional states can be separated to huge margin by using
speaker independent recognition typical performance SVM classifier. This margin is nothing but the width of the
obtained of 75%, and that of 89.12% for speaker dependent largest tube without any utterances, which can obtain around
recognition using GMM [1]. decision boundary. The support vectors can be known as the
Other classifier that is used for the emotion classification is measurement vectors which define the boundaries of the
an artificial neural network (ANN), which is used due to its margin. An original SVM classifier was designed only for
ability to find nonlinear boundaries separating the emotional two class problems, but it can be use for more classes.
states. Out of the many types, feed forward neural network is Because of the structural risk minimization oriented training
used most frequently in speech emotion recognition [7]. SVM is having high generalization capability. The accuracy
Multilayer perceptron layer neural networks are relatively of the SVM for the speaker independent and dependent
common in speech emotion recognition as it is easy for classification are 75% and above 80% respectively [1, 7].
implementation and it has well defined training algorithm
[1].The ANN based classifiers may achieve a correct V. CONCLUSION
classification rate of 51.19% in speaker dependent Speech emotion recognition systems based on the several
recognition, and that of 52.87% for speaker independent classifiers is illustrated. The important issues in speech
recognition. According to the emotional state of the k emotion recognition system are the signal processing unit in
utterances, the k-nearest neighbor classifier (k-NN) allocates which appropriate features are extracted from available
an utterance to an emotional condition. The classifier can speech signal and another is a classifier which recognizes
classify all the utterances in the design set properly, if „k‟ emotions from the speech signal. The average accuracy of the
equals to 1, however its performance on the test set will most of the classifiers for speaker independent system is less
reduced. Utilizing the information of pitch and energy than that for the speaker dependent.
contours, the K-NN classifier attains an accurate Automatic emotion recognitions from the human
classification rate of 64% for four emotional states. [7]. speech are increasing now a day because it results in the
In speech recognition system like isolated word better interactions between human and machine. To improve
recognition and speech emotion recognition, hidden markov the emotion recognition process, combinations of the given
model is generally used; the main reason is its physical methods can be derived. Also by extracting more effective
relation with the speech signals production mechanism. In features of speech, accuracy of the speech emotion
speech emotion recognition system, HMM has achieved recognition system can be enhanced.
great success for modeling temporal information in the
speech spectrum. The HMM is doubly stochastic process REFERENCES
consist of first order markov chain whose states are buried [1] M. E. Ayadi, M. S. Kamel, F. Karray, “Survey on Speech Emotion
from the observer [1]. For speech emotion recognition Recognition: Features, Classification Schemes, and Databases”,
typically a single HMM is trained for each emotion and an Pattern Recognition 44, PP.572-587, 2011.
unknown sample is classified according to the model which [2] I. Chiriacescu, “Automatic Emotion Analysis Based On Speech”,
M.Sc. THESIS Delft University of Technology, 2009.
illustrate the derived feature sequence best [3]. HMM has the [3] T. Vogt, E. Andre and J. Wagner, “Automatic Recognition of Emotions
important advantage that the temporal dynamics of speech from Speech: A review of the literature and recommendations for
features can be caught second accessibility of the well practical realization”, LNCS 4868, PP.75-91, 2008.
[4] S. Emerich, E. Lupu, A. Apatean, “Emotions Recognitions by Speech
established procedure for optimizing the recognition and Facial Expressions Analysis”, 17th European Signal Processing
framework. The main problem in building the HMM based Conference, 2009.
recognition model is the features selection process. Because [5] A. Nogueiras, A. Moreno, A. Bonafonte, Jose B. Marino, “Speech
Emotion Recognition Using Hidden Markov Model”, Eurospeech,
it is not enough that features carries information about the 2001.
emotional states, but it must fit the HMM structure as well. [6] P.Shen, Z. Changjun, X. Chen, “Automatic Speech Emotion
HMM provides better classification accuracies for speech Recognition Using Support Vector Machine”, International
Conference On Electronic And Mechanical Engineering And
emotion recognition as compared with the other classifiers Information Technology, 2011.
[5]. HMM classifiers using prosody and formant features [7] D. Ververidis and C. Kotropoulos, "Emotional Speech Recognition:
have considerably lower recall rates than that of the Resources, Features and Methods", Elsevier Speech communication,
vol. 48, no. 9, pp. 1162-1181, September, 2006.
classifiers using spectral features [9]. The accuracy rate of the [8] Z. Ciota, “Feature Extraction of Spoken Dialogs for Emotion
speech emotion recognition by using HMM classifier is Detection”, ICSP, 2006.
observed as 76.12% for the speaker dependent in the previous [9] E. Bozkurt, E, Erzin, C. E. Erdem, A. Tanju Erdem, “Formant Position
study and that for the speaker independent it was 64.77% [1]. Based Weighted Spectral Features for Emotion Recognition”, Science
Direct Speech Communication, 2011.
Transforming the original feature set to a high dimensional [10] C. M. Lee, S. S. Narayanan, “Towards detecting emotions in spoken
feature space by using the kernel function is the main thought dialogs”, IEEE transactions on speech and audio processing, Vol. 13,
behind the support vector machine (SVM) classifier, which No. 2, March 2005.
leads to get optimum classification in this new feature space. Ashish B. Ingale received the B.E. degree in Electronics
The kernel functions like linear, polynomial, radial basis and Telecommunication Engineering from Sant Gadage
function (RBF) can be used in SVM model for large extent. Baba Amravati University, Amravati in 2008, and he is
currently pursuing the M. Tech. degree in Electronic
In the main applications like pattern recognition and System and Communication (ESC) at Government
classification problems, SVM classifier are generally used, College of Engineering Amravati. He has attended one
and because of that it is used in the speech emotion day workshops on „VLSI & EDA Tools & Technology in
Education‟ and „Cadence-OrCad EDA Technology‟ at
recognition system. SVM is having much better classification Government College of Engineering Amravati.

237
Speech Emotion Recognition

Devendra S. Chaudhari obtained BE, ME, from


Marathwada University, Aurangabad and PhD from
Indian Institute of Technology, Bombay, Powai,
Mumbai. He has been engaged in teaching, research for
period of about 25 years and worked on DST-SERC
sponsored Fast Track Project for Young Scientists. He
has worked as Head Electronics and
Telecommunication, Instrumentation, Electrical,
Research and incharge Principal at Government
Engineering Colleges. Presently he is working as Head, Department of
Electronics and Telecommunication Engineering at Government College of
Engineering, Amravati.
Dr. Chaudhari published research papers and presented papers in
international conferences abroad at Seattle, USA and Austria, Europe. He
worked as Chairman / Expert Member on different committees of All India
Council for Technical Education, Directorate of Technical Education for
Approval, Graduation, Inspection, Variation of Intake of diploma and degree
Engineering Institutions. As a university recognized PhD research
supervisor in Electronics and Computer Science Engineering he has been
supervising research work since 2001. One research scholar received PhD
under his supervision.
He has worked as Chairman / Member on different university and college
level committees like Examination, Academic, Senate, Board of Studies, etc.
he chaired one of the Technical sessions of International Conference held at
Nagpur. He is fellow of IE, IETE and life member of ISTE, BMESI and
member of IEEE (2007). He is recipient of Best Engineering College
Teacher Award of ISTE, New Delhi, Gold Medal Award of IETE, New
Delhi, Engineering Achievement Award of IE (I), Nashik. He has organized
various Continuing Education Programmes and delivered Expert Lectures on
research at different places. He has also worked as ISTE Visiting Professor
and visiting faculty member at Asian Institute of Technology, Bangkok,
Thailand. His present research and teaching interests are in the field of
Biomedical Engineering, Digital Signal Processing and Analogue Integrated
Circuits.

238

You might also like