Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
235
Speech Emotion Recognition
structure of the speech emotion recognition is as shown in One of the main speech features which indicate emotion is
Figure 1. energy and the study of energy is depends on short term
energy and short term average amplitude [6]. As the arousal
Speech Feature Feature Classifier Recognized
input Extraction selection emotion level of emotions is associated with the short term speech
energy therefore it can be used in the field of emotion
recognition. The pitch signal which is also referred as the
Figure 1. Structure of the Speech Emotion Recognition glottal wave form is one more main feature which indicates
System. emotion in speech. The pitch signal depends on the tension of
the vocal folds and sub glottal air pressure, and it is produced
The need to find out a set of the significant emotions from the vibration rate of the vocal cord. The pitch signal is
to be classified by an automatic emotion recognizer is a main characterize by the two features that is pitch frequency, and
concern in speech emotion recognition system. A typical set glottal air velocity at the vocal fold opening time instant.
of emotions contains 300 emotional states. Therefore to Number of harmonics present in the spectrum is directly get
classify such a great number of emotions is very complicated. affected by the pitch frequency [7].
According to „Palette theory‟ any emotion can be Linear prediction cepstrum coefficient (LPCC)
decomposed into primary emotions similar to the way that gives the details about the characteristics of particular
any color is a combination of some basic colors. Primary channel of any individual person and this channel
emotions are anger, disgust, fear, joy, sadness and surprise characteristic will get change in accordance with the different
[1]. emotions, so by using these features one can extract the
The evaluation of the speech emotion recognition emotions in speech. The merits of using the LPCC is that it
system is based on the level of naturalness of the database involves less computation, its algorithm is more efficient and
which is used as an input to the speech emotion recognition it could describe the vowels in better manner. Mel frequency
system. If the inferior database is used as an input to the cepstrum coefficient (MFCC) is extensively used in speech
system then incorrect conclusion may be drawn. The recognition and speech emotion recognition systems and the
database as an input to the speech emotion recognition recognition rate of the MFCC is very good. In the low
system may contain the real world emotions or the acted frequency region better frequency resolution and robustness
ones. It is more practical to use database that is collected from to noise could be achieved with the help of MFCC rather than
the real life situations [1]. that for high frequency region [6]. Mel frequency cepstrum is
an illustration of short term power spectrum of sound [4].
III. FEATURE EXTRACTION AND SELECTION In feature extraction all of the basic speech feature
Any emotion from the speaker‟s speech is represented by the extracted may not be helpful and essential for speech emotion
large number of parameters which is contained in the speech recognition system. If all the extracted features gives as an
and the changes in these parameters will result in input to the classifier this would not guarantee the best
corresponding change in emotions. Therefore an extraction system performance which shows that there is a need to
of these speech features which represents emotions is an remove such a unusefull features from the base features.
important factor in speech emotion recognition system [6]. Therefore there is a need of systematic feature selection to
The speech features can be divide into two main categories reduce these features. Forward selection (FS) feature
that is long term and short term features. The region of selection method could be used to select the best feature
analysis of the speech signal used for the feature extraction is subset. In the initial stage forward selection initializes with
an important issue which is to be considering in the feature the single best feature out of the whole feature set. The
extraction. The speech signal is divided into the small remaining features are further added which increases the
intervals which are referred as a frame [1]. classification accuracy. If the added number of features
The prosodic features are known as the primary indicator attained the preset number, the selection process should stop
of the speakers emotional states. Research on emotion of [10].
speech indicates that pitch, energy, duration, formant, Mel
frequency cepstrum coefficient (MFCC), and linear IV. CLASSIFIER SELECTION
prediction cepstrum coefficient (LPCC) are the important In the speech emotion recognition system after calculation of
features [5, 6]. With the different emotional state, the features, the best features are provided to the classifier. A
corresponding changes occurs in the speak rate, pitch, classifier recognizes the emotion in the speaker‟s speech
energy, and spectrum. Typically anger has a higher mean utterance. Various types of classifier have been proposed for
value and variance of pitch and mean value of energy. In the the task of speech emotion recognition. Gaussian Mixtures
happy state there is an improvement in mean value, variation Model (GMM), K-nearest neighbors (KNN), Hidden Markov
range and variance of pitch and mean value of energy. On the Model (HMM) and Support Vector Machine (SVM),
other hand the mean value, variation range and variance of Artificial Neural Network (ANN), etc. are the classifiers used
pitch is decreases in sadness, also the energy is weak, speak in the speech emotion recognition system. Each classifier has
rate is slow and decrease in spectrum of high frequency some advantages and limitations over the others.
components. The feature of fear has a high mean value and Only when the global features are extracted from the
variation range of pitch, improvement of spectrum in high training utterances, Gaussian Mixture Model is more suitable
frequency components. Therefore statistics of pitch, energy for speech emotion recognition. All the training and testing
and some spectrum feature can be extracted to recognize equations are based on the supposition that all vectors are
emotions from speech [5, 6]. independent therefore GMM cannot form temporal structure
236
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-2, Issue-1, March 2012
of the training data. For the best features a maximum performance compared to other classifiers [1, 4]. The
accuracy of 78.77% could be achieved using GMM. In emotional states can be separated to huge margin by using
speaker independent recognition typical performance SVM classifier. This margin is nothing but the width of the
obtained of 75%, and that of 89.12% for speaker dependent largest tube without any utterances, which can obtain around
recognition using GMM [1]. decision boundary. The support vectors can be known as the
Other classifier that is used for the emotion classification is measurement vectors which define the boundaries of the
an artificial neural network (ANN), which is used due to its margin. An original SVM classifier was designed only for
ability to find nonlinear boundaries separating the emotional two class problems, but it can be use for more classes.
states. Out of the many types, feed forward neural network is Because of the structural risk minimization oriented training
used most frequently in speech emotion recognition [7]. SVM is having high generalization capability. The accuracy
Multilayer perceptron layer neural networks are relatively of the SVM for the speaker independent and dependent
common in speech emotion recognition as it is easy for classification are 75% and above 80% respectively [1, 7].
implementation and it has well defined training algorithm
[1].The ANN based classifiers may achieve a correct V. CONCLUSION
classification rate of 51.19% in speaker dependent Speech emotion recognition systems based on the several
recognition, and that of 52.87% for speaker independent classifiers is illustrated. The important issues in speech
recognition. According to the emotional state of the k emotion recognition system are the signal processing unit in
utterances, the k-nearest neighbor classifier (k-NN) allocates which appropriate features are extracted from available
an utterance to an emotional condition. The classifier can speech signal and another is a classifier which recognizes
classify all the utterances in the design set properly, if „k‟ emotions from the speech signal. The average accuracy of the
equals to 1, however its performance on the test set will most of the classifiers for speaker independent system is less
reduced. Utilizing the information of pitch and energy than that for the speaker dependent.
contours, the K-NN classifier attains an accurate Automatic emotion recognitions from the human
classification rate of 64% for four emotional states. [7]. speech are increasing now a day because it results in the
In speech recognition system like isolated word better interactions between human and machine. To improve
recognition and speech emotion recognition, hidden markov the emotion recognition process, combinations of the given
model is generally used; the main reason is its physical methods can be derived. Also by extracting more effective
relation with the speech signals production mechanism. In features of speech, accuracy of the speech emotion
speech emotion recognition system, HMM has achieved recognition system can be enhanced.
great success for modeling temporal information in the
speech spectrum. The HMM is doubly stochastic process REFERENCES
consist of first order markov chain whose states are buried [1] M. E. Ayadi, M. S. Kamel, F. Karray, “Survey on Speech Emotion
from the observer [1]. For speech emotion recognition Recognition: Features, Classification Schemes, and Databases”,
typically a single HMM is trained for each emotion and an Pattern Recognition 44, PP.572-587, 2011.
unknown sample is classified according to the model which [2] I. Chiriacescu, “Automatic Emotion Analysis Based On Speech”,
M.Sc. THESIS Delft University of Technology, 2009.
illustrate the derived feature sequence best [3]. HMM has the [3] T. Vogt, E. Andre and J. Wagner, “Automatic Recognition of Emotions
important advantage that the temporal dynamics of speech from Speech: A review of the literature and recommendations for
features can be caught second accessibility of the well practical realization”, LNCS 4868, PP.75-91, 2008.
[4] S. Emerich, E. Lupu, A. Apatean, “Emotions Recognitions by Speech
established procedure for optimizing the recognition and Facial Expressions Analysis”, 17th European Signal Processing
framework. The main problem in building the HMM based Conference, 2009.
recognition model is the features selection process. Because [5] A. Nogueiras, A. Moreno, A. Bonafonte, Jose B. Marino, “Speech
Emotion Recognition Using Hidden Markov Model”, Eurospeech,
it is not enough that features carries information about the 2001.
emotional states, but it must fit the HMM structure as well. [6] P.Shen, Z. Changjun, X. Chen, “Automatic Speech Emotion
HMM provides better classification accuracies for speech Recognition Using Support Vector Machine”, International
Conference On Electronic And Mechanical Engineering And
emotion recognition as compared with the other classifiers Information Technology, 2011.
[5]. HMM classifiers using prosody and formant features [7] D. Ververidis and C. Kotropoulos, "Emotional Speech Recognition:
have considerably lower recall rates than that of the Resources, Features and Methods", Elsevier Speech communication,
vol. 48, no. 9, pp. 1162-1181, September, 2006.
classifiers using spectral features [9]. The accuracy rate of the [8] Z. Ciota, “Feature Extraction of Spoken Dialogs for Emotion
speech emotion recognition by using HMM classifier is Detection”, ICSP, 2006.
observed as 76.12% for the speaker dependent in the previous [9] E. Bozkurt, E, Erzin, C. E. Erdem, A. Tanju Erdem, “Formant Position
study and that for the speaker independent it was 64.77% [1]. Based Weighted Spectral Features for Emotion Recognition”, Science
Direct Speech Communication, 2011.
Transforming the original feature set to a high dimensional [10] C. M. Lee, S. S. Narayanan, “Towards detecting emotions in spoken
feature space by using the kernel function is the main thought dialogs”, IEEE transactions on speech and audio processing, Vol. 13,
behind the support vector machine (SVM) classifier, which No. 2, March 2005.
leads to get optimum classification in this new feature space. Ashish B. Ingale received the B.E. degree in Electronics
The kernel functions like linear, polynomial, radial basis and Telecommunication Engineering from Sant Gadage
function (RBF) can be used in SVM model for large extent. Baba Amravati University, Amravati in 2008, and he is
currently pursuing the M. Tech. degree in Electronic
In the main applications like pattern recognition and System and Communication (ESC) at Government
classification problems, SVM classifier are generally used, College of Engineering Amravati. He has attended one
and because of that it is used in the speech emotion day workshops on „VLSI & EDA Tools & Technology in
Education‟ and „Cadence-OrCad EDA Technology‟ at
recognition system. SVM is having much better classification Government College of Engineering Amravati.
237
Speech Emotion Recognition
238