0% found this document useful (0 votes)
13 views5 pages

Sharma 2020

This document presents a decision support system for emotion recognition from German speech signals, utilizing a dataset of 535 speech samples labeled with seven emotions. The system employs various feature descriptors and achieves a classification accuracy of 93% using Support Vector Machine and Information Theory based feature selection. The study highlights the significance of speech in diagnosing psychological conditions and emphasizes the importance of feature extraction in emotion classification.

Uploaded by

sanjay sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Sharma 2020

This document presents a decision support system for emotion recognition from German speech signals, utilizing a dataset of 535 speech samples labeled with seven emotions. The system employs various feature descriptors and achieves a classification accuracy of 93% using Support Vector Machine and Information Theory based feature selection. The study highlights the significance of speech in diagnosing psychological conditions and emphasizes the importance of feature extraction in emotion classification.

Uploaded by

sanjay sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2020 IEEE International Conference on Advent Trends inMultidisciplinary Research and Innovation (ICATMRI)

Emotion Recognition from German Speech Using


Feature Selection Based Decision Support System
2020 IEEE International Conference on Advent Trends in Multidisciplinary Research and Innovation (ICATMRI) | 978-1-7281-7734-2/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICATMRI51801.2020.9398316

Sanjay Sharma
Department of Computer Science & Chetan Kumar
Engg Department of Computer Science &
Engg
Kautilya Institute of Technology &
Kautilya Institute of Technology &
Engineering, Jaipur,Rajasthan, India
Engineering, Jaipur,Rajasthan, India
[email protected]
[email protected]

Abstract— The speech signal plays a vital role in human specifics of the speaker’s inner thoughts, individuality,
communication. From last very years emotion characterization behaviour, thoughts, levels of stress, and existing emotional
from speech has gained popularity in domain of signal and signal .
speech processing. In this article we propose a decision support
Presentation stress and emotion recognition and classification
system for emotion recognition from German speech signal
database. A set of 535 speech signals with 7 emotion labels are are aimed to automatically detect pressure and sensations in
investigated using prominent feature descriptors viz, Mel speech signs through examining the singing behavior being a
frequency coefficients, signal energy, pitch , zero crossing rate marker of affect (e.g. feelings and pressures), concentrating on
and formant feature numeric’s to generate a set of 37 features for the nonverbal facet of dialog.
each speech signal respectively. Decision support system is Moreover, by assessing a speaker’s stress level, the anxiety
modelled in feature space using Support Vector Machine and
category can support intelligent examination of intellectual
resultant classification accuracy is further enhanced using
Information Theory based feature selection measures to achieve suggests of individuals doing work in hazardous environments
the highest test classification accuracy of 93 percent respectively. and people undertaking a higher level of obligation. Various
clinical uses of have an impact on examination in the speech
Keywords: - Emotion, machine learning, feature selection and presentation have already been noted in the proper diagnosis
information theory. of situations such as depression symptoms, autism, Alzheimer
and dementia, and schizophrenia [4]. These functions illustrate
I INTRODUCTION the incredible importance of speech being a diagnostic signal
Both anxiety and emotion are psycho-physiological suggest supplying reasonable information regarding the speaker’s
involving characteristic somatic and autonomic responses [1]. psychological and physiological condition. Latest cellular
Stress is really a mental and biological phrase described as the technologies can use feeling identification to sort emergency
loss of ability to appropriately respond to hard psychological call information, or cope with the challenge through
and physical conditions that can be either actual or envisioned. monitoring the mental status of clients. Another business
The pressure is observed as subjective strain, dysfunctional putting on a passion diagnosis system is the entertaining video
physical activity, and wear and tear of performance [2]. game business supplying the sensation of naturalistic human-
Common pressure signs and symptoms add a status of high like interaction, and the entertainment market facilitating the
excitement, elevated heart rate, extreme adrenaline creation, smart brokers with the level of sensitivity to player’s mood
malfunction from the coping components, sense of tension, and also the capacity to react accordingly through effective
and exhaustion, and inability to completely focus. Stress might voice or encounter phrase.Thus the emotion identification
be caused by exterior elements (workload, noises, shake, sleep using computerized automation has grown significantly in last
damage, and so forth.) or by interior factors (exhaustion, and many years.
so on.) [3]. During the early 1990s, progress in computer systems made it
In recent times, conversation scientists and technical feasible to develop new, sophisticated attribute extraction and
engineers, who possessed tended to overlook realistic and classification techniques. Computers grew to be effective at
paralinguistic aspects of dialog with their effort to develop handling a huge number of information offered by hours of
types of conversation interaction for speech technologies sound-graphic tracks within sensible computation. These
programs. improvements opened up methods to functional
The speech indicates communicates linguistic information and implementations of brand new significant uses of speech
facts between audio speakers as well as paralinguistic

978-1-7281-7734-2/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 21:46:11 UTC from IEEE Xplore. Restrictions apply.
technologies which include the speaker recognition and much SAFE corpus was developed within this work, according to
more fairly recently, stress and passion recognition [1]. fiction videos. The corpus includes tracks of both standard and
The very idea of auto emotion acknowledgment was abnormal situations to identify the fear and simple sensations.
introduced from the middle of the-1980s when many creators Despite the range in the information, the machine acquired an
suggested the application of statistical properties of dialog in appealing result using the appropriate reputation rate close to
intelligent emotion identification [2]. The prosodic 70%.
characteristics: basic frequency, formants, energy, and flow, While in [10] authors applied MFCCs characteristics to find
were widely exploited for anxiety and feelings recognition three inner thoughts (frustrated, well mannered, and fairly
mainly because they have been linked to the excitement neutral) in children during impulsive dialog connections with
measure of emotions. pc characters. It was noticed that MFCC deployment obtained
It was seen how the prosodic options that come with a the better classification accuracies, which range from
presentation created under anxiety and feelings vary from the 66.4Percent to 70.6
characteristics under neutral condition. Probably the most The main part evaluation (PCA) is a relatively simple and by
often taking place adjustments involve variations in the far the most commonly utilized filtration system sort of feature
utterance period, reduced or improved pitch, the shifting of variety. Authors in [11] labeled four sensations: pleased,
formants, and other amounts of energy. As an example, in [3] upset, neutral, and miserable and employed the principal
authors have shown that rage is described as an increased aspect examination to minimize the sizing of acoustic
measure of energy and pitch, compared to the other four inner characteristics produced by the pitch, speaking amount and
thoughts: disgust, anxiety, pleasure, and misery. It was also power.
displayed that masculine loudspeakers communicate anger by In [12], scholars extracted features linked to the vitality, pitch,
using a reduced speech rate than the usual female under very and formants were obtained from the Danish Emotional
similar problems, In article [4] authors extracted 10 Presentation database. The sequential forward variety
statistically examined prosodic characteristics from power, technique (SFS) was utilized to automatically pick the very
formants, pitch, and the percentage of voiced support frames best performing capabilities for every sex. The classification
to identify the non-unfavorable (beneficial and neutral) and outcomes based on were 50 Percent for sexes, 61 Percent for
bad inner thoughts. Comparable attribute collections had been males, and 57Per cent for ladies. Within the stick to-up
utilized in [5] to distinguish different kinds of feelings. research, while article [13] documented a small development
Typically, the statistic prosodic features failed to achieve high from the classification prices if the sequential forward
affection acknowledgment accuracy and reliability. This might selection approach (SFS) was changed with the sequential
be partially related to the difficulty to determine the exact drifting forward selection (SFFS).
values of the prosodic capabilities, and also the problems to
estimation parameters depending on the speaker and text-
independent basis. II FEATURE SPACE DESCRIPTORS
The MFCCs have already been placed on the anxiety and
emotion classification in conversation [6] ultimately causing Before any audio signal can be categorized within a specific
relatively reasonable results. class, the features for the sound signal are really need to be
In [7] authors deployed the MFCCs to know the difference extracted. These feature extraction functions will finalize the
between six forms of emotions (frustration, disgust, worry, category from the signal. The feature removal strategies might
pleasure, unhappiness, and delight). An overall of 12 speaker be classified as temporal examination and spectral analysis
systems was used to build 720 utterances. The classification strategies. The computation load is weighted by the
precision for six varieties of sensations making use of the importance of each and every feature descriptor choice. The
MFCC feature was 59Percent, which was better than the linear prominent features correspond to duration, energy, and pitch
coefficients linear forecast cepstral coefficients (LPCC) which are directly connected with tempo, loudness, and intonation
attained 56.1Percent, however, not just like 78Percent of respectively
reliability achieved through the offered short time log Timeframe: These characteristics give the temporal
regularity strength coefficients. properties of voiced and unvoiced elements. They operate
While in. [8] authors employed MFCCs features of various set entirely on the temporal signal.
to the glottal waveform as well as conversation sign to
categorize 4 sensations: satisfied, furious, unfortunate, and Mel-Freq. Cepstrum Coefficients: These coefficients are
neutral. The database of the psychological presentation was generated from your improvement to a cepstrum space, to be
gathered in an anechoic holding chamber. The results revealed able to catch data of the time-different spectral envelope. A
that the most effective performance was offered by MFCC of cepstrum can be accomplished by making use of a Fourier
deploying 6 for glottal and achieved reliable classification Transform in the sign (f) plot, to isolate in the regularity
accuracy around 60 Percent. domain name the slowly different spectral envelope in the far
In article [9] authors utilized feature vectors which include more changing fast spectral fine composition.
pitch, jitter, shimmer, formants, and MFCCs to find anxiety- =| log(| ( ) | )|
type emotions developing during abnormal scenarios. The

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 21:46:11 UTC from IEEE Xplore. Restrictions apply.
Energy: The capacity of the transmission x inside a particular any music band is known as the ratio of your geometric along
window of N trial samples is depicted by: with the arithmetic method of the power variety coefficients
within that music group. Each vector is decreased into a scalar
( ). ∗
= ( ) by determining the imply benefit over the groups for every
single particular structure, therefore reaching a scalar function
that specifies the general flatness.
Zero Traversing Rate (ZTR): The zero crossings attribute
computes the number of periods how the symbol of the Harmonic Ratio:A computation in the proportion of
conversation signal amplitude differs in the time site in a harmonic elements throughout the spectrum, identified as the
structure. For solitary-voiced speech signals, absolutely no highest worth of the autocorrelation (AC) of each structure.
crossings are used to generate a rough estimation in the All of these classic acoustic characteristics still display some
essential-consistency. In the case of sophisticated indicators, it difficulties to identify contentment from frustration or misery
is a straightforward working out of noisiness. The Zero from monotony. Our goal therefore can be to increase effects
Traversing Amount computes how often the conversation by concentrating on the balance and melody of speech. This
transmission modifications its indication: - process is possible within the supposition which our head
integrates temporally what we should pick up within a quick
1 duration of time of at least 3 or 4 secs. It really is, therefore,
= | ( )− ( )| easy to analyze pitch time periods or triads as though they
2
happened concurrently, though dialog is actually by character
sequential.
Pitch: The signal which comes through the singing pathway
begins in the larynx where vocal cords are placed and come to
an end at jaws. The shape of the vocal pathway and the III FEATURE SELECTION METHOD
vibrations of the singing cords are monitored by nerves from
the human brain. The noise, which we make, could be From the set of extracted features set, only few feature are
classified into voiced and unvoiced appears to be. During the
significant in nature. It means we have to search out and
technology of unvoiced appears to be the vocal cords usually
do not vibrate and remain open whereas during voiced noises exclude irrelevant features from the retrieved feature set. For
they vibrate and generate exactly what is known as glottal this sake in this work we have used Information theory to rank
pulse. A pulse is an addition of your sinusoidal influx of features on the basis of their Joint Mutual Information (JMI)
fundamental regularity and its harmonics (Amplitude with respect to rest features as well as the associated class
diminishes when frequency raises). The basic regularity of label. Let A and B be two discrete random variables
glottal pulse is named as the pitch. representing the elements of feature set, the mutual
information between these two features is defined as:
I(A:B)=H(A) – H(A/B),
Centroid: The spectral centroid is identified as the middle of Where H(.) represent the Shannon formulation of entropy
gravitational pressure in the variety [37]. The centroid may be related to a single variable while, H(./.) depicts the conditional
the computation from the spectral form and higher centroid entropy of these two random variable measured over their
values refer to better textures with increased high frequencies. conditional probability distribution.
Let the features of a sample instance is associated to some
/
∑ [ ] [ ] class label C, then the joint mutual information between the
= / features A, B and class label C can be represented as:
∑ [ ]
I(A:B/C) = H(A/B) – H( A/B,C )
Using the joint mutual information principle, features are
Whereby f [k] may be the regularity at bin k. Centroid adjusts ranked according to their information with respected to next
the sound sharpness. Sharpness corresponds to the high- ranked features and associated class label for that instance.
frequency content of the spectrum. Higher centroid values
refer to spectra in the range of higher frequencies. Because of
its effectiveness to define spectral shape, centroid measures
are utilized in audio classification tasks. IV EXPERIMENTAL SETUP & SIMULATION RESULTS
Audio Spectrum Flatness: It can be defined as the deviation The experimentation performed over MATLAB2013a running
of your spectral form from that from a level array. Flat spectra on Intel Core I5 processor operating at 5.5GHz. Windows 7 is
refer to disturbance or impulse-like signals hence higher the basic platform to execute MATLAB commands from
flatness principles represent noisiness. Reduced flatness values higher level to lower level. In this work we have investigated
usually illustrate the existence of harmonic elements. As the performance of support vector machine on the features
opposed to computer one flatness worth for the whole range, a extracted from the speech audio files from EMO-DB Berlin
splitting up in regularity groups is carried out, causing one database of emotion speech [14].
vector of flatness beliefs per period of time. The flatness of

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 21:46:11 UTC from IEEE Xplore. Restrictions apply.
Basically,seven categories of emotion classes are incorporated information that among the all 37 features the JMI ranked top
in this research work, related to happy, anger, boredom, 29 features are relevant to classify the emotion labels in
disgust, anxiety, neutral and sadness moods of persons feature space while rest features are redundant in nature.
distributed over 526 speech files of German language. Various
relevant features are extracted from each audio file including
pitch, energy, zero transition rate, mel frequency coefficients
mean and variance, and set of four elementary formant
frequencies to defined as feature vector of size 37.
The feature matrix is evaluated for ranking of Joint Mutual
Information so as to result in a set of newly generate feature
matrix with same size as original but the features are ranked
according to their mutual information with next ranked feature
and the associated class label. The whole dataset is segregated
into training and test data set of size 421 and 105 respectively.
During the simulation an incremental set of initially ranked
features are sliced and applied as input to Support Vector
Machine to train and delineate multiclass boundaries in feature
space that are later tested over a test data set to retrieve the test
classification accuracy. The simulation outcomes for decision
space learning using training and test dataset is depicted in Fig
1 and 2 respectively.

. Fig. 2. Test Accuracy Using Support Vector Machine

V CONCLUSION & FUTURE SCOPE


The emotion plays a very significant role in human life.
Speech is one-a-kind of way to communicate with other
persons. In this work we developed a model to classify the
German speech signals into various emotion labels. This
research objective is aimed at improving the performance of
raw model of Support Vector Machine. The efficiency of
emotion classification has successfully improved using Joint
Mutual Information based feature selection method and
excludes out irrelevant features from the extracted feature set.
The experimental outcomes evidence the promising result with
the class emotion classification accuracy of 93 percentages.

REFERENCES
[1] Sailunaz, Kashfia, Manmeet Dhaliwal, Jon Rokne, and Reda Alhajj.
Fig. 1. Training Accuracy Using Support Vector Machine "Emotion detection from text and speech: a survey." Social Network
Analysis and Mining 8, no. 1 (2018): 28.
[2] Chebbi, Safa, and Sofia Ben Jebara. "On the use of pitch-based features
It is pertinent from Fig 1 & 2 that as the feature size increases for fear emotion detection from speech." In 2018 4th International
Conference on Advanced Technologies for Signal and Image Processing
the classification accuracy improves, as it means more number (ATSIP), pp. 1-6. IEEE, 2018.
of feature descriptors is introduced to distinguish different [3] Jain, Udit, Karan Nathani, NersissonRuban, Alex Noel Joseph Raj,
emotion labels in feature space. Moreover, it is important to Zhemin Zhuang, and Vijayalakshmi GV Mahesh. "Cubic SVM classifier
note that at the feature set of initial ranked 29 features, model based feature extraction and emotion detection from speech signals."
In 2018 International Conference on Sensor Networks and Signal
depicts highest test accuracy of 93 percentage that later Processing (SNSP), pp. 386-391. IEEE, 2018.
reduces down to 91 percent for full feature size. It conveys the

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 21:46:11 UTC from IEEE Xplore. Restrictions apply.
[4] Healy, Michael, Ryan Donovan, Paul Walsh, and Huiru Zheng. "A
machine learning emotion detection platform to support affective well
being." In 2018 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pp. 2694-2700. IEEE, 2018.
[5] Latif, Siddique, Rajib Rana, Shahzad Younis, Junaid Qadir, and Julien
Epps. "Transfer learning for improving speech emotion classification
accuracy." arXiv preprint arXiv:1801.06353 (2018).
[6] Kudiri, Krishna Mohan, Abas Md Said, and M. YunusNayan. "Human
emotion detection through speech and facial expressions." In 2016 3rd
International Conference on Computer and Information Sciences
(ICCOINS), pp. 351-356. IEEE, 2016.
[7] Khan, Atreyee, and Uttam Kumar Roy. "Emotion recognition using
prosodie and spectral features of speech and Naïve Bayes Classifier."
In 2017 international conference on wireless communications, signal
processing and networking (WiSPNET), pp. 1017-1021. IEEE, 2017.
[8] Cevher, Deniz, Sebastian Zepf, and Roman Klinger. "Towards
multimodal emotion recognition in german speech events in cars using
transfer learning." arXiv preprint arXiv:1909.02764 (2019).
[9] Rodrigues, Manuel, Dalila Durães, Ricardo Santos, and Cesar Analide.
"Emotion Detection Throughout the Speech." In Proceedings of SAI
Intelligent Systems Conference, pp. 304-314. Springer, Cham, 2020.
[10] Aggarwal, Gaurav, and Rekha Vig. "Acoustic Methodologies for
Classifying Gender and Emotions using Machine Learning Algorithms."
In 2019 Amity International Conference on Artificial Intelligence
(AICAI), pp. 672-677. IEEE, 2019.
[11] Tripathi, Anjali, Upasana Singh, Garima Bansal, Rishabh Gupta, and
Ashutosh Kumar Singh. "A Review on Emotion Detection and
Classification using Speech." Available at SSRN 3601803 (2020).
[12] Mohanty, Mihir Narayan. "Emotion Analysis of Different Age Groups
From Voice Using Machine Learning Approach." In Critical
Approaches to Information Retrieval Research, pp. 150-171. IGI Global,
2020.
[13] Özseven, Turgut. "A novel feature selection method for speech emotion
recognition." Applied Acoustics 146 (2019): 320-326.
[14] Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter Sendlmeier
und Benjamin Weiss, “ A Database of German Emotional
Speech”Proceedings Interspeech 2005, Lissabon, Portugal

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 21:46:11 UTC from IEEE Xplore. Restrictions apply.

You might also like