0% found this document useful (0 votes)

100 views5 pages

Speech Emotion Recognition With Deep Learning

This document summarizes a study that used deep learning to perform speech emotion recognition on a dataset containing German speech recordings labeled with three emotions: angry, neutral, and sad. The researchers used a deep neural network architecture with convolutional and fully connected layers to classify short segments of the audio recordings into one of the three emotion classes. The model achieved 96.97% accuracy on held-out test data, outperforming previous approaches on this dataset that used support vector machines with handcrafted audio features. This study demonstrates the effectiveness of an end-to-end deep learning approach for speech emotion recognition without requiring feature engineering.

Uploaded by

Hari Haran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views5 pages

Speech Emotion Recognition With Deep Learning

Uploaded by

Hari Haran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320089581

Speech emotion recognition with deep learning

Conference Paper · February 2017

DOI: 10.1109/SPIN.2017.8049931

CITATIONS READS

22 743

3 authors, including:

Pavol Harár
University of Vienna
17 PUBLICATIONS 71 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Effects of non-invasive brain stimulation on hypokinetic dysarthria, micrographia, and brain plasticity in patients with Parkinson's disease View project

Voice Pathology Assessment View project

All content following this page was uploaded by Pavol Harár on 06 April 2020.

The user has requested enhancement of the downloaded file.

Speech Emotion Recognition with Deep Learning
Pavol Harár1, Radim Burget1 and Malay Kishore Dutta2

Abstract— This paper describes a method for Speech Emotion

Recognition (SER) using Deep Neural Network (DNN) II. RELATED WORK
architecture with convolutional, pooling and fully connected Most of related papers published in past decade use spectral
layers. We used 3 class subset (angry, neutral, sad) of German and prosodic features extracted from raw audio signals prior to
Corpus (Berlin Database of Emotional Speech) containing 271 recognition itself. Usually Mel frequency cepstrum coefficient
labeled recordings with total length of 783 seconds. Raw audio
(MFCC), Mel energy spectrum dynamic coefficient (MEDC),
data were standardized so every audio file has zero mean and
Linear prediction cepstrum coefficient (LPCC), Perceptual
unit variance. Every file was split into 20 millisecond segments
without overlap. We used Voice Activity Detection (VAD) linear prediction cepstrum coefficient (PLP), pitch, formants
algorithm to eliminate silent segments and divided all data into and energy. After extraction, several classifiers have been
TRAIN (80%) VALIDATION (10%) and TESTING (10%) sets. proposed for this task including Support Vector Machines
DNN is optimized using Stochastic Gradient Descent. As input we (SVM), Hidden Markov Models (HMM), Artificial Neural
used raw data without and feature selection. Our trained model Networks (ANN), Bayesian Networks (BN), K-nearest
achieved overall test accuracy of 96.97% on whole-file Neighbors (KNN) or Gaussian mixture models (GMM)
classification. [4][5][6].
Yixiong Pan in [7] used SVM for 3 class emotion
I. INTRODUCTION classification on Berlin Database of Emotional Speech [8] and
achieved 95.1% accuracy which is to our knowledge the best
The primary objective of Speech Emotion Recognition result yet published in this particular matter.
(SER) is to aid human to machine interaction. Despite the fact
a lot of progress has been made in the area of Speech S. Lalitha in [9] used pitch and prosody features and SVM
Recognition (SR) in past years, still there is a need for a better classifier reporting 81.1% accuracy on 7 classes of the whole
computer understanding of human emotions which could lead Berlin Database of Emotional Speech. Yu Zhou in [10]
to further improvement in human to machine interaction combined prosodic and spectral features and used Gaussian
systems [1]. mixture model super vector based SVM and reported 88.35%
accuracy on 5 classes of Chinese-LDC corpus. Fei Wang used
The ideal way to reach this objective, as the trends are combination of Deep Auto Encoder, various features and SVM
showing, might be to create an end-to-end learning algorithm, in [5] and reported 83.5% accuracy on 6 classes of Chinese
which is capable of processing raw input signal directly emotion corpus CASIA.
resulting in desired performance with as little human
knowledge and work as possible [2]. Hence, in this paper we In contrast to these traditional approaches a more novel
investigate the possibilities in unison with this idea. papers have been published recently employing Deep Neural
Networks into their experiments with the promising results.
Nowadays, we are at the dawn of Deep Learning (DL)
because in a short time it has dramatically improved the state- Jianwei Niu in [11] used various features in their recognition
of-the-art in many domains including SR. This approach allows system and combined DNN with HMM reporting 92.3%
us to use complex multi-layer models that learn representations accuracy on 6 classes of 7676 spoken Mandarin Chinese
of data with multiple levels of abstraction. SER is not an sentences. H.M. Fayek in [12] explored various DNN
exception since convolutional nets as well as recurrent nets are architectures and reported accuracy around 60% on two
applicable for solving this problem. The main advantage of DL different databases eNTERFACE [13] and SAVEE [14] with 6
is the fact that it requires very little engineering by hand, and it and 7 classes respectively.
can benefit from today's increases of data amounts and
Sadly, we are not aware of any paper that used Deep
computational power [3].
Learning for Speech Emotion Recognition on Berlin Database
The reminder of this paper is organized as follows. of Emotional Speech.
Section II introduces the related papers in this area of expertise.
In section III, the methodology of our experiment will be III. METHODOLOGY
discussed. The results will be presented in section IV.
Conclusions will be drawn in section V. A. Data
1
Dept. of Telecommunications, Brno University of Technology, The data set we used was German Corpus (Berlin Database
Brno, Czech Republic [email protected], of Emotional Speech) that contains about 800 sentences
[email protected]
2
Dept. of Electronics and Communication Engineering, Amity
(7 emotion classes * 5 female and 5 male actors * 10 different
University, Sector 125, Noida, 201313 Uttar Pradesh, India sentences + some second versions).
[email protected]
All sentences were recorded in an anechoic chamber using For regularization of the DNN, we used a 0.1 dropout [19]
high-quality equipment with sampling frequency of 48kHz and for all convolutional and fully connected layers except the last
later downsampled do 16kHz (mono) [8]. layer with 0.2 dropout. Relu as activation function [20] was
used for all layers except the Softmax output layer. All layers
This preliminary experiment was conducted on a smaller were initialized using Glorot uniform initialization [21]. This
subset of this corpus containing 271 labeled recordings with whole DNN had overall 468003 parameters and its whole
total length of 783 seconds. Because of nonequality between architecture is depicted in Fig.2.
classes and in order to get comparable results with [7], we used
all sentences from all actors but only from 3 emotional states:
angry (127 recordings, 334 seconds), neutral (79 recordings,
186 seconds), sad (65 recordings, 263 seconds). To remove the
silent parts of the audio signal as depicted in Fig. 1 a Google
WebRTC voice activity detector [15] (VAD) was incorporated
into our preprocessing. All audio files were standardized to
have zero mean and unit variance.

Fig. 1. An example of silent segment detection on raw audio signal of

uttrance 08a01Wa.wav

We split every file into 20 millisecond chunks with no

overlap - vectors of length 320 (16 kHz * 20ms) obtaining total
number of 39052 segments. Then we removed the 3098 silent
segments according to VAD and split the prepared data into
TRAINING (79.56%), VALIDATION (9.84%) and TESTING
(10,60%) sets. To obtain an even distribution of all classes in
training and validation sets we adjusted the amounts of
segments used from each class according to the class with the
smallest number of segments available, hence we used only
73.86% of valid (non-silent) speech segments resulting in
21129 training segments (7043 for each class), 2613 validation
segments (871 for each class). The rest was used as testing data
with 2814 segments from 33 audio files (11 angry, 12 neutral,
10 sad). All segments used for testing were taken from files
that have not been seen by DNN during training or validation.

B. DNN architecture
As the first two layers of our model we used convolutional
layer [16] with 32 kernels of size 7 x 1 succeeded with average
pooling layer [17]. The third and fourth layers were also
convolutional with 32 kernels of size 13 x 1 again succeeded
with average pooling. The last two convolutional layers had 16
kernels again of size 13 x 1. After the last convolutional layer
we divided the network to two branches. Both branches
consisted only of one pooling layer. One with average pooling
and the second with max pooling [17] which were afterwards
flattened and concatenated back to main branch.
From this point on, the DNN consisted of only fully
connected layers. The first of size 480, the second one of size
240 and the last one was an output Softmax layer [18] with 3
output neurons. All pooling layers were used with pool size 2.
All convolutional layers border mode was set to 'valid'
therefore no zero padding was performed on borders.

Fig. 2. Detailed architecture of proposed DNN with in/out vector shapes.

C. Experiment TABLE I. THE CONFUSION MATRIX OF VALIDATION SEGMENTS

For training our proposed model we utilized Stochastic

Gradient Descent algorithm with fixed learning rate of 0.11 to
optimize a Binary crossentropy loss function also known as
logloss [22].
The input data were presented to the DNN in batches of
size 21 in multiple epochs (iterations). Every batch contained
exactly the same number of segments from every class and
each sample succeeded with a sample of different class (e.g. TABLE II. THE CLASSIFICATION REPORT OF VALIDATION SEGMENTS
angry, neutral, sad, angry, neutral, sad, angry ...).
The last batch was populated with the remaining number of
segments to complete an epoch and therefore might be smaller
if necessary, yet all previous rules of equality remained
honored. At each epoch the data were presented to DNN in
different order. Since batch size can vary, vector shapes in
Fig.2 start with “None” instead of scalar.
To eliminate overfitting we set the patience equal to 15.
That means the experiment was terminated if no progress on
TABLE III. THE CONFUSION MATRIX OF TESTING SEGMENTS
validation loss had been made for more than 15 epochs of
training. The best results were recorded after the 38th epoch of
training.
We utilized the capabilities of Keras [23] and Theano [24]
frameworks to build the DNN model and accelerate the
training on GPU (NVIDIA GeForce GTX 690). The whole 38
epochs long training took only 4.12 minutes to finish. All
hyperparameters were tuned based on the validation results.
Using the trained model we were able to obtain the TABLE IV. THE CLASSIFICATION REPORT OF TESTING SEGMENTS
prediction probabilities for each class of each segment from
validation and testing sets. We took the maximum value from
predicted probabilities to denote the predicted class.
It was of course important to deliver the final results for the
whole audio files. For that purpose we computed the average
probability of all segments belonging to the particular file and
used it to denote the final predicted class. The value of average
probability can be viewed as confidence of prediction.
TABLE V. THE CONFUSION MATRIX OF TESTING FILES
IV. RESULTS
In order to perform 3 class SER predictions on 33 testing
audio files of German speech, we built a deep neural network
model consisting of convolutional, pooling and fully connected
layers. We trained it with raw, standardized input data cleared
of silent segments using mini-batches of size 21 over 38 epochs
on GPU. The whole training took only 4.12 minutes.
The training and validation sets contained exactly the same
number of segments per class as can be seen in Tab. I. The TABLE VI. THE CLASSIFICATION REPORT OF TESTING FILES
DNN predicted 545 validation segments to belong to a wrong
class as opposed to 2068 correct predictions resulting in
79.14% validation accuracy. The precision, recall, f1-score and
overall accuracy of validation segments is shown in Tab. II.
Tab. III shows the DNN predicted 633 testing segments to
belong to a wrong class as opposed to 2181 correct predictions
resulting in 77.51% testing accuracy. The precision, recall,
f1-score and overall accuracy of testing segments is shown in
Tab. IV.
After combining the predictions of segments, only one Workshops (ICUMT), 2015 7th International Congress on (pp. 320-
file’s emotion class out of 33 testing files has been predicted 324). IEEE.
incorrectly as confusion matrix shows in Tab. V. As shown in [7] Pan, Y., Shen, P. and Shen, L., 2012. Speech emotion recognition using
support vector machine. International Journal of Smart Home, 6(2),
Tab. VI, our DNN achieved 96.97% accuracy on speech pp.101-108.
emotion recognition task on testing files. The average [8] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F. and Weiss,
confidence of file prediction was 69.55%. B., 2005, September. A database of German emotional speech. In
Interspeech (Vol. 5, pp. 1517-1520).
V. CONCLUSION [9] Lalitha, S., Madhavan, A., Bhushan, B. and Saketh, S., 2014, October.
Speech emotion recognition. In Advances in Electronics, Computers and
The objective of this paper was to predict the emotional Communications (ICAECC), 2014 International Conference on (pp. 1-
state of a person from a short voice recording split into 20 4). IEEE.
millisecond segments. Our method achieved 96.97% accuracy [10] Zhou, Y., Sun, Y., Zhang, J. and Yan, Y., 2009, December. Speech
on testing data with the average confidence of 69.55% on file emotion recognition using both spectral and prosodic features. In 2009
International Conference on Information Engineering and Computer
prediction. Science (pp. 1-4). IEEE.
Our approach is context independent which means that all [11] Niu, J., Qian, Y. and Yu, K., 2014, September. Acoustic emotion
audio segments were classified independently. The DNN thus recognition using deep neural network. In Chinese Spoken Language
Processing (ISCSLP), 2014 9th International Symposium on (pp. 128-
had no knowledge of the actual context of what the actor is 132). IEEE.
saying nor did it have any knowledge of the rhythm etc. On
[12] Fayek, H. M., M. Lech, and L. Cavedon. "Towards real-time speech
one hand, this can be viewed as an advantage, but we think that emotion recognition using deep neural networks." Signal Processing and
context dependent approach as in [2] using recurrent nets might Communication Systems (ICSPCS), 2015 9th International Conference
significantly improve the results in this. on. IEEE, 2015.
[13] Martin, O., Kotsia, I., Macq, B. and Pitas, I., 2006, April. The
Even though the resulting accuracy is high, our future work eNTERFACE'05 audio-visual emotion database. In 22nd International
will try to further improve the approach, by incorporating Conference on Data Engineering Workshops (ICDEW'06) (pp. 8-8).
recurrent neural networks, using over-sampling or bigger data IEEE.
sets, so the model is capable of bringing the satisfying results [14] Jackson, P. and Haq, S., 2014. Surrey Audio-Visual Expressed
on more than 3 classes and across multiple data sets with Emotion(SAVEE) Database.
higher accuracy on validation sets, higher confidence of [15] Google WebRTC. https://webrtc.org/. Accessed 10 Oct 2016
predictions and higher reliability on real-world data. [16] LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P., 1998. Gradient-based
learning applied to document recognition. Proceedings of the IEEE,
86(11), pp.2278-2324.
ACKNOWLEDGMENT [17] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and
Research described in this paper was financed by the Salakhutdinov, R.R., 2012. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
National Sustainability Program under grant LO1401.
[18] Bridle, J.S., 1990. Probabilistic interpretation of feedforward
classification network outputs, with relationships to statistical pattern
REFERENCES recognition. In Neurocomputing (pp. 227-236). Springer Berlin
Heidelberg.
[1] El Ayadi, M., Kamel, M.S. and Karray, F., 2011. Survey on speech [19] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I. and
emotion recognition: Features, classification schemes, and databases. Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural
Pattern Recognition, 44(3), pp.572-587. networks from overfitting. Journal of Machine Learning Research,
[2] Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A. 15(1), pp.1929-1958.
and Zafeiriou, S., 2016, March. Adieu features? End-to-end speech [20] Nair, V. and Hinton, G.E., 2010. Rectified linear units improve restricted
emotion recognition using a deep convolutional recurrent network. In boltzmann machines. In Proceedings of the 27th International
2016 IEEE International Conference on Acoustics, Speech and Signal Conference on Machine Learning (ICML-10) (pp. 807-814).
Processing (ICASSP) (pp. 5200-5204). IEEE. [21] Glorot, X. and Bengio, Y., 2010, May. Understanding the difficulty of
[3] LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249-
521(7553), pp.436-444. 256).
[4] Chakraborty, R. and Kopparapu, S.K., 2016, July. Improved speech [22] Zhang, T., 2004, July. Solving large scale linear prediction problems
emotion recognition using error correcting codes. In Multimedia & Expo using stochastic gradient descent algorithms. In Proceedings of the
Workshops (ICMEW), 2016 IEEE International Conference on (pp. 1- twenty-first international conference on Machine learning (p. 116).
6). IEEE. ACM.
[5] Fei, W., Ye, X., Sun, Z., Huang, Y., Zhang, X. and Shang, S., 2016, [23] Chollet, F., 2015. Keras: Deep learning library for theano and
June. Research on speech emotion recognition based on deep auto- tensorflow.
encoder. In Cyber Technology in Automation, Control, and Intelligent [24] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R.,
Systems (CYBER), 2016 IEEE International Conference on (pp. 308- Desjardins, G., Turian, J., Warde-Farley, D. and Bengio, Y., 2010, June.
312). IEEE. Theano: A CPU and GPU math compiler in Python. In Proc. 9th Python
[6] Vyas, G., Dutta, M.K., Riha, K. and Prinosil, J., 2015, October. An in Science Conf (pp. 1-7).
automatic emotion recognizer using MFCCs and Hidden Markov
Models. In Ultra Modern Telecommunications and Control Systems and

View publication stats

BSMA 2022 Curriculum
100% (1)
BSMA 2022 Curriculum
2 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
RCD Tester Rev.1 Sop
67% (3)
RCD Tester Rev.1 Sop
2 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
Laboratory Quality Control
50% (2)
Laboratory Quality Control
19 pages
Research Paper
No ratings yet
Research Paper
5 pages
ASERS-LSTM: Arabic Speech Emotion Recognition System Based On LSTM Model
No ratings yet
ASERS-LSTM: Arabic Speech Emotion Recognition System Based On LSTM Model
9 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Multimodal Recognition With Deep Learning: Audio, Image, and Text
No ratings yet
Multimodal Recognition With Deep Learning: Audio, Image, and Text
11 pages
Review 3 PPT Final1)
No ratings yet
Review 3 PPT Final1)
51 pages
Inspection Report Shore Quantity Report Ullage Report Time Sheet / Time Log Sample Report Quality Report
No ratings yet
Inspection Report Shore Quantity Report Ullage Report Time Sheet / Time Log Sample Report Quality Report
21 pages
1 s2.0 S0003682X23002906 Main
No ratings yet
1 s2.0 S0003682X23002906 Main
11 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
2017.09.13 - MY18 GLE-Coupe
No ratings yet
2017.09.13 - MY18 GLE-Coupe
29 pages
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
No ratings yet
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
5 pages
Emotion Recognition From Speech Via The Use of Dif
No ratings yet
Emotion Recognition From Speech Via The Use of Dif
11 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Reality
No ratings yet
Reality
11 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Applsci 12 04338 v3
No ratings yet
Applsci 12 04338 v3
18 pages
1802.05630v2 - Speech Emotion Detection
No ratings yet
1802.05630v2 - Speech Emotion Detection
5 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
EpochSER MTA
No ratings yet
EpochSER MTA
35 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Zhao 2019
No ratings yet
Zhao 2019
12 pages
Group 110 Arun Kumar Review 2 Report
No ratings yet
Group 110 Arun Kumar Review 2 Report
14 pages
Wa0007
No ratings yet
Wa0007
6 pages
Project Proposal
No ratings yet
Project Proposal
9 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
2019 BE Emotionrecognition ICESTMM19
No ratings yet
2019 BE Emotionrecognition ICESTMM19
8 pages
Clay Plasters: Work Sheet 5.1
No ratings yet
Clay Plasters: Work Sheet 5.1
28 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
No ratings yet
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
15 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Session 2 Overview of Integrity
No ratings yet
Session 2 Overview of Integrity
19 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Research Paper Attri
No ratings yet
Research Paper Attri
7 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
10 pages
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
No ratings yet
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
13 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
SPRINGERIJST
No ratings yet
SPRINGERIJST
11 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Applsci 12 09188 v2
No ratings yet
Applsci 12 09188 v2
17 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
No ratings yet
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
12 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Gender: Project All Numerates Pre-Test Results
100% (1)
Gender: Project All Numerates Pre-Test Results
6 pages
Study Guide Chapter 8. The Teaching of Araling Panlipunan
No ratings yet
Study Guide Chapter 8. The Teaching of Araling Panlipunan
5 pages
Information Required For Preparation of Offers For Safety Consultancy Assignments
No ratings yet
Information Required For Preparation of Offers For Safety Consultancy Assignments
3 pages
Sony STR-DE598 Service Manual
No ratings yet
Sony STR-DE598 Service Manual
70 pages
Building Consensus Around Difficult Strategic Decisions
No ratings yet
Building Consensus Around Difficult Strategic Decisions
9 pages
Job Opportunity Bootloader Specialist at Elektrobit Automotive GMBH Jobportal1
No ratings yet
Job Opportunity Bootloader Specialist at Elektrobit Automotive GMBH Jobportal1
3 pages
Brosur Grolen HP19R
No ratings yet
Brosur Grolen HP19R
2 pages
Mechanical Tube English
No ratings yet
Mechanical Tube English
8 pages
VOCALOID 6 Reference Manual ENG
No ratings yet
VOCALOID 6 Reference Manual ENG
88 pages
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
No ratings yet
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
45 pages
MAD111 - Chap 1
No ratings yet
MAD111 - Chap 1
237 pages
Improving Quality in Food Products: Nestlé's Strategies For Standard Operating Procedures (SOP) and Documentation
No ratings yet
Improving Quality in Food Products: Nestlé's Strategies For Standard Operating Procedures (SOP) and Documentation
10 pages
Delineating The Epistemological Trajectory of Learning Theories: Implications For Mathematics Teaching and Learning
No ratings yet
Delineating The Epistemological Trajectory of Learning Theories: Implications For Mathematics Teaching and Learning
18 pages
Brochure 4200 en
No ratings yet
Brochure 4200 en
8 pages
Electrical Thumb Rules You MUST Follow Part 5
No ratings yet
Electrical Thumb Rules You MUST Follow Part 5
3 pages
? Gallery Walk Scoring Rubric
No ratings yet
? Gallery Walk Scoring Rubric
2 pages
Understanding Demand: Unit 2: Microeconomics
No ratings yet
Understanding Demand: Unit 2: Microeconomics
26 pages
Vigyan Vahini
No ratings yet
Vigyan Vahini
8 pages
Assignment 1 To 4 - BTC507 - 20376005
No ratings yet
Assignment 1 To 4 - BTC507 - 20376005
35 pages
MAED Math 4
No ratings yet
MAED Math 4
2 pages
MH 7
No ratings yet
MH 7
1 page
Week 03 - Quiz
No ratings yet
Week 03 - Quiz
1 page
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet

Speech Emotion Recognition With Deep Learning

Uploaded by

Speech Emotion Recognition With Deep Learning

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Speech emotion recognition with deep learning

Conference Paper · February 2017

Voice Pathology Assessment View project

The user has requested enhancement of the downloaded file.

Abstract— This paper describes a method for Speech Emotion

Fig. 1. An example of silent segment detection on raw audio signal of

We split every file into 20 millisecond chunks with no

Fig. 2. Detailed architecture of proposed DNN with in/out vector shapes.

For training our proposed model we utilized Stochastic

View publication stats

You might also like