0% found this document useful (0 votes)
14 views24 pages

Technical Seminar Report

This is technical seminar report for the topic SPEECH EMOTION RECOGNITION BASED ON CONVOLUTION NEURAL NETWORK COMBINED WITH RANDOM FOREST.

Uploaded by

1by20ec190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

Technical Seminar Report

This is technical seminar report for the topic SPEECH EMOTION RECOGNITION BASED ON CONVOLUTION NEURAL NETWORK COMBINED WITH RANDOM FOREST.

Uploaded by

1by20ec190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI, KARNATAKA- 590018

Technical seminar report


on
“SPEECH EMOTION RECOGNITION BASED ON CONVOLUTION NEURAL
NETWORK COMBINED WITH RANDOM FOREST”

Submitted in partial fulfillment of the requirement for the award of the degree

Bachelor of Engineering
in
ELECTRONICS AND COMMUNICATION ENGINEERING
Prescribed by
Visvesvaraya Technological University

By
Kari Tejasri
1BY16EC036

Under the guidance of

Dr. Jagannath.K.B
Assistant Professor, Dept. of ECE, BMSIT & M

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BMS INSTITUTE OF TECHNOLOGY AND MANAGEMENT


Avalahalli, Yelahanka, Bengaluru-560064
2023-2024
B.M.S. INSTITUTE OF TECHNOLOGY AND MANAGEMENT
Yelahanka, Bengaluru-560064

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Vision
Be a pioneer in providing quality education in electronics, communication and
allied engineering field to serve as valuable resources for industry and society.
Mission
1. Impart sound theoretical concepts & practical skills through innovative pedagogy
2. Promote interdisciplinary research
3. Inculcate professional ethics

Program Outcomes (POs)

1. Engineering knowledge: Apply the knowledge of Mathematics, Science, Engineering


fundamentals and an engineering specialization to the solution of complex engineering problems

2. Problem analysis: Identify, formulate, review research literature, and analyse


complex Engineering problems reaching substantiated conclusions using first principles of
mathematics, Natural sciences and engineering sciences

3. Design/development of solutions: Design solutions for complex engineering


problems and design system components or processes that meet the specified needs with
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.

4. Conduct investigations of complex problems: Use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data, and
synthesis of the Information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern Engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.

6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.

7. Environment and sustainability: Understand the impact of the professional


engineering solutions in societal and environmental contexts, and demonstrate the knowledge
of, and need for Sustainable development

8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.

9. Individual and team work: Function effectively as an individual, and as a member or


leader in diverse teams, and in multidisciplinary settings

10. Communication: Communicate effectively on complex engineering activities with


the engineering Community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.

11. Project management and finance: Demonstrate knowledge and understanding of


the Engineering and management principles and apply these to one’s own work, as a member
and Leader in a team, to manage projects and in multidisciplinary environments.

12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological

Program Educational Objectives (PEOs)

Our Graduates will be able to

PEO–1: Work as Professionals in the area of Electronics, Communication, and


Allied Engineering fields.

PEO–2: Pursue Higher studies and involve in Interdisciplinary research work.

PEO–3: Exhibit Ethics, Professional skills and Leadership qualities in their profession.
Program Specific Outcomes (PSOs)

PSO–1: Exhibit Competency in Embedded System and VLSI Design

PSO–2: Capability to comprehend the technological advancements in


Radio Frequency Communication and Digital Signal Processing.
B.M.S. INSTITUTE OF TECHNOLOGY AND MANAGEMENT
Avalahalli, Yelahanka, Bengaluru-560064

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


ENGINEERING

CERTIFICATE
Certified that the technical seminar entitled “XYZ” presented by KARI TEJASRI
(1BY16EC036), a bonafide student of BMS Institute of Technology and Management
in partial fulfilment for the award of bachelor of engineering in Electronics and
Communication under Visvesvaraya Technological University, Belagavi during the year
2023-2024. It is certified that all corrections/suggestions indicated for internal
assessment have been incorporated in the report deposited in the department library.

The seminar report has been approved as it satisfies the academic requirement in respect
of seminar work prescribed for the said degree.

Signature of Guide Signature of HOD


Mr.Jagannath KB Dr.Jayadeva GS
BMSIT&M Dept. of ECE BMSIT&M Dept. of ECE

i
ACKNOWLEDGEMENT

The satisfaction and euphoria that accompany the successful completion of any task
would be incomplete without the mention of the people who made it possible and
whose constant guidance and acknowledgement crowned our effort with success.

We express our profound gratitude to our Principal, Dr. Sanjay H S BMS Institute
of Technology and Management for providing all the facilities and encouragement.
We would like to thank our HOD, Dr. Jayadeva G.S. for the inspiration, guidance
and her valuable suggestions.
Our sincere gratitude to our seminar coordinators, Mr. Jagannath KB and Dr.
Saneesh CT for their valuable time, suggestions, and technical support in
conducting the seminar presentation and writing the report.

Our sincere gratitude to our guide, Mr. Jagannath KB for her valuable time,
patience, suggestions and periodic evaluation that was conducive to the project.

We would also like to thank all the teaching and non-teaching staff of the
Department of Electronics and Communication Engineering for their cooperation
and motivation.

Finally, we express our cordial thanks to our parents and friends for their support
and guidance throughout the project preparation.

Kari Tejasri

ii
ABSTRACT

The key to speech emotion recognition is extraction of speech emotion features. In this
paper, a new network model (CNN-RF) based on convolution neural network combined
with random forest is proposed. Firstly, the convolution neural network is used as the
feature extractor to extract the speech emotion feature from the normalized spectrogram,
and uses a random forest classification algorithm to classify the speech emotion features.
The result of the experiment shows that CNN-RF model is superior to the traditional
CNN model. Secondly, Improved the Record Sound command box of Nao and applied
the CNN-RF model to the Nao robot. Finally, Nao robot can "try to figure out" a human's
psychology through speech emotion recognition and also know about people's happiness,
anger, sadness and joy, achieving a more intelligent human-computer interaction.

iii
CONTENTS

Certificate i

Acknowledgement ii

Abstract iii

Contents iv

List of Figures v

List of tables v

Chapter 1: INTRODUCTION 01

Chapter 2: LITERATURE SURVEY 02

Chapter 3: METHODOLOGY 04

3.0 SPEECH EMOTION RECOGNITION BASED ON CNN-RF

3.0.1 CNN Feature Extraction 04

3.0.2 RF Classifier 05

3.0.3 CNN-RF Model Analysis 06

Chapter 4: CONTRAST EXPERIMENT 07

Chapter 5: APPLICATIONS OF CNN-RF MODEL IN NAO ROBOT 08

Chapter 6: RESULTS AND DISCUSSION 09

6.1 The Improved Record Sound Box 09

6.2 Speech Emotion Test on Nao 10

Chapter 7: CONCLUSION 13

REFERENCES 14

iv
LIST OF FIGURES

Fig 3.0.1: Framework of speech emotion recognition system 04


Fig 3.0.2: Structure of CNN model 04
Fig 4.1: Normalized spectrogram corresponding to speech signals 07
Fig 6.1: Choregraphe interface 09
Fig 6.2: Memory space of NAO 10
Fig 6.3: Time-domain waveforms of Recorder, RS1 and RS2 11
Fig 6.4: Speech emotion recognition of NAO robot 11

LIST OF TABLES
Table 4.1: Classification Results of CNN and CNN-RF 07
Table 6.1: Test results in NAO robots 12

v
Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 1

INTRODUCTION
In facing the development trend of the era and increasing material and spiritual demands of
people, the robot industry is extending from the manufacturing industry to service industry
and home service robots mainly based on speech interaction are going to penetrate into
thousands of households. With the gradual development of the new generation of man-
machine interaction technology as well as continuous increase of people’s demands for
emotional intelligence of home service robots, man-machine interaction technology based on
speech emotion recognition has attracted wide research attention. Traditional machine
learning methods have achieved great progress in speech emotion recognition. However,
there are some problems: on the one hand, which features can reflect the differences between
different emotions, there is no unified opinion; On the other hand, these artificially designed
features rely highly on databases and have low generalization ability. It takes a long time to
extract features. Deep learning can extract different layers of features from the original data
through automatic learning, it has been widely used in speech recognition, image recognition
and other fields. In this study, CNN model was used as the feature extractor to extract high-
order features of spectrogram. RF was used as a classifier to design and implement the speech
emotion recognition system and applied in the NAO robot.

BMSIT&M, Department of ECE Page 1


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 2

LITERATURE SURVEY
Literature Survey is an important phase in the system development life cycle as we collect
and acquire the necessary information to handle or develop a project during this phase. A
literature review is a description of the literature relevant to a particular field or topic. It gives
an overview of what has been said, who the key writers are, what are the prevailing theories
and hypothesis and what methods and what methodologies are appropriate and useful.

In this chapter research is done prior to taking up the project and understanding the various
methods that were used previously. A detailed analysis of the existing systems was
performed. This study helped to identify the benefits and drawbacks of the existing systems.

In traditional speech emotion recognition methods, extraction of speech emotional features


mainly focus on super phonetics and linguistics to extract prosodic features [1], spectral
features [2], and other features [3-4]. They combined with support vector machine [5] (SVM)
and other classifiers to realize recognition of speech emotions. Yu Gu et al. [6] proposed the
voiced segment selection (VSS) algorithm that realizes accurate segmentation of voice
signals. Compared to traditional methods, VSS algorithm takes the voiced sound signal
segment as texture image processing. Voiced sound and unvoiced sound features in
spectrogram were classified by using the Log-Gabor filter. Jun Deng et al. [7] established an
emotion recognition system based on phase position. Nucleus Fisher and GMM models were
introduced to extract phase features.

In other words, Fisher vectors for code extraction by using GMM model were applied and
then phase features of speech signals were extracted by setting Fisher vectors. Finally, the
linear support vector machine was used for classification and recognition. Pavitra Patel et al.
[8] applied PCA algorithm to extract pitch, loudness and resonance peak in speeches to
reduce data dimensions. In this study, an improved GMM algorithm was introduced and the
expectation maximum algorithm (EM) was integrated into the Boosting framework. It was
proved by experiment that the Boosted-GMM algorithm increases speech emotion
recognition rate effectively.

BMSIT&M, Department of ECE Page 2


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

Considering the shortcomings of traditional machine learning and the advantages of deep
learning, Wootaek Lim et al. [9] combined the convolution neural network and one special
circulation neural network and constructed a new deep neural network by using the Time

Distributed layer: Time Distributed CNNs. CNNs and LSTM network accomplished feature
learning together. Experiments demonstrated that the recognition rate of Time Distributed
CNNs in 7 emotions in the EmoDB database was higher than those of CNNs and LSTM
networks. Reference [10] applied the same method, combined CNNs and LSTM networks for
automatic learning features which were easy to distinguish from the original speech signals,
and used these features to solve feature extraction problems related to context. Compared
with traditional speech emotion recognition methods based on signal processing, the
proposed method has good prediction capacity on RECOLA natural emotion database.

BMSIT&M, Department of ECE Page 3


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 3

METHODOLOGY
3.0 SPEECH EMOTION RECOGNITION BASED ON CNN-RF

The framework of speech emotion recognition is shown in Figure 3.0.1, spectrogram was
calculated from emotion speech samples through framing, windowing, short-time Fourier
transform (STFT) and power spectral density (PSD), and the normalized spectrogram was
used as input of CNN. Speech emotion features were extracted by CNN and the output of
CNN Flatten layer was input into RF classifier as eigenvectors of speech emotion samples.

Figure3.0.1.The framework of speech emotion recognition

3.0.1 CNN Feature Extraction

Speech emotion feature is the basis for speech emotion recognition. Extraction accuracy of
speech emotion features in the original speech emotion samples can influence the final
recognition rate of speech emotions directly.

Figure3.0.2.Structure of CNN model

In order to learn more global features from different angles, we set up multi-convolution
kernels in different layers. The following are the parameter settings of CNN network model
for feature extraction:

BMSIT&M, Department of ECE Page 4


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

INPUT layer: Emotion speech samples with different time length were normalized into
colorful spectrogram with 99*145 pixels, 3 channels and jpg format and used as the input; C1
and C2 layers: Convolution layer, these two layers used 16 5*5 multi-convolution nucleus.
The feature image size in C1 is changed from 99*145 into 95*141 and the feature image size
in C2 is 91*137, each layer generates 16 feature images;

S3 layer: Pooling layer, this layer used the 2*2 maximum sampling convolution nucleus. This
layer was to reduce dimension of data and generated 16 45*68 feature images.

C4 and C5 layers: Convolution layer, these layers used 16 types of 3*3 multi-convolution
nucleus, 16 feature images size in C5 is changed from 43*66 of C4 into 41*64.

S6 layer: Pooling layer, the 2*2 maximum sampling convolution nucleus was used and 16
20*32 feature images were gained.

F1 layer: Full connection layer. This paper uses two fully connected layers. F1 have 128
parameters, F2 have 6 parameters (6 represents the speech emotion category), softmax was
used as the classifier.

D layer: Dropout layer, this study applied dropout strategy in F1, the parameter value was set
0.6, the loss value at training and val_loss reduction at verification achieved the best. The
purpose is to improve the generalization ability of the model. It can be seen from Fig3.2 that
the similarity between applied CNN feature extraction model and VGG16 and VGG19 [11]
lies in the design of convolution layers. The repeated occurrence of convolution layers is
conducive to fuller feature learning.

3.0.2 RF Classifier

It RF [12] was used as a classifier after CNN accomplished features. The designed RF in this
study covered 200 decision trees. The classification standards used Gini indexes. Except for
these two parameters, other parameters all used defaulted values of RF classifier in scikit-
learn packet.

In generation process of RF, the following two random sampling methods were used to

BMSIT&M, Department of ECE Page 5


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24
construct the decision tree, which increases generalization ability effectively:

● Row sampling. In the original training dataset N, n (n<<N) samples with replacement
were chosen randomly to generate the decision tree. Input samples of each tree were
the proper subset of the original training dataset, thus enabling to prevent overfitting.

● Column sampling. Suppose there were M features and each node chose m (m<<M)
features to determine the best splitting points. This ensures complete splitting or
pointing to one classification of leaf nodes of the decision tree.

In addition, classification results of RF were determined by the number of output types in all
decision trees, resulting in the relatively high classification accuracy.

3.0.3 CNN-RF Model Analysis

In the actual application process of speech emotion recognition system, interfered by noises
in the environment, purity of speech samples collected by robot was far lower than samples in
the database. CNN-RF model uses CNN as a special feature extractor and each convolution
layer was set with multi-convolution nucleus to make extracted features more comprehensive.
Random sampling method was used during the generation of RF classifier, which prevented
overfitting effectively and generalization ability of the system. Therefore, the proposed CNN-
RF model can meet actual demands of speech emotion recognition.

BMSIT&M, Department of ECE Page 6


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 4

CONTRAST EXPERIMENT
Core i7 3.4Hz, 16G memory and Windows system were used as the experimental
environment. Graphics card used NVIDIA GeForce GTX 1070 and the ground floor was
keras framework of Theano. The GPU acceleration was realized by Anaconda software.
Experimental database used the CASIA Chinese emotion database of the Institution of
Automation, Chinese Academy
of Sciences [13]. This database was recorded by 4 professional players, including angry, fear,
happy, neutral, sad and surprise, including 9600 samples. Each emotion accounts for 1/6
samples. The normalized spectrogram after normalization of two neutral emotion speeches
which are chosen randomly from the database is shown in Figure 4.1.

Fig 4.1 Normalized spectrogram corresponding to speech signals

CASIA database were divided into training set and test set according to the proportion of 5:1.
The performance of the classification is evaluated by Accuracy (ACC). Parameter of softmax
was set 6, indicating 6 emotions Classification results of CNN model and CNN-RF model are
Table 4.1. The recognition accuracy of CNN-RF is 3.25% higher than that of CNN model.

Table4.1. Classification Results of CNN and CNN-RF

Network Model ACC


CNN 0.8143
CNN-RF 0.8468

BMSIT&M, Department of ECE Page 7


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 5

APPLICATIONS OF CNN-RF MODEL IN NAO ROBOT


NAO is a humanoid robot developed by Aldebaran Robotics company. NAO is especially
applicable as the research platform of service robot and users can make visual programming
of NAO through Choregraphe software. The Record Sound box provided by Choregraphe
software called the microphone in the NAO robot. It can realize the following two types of
video recording:

● Four tracks, 48000Hz, .wav.


● Single track (front microphone), 16000 Hz, .ogg.

In existing studies of man-machine interaction, speech signal processing in Python and


Matlab mainly used audio with a single track and .wav format. Reference [14] proposed the
methods of single track extraction from multi-tracks and conformation of audio format.
However, extra operation makes man-machine interaction complicated. Therefore, this study
improved form the source and gained the designed single-track audio by modifying the
command box. It not only simplified audio processing, but also increased real-time
performance of man-machine interaction.

BMSIT&M, Department of ECE Page 8


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER6

RESULTS AND DISCUSSION

6.1 The Improved Record Sound Box

Fig 6.1.Choregraphe Interface

In this study, the Record Sound box was improved and the “ALAudioRecorder” application
programming interface was used to replace the “ALAudioDevice” interface to generate a new
Recorder box. It can be seen from the interface of Choregraphe in Figure 6.1 that the
improved box increases the single-track (front microphone) .wav format and four-track .ogg
format in recording.

In this study, the same text content of “Hello, Nao!” was used as recorded by the original
Record Sound box and the improved Recorder box. Two different formats of documents
which were generated by the original Record Sound box were named as RS1.wav and
RS2.ogg, while the document generated by the improved Recorder box was named by
Recorder.wav. Three speeches in internal memory of NAO are shown in Figure 6.2. All three
audio documents were downloaded and input into Audacity. Name, number of tracks and
sampling frequency of documents were examined (Figure 6.3). The first two is
Recorder.wav,and the second to the fifth rows are RS1.wav, indicating that RS1 is the four-
track mixed sound.

BMSIT&M, Department of ECE Page 9


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

Audios collected by left, right, front and back microphones of the robot were shown from top
to bottom. The sixth row is RS2.ogg single-track audio, but the .ogg document belongs to
audio lossy compression document and can affect tone quality to some extent.

In addition, it tries to use .wav format in order to ensure consistency of format of audios
collected in the NAO and CASIA database and eliminate interferences of audio format on
application effects. Hence, the Recorder.wav meets the above requirements.

Fig 6.2 Memory space of NAO

It can be seen from Figure 6.2 and Figure 6.3 that the audio collected by the Recorder box
doesn’t need other operations and can meet speech signal analysis, Chinese words
segmentation, speech recognition and speech emotion recognition based on NAO. It can
simplify the man-machine interaction process, thus increasing time efficiency significantly.

6.2 Speech Emotion Test on Nao

The CNN-RF speech emotion recognition model successfully applied to Nao robot needs the
three steps shown in Figure 6.4. The detailed process is as follows:

(1) Acquire emotional speech samples timely by NAO. Connect NAO robot and local
computer, use the improved Recorder box to store the emotional speech samples in .wav
format and download them by SFTP functions of python;

BMSIT&M, Department of ECE Page 10


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

Fig 6.3. Time-domain waveforms of Recorder, RS1 and RS2

(2)Draw the spectrogram;

(3)Test samples are classified by the trained and stored CNN-RF network model. Use the
“ALTextSpeech” application interface in NAO robot to “speak out” the recognition results

Fig 6.4 Speech emotion recognition of NAO robot

Test results on NAO robots are shown in Table 6.1. In the process of the test, different
sentences were used to express emotions. According to the test, “surprise” and “fear” were
the easiest to confuse. The recognition degree of “fear” and “happy” was lower than the
average level. The highest recognition degree was achieved to “neutral”, reaching 90%.

BMSIT&M, Department of ECE Page 11


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

Test results on NAO robots are shown in Table 6.1

Angry Neutral Surprise Happy Fear Sad ACC

Angry 19 3 0 1 0 0 82.6%

Neutral 0 37 4 0 0 0 90.24%

Surprise 1 0 20 0 6 0 74.07%

Happy 1 2 3 12 1 1 60%

Fear 0 0 3 0 17 0 85%

Sad 1 1 1 0 2 8 61.53%

Average
75.57%

BMSIT&M, Department of ECE Page 12


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

CHAPTER 7

CONCLUSION
In this study, CNN model is used as a feature extractor and combined with RF classifier. On
this basis, the Chinese speech emotion recognition system is designed and used in NAO
robot. In the application process, the original Record Sound box is improved and a new
Recorder box is generated. Speech signals which are collected by the improved Recorder box
not only can meet format requirements of speech emotion recognition, but also can meet
needs of studying NAO-BASED speech signal analysis, Chinese words segmentation and
speech recognition. After a systematic test, the proposed CNN-RF model gives NAO robot
basic functions of speech emotion recognition.

BMSIT&M, Department of ECE Page 13


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

REFERENCES

[1] Juan Pablo Arias, Carlos Busso, Nestor Becerra Yom. Shape-based modeling of the

fundamental frequency contour for emotion detection in speech[J]. Computer Speech

& Language, 2014, 28(1):278-294.

[2] Yongming Huang, Guobao Zhang, Yue Li, et.al. Improved Emotion Recognition with

Novel Task-Oriented Wavelet Packet Features [J]. Intelligent Computing Theory,

2014,

8588:706-714.

[3] H.M.Teager Sc.D, S.M.Teager. Allen. Evidence for nonlinear production mechanisms in

the vocal tract [J]. Speech Production and Speech Modelling, 1990, 55:241-261.

[4] Chong Feng, Chunhui Zhao. Voice activity detection based on ensemble empirical mode

decomposition and teager kurtosis [A]. International Conference on Signal

Processing[C]. USA: IEEE, 2014:455-460.

[5] Lin Yilin, Wei Gang. Speech Emotion Recognition Based on HMM and SVM // Proc

of the 4th International Conference on Machine Learning and Cybernetics. Guangzhou,

China, 2005, VIII:4898 㸫 4901.


[6] Gu Y, Postma E, Lin H X, et al. Speech Emotion Recognition Using Voiced

Segment Selection Algorithm[J]. 2016.

[7] Deng J, Xu X, Zhang Z, et al. Fisher kernels on phase-based features for speech

emotion recognition[M]//Dialogues with Social Robots. Springer Singapore, 2017: 195-203.

[8] Patel P, Chaudhari A, Kale R, et al. EMOTION RECOGNITION FROM SPEECH

WITH GAUSSIAN MIXTURE MODELS & VIA BOOSTED GMM[J].

BMSIT&M, Department of ECE Page 14


Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24

International Journal of Research In Science & Engineering, 2017, 3.

[9] Lim W, Jang D, Lee T. Speech emotion recognition using convolutional and Recurrent

Neural Networks[C]//Signal and Information Processing Association Annual Summit

and Conference (APSIPA), 2016 Asia-Pacific. IEEE, 2016: 1-4.

[10] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features? End-to-end speech

emotion recognition using a deep convolutional recurrent network[C]//Acoustics,

Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on.

IEEE, 2016: 5200-5204.

[11] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale

image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[12] Liaw A, Wiener M. Classification and regression by randomForest[J]. R news,

2002, 2(3): 18-22.

[13] Institute of Automation Chinese Academy of Sciences. The selected Speech

Emotion

Database of Institute of Automation Chinese Academy of Sciences (CASIA)

[DB/OL].2012/5/17.

[14] Cheng Ming, An approach of speech interaction and software design for humanoid

robots, thesis, Changsha, Hunan University, 2016.

BMSIT&M, Department of ECE Page 15

You might also like