Technical Seminar Report
Technical Seminar Report
Submitted in partial fulfillment of the requirement for the award of the degree
Bachelor of Engineering
in
ELECTRONICS AND COMMUNICATION ENGINEERING
Prescribed by
Visvesvaraya Technological University
By
Kari Tejasri
1BY16EC036
Dr. Jagannath.K.B
Assistant Professor, Dept. of ECE, BMSIT & M
Vision
Be a pioneer in providing quality education in electronics, communication and
allied engineering field to serve as valuable resources for industry and society.
Mission
1. Impart sound theoretical concepts & practical skills through innovative pedagogy
2. Promote interdisciplinary research
3. Inculcate professional ethics
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
PEO–3: Exhibit Ethics, Professional skills and Leadership qualities in their profession.
Program Specific Outcomes (PSOs)
CERTIFICATE
Certified that the technical seminar entitled “XYZ” presented by KARI TEJASRI
(1BY16EC036), a bonafide student of BMS Institute of Technology and Management
in partial fulfilment for the award of bachelor of engineering in Electronics and
Communication under Visvesvaraya Technological University, Belagavi during the year
2023-2024. It is certified that all corrections/suggestions indicated for internal
assessment have been incorporated in the report deposited in the department library.
The seminar report has been approved as it satisfies the academic requirement in respect
of seminar work prescribed for the said degree.
i
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of any task
would be incomplete without the mention of the people who made it possible and
whose constant guidance and acknowledgement crowned our effort with success.
We express our profound gratitude to our Principal, Dr. Sanjay H S BMS Institute
of Technology and Management for providing all the facilities and encouragement.
We would like to thank our HOD, Dr. Jayadeva G.S. for the inspiration, guidance
and her valuable suggestions.
Our sincere gratitude to our seminar coordinators, Mr. Jagannath KB and Dr.
Saneesh CT for their valuable time, suggestions, and technical support in
conducting the seminar presentation and writing the report.
Our sincere gratitude to our guide, Mr. Jagannath KB for her valuable time,
patience, suggestions and periodic evaluation that was conducive to the project.
We would also like to thank all the teaching and non-teaching staff of the
Department of Electronics and Communication Engineering for their cooperation
and motivation.
Finally, we express our cordial thanks to our parents and friends for their support
and guidance throughout the project preparation.
Kari Tejasri
ii
ABSTRACT
The key to speech emotion recognition is extraction of speech emotion features. In this
paper, a new network model (CNN-RF) based on convolution neural network combined
with random forest is proposed. Firstly, the convolution neural network is used as the
feature extractor to extract the speech emotion feature from the normalized spectrogram,
and uses a random forest classification algorithm to classify the speech emotion features.
The result of the experiment shows that CNN-RF model is superior to the traditional
CNN model. Secondly, Improved the Record Sound command box of Nao and applied
the CNN-RF model to the Nao robot. Finally, Nao robot can "try to figure out" a human's
psychology through speech emotion recognition and also know about people's happiness,
anger, sadness and joy, achieving a more intelligent human-computer interaction.
iii
CONTENTS
Certificate i
Acknowledgement ii
Abstract iii
Contents iv
List of Figures v
List of tables v
Chapter 1: INTRODUCTION 01
Chapter 3: METHODOLOGY 04
3.0.2 RF Classifier 05
Chapter 7: CONCLUSION 13
REFERENCES 14
iv
LIST OF FIGURES
LIST OF TABLES
Table 4.1: Classification Results of CNN and CNN-RF 07
Table 6.1: Test results in NAO robots 12
v
Speech Emotion Recognition Based on Convolution Neural Network
combined with Random Forest 2023-24
CHAPTER 1
INTRODUCTION
In facing the development trend of the era and increasing material and spiritual demands of
people, the robot industry is extending from the manufacturing industry to service industry
and home service robots mainly based on speech interaction are going to penetrate into
thousands of households. With the gradual development of the new generation of man-
machine interaction technology as well as continuous increase of people’s demands for
emotional intelligence of home service robots, man-machine interaction technology based on
speech emotion recognition has attracted wide research attention. Traditional machine
learning methods have achieved great progress in speech emotion recognition. However,
there are some problems: on the one hand, which features can reflect the differences between
different emotions, there is no unified opinion; On the other hand, these artificially designed
features rely highly on databases and have low generalization ability. It takes a long time to
extract features. Deep learning can extract different layers of features from the original data
through automatic learning, it has been widely used in speech recognition, image recognition
and other fields. In this study, CNN model was used as the feature extractor to extract high-
order features of spectrogram. RF was used as a classifier to design and implement the speech
emotion recognition system and applied in the NAO robot.
CHAPTER 2
LITERATURE SURVEY
Literature Survey is an important phase in the system development life cycle as we collect
and acquire the necessary information to handle or develop a project during this phase. A
literature review is a description of the literature relevant to a particular field or topic. It gives
an overview of what has been said, who the key writers are, what are the prevailing theories
and hypothesis and what methods and what methodologies are appropriate and useful.
In this chapter research is done prior to taking up the project and understanding the various
methods that were used previously. A detailed analysis of the existing systems was
performed. This study helped to identify the benefits and drawbacks of the existing systems.
In other words, Fisher vectors for code extraction by using GMM model were applied and
then phase features of speech signals were extracted by setting Fisher vectors. Finally, the
linear support vector machine was used for classification and recognition. Pavitra Patel et al.
[8] applied PCA algorithm to extract pitch, loudness and resonance peak in speeches to
reduce data dimensions. In this study, an improved GMM algorithm was introduced and the
expectation maximum algorithm (EM) was integrated into the Boosting framework. It was
proved by experiment that the Boosted-GMM algorithm increases speech emotion
recognition rate effectively.
Considering the shortcomings of traditional machine learning and the advantages of deep
learning, Wootaek Lim et al. [9] combined the convolution neural network and one special
circulation neural network and constructed a new deep neural network by using the Time
Distributed layer: Time Distributed CNNs. CNNs and LSTM network accomplished feature
learning together. Experiments demonstrated that the recognition rate of Time Distributed
CNNs in 7 emotions in the EmoDB database was higher than those of CNNs and LSTM
networks. Reference [10] applied the same method, combined CNNs and LSTM networks for
automatic learning features which were easy to distinguish from the original speech signals,
and used these features to solve feature extraction problems related to context. Compared
with traditional speech emotion recognition methods based on signal processing, the
proposed method has good prediction capacity on RECOLA natural emotion database.
CHAPTER 3
METHODOLOGY
3.0 SPEECH EMOTION RECOGNITION BASED ON CNN-RF
The framework of speech emotion recognition is shown in Figure 3.0.1, spectrogram was
calculated from emotion speech samples through framing, windowing, short-time Fourier
transform (STFT) and power spectral density (PSD), and the normalized spectrogram was
used as input of CNN. Speech emotion features were extracted by CNN and the output of
CNN Flatten layer was input into RF classifier as eigenvectors of speech emotion samples.
Speech emotion feature is the basis for speech emotion recognition. Extraction accuracy of
speech emotion features in the original speech emotion samples can influence the final
recognition rate of speech emotions directly.
In order to learn more global features from different angles, we set up multi-convolution
kernels in different layers. The following are the parameter settings of CNN network model
for feature extraction:
INPUT layer: Emotion speech samples with different time length were normalized into
colorful spectrogram with 99*145 pixels, 3 channels and jpg format and used as the input; C1
and C2 layers: Convolution layer, these two layers used 16 5*5 multi-convolution nucleus.
The feature image size in C1 is changed from 99*145 into 95*141 and the feature image size
in C2 is 91*137, each layer generates 16 feature images;
S3 layer: Pooling layer, this layer used the 2*2 maximum sampling convolution nucleus. This
layer was to reduce dimension of data and generated 16 45*68 feature images.
C4 and C5 layers: Convolution layer, these layers used 16 types of 3*3 multi-convolution
nucleus, 16 feature images size in C5 is changed from 43*66 of C4 into 41*64.
S6 layer: Pooling layer, the 2*2 maximum sampling convolution nucleus was used and 16
20*32 feature images were gained.
F1 layer: Full connection layer. This paper uses two fully connected layers. F1 have 128
parameters, F2 have 6 parameters (6 represents the speech emotion category), softmax was
used as the classifier.
D layer: Dropout layer, this study applied dropout strategy in F1, the parameter value was set
0.6, the loss value at training and val_loss reduction at verification achieved the best. The
purpose is to improve the generalization ability of the model. It can be seen from Fig3.2 that
the similarity between applied CNN feature extraction model and VGG16 and VGG19 [11]
lies in the design of convolution layers. The repeated occurrence of convolution layers is
conducive to fuller feature learning.
3.0.2 RF Classifier
It RF [12] was used as a classifier after CNN accomplished features. The designed RF in this
study covered 200 decision trees. The classification standards used Gini indexes. Except for
these two parameters, other parameters all used defaulted values of RF classifier in scikit-
learn packet.
In generation process of RF, the following two random sampling methods were used to
● Row sampling. In the original training dataset N, n (n<<N) samples with replacement
were chosen randomly to generate the decision tree. Input samples of each tree were
the proper subset of the original training dataset, thus enabling to prevent overfitting.
● Column sampling. Suppose there were M features and each node chose m (m<<M)
features to determine the best splitting points. This ensures complete splitting or
pointing to one classification of leaf nodes of the decision tree.
In addition, classification results of RF were determined by the number of output types in all
decision trees, resulting in the relatively high classification accuracy.
In the actual application process of speech emotion recognition system, interfered by noises
in the environment, purity of speech samples collected by robot was far lower than samples in
the database. CNN-RF model uses CNN as a special feature extractor and each convolution
layer was set with multi-convolution nucleus to make extracted features more comprehensive.
Random sampling method was used during the generation of RF classifier, which prevented
overfitting effectively and generalization ability of the system. Therefore, the proposed CNN-
RF model can meet actual demands of speech emotion recognition.
CHAPTER 4
CONTRAST EXPERIMENT
Core i7 3.4Hz, 16G memory and Windows system were used as the experimental
environment. Graphics card used NVIDIA GeForce GTX 1070 and the ground floor was
keras framework of Theano. The GPU acceleration was realized by Anaconda software.
Experimental database used the CASIA Chinese emotion database of the Institution of
Automation, Chinese Academy
of Sciences [13]. This database was recorded by 4 professional players, including angry, fear,
happy, neutral, sad and surprise, including 9600 samples. Each emotion accounts for 1/6
samples. The normalized spectrogram after normalization of two neutral emotion speeches
which are chosen randomly from the database is shown in Figure 4.1.
CASIA database were divided into training set and test set according to the proportion of 5:1.
The performance of the classification is evaluated by Accuracy (ACC). Parameter of softmax
was set 6, indicating 6 emotions Classification results of CNN model and CNN-RF model are
Table 4.1. The recognition accuracy of CNN-RF is 3.25% higher than that of CNN model.
CHAPTER 5
CHAPTER6
In this study, the Record Sound box was improved and the “ALAudioRecorder” application
programming interface was used to replace the “ALAudioDevice” interface to generate a new
Recorder box. It can be seen from the interface of Choregraphe in Figure 6.1 that the
improved box increases the single-track (front microphone) .wav format and four-track .ogg
format in recording.
In this study, the same text content of “Hello, Nao!” was used as recorded by the original
Record Sound box and the improved Recorder box. Two different formats of documents
which were generated by the original Record Sound box were named as RS1.wav and
RS2.ogg, while the document generated by the improved Recorder box was named by
Recorder.wav. Three speeches in internal memory of NAO are shown in Figure 6.2. All three
audio documents were downloaded and input into Audacity. Name, number of tracks and
sampling frequency of documents were examined (Figure 6.3). The first two is
Recorder.wav,and the second to the fifth rows are RS1.wav, indicating that RS1 is the four-
track mixed sound.
Audios collected by left, right, front and back microphones of the robot were shown from top
to bottom. The sixth row is RS2.ogg single-track audio, but the .ogg document belongs to
audio lossy compression document and can affect tone quality to some extent.
In addition, it tries to use .wav format in order to ensure consistency of format of audios
collected in the NAO and CASIA database and eliminate interferences of audio format on
application effects. Hence, the Recorder.wav meets the above requirements.
It can be seen from Figure 6.2 and Figure 6.3 that the audio collected by the Recorder box
doesn’t need other operations and can meet speech signal analysis, Chinese words
segmentation, speech recognition and speech emotion recognition based on NAO. It can
simplify the man-machine interaction process, thus increasing time efficiency significantly.
The CNN-RF speech emotion recognition model successfully applied to Nao robot needs the
three steps shown in Figure 6.4. The detailed process is as follows:
(1) Acquire emotional speech samples timely by NAO. Connect NAO robot and local
computer, use the improved Recorder box to store the emotional speech samples in .wav
format and download them by SFTP functions of python;
(3)Test samples are classified by the trained and stored CNN-RF network model. Use the
“ALTextSpeech” application interface in NAO robot to “speak out” the recognition results
Test results on NAO robots are shown in Table 6.1. In the process of the test, different
sentences were used to express emotions. According to the test, “surprise” and “fear” were
the easiest to confuse. The recognition degree of “fear” and “happy” was lower than the
average level. The highest recognition degree was achieved to “neutral”, reaching 90%.
Angry 19 3 0 1 0 0 82.6%
Neutral 0 37 4 0 0 0 90.24%
Surprise 1 0 20 0 6 0 74.07%
Happy 1 2 3 12 1 1 60%
Fear 0 0 3 0 17 0 85%
Sad 1 1 1 0 2 8 61.53%
Average
75.57%
CHAPTER 7
CONCLUSION
In this study, CNN model is used as a feature extractor and combined with RF classifier. On
this basis, the Chinese speech emotion recognition system is designed and used in NAO
robot. In the application process, the original Record Sound box is improved and a new
Recorder box is generated. Speech signals which are collected by the improved Recorder box
not only can meet format requirements of speech emotion recognition, but also can meet
needs of studying NAO-BASED speech signal analysis, Chinese words segmentation and
speech recognition. After a systematic test, the proposed CNN-RF model gives NAO robot
basic functions of speech emotion recognition.
REFERENCES
[1] Juan Pablo Arias, Carlos Busso, Nestor Becerra Yom. Shape-based modeling of the
[2] Yongming Huang, Guobao Zhang, Yue Li, et.al. Improved Emotion Recognition with
2014,
8588:706-714.
[3] H.M.Teager Sc.D, S.M.Teager. Allen. Evidence for nonlinear production mechanisms in
the vocal tract [J]. Speech Production and Speech Modelling, 1990, 55:241-261.
[4] Chong Feng, Chunhui Zhao. Voice activity detection based on ensemble empirical mode
[5] Lin Yilin, Wei Gang. Speech Emotion Recognition Based on HMM and SVM // Proc
[7] Deng J, Xu X, Zhang Z, et al. Fisher kernels on phase-based features for speech
[9] Lim W, Jang D, Lee T. Speech emotion recognition using convolutional and Recurrent
Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on.
Emotion
[DB/OL].2012/5/17.
[14] Cheng Ming, An approach of speech interaction and software design for humanoid