0% found this document useful (0 votes)
24 views4 pages

2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System

Uploaded by

bndaivarath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System

Uploaded by

bndaivarath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the Third International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2019)

IEEE Xplore Part Number:CFP19OSV-ART; ISBN:978-1-7281-4365-1

CNN based speaker recognition in language and


text-independent small scale system
Rohan Jagiasi Shubham Ghosalkar
Department of Information Technology Department of Information Technology
VES’s Institute of Technology VES’s Institute of Technology
Mumbai, India Mumbai, India
[email protected] [email protected]

Punit Kulal Asha Bharambe


Department of Information Technology Department of Information Technology
VES’s Institute of Technology VES’s Institute of Technology
Mumbai, India Mumbai, India
[email protected] [email protected]

Abstract — Speaker Recognition is the ability of the system to parameters : The feature of the voice signal to be used
recognize the speaker from the set of speaker samples available like Mel Frequency Cepstral Coefficients (MFCC),
in the system. It is of 2 types, one uses a keyword, called text- Linear Prediction Cepstral Coefficient (LPCC),
dependent systems, and another one can recognize the voice in
Perceptual Linear Prediction (PLP) and the modelling
any language/text, also called as text-independent speaker
technique used to learn the voice samples like
recognition. In this paper, a text-independent, language-
independent speaker recognition system is implemented using
Artificial Neural Networks (ANN), Gaussian Mixture
dense & convolutional neural networks. Speaker recognition has Model (GMM)[12], vector quantization etc. Studies
found several applications in upcoming electronic products like have also been done where multiple techniques are
personal/home assistants, telephone banking and biometric combined, such as using different models for feature
identification. In this paper, we explore a system that uses extraction and classification.[8] Of these choices,
MFCC along with DNN and CNN as the model for building a majority approaches use MFCC as the feature to be
speaker recognition system. used since it is most effective for speaker
recognition[11].
Keywords — speaker recognition, neural network, voice sample,
language-independent speaker recognition, independent speaker
recognition system
Using MFCC features and, [10] studied
phenomic based speaker recognition systems.
However, the accuracies seem to fluctuate based on
I. INTRODUCTION
the words and syllables pronounced. Also, relatively
Speaker recognition is the identification of a person lower accuracies have been achieved using GMM[2],
from the characteristics of voices. Recently, due to a large Vector Quantisation[3] and Deep Neural Networks
increase in the field of smart devices, there have been many (DNN)[4]
studies conducted based on how to identify the speaker so as
to be able to give a personalized experience to its users. III. OUR APPROACH
However, most of the studies are dependent on the fact that
Fig 1 - High-level architecture
the user speaks a keyword in order to activate the device used
to identify them (Text-dependent speaker recognition) [1].
This makes interacting with the devices a bit monotonous.
This study has been done in order to overcome those keyword
and language barriers and be able to recognize the user
whatever he speaks.
II. EARLIER WORK
There are various approaches to solving the speaker
recognition problem. The solution is defined by two

978-1-7281-4365-1/19/$31.00 ©2019 IEEE 176


Proceedings of the Third International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2019)
IEEE Xplore Part Number:CFP19OSV-ART; ISBN:978-1-7281-4365-1

Expected Input: The speaker signal is expected to be in an 8-


44Khz uncompressed audio file in Wav format. The application uses a sampling rate of
● Preprocessing block: In this block, the voice samples 44.1Khz (subject to availability) for optimum quality
are preprocessed to minimize/filter the noise content audio files for better results. It has been noticed that
and eliminate the silent parts in the signal. the majority of the information is stored in the first 0-
● Feature Extraction block: This block is responsible 8000 Hz bandwidth. A low pass filter is, therefore,
for extracting MFCCs. It then reduces the dimension applied to remove higher frequency sounds which is
of these vectors and passes it to the speaker mostly ambient noise. The obtained audio signal is
modelling block. saved in the application and will be referred to as
● Speaker modeling block: This block takes MFCCs processed audio from here on.
as input and builds a model.
● Computing Likelihood: This block is present only in As given below, fig 2 shows the original raw voice
the recognition phase. It searches for a match in the sample. The processed sample is shown in figure 3
trained model to identify the speaker and returns it
as output. Fig 2 - Original Voice Sample
Expected Output: Name of the speaker.

As mentioned before, we use neural networks which


is a machine learning approach.
Like every machine learning system, it has 2 phases - training
and testing.

The system starts by recording the user speaking a


paragraph from any article for a specific amount of time (1
minute in our case). The voice signal is then preprocessed
using the SoX framework for eliminating noise and silence
from the audio. MFCC (Mel Frequency Cepstral Fig 3 - Processed using SoX script

Coefficients) is then extracted from the processed voice


signal using a library. These features are then fed to a neural
network. After all the speakers have been enrolled, the model
is tested.

Testing is fairly simple, the user speaks a couple of


words which are recorded, pre-processed and the features
extracted are tested on the Convolutional Neural Network
(CNN) model built during the training phase of the system.

B. Feature Extraction
A. Pre Processing [4] This frequency warping can allow for
We use Sound Exchange(SoX), an audio processing better representation of sound. MFCC has found
framework, to remove the noise and silent parts from the application in various speech recognition applications.
audio. The silence is removed by using a threshold; anything
below the threshold is removed. We have used “python_speech_features”, an
audio processing library that extracts MFCC features
The human voice has a fundamental frequency range from a given Wav format audio file. Filter bank
of 85 to 180 Hz for male and 165 to 255 Hz for children and energies, which is an intermediate step in the
females. However, when a person speaks, their frequency extraction of MFCC, is also used along with MFCC
isn’t fixed. It varies for different words. The energy also feature vectors. The frame size is 32ms with a stride of
spreads to nearby frequencies which gives a diminishing 16 ms. Only the initial 13 MFC coefficients are useful
effect to sound, necessary to utter certain specific words in [10] since the later ones are nearly zero.
certain languages. However, this spread needs to be limited in
order to distinguish between voice and noise.

978-1-7281-4365-1/19/$31.00 ©2019 IEEE 177


Proceedings of the Third International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2019)
IEEE Xplore Part Number:CFP19OSV-ART; ISBN:978-1-7281-4365-1

C. Neural Network
There are two types of approaches to training the
neural network using the MFCC vectors

Every speaker read a paragraph from the english


newspaper for the training set. These 1 minute training
samples were then pre-processed and each frame was fed to
the neural network. IV. DEPLOYMENT
The application was deployed using the flask
Nearly every paper suggests that features of up to 40
framework for the front end. The recordings were
frames be stacked and given collectively as input. This is
taken by the web app. Training samples were of 1
necessary for speech dependent system, where the sequence
minute length and test samples were recorded for 10
of words/frames matter. Since we were focussing only on the
seconds. It was deployed on a laptop with an intel i3
text-independent aspect and are not concerned with the order
dual-core processor with 8 GB RAM, which could be
of frames, the size of the input vector corresponds only to the
accessed via a computer or mobile phone.
size of features of 1 frame. Every frame has 52 features
which include 13 MFCC, 13 deltas derived from MFCCs, 13
V. RESULTS AND DISCUSSION
acceleration derived from deltas [7] and 13 filterbank
energies.[1] The voice sample is recorded and processed
as above. 50 frames are extracted from the processed
We have adopted an approach which tests two audio and all of them are tested on the neural net. The
learning models, namely DNN and CNN. We have used probability outputs of all frames are added and the
DNN because as stated in [5], DNN provides better noise speaker with the highest probability is the output. This
immunity over the next best performing model, i.e. GMM. test set included voice samples from the speaker in
CNN is tried because CNN is innately used for identifying multiple languages, mainly English, Hindi and
patterns in the data/features and scales well which is of Marathi.
essence in this problem.[6][9]

Table 1 - Observations for voice recorded under lab conditions


The neural network has been implemented on the (Speaker recognition dataset from openslr by Tsinghua University)
tensorflow platform using Keras APIs.

DNN: The first layer consists of 3000 nodes followed by 4 No. of DNN CNN
layers of 100 nodes each and a dropout of 0.3 between each Speakers accuracy accuracy
of them. The dropouts help in avoiding overfitting of data. 5 70 76

CNN: The CNN contains 4 layers, 3 convolution layers and a 10 67 68


dense layer in the end. First 2 layers use 52 convolution
kernels with filter lengths (window size) of 13 and 7 15 70 72
respectively. The third convolution layer uses 13 output
20 55 58
kernels with a window size of 3. The output is then flattened
and a dense layer with 1000 nodes follows. 0.25 of dropout is 35 55 58
added at this stage to avoid overfitting. Then, the final output
layer follows. All layers use ‘tanh’ as the activation function, 50 61 71
except the final layer, which uses ‘softmax’.
Table 2 - Observations for Real-World Voice Samples
The loss was calculated using “Categorical Cross
No of DNN CNN
Entropy” and the optimizer used was “Adam” Speakers accuracy accuracy

Fig 4 - Diagram of CNN Model 2 100 100

3 75 75

4 75 87

978-1-7281-4365-1/19/$31.00 ©2019 IEEE 178


Proceedings of the Third International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2019)
IEEE Xplore Part Number:CFP19OSV-ART; ISBN:978-1-7281-4365-1

improve the accuracy of the model and to scale up the


5 80 90
number of users.
6 65 78
Speaker recognition can be applied to
7 64 77 multiple domains across the industry. In the IoT
world, it can change the way we interact with smart
8 58 75
devices. Without saying “Ok Google” or “Hey Siri”,
one would be able to communicate with such devices
in natural language. Also, the device would be able to
Fig 5 - Observation for Voice recorded under Lab conditions (Speaker personalize the experience for the recognized user.
Recognition dataset from openslr by Tsinghua University)
Speaker recognition is also useful in biometric
verification. In its current state, it can be used to build
a simple lab attendance system which would not
require any specialized biometric device.

REFERENCES

[1] Zhang, Chunlei, and Kazuhito Koishida. "End-to-End Text-


Independent Speaker Verification with Triplet Loss on Short
Utterances." In Interspeech, pp. 1487-1491. 2017.
Fig 6 - Observation for Real-World Voice Samples [2] Li, Chao, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei
Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. "Deep
speaker: an end-to-end neural speaker embedding system." arXiv
preprint arXiv:1705.02304 (2017).
[3] Geeta Nijhawan, Dr. M. K. Soni, July 2014, "Speaker
Recognition using MFCC and Vector Quantisation", International
Journal on Recent Trends in Engineering and Technology, Vol 11.
No. 1
[4] Shahenda Sarhan, Mohamed Abu ElSoud, Nagham Mohammed
Hasan, July 2015, "Text Independent speaker identification based
on MFCC and Deep Neural Networks",
https://www.researchgate.net/publication/291165354
[5] Ahilan Kanagasundaram, David Dean, Sridha Sridharan, Clinton
VI. OBSERVATION Fookes, October 2016, "DNN based Speaker Recognition on Short
Utterances", arXiv:1610.03190
CNN: Upto 10 speakers can be fed into the net for a decent [6] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei
Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu, 5 May
accuracy of 75-80% for real-world samples (Table 2 and fig 2017, "Deep Speaker: an End-to-End Neural Speaker Embedding
6). For voice recorded in lab conditions, the model gives an System",arXiv:1705.02304v1 [cs.CL]
accuracy of 70% for up to 50 speakers with some [7] Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram
inconsistencies as seen from table 1. Sundaram, Aravind Ganapathiraju, 7-8 September 2017. "Neural
Network Based Speaker Classification and
Verification Systems with Enhanced Features", Intelligent Systems
DNN: DNN gives an accuracy of 75-80% for 5 speakers and Conference 2017
then its performance degrades (Table 1 and Fig 5). [8] Fred Richardson, Douglas Reynolds, Najim Dehak, October
2015, “Deep Neural Network Approaches to Speaker and Language
Recognition”, IEEE Signal Processing Letters, Vol 22, No. 10
This is consistent with the fact that CNN is a [9] Amirsina Torfi, Nasser Nasrabadi, Jeremy Dawson, 6 November
learning model that excels at identifying patterns in the input 2017, “Text-Independent Speaker Verification Using 3D
and can scale much better than DNN. Convolutional Neural Networks”, arXiv:1705.09422v4
[10] Zhang, Shi-Xiong, Zhuo Chen, Yong Zhao, Jinyu Li, and Yifan
Gong. "End-to-end attention based text-dependent speaker
VII. CONCLUSION AND FUTURE SCOPE verification." In 2016 IEEE Spoken Language Technology
The language independent, text independent speaker Workshop (SLT), pp. 171-178. IEEE, 2016.
[11] Qiyue Liu, Mingqiu Yao, Han Xu, Fang Wang, 2013,
recognition system was developed with the idea that the
“Research on Different Feature Parameters in Speaker
future smart devices should have the ability to recognize their Recognition”, Journal of Signal and Information Processing, 4,
user’s voice without the boundaries of languages and 106-110
keywords. An accuracy of 75-80% was achieved using the [12] Athira Aroon, S.B. Dhonde, 2015, “Speaker Recognition
System using Gaussian Mixture Model”, International Journal of
CNN model (Table 2). Further studies can be conducted to
Computer Applications (0975 – 8887), Volume 130 – No.14

978-1-7281-4365-1/19/$31.00 ©2019 IEEE 179

You might also like