2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System
2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System
Abstract — Speaker Recognition is the ability of the system to parameters : The feature of the voice signal to be used
recognize the speaker from the set of speaker samples available like Mel Frequency Cepstral Coefficients (MFCC),
in the system. It is of 2 types, one uses a keyword, called text- Linear Prediction Cepstral Coefficient (LPCC),
dependent systems, and another one can recognize the voice in
Perceptual Linear Prediction (PLP) and the modelling
any language/text, also called as text-independent speaker
technique used to learn the voice samples like
recognition. In this paper, a text-independent, language-
independent speaker recognition system is implemented using
Artificial Neural Networks (ANN), Gaussian Mixture
dense & convolutional neural networks. Speaker recognition has Model (GMM)[12], vector quantization etc. Studies
found several applications in upcoming electronic products like have also been done where multiple techniques are
personal/home assistants, telephone banking and biometric combined, such as using different models for feature
identification. In this paper, we explore a system that uses extraction and classification.[8] Of these choices,
MFCC along with DNN and CNN as the model for building a majority approaches use MFCC as the feature to be
speaker recognition system. used since it is most effective for speaker
recognition[11].
Keywords — speaker recognition, neural network, voice sample,
language-independent speaker recognition, independent speaker
recognition system
Using MFCC features and, [10] studied
phenomic based speaker recognition systems.
However, the accuracies seem to fluctuate based on
I. INTRODUCTION
the words and syllables pronounced. Also, relatively
Speaker recognition is the identification of a person lower accuracies have been achieved using GMM[2],
from the characteristics of voices. Recently, due to a large Vector Quantisation[3] and Deep Neural Networks
increase in the field of smart devices, there have been many (DNN)[4]
studies conducted based on how to identify the speaker so as
to be able to give a personalized experience to its users. III. OUR APPROACH
However, most of the studies are dependent on the fact that
Fig 1 - High-level architecture
the user speaks a keyword in order to activate the device used
to identify them (Text-dependent speaker recognition) [1].
This makes interacting with the devices a bit monotonous.
This study has been done in order to overcome those keyword
and language barriers and be able to recognize the user
whatever he speaks.
II. EARLIER WORK
There are various approaches to solving the speaker
recognition problem. The solution is defined by two
B. Feature Extraction
A. Pre Processing [4] This frequency warping can allow for
We use Sound Exchange(SoX), an audio processing better representation of sound. MFCC has found
framework, to remove the noise and silent parts from the application in various speech recognition applications.
audio. The silence is removed by using a threshold; anything
below the threshold is removed. We have used “python_speech_features”, an
audio processing library that extracts MFCC features
The human voice has a fundamental frequency range from a given Wav format audio file. Filter bank
of 85 to 180 Hz for male and 165 to 255 Hz for children and energies, which is an intermediate step in the
females. However, when a person speaks, their frequency extraction of MFCC, is also used along with MFCC
isn’t fixed. It varies for different words. The energy also feature vectors. The frame size is 32ms with a stride of
spreads to nearby frequencies which gives a diminishing 16 ms. Only the initial 13 MFC coefficients are useful
effect to sound, necessary to utter certain specific words in [10] since the later ones are nearly zero.
certain languages. However, this spread needs to be limited in
order to distinguish between voice and noise.
C. Neural Network
There are two types of approaches to training the
neural network using the MFCC vectors
DNN: The first layer consists of 3000 nodes followed by 4 No. of DNN CNN
layers of 100 nodes each and a dropout of 0.3 between each Speakers accuracy accuracy
of them. The dropouts help in avoiding overfitting of data. 5 70 76
3 75 75
4 75 87
REFERENCES