0% found this document useful (0 votes)
29 views

Voice Recognition Based Security System Using

1) The document describes a voice recognition security system using a convolutional neural network that analyzes speech characteristics. 2) It uses scaled spectrograms and the Google Speech-to-Text API to convert speech to text for verification. 3) A convolutional neural network is implemented for voice identification and authentication, with each layer assigned a specific task in the network's training on multiple voice samples.

Uploaded by

ks09anoop
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Voice Recognition Based Security System Using

1) The document describes a voice recognition security system using a convolutional neural network that analyzes speech characteristics. 2) It uses scaled spectrograms and the Google Speech-to-Text API to convert speech to text for verification. 3) A convolutional neural network is implemented for voice identification and authentication, with each layer assigned a specific task in the network's training on multiple voice samples.

Uploaded by

ks09anoop
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

Voice Recognition Based Security System Using


Convolutional Neural Network
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) | 978-1-7281-8529-3/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICCCIS51004.2021.9397151

Pankaj H. Chandankhede Abhijit S. Titarmare


Department of Electronics & Telecommunication
Engineering Department of Electronics & Telecommunication
G H Raisoni College of Engineering, Nagpur, India Engineering
[email protected] G H Raisoni College of Engineering, Nagpur, India
[email protected]

Sarang Chauhvan
Department of Electronics & Telecommunication Engineering
G H Raisoni College of Engineering, Nagpur, India
[email protected]

Abstract— Following review depicts a unique speech should speak identical words each for recording and
recognition technique, based on planned analysis and recognizing sessions. In non-text related systems this example
utilization of Neural Network and Google API using speech’s is a recording and review session, both are entirely different [1,
characteristics. Multifactor security system pioneered for the 2]. The voices of every individual are easily distinguishable,
authentication of vocal modalities and identification. even will acknowledge one another at the phone. In voice
Undergone project drives completely unique strategy of recognizing gathering the options of the voice is vital.
independent convolution layers structure and involvement of As per considered perusal and taken into account system is
totally unique convolutions includes spectrum and Mel- designed with the set of rules of speech, cryptography
frequency cepstral coefficient. This review takes in the breakthrough within the style of RGB (Red, Green, and Blue)
statistical analysis of sound using scaled up and scaled down spectrograms. Present Strategy working like substitution
spectrograms, conjointly by exploitation the Google Speech- addressing recognition of voice, supported the précising the
to-text API turns speech to pass code, it will be cross-verified Cryptography about system and one’s speech characteristics.
for extended security purpose. Our study reveals that the System is designed to plan usage of Convolutional Neural
incorporated methodology and the result provided elucidate Networks, implementing CNN is of bit a tough task since it has
the inclination of research in this area and encouraged us to out shown a surprising result and shown hostility to unknown
advance in this field. frequency elements which are not authorized to the system’s
database environment.
Keywords— MFCC (Mel-Frequency Cepstrum Coefficients), Convolutional Neural Networks undergoes the vigorous
CNN (Convolutional Neural Network), ANN (Artificial Neural training phase as it has to be feed with multiple voice modalities
Network), ASR (Artificial Speech Recognition), STT (Speech-to- and each layer has its own designated work assigned to it, the
Text), GUI (Graphic User Interface). planned structure of this architecture led to the usage of two
I. INTRODUCTION Convolution and one Fully connected layer, this whole software
program is covered in and written in the Python programming
Restricting to breach the Security and encompassing is supreme environment using Rasp - berry Pi, System undergoes dedicated
vital. Standard System carries the security like passcodes, Neural Network Structure.
finger scanning, scanning of palm, identity verification can
break simply. So upper mentioned systems can breach out by II. SPEECH RECOGNITION PROCESS
perceiving password of particular system, since every coin has
The Method of speech recognition [3] is complicated and a
two sides so, by known force or any other technique Palm or
cumbersome job. The subsequent figure shows the steps
finger scanning breaching the whole system can cause loss of
concerned within the method of speech recognition. Voice
confidential data loss. Proposed system is made in way to grips
recognition Technique mainly consists of 2 chief arms - Audio
with hand in hand security leading to safeguarding documents.
Classification Prediction, Feature Extraction and storage
Voice Recognition technique is supremely excellent amidst
Feature extraction process extracts vital quantity of knowledge
other techniques for security purposes. Since every person has
form a particular modality of voice, leading to show the
its own unique voice adding distinguishable features like
designated authorized speaker. MFCC is employed because
Frequency, Pitch, Amplitude etc. Voice recognition is a vital
Feature Extraction technique [4] is used during this project.
task in life sciences and is split into two groups: text-dependent
Audio Classification Prediction involves the particular
and text-independent. In text-dependent systems, the user
procedure where with the help of forceful Machine learning it

ISBN:
ISBN: 978-1-7281-8529-3/21/$31.00
XX-X-XX-XX-X/19/$31.00 ©2021©2021IEEE
IEEE 738
1

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

is used to spot the unknown speaker in order that the system basis of individual details found in audio files i.e. speaker wave
calculates the information losses, minimum the information files. this device makes it possible to use the speech of the
loss additional correct the system are, every speaker has its own speaker to check their individual identity and access to services
speaker id and knowledge set loaded within the backend of the like dialing of voice modality, mobile phone banking, accessing
system NN compares with knowledge set to its own intelligence the information, mail, voice, services related to information,
developed by the coaching and extract the precise output. lead region security management and isolated connections to
machines such as computer. These Automatic identification [7]
and verification techniques typically thought-about foremost
existing and cut-rate strategies to steer clear of from
Training Testing unapproved connect any common spot or machines practice.
Phase Alternatively, the substitute available in the devices makes it
Phase suitable as speaker recognition for security for a purpose. The
drawback of speaker recognition system is one that's frozen
within the Speech Signal’s study. An exceedingly fascinating
drawback is that in the analysis of the speech signal, and in that
what characteristics create it distinctive among different signals
and what makes one speech signal totally different from
another.
A. Mel-Frequency Cpestrum Coefficients
Feature Extraction is that the extraction of the most effective
constant illustration of voice signals so as to supply a higher
recognition performance. The Principal objective of feature
extraction is to extract characteristics from the speech signal
that comes out to be very distinctive to every individual which
can be additional accustomed helping in differentiating one
speaker from other. For successive section it's vital that this
section have sensible potency because it affects the behavior of
the system. The Characteristics of the Vocal tract is exclusive
Fig.1 Traning and Testing phases of System.
for every individual speaker thus the impulses of the vocal tract
response will be accustomed differentiate speakers which may
be obtained by applying algorithm of Mel-frequency cepstrum
III. SPEECH RECOGNITION PROCESS coefficients (MFCC) [5, 6]. MFCC depends upon recognition
of alteration of the key data measurement of the human ear with
frequency i.e. its supported perceptions of human hearing can’t
comprehend frequencies above 1 kHz. Filters are of two types
in MFCC and they are linearly separated at frequencies below
1000 frequencies and at exponents greater than 1000 Hz. The
overall procedure of the MFCC is shown in figure below.

Fig.2 Overall System Design Flow. Fig. 3: Block Diagram of MFCC.

Recognition of speakers is the method of mechanically


identifying the particular authorized person who speaks on the

739
2

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

A. Speech-to-text (STT) minimum to maximum complicated shapes. The contribution of


CNN in object recognition from image is concept behind
solving the speech recognition matter, an image in CNN shows
its architecture. Thus, the sound encoding using image is
included in our method. Generally, CNN consists of
Convolution layer, pooling layers and fully connected layers.
Convolution layers and pooling layers combine to make
internal structure while fully connected layers are liable for
class probability generation.

3.1 Convolutional Layer


Convolutional layers have neurons which are connected to
previous layer’s receptive field. The neurons in same future
map are equal in weights. CL consists a group of learnable
filters. When specific shape or colour blob occurs in the area
one specific filter gets activated. Every CL consists of multiple
filters and filter may have a set of learnable weights which
Fig.4 Block Diagram of Speech-to-text. corresponds to the neurons in previous layers. Small spatially
that can be extended along total depth of previous layers is
Speech recognition is a new technology that is majorly used to known as a filter (usually the filter sizes are 3x3, 5x5 and rarely
interpret audio wave files or spoken utterances into its 7x7).
corresponding text. The text can be in any terms, it may be
words or a sequence of words, it can also be symbols or
characters, a voice command such as audio wave files or direct
speech signals, also it might be sub-word units or phones, but
here we are simply translating the speech signals into its
corresponding text. There are numerous examples of ASR
(Artificial Speech Recognition) system. One of the best
examples is YouTube’s closed captioning, it uses the ASR
engine for the effective transcription of the speech signals even
the audio and the video clips too. Some more examples to it are
the voice mail service which needs transcription; it too contains
the ASR engine running in its backend. Earlier the prototype
which was used for the ASR system was the dictation system.
A dictation system is mainly a system where the words are
spoken or simply a speech signal is given and the corresponding
transcript of the speech signal is produced. A major use of ASR
(Artificial Speech Recognition) can be seen in some most
renowned systems like Google Assistant, Cortana, Siri, Alexa
and many more which uses ASR engines in their front end. This
Technology of ASR strictly translates spoken utterances into its
corresponding text. There are many benefits of having an
adequate ASR system. A lot of time is saved if an individual
gives voice signals which are then converted to its
corresponding text instead of wasting a lot of time in typing.
Technology these days has many interfaces which is why most
of the people are unable to cope up with it. Therefore, there is a
higher need of a developed and stably constructed ASR system
so that this system can be used by both the illiterate and literate
users so they can interact better with the growing technology.
A. Convolutional Neural Network (CNN)
The structure of visual cortex in human brain has pattern for
creation of Convolutional Neural Network [7, 8]. The pixel
arrangement in local area is for determining the identity of
shape and structure of object, thus CNN analyses the image Fig 5. Speech-to-text API flow chart.
with help of tiny local patterns and generates combination of

3740

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

receptive field this trend prefers stacked CLs eventually with


stride and uses PLs terribly often or discards them altogether

Fig. 7 Assumed network structure.


3.3 Back-Propagation
Fig. 6 Convolutional Neural Network Block Diagram.
The single evaluation is totally in step with the feed-forward
The number of variable functions used in image which we get neural network. The activations or the input data is moved to
from the quantity of learnable filters affects the depth of CL. the succeeding layers, now this activation function is applied to
Exactly one CL filter’s weight is used by each neuron in CL the scalar product which has been computed. Two or three fully
while many neurons use the equal weights. The neurons in CL connected layers are being assembled at the end of the layer. If
are classified into feature maps by the filters used. The local the gradient descent learning algorithm is to be used then the
area of neuron’s connectivity onto previous layer is specified gradient should be computed prior. In our project we have used
by neuron in CL. The size of all neurons in same CL is identical the typical back-propagation algorithm which comprises of two
to their receptive field. The reception field of this neuron is technical updates. The traditional back-propagation algorithm
formed by all connections of given neuron. The volume of calculates various biased derivatives of weights which belongs
receptive field is same as that of product of filter size of specific to neurons inside the similar filter. Thus, the derivatives of the
CL and depth of previous layer. The activation is calculated by loss functions with reference to the corresponding weights of
application of activation function over potential. Most of the neurons which belongs to the identical feature map are summed
time ramp function is used as activation function which is also up in conjunction. The neurons which are not filtered with the
called as ReLU unit. The specific features which matter in only max-pooling, the error of back propagating is routed only to
some part of image is also essential within rest of image. The those neurons. To speed up the back-propagation it is very
foremost important hyper-parameter which comes after amount regular to trace the indices of neurons during forward
of filter is Stride. The dimension of the receptive field of the propagation.
neuron and size of image is considered while determining the
stride. In case parameters are incorrectly chosen, the stride B. Hardware Desigm
should be changed or Zero padding should be applied in order
to normalize image with various shapes or to keep specific input REQUIRED HARDWARE COMPONENTS
size.
3.2 Pooling Layer 1) TIP 41 NPN transistors (3 no’s),
Pooling layer (PL) is an efficient way of nonlinear down 2) Raspberry Pi 3 B+ model
sampling. It’s the CL receptive field and stride; but, it is not 3) HDMI Monitor
adding any learnable parameter. PL is usually place after the CL 4) Condenser Mic
[13]. The receptive field of neuron in PL is 2 dimensional. Max- 5) Resistors (3 no’s – 100 ohms, 3nos – 1.8k ohms),
PL is that the most frequently used PL. Each neuron output the 6) LED’,
foremost of its receptive field. Usually, the stride is that the 7) Push button.
same to size of the receptive field. The receptive field don’t 8) Lamps (3 no’s) and holders (3 no’s),
overlap but touch. In most cases, stride and size of receptive 9) Power Supply (12 V, 1 A) and (5v, 2A),
field are 2x2. Max-PL amplifies the foremost present feature 10) Fly Back Diode (1N4007)
(pattern) of its receptive field and throws away the remainders. 11) Power Relay (3 no’s), 12V
The intuition is that, once a feature has been found, its rough 12) Solenoid Lock
location relative to alternative options is additional necessary
than its actual location. The PL is effectively reducing the
13) Acrylic sheet
spatial size of the representation and doesn’t add new 14) Jumper Wires.
parameters reducing them for latter layers, making the 15) SMPS (Switch Mode Power Supply) (12 V 1 A)
computation more feasible. Due to its destructiveness,
discarding 75% of input information just in case of a small 2x2

741
4

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

Fig.8 Schematic of the intermediate Circuit


Fig. 10 Complete Hardware Circuit

IV. CONCLUSION AND FUTURE SCOPE


Studies have found out that combination of Mel frequency and
Convolutional Neural Network provides the best accuracy and
most effective performance. It conjointly suggests that to get
satisfactory result that maximum of the epochs needs to be
increased as the information in the system will be increased
with the increasing number of speakers. Therefore, to attain the
higher performance, the speaker training sessions by giving the
voice samples to the system needs to be perennial therefore on
update the speaker specific codebooks within the information
because it is shown in psychophysical studies that there's a
likelihood that human speech features might vary over a amount
of 2-3 years. The current study, contains a detailed work and
study on MFCC, CNN and Speech-to-text which is also
Fig.9 PCB of Circuit accustomed to improve the effectiveness and accuracy of the
Connections – system to take care of background noise, laughter and atypical
sounds. Improvement will be obtained by increasing the
This is Driver board which Drives Relays and devices. reference information size. Conjointly if we are able to unite
Transistors bases are connected to Raspberry pi GPIO’s voice activation detection with this procedure, we are able to
through the 100-ohm resistor. perform speech recognition on live voices and speech.
Connectors that control the Output Devices through the code.
V. RESULTS
TIP 41 emitter is grounded and Collector are connected through
the Fly back diode so as to avoid the reverse current spike due This is the Graphic User interface being displayed on the HDMI
to uninterrupted power supply base is provided With VCC to Monitor.
drive the board. The Graphic User Interface Shows Unique Speaker ID, Output
Additional Coupling capacitor is used to smooth the Current Peripheral Devices, Lamps and lock State of being HIGH and
flow. Three diodes are connected parallel to see each transistor LOW.
is working properly. MFCC Spectrogram is also displayed with MFCC Graph, we
give the live speech with the microphone and according to
speech Spectrogram and Graph is displayed.
Additional Connections –
Unique Speaker ID is already Declared in the Training Period
The Collector is connected to the Relays and the Solenoid Lock to control the output device, in the above figure Live speech is
through the Connect taken and CNN predict the Unique ID given and control the
Relays are connected to Lamps which itself needs the AC 230 device as the per the command given to the Speaker.
v power supply. Here Lamp 1 is getting on displaying the Speaker ID – 0. Other
Solenoid lock is driven with the Dc voltage according to the two being in LOW state.
Comparison and operations performed.

742
5

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

[3] Chandankhede, Pankaj H., and M. M. Khanapurkar.


"Design of CAN-Based Enhanced Event Data Recorder and
Evidence Collecting System." In Proceedings of the
International Conference on Recent Cognizance in Wireless
Communication & Image Processing, pp. 115-122. Springer,
New Delhi, 2016.
[4] Titarmare, Abhijit S., Milind M. Khanapurkar, and Pankaj
H. Chandankhede. "Analysis of Traffic Flow at Intersection to
Avoid Accidents using Nagel-Schreckenlerg (NS) Model."
In 2020 Fourth International Conference on I-SMAC (IoT in
Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 478-484.
IEEE, 2020.
[5] L. Deng and X. Li, “Machine learning paradigms for speech
recognition: An overview,” IEEE Trans. Audio, Speech, Lang.
Fig. 11 Result Displayed with GUI (1) Process., vol. 21, no. 5, pp. 1060–1089, May 2013.
[6] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton,
This is the Graphic User interface being displayed on the HDMI “Phone recognition with the mean-covariance restricted
Monitor Boltzmann machine,” Adv. Neural Inf. Process,2010.
The Graphic User Interface Shows Unique Speaker ID, Output [7] A. Bajpai, U. Varshney and D. Dubey, "Performance
Peripheral Devices, Lamps and lock State of being HIGH and Enhancement of Automatic Speech Recognition System using
LOW. Euclidean Distance Comparison and Artificial Neural
MFCC Spectrogram is also displayed with MFCC Graph, we Network," 2018 3rd International Conference On Internet of
give the live speech with the microphone and according to Things: Smart Innovation and Usages (IoT-SIU), Bhimtal,
speech Spectrogram and Graph is displayed. 2018, pp. 1-5, doi: 10.1109/IoT-SIU.2018.8519839.
[8] R. Jagiasi, S. Ghosalkar, P. Kulal and A. Bharambe, "CNN
A based speaker recognition in language and text-independent
small scale system," 2019 Third International conference on I-
SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-
SMAC), Palladam, India, 2019, pp. 176-179, doi: 10.1109/I-
SMAC47947.2019.9032667.
[9] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G.
Hinton, and M. Picheny, “Deep belief networks using
discriminative features forphone recognition,” in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May
2011, pp. 5060–5063.
[10] Azarang, J. Hansen and N. Kehtarnavaz, "Combining Data
Augmentations for CNN-Based Voice Command
Fig. 11 Result Displayed with GUI (1) Recognition," 2019 12th International Conference on Human
Unique Speaker ID is already Declared in the Training Period System Interaction (HSI), Richmond, VA, USA, 2019, pp. 17-
to control the output device, in the Above figure Live speech is 21, doi: 10.1109/HSI47298.2019.8942638.
taken and CNN predict the Unique ID given and control the [11] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and
device as the per the command given to the Speaker. Here Lamp fine tuning in context-dependent DBN-HMMs for real-world
1 is getting on displaying the Speaker ID – 0. With the Speaker speech recognition” in Proc. NIPS Workshop Deep Learn.
ID 4 solenoid Lock gets Lock and Unlock. Unsupervised Feature Learn. 2010.
[12] A. Sokolov and A. V. Savchenko, "Voice command
REFERENCES
recognition in intelligent systems using deep neural networks,"
[1] H. Jiang, “Discriminative training for automatic speech 2019 IEEE 17th World Symposium on Applied Machine
recognition: A survey,” Comput. Speech, Lang., vol. 24, no. 4, Intelligence and Informatics (SAMI), Herlany, Slovakia, 2019,
pp. 589–608, 2010. doi: 10.1109/SAMI.2019.8782755.
[2] F. Akdeniz and Y. Becerikli, "Performance Comparison of [13] Salankar, Suresh S., and Balasaheb M. Patre. "SVM based
Support Vector Machine, K-Nearest-Neighbor, Artificial model as an optimal classifier for the classification of sonar
Neural Networks, and Recurrent Neural networks in Gender signals." International Journal of Computer, Information, and
Recognition from Voice Signals," 2019 3rd International Systems Science, and Engineering 1, no. 1 (2007): 68-76.
Symposium on Multidisciplinary Studies and Innovative
Technologies (ISMSIT), Ankara, Turkey, 2019, pp. 1-4, doi:
10.1109/ISMSIT.2019.8932818.

743
6

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 03:51:10 UTC from IEEE Xplore. Restrictions apply.

You might also like