1 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320673140

A New Human Voice Recognition System

Article · December 2016

CITATIONS READS
7 6,046

1 author:

Mahesh Pala
SAK INFORMATICS
9 PUBLICATIONS 28 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A New Encryption and Decryption for 3D MRT Images View project

All content following this page was uploaded by Mahesh Pala on 03 March 2018.

The user has requested enhancement of the downloaded file.


Asian Journal of Science and Applied Technology
ISSN: 2249-0698 Vol. 5 No. 2, 2016, pp.23-30
© The Research Publication, www.trp.org.in

A New Human Voice Recognition System


Pala Mahesh Kumar
SAK Informatics Pvt. Ltd., Andra Pradesh, India
Email:[email protected]

Abstract - In an effort to provide a more efficient a. Speaker Dependent vs. Speaker Independent:
representation of the speech signal, the application of the A speaker‐dependent speech recognition system is one that
wavelet analysis is considered. This research presents an is trained to recognize the speech of only one speaker. Such
effective and robust method for extracting features for speech systems are custom built for just a single person, and are
processing. Here, we proposed a new human voice recognition
system using the combination of decimated wavelet (DW) and
hence not commercially viable. Conversely, a speaker‐
Relative Spectra Algorithm with Linear Predictive coding. independent system is one that is independence is hard to
First, we will apply the proposed techniques to the training achieve, as speech recognition systems tend to become
speech signals and then form a train feature vector which attuned to the speakers they are trained on, resulting in error
contains the low level features extracted, wavelet and linear rates that are higher than speaker dependent systems.
predictive coefficients. Afterwards, the same process will be
applied to the testing speech signals and will form a test b. Isolated vs. Continuous
feature vector. Now, we will compare the two feature vectors In isolated speech, the speaker pauses momentarily between
by calculating the Euclidean distance between the vectors to
every word, while in continuous speech the speaker speaks
identify the speech and speaker. If the distance between two
vectors is near to zero then the tested speech/speaker will be
in a continuous and possibly long stream, with little or no
matched with the trained speech/speaker. Simulation results breaks in between. Isolated speech recognition systems are
have been compared with LPC scheme, and shown that the easy to build, as it is trivial to determine where one word
proposed scheme has performed superior to the existing ends and another starts, and each word tends to be more
technique by using the fifty preloaded voice signals from six cleanly and clearly spoken. Words spoken in continuous
individuals, the verification tests have been carried and an speech on the other hand are subjected to the co-articulation
accuracy rate of approximately 90 % has been achieved. effect, in which the pronunciation of a word is modified by
Keywords: Human Voice, Decimated wavelet, LPC, RASTA the words surrounding it. This makes training a speech
and Euclidean Distance
system difficult, as there may be many inconsistent
I.INTRODUCTION pronunciations for the same word.

In our everyday lives the audio signal especially the voice III.WAVELET ANALYSIS
signal has become one of the major part, because it can be
used as a one of the major tool for communicating each The basic idea of this proposal is to use wavelets as a mean
other. However, by using some software applications, the of extracting features from a voice signal. The wavelet
voice signal has modified or processed further due to its technique is considered a relatively new technique in the
technological advancement, and can be utilized in various field of signal processing compared to other methods or
applications such as security applications. In many techniques currently employed in this field. Fourier
applications these speech processing systems plays a vital Transform (FT) and Short Term Fourier Transform (STFT)
role such as speech recognition, voice communication. [1] [2] are the current methods used in the field of signal
Speech recognition is the process of automatically processing. However due to severe limitations imposed by
extracting and determining linguistic information conveyed both the Fourier Transform and Short Term Fourier
by a speech signal using computers or electronic circuits. Transform in analyzing signals deems them ineffective in
Automatic speech recognition methods, investigated for analyzing complex and dynamic signals such as the voice
many years have been principally aimed at realizing signal [3][4]. In order to substitute the shortcomings
transcription and human computer interaction systems. The imposed by both the common signal processing methods,
first technical paper to appear on speech recognition has the wavelet signal processing technique is used. The
since then intensified the researches in this field, and speech wavelet technique is used to extract the features in the voice
recognizers for communicating with machines through signal by processing data at different scales. The wavelet
speech have recently been constructed, although they technique manipulates the scales to give a higher correlation
remain only of limited use. in detecting the various frequency components in the signal.
These features are then further processed in order to
II.SPEECH RECOGNITION construct the voice recognition system. Extracting the
features of the voice signal does not limit the capabilities of
Most speech recognition systems can be classified this technique to a particular application alone, but it opens
according to the following categories: the door to a wide range of possibilities as different

23 AJSAT Vol.5 No.2 July-December 2016


Pala Mahesh Kumar

applications can benefit from the voice extracted features. To correct the deficiency in FT, Dennis Gabor in 1946
Applications such as speech recognition system, speech to introduced a new technique called windowing, which can be
text translators, and voice based security system are some of applied to the signal to analyze a small section of a signal.
the future systems that can be developed. This adaptation has been called as the Short-Time Fourier
Transform (STFT), in which the signal will be mapped into
3.1Fourier Transform time and frequency information. In STFT, the window is
fixed. So, we this window will not change with the time
The signal can be analyzed more effectively in frequency period of the signal i.e., for both narrow resolution and wide
domain than the time domain, because the characteristics of resolution. And we cannot predict the frequency content at
a signal will be more in frequency domain. One possible each time interval section. To overcome the drawbacks of
way to convert or transform the signal from time to STFT, a wavelet technique has been introduced with
frequency domain is Fourier transform (FT). FT is an variable window size. Wavelet analysis allows the use of
approach which breaks down the signal into different long time intervals where we want more precise low-
frequencies of sinusoids and it is defined as a mathematical frequency information, and shorter regions where we want
approach for transforming the signal from time domain to high-frequency information. In fig1 it is shown that the
frequency domain. FT has a drawback that it will work out comparison of FT, STFT and wavelet transform by
for only stationary signals, which will not vary with the time considering an example input signal and how the analysis of
period. Because, the FT applied for the entire signal but not transformation techniques will apply to get the frequency
segments of a signal, if we consider non-stationary signal information of input signal. We can observe that in wavelet
the signal will vary with the time period, which could not be analysis the graphical representation shows that the wavelet
transformed by FT. and one more drawback that we have has more number of features than the FT and STFT.
with the FT is we cannot say that at what time the particular Wavelet is also called as multi resolution analysis (MRA).
event will has occurred. Here’s what this looks like in contrast with the time-based,
frequency-based, and STFT views of a signal:
3.2 Short-Time Fourier Analysis

Fig1. Comparison Of Ft, Stft And Wavelet Analysis Of A Signal

3.3 Decimated Wavelet (DW) signal. This is similar to imparting flavor to the signal. For a
voice signal, if the high frequency content is removed, the
Discrete Wavelet Transform (DWT) is a revised version of voice will sound different but the message can still be heard
Continuous Wavelet Transform (CWT). The DWT or conveyed. This is not true if the low frequency content of
compensates for the huge amount of data generated by the the signal is removed as what is being spoken cannot be
CWT. The basic operation principles of DWT are similar to heard except only for some random noise. The wavelet
the CWT however the scales used by the wavelet and their function is defined as follows:
positions are based upon powers of two. This is called the
dyadic scales and positions as the term dyadic stands for the
1 ∞  t −τ 
factor of two [9]. As in many real world applications, most
of the important features of a signal lie in the low frequency
W (τ , s ) =
s
∫−∞
x(t )ψ 
 s 
dt
section. For voice signals, the low frequency content is the ∞
section or the part of the signal that gives the signal its
identity whereas the high frequency content can be ∫ ψ (t )dt = 0
−∞
considered as the part of the signal that gives nuance to the ∞
∫ (ψ (t )) dt < ∞
2
−∞

AJSAT Vol.5 No.2 July-December 2016 24


A New Human Voice Recognition System

The basic operation of the DWT is that the signal is passed However, with DW the numbers of samples are reduced
through a series of high pass and low pass filter to obtain according to dyadic scale. This process is called the sub-
the high frequency and low frequency contents of the signal. sampling. Sub-sampling means reducing the samples by a
The low frequency contents of the signal are called the given factor. Due to the disadvantages imposed by CWT
approximations [10]. This means the approximations are which requires high processing power [11] the DW is
obtained by using the high scale wavelets which chosen due its simplicity and ease of operation in handling
corresponds to the low frequency. The high frequency complex signals such as the voice signal. B. Wavelet
components of the signal called the details are obtained by Energy Whenever a signal is being decomposed using the
using the low scale wavelets which corresponds to the high wavelet decomposition method, there is a certain amount or
frequency. From Figure2, demonstrates the single level percentage of energy being retained by both the
filtering using DW. First the signal is fed into the wavelet approximation and the detail. This energy can be obtained
filters. These wavelet filters comprises of both the high-pass from the wavelet bookkeeping vector and the wavelet
and low-pass filter. Then, these filters will separate the high decomposition vector. The energy calculated is a ratio as it
frequency content and low frequency content of the signal. compares the original signal and the decomposed signal.

Fig.2 Demonstration Of Single Level Wavelet Decomposition

IV. EXISTING ALGORITHM (𝑛𝑛) = −[𝑎𝑎1 𝑥𝑥(𝑛𝑛 − 𝑙𝑙) − 𝑎𝑎2 𝑥𝑥(𝑛𝑛 − 𝑙𝑙) − 𝑎𝑎3 𝑥𝑥(𝑛𝑛 − 𝑙𝑙) ….

A. LPC algorithm The LPC will analyze the signal by estimating or predicting
the formants. Then, the formants effects are removed from
The LPC (Linear Predictive Coding) method is derived the speech signal. The intensity and frequency of the
from the word linear prediction. Linear prediction as the remaining buzz is estimated. So by removing the formants
term implies is a type of mathematical operation. This from the voices signal will enable us to eliminate the
mathematical function which is used in discrete time signal resonance effect. This process is called inverse filtering.
estimates the future values based upon a linear function of The remaining signal after the formant has been removed is
previous samples [8]. called the residue. In order to estimate the formants,
coefficients of the LPC are needed. The coefficients are
𝑃𝑃
estimated by taking the mean square error between the
𝑥𝑥�(𝑛𝑛) = − � 𝑎𝑎𝑙𝑙 𝑥𝑥(𝑛𝑛 − 𝑙𝑙) predicted signal and the original signal. By minimizing the
𝑙𝑙=1 error, the coefficients are detected with a higher accuracy
𝑥𝑥�(𝑛𝑛) is the predicted or estimated value and𝑥𝑥(𝑛𝑛 − 𝑙𝑙)is the and the formants of the voice signal are obtained.
previous value. By expanding this equation

Fig.3 Block Diagram Of Lpc Based Recognition System

25 AJSAT Vol.5 No.2 July-December 2016


Pala Mahesh Kumar

V.PROPOSED ALGORITHM In Order to obtain the formant of the voice signals, the LPC
(Linear Predictive Coding) method is used. The LPC
A. RASTA (Relative Spectral Algorithm) (Linear Predictive Coding) method is derived from the word
RASTA or Relative Spectral Algorithm as it is known is a linear prediction. Linear prediction as the term implies is a
technique that is developed as the initial stage for voice type of mathematical operation.
recognition [13]. This method works by applying a band- This mathematical function which is used in discrete time
pass filter to the energy in each frequency sub-band in order signal estimates the future values based upon a linear
to smooth over short-term noise variations and to remove function of previous samples [8].
any constant offset. In voice signals, stationary noises are
often detected. Stationary noises are noises that are present C. RASTA-LPC and DWT Implementation
for the full period of a certain signal and does not have
diminishing feature [14]. Their property does not change In order to implement the system, a certain methodology is
over time. The assumption that needs to be made is that the implemented by decomposing the voice signal to its
noise varies slowly with respect to speech. This makes the approximation and detail. From the approximation and
RASTA a perfect tool to be included in the initial stages of detail coefficients that are extracted, the methodology is
voice signal filtering to remove stationary noises [15]. The implemented in order to carry out the recognition process.
stationary noises that are identified are noises in the The proposed methodology for the recognition phase is the
frequency range of 1Hz - 100Hz. statistical calculation. Four different types of statistical
calculations are carried out on the coefficients. The
B. Formant Estimation statistical calculations that are carried out are mean,
standard deviation, variance and mean of absolute deviation.
Formant is one of the major components of speech. The The wavelet that is used for the system is the symlet 7
frequencies at which the resonant peaks occur are called the wavelet as that this wavelet has a very close correlation with
formant frequencies or simply formants [12]. The formant the voice signal. This is determined through numerous trial
of the signal can be obtained by analyzing the vocal tract and errors. The coefficients that are extracted from the
frequency response. Figure 7 shows the vocal tract wavelet decomposition process is the second level
frequency response. The x-axis represents the frequency coefficients as the level two coefficients contain most of the
scale and the y-axis represents the magnitude of the signal. correlated data of the voice signal. The data at higher levels
As it can be seen, the formants of the signals are classified contains very little amount of data deeming it unusable for
as F1, F2, F3 and F4. Typically a voice signal will contain the recognition phase. Hence for initial system
three to five formants. But in most voice signals, up to four implementation, the level two coefficients are used.
formants can be detected.

Fig.4 Formant Estimation


The coefficients are further threshold to remove the low energy. All the extracted information acts like a
correlation values, and using this coefficients statistical ‘fingerprint’ for the voice signals. The percentage of
computation is carried out. The statistical computation of verification is calculated by comparing the current values
the coefficients is used in comparison of voice signal signal values against the registered voice signal values. The
together with the formant estimation and the wavelet percentage of verification is given by:

Fig.5 Block Diagram Of Rasta Process

AJSAT Vol.5 No.2 July-December 2016 26


A New Human Voice Recognition System

Fig.6 Block Diagram Of Proposed Text Dependent Speaker Identification System

Verification % = (Test value / Registered value) x100.


1 n
Between the tested and registered value, whichever value is
Mean =
∑ xi
n i =1
higher is taken as the denominator and the lower value is Std. deviation = 1 n
taken as the numerator. Figure 9 shows the complete ∑ ( xi − µ ) ∧ 2
n − 1 i =1
flowchart which includes all the important system
1 n
∑ ( xi − µ ) ∧ 2
components that are used in the voice verification program. Variance =

VI.SIMULATION RESULTS n
n − 1 i =1
Energy = T ∑ x ∧ 2(i )
In this section, experimental results have been shown for i =1
various voice test signals with LPC and proposed Table 1 and tabel2 has shown the performance comparison
algorithms. All the experiments have been done in of proposed and LPC in terms of recognition accuracy with
MATLAB 2011a version with 4GB RAM and i3 processor statistical parameters. Finally, LPC achieved 66.66%
for speed specifications. accuracy where the proposed algorithm achieved almost
90% accuracy.

(a)

27 AJSAT Vol.5 No.2 July-December 2016


Pala Mahesh Kumar

(b)

Fig.7 (A) Original Voice Signal (B) De-Noised Signal For Training

(a)

(b)
Fig.8 (A) Original Voice Signal (B) De-Noised Signal For Testing

(a)

AJSAT Vol.5 No.2 July-December 2016 28


A New Human Voice Recognition System

(b)

Fig.9 (A) Original Voice Signal (B) De-Noised Signal For Training

(a)

(b)
Fig.10 (A) Original Voice Signal (B) De-Noised Signal For Testing

(a)

(b)
Fig.11(a) Original voice signal (b) De-noised signal for training

29 AJSAT Vol.5 No.2 July-December 2016


Pala Mahesh Kumar

(a)

(b)
Fig.12(a) Original voice signal (b) De-noised signal for testing

VII. CONCLUSION
[4] Viswanath Ganapathy, Ranjeet K. Patro, Chandrasekhara Thejaswi,
Manik Raina, Subhas K. Ghosh, Signal Separation using Time
Text dependant Speaker Recognition system used to verify Frequency Representation, Honeywell Technology Solutions
the identity of an individual based on their own speech Laboratory
signal using the statistical computation, formant estimation [5] Amara Graps, An Introduction to Wavelets, Istituto di Fisica dello
Spazio Interplanetario, CNR-ARTOV
and wavelet energy. By using the fifty preloaded voice [6] Brani Vidakovic and Peter Mueller, Wavelets For Kids – A Tutorial
signals from six individuals, the verification tests have been Introduction, Duke University
carried and an accuracy rate of approximately 90 % has [7] O. Farooq and S. Datta, A Novel Wavelet Based Pre Processing For
been achieved by proposed algorithm where the LPC has Robust Features In ASR
[8] Giuliano Antoniol, Vincenzo Fabio Rollo, Gabriele Venturi, IEEE
achieved only 66.66%. By observing the simulation results
Transactions on Software Engineering, LPC & Cepstrum
on various speech signals with different speaker we can coefficients for Mining Time Variant Information from Software
conclude that the proposed algorithm accuracy has been Repositories, University Of Sannio, Italy
improved when compared to LPC. [9] Michael Unser, Thierry Blu, IEEE Transactions on Signal
Processing, Wavelet Theory Demystified, Vol. 51,No. 2,Feb’13
[10]C. Valens, IEEE, A Really Friendly Guide to Wavelets, Vol.86, No.
REFERENCES 11, Nov 2012,
[11]James M. Lewis, C. S Burrus, Approximate CWT with An
[1] Soontorn Oraintara, Ying-Jui Chen Et.al. IEEE Transactions on Application To Noise Reduction, Rice University, Houston
Signal Processing, IFFT, Vol. 50,No. 3, March 2002 [12]Ted Painter, Andreas Spanias, IE EE, Perceptual Coding of Digital
[2] Kelly Wong, Journal of Undergraduate Research, The Role of the Audio, ASU
Fourier Transform in Time-Scale Modification, University of [13]D P. W. Ellis, PLP,RASTA, MFCC & inversion Matlab, 2005
Florida, Vol 2, Issue 11 - August 2011 [14]Ram Singh, Proceedings of the NCC, Spectral Subtraction Speech
[3] Bao Liu, Sherman Riemenschneider, An Adaptive Time Frequency Enhancement with RASTA Filtering IIT-B 2012
Representation and Its Fast Implementation, Department of [15]NitinSawhney, Situational Awareness from Environmental
Mathematics, West Virginia University Sounds, SIG, MIT Media Lab, June 13, 2013

AJSAT Vol.5 No.2 July-December 2016 30

View publication stats

You might also like