0% found this document useful (0 votes)
72 views6 pages

Speech Recognition and Verification Using MFCC and VQ

Verification Using MFCC & VQ.

Uploaded by

Mustafa Habibi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views6 pages

Speech Recognition and Verification Using MFCC and VQ

Verification Using MFCC & VQ.

Uploaded by

Mustafa Habibi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 3, Issue 5, May 2013 ISSN: 2277 128X

International Journal of Advanced Research in


Computer Science and Software Engineering
Research Paper
Available online at: www.ijarcsse.com
Speech Recognition and Verification Using MFCC & VQ
Mr. Kashyap Patel Dr. R.K. Prasad
Department of Electronics Department of Electronics
Bharati Vidyapeeth College B.V.D.U.C.O.E. PUNE
Of Engineering, Pune, India. India.

Abstract- Speaker Recognition software using MFCC (Mel Frequency Cepstral Co-efficient) and vector quantization
has been designed, developed and tested satisfactorily for male and female voice. In this paper the ability of HPS
(Harmonic Product Spectrum) algorithm and MFCC for gender and speaker recognition is explored. HPS algorithm
can be used to find the pitch of the speaker which can be used to determine gender of the speaker. In this algorithm
the speech signals for male and female ware recorded in .wav(dot wav) file at 8 KHz sampling rate and then modified.
This modified wav file for speech signal was processed using MATLAB software for computing and plotting the
autocorrelation of speech signal. The software reliably computes the pitch of male and female voice. The MFCC
algorithm and vector quantization algorithm is used for speech recognition process. By using Autocorrelation
technique and FFT pitch of the signal is calculated which is used to identify the true gender. In this paper the quality
and testing of speaker recognition and gender recognition system is completed and analysed.

Keywords: Autocorrelation, Signal, Voice command, Pitch, MFCC, Vector quantization ,K-Mean Algorithm,
Euclidean distance.

I. INTRODUCTION
The task of speaker identification is to determine the identity of a speaker by machine. To recognize voice, the
voices must be familiar in case of human beings as well as machines. The second component of speaker identification is
testing; namely the task of comparing an unidentified utterance to the training data and making the identification. The
speaker of a test utterance is referred to as the target speaker. Recently, there has been some interest in alternative speech
parameterizations based on using formant features. To develop speech spectrum formant frequencies are very essential.
But formants are very difficult to find from given speech signal and sometimes they may be not found clearly. That is
why instead of estimating the resonant frequencies, formant-like features can be used. Depending upon the application
the area of speaker recognition is divided into two parts. One is identification and other is verification. In speaker
identification there are two types, one is text dependent and another is text independent. Speaker identification is divided
into two components: feature extraction and feature classification. In speaker identification the speaker can be identified
by his voice, where in case of speaker verification the speaker is verified using database. The Pitch is used for speaker
identification. Pitch is nothing but fundamental frequency of a particular person. This is one of the important
characteristic of human being, which differ from each other.
The speech signal is an acoustic sound pressure wave that originates by exiting of air from vocal tract and
voluntary movement of anatomical structure. The schematic diagram of human speech production is as shown in figure
1.

Fig. 1. Schematic diagram of human speech production mechanism

The components of this system are the lungs, trachea larynx (organ of voice production), pharyngeal cavity, oral cavity
and nasal cavity. In technical discussion, the pharyngeal and oral cavities are usually called the "vocal tract". Therefore
the vocal tract begins at the output of the larynx and terminates at the input of lips. Finer anatomical components critical
to speech production.

© 2013, IJARCSSE All Rights Reserved Page | 478


Patel et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(5),
May - 2013, pp. 478-483

Fig. 2 Technical model for speech production

These components can move to different position to change the size and shape of vocal tract and produce various speech
sound. The technical model of speech production is as shown in figure 2.

II. Gender Recognition

The block diagram of gender recognition system is as shown in figure 3.

Fig. 3 Block diagram of Gender Recognition

The recording of speaker voice samples is done using Microsoft Sound Recorder (inbuilt software in windows operating
system). Standard computer microphone is used for recording. The pre-processing includes noise removal, silence
detection and removal and pre-emphasis. Pitch Detection is the main block for gender recognition. Pitch is nothing but
the fundamental frequency of sound. The ANSI defines pitch at the attribute of auditory sensation of sounds. For
detection of pitch the autocorrelation of speech signals for male and female voices has been computed and plotted using
MATLAB software.
The speech signal and its corrologram for male and female voice is shown in figure 4. The female voice pitch
computed from auto-corrologram is 231.5 Hz. Similarly the recorded speech signal and its auto-corrologram is shown in
figure 5. From this figure the pitch of male voice is 120.3 Hz.

Fig. 4 Correlation coefficients for female voice

© 2013, IJARCSSE All Rights Reserved Page | 479


Patel et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(5),
May - 2013, pp. 478-483

Fig. 5. Correlation coefficients for male voice


As the pitch of female speaker is higher than male speaker some threshold can be selected so as to discriminate male and
female speaker. If the value of mean pitch is less than the threshold then the given speaker is male else if mean pitch
value is greater than threshold then the given speaker is female.
III. Speaker Recognition
The main aim The main aim of this project is speaker identification, which consists of comparing a speech signal from an
unknown speaker to a database of known speaker. The system can recognize the speaker, which has been trained with a
number of speakers. Figure 6 shows the fundamental formation of speaker identification and verification systems. Where
the speaker identification is the process of determining which registered speaker provides a given speech. On the other
hand, speaker verification is the process of rejecting or accepting the identity claim of a speaker. In most of the
applications, voice is use as the key to confirm the identities of a speaker are classified as speaker verification.

Fig. 6 Conceptual presentation of speaker identification


A. Mel Frequency Cepstrum Coefficient
In this project we are using the Mel Frequency Cepstral Coefficients (MFCC) technique to extract features from the
speech signal and compare the unknown speaker with the exits speaker in the database. Figure 7 shows the complete
pipeline of Mel Frequency Cepstral Coefficients.

Fig. 7 Pipeline of MFCC

© 2013, IJARCSSE All Rights Reserved Page | 480


Patel et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(5),
May - 2013, pp. 478-483

The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to create the fingerprint of the sound files. The
MFCC are based on the known variation of the human ear’s critical bandwidth frequencies with filters spaced linearly at
low frequencies and logarithmically at high frequencies used to capture the important characteristics of speech. Studies
have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale.
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the Mel
scale. The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As
a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels.
The following formula is used to compute the Mels for a particular frequency: mel( f ) = 2595*log10(1+ f / 700). A
block diagram of the MFCC processes is shown in Figure 8.

Fig. 8 Block diagram of MFCC


The speech waveform is cropped to remove silence or acoustical interference that may be present in the beginning or end
of the sound file. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of
each frame to zero. The FFT block converts each frame from the time domain to the frequency domain. In the Mel-
frequency wrapping block, the signal is plotted against the Mel spectrum to mimic human hearing. In the final step, the
Cepstrum, the Mel-spectrum scale is converted back to standard frequency scale. This spectrum provides a good
representation of the spectral properties of the signal which is key for representing and recognizing characteristics of the
speaker. After the fingerprint is created, we will also referred to as an acoustic vector. This vector will be stored as a
reference in the database. When an unknown sound file is imported into MatLab, a fingerprint will be created of it also
and its resultant vector will be compared against those in the database, again using the Euclidian distance technique, and
a suitable match will be determined. This process is as referred to as feature matching.
B. Vector Quantization
A speaker recognition system must able to estimate probability distributions of the computed feature vectors. Storing
every single vector that generate from the training mode is impossible, since these distributions are defined over a high-
dimensional space. It is often easier to start by quantizing each feature vector to one of a relatively small number of
template vectors, with a process called vector quantization. VQ is a process of taking a large set of feature vectors and
producing a smaller set of measure vectors that represents the centroids of the distribution. The technique of VQ consists
of extracting a small number of representative feature vectors as an efficient means of characterizing the speaker specific
features. By means of VQ, storing every single vector that we generate from the training is impossible. By using these
training data features are clustered to form a codebook for each speaker. In the recognition stage, the data from the tested
speaker is compared to the codebook of each speaker and measure the difference. These differences are then use to make
the recognition decision.
C. K-Means Algorithm
The K-means algorithm is a way to cluster the training vectors to get feature vectors. In this algorithm clustered the
vectors based on attributes into k partitions. It use the k means of data generated from Gaussian distributions to cluster
the vectors. The objective of the k-means is to minimize total intra-cluster variance, V.
k
2
V= xj − µi
i=1 j∈Si
where there are k clusters Si, i = 1,2,...,k and μi is the centroid or mean point of all the points, xj ∈ Si.
The process of k-means algorithm used least-squares partitioning method to divide the input vectors into k initial sets. It
then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the
closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated until when the vectors
no longer switch clusters or alternatively centroids are no longer changed.
D. Euclidean Distance
In the speaker recognition phase, an unknown speaker’s voice is represented by a sequence of feature vector {x1, x2
….xi), and then it is compared with the codebooks from the database. In order to identify the unknown speaker, this can
be done by measuring the distortion distance of two vector sets based on minimizing the Euclidean distance.
The Euclidean distance is the "ordinary" distance between the two points that one would measure with a ruler, which can
be proven by repeated application of the Pythagorean Theorem. The formula used to calculate the Euclidean distance can
be defined as following: The Euclidean distance between two points P = (p1, p2…pn) and Q = (q1, q2...qn), is given by:
𝑛
(𝑝1 − 𝑞1)2 + (𝑝2 − 𝑞2)2 + ⋯ + (𝑝𝑛 − 𝑞𝑛)2 = 𝑖=1 (𝑝𝑖 − 𝑞𝑖)
2

© 2013, IJARCSSE All Rights Reserved Page | 481


Patel et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(5),
May - 2013, pp. 478-483
The speaker with the lowest distortion distance is chosen to be identified as the unknown person.

IV. Experimental Result


To implement proposed speaker recognition system, a system with some voice commands such as 'Hello' is considered.

Fig. 9 original speech signal

Fig. 10 Silence removal signal

Fig. 11 Framing

Fig. 12 Windowing

© 2013, IJARCSSE All Rights Reserved Page | 482


Patel et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(5),
May - 2013, pp. 478-483

Fig. 13 Autocorrelation

Training phase is done in two forms. First system was trained with one repetition for each command and once in each
testing sessions.

Fig. 14 Fast Fourier Transform


Table I. Pitch and frequency of different speech database
Sr.no Male Pitch Frequency Female Pitch Frequency
1. Male 1. 123.8 KHz Female 1. 245.6 KHz
2. Male 2. 100.3 KHz Female 2. 220.8 KHz
3. Male 3. 135.3 KHz Female 3. 234.3 KHz
4. Male 4. 110.0 KHz Female 4. 217.2 KHz

With this type of training error rate is about 12%. In second form, speaker repeated the words 4 times in a single training
session, and then twice in each testing session. By doing this negligible error rate in recognition of commands is
achieved.
V. CONCLUSION
The goal of this project was to create a gender and speaker recognition system, and apply it to a speech of an
unknown speaker. By investigating the extracted features of the unknown speech and then compare them to the stored
extracted features for each different speaker in order to identify the unknown speaker. A crude speaker recognition code
has been written using the MATLAB programming language. This code uses comparisons between the average pitch of a
recorded wav file as well as the vector differences between formant peaks in the PSD of each file. It was found that
comparison based on pitch produced the most accuracy, while comparison based on formant peak location did produce
results, but could likely be improved. Experience was also gained in speech editing as well as basic filtering techniques.
While the methods utilized in the design of the code for this project are a good foundation for a speaker recognition
system, more advanced techniques would have to be used to produce a successful speaker recognition system.

References
[1] Mahdi Shaneh and Azizollah Taheri, " Voice Command Recognition System Based on MFCC and VQ
algorithms" World Academy of Science, Engineering and Technology 33 2009.
[2] Ms. Arundhati S. Mehendale and Mrs. M.R. Dixit "Speaker Identification" Signals and Image processing : An
International Journal (SIPIJ) Vol. 2, No. 2, June 2011.
[3] Jamel Price, Sophomore student, Dr. Ali Eydgahi "Design of an Automatic Speech Recognition System Using
MATLAB" Chesapeake Information Based Aeronautics Consortium August 2005.
[4] E. Darren. Ellis "Design of a Speaker Recognition Code using MATLAB "Department of Computer and Electrical
Engineering- University of Tennessee, Knoxville Tennessee 37996. 9th May 2001.
[5] J.S Chitode, Anuradha S. Nigade " Throat Microphone Signals for Isolated Word Recognition Using LPC "
International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 8,
August 2012. ISSN: 2277 128X.

© 2013, IJARCSSE All Rights Reserved Page | 483

You might also like