Speech Recognition and Verification Using MFCC and VQ
Speech Recognition and Verification Using MFCC and VQ
Abstract- Speaker Recognition software using MFCC (Mel Frequency Cepstral Co-efficient) and vector quantization
has been designed, developed and tested satisfactorily for male and female voice. In this paper the ability of HPS
(Harmonic Product Spectrum) algorithm and MFCC for gender and speaker recognition is explored. HPS algorithm
can be used to find the pitch of the speaker which can be used to determine gender of the speaker. In this algorithm
the speech signals for male and female ware recorded in .wav(dot wav) file at 8 KHz sampling rate and then modified.
This modified wav file for speech signal was processed using MATLAB software for computing and plotting the
autocorrelation of speech signal. The software reliably computes the pitch of male and female voice. The MFCC
algorithm and vector quantization algorithm is used for speech recognition process. By using Autocorrelation
technique and FFT pitch of the signal is calculated which is used to identify the true gender. In this paper the quality
and testing of speaker recognition and gender recognition system is completed and analysed.
Keywords: Autocorrelation, Signal, Voice command, Pitch, MFCC, Vector quantization ,K-Mean Algorithm,
Euclidean distance.
I. INTRODUCTION
The task of speaker identification is to determine the identity of a speaker by machine. To recognize voice, the
voices must be familiar in case of human beings as well as machines. The second component of speaker identification is
testing; namely the task of comparing an unidentified utterance to the training data and making the identification. The
speaker of a test utterance is referred to as the target speaker. Recently, there has been some interest in alternative speech
parameterizations based on using formant features. To develop speech spectrum formant frequencies are very essential.
But formants are very difficult to find from given speech signal and sometimes they may be not found clearly. That is
why instead of estimating the resonant frequencies, formant-like features can be used. Depending upon the application
the area of speaker recognition is divided into two parts. One is identification and other is verification. In speaker
identification there are two types, one is text dependent and another is text independent. Speaker identification is divided
into two components: feature extraction and feature classification. In speaker identification the speaker can be identified
by his voice, where in case of speaker verification the speaker is verified using database. The Pitch is used for speaker
identification. Pitch is nothing but fundamental frequency of a particular person. This is one of the important
characteristic of human being, which differ from each other.
The speech signal is an acoustic sound pressure wave that originates by exiting of air from vocal tract and
voluntary movement of anatomical structure. The schematic diagram of human speech production is as shown in figure
1.
The components of this system are the lungs, trachea larynx (organ of voice production), pharyngeal cavity, oral cavity
and nasal cavity. In technical discussion, the pharyngeal and oral cavities are usually called the "vocal tract". Therefore
the vocal tract begins at the output of the larynx and terminates at the input of lips. Finer anatomical components critical
to speech production.
These components can move to different position to change the size and shape of vocal tract and produce various speech
sound. The technical model of speech production is as shown in figure 2.
The recording of speaker voice samples is done using Microsoft Sound Recorder (inbuilt software in windows operating
system). Standard computer microphone is used for recording. The pre-processing includes noise removal, silence
detection and removal and pre-emphasis. Pitch Detection is the main block for gender recognition. Pitch is nothing but
the fundamental frequency of sound. The ANSI defines pitch at the attribute of auditory sensation of sounds. For
detection of pitch the autocorrelation of speech signals for male and female voices has been computed and plotted using
MATLAB software.
The speech signal and its corrologram for male and female voice is shown in figure 4. The female voice pitch
computed from auto-corrologram is 231.5 Hz. Similarly the recorded speech signal and its auto-corrologram is shown in
figure 5. From this figure the pitch of male voice is 120.3 Hz.
The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to create the fingerprint of the sound files. The
MFCC are based on the known variation of the human ear’s critical bandwidth frequencies with filters spaced linearly at
low frequencies and logarithmically at high frequencies used to capture the important characteristics of speech. Studies
have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale.
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the Mel
scale. The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As
a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels.
The following formula is used to compute the Mels for a particular frequency: mel( f ) = 2595*log10(1+ f / 700). A
block diagram of the MFCC processes is shown in Figure 8.
Fig. 11 Framing
Fig. 12 Windowing
Fig. 13 Autocorrelation
Training phase is done in two forms. First system was trained with one repetition for each command and once in each
testing sessions.
With this type of training error rate is about 12%. In second form, speaker repeated the words 4 times in a single training
session, and then twice in each testing session. By doing this negligible error rate in recognition of commands is
achieved.
V. CONCLUSION
The goal of this project was to create a gender and speaker recognition system, and apply it to a speech of an
unknown speaker. By investigating the extracted features of the unknown speech and then compare them to the stored
extracted features for each different speaker in order to identify the unknown speaker. A crude speaker recognition code
has been written using the MATLAB programming language. This code uses comparisons between the average pitch of a
recorded wav file as well as the vector differences between formant peaks in the PSD of each file. It was found that
comparison based on pitch produced the most accuracy, while comparison based on formant peak location did produce
results, but could likely be improved. Experience was also gained in speech editing as well as basic filtering techniques.
While the methods utilized in the design of the code for this project are a good foundation for a speaker recognition
system, more advanced techniques would have to be used to produce a successful speaker recognition system.
References
[1] Mahdi Shaneh and Azizollah Taheri, " Voice Command Recognition System Based on MFCC and VQ
algorithms" World Academy of Science, Engineering and Technology 33 2009.
[2] Ms. Arundhati S. Mehendale and Mrs. M.R. Dixit "Speaker Identification" Signals and Image processing : An
International Journal (SIPIJ) Vol. 2, No. 2, June 2011.
[3] Jamel Price, Sophomore student, Dr. Ali Eydgahi "Design of an Automatic Speech Recognition System Using
MATLAB" Chesapeake Information Based Aeronautics Consortium August 2005.
[4] E. Darren. Ellis "Design of a Speaker Recognition Code using MATLAB "Department of Computer and Electrical
Engineering- University of Tennessee, Knoxville Tennessee 37996. 9th May 2001.
[5] J.S Chitode, Anuradha S. Nigade " Throat Microphone Signals for Isolated Word Recognition Using LPC "
International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 8,
August 2012. ISSN: 2277 128X.