0% found this document useful (0 votes)
112 views7 pages

Speaker Recognition System - v1

This document provides a synopsis for a Master's project on a speaker recognition system using vector quantization technique. The objective is to design a speaker recognition model using cepstrum analysis with vector quantization. The methodology involves collecting speech samples, extracting cepstrum coefficients as features, training the system using vector quantization via Lloyd's algorithm to create codebooks for each speaker, and then performing recognition by comparing unknown speech features to the codebooks. The literature review discusses the history and advances in speaker recognition technologies from the 1940s to present. Tools to be used include Matlab and Microsoft Word.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views7 pages

Speaker Recognition System - v1

This document provides a synopsis for a Master's project on a speaker recognition system using vector quantization technique. The objective is to design a speaker recognition model using cepstrum analysis with vector quantization. The methodology involves collecting speech samples, extracting cepstrum coefficients as features, training the system using vector quantization via Lloyd's algorithm to create codebooks for each speaker, and then performing recognition by comparing unknown speech features to the codebooks. The literature review discusses the history and advances in speaker recognition technologies from the 1940s to present. Tools to be used include Matlab and Microsoft Word.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

SYNOPSIS FOR MASTER OF TECHNOLOGY PROJECT

SPEAKER RECOGNITION SYSTEM USING VECTOR


QUANTIZATION TECHNIQUE



Submitted by

HARDEER KAUR
ROLL NO. 1/11/FET/COM/3003



Under the Guidance of

Ms Vibha









FACULTY OF ENGINEERING & TECHNOLOGY
MANAV RACHNA INTERNATIONAL UNIVERSITY, FARIDABAD

TITLE

Speaker Recognition System using Vector quantization technique

OBJECTIVE

The objective of the project is to design a speaker recognition model using Cepstrum
Analysis technique with Vector quantization model. The task of a speaker recognition system
is to identify a speaker, based on his voice, amongst those in the database. While there are
many approaches to this problem, we will use the Cepstrum Analysis, which has become
relatively standard, to achieve our goal speaker recognition.

MOTIVATION

The motivation for speaker recognition is simple; it is mans principle means of
communication and is, therefore, a convenient and desirable mode of communication with
machines. Speech communication has evolved to be efficient and robust and it is clear that
the route to computer based speech recognition is the modelling of the human system.

Unfortunately from pattern recognition point of view, human recognizes speech through a
very complex interaction between many levels of processing; using syntactic and semantic
information as well very powerful low level pattern classification and processing. Powerful
classification algorithms and sophisticated front ends are, in the final analysis, not enough;
many other forms of knowledge, e.g. linguistic, semantic and pragmatic, must be built into
the recognizer. Nor, even at a lower level of sophistication, is it sufficient merely to generate
a good representation of speech (i.e. a good set of features to be used in a pattern
classifier); the classifier itself must have a considerable degree of sophistication. It is the
case, however, it do not effectively discriminate between classes and, further, that the better
the features the easier is the classification task.

Automatic speech recognition is therefore an engineering compromise between the ideal, i.e.
a complete model of the human, and the practical, i.e. the tools that science and technology
provide and that costs allow. At the highest level, all speaker recognition systems contain two
main modules feature extraction and feature matching. Feature extraction is the process that
extracts a small amount of data from the voice signal that can later be used to represent each
speaker. Feature matching involves the actual procedure to identify the unknown speaker by
comparing extracted features from his/her voice input with the ones from a set of known
speakers.

LITERATURE REVIEW

The concept of speech recognition started somewhere in 1940s [3], practically the first speech
recognition program was appeared in 1952 at the bell labs, that was about recognition of a
digit in a noise free environment [4], [5].

1. 1940s and 1950s consider as the foundational period of the speech recognition
technology, in this period work was done on the foundational paradigms of the speech
recognition that is automation and information theoretic models [6].
2. In the 1960s we were able to recognize small vocabularies (order of 10-100 words)
of isolated words, based on simple acoustic-phonetic properties of speech sounds [3].
The key technologies that were developed during this decade were, filter banks and
time normalization methods [6].
3. In 1970s the medium vocabularies (order of 100-1000 words) using simple template-
based, pattern recognition methods were recognized.
4. In 1980s large vocabularies (1000-unlimited) were used and speech recognition
problems based on statistical, with a language structures were addressed. The key
invention of this era were hidden markov model (HMM) and the stochastic language
model, which together enabled powerful new methods for handling continuous speech
recognition problem efficiently and with high performance [3].
5. In 1990s the key technologies developed during this period were the methods for
stochastic language understanding, statistical learning of acoustic and language
models, and the methods for implementation of large vocabulary speech
understanding systems.
6. After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of
designing a machine that truly functions like an intelligent human is still a major one
going forward.
7. In 2000s the key technologies developed during this period were the methods for
concatenative synthesis, machine learning & mixed initiative dialog for Very Large
Vocabulary multimodal dialog systems.

Joseph P. Campbell, Jr. (Senior Member, IEEE) (PROCEEDINGS OF THE IEEE, VOL.
85, NO. 9, SEPTEMBER 1997)
Automatic speaker recognition is the use of a machine to recognize a person from a spoken
phrase. Speaker recognition systems can be used in two modes: to identify a particular person
or to verify a persons claimed identity. The scope of this work is limited to speech collected
from cooperative users in real world office environments and without adverse microphone or
channel impairments. A speaker-identification test yielded 98.9% correct close-set speaker
identification, using cooperative speakers with high-quality telephone-bandwidth speech
collected in real world office environments under a constrained grammar across 44 and 43
speaker subsets of the YOHO corpus, with 80 seconds of speech for training and testing. The
new speaker-recognition system presented here is practical to implement in software on a
modest personal computer.
Douglas A. Reynolds (MIT Lincoln Laboratory, Lexington, MA USA)
In this section we briefly outline some of the trends in speaker recognition research and
development. Exploitation of higher-levels of information: In addition to the low-level
spectrum features used by current systems, there are many other sources of speaker
information in the speech signal that can be used. These include idiolect (word usage),
prosodic measures and other long-term signal measures. This work will be aided by the
increasing use of reliable speech recognition systems for speaker recognition R&D. High-
level features not only offer the potential to improve accuracy, they may also help improve
robustness since they should be less susceptible to channel effects.

TOOLS USED.
1. Matlab Software.
2. Microsoft word.

METHODOLOGY
The speaker recognition system consists of two parts: system training and speaker
recognizing (testing).

1. System Training:
Before we perform the speaker recognition, we should already have all the peoples
codebooks. We obtain the codebooks by system training. The process is shown as Figure
below

1.1. Data Collection and Signal Preprocess
We will use the microphone and software to record sound of
the subject. We can use the sampling frequency of 44 kHz,
or 22 kHz.

1.2. Obtaining the frames and windowing
In order to be able to obtain the time-varying parametric
description of the speech, we should break up the signal into
segments, called frames. We just use a rectangle window
with a fixed length to truncate the signal at the beginning of
the signal. Then slide the rectangle window to get the next
frames. The length of the window and the slide amount are
the important parameters for the speaker recognition.
After we obtain the frames, before calculating the cepstrums of frames, we apply a smoothing
window (Hamming Window) to reduce the edge effect. This is called windowing.

1.3. Calculating the cepstrums
For each frame, we calculate the Cepestrum of it, and take the several first components (say
the first 12 Cepestrum coefficients) of the Cepestrum for the Vector Quantization. The reason
we just take first several coefficients is based on the assumption that all the useful information
for our speaker recognition is on the low coefficients representing the slowly varying
(frequency) components of the spectrum.
1.4. Vector Quantization

We use the Llyods Algorithm to do our vector quantization. The Llyods Algorithm is
described as follow:

Given a set of N vectors design a codebook of C prototypes.

STEP 1: Initialize the C prototypes to be equal to C randomly chosen vectors from the
pool of N.
STEP 2: For each of the N vectors in the set, label it with the number of the prototype
vector closest to it.
STEP 3: Replace each prototype vector with the average of all the vectors in the set
labeled with its number.
STEP 4: Computer the total quantization error of the set of N vectors under the current
codebook. If it has gone down, go to STEP 2, otherwise we are done.
In above steps, we use the term closest and average under the Euclidean sense. In practice,
we set some threshold for the error change in step 4 rather than waiting for the error to
actually reach a stable minimum.

2. speaker recognizing (testing):

The process of speaker recognition is similar to that of the system training except the vector
quantization. The block diagram of speaker recognition is shown as follows.
The first four steps of speaker recognition are the same as
those of system training.

2.1. Computing quantization errors
After obtaining the cepstrums of the sample used to
recognize the speaker, we computer the quantization
errors between the cepstrums and all codebooks in the
database respectively. Our decision will base on these
quantization errors. Its easy to calculate the
quantization errors for a specified codebook, first we
label each Cepestrum vector with the number of the
codeword closest to it, then we computer the average
error between each Cepestrum vector and the respective
codeword.

2.2. Making decision
Based on the errors we get above, we just look for the
codebook with minimum error regarding to the
recognition sample and claim that the speaker being
recognized is the person whose error is the smallest.

Parameter Optimization for Vector Quantization

The Vector Quantization plays a key role in the speaker recognition system. From the
description of system, we can see that as long as the codebooks and the parameters we used
to train the codebooks are fixed, the goodness of the recognition system is determined. Thus,
we hope to optimize the codebooks. Its our belief that the optimal codebook should have
highest degree of clustering. We can present the codebook as:

C = F ( Ain, N, M, L_WIN, L_SLIDE, D, W)

Where: C is Codebook, i.e. Quantized Vector,
Ain is the samples used to train the system,
N is the dimension of codebook,
M is the Cepestrum dimension used to train system,
L_WIN is the length of frames (Window),
L_SLIDE is the amount to slide the window,
D is the measure of distance,
W is the type of windowing,

In order to optimize the codebook in the sense of clustering, first we should know how to
evaluate the clustering. For the general problem, this is too complicated.

Since the library of particular configurations is different from person to person, the
parameter optimizations for different person should have different solution. Thus during the
parameter optimization, we fix the Ain. For the parameters N and M, we just look at the
persons cepstrums and find the best values of M and M. Here we take the recommended
values, i.e., both values of N and M are 12. From previous groups work, we know the best
measure and type of windowing are Euclidean distance and Hamming window. Thus, in our
parameter optimization problem, we employ the Euclidean distance and Hamming window.
So, we just need to optimize L_WIN and L_SLIDE. For the parameters L_WIN and
L_SLIDE, we define the degree of clustering as the reciprocal of average distance between
the cepstrums and codebook. Thus the optimal problem can be presented
Where L is the number of the cepstrums, N is the dimension of the codebook.

We sweep L_WIN from 20ms to 55ms with step of 5ms and L_SLIDE from 8ms to 20ms
with step of 2ms. Figure 57 show the optimization results for three different people. The
results seem not good since we cannot get a global optimal solution. Anyway, we can get
local optimal solution for almost all subjects with L_WIN=40ms and L_SLIDE=15ms. Thus,
in our system, we take L_WIN =40ms and L_SLIDE=15ms and we can see blow, using these
suboptimal parameters, the system can be improved.

L
Cj) D(Ces
k
L
= k
N = j

1
n
i
m
1,
min
REFERENCES

1. Speech recognition- The next revolution 5th edition.
2. Ksenia Shalonova, Automatic Speech Recognition 07 DEC 2007
Source:http://www.cs.bris.ac.uk/Teaching/Resources/COMS12303/lectures/Ksenia_S
halonova - Speech_Recognition.pdf
3. http://www.abilityhub.com/speech/speech-description.htm
4. "Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN:
0130151572.
5. "Speech and Language Processing: An Introduction to Natural Language Processing,
Computational, Linguistics and Speech Recognition". D. Jurafsky, J. Martin. 2000.
ISBN: 0130950696.
6. B.H. Juang & Lawrence R. Rabiner, Automatic Speech Recognition A Brief
History of the Technology Development 10/08/2004 Source:
http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-
final-10-8.pdf
7.

You might also like