0% found this document useful (0 votes)
16 views15 pages

Voice Syn - NN

This research proposal outlines the development of an improved voice recognition system utilizing Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and K-Means algorithm for feature matching. The system aims to enhance security applications by accurately identifying speakers based on their voice characteristics. The study will also evaluate existing voice recognition systems and propose an efficient methodology for implementation using MATLAB.

Uploaded by

usha kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Voice Syn - NN

This research proposal outlines the development of an improved voice recognition system utilizing Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and K-Means algorithm for feature matching. The system aims to enhance security applications by accurately identifying speakers based on their voice characteristics. The study will also evaluate existing voice recognition systems and propose an efficient methodology for implementation using MATLAB.

Uploaded by

usha kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Research Proposal on

An improved Voice Recognition System using


Feature Extraction of MFCC Technique and
Feature Matching of K-Means Algorithm
Submitted in partial fulfillment of the requirement for the award of the degree

of

MASTERS OF TECHNOLOGY

in
COMPUTER SCIENCE AND ENGINEERING

by

USHA KUMARI
(020215001)

under the supervision of


GAURAV AGGARWAL
(Associate Professor & HOD (Dept. of CSE)

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING

Jagannath University NCR, Haryana, India

2016-2017
CONSENT LETTER

I hereby give my consent to supervise USHA KUMARI, roll no. 020215001, student of
M.Tech. (2015-17) during the session 2016-17, for the thesis work to be carried out on the
topic An improved Voice Recognition System using Feature Extraction of MFCC
Technique and Feature Matching of K-Means Algorithm, for partial fulfillment of the
requirement for the award of the degree of Master of Technology in Computer Science and
Engineering.

Signature of the Supervisor


Place:
Date:
1. Introduction

1.1 Introduction

Biometrics refers to automatic recognition of individuals based on their physiological and


behavioral characteristics [1]. The world is crying out for the simpler access controls to
personal authentication systems and it looks like biometrics may be the answer. Instead of
carrying bunch of keys, all those access cards or passwords you carry around with you, your
body can be used to uniquely identify you. Furthermore, when biometrics measures are
applied in combination with other controls, such as access cards or passwords, the reliability
of authentication controls takes Giant step forward. The various application using biometrics
are passports, driving licenses, banking, refraining imposters from hacking into networks,
stealing mails etc. The traditional security systems are Token based system, in this fakers are
prevented from accessing protected resources using ID cards, smart cards etc. The main
disadvantages of token based systems are ID cards can be lost, forged, or misplaced.

Main advantages of biometric system over conventional approach is the reliability, it cannot
be stolen or misplaced. In a biometric system various biometric features are extracting after
capturing the biometric images of the user and authenticating individual by checking against
the templates previously stored in the database. How an individual to be authenticated is
depending upon application of the biometric system is used. The types of operating modes of
biometric system are verification and identification. One of important biometric system is
human voice.

Voice or speaker recognition [2] is the process of automatically recognizing who is speaking
on the basis of individual information included in speech waves. This technique makes it
possible to use the speaker's voice to verify their identity and control access to services such
as voice dialing, banking by telephone, telephone shopping, database access services,
information services, voice mail, security control for confidential information areas, and
remote access to computers.

1.2 Classification of Voice Recognition

Voice recognition is a popular topic in today’s life. The applications of Voice recognition can
be found everywhere, which make our life more effective. For example the applications in the
mobile phone, instead of typing the name of the person who people want to call, people can
just directly speak the name of the person to the mobile phone, and the mobile phone will
automatically call that person. If people want send some text messages to someone, people
can also speak messages to the mobile phone instead of typing [3].

Voice recognition can be classified into a number of categories. Figure 1.1 below provides
the various classifications of speaker voice recognition [1].

Figure 1.1: Classification of Speaker Recognition

1.2.1 OPEN SET vs CLOSED SET

This category of classification is based on the set of trained speakers available in a system as
describe below.

1. Open Set: An open set system can have any number of trained speakers. We have an open
set of speakers and the number of speakers can be anything greater than one.

2. Closed Set: A closed set system has only a specified (fixed) number of users registered to
the system. In our thesis, we have used open set of trained speakers.

1.2.2 IDENTIFICATION vs VERIFICATION

Automatic speaker identification and verification are often considered to be the most natural
and economical methods for avoiding unauthorized access to physical locations or computer
systems. Identification & verifications are described below.

1. Speaker identification: It is the process of determining which registered speaker provides


a given utterance.
2. Speaker verification: It is the process of accepting or rejecting the identity claim of a
speaker. Figure 1.2 and figure 1.3 below illustrate the basic differences between speaker
identification and verification systems.

Similarity

Reference
model Maximum Identification
Input Feature
(Speaker #1) selection result
speech extraction
(Speaker ID)

Similarity

Reference
model
(Speaker #N) (a) Speaker identification

Verification
Input Feature result
Similarity Decision
speech extraction (Accept/Reject)

Reference
Speaker ID Threshold
model
(#M) (Speaker #M)

(b) Speaker verification


Figure 1.2: Block Diagrams of Identification and Verification systems [1]
Figure 1.3: Practical examples of Identification and Verification Systems [1]

Both the figures depict the differences between ASI (Automatic Speaker Identification) and
ASV (Automatic Speaker Verification) systems. Figure 1.2 gives the theoretical block
diagrams of both the processes whereas figure 1.3 gives a practical implementation of the
systems. In our thesis we have focussed only on ASI systems.

1.2.3 TEXT-DEPENDENT vs TEXT-INDEPENDENT

This is another category of classification of speaker recognition systems. This category is


based upon the text uttered by the speaker during the identification process as describe below.

1. Text-Dependent: In this case, the test utterance is identical to the text used in the training
phase. The test speaker has prior knowledge of the system.

2. Text-Independent: In this case, the test speaker doesn’t have any prior knowledge about
the contents of the training phase and can speak anything.

In our thesis, we have used the text-independent model. Thus, we have designed a open-set
text-independent ASI (Automatic Speaker Identification) system in our thesis work.

2. Literature Survey:

Various types of voice and speaker recognition techniques are available. In this section, we
provide the literature review of work done in this field.
Anusuya M. A. et. al. (2009) [5] in their paper presented a brief survey on Automatic Speech
Recognition and discusses the major themes and advances made in the past 60 years of
research, so as to provide a technological perspective and an appreciation of the fundamental
progress that has been accomplished in this important area of speech communication.

Zue, V. et. al. (2011) [6] defined an approach for the audio dialogues, text, icons and graphics
for Speech Recognition and understanding. The authors produced a language of word-pair
which helps in searching and navigation.

Fook C.Y et.al. (2012) [7] defined speech recognition paper. The main aim of their research
is to compare and summarize the well known speech recognition methods used by various
researchers.

Singh P. P. et. al. (2012) [8] described that speech recognition is the new emerging
technology in the field of computer and artificial intelligence. It has changed the way we
communicate with computer and other intelligent devices of same calibre like smart phones.
It is a major area of interest for research in this field which is related to artificial intelligence.
In this paper the overview of this technology and its current implementations were listed and
introduced.

Choudhary A. et. al. (2012) [9] described the speech recognition process using the approach
of AI. The recognition method used is language mode, trigram model and acoustic model. No
GUI is used, acoustic model interface with the telephony system to manage spoken dialogues
by the speaker.

Mathur S. et. al. (2013) [10] in their paper outlines the basic concepts of speaker recognition
along with its diverse applications. It also presents an idea of selecting a robust parameter for
the purpose of identification to attain the accurate results, limitations faced and the recent
built up advances for identification, so as to provide a technological perspective in this
important area of speaker recognition.
Chandra E. et. al. (2014) [11] described that speaker recognition is the process of identifying
a person through his/her voice signals or speech waves. Pattern classification plays a vital
role in speaker recognition. Pattern classification is the process of grouping the patterns,
which are sharing the same set of properties. This paper deals with speaker recognition
system and over view of Pattern classification techniques DTW, GMM and SVM.
Nereveettil C. J. et. al. (2014) [12] in their paper presented the viability of Mel Frequency
Cepstral coefficient Algorithm to extract features and Fuzzy Inference System model for
feature selection, by reducing the dimensionality of the extracted features. There is an
increasing need for a new Feature selection method, to increase the processing rate and
recognition accuracy of the classifier, by selecting the discriminative features. Hence a Fuzzy
Inference system model is used selecting the optimal features from speech vectors which are
extracted using MFCC. The work has been done on MATLAB13a and experimental results
show that system is able to reduce word error rate at sufficiently high accuracy.

Nandyala S. P. et. al. (2014) [13] described a new approach of hybrid HMM/DTW by using
kernel adaptive filters for speech analysis and recognition is used. The noise removal or
filtration of conversations like over the telephone is very important in speech recognition.
Their approach gave better experimental results as compare to traditional results.

Xiang-Lilan et. al. (2014) [14] In this paper they introduced a new merged-weight dynamic
time wrapping algorithm (MWDTW). This method defines a template confidence index for
measuring the similarities between training and testing data, by using the DTW approach. By
using the merge approach of SD speech recognition datasets, HMM and DTW on merged
data sets, resulted six times better than DTW overall.

Doye D. D. et. al. (2015) [15] in their paper worked on the approach of new non linear time
alignment model rather than DTW algorithm. They worked for finding suitable time
alignment algorithm for the Marathi language. They took 46 monosyllabic confusing
alphabets and 46 confusing names for their work. They main feature used in this research
were Mel Frequency Cepstral Coefficients (MFCC), Linear Frequency Cepstral Coefficients
(LFCC) and Linear Prediction Coefficient (LPC)

Padmanabhan J. et. al. (2015) [16] described that the automatic speech recognition along with
Gaussian mixture model, machine learning and HMM is reviewed. The scanning,
preprocessing, extraction and classification of input are done by using the feature of acoustic,
bottleneck and MLP.

Chaudhary P. J. et. al. (2015) [17] in their paper talked about speaker recognition as an
ordinary process whereas speaker identification and speaker verification refer to definite
tasks or assessment modes associated with this process. Here, Speaker Recognition is nothing
but the computing task of validating a person’s claimed identity using features extracted from
the database of various voices. For the areas in which security is a foremost concern, speaker
Recognition technique is one of the most useful and popular biometric recognition
techniques. Various techniques for feature extraction like MFCC, RCC, LPC, LPCC, and
PLPC are discussed in their paper.

Chadha N. et. al. (2015) [18] described various applications of speech recognition systems
are present and these all includes various research challenges. A critical machine learning
based review is defined which addresses the various challenging tasks of speech recognition
system in NLP. In the existing systems, the recognition rate is very less and the noise ration
during the recognition process creates a problem.

Karpagavalli S et. al. (2016) [19] described that speech is the most natural communication
mode for human beings. The task of speech recognition is to convert speech into a sequence
of words by a computer program. Speech recognition applications enable people to use
speech as another input mode to interact with applications with ease and effectively. Speech
recognition interfaces in native language will enable the illiterate/semi-literate people to use
the technology to greater extent without the knowledge of operating with computer keyboard
or stylus. A detailed study on automatic speech recognition is carried out and presented in
this paper that covers the architecture, speech parameterization, methodologies,
characteristics, issues, databases, tools and applications.

3. Problem Formulation & Research Motivational

Voice Recognition is the process of converting a speech signal to a sequence of words, by


means of algorithms implemented as a computer program. Speech or voice is the most natural
form of human communication. Speech recognition technology has made it possible for
computer to follow human voice commands and understand human languages. The primary
function of the speech recognition engine is to process spoken input and translate it into text
that an application understands. The application can then do one of two things [4]:

▪ The application can interpret the result of the recognition as a command. In this case, the
application is a command and control application. An example of a command and control
application is one in which the caller says “check balance”, and the application returns the
current balance of the caller’s account.
▪ If an application handles the recognized text simply as text, then it is considered a dictation
application. In a dictation application, if you said “check balance,” the application would not
interpret the result, but simply return the text “check balance”.

For reasons ranging from technological curiosity about the mechanisms for mechanical
realization of human speech capabilities to desire to automate simple tasks which necessitates
human machine interactions and research in automatic speech recognition by machines has
attracted a great deal of attention for sixty years. Based on major advances in statistical
modeling of speech, automatic speech recognition systems today find widespread application
in tasks that require human machine interface, such as automatic call processing in telephone
networks, and query based information systems that provide updated travel information,
stock price quotations, weather reports, Data entry, voice dictation and access to information.

4. Objectives

Voice recognition is the process of automatically recognizing who is speaking on the basis of
individual information included in speech waves. This thesis describes how to build a simple,
yet complete and representative automatic voice recognition system. Such a voice recognition
system has potential in many security applications. For example, users have to speak a PIN
(Personal Identification Number) in order to gain access to the laboratory door, or users have
to speak their credit card number over the telephone line to verify their identity. By checking
the speech characteristics of the input utterance, using an automatic voice recognition system
similar to the one that we will describe, the system is able to add an extra level of security.

The overall objectives of voice recognition are summarized below:

1. The main aim of this thesis is speaker identification, which consists of comparing a
speech signal from an unknown speaker to a database of known speaker. The system
can recognize the speaker, which has been trained with a number of speakers.
2. Study the existing voice recognition systems.

3. Develop a new efficient technique for voice recognition system by applying Mel

Frequency Cepstral Coefficients (MFCC), K-means and Euclidean distance technique

4. Build a system that delivers optimal performance both in terms of speed and accuracy.
5. Proposed Methodology

This thesis describes how to build a simple, yet complete and representative automatic voice
recognition system. Such a voice recognition system has potential in many security
applications. For example, users have to speak a PIN (Personal Identification Number) in
order to gain access to the laboratory door, or users have to speak their credit card number
over the telephone line to verify their identity. By checking the speech characteristics of the
input utterance, using an automatic voice recognition system similar to the one that we will
describe, the system is able to add an extra level of security.

The main aim of this thesis is speaker identification, which consists of comparing a speech
signal from an unknown speaker to a database of known speaker. The system can recognize
the speaker, which has been trained with a number of speakers. In this work, we use the Mel
Frequency Cepstral Coefficients (MFCC) technique is used to extract features from the
speech signal and compare the unknown speaker with the existing speaker in the database.

Then we use K-means algorithm to cluster the training vectors to get feature vectors. This
algorithm clustered the vectors based on attributes into k partitions. Finally to identify the
unknown speaker, we use Euclidean distance. The Euclidean distance measure the distortion
distance of two vector sets. The speaker with the lowest distortion distance is chosen to be
identified as the unknown person.

Proposed Tool - MATLAB

MATLAB is a high-performance language for technical computing. It integrates


computation, visualization, and programming in an easy-to-use environment where problems
and solutions are expressed in familiar mathematical notation. Typical uses include:
 Mathematical computation
 Algorithm development
 Data acquisition
 Modeling, simulation, and prototyping
 Data analysis, exploration, and visualization
 Scientific and engineering graphics
 Application development, including graphical user interface building
MATLAB is an interactive system whose basic data element is an array that does not require
dimensioning. This allows you to solve many technical computing problems, especially those
with matrix and vector formulations, in a fraction of the time it would take to write a program
in a scalar non interactive language such as C or FORTRAN. The name MATLAB stands for
matrix laboratory. MATLAB was originally written to provide easy access to matrix software
developed by the LINPACK and EISPACK projects. Today, MATLAB engines incorporate
the LAPACK and BLAS libraries, embedding the state of the art in software for matrix
computation. MATLAB has evolved over a period of years with input from many users. In
university environments, it is the standard instructional tool for introductory and advanced
courses in mathematics, engineering, and science.

In industry, MATLAB is the tool of choice for high-productivity research, development, and
analysis. MATLAB features a family of add-on application-specific solutions called
toolboxes. Very important to most users of MATLAB, toolboxes allow you to learn and
apply specialized technology. Toolboxes are comprehensive collections of MATLAB
functions (M-files) that extend the MATLAB environment to solve particular classes of
problems. Areas in which toolboxes are available include signal processing, control systems,
neural networks, fuzzy logic, wavelets, simulation, and many others.

6. Facilities required for proposed work

Speaker recognition [20] is the process of automatically recognizing who is speaking on the
basis of individual information included in speech waves. Speaker recognition can be
classified into identification and verification. Speaker identification is the process of
determining which registered speaker provides a given utterance. Speaker verification, on the
other hand, is the process of accepting or rejecting the identity claim of a speaker.

All speaker recognition systems contain two main modules: feature extraction and feature
matching. Feature extraction is the process that extracts a small amount of data from the
voice signal that can later be used to represent each speaker. Feature matching involves the
actual procedure to identify the unknown speaker by comparing extracted features from
his/her voice input with the ones from a set of known speakers.

6.1 Voice Feature Extraction [21, 22]

The purpose of this module is to convert the speech waveform, using digital signal processing
(DSP) tools, to a set of features for further analysis. There are wide range of possibilities exist
for voice feature extractions such as Linear Prediction Coding (LPC), Mel-Frequency
Cepstrum Coefficients (MFCC), and others. We use MFCC technique in our thesis because it
is the well known and most popular technique.

6.2 Voice Feature Matching

The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest are
generically called patterns and in our case are sequences of acoustic vectors that are extracted
from an input speech using the MFCC technique. The classes here refer to individual
speakers. Since the classification procedure in our case is applied on extracted features, it
can be also referred to as feature matching [23]. Various techniques are used for voice feature
matching such as Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), K-
Means and Vector Quantization (VQ). In this thesis, we use K-Means approach [24] due to
ease of implementation and high accuracy.

References

[1] Ashish Kumar Panda, Amit Kumar Sahoo, “Study of Speaker Recognition Systems”,
National Institute of Technology, Rourkela, 2011.

[2] Campbell, J.P., “Speaker recognition: a tutorial”, Proceedings of the IEEE Volume 85,
Issue 9, Sept. 1997 Page(s):1437 – 1462.

[3] Rakesh Tiwari, “An improved algorithm for Speaker Recognition”, School Of
Educational Technology, Jadavpur University, 2010.

[4] Kimberlee A. Kemble, An Introduction to Speech Recognition, [online] Available:


ftp://ftp.software.ibm.com / software / partners / comarketing / na / ss / we / WS_Voice
_Server_White_Paper.pdf

[5] M. A. Anusuya, S. K. Katti, “Speech Recognition by Machine: A Review”, International


Journal of Computer Science and Information Security, Vol. 6, No. 3, 2009.

[6] Zue V, Glass, J., Goodine, D., Leung, H., Phillips, M, Polifroni, J., Seneff, S, “Integration
of speech recognition and natural language processing”, in the MIT voyager system, IEEE,
2011.

[7] Fook, C.Y., Hariharan, M., Yaacob, S., Adom, A., “A review: Malay speech recognition
and audio visual speech recognition”, in Biomedical Engineering (ICoBE), International
Conference, 2012.
[8] Parwinder Pal Singh, Er. Bhupinder Singh, “Speech Recognition as Emerging
Revolutionary Technology”, International Journal of Advanced Research in Computer
Science and Software Engineering, Volume 2, Issue 10, October 2012.

[9] Anupam Choudhary, Ravi Kshirsagar, “Process Speech Recognition System using
Artificial Intelligence Technique”, in International Journal of Soft Computing and
Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-5, 2012.

[10] Surbhi Mathur, Choudhary S. K. and Vyas J. M., “Speaker Recognition System and its
Forensic Implications”, Open Access Scientific Reports, Volume 2, Issue 4, 2013.

[11] Dr E. Chandra, K. Manikandan, M. S. Kalaivani, “A Study on Speaker Recognition


System and Pattern classification Techniques”, International Journal of Innovative Research
in Electrical, Electronics, Instrumentation and Control Engineering, Vol. 2, Issue 2, February
2014.

[12] Catherine J Nereveettil, M. Kalamani, Dr. S.Valarmathy, “Feature Selection Algorithm


for Automatic Speech Recognition Based On Fuzzy Logic”, International Journal of
Advanced Research in Electrical, Electronics and Instrumentation Engineering, Vol. 3, Issue
1, January 2014.

[13] Siva Prasad Nandyala and T. Kishore Kumar, “Hybrid HMM/DTW based Speech
Recognition with Kernel Adaptive Filtering Method”, in International Journal on
Computational Sciences & Applications (IJCSA) Vol.4, No.1, 2014.

[14] Xiang-Lilan, Zhang, Zhi-Gang, Luo,Ming Li, “Merge-Weighted Dynamic Time


Warping for Speech Recognition”, in Journal of Computer Science and Technology, Volume
29, Issue 6, 2014, pp 1072-1082.

[15] D D Doye, T R Sontakke & Smita Nagtode, “The Nonlinear Time Alignment Model for
Speech”, in IETE Journal of Research, Taylor & Francis, 2015, pp 1-6.

[16] Jayashree Padmanabhan and Melvin Jose Johnson Prem kumar, “Machine Learning in
Automatic Speech Recognition: A Survey”, IETE Technical Review, Taylor & Francis, 2015,
pp-1-13.

[17] Parvati J.Chaudhary, Kinjal M. Vagadia, “A Review Article on Speaker Recognition


with Feature Extraction”, International Journal of Emerging Technology and Advanced
Engineering, Volume 5, Issue 2, February 2015.

[18] Neha Chadha, R.C. Gangwar, Rajeev Bedi, “Current Challenges and Application of
Speech Recognition Process using Natural Language Processing: A Survey”, International
Journal of Computer Applications (0975 – 8887) Volume 131 – No.11, December 2015.

[19] Karpagavalli S and Chandra E, “A Review on Automatic Speech Recognition


Architecture and Approaches”, International Journal of Signal Processing, Image Processing
and Pattern Recognition Vol.9, No.4, (2016), pp.393-404

[20] Douglas A. Reynolds, “An Overview of Automatic Speaker Recognition Technology",


©2002 IEEE
[21] Bhupinder Singh, Rupinder Kaur, Nidhi Devgun, Ramandeep Kaur, “The process of
Feature Extraction in Automatic Speech Recognition System for Computer Machine
Interaction with Humans: A Review”,IJARCSSE, Volume 2, Issue 2, February 2012.

[22] Genevieve I. Sapijaszko, Wasfy B. Mikhael, “An Overview of Recent Window Based
Feature Extraction Algorithms for Speaker Recognition”, IEEE, pp 880-883, 2012.

[23] Maider Zamalloa, Germacn Bordel, Luis Javier Rodriguez, Mikel Penagarikano,
“Feature Selection Based on Genetic Algorithms for Speaker Recognition”, 2006, IEEE.

[24] Huo ChunBao, Shoa Yan, “The improved VQ algorithm for speaker recognition”,©
2009 IEEE.

You might also like