0% found this document useful (0 votes)
167 views29 pages

"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering

The document is a project report on speech recognition and voice detection systems submitted by four students. It provides an overview of speech recognition technology, including its history and applications. It describes the development workflow and algorithms used in voice recognition systems. The report also discusses current research funding and the performance and flaws of voice recognition systems. It was submitted in partial fulfillment of the requirements for a Bachelor of Technology degree in computer science engineering.

Uploaded by

Sakshi Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views29 pages

"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering

The document is a project report on speech recognition and voice detection systems submitted by four students. It provides an overview of speech recognition technology, including its history and applications. It describes the development workflow and algorithms used in voice recognition systems. The report also discusses current research funding and the performance and flaws of voice recognition systems. It was submitted in partial fulfillment of the requirements for a Bachelor of Technology degree in computer science engineering.

Uploaded by

Sakshi Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

A

PROJECT REPORT

ON

“SPEECH RECOGNITION AND VOICE DETECTION SYSTEM”


Submitted in Partial fulfilment of the requirement for the award

For the degree of

BACHLOR OF TECHNOLOGY

IN

COMPUTER SCIENCE ENGINEERING

Submitted By
1. SAKSHI AGARWAL 1647941016
2. PRAGATI DIXIT 1747910902
3. PRIYANKA 1647910008
4. ISRAR HUSAIN 1647910003

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

RAJSHEE INSTITUTE OF MANAGEMENT & TECHNOLOGY BARELLY

UTTAR PRADESH

APRIL,2020

PROJECT REPORT
ON

“SPEECH RECOGNITION AND VOICE DETECTION SYSTEM”


Submitted in Partial fulfilment of the requirement for the award

For the degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE ENGINEERING

Under the Supervision of

Mr. H.K PATEL

(Assistant Professor)

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

RAJSHEE INSTITUTE OF MANAGEMENT & TECHNOLOGY BARELLY

UTTAR PRADESH

APRIL,2020
CERTIFICATE

This is to certify that project entitled “SPEECH RECOGNITION AND VOICE DETECTION SYSTEM”

is a bona fide work carried out in Eighth Semester By “ SAKSHI AGARWAL , PRAGATI DIXIT, PRIYANKA and

ISRAR HUSAIN” in fulfillment for the award of Bachelor of Technology in Computer Science Engineering

from Rajshree Institute of Management & Technology, Bareilly during the academic year 2019-2020 who

carried out the project under our guidance and no part of this work has been submitted earlier for the award of

any degree according to our knowledge.

Dr. JYOTI AGARWAL Dr. SAKET AGARWAL Mr. H.K PATEL

(Assistant Professor ) (Dean Academics) (Supervisor)

Head of Department Assistant Professor

(Deptt. CS Engineering ) (Deptt. CS Engineering)


ACKNOWLEDGEMT

It gives me a great of pleasure to present the report of the B.Tech project undertaken during B.Tech final year. I

owe special debt of gratitude to my project guide assistant professor Mr. H.K PATEL “Department of

Computer Science Engineering, Rajshree Institute of Management & Technology, Bareilly” for his

constant support and guidance throughout the course of my work. His sincerity, thoroughness and perseverance

have been a constant source of inspiration for me. It is only his cognizant efforts that my endeavors have seen

light of the day.

I also like the opportunity to acknowledge the contribution of my project co-ordinator Assistant Professor Mr.

Rakesh Patel , Department of Computer Science Engineering for their kind assistance and cooperation during

the development of my project.

I also take the opportunity to acknowledge the contribution of professor & Head Dr. Jyoti Agarwal

Department of the Computer Science Engineering, Rajshree Institute of Management & Technology, Bareilly

for her full support and assistance during the development of the project.

Last but not the least, I acknowledge to all faculty members of the Department and my friends for their

contribution in the completion of the project.

Signature:

Name:

Roll No:

Date:
DEDICATION

This Project work is dedicated to our parents and our teachers who taught us different roads of
life and directed us towards our destinations. To all those friends and companions who helped
us while doing this work and made the journey of university career easier
PROJECT OVERVIEW

This report consider an overview of speech recognition technology< software development and its application.
The first section deals with the description of speech recognition process, its applications in different sectors, its
flaws and finally the future of technology. Later part of the report covers the speech recognition process, and the
code for the software and its working. Finally the report concludes at the different potential uses of the
applicationand further improvements and considerations

PROJECT OBJECTIVE

 To understand the speech recognition and its fundamentals.


 Its working and application in different areas.
 Its implementation as a desktop application.
 Development of software that can mainly be used for-:
 Speech Recognition
 Speech Generation
 Text Editing
 Tool for operating Machine through voice
.

PROJECT SCOPE
This project has the speech recognizing and speech synthesizing capabilities though it is not a complete
replacement of what we call a NOTEPAD but still a good text editor to be used through voice . This software
can also open windows based softwares such as Notepad, Ms-paint and more
ABSTRACT

Speech recognition techonology is one from growing engineering techonologies. It has a number of applications

in different areas to provide potential benefits. Nearly 20% people of the world are suffering from various

disabilities , many of them are blind or unable to use their hands effectively. The speech recognition system in

those particular cases provide a significant help to them ,so that they can share information with people by

operating computer through voice input.

The project is designed and developed for keeping that factor into mind and a little effort is made

to achieve this aim. Our project is capable to recognize the speech and convert the input audio into text , it also

enables a user to perform operation such as “save”,” Open”,” Exit” a file by providing voice input. It also helps

the user to open different system software such as opening Ms-Paint, Notepad and Calculator.

At the initial level,effort is made to provide help for basic operation as dissussed above ,but the

application can further be updated and enhanced in order to cover more operations.
TABLE OF CONTENTS

Introduction………………………………………………………………………Page 2-3

History………………………………………………………………………………Page 4

Classification of Voice Recognition System…………………………..Page 5

Development Workflow……………………………………………………...Page6-8

Speech to Data Conversion…………………………………………………Page 9

Voice Recognition and Statistical Modeling………………………….Page 10

Application…………………………………………………………………………Page 11-15

Performance of Voice Recognition System……………………………Page 15-17

Algorithms Used in Voice Recognition System………………………Page 18-19

Current Funding and Research……………………………………………..Page 20-21

Flaws and Weakness in Voice Recognition System…………………Page 22-23

References………………………………………………………………………….Page 24
“VOICE RECOGNITION SYSTEM”

Introduction

Voice Recognition System is a system which can recognize the voices. This can be for the purpose of words
identification or for the purpose of security.

Voice Recognition is the process of automatically recognizing who is speaking or what is speaking, on the basis
of individual information included in the speech waves. This technique makes it possible to use the speaker's
voice to verify their identity and control access to services such as voice dialing, banking by telephone,
telephone shopping, database access services, information services, voice mail, security control for confidential
information areas, and remote access to computers.

Some Voice Recognition System is designed in such a way that they can convert the spoken words into text.

Voice recognition System or Software’s can also be used as an alternative to typing on a keyboard. Put simply,
you talk to the computer and your words appear on the screen. The software has been developed to provide a
fast method of writing onto a computer and can help people with a variety of disabilities. It is useful for people
with physical disabilities who often find typing difficult, painful or impossible. Voice recognition software can
also help those with spelling difficulties, including users with dyslexic, because recognized words are always
correctly spelled.

We can see the use of Voice Recognition Systems in our daily life for example today, when we call most large
companies; a person doesn’t usually answer the phone. Instead, an automated voice recording answers and
instructs you to press buttons to move through options menus. Many companies have moved beyond
requiring you to press buttons, though. Often you can just speck certain words (again as instructed by a
recording) to get what you need. The system that makes this possible is a type of Voice Recognition Program –
an automated phone system.

You can also use voice recognition software in homes and businesses. A range of software products allows
users to dictate to their computer and have their words converted to text in a word processing or e-
mail document. You can access function commands, such as opening files and accessing menus, with voice
instructions. Some programs are for specific business settings, such as medical or legal transcription.

People with disabilities that prevent them from typing have also adopted voice-recognition systems. If a user
has lost the use of his hands, or for visually impaired users when it is not possible or convenient to use a
keyboard, the systems allow personal expression through dictation as well as control of many computer tasks.
Some programs save users' speech data after every session, allowing people with progressive speech
deterioration to continue to dictate to their computers. These programs can be seen in our daily life, these
programs are included in Windows Xp, Windows Vista, Windows 7 operating system from Microsoft
Corporation, while these software’s can be installed additionally for example “Dragon Naturally Speaking by
Nuance”, “Speech Recognition by Icons”, “Speech Recognition by Tazti”, “ ViaVoice by IBM”, “iListen” these are
some popular voice recognition software’s that are available as third party software for voice recognition.
History

The first speech recognizer appeared in 1952 and consisted of a device for the recognition of single spoken
digits. Another early device was the IBM Shoebox, exhibited at the 1964 New York World's Fair. Lately there
have been numerous improvements like a high speed mass transcription capability on a single system like
Sonic Extractor.

One of the most notable domains for the commercial application of speech recognition in the United States
has been health care and in particular the work of the medical transcriptionist (MT). According to industry
experts, at its inception, voice recognition (VR) was sold as a way to completely eliminate transcription rather
than make the transcription process more efficient, hence it was not accepted. It was also the case that VR at
that time was often technically deficient. Additionally, to be used effectively, it required changes to the ways
physicians worked and documented clinical encounters, which many if not all were reluctant to do. The biggest
limitation to voice recognition automating transcription, however, is seen as the software. The nature of
narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but
not yet by an automated system. Another limitation has been the extensive amount of time required by the
user and/or system provider to train the software.

A distinction in Automatic Voice recognition (AVR) is often made between "artificial syntax systems" which are
usually domain-specific and "natural language processing" which is usually language-specific. Each of these
types of application presents its own particular goals and challenges.
Classification of Voice Recognition System

 Isolated Voice Recognition System requires a brief pause between each spoken word, otherwise they
can’t detect the voice completely, and this system will malfunction.

 Continuous Voice Recognition System doesn’t require a brief pause between each spoken words,
hence it can detect the continuous speech or voice. We can say that this system is an advance version
of the Isolated Voice Recognition System.

 Speaker-Dependent Voice Recognition System can only recognize the speech from one particular
speaker’s voice. This type of system’s can be used for security and identification purposes.

 Speaker-Independent Voice Recognition System can recognize the speech from anybody. These types
of systems are embedded in voice-activated routing at customer call centre’s, voice dialing on mobile
phones and many other daily applications. This system is an advanced version of Speaker-Dependent
Voice Recognition System.
The Development Workflow of Voice Recognition System

There are two major stages within Voice Recognition: a training stage and a testing stage. Training
involves “teaching” the system by building its dictionary, an acoustic model for each word that the
system needs to recognize. In the testing stage we use acoustic models of these words to recognition
spoken words using a classification algorithm.

The Development Workflow consists of three steps:

 Speech Acquisition.

 Speech Analysis.

 User Interface Development.


Speech Acquisition

For training speech is acquired from the microphone and brought under the development environment for the
offline analysis. For testing the speech is continuously streamed into the environment for online processing.

During the training stage, it is necessary to record the repeated utterances of each word in the dictionary. For
example, suppose we are recording the word “Apple” in the dictionary, then we have to record the “Apple” for
many times with a pause between each utterance. This is necessary for building a robust voice recognition
system. If we fail to do so, then the system developed may produce undesirable responses.

We can record the speech by using a microphone and with the help of standard PC-Sound Card. This approach
works well for training data. In the testing stage, we need to continuously acquire and buffer speech samples,
and at the same time, process the incoming speech frame by frame, or in continuous groups of samples.

Speech Analysis

When speech is acquired into the development environment then it has to be processed or analyzed. This
speech analysis is one of the most complicated and important step in the development of voice recognition
system. In this stage a word detection algorithm is made that serrate each word from the ambient noise. Then
an acoustic model is derived that gives a robust representation of each word in the training stage. Finally an
appropriate classification algorithm is selected for the testing stage.

User Interface Development

These systems have a Graphical User Interface for the convenience of the users. In these User Interfaces firstly
the users have to train their system and then can use this system for the purpose of testing and their work.
How Speech To Data Conversion Takes Place?

To convert speech to on-screen text or a computer command, a computer has to go through several complex
steps. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this
analog wave into digital data that the computer can understand. To do this, it samples, or digitizes, the sound
by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to
remove unwanted noise, and sometimes to separate it into different bands of frequency (frequency is the
wavelength of the sound waves, heard by humans as differences in pitch). It also normalizes the sound, or
adjusts it to a constant volume level. It may also have to be temporally aligned. People don't always speak at
the same speed, so the sound must be adjusted to match the speed of the template sound samples already
stored in the system's memory.

Next the signal is divided into small segments as short as a few hundredths of a second, or even thousandths in
the case of plosive consonant sounds -- consonant stops produced by obstructing airflow in the vocal tract --
like "p" or "t." The program then matches these segments to known phonemes in the appropriate language. A
phoneme is the smallest element of a language -- a representation of the sounds we make and put together to
form meaningful expressions. There are roughly 40 phonemes in the English language (different linguists have
different opinions on the exact number), while other languages have more or fewer phonemes.
The next step seems simple, but it is actually the most difficult to accomplish and is the is focus of most speech
recognition research. The program examines phonemes in the context of the other phonemes around them. It
runs the contextual phoneme plot through a complex statistical model and compares them to a large library of
known words, phrases and sentences. The program then determines what the user was probably saying and
either outputs it as text or issues a computer command.
Voice Recognition and Statistical Modeling

Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the
words spoken fit into a certain set of rules, the program could determine what the words were. However,
human language has numerous exceptions to its own rules, even when it's spoken consistently. Accents,
dialects and mannerisms can vastly change the way certain words or phrases are spoken. Imagine someone
from Boston saying the word "barn." He wouldn't pronounce the "r" at all, and the word comes out rhyming
with "John." Or consider the sentence, "I'm going to see the ocean." Most people don't enunciate their words
very carefully. The result might come out as "I'm goin' da see tha ocean." They run several of the words
together with no noticeable break, such as "I'm goin'" and "the ocean." Rules-based systems were unsuccessful
because they couldn't handle these variations. This also explains why earlier systems could not handle
continuous speech -- you had to speak each word separately, with a brief pause in between them.

Today's speech recognition systems use powerful and complicated statistical modeling systems. These
systems use probability and mathematical functions to determine the most likely outcome. According to John
Garofolo, Speech Group Manager at the Information Technology Laboratory of the National Institute of
Standards and Technology, the two models that dominate the field today are the Hidden Markov Model and
neural networks. These methods involve complex mathematical functions, but essentially, they take the
information known to the system to figure out the information hidden from it.

The Hidden Markov Model is the most common, so we'll take a closer look at that process. In this model, each
phoneme is like a link in a chain, and the completed chain is a word. However, the chain branches off in
different directions as the program attempts to match the digital sound with the phoneme that's most likely to
come next. During this process, the program assigns a probability score to each phoneme, based on its built-in
dictionary and user training.

This process is even more complicated for phrases and sentences -- the system has to figure out where each
word stops and starts. The classic example is the phrase "recognize speech," which sounds a lot like "wreck a
nice beach" when you say it very quickly. The program has to analyze the phonemes using the phrase that
came before it in order to get it right. Here's a breakdown of the two phrases:

r  eh k ao g n ay  z       s  p  iy  ch


"recognize speech"
r  eh  k     ay     n  ay s     b  iy  ch
"wreck a nice beach"

Why is this so complicated? If a program has a vocabulary of 60,000 words (common in today's programs), a
sequence of three words could be any of 216 trillion possibilities. Obviously, even the most powerful computer
can't search through all of them without some help.
That help comes in the form of program training. According to John Garofolo :

“These statistical systems need lots of exemplary training data to reach their optimal performance -- sometimes
on the order of thousands of hours of human-transcribed speech and hundreds of megabytes of text. These
training data are used to create acoustic models of words, word lists, and [...] multi-word probability networks.
There is some art into how one selects, compiles and prepares this training data for "digestion" by the system
and how the system models are "tuned" to a particular application. These details can make the difference
between a well-performing system and a poorly-performing system -- even when using the same basic
algorithm.”
While the software developers who set up the system's initial vocabulary perform much of this training, the end
user must also spend some time training it. In a business setting, the primary users of the program must spend
some time (sometimes as little as 10 minutes) speaking into the system to train it on their particular speech
patterns. They must also train the system to recognize terms and acronyms particular to the company. Special
editions of speech recognition programs for medical or legal offices have terms commonly used in those fields
already trained into them.
Applications

Health Care

In the health care domain, even in the wake of improving speech recognition technologies, medical
transcriptionists (MTs) have not yet become obsolete. The services provided may be redistributed rather than
replaced. Speech recognition is used to enable deaf people to understand the spoken word via speech to text
conversion, which is very helpful.

Many Electronic Medical Records (EMR) applications can be more effective and may be performed more easily
when deployed in conjunction with a speech-recognition engine. Searches, queries, and form filling may all be
faster to perform by voice than by using a keyboard.

Military

High-Performance Fighter Air-Crafts

Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in
fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter
Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program in France on installing speech
recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms. In
these programs, speech recognizers have been operated successfully in fighter aircraft with applications
including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and
weapons release parameters, and controlling flight displays. Generally, only very limited, constrained
vocabularies have been used successfully, and a major effort has been devoted to integration of the speech
recognizer with the avionics system.

Some important conclusions from the work were as follows:

1. Speech recognition has definite potential for reducing pilot workload, but this potential was not
realized consistently.
2. Achievement of very high recognition accuracy (95% or more) was the most critical factor for making
the speech recognition system useful — with lower recognition rates, pilots would not use the system.
3. More natural vocabulary and grammar, and shorter training times would be useful, but only if very high
recognition rates could be maintained.

Laboratory research in robust speech recognition for military environments has produced promising results
which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance
aircraft.
Helicopters

The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter
environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the
helicopter environment, not only because of the high noise levels but also because the helicopter pilot
generally does not wear a facemask, which would reduce acoustic noise in the microphone. Substantial test and
evaluation programs have been carried out in the past decade in speech recognition systems applications in
helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal
Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma
helicopter. There has also been much useful work in Canada. Results have been encouraging, and voice
applications have included: control of communication radios; setting of navigation systems; and control of an
automated target handover system.

As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness.
Encouraging results are reported for the AVRADA tests, although these represent only a feasibility
demonstration in a test environment. Much remains to be done both in speech recognition and in overall
speech recognition technology, in order to consistently achieve performance improvements in operational
settings.

Battle Management

Battle Management command centers generally require rapid access to and control of large, rapidly changing
information databases. Commanders and system operators need to query these databases as conveniently as
possible, in an eyes-busy environment where much of the information is presented in a display format. Human-
machine interaction by voice has the potential to be very useful in these environments. . A number of efforts
have been undertaken to interface commercially available isolated-word recognizers into battle management
environments. In one feasibility study speech recognition equipment was tested in conjunction with an
integrated information display for naval battle management applications. Users were very optimistic about the
potential of the system, although capabilities were limited.

Speech understanding programs sponsored by the Defense Advanced Research Projects Agency (DARPA) in
the U.S. has focused on this problem of natural speech interface. Speech recognition efforts have focused on a
database of continuous speech recognition (CSR), large-vocabulary speech which is designed to be
representative of the naval resource management task. Significant advances in the state-of-the-art in CSR have
been achieved, and current efforts are focused on integrating speech recognition and natural language
processing to allow spoken language interaction with a naval resource management system.

Training Air Traffic Controller

Training for air traffic controllers (ATC) represents an excellent application for speech recognition systems.
Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog
with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in
a real ATC situation. Speech recognition and synthesis techniques offer the potential to eliminate the need for
a person to act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also
characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty
of the speech recognition task.

The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers
using speech recognition. Generally, the recognition accuracy falls short of providing graceful interaction
between the trainee and the system. However, the prototype training systems have demonstrated a significant
potential for voice interaction in these systems, and in other training applications. The U.S. Navy has sponsored
a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a
complex training system including displays and scenario creation. Although the recognizer was constrained in
vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained
language, using specific vocabulary specifically designed for the ATC task. Research in France has focused on
the application of speech recognition in ATC training systems, directed at issues both in speech recognition
and in application of task-domain grammar constraints.

The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech recognition from a number
of different vendors, including UFA, Inc, and Adacel Systems Inc (ASI). This software uses speech recognition
and synthetic speech to enable the trainee to control aircraft and ground vehicles in the simulation without the
need for pseudo pilots.

Another approach to ATC simulation with speech recognition has been created by Supremis. The Supremis
system is not constrained by rigid grammars imposed by the underlying limitations of other recognition
strategies.

Telephony and Other Domains

ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is
becoming more widespread. Despite the high level of integration with word processing in general personal
computing, however, ASR in the field of document production has not seen the expected increases in use.

The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows
Mobile Smart phones. Speech is used mostly as a part of User Interface, for creating pre-defined or custom
speech commands. Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice
Command), Nuance Communications (Nuance Voice Control), Vito Technology (VITO Voice2Go), Speereo
Software (Speereo Voice Translator), Digital Syphon (Sonic Massager appliance) and SVOX.
People with Disabilities

People with disabilities can benefit from speech recognition programs. Speech recognition is especially useful
for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involved
disabilities that preclude using conventional computer input devices. In fact, people who used the keyboard a
lot and developed RSI became an urgent early market for speech recognition. Speech recognition is used in
deaf telephony, such as voicemail to text, relay services, and captioned telephone. Individuals with learning
disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is
processed incorrectly causing it to end up differently on paper) can benefit from the software.

Further Application

 Automatic translation;
 Automotive speech recognition
 Telematics (e.g. vehicle Navigation Systems);
 Court reporting (Real-time Voice Writing);
 Hands-free computing: voice command recognition computer user interface;
 Home automation;
 Interactive voice response;
 Mobile telephony, including mobile email;
 Multimodal interaction;
 Pronunciation evaluation in computer-aided language learning applications;
 Robotics;
 Video games, with Tom Clancy's End War and Lifeline as working examples;
 Transcription (digital speech-to-text);
 Speech-to-text (transcription of speech into mobile text messages);
 Air Traffic Control Speech Recognition
Performance of Voice Recognition System

The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is
usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures
of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).

In 1982 Kurzweil Applied Intelligence and Dragon Systems released speech recognition products. By 1985,
Kurzweil’s software had a vocabulary of 1,000 words—if uttered one word at a time. Two years later, in 1987, its
lexicon reached 20,000 words, entering the realm of human vocabularies, which range from 10,000 to 150,000
words. But recognition accuracy was only 10% in 1993. Two years later, the error rate crossed below 50%.
Dragon Systems released "Naturally Speaking" in 1997 which recognized normal human speech. Progress
mainly came from improved computer performance and larger source text databases. The Brown Corpus was
the first major database available, containing several million words. In 2001 recognition accuracy reached its
current plateau of 80%, no longer growing with data or computing power. In 2006, Google published a trillion-
word corpus, while Carnegie Mellon University researchers found no significant increase in recognition
accuracy.

Dictation in Voice Recognition Systems

Dictation machines can achieve good performance in controlled conditions. There is some confusion, however,
over the interchangeability of the terms "speech recognition" and "dictation". Commercial speaker-dependent
dictation systems usually require only a short training period (sometimes also called `enrollment') and may
successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most
commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated
under optimal conditions. `Optimal conditions' usually assume that users:

 have speech characteristics which match the training data,

 can achieve proper speaker adaptation, and

 Work in a clean noise environment (e.g. quiet office or laboratory space).

This explains why some users, such as those whose speech is heavily accented, experience much lower
recognition rates.

Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten
digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their
destinations in large organizations.
Algorithm used in Voice Recognition System

Both acoustic modeling and language modeling are important parts of modern statistically-based speech
recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling
has many other applications such as smart keyboard and document classification.

Hidden Markov models


Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are
statistical models which output a sequence of symbols or quantities. HMMs are used in speech recognition
because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a
short-time (e.g., 10 milliseconds)), speech can be approximated as a stationary process. Speech can be thought
of as a Markov model for many stochastic purposes.

Another reason why HMMs are popular is because they can be trained automatically and are simple and
computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of
n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10
milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier
transform of a short time window of speech and décor-relating the spectrum using a cosine transform, then
taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a
statistical distribution that is a mixture of diagonal covariance Gaussians which will give likelihood for each
observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a
different output distribution; a hidden Markov model for a sequence of words or phonemes is made by
concatenating the individual trained hidden Markov models for the separate words and phonemes.

Described above are the core elements of the most common, HMM-based approach to speech recognition.
Modern speech recognition systems use various combinations of a number of standard techniques in order to
improve results over the basic approach described above. A typical large-vocabulary system would need
context dependency for the phonemes (so phonemes with different left and right context have different
realizations as HMM states); it would use cepstral normalization to normalize for different speaker and
recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for
male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker
adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics
and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and
delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic
linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear
transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a
purely statistical approach to HMM parameter estimation and instead optimize some classification-related
measure of the training data. Examples are maximum mutual information (MMI), minimum classification error
(MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new utterance and
must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path,
and here there is a choice between dynamically creating a combination hidden Markov model which includes
both the acoustic and language model information, or combining it statically beforehand (the finite state
transducer, or FST, approach).
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition but has now largely
been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for
measuring similarity between two sequences which may vary in time or speed. For instance, similarities in
walking patterns would be detected, even if in one video the person was walking slowly and if in another they
were walking more quickly, or even if there were accelerations and decelerations during the course of one
observation. DTW has been applied to video, audio, and graphics – indeed, any data which can be turned into a
linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In
general, it is a method that allows a computer to find an optimal match between two given sequences (e.g.
time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This
sequence alignment method is often used in the context of hidden Markov models.

Further Information

Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe,
ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural
language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech
processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE
Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech
Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to
acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical
Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong
Huang etc. More up to date is "Computer Speech", by Manfred R. Schroeder, second edition published in 2004.
The recently updated textbook of "Speech and Language Processing (2008)" by Jurafsky and Martin presents
the basics and the state of the art for ASR. A good insight into the techniques used in the best modern systems
can be gained by paying attention to government sponsored evaluations such as those organised by DARPA
(the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both
speech recognition and translation components).

In terms of freely available resources, Carnegie Mellon University's SPHINX toolkit is one place to start to both
learn about speech recognition and to start experimenting. Another resource (free as in free beer, not free
software) is the HTK book (and the accompanying HTK toolkit). The AT&T libraries GRM library, and DCD library
are also general software libraries for large-vocabulary speech recognition.

A useful review of the area of robustness in ASR is provided by Junqua and Haton (1995).
Current Research and Funding
Measuring progress in speech recognition performance is difficult and controversial. Some speech recognition
tasks are much more difficult than others. Word error rates on some tasks are less than one percent. On others
they can be as high as 50%. Sometimes it even appears that performance is going backwards as researchers
undertake harder tasks that have higher error rates.

Because progress is slow and is difficult to measure, there is some perception that performance has plateaued
and that funding has dried up or shifted priorities. Such perceptions are not new. In 1969, John Pierce wrote an
open letter that did cause much funding to dry up for several years. In 1993 there was a strong feeling that
performance had plateaued and there were workshops dedicated to the issue. However, in the 1990s funding
continued more or less uninterrupted and performance continued to slowly but steadily improve.

For the past thirty years, speech recognition research has been characterized by the steady accumulation of
small incremental improvements. There has also been a trend to continually change focus to more difficult
tasks due both to progress in speech recognition performance and to the availability of faster computers. In
particular, this shifting to more difficult tasks has characterized DARPA funding of speech recognition since the
1980s. In the last decade it has continued with the EARS project, which undertook recognition of Mandarin and
Arabic in addition to English, and the GALE project, which focused solely on Mandarin and Arabic and required
translation simultaneously with speech recognition.

Commercial research and other academic research also continue to focus on increasingly difficult problems.
One key area is to improve robustness of speech recognition performance, not just robustness against noise
but robustness against any condition that causes a major degradation in performance. Another key area of
research is focused on an opportunity rather than a problem. This research attempts to take advantage of the
fact that in many applications there is a large quantity of speech data available, up to millions of hours. It is too
expensive to have humans transcribe such large quantities of speech, so the research focus is on developing
new methods of machine learning that can effectively utilize large quantities of unlabeled data. Another area of
research is better understanding of human capabilities and to use this understanding to improve machine
recognition performance.
Voice Recognition Systems: Weakness and Flaws
No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these
factors are issues that continue to improve as the technology improves. Others can be lessened -- if not
completely corrected -- by the user.

Low signal-to-noise ratio

The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the sound will
interfere with this. The noise can come from a number of sources, including loud background noise in an office
environment. Users should work in a quiet room with a quality microphone positioned as close to their mouths
as possible. Low-quality sound cards, which provide the input for the microphone to send the signal to the
computer, often do not have enough shielding from the electrical signals produced by other computer
components. They can introduce hum or hiss into the signal.

Overlapping speech

Current systems have difficulty separating simultaneous speech from multiple users. "If you try to employ
recognition technology in conversations or meetings where people frequently interrupt each other or talk over
one another, you're likely to get extremely poor results," says John Garofolo.

Intensive use of computer power

Running the statistical models needed for speech recognition requires the computer's processor to do a lot of
heavy work. One reason for this is the need to remember each stage of the word-recognition search in case the
system needs to backtrack to come up with the right word. The fastest personal computers in use today can
still have difficulties with complicated commands or phrases, slowing down the response time significantly. The
vocabularies needed by the programs also take up a large amount of hard drive space. Fortunately, disk
storage and processor speed are areas of rapid advancement -- the computers in use 10 years from now will
benefit from an exponential increase in both factors.

Homonyms

Homonyms are two words that are spelled differently and have different meanings but sound the same.
"There" and "their," "air" and "heir," "be" and "bee" are all examples. There is no way for a speech recognition
program to tell the difference between these words based on sound alone. However, extensive training of
systems and statistical models that take into account word context has greatly improved their performance.
References:
 http://en.wikipedia.org/wiki/Speech_recognition
 http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm
 https://www.microsoft.com/enable/products/windowsvista/speech.aspx
 www.nuance.com/naturallyspeaking/
 www.faqs.org/docs/Linux.../Speech-Recognition-HOWTO.html

You might also like