Speech Recognition Report
Speech Recognition Report
{PT-CSE-425G}
On
“VOICE RECOGNITION”
Bachelor of Technology
In
Computer Science &
Engineering (Batch: 2021-25)
I have not submitted the matter embodied in this project report for any other
degree or diploma.
This is to certify that NIDHI of CSE, VIth Semester, 2021-2025, KIIT College of
Engineering, Gurugram has successfully completed the project work entitled “Speech
Recognition using Python” for the completion of Bachelor of Technology (Computer
Science and Engineering) as prescribed by the Gurugram University, Gurugram.
This project report is the record of authentic work carried out by her, Under the guidance of
Dr. Kanika Kaur, Dr. Atul Kumar and Dr. Seema Sharma.
She has worked under our guidance.The performance of the student is satisfactory.
H.O.D
and Professor
I take immense pleasure in thanking Prof. (Dr.) S. S. Agrawal, Director General (KIIT
Group of Colleges), and Prof. (Dr.) Mahavir Singh (Principal), Prof. (Dr.) Kanika
Kaur (H.O.D), for permiting me to carry out this project work.
I wish to express my sense of gratitude to our Project Supervisor Dr. Atul Kumar,
Dr. Seema Sharma for his/her guidance, which helped me complete the project
work.
Finally, yet importantly, I would like to express my heartfelt thanks to our beloved parents
for their blessings, our friends/classmates for their help, and our wishes for the successful
completion of this project.
ABSTRACT
This project is designed and developed keeping that factor into mind, and a
little effort to achieve this aim. Our project is capable to recognize the speech
and convert into text.
TABLE OF CONTENTS
1.1 INTRODUCTION
Unified framework
Fig: 1.1 Unified Framework
• In mostly areas of the country, there are lot of people who don’t know how to write
and also how to read any word, so this project is very helpful for these type of
people as you know in today’s world
• Everybody has its own mobile phones and they want to search a lot of things. In
this project, they usually speak what they want to search and various results of
such type opens in the browser window.
• In this project, we made our machine recognise the speech passed as
the audio file as well as the disection of the speech basis on the
requirement .
• Our Aim is to make the search fast and efficient and also reliable for
every person by implementing basic search commands and also
correct their vocabulary easily and also further implementing
speaking mode like Siri in iphone’s.
2. Objectives
Speech recognition is the process that recognizes all words being said
by humans and to convert this speech into text and to analyse this texxt
to produce the results required by the humans. The performance of this system
majorly depends upon number of factors such as the speed of the
spoken words by the user,vocabularies and the background noise
caused by the environment .The speech recognition library of the
package provided by the pypi library can be helpful in reducing various
factors such as background noise which then makes the speech good for
processing and the performing the tasks provided to this system such as words
recognition ,web searches .
1.4 Methodology:
1.5 Scope
• The speech recognition system in this project has the capability which could
be same as the systems used by Iphones and google but cannot be as much
effective as the functions provided by these systems.
• This project is the basic implementation of speech to text conversion and also
performing the basic tasks provided by the user to the system.
CHAPTER 2
LITERATURE REVIEW
2.1 HISTORY
The First speech recognition system were focused on numbers, not words.
In 1952 bell Laboratory designed the “Audrey System” which could
recognize a single voice speaking digits aloud. Ten years later IBM
introduced “shoebox” which understood 16 words in English .
Across the globe other nations developed hardware that could recognize sound
and sleep. And by the end of ‘60s , the technology could support words with 4
vowels and nine consonants.
1970’S
Speech recognition made several meaningful advancements in this Decade.
This was mostly due to the US Department of defence and DARPA. The
Speech Understanding Program SUR program ther ran was one of the largest of
its kind in the history of speech recognition. Mellon ‘Harpy Speech System‘
came from this programand was capable of understanding over 1000 kind
words that is about the same a three year old’s vocabulary.
Also significant in the 70’s was Bell Laboratories introduction od the
system that could interpret Multiple voices.
1980’s
The ‘80s saw speech Recognition vocab go from few of hundreds words to the
several thousands words. One of the Breakthroughs that came from a statistical
methods known as The ‘ Hidden Markov Model0 ‘HMM’ ‘ . Instead of just
using words and looking for the sound patterns. The Hmm estimated the
probability of the unknown sounds actually being words .
1990’s
Speech recognition was propelled forward in the 90s in the large part because of the
own personal computer. The faster processors made it possible for software like
dragon dictate to become the more widely used bell south introduced the Voice
Portal (VAL) in which was a dial in interactive voice recognition system . This
System give new birth to the myriad of the phones tree system that are still in the
existence today.
2000s
• From the year 20002 Speech recognition Technology had achieved close to the
80 percent accuracy.
• For almost of all the Decade There aren’t a lot of Advancements till google
has come with a start of google search voice.
• As it was an application which put speech recognition into hands of lakhs of
people .
• This was also Significant because that the processing power would be
offloaded to its data Centres.
• Not only for that, Google Application was collecting data from many billions
of the searches which could help this to predict what a human is actually
Saying.
• That time Google’s English voice search system, included 240 billion words
from user searches. 2010s
In 2012 Apple Launched SIRI which was as same as the Google’s VOICE SEARCH.
The early part of the decade saw an explosion of the other voice Recognition
Applications.
And with Amazon’s ALEXA, Google Home we’ve seen consumers Becoming More
and More comfortable talking to Machines.
Today, Some of the Largest Technical Companies are competing to herald the speech
accuracy title. In 2015, IBM achieved a word ERROR RATE pf 6.8%.
IN 2016 Microsoft overpassed IBM with a 5.8 % claim. Shortly After that IBM
improved their Rate to 5.4 %. However it’s Google that claims the lowest Ratio rate at
4.8percent.
The Future
The tech to support speech Applications is today both Relatively Inexpensive and
Powerful. With the betterment or the advance tech in Artificial Intelligence and to the
increase amounts of Speech Data that can be easily mined, it is now possible to that
voice becomes the next Dominant Interface.
At Sonix, We can also applause the many companies before us that propelled speech
Recognition to where it is Today. We Automate Transcription workflow and make it
fast , easy and more affordable.
We wouldn’t do this without the proper Work that has to been done before we.
• Analysis:
Basically it means that how the words are been spoken as in connected or in
isolated. In Isolated word of speech Recognition System needs that speaker
take pause between the words he speak. It means single kind word In connected
word of speech recognition system did not need that the speaker take pause
briefly in between the words. It generally means full length sentences in which
words are then artificially keep away by silence.
Speaking Style:
Generally it Includes whether that the speech is in continuous form of
spontaneous form. Continuous form is that spoken in natural form.
Systems are to evaluated on speech read from the scripts that are
prepared where as in spontaneous or extemporaneously generated, speech does
not contain fluencies, and it is also difficult to figure out that speech read from
the written script. It is also vastly much more hard as it tends to be peppered
with unfluency like “uuh” and “uum”, no full sentaces, spluttering , stuttering,
sneezing , cough, and also vocabulary is essentially ulimited, So there must be
training to system to be able to tackle with unknown and hidden words.
Vocabulary :
IT is much simple to discriminate a smaller set of the words, but rate of error
incareses as the size of the vocabulary increases.
For ex: 10 digits start from 0 -9 can easily be recognised rightly on the other
side vocabulary whose size is 100 , 4000 or 15000 have the rate of error as 3%,
6%, 40% . The vocabulary is hard to predict or recognize if it contains
Confused kind of words.
Enrollment:
This is kind of 2 ways
1)Speaker Dependent 2) speaker independent
In speaker dependent the user must be providing various samples of her or his
speech before they’re used, a speaker dependent system is meant for use of
only single kind speaker , where as speaker independent system is allowed or
intended to use any type or kind of speaker
Fig: 2.3 Speech Recognition Process
2.4 APPLICATION
SYSTEM DEVELOPMENT
1. Speech Synthesis
1. Evaluation of Synthetic Speech:
GRAPH 3.1
(Line graph showing transcription accuracy by speaking rate for expert and
non-expert users of text-to-speech synthesizers)
3.2 Packages Used :
3. import web browser: with this package we can make use of our default
browser used to locate, retrieve and display data .The URL and the
query is passed to the instance of the webbrowser package and basis on
the url provided and the query the particular webpage opens.
Recognizer Class : All the major process for speech recognition occurs in the
recognizer class. As the main function or purpose of
a recognizer instance is that to recognize speech and it provides with the
various processes and functions which furthur helps in recognising speech
from audio source.
The path of the audio file can be passed as the argument to the AudioFile class
and it also provides with the context manager as it helps in reading and
working with the file material.
The context manager then is responsible for opening of the audio file and
finally stores the data of file in the instance of the AudioFile.Then
the record()method is used to store the data from the entire audio file
and initialize it into the instance of AudioData.
The recognize google() is used to recognize any kind of speech in the audio.
The results depends on the internet’s connection speed and are displayed and
the speech to text conversion depends immensly on the accent and the speed of
the speaker.As we have used the audio file our speech recognition system
caught some words differently because of the vocabulary of the speaker.
Fig Usage of
Offset and duration
3.2.2 The Effect of Noice on Speech Recognition
All of the audio recordings consists of some level of noise in them & the
unhandled noise can greatly reduce the accuracy of the speech recognition
apps.
This file has the phrase “smell during periods” spoken with a loud sound in the
background. Thus the speech cannot be recognised properly.
The Pyaudio is installed to access the microphone which helps the user for
real -time speech recognition.With instance of speech_recognition the
microphone can be used.
If the user guesses the fruit name correctly then the game announces the win
else it prints the message to try again if their are any attempts remaining.
3.2.4.1 Working :
This function first checks the correctness of both the arguments and
produces a Type Error if anyone of them is invalid.
Then listen method is used to liste to the input from the mic.
The first for loop of the program runs for the number of guesses provided to
the user.the othe for loop inside the first for loop attempts to recognise the
input each time from the recognise_speech_mic() function which stores the
dictonary returned from this function and stores it in an variable
If the system recognises the word spoken by the user .I.e the transcription key
is not null and the speech of user is transcribed and the inner loop breaks out
and if the speech is not transcribed and the API error occured then also the loop
breaks out and if the API request
becomes succesful but the speech was not recognised then the else
statement is executed which tells the user to again speak the word.
If the inner loop breaks out without any errors then the returnedd
dicitionary is correct the errors if the error occured then the error
message is displayed and which ends the program.
If no error occur on the breaking of the inner loop then the inscription is
cmatched is compared to the word selected by system and to lower() method
is used to convert string into lowercases which reduces the possiblity of
wrong answer because of the conversion of the speech to upper cases .
If user makes a correct guess which matches with the system’s guess then the
user win the game else the outer loop executes on the basis of the attempts left
and finally if user fails in last attempt then user loses the game.
Output For Guess The Fruit Using Speech Recognition Through Microphone:
3.3 System Development Approach :
(PERFORMANCE ANALYSIS)
1. System Requirement:
1. Requirements:
a. 1.6 MHz Processor
b. 128 MB RAM
c. Microphones for good audio.
2. Best Requirements:
a. 2.4 GHz processor
b. Greater than128 MB RAM
c. 10% consumption of memory
d. best quality microphones
Sound cards:
The proper driver must be installed for the sound as speech
requires low Bandwidth thus high quality of sound cards are to be used.
Microphones:
Microphones are the most important tools for the real
time speech to text conversion .Therefore the pre-installed
ones cannot be used as they are more prone to the background
noise and also of poor quality in terms of speech.
Computer Processor:
Speech recognition application depends majorly on processing
speed. The input from the user can take some time
if the processing speed is low and thus user wasted more time on
waiting compared to performing the task which makes the application
less feasible for use.
The program imports speech recognition library which handles the request
from the user to perform web search and search the query on the youtube.
For performing web search we used the Recognizer class of the speech
recognition and created three instances of this class.
first instance is used to recognize text from youtube ,second instance is used
for web search and third instance is used to listen to speech .
We take input from the user’s microphone and on the basis of the words
spoken e.g: web search and video we search the web and youtube respectively.
This system is designed to recognize the speech and also has the capabilities
to convert speech to text. This software name ‘SPEECH RECOGNITION
SYSTEM’ has the capability to write spoken words into text.
with sr.Microphone() as source:
print('[search python : search
Youtube]') print('speak Now!! \n')
audio = r3.listen(source)
if 'python' in r2.recognize
google(audio): r2 =
sr.Recogniser()
url =
'https://va<.edureka.co/' with
sr.Microphone() as source:
print('\n search the query \
n') audio = r2.listen(source)
get = r2.recognize
google(audio) print(get)
wb.get().open
new(url+get) except
sr.UnknownValueError:
print('Unable to
recognize') except
sr.RequestError as e:
print('failed'.format(e))
if 'video' in r1.recognize
google(audio): r1 =
sr.Recogniser()
url = 'https://va<.youtube.com/results!search
query=' with sr.Microphone() as source:
print('\n search the query \
n') audio = r2.listen(source)
get = r1.recognize
google(audio) print(get)
wb.get().open
new(url+get) except
sr.UnknownValueError:
print('Unable to
recognize') except
sr.RequestError as e:
print('failed'.format(e))
GRAPH 4.1
(SOURCE: MICRosoft)
CHAPTER 5
(CONCLUSION)
1. Advantages of Software:
In mostly areas of the country, there are lot of people who
don’t know how to write and also how to read any word, so this project is
very helpful for these type of people as you know in today’s world,
everybody has its own mobile phones and they want to search a lot of things.
In this project, they usually speak what they want to search and various
results of such type opens in the browser window.
2. Disadvantages:
1. Low accuracy because of its limited ability.
2. Fails in noisy environment.
3. Depends majorly on GoogleAPI thus not a original software.
4. Limited operations can be performed.
5.3
Conclusion:
BOOKS:
1. G. L. Clapper, "Automatic word recognition", IEEE Spectrum, pp. 57-59, Aug. 1971.
View Article Full Text: PDF (8868KB) Google Scholar
2. M. B. Herscher, "Real-time interactive speech technology at threshold technology",
Workshop Voice Technol. Interactive Real Time Command Control Syst. Appl.,
1977-Dec.
Google Scholar
3. J. W. Gleen, "Template estimation for word recognition", Proc. Conf. Pattern
Recog. Image Processing, pp. 514-516, 1978-June.
Google Scholar
Internet:
1. https://pypi.org/project/SpeechRecognition.
2. https://www.researchgate.net/publication/337155654_A_Study_on_Automatic_Sp
eech_Recognition.
3. https://www.ijedr.org/papers/IJEDR1404035.pdf