Speech Recognition Final Report (1) - Removed - Removed
Speech Recognition Final Report (1) - Removed - Removed
A PROJECT REPORT
Submitted by
AASHISH CHAUHAN(20BCS7557)
SIDHARTH KUMAR(20BCS1997)
RAHUL THAKUR(20BCS1975)
UTKARSH SHARMA(20BCS2017)
In partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
Chandigarh University
April 2024
BONAFIDE CERTIFICATE
SUPERVISOR
Associate Professor Head of Department
Computer Science & Engineering Computer Science & Engineering
Sr.No. Contents
1. Introduction to Project
2. System Requirement
3. Feasibility Study
4. Requirement Analysis
7. E-R Diagram
8. Database Design
10. Reports
11. Conclusion
13. Enhancement
14. Bibliography
Acknowledgement
]
Chapter 1
1.1. Introduction
Speech Recognition (SR) is the ability to translate a dictation or spoken word
to text.
Speech Recognition known as “automatic speech recognition“ (ASR),or speech
to text(STT)
Speech recognition is the process of converting an acoustic signal, captured
by a microphone or any peripherals , to a set of words .
To achieve speech understanding we can use linguistic processing
The recognized words can be an end in themselves, as for applications such
as commands & control data entry and document preparation.
In the society every one either human or animals wish to interact with each other
and tries to convey own message to others . The receiver for messages may get the
exact and full idea of the senders, or may get the partial idea or sometimes can not
understand anything out of it.
In some cases may happen when there is some lacking in communication (i.e when
a child convey message, the mother can understand easily while others can not )
Project overview
After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of
designing a machine that truly functions like an intelligent human is still a major one
going forward.
1.3. Abstract
Speech recognition Technology is one of the fast growing engineering
technologies.
This project is designed and developed keeping that facto in mind , and a little effort
is made to achieve this aim.
It has a number of applications in different areas and provides potential benefits ,
Nearly 20% people of the world ae suffering from various disabilities ; many of them
are blind or unable to use their hands effectively . The speech recognition system in
those particular cases provide a significant help to them , so that they can share
information with people by operating computer through voice input .
Consider the Thousands of people in world they are not able to use their hands
making typing impossible. our project it for these people who can’t type ,and see
,even for those of us who are lazy and don’t feel like it Our project is capable to
recognize the speech and convert the input audio into text; it also enables a user to
perform operations such as (open , close ,exit, read, ……) program application and
a file by providing voice input . example open Word processing ,google chrome
,Notepad and calculator …,,etc.
In our project capable to read the text which is wrote by any one or the text which is
entered by the user himself
Continuous speech: When user speak in a more normal, fluid manner without
having to pause between word, which is referred as continuous speech.
Discrete speech: when user speak with taking rest between each word then such
speech is referred as discrete speech.
1. System Requirements:
Language
Model
Display Speech
Engine
Besides all these advantages and benefits, yet a hundred percent perfect
speech recognition system is unable to be developed. There are number of factors
that can reduces the accuracy and performance of a speech recognition program.
Speech recognition process is easy for a human but it is a difficult talk for a
machine , comparing with a human mind speech recognition programs seems less
intelligent, this is due to that fact that a human mind is God gifted thing and the
capability of thinking, understanding and reacting is natural, while for a computer
program it is a complicated task, first it need to understand the spoken words with
respect to their meanings, and it has to create a sufficient balance between the words,
noise and spaces. A human has a built in capability of filtering the noise form a
speech while a machine requires training, computer requires help for separating the
speech sound from the other sounds.
1.4. Factors on the speech recognition:
2.4.1. Homonyms: are the words that are differently spelled and have the
different meaning but acquires the same meaning, for example “there”
“their” “be” and “bee”, “cool” and “coal”. This is a challenge for
computer machine to distinguish between such types of phrases that
sound alike.
2.4.2. Overlapping speeches: a second challenge in the process, is to
understand the speech uttered by different users, current systems have
a difficulty to separate simultaneous speeches form multiple users.
2.4.3. Noise factor: the program requires hearing the words uttered by a
human distinctly and clearly. Any extra sound can create interference,
first you need to place system away form noisy environments and the n
speak clearly else the machine will confuse and will mix up the words.
1.5. The future of speech recognition :
Dictation speech recognition will gradually become accepted.
Accuracy will become better and better.
Microphone and sound systems will be designed to adapt more
quickly to changing background noise levels, different
environments, with better recognition of extraneous material to be
discarded.
Greater use will be made of “intelligent systems” which will
attempt to guess what the speaker intended to say, rather than what
was actually said, as people often misspeak and make unintentional
mistakes.
Methodology
As an emerging technology , not all developers are familiar with speech recognition
technology . While the basic functions of both speech synthesis and speech
recognition takes only
few minutes to understand (after all, most people learn to speak and listen by
age two), there are subtle and powerful capabilities provided by computerized
speech that
developers will want to understand and utilize.
An understanding of the capabilities and limitations of speech technology is also
important for developers in making decisions about whether a particular applications
will benefit from the use of speech input and output.
System Requirements:
CPU:-
Our Application depend on efficiency of CPU(central processing unit).
This is because a large amount of digital filtering and signal processing can take
place in ASR(Automated Speech Recognition ).
Chapter3
3. Feasibility Study :
Through these studies were obtained on the conclusions and proposals for the project
:
RAM 2 GB 4gb
3.1.2. Human Components
Programmers, Analyzers, Designers and etc...
Visual studio 2015: for build up our project, creates all the window forms
application and designing an interfaces.
MySQL: for managing the database (creates tables, store the data).
Word processor: for write a project report.
Programming language:
The programming language is C SHARP(C#)
Its easy for learning and its use for create windows forms application, its
also a well-known and high-level programing language.
Microsoft Speech SDK is one of the many tools that enable a
developer to add speech capability in to an applications.
Costs :
It’s the costs which will be spend by the team for complete the project as we
will discuss.
Profits:
The profits that the team can achieve after implementation the project ,in the
beginning , it will be a trial version and any user can use it for free. After improving
the version there will be a product key that no one can use the application without it
, and the application will be sold to users , every version will have different product
key .
3.4. Operational Feasibility
3.4.1. The performance (Throughput):
increasing recognition throughput in batch processing of speech
data; and reducing recognition latency in realtime usage scenarios.
Improve Throughput: Allow batch processing of the speech recognition task to
execute as efficiently as possible,thereby increasing the utility for multimedia search
and retrieval.
3.4.2. Information
With the help of microphone audio is input to the system, the pc sound card
produces the equivalent digital representation of received audio.
Analyze
- Identify opportunities for speech
& outline project strategy
- Review business requirements &
processes
- Review of existing IVR
- Interview Subject Matter Experts
- Voice User Interface & Technical
requirements (use-case scenarios)
- Define success criteria
- Map out solution
- Client review & sign-off on
requirements
Prosody Analysis
When one thinks about speaking to computers, the first image is usually speech
recognition, the conversion of an acoustic signal to a stream of words. After
many years of research, speech recognition technology is beginning to pass the
threshold of practicality. The last decade has witnessed dramatic improvement
in speech recognition technology, to the extent that high performance
algorithms and systems are becoming available.
Wide varieties of techniques with different levels of speech recognition are
used to perform speech recognition. The speech recognition process is
performed by a software component known as the speech recognition engine.
The primary function of the speech recognition engine is to process spoken input
word and translate it into text that an application understood. The application
can be work in two different mode Command and control mode some times
referred as voice navigation and Dictation mode.
In command and control mode the application can interpret the result of the
recognition as command. This mode offers developers the easiest implementation of
a speech interface in an existing application. In this mode the grammar (or list of
recognized words) can be limited to the list of available commands. This provides
better accuracy and performance, and reduces the processing overhead required by
the application. An example of a command and control application is one in which
the caller says “open” “file”, and the application asks the name of the file to be
opened.
Feature vectors
Reference
Model result
Speech
Recognition result
11) Use case diagram :
Here is the part where basic to intermediate experience with MS Access comes in
handy because we will not go into every detail on this process. I generally use MS
Access 2007 so my instruction will be geared toward that version. To begin, select
the Windows icon in the upper left then click on New. Name the database
whatever you would like; for this tutorial I will be naming mine VR.accdb. Create
a database that your project can use. I named mine CustomCommands and I
included the following fields:
ID
CommonField
Command
Result
Save that database and so we can use it during the next step.
After you have connected your program to the MS Access database you have created
we will add that database to our program forms. You will see that you now have
your DataSet under your Data Sources.
This is where you can click and hold the given field such as "Command" or
"Result" and drag them onto your forms. Just make sure the field is set to TextBox
and you should have a form that looks something like this:
The "Common Field" field is what the computer will speak to you, the "Command"
field is what you would speak to the computer and "Result" field is the program that
would be executed. To help me mentally keep these straight I will rename my labels
to that effect. We just walked through connecting a database to only one of our
forms. You can follow the appropriate steps listed in this section to connect this
same database to your second form.
On our main form, we will not need a data grid view, however, on our second form
we will. Having the grid is not entirely necessary for the operation of the program
but it does help you keep things organized as you are managing your commands.
Under the Data Sources explorer select CustomCommands and from the dropdown
menu select DataGridView. Then simply grab CustomCommands with your mouse
and drag it onto your form. You can arrange your data grid view however you would
like from there.
Now, we can create all of the buttons and text boxes we will need on our forms.
We'll just do this in steps to keep things simple.
9. User Interface
Code the main window and movement between the windows by voice
Option Strict On
SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Me.Text = "Speech recognition, by Doc Oc, version:" &
My.Application.Info.Version.ToString & " , say hello to piss the pc off"
Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()
SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)
recognizer = New SpeechRecognitionEngine()
recognizer.SetInputToDefaultAudioDevice()
End Sub
End Class
9.2. Add Commands
Imports System.Drawing.Drawing2D
Imports System.ComponentModel
Imports DMSoft
' ////////////////////////////////////////////////////////
SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Try
Conn.Open()
Dim DataAdapter1 As New OleDbDataAdapter(SQLstr, Conn)
DataAdapter1.Fill(DataSet1, "CustomCommands")
dataGridView1.DataSource = DataSet1
dataGridView1.DataMember = "CustomCommands"
dataGridView1.Refresh()
Conn.Close()
Catch e1 As Exception
Console.WriteLine(e1)
End Try
End Sub
Else
MsgBox("please enter field data", MsgBoxStyle.Critical, "wrong in data
enter")
Exit Sub
End If
End Sub
btnOpen.Visible = False
text2.Text = "http://www."
text2.Focus()
End If
End Sub
btnOpen.Visible = True
text2.Text = ""
End If
End Sub
End Sub
Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()
SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)
Do Until ReadLines.EndOfStream
Dim NewGrammar As New Grammar(New Choices(New
String(CType(ReadLines.ReadLine(), Char()))))
recognizer.LoadGrammarAsync(NewGrammar)
Loop
ReadLines.Close()
recognizer.RecognizeAsync(RecognizeMode.Multiple)
End Sub
Case Is = "RESTART"
a.Speak("restart")
System.Diagnostics.Process.Start("shutdown", "-r")
System.Diagnostics.Process.Start("https://www.google.com/webhp?sourceid=chro
me-instant&ion=1&ie=UTF-8#output=search&sclient=psy-
ab&q=weather&oq=&gs_l=&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47008514,
d.eWU&fp=6c7f8a5fed4db490&biw=1366&bih=643&ion=1&pf=p&pdl=300")
a.Speak("Searching for local weather")
Case Is = "HELLO "
a.Speak("Hello sir")
Case Is = "GOODBYE "
a.Speak("Until next time")
Me.Close()
Case Is = "OPEN DISK DRIVE"
'Case Is = "NINE"
a.Speak("Its now open")
Dim oWMP = CreateObject("WMPlayer.OCX.7")
Dim CDROM = oWMP.cdromCollection
If CDROM.Count = 2 Then
CDROM.Item(1).Eject()
End If
End
End Select
End If
End Sub
Private Sub recognizer_LoadGrammarCompleted(sender As Object, e As
LoadGrammarCompletedEventArgs) Handles
recognizer.LoadGrammarCompleted
Label1.Text = ("Grammar " & grammarName & " " & If((grammarLoaded),
"Is", "Is Not") & " loaded.")
End Sub
Label2.Text = "Grammar " & e.Result.Grammar.Name & " " & e.Result.Text
End Sub
'End Select
OutputListBox.BackColor = Color.Red
End Sub
End Class
Imports System.Speech
Imports System.Speech.Recognition
Imports System.Speech.Recognition.SrgsGrammar
Try
colorRule.Add(colorsList)
gram.Rules.Add(colorRule)
gram.Root = colorRule
reco.LoadGrammarAsync(New Recognition.Grammar(gram))
reco.SetInputToDefaultAudioDevice()
reco.RecognizeAsync(RecognizeMode.Multiple)
Catch s As Exception
MessageBox.Show(s.Message)
End Try
End Sub
End Class
10. Report
Storage of speech files and its feature in traditional flat file format
The process of data storage in traditional flat file format consists two or more
type of the files. Each prompted utterance is stored within a separate file in any valid
audio file format. The stored speech file for each utterance is processed with the
speech processing tools (i.e. software) and the corresponding features for each
utterance is extracted and these processed outcome is stored in the other flat file,
which is accompanying each utterance file. For the storage of the features we may
use many different approaches as follow:
(1) One may take the separate file for each utterance i.e. for pitch of all
the utterance one separate file, for the frequency of all the utterance
one separate file and so on. In each feature file each row represents
the different utterance. The affiliation of each row with the
accompanying utterance must be previously determined. And for
every feature file this affiliation remains same. Suppose there are 36
utterances and 10 features then there are 46 files.
(2) All the features (i.e. pitch, frequency and many more) for the one
utterance are stored in the one file. For the second utterance again all
the features (i.e. pitch, frequency and many more) are stored in the
other file and so on. In this approach, every file is named in such a
way so that it accompanying the utterance. Suppose there are 36
utterances then there are 72 files (i.e. 36 – for utterance and 36 – for
features of the accompanying utterance).
(3) All the features (i.e. pitch, frequency and many more) for the one
utterance is stored in one line of the flat file separated by either
comma or space, second line again stores the same features for the
second utterance and so on. In the file format each column represents
the same feature for all the utterance and each row represents
different features for the one utterance. The affiliation of each feature
with the column and affiliation of each utterance with each row must
be previously determined. In this approach if there are 36 utterances
then there are 37 files.
11. Conclusion
This project work of speech recognition started with a brief introduction of
the application and the technology in the computer (desktop applications)This
project able to write the text through both keyboard and voice ,the speech
recognition of different notepad command such as open ,save ,select ,copy ,clear
and close,Open the different windows software depends on voice input.
One challenge is to develop ways in which our knowledge of the speech
signal, and of speech production and perception, can be incorporated more
effectively into recognition methods. For example, the fact that speakers have
different vocal tract lengths could be used to develop more compact models for
improved speaker-independent recognition.
12. System limitations
A speech signal is a highly redundant non-stationary signal. These attributes
make this signal very challenging to characterise. It should be possible to
recognize speech directly from the digitized waveform. However, because of the
large variability of the speech signal, it is a good idea to perform some form of
feature extraction that would reduce that variability. Applications that need voice
processing (such as coding, synthesis, recognition) require specific representations
of speech information. For instance, the main requirement for speech recognition is
the extraction of voice features, which may distinguish different phonemes of a
language. From a statistical point of view, this procedure is equivalent to finding a
sufficient for statistic to estimate phonemes. Other information, not require for this
aim, such as phonatory apparatus dimensions (that is speaker dependent), the
speaker’s moods, sex, age, dialect inflexions, and background noise etc., should be
overlooked. To decrease vocal message ambiguity, speech is therefore filtered
before it arrives at the automatic recognizer. Hence, the filtering procedure can be
considered as the first stage of speech analysis. Filtering is performed on discrete
time quantized speech signals. Hence, the first procedure consists of analog to
digital signal conversion. Then, the extraction procedure of the significant features
of speech signal is performed.
When captured by a microphone, speech signals are seriously distorted by
background noise and reverberation. Fundamentally speech is made up of
discrete units. The units can be a word, a syllable or a phoneme. Each stored
unit of speech includes details of the characteristics that differentiate it from
the others. Apart from the message content, the speech signal also carries
variability such as speaker characteristics, emotions and background noise.
Speech recognition differentiates between accents, dialects, age, genders,
emotional state, rate of speech and environmental noises. According to Rosen
(1992), the temporal features of speech signals can be partitioned into three
categories, i.e., envelope (2-50 Hz), periodicity (50-500 Hz), and fine structure
(500-10000 Hz). A method of generating feature signals from speech signals
comprising the following steps:
Receives the speech signals.
Block the speech signals into frames.
Form frequency domain representations of said blocked speech signals.
Pass the said frequency domain representations through mel-filter banks.
During speech, altering the size and shape of the vocal tract, mostly by moving
the tongue, result in frequency and intensity changes that emphasize some
harmonics and suppress others. The resulting waveform has a series of peaks and
valleys. Each of the peak is called a formant and it is manipulation of formant
frequencies that facilitates the recognition of different vowels sounds. Speech has a
number of features that need to be taken into account. Perform the combination
linear predictive coding and cepstral recursion analysis on the blocked speech
signals to produce various features of the signals.
13. Enhancement
Thus the goal of speech enhancement is to find an optimal estimate (i.e., preferred
by a human listener), given a noisy measurement. The relative unimportance of
phase for speech quality has given rise to a family of speech enhancement algorithms
based on spectral magnitude estimation. These are frequency-domain estimators in
which an estimate of the clean-speech spectral magnitude is recombined with the noisy
phase before resynthesize with a standard overlap-add procedure.
[5] http://www.abilityhub.com/speech/speech-description.htm
[6] Charu Joshi “Speech Recognition”
Source: http://www.scribd.com/doc/2586608/speechrecognition.pdf
60
View publication stats