Speech Recognition System: A Project Report Submitted by
Speech Recognition System: A Project Report Submitted by
to the
Bachelor of Science
of the
UNIVERSITY OF PERADENIYA
SRI LANKA
2015
CS304 – Project Report Speech Recognition System
Declaration
I hereby declare that the project work entitled “Speech Recognition System” submitted
to the University of Peradeniya, is a record of an original work done by me under the
guidance of ………………………………………………………., Staff Member, Department of
Statistics and Computer Science, Faculty of Science, University of Peradeniya, and this
project work has not performed the basis for the award of any Degree or
diploma/fellowship and similar project if any.
Certified By:
Supervisor:……………………….
Signature:…………………………
Date:………………………………..
1|Page
CS304 – Project Report Speech Recognition System
Acknowledgement
I express my sincere gratitude to the staff members of the Department of Statistics and
Computer Science, Faculty of Science, University of Peradeniya, for their support and
guidance in successfully complete this project.
Date: 30/11/2015
Name: M.F.Ahmed Shariff
2|Page
CS304 – Project Report Speech Recognition System
Abstract
The Speech Recognition System documented in this report is a system that uses the
CMUsphinx as the base API to obtain speech recognition results and is implemented using
Java. The primary goal of the system is to provide the user the ability to define how speech
is recognized, by providing models for the recognizer, how the speech result is processed
and what the consequent functions that need to be executed are. The user can provide
these details in the form of plugins, which are classes that implement the provided
interfaces packed in a jar file. The details of the classes to be loaded as modules must be
included in the configuration file. Using the provided interfaces, a user can implement a
broad range of functions using the plugin system provided with much ease.
3|Page
CS304 – Project Report Speech Recognition System
Contents
1. Introduction 6
2. Software requirements and specifications 7
2.1. Product perspective 7
2.1.1. Use Cases 7
2.1.1.1. Use case diagram 7
2.1.1.2. USE CASE: Recognize Speech 8
2.1.1.3. USE CASE: Process speech 8
2.1.1.4. USE CASE: Execute function 9
2.1.1.5. USE CASE: Provide details 10
2.1.1.6. USE CASE: Provide recognition details 10
2.1.1.7. USE CASE: Provide process details 10
2.1.1.8. USE CASE: Provide execution details 10
2.1.2. Class diagram 11
2.2. User Characteristics 16
2.3. Specific Requirements 16
2.3.1. Functional Requirements 16
2.3.2. External Interfaces 16
2.4. Performance Requirements 17
2.5. Design Constraints 18
3. Design Strategy 19
4. Project plan 20
4.1. The Engines 20
4.2. The plugin modules 21
4.3. The Recognizer Engine 21
4.4. The Response Engine 22
4.5. The System Engine 24
4.6. The System as a whole 25
5. Future work 26
6. Conclusion 26
7. Reference 27
4|Page
CS304 – Project Report Speech Recognition System
List of Figures
Figure 2.1.1. 1.1- Use case diagram 7
Figure 2.1.2. 1- Class diagram (1) 11
Figure 2.1.2. 1- Class diagram (2) 12
Figure 2.1.2. 2- Class diagram (3) 13
Figure 2.1.2. 3- Class diagram (4) 14
Figure 2.1.2. 4- Class diagram (5) 15
Figure 2.3.2. 1- The Output window 17
Figure 2.3.2. 2 - System Tray icon and popup menu 17
Figure 2.3.2. 3 - The System Console window 17
Figure 4.3. 1 - Recognizer Engine's process cycle 22
Figure 4.4. 1 - Response Engine's process cycle 23
Figure 4.4. 2 - Response Engine Processor's process 24
Figure 4.5. 1 - The System Engine's process cycle 24
Figure 4.6. 1 - Simplified model of the speech recognition system 25
5|Page
CS304 – Project Report Speech Recognition System
1. Introduction
Today we have many technologies that provide the functionality of communicating with
machines in human’s natural form, speech. Though, yet, the traditional means of
communicating with a machine or computers, such as switches, keyboards, etc., is still in
dominance due to the complexities that come with implementing a successful speech
recognition system with a broad range of functionalities. Having to recognize speech
successfully is one part of the problem, the other part of the problem is being able to do
anything using what is said. There aren’t many systems that implement a speech
recognition system with the flexibility such as a mouse or a keyboard has when interacting
with a computer, that is, what we can do using our voice alone tends to be somewhat
limited. The system designed for this project is an attempt to provide a system which can
be easily adapted to broaden the range of functionalities of a speech recognition system.
To further elaborate, say a system is designed to simply type what is being said, which can
be done using the existing API’s available. If the user want to be able to use this for
another purpose such as to give commands to the computer, adapting such a system to
suit the later need, it would be a tedious task. If the user is even more ambitious and want
to automate functions around the house, such as switch on lights or other appliances, the
adaption process is becomes even more complex.
The system designed here provides an interface to be able to easily design a system that
can do the bidding of the user as they want, the systems a user develops, which are
essentially simple instructions, can be easily executed by simply providing them as plugins
to the system. That is the user can provide details to the system in the form of plugins,
details such as, the context in which speech must be recognizer, or in other terms, what
exactly are the set of words the recognizer should be looking for, how the words
recognized should be processed, and what should be done with the processes results.
The speech recognition API sued in this project is CMUsphinx, which is an open source
speech recognition API developed by the Carnegie Mellon University, which is also one of
the leading open source speech recognizers available today. And the system is completely
designed in Java.
6|Page
CS304 – Project Report Speech Recognition System
7|Page
CS304 – Project Report Speech Recognition System
8|Page
CS304 – Project Report Speech Recognition System
Pre-conditions:
The details of processing speech is provided.
Success Guarantees:
A processed speech result is obtained.
Main Success Scenario:
1. The recognized words are obtained – include – Recognize speech.
2. The details regarding processing the words are obtained.
3. The words recognized are processed based on the details.
4. The processed speech result is outputted.
9|Page
CS304 – Project Report Speech Recognition System
10 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.1.2. 5- Class diagram (1) – The Classes 'Response' and 'ResponseEngineProcess' are described in Figure 2.1.2.2-Class
diagram (2) and Figure 2.1.2.3-Class diagram (3) respectively. Classes ‘Queue’ and ‘PrintWriter’ are classes from the Java API
11 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.1.2. 6-Class diagram (2) - The interface ‘ResponseEngineInterface’ is described in Figure 2.1.2.3- Class diagram
(3). The classes ‘Configuration’ and ‘LiveSpeechRecognizer’ are from the sphinx API, and the interface ‘Runnable’ is from
the Java API.
12 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.1.2. 7- Class diagram (3) – The interfaces ‘ModuleSet’ and ‘SystemEngineInterface’ will be described
in Figure 2.1.2.5-Class diagram (5) and Figure 2.1.2.4- Class diagram (4) respectively. ‘BlockingQueue’,
‘Runnable’ and ‘List’ are from the Java API.
13 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.1.2. 8- Class Diagram (4)- The interface ’ModuleSet’ is described in Figure 2.1.2.5-Class diagram
(5). ‘BlockingQueue’,’Runnable’ and ‘PrintWriter’ are from the Java API
14 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.1.2. 9- Class diagram (5) – ‘Map’, ‘List’ and ‘NodeList’ are from the Java API
15 | P a g e
CS304 – Project Report Speech Recognition System
Figure 2.3.2. 1- The Output window Figure 2.3.2. 2 – System Tray icon and popup menu
Next option in the system tray popup menu is to pause the system. Which, when selected
will stop following the next speech result. The reason for not being able to stop
instantaneously is addressed in section 2.5. When deselected, the system will start from
where it stopped.
When exit is selected from this menu, the speech recognitions system will effectively exit.
17 | P a g e
CS304 – Project Report Speech Recognition System
18 | P a g e
CS304 – Project Report Speech Recognition System
3. Design Strategy
First of all, the system will be implemented in Java, as it is platform independent, making
the system portable, providing the user to implement what they want and not bother
about the platform it is running on. The primary focus of the system designed here is to
provide an interface any user can use to implement their own system using speech
recognition. For this purpose, a plugin system is implemented. The necessary interfaces
will be provided, the user can implement the interfaces that they need to accomplish their
task, and place them in a predefined folder packed as jar files. Also the plugins will not
depend on any of the components of the primary system, this is to ensure that the user
does not alter the core functions by providing an illegal instruction.
The user can provide three types of details:
1. How speech is recognized (the context in which speech is recognized)
A recognizer needs three important components other the audio stream.
I. Acoustic model: It is the model that represent the relationship between an audio
signal and the corresponding linguistic feature or phonemes they represent.
II. Dictionary Model: It comprises of a list of words that will be recognized by the
speech recognizer and respective the phonemes or linguistic feature of each word.
III. Language/grammar model: This is the mapping of the order of word that will be
spoken.
The user will be able to provide the necessary models they want to use in the speech
recognition, and include the path of these resources in the plugins, which in turn the
system will load and use for the recognition process.
2. How the recognition result is processed.
In this phase the user can provide instructions to the system as to what should be done
with the speech result obtained from the recognizer. For example, it can further filter
the response so that the in the consequent steps they request can be more easily
processed, or the user may instruct the system to switch the models used in the
recognizer, or else simply pass the speech result to be processed by the next phase.
3. What is to be done with the result that was processed.
Here the user can provide information on what the system must do with the result. It
can be virtually anything that can be programmatically done.
To handle the three different details a user may provide, three engines are used, each
engine handling a different type of detail. For the processing part, the user is provided
with a set of instructions that can be processed by the system. In order for the engines to
communicate with each other, the speech result will be wrapped in an object where each
engine can add additional information to this object.
19 | P a g e
CS304 – Project Report Speech Recognition System
4. Project plan
4.1. The Engines
The system primarily has three engines.
1. Recognizer Engine
2. Response Engine
3. System Engine
1. Recognizer Engine
The Recognizer Engine’s responsibility is to obtain the speech result from the recognizer
and to switch the models used when requested to do so. The Recognizer Engine will obtain
the result and pass it to the Response Engine and wait for the Response engine to signal
it to proceed or switch the models it is using.
2. Response Engine
The Response engine is responsible for deciding what has to be done with the speech
result obtained, and what is to be done after the speech result is obtained. The set of its
functions are as follows:
Instruct the Recognizer Engine to proceed and obtain the next speech result.
Instruct the Recognizer to change a model it is using.
Process the speech result.
Wrap the speech result with information the System Engine needs to execute the
function related to the speech result.
Pass the speech result to the system engine for it to proceed with its functions.
3. System Engine
The system engine will execute the function related to the speech result obtained.
Another responsibility of the system engine is to identify if a model used in the recognizer
engine needs to be rebuilt. In that case, it will request the Response Engine to be able to
proceed with the build, the Response Engine will then pause the Recognize Engine and
signal the System Engine to proceed with the build. When the building process is complete
the Response engine will be signaled to proceed with its functions.
20 | P a g e
CS304 – Project Report Speech Recognition System
21 | P a g e
CS304 – Project Report Speech Recognition System
The Response Engine’s function can be described as coordinating the functions of the
Recognizer Engine and System Engine. The functions of the Response Engine are
coordinated by a Response Engine Processor. The processor will have one of the
Response Handler Modules and Response Generator Modules set as active. Also the
references of all modules of types Response Handler Module, Response Generator
Module, Acoustic Module, Dictionary Module and Language Module, which are all loaded
to the system are stored in the Processor. The list of modules that will be loaded to the
system are defined by a configuration file. When the Response Engine is passed the
response object, the Processor will get this response and pass it to the active Response
Handler Module. The Response Handler module will return a process queue to the
Processor, containing instructions for the Processor to execute. The list of instructions the
Processor can execute are as follows:
Pass response to generator- The response will be passed to the active Response
Generator Module, which will return the response object with additional
information attached to it.
Pass response to system- The response will be passed to the System Engine.
Switch Response Handler Module- The active Response Handler Module will be
switched to the specified Response Handler Module from the Module references
stored.
Switch Response Generator Module- The active Response Generator Module will
be switched to the specified Response Generator Module from the Module
references stored.
22 | P a g e
CS304 – Project Report Speech Recognition System
Switch Acoustic Module- The specified Acoustic Module will be passed to the
Recognizer Engine through the Response Engine to load the acoustic model the
specified Acoustic Module refers to.
Switch Dictionary Module- The specified Dictionary Module will be passed to the
Recognizer Engine through the Response Engine to load the dictionary model the
specified Dictionary Module refers to.
Switch Language Module- The specified Language Module will be passed to the
Recognizer Engine through the Response Engine to load the Language model the
specified Language Module refers to.
Wait for a predefined period of time- A null response object will be passed to the
Response Engine in the predetermined period of time, if the Processor was not
instructed to pass a response object to the System Engine within that period of
time. This instruction can be used to implement functionality such as providing the
user a brief period of time to be able to cancel a function related to the speech
result before it is executed. (Note: if a model of the Recognizer Engine was
switched to obtain a different type of recognition result during the period, it can
cause inconsistencies in the result, as the default API of sphinx does not provide
for a method to interrupt the recognizer when it enters a RECOGNIZING state)
Once the Processor has completed executing all the instructions provided to it by the
active Response Handler Module, the Response Engine will signal the Recognizer Engine
to continue, and wait for the next response to be passed to it.
Every time a new response is obtained, before passing it to the Processor, it will check if
the System Engine has placed a request to build, if so, it will signal the System Engine and
wait to be signaled back to continue with its functions.
23 | P a g e
CS304 – Project Report Speech Recognition System
Get Response
When module
Allow the module
completes build, signal
selector to choose the
Response Engine to
appropriate module
continue.
24 | P a g e
CS304 – Project Report Speech Recognition System
Response Generator
Module
Response Engine
Response Engine
Processor
Response Handler
Module
Acoustic Module
Language Module
The three engines are designed as singletons to avoid conflicts for resources. Also note
that the modules, Acoustic Module, Dictionary Module and Language Module are
related to the Recognizer engine, yet all loaded Modules will be stored in the Response
Engine Processor. Also the plugins modules may communicate among them, to improve
their functionalities.
25 | P a g e
CS304 – Project Report Speech Recognition System
5. Future work
One of the primary focuses in improving the speech recognition system, is to improve how
configuration details are provided and managed. Which also can include functionality to
load modules during runtime. Also a system which provides a graphical user interface that
can automate the process of building plugins is planned, which will eliminate the need of
programming knowledge to implement simple plugins.
Another aspect that will be taken into consideration is to provide functionality for other
audio sources such as audio streams or audio files, for which the sphinx API provides
functionalities. Which will allow the system to be implemented in network systems,
servers, etc.
6. Conclusion
Giving the machines the ability to communicate with the humans in human’s natural
medium of communications has always been a fascinating prospect, and the modern
technologies have brought humans closer to realizing this dream than ever before. Yet
providing the necessary intelligence a machine needs to be able to flawlessly
communicate with machines is the greatest challenge in realizing this dream at this stage.
The Speech Recognition System designed and documented in this report is an attempt to
provide the users a simple interface for to provide their own details to recognize speech
and have computers do their bidding.
26 | P a g e
CS304 – Project Report Speech Recognition System
7. References
JavaTM Speech API Programmer's Guide, Sun Microsystems, Inc, Retrieved: July 8,
2015, from: http://www.ling.helsinki.fi/kit/2004s/ctl310gen/L7-
Speech/JSAPI/index.html
27 | P a g e