Speech To Text Conversion
Speech To Text Conversion
Prepared by:
Alaa Hassan Mahmoud
Salma Alzaki Ali
Supersior by:
Dr. Howida Ali Abdul Gadir
2014
-:
()-
DEDICATION
To our mothers Our first Teachers
To our fathers .. Our Heroes
To Our Brothers, Sisters
&
To our Friends
We dedicate this research
Acknowledgment
First of all , thank to Allah who owed us with courage
and ability accomplish this study .
Second we are deeply thankful to our university which
gave us the chance to conduct our study.
Third, we are really grateful to our supervisor:
Dr. Howida Ali Abdul Gadir who exerted all possible efforts to
us from the beginning of the study until its final stage and we
benefited a lot from his valuable instruction.
ABSTRACT:
Though Arabic language is a widely spoken language, research done in the area of
Arabic Speech Recognition is limited when compared to other similar languages. This
paper concerns with convert Arabic spoken word into text using Mel-frequency
Cepstrum Coefficient (MFCC) and Vector Quantization (VQ).
This has been realized by first recording teacherss voices for each word in a noisy
environment. Secondly these words have been used to extract their features using the
Mel Frequency Cepstral Coefficients (MFCC) technique which are taken as input data to
the Vector Quantization to construct codeword for each word . Finally ,in the conversion
stage each codeword was indexed with the corresponding text.
The system targeting deaf students to help them solve some of the problems which face
them in the university environment.
The system Word Error Rate was 20%.
.
.
.
,
Mel Frequency Cepstral Coefficients.
( Vector Quantization)
.
.%20
Chapter one
Introduction
CHAPTER ONE
INTROUDUCTION
1.1 Introduction:The advancement of Information and Communication Technology has effected in all
aspects of our lives . so we can use it to improve our communication ,work and learning.
The world population has just touched 7 Billion in 2012. According to the World
Federation of the Deaf ,the total number of deaf people worldwide is around 70 million.
A lot of applications were developed in order to help people who have learning
disabilites and improve thier livess quality and solve some of their proplems.
In order to contribute in development of the applications that help deaf people,we
developed a speech to text conversion system , as a result it will realy effect their
communication with their enviroment.
1.2 The problem:the deaf people communicate with others using sign language instead of spoken
language.
There are 25 countries where Arabic is an official language. In some countries Arabic is
spoken by a minority of the people. Some sources put the number at 22-26 countries.
but learner deaf can use a written language to communicate ,and they face a lot of
problem while they try to continue learn, specialy deaf who in university because the
teachers in universities communicate with spoken language ,and the cost of come up with
sign teacher to translate speech into sign will be high. And also other student will keep
their attention with sign teacher .
In
our project We triedto solve these proplems .so that the deaf people
can
The system aims to help deaf people to know what is going on around them in the
classroom because God did not give them by the grace of the hearing and it was
necessary for us to help them and provide them with help.
The project also aims to spread knowledge and assistance in learning the generations,
whatever their disability and their ability to acquire knowledge.
1.4 Importance of the project :The importance lies in the following: 1 - Developing and teaching methods in universities and institutes of higher education.
2 - Assistance assigned people with special needs, especially the deaf.
3-suporting Arabic languages by adding new application for this language.
1.5 Scope and limitations of the project :Yes it was God's grace we have the science and application ,and Quest to seek
knowledge was recommended by the Holy Messenger said : (Asking for knowledge
from the cradle to the grave) ,But each has been deprived of some of the senses that help
in seeking more of science ,and it is our duty as Muslims to help them toward each other,
which urged us to develop this project, which Targeting the deaf community, especially
the students of the Faculty of Fine Arts at the University of Sudan for Science and
Technology (western branch) , So the message is sent for each educational category in
university community .
in our project we concern with convert ten arabic spoken word into text .
the words are :
." " " " " " " " " " " " " " " " " " " "
1.6 Tools:MATLAB:
MATLAB is a high-level technical computing language and interactive environment for
algorithm development, data visualization, data analysis, and numeric computation.
Using the MATLAB product, you can solve technical computing problems faster than
with traditional programming languages, such as C, C++, and Fortran.
1.7 The system overview :We try to do something for deaf people to help them and improve their life by ngissa
some applications or something special to make the communications with external world
easy and comfortable , and we use Arabic languages in our project because this is our
main languages .
Our project is application that convert the speech to text that means if the teacher who
train the system on his voice ,speak his voice signal , will be processed to find the Word
matching and the result should be shown as a text .
We hope our project expanded in the future and work at all Arabic languages (not
just ten words ) that used in education environment this help students attain a better
understanding of their curriculum , allow for simultaneous note-taking during the
lecture, and help students to complete homework after the lecture.
1.8 Content:
In this section, we present brief information about the rest of this thesis. The remainder
part of this thesis is:
Chapter 2: Literature Review: This chapter intends to discuss how the speech
signal is processed, the basics of speech recognition and the methods used in this
field.
Chapter 3: System implementation: This chapter describes the project
implementation steps.
Chapter 4: Conclusion and Future work: This chapter shows a conclusion for
the results obtained and the recommendation of this research.
References: Here are the used citations indexed by numbers.
Appendix A: This appendix contains the project user interface.
CHAPTER 2
MOTHODOLOGY
2.1 Introduction
In this chapter we will show how to design the speech recognition system using the
algorithm of Vector Quantization (VQ) methods Also feature extraction and matching
techniques used in the method above will be shown and discussed.
2.2Speech recognition
Speech recognition applications are becoming more and more useful now a days.
Various interactive speech aware applications are available in the market. but they are
usually meant for and executed on the traditional general
growth in the needs for embedded computing and the demand for emerging embedded
platforms, it is required that the speech recognition systems (SRS) are
telephones
handheld devices which becoming more and more powerful an affordable as well. It has
become possible to run multimedia on these devices. speech recognition systems emerge
as efficient alternatives for such devices where typing becomes difficult attributed to their
small screen limitations.
We used this property for the development of this system that helps deaf people in the
educational field, and this has led to support the development and increase speech
recognition systems .
2.3 Voice recognition: Voice recognition is an ability of a computer, computer software program, or
hardware device to decode the human voice into digitized speech that can be interpreted
by the computer or hardware device. Voice recognition is commonly used to operate a
device, perform commands, or write without having to operate a keyboard, mouse, or
press any buttons. Today, this is done on a computer with automatic speech recognition
(ASR) software programs. many ASR programs require the user to "train" the ASR
program to recognize their voice so that it can more accurately convert the speech to text.
2.3.1.1 Speaker Dependence:Speaker dependence describes the degree to which a speech recognition system
requires knowledge of a speakers individual voice characteristics to successfully process
speech. The speech recognition engine can learn how you speak words and phrases; it
can be trained to your voice. Speech recognition systems that require a user to train the
system to his/her voice are known as speaker-dependent systems. If you are familiar with
desktop dictation systems, most are speaker dependent. Because they operate on very
large vocabularies, dictation systems perform much better when the speaker has spent the
time to train the system to his/her voice.
2.3.1.2 Speaker Independence :Speech recognition systems that do not require a user to train the system are known as
speaker-independent systems. Speech recognition in the VoiceXML world must be
speaker-independent. Think of how many users (hundreds, maybe thousands) may be
calling into your web site. You cannot require that each caller train the system to his or
her voice. The speech recognition system in a voice-enabled web application MUST
successfully process the speech of many different callers without having to understand
the individual voice characteristics of each caller.[1]
2.4 The process of speech recognition:The figure 1 below the process and the interaction between them .
2.4.1 Feature Extraction:In speech recognition, the main goal of the feature extraction step is to compute a
parsimonious sequence of feature vectors providing a compact representation of the given
input signal.
The feature extraction is usually performed in three stages. The first stage is called the
speech analysis or the acoustic front end. It performs some kind of spectrum temporal
analysis of the signal and generates raw features describing the envelope of the power
spectrum of short speech intervals.
The second stage compiles an extended feature vector composed of static and dynamic
features. Finally, the last stage( which is not always present) transforms these extended
feature vectors into more compact and robust vectors that are then supplied to the
recognition stage.[3]
characteristics of
speech. It show human perception of the frequency contents of sounds for speech signals
does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in
Hz, a subjective pitch is measured on a scale called the Mel scale.
The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the
perceptual hearing threshold, is defined as 1000 Mels , a block diagram of the MFCC
processes is shown in figure 2 .[2]
Step5 :Mel-frequency wrapping:Human perception of frequency contents of sounds for speech signal does not follow
a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a
subjective pitch is measured on a scale called the mel scale. The mel-frequency scale is
a linear frequency spacing below 1000Hz and a logarithmic spacing above 1000Hz .As a
reference point ,the pitch of a 1 KHz tone ,40dB above the perceptual hearing threshold,
is defined as 1000 mels. Therefore we can use the following approximate formula to
compute the mels for a given frequency fin Hz.
Mel(f) = 2595*log10(1 +f/700)
------------------ (1)
Ours approach to simulate the subjective spectrum is to use a filter bank, one filter
for each desired mel-frequency component. That filter bank has a triangular band pass
frequency response and the spacing as well as the bandwidth is determined by a constant
mel-frequency interval , the mel scale filter bank is a series of l triangular band pass
filters that have been designed to simulate the band pass filtering believed to occur in the
auditory system. this corresponds to series of band pass filters with constant bandwidth
and spacing on a mel frequency scale .
Step 6 :Cepstrum :Here we convert the log mel spectrum backto time. the result is called the Mel
Frequency Cepstrum Coefficients (MFCC).The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for
the given frame analysis.
Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can
convert them to the time domain using the discrete cosine transform (DCT). In
this final step log mel spectrum is converted back to time.
The result is called the Mel Frequency Cepstrum Coefficients (MFCC).the discrete cosine
transform is done fortrans forming the mel coefficients back to time domain.[2]
2.5 Pattern Recognition:The pattern-matching approach involves two essential steps namely, pattern training
and pattern comparison. The essential feature of this approach is that it uses a well
formulated mathematical framework and establishes consistent speech pattern
representations, for reliable pattern comparison, from aset of labeled training samples via
a formal training algorithm.A speech pattern representation can be in the form of a
speech template or a statistical model (e.g., a HIDDEN MARKOVMODEL or HMM)
and can be applied to a sound (smaller than a word), a word, or a phrase. In the patterncomparison stage of the approach, a direct comparison is made between the unknown
speeches (the speech to be recognized) with each possible pattern learned in the training
stage in order to determine the identity of the unknown according to the
goodness of match of the patterns .[3]
2.5.1 Template Based Approach:Template based approach to speech recognition have provided a family of techniques
that have advanced the field considerably during the last six decades. the underlying
ideais simple. A collection of prototypical speech patterns are stored as reference patterns
representing the dictionary of candidates words.,recognition is then carried out by
matching an unknown spoken utterance with each of these reference templates and
selecting the category of the best matching pattern. Usually templates for entire words are
constructed.
This has the advantage that, errors due to segmentation or classification of smaller
acoustically more variable units such as phonemes can be avoided. In turn, each word
must have its own full reference template; template preparation and matching become
prohibitively expensive or impractical as vocabulary size increases beyond a few hundred
words. one key idea in template method is to derive a typical sequences of speech frames
for a pattern (a word) via some averaging procedure, and to rely on the use of local
spectral distance measures to compare patterns. Another key idea is to
use some form of dynamic programming to temporarily align patterns to account for
differences in speaking rates across talkers as well as across repetitions of the word by
the same talker .[3]
2.5.2 Stochastic Approach:Stochastic modeling entails the use of probabilistic models to deal with uncertain or
incomplete information. In speech recognition, uncertainty and incompleteness arise from
many sources; for example, confusable sounds, speaker variability s, contextual effects,
and homophones words. Thus, stochastic models are particularly suitable approach to
speech recognition. The most popular stochastic approach today is hidden Markov
modeling. A hidden Markov model is characterized by a finite state markov model and a
set of output distributions. the transition parameters in the Markov chain models,
temporal variabilities, while the parameters in the output distribution model, spectral
variabilities. These two types of variabilites are the essence of speech recognition.[3]
2.6 Vector Quantization(VQ):Vector Quantization(VQ)is often applied to ASR. It is useful for speech coders, i.e.,
efficient data reduction. Since transmission rate is not a major issue for ASR, the utility
of VQ here lies in the efficiency of using compact codebooks for reference models and
codebook searcher in place of more costly evaluation methods. For IWR, each
vocabulary word gets its own VQ codebook, based on training sequence of several
repetitions of the word. The test speech is evaluated by all codebooks and ASR chooses
the word whose codebook yields the lowest distance measure. In basic VQ, codebooks
have no explicit time information, since codebook entries are not ordered and can come
from any part of the training words. However, some indirect durational cues are
preserved because the codebook entries are chosen to minimize average distance across
all training frames, and frames, corresponding to longer acoustic.,segments are more
frequent in the training data.
The VQ puts on speech transients can be an advantage over other ASR comparison
methods for vocabularies of similar words. .[3]
2.6.1 Applications use VQ:Vector quantization is used in many applications such as image and voice
compression, voice recognition (in general statistical pattern recognition).[4]
2.6.2 Vector Quantization detail :A vector quantizer maps k-dimensional vectors in the vector space into a finite set of
vectors Y = {yi: i = 1, 2, ..., N}. Each vector yi is called a code vector or a codeword.
And the set of all the codewords is called a codebook. Associated with each codeword,
yi, is a nearest neighbor region called Voronoi region .
2.6.3 CompressioninVQ:A vector quantizer is composed of two operations. The first is the encoder, and the
second is the decoder. The encoder takes an input vector and outputs the index of the
codeword that offers the lowest distortion. In this case the lowest distortion is found by
evaluating the Euclidean distance between the input vector and each codeword in the
codebook. Once the closest codeword is found, the index of that codeword is sent
through a channel (the channel could be a computer storage, communications channel,
and so on). When the encoder receives the index of the codeword, it replaces the index
with the associated codeword. Figure 5 shows a block diagram of the operation of the
encoder and decoder. .[4]
By investigating the extracted features of the unknown speech and then compare them to
the stored extracted features for each different speaker in order to identify the unknown
speaker. The feature extraction is done by using MFCC (Mel Frequency Cepstral
Coefficients) and Vector Quantization (VQ) as classification algorithms.
The error rate of the system was about 13%. In second form. [5]
The system developed using different techniques such as Mel Frequency Cepstrum
Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM).
The coding of all the techniques mentioned above has been done using MATLAB. It has
been found that the combination of MFCC and Distance Minimum algorithm gives the
best performance and also accurate results in most of the cases with an overall efficiency
of 95%. The study also reveals that the HMM algorithm is able to identify the most
commonly used isolated word. As a result of this, speech recognition system achieves
98% efficiency. [7]
CHAPTER 3
IMPLEMENTATION AND RESULTS
3.1 Introduction
This chapter shows the implementation details of speech to text converting system.
It also shows the steps required to achieve the complete voice recognition project. Also, it
introduces the project testing using different testing environments and the results after
testing.
You can use MATLAB in a wide range of applications, including signal and image
processing, communications, control design, test and measurement, financial modeling
and analysis, and computational biology. Add-on toolboxes (collections of specialpurpose MATLAB functions, available separately) extend the MATLAB environment to
solve particular classes of problems in these application areas.
MATLAB provides a number of features for documenting and sharing your work. You
can integrate your MATLAB code with other languages and applications, and distribute
your MATLAB algorithms and applications. Features like High-level language for
technical computing , Development environment for managing code, files, and data
Interactive tools for iterative exploration, design, and problem solving , Mathematical
functions for linear algebra, statistics, Fourier analysis, filtering, optimization, and
numerical integration , 2-D and 3-D graphics functions for visualizing data , Tools for
building custom graphical user interfaces
3.3 Implementation
The first step of the development :
3.3.2 Preprocessing
To enhance the accuracy and efficiency of the extraction processes, speech signals are
normally pre-processed before features are extracted. There are two steps in Preprocessing.
1. Pre-emphasization.
2. Voice Activation Detection (VAD).
1- Pre-emphasization
The digitized speech waveform has a high dynamic range and suffers from additive
noise. In order to reduce this range and spectrally flatten the speech signal, pre-emphasis
is applied. First order high pass FIR filter is used to preemphasize the higher frequency
components, the figure below show the Pre-emphasization .[7]
The following figure show Sample of the codebook data of the word :
Conversion stage
In this stage when the system capture a word from the user it is extract the voice feature
,then compare these feature with the codeword which stored in the training stage ,to find
the suitable index .
After found the suitable index the system represent the corresponding text.
Testing phase
For our system we calculated Word Error Rate(WER) which is a common
metric of the performance of a speech recognition or machine translation system.
Testing result
The system WER is 20% .
CHAPTER 4
CONCLUSION AND FUTURE WORK
4.1 CONCLUSION:
This research has discussed an isolated word recognition system which developed using
MFCC (Mel Frequency Cepstral Coefficients) as and VQ algorithms . The system was
designed and implemented perfectly using matlab tool.
the greater the size of the training data, the greater the recognition
This training data could incorporate aspects like the different ways via the accents in
which a word can be spoken, the same words spoken by male/female speakers and the
word being spoken under different conditions say under conditions in which the speaker
may have a sore throat etc.
The system could be designed as client-server Architectural to make the matlb
work as a server to mobile phone which capture the speech and send it to the
matlb to process it .
REFERENCES
1. Kimberlee A. Kemble:An Introduction to Speech Recognition.
2. :http://www.researchtrend.net/ijet/4_Vibha.pdf , access at: 15/6/2014 5:22 pm
3. M.A.Anusuya, Speech Recognition by Machine ,
4. http://www.mqasem.net/vectorquantization/vq.html,access at:15/6/2014 5:05 pm
5.
Kashyap Patel, R.K. Prasad, Speech Recognition and Verification Using MFCC
& VQ.