Final Synopsis
Final Synopsis
on
By
(Dr.Milind Nemade)
Faculty of Engineering
2
Telecommunication Engineering, K. J. Somaiya Institute of Engineering and
Information Technology Sion (E), Mumbai-400022.
Introduction……………………………………………………………….. 4-5
Importance / Rationale of Proposed Investigation………………………. 6
Scope of Research………………………………………………………... 6
Review of Literature……………………………………………………… 7-16
Research gaps identified in the proposed field of investigation………… 17
Objectives of Research…………………………………………………… 17
Research Methodology……………………………………………………. 18
Hypothesis………………………………………………………………... 18
Sources of Information…………………………………………………… 18
Tools / Techniques of Research…………………………………………. 18
Plan of Research work…………………………………………………… 19
Tentative Chapter Flow………………………………………………….. 20
References……………………………………………………………….. 20-23
CONTENTS
3
Introduction:
With the ever increasing power and falling cost of the digital signal
processors, and availability of cheap memory chips, speech processing systems are
mostly used for voice communication and recognition. Voice recognition system
includes hands free input system for voice dialling, voice activated security systems
etc. Presence of background noise and other types of disturbances also makes a
speech processing system complex and difficult. The performance of a speech
processing system is usually measured in terms of recognition accuracy. All speech
recognizers include an initial signal processing front end that converts a speech signal
into its more convenient and compressed form called feature vectors. Feature
extraction method plays a vital role in speech recognition task. There are two
dominant approaches of acoustic measurement. First is in temporal domain approach
(parametric) like Linear Prediction, which is developed to closely match the resonant
structure of human vocal tract that produces the corresponding sound. Second is
frequency domain approach (nonparametric) known as Mel-Frequency Cepstral
Coefficients (MFCC) [1].
In another approach wavelet transform and wavelet packet tree have been used
for speech feature extraction in which the energies of wavelet decomposed subbands
have been used in place of Mel filtered subband energies. However, the time
information is lost due to use of wavelet subband energies. The speech is a
nonstationary signal. The Fourier transform (FT) is not suitable for the analysis of
such nonstationary signal because it provides only the frequency information of signal
but does not provide the information about at what time which frequency is present.
The windowed short-time FT (STFT) provides the temporal information about the
frequency content of signal. A drawback of the STFT is its fixed time resolution due
to fixed window length. The wavelet transform, with its flexible time-frequency
window, is an appropriate tool for the analysis of nonstationary signals like speech
which have both short high frequency bursts and long quasi-stationary components
also [2]. In speech signal, high frequencies are present very briefly at the onset of a
sound while lower frequencies are presented latter for long period. DWT resolves all
4
these frequencies well. The DWT parameters contain the information of different
frequency scales. This helps in getting the speech information of corresponding
frequency band. In order to parameterize the speech signal, the signal can be
decomposed into four frequency bands uniformly or in dyadic fashion [2].
5
Importance / Rationale of Proposed Investigation:
The predominant mode of human communication for every day interaction is speech
and it will also be the preferred mode for human computer interaction. Spoken
language has been focused on its use as an human computer interaction mostly for
information access and extraction. There is a need in spoken language not as a access
of information but also source of information that would make language more
important in respect of accessing, sorting, editing, translation etc. As speech signals
are non-stationary in nature, speech recognition is a complex task due to the
differences in gender, emotional state, accent, pronunciation, articulation, nasality;
pitch, volume, and speed variability in people speak. Presence of background noise
and other types of disturbances also makes a speech processing system complex and
difficult. The effective speech features extraction and classification would improve
quality of the speech, represent speech signal in terms of frequency and bandwidth
and improve speech recognition. Speech processing is useful for various applications
such as mobile applications, healthcare, automatic translation, robotics, video games,
transcription, audio and video database search, household applications, language
learning applications etc.
Scope of Research:
6
Review of Literature:
Whistle sound detection was performed in this paper [7] using FFT.
Experiments were conducted and have shown that frequency spectrum analysis of
sound signals is more adequate because it shows a high degree of immunity to noise.
This finding has allowed the creation of a simpler and lighter algorithm for its
detection. There are probably other similar noises that can be mixed up with a whistle
sound using frequency spectrum analysis. A more thorough investigation is necessary
7
for finding those sounds although parameter tuning is important in order to constrain
the value margins.
Another important finding was the fact that this method cannot easily
distinguish between slightly different referee whistles unless their frequencies are
strongly different [7]. This paper [8] has shown three advantages of the proposed
speech recognition system. The first is that the system is implemented on an
embedded platform with FPGA chip and SOC architecture to make the systems
flexible to corporate with other digital systems. Second, for the computation of FFT
on hardware, the Integer FFT is adopted to replace Float FFT. A realization algorithm
of Integer FFT on the hardware is also proposed. Through the usage of Integer FFT,
the calculation time for FFT can be decreased by a factor of 57%. The calculation
time decreases substantially, which conforms to the application of real-time on an
embedded platform. The third advantage is that the experimental results show that the
speech recognition rate is better than that of the existing papers. However, in spite of
the improvement on the computation time for FFT, the recognition time of the
proposed systems is still too long to be used for real-time applications. Multi-core and
parallel processing for the speech recognition algorithm are necessary to further
improve the recognition time and is worthwhile to examine in the future research.
8
Examination using voice which will be helpful for people who are blinds and the
people who don’t use keyboard for interaction with the system [10]. The proposed
method will help for voice recognition where we take voice as input through
microphone and then register for online examination. Then at the time of log in user
will logged in to the online examination system through voice authentication, which
uses FFT for comparing input voice of the user with the template voice. Next the user
will proceed with the questions displayed and read by the system [10].
In this research work, authors had emphasised to the inclusion of DTW and K-
NN technique for recognizing Bangla speech as no such works have been seen and
also to evaluate the performance from several aspects [11]. However for different
speaker the performance decreases to almost 10 to 15%. Recognizing continuous
speech with ANN classifier has average accuracy rate of 73.36% for three layers
Back- Propagation Neural Network the maximum accuracy rate is 86.67% and spoken
letter recognition by measuring Euclidian distance, which can recognize only the
vowels, has an 80% accuracy rate.
9
Numeral recognition remains one of the most important problems in pattern
recognition. It has numerous applications including those in reading postal zip code,
passport number, employee code, form processing and bank cheque processing, postal
mail sorting, job application form sorting, automatic scoring of tests containing
multiple choice questions and video gaming etc. To the best of our knowledge, little
work has been done in Indian language, especially in Marathi as compared with those
for non Indian languages. This paper has discussed an effective method for
recognition of isolated Marathi numerals. It presents a Marathi database and isolated
numeral recognition system based on Mel-Frequency Cepstral Coefficient (MFCC)
used for Feature Extraction and Distance Time Warping (DTW) used for Feature
Matching or to compare the test patterns [13]. In recent years there has been a steady
movement towards the development of speech technologies to replace or enhance text
input called as Mobile Search Applications. Recently both Yahoo! and Microsoft have
launched voice-based mobile search applications. Future work can include improving
the recognition accuracy of the individual numerals by combining the multiple
classifiers [13]. Voice recognition is a system to convert spoken words in well-known
languages into written languages or translated as commands for machines, depending
on the purpose. The input for that system is "voice", where the system identifies
spoken word(s) and the result of the process is written text on the screen or a
movement from machine's mechanical parts. This research focused on analysis of
matching process to give a command for multipurpose machine such as a robot with
Linear Predictive Coding (LPC) and Hidden Markov Model (HMM), where LPC is a
method to analyse voice signals by giving characteristics into LPC coefficients. In the
other hand, HMM is a form of signal modeling where voice signals are analyzed to
find maximum probability and recognize words given by a new input based from the
defined codebook. This process could recognize five basic movement of a robot:
"forward", "reverse", "left", "right" and "stop" in the desired language [14].
10
LPC method for extracting the features of voice signal. For each voice signal, LPC
method produces 576 data. Then, these data become the input of the ANN. The ANN
was trained by using 210 data training. This data training includes the pronunciation
of the seven words used as the command, which are created from 30 different people.
Experimental results show that the highest recognition rate that can be achieved by
this system is 91.4%. This result is obtained by using 25 samples per word, 1 hidden
layer, 5 neurons for each hidden layer, and learning rate 0.1[15].This paper proposes
an approach to recognize English words corresponding to digits Zero to Nine spoken
in an isolated way by different male and female speakers. A set of features consisting
of a combination of Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive
Coding (LPC), Zero Crossing Rate (ZCR), and Short Time Energy (STE) of the audio
signal, is used to generate a 63-element feature vector, which is subsequently used for
discrimination. Classification is done using artificial neural networks (ANN) with
feed-forward back-propagation architectures. An accuracy of 85% is obtained by the
combination of features, when the proposed approach is tested using a dataset of 280
speech samples, which is more than those obtained by using the features singly [16].
11
This paper [18] presents the Bangla speech recognition system. Bangla speech
recognition system is divided mainly into two major parts. The first part is speech
signal processing and the second part is speech pattern recognition technique. The
speech processing stage consists of speech starting and end point detection,
windowing, filtering, calculating the linear predictive coding (LPC) and cepstral
coefficients and finally constructing the codebook by vector quantization. The second
part consists of pattern recognition system using artificial neural network (ANN).
Speech signals are recorded using an audio wave recorder in the normal room
environment. The recorded speech signal was passed through the speech starting and
end-point detection algorithm to detect the presence of the speech signal and remove
the silence and pauses portions of the signals. The resulting signal is then filtered for
the removal of unwanted background noise from the speech signals. The filtered
signal was then windowed ensuring half frame overlap. After windowing, the speech
signal was then subjected to calculate the LPC coefficient and cepstral coefficient.
The feature extractor uses a standard LPC cepstrum coder, which converts the
incoming speech signal into LPC cepstrum feature space. The self organizing map
(SOM) neural network makes each variable length LPC trajectory of an isolated word
into a fixed length LPC trajectory and thereby making the fixed length feature vector,
to be fed into to the recognizer. The structures of the neural network is designed with
multi layer perceptron approach and tested with 3, 4, 5 hidden layers using the
Transfer functions of Tanh Sigmoid for the Bangla speech recognition system.
Comparison among different structures of neural networks conducted here for a better
understanding of the problem and its possible solutions.
12
discrete wavelet transform (DWT), which are wavelet filter cepstral coefficients
(WFCCs), sub-band power normalization (SBPN), and lowpass filtering plus zero
interpolation (LFZI). According to experiments, the proposed WFCC is found to
provide a more robust c0 (the zerothceptral coefficient) for speech recognition, and
with the proper integration of WFCCs and the conventional MFCCs, the resulting
compound features can enhance the recognition accuracy. Second, the SBPN
procedure is found to reduce the power mismatch within each modulation spectral
sub-band, and thus to improve the recognition accuracy significantly. Finally, the
third technique, LFZI, can reduce the storage space for speech features, while it is still
helpful in speech recognition under noisy conditions [20].The framework of voice
conversion system is expected to emphasize both the static and dynamic
characteristics of the speech signal.
The conventional approaches like Mel frequency cepstrum coefficients and linear
predictive coefficients focus on spectral features limited to lower frequency bands.
This paper presents [21] a novel wavelet packet filter bank approach to identify non-
uniformly distributed dynamic characteristics of the speaker. Contribution of this
paper [21] is threefold. First, in the feature extraction stage, dyadic wavelet packet
tree structure is optimized to involve less computation while preserving the speaker-
specific features. Second, in the feature representation step, magnitude and phase
attributes are treated separately to rule out on the fact that raw time-frequency traits
are highly correlated but carry intelligent speech information. Finally, the RBF
mapping function is established to transform the speaker-specific features from the
source to the target speakers. The results obtained by the proposed filter bank-based
voice conversion system are compared to the baseline multiscale voice morphing
results by using subjective and objective measures. Evaluation results reveal that the
proposed method outperforms by incorporating the speaker-specific dynamic
characteristics and phase information of the speech signal [21]. The study [22]
proposes an improved feature extraction method that is called Wavelet Cepstral
Coefficients (WCC). In traditional cepstral analysis, the cepstrums are calculated with
the use of the Discrete Fourier Transform (DFT). Owing to the fact that the DFT
calculation assumes signal stationary between frames which in practice is not quite
true, the WCC replaces the DFT block in the traditional cepstrum calculation with the
Discrete Wavelet Transform (DWT) hence producing the WCC. To evaluate the
13
proposed WCC, speech recognition task of recognizing the 26 English alphabets were
conducted. Comparisons with the traditional Mel-Frequency Cepstral Coefficients
(MFCC) are done to further analyse the effectiveness of the WCCs. It is found that the
WCCs showed some comparable results when compared to the MFCCs considering
the WCCs small vector dimension when compared to the MFCCs. The best
recognition was found from WCCs at level 5 of the DWT decomposition with a small
difference of 1.19% and 3.21% when compared to the MFCCs for speaker
independent and speaker dependent tasks respectively [22].The method of speaker
recognition based on wavelet functions and neural networks was presented in this
paper [23]. The wavelet functions are used to obtain the approximation function and
the details of the speaker’s averaged spectrum in order to extract speaker’s voice
characteristics from the frequency spectrum. The approximation function and the
details are then used as input data for decision-making neural networks. In this
recognition process, not only the decision on the speaker’s identity is made, but also
the probability that the decision is correct can be provided [23].In this work [24], a
modified voice identification system is described using over sampled Haar wavelets
followed by proper orthogonal decomposition. The audio signal is decomposed using
over sampled Haar wavelets. This converts the audio signal into various non-
correlating frequency bands. This allows us to calculate the linear predictive cepstral
coefficient to capture the characteristics of individual speakers. Adaptive threshold
was applied to reduce noise interference. This is followed by multi-layered vector
quantization technique to eliminate the interference between multiband coefficients.
Finally, proper orthogonal decomposition is used to evaluate unique characteristics
for capturing more details of phoneme characters. The performance analysis of voice
activity algorithms (VAD) which were based on wavelet and AMR-WB (Adaptive
Multi-Rate Wideband) speech codec was developed in [25] and HMM classifier was
used for pattern matching. The experimental results showed that wavelet approaches
provided good results in clean, noisy and reverberant environments with respect to
speech clipping and also gave a much lower computational complexity. The
performance of this system by using AMR VAD was improved upon by approaches
which were based on wavelet [25].Speech signal based analysis was done using
wavelet transform in [26], it efficiently locate the spectral changes in speech signal as
well as beginning and end of the sounds were located. Experimental results show
superiority of DWT methods over more classic ones like Mel-Frequency Cepstral
14
Coefficients (MFCC) [27]. The analysis of the power in different frequency subbands
gives an excellent opportunity to distinguish the beginning and the end of phonemes.
In this paper [28] a new efficient feature extraction methods for speech
recognition have been proposed. The features are obtained from Cepstral Mean
Normalized reduced order Linear Predictive Coding (LPC) coefficients derived from
the speech frames decomposed using Discrete Wavelet Transform (DWT). LPC
coefficients derived from wavelet-decomposed subbands of speech frame provide
better representation than modelling the frame directly. Experimentally it has been
shown that, the proposed approach provides effective (better recognition rate),
efficient (reduced feature vector dimension) features. The speech recognition system
using the Continuous Density Hidden Markov Model (CDHMM) has been
implemented. The proposed algorithms were evaluated using isolated Marathi digits
database in presence of white Gaussian noise [28]. In this paper [29], an Automatic
Speech Recognition system is designed for isolated spoken words in Malayalam using
discrete wavelet transforms and artificial neural networks. A better performance of
identification with high recognition accuracy of 90% is obtained from this study. The
computational complexity and feature vector size is successfully reduced to a great
extent by using discrete wavelet transforms. Thus a wavelet transform is an elegant
tool for the analysis of non-stationary signals like speech. The experiment results
show that this hybrid architecture using discrete wavelet transforms and neural
networks could effectively extract the features from the speech signal for automatic
speech recognition. In this experiment, we have used a limited number of samples.
Recognition rate can be increased by increasing the number of samples. The neural
network classifier which is used in the experiment provides good accuracies. DWT
and LPC-based techniques (UWLPC and DWLPC) for isolated word recognition was
presented in paper [30]. Experimental results show that the proposed WLPC (UWLPC
and DWLPC) features are effective and efficient as compared to LPCC and MFCC
because it takes the combined advantages of LPC and DWT while estimating the
features. Feature vector dimension for WLPC is almost half of the LPCC and MFCC.
This reduces the memory requirement and the computational time. It is also observed
that the performance of DWLPC is better than UWLPC. This is because the dyadic
(logarithmic) frequency decomposition mimics the human auditory perception system
better than uniform frequency decomposition [30]. A combination of wavelet packet
15
signal processing and Adaptive network based fuzzy inference system (ANFIS) to
efficiently extract the features from pre-processed real speech signals for the purpose
of automatic speech recognition among variety Turkish language words was presented
in paper [31]. These features make the expert system suitable for automatic
classification in interpretation of the speech signals. These results point out the ability
of design of new Automatic speech recognition (ASR) assistance system [31].
16
Research gaps identified in the proposed field of investigation:
Feature extraction and classification are major components, plays vital role
in speech recognition systems. So efficient representation of speech features
& its classification is required for speech recognition systems.
To improve accuracy of speech recognition system, we can use hybrid
architecture consist of Wavelet Transform (WT) and Artificial Neural
Network (ANN).
ANN architecture simplified without reducing the recognition rate.
Isolated Marathi Digit should be recognised speedily by speeding up the
recognition time.
Objectives of Research:
To derive effective, efficient, and noise robust features from the frequency
subbands of the frame using discrete wavelet transform.
Each frame of speech signal is decomposed into different frequency subbands
using discrete wavelet transform.
Classification of each subband using artificial neural network (ANN).
Determination of accuracy of speech recognition system
17
Research Methodology:
The following diagram indicates the method of feature extraction and classification
using discrete wavelet transform and artificial neural network.
Hypothesis:
The objective of our research is to investigate the combined performance of wavelet
transform and artificial neural network (ANN) for isolated Marathi digits so as to
improve accuracy of speech recognition system.
Sources of Information:
18
Phase Phase Phase Phase Phase Phase
Activity
I II III IV V VI
Literature Survey
Simulation of combined
strategies
Comparing results of
developed strategies with
existing algorithms
Documentation
Thesis framework
preparation and Submission
19
1. Speech Feature Classification using Artificial Neural Network (ANN)
References:
20
Computer Applications (0975 – 8887), vol.39,no.12, pp.7-12, February 2012.
[11] AsmSayem, “Speech Analysis for Alphabets in Bangla Language:Automatic
Speech Recognition”, International Journal of Engineering Research, vol.3,
no.2, pp.88-93, February 2014.
[12] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-
Carrasquillo, “Support Vector Machines for Speaker and Language
Recognition”, in Elsevier Journal of Computer Speech & Language, vol. 20,
issue 2/3, pp 210 – 229, 2006.
[13] Siddheshwar S. Gangonda, Dr.PrachiMukherji, “Speech Processing for
Marathi Numeral Recognition using MFCC and DTW Features”, International
Journal of Engineering Research and Applications (IJERA), pp.218-222,
March 2012.
[14] WahyuKusuma R., Prince Brave Guhyapati V., “Simulation Voice Recognition
System for controlling Robotic Applications”, Journal of Theoretical and
Applied Information Technology,vol.39, no.2,pp. 188-196, May 2012.
[15] Thiang and SuryoWijoyo, “Speech Recognition Using Linear Predictive
Coding and Artificial Neural Network for Controlling Movement of Mobile
Robot”, International Conference on Information and Electronics Engineering,
vol.6, pp.179-183, 2011.
[16] Bishnu Prasad Das, Ranjan Parekh, “Recognition of Isolated Words using
Features based on LPC, MFCC, ZCR and STE, with Neural Network
Classifiers”, International Journal of Modern Engineering Research, vol.2,
pp.854-858, May-June 2012.
[17] P. Zegers, “Speech recognition using neural network”, MS Thesis, Department
of Electrical & Computer Engineering, University of Arizona, 1998.
[18] Paul A.K., Das D., Kamal M.M., “Bangla Speech Recognition System Using
LPC and ANN”, 7th IEEE International Conference on Advances in Pattern
Recognition”, pp. 171-174, 2009.
[19] Firoz Shah. A, RajiSukumar. A and BabuAnto. P, “Discrete Wavelet
Transforms and Artificial Neural Networks for Speech Emotion Recognition”,
International Journal of Computer Theory and Engineering, vol. 2, no. 3,
pp.319-322, June 2010.
[20] Jeih-Weih Hung, Hao-Teng Fan , and Syu-Siang Wang, “Several New DWT-
Based Methods for Noise-Robust Speech Recognition”, International Journal
of Innovation, Management and Technology, vol. 3, no. 5, pp.547-551,
October 2012.
[21] Jagannath H Nirmal, Mukesh A Zaveri, SupravaPatnaik and Pramod H
21
Kachare, “A novel voice conversion approach using admissible wavelet packet
decomposition”, EURASIP Journal on Audio, Speech, and Music
Processing,pp 1 – 10, 2013.
[22] T. B. Adam, M. S. Salam, T. S. Gunawan, “Wavelet Cesptral Coefficients for
Isolated Speech Recognition”, Telkomnika, vol.11, no.5, pp.2731-2738, May
2013.
[23] SanjaGrubesa, TomislavGrubesa, HrvojeDomitrovic, “Speaker Recognition
Method combining FFT, Wavelet Functions and Neural Networks”, Faculty of
Electrical Engineering and Computing, University of Zagreb, Croatia.
[24] Mohammed Anwer and Rezwan-Al-Islam Khan, “Voice identification Using a
Composite Haar Wavelets and Proper Orthogonal Decomposition”,
International Journal of Innovation and Applied Studies, vol. 4, no. 2, pp.353-
358, October 2013.
[25] Marco Jeub, Dorothea Kolossa, Ramon F. Astudillo, Reinhold Orglmeister,
“Performance Analysis of Wavelet-based Voice Activity Detection”,
NAG/DAGA-Rotterdam, 2009.
[26] Beng T Tan, Robert lang, Hieko Schroder, Andrew Spray, Phillip Dermody,
“Applying Wavelet Analysis to Speech Segmentation and Classification”,
Department of Computer Science.
[27] BartoszZioko, Suresh Manandhar, Richard C. Wilson and MariuszZioko,
“Wavelet Method of Speech Segmentation”, University of York Heslington,
YO10 5DD, York, UK.
[28] N. S. Nehe, R. S. Holambe, “New Feature Extraction Techniques for Marathi
Digit Recognition”, International Journal of Recent Trends in Engineering, Vol
2, No. 2, November 2009.
[29] Sonia Sunny, David Peter S, K Poulose Jacob, “Discrete Wavelet Transforms
and Artificial Neural Networks for Recognition of Isolated Spoken Words”, in
International Journal of Computer Applications, volume 38, No.9, pp 9 – 13,
January 2012.
[30] N. S. Nehe, R. S. Holambe, “DWT and LPC based feature extraction methods
for isolated word recognition”, EURASIP Journal on Audio, Speech, and
Music Processing, vol.2012, pp.1-7, 2012.
[31] EnginAvci, ZuhtuHakanAkpolat, “Speech recognition using a wavelet packet
adaptive network based fuzzy inference system”, in Elsevier Expert Systems &
Applications, vol 31, pp 495 – 503, 2006.
22
Signature of Candidate with Date
Outlined Approved
23