0% found this document useful (0 votes)
108 views23 pages

Final Synopsis

This document provides an introduction and summary of a research paper titled "Performance Analysis of Combined Wavelet Transform and Artificial Neural Network for Isolated Marathi Digit Recognition". The research aims to use discrete wavelet transform to extract effective features from the frequency subbands of speech frames. These features will then be classified using an artificial neural network. The scope is limited to recognizing isolated Marathi digits. A literature review covers previous work using Fourier transforms and neural networks for speech recognition, as well as algorithms mimicking human audition. The proposed methodology will decompose speech into frequency bands using wavelet transforms before classifying with a neural network.

Uploaded by

atul narkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views23 pages

Final Synopsis

This document provides an introduction and summary of a research paper titled "Performance Analysis of Combined Wavelet Transform and Artificial Neural Network for Isolated Marathi Digit Recognition". The research aims to use discrete wavelet transform to extract effective features from the frequency subbands of speech frames. These features will then be classified using an artificial neural network. The scope is limited to recognizing isolated Marathi digits. A literature review covers previous work using Fourier transforms and neural networks for speech recognition, as well as algorithms mimicking human audition. The proposed methodology will decompose speech into frequency bands using wavelet transforms before classifying with a neural network.

Uploaded by

atul narkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

A S y n op s i s

on

“PERFORMANCE ANALYSIS OF COMBINED WAVELET


TRANSFORM AND ARTIFICIAL NEURAL NETWORK FOR
ISOLATED MARATHI DIGIT RECOGNITION”

By

(Atul Dattatraya Narkhede )

Under the Supervision of

(Dr.Milind Nemade)
Faculty of Engineering

PACIFIC ACADEMY OF HIGHER EDUCATION AND RESEARCH UNIVERSITY


UDAIPUR

Name of Scholar: (In English) Atul Dattatraya Narkhede

(In Hindi) Atul dTta5y narqeDe

Title of the Research: (In English ) “Performance Analysis of Combined


Wavelet Transform and Artificial Neural Network for
Isolated Marathi Digit Recognition”

(In Hindi) “पृथक मराठी डिजिट मान्यता के लिए प्रदर्शन


संयुक् त तरं गिका को बदलने और कृ त्रिम तंत्रि का नेटवर्क
का विश्लेषण ”

Location:- Computer Software Laboratory, Department of Electronics and

2
Telecommunication Engineering, K. J. Somaiya Institute of Engineering and
Information Technology Sion (E), Mumbai-400022.

Introduction……………………………………………………………….. 4-5
Importance / Rationale of Proposed Investigation………………………. 6
Scope of Research………………………………………………………... 6
Review of Literature……………………………………………………… 7-16
Research gaps identified in the proposed field of investigation………… 17
Objectives of Research…………………………………………………… 17
Research Methodology……………………………………………………. 18
Hypothesis………………………………………………………………... 18
Sources of Information…………………………………………………… 18
Tools / Techniques of Research…………………………………………. 18
Plan of Research work…………………………………………………… 19
Tentative Chapter Flow………………………………………………….. 20
References……………………………………………………………….. 20-23
CONTENTS

3
Introduction:

With the ever increasing power and falling cost of the digital signal
processors, and availability of cheap memory chips, speech processing systems are
mostly used for voice communication and recognition. Voice recognition system
includes hands free input system for voice dialling, voice activated security systems
etc. Presence of background noise and other types of disturbances also makes a
speech processing system complex and difficult. The performance of a speech
processing system is usually measured in terms of recognition accuracy. All speech
recognizers include an initial signal processing front end that converts a speech signal
into its more convenient and compressed form called feature vectors. Feature
extraction method plays a vital role in speech recognition task. There are two
dominant approaches of acoustic measurement. First is in temporal domain approach
(parametric) like Linear Prediction, which is developed to closely match the resonant
structure of human vocal tract that produces the corresponding sound. Second is
frequency domain approach (nonparametric) known as Mel-Frequency Cepstral
Coefficients (MFCC) [1].

In another approach wavelet transform and wavelet packet tree have been used
for speech feature extraction in which the energies of wavelet decomposed subbands
have been used in place of Mel filtered subband energies. However, the time
information is lost due to use of wavelet subband energies. The speech is a
nonstationary signal. The Fourier transform (FT) is not suitable for the analysis of
such nonstationary signal because it provides only the frequency information of signal
but does not provide the information about at what time which frequency is present.
The windowed short-time FT (STFT) provides the temporal information about the
frequency content of signal. A drawback of the STFT is its fixed time resolution due
to fixed window length. The wavelet transform, with its flexible time-frequency
window, is an appropriate tool for the analysis of nonstationary signals like speech
which have both short high frequency bursts and long quasi-stationary components
also [2]. In speech signal, high frequencies are present very briefly at the onset of a
sound while lower frequencies are presented latter for long period. DWT resolves all

4
these frequencies well. The DWT parameters contain the information of different
frequency scales. This helps in getting the speech information of corresponding
frequency band. In order to parameterize the speech signal, the signal can be
decomposed into four frequency bands uniformly or in dyadic fashion [2].

Artificial Neural Network (ANN) is an efficient pattern recognition mechanism which


simulates the neural information processing of human brain [3]. The ANN processes
information in parallel with a large number of processing elements called neurons and
uses large interconnected networks of simple and non linear units. The computational
intelligence of neural networks is made up of their processing units, characteristics
and ability to learn. During learning the system parameters of NN vary over time and
are characterized by their ability of local and parallel computation, simplicity and
regularity. Multi Layer Perceptron (MLP) architecture is used for pattern
classification in this work. The MLP architecture consists of one or more hidden
layers. A signal is transmitted in the one direction from the input to the output and
therefore this architecture is called feed forward. The MLP networks are learned with
using the Backward Propagation algorithm and is widely using in machine learning
applications. MLP uses hidden layers to classify successfully the patterns in to
different classes. The inputs are fully connected to the first hidden layer, each hidden
layer is fully connected to the next, and the last hidden layer is fully connected to the
outputs [3]. A wavelet transform is an elegant tool for the analysis of non-stationary
signals like speech. The results have shown that this hybrid architecture using discrete
wavelet transforms and neural networks could effectively extract the features from the
speech signal for automatic speech recognition.

5
Importance / Rationale of Proposed Investigation:
The predominant mode of human communication for every day interaction is speech
and it will also be the preferred mode for human computer interaction. Spoken
language has been focused on its use as an human computer interaction mostly for
information access and extraction. There is a need in spoken language not as a access
of information but also source of information that would make language more
important in respect of accessing, sorting, editing, translation etc. As speech signals
are non-stationary in nature, speech recognition is a complex task due to the
differences in gender, emotional state, accent, pronunciation, articulation, nasality;
pitch, volume, and speed variability in people speak. Presence of background noise
and other types of disturbances also makes a speech processing system complex and
difficult. The effective speech features extraction and classification would improve
quality of the speech, represent speech signal in terms of frequency and bandwidth
and improve speech recognition. Speech processing is useful for various applications
such as mobile applications, healthcare, automatic translation, robotics, video games,
transcription, audio and video database search, household applications, language
learning applications etc.

Scope of Research:

The scope of our research is limited to investigate the combined performance


of Wavelet Transform (WT) and Artificial Neural Network (ANN) for feature
extraction and classification for isolated Marathi digits. We proposed in to use
discrete wavelet transform to derive effective, efficient, and noise robust features
from the frequency subbands of the frame. Each features of the subband are further
classified using ANN.

6
Review of Literature:

Speaker independent recognition of 10 English vowels on isolated words that


compared the use of an ear model with an FFT as pre processing was mention [4].
The FFT was done using a mel scale and the same number of filters (40) as for the ear
model. Recognition was performed with a neural network with one hidden layer of 20
units. Authors obtained 87% recognition with the FFT pre processing Vs 96%
recognition with the ear model. This was an example of the successful application of
knowledge about human audition to the automatic recognition of speech with
machines[4].

Paper [5] describes an FFT-based companding algorithm for pre processing


speech before recognition. The algorithm mimics tone-to-tone suppression and
masking in the auditory system to improve automatic speech recognition performance
in noise. Moreover, it is also very computationally efficient and suited to digital
implementations due to its use of the FFT. In an automotive digits recognition task
with the CU-Move database recorded in real environmental noise, the algorithm
improves the relative word error by 12.5% at −5 dB signal-to-noise ratio (SNR) and
by 6.2% across all SNRs (−5 dB SNR to +15 dB SNR).

The algorithms of signal processing were used in the sound recognition


process for musical instruments [6]. Investigations were carried out for different input
data. The best results were obtained for two-second, three-second, four-second, five-
second samples. Efficiency of sound recognition was 100%for each sound of musical
instrument. It used FFT, K-NN classifier with the band-pass filter which passes
frequencies 0–1378 Hz. K-NN classifier was based on cosine distance [6].

Whistle sound detection was performed in this paper [7] using FFT.
Experiments were conducted and have shown that frequency spectrum analysis of
sound signals is more adequate because it shows a high degree of immunity to noise.
This finding has allowed the creation of a simpler and lighter algorithm for its
detection. There are probably other similar noises that can be mixed up with a whistle
sound using frequency spectrum analysis. A more thorough investigation is necessary

7
for finding those sounds although parameter tuning is important in order to constrain
the value margins.

Another important finding was the fact that this method cannot easily
distinguish between slightly different referee whistles unless their frequencies are
strongly different [7]. This paper [8] has shown three advantages of the proposed
speech recognition system. The first is that the system is implemented on an
embedded platform with FPGA chip and SOC architecture to make the systems
flexible to corporate with other digital systems. Second, for the computation of FFT
on hardware, the Integer FFT is adopted to replace Float FFT. A realization algorithm
of Integer FFT on the hardware is also proposed. Through the usage of Integer FFT,
the calculation time for FFT can be decreased by a factor of 57%. The calculation
time decreases substantially, which conforms to the application of real-time on an
embedded platform. The third advantage is that the experimental results show that the
speech recognition rate is better than that of the existing papers. However, in spite of
the improvement on the computation time for FFT, the recognition time of the
proposed systems is still too long to be used for real-time applications. Multi-core and
parallel processing for the speech recognition algorithm are necessary to further
improve the recognition time and is worthwhile to examine in the future research.

Besides, the simplification of the ANN architecture without reducing the


recognition rate can also speed up the recognition time and is also an important topic
of the future research [8]. This paper [9] looks at automatic bird identification
techniques from a signal processing and pattern recognition perspective. A new
technique based on the Spectral ensemble average voice prints is proposed for
automatic bird call identification. The results using such an approach are compared
with other conventional approaches like the DTW and the Gaussian mixture
modelling (GMM) techniques. The Spectral Ensemble Average Voice Print (SEAV)
is found to be computationally inexpensive and performs better than the DTW
technique. Classifier combination techniques at rank and measurement level are tried.
A two level classifier combination is also attempted. The SEAVs of various birds are
considerably different from each other, which call for further analysis to see if this
technique could be used to identify bird songs based on syllable identification. There
is also a need to analyse the theoretical basis of using the SEAV for various bird
recognition tasks. The other proposed system was developed to conduct an Online

8
Examination using voice which will be helpful for people who are blinds and the
people who don’t use keyboard for interaction with the system [10]. The proposed
method will help for voice recognition where we take voice as input through
microphone and then register for online examination. Then at the time of log in user
will logged in to the online examination system through voice authentication, which
uses FFT for comparing input voice of the user with the template voice. Next the user
will proceed with the questions displayed and read by the system [10].

In this research work, authors had emphasised to the inclusion of DTW and K-
NN technique for recognizing Bangla speech as no such works have been seen and
also to evaluate the performance from several aspects [11]. However for different
speaker the performance decreases to almost 10 to 15%. Recognizing continuous
speech with ANN classifier has average accuracy rate of 73.36% for three layers
Back- Propagation Neural Network the maximum accuracy rate is 86.67% and spoken
letter recognition by measuring Euclidian distance, which can recognize only the
vowels, has an 80% accuracy rate.

In comparison, the recognizer presented in this paper has an average accuracy


rate of 90%. Spoken letter recognition by using only DTW is 80% but use of K-NN it
increases almost 10% [11]. Support vector machines (SVMs) have proven to be a
powerful technique for pattern classification [12]. SVMs map inputs into a high
dimensional space and then separate classes with a hyper plane. A critical aspect of
using SVMs successfully is the design of the inner product, the kernel, induced by the
high dimensional mapping. We consider the application of SVMs to speaker and
language recognition. A key part of the approach is the use of a kernel that compares
sequences of feature vectors and produces a measure of similarity. Sequence kernel is
based upon generalized linear discriminants, shows that this strategy has several
important properties. First, the kernel uses an explicit expansion into SVM feature
space, this property makes it possible to collapse all support vectors into a single
model vector and have low computational complexity. Second, the SVM builds upon
a simpler mean-squared error classifier to produce a more accurate system. Finally,
the system is competitive and complimentary to other approaches, such as Gaussian
mixture models (GMMs) [12].

9
Numeral recognition remains one of the most important problems in pattern
recognition. It has numerous applications including those in reading postal zip code,
passport number, employee code, form processing and bank cheque processing, postal
mail sorting, job application form sorting, automatic scoring of tests containing
multiple choice questions and video gaming etc. To the best of our knowledge, little
work has been done in Indian language, especially in Marathi as compared with those
for non Indian languages. This paper has discussed an effective method for
recognition of isolated Marathi numerals. It presents a Marathi database and isolated
numeral recognition system based on Mel-Frequency Cepstral Coefficient (MFCC)
used for Feature Extraction and Distance Time Warping (DTW) used for Feature
Matching or to compare the test patterns [13]. In recent years there has been a steady
movement towards the development of speech technologies to replace or enhance text
input called as Mobile Search Applications. Recently both Yahoo! and Microsoft have
launched voice-based mobile search applications. Future work can include improving
the recognition accuracy of the individual numerals by combining the multiple
classifiers [13]. Voice recognition is a system to convert spoken words in well-known
languages into written languages or translated as commands for machines, depending
on the purpose. The input for that system is "voice", where the system identifies
spoken word(s) and the result of the process is written text on the screen or a
movement from machine's mechanical parts. This research focused on analysis of
matching process to give a command for multipurpose machine such as a robot with
Linear Predictive Coding (LPC) and Hidden Markov Model (HMM), where LPC is a
method to analyse voice signals by giving characteristics into LPC coefficients. In the
other hand, HMM is a form of signal modeling where voice signals are analyzed to
find maximum probability and recognize words given by a new input based from the
defined codebook. This process could recognize five basic movement of a robot:
"forward", "reverse", "left", "right" and "stop" in the desired language [14].

This paper describes about implementation of speech recognition system on a


mobile robot for controlling movement of the robot. The methods used for speech
recognition system are Linear Predictive Coding (LPC) and Artificial Neural Network
(ANN). LPC method is used for extracting feature of a voice signal and ANN is used
as the recognition method. Back propagation method is used to train the ANN. Voice
signals are sampled directly from the microphone and then they are processed using

10
LPC method for extracting the features of voice signal. For each voice signal, LPC
method produces 576 data. Then, these data become the input of the ANN. The ANN
was trained by using 210 data training. This data training includes the pronunciation
of the seven words used as the command, which are created from 30 different people.
Experimental results show that the highest recognition rate that can be achieved by
this system is 91.4%. This result is obtained by using 25 samples per word, 1 hidden
layer, 5 neurons for each hidden layer, and learning rate 0.1[15].This paper proposes
an approach to recognize English words corresponding to digits Zero to Nine spoken
in an isolated way by different male and female speakers. A set of features consisting
of a combination of Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive
Coding (LPC), Zero Crossing Rate (ZCR), and Short Time Energy (STE) of the audio
signal, is used to generate a 63-element feature vector, which is subsequently used for
discrimination. Classification is done using artificial neural networks (ANN) with
feed-forward back-propagation architectures. An accuracy of 85% is obtained by the
combination of features, when the proposed approach is tested using a dataset of 280
speech samples, which is more than those obtained by using the features singly [16].

Although speech recognition products are already available in the market at


present, their development is mainly based on statistical techniques which work under
very specific assumptions. The work presented in this thesis was to investigate the
feasibility of alternative approaches for solving the problem more efficiently. A
speech recognizer system comprised of two distinct blocks, a Feature Extractor and a
Recognizer, was presented. The Feature Extractor block uses a standard LPC
Cepstrum coder, which translates the incoming speech into a trajectory in the LPC
Cepstrum feature space, followed by a Self Organizing Map, which tailors the
outcome of the coder in order to produce optimal trajectory representations of words
in reduced dimension feature spaces. Designs of the Recognizer blocks based on three
different approaches are compared. The performance of Templates, Multi- Layer
Perceptrons, and Recurrent Neural Networks based recognizers is tested on a small
isolated speaker dependent word recognition problem. Experimental results indicate
that trajectories on such reduced dimension spaces can provide reliable
representations of spoken words, while reducing the training complexity and the
operation of the Recognizer [17].

11
This paper [18] presents the Bangla speech recognition system. Bangla speech
recognition system is divided mainly into two major parts. The first part is speech
signal processing and the second part is speech pattern recognition technique. The
speech processing stage consists of speech starting and end point detection,
windowing, filtering, calculating the linear predictive coding (LPC) and cepstral
coefficients and finally constructing the codebook by vector quantization. The second
part consists of pattern recognition system using artificial neural network (ANN).
Speech signals are recorded using an audio wave recorder in the normal room
environment. The recorded speech signal was passed through the speech starting and
end-point detection algorithm to detect the presence of the speech signal and remove
the silence and pauses portions of the signals. The resulting signal is then filtered for
the removal of unwanted background noise from the speech signals. The filtered
signal was then windowed ensuring half frame overlap. After windowing, the speech
signal was then subjected to calculate the LPC coefficient and cepstral coefficient.
The feature extractor uses a standard LPC cepstrum coder, which converts the
incoming speech signal into LPC cepstrum feature space. The self organizing map
(SOM) neural network makes each variable length LPC trajectory of an isolated word
into a fixed length LPC trajectory and thereby making the fixed length feature vector,
to be fed into to the recognizer. The structures of the neural network is designed with
multi layer perceptron approach and tested with 3, 4, 5 hidden layers using the
Transfer functions of Tanh Sigmoid for the Bangla speech recognition system.
Comparison among different structures of neural networks conducted here for a better
understanding of the problem and its possible solutions.

Automatic Emotion Recognition (AER) from speech finds greater significance


in better man machine interfaces and robotics [19]. Speech emotion based studies
closely related to the databases used for the analysis. We have created and analyzed
three emotional speech databases. Discrete Wavelet Transformation (DWT) was used
for the feature extraction and Artificial Neural Network (ANN) was used for pattern
classification. We can find that recognition accuracies vary with the type of database
used. Daubechies type of mother wavelet was used for the experiment. Overall
recognition accuracies of 72.05 %, 66.05%, and 71.25% could be obtained for male,
female and combined male and female databases respectively [19]. This paper [20]
proposes three novel noise robustness techniques for speech recognition based on

12
discrete wavelet transform (DWT), which are wavelet filter cepstral coefficients
(WFCCs), sub-band power normalization (SBPN), and lowpass filtering plus zero
interpolation (LFZI). According to experiments, the proposed WFCC is found to
provide a more robust c0 (the zerothceptral coefficient) for speech recognition, and
with the proper integration of WFCCs and the conventional MFCCs, the resulting
compound features can enhance the recognition accuracy. Second, the SBPN
procedure is found to reduce the power mismatch within each modulation spectral
sub-band, and thus to improve the recognition accuracy significantly. Finally, the
third technique, LFZI, can reduce the storage space for speech features, while it is still
helpful in speech recognition under noisy conditions [20].The framework of voice
conversion system is expected to emphasize both the static and dynamic
characteristics of the speech signal.

The conventional approaches like Mel frequency cepstrum coefficients and linear
predictive coefficients focus on spectral features limited to lower frequency bands.
This paper presents [21] a novel wavelet packet filter bank approach to identify non-
uniformly distributed dynamic characteristics of the speaker. Contribution of this
paper [21] is threefold. First, in the feature extraction stage, dyadic wavelet packet
tree structure is optimized to involve less computation while preserving the speaker-
specific features. Second, in the feature representation step, magnitude and phase
attributes are treated separately to rule out on the fact that raw time-frequency traits
are highly correlated but carry intelligent speech information. Finally, the RBF
mapping function is established to transform the speaker-specific features from the
source to the target speakers. The results obtained by the proposed filter bank-based
voice conversion system are compared to the baseline multiscale voice morphing
results by using subjective and objective measures. Evaluation results reveal that the
proposed method outperforms by incorporating the speaker-specific dynamic
characteristics and phase information of the speech signal [21]. The study [22]
proposes an improved feature extraction method that is called Wavelet Cepstral
Coefficients (WCC). In traditional cepstral analysis, the cepstrums are calculated with
the use of the Discrete Fourier Transform (DFT). Owing to the fact that the DFT
calculation assumes signal stationary between frames which in practice is not quite
true, the WCC replaces the DFT block in the traditional cepstrum calculation with the
Discrete Wavelet Transform (DWT) hence producing the WCC. To evaluate the

13
proposed WCC, speech recognition task of recognizing the 26 English alphabets were
conducted. Comparisons with the traditional Mel-Frequency Cepstral Coefficients
(MFCC) are done to further analyse the effectiveness of the WCCs. It is found that the
WCCs showed some comparable results when compared to the MFCCs considering
the WCCs small vector dimension when compared to the MFCCs. The best
recognition was found from WCCs at level 5 of the DWT decomposition with a small
difference of 1.19% and 3.21% when compared to the MFCCs for speaker
independent and speaker dependent tasks respectively [22].The method of speaker
recognition based on wavelet functions and neural networks was presented in this
paper [23]. The wavelet functions are used to obtain the approximation function and
the details of the speaker’s averaged spectrum in order to extract speaker’s voice
characteristics from the frequency spectrum. The approximation function and the
details are then used as input data for decision-making neural networks. In this
recognition process, not only the decision on the speaker’s identity is made, but also
the probability that the decision is correct can be provided [23].In this work [24], a
modified voice identification system is described using over sampled Haar wavelets
followed by proper orthogonal decomposition. The audio signal is decomposed using
over sampled Haar wavelets. This converts the audio signal into various non-
correlating frequency bands. This allows us to calculate the linear predictive cepstral
coefficient to capture the characteristics of individual speakers. Adaptive threshold
was applied to reduce noise interference. This is followed by multi-layered vector
quantization technique to eliminate the interference between multiband coefficients.
Finally, proper orthogonal decomposition is used to evaluate unique characteristics
for capturing more details of phoneme characters. The performance analysis of voice
activity algorithms (VAD) which were based on wavelet and AMR-WB (Adaptive
Multi-Rate Wideband) speech codec was developed in [25] and HMM classifier was
used for pattern matching. The experimental results showed that wavelet approaches
provided good results in clean, noisy and reverberant environments with respect to
speech clipping and also gave a much lower computational complexity. The
performance of this system by using AMR VAD was improved upon by approaches
which were based on wavelet [25].Speech signal based analysis was done using
wavelet transform in [26], it efficiently locate the spectral changes in speech signal as
well as beginning and end of the sounds were located. Experimental results show
superiority of DWT methods over more classic ones like Mel-Frequency Cepstral

14
Coefficients (MFCC) [27]. The analysis of the power in different frequency subbands
gives an excellent opportunity to distinguish the beginning and the end of phonemes.

In this paper [28] a new efficient feature extraction methods for speech
recognition have been proposed. The features are obtained from Cepstral Mean
Normalized reduced order Linear Predictive Coding (LPC) coefficients derived from
the speech frames decomposed using Discrete Wavelet Transform (DWT). LPC
coefficients derived from wavelet-decomposed subbands of speech frame provide
better representation than modelling the frame directly. Experimentally it has been
shown that, the proposed approach provides effective (better recognition rate),
efficient (reduced feature vector dimension) features. The speech recognition system
using the Continuous Density Hidden Markov Model (CDHMM) has been
implemented. The proposed algorithms were evaluated using isolated Marathi digits
database in presence of white Gaussian noise [28]. In this paper [29], an Automatic
Speech Recognition system is designed for isolated spoken words in Malayalam using
discrete wavelet transforms and artificial neural networks. A better performance of
identification with high recognition accuracy of 90% is obtained from this study. The
computational complexity and feature vector size is successfully reduced to a great
extent by using discrete wavelet transforms. Thus a wavelet transform is an elegant
tool for the analysis of non-stationary signals like speech. The experiment results
show that this hybrid architecture using discrete wavelet transforms and neural
networks could effectively extract the features from the speech signal for automatic
speech recognition. In this experiment, we have used a limited number of samples.
Recognition rate can be increased by increasing the number of samples. The neural
network classifier which is used in the experiment provides good accuracies. DWT
and LPC-based techniques (UWLPC and DWLPC) for isolated word recognition was
presented in paper [30]. Experimental results show that the proposed WLPC (UWLPC
and DWLPC) features are effective and efficient as compared to LPCC and MFCC
because it takes the combined advantages of LPC and DWT while estimating the
features. Feature vector dimension for WLPC is almost half of the LPCC and MFCC.
This reduces the memory requirement and the computational time. It is also observed
that the performance of DWLPC is better than UWLPC. This is because the dyadic
(logarithmic) frequency decomposition mimics the human auditory perception system
better than uniform frequency decomposition [30]. A combination of wavelet packet

15
signal processing and Adaptive network based fuzzy inference system (ANFIS) to
efficiently extract the features from pre-processed real speech signals for the purpose
of automatic speech recognition among variety Turkish language words was presented
in paper [31]. These features make the expert system suitable for automatic
classification in interpretation of the speech signals. These results point out the ability
of design of new Automatic speech recognition (ASR) assistance system [31].

16
Research gaps identified in the proposed field of investigation:

Research gaps identified in proposed field of investigation are explained in


following points

 Feature extraction and classification are major components, plays vital role
in speech recognition systems. So efficient representation of speech features
& its classification is required for speech recognition systems.
 To improve accuracy of speech recognition system, we can use hybrid
architecture consist of Wavelet Transform (WT) and Artificial Neural
Network (ANN).
 ANN architecture simplified without reducing the recognition rate.
 Isolated Marathi Digit should be recognised speedily by speeding up the
recognition time.

Objectives of Research:

The objective of our research is to investigate the combined performance of Wavelet


Transform (WT) and Artificial Neural Network (ANN) for Isolated Marathi Digits so
as to improve accuracy of speech recognition system.

 To derive effective, efficient, and noise robust features from the frequency
subbands of the frame using discrete wavelet transform.
 Each frame of speech signal is decomposed into different frequency subbands
using discrete wavelet transform.
 Classification of each subband using artificial neural network (ANN).
 Determination of accuracy of speech recognition system

17
Research Methodology:

The following diagram indicates the method of feature extraction and classification
using discrete wavelet transform and artificial neural network.

Hypothesis:
The objective of our research is to investigate the combined performance of wavelet
transform and artificial neural network (ANN) for isolated Marathi digits so as to
improve accuracy of speech recognition system.

Sources of Information:

 Research Papers from reputed Journals / Conferences / publishers.

Tools / Techniques of Research:

 Matlab / Simulink Programming Language

18
Phase Phase Phase Phase Phase Phase
Activity
I II III IV V VI

Literature Survey

Study of Software Tools


like MATLAB/SIMULINK,
Neural Network Toolbox
and its MATLAB link

Survey of Existing Methods


and Algorithms

Suggesting techniques for


removing limitations in
existing algorithms

Simulation of combined
strategies

Comparing results of
developed strategies with
existing algorithms

Performance evaluation and


implementation

Documentation

Review and Research Paper


Preparation &
Presentation/Publication

Thesis framework
preparation and Submission

Plan of Research Work:

Tentative Chapter Flow:

1. Introduction to Speech Processing, Wavelet Transform & ANN

2. Speech Feature Extraction using Wavelet Transform (WT)

19
1. Speech Feature Classification using Artificial Neural Network (ANN)

2. Performance analysis of Speech Feature Extraction and Classification


Techniques

3. Results & Conclusion

References:

[1] T. F. Quatieri, “Discrete Time Speech Signal Processing”, Pearson Education,


2002.
[2] R. M. Rao, A. S. Bopardikar, “Wavelet Transform”, Pearson Education, 2005.
[3] J. M. Zurada, “Introduction to Artificial Neural Network”, West, 1992.
[4] YoshuaBengio, Renato De Mori, Regis Cardin, “Speaker Independent Speech
Recognition with Neural Networks and Speech Knowledge”, Department of
Computer Science McGill University, pp.218-225, 1990.
[5] Bhiksha Raj, Lorenzo Turicchia, Bent Schmidt-Nielsen, and Rahul Sarpeshkar,
“An FFT-Based Companding Front End for Noise-Robust Automatic Speech
Recognition”, EURASIP Journal on Audio, Speech, and Music Processing,
vol.2007, pp.1-13, 2007.
[6] Adam Glowacz, WitoldGlowacz, AndrzejGlowacz, “Sound Recognition of
Musical Instruments with Application of FFT and K-NN classifier with Cosine
Distance ”,AGH university of Science and Technology, 2010.
[7] Gil Lopes, Fernando Ribeiro, Paulo Carvalho, “Whistle Sound Recognition in
Noisy Environment”, Universidade do Minho, Departamento de Electrónica
Industrial, Guimarães, Portugal.
[8] Shing-Tai Pan, Chih-Chin Lai and Bo-Yu Tsai, “The Implementation of
Speech Recognition Systems on FPGA-Based Embedded Systems with SOC
Architecture”, International Journal of Innovative Computing, Information and
Control, vol.7, no.11, pp.6161-6175, November 2011.

[9] HemantTyagi, Rajesh M. Hegde, Hema A. Murthy and Anil Prabhakar,


“Automatic Identification of Bird calls using Spectral Ensemble Average
Voice Prints”, 14th European IEEE Signal Processing Conference, pp. 1 – 5,
2006.
[10] DwijenRudrapal, Smita Das, S. Debbarma, N. Kar, N. Debbarma, “Voice
Recognition and Authentication as a Proficient Biometric Tool and its
Application in Online Exam for P.H People”, International Journal of

20
Computer Applications (0975 – 8887), vol.39,no.12, pp.7-12, February 2012.
[11] AsmSayem, “Speech Analysis for Alphabets in Bangla Language:Automatic
Speech Recognition”, International Journal of Engineering Research, vol.3,
no.2, pp.88-93, February 2014.
[12] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-
Carrasquillo, “Support Vector Machines for Speaker and Language
Recognition”, in Elsevier Journal of Computer Speech & Language, vol. 20,
issue 2/3, pp 210 – 229, 2006.
[13] Siddheshwar S. Gangonda, Dr.PrachiMukherji, “Speech Processing for
Marathi Numeral Recognition using MFCC and DTW Features”, International
Journal of Engineering Research and Applications (IJERA), pp.218-222,
March 2012.
[14] WahyuKusuma R., Prince Brave Guhyapati V., “Simulation Voice Recognition
System for controlling Robotic Applications”, Journal of Theoretical and
Applied Information Technology,vol.39, no.2,pp. 188-196, May 2012.
[15] Thiang and SuryoWijoyo, “Speech Recognition Using Linear Predictive
Coding and Artificial Neural Network for Controlling Movement of Mobile
Robot”, International Conference on Information and Electronics Engineering,
vol.6, pp.179-183, 2011.
[16] Bishnu Prasad Das, Ranjan Parekh, “Recognition of Isolated Words using
Features based on LPC, MFCC, ZCR and STE, with Neural Network
Classifiers”, International Journal of Modern Engineering Research, vol.2,
pp.854-858, May-June 2012.
[17] P. Zegers, “Speech recognition using neural network”, MS Thesis, Department
of Electrical & Computer Engineering, University of Arizona, 1998.
[18] Paul A.K., Das D., Kamal M.M., “Bangla Speech Recognition System Using
LPC and ANN”, 7th IEEE International Conference on Advances in Pattern
Recognition”, pp. 171-174, 2009.
[19] Firoz Shah. A, RajiSukumar. A and BabuAnto. P, “Discrete Wavelet
Transforms and Artificial Neural Networks for Speech Emotion Recognition”,
International Journal of Computer Theory and Engineering, vol. 2, no. 3,
pp.319-322, June 2010.
[20] Jeih-Weih Hung, Hao-Teng Fan , and Syu-Siang Wang, “Several New DWT-
Based Methods for Noise-Robust Speech Recognition”, International Journal
of Innovation, Management and Technology, vol. 3, no. 5, pp.547-551,
October 2012.
[21] Jagannath H Nirmal, Mukesh A Zaveri, SupravaPatnaik and Pramod H

21
Kachare, “A novel voice conversion approach using admissible wavelet packet
decomposition”, EURASIP Journal on Audio, Speech, and Music
Processing,pp 1 – 10, 2013.
[22] T. B. Adam, M. S. Salam, T. S. Gunawan, “Wavelet Cesptral Coefficients for
Isolated Speech Recognition”, Telkomnika, vol.11, no.5, pp.2731-2738, May
2013.
[23] SanjaGrubesa, TomislavGrubesa, HrvojeDomitrovic, “Speaker Recognition
Method combining FFT, Wavelet Functions and Neural Networks”, Faculty of
Electrical Engineering and Computing, University of Zagreb, Croatia.
[24] Mohammed Anwer and Rezwan-Al-Islam Khan, “Voice identification Using a
Composite Haar Wavelets and Proper Orthogonal Decomposition”,
International Journal of Innovation and Applied Studies, vol. 4, no. 2, pp.353-
358, October 2013.
[25] Marco Jeub, Dorothea Kolossa, Ramon F. Astudillo, Reinhold Orglmeister,
“Performance Analysis of Wavelet-based Voice Activity Detection”,
NAG/DAGA-Rotterdam, 2009.
[26] Beng T Tan, Robert lang, Hieko Schroder, Andrew Spray, Phillip Dermody,
“Applying Wavelet Analysis to Speech Segmentation and Classification”,
Department of Computer Science.
[27] BartoszZioko, Suresh Manandhar, Richard C. Wilson and MariuszZioko,
“Wavelet Method of Speech Segmentation”, University of York Heslington,
YO10 5DD, York, UK.
[28] N. S. Nehe, R. S. Holambe, “New Feature Extraction Techniques for Marathi
Digit Recognition”, International Journal of Recent Trends in Engineering, Vol
2, No. 2, November 2009.
[29] Sonia Sunny, David Peter S, K Poulose Jacob, “Discrete Wavelet Transforms
and Artificial Neural Networks for Recognition of Isolated Spoken Words”, in
International Journal of Computer Applications, volume 38, No.9, pp 9 – 13,
January 2012.
[30] N. S. Nehe, R. S. Holambe, “DWT and LPC based feature extraction methods
for isolated word recognition”, EURASIP Journal on Audio, Speech, and
Music Processing, vol.2012, pp.1-7, 2012.
[31] EnginAvci, ZuhtuHakanAkpolat, “Speech recognition using a wavelet packet
adaptive network based fuzzy inference system”, in Elsevier Expert Systems &
Applications, vol 31, pp 495 – 503, 2006.

22
Signature of Candidate with Date

Outlined Approved

Name & Signature of supervisor with Date

23

You might also like