0% found this document useful (0 votes)
54 views7 pages

Speech Recognition Using Neural Networks IJERTV7IS100087

This document summarizes research on speech recognition using neural networks. It discusses different types of speech recognition including isolated word, continuous word, and connected word recognition. It also describes how speech sounds are classified based on obstruction/non-obstruction of air flow and voiced/voiceless sounds. The paper focuses on using neural networks for speech recognition as an alternative to traditional Hidden Markov Model techniques. It outlines the speech recognition process including speech preprocessing, feature extraction, and training neural networks to classify speech sounds.

Uploaded by

Ibrahim Lukman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views7 pages

Speech Recognition Using Neural Networks IJERTV7IS100087

This document summarizes research on speech recognition using neural networks. It discusses different types of speech recognition including isolated word, continuous word, and connected word recognition. It also describes how speech sounds are classified based on obstruction/non-obstruction of air flow and voiced/voiceless sounds. The paper focuses on using neural networks for speech recognition as an alternative to traditional Hidden Markov Model techniques. It outlines the speech recognition process including speech preprocessing, feature extraction, and training neural networks to classify speech sounds.

Uploaded by

Ibrahim Lukman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 7 Issue 10, October-2018

Speech Recognition using Neural Networks


Mr. Hardik Dudhrejia Mr. Sanket Shah
Department of Computer Engineering Department of Computer Engineering
G H Patel College of Engineering & Technology G H Patel College of Engineering & Technology
Vadodara, India Anand, India

Abstract—Speech is the most common way for humans to Types of speech recognition: Based on the type of words
interact. Since it is the most effective method for speech recognizing systems can recognize, the speech
communication, it can be also extended further to interact with recognition system is divided into the following categories:
the system. As a result, it has become extremely popular in no ➢ Isolated Word:
time. The speech recognition allows system to interact and
Isolated word requires each utterance to have quiet on
process the data provided verbally by the user. Ever since the
user can interact with the help of voice the user is not confined both sides of sample window. At a time only single
to the alphanumeric keys. Speech recognition can be defined as words and single utterances are accepted and it is
a process of recognizing the human voice to generate having “Listen and Non-Listen state”.
commands or word strings. It is also popularly known as ASR ➢ Continuous Word:
(Automatic speech recognition), computer speech recognition
Continuous speech recognisers provide the users a
or speech to text (STT). Speech recognition activity can be
performed after having a knowledge of diverse fields like facility to speak in a continuous fashion and almost
linguistic and computer science. It is not an isolated activity. naturally and at the same time the computer determines
Various techniques available for speech recognition are HMM the content of the speech. Recognisers rendering the
(Hidden Markov model)[1], DTW(Dynamic time warping)- facility of continuous speech capabilities are pretty
based speech recognition[2], Neural Networks[3], Deep much difficult to create because they require some
feedforward and recurrent neural networks[4] and End-to-end special and peculiar methods in order to determine the
automatic speech recognition[5]. This paper mainly focusses on boundaries of the utterances.
Different Neural networks used for Automatic speech ➢ Connected Word:
recognition. This research paper primarily focusses on
different types of neural networks used for speech recognition. Connected words are very much alike the isolated
In addition to this paper also consist of work done on speech words but they allow separate utterances to be executed
recognition using this neural networks. with “minimal pauses” in between them.
➢ Spontaneous speech:
Keywords— Speech recognition; Recurrent Neural network; At an elementary level, spontaneous speech can be
Hidden Markov Model; Long Short term memory network considered as a speech that is coming out naturally and
not a rehearsed one. An Automatic Speech Recogniser
I. INTRODUCTION must be able to handle a wide range of speech features
Throughout their life-span humans communicate mostly like the words being run together.
through voice since they learn all the relevant skills in their
early age and continue to rely on speech communication. Classification of speech sounds:
So, it is more efficient to communicate with speech rather In this modern time, the process of classification of speech
than by using keyboard and mouse. Voice Recognition or sounds is commonly done on the basis of 2 process based on
Speech Recognition provides the methods using which how the classification process is looked upon:
computers can be upgraded to accept speech or human voice
assist input instead of giving input by keyboard. It is Based on the process of obstruction and non-obstruction
extremely advantageous for the disabled people. sounds
Speech is affected greatly by the factors such as The process of classifying the sounds with respect to the
pronunciations, accents, roughness, pitch, volume,
process of obstruction and non-obstruction relies upon the
background noise, echoes and gender. Preliminary method
of speech processing is the process of studying the speech conception of bodily air. While generating human sounds,
signals and the methods of processing these signals. the air coming out of the body has two functions; it is
The conventional method of speech recognition insist in obstructed in the mouth or throat somewhere or it doesn’t
representing each word by its feature vector and pattern get obstructed, but the air comes out very easily.
matching with the statistically available vectors using neural Correspondingly, the sounds that are produces as a result of
networks. On the contrary to the antediluvian method obstructions and non-obstructions are not same excluding
HMM, neural networks does not require prior knowledge of some of their qualities that are trivial.
speech process and do not need statistics of speech data. [3]

IJERTV7IS100087 www.ijert.org 196


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

For e.g.- all the vowels (a,e,i,o,u) are non obstruction speech numerous tools and software’s available which record the
sounds and all the consonants speech delivered by the humans. The phonic environment
(b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z) are obstruction and the equipment device used have a significant impact on
speech sounds. the speech generated. There is a possibility of having
background or room reverberation blended with the speech
Based on the process of voice and voiceless sounds but this is completely undesirable.
Voiced sound is produced when the vocal chords vibrate
when the sound is produced. Whereas in the voiceless sound Speech Pre-Processing
no vocal cord vibration is produced. To test this, place your The solution of the problem described above is the “Speech
finger on your throat as you say the words. A vibration will Pre-Processing”. It plays an influential role in cancelling out
be felt when the voiced sounds are uttered and no vibration the trivial sources of variation. The speech pre-processing
typically includes reverberation cancelling, echo
will be felt while uttering a voiceless sound. Many a times it
cancellation, windowing, noise filtering and smoothing all
is difficult to feel the difference between them. So in order
of which conclusively improves the accuracy of speech
to distinguish between them another test can be performed recognition.
by putting a paper in front of our mouth and the paper
should move only by saying the voiceless sounds. All the Feature Extraction
vowels are voiced whereas some of the consonants are Each and every person has different speech and different
voiced as well as voiceless. intonation. This is due to the different characteristics
ingrained in their utterance. There should be a probability of
Voiced consonants are :- b,d,g,v,z,th,sz,j,l,m,n,ng,r,w,y identifying speech from the theoretical waveform, at least
Voiceless consonants are :- p,t,k,f,s,th,sh,ch,h theoretically. As a result of an enormous variation in speech
there is an imminent need to reduce the variations by
II. SPEECH RECOGNITION PROCESS performing some feature extraction. The ensuing section
Speech Recognition is truly a ponderous and tiresome depicts some of the feature extraction technologies which
process. It consists of 5 steps:- are extremely used nowadays.
1. Speech
2. Speech Pre-Processing LPC (Linear Predictive Coding):- It is an extremely useful
3. Feature Extraction speech analysis technique for encoding quality speech at
4. Speech Classification low bit rate and is one of the most powerful method. The
5. Recognition key idea behind this method is that a specific speech sample
at current time can be approximated as a linear combination
of past speech samples. In this method the digital signal is
compressed for competent storage and transmission. The
principle behind the use of LPC is to reduce the sum of
squared distance between the original speech and estimated
speech over a finite duration. It can be further used to
provide unique set of predictor coefficients. Gain (G) is also
a crucial parameter.

MFCC (Mel Frequency Cepstarl Coefficients):- This is the


standard method feature extraction. It is preliminary based
on the frequency domain which is based Mel scale based on
human ear scale. They are more accurate than time domain
features ever since they fall into the category of frequency
domain features. The most conspicuous impediment is its
sensitivity to noise as it is highly dependent on the spectral
form. Techniques utilizing the periodicity of speech signals
could be used to overcome this drawback although speech
also encompasses aperiodic content.

Speech Classification
These systems are used to extract the hidden information
from the input processing signals and comprises of
convoluted mathematical functions. This section describes
some commonly used speech classification techniques in
Figure-1 Speech recognition process brief.
Speech
Speech is defined as the ability to express one’s thoughts HMM (Hidden Markov Model):- This is the most strongly
and feelings by articulate sounds. Initially the speech of a used method in order to recognize pattern in the speech. It is
person is received in the form of a waveform. Also there are safer and possesses a secure mathematical foundation as

IJERTV7IS100087 www.ijert.org 197


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

compared to the template based and knowledge based nodes. It is a network of elementary elements known as
approach. In this method, the system being modelled is artificial neurons, which receives an input, changes the state
assumed to be a Markov process having hidden states. The according to that input and generates an output. An
speech is distributed into smaller resounding entities each of interconnected group of natural or artificial neurons uses a
which represent a state. In simpler Markov Model, the states mathematical model for information processing based on
are clearly visible to the user and thus the state transition connectionist approach to communication. Neutral
probabilities are only the parameters. On the other hand, in Networks can be treated as simple mathematical models
hidden Markov Model, the state is not directly visible, but defining a function f: X Y or a distribution over X or both
the output, which is dependent on the state, is evident. X and Y [8], but many a times models are intimately
HMM are specifically known for their application in connected with a particular learning rule. The term
reinforcement learning and pattern recognition such as “artificial neural network” refers to inter-connections in
speech, handwriting and bioinformatics. among the neurons in the different layers of each system.
Mathematically a neuron’s network function f(x) is defined
DTW (Dynamic Time Warping):-In time series analysis, as a composition of other functions gi(x), which can further
DTW is a kind of algorithm which measures the similarity be defined as a composition of other functions. A most
or affinity between two temporal sequences that vary in commonly used type of composition is the nonlinear
speed or time. It correlates the speech words with reference weighted sum, where f(x) = K (∑i ,wigi(x)), where K is a
words. This method changes the time dimension of the predefined function. Referring the collection of functions gi
undiscovered words unless and until they are matched with as a vector g simply would be more advantageous. The first
the reference word. A well-known application of DTW is view being the functional view; the given x is then changed
the automatic speech recognition, in order to cope up with to a 3-D vector h, which is then finally transformed into f.
different speaking speeds. Various other applications are This view is frequently encountered in the optimization
online signature recognition and speaker recognition. context. Probabilistic view is the second view and the
arbitrary variable G = g (H), depends on H = h(X), which in
VQ (Vector Quantization):- This method is preliminary turn depends upon the X(random variable). In the
based on block coding principle. This technique allows the perspective of graphical models this view most commonly
modelling of probability density functions by the circulation comes across. Networks described such as the above one are
of prototype vectors. It was formerly used for data often called feed forward as its graph is a DAG (Directed
compression. It performs the mapping of the vector from a Acyclic Graph). Networks which have cycles in it is called
vast vector space to a finite number of regions in that space. recurrent neural networks.
Every region is known as cluster and can be depicted by its
centre called code word. Vector Quantization is used in Neural Network Models:
lossy data compression, lossy data correction, and clustering Language modelling and acoustic modelling are both vital
and pattern recognition. aspects of modern statistically-based speech recognition
systems. The ensuring section illustrates various methods
Recognition for Speech Recognition.
After the above four phases speech recording, speech pre- Deep Neural Network (Hidden Markov Models HMM is a
processing, feature extraction and speech classification the generative model in which observable acoustic features are
final step that is remaining is the speech recognition. Once assumed to be generated from a hidden Markov process that
all the above mentioned steps are completed successfully transitions between states S = {s1, s2 …., sk}.
then the recognition of speech can be done by three
approaches. HMM was the most widely used technique for Large
1. Acoustic phonetic approach[6] Vocabulary Speech Recognition (LSVR) for at least two
2. Pattern recognition approach[7] decades. The decisive parameters in the HMM are the initial
3. Artificial intelligence approach state probability distribution f = {p(q0=si)}, where qt is the
state at time t; the transition probabilities aij= p(qt=sj |qt-
This paper is mainly concern with Artificial intelligence 1=si) ; and a model to estimate the observation probabilities
approach for speech recognition. This is a combination of p(xt |si). The conventional HMM used in the process of
acoustic phonetic and pattern recognition approach. In this Automatic Speech Recognition had their observation
approach system created by neural networks are used to probability modelled by using Gaussian Mixture Model
classify and recognize the sound.. Neural networks are very (GMM). Even the GMM had a vast number of advantages,
powerful for recognition of speech. There are various the issue was that, they were statistically inept for modelling
networks for this process. RNN, LSTM, Deep Neural data that lie on or near the non-linear diverse in the space.
network and hybrid HMM-LSTM are used for speech For instance, modelling of the points residing very close to
recognition. surface of the sphere hardly requires any parameter using a
suitable model class, but it requires large number of
III. NEURAL NETWORKS diagonal Gaussians or a fairly huge number of full-
Traditionally neural networks referred to as neurons or covariance Gaussians. Because of this other types of models
circuit. At present the term neural networks refers to as may work better than GMM the exploitation of information
Artificial Neural Network, consisting of artificial neurons or embedded in a large window of frames is done well. On the

IJERTV7IS100087 www.ijert.org 198


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

other hand ANN (Artificial Neural Network) can handle the preserving the modelling power simultaneously but it
data residing on or near a non-linear model more effectively increases the computational cost. Therefore there is a
and learn much better models of data. Since the past few prominent need of using the information in the process of
years, outstanding advances have been made both in training set to build various layers of non-linear feature
machine learning algorithms and computer hardware which detectors.
ultimately has led to more efficient methods of learning
having many layers and a large layer of output. The output Recurrent Neural Network: Recurrent Neural Network is a
layer must hold a great number of HMM states that arise on kind of Artificial Neural Network, which is represented in
each phone being modelled by a different number of the form of directed cycle where each and every node is
triphone. By employing various new learning methods, vast connected to the other nodes. Two units become dynamic as
number of research group have shown that Deep Neural soon as the communication takes place in between them.
Network has better performance than GMMs at acoustic Since RNN uses internal memory just like the feed-forward
modelling for recognition of speech that includes massive networks to process the sequence of arbitrary input, it makes
vocabularies and behemoth dataset. An Artificial Neural them ideal choice for speech recognition. The key feature of
Network has more than one layer of hidden units between RNN is that the activations flow round in a loop as the
its inputs and outputs whereas Deep Neural Network is a network contains at least one feed-back connection thus
feed-forward network. Each hidden unit in DNN, j, uses the allowing the networks to do temporal processing and learn
logistic function to map all of its input from the layer below, sequences.
xj , to the scalar state, yj that it sends to the above lying
layer.

Yj = log (xj) = 1/ (1+e-xj)

Xj = bj+ ∑ yiwij
i
Where bj is the bias of unit j, i is an index over units in the
layer below, and wijis the weight on a connection to unit j
from unit i in the layer below. For multiclass classification,
output unit j converts its total input, xj , into a class
probability, pj, by using the softmax non-linearity

pj= exp(xj) / ∑ exp(xk)


k

where k is an index over all classes. Figure 1 Recurrent Neural Network[9]


Deep Neural Network DNN‘s can be trained by back-
propagating derivatives of a cost function that measures the The elementary structure is a feed-forward DNN having an
divergence between the target outputs and the actual outputs input and output layer and certain number of hidden layers
produced. When working with softmax function, the natural with full recurrent connections. Even of basic architectures,
cost C can be calculated as learning in RNN can be achieved. In Fig. 2 representation of
simplest form of fully recurrent neural network that is an
C = - ∑ djlogpj MLP (Multi-Layer Preceptron) having previous set of
j concealed unit activations (h (t)), feeding back into the
where p is the output of the softmax d represents the target network along with the input (h (t+1)). The time scale t
probabilities. refers to the operation of real neurons and as far as artificial
DNNs having many hidden layers are difficult to optimize. systems are concerned any relevant time step size for the
The optimal way is not to choose the gradient descent from given problem can be used. In order to hold the unit until
a arbitrary starting point near the origin in order to find a they are processed at the next step, a delay unit is introduced
good set of weights and unless the initial scales of the purposely.
weight are carefully chosen, in different layer there will be
varying magnitudes of the back propagated gradients. In LSTM (Long Short Term Memory): LSTM is a RNN
addition to these issues, generalization of test data may be architecture that in addition to regular network unit, contains
done poorly by DNN. Layers of DNN are quite flexible and LSTM blocks. LSTM blocks are generally referred to as
each with a large number of parameters. As a result of this it “smart” network unit that possess the capability of
makes DNN capable of modelling very intricate and non- remembering a value having arbitrary length of time. It
linear relationships between outputs and inputs. There is a contains “gates” whose function is to let us know when the
possibility of having severe over fitting and this issue can be input is significant to remember, when to forget and when it
eliminated by early stopping or weight penalties but it is should output the value. In LSTM architecture, a set of
only possible by reducing considerable amount of power. recurrently connected subnets known as “memory blocks”
Large amount of dataset can reduce the over fitting and resides in the recurrent hidden layer. In order to control the

IJERTV7IS100087 www.ijert.org 199


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

flow of information, each memory block contains one or long time lags, LSTM can handle noise, continuous
more self-connected memory cells and three multiplicative values and distributed representations. However in
gates. The flow of information in each cell of LSTM is case of hidden Markov Model LSTM doesn’t need
secured by the learned input and output gates. The forget to have priori choice of a finite number of states, in
gate is added for the purpose of resetting the cells. A that it can deal with infinite state numbers.
conventional LSTM can be defined as follows: 2. With respect to the problems discussed in this
paper, LSTM can generalize well irrespective of
Given an input sequence x = (x1, x2 , . . . , xT ), a the irrelevant and widely spread inputs in the input
conventional RNN computes the hidden vector sequence h =
sequence. It has the capability of quickly learning
(h1, h2, . . . , hT ) and output vector sequence y = (y1, y2, . .
about how to differentiate in between two or more
. , yT ) from t = 1 to T as follows:
widely spread occurrences of a particular element
ht = H (Wxhxt +Whhht−1 + bh) in the input sequence without getting dependent on
t = Whyht + by short time lag training exemplars.
3. It doesn’t require parameter tuning at all. It works
Where, the W denotes weight matrices, the b denotes bias pretty well with wide range of parameters such as
vectors and H(·) is the recurrent hidden layer function. The input and output bias gate, learning rate.
following figure illustrates the architecture of LSTM:- 4. The LSTM’s algorithm complexity per weight and
time step is O(1). This is considered to be
extremely advantageous and outruns the other
approaches such as RTRL.

IV. LITERATURE SURVEY


In early 1920s, speech recognition came into existence.
The first machine to recognize speech is named Radio
Rex. (Manufactured in 1920). After that, research is
begun in Bell Labs in 1936[12].In 1939, Bell labs
demonstrated a speech synthesis machine at the world
Fair in New York. In 1952 three Bell Labs
researchers, Stephen.Balashek , R. Biddulph, and K. H.
Davis built a system called “Audrey” an automatic digit
Figure 2 Architecture of LSTM network having a single memory block[10] recognizer for single-speaker digit recognition. Their system
worked by locating the formates in the power spectrum of
Limitations of LSTM: each utterance. [13] The 1950s era technology was limited
Because LSTM has the capacity to store only one of its to single-sAupeaker systems with vocabularies of around
inputs, error will not be reduced which in turn won’t ten word. Michael Price, James Glass, Anantha P.
moderate the error by solving the sub goal. Full gradient can Chandrakasan (2015) [14] described an IC that provides a
be used as a solution to this problem. local speech recognition capability for a variety of
electronic devices. With 5,000 word recognition tasks in
a. However full gradient has also some limitations: 1) real-time with 13.0% word error rate, 6.0 mW core power
It is very complex 2) Error flow is visible only consumption, chip provides search efficiency of
when truncated LSTM is used. approximately 16 nJ per hypothesis. The vowel recognizer
b. The weight factor increases to 32 as a single hidden of Forgie and Forgie constructed at MIT Lincoln
laboratories in 1959 in which 10 vowels embedded in a /b/-
unit is replaced by 3 units. Therefore, a single
vowel/t/ format were recognized in a speaker independent
memory has the requirement of two additional cell
manner.[15]
blocks. In late 1960s Raj Reddy was first to take on continuous
c. The problems associated with the feed forward nets speech recognition at Stanford university. Early systems are
also persists in the LSTM because, it behaves like based on pause between each word. This is first continuous
fed forward neural network trained by back speech recognition approach. In the meanwhile, Soviet
propagation neural network to see the complete Union has used DTW algorithm to build a 200-word
input string. vocabulary speech recognition machine.
d. Practical problem in all the gradient based In 1970s, Velichko and Zagoruyko have studied discrete
approaches is “Counting the time steps” problem. utterance recognition or isolated word in Russia.[17] Same
in United states by Itakura and in Japan by Cakoe and
Advantages of LSTM: Chiba[18].In 1971, striving speech understanding project
1. The constant error back propagation in LSTM was funded by Defence Advanced Research Projects
allows it to bridge extremely long time lags in case Agencies (DARPA). IEEE acoustic, Speech, and Signal
of similar problems discussed above. In case of Processing group held a conference in Newton,
Massachusetts in 1972.

IJERTV7IS100087 www.ijert.org 200


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

In mid 1980s, IBM created a voice-activated typewriter convolutional layers. The first two layers have max pooling
called Tangora, which could handle a 20,000-word and the next two densely connected layers with a softmax
vocabulary under the lead of Fred Jelinek. [16] In this era, layer as output. The activation function used was ReLu.
neural networks are emerged as an attractive model for They implemented a rectangular convolutional kernel
Automatic speech recognition. Speech research in the instead of square kernel.
1980s was shifted to statistical modelling rather than
template based approach. This is mainly known as Hidden V. CONCLUSION
Markov model approach. Applying neural networks for Speech is primary and essential way for communication
speech recognition was reintroduced in late 1980s. Neural between humans. This survey is about neural networks are
networks first introduced in 1950 but for some practical modern way for recognizing the speech. In contrast to
problems they were not that much efficient. traditional approach it does not requires any statistics.A
In the 1990s, the Bayes classification is transformed into speech recognition system should include the four stages:
the optimization problems, which also reduces the Analysis, Feature Extraction, Modeling and matching
empirical errors. A key issue in the design and techniques as described in the paper. In this paper, the
implementation of speech recognition system is how to fundamentals of speech recognition are discussed and its
choose proper method in the speech material used to train recent progress is investigated. Various neural networks
the recognition algorithm. Training can be supervised model such as deep neural networks, and RNN and LSTM
learning in which class is labelled in the training data and are discussed in the paper. Automatic speech recognition
algorithm will predict the label in the unlabelled data. using neural networks is emerging field now a day. Text to
Stephen V. Kosonocky [11] had researched about how speech and speech to text are two application that are useful
neural network can be used for speech recognition in for disabled people. Paper mainly focuses on speech
1995. recognition of one language, which is English.

In 2005, Giuseppe Riccardi [19] developed Variational VI. REFERENCES


Bayesian (VB) to solve the problem of adaptive learning in [1] Yu D., Deng L. (2015) Deep Neural Network-Hidden Markov
speech recognition and proposed learning algorithm for Model Hybrid Systems. In: Automatic Speech Recognition.
ASR. Signals and Communication Technology. pp 99-116,
Springer, London,[Available Online]: Automatic Speech
Recognition Using HMM and deep neural network
In 2011, Dr.R.L.K.Venkateswarlu, Dr. R. VasanthaKumari, [2] Zhang, XL, Luo, ZG. & Li, M. J, Journal Of Computer
G.VaniJayaSri[20]utilizes Recurrent Neural Network, one Science and Technology, Springer ,November 2014, Volume
of the Neural Network techniques to observe the difference 29, Issue 6, pp 1072–1082.https://doi.org/10.1007/s11390-
of alphabet from E- set to AH - set. In their research 6 014-1491-0
speakers (a mixture of male and female) are trained in quiet [3] Zou J., Han Y., So SS. (2008) Overview of Artificial Neural
environment. The English language offers a ndreumber of Networks. In: Livingstone D.J. (Eds) Artificial Neural
challenges for speech recognition. They used multilayer Networks. Methods in Molecular Biology™, vol 458.
back propagation algorithm for training the neural network. Humana Press, [Available Online]:
Six speakers were trained using the multilayer perceptron https://link.springer.com/protocol/10.1007/978-1-60327-
101-1_2
with 108 input nodes, 2 hidden layers and 4 output nodes [4] Recurrent deep neural networks for robust speech recognition.
each for one word, with the noncurvear activation function / Weng, Chao; Yu, Dong; Watanabe, Shinji; Juang, Biing
sigmoid. The learning rate was taken as 0.1, momentum rate Hwang Fred. IEEE International Conference on Acoustics,
was taken as 0.5.Weights were initialized to random values Speech, and Signal Processing, ICASSP 2014. Institute of
between +0.1 and -0.1 and accepted error was chosen as Electrical and Electronics Engineers Inc., 2014. p. 5532-5536
0.009. They have compared the performance of neural 6854661.
network with Multi-Layer Perceptron and concluded that [5] Miao Y., Metze F. (2017) End-to-End Architectures for
RNN is better than Multi-Layer Perceptron. For A-set the Speech Recognition. In: Watanabe S., Delcroix M., Metze F.,
maximum performances of speakers 1-6 were 93%, 99%, Hershey J. (eds) New Era for Robust Speech Recognition.
Springer, Cham.
96%, 93%, 92% & 94%. For E-set it was 99%, 100%, 98%, [6] Schwartz R.M. et al. (1988) Acoustic-Phonetic Decoding of
97%, 97% & 95%, and For EH-set 100%, 95%, 98%, 95%, Speech. In: Niemann H., Lang M., Sagerer G. (eds) Recent
98% & 98% and lastly for AH-set 95%, 98%, 96%, 96%, Advances in Speech Understanding and Dialog Systems.
95% & 95% respectively. Results shows that RNN is very NATO ASI Series (Series F: Computer and Systems
powerful in classifying the speech signals. Sciences), vol 46. Springer, Berlin, Heidelberg
[7] Rabiner L.R. (1992) Speech Recognition Based on Pattern
Song, W., &Cai, J. (2015) [21] has developed end to end Recognition Approaches. In: Ince A.N. (eds) Digital Speech
speech recognition using hybrid CNN and RNN. They have Processing. The Kluwer International Series in Engineering
used hybrid convolutional neural networks for phoneme and Computer Science (VLSI, Computer Architecture and
Digital Signal Processing), vol 155. Springer, Boston, MA
recognition and HMM for word decoding. Their best model [8] Wikipedia contributors. (2018, September 22). Neural
achieved an accuracy of 26.3% frame error on the standard network. In Wikipedia, The Free Encyclopaedia. Retrieved
core test dataset for TIMIT. Their main motto is to replace 15:37, November 4 , 2017,
GMM-HMM based automatic speech recognition with the from https://en.wikipedia.org/w/index.php?title=Neural_netw
deep neural networks. The CNN they used consists of 4 ork&oldid=860697996

IJERTV7IS100087 www.ijert.org 201


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 7 Issue 10, October-2018

[9] International Journal on Recent and Innovation Trends in Sanket A. Shah was born in Anand City, Gujarat, India in
Computing and Communication Volume: 4 1997. He is currently pursuing his computer engineering at
[10] G Gnaneswari, S R VijayaRaghava, A K Thushar, G.H. PATEL INSTITUTE OF ENGINEERING AND
Dr.S.Balaji, Recent Trends in Application of Neural Networks TECHNOLOGY, Bakrol, Gujarat, India. He has attended
to Speech Recognition, International Journal on Recent and
Innovation Trends in Computing and Communication,
the national conference, RACST held at his institute in
Volume: 4 Issue 1,pp 18 - 25 2016.
[11] Kosonocky S.V. (1995) Speech Recognition Using Neural
Networks. In: Ramachandran R.P., Mammone R.J. (eds) Hardik J. Dudhrejia was born in Rajkot City, Gujarat,
Modern Methods of Speech Processing. The Springer India in 1997. He is currently pursuing his computer
International Series in Engineering and Computer Science engineering at G.H. PATEL INSTITUTE OF
(VLSI, Computer Architecture and Digital Signal Processing), ENGINEERING AND TECHNOLOGY, Bakrol, Gujarat,
vol 327. Springer, Boston, MA India. He has attended various workshops as well as
[12] Wikipedia contributors. (2018, October 2). Bell Labs. conferences including a national conference,
In Wikipedia, The Free Encyclopedia. Retrieved 13:35,
October 5, 2018,
IMPRESSARIO at his institute in 2016.
from https://en.wikipedia.org/w/index.php?title=Bell_Labs&o
ldid=862196483 .
[13] Juang, B. H.; Rabiner, Lawrence R. "Automatic speech
recognition–a brief history of the technology development"
(PDF): 6. Archived (PDF) from the original on 17 August
2014. Retrieved 17 January 2015
[14] Michael Price, James Glass, Anantha P. Chandrakasan “A 6
mW, 5,000-Word Real-Time Speech Recognizer Using
WFST Models”, IEEE Journal Of Solid-State Circuits, Vol.
50, No. 1, PP 102-112, January 2015
[15] W. Forgie, James & D. Forgie, Carma. (1959). Results
Obtained from a Vowel Recognition Computer Program. The
Journal of the Acoustical Society of America. 31. 844-844.
10.1121/1.1936151.
[16] “Pioneer speech recognition”, [available
online]: http://www03.ibm.com/ibm/history/ibm100/us/en/ico
ns/speechreco/
[17] V. M. Velichko and N. G. Zagoruyko, “Automatic
recognition of 200 words,” Int. J. Man-Machine Studies,
2:223, June 1970
[18] H. Sakoe and S. Chiba, “Dynamic programming algorithm
optimization for spoken word recognition,” IEEE Trans.
Acoustics, speech, signal proc., ASSP 26(1), pp. 43-49,
Febreary 1978.
[19] Giuseppe Riccardi, DilekHakkani-Tür, "Active learning:
theory and applications to automatic speech
recognition", IEEE Transaction on Speech and Audio
Processing, vol. 13, no. 4, pp. 504-511, 2005.
[20] Dr.R.L.K.Venkateswarlu, Dr. R. Vasantha Kumari,
G.VaniJayaSri, International Journal of Scientific &
Engineering Research Volume 2, Issue 6, June-2011
[21] Song, W., &Cai, J. (2015) End-to-End Deep Neural Network
for Automatic Speech Recognition

IJERTV7IS100087 www.ijert.org 202


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like