Speech Recognition Using Neural Networks IJERTV7IS100087
Speech Recognition Using Neural Networks IJERTV7IS100087
Abstract—Speech is the most common way for humans to Types of speech recognition: Based on the type of words
interact. Since it is the most effective method for speech recognizing systems can recognize, the speech
communication, it can be also extended further to interact with recognition system is divided into the following categories:
the system. As a result, it has become extremely popular in no ➢ Isolated Word:
time. The speech recognition allows system to interact and
Isolated word requires each utterance to have quiet on
process the data provided verbally by the user. Ever since the
user can interact with the help of voice the user is not confined both sides of sample window. At a time only single
to the alphanumeric keys. Speech recognition can be defined as words and single utterances are accepted and it is
a process of recognizing the human voice to generate having “Listen and Non-Listen state”.
commands or word strings. It is also popularly known as ASR ➢ Continuous Word:
(Automatic speech recognition), computer speech recognition
Continuous speech recognisers provide the users a
or speech to text (STT). Speech recognition activity can be
performed after having a knowledge of diverse fields like facility to speak in a continuous fashion and almost
linguistic and computer science. It is not an isolated activity. naturally and at the same time the computer determines
Various techniques available for speech recognition are HMM the content of the speech. Recognisers rendering the
(Hidden Markov model)[1], DTW(Dynamic time warping)- facility of continuous speech capabilities are pretty
based speech recognition[2], Neural Networks[3], Deep much difficult to create because they require some
feedforward and recurrent neural networks[4] and End-to-end special and peculiar methods in order to determine the
automatic speech recognition[5]. This paper mainly focusses on boundaries of the utterances.
Different Neural networks used for Automatic speech ➢ Connected Word:
recognition. This research paper primarily focusses on
different types of neural networks used for speech recognition. Connected words are very much alike the isolated
In addition to this paper also consist of work done on speech words but they allow separate utterances to be executed
recognition using this neural networks. with “minimal pauses” in between them.
➢ Spontaneous speech:
Keywords— Speech recognition; Recurrent Neural network; At an elementary level, spontaneous speech can be
Hidden Markov Model; Long Short term memory network considered as a speech that is coming out naturally and
not a rehearsed one. An Automatic Speech Recogniser
I. INTRODUCTION must be able to handle a wide range of speech features
Throughout their life-span humans communicate mostly like the words being run together.
through voice since they learn all the relevant skills in their
early age and continue to rely on speech communication. Classification of speech sounds:
So, it is more efficient to communicate with speech rather In this modern time, the process of classification of speech
than by using keyboard and mouse. Voice Recognition or sounds is commonly done on the basis of 2 process based on
Speech Recognition provides the methods using which how the classification process is looked upon:
computers can be upgraded to accept speech or human voice
assist input instead of giving input by keyboard. It is Based on the process of obstruction and non-obstruction
extremely advantageous for the disabled people. sounds
Speech is affected greatly by the factors such as The process of classifying the sounds with respect to the
pronunciations, accents, roughness, pitch, volume,
process of obstruction and non-obstruction relies upon the
background noise, echoes and gender. Preliminary method
of speech processing is the process of studying the speech conception of bodily air. While generating human sounds,
signals and the methods of processing these signals. the air coming out of the body has two functions; it is
The conventional method of speech recognition insist in obstructed in the mouth or throat somewhere or it doesn’t
representing each word by its feature vector and pattern get obstructed, but the air comes out very easily.
matching with the statistically available vectors using neural Correspondingly, the sounds that are produces as a result of
networks. On the contrary to the antediluvian method obstructions and non-obstructions are not same excluding
HMM, neural networks does not require prior knowledge of some of their qualities that are trivial.
speech process and do not need statistics of speech data. [3]
For e.g.- all the vowels (a,e,i,o,u) are non obstruction speech numerous tools and software’s available which record the
sounds and all the consonants speech delivered by the humans. The phonic environment
(b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z) are obstruction and the equipment device used have a significant impact on
speech sounds. the speech generated. There is a possibility of having
background or room reverberation blended with the speech
Based on the process of voice and voiceless sounds but this is completely undesirable.
Voiced sound is produced when the vocal chords vibrate
when the sound is produced. Whereas in the voiceless sound Speech Pre-Processing
no vocal cord vibration is produced. To test this, place your The solution of the problem described above is the “Speech
finger on your throat as you say the words. A vibration will Pre-Processing”. It plays an influential role in cancelling out
be felt when the voiced sounds are uttered and no vibration the trivial sources of variation. The speech pre-processing
typically includes reverberation cancelling, echo
will be felt while uttering a voiceless sound. Many a times it
cancellation, windowing, noise filtering and smoothing all
is difficult to feel the difference between them. So in order
of which conclusively improves the accuracy of speech
to distinguish between them another test can be performed recognition.
by putting a paper in front of our mouth and the paper
should move only by saying the voiceless sounds. All the Feature Extraction
vowels are voiced whereas some of the consonants are Each and every person has different speech and different
voiced as well as voiceless. intonation. This is due to the different characteristics
ingrained in their utterance. There should be a probability of
Voiced consonants are :- b,d,g,v,z,th,sz,j,l,m,n,ng,r,w,y identifying speech from the theoretical waveform, at least
Voiceless consonants are :- p,t,k,f,s,th,sh,ch,h theoretically. As a result of an enormous variation in speech
there is an imminent need to reduce the variations by
II. SPEECH RECOGNITION PROCESS performing some feature extraction. The ensuing section
Speech Recognition is truly a ponderous and tiresome depicts some of the feature extraction technologies which
process. It consists of 5 steps:- are extremely used nowadays.
1. Speech
2. Speech Pre-Processing LPC (Linear Predictive Coding):- It is an extremely useful
3. Feature Extraction speech analysis technique for encoding quality speech at
4. Speech Classification low bit rate and is one of the most powerful method. The
5. Recognition key idea behind this method is that a specific speech sample
at current time can be approximated as a linear combination
of past speech samples. In this method the digital signal is
compressed for competent storage and transmission. The
principle behind the use of LPC is to reduce the sum of
squared distance between the original speech and estimated
speech over a finite duration. It can be further used to
provide unique set of predictor coefficients. Gain (G) is also
a crucial parameter.
Speech Classification
These systems are used to extract the hidden information
from the input processing signals and comprises of
convoluted mathematical functions. This section describes
some commonly used speech classification techniques in
Figure-1 Speech recognition process brief.
Speech
Speech is defined as the ability to express one’s thoughts HMM (Hidden Markov Model):- This is the most strongly
and feelings by articulate sounds. Initially the speech of a used method in order to recognize pattern in the speech. It is
person is received in the form of a waveform. Also there are safer and possesses a secure mathematical foundation as
compared to the template based and knowledge based nodes. It is a network of elementary elements known as
approach. In this method, the system being modelled is artificial neurons, which receives an input, changes the state
assumed to be a Markov process having hidden states. The according to that input and generates an output. An
speech is distributed into smaller resounding entities each of interconnected group of natural or artificial neurons uses a
which represent a state. In simpler Markov Model, the states mathematical model for information processing based on
are clearly visible to the user and thus the state transition connectionist approach to communication. Neutral
probabilities are only the parameters. On the other hand, in Networks can be treated as simple mathematical models
hidden Markov Model, the state is not directly visible, but defining a function f: X Y or a distribution over X or both
the output, which is dependent on the state, is evident. X and Y [8], but many a times models are intimately
HMM are specifically known for their application in connected with a particular learning rule. The term
reinforcement learning and pattern recognition such as “artificial neural network” refers to inter-connections in
speech, handwriting and bioinformatics. among the neurons in the different layers of each system.
Mathematically a neuron’s network function f(x) is defined
DTW (Dynamic Time Warping):-In time series analysis, as a composition of other functions gi(x), which can further
DTW is a kind of algorithm which measures the similarity be defined as a composition of other functions. A most
or affinity between two temporal sequences that vary in commonly used type of composition is the nonlinear
speed or time. It correlates the speech words with reference weighted sum, where f(x) = K (∑i ,wigi(x)), where K is a
words. This method changes the time dimension of the predefined function. Referring the collection of functions gi
undiscovered words unless and until they are matched with as a vector g simply would be more advantageous. The first
the reference word. A well-known application of DTW is view being the functional view; the given x is then changed
the automatic speech recognition, in order to cope up with to a 3-D vector h, which is then finally transformed into f.
different speaking speeds. Various other applications are This view is frequently encountered in the optimization
online signature recognition and speaker recognition. context. Probabilistic view is the second view and the
arbitrary variable G = g (H), depends on H = h(X), which in
VQ (Vector Quantization):- This method is preliminary turn depends upon the X(random variable). In the
based on block coding principle. This technique allows the perspective of graphical models this view most commonly
modelling of probability density functions by the circulation comes across. Networks described such as the above one are
of prototype vectors. It was formerly used for data often called feed forward as its graph is a DAG (Directed
compression. It performs the mapping of the vector from a Acyclic Graph). Networks which have cycles in it is called
vast vector space to a finite number of regions in that space. recurrent neural networks.
Every region is known as cluster and can be depicted by its
centre called code word. Vector Quantization is used in Neural Network Models:
lossy data compression, lossy data correction, and clustering Language modelling and acoustic modelling are both vital
and pattern recognition. aspects of modern statistically-based speech recognition
systems. The ensuring section illustrates various methods
Recognition for Speech Recognition.
After the above four phases speech recording, speech pre- Deep Neural Network (Hidden Markov Models HMM is a
processing, feature extraction and speech classification the generative model in which observable acoustic features are
final step that is remaining is the speech recognition. Once assumed to be generated from a hidden Markov process that
all the above mentioned steps are completed successfully transitions between states S = {s1, s2 …., sk}.
then the recognition of speech can be done by three
approaches. HMM was the most widely used technique for Large
1. Acoustic phonetic approach[6] Vocabulary Speech Recognition (LSVR) for at least two
2. Pattern recognition approach[7] decades. The decisive parameters in the HMM are the initial
3. Artificial intelligence approach state probability distribution f = {p(q0=si)}, where qt is the
state at time t; the transition probabilities aij= p(qt=sj |qt-
This paper is mainly concern with Artificial intelligence 1=si) ; and a model to estimate the observation probabilities
approach for speech recognition. This is a combination of p(xt |si). The conventional HMM used in the process of
acoustic phonetic and pattern recognition approach. In this Automatic Speech Recognition had their observation
approach system created by neural networks are used to probability modelled by using Gaussian Mixture Model
classify and recognize the sound.. Neural networks are very (GMM). Even the GMM had a vast number of advantages,
powerful for recognition of speech. There are various the issue was that, they were statistically inept for modelling
networks for this process. RNN, LSTM, Deep Neural data that lie on or near the non-linear diverse in the space.
network and hybrid HMM-LSTM are used for speech For instance, modelling of the points residing very close to
recognition. surface of the sphere hardly requires any parameter using a
suitable model class, but it requires large number of
III. NEURAL NETWORKS diagonal Gaussians or a fairly huge number of full-
Traditionally neural networks referred to as neurons or covariance Gaussians. Because of this other types of models
circuit. At present the term neural networks refers to as may work better than GMM the exploitation of information
Artificial Neural Network, consisting of artificial neurons or embedded in a large window of frames is done well. On the
other hand ANN (Artificial Neural Network) can handle the preserving the modelling power simultaneously but it
data residing on or near a non-linear model more effectively increases the computational cost. Therefore there is a
and learn much better models of data. Since the past few prominent need of using the information in the process of
years, outstanding advances have been made both in training set to build various layers of non-linear feature
machine learning algorithms and computer hardware which detectors.
ultimately has led to more efficient methods of learning
having many layers and a large layer of output. The output Recurrent Neural Network: Recurrent Neural Network is a
layer must hold a great number of HMM states that arise on kind of Artificial Neural Network, which is represented in
each phone being modelled by a different number of the form of directed cycle where each and every node is
triphone. By employing various new learning methods, vast connected to the other nodes. Two units become dynamic as
number of research group have shown that Deep Neural soon as the communication takes place in between them.
Network has better performance than GMMs at acoustic Since RNN uses internal memory just like the feed-forward
modelling for recognition of speech that includes massive networks to process the sequence of arbitrary input, it makes
vocabularies and behemoth dataset. An Artificial Neural them ideal choice for speech recognition. The key feature of
Network has more than one layer of hidden units between RNN is that the activations flow round in a loop as the
its inputs and outputs whereas Deep Neural Network is a network contains at least one feed-back connection thus
feed-forward network. Each hidden unit in DNN, j, uses the allowing the networks to do temporal processing and learn
logistic function to map all of its input from the layer below, sequences.
xj , to the scalar state, yj that it sends to the above lying
layer.
Xj = bj+ ∑ yiwij
i
Where bj is the bias of unit j, i is an index over units in the
layer below, and wijis the weight on a connection to unit j
from unit i in the layer below. For multiclass classification,
output unit j converts its total input, xj , into a class
probability, pj, by using the softmax non-linearity
flow of information, each memory block contains one or long time lags, LSTM can handle noise, continuous
more self-connected memory cells and three multiplicative values and distributed representations. However in
gates. The flow of information in each cell of LSTM is case of hidden Markov Model LSTM doesn’t need
secured by the learned input and output gates. The forget to have priori choice of a finite number of states, in
gate is added for the purpose of resetting the cells. A that it can deal with infinite state numbers.
conventional LSTM can be defined as follows: 2. With respect to the problems discussed in this
paper, LSTM can generalize well irrespective of
Given an input sequence x = (x1, x2 , . . . , xT ), a the irrelevant and widely spread inputs in the input
conventional RNN computes the hidden vector sequence h =
sequence. It has the capability of quickly learning
(h1, h2, . . . , hT ) and output vector sequence y = (y1, y2, . .
about how to differentiate in between two or more
. , yT ) from t = 1 to T as follows:
widely spread occurrences of a particular element
ht = H (Wxhxt +Whhht−1 + bh) in the input sequence without getting dependent on
t = Whyht + by short time lag training exemplars.
3. It doesn’t require parameter tuning at all. It works
Where, the W denotes weight matrices, the b denotes bias pretty well with wide range of parameters such as
vectors and H(·) is the recurrent hidden layer function. The input and output bias gate, learning rate.
following figure illustrates the architecture of LSTM:- 4. The LSTM’s algorithm complexity per weight and
time step is O(1). This is considered to be
extremely advantageous and outruns the other
approaches such as RTRL.
In mid 1980s, IBM created a voice-activated typewriter convolutional layers. The first two layers have max pooling
called Tangora, which could handle a 20,000-word and the next two densely connected layers with a softmax
vocabulary under the lead of Fred Jelinek. [16] In this era, layer as output. The activation function used was ReLu.
neural networks are emerged as an attractive model for They implemented a rectangular convolutional kernel
Automatic speech recognition. Speech research in the instead of square kernel.
1980s was shifted to statistical modelling rather than
template based approach. This is mainly known as Hidden V. CONCLUSION
Markov model approach. Applying neural networks for Speech is primary and essential way for communication
speech recognition was reintroduced in late 1980s. Neural between humans. This survey is about neural networks are
networks first introduced in 1950 but for some practical modern way for recognizing the speech. In contrast to
problems they were not that much efficient. traditional approach it does not requires any statistics.A
In the 1990s, the Bayes classification is transformed into speech recognition system should include the four stages:
the optimization problems, which also reduces the Analysis, Feature Extraction, Modeling and matching
empirical errors. A key issue in the design and techniques as described in the paper. In this paper, the
implementation of speech recognition system is how to fundamentals of speech recognition are discussed and its
choose proper method in the speech material used to train recent progress is investigated. Various neural networks
the recognition algorithm. Training can be supervised model such as deep neural networks, and RNN and LSTM
learning in which class is labelled in the training data and are discussed in the paper. Automatic speech recognition
algorithm will predict the label in the unlabelled data. using neural networks is emerging field now a day. Text to
Stephen V. Kosonocky [11] had researched about how speech and speech to text are two application that are useful
neural network can be used for speech recognition in for disabled people. Paper mainly focuses on speech
1995. recognition of one language, which is English.
[9] International Journal on Recent and Innovation Trends in Sanket A. Shah was born in Anand City, Gujarat, India in
Computing and Communication Volume: 4 1997. He is currently pursuing his computer engineering at
[10] G Gnaneswari, S R VijayaRaghava, A K Thushar, G.H. PATEL INSTITUTE OF ENGINEERING AND
Dr.S.Balaji, Recent Trends in Application of Neural Networks TECHNOLOGY, Bakrol, Gujarat, India. He has attended
to Speech Recognition, International Journal on Recent and
Innovation Trends in Computing and Communication,
the national conference, RACST held at his institute in
Volume: 4 Issue 1,pp 18 - 25 2016.
[11] Kosonocky S.V. (1995) Speech Recognition Using Neural
Networks. In: Ramachandran R.P., Mammone R.J. (eds) Hardik J. Dudhrejia was born in Rajkot City, Gujarat,
Modern Methods of Speech Processing. The Springer India in 1997. He is currently pursuing his computer
International Series in Engineering and Computer Science engineering at G.H. PATEL INSTITUTE OF
(VLSI, Computer Architecture and Digital Signal Processing), ENGINEERING AND TECHNOLOGY, Bakrol, Gujarat,
vol 327. Springer, Boston, MA India. He has attended various workshops as well as
[12] Wikipedia contributors. (2018, October 2). Bell Labs. conferences including a national conference,
In Wikipedia, The Free Encyclopedia. Retrieved 13:35,
October 5, 2018,
IMPRESSARIO at his institute in 2016.
from https://en.wikipedia.org/w/index.php?title=Bell_Labs&o
ldid=862196483 .
[13] Juang, B. H.; Rabiner, Lawrence R. "Automatic speech
recognition–a brief history of the technology development"
(PDF): 6. Archived (PDF) from the original on 17 August
2014. Retrieved 17 January 2015
[14] Michael Price, James Glass, Anantha P. Chandrakasan “A 6
mW, 5,000-Word Real-Time Speech Recognizer Using
WFST Models”, IEEE Journal Of Solid-State Circuits, Vol.
50, No. 1, PP 102-112, January 2015
[15] W. Forgie, James & D. Forgie, Carma. (1959). Results
Obtained from a Vowel Recognition Computer Program. The
Journal of the Acoustical Society of America. 31. 844-844.
10.1121/1.1936151.
[16] “Pioneer speech recognition”, [available
online]: http://www03.ibm.com/ibm/history/ibm100/us/en/ico
ns/speechreco/
[17] V. M. Velichko and N. G. Zagoruyko, “Automatic
recognition of 200 words,” Int. J. Man-Machine Studies,
2:223, June 1970
[18] H. Sakoe and S. Chiba, “Dynamic programming algorithm
optimization for spoken word recognition,” IEEE Trans.
Acoustics, speech, signal proc., ASSP 26(1), pp. 43-49,
Febreary 1978.
[19] Giuseppe Riccardi, DilekHakkani-Tür, "Active learning:
theory and applications to automatic speech
recognition", IEEE Transaction on Speech and Audio
Processing, vol. 13, no. 4, pp. 504-511, 2005.
[20] Dr.R.L.K.Venkateswarlu, Dr. R. Vasantha Kumari,
G.VaniJayaSri, International Journal of Scientific &
Engineering Research Volume 2, Issue 6, June-2011
[21] Song, W., &Cai, J. (2015) End-to-End Deep Neural Network
for Automatic Speech Recognition