0% found this document useful (0 votes)
93 views23 pages

Ann LA2 Project

This document discusses a student project to implement speech recognition using neural networks. It outlines the aim to accurately convert speech to text independently of speaker or recording device. It describes the key steps of speech recognition including preprocessing the speech signal, extracting features, and using algorithms like hidden Markov models, dynamic time warping, and artificial neural networks to classify the speech and output text. Hardware and software requirements for speech recognition systems are also listed.

Uploaded by

Dimitri Molotov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views23 pages

Ann LA2 Project

This document discusses a student project to implement speech recognition using neural networks. It outlines the aim to accurately convert speech to text independently of speaker or recording device. It describes the key steps of speech recognition including preprocessing the speech signal, extracting features, and using algorithms like hidden Markov models, dynamic time warping, and artificial neural networks to classify the speech and output text. Hardware and software requirements for speech recognition systems are also listed.

Uploaded by

Dimitri Molotov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Nitte Meenakshi Institute of Technology

Department of Electronics and Communication Engineering


Learning activity -2
USING ALGORITHM AND FLOW CHART IMPLEMENT
SPEECH RECOGNITION USING NEURAL NETWORK

Submitted By : Sayantan Das -1NT18EC198


Soumojit Gain -1NT18EC163
Suryansh Gupta - 1NT18EC200
Shreya Thimmaya – 1NT18EC154
5th Semster ,ECE
Submitted To : Dr. Jayavrinda Vrindavanam
Associate Professor,Nmit
PROJECT DETAILS
AIM -
The aim of the system is to accurately and efficiently convert a speech
signal into a text message transcription of the spoken words
independent of the speaker, environment or the device used to record
the speech (i.e. the microphone)
THEORY-
This process begins when a speaker decides what to say and actually speaks
a sentence. The software then produces a speech wave form, which
embodies the words of the sentence as well as the extraneous sounds and
pauses in the spoken input . First it converts the speech signal into a sequence
of vectors which are measured throughout the duration of the speech signal.
Then, using a syntactic decoder it generates a valid sequence of
representations.
ESSENTIAL SOFTWARES AND HARDWARES USED
• Based upon stated preferences and system specifications, the following conditions have been established,
• 1.Continuous speech recognition software must provide is preferred, rather than the slower, more unnatural
and lower-priced discrete speech recognition software also on the market.
• 2.The application must run on a Pentium-powered PC under Windows 95, and be capable of integration with
Microsoft Word97.
• 3.The software program must be easily and successfully installed by any intermediate-level computer user in the
office.The program must be one that can be learned and customized reasonably quickly by nearly anyone in
the office.
All four programs run on Pentium-powered PC's utilizing Windows 95, 98 or NT 4.0 and require 16-bitSoundBlaster-
compatible sound cards. Random access memory (RAM) requirements for software run under Windows NT are
higher for all of these programs
• 1. .Dragon Systems' NaturallySpeaking requires a
• Pentium/133MHz processor or higher, 32MB of RAM, and 180MB of hard disk space
• 2. IBM Via Voice 98 requires a Pentium/166MHz with MMX (multimedia chip) or higher, 32MB of RAM, 180MB
of hard disk space, and 256K L2 cache.
• 3. L&H Voice Xpress Plus requires a Pentium/166MHz with MMX, 40MB of RAM, and 130 MB of hard disk
space.
• 4. Philips FreeSpeech98 requires a Pentium/166MHz processor, 32MB of RAM, and 150MB of hard disk
space.
• inclusion of microphone is necessary for capturing of Spoken words ….
Past Present And Future Of Speech Recognition
IMPLEMENTATION
Algorithm:-
There are mainly 3 algorithms that are used for Speech Recognition. Those are given
below:
• 1. Hidden Markov Model(HMM)
• 2. Dynamic Time Warping(DTW)
• 3. Artificial Neural Networks(ANN)
• HIDDEN MARKOV MODEL (HMM)
A hidden Markov model (HMM) is a statistical Markov model in which the
system being modelled is assumed to be a Markov process with unobserved
(hidden) states.It can be presented as the simplest dynamic Bayesian network.
It can be thought of as a black box, where the sequence of output symbols
generated over time is observable, but the sequence of states visited over
time is hidden from view. This is why it’s called a Hidden Markov Model.

When an HMM is applied to speech recognition, the states are interpreted as


acoustic models, indicating what sounds are likely to be heard during their
corresponding segments of speech; while the transitions provide ,temporal
constraints, indicating how the states may follow each other in sequence
DYNAMIC TIME WARPING
• The simplest way to recognize an isolated word sample is to compare it
against a number of stored word templates and determine which the “best
match” is. This goal is complicated by a number of factors. First, different
samples of a given word will have somewhat different durations. This
problem can be eliminated by simply normalizing the templates and the
unknown speech so that they all have an equal duration. However, another
problem is that the rate of speech may not be constant throughout the
word; in other words, the optimal alignment between a template and the
speech sample may be nonlinear.
• Dynamic Time Warping (DTW) is an efficient method for finding this optimal
nonlinear alignment
ARTIFICIAL NEURAL NETWORKS(ANN)
• A neural network can be defined as a model of reasoning based on the human
brain. The brain consists of a densely interconnected set of nerve cells, or basic
information-processing units, called neurons.. By using multiple neurons
simultaneously, the brain can perform its functions much faster than the fastest
computers in existence today.

• Each neuron has a very simple structure, but an army of such elements constitutes a
tremendous processing power.
• Feedforward network is the first and the simplest form of ANN. In this network,
the information flows only in one i.e. forward direction from input node via
hidden nodes to the output node. Learning is the adaptation of free
parameters of neural network through a continuous process of stimulation by
the embedded environment. The back-propagation algorithm has emerged
to design the new class of layered feedforward network called as Multi-Layer
Perceptrons (MLP). It generally contains at least two layers of perceptrons. It
has one input layer, one or more hidden layers and output layers. The hidden
layer plays very important role and acts as a feature extractor.
Struture of Speech Recognition
The structure of a standard speech recognition system is illustrated in Figure. The elements
are as follows:
• Raw speech - Speech is typically sampled at a high frequency, e.g., 16 KHz over a
microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over
time.
• Signal analysis - Raw speech should be initially transformed and compressed, in order to
simplify subsequent processing. Many signal analysis techniques are available which can
extract useful features and compress the data by a factor of ten,without losing any
important information. Among the most popular:
• Fourier analysis (FFT)- yields discrete frequencies over time, which can be
interpreted visually. Frequencies are often distributed using a Mel scale,
which is linear in the low range but logarithmic in the high range,
corresponding to physiological characteristics of the human ear.
• Perceptual Linear Prediction (PLP)- is also physiologically motivated, but
yields coefficients that cannot be interpreted visually.
• Linear Predictive Coding (LPC)- yields coefficients of a linear equation
that approximate the recent history of the raw speech values.
FLOWCHART OF THE SYSTEM
• The general structure of the speech
recognition program is shown in figure
bellow. The input of the system is the Close-talking microphone
speech signal. The preprocessing includes Microphone array

Denoising and End points detection


Auditory models
(EIH, SMC, PLP)

operations. After preprocessing the signal Adaptive filtering


Noise subtraction

is sent to the feature extraction block. In


Comb filtering
Spectral mapping
Cepstral mean normalization
this thesis three methods- LPC, MFCC, and
A cepstra
RASTA

Spectrogram are used for feature Noise addition

extraction. Finally the decision is taken if


HMM (de) composition(PMC)
Model transformation(MLLR)
Bayesian adaptive learning
there is matching or no in the last block Frequency weighting measure
Weighted cepstral distance
Cepstrum projection measure

Word spotting
Utterance verification

Language model adaptation


AUTOMATIC SPEECH RECOGNITION
• Automatic speech recognition (ASR) can be defined as the independent, computer‐driven
transcription of spoken language into readable text in real time In ASR is technology that
allows a computer to identify the words that a person speaks into a microphone or
telephone and convert it to written text. The ultimate goal of ASR research is to allow a
computer to recognize in real‐time, with 100% accuracy, few words that are intelligibly
spoken by any person, independent of vocabulary size, noise, speaker characteristics or
accent. The goal of an ASR system is to accurately and efficiently convert a speech signal
into a text message transcription of the spoken words independent of the speaker,
environment or the device used to record the speech (i.e. the microphone).
• This process begins when a speaker decides what to say and actually speaks a sentence.
The software then produces a speech wave form, which embodies the words of the
sentence as well as the extraneous sounds and pauses in the spoken input. Next, the
software attempts to decode the speech into the best estimate of the sentence. First it
converts the speech signal into a sequence of vectors which are measured throughout the
duration of the speech signal. Then, using a syntactic decoder it generates a valid
sequence of representations.
WORKING
• This process begins when a speaker decides what to say and actually speaks
a sentence. The software then produces a speech wave form, which
embodies the words of the sentence as well as the extraneous sounds and
pauses in the spoken input. Next, the software attempts to decode the
speech into the best estimate of the sentence. First it converts the speech
signal into a sequence of vectors which are measured throughout the
duration of the speech signal. Then, using a syntactic decoder it generates a
valid sequence of representations
The GALAXY-II conversational system at MIT Galaxy is a clientserver
architecture developed at MIT for accessing online information using spoken
dialogue [9]. Ithas served as the testbed for developing human language The
boxes in this figure represent various human language technology servers as
well as information and domain servers
RESULT
Author(s) Year Paper name Technique Results
Ibrahim Patel 2010 Speech Recognition Resolution It show an improvement
Using HMM with Decomposition with in the quality metrics of
MFCC-an analysis using Separating speech recognition with
Frequency Frequency is the respect to computational
Spectral mapping approach time, learning accuracy
Decomposition for a speech
Technique recognition system
Kavita Sharma 2012 Speech Denoising using FIR, IIR, Use of filter shows that
Different WAVELETS, estimation of clean
Types of Filters FILTER speech and noise for
speech enhancement in
speech recognition
Bhupinder Singh 2012 Speech Recognition with Hidden Markov Develop a voice based
Hidden Markov Model Model user machine interface
system
Author(s) Year Paper name Technique Results

Patiyuth Pramkeaw 2012 Improving MFCCbased FIR Filter Shows the improvement
speech in recognition rates of
classification spoken words
with FIR filter

Shivanker Dev 2013 Isolated Speech Dynamic Time It shows that the DTW
Dhingra Recognition using Warping(DTW) is the best non linear
MFCC and DTW feature

matching technique in
speech
identification,
with minimal
error rates and fast
computing speed
CONCLUSION
• For SR ANN is a effective and efficient way as it has multi layer network.
Speech Recognition is also used in smart phones. In smart phones
speech/spoken words are given as an input and SR s/w gives appropriate
search or information that user wants as a output.Neural networks, with their
remarkable ability to derive meaning from complicated or imprecise data,
can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. A trained neural
network can be thought of as an "expert" in the category of information it
has been given to analyse.
• ANN has,
1.Adaptive learning: An ability to learn how to do tasks based on the data
given for training or initial experience.
2.Self-Organisation: An ANN can create its own organisation or representation
of the information it receives during learning time.
3.Real Time Operation: ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take
advantage of this capability.
4.Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However,
some network capabilities may be retained even with major network
damage. Thus for speech recognition artificial neural network is efficient and
effective algorithm among all algorithms.
BIBLIOGRAPHY
• http://en.wikipedia.org/wiki/Speech_recognition.
• http://en.wikipedia.org/wiki/Artificial_neural_network
• http://www.researchgate.net/
• Youtube
• Introduction to Various Algorithms of Speech Recognition: Hidden Markov
Model, Dynamic Time Warping and ArtificialNeural NetworksPahini A. Trived
• Automatic Speech Recognition System Prof. Pisal Ranjeet1 , Thite Prakash2 ,
Satpute Amruta3 & Shingade Monali4
• VOICE RECOGNITION SYSTEM: SPEECH-TO-TEXT Prerana Das, Kakali Acharjee,
Pranab Das and Vijay Prasad*

You might also like