Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Middle-East Journal of Scientific Research 23 (7): 1506-1511, 2015

ISSN 1990-9233
IDOSI Publications, 2015
DOI: 10.5829/idosi.mejsr.2015.23.07.22353

Speech Recognition Using Discrete Hidden Markov Model

S. Suganya and C. Sheeba Joice

Department of ECE, Saveetha Engineering College, Chennai, India

Abstract: In recent years, Speech Recognition has the great development in the automation industry.
This paper proposes an Automatic Speech Recognitin (ASR) to facilitate an interaction between human and
the electronic components. The main concern of this paper involves the suppression of various noises to
achieve a robust speech recognition system. Discrete Hidden Markov Model is used to increase the speed of
speech recognition. This paper explores the hardware realization of desired speech recognition system on the
Field Programmable Gate array (FPGA). The accuracy has to be increased to get the clear and robust Speech
Recognition. The speech features can be extracted through the cepstral coefficients by using warping filter
banks. The cepstral coefficients are used to increase the robustness of Speech Recognition. To minimize the
complexity of desired ASR system, the number of coefficients has to be minimized. The Speech-to-Text
conversion is the main objective of this paper. This can be achieved by using an in-built function in Matlab
software.

Key words: Feature Extraction Cepstral Coefficients Discrete Cosine Transform Discrete Hidden
Markov Model

INTRODUCTION robustness may cause effects on the performance of


ASR. Linear Predictive Coding(LPC), Mel Frequency
Speech Recognition has been the most dominant and Cepstral Coefficients (MFCC) and Perceptual Linear
convenient means of communication. Speech Prediction (PLP) are the various methods of feature
communication is not only a face-to-face interaction but extraction techniques available for ASR. MFCC and
also the individuals at any moment, via a wide variety of PLP are most widely used feature extraction
modern technological media. Speech Recognition plays an techniques which are required to reconstruct the
important role that a human to make an interaction with original signal [1]. The recent focus of researchers
the electronic components. Actually, Speech Recognition involved in the implementation of ASR into an
is also a kind of Pattern Recognition technique. Various embedded platform [2]. The speech-to-text conversion is
applications of Speech Recognition includes voice a useful technique which is helpful for handicapped
controlled devices, speech-to-text, etc. Automatic speech peoples [3].This paper focusing on the speech-to-text
Recognition can be resolved into two phase viz., training conversion and can be implemented in Alter DE1
phase and testing phase. In training phase, the speech board. The robustness of speech recognition can be
feature vectors can be extracted and is trained in the improved by suppressing the noise from the speech
codebook. In testing phase, the feature vectors can be signal. The conversion of speech-to-text conversion
obtained as in the testing phase and also comparing the depends on the variation in the frequency. As the number
testing features are matched with the codebook. If both of campestral coefficients increases, the desired system
the features are similar, this given speech can be gets complex to achieve the desired goal. So the number
recognized and that can be utilized for an authentication of coefficients here is 12.
purpose. The speech signal has to be in wave format, then
Speech Recognition is an advance technique to be only the signal has to be processed and feature extraction
followed in the fields of Automation, Artificial Intelligence can be done. The speech signal is sampled at 16000 Hz
and so on. The improvement in the recognition accuracy, and the number of bits per sample as 32. The reasonable

Corresponding Author: S. Suganya, Saveetha Engineering College, Chennai, India.

1506
Middle-East J. Sci. Res., 23 (7): 1506-1511, 2015

up into number of frames according to the frame length.


The low frequency signals are selected by blocking the
low frequency in every frame. After frame blocking of
each and every frames, Hamming window is applied to
reduce the discontinuity of the signal. Determine the DCT
coefficients for the evaluation of campestral coefficients
by using filter bank spacing to each and every windowed
Fig. 1: Proposed Methodology frame. Apply logarithmic values to get the DCT values as
a single value [5].

Feature Extraction: Fig. 2 describes the block diagram of


feature extraction techniques. The steps that are followed
for the desired system has explained below.

Frame Blocking: With an appropriate time length for


each frame, frame blocking is applied to divide the signals
into matrix form. This speech signal is sampled at 16000
Hz, the number of frames could be assumed as 320
samples within a frame. Overlapping of frames would have
the factor of separation of samples due to the effect of
frame blocking [5].

Hamming Windowing: To reduce the discontinuities of


signal at the end of each frames, hamming windowing is
applied to each and every frames after frame blocking. The
equation (3.1) representing the discrete time
representation of signal,

(3.1)
Fig. 2: Feature Extraction technique
By introducing hamming windowing to each frames,
modeling of speech signal can be done according to the windowing generates the least distortion[5].
assumption that such a small segment of speech s Fast Fourier Transform: Time domain signal can be
sufficiently stationary [4]. converted into frequency domain by applying fast
Section II describes the proposed methodology of fourier transform to each and every windowed frames.
this paper. Section III explores the steps has been The output of FFT can be complex numbers having both
followed in the feature extraction Section IV explodes the real and imaginary parts. Real time data has to be
theory of Discrete Hidden Markov Model. Section V processed with the speech recognition system. The
explains the hardware architecture which is designed to complex variables could be neglected by the FFT[5].
implement in FPGA. Section VI discussed the results Equation (3.2) describes the spectral domain
which are obtained for the desired system. Section VII
explains the concepts which are concluded from this (3.2)
paper. Section VIII explains the concepts from various
papers that are referred.
Mel Frequency Filter Bank: Based on the human
Proposed Methodology: The Proposed methodology perception, the Mel frequency analysis is preferable.
consists of the feature extraction module and the The human ear is very sensitive and it is proved that
codebook generation. The proposed architecture has humans having high resolution to the low frequency
been shown in Fig. 1, which explodes the steps has been rather than the higher frequency. Speech signal does not
followed in this paper. The original signal has to be split be linear. To make a linear scale conversion for the

1507
Middle-East J. Sci. Res., 23 (7): 1506-1511, 2015

since the value obtained will provide us the accuracy for


increasing the robustness of ASR. Equation (3.5)
expresses the DCT,

(3.5)

The cepstral coefficients are obtained by applying


DCT to the Mel scale values. The coefficients which are
obtained after the evaluation of DCT called Mel
Frequency Cepstral Coefficients (MFCC). When the
number of coefficients increases the accuracy and also
the increase in the complex of the designer complexity,
there is a lag in design of ASR system. The number of
Fig. 3: Mel filter Bank using Triangular Band Pass Filter coefficients taken here is 12 [5].

frequency using Mel scale is used to warping a signal in Discrete Hidden Markov Model: Discrete Hidden Markov
frequency domain to the Mel scale. The conversion of Model is used to accelerate the speed of Speech
speech signal from frequency domain to Mel scale can be Recognition. A Codebook is to be first generated for the
done using the following equation (3.3). feature vectors. Feature vectors can be trained using
DHMM in the codebook. From the training samples, the
upper and lower bounds of each element has to be
(3.3)
calculated to generate the codebook. The range of upper
and lower bounds is divided into various sub-intervals
The Mel Filter bank spacing has to be applied to the
from which the feature vectors are extracted. By
FFT values to get the conversion for the frequency
randomizing the same number of vectors according tothe
domain into the Mel scale. Triangular band pass filters are
number of classes, the initial codebook has to be formed.
applied as a filter bank spacing which his non-uniformly
The codebook can be initialized with the values obtained
spaced on the linear frequency axis and it is uniformly
from the feature vectors. DHMM is the only classifier
spaced on the linear frequency axis, with the larger
based on probability. This paper utilizes this technique as
number of filters in the low frequency region and lesser
a comparator based on the probability basis. Since
number of filters in the high frequency region and is
DHMM is a time consuming process, it improves the
shown in Fig. 3. efficiency of the desired system [6].

Logarithm of Energies: To compute the log-energy, i.e., Hardware Architecture: The desired system can be
the logarithm of the sum of filtered components for each implemented in the Altera DE1 board. The desired system
filter. Equation(3.4) expresses the computing logarithm of can be evaluated and can be implemented through the
weighted sum of spectral values in the filter-bank channel. System On Chip architecture of FPGA. The SOC
At this stage, the number of architecture can be explained below as shown in Fig. 4.
All the algorithms of the desired methodology can be
(3.4) implemented through the NIOS-II processor. The Altera
DE1 development board in which the CYCLONE-II
rows equal to number of frames and the number of processor is included is used for this experiment.
columns equal to the number of filters in the filter The push button is used here for noise suppression.
bank. The toggle switch can be used as an input for the FPGA
board. This can be used as an authentication purpose.
Discrete Cosine Transform: The cepstral analysis The microphone can be used as an output for checking
includes the conversion of spatial domain to frequency the robustness of ASR. The Liquid Crystal Display will be
domain by applying DCT to the Mel Scale values. DCT used to display the words under the conversion of
expresses a finite set of data points in terms of a sum of Speech-to-Text. An Audio Controller is used to receive
cosine functions. The conversion of DCT is similar to the the speech signal. The I2C protocol is used to control the
DFT in the conversion process, DCT is more preferable register of the platform [6].

1508
Middle-East J. Sci. Res., 23 (7): 1506-1511, 2015

Fig. 4: Block Diagram of SOC architecture

Fig. 5: Authorized Speech

RESULTS AND DISCUSSIONS and which is helped to implement in the hardware


realization. Fig. 5 shows that the authorized speech, due
The codebook can be initialized with the help of to the match-out signal is high when the feature vectors
feature vectors are obtained through the training phase. are similar. Fig.6 shows that the unauthorized speech, due
The speech feature vectors can be randomized and that to the match-out signal is low when the feature vectors
van be evaluated through the suppression of are not similar as in the codebook.
environmental noises. The speech signal can be Fig. 7 shows that the power dissipated from the
processed and that can be compared with the various desired system in the hardware. The power dissipated for
hidden states using DHMM. When the feature vectors the desired system is 188.18 mw. This analyzed output
are similar and the pattern can be recognized using explores the number of logic units required to design a
desired ASR. The simulation results can be obtained with robust speech recognition system. Fig.8 shows that the
the help of Modelsim Software and this can be amount of area required to occupy the logic elements as
synthesized through the help of QUARTUS II software in the desired speech recognition system.

1509
Middle-East J. Sci. Res., 23 (7): 1506-1511, 2015

Fig. 6: Unauthorized Speech

Fig. 7: Power Dissipation

Fig. 8: Synthesis Report

CONCLUSION REFERENCES

The speech signal can be processed and that can be 1. Yuan Mang, 2004. Speech Recognition on DSP:
trained and compared with the feature vectors that are Algorithm on optimization & performance
obtained by processing the speech. DHMM technique is analysis, The Chinese University of Hong Kong,
a slight time consuming process but it provides accuracy pp: 1-18.
for robust speech recognition. 2. Huggins-Daines D., M. Kumar, A. Chan,
The future work is to be implemented in ALTERA A. Black, M. Ravishekar and A. Rudnicky,
DE1 FPGA starter kit and this also can be used to convert 2006. Pocketsphinx: A free, real-time
the speech to text. The research work can be extended to continuous speech recognition system for
activate the voice controlled device for an authentication hand-held devices, in Proceedings of
purpose. ICASSP.

1510
Middle-East J. Sci. Res., 23 (7): 1506-1511, 2015

3. Rumia Sultana and Rajesh Palit, 2014. A Survey on 5. Joshi, Siddhant, C. and Dr. A.N. Cheeran, 2014.
Bengali Speech-To-Text Recognition Techniques, MATLAB Based Feature Extraction Using Mel
The 9th International Forum on Strategic Technology, Frequency Cepstrum coefficients for Automatic
Coxs Bazar, Bangladesh. Speech Recognition, IJSETR, 3(6).
4. Muda Lindasalwa, Mumtaj Begam and I. Elamvazuthi, 6. Pan Shing-Tai and Xu-Yu Li, 2012. An FPGA Based
2010. Voice recognition algorithm using MFCC & Embedded Robust Speech Recognition System
DTW techniques, Journal of Computing, ISSN 2151- Designed By Combining Empirical Mode
9617, 2(3): 138-143. Decomposition and a Genetic Algorithm, IEEE Trans
on Instrumentation and Measurement, 61(9).

1511

You might also like