0% found this document useful (0 votes)
2 views5 pages

DCT Application in Speech Recognition: A Survey

This document surveys the application of Discrete Cosine Transform (DCT) in speech recognition, highlighting its effectiveness in feature extraction and noise reduction. It discusses various methods, including the use of DCT in conjunction with Mel Frequency Cepstral Coefficients (MFCC) and hybrid systems like Genetic-Fuzzy Inference for improved recognition accuracy. The findings suggest that DCT enhances speech recognition performance by providing better energy compaction and reducing noise in voice signals.

Uploaded by

safi edin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

DCT Application in Speech Recognition: A Survey

This document surveys the application of Discrete Cosine Transform (DCT) in speech recognition, highlighting its effectiveness in feature extraction and noise reduction. It discusses various methods, including the use of DCT in conjunction with Mel Frequency Cepstral Coefficients (MFCC) and hybrid systems like Genetic-Fuzzy Inference for improved recognition accuracy. The findings suggest that DCT enhances speech recognition performance by providing better energy compaction and reducing noise in voice signals.

Uploaded by

safi edin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Engineering and Techniques - Volume 5 Issue4 , August 2019

DCT APPLICATION IN SPEECH


RECOGNITION: A SURVEY
Atul Narkhede1, Dr. Naveen Sen2, Dr. Milind Nemade3
1(Research Scholar, Faculty of Engineering, Pacific Academy of Higher Education and Research University, Udaipur
[email protected])
2(Associate Prof., Pacific Academy of Higher Education and Research University, Udaipur
[email protected])
3(Professor, Department of Electronics Engineering, K. J. Somaiya Institute of Engineering & Information Technology,
Mumbai, India. [email protected])

Abstract: - Speech recognition with the help of the machine is automatically an important research area
for over forty years. Since the voice is an unlimited information signal, the speech signal processing
through digital conversion is a very efficient tool for high and accurate automatic signal or voice
recognition technology. Speech recognition has found its application in different areas of our daily life as a
telephone answering machine for transmitting text and sending voice signals to machines. Function
extraction and classification is a major part of the ASR system process. The main part of the voice
processing system to improve capacity is the selection of the function extraction method that plays an
important role in the accuracy of the system. This document provides a brief overview of the detection of
various methods in speech processing where DCT uses to efficiently extract features in different ways.

Keywords: DCT, MMSE, MFCC.

I. Introduction matching techniques play an important role in the


Automatic speech recognition by machine voice recognition system to maximize the speech
has been a research goal for over four decades. In recognition rate of different people. Following are
the world of science, the computer has always some methods that explain the advantages and
understood human mimics. The idea that was disadvantages of DCT.
generated to make the speech recognition system is II. DCT for noise Reduction:
because it is convenient for humans to interact with
This article illustrates the advantages of
a computer, a robot or any machine by voice or
using the discrete cosine transform (DCT) over the
vocalization instead of difficult instructions.
discrete standard Fourier transform (DFT) in order
Humans have long been inspired to create a
to eliminate the noise embedded in a voice signal.
computer that can understand and speak like
The derivation of the minimum mean square error
humans. Speech recognition is the process by which
filter (MMSE) based on the statistical modeling of
the computer assigns an acoustic voice signal to the DCT coefficients is shown. The derivation of an
some form of abstract vocal meaning. This process
excessive attenuation factor is also demonstrated by
is very difficult, since the sound must correspond to
the fact that the speech energy is not always present
the fragments of sound stored in which a
in the noisy signal at all times or in all coefficients.
subsequent analysis must be performed because the This excessive attenuation factor is useful for
fragments of sound do not correspond to the pre-
suppressing any residual musical noise that may be
existing sound pieces. Various methods of feature
present. It is often necessary to improve speech by
extraction and model matching techniques are used
eliminating noise in voice processing systems
to create better quality speech recognition systems.
operating in noisy environments. The energy of
The feature extraction technique and model
International Journal of Engineering and Techniques - Volume 5 Issue4 , August 2019

white noise is uniformly distributed throughout the This method illustrates the properties of the
spectrum, but the energy of speech, particularly of discrete cosine transformation (DCT) with respect
sound, is concentrated in certain frequencies. to the discrete standard Fourier transformation
Therefore, the advantage of using a real (DFT) in the case of elimination of speech noise.
transformation, like the DCT considered in this The results show that DCT has better energy
document, is that the problem of not correcting the compaction and fewer calculations than DFT. The
phase will have less serious consequences. DCT is proposed algorithm is implemented for the
widely used in image compression due to its reduction of residual noise using the probability of
excellent energy compaction property. This is also a the absence of speech technique. The proposed
useful function to eliminate noise. DCT provides techniques use adaptive schemes that will monitor
significantly higher energy compaction than DFT the probability of the absence of speech in a noisy
[1]. speech. Estimate the spectral width received from a
III. Hybrid Method Genetic-Fuzzy Inference binary classification, that is, speech is present or
System: absent.
In this system, a voice signal is coded and The presence of different noises such as:
parameterized in a two-dimensional time matrix, • Background noise
with four parameters of the voice signal. After • Channel noise
encoding, the mean and variance of each model is • Quantization noise
used to generate the rule base of the fuzzy inference It significantly degrades system performance, such
system Mamdani. The mean and variance are as voice encoders and speech recognition systems,
optimized using a genetic algorithm to obtain the so we have to do a preprocessing step in these
best performance of the recognition system. systems that incorporate speech enhancement to
Consider the Brazilian expressions (digits) as eliminate noise. The filtering process must be
schemes: 0,1, 2,3,4,5, 6,7,8,9. The discrete cosine performed to filter a signal and eliminate noise. So
transformation (DCT) is used to encode vocal we can define the processing of the filter as follows:
patterns. The use of DCT in data compression and The information extraction process that carries the
model classification has increased in recent years, X (n) signal from the observed signal Y (n), where
mainly because its performance is much closer to Y (n) = X (n) + N (n) and N (n) is a noise process, it
the results obtained from the Karhunen-Lo` is called a filter. Different algorithms are used both
transformation, which is considered optimal for a in the time and frequency domain to eliminate the
variety of criteria such as mean square error of noise embedded in the noisy voice signal [3].
truncation and entropy. This article demonstrates
the potential of DCT and the fuzzy inference system V. DCT and MFCC:
in speech recognition. These two tools have shown This paper examines and presents an
good results in the temporal modeling of the vocal approach for speech signal recognition using
signal [2]. frequency spectral information with Mel frequency.
It is a dominant feature for speech recognition. The
mel coefficients of Cepstral (MFCC) are the
coefficients that collectively represent the short-
term power spectrum of a sound, based on a linear
transformation of the cosine of a logarithmic power
spectrum on a non-linear mel frequency scale. The
performance of the MFCC is influenced by the
Fig. 1. Block diagram of the proposed recongnition system HMFE number of filters, the shape of the filters, the filter
spacing mode and the deformation of the power
IV. MMSE filter using DCT: spectrum. In this document, the optimal values of
International Journal of Engineering and Techniques - Volume 5 Issue4 , August 2019

the above parameters are chosen to obtain an band filter and the voice compression based on the
efficiency of 99.5% in a very small audio file discrete transformation of the cosine with inverse
length. wave transformation. The main objective is to
integrate the filter with the voice recognition
algorithm to improve the results when there is noise
in the signal. In this work, the correspondence is
made using inverse wave transformations that
reduce the speech recognition time. The proposed
algorithm is designed and implemented in
MATLAB. The proposed algorithm was tested on
the samples provided and evaluated using different
recognizable and unrecognizable samples, obtaining
a recognition ratio of approximately 98%. It has
been shown that the proposed algorithm provides
better results than existing techniques. The
proposed algorithm increases the accuracy of the
voice recognition system. In the proposed method,
the goal is to detect the speaker from previously
Figure 2. Process model for extracting MFCCs from an audio speech recorded wave samples. The main concentration is
in precision and speed. The proposed method is
a) Pre-emphasis: normally, a FIR filter of a implemented using MatLab.
coefficient is known as a pre-emphasis filter.
b) Framing: frames generally have 20-30 ms with
an overlap of 10-15 ms.
c) Windows: you can use the functions of the
Hamming or Hanning window.
d) DFT: to convert each frame of N time domain
samples into the frequency domain.
e) Mel filtering: the magnitude frequency response
of each filter has a triangular shape and is equal to
the unit at the central frequency and decreases
linearly to zero at the central frequency of two
adjacent filters.
f) DCT: this is the process to convert the spectrum
of Mel records into the time domain using DCT.
The result of the conversion is called the Mel
coefficient of Cepstrum. The set of coefficients is
called acoustic vectors. Therefore, each incoming
emission is transformed into an acoustic vector
sequence [4].
In this survey, they focus on providing
better performance in the speech recognition
algorithm by integrating digital signal transposition
with voice recognition techniques. This is an
approach to improve the performance of the speech Fig.3. Flow-Chart For Speech Recognition Algorithm
recognition algorithm by using the Butterworth stop
International Journal of Engineering and Techniques - Volume 5 Issue4 , August 2019

Voice compression based on discrete cosine


transformations (DCT) is used to reduce the size of
vocal information. It is used to speed up the system
by eliminating the redundancy of audio
information. Compression is the process of
eliminating redundancy and duplicity. DCT is very
common when encoding video and voice tracks on
computers [5]. Fig. 4. Feature extraction methods

VI. 2D DCT: Another approach is a 2D DCT-based


This proposed method used the coefficients approach to compress the acoustic characteristics
extracted from the discrete cosine transform 2D for remote speech recognition applications. The
(DCT) of the energies of the Log Mel filter bank to coding scheme involves the calculation of a 2D
improve the recognition of the diffusers on the DCT in blocks of feature vectors followed by
traditional Mel cepstral frequency coefficients uniform scalar quantification, stroke length and
(MFCC) with delta and double deltas (MFCC / Huffman coding. Digit recognition experiments
delta ). The selection of the relevant coefficients were conducted in which the training was
proved to be crucial, which led to the proposal of a conducted with un-quantified cephalic features of
zigzag analysis strategy. Although the 2D-DCT clean voice and the tests used the same
coefficients have provided significant gains on characteristics after encoding and decoding with 2D
MFCC / delta, the analysis strategy remains DCT and entropy coding and at various noise
sensitive to the number of outputs of the filter bank levels. Acoustic the coding scheme translates into
and to the size of the analysis window. In this work, recognition performances comparable to those
we analyze this sensitivity and propose two new obtained with characteristics that are not quantified
data-based methods to use the DCT coefficients for at low bit rates. MFCC's 2D DCT coding together
the recognition of the speakers: rankDCT and with a method for analyzing variable frame rates
pcaDCT. The first, rankDCT, is an automatic [Zhu and Alwan, 2000] and peak isolation [Strope
coefficient selection strategy based on the highest and Alwan, 1997] maintains the noise robustness of
average intra-frame energy range. The alternative these low SNR algorithms even at 624 bps
method, pcaDCT, avoids the need for selection and
instead projects the DCT coefficients on the desired
dimensionality through the analysis of the main
components (PCA). All functions, including MFCC
/ delta, are set in a subset of the PRISM database to
subsequently highlight the sensitivity of the
parameters of each function. Evaluated in the recent
NIST SRE’12 corpus, pcaDCT constantly exceeds
the characteristics of the DCT and zzDCT range Figure 5. Block diagram of the DCT and entropy encoder.
and offers an average relative improvement of 20%
on MFCC / delta in all conditions [6]. In the client, the entry is first segmented into
frames, the characteristics are calculated for each
frame and then feature blocks are generated. A 2D
DCT is then performed on each block and the
components with the lowest energy are set to zero.
This is followed by scalar quantification, execution
length and Huffman coding. A block diagram of the
encoder is shown in Figure 5. In the receiver
International Journal of Engineering and Techniques - Volume 5 Issue4 , August 2019

decoding and IDCT are performed and in the ASR ELSEVIER, Speech Communication, pp 249 –
system the characteristic vectors corresponding to 257,1998
each frame are inserted. Only function vectors are [2] Washington Silva and Ginalber Serra, “ An Intelligent
encoded and sent to the recognition server; the first System Based on Discrete Cosine Transform for Speech
and second derivatives are calculated on the server Recognition,” ResearchGate, IBERAMIA, LNAI 7637,
pp. 320–329, November 2012.
based on the features retrieved [7].
[3] Muhammad Safder Shafi, Mansoor Khan, “Transform
Based Speech Enhancement Using DCT Based MMSE
VII. BDCT Method: Filter,& Its Comparison With DFT Filter,” Journal of
Robust speech recognition has become an Space Technology, Vol 1, No. 1, pp 47 – 52, July 2012.
important area of research in recent years. Multi-
[4] Garima Vyas, Barkha Kumari, “Speaker Recognition
band functions can be combined in different ways System Based On MFCC and DCT,” International
to perform the speech recognition task. The Journal of Engineering and Advanced Technology
extraction of multiband characteristics will propose (IJEAT) ISSN: 2249 – 8958, Volume-2, Issue-5, pp 167
a transformation of the cosine to discrete blocks – 169, June 2013.
(BDCT) with its transformation matrix of the [5] Sukhdeep Kaur, Er. Gurwinder Kaur, “Enhancement of
nucleus derived from the decomposition of the Speech Recognition Algorithm Using DCT and Inverse
discrete cosine transformation nucleus (DCT). We Wave Transformation,” International Journal of
Engineering Research and Applications, ISSN: 2248-
show that the BDCT approaches the DCT to
9622, Vol. 3, Issue 6, pp.749-754, Nov-Dec 2013.
maintain information in the correlation of a
sequence. When the BDCT is applied to the [6] Mitchell McLaren, Yun Lei, “Improved Speaker
Recognition Using DCT Coefficients as Features,”IEEE
energies of the mel filter bank frequency (FBE) to International Conference on Acoustics, Speech and
replace the DCT to convert them into cephalic Signal Processing, 978-1-4673-6997-8, pp 4430 – 4434,
coefficients, a new type of MFCC is produced [8]. April 2015.
[7] Qifeng Zhu and Abeer Alwan, “An Efficient And
VIII. Conclusion Scalable 2d Dct-Based Feature Coding Scheme For
This Paper briefly explain different methods Remote Speech Recognition,” IEEE International
used for speech recognition using DCT, which Conference on Acoustics, Speech, and Signal
shows that DCT can be used for noise reduction Processing, ISSN 1520 – 6149, May 2011.
very well, also it has property of energy compaction [8] Suman K. Saksamudre, R. R. Deshmukh, “Comparative
which can improve speed as well as recognition Study of Isolated Word Recognition System for Hindi
Language,” International Journal of Engineering
rate. Research & Technology (IJERT), ISSN: 2278-0181,
Vol. 4 Issue 07, July-2015.
References:

[1] Ing Yann Soon*, Soo Ngee Koh, Chai Kiat Yeo, “Noisy
speech enhancement using discrete cosine transform ,”

You might also like