0% found this document useful (0 votes)

40 views8 pages

Paper 3

This document presents a study on robust speech recognition in noisy environments using frequency warped signal processing. A speech enhancement technique called spectral subtraction is used to pre-filter noisy speech before recognition. Various frequency scales including Mel, Bark, ERB, and uniform scales are used to parameterize features from the enhanced speech. Experiments evaluate the speech recognizer's performance with and without enhancement across different noise types and signal-to-noise ratios, measuring accuracy at word levels. Results show the potential of frequency warping and enhancement for robust speech recognition in real noisy conditions.

Uploaded by

Navneet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views8 pages

Paper 3

Uploaded by

Navneet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Natl. Acad. Sci. Lett.

https://doi.org/10.1007/s40009-017-0597-7

SHORT COMMUNICATION

Robust Recognition of English Speech in Noisy Environments

Using Frequency Warped Signal Processing
Navneet Upadhyay1 • Hamurabi Gamboa Rosales1

Received: 13 July 2016 / Revised: 23 March 2017 / Accepted: 26 September 2017

The National Academy of Sciences, India 2018

Abstract The performance level of speech recognizer Speech is the most dominant and convenient way of
drops significantly when there is an acoustic mismatch communication for humans. Therefore, methods to com-
between training and operational environments. A speech municate with the computer, such as speech oriented
recognizer is called robust if it preserves good recognition information input and its retrieval is strongly required for
accuracy even in the mismatch conditions. Present study the modern society. The automatic speech recognizer
addresses the recognition of English speech in noisy (ASR) is used to facilitate such type of communication.
environments and presents the comparative study of vari- ASR is a task of translating a sequence of words or other
ous frequency scales used in parameterization based on the linguistic entities contained in an acoustic signal into the
average recognition rate. For the robust automatic speech textual representation by means of algorithms implemented
reorganization, a front end signal enhancement component, in a computer or machine. ASR is used in a range of
spectral subtraction algorithm, is used to prefilter the noisy applications including voice dialling, call routing, data
input speech prior fed to the recognizer. A number of entry and dictation, content based spoken audio search, and
frequency warped scales namely, perceptual scales viz, robotics since the past decades [1–4].
Mel scale, Bark scale, equivalent rectangular bandwidth The conventional ASR system typically uses a direc-
rate scale, and a non-perceptual scale called uniform scale tional microphone to collect speech data and the acoustic
are used in the parameterization for feature extraction from information is sampled as a signal suitable for processing
enhanced speech. A suite of experiments is carried out to by computer and fed into a recognition process. The output
evaluate the performance of the speech recognizer, with of the ASR system is the hypothesis for a transcription of
and without the use of a front end signal enhancement the utterance. However, the performance of ASRs turns out
component, in a variety of noisy environments. Recogni- to be poor under real situations, especially when classifiers
tion accuracy is tested in terms of word linguistic levels on are trained under high signal to noise ratio (SNR) envi-
a wide range of signal to noise ratios for both stationary ronments (typically SNR C 30 dB) and operated in real
and non-stationary noises. environments of relatively lower SNR. A speech environ-
ment is characterized by a specific transmission line, and a
Keywords Speech enhancement mismatching environment could be produced by either a
Spectral over subtraction Parameterization microphone change or the introduction of a noisy back-
Frequency warped scales Speech recognition ground. Significant research has been conducted on the
problem of mismatch compensation for ASR. Some of the
methods require the presence of stereo data (simultaneous
recording from both environments) that is not available
& Navneet Upadhyay more in real scenarios. Other approaches do not mandate
[email protected] the availability of stereo data, but they require some
1 information about the environments such as a model or
Department of Signal Processing and Acoustics, Faculty of
Electrical Engineering, Autonomous University of Zacatecas, knowledge of environmental statistics [1–6]. Therefore,
98000 Zacatecas, Mexico many researchers have recently focused on the task of

123
N. Upadhyay, H. G. Rosales

building more noise robust ASRs that may be operated in added with various background noises and estimated by
noisy environments with higher recognition accuracy or using spectral subtraction algorithm. The features of esti-
rate. mated speech signal are extracted and used for matching
A number of approaches have been existed in the lit- with trained model. For feature extraction, a number of
erature for robust speech recognition. These approaches frequency scales viz Mel, Bark, ERB rate, and uniform
can be classified into three broad categories: speech scale is used in the parameterization approach [13, 16–19].
enhancement based, feature compensation based, and The spectral over subtraction algorithm is explained and
model adaptation based methods. In the first approach, a detail description of frequency scales is presented in the
signal enhancement component is used as a preprocessor to following subsections.
ASR, where a speech enhancement algorithm applied to The spectral subtraction (SpecSub) is one of the most
noisy speech prior fed to speech recognizer. A conven- established methods in removing additive and uncorrelated
tional ASR trained on clean speech is further used in noise from the noisy speech [7]. The basic principal of
conjunction with features extracted from the enhanced SpecSub is to estimate the noise spectrum of non-speech
signal. Some typical speech enhancement methods are the regions and then subtracts the estimated noise from the
spectral subtractive [7, 8], Wiener filtering [9], statistical noisy speech to produce an enhanced speech. Assume that
model based methods [10], and subspace algorithms [9, 11] the original clean speech s½n is converted to a noisy speech
have succeeded in improving the SNR of noisy speech, but signal y½n by adding acoustic background noise d ½n, as:
not necessarily improving the perceived quality or the y½n ¼ s½n þ d½n, where n is the discrete time index. The
intelligibility of the speech and the recognition accuracy acoustic background noise is a broadband and non-sta-
[12]. It seems plausible that, in optimizing the SNR, the tionary signal.
spectrum of the speech signal is altered and some distortion Speech is a time varying, and consequently, non-sta-
is introduced which does not necessarily improve recog- tionary signal. It is usually processed over short time
nition accuracy. The second approach uses parameteriza- frames where the stationary signal is assumed. Therefore,
tion which is fundamentally immune to noise. The featured the above model can be expressed in the form of the frame
based approaches aimed to find a transformation in the as,
feature space to match an already trained model. The most y½m; n ¼ s½m; n þ d ½m; n; m ¼ 1; 2; . . .; N;
widely used speech recognition features are linear predic- ð1Þ
n ¼ 0; 1; . . .; ðN 1Þ
tive cepstral coefficients (LPCC), Mel frequency cepstral
coefficients (MFCC) [13], and perceptual linear prediction
(PLP) [14]. The third model is the adaptation based
Noisy Speech
approach, e.g. maximum likelihood linear regression [15],
adapts the clean trained acoustic model to better represent Pre processing
the noisy features. This approach tries to transform the
acoustic models to reduce the mismatch or to match the
Short Term Fourier Transform
noisy speech feature in a new testing environment. Noise (STFT)
Present study describes a synthetic and structural Estimation
description of the digital techniques for single microphone − Spectral Subtraction
speech recognition in noisy environments. In this paper, a (SpecSub)
front end signal enhancement, spectral subtraction algo-
rithm, is used to enhance the speech signal, degraded by
various stationary and non-stationary noises, prior to Bark Mel ERB Uniform
Filterbank Filterbank Filterbank Filterbank
recognition. A number of frequency warped scales namely,
perceptual scales viz, Mel scale, Bark scale, equivalent
rectangular bandwidth (ERB) rate scale, and a non-per- Non linearity
ceptual scale called uniform scale are used in the param- log(. )
eterization of enhanced speech and to assess the
performance of the task of speech recognition based on Discrete Cosine
average recognition rate [13, 16, 17]. Transform
(DCT)
The block diagram of proposed robust speech recogni-
tion system is shown in Fig. 1. The ASR training is per- Feature vector
formed on clean speech, while testing, clean speech is first
degraded by additive noise and then further processed Fig. 1 Block diagram of the proposed robust speech recognition
using speech enhancement technique. The test signal is system

123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…

where m is the frame index and N is the length of the frame. adaptively. For larger a(m), the smaller is the residual wide
By applying the fast Fourier transform (FFT), we get: band noise, but with the larger speech distortion, and vice
jY ðm; kÞjp ¼ jSðm; kÞjp þjDðm; kÞjp ð2Þ versa.
The parameter c(0 \ c 1) is the spectral floor factor,
where Y ðm; kÞ, Sðm; kÞ, and Dðm; kÞ are the kth FFT which is a small positive number, assuring that the esti-
coefficient of the mth noisy speech, clean speech, and noise mated spectrum will not be negative. The initial noise
frames, respectively. The p is the power exponent and spectrum is estimated by averaging the first few frames of
determines the type of spectrum subtraction. Typically for the speech utterance. Phase information has been ignored
the parameter p, a value of 1 or 2 is chosen. The p = 1 in spectral subtraction as it is deemed to have little effect
yielding the magnitude SpecSub and p = 2 yielding the on human intelligibility.
power SpecSub. Musical noise, similar to the sound of rhythmic music,
is an artifact of spectral subtraction resulting in the
enhanced speech from both over reduction and under
Estimation of Noise Statistics reduction of the noise spectrum [7]. The over reduction
spectral subtraction reduces the noise to some extent, but
The estimation of the noise spectra or noise characteristic the musical noise is not completely eliminated, affecting
which form the noisy speech is a central part of the the quality of the speech signal [8]. There is a tradeoff
SpecSub method from many methods which have been between the amount of noise reduction and speech
proposed [20]. It is one of the most common methods, distortion.
which is used in this paper, is given by [8] Spectral feature extraction is the main component of an

D^ ðm; kÞp ASR system. The aim of speech feature extraction is to
( convert the speech waveform to some type of parametric
^ðm 1; kÞp þð1 kÞjY ðm; kÞjp ;
k D if jY ðm; kÞjp \bD^ðm; kÞp
¼ p representation at a lower information rate for further
D
^ðm 1; kÞ ; else
analysis. Since in an ASR system, what is spoken is more
ð3Þ important than who spoke it therefore the aim of parame-
where jD ^ðm; kÞj is the absolute value of the kth FFT terization step is to capture the relevant information and
coefficient of the mth noisy speech frame and kð0 k 1Þ discard the irrelevant information.
is the updating noise factor. If a large k is chosen, the Frequency warping scale plays the central role to relate
estimated noise spectrum changes rapidly and may result in the physical frequency of the incoming signal and the
poor estimation. On the contrary, if a small k is chosen, symbolization of that frequency by the human ear. The
despite the increased robustness in estimation when the human ear can be modeled as a bank of band pass filters
noise spectrum is stationary or changes slowly with time as called auditory filters of approximately constant bandwidth
it does not permit the system to follow rapid noise changes. at low frequencies and at higher frequencies the bandwidth
In turn, b is the threshold parameter to distinguish between increases in proportion to frequency. A number of per-
noise and speech signal frame. ceptual and non-perceptual scales with different frequency
scales are used to estimate the bandwidth of the auditory
filters. In this paper, Mel scale [13], Bark scale [16], ERB
Estimation of Speech Spectrum (equivalent rectangular bandwidth) rate scale [17], and
uniform scale (U. scale) have been used in the parame-
After estimation of noise spectrum, the clean speech terization [18, 19].
spectrum, S^ðm; kÞ, can be estimated as The Mel scale models the human ear in regard to the
8 nonlinear properties of pitch perception. The name is
> D^ðm; kÞp 1
p < jY ðm; kÞjp aðmÞD^ðm; kÞp ; if \ meant to symbolize that the scale is based on pitch com-
S^ðm; kÞ ¼ jY ðm; kÞjp ½aðmÞ þ c
>
: c D parisons, as Mel is the abbreviation of melody. The Mel
^ ðm; kÞp ; else
frequency cepstral coefficients incorporate the Mel scale,
ð4Þ represented by the following equation:

where a(m)(0 \ a(m) \ 3) is the over reduction factor and f f
is used to compensate errors in noise spectrum estimation. fMel ðf ½HzÞ ¼ 1127 ln 1 þ ¼ 2595 log10 1 þ
700 700
The value of a(m), is adapted from frame to frame
ð5Þ
depending on the segmental noisy signal to noise ratio
(NSNR) of the frame. Therefore, in order to obtain better The Mel scale is approximately linear up to 1000 Hz,
results, we need to set this parameter accurately and although it is the logarithm above 1 kHz [13].

123
N. Upadhyay, H. G. Rosales

The Bark frequency scale is an alternative perceptual on linear filterbanks. Generally, the feature vectors are
scale to the Mel scale [13]. Speech intelligibility perception derived from the FFT magnitude spectrum by applying
in humans begins with spectral analysis performed by the filterbanks. The logarithm of the energy in each filter is
basilar membrane (BM). Each point on the BM can be calculated and accumulated before a discrete cosine
considered as a bandpass filter having a bandwidth equal to transform (DCT) is applied to produce the feature vector.
one critical bandwidth or one Bark [16]. The bandwidth of The algorithm is detailed below:
several auditory filters was empirically observed and used
STEP
1: Firstly, the spectral magnitude (or energy
to formulate the Bark scale. In the perceptual linear pre- S^ðkÞ2 ) of an enhanced speech signal in a frame is
diction feature extraction approach [14], the mapping
calculated as:
between linear frequency f (in Hertz or Hz), and Bark
N
frequency, B (units of Bark) is: Sî ¼ S^ðkÞ; i ¼ 0; 1; . . .; ð9Þ
2 2
0:76f f
B ðf ½HzÞ ¼ 13arctan þ 3:5arctan ð6Þ where S^ðkÞ is the N point FFT of the speech signal or a
1000 7500
frame of the speech signal:
An approximate expression (6) for the Bark scale X
N 1
j2pkn
frequency warping is used by Schroeder [21]. PLP S^ðkÞ ¼ s½nexp ; k ¼ 0; 1; . . .; ðN 1Þ
cepstral (PLPC) analysis is based on usage of all pole n¼0
N
model for simulation of processing in auditory system ð10Þ
spectrum [14]. First of all speech short time power
spectrum is calculated. Afterwards the Bark analysis, i.e. STEP 2: The energy in each band is obtained by
transformation from linear frequency scale spectrum to applying the corresponding filters of various frequency
bark frequency is performed. warping scales to the spectral magnitude:
2 3
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ffi
N
2 1
X
f f Ej ¼ Sî hi ðiÞ; j ¼ 1; 2; . . .; J ð11Þ
Bðf ½HzÞ ¼ 6 ln4 þ þ15 ð7Þ
600 600 i¼0

where J is the total number of filters, hi(i), used.

The equivalent rectangular bandwidth (ERB) rate scale rate STEP 3: Finally, the cepstral coefficients of the corre-
scale is the modification of Zwicker’s loudness model [17]. sponding filterbanks of various frequency scales:
The ERB scale measures the bandwidth of filters in human
XJ 1 p
hearing using rectangular bandpass filters from several c ½ n ¼ log10 Ej cos n j ; n ¼ 1; 2; . . .; M
types of approximations of the ERB scale exist. The j¼1
2 J
following is one of such approximations relating the ERB
ð12Þ
and the frequency f:
ERBðf ½HzÞ ¼ 24:7ð1 þ 0:00437f Þ ð8Þ where n is the number of cepstral coefficients to be
retained, generally 8–14 and J is the length of analyzing
The uniform frequency scale is linear in relation to the signals [12].
Hertz scale, that is, the inherent frequency resolution is
Usually, the cepstral coefficient values are taken
uniform for the whole band.
between 13 and 39. High order coefficients (1–15) contain
The comparison study among four different warped
information about the fine structure of the spectrum, and
scales namely, Mel scale, Bark scale, ERB rate scale and
low order coefficients about the smoothed spectrum. It is
uniform scale have been presented in Fig. 2a. It is evident
also useful to discard the first coefficient of (12), which
from Fig. 2a that the Mel scale, Bark scale and ERB scale
contains a measure of the signal energy in the frame.
have narrow bandwidth at low frequencies and wider
To evaluate the effectiveness of the proposed speech
bandwidth at higher frequency while uniform scale has
recognition algorithm, a series of experiments are con-
fixed bandwidth at low as well as high frequencies.
ducted on the Aurora II corpus noisy speech database in a
Based on the above mentioned frequency warping scales
speaker independent mode [22]. The Aurora II database is
and their corresponding filterbanks various parametric
modified form of the original TIDigits (TI connected digit
speech representations are used to generate an input filter to
database, vocabulary consists of 11 words which include
recognize speech. The commonly used speech features are
10 digits and an ‘‘oh’’.) of sampling rate 20 kHz by down
Mel frequency cepstral coefficients (MFCC), Bark fre-
sampling to 8 kHz and normalized to the same amplitude
quency cepstral coefficients (MFCC), ERB cepstral coef-
level [22]. The database is partitioned into a training set
ficients (ERBCC) based on human hearing model filterbank
and three test sets. Two sets are prepared for training data,
and uniform frequency cepstral coefficients (UFCC) based

123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…

Fig. 2 a Comparison of Mel,

Bark, ERB, and uniform
a 1

frequency scales. b Comparison

of Mel, Bark, ERB, and uniform 0.8

Warped Normalized Scales

filterbanks

0.6

0.4

Bark
0.2 ERB
Mel
Uniform
0
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)

b
1
Mel Scale
.5

0
1
Bark Scale

0
1
ERB Scale
.5

0
1
U. Scale
.5

0
0 500 1000 1500 2000 2500 3000 3500 4000

such as clean training data set (training on clean speech simulates the frequency characteristics of a telecommuni-
only) and multi condition training data set (training on cation terminal are added to the subsets of clean data. The
clean and noisy speech). No noise was added to clean data size of test set C is thus half of test sets A and B, as
training set and it consists of 8440 utterances recorded there are only two types of noise added. Experiments with
from 55 male and 55 female adults. 4004 utterances from training on both conditions are considered to demonstrate
52 male and 52 female speakers are split up equally into the efficiency of the proposed approach [22].
four subsets with 1001 utterances each. In the multi con- For SpecSub algorithm, the analysis frame length is
dition training set, four types of non-stationary noises have 25 ms with a frame shift of 10 ms and a value of over a
been added at different SNR levels. Moreover, a stationary reduction factor depends on frame length. The c is chosen
noise, additive white Gaussian noise (AWGN), has been as 0.03. For feature vectors, MFCC, BFCC, ERBCC, and
added at different SNR levels. In test set A, four types of UFCC filterbanks are used. For our experiments, we use 13
noises (car, babble, exhibition, and subway) added to the static features (including the 0th cepstral coefficient) aug-
four different clean data sets at SNRs from - 5 to 20 dB mented with their delta and double delta coefficients,
with 5 dB step size. Thus, there are 28,028 utterances in making 39 dimensional feature vectors.
test set A. In test set B, four different types of noises The HMM Tool Kit (HTK) software is used to observe
(restaurant, street, airport, and train station) added to the the effect of using a SpecSub algorithm on the testing data
same sets of clean data with the same SNR levels. In test containing both stationary and non-stationary noise types
set C, subway and street noises filtered by a modified of varying SNR levels [22–24]. Simple left to right 5
intermediate reference system (MIRS) filter which mixture HMM models was used. Each model contains

123
N. Upadhyay, H. G. Rosales

100 100

90 Car Noise 90 Babble Noise

80 80

70 70

WRA(%)
WRA(%)

60 60

(acd) 50 50 (acd)
40 40

30 30
UFCC UFCC
ERBCC ERBCC
20 20
MFCC MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100

90 Exhibition Noise 90 Subway Noise

80 80

70 70
WRA(%)

WRA(%)
60 60

(acd) 50 50 (acd)
40 40

30 30
UFCC UFCC
ERBCC ERBCC
20 MFCC
20
MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100

90 Car Noise 90
Babble Noise
80
80

70
70
WRA(%)

WRA(%)

60
60
(amcd) 50
(amcd)
50
40
40
30 UFCC
UFCC
ERBCC ERBCC
20 30
MFCC MFCC
BFCC BFCC
10 20
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100

90
Exhibition Noise 90 Subway Noise
80 80

70 70
WRA(%)

WRA(%)

60 60
(amcd) 50 50 (amcd)
40 40

30 30
UFCC UFCC
ERBCC ERBCC
20 20
MFCC MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]

Fig. 3 Word recognition accuracy (WRA) in % for all noises of test set A. (acd) Trained on clean data. (amcd) Trained on multi condition data.
*(acd)—Test set A on clean data, (amcd)—Test set A on multi condition data

123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…

Table 1 Comparison of avg. word recognition accuracy (WRA) in % with and without speech enhancement component for AWGN and car
noise case using MFCC
Noise types -5 dB 0 dB 5 dB 10 dB 15 dB 20 dB

AWGN
Without enhancement 2.98 4.65 14.58 22.34 33.42 48.90
With enhancement 9.62 15 30.65 50.25 55.54 62.79
CAR
Without enhancement 1.57 1.56 2.78 3.42 15.67 25.89
With enhancement 1.59 2.78 4.97 8.97 20.92 34.86

three states, except for the closures and silence models (epi, with speech having the same level of noise as the test
sil, vcl, cl) which were modelled with only a single state. speech, best recognition accuracies are obtained.
The performance of ASR is usually measured in terms In Table 1, the comparison of WRA in % with and
of average recognition accuracy and speed. Accuracy is without speech enhancement component for AWGN and
usually rated with a word error rate (WER), while speed is car noise using MFCC is represented. It is clear from the
measured with the real time factor. Word errors are cate- table that the WRA performance is more if a speech
gorized into a number of insertions, substitutions and enhancement algorithm is used prior to recognition, also
deletions [25, 26]. The average recognition accuracy or rate the WRA is higher for AWGN.
is measured in terms of word recognition accuracy This paper describes, an algorithm to recognize English
speech in a noisy environment and the comparative study
IþSþD
WER ð%Þ ¼ 100% ð13Þ of various frequency warped scales used in the parame-
N
terization. The speech signal degraded by stationary and
WRA ð%Þ ¼ ð1 WERÞ 100%
non-stationary noises are passed through spectral subtrac-
NSDI tion speech enhancement algorithm prior to recognition. A
¼ 100% ð14Þ
N number of perceptual frequency warped scales and a non-
perceptual scale used and the comparative study of the
where I = no. of insertion, i.e. an extra word inserted in the
scales done was based on average recognition rate. A set of
recognized sentence, D = no. of deletion, i.e. a correct
experiments was conducted in a variety of noisy environ-
word omitted in the recognized sentence, S = no. of sub-
ments, with and without the use of a front end signal
stitution error, i.e. an incorrect word substituted for a
enhancement component to calculate the average word
correct word, and N = total number of words in the ref-
recognition rate. It is observed that the perceptual scale
erence transcription or sentence.
features give an advantage over non-perceptual scale in
Though the front end processing increases noise
mismatched training and operating conditions.
robustness yet sometimes it degrades the recognition per-
formance under clean test conditions. This may take place
as speech enhancement algorithms, SpecSub, introduce
unexpected distortions for clean speech. Therefore, even References
though the performance of SpecSub algorithm is consid-
erably good under noisy environments, it is not desirable if 1. Cutajar M, Gatt E, Grech I, Casha O, Micallef J (2013) Com-
the recognition accuracy decreases for clean speech. Due to parative study of automatic speech recognition techniques. IET
Signal Proc 7(1):25–46
this reason, we evaluated the performance of SpecSub 2. Gong Y (1995) Speech recognition in noisy environments: a
algorithm not only in noisy conditions, but also on the survey. Comput Speech Lang 16:261–291
clean original Aurora II corpus databases. WRA (%) results 3. O’Shaughnessy D (2008) Invited paper: automatic speech
obtained from the clean training conditions (Fig. 3). recognition: history, methods and challenges. Pattern Recogn
41(10):2965–2979
Also, we evaluated the performance of the SpecSub 4. Rabiner L, Juang BH (1993) Fundamentals of speech recognition.
algorithm in noisy training conditions by using noisy Prentice Hall, Englewood Cliffs
speech data in the training phase. Recognition results 5. Juang BH (1991) Speech recognition in adverse environments.
obtained from the noisy training conditions are shown in Comput Speech Lang 5:275–294
6. Acero A, Stern RM (1990) Environmental robustness in auto-
Fig. 3. matic speech recognition. In: IEEE international conference
It is apparent from Fig. 3(acd) and (amcd) that in mat- acoustic, speech and signal processing, Albuquerque, USA, vol 2,
ched conditions, where the recognition system is trained pp 849–852

123
N. Upadhyay, H. G. Rosales

7. Boll SF (1979) Suppression of acoustic noise in speech using 17. Moore BCJ, Glasberg BR (1996) A revision of Zwicker’s loud-
spectral subtraction. IEEE Trans Acoust Speech Signal Process ness model. Acust Acta Acust 82:335–345
27(2):113–120 18. Rabiner LR, Schafer RW (1978) Digital processing of speech
8. Berouti M, Schwartz R, Makhoul J (1979) Enhancement of signals. Prentice Hall, Englewood Cliffs
speech corrupted by acoustic noise. In: IEEE international con- 19. Deller JR, Proakis JG, Hansen JHL (2000) Discrete time pro-
ference acoustic, speech and signal processing, pp 208–211 cessing of speech signals. IEEE Press, New York
9. Lizou PC (2013) Speech enhancement: theory and practice, 2nd 20. Martin R (2001) Noise power spectral density estimation based
edn. CRC Press, Boca Raton on optimal smoothing and minimum statistics. IEEE Trans
10. Ephraim Y, Malah D (1984) Speech enhancement using a mini- Speech Audio Process 9(5):504–512
mum mean square error short time spectral amplitude estimator. 21. Schroeder MR, Atal BS, Hall JL (1979) Optimizing digital
IEEE Trans Acoust Speech Signal Process 32(6):1109–1121 speech coders by exploiting masking properties of the human ear.
11. Ephraim Y, Malah D (1985) Speech enhancement using a mini- J Acoust Soc Am 66(16):1647–1651
mum mean square error log spectral amplitude estimator. IEEE 22. Hirsch HG, Pearce D (2000) The Aurora experimental framework
Trans Acoust Speech Signal Process 33(2):443–445 for the performance evaluation of speech recognition systems under
12. O’Shaughnessy D (1990) Speech communications. Addison noisy conditions. ISCA Tutorial and Research Workshop, Paris
Wesley, Boston 23. Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D,
13. Volkmann J, Stevens SS, Newman EB (1937) A scale for the Povey D, Valtchev V, Woodland P (2002) The HTK book.
measurement of the psychological magnitude pitch. J Acoust Soc Cambridge University Engineering Department
Am 8(3):208 24. Juang BH, Rabiner LR (1991) Hidden Markov models for speech
14. Hermansky H (1990) Perceptual linear prediction analysis of recognition. Technometrics 33(3):251–272
speech. J Acoust Soc Am 87(4):1738–1752 25. Pallett DS (1985) Performance assessment of automatic speech
15. Leggetter C, Woodland P (2012) Maximum likelihood linear recognizers. J Res Natl Bureau Stand 90(5):371–385
regression for speaker adaptation of continuous density hidden 26. Yamada T, Kumakura M, Kitawaki N (2006) Performance esti-
Markov models. Comput Speech Lang 9(2):171–186 mation of speech recognition system under noise conditions using
16. Zwicker E (1961) Subdivision of the audible frequency range into objective quality measures and artificial voice. IEEE Trans Audio
critical bands. J Acoust Soc Am 33(2):248 Speech Lang Process 14(6):2006–2013

123

A Study On Automatic Speech Recognition
100% (1)
A Study On Automatic Speech Recognition
2 pages
Fundamentals of Speech Recognitiony - Lawrence Rabiner - Biing-Hwang Juang PDF
No ratings yet
Fundamentals of Speech Recognitiony - Lawrence Rabiner - Biing-Hwang Juang PDF
546 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Speech Enhancement Using Signal Subspace Algorithm
No ratings yet
Speech Enhancement Using Signal Subspace Algorithm
4 pages
Speech Enhancement Using LPC Analysis-A Review
No ratings yet
Speech Enhancement Using LPC Analysis-A Review
6 pages
2019 Speech Enhancement For Secure Communication
No ratings yet
2019 Speech Enhancement For Secure Communication
19 pages
Non-Linear Feature Extraction For Robust Speech Recognition in Stationary and Non-Stationary Noise
No ratings yet
Non-Linear Feature Extraction For Robust Speech Recognition in Stationary and Non-Stationary Noise
22 pages
Speaker Recognition Using MATLAB
95% (64)
Speaker Recognition Using MATLAB
75 pages
Rabiner & Juang - Fundamentals of Speech Recognition
100% (2)
Rabiner & Juang - Fundamentals of Speech Recognition
277 pages
A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise
No ratings yet
A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise
15 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
Assesment of Voice Disorders
No ratings yet
Assesment of Voice Disorders
13 pages
Synthetic Speech Detection Through Short Term and Long-Term Prediction Traces
100% (1)
Synthetic Speech Detection Through Short Term and Long-Term Prediction Traces
14 pages
Workshop On Acoustic Voice Analysis PDF
100% (1)
Workshop On Acoustic Voice Analysis PDF
263 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
Paper 4
No ratings yet
Paper 4
14 pages
Spectral Analysis in Speech Processing Techniques: Prof. Vijaya Sugandhi
No ratings yet
Spectral Analysis in Speech Processing Techniques: Prof. Vijaya Sugandhi
3 pages
Applsci 09 02166
No ratings yet
Applsci 09 02166
12 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
1 s2.0 S0885230812000563 Main
No ratings yet
1 s2.0 S0885230812000563 Main
17 pages
Different Approaches of Spectral Subtraction Method For Enhancing The Speech Signal in Noisy Environments
No ratings yet
Different Approaches of Spectral Subtraction Method For Enhancing The Speech Signal in Noisy Environments
6 pages
Speech Enhancement Using A DNN-Augmented Colored-Noise
No ratings yet
Speech Enhancement Using A DNN-Augmented Colored-Noise
14 pages
Preprocessing Signal
No ratings yet
Preprocessing Signal
6 pages
Multi-Band Spectral Subtraction Algorithm For Speech Enhancement
No ratings yet
Multi-Band Spectral Subtraction Algorithm For Speech Enhancement
12 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
Journal - Automatic Sound Recognition For The Hearing Impaired
No ratings yet
Journal - Automatic Sound Recognition For The Hearing Impaired
8 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Introduction
No ratings yet
Introduction
9 pages
A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
No ratings yet
A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
12 pages
Noise Supression Techniques For Speech Enhancement Using Adaptive Filtering
No ratings yet
Noise Supression Techniques For Speech Enhancement Using Adaptive Filtering
18 pages
Speech Enhancement Noise Reduction Noise Reduction: Pham Van Tuan
No ratings yet
Speech Enhancement Noise Reduction Noise Reduction: Pham Van Tuan
26 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
InTech-Multi Channel Feature Enhancement For Robust Speech Recognition
No ratings yet
InTech-Multi Channel Feature Enhancement For Robust Speech Recognition
27 pages
Noise Effect On Amazigh Digits in Speech
No ratings yet
Noise Effect On Amazigh Digits in Speech
8 pages
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
No ratings yet
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
23 pages
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
No ratings yet
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
17 pages
Proposal of An Intelligent Speech Recognition System: November 2012
No ratings yet
Proposal of An Intelligent Speech Recognition System: November 2012
7 pages
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
No ratings yet
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
5 pages
Implementation of Adaptive Filtering Algorithm For Speech Signal On FPGA
No ratings yet
Implementation of Adaptive Filtering Algorithm For Speech Signal On FPGA
5 pages
Robust Speech Recognition Using Adaptive Noise Cancellation
No ratings yet
Robust Speech Recognition Using Adaptive Noise Cancellation
4 pages
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
No ratings yet
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
6 pages
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
No ratings yet
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
20 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Speech Intelligibility Prediction Intended For State-Of-The-Art Noise Estimation Algorithms
No ratings yet
Speech Intelligibility Prediction Intended For State-Of-The-Art Noise Estimation Algorithms
1 page
Noise Estimation and Noise Removal Techniques For Speech Recognition in Adverse Environment
No ratings yet
Noise Estimation and Noise Removal Techniques For Speech Recognition in Adverse Environment
8 pages
Asr2000 Final Footer
No ratings yet
Asr2000 Final Footer
8 pages
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
No ratings yet
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
6 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Speech Recognition Using A DSP: Lunds Universitet
No ratings yet
Speech Recognition Using A DSP: Lunds Universitet
12 pages
Speech Recognition Algo
No ratings yet
Speech Recognition Algo
17 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
A Review On Feature Extraction and Noise Reduction Technique
No ratings yet
A Review On Feature Extraction and Noise Reduction Technique
5 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
Speech Recognition System - A Review
No ratings yet
Speech Recognition System - A Review
10 pages
Comparison of Noise Removal and Echo Cancellation For Audio Signals
No ratings yet
Comparison of Noise Removal and Echo Cancellation For Audio Signals
3 pages
CD
No ratings yet
CD
5 pages
Performance Evaluation of MLP For Speech Recognition in Noisy Environments Using MFCC & Wavelets
No ratings yet
Performance Evaluation of MLP For Speech Recognition in Noisy Environments Using MFCC & Wavelets
5 pages
Cepstrum
100% (1)
Cepstrum
16 pages
Lab 4. LTI Systems, The Z-Transform, and An Introduc-Tion To Filtering
No ratings yet
Lab 4. LTI Systems, The Z-Transform, and An Introduc-Tion To Filtering
15 pages
VTU CBCS2015SCHEME Ecsyll8aem
0% (1)
VTU CBCS2015SCHEME Ecsyll8aem
14 pages
Voice Activation Using Speaker Recognition For Controlling Humanoid Robot
No ratings yet
Voice Activation Using Speaker Recognition For Controlling Humanoid Robot
6 pages
Automatic Speaker Recognition System Based On Machine Learning Algorithms
0% (1)
Automatic Speaker Recognition System Based On Machine Learning Algorithms
12 pages
Cepstrum Pitch Determination: OICED-speech Sounds Result From The Resonant
100% (1)
Cepstrum Pitch Determination: OICED-speech Sounds Result From The Resonant
17 pages
MATLAB Code For Speech Recognition
No ratings yet
MATLAB Code For Speech Recognition
4 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
Voice Recognition Using MFCC Algorithm
No ratings yet
Voice Recognition Using MFCC Algorithm
4 pages
Automatic Speech Recognition Using Cepstral and Itakura-Saito Distances For Vocal Command
No ratings yet
Automatic Speech Recognition Using Cepstral and Itakura-Saito Distances For Vocal Command
5 pages
Multimedia Data Mining
No ratings yet
Multimedia Data Mining
19 pages
Audio Signal Processing Basics
No ratings yet
Audio Signal Processing Basics
11 pages
Bhoomika Tech Seminar Report
No ratings yet
Bhoomika Tech Seminar Report
19 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Unit 2 - Speech and Video Processing (SVP) - 1
No ratings yet
Unit 2 - Speech and Video Processing (SVP) - 1
23 pages
MFCC CZT
No ratings yet
MFCC CZT
10 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
Classification. of Reverberant Situations
No ratings yet
Classification. of Reverberant Situations
4 pages
MFCC Code
No ratings yet
MFCC Code
8 pages
Speech Processing Papers
No ratings yet
Speech Processing Papers
4 pages
Wireless Communications by Theodore S Ra
No ratings yet
Wireless Communications by Theodore S Ra
31 pages
REC085 t3 Sheet
No ratings yet
REC085 t3 Sheet
15 pages
Comparison of Distance Measures in Discrete Spectral Modeling
No ratings yet
Comparison of Distance Measures in Discrete Spectral Modeling
4 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
COMMUNICATION SYSTEMS
From Everand
COMMUNICATION SYSTEMS
B.P. Lathi
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
From Everand
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
From Everand
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
Fouad Sabry
No ratings yet
Tone Mapping: Tone Mapping: Illuminating Perspectives in Computer Vision
From Everand
Tone Mapping: Tone Mapping: Illuminating Perspectives in Computer Vision
Fouad Sabry
No ratings yet

Paper 3

Uploaded by

Paper 3

Uploaded by

Natl. Acad. Sci. Lett.

Robust Recognition of English Speech in Noisy Environments

Received: 13 July 2016 / Revised: 23 March 2017 / Accepted: 26 September 2017

where J is the total number of filters, hi(i), used.

Fig. 2 a Comparison of Mel,

frequency scales. b Comparison

Warped Normalized Scales

90 Car Noise 90 Babble Noise

90 Exhibition Noise 90 Subway Noise

You might also like