Paper 3
Paper 3
https://doi.org/10.1007/s40009-017-0597-7
SHORT COMMUNICATION
Abstract The performance level of speech recognizer Speech is the most dominant and convenient way of
drops significantly when there is an acoustic mismatch communication for humans. Therefore, methods to com-
between training and operational environments. A speech municate with the computer, such as speech oriented
recognizer is called robust if it preserves good recognition information input and its retrieval is strongly required for
accuracy even in the mismatch conditions. Present study the modern society. The automatic speech recognizer
addresses the recognition of English speech in noisy (ASR) is used to facilitate such type of communication.
environments and presents the comparative study of vari- ASR is a task of translating a sequence of words or other
ous frequency scales used in parameterization based on the linguistic entities contained in an acoustic signal into the
average recognition rate. For the robust automatic speech textual representation by means of algorithms implemented
reorganization, a front end signal enhancement component, in a computer or machine. ASR is used in a range of
spectral subtraction algorithm, is used to prefilter the noisy applications including voice dialling, call routing, data
input speech prior fed to the recognizer. A number of entry and dictation, content based spoken audio search, and
frequency warped scales namely, perceptual scales viz, robotics since the past decades [1–4].
Mel scale, Bark scale, equivalent rectangular bandwidth The conventional ASR system typically uses a direc-
rate scale, and a non-perceptual scale called uniform scale tional microphone to collect speech data and the acoustic
are used in the parameterization for feature extraction from information is sampled as a signal suitable for processing
enhanced speech. A suite of experiments is carried out to by computer and fed into a recognition process. The output
evaluate the performance of the speech recognizer, with of the ASR system is the hypothesis for a transcription of
and without the use of a front end signal enhancement the utterance. However, the performance of ASRs turns out
component, in a variety of noisy environments. Recogni- to be poor under real situations, especially when classifiers
tion accuracy is tested in terms of word linguistic levels on are trained under high signal to noise ratio (SNR) envi-
a wide range of signal to noise ratios for both stationary ronments (typically SNR C 30 dB) and operated in real
and non-stationary noises. environments of relatively lower SNR. A speech environ-
ment is characterized by a specific transmission line, and a
Keywords Speech enhancement mismatching environment could be produced by either a
Spectral over subtraction Parameterization microphone change or the introduction of a noisy back-
Frequency warped scales Speech recognition ground. Significant research has been conducted on the
problem of mismatch compensation for ASR. Some of the
methods require the presence of stereo data (simultaneous
recording from both environments) that is not available
& Navneet Upadhyay more in real scenarios. Other approaches do not mandate
[email protected] the availability of stereo data, but they require some
1 information about the environments such as a model or
Department of Signal Processing and Acoustics, Faculty of
Electrical Engineering, Autonomous University of Zacatecas, knowledge of environmental statistics [1–6]. Therefore,
98000 Zacatecas, Mexico many researchers have recently focused on the task of
123
N. Upadhyay, H. G. Rosales
building more noise robust ASRs that may be operated in added with various background noises and estimated by
noisy environments with higher recognition accuracy or using spectral subtraction algorithm. The features of esti-
rate. mated speech signal are extracted and used for matching
A number of approaches have been existed in the lit- with trained model. For feature extraction, a number of
erature for robust speech recognition. These approaches frequency scales viz Mel, Bark, ERB rate, and uniform
can be classified into three broad categories: speech scale is used in the parameterization approach [13, 16–19].
enhancement based, feature compensation based, and The spectral over subtraction algorithm is explained and
model adaptation based methods. In the first approach, a detail description of frequency scales is presented in the
signal enhancement component is used as a preprocessor to following subsections.
ASR, where a speech enhancement algorithm applied to The spectral subtraction (SpecSub) is one of the most
noisy speech prior fed to speech recognizer. A conven- established methods in removing additive and uncorrelated
tional ASR trained on clean speech is further used in noise from the noisy speech [7]. The basic principal of
conjunction with features extracted from the enhanced SpecSub is to estimate the noise spectrum of non-speech
signal. Some typical speech enhancement methods are the regions and then subtracts the estimated noise from the
spectral subtractive [7, 8], Wiener filtering [9], statistical noisy speech to produce an enhanced speech. Assume that
model based methods [10], and subspace algorithms [9, 11] the original clean speech s½n is converted to a noisy speech
have succeeded in improving the SNR of noisy speech, but signal y½n by adding acoustic background noise d ½n, as:
not necessarily improving the perceived quality or the y½n ¼ s½n þ d½n, where n is the discrete time index. The
intelligibility of the speech and the recognition accuracy acoustic background noise is a broadband and non-sta-
[12]. It seems plausible that, in optimizing the SNR, the tionary signal.
spectrum of the speech signal is altered and some distortion Speech is a time varying, and consequently, non-sta-
is introduced which does not necessarily improve recog- tionary signal. It is usually processed over short time
nition accuracy. The second approach uses parameteriza- frames where the stationary signal is assumed. Therefore,
tion which is fundamentally immune to noise. The featured the above model can be expressed in the form of the frame
based approaches aimed to find a transformation in the as,
feature space to match an already trained model. The most y½m; n ¼ s½m; n þ d ½m; n; m ¼ 1; 2; . . .; N;
widely used speech recognition features are linear predic- ð1Þ
n ¼ 0; 1; . . .; ðN 1Þ
tive cepstral coefficients (LPCC), Mel frequency cepstral
coefficients (MFCC) [13], and perceptual linear prediction
(PLP) [14]. The third model is the adaptation based
Noisy Speech
approach, e.g. maximum likelihood linear regression [15],
adapts the clean trained acoustic model to better represent Pre processing
the noisy features. This approach tries to transform the
acoustic models to reduce the mismatch or to match the
Short Term Fourier Transform
noisy speech feature in a new testing environment. Noise (STFT)
Present study describes a synthetic and structural Estimation
description of the digital techniques for single microphone − Spectral Subtraction
speech recognition in noisy environments. In this paper, a (SpecSub)
front end signal enhancement, spectral subtraction algo-
rithm, is used to enhance the speech signal, degraded by
various stationary and non-stationary noises, prior to Bark Mel ERB Uniform
Filterbank Filterbank Filterbank Filterbank
recognition. A number of frequency warped scales namely,
perceptual scales viz, Mel scale, Bark scale, equivalent
rectangular bandwidth (ERB) rate scale, and a non-per- Non linearity
ceptual scale called uniform scale are used in the param- log(. )
eterization of enhanced speech and to assess the
performance of the task of speech recognition based on Discrete Cosine
average recognition rate [13, 16, 17]. Transform
(DCT)
The block diagram of proposed robust speech recogni-
tion system is shown in Fig. 1. The ASR training is per- Feature vector
formed on clean speech, while testing, clean speech is first
degraded by additive noise and then further processed Fig. 1 Block diagram of the proposed robust speech recognition
using speech enhancement technique. The test signal is system
123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…
where m is the frame index and N is the length of the frame. adaptively. For larger a(m), the smaller is the residual wide
By applying the fast Fourier transform (FFT), we get: band noise, but with the larger speech distortion, and vice
jY ðm; kÞjp ¼ jSðm; kÞjp þjDðm; kÞjp ð2Þ versa.
The parameter c(0 \ c 1) is the spectral floor factor,
where Y ðm; kÞ, Sðm; kÞ, and Dðm; kÞ are the kth FFT which is a small positive number, assuring that the esti-
coefficient of the mth noisy speech, clean speech, and noise mated spectrum will not be negative. The initial noise
frames, respectively. The p is the power exponent and spectrum is estimated by averaging the first few frames of
determines the type of spectrum subtraction. Typically for the speech utterance. Phase information has been ignored
the parameter p, a value of 1 or 2 is chosen. The p = 1 in spectral subtraction as it is deemed to have little effect
yielding the magnitude SpecSub and p = 2 yielding the on human intelligibility.
power SpecSub. Musical noise, similar to the sound of rhythmic music,
is an artifact of spectral subtraction resulting in the
enhanced speech from both over reduction and under
Estimation of Noise Statistics reduction of the noise spectrum [7]. The over reduction
spectral subtraction reduces the noise to some extent, but
The estimation of the noise spectra or noise characteristic the musical noise is not completely eliminated, affecting
which form the noisy speech is a central part of the the quality of the speech signal [8]. There is a tradeoff
SpecSub method from many methods which have been between the amount of noise reduction and speech
proposed [20]. It is one of the most common methods, distortion.
which is used in this paper, is given by [8] Spectral feature extraction is the main component of an
D^ ðm; kÞp ASR system. The aim of speech feature extraction is to
( convert the speech waveform to some type of parametric
^ðm 1; kÞp þð1 kÞjY ðm; kÞjp ;
k D if jY ðm; kÞjp \bD^ðm; kÞp
¼ p representation at a lower information rate for further
D
^ðm 1; kÞ ; else
analysis. Since in an ASR system, what is spoken is more
ð3Þ important than who spoke it therefore the aim of parame-
where jD ^ðm; kÞj is the absolute value of the kth FFT terization step is to capture the relevant information and
coefficient of the mth noisy speech frame and kð0 k 1Þ discard the irrelevant information.
is the updating noise factor. If a large k is chosen, the Frequency warping scale plays the central role to relate
estimated noise spectrum changes rapidly and may result in the physical frequency of the incoming signal and the
poor estimation. On the contrary, if a small k is chosen, symbolization of that frequency by the human ear. The
despite the increased robustness in estimation when the human ear can be modeled as a bank of band pass filters
noise spectrum is stationary or changes slowly with time as called auditory filters of approximately constant bandwidth
it does not permit the system to follow rapid noise changes. at low frequencies and at higher frequencies the bandwidth
In turn, b is the threshold parameter to distinguish between increases in proportion to frequency. A number of per-
noise and speech signal frame. ceptual and non-perceptual scales with different frequency
scales are used to estimate the bandwidth of the auditory
filters. In this paper, Mel scale [13], Bark scale [16], ERB
Estimation of Speech Spectrum (equivalent rectangular bandwidth) rate scale [17], and
uniform scale (U. scale) have been used in the parame-
After estimation of noise spectrum, the clean speech terization [18, 19].
spectrum, S^ðm; kÞ, can be estimated as The Mel scale models the human ear in regard to the
8 nonlinear properties of pitch perception. The name is
> D^ðm; kÞp 1
p < jY ðm; kÞjp aðmÞD^ðm; kÞp ; if \ meant to symbolize that the scale is based on pitch com-
S^ðm; kÞ ¼ jY ðm; kÞjp ½aðmÞ þ c
>
: c D parisons, as Mel is the abbreviation of melody. The Mel
^ ðm; kÞp ; else
frequency cepstral coefficients incorporate the Mel scale,
ð4Þ represented by the following equation:
where a(m)(0 \ a(m) \ 3) is the over reduction factor and f f
is used to compensate errors in noise spectrum estimation. fMel ðf ½HzÞ ¼ 1127 ln 1 þ ¼ 2595 log10 1 þ
700 700
The value of a(m), is adapted from frame to frame
ð5Þ
depending on the segmental noisy signal to noise ratio
(NSNR) of the frame. Therefore, in order to obtain better The Mel scale is approximately linear up to 1000 Hz,
results, we need to set this parameter accurately and although it is the logarithm above 1 kHz [13].
123
N. Upadhyay, H. G. Rosales
The Bark frequency scale is an alternative perceptual on linear filterbanks. Generally, the feature vectors are
scale to the Mel scale [13]. Speech intelligibility perception derived from the FFT magnitude spectrum by applying
in humans begins with spectral analysis performed by the filterbanks. The logarithm of the energy in each filter is
basilar membrane (BM). Each point on the BM can be calculated and accumulated before a discrete cosine
considered as a bandpass filter having a bandwidth equal to transform (DCT) is applied to produce the feature vector.
one critical bandwidth or one Bark [16]. The bandwidth of The algorithm is detailed below:
several auditory filters was empirically observed and used
STEP
1: Firstly, the spectral magnitude (or energy
to formulate the Bark scale. In the perceptual linear pre- S^ðkÞ2 ) of an enhanced speech signal in a frame is
diction feature extraction approach [14], the mapping
calculated as:
between linear frequency f (in Hertz or Hz), and Bark
N
frequency, B (units of Bark) is: S^i ¼ S^ðkÞ; i ¼ 0; 1; . . .; ð9Þ
2 2
0:76f f
B ðf ½HzÞ ¼ 13arctan þ 3:5arctan ð6Þ where S^ðkÞ is the N point FFT of the speech signal or a
1000 7500
frame of the speech signal:
An approximate expression (6) for the Bark scale X
N 1
j2pkn
frequency warping is used by Schroeder [21]. PLP S^ðkÞ ¼ s½nexp ; k ¼ 0; 1; . . .; ðN 1Þ
cepstral (PLPC) analysis is based on usage of all pole n¼0
N
model for simulation of processing in auditory system ð10Þ
spectrum [14]. First of all speech short time power
spectrum is calculated. Afterwards the Bark analysis, i.e. STEP 2: The energy in each band is obtained by
transformation from linear frequency scale spectrum to applying the corresponding filters of various frequency
bark frequency is performed. warping scales to the spectral magnitude:
2 3
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ffi
N
2 1
X
f f Ej ¼ S^i hi ðiÞ; j ¼ 1; 2; . . .; J ð11Þ
Bðf ½HzÞ ¼ 6 ln4 þ þ15 ð7Þ
600 600 i¼0
123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…
0.6
0.4
Bark
0.2 ERB
Mel
Uniform
0
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
b
1
Mel Scale
.5
0
1
Bark Scale
.5
0
1
ERB Scale
.5
0
1
U. Scale
.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
such as clean training data set (training on clean speech simulates the frequency characteristics of a telecommuni-
only) and multi condition training data set (training on cation terminal are added to the subsets of clean data. The
clean and noisy speech). No noise was added to clean data size of test set C is thus half of test sets A and B, as
training set and it consists of 8440 utterances recorded there are only two types of noise added. Experiments with
from 55 male and 55 female adults. 4004 utterances from training on both conditions are considered to demonstrate
52 male and 52 female speakers are split up equally into the efficiency of the proposed approach [22].
four subsets with 1001 utterances each. In the multi con- For SpecSub algorithm, the analysis frame length is
dition training set, four types of non-stationary noises have 25 ms with a frame shift of 10 ms and a value of over a
been added at different SNR levels. Moreover, a stationary reduction factor depends on frame length. The c is chosen
noise, additive white Gaussian noise (AWGN), has been as 0.03. For feature vectors, MFCC, BFCC, ERBCC, and
added at different SNR levels. In test set A, four types of UFCC filterbanks are used. For our experiments, we use 13
noises (car, babble, exhibition, and subway) added to the static features (including the 0th cepstral coefficient) aug-
four different clean data sets at SNRs from - 5 to 20 dB mented with their delta and double delta coefficients,
with 5 dB step size. Thus, there are 28,028 utterances in making 39 dimensional feature vectors.
test set A. In test set B, four different types of noises The HMM Tool Kit (HTK) software is used to observe
(restaurant, street, airport, and train station) added to the the effect of using a SpecSub algorithm on the testing data
same sets of clean data with the same SNR levels. In test containing both stationary and non-stationary noise types
set C, subway and street noises filtered by a modified of varying SNR levels [22–24]. Simple left to right 5
intermediate reference system (MIRS) filter which mixture HMM models was used. Each model contains
123
N. Upadhyay, H. G. Rosales
100 100
70 70
WRA(%)
WRA(%)
60 60
(acd) 50 50 (acd)
40 40
30 30
UFCC UFCC
ERBCC ERBCC
20 20
MFCC MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100
70 70
WRA(%)
WRA(%)
60 60
(acd) 50 50 (acd)
40 40
30 30
UFCC UFCC
ERBCC ERBCC
20 MFCC
20
MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100
90 Car Noise 90
Babble Noise
80
80
70
70
WRA(%)
WRA(%)
60
60
(amcd) 50
(amcd)
50
40
40
30 UFCC
UFCC
ERBCC ERBCC
20 30
MFCC MFCC
BFCC BFCC
10 20
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
100 100
90
Exhibition Noise 90 Subway Noise
80 80
70 70
WRA(%)
WRA(%)
60 60
(amcd) 50 50 (amcd)
40 40
30 30
UFCC UFCC
ERBCC ERBCC
20 20
MFCC MFCC
BFCC BFCC
10 10
-5 0 5 10 15 20 -5 0 5 10 15 20
Input SNR [dB] Input SNR [dB]
Fig. 3 Word recognition accuracy (WRA) in % for all noises of test set A. (acd) Trained on clean data. (amcd) Trained on multi condition data.
*(acd)—Test set A on clean data, (amcd)—Test set A on multi condition data
123
Robust Recognition of English Speech in Noisy Environment Using Frequency Warped Signal…
Table 1 Comparison of avg. word recognition accuracy (WRA) in % with and without speech enhancement component for AWGN and car
noise case using MFCC
Noise types -5 dB 0 dB 5 dB 10 dB 15 dB 20 dB
AWGN
Without enhancement 2.98 4.65 14.58 22.34 33.42 48.90
With enhancement 9.62 15 30.65 50.25 55.54 62.79
CAR
Without enhancement 1.57 1.56 2.78 3.42 15.67 25.89
With enhancement 1.59 2.78 4.97 8.97 20.92 34.86
three states, except for the closures and silence models (epi, with speech having the same level of noise as the test
sil, vcl, cl) which were modelled with only a single state. speech, best recognition accuracies are obtained.
The performance of ASR is usually measured in terms In Table 1, the comparison of WRA in % with and
of average recognition accuracy and speed. Accuracy is without speech enhancement component for AWGN and
usually rated with a word error rate (WER), while speed is car noise using MFCC is represented. It is clear from the
measured with the real time factor. Word errors are cate- table that the WRA performance is more if a speech
gorized into a number of insertions, substitutions and enhancement algorithm is used prior to recognition, also
deletions [25, 26]. The average recognition accuracy or rate the WRA is higher for AWGN.
is measured in terms of word recognition accuracy This paper describes, an algorithm to recognize English
speech in a noisy environment and the comparative study
IþSþD
WER ð%Þ ¼ 100% ð13Þ of various frequency warped scales used in the parame-
N
terization. The speech signal degraded by stationary and
WRA ð%Þ ¼ ð1 WERÞ 100%
non-stationary noises are passed through spectral subtrac-
NSDI tion speech enhancement algorithm prior to recognition. A
¼ 100% ð14Þ
N number of perceptual frequency warped scales and a non-
perceptual scale used and the comparative study of the
where I = no. of insertion, i.e. an extra word inserted in the
scales done was based on average recognition rate. A set of
recognized sentence, D = no. of deletion, i.e. a correct
experiments was conducted in a variety of noisy environ-
word omitted in the recognized sentence, S = no. of sub-
ments, with and without the use of a front end signal
stitution error, i.e. an incorrect word substituted for a
enhancement component to calculate the average word
correct word, and N = total number of words in the ref-
recognition rate. It is observed that the perceptual scale
erence transcription or sentence.
features give an advantage over non-perceptual scale in
Though the front end processing increases noise
mismatched training and operating conditions.
robustness yet sometimes it degrades the recognition per-
formance under clean test conditions. This may take place
as speech enhancement algorithms, SpecSub, introduce
unexpected distortions for clean speech. Therefore, even References
though the performance of SpecSub algorithm is consid-
erably good under noisy environments, it is not desirable if 1. Cutajar M, Gatt E, Grech I, Casha O, Micallef J (2013) Com-
the recognition accuracy decreases for clean speech. Due to parative study of automatic speech recognition techniques. IET
Signal Proc 7(1):25–46
this reason, we evaluated the performance of SpecSub 2. Gong Y (1995) Speech recognition in noisy environments: a
algorithm not only in noisy conditions, but also on the survey. Comput Speech Lang 16:261–291
clean original Aurora II corpus databases. WRA (%) results 3. O’Shaughnessy D (2008) Invited paper: automatic speech
obtained from the clean training conditions (Fig. 3). recognition: history, methods and challenges. Pattern Recogn
41(10):2965–2979
Also, we evaluated the performance of the SpecSub 4. Rabiner L, Juang BH (1993) Fundamentals of speech recognition.
algorithm in noisy training conditions by using noisy Prentice Hall, Englewood Cliffs
speech data in the training phase. Recognition results 5. Juang BH (1991) Speech recognition in adverse environments.
obtained from the noisy training conditions are shown in Comput Speech Lang 5:275–294
6. Acero A, Stern RM (1990) Environmental robustness in auto-
Fig. 3. matic speech recognition. In: IEEE international conference
It is apparent from Fig. 3(acd) and (amcd) that in mat- acoustic, speech and signal processing, Albuquerque, USA, vol 2,
ched conditions, where the recognition system is trained pp 849–852
123
N. Upadhyay, H. G. Rosales
7. Boll SF (1979) Suppression of acoustic noise in speech using 17. Moore BCJ, Glasberg BR (1996) A revision of Zwicker’s loud-
spectral subtraction. IEEE Trans Acoust Speech Signal Process ness model. Acust Acta Acust 82:335–345
27(2):113–120 18. Rabiner LR, Schafer RW (1978) Digital processing of speech
8. Berouti M, Schwartz R, Makhoul J (1979) Enhancement of signals. Prentice Hall, Englewood Cliffs
speech corrupted by acoustic noise. In: IEEE international con- 19. Deller JR, Proakis JG, Hansen JHL (2000) Discrete time pro-
ference acoustic, speech and signal processing, pp 208–211 cessing of speech signals. IEEE Press, New York
9. Lizou PC (2013) Speech enhancement: theory and practice, 2nd 20. Martin R (2001) Noise power spectral density estimation based
edn. CRC Press, Boca Raton on optimal smoothing and minimum statistics. IEEE Trans
10. Ephraim Y, Malah D (1984) Speech enhancement using a mini- Speech Audio Process 9(5):504–512
mum mean square error short time spectral amplitude estimator. 21. Schroeder MR, Atal BS, Hall JL (1979) Optimizing digital
IEEE Trans Acoust Speech Signal Process 32(6):1109–1121 speech coders by exploiting masking properties of the human ear.
11. Ephraim Y, Malah D (1985) Speech enhancement using a mini- J Acoust Soc Am 66(16):1647–1651
mum mean square error log spectral amplitude estimator. IEEE 22. Hirsch HG, Pearce D (2000) The Aurora experimental framework
Trans Acoust Speech Signal Process 33(2):443–445 for the performance evaluation of speech recognition systems under
12. O’Shaughnessy D (1990) Speech communications. Addison noisy conditions. ISCA Tutorial and Research Workshop, Paris
Wesley, Boston 23. Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D,
13. Volkmann J, Stevens SS, Newman EB (1937) A scale for the Povey D, Valtchev V, Woodland P (2002) The HTK book.
measurement of the psychological magnitude pitch. J Acoust Soc Cambridge University Engineering Department
Am 8(3):208 24. Juang BH, Rabiner LR (1991) Hidden Markov models for speech
14. Hermansky H (1990) Perceptual linear prediction analysis of recognition. Technometrics 33(3):251–272
speech. J Acoust Soc Am 87(4):1738–1752 25. Pallett DS (1985) Performance assessment of automatic speech
15. Leggetter C, Woodland P (2012) Maximum likelihood linear recognizers. J Res Natl Bureau Stand 90(5):371–385
regression for speaker adaptation of continuous density hidden 26. Yamada T, Kumakura M, Kitawaki N (2006) Performance esti-
Markov models. Comput Speech Lang 9(2):171–186 mation of speech recognition system under noise conditions using
16. Zwicker E (1961) Subdivision of the audible frequency range into objective quality measures and artificial voice. IEEE Trans Audio
critical bands. J Acoust Soc Am 33(2):248 Speech Lang Process 14(6):2006–2013
123