Speech Enhancement Using A DNN-Augmented Colored-Noise
Speech Enhancement Using A DNN-Augmented Colored-Noise
Journal Pre-proof
PII: S0167-6393(20)30283-1
DOI: https://doi.org/10.1016/j.specom.2020.10.007
Reference: SPECOM 2744
Please cite this article as: Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne, Speech Enhance-
ment Using a DNN-Augmented Colored-Noise Kalman Filter, Speech Communication (2020), doi:
https://doi.org/10.1016/j.specom.2020.10.007
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
H IGHLIGHTS
• Colored-noise Kalman filter is adopted in our system,
which is more component to deal with the complex noise
and alleviate the speech distortion.
• A multi-objective DNN is first employed to joint estimate
parameters of the clean speech autoregressive (AR) model
and the noise AR model. tTo kinds of DNN, i.e., fully-
connected feed-forward network (FNN) and long short-
term memory (LSTM), are adopted.
• A post subtraction technique is employed to further
remove the residual noise in the Kalman-filtered speech.
• The proposed system takes advantage of both the DNN
based method and Kalman filtering, and has a good
generalization capability in both seen and unseen noise
environments.
2
Abstract—In this paper, we propose a new speech enhancement and perceived musical noise in the enhanced speech. In
system using a deep neural network (DNN)-augmented colored- [4, 5], a multiband spectral subtraction was proposed based
noise Kalman filter. In our system, both clean speech and noise on the fact that the noise affects the speech at different levels
are modelled as autoregressive (AR) processes, whose parameters
comprise the linear prediction coefficients (LPCs) and the driving depending on frequency bands. In the multiband approach,
noise variances. The LPCs are obtained through training a multi- the speech spectrum is divided into several non-overlapping
objective DNN that learns the mapping from the noisy acoustic frequency bands, and then spectral subtraction is performed
features to the line spectrum frequencies (LSFs), while the driving independently in each band.
noise variances are obtained by solving an optimization problem The statistical filter based speech enhancement methods
aiming to minimize the difference between the modelled and
observed AR spectra of the noisy speech. The colored-noise have also received considerable attention. Wiener filtering,
Kalman filter with DNN estimated parameters is then applied one of the most famous algorithms in this class, aims to
to the noisy speech for denoising. Finally, a post-subtraction find the minimum mean square error (MMSE) estimate of
technique is adopted to further remove the residual noise in the clean speech’s discrete Fourier transform (DFT) coef-
the Kalman-filtered speech. Extensive computer simulations show ficients [6–8]. Compared with spectral subtraction, Wiener
that the proposed speech enhancement system achieves significant
performance gains when compared to conventional Kalman filter filtering introduces less distortion in the enhanced speech.
based algorithms as well as recent DNN-based methods under However, Wiener filters are derived under the assumption that
both seen and unseen noise conditions. the processed signals are stationary, which is rarely satisfied
Index Terms—speech enhancement, deep neural network, in real-world applications. Kalman filters, which can handle
corlored-noise Kalman filter, spectral subtraction non-stationary signals, have therefore attracted the interests
of speech enhancement researchers [9]. In this context, the
Kalman filter can be viewed as a time-domain, sequential
I. I NTRODUCTION linear MMSE estimator of the noise corrupted speech, in
S PEECH enhancement, which aims to suppress the back- which the clean speech is characterized by a dynamical or
ground noise and improve the quality and intelligibility of state-space model, such as the autoregressive (AR) model. As
a speech signal, has been widely adopted as a pre-processing such, the enhancement performance is largely dependent on
means in a variety of speech-related applications to provide the estimation accuracy of the AR parameters, which include
better user experience. Numerous speech enhancement tech- the linear prediction coefficients (LPCs) and the variances of
niques have been proposed in the literature over the past the driving and observation noises.
decades, but due to their limited performance, the problem Ideally, the AR parameters of the clean speech can lead to
continues to be intensively studied. excellent performance of the Kalman filter [9], but they are not
Spectral subtraction [1], one of the earliest techniques for accessible in practice. Therefore, various estimation algorithms
speech enhancement, modifies the noisy speech power spec- have been proposed to obtain the above parameters from the
trum by subtracting the estimated noise power spectrum. Al- noisy speech, which can be divided into two categories: online
though spectral subtraction is easy to employ, the difficulty in estimation [10–13] and offline estimation [14, 15]. The former
accurately estimating the noise spectrum hinders the enhance- algorithms usually estimate and update the denoised speech
ment performance. Extra distortion, such as the musical noise, and the model parameters in an iterative manner, while the
can degrade the perceptual quality of the enhanced speech if latters require a training stage on a clean speech database
the noise spectrum is not accurately estimated. More flexible to predict the parameters beforehand. To further improve the
spectral subtraction algorithms with better performance were speech enhancement performance, several advanced versions
proposed in [2, 3], where two techniques, i.e., the use of of Kalman filters have been proposed. For example, the
oversubtraction factor and spectral flooring parameter, were subband Kalman filtering technique [16–19] divides the noisy
introduced along with the standard spectral subtraction. These speech into several contiguous frequency bands, and performs
techniques are used to adjust the estimated noise spectrum, Kalman filtering separately as the noise level dynamic varies in
and thereby control the ratio of the remaining residual noise each band. The improved Kalman filter in [20, 21] models both
clean speech and noise as AR processes, and achieves a better
H. Yu and W.-P. Zhu is with the Department of Electrical and Com- performance in color noise environments. The perceptual
puter Engineering, Concordia University, Montreal, Quebec, Canada. E-mail: Kalman filter [22, 23] incorporates an additional post-filter
ho [email protected].
B. Champagne is with the Department of Electrical and Computer Engi- to further remove the residual noise by scaling the estimation
neering, McGill University, Montreal, Quebec, Canada. error of the Kalman filter below the masking threshold.
3
In recent years, deep learning, and especially deep neural residual noise in the Kalman-filtered speech, which is caused
network (DNN), has been successfully applied in many areas. by the parameter estimation error. Through exhaustive com-
Compared with the unsupervised statistical filter based meth- puter simulations, it is shown that the proposed system can not
ods, the use of DNN for speech enhancement offers several only significantly improve the performance of Kalman filtering
advantages: (1) powerful learning capability to model various in speech enhancement, but also offer a good generalization
non-linear mapping relationships; (2) no reliance on assump- capability in both seen and unseen noise conditions.
tions about the statistical properties of the speech and noise, The rest of the paper is organized as follows. Section 2 sum-
and; (3) no specific need for the noise spectrum estimation. marizes our previous work on DNN-based Kalman filtering.
Early works in this area, e.g., [24, 25] employed DNN to Section 3 presents the newly proposed speech enhancement
directly estimate the clean speech magnitude spertrum, where system with DNN-augmented colored-noise Kalman filter,
the DNN acts as a regression model to implement a mapping including a detailed description of its main components. Sec-
function between the log-power spectra (LPS) of the noisy tion 4 presents a series of experiments to assess the system
and clean speech signals. Subsequent works seeked to estimate performance. Section 5 concludes the paper.
ratio masks via DNN-based approaches, and then remove the
background noise in the spectral domain by means of the
II. R ELATED W ORK
estimated masks [26–31]. For instance, in [26], the ideal ratio
mask (IRM) is predicted by a DNN and then applied to the Herein, we briefly review our previous work on speech
noisy magnitude spectrum to recover desired speech signal. enhancement using DNN and Kalman filtering [35], where
Nonetheless, deep learning algorithms require large training the DNN is employed to estimate the AR parameters in the
databases to improve their generalization capability [25]. Since conventional Kalman filter.
the statistical filter based methods can reduce different kinds
and levels of noises to a sensible extent in a variety of
A. Conventional Kalman Filter
situations, researchers have recently turned their attention to
the combination of DNN and statistical filter based approaches. Consider the noisy speech y (n) as an additive mixture of
Indeed, the learning capability of the former makes it possi- the clean speech s (n) and the background noise w (n),
ble to boost speech enhancement performance under various
conditions, while the latter helps better exploits the gener- y (n) = s (n) + w (n) (1)
alization capability of the enhancement system by providing
where n ∈ N is the discrete time index. As usual, w(n)
an appropriate structural framework. [32–36]. Recently in
is regarded as a zero-mean white noise with variance σw 2
,
[35], we have proposed a DNN-augmented Kalman filter for
uncorrelated with s(n). The clean speech s (n) is usually
speech enhancement, where the DNN is trained to predict
represented by a linear model as a dynamic process of speech
the AR parameters needed for Kalman filtering. Experiments
production. For the widely-adopted AR model, we have
have shown that the AR parameters estimated in this way
p
X
are less sensitive to various types of noise, leading to a
better enhancement performance than the subband iterative s (n) = as,i s (n − i) + v (n) (2)
i=1
Kalman filter algorithm [17]. However, the enhanced speech
still suffers from distortion at higher frequencies, partly due where as,i are the LPCs of the clean speech, p the order of
to the inaccurate estimation of additive noise and its harmful the model, and v (n) the driving noise, i.e., a zero-mean white
effects on the conventional Kalman filter. noise with variance σv2 .
In this paper, we propose a novel speech enhancement To facilitate the Kalman filter presentation for speech en-
system consisting of a colored-noise Kalman filter augmented hancement, the above model equations for s(n) and y(n) can
with DNN-based parameter estimation, where both clean be rewritten in matrix form as,
speech and noise are modelled as AR processes. In our
system, a multi-objective DNN is first employed to estimate s (n) = Fs s (n − 1) + Gs v (n)
(3)
the line spectrum frequencies (LSFs), which are used for the y (n) = HTs s (n) + w (n)
representation of the LPCs parameters in these models. Two T
where s (n) = [s (n − p + 1) , . . . , s (n − 1) , s (n)] denotes
kinds of DNN are used in this work, i.e. the fully-connected the speech state vector. Moreover, the transition matrix Fs is
feed-forward DNN (denoted as FNN) [32] and the long short- given by
term memory (LSTM) [37]. The driving noise variances for
the clean speech process and the noise process are obtained 0 1 ··· 0 0
0 0 ··· 0 0
by solving an optimization problem as in [8]. The multi-
.. .
.. . .. .
.. ..
objective DNN training is beneficial as it can simultaneously Fs = . . (4)
estimate the AR parameters of the clean speech and noise 0 0 ··· 0 1
with a lower computational complexity, while providing more as,p as,p−1 · · · as,2 as,1
accurate estimates under noisy conditions. Subsequently, the
T
colored-noise Kalman filter with the DNN estimated AR pa- and Hs = Gs = [0, · · · , 0, 1] ∈ Rp .
rameters is applied to the noisy speech for denoising. Finally, The denoising process with a Kalman filter amounts to
a post subtraction technique is employed to further remove the recursively calculate an unbiased, linear MMSE estimate of
4
the state vector s(n), given the corrupted speech y(n). This frames remains a difficult task, and the detection errors lead
process can be summarized by the following equations: to inaccurate variance estimation of the additive noise, which
brings further distortion to the enhanced speech.
e (n) = y (n) − HTs ŝ (n|n − 1)
2
−1
K (n) = P (n|n − 1) Hs σw + HTs P (n|n − 1) Hs
ŝ (n|n) = ŝ (n|n − 1) + K(n) e (n)
III. P ROPOSED S YSTEM
P (n|n) = I − K (n) HTs P (n|n − 1)
ŝ (n + 1|n) = Fs ŝ (n|n)
To counter the difficulties posed by the VAD procedure and
P (n + 1|n) = Fs P (n|n) FTs + σv2 Gs GTs improve the accuracy of the variance estimation, we propose
(5)
a hybrid speech enhancement system that combines DNN-
where ŝ (n|n − 1) is the a priori estimate of the current state
based parameter estimation with colored-noise Kalman filter.
vector s (n), given observations up to a time index n − 1,
The overall block diagram of our new system is depicted in
i.e., y(1), ..., y(n − 1), P (n|n − 1) the predicted state error
Fig.1, which is composed of two stages, namely: the training
correlation matrix of ŝ (n|n − 1), e (n) the innovation, K (n)
stage and the enhancement stage. In the training stage, the
the Kalman gain matrix, ŝ (n|n) the filtered estimate of state
input feature set to the DNN consists of the combination of the
vector s (n), and P (n|n) the filtered state error covariance
noisy speech LSFs along with four acoustic features from [39].
matrix of ŝ (n|n). The denoised speech ŝ (n) is finally given
The output targets are the LSFs of both the clean speech and
by
the noise. Then, a multi-objective DNN is trained to learn the
ŝ (n) = GTs ŝ (n|n) . (6)
mapping from the noisy input feature set to the targets. In the
enhancement stage, given a noisy speech signal, we obtain first
B. Parameter Estimation the input feature set, and then process it by the trained DNN to
We note that several parameters appearing in the above predict the clean speech LSFs and noise LSFs. The estimated
equations should be estimated or calculated from the noisy LPCs are then obtained from the LSFs, and applied to both
observations in order to perform Kalman filtering. Those variance estimation and Kalman filtering. Subsequently, the
parameters include the driving noise variance σv2 , the additive noisy speech is enhanced by the colored-noise Kalman filter.
noise variance σw2
, and the transition matrix Fs which contains This operation is followed by a post subtraction to further
the LPCs of the clean speech model. remove the residual noise in the filtered speech. The key
In our previous work [35], an FNN is adopted for the LPCs components and steps involved in the proposed system are
prediction. More specifically, the LPCs of the noisy speech and described in further details below.
of the clean speech are first calculated and then converted into
their representative LSFs, which are used as input features and
output targets of the DNN, respectively. Using LSFs instead of
A. Colored-Noise Kalman filter
LPCs offers a more stable DNN training process [14], due to
the relatively well-behaved dynamic range of LSFs. The well- As mentioned before, in a conventional Kalman filter the
trained FNN can learn the non-linear relationship between the clean speech is modelled as an AR process, while the additive
noisy LSFs and the clean ones. Finally, the estimated LSFs noise is assumed to be white, which is not suitable for the
are transformed back to LPCs, as required in the transition complex noises encountered in real-world environment. To
matrix Fs needed to perform Kalman filtering. overcome this limitation, we herein adopt the colored-noise
The variance σw 2
of the additive noise w(n) is usually Kalman filter. In this method, the additive noise w(n) in (1)
estimated and updated during the unvoiced frames. The calcu- is now modelled as an AR process, expressed as,
lation involves a voice activity detection (VAD) procedure [38] q
X
to detect whether a given speech frame is voiced or unvoiced. w (n) = aw,i w (n − i) + z (n) (8)
The variance of the driving noise v(n) can be then estimated i=1
as:
where aw,i are the LPCs of the colored noise, q the order of
σv2 = σy2 − σw
2
the AR model, and z(n) the zero-mean white driving noise
2 (7)
= E y (n) − rTy ay − σw
2 with variance σz2 .
The underlying AR signal model in the colored-noise
where ay = [ay,1 , · · · , ay,p ]T is the LPC vector of the noisy Kalman filter can be conveniently incorporated into the fol-
speech, and ry = E [y (n) y (n)] the autocorrelation vector of lowing state-space matrix form,
the noisy speech y(n) with its past p samples, represented by
the vector y(n) = [y(n − 1), . . . , y(n − p)]T . x(n) = Fx(n − 1) + Gu(n)
(9)
Although the performance of the conventional Kalman y(n) = HT x(n)
filter method for speech enhancement has been improved T
notably by using the FNN for parameter estimation, several where x(n) = s(n), w(n) is the p + q dimensional con-
limitations have been identified. Firstly, the additional VAD catenated state vector constituted by the clean speech vector
procedure needed for the estimation of the additive noise s (n) = [s (n − p + 1) , . . . , s (n − 1) , s (n)] together with the
variance increases the computational and structural complexity noise vector w (n) = [w (n − q + 1) , . . . , w (n − 1) , w (n)],
T
of the system. In addition, accurately detecting the unvoiced and u(n) = v(n), z(n) is the concatenated driving noise
5
Fig. 1: Block diagram of proposed speech enhancement system using DNN-augmented colored noise Kalman filter.
Given a noisy observation y(n), the estimate of the state B. DNN-based LSFs Estimation
vector x(n) can be obtained by the following Kalman filtering
Recently, we have demonstrated that FNN offers a conve-
recursive equations:
nient means for LSFs estimation in speech processing applica-
tions [35]. Here, we propose to employ two different networks,
i.e., FNN and LSTM, to predict both the clean speech LSFs
e (n) = y (n) − HT x̂ (n|n − 1)
−1 and noise LSFs. The specific configuration of each network is
K (n) = P (n|n − 1) H HT P (n|n − 1) H described in section IV-B.
x̂ (n|n) = x̂ (n|n − 1) + K (n) e (n) For the input features, we extract 12-dimensional LSFs
(12)
P (n|n) = I − K (n) HT P (n|n − 1) along with several complementary features from the noisy
x̂ (n + 1|n) = Fx̂ (n|n) speech, in order to collect more information about the speech
P (n + 1|n) = FP (n|n) FT + GQu GT characteristics. Specifically, the following additional acoustic
where e (n) is the innovation, K (n) the Kalman gain matrix, features are utilized: the 15-dimensional amplitude modulation
x̂ (n|n) the filtered estimate of state vector x (n), x̂ (n|n − 1) spectrum (AMS); the 31-dimensional relative spectral trans-
a priori estimate of the state vector x (n). P (n|n) is the form and perceptual linear prediction (RASTA-PLP); the 13-
filtered state error covariance matrix, and P (n|n − 1) the dimensional Mel-frequency cepstral coefficients (MFCC) and
predicted state error correlation matrix. Qu is the covariance their deltas; and the 64-dimensional Gammatone filterbank
matrix of the driving noise vector u(n), which is given by energies (GF), and their deltas [39]. The total dimension of
2 the input feature set is 258, i.e., (12+2×(15+31+13+64))
σv 0
Qu = E[u(n)u(n) ] = T
. (13) The input features are computed for each frame of the
0 σz2
noisy speech, and represented as a row vector f (m) with m
The denoised speech is the output of the colored-noise Kalman denoting the frame index. To make full use of the temporal
6
information of the speech, it is common to incorporate the From equations (1), (2) and (8), the spectrum of the AR-
features of adjacent frames into a single extended feature modelled noisy speech can be expressed as:
vector. Hence, the extended feature vector centered at the
P̂y (k) = P̂s (k) + P̂w (k)
m-th frame is constructed as f̃ (m) = [f (m − m0 ) , · · · ,
f (m) , · · · , f (m + m0 )], where m0 is the number of adjacent σv2 σz2 (16)
= 2 + 2
frames to be included on each side. The value of m0 is set |As (k)| |Aw (k)|
to 2 in our experiment. Note that all the different features are with
normalized to the range [0, 1) in order to balance the training p
X
errors. As (k) = 1 − as,i e−j2πik/K
For the training targets, we adopt a multi-objective learning i=1
(17)
architecture to estimate both the clean speech LSFs and noise q
X
−j2πik/K
LSFs. Compared to a standard DNN, the output layer in the Aw (k) = 1 − aw,i e
proposed architecture is divided into two parts: one for the i=1
clean speech LSFs and the other for the noise LSFs. The where K is the frame length. Note that the clean speech LPCs
advantages of multi-objective learning are twofold. On one as,i and the noise LPCs aw,i can be obtained from the LSFs
hand, it has lower computational complexity compared to at the output of the trained DNN.
training two separate DNNs (i.e., one for clean speech and
one for noise). On the other hand, estimating the two sets of
LSFs simultaneously can help better exploit the relationship
between the clean speech and noise.
The AR spectrum of the observed noisy speech Py (k) can
In the training stage, back propagation is used to adjust be written as,
the weights and biases so as to minimize the cost function, σy2
which is defined as the mean square error (MSE) between Py (k) = 2 (18)
|Ay (k)|
the reference LSFs and the estimated ones for each training
utterance. Note that the cost function is composed of two parts: with p
X
one for the clean speech LSFs and the other for the noise LSFs, Ay (k) = 1 − ay,i e−j2πik/K (19)
as given by, i=1
( p
1 X 1 Xh i2 σy2 = E y(n)2 − rTy ay .
M
(20)
M SELSF = L̂s,i (m) − Ls,i (m)
M m=1 p i=1 We can obtain the variance estimates by minimizing the
(15)
i2 difference between the AR spectrum of the modelled noisy
1 Xh
q
+ L̂w,j (m) − Lw,j (m) speech P̂y (k) and that of the observed one Py (k), that is,
q
j=1
σv∗2 , σz∗2 = arg min d P̂y (k) , Py (k) (21)
where m is the frame index of the input noisy speech and M
2
σv ,σz2
the total number of the frames. The quantities Ls,i (m) and where the difference is measured in the log-spectral domain
L̂s,i (m) are the reference clean speech LSFs and the estimated as given by,
ones at frame m, where i ∈ {1, ..., p} is the order index of
K
1 X 2
the clean speech AR model. Similarly, Lw,j are the reference d P̂y (k) , Py (k) = lnP̂y (k) − lnPy (k)
noise LSFs and L̂w,j the estimated ones at frame m, where K
k=1
i ∈ {1, ..., q} is the order index of the noise AR model. 2 (22)
K
X 2 2
1 σv2 / |As (k)| + σz2 / |Aw (k)| − Py (k)
In the enhancement stage, the clean speech LSFs and noise ≈ .
LSFs are first obtained by the well-trained DNN, and then K Py (k)
k=1
converted to their respective LPCs. The estimated LPCs are
used along with the estimated variances in the Kalman filter
equations (12) in order to estimate the desired speech signal.
To obtain the approximate equation in (22), we have used
equation (16) and the approximation of ln(x + 1) ≈ x.
C. Variance Estimation Then
by applying partial differentiation to the difference
d P̂y (k) , Py (k) with respect to σv2 and σz2 , we obtain the
The covariance matrix Qu in (13) is another key parameter following linear system of equations:
that needs to be estimated prior to the application of the
Kalman filtering equations. Proceeding as in [8], we now
formulate an optimization problem to estimate σv2 and σz2 . Our
goal is to minimize the difference between the noisy spectrum
2
and the sum of the estimated clean speech spectrum and noise Ess Esw σv Eys
spectrum. = (23)
Esw Eww σz2 Eyw
7
a brief conceptual summary of each one of the reference C. Evaluation of Input Feature Sets
methods. In the training stage, we use the following feature sets as
• IKF (Iterative Kalman filtering) [11]: This algorithm iter- the input of our proposed system: LPS-only set, LSF-only set,
atively performs conventional Kalman filtering, in which multi-feature set consisting of AMS+RASTAPLP+MFCC+GF,
the LPCs are updated in each iteration and joint set formed by combining the LSF-only set with the
• P-IKF (Perceptual IKF) [23]: This algorithm calculates a multi-feature set. In this experiment, we investigate the perfor-
perceptual mask according to human hearing system and mance of the proposed system with these different feature sets
applies it to the Kalman-filtered speech in order to further when using FNN for LSFs estimation. The objective results
remove the residual noises. of the enhanced speech are shown in Table I.
• S-IKF (Subband IKF) [17]: In this method, the noisy The final enhanced speech for the LPS-only and LSF-
speech is first divided into subband signals. Iterative only feature sets exhibit similar PESQ and STOI scores,
Kalman filtering is then applied separately for each sub- while the objective scores could be improved notably for the
band noisy speech. The final enhanced speech is obtained multi-feature and joint sets, using more indicates that more
by synthesising the subband enhanced speech signals. acoustic features provides useful additional information about
• FNN-MAG [24]: A FNN is employed to directly explore the speech. Finally, the enhanced speech from the joint set
the mapping from the noisy speech magnitude spectrum achieves the highest PESQ and STOI scores. As a result, the
to the clean one. The enhanced speech is synthesised with joint set is considered as the optimal input feature set for the
the estimated clean magnitude and noisy phase. proposed system.
• FNN-WF [32]: A FNN is trained for the estimation of
AR parameters of the clean speech. Then, a Wiener filter TABLE I: Objective results with different feature sets
is estimated by calculating the ratio of the estimated -3dB 0dB 3dB 6dB
clean speech power spectrum to that of the noisy speech. Noisy 1.41 1.52 1.68 1.86
The enhanced speech is then obtained by applying the LPS-only 1.67 1.90 2.10 2.29
estimated Wiener filter to the noisy speech. PESQ LSF-only 1.69 1.92 2.12 2.33
• FNN-KF [35]: A FNN is used to predict the LPCs needed Multi Set 1.80 2.06 2.27 2.46
for conventional Kalman filtering. The DNN learns the Joint Set 1.88 2.12 2.32 2.51
mapping from the acoustic features of the noisy speech Noisy 0.66 0.72 0.78 0.83
to the LSFs of the clean speech. The estimated LSFs are LPS-only 0.69 0.75 0.80 0.83
then converted to the desired LPCs. STOI LSF-only 0.68 0.74 0.80 0.84
Besides these benchmarks, we consider three versions of Multi Set 0.72 0.78 0.83 0.86
our proposed DNN-augmented colored-noise Kalman filter Joint Set 0.73 0.79 0.85 0.88
method, namely,
• FNN-CKF: FNN for LSFs estimation and without post
subtraction. D. Evaluation of LPCs Estimation Accuracy
• FNN-CKFS: FNN for LSFs estimation and with post In this subsection, the LPCs estimation error is evaluated to
subtraction. verify the learning capability of the proposed multi-objective
• LSTM-CKFS: LSTM for LSFs estimation and with post DNN training. We first define the LPCs estimation error of the
subtraction. speech as the mean square error (MSE) between the estimated
In order to make fair comparisons, we use the same con- LPCs and the ideal LPCs calculated from the clean speech for
figuration for the FNN in the related methods, i.e., one input each utterance as given below,
layer, one output layer and three hidden layers with 1024 units
( )
in each layer. The LSTM network is obtained by stacking by 1 X
M
1X
p
2
one input layer, two LSTM layers with 512 units in each layer, M SELP C = [âs,i (m) − as,i (m)] (29)
M m=1 p i=1
one feed-forward layer with 512 units and one output layer.
For FNN-MAG, a Hamming window is selected to divide where M denotes the number of the speech frames in the
each utterance into 20 ms time frames with an 10 ms frame utterance, as,i (m) the ideal LPCs of the clean speech and
shift (50% overlap). A 320-point DFT is then computed for âs,i (m) the estimated ones. The estimated LPCs are obtained
each frame. For the other reference methods and the proposed by three methods for comparison. The first one applies the
system, a rectangular window is used to divide the audio Levinson-Durbin (LD) algorithm to obtain the LPCs of the
signals into 20 ms frames with no overlap. noisy speech directly [44]. The second and third ones adopt
For the conventional Kalman filter, we set s(0|0) = 0, the proposed DNN based LSFs estimation algorithm, where
P(0|0) = I, and the AR model order of the clean speech FNN and LSTM are used to estimate the LSFs, which are then
as p = 12. For the colored-noise Kalman filter, we set converted to LPCs. Similarly, we compute the LPCs estimation
x(0|0) = 0, P(0|0) = I, and the orders of AR models for error of the additive noise for each noise type by using (29),
clean speech and additive noise as p = q = 12 . For the post where the estimated and ideal LPCs of the speech are replaced
subtraction in FNN-CKFS and LSTM-CKFS, the spectrum is by those of the additive noise, and the order of the speech
evenly divided into 4 bands. model, p is replaced by that of the noise, q.
9
-3 dB 0 dB -3 dB 0 dB
0.3 0.3 0.3 0.3
Average MSE
Average MSE
0.2 0.2 0.2 0.2
0 0 0 0
LD FNN LSTM LD FNN LSTM LD FNN LSTM LD FNN LSTM
3 dB 6 dB 3 dB 6 dB
0.3 0.3 0.3 0.3
Average MSE
Average MSE
0.2 0.2 0.2 0.2
Fig. 2: LPC estimation error comparison for speech among Levinson-Durbin (LD), FNN and LSTM methods.
-3 dB 0 dB
-3 dB 0 dB 0.2 0.2
0.2 0.2
0.15 0.15
Average MSE
0.15 0.15
Average MSE
0.1 0.1
0.1 0.1
0.05 0.05
0.05 0.05
0 0
0 0 LD FNN LSTM LD FNN LSTM
LD FNN LSTM LD FNN LSTM
3 dB 6 dB
3 dB 6 dB 0.2 0.2
0.2 0.2
0.15 0.15
Average MSE
0.15 0.15
Average MSE
Fig. 3: LPC estimation error comparison for additive noise among LD, FNN and LSTM methods.
Fig. 2 shows the LPCs estimation error comparison for the The LPCs estimation error comparison for the additive noise
speech. The average MSE is computed over all the testing is shown in Fig. 3, where we notice important differences
utterances for both seen and unseen noise. In general, the with the case of clean speech. Firstly, as SNR increases, the
FNN and LSTM based approaches give a slightly smaller error strong speech component more strongly affect the noise LPCs
than the LD method does for the eight types of noises and estimation, and hence the LPCs estimation error of additive
different SNRs. In addition, the error from LSTM is smaller noise gets larger. Secondly, compared with speech, noise
than that from FNN in most cases. Another important finding exhibits less structure and correlation, and the mapping from
is that the error from the DNN methods decrease with an the noisy speech feature to the noise LSFs is thus more difficult
increase of the SNR, which means that DNN achieves a better to learn. Therefore, we can find that the DNN estimation result
performance at higher SNR. The LPCs estimation performance is not always better than the traditional LD method, especially
also varies for different noise types. In particular, the best at low SNR.
estimation accuracy is achieved for street noise, and the worst
for white noise. Interestingly, we note that the estimation error E. Speech Enhancement Performance under Seen Noise
of FNN and LSTM based algorithm under unseen noise does
not increase considerably compared with that under seen noise, Here, we campare the different speech enhancement meth-
which indicates that using DNN in LPCs estimation offers ods under seen noise. Table II gives the average objective
robustness and has a good generalization capability. scores of different speech enhancement methods on seen
noise. We first note that the performances of the unsupervised
10
TABLE II: Objective scores of different speech enhancement methods on seen noise
PESQ STOI
Method
-3 dB 0 dB 3 dB 6 dB -3 dB 0 dB 3 dB 6 dB
Noisy 1.41 1.52 1.68 1.86 0.66 0.72 0.78 0.83
IKF 1.55 1.79 2.01 2.25 0.67 0.74 0.80 0.85
P-IKF 1.57 1.83 2.08 2.31 0.68 0.75 0.81 0.85
S-IKF 1.56 1.81 2.04 2.29 0.67 0.75 0.81 0.84
FNN-MAG 1.89 2.13 2.34 2.55 0.75 0.82 0.86 0.88
FNN-WF 1.65 1.83 2.15 2.36 0.71 0.78 0.82 0.86
FNN-KF 1.70 1.93 2.13 2.30 0.71 0.77 0.81 0.85
FNN-CKF 1.73 2.01 2.26 2.49 0.72 0.78 0.84 0.87
FNN-CKFS 1.88 2.12 2.32 2.51 0.73 0.79 0.85 0.88
LSTM-CKFS 1.93 2.16 2.38 2.58 0.74 0.80 0.85 0.88
Kalman filtering algorithms are worse than those of the DNN- degradation is possibly caused by the inaccuracy in estimating
based methods. The P-IKF, which incorporates a perceptual the noise LPCs, as shown in Fig.3 where the FNN estimation
mask to further suppress the residual noise, is the best among error is higher than the LD estimation error under low input
the three unsupervised Kalman filtering algorithms. However, SNR conditions.
P-IKF still can not achieve as good performance as FNN-KF, In the case of unseen noise, we find that LSTM-CKFS
not to mention our FNN-CKF, FNN-CKFS and LSTM-CKFS. achieves the best objective scores due to its advanced network
These results demonstrate the benefit from employing DNN structure. More interestingly, FNN-MAG no longer holds the
in parameter estimation. The DNN can predict more accurate best performance among FNN based methods. In fact, the
LPCs from the noisy speech, thus improving the performance objective scores of FNN-MAG decease largely, indicating that
of Kalman filtering algorithms. mapping the noisy magnitude spectrum to the clean one is
Moreover, FNN-KF has lower objective scores compared prone to errors when the noise is unmatched with those in
with the proposed methods. This is because FNN-KF requires the training stage. In contrast, FNN-WF, FNN-KF and our
a VAD procedure to detect the unvoiced frame for estimating proposed system suffer less performance degradation. Indeed,
and updating the additive noise variance σw 2
. However, VAD the denoising process in these methods is accomplished by
in noisy condition is a difficult task, which causes variance Wiener and Kalman filtering. Therefore, as the DNN can
estimation error and introduces extra distortion to the enhanced provide more accurate parameters, their performances would
speech. In our proposed system, an AR model is adopted to not fluctuate as much whether on seen noise or unseen
represent the background noise. As such, the Kalman filtering noise. Based on these results, and considering the robustness
equations in (12) no longer involve σw 2
, and we can therefore of the DNN-based LPCs estimation, we can conclude that
overcome the speech distortion problem due to the inaccurate our FNN-CKF, FNN-CKFS and LSTM-CKFS, have a better
estimation of σw
2
. The performance can be further improved by generalization capability than FNN-MAG.
employing post subtraction to remove the residual noise due Finally, we make comparison in terms of each objective
to the inaccurate parameters of the noise AR model. Indeed, metric. Although the enhanced speech from LSTM-CKFS has
FNN-CKFS achieves a better performance than FNN-CKF, the best speech quality according to the PESQ scores, the im-
which approaches closely that of FNN-MAG. Finally, although provement of speech intelligibility is not obvious as seen from
FNN-MAG has the best performance among all tested FNN the STOI scores. In fact, the LSTM-CKFS gives similar STOI
based approaches, by employing LSTM for LSFs estimation scores to FNN-MAG. Actually, there is a trade-off between
in our proposed system, LSTM-CKFS can achieve the best residual noise and speech distortion for speech enhancement
PESQ scores, which demonstrates the LSTM’s advantage in algorithms, leading to decreased speech intelligibility. For our
modelling long temporal dependencies. LSTM-CKFS, the enhanced speech achieves similar speech in-
telligibility as that of FNN-MAG but far better speech quality,
F. Speech Enhancement Performance under Unseen Noise indicating that LSTM-CKFS could preserve the information
content of clean speech well, while significantly removing the
Table III gives the average objective scores of the different
additive noise.
speech enhancement methods in the case of unseen noise.
In this case, the performances of the unsupervised Kalman
filtering algorithms are still worse than those of the FNN-based G. Spectrograms of Enhanced Speeches
methods. Comparing FNN-KF with our proposed system, we To better understand the characteristics of the enhanced
can find again that FNN-CKF, FNN-CKFS and LSTM-CKFS speech, Fig. 4 shows the spectrograms of the enhanced speech
outperform FNN-KF because of the adoption of colored- signals from several selected methods, demonstrating the ef-
noise Kalman filter. However, the STOI scores of FNN-CKF fects of the residual noises and the distortions in the harmonic
are slightly lower than those of FNN-KF at low SNR. This structures in the time-frequency domain. The noisy speech is
11
TABLE III: Objective scores of different speech enhancement methods on unseen noise
PESQ STOI
Method
-3 dB 0 dB 3 dB 6 dB -3 dB 0 dB 3 dB 6 dB
Noisy 1.37 1.51 1.65 1.82 0.65 0.72 0.78 0.83
IKF 1.64 1.84 2.04 2.26 0.68 0.75 0.81 0.85
P-IKF 1.67 1.88 2.09 2.32 0.69 0.76 0.81 0.85
S-IKF 1.66 1.87 2.08 2.31 0.68 0.75 0.81 0.84
FNN-MAG 1.73 1.92 2.13 2.32 0.70 0.76 0.82 0.87
FNN-WF 1.68 1.92 2.15 2.33 0.67 0.74 0.81 0.85
FNN-KF 1.73 1.95 2.21 2.38 0.71 0.77 0.82 0.85
FNN-CKF 1.76 2.02 2.26 2.48 0.70 0.76 0.82 0.86
FNN-CKFS 1.89 2.11 2.32 2.50 0.71 0.78 0.83 0.87
LSTM-CKFS 1.91 2.15 2.36 2.55 0.73 0.79 0.84 0.88
obtained by mixing a selected clean speech utterance with buc- VI. ACKNOWLEDGEMENTS
caneer noise at 3dB SNR. For the best unsupervised Kalman
The authors acknowledge the support from China Schol-
filtering in our experiment, i.e., P-IKF, we can find the musical
arships Council (CSC No.201606270200) and NSERC of
noise structure in the spectrogram in the region between 4kHz
Canada under a CRD project sponsored by Microchip in
and 8kHz. The spectrogram of FNN-MAG also exhibits some
Ottawa, Canada.
musical noise structures in the high-frequency component as
well as residual noise in the low-frequency component. For
FNN-WF, the high-frequency components look better than the R EFERENCES
previous two spectrograms, but still have undesired structures,
[1] S. Boll, “Suppression of acoustic noise in speech using
which are likely caused by the difficulty of Wiener filter in
spectral subtraction,” IEEE Trans. on Acoustics, Speech,
removing non-stationary noise. Finally, for the four Kalman
and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
filtering related methods, it is observed that FNN-KF, FNN-
[2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement
CKFS and LSTM-CKFS can remove the background noise
of speech corrupted by acoustic noise,” in IEEE Int. Conf.
quite well. However, the high-frequency components of FNN-
on Acoustics, Speech and Signal Processing (ICASSP),
KF still suffer from various degradations. While this situation
pp. 208–211, April, 1979.
is improved in the cases of FNN-CKFS and LSTM-CKFS, the
[3] W. M. Kushner, V. Goncharoff, C. Wu, V. Nguyen, and
LSTM-CKFS can preserve the harmonic structures best among
J. N. Damoulakis, “The effects of subtractive-type speech
all the tested methods, thus achieving the best objective scores.
enhancement/noise reduction algorithms on parameter
estimation for improved recognition and coding in high
V. C ONCLUSION noise environments,” in IEEE Int. Conf. on Acoustics,
In this paper, we have proposed a hybrid speech en- Speech and Signal Processing (ICASSP), pp. 211–214,
hancement system with DNN-aided parameter estimation and May, 1989.
colored-noise Kalman filtering. Our system first employs a [4] L. Singh and S. Sridharan, “Speech enhancement using
multi-objective FNN or LSTM to estimate the AR model pa- critical band spectral subtraction,” in Int. Conf. on Spoken
rameters of both clean speech and noise. Then a colored-noise Language Processing, pp. 2827–2830, Nov., 1998.
Kalman filter with the estimated parameters is applied to the [5] S. Kamath and P. Loizou, “A multi-band spectral subtrac-
noisy speech for denoising. By doing so, the proposed system tion method for enhancing speech corrupted by colored
can more efficiently cope with color noises encountered in noise,” in IEEE Int. Conf. on Acoustics, Speech and
real-world environments. To further improve the enhancement Signal Processing (ICASSP), pp. 4160–4164, May, 2002.
performance, a post subtraction algorithm is adopted to better [6] J. S. Lim and A. V. Oppenheim, “Enhancement and band-
remove the residual noise. width compression of noisy speech,” Proceedings of the
Experiments have shown the superiority of the proposed IEEE, vol. 67, no. 12, pp. 1586–1604, December,1979.
system from two aspects. First, the employment of DNN for [7] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New in-
parameter estimation and post subtraction for residual noise sights into the noise reduction wiener filter,” IEEE Trans.
suppression largely improves the enhancement performance on audio, speech, and language processing, vol. 14,
of colored-noise Kalman filtering. Second, our proposed sys- no. 4, pp. 1218–1234, 2006.
tem takes advantages of both unsupervised and supervised [8] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Code-
methods, and thus exhibits a better generalization capability. book driven short-term predictor parameter estimation
Indeed, while it achieves comparable performance as recent for speech enhancement,” IEEE Trans. on Audio, Speech,
DNN-based approaches on seen noise, it offers notably better and Language Processing, vol. 14, no. 1, pp. 163–176,
results on unseen noise. 2005.
12
Fig. 4: Spectrograms of the clean, noisy and enhanced speech signals for different methods.
[9] K. Paliwal and A. Basu, “A speech enhancement method filter,” in IEEE International Symposium on Circuits and
based on Kalman filtering,” in IEEE Int. Conf. on Acous- Systems (ISCAS), pp. 762–765, May, 2016.
tics, Speech and Signal Processing (ICASSP), vol. 12, [18] H. Yu, W.-P. Zhu, and B. Champagne, “High-frequency
pp. 177–180, April, 1987. component restoration for Kalman filter based speech
[10] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of enhancement,” in International Symposium on Circuits
colored noise for speech enhancement and coding,” IEEE and Systems (ISCAS), pp. 1–5, IEEE, Oct., 2020.
Trans. on Signal Processing, vol. 39, no. 8, pp. 1732– [19] H. Yu, W.-P. Zhu, and B. Champagne, “Subband Kalman
1742, 1991. filtering with DNN estimated parameters for speech
[11] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative and enhancement,” in Proc. of Interspeech, pp. 2697–2701,
sequential Kalman filter-based speech enhancement al- Oct., 2020.
gorithms,” IEEE Trans. on speech and audio processing, [20] D. C. Popescu and I. Zeljkovic, “Kalman filtering of col-
vol. 6, no. 4, pp. 373–385, 1998. ored noise for speech enhancement,” in IEEE Int. Conf.
[12] T. Mellahi and R. Hamdi, “LPC-based formant enhance- on Acoustics, Speech and Signal Processing (ICASSP),
ment method in Kalman filtering for speech enhance- vol. 2, pp. 997–1000, May, 1998.
ment,” AEU-International Journal of Electronics and [21] V. Grancharov, J. Samuelsson, and W. B. Kleijn, “Im-
Communications, vol. 69, no. 2, pp. 545–554, 2015. proved Kalman filtering for speech enhancement,” in
[13] Y. Xia and J. Wang, “Low-dimensional recurrent neural IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
network-based Kalman filter for speech enhancement,” cessing (ICASSP), vol. 1, pp. 1109–1112, March, 2005.
Neural Networks, vol. 67, pp. 131–139, 2015. [22] N. Ma, M. Bouchard, and R. A. Goubran, “Perceptual
[14] N. Nower, Y. Liu, and M. Unoki, “Restoration scheme Kalman filtering for speech enhancement in colored
of instantaneous amplitude and phase using Kalman filter noise,” in IEEE Int. Conf. on Acoustics, Speech and
with efficient linear prediction for speech enhancement,” Signal Processing (ICASSP), vol. 1, pp. 717–720, May,
Speech Communication, vol. 70, pp. 13–27, June, 2015. 2004.
[15] M. S. Kavalekalam, M. G. Christensen, F. Gran, and J. B. [23] N. Ma, M. Bouchard, and R. A. Goubran, “Speech
Boldt, “Kalman filter for speech enhancement in cocktail enhancement using a masking threshold constrained
party scenarios using a codebook-based approach,” in Kalman filter and its heuristic implementations,” IEEE
IEEE Int. Conf. on Acoustics, Speech and Signal Pro- Trans. on Audio, Speech, and Language Processing,
cessing (ICASSP), pp. 191–195, March, 2016. vol. 14, no. 1, pp. 19–32, 2005.
[16] W.-R. Wu and P.-C. Chen, “Subband Kalman filtering for [24] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental
speech enhancement,” IEEE Trans. on Circuits and Sys- study on speech enhancement based on deep neural
tems II: Analog and Digital Signal Processing, vol. 45, networks,” IEEE Signal Processing Letters, vol. 21, no. 1,
no. 8, pp. 1072–1083, 1998. pp. 65–68, 2013.
[17] S. K. Roy, W.-P. Zhu, and B. Champagne, “Single chan- [25] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression
nel speech enhancement using subband iterative Kalman approach to speech enhancement based on deep neural
13
networks,” IEEE/ACM Trans. on Audio, Speech and IEEE/ACM Trans. on Audio, Speech and Language Pro-
Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, cessing (TASLP), vol. 21, no. 2, pp. 270–279, 2013.
2015. [40] IEEE Subcommittee, “IEEE recommended practice for
[26] A. Narayanan and D. Wang, “Ideal ratio mask estimation speech quality measurements,” IEEE Trans. on Audio and
using deep neural networks for robust speech recogni- Electroacoustics, vol. 17, no. 3, pp. 225–246, 1969.
tion,” in IEEE Int. Conf. on Acoustics, Speech and Signal [41] A. Varga and H. J. Steeneken, “Assessment for automatic
Processing (ICASSP), pp. 7092–7096, May, 2013. speech recognition: II. NOISEX-92: A database and an
[27] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, experiment to study the effect of additive noise on speech
“Phase-sensitive and recognition-boosted speech sepa- recognition systems,” Speech Communication, vol. 12,
ration using deep recurrent neural networks,” in IEEE no. 3, pp. 247–251, 1993.
Int. Conf. on Acoustics, Speech and Signal Processing [42] ITU-R, “Perceptual evaluation of speech quality (PESQ)
(ICASSP), pp. 708–712, April, 2015. an objective method for end-to-end speech quality as-
[28] W. Han, X. Zhang, G. Min, M. Sun, and J. Yang, sessment of narrowband telephone networks and speech
“Joint optimization of audible noise suppression and deep codecs,” Recommendation P.862, 2001.
neural networks for single-channel speech enhancement,” [43] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,
in IEEE Int. Conf. on Multimedia and Expo (ICME), “An algorithm for intelligibility prediction of time-
pp. 1–6, July, 2016. frequency weighted noisy speech,” IEEE/ACM Trans.
[29] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio on Audio, Speech and Language Processing (TASLP),
masking for monaural speech separation,” IEEE/ACM vol. 19, no. 7, pp. 2125–2136, 2011.
Trans. on Audio, Speech and Language Processing [44] T. Shimamura, N. Kunieda, and J. Suzuki, “A robust
(TASLP), vol. 24, no. 3, pp. 483–492, 2016. linear prediction method for noisy speech,” in IEEE Int.
[30] M. Tu and X. Zhang, “Speech enhancement based on Symposium on Circuits and Systems (ISCAS), vol. 4,
deep neural networks with skip connections,” in IEEE pp. 257–260, May, 1998.
Int. Conf. on Acoustics, Speech and Signal Processing The authors declare that they have no known competing
(ICASSP), pp. 5565–5569, March, 2017. financial interests or personal relationships that could have
[31] H. Yu, W.-P. Zhu, and Y. Yang, “Constrained ratio appeared to influence the work reported in this paper.
mask for speech enhancement using DNN,” in Proc. of
Interspeech, pp. 2427–2431, Oct., 2020. AUTHORSHIP AND C ONTRIBUTIONS
[32] Y. Li and S. Kang, “Deep neural network-based linear Hongjiang Yu: Conceptualization, Methodology, Software,
predictive parameter estimations for speech enhance- Formal analysis, Writing - Original Draft
ment,” IET Signal Processing, vol. 11, no. 4, pp. 469– Wei-Ping Zhu: Validation, Formal analysis,Writing - Review
476, 2016. & Editing, Supervision, Resources
[33] S. Nie, S. Liang, B. Liu, Y. Zhang, W. Liu, and J. Tao, Benoit Champagne: Validation, Formal analysis, Writing -
“Deep noise tracking network: A hybrid signal process- Review & Editing
ing/deep learning approach to speech enhancement.,” in
Proc. of Interspeech, pp. 3219–3223, Sept., 2018.
[34] Z. Ouyang, H. Yu, W.-P. Zhu, and B. Champagne, “A
deep neural network based harmonic noise model for
speech enhancement,” in Proc. of Interspeech, pp. 3224–
3228, Sept., 2018.
[35] H. Yu, Z. Ouyang, W.-P. Zhu, and B. Champagne, “A
deep neural network based Kalman filter for time domain
speech enhancement,” in IEEE International Symposium
on Circuits and Systems (ISCAS), pp. 1–5, May, 2019.
[36] H. Yu, W.-P. Zhu, Z. Ouyang, and B. Champagne, “A
hybrid speech enhancement system with DNN based
speech reconstruction and Kalman filtering,” Multimedia
Tools and Applications, vol. 79, pp. 32643 – 32663, 2020.
[37] J. Chen and D. Wang, “Long short-term memory for
speaker generalization in supervised speech separation,”
The Journal of the Acoustical Society of America,
vol. 141, no. 6, pp. 4705–4714, 2017.
[38] M. H. Moattar and M. M. Homayounpour, “A simple but
efficient real-time voice activity detection algorithm,” in
European Signal Processing Conference, pp. 2549–2553,
August, 2009.
[39] Y. Wang, K. Han, and D. Wang, “Exploring monau-
ral features for classification-based speech segregation,”