0% found this document useful (0 votes)

25 views14 pages

Speech Enhancement Using A DNN-Augmented Colored-Noise

Uploaded by

hodece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views14 pages

Speech Enhancement Using A DNN-Augmented Colored-Noise

Uploaded by

hodece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Speech Enhancement Using a DNN-Augmented Colored-Noise Kalman Filter

Journal Pre-proof

Speech Enhancement Using a DNN-Augmented Colored-Noise

Kalman Filter

Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne

PII: S0167-6393(20)30283-1
DOI: https://doi.org/10.1016/j.specom.2020.10.007
Reference: SPECOM 2744

To appear in: Speech Communication

Received date: 21 April 2020

Revised date: 16 September 2020
Accepted date: 29 October 2020

Please cite this article as: Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne, Speech Enhance-
ment Using a DNN-Augmented Colored-Noise Kalman Filter, Speech Communication (2020), doi:
https://doi.org/10.1016/j.specom.2020.10.007

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier B.V.

H IGHLIGHTS
• Colored-noise Kalman filter is adopted in our system,
which is more component to deal with the complex noise
and alleviate the speech distortion.
• A multi-objective DNN is first employed to joint estimate
parameters of the clean speech autoregressive (AR) model
and the noise AR model. tTo kinds of DNN, i.e., fully-
connected feed-forward network (FNN) and long short-
term memory (LSTM), are adopted.
• A post subtraction technique is employed to further
remove the residual noise in the Kalman-filtered speech.
• The proposed system takes advantage of both the DNN
based method and Kalman filtering, and has a good
generalization capability in both seen and unseen noise
environments.
2

Speech Enhancement Using a DNN-Augmented

Colored-Noise Kalman Filter
Hongjiang Yu, Wei-Ping Zhu, and Benoit Champagne

Abstract—In this paper, we propose a new speech enhancement and perceived musical noise in the enhanced speech. In
system using a deep neural network (DNN)-augmented colored- [4, 5], a multiband spectral subtraction was proposed based
noise Kalman filter. In our system, both clean speech and noise on the fact that the noise affects the speech at different levels
are modelled as autoregressive (AR) processes, whose parameters
comprise the linear prediction coefficients (LPCs) and the driving depending on frequency bands. In the multiband approach,
noise variances. The LPCs are obtained through training a multi- the speech spectrum is divided into several non-overlapping
objective DNN that learns the mapping from the noisy acoustic frequency bands, and then spectral subtraction is performed
features to the line spectrum frequencies (LSFs), while the driving independently in each band.
noise variances are obtained by solving an optimization problem The statistical filter based speech enhancement methods
aiming to minimize the difference between the modelled and
observed AR spectra of the noisy speech. The colored-noise have also received considerable attention. Wiener filtering,
Kalman filter with DNN estimated parameters is then applied one of the most famous algorithms in this class, aims to
to the noisy speech for denoising. Finally, a post-subtraction find the minimum mean square error (MMSE) estimate of
technique is adopted to further remove the residual noise in the clean speech’s discrete Fourier transform (DFT) coef-
the Kalman-filtered speech. Extensive computer simulations show ficients [6–8]. Compared with spectral subtraction, Wiener
that the proposed speech enhancement system achieves significant
performance gains when compared to conventional Kalman filter filtering introduces less distortion in the enhanced speech.
based algorithms as well as recent DNN-based methods under However, Wiener filters are derived under the assumption that
both seen and unseen noise conditions. the processed signals are stationary, which is rarely satisfied
Index Terms—speech enhancement, deep neural network, in real-world applications. Kalman filters, which can handle
corlored-noise Kalman filter, spectral subtraction non-stationary signals, have therefore attracted the interests
of speech enhancement researchers [9]. In this context, the
Kalman filter can be viewed as a time-domain, sequential
I. I NTRODUCTION linear MMSE estimator of the noise corrupted speech, in

S PEECH enhancement, which aims to suppress the back- which the clean speech is characterized by a dynamical or
ground noise and improve the quality and intelligibility of state-space model, such as the autoregressive (AR) model. As
a speech signal, has been widely adopted as a pre-processing such, the enhancement performance is largely dependent on
means in a variety of speech-related applications to provide the estimation accuracy of the AR parameters, which include
better user experience. Numerous speech enhancement tech- the linear prediction coefficients (LPCs) and the variances of
niques have been proposed in the literature over the past the driving and observation noises.
decades, but due to their limited performance, the problem Ideally, the AR parameters of the clean speech can lead to
continues to be intensively studied. excellent performance of the Kalman filter [9], but they are not
Spectral subtraction [1], one of the earliest techniques for accessible in practice. Therefore, various estimation algorithms
speech enhancement, modifies the noisy speech power spec- have been proposed to obtain the above parameters from the
trum by subtracting the estimated noise power spectrum. Al- noisy speech, which can be divided into two categories: online
though spectral subtraction is easy to employ, the difficulty in estimation [10–13] and offline estimation [14, 15]. The former
accurately estimating the noise spectrum hinders the enhance- algorithms usually estimate and update the denoised speech
ment performance. Extra distortion, such as the musical noise, and the model parameters in an iterative manner, while the
can degrade the perceptual quality of the enhanced speech if latters require a training stage on a clean speech database
the noise spectrum is not accurately estimated. More flexible to predict the parameters beforehand. To further improve the
spectral subtraction algorithms with better performance were speech enhancement performance, several advanced versions
proposed in [2, 3], where two techniques, i.e., the use of of Kalman filters have been proposed. For example, the
oversubtraction factor and spectral flooring parameter, were subband Kalman filtering technique [16–19] divides the noisy
introduced along with the standard spectral subtraction. These speech into several contiguous frequency bands, and performs
techniques are used to adjust the estimated noise spectrum, Kalman filtering separately as the noise level dynamic varies in
and thereby control the ratio of the remaining residual noise each band. The improved Kalman filter in [20, 21] models both
clean speech and noise as AR processes, and achieves a better
H. Yu and W.-P. Zhu is with the Department of Electrical and Com- performance in color noise environments. The perceptual
puter Engineering, Concordia University, Montreal, Quebec, Canada. E-mail: Kalman filter [22, 23] incorporates an additional post-filter
ho [email protected].
B. Champagne is with the Department of Electrical and Computer Engi- to further remove the residual noise by scaling the estimation
neering, McGill University, Montreal, Quebec, Canada. error of the Kalman filter below the masking threshold.
3

In recent years, deep learning, and especially deep neural residual noise in the Kalman-filtered speech, which is caused
network (DNN), has been successfully applied in many areas. by the parameter estimation error. Through exhaustive com-
Compared with the unsupervised statistical filter based meth- puter simulations, it is shown that the proposed system can not
ods, the use of DNN for speech enhancement offers several only significantly improve the performance of Kalman filtering
advantages: (1) powerful learning capability to model various in speech enhancement, but also offer a good generalization
non-linear mapping relationships; (2) no reliance on assump- capability in both seen and unseen noise conditions.
tions about the statistical properties of the speech and noise, The rest of the paper is organized as follows. Section 2 sum-
and; (3) no specific need for the noise spectrum estimation. marizes our previous work on DNN-based Kalman filtering.
Early works in this area, e.g., [24, 25] employed DNN to Section 3 presents the newly proposed speech enhancement
directly estimate the clean speech magnitude spertrum, where system with DNN-augmented colored-noise Kalman filter,
the DNN acts as a regression model to implement a mapping including a detailed description of its main components. Sec-
function between the log-power spectra (LPS) of the noisy tion 4 presents a series of experiments to assess the system
and clean speech signals. Subsequent works seeked to estimate performance. Section 5 concludes the paper.
ratio masks via DNN-based approaches, and then remove the
background noise in the spectral domain by means of the
II. R ELATED W ORK
estimated masks [26–31]. For instance, in [26], the ideal ratio
mask (IRM) is predicted by a DNN and then applied to the Herein, we briefly review our previous work on speech
noisy magnitude spectrum to recover desired speech signal. enhancement using DNN and Kalman filtering [35], where
Nonetheless, deep learning algorithms require large training the DNN is employed to estimate the AR parameters in the
databases to improve their generalization capability [25]. Since conventional Kalman filter.
the statistical filter based methods can reduce different kinds
and levels of noises to a sensible extent in a variety of
A. Conventional Kalman Filter
situations, researchers have recently turned their attention to
the combination of DNN and statistical filter based approaches. Consider the noisy speech y (n) as an additive mixture of
Indeed, the learning capability of the former makes it possi- the clean speech s (n) and the background noise w (n),
ble to boost speech enhancement performance under various
conditions, while the latter helps better exploits the gener- y (n) = s (n) + w (n) (1)
alization capability of the enhancement system by providing
where n ∈ N is the discrete time index. As usual, w(n)
an appropriate structural framework. [32–36]. Recently in
is regarded as a zero-mean white noise with variance σw 2
,
[35], we have proposed a DNN-augmented Kalman filter for
uncorrelated with s(n). The clean speech s (n) is usually
speech enhancement, where the DNN is trained to predict
represented by a linear model as a dynamic process of speech
the AR parameters needed for Kalman filtering. Experiments
production. For the widely-adopted AR model, we have
have shown that the AR parameters estimated in this way
p
X
are less sensitive to various types of noise, leading to a
better enhancement performance than the subband iterative s (n) = as,i s (n − i) + v (n) (2)
i=1
Kalman filter algorithm [17]. However, the enhanced speech
still suffers from distortion at higher frequencies, partly due where as,i are the LPCs of the clean speech, p the order of
to the inaccurate estimation of additive noise and its harmful the model, and v (n) the driving noise, i.e., a zero-mean white
effects on the conventional Kalman filter. noise with variance σv2 .
In this paper, we propose a novel speech enhancement To facilitate the Kalman filter presentation for speech en-
system consisting of a colored-noise Kalman filter augmented hancement, the above model equations for s(n) and y(n) can
with DNN-based parameter estimation, where both clean be rewritten in matrix form as,
speech and noise are modelled as AR processes. In our
system, a multi-objective DNN is first employed to estimate s (n) = Fs s (n − 1) + Gs v (n)
(3)
the line spectrum frequencies (LSFs), which are used for the y (n) = HTs s (n) + w (n)
representation of the LPCs parameters in these models. Two T
where s (n) = [s (n − p + 1) , . . . , s (n − 1) , s (n)] denotes
kinds of DNN are used in this work, i.e. the fully-connected the speech state vector. Moreover, the transition matrix Fs is
feed-forward DNN (denoted as FNN) [32] and the long short- given by
term memory (LSTM) [37]. The driving noise variances for  
the clean speech process and the noise process are obtained 0 1 ··· 0 0
 0 0 ··· 0 0 
by solving an optimization problem as in [8]. The multi-  
 .. .
.. . .. .
.. .. 
objective DNN training is beneficial as it can simultaneously Fs =  . .  (4)
 
estimate the AR parameters of the clean speech and noise  0 0 ··· 0 1 
with a lower computational complexity, while providing more as,p as,p−1 · · · as,2 as,1
accurate estimates under noisy conditions. Subsequently, the
T
colored-noise Kalman filter with the DNN estimated AR pa- and Hs = Gs = [0, · · · , 0, 1] ∈ Rp .
rameters is applied to the noisy speech for denoising. Finally, The denoising process with a Kalman filter amounts to
a post subtraction technique is employed to further remove the recursively calculate an unbiased, linear MMSE estimate of
4

the state vector s(n), given the corrupted speech y(n). This frames remains a difficult task, and the detection errors lead
process can be summarized by the following equations: to inaccurate variance estimation of the additive noise, which
 brings further distortion to the enhanced speech.

 e (n) = y (n) − HTs ŝ (n|n − 1)

 2
−1

 K (n) = P (n|n − 1) Hs σw + HTs P (n|n − 1) Hs

ŝ (n|n) = ŝ (n|n − 1) + K(n) e (n)
III. P ROPOSED S YSTEM

 P (n|n) = I − K (n) HTs P (n|n − 1)



 ŝ (n + 1|n) = Fs ŝ (n|n)
 To counter the difficulties posed by the VAD procedure and
P (n + 1|n) = Fs P (n|n) FTs + σv2 Gs GTs improve the accuracy of the variance estimation, we propose
(5)
a hybrid speech enhancement system that combines DNN-
where ŝ (n|n − 1) is the a priori estimate of the current state
based parameter estimation with colored-noise Kalman filter.
vector s (n), given observations up to a time index n − 1,
The overall block diagram of our new system is depicted in
i.e., y(1), ..., y(n − 1), P (n|n − 1) the predicted state error
Fig.1, which is composed of two stages, namely: the training
correlation matrix of ŝ (n|n − 1), e (n) the innovation, K (n)
stage and the enhancement stage. In the training stage, the
the Kalman gain matrix, ŝ (n|n) the filtered estimate of state
input feature set to the DNN consists of the combination of the
vector s (n), and P (n|n) the filtered state error covariance
noisy speech LSFs along with four acoustic features from [39].
matrix of ŝ (n|n). The denoised speech ŝ (n) is finally given
The output targets are the LSFs of both the clean speech and
by
the noise. Then, a multi-objective DNN is trained to learn the
ŝ (n) = GTs ŝ (n|n) . (6)
mapping from the noisy input feature set to the targets. In the
enhancement stage, given a noisy speech signal, we obtain first
B. Parameter Estimation the input feature set, and then process it by the trained DNN to
We note that several parameters appearing in the above predict the clean speech LSFs and noise LSFs. The estimated
equations should be estimated or calculated from the noisy LPCs are then obtained from the LSFs, and applied to both
observations in order to perform Kalman filtering. Those variance estimation and Kalman filtering. Subsequently, the
parameters include the driving noise variance σv2 , the additive noisy speech is enhanced by the colored-noise Kalman filter.
noise variance σw2
, and the transition matrix Fs which contains This operation is followed by a post subtraction to further
the LPCs of the clean speech model. remove the residual noise in the filtered speech. The key
In our previous work [35], an FNN is adopted for the LPCs components and steps involved in the proposed system are
prediction. More specifically, the LPCs of the noisy speech and described in further details below.
of the clean speech are first calculated and then converted into
their representative LSFs, which are used as input features and
output targets of the DNN, respectively. Using LSFs instead of
A. Colored-Noise Kalman filter
LPCs offers a more stable DNN training process [14], due to
the relatively well-behaved dynamic range of LSFs. The well- As mentioned before, in a conventional Kalman filter the
trained FNN can learn the non-linear relationship between the clean speech is modelled as an AR process, while the additive
noisy LSFs and the clean ones. Finally, the estimated LSFs noise is assumed to be white, which is not suitable for the
are transformed back to LPCs, as required in the transition complex noises encountered in real-world environment. To
matrix Fs needed to perform Kalman filtering. overcome this limitation, we herein adopt the colored-noise
The variance σw 2
of the additive noise w(n) is usually Kalman filter. In this method, the additive noise w(n) in (1)
estimated and updated during the unvoiced frames. The calcu- is now modelled as an AR process, expressed as,
lation involves a voice activity detection (VAD) procedure [38] q
X
to detect whether a given speech frame is voiced or unvoiced. w (n) = aw,i w (n − i) + z (n) (8)
The variance of the driving noise v(n) can be then estimated i=1
as:
where aw,i are the LPCs of the colored noise, q the order of
σv2 = σy2 − σw
2
the AR model, and z(n) the zero-mean white driving noise
2 (7)
= E y (n) − rTy ay − σw
2 with variance σz2 .
The underlying AR signal model in the colored-noise
where ay = [ay,1 , · · · , ay,p ]T is the LPC vector of the noisy Kalman filter can be conveniently incorporated into the fol-
speech, and ry = E [y (n) y (n)] the autocorrelation vector of lowing state-space matrix form,
the noisy speech y(n) with its past p samples, represented by
the vector y(n) = [y(n − 1), . . . , y(n − p)]T . x(n) = Fx(n − 1) + Gu(n)
(9)
Although the performance of the conventional Kalman y(n) = HT x(n)
filter method for speech enhancement has been improved T
notably by using the FNN for parameter estimation, several where x(n) = s(n), w(n) is the p + q dimensional con-
limitations have been identified. Firstly, the additional VAD catenated state vector constituted by the clean speech vector
procedure needed for the estimation of the additive noise s (n) = [s (n − p + 1) , . . . , s (n − 1) , s (n)] together with the
variance increases the computational and structural complexity noise vector w (n) = [w (n − q + 1) , . . . , w (n − 1) , w (n)],
T
of the system. In addition, accurately detecting the unvoiced and u(n) = v(n), z(n) is the concatenated driving noise
5

Fig. 1: Block diagram of proposed speech enhancement system using DNN-augmented colored noise Kalman filter.

vector. Moreover, the augmented matrices G, H, and the filter, i.e.,

overall transition matrix F are given as follows: ŝ (n) = [GTs , 0T ]x̂ (n|n) (14)

Fs 0 Gs 0 Hs Note that two parameterized matrices that appear in the
F= , G= , H=
0 Fw 0 Gw Hw process equations (12) should be estimated from the noisy
(10) speech to carry out Kalman filtering, namely, the overall
with   transition matrix F and the covariance matrix Qu of the
0 1 ··· 0 0
 0 concatenated driving noise vector u(n). The first depends on
 0 · · · 0 0 
 .. . .. . ..  the clean speech and noise LPCs, which can be converted to
Fw =  . .. . .. .  (11) the LSFs and predicted through a DNN, while the second one
 
 0 0 ··· 0 1  is obtained by solving an optimization problem. The details
aw,q aw,q−1 · · · aw,2 aw,1 of this parameter estimation are provided in the following
T subsections.
and Hw = Gw = [0, · · · , 0, 1] ∈ Rq .

Given a noisy observation y(n), the estimate of the state B. DNN-based LSFs Estimation
vector x(n) can be obtained by the following Kalman filtering
Recently, we have demonstrated that FNN offers a conve-
recursive equations:
nient means for LSFs estimation in speech processing applica-
tions [35]. Here, we propose to employ two different networks,
 i.e., FNN and LSTM, to predict both the clean speech LSFs

 e (n) = y (n) − HT x̂ (n|n − 1)

 −1 and noise LSFs. The specific configuration of each network is

 K (n) = P (n|n − 1) H HT P (n|n − 1) H described in section IV-B.

x̂ (n|n) = x̂ (n|n − 1) + K (n) e (n) For the input features, we extract 12-dimensional LSFs
(12)

 P (n|n) = I − K (n) HT P (n|n − 1) along with several complementary features from the noisy



 x̂ (n + 1|n) = Fx̂ (n|n) speech, in order to collect more information about the speech

P (n + 1|n) = FP (n|n) FT + GQu GT characteristics. Specifically, the following additional acoustic
where e (n) is the innovation, K (n) the Kalman gain matrix, features are utilized: the 15-dimensional amplitude modulation
x̂ (n|n) the filtered estimate of state vector x (n), x̂ (n|n − 1) spectrum (AMS); the 31-dimensional relative spectral trans-
a priori estimate of the state vector x (n). P (n|n) is the form and perceptual linear prediction (RASTA-PLP); the 13-
filtered state error covariance matrix, and P (n|n − 1) the dimensional Mel-frequency cepstral coefficients (MFCC) and
predicted state error correlation matrix. Qu is the covariance their deltas; and the 64-dimensional Gammatone filterbank
matrix of the driving noise vector u(n), which is given by energies (GF), and their deltas [39]. The total dimension of
2 the input feature set is 258, i.e., (12+2×(15+31+13+64))
σv 0
Qu = E[u(n)u(n) ] = T
. (13) The input features are computed for each frame of the
0 σz2
noisy speech, and represented as a row vector f (m) with m
The denoised speech is the output of the colored-noise Kalman denoting the frame index. To make full use of the temporal
6

information of the speech, it is common to incorporate the From equations (1), (2) and (8), the spectrum of the AR-
features of adjacent frames into a single extended feature modelled noisy speech can be expressed as:
vector. Hence, the extended feature vector centered at the
P̂y (k) = P̂s (k) + P̂w (k)
m-th frame is constructed as f̃ (m) = [f (m − m0 ) , · · · ,
f (m) , · · · , f (m + m0 )], where m0 is the number of adjacent σv2 σz2 (16)
= 2 + 2
frames to be included on each side. The value of m0 is set |As (k)| |Aw (k)|
to 2 in our experiment. Note that all the different features are with
normalized to the range [0, 1) in order to balance the training p
X
errors. As (k) = 1 − as,i e−j2πik/K
For the training targets, we adopt a multi-objective learning i=1
(17)
architecture to estimate both the clean speech LSFs and noise q
X
−j2πik/K
LSFs. Compared to a standard DNN, the output layer in the Aw (k) = 1 − aw,i e
proposed architecture is divided into two parts: one for the i=1
clean speech LSFs and the other for the noise LSFs. The where K is the frame length. Note that the clean speech LPCs
advantages of multi-objective learning are twofold. On one as,i and the noise LPCs aw,i can be obtained from the LSFs
hand, it has lower computational complexity compared to at the output of the trained DNN.
training two separate DNNs (i.e., one for clean speech and
one for noise). On the other hand, estimating the two sets of
LSFs simultaneously can help better exploit the relationship
between the clean speech and noise.
The AR spectrum of the observed noisy speech Py (k) can
In the training stage, back propagation is used to adjust be written as,
the weights and biases so as to minimize the cost function, σy2
which is defined as the mean square error (MSE) between Py (k) = 2 (18)
|Ay (k)|
the reference LSFs and the estimated ones for each training
utterance. Note that the cost function is composed of two parts: with p
X
one for the clean speech LSFs and the other for the noise LSFs, Ay (k) = 1 − ay,i e−j2πik/K (19)
as given by, i=1
( p
1 X 1 Xh i2 σy2 = E y(n)2 − rTy ay .
M
(20)
M SELSF = L̂s,i (m) − Ls,i (m)
M m=1 p i=1 We can obtain the variance estimates by minimizing the
 (15)
i2  difference between the AR spectrum of the modelled noisy
1 Xh
q
+ L̂w,j (m) − Lw,j (m) speech P̂y (k) and that of the observed one Py (k), that is,
q 
j=1
σv∗2 , σz∗2 = arg min d P̂y (k) , Py (k) (21)
where m is the frame index of the input noisy speech and M
2
σv ,σz2

the total number of the frames. The quantities Ls,i (m) and where the difference is measured in the log-spectral domain
L̂s,i (m) are the reference clean speech LSFs and the estimated as given by,
ones at frame m, where i ∈ {1, ..., p} is the order index of
K
1 X 2
the clean speech AR model. Similarly, Lw,j are the reference d P̂y (k) , Py (k) = lnP̂y (k) − lnPy (k)
noise LSFs and L̂w,j the estimated ones at frame m, where K
k=1
i ∈ {1, ..., q} is the order index of the noise AR model. 2 (22)
K
X 2 2
1 σv2 / |As (k)| + σz2 / |Aw (k)| − Py (k)
In the enhancement stage, the clean speech LSFs and noise ≈ .
LSFs are first obtained by the well-trained DNN, and then K Py (k)
k=1
converted to their respective LPCs. The estimated LPCs are
used along with the estimated variances in the Kalman filter
equations (12) in order to estimate the desired speech signal.
To obtain the approximate equation in (22), we have used
equation (16) and the approximation of ln(x + 1) ≈ x.
C. Variance Estimation Then
by applying partial differentiation to the difference
d P̂y (k) , Py (k) with respect to σv2 and σz2 , we obtain the
The covariance matrix Qu in (13) is another key parameter following linear system of equations:
that needs to be estimated prior to the application of the
Kalman filtering equations. Proceeding as in [8], we now
formulate an optimization problem to estimate σv2 and σz2 . Our
goal is to minimize the difference between the noisy spectrum
2
and the sum of the estimated clean speech spectrum and noise Ess Esw σv Eys
spectrum. = (23)
Esw Eww σz2 Eyw
7

with The value of δl is determined as,


1 1  1 , fl < 1kHz
Ess = , Eww = 

Py2 (k) |As (k)|
4
Py2
4
(k) |Aw (k)|  Fs
δl = 2.5 , 1kHz ≤ fl ≤ 2 − 2kHz (28)
1 

Esw = 
 1.5 , f > Fs − 2kHz
2 2 l
Py2 (k) |As (k)| |Aw (k)| 2
1 1 where fl is the upper frequency of the l-th band and Fs
Eys = 2 , Eyw = 2 the sampling frequency. The above values of the factors α
Py (k) |As (k)| Py (k) |Aw (k)|
and β are taken from [5] where they have been determined
(24)
empirically based large experiments.
Finally, we synthesize the modified subband spectrum from
PK norms involved in equation (24) are defined as kf (k)k ,
The
k=1 |f (k)|.
the modified magnitude (25) and the phase of the Kalman-
When the AR spectrum of the observed noisy speech Py (k) filtered speech. The final enhanced speech is obtained by
is calculated and As (k) and Aw (k) are obtained with the computing the inverse FFT of the modified subband spectrums.
estimated LPCs from the trained DNN, we can finally obtain
the optimal variances σv2 and σz2 using equation (23). IV. E XPERIMENTAL R ESULTS
D. Post Subtraction A. Experimental Setup
To further remove the residual noise in the Kalman-filtered Databases: The clean speech is selected from the IEEE
speech, a post subtraction algorithm is applied right after sentence database [40], where we choose 670 utterances for
Kalman filtering. We adopt multiband spectral subtraction training and 50 utterances for enhancement. The noise is from
because of its good performance in reducing speech distortion the NOISEX-92 database [41], where four types of noises
[5]. The main idea of this method is described as follows. (babble, white, street, factory) are employed as seen noise, and
The fast Fourier transform (FFT) is first applied to the another four (pink, buccaneer2, destroyerengine, hfchannel)
windowed Kalman-filtered speech to obtain the magnitude as unseen noise. In the training stage, the noisy speech is
spectrum. Next, the noise spectrum is estimated and updated obtained by mixing the clean speech with seen noise at four
during the unvoiced frames. The detection of unvoiced frames levels SNRs, i.e., -3dB, 0dB, 3dB and 6dB, which results in
is accomplished by comparing the total power of the estimated 10720 utterances. In the enhancement stage, both seen noise
clean speech, say P̂s2 and that of the estimated noise, P̂w2 , and unseen noise are mixed with the clean speech at the above
which can easily be obtained from the estimated spectra in mentioned SNR levels. The number of noisy utterances used in
Section III-C. Specifically, a frame is labelled as a voiced the enhancement stage is 800 for both seen noise and unseen
frame if P̂s2 > P̂w2 , and as an unvoiced frame otherwise. noise. The sampling frequency is set to 16 kHz for both clean
Then, the magnitude spectra of the filtered speech and noise speech and noise.
are divided into L subbands. In each subband, the Kalman- Objective metrics: To evaluate the enhancement perfor-
filtered magnitude spectrum is enhanced by subtracting a noise mance, two objective metrics are selected: the perceptual
power spectrum term, evaluation of speech quality (PESQ) measure [42] and the
|Ĉl (k) |2 = |Ŝl (k) |2 − α δl |D̂l (k) |2 (25) short-time objective intelligibility (STOI) measure [43]. PESQ
and STOI evaluate the processed speech from two different
where |Ĉl (k) |2 denotes the modified subband speech power perspectives, i.e., speech quality and intelligibility, and are
spectrum, |Ŝl (k) |2 the Kalman-filtered speech power spec- widely adopted in speech-related applications. PESQ mea-
trum and |D̂l (k) |2 the estimated noise power spectrum (ob- sures the perceptual distortion by comparing the original and
tained and updated during unvoiced frames), with k being the processed signals. A score given by PESQ evaluation ranges
discrete frequency and l the subband index. Moreover, α is the from -0.5 to 4.5. Although PESQ is an objective metric for
oversubtraction factor and δl the additional subtraction factor evaluating the speech quality, it also reflects faithfully the
that can be individually set for each subband to customize the subjective score of the processed speech. STOI has been put
noise removal process. forward in recent years for objective assessment of the speech
The factors α and δl are used to control the noise subtraction intelligibility. The score of STOI ranges from 0 to 1, and shows
level within each band. The value of α is defined as a function a good correlation with the subjective score in listening test
of the segmental signal-to-noise ratio (SNR) (in dB), i.e., for the speech intelligibility. For both metrics, a higher score
 means a better speech quality or intelligibility.

 4.75 , SNR < −5

3
α = 4 − SNR , − 5 ≤ SNR ≤ 20 (26)

 20 B. Reference Methods

1 , SNR > 20
To evaluate the proposed speech enhancement system, we
with ! adopt several existing methods for performance comparison.
|Ŝl (k) |2 Our reference methods include both Kalman filter based algo-
SNR = 10 log10 (27)
|D̂l (k) |2 rithms and DNN-based approaches. The following provides
8

a brief conceptual summary of each one of the reference C. Evaluation of Input Feature Sets
methods. In the training stage, we use the following feature sets as
• IKF (Iterative Kalman filtering) [11]: This algorithm iter- the input of our proposed system: LPS-only set, LSF-only set,
atively performs conventional Kalman filtering, in which multi-feature set consisting of AMS+RASTAPLP+MFCC+GF,
the LPCs are updated in each iteration and joint set formed by combining the LSF-only set with the
• P-IKF (Perceptual IKF) [23]: This algorithm calculates a multi-feature set. In this experiment, we investigate the perfor-
perceptual mask according to human hearing system and mance of the proposed system with these different feature sets
applies it to the Kalman-filtered speech in order to further when using FNN for LSFs estimation. The objective results
remove the residual noises. of the enhanced speech are shown in Table I.
• S-IKF (Subband IKF) [17]: In this method, the noisy The final enhanced speech for the LPS-only and LSF-
speech is first divided into subband signals. Iterative only feature sets exhibit similar PESQ and STOI scores,
Kalman filtering is then applied separately for each sub- while the objective scores could be improved notably for the
band noisy speech. The final enhanced speech is obtained multi-feature and joint sets, using more indicates that more
by synthesising the subband enhanced speech signals. acoustic features provides useful additional information about
• FNN-MAG [24]: A FNN is employed to directly explore the speech. Finally, the enhanced speech from the joint set
the mapping from the noisy speech magnitude spectrum achieves the highest PESQ and STOI scores. As a result, the
to the clean one. The enhanced speech is synthesised with joint set is considered as the optimal input feature set for the
the estimated clean magnitude and noisy phase. proposed system.
• FNN-WF [32]: A FNN is trained for the estimation of
AR parameters of the clean speech. Then, a Wiener filter TABLE I: Objective results with different feature sets
is estimated by calculating the ratio of the estimated -3dB 0dB 3dB 6dB
clean speech power spectrum to that of the noisy speech. Noisy 1.41 1.52 1.68 1.86
The enhanced speech is then obtained by applying the LPS-only 1.67 1.90 2.10 2.29
estimated Wiener filter to the noisy speech. PESQ LSF-only 1.69 1.92 2.12 2.33
• FNN-KF [35]: A FNN is used to predict the LPCs needed Multi Set 1.80 2.06 2.27 2.46
for conventional Kalman filtering. The DNN learns the Joint Set 1.88 2.12 2.32 2.51
mapping from the acoustic features of the noisy speech Noisy 0.66 0.72 0.78 0.83
to the LSFs of the clean speech. The estimated LSFs are LPS-only 0.69 0.75 0.80 0.83
then converted to the desired LPCs. STOI LSF-only 0.68 0.74 0.80 0.84
Besides these benchmarks, we consider three versions of Multi Set 0.72 0.78 0.83 0.86
our proposed DNN-augmented colored-noise Kalman filter Joint Set 0.73 0.79 0.85 0.88
method, namely,
• FNN-CKF: FNN for LSFs estimation and without post
subtraction. D. Evaluation of LPCs Estimation Accuracy
• FNN-CKFS: FNN for LSFs estimation and with post In this subsection, the LPCs estimation error is evaluated to
subtraction. verify the learning capability of the proposed multi-objective
• LSTM-CKFS: LSTM for LSFs estimation and with post DNN training. We first define the LPCs estimation error of the
subtraction. speech as the mean square error (MSE) between the estimated
In order to make fair comparisons, we use the same con- LPCs and the ideal LPCs calculated from the clean speech for
figuration for the FNN in the related methods, i.e., one input each utterance as given below,
layer, one output layer and three hidden layers with 1024 units
( )
in each layer. The LSTM network is obtained by stacking by 1 X
M
1X
p
2
one input layer, two LSTM layers with 512 units in each layer, M SELP C = [âs,i (m) − as,i (m)] (29)
M m=1 p i=1
one feed-forward layer with 512 units and one output layer.
For FNN-MAG, a Hamming window is selected to divide where M denotes the number of the speech frames in the
each utterance into 20 ms time frames with an 10 ms frame utterance, as,i (m) the ideal LPCs of the clean speech and
shift (50% overlap). A 320-point DFT is then computed for âs,i (m) the estimated ones. The estimated LPCs are obtained
each frame. For the other reference methods and the proposed by three methods for comparison. The first one applies the
system, a rectangular window is used to divide the audio Levinson-Durbin (LD) algorithm to obtain the LPCs of the
signals into 20 ms frames with no overlap. noisy speech directly [44]. The second and third ones adopt
For the conventional Kalman filter, we set s(0|0) = 0, the proposed DNN based LSFs estimation algorithm, where
P(0|0) = I, and the AR model order of the clean speech FNN and LSTM are used to estimate the LSFs, which are then
as p = 12. For the colored-noise Kalman filter, we set converted to LPCs. Similarly, we compute the LPCs estimation
x(0|0) = 0, P(0|0) = I, and the orders of AR models for error of the additive noise for each noise type by using (29),
clean speech and additive noise as p = q = 12 . For the post where the estimated and ideal LPCs of the speech are replaced
subtraction in FNN-CKFS and LSTM-CKFS, the spectrum is by those of the additive noise, and the order of the speech
evenly divided into 4 bands. model, p is replaced by that of the noise, q.
9

-3 dB 0 dB -3 dB 0 dB
0.3 0.3 0.3 0.3
Average MSE

Average MSE
0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
LD FNN LSTM LD FNN LSTM LD FNN LSTM LD FNN LSTM

3 dB 6 dB 3 dB 6 dB
0.3 0.3 0.3 0.3
Average MSE

Average MSE
0.2 0.2 0.2 0.2

0.1 0.1 Babble 0.1 0.1 Pink

White Buccaneer2
Street Destroyerengine
Factory Hfchannel
0 0 0 0
LD FNN LSTM LD FNN LSTM LD FNN LSTM LD FNN LSTM

(a) Seen noise (b) Unseen noise

Fig. 2: LPC estimation error comparison for speech among Levinson-Durbin (LD), FNN and LSTM methods.

-3 dB 0 dB
-3 dB 0 dB 0.2 0.2
0.2 0.2
0.15 0.15
Average MSE

0.15 0.15
Average MSE

0.1 0.1
0.1 0.1

0.05 0.05
0.05 0.05

0 0
0 0 LD FNN LSTM LD FNN LSTM
LD FNN LSTM LD FNN LSTM

3 dB 6 dB
3 dB 6 dB 0.2 0.2
0.2 0.2

0.15 0.15
Average MSE

0.1 0.1 0.1 0.1

Babble Pink
0.05 0.05 White 0.05 0.05 Buccaneer2
Street Destroyerengine
Factory Hfchannel
0 0 0 0
LD FNN LSTM LD FNN LSTM LD FNN LSTM LD FNN LSTM

(a) Seen noise (b) Unseen noise

Fig. 3: LPC estimation error comparison for additive noise among LD, FNN and LSTM methods.

Fig. 2 shows the LPCs estimation error comparison for the The LPCs estimation error comparison for the additive noise
speech. The average MSE is computed over all the testing is shown in Fig. 3, where we notice important differences
utterances for both seen and unseen noise. In general, the with the case of clean speech. Firstly, as SNR increases, the
FNN and LSTM based approaches give a slightly smaller error strong speech component more strongly affect the noise LPCs
than the LD method does for the eight types of noises and estimation, and hence the LPCs estimation error of additive
different SNRs. In addition, the error from LSTM is smaller noise gets larger. Secondly, compared with speech, noise
than that from FNN in most cases. Another important finding exhibits less structure and correlation, and the mapping from
is that the error from the DNN methods decrease with an the noisy speech feature to the noise LSFs is thus more difficult
increase of the SNR, which means that DNN achieves a better to learn. Therefore, we can find that the DNN estimation result
performance at higher SNR. The LPCs estimation performance is not always better than the traditional LD method, especially
also varies for different noise types. In particular, the best at low SNR.
estimation accuracy is achieved for street noise, and the worst
for white noise. Interestingly, we note that the estimation error E. Speech Enhancement Performance under Seen Noise
of FNN and LSTM based algorithm under unseen noise does
not increase considerably compared with that under seen noise, Here, we campare the different speech enhancement meth-
which indicates that using DNN in LPCs estimation offers ods under seen noise. Table II gives the average objective
robustness and has a good generalization capability. scores of different speech enhancement methods on seen
noise. We first note that the performances of the unsupervised
10

TABLE II: Objective scores of different speech enhancement methods on seen noise

PESQ STOI
Method
-3 dB 0 dB 3 dB 6 dB -3 dB 0 dB 3 dB 6 dB
Noisy 1.41 1.52 1.68 1.86 0.66 0.72 0.78 0.83
IKF 1.55 1.79 2.01 2.25 0.67 0.74 0.80 0.85
P-IKF 1.57 1.83 2.08 2.31 0.68 0.75 0.81 0.85
S-IKF 1.56 1.81 2.04 2.29 0.67 0.75 0.81 0.84
FNN-MAG 1.89 2.13 2.34 2.55 0.75 0.82 0.86 0.88
FNN-WF 1.65 1.83 2.15 2.36 0.71 0.78 0.82 0.86
FNN-KF 1.70 1.93 2.13 2.30 0.71 0.77 0.81 0.85
FNN-CKF 1.73 2.01 2.26 2.49 0.72 0.78 0.84 0.87
FNN-CKFS 1.88 2.12 2.32 2.51 0.73 0.79 0.85 0.88
LSTM-CKFS 1.93 2.16 2.38 2.58 0.74 0.80 0.85 0.88

Kalman filtering algorithms are worse than those of the DNN- degradation is possibly caused by the inaccuracy in estimating
based methods. The P-IKF, which incorporates a perceptual the noise LPCs, as shown in Fig.3 where the FNN estimation
mask to further suppress the residual noise, is the best among error is higher than the LD estimation error under low input
the three unsupervised Kalman filtering algorithms. However, SNR conditions.
P-IKF still can not achieve as good performance as FNN-KF, In the case of unseen noise, we find that LSTM-CKFS
not to mention our FNN-CKF, FNN-CKFS and LSTM-CKFS. achieves the best objective scores due to its advanced network
These results demonstrate the benefit from employing DNN structure. More interestingly, FNN-MAG no longer holds the
in parameter estimation. The DNN can predict more accurate best performance among FNN based methods. In fact, the
LPCs from the noisy speech, thus improving the performance objective scores of FNN-MAG decease largely, indicating that
of Kalman filtering algorithms. mapping the noisy magnitude spectrum to the clean one is
Moreover, FNN-KF has lower objective scores compared prone to errors when the noise is unmatched with those in
with the proposed methods. This is because FNN-KF requires the training stage. In contrast, FNN-WF, FNN-KF and our
a VAD procedure to detect the unvoiced frame for estimating proposed system suffer less performance degradation. Indeed,
and updating the additive noise variance σw 2
. However, VAD the denoising process in these methods is accomplished by
in noisy condition is a difficult task, which causes variance Wiener and Kalman filtering. Therefore, as the DNN can
estimation error and introduces extra distortion to the enhanced provide more accurate parameters, their performances would
speech. In our proposed system, an AR model is adopted to not fluctuate as much whether on seen noise or unseen
represent the background noise. As such, the Kalman filtering noise. Based on these results, and considering the robustness
equations in (12) no longer involve σw 2
, and we can therefore of the DNN-based LPCs estimation, we can conclude that
overcome the speech distortion problem due to the inaccurate our FNN-CKF, FNN-CKFS and LSTM-CKFS, have a better
estimation of σw
2
. The performance can be further improved by generalization capability than FNN-MAG.
employing post subtraction to remove the residual noise due Finally, we make comparison in terms of each objective
to the inaccurate parameters of the noise AR model. Indeed, metric. Although the enhanced speech from LSTM-CKFS has
FNN-CKFS achieves a better performance than FNN-CKF, the best speech quality according to the PESQ scores, the im-
which approaches closely that of FNN-MAG. Finally, although provement of speech intelligibility is not obvious as seen from
FNN-MAG has the best performance among all tested FNN the STOI scores. In fact, the LSTM-CKFS gives similar STOI
based approaches, by employing LSTM for LSFs estimation scores to FNN-MAG. Actually, there is a trade-off between
in our proposed system, LSTM-CKFS can achieve the best residual noise and speech distortion for speech enhancement
PESQ scores, which demonstrates the LSTM’s advantage in algorithms, leading to decreased speech intelligibility. For our
modelling long temporal dependencies. LSTM-CKFS, the enhanced speech achieves similar speech in-
telligibility as that of FNN-MAG but far better speech quality,
F. Speech Enhancement Performance under Unseen Noise indicating that LSTM-CKFS could preserve the information
content of clean speech well, while significantly removing the
Table III gives the average objective scores of the different
additive noise.
speech enhancement methods in the case of unseen noise.
In this case, the performances of the unsupervised Kalman
filtering algorithms are still worse than those of the FNN-based G. Spectrograms of Enhanced Speeches
methods. Comparing FNN-KF with our proposed system, we To better understand the characteristics of the enhanced
can find again that FNN-CKF, FNN-CKFS and LSTM-CKFS speech, Fig. 4 shows the spectrograms of the enhanced speech
outperform FNN-KF because of the adoption of colored- signals from several selected methods, demonstrating the ef-
noise Kalman filter. However, the STOI scores of FNN-CKF fects of the residual noises and the distortions in the harmonic
are slightly lower than those of FNN-KF at low SNR. This structures in the time-frequency domain. The noisy speech is
11

TABLE III: Objective scores of different speech enhancement methods on unseen noise

PESQ STOI
Method
-3 dB 0 dB 3 dB 6 dB -3 dB 0 dB 3 dB 6 dB
Noisy 1.37 1.51 1.65 1.82 0.65 0.72 0.78 0.83
IKF 1.64 1.84 2.04 2.26 0.68 0.75 0.81 0.85
P-IKF 1.67 1.88 2.09 2.32 0.69 0.76 0.81 0.85
S-IKF 1.66 1.87 2.08 2.31 0.68 0.75 0.81 0.84
FNN-MAG 1.73 1.92 2.13 2.32 0.70 0.76 0.82 0.87
FNN-WF 1.68 1.92 2.15 2.33 0.67 0.74 0.81 0.85
FNN-KF 1.73 1.95 2.21 2.38 0.71 0.77 0.82 0.85
FNN-CKF 1.76 2.02 2.26 2.48 0.70 0.76 0.82 0.86
FNN-CKFS 1.89 2.11 2.32 2.50 0.71 0.78 0.83 0.87
LSTM-CKFS 1.91 2.15 2.36 2.55 0.73 0.79 0.84 0.88

obtained by mixing a selected clean speech utterance with buc- VI. ACKNOWLEDGEMENTS
caneer noise at 3dB SNR. For the best unsupervised Kalman
The authors acknowledge the support from China Schol-
filtering in our experiment, i.e., P-IKF, we can find the musical
arships Council (CSC No.201606270200) and NSERC of
noise structure in the spectrogram in the region between 4kHz
Canada under a CRD project sponsored by Microchip in
and 8kHz. The spectrogram of FNN-MAG also exhibits some
Ottawa, Canada.
musical noise structures in the high-frequency component as
well as residual noise in the low-frequency component. For
FNN-WF, the high-frequency components look better than the R EFERENCES
previous two spectrograms, but still have undesired structures,
[1] S. Boll, “Suppression of acoustic noise in speech using
which are likely caused by the difficulty of Wiener filter in
spectral subtraction,” IEEE Trans. on Acoustics, Speech,
removing non-stationary noise. Finally, for the four Kalman
and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
filtering related methods, it is observed that FNN-KF, FNN-
[2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement
CKFS and LSTM-CKFS can remove the background noise
of speech corrupted by acoustic noise,” in IEEE Int. Conf.
quite well. However, the high-frequency components of FNN-
on Acoustics, Speech and Signal Processing (ICASSP),
KF still suffer from various degradations. While this situation
pp. 208–211, April, 1979.
is improved in the cases of FNN-CKFS and LSTM-CKFS, the
[3] W. M. Kushner, V. Goncharoff, C. Wu, V. Nguyen, and
LSTM-CKFS can preserve the harmonic structures best among
J. N. Damoulakis, “The effects of subtractive-type speech
all the tested methods, thus achieving the best objective scores.
enhancement/noise reduction algorithms on parameter
estimation for improved recognition and coding in high
V. C ONCLUSION noise environments,” in IEEE Int. Conf. on Acoustics,
In this paper, we have proposed a hybrid speech en- Speech and Signal Processing (ICASSP), pp. 211–214,
hancement system with DNN-aided parameter estimation and May, 1989.
colored-noise Kalman filtering. Our system first employs a [4] L. Singh and S. Sridharan, “Speech enhancement using
multi-objective FNN or LSTM to estimate the AR model pa- critical band spectral subtraction,” in Int. Conf. on Spoken
rameters of both clean speech and noise. Then a colored-noise Language Processing, pp. 2827–2830, Nov., 1998.
Kalman filter with the estimated parameters is applied to the [5] S. Kamath and P. Loizou, “A multi-band spectral subtrac-
noisy speech for denoising. By doing so, the proposed system tion method for enhancing speech corrupted by colored
can more efficiently cope with color noises encountered in noise,” in IEEE Int. Conf. on Acoustics, Speech and
real-world environments. To further improve the enhancement Signal Processing (ICASSP), pp. 4160–4164, May, 2002.
performance, a post subtraction algorithm is adopted to better [6] J. S. Lim and A. V. Oppenheim, “Enhancement and band-
remove the residual noise. width compression of noisy speech,” Proceedings of the
Experiments have shown the superiority of the proposed IEEE, vol. 67, no. 12, pp. 1586–1604, December,1979.
system from two aspects. First, the employment of DNN for [7] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New in-
parameter estimation and post subtraction for residual noise sights into the noise reduction wiener filter,” IEEE Trans.
suppression largely improves the enhancement performance on audio, speech, and language processing, vol. 14,
of colored-noise Kalman filtering. Second, our proposed sys- no. 4, pp. 1218–1234, 2006.
tem takes advantages of both unsupervised and supervised [8] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Code-
methods, and thus exhibits a better generalization capability. book driven short-term predictor parameter estimation
Indeed, while it achieves comparable performance as recent for speech enhancement,” IEEE Trans. on Audio, Speech,
DNN-based approaches on seen noise, it offers notably better and Language Processing, vol. 14, no. 1, pp. 163–176,
results on unseen noise. 2005.
12

Fig. 4: Spectrograms of the clean, noisy and enhanced speech signals for different methods.

[9] K. Paliwal and A. Basu, “A speech enhancement method filter,” in IEEE International Symposium on Circuits and
based on Kalman filtering,” in IEEE Int. Conf. on Acous- Systems (ISCAS), pp. 762–765, May, 2016.
tics, Speech and Signal Processing (ICASSP), vol. 12, [18] H. Yu, W.-P. Zhu, and B. Champagne, “High-frequency
pp. 177–180, April, 1987. component restoration for Kalman filter based speech
[10] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of enhancement,” in International Symposium on Circuits
colored noise for speech enhancement and coding,” IEEE and Systems (ISCAS), pp. 1–5, IEEE, Oct., 2020.
Trans. on Signal Processing, vol. 39, no. 8, pp. 1732– [19] H. Yu, W.-P. Zhu, and B. Champagne, “Subband Kalman
1742, 1991. filtering with DNN estimated parameters for speech
[11] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative and enhancement,” in Proc. of Interspeech, pp. 2697–2701,
sequential Kalman filter-based speech enhancement al- Oct., 2020.
gorithms,” IEEE Trans. on speech and audio processing, [20] D. C. Popescu and I. Zeljkovic, “Kalman filtering of col-
vol. 6, no. 4, pp. 373–385, 1998. ored noise for speech enhancement,” in IEEE Int. Conf.
[12] T. Mellahi and R. Hamdi, “LPC-based formant enhance- on Acoustics, Speech and Signal Processing (ICASSP),
ment method in Kalman filtering for speech enhance- vol. 2, pp. 997–1000, May, 1998.
ment,” AEU-International Journal of Electronics and [21] V. Grancharov, J. Samuelsson, and W. B. Kleijn, “Im-
Communications, vol. 69, no. 2, pp. 545–554, 2015. proved Kalman filtering for speech enhancement,” in
[13] Y. Xia and J. Wang, “Low-dimensional recurrent neural IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
network-based Kalman filter for speech enhancement,” cessing (ICASSP), vol. 1, pp. 1109–1112, March, 2005.
Neural Networks, vol. 67, pp. 131–139, 2015. [22] N. Ma, M. Bouchard, and R. A. Goubran, “Perceptual
[14] N. Nower, Y. Liu, and M. Unoki, “Restoration scheme Kalman filtering for speech enhancement in colored
of instantaneous amplitude and phase using Kalman filter noise,” in IEEE Int. Conf. on Acoustics, Speech and
with efficient linear prediction for speech enhancement,” Signal Processing (ICASSP), vol. 1, pp. 717–720, May,
Speech Communication, vol. 70, pp. 13–27, June, 2015. 2004.
[15] M. S. Kavalekalam, M. G. Christensen, F. Gran, and J. B. [23] N. Ma, M. Bouchard, and R. A. Goubran, “Speech
Boldt, “Kalman filter for speech enhancement in cocktail enhancement using a masking threshold constrained
party scenarios using a codebook-based approach,” in Kalman filter and its heuristic implementations,” IEEE
IEEE Int. Conf. on Acoustics, Speech and Signal Pro- Trans. on Audio, Speech, and Language Processing,
cessing (ICASSP), pp. 191–195, March, 2016. vol. 14, no. 1, pp. 19–32, 2005.
[16] W.-R. Wu and P.-C. Chen, “Subband Kalman filtering for [24] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental
speech enhancement,” IEEE Trans. on Circuits and Sys- study on speech enhancement based on deep neural
tems II: Analog and Digital Signal Processing, vol. 45, networks,” IEEE Signal Processing Letters, vol. 21, no. 1,
no. 8, pp. 1072–1083, 1998. pp. 65–68, 2013.
[17] S. K. Roy, W.-P. Zhu, and B. Champagne, “Single chan- [25] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression
nel speech enhancement using subband iterative Kalman approach to speech enhancement based on deep neural
13

networks,” IEEE/ACM Trans. on Audio, Speech and IEEE/ACM Trans. on Audio, Speech and Language Pro-
Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, cessing (TASLP), vol. 21, no. 2, pp. 270–279, 2013.
2015. [40] IEEE Subcommittee, “IEEE recommended practice for
[26] A. Narayanan and D. Wang, “Ideal ratio mask estimation speech quality measurements,” IEEE Trans. on Audio and
using deep neural networks for robust speech recogni- Electroacoustics, vol. 17, no. 3, pp. 225–246, 1969.
tion,” in IEEE Int. Conf. on Acoustics, Speech and Signal [41] A. Varga and H. J. Steeneken, “Assessment for automatic
Processing (ICASSP), pp. 7092–7096, May, 2013. speech recognition: II. NOISEX-92: A database and an
[27] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, experiment to study the effect of additive noise on speech
“Phase-sensitive and recognition-boosted speech sepa- recognition systems,” Speech Communication, vol. 12,
ration using deep recurrent neural networks,” in IEEE no. 3, pp. 247–251, 1993.
Int. Conf. on Acoustics, Speech and Signal Processing [42] ITU-R, “Perceptual evaluation of speech quality (PESQ)
(ICASSP), pp. 708–712, April, 2015. an objective method for end-to-end speech quality as-
[28] W. Han, X. Zhang, G. Min, M. Sun, and J. Yang, sessment of narrowband telephone networks and speech
“Joint optimization of audible noise suppression and deep codecs,” Recommendation P.862, 2001.
neural networks for single-channel speech enhancement,” [43] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,
in IEEE Int. Conf. on Multimedia and Expo (ICME), “An algorithm for intelligibility prediction of time-
pp. 1–6, July, 2016. frequency weighted noisy speech,” IEEE/ACM Trans.
[29] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio on Audio, Speech and Language Processing (TASLP),
masking for monaural speech separation,” IEEE/ACM vol. 19, no. 7, pp. 2125–2136, 2011.
Trans. on Audio, Speech and Language Processing [44] T. Shimamura, N. Kunieda, and J. Suzuki, “A robust
(TASLP), vol. 24, no. 3, pp. 483–492, 2016. linear prediction method for noisy speech,” in IEEE Int.
[30] M. Tu and X. Zhang, “Speech enhancement based on Symposium on Circuits and Systems (ISCAS), vol. 4,
deep neural networks with skip connections,” in IEEE pp. 257–260, May, 1998.
Int. Conf. on Acoustics, Speech and Signal Processing The authors declare that they have no known competing
(ICASSP), pp. 5565–5569, March, 2017. financial interests or personal relationships that could have
[31] H. Yu, W.-P. Zhu, and Y. Yang, “Constrained ratio appeared to influence the work reported in this paper.
mask for speech enhancement using DNN,” in Proc. of
Interspeech, pp. 2427–2431, Oct., 2020. AUTHORSHIP AND C ONTRIBUTIONS
[32] Y. Li and S. Kang, “Deep neural network-based linear Hongjiang Yu: Conceptualization, Methodology, Software,
predictive parameter estimations for speech enhance- Formal analysis, Writing - Original Draft
ment,” IET Signal Processing, vol. 11, no. 4, pp. 469– Wei-Ping Zhu: Validation, Formal analysis,Writing - Review
476, 2016. & Editing, Supervision, Resources
[33] S. Nie, S. Liang, B. Liu, Y. Zhang, W. Liu, and J. Tao, Benoit Champagne: Validation, Formal analysis, Writing -
“Deep noise tracking network: A hybrid signal process- Review & Editing
ing/deep learning approach to speech enhancement.,” in
Proc. of Interspeech, pp. 3219–3223, Sept., 2018.
[34] Z. Ouyang, H. Yu, W.-P. Zhu, and B. Champagne, “A
deep neural network based harmonic noise model for
speech enhancement,” in Proc. of Interspeech, pp. 3224–
3228, Sept., 2018.
[35] H. Yu, Z. Ouyang, W.-P. Zhu, and B. Champagne, “A
deep neural network based Kalman filter for time domain
speech enhancement,” in IEEE International Symposium
on Circuits and Systems (ISCAS), pp. 1–5, May, 2019.
[36] H. Yu, W.-P. Zhu, Z. Ouyang, and B. Champagne, “A
hybrid speech enhancement system with DNN based
speech reconstruction and Kalman filtering,” Multimedia
Tools and Applications, vol. 79, pp. 32643 – 32663, 2020.
[37] J. Chen and D. Wang, “Long short-term memory for
speaker generalization in supervised speech separation,”
The Journal of the Acoustical Society of America,
vol. 141, no. 6, pp. 4705–4714, 2017.
[38] M. H. Moattar and M. M. Homayounpour, “A simple but
efficient real-time voice activity detection algorithm,” in
European Signal Processing Conference, pp. 2549–2553,
August, 2009.
[39] Y. Wang, K. Han, and D. Wang, “Exploring monau-
ral features for classification-based speech segregation,”

Fundamental of Speech Enhencements
No ratings yet
Fundamental of Speech Enhencements
112 pages
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
No ratings yet
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
26 pages
Kalman Filter - Kordic V. (Ed.) - 2010 - Intech - 9789533070940 - Anna's Archive
No ratings yet
Kalman Filter - Kordic V. (Ed.) - 2010 - Intech - 9789533070940 - Anna's Archive
400 pages
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
No ratings yet
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
23 pages
Kalman Filtering Book PDF
No ratings yet
Kalman Filtering Book PDF
83 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
The Bangla Adaptation of Mini Mental Sta
No ratings yet
The Bangla Adaptation of Mini Mental Sta
10 pages
Kalman Filter in Speech Enhancement: Thesis
No ratings yet
Kalman Filter in Speech Enhancement: Thesis
43 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
B.Tech ECE - R23
No ratings yet
B.Tech ECE - R23
96 pages
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
No ratings yet
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
14 pages
DTSP Mini Project
No ratings yet
DTSP Mini Project
13 pages
Digital Signal Processing: PBL Approach: Theme: Speech Processing Title
No ratings yet
Digital Signal Processing: PBL Approach: Theme: Speech Processing Title
15 pages
04.04.2021 - PPT Presentation IEEE 6th I2CT - Paper ID - 247
No ratings yet
04.04.2021 - PPT Presentation IEEE 6th I2CT - Paper ID - 247
27 pages
GSVD-Based Optimal Filtering For Single and Multimicrophone Speech Enhancement
No ratings yet
GSVD-Based Optimal Filtering For Single and Multimicrophone Speech Enhancement
15 pages
LPCKalman DNNConv
No ratings yet
LPCKalman DNNConv
12 pages
Comb - Investigation of The Effect of Speech Enhancement On The Watermarking Process
No ratings yet
Comb - Investigation of The Effect of Speech Enhancement On The Watermarking Process
12 pages
High-Fidelity Noise Reduction With Differentiable Signal
No ratings yet
High-Fidelity Noise Reduction With Differentiable Signal
10 pages
2019 Speech Enhancement For Secure Communication
No ratings yet
2019 Speech Enhancement For Secure Communication
19 pages
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
No ratings yet
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
17 pages
7
No ratings yet
7
8 pages
Speech Enhancement Using Kalman Filter
No ratings yet
Speech Enhancement Using Kalman Filter
14 pages
A Perceptually-Motivated Approach For Low-Complexity Real-Time Enhancement of Fullband Speech
No ratings yet
A Perceptually-Motivated Approach For Low-Complexity Real-Time Enhancement of Fullband Speech
5 pages
BTP Group-1 Report
No ratings yet
BTP Group-1 Report
21 pages
Indian Institue of Technology 1
No ratings yet
Indian Institue of Technology 1
186 pages
Noise Estimation and Noise Removal Techniques For Speech Recognition in Adverse Environment
No ratings yet
Noise Estimation and Noise Removal Techniques For Speech Recognition in Adverse Environment
8 pages
Sensors: Wearable Hearing Device Spectral Enhancement Driven by Non-Negative Sparse Coding-Based Residual Noise Reduction
No ratings yet
Sensors: Wearable Hearing Device Spectral Enhancement Driven by Non-Negative Sparse Coding-Based Residual Noise Reduction
14 pages
Multi-Band Spectral Subtraction Algorithm For Speech Enhancement
No ratings yet
Multi-Band Spectral Subtraction Algorithm For Speech Enhancement
12 pages
Paper 3
No ratings yet
Paper 3
8 pages
Noise Supression Techniques For Speech Enhancement Using Adaptive Filtering
No ratings yet
Noise Supression Techniques For Speech Enhancement Using Adaptive Filtering
18 pages
Speech Enhancement Noise Reduction Noise Reduction: Pham Van Tuan
No ratings yet
Speech Enhancement Noise Reduction Noise Reduction: Pham Van Tuan
26 pages
Speech Intelligibility Prediction Intended For State-Of-The-Art Noise Estimation Algorithms
No ratings yet
Speech Intelligibility Prediction Intended For State-Of-The-Art Noise Estimation Algorithms
1 page
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
No ratings yet
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
5 pages
Non-Linear Feature Extraction For Robust Speech Recognition in Stationary and Non-Stationary Noise
No ratings yet
Non-Linear Feature Extraction For Robust Speech Recognition in Stationary and Non-Stationary Noise
22 pages
Speech Enhancement: Concept and Methodology
No ratings yet
Speech Enhancement: Concept and Methodology
21 pages
Robust Speech Recognition Using Adaptive Noise Cancellation
No ratings yet
Robust Speech Recognition Using Adaptive Noise Cancellation
4 pages
Comparison of Speech Enhancement Algorithms: Sciencedirect
No ratings yet
Comparison of Speech Enhancement Algorithms: Sciencedirect
11 pages
DeepFilterNet2205 05474
No ratings yet
DeepFilterNet2205 05474
5 pages
M.E. Comm. and Net.
No ratings yet
M.E. Comm. and Net.
94 pages
A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise
No ratings yet
A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise
15 pages
SSP Paper PDF
No ratings yet
SSP Paper PDF
4 pages
Image Restoration: Yinghua He School of Computer Science and Technology Tianjin University
No ratings yet
Image Restoration: Yinghua He School of Computer Science and Technology Tianjin University
116 pages
Different Approaches of Spectral Subtraction Method For Enhancing The Speech Signal in Noisy Environments
No ratings yet
Different Approaches of Spectral Subtraction Method For Enhancing The Speech Signal in Noisy Environments
6 pages
Ijert Ijert: Analysis of LMS and NLMS Adaptive Beamforming Algorithms
No ratings yet
Ijert Ijert: Analysis of LMS and NLMS Adaptive Beamforming Algorithms
6 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
10 pages
Speech Enhancement
No ratings yet
Speech Enhancement
5 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
ADSP Unit 3
No ratings yet
ADSP Unit 3
34 pages
CNN Basic
No ratings yet
CNN Basic
11 pages
Optimal Filtering
No ratings yet
Optimal Filtering
20 pages
PRC PPT Mtech
No ratings yet
PRC PPT Mtech
23 pages
DSP Unit5 Applications of Multirate Signal Processing
No ratings yet
DSP Unit5 Applications of Multirate Signal Processing
19 pages
Speech Enhancement Using Signal Subspace Algorithm
No ratings yet
Speech Enhancement Using Signal Subspace Algorithm
4 pages
Adaptive Filtering With Averaging in Noise Cancell
No ratings yet
Adaptive Filtering With Averaging in Noise Cancell
5 pages
Weiner Filter
No ratings yet
Weiner Filter
35 pages
An Information-Theoretic Framework For Receiver Quantization in Communication
No ratings yet
An Information-Theoretic Framework For Receiver Quantization in Communication
25 pages
Adaptive Wiener Filtering Approach For Speech Enhancement
No ratings yet
Adaptive Wiener Filtering Approach For Speech Enhancement
9 pages
Filtering For Speech: Enhancement
No ratings yet
Filtering For Speech: Enhancement
4 pages
Employing Laplacian-Gaussian Densities For Speech Enhancement
No ratings yet
Employing Laplacian-Gaussian Densities For Speech Enhancement
4 pages
Speech Enhancement Using An Adaptive Wiener Filtering Approach M. A. Abd El-Fattah, M. I. Dessouky, S. M. Diab and F. E. Abd El-Samie
No ratings yet
Speech Enhancement Using An Adaptive Wiener Filtering Approach M. A. Abd El-Fattah, M. I. Dessouky, S. M. Diab and F. E. Abd El-Samie
18 pages
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
No ratings yet
Spectral Restoration Based Speech Enhancement For Robust Speaker Identification
6 pages
Boll79 SuppressionAcousticNoiseSpectralSubtraction PDF
No ratings yet
Boll79 SuppressionAcousticNoiseSpectralSubtraction PDF
8 pages
Multi-Objective Ant Lion Optimizer A Multi-Objective Optimization Algorithm For Solving Engineering Problems
No ratings yet
Multi-Objective Ant Lion Optimizer A Multi-Objective Optimization Algorithm For Solving Engineering Problems
17 pages
A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
No ratings yet
A Novel Expectation-Maximization Framework For Speech Enhancement in Non-Stationary Noise Environments
12 pages
My
No ratings yet
My
25 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
Space Time Block Code
No ratings yet
Space Time Block Code
106 pages
Implementation of Adaptive Filtering Algorithm For Speech Signal On FPGA
No ratings yet
Implementation of Adaptive Filtering Algorithm For Speech Signal On FPGA
5 pages
Speech Enhancement Using LPC Analysis-A Review
No ratings yet
Speech Enhancement Using LPC Analysis-A Review
6 pages
Filtering For Speech Enhancement: H H H H
No ratings yet
Filtering For Speech Enhancement: H H H H
4 pages
Ubiquitous Computing and Communication Journal - 72
No ratings yet
Ubiquitous Computing and Communication Journal - 72
8 pages
Atmaja 2016 J. Phys. Conf. Ser. 776 012072
No ratings yet
Atmaja 2016 J. Phys. Conf. Ser. 776 012072
7 pages
An Effective Fruit Fly Optimization Algorithm With Hybrid
No ratings yet
An Effective Fruit Fly Optimization Algorithm With Hybrid
26 pages
E-PiCo Syllabus MASTER PROGRAM
No ratings yet
E-PiCo Syllabus MASTER PROGRAM
51 pages
Elements On Estimation Theory 04nvp04de13
No ratings yet
Elements On Estimation Theory 04nvp04de13
52 pages
Adaptive Blind Noise Suppression in Some Speech Processing Applications
No ratings yet
Adaptive Blind Noise Suppression in Some Speech Processing Applications
5 pages
Characterization of Clutter Heterogeneity and Estimation of Its Covariance Matrix
No ratings yet
Characterization of Clutter Heterogeneity and Estimation of Its Covariance Matrix
6 pages
Lectr 16
No ratings yet
Lectr 16
33 pages
SDET Formulae MidSem2 2018 Ver3
No ratings yet
SDET Formulae MidSem2 2018 Ver3
2 pages
FPGA Set 1 0
No ratings yet
FPGA Set 1 0
2 pages
SSP Exam WS1718
No ratings yet
SSP Exam WS1718
5 pages
Massive MIMO in The UL DL of Cellular Networks How Many Antennas Do We Need
No ratings yet
Massive MIMO in The UL DL of Cellular Networks How Many Antennas Do We Need
12 pages
Linear MMSE Estimation of Random Variables
No ratings yet
Linear MMSE Estimation of Random Variables
6 pages
A Comparison of GSM Receivers For Fading
No ratings yet
A Comparison of GSM Receivers For Fading
9 pages
Optimal Joint Detection and Estimation in Linear Models: Jianshu Chen, Yue Zhao, Andrea Goldsmith, and H. Vincent Poor
No ratings yet
Optimal Joint Detection and Estimation in Linear Models: Jianshu Chen, Yue Zhao, Andrea Goldsmith, and H. Vincent Poor
6 pages
OTSM Matlabcode Readme
No ratings yet
OTSM Matlabcode Readme
2 pages
Mimo For Dummies
No ratings yet
Mimo For Dummies
7 pages
Tutorial 5 ADSP
No ratings yet
Tutorial 5 ADSP
1 page
Maximum Versoria Criterion-Based Robust Adaptive Filtering Algorithm
No ratings yet
Maximum Versoria Criterion-Based Robust Adaptive Filtering Algorithm
5 pages
Statistical Signal: Fall 2019-2020 Short Description
No ratings yet
Statistical Signal: Fall 2019-2020 Short Description
2 pages
Reduced Complexity Decision Feedback Equalizer For Supporting High Mobility in Wimax
No ratings yet
Reduced Complexity Decision Feedback Equalizer For Supporting High Mobility in Wimax
5 pages
A Novel Channel Estimation Algorithm For 3GPP LTE Downlink System Using Joint Time-Frequency Two-Dimensional Iterative Wiener Filter
No ratings yet
A Novel Channel Estimation Algorithm For 3GPP LTE Downlink System Using Joint Time-Frequency Two-Dimensional Iterative Wiener Filter
4 pages
Mmse Proof
No ratings yet
Mmse Proof
2 pages
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
From Everand
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
Fouad Sabry
No ratings yet

Speech Enhancement Using A DNN-Augmented Colored-Noise

Uploaded by

Speech Enhancement Using A DNN-Augmented Colored-Noise

Uploaded by

Speech Enhancement Using a DNN-Augmented Colored-Noise Kalman Filter

Speech Enhancement Using a DNN-Augmented Colored-Noise

Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne

To appear in: Speech Communication

Received date: 21 April 2020

© 2020 Published by Elsevier B.V.

Speech Enhancement Using a DNN-Augmented

vector. Moreover, the augmented matrices G, H, and the filter, i.e.,

with The value of δl is determined as,

0.1 0.1 0.1 0.1

0.1 0.1 Babble 0.1 0.1 Pink

(a) Seen noise (b) Unseen noise

0.1 0.1 0.1 0.1

(a) Seen noise (b) Unseen noise

You might also like