0% found this document useful (0 votes)
23 views5 pages

A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement

This document proposes a novel recurrent neural network (RNN) called the Signal-to-Noise Ratio Recurrent Neural Network (SNRNN) for robust speech enhancement. The SNRNN reformulates the classical decision-directed approach for estimating the a priori and a posteriori signal-to-noise ratios (SNRs) as latent variables in an RNN. It introduces neural networks to learn parameters normally determined heuristically. The SNRNN aims to combine the robustness of classical approaches with the learning ability of neural networks for speech enhancement.

Uploaded by

gupnaval2473
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement

This document proposes a novel recurrent neural network (RNN) called the Signal-to-Noise Ratio Recurrent Neural Network (SNRNN) for robust speech enhancement. The SNRNN reformulates the classical decision-directed approach for estimating the a priori and a posteriori signal-to-noise ratios (SNRs) as latent variables in an RNN. It introduces neural networks to learn parameters normally determined heuristically. The SNRNN aims to combine the robustness of classical approaches with the learning ability of neural networks for speech enhancement.

Uploaded by

gupnaval2473
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Priori SNR Estimation Based on a Recurrent Neural Network for Robust

Speech Enhancement
Yangyang Xia1 , Richard M. Stern1,2
1
Department of Electrical and Computer Engineering, Carnegie Mellon University
2
Language Technologies Institute, Carnegie Mellon University
[email protected], [email protected]

Abstract domain for the well-known methods such as spectral subtraction


Speech enhancement under highly non-stationary noise condi- [2], Wiener filter [3], maximum likelihood (ML) estimator [6],
tions remains a challenging problem. Classical methods typi- and MMSE estimation [4] can all be expressed in terms of a pri-
cally attempt to identify a frequency-domain optimal gain func- ori and a posteriori SNRs [7]. The enhancement problem thus
tion that suppresses noise in noisy speech. These algorithms becomes a priori SNR and a posteriori SNR estimation prob-
typically produce artifacts such as “musical noise” that are lem. For estimating the a priori SNR, a closed-form maximum-
detrimental to machine and human understanding, largely due likelihood method and a recursive “decision-directed” method
to inaccurate estimation of noise power spectra. The optimal are proposed [4]. For estimating the a posteriori SNR, or
gain function is commonly referred to as the ideal ratio mask equivalently the noise power, the minima-controlled recursive-
(IRM) in neural-network-based systems, and the goal becomes averaging (MCRA) algorithm can be employed [7, 8, 9]. De-
estimation of the IRM from the short-time Fourier transform spite the robustness of the decision-directed approach even in
amplitude of degraded speech. While these data-driven tech- highly nonstationary noise environments, inexact heuristics in
niques are able to enhance speech quality with reduced artifacts, the estimation procedure often produce artifacts called musical
they are frequently not robust to types of noise that they had noise, which is sometimes even more detrimental to machine
not been exposed to in the training process. In this paper, we tasks and human listening experience than noisy speech.
propose a novel recurrent neural network (RNN) that bridges Independent of the classical approaches, researchers in the
the gap between classical and neural-network-based methods. neural network (NN) community formulate the speech enhance-
By reformulating the classical decision-directed approach, the ment task to be a supervised learning problem. Recognizing the
a priori and a posteriori SNRs become latent variables in the ideal ratio mask (IRM) in the STSA domain as a better training
RNN, from which the frequency-dependent estimated likeli- target than clean signal power or magnitude spectra [10], var-
hood of speech presence is used to update recursively the la- ious neural network architectures have been explored to learn
tent variables. The proposed method provides substantial en- the IRM for SE. Some examples are feedforward deep neural
hancement of speech quality and objective accuracy in machine networks [11, 12], deep denoising autoencoders [13], and re-
interpretation of speech. current neural networks (RNN) with long short-term memory
Index Terms: robust speech enhancment, a priori SNR estima- [14]. Although these NN-based SE algorithms work well under
tion, decision-directed, recurrent neural networks. noise conditions that appear in the training set, they typically
suffer from degraded performance in unseen noise types as they
attempt to learn a nonlinear mapping between noisy speech and
1. Introduction the IRM.
Speech enhancement (SE) has been one of the enabling tech- A fusion system that combines the robustness and inter-
nologies for robust speech processing applications for decades. pretability of the classical approach and the learning ability of
SE algorithms strive to improve speech quality and intelligibil- the NN approach is clearly desirable. One previous study that
ity of speech signals degraded by additive noise [1]. Enhanced attempts this fusion [15] proposes a NN version of spectral sub-
speech signals will benefit subsequent human listening experi- traction by having dedicated NNs for estimating noise alone,
ence or performance of machine tasks, such as automatic speech noise in noisy speech, and the enhanced speech. Although their
recognition and speaker verification. Classical signal process- NN structure is reminiscent of spectral subtraction, our exper-
ing methods for SE typically work in the frequency domain with iments show that the latent variables do not learn the intended
optimization criteria associated with the spectral component of representation. Others [16, ?] have attempted to improve a pri-
the enhanced speech. The technique ranges from heuristically ori SNR estimation using NNs, but their systems are shallow
estimating the power spectra [2], finding a linear filter that opti- combinations of multiple approaches at the input and output
mizes the mean squared error of the complex spectra [3], to min- levels.
imum mean-squared error estimators (MMSE) that optimizes We propose a novel RNN that addresses these issues. We
the (log) short-time spectral amplitude (STSA) [4, 5]. slightly modify the decision-directed approach to form a recur-
A priori signal-to-noise ratio (SNR) and a posteriori SNR rent estimation of both the a priori SNR and a posteriori SNR,
arise as two important concepts from the derivation of the eliminating the need to estimate noise explicitly. This refor-
MMSE-STSA estimator [4]. The a priori SNR can be under- mulation leads to a ratio-based representation for all variables,
stood as the true instantaneous power ratio between each spec- which have already proven to be superior training targets for
tral component of clean speech and noise, while the a posteriori neural network learning [10]. Among them, the a priori SNR,
SNR can be viewed as the instantaneous power ratio between a posteriori SNR, and the speech-presence likelihood ratio are
each spectral component of observed noisy speech and noise. interpreted as latent recurrent cells of a recurrent neural net-
Within this framework, the optimal gain function in the STSA work. This enables us to insert feedforward NNs to learn pa-
rameters that are normally heuristically determined using clas- Noise estimation is needed to calculate γ̂[m, k] by definition.
sical approaches. In addition, we introduce a learning objective Acknowledging the importance of the decision-directed ap-
function that jointly optimizes the MSE of STSA as well as the proach, we adopt the MCRA algorithm [7, 8, 9] for noise power
frame-level speech-presence detection accuracy. estimation. Specifically, the speech-absence (H0k ) hypothesis
and speech-presence (H1k ) hypothesis are assumed for each fre-
2. The Signal-to-noise Ratio Recurrent quency bin k of each frame m of the noisy signal:
Neural Network (SNRNN) H0k : |N̂ [m, k]|2 = b|N̂ [m − 1, k]|2 +(1 − b)|X[m − 1, k]|2 (10)
Our signal-to-noise ratio recurrent neural network (SNRNN) H1k : |N̂ [m, k]|2 = |N̂ [m − 1, k]|2
consists of a slightly modified version of the classical decision-
directed a priori SNR estimation and a neural network compo- where 0 < b < 1 is the weighting coefficient. In other words,
nent. Throughout the discussion, we assume additive noise in the noise power in a specific frequency bin is recursively up-
the short-time Fourier transform (STFT) domain: dated by a fraction of signal power from the previous frame
only if it is classified as speech-absent. This decision is made
X[m, k] = S[m, k] + N [m, k] (1) by thresholding the likelihood ratio of speech-presence uncer-
where X[m, k], S[m, k], and N [m, k] denote the STFT at time tainty:
frame m and frequency bin k of the observed noisy speech, P (X[m, k]|H1k )
Λ[m, k] , (11)
clean speech, and noise, respectively. The end goal is to seek for P (X[m, k]|H0k )
the optimal gain function or IRM in the STSA domain, G[m, k], The previous assumption that the noise and speech DFT coeffi-
such that the clean speech estimate Ŝ[m, k] can be obtained cients are independent, complex, and Gaussian leads to:
from the modified STSA and the phase from the noisy input:
ξ̂[m,k]
1 γ̂[m,k]
j 6 X[m,k] Λ[m, k] = e 1+ξ̂[m,k] (12)
Ŝ[m, k] = G[m, k]|X[m, k]|e (2) ˆ
1 + ξ[m, k]
The a priori SNR ξ[m, k] is defined by the ratio of the expected In our system, we replace the hard threshold used in Eq. 10 by
value of clean speech power to the expected value of the noise a soft threshold to enable gradient backpropagation. We also
power: rewrite γ̂[m, k] as a recursive function, eliminating the notion
E[|S[m, k]|2 ] of noise estimation completely. Finally, we introduce the neural
ξ[m, k] = (3)
E[|N [m, k]|2 ] network component, along with the loss function.
The a posteriori SNR γ[m, k] is defined by the ratio of the in-
stantaneous noisy speech power to the expected value of the 2.1. Recurrent A Priori and A Posteriori SNR Estimation
noise power: The noise update rule in Eq. 10 can be interpreted as a re-
|X[m, k]|2 current nonlinear activation function. Specifically, let δ be a
γ[m, k] = (4)
E[|N [m, k]|2 ] hard threshold of the log-likelihood ratio of speech-presence
In estimating the a priori and a posteriori SNRs, we replace the uncertainty above which the noisy frame is classified as speech-
expected values by the corresponding instantaneous values: present. The update rule can then be rewritten as:

ˆ |Ŝ[m, k]|2 |X[m, k]|2 |N̂ [m,k]|2


= β(Λ[m − 1, k]) + (1 − β(Λ[m − 1, k]))γ̂[m − 1, k] (13)
ξ[m, k] = , γ̂[m, k] = (5) |N̂ [m−1,k]|2
|N̂ [m, k]|2 |N̂ [m, k]|2
where β(Λ[m, k]) is the scaled and shifted unit step function:
Assuming that S[m, k] and N [m, k] are statistically indepen-
dent zero-mean complex Gaussian random variables, Eq. 1 im- β(Λ[m, k]) = b + (1 − b)u[log(Λ[m, k]) − δ] (14)
plies an additive relationship in the spectral power domain:
To enable gradient backpropagation in our RNN, we propose
E[|X[m, k]|2 ] = E[|S[m, k]|2 ] + E[|N [m, k]|2 ] (6) two nonlinearities, sigmoid and piecewise-linear, that have non-
zero gradients around the decision boundary δ to replace the
which leads to the definition of ξ[m, k] in terms of γ[m, k]: unit step function:
ξ[m, k] = E[γ[m, k]] − 1 (7) 1
βsig (Λ[m, k]) = b + (1 − b) (15)
ˆ
The decision-directed approach [4] calculates ξ[m, k] by lin- 1 + e−(log(Λ[m,k])−δ)
1−b
early averaging the past and present estimates of a priori SNR: βpwl (Λ[m, k]) = min{1, max{b, 2
[log(Λ[m, k]) − (δ − )] + b}}

ˆ
ξ[m, k] = aĜ2 [m − 1, k]γ̂[m − 1, k] + (1 − a)max{γ̂[m, k] − 1, 0} (8) where  is a small positive constant that controls the width of
the linear region. Combining the new update Eq. 13 with Eq. 5
where 0 < a < 1 is the weighting coefficient, and max{·} is we obtain:
the element-wise maximum operator that prevents the current
estimate from going below 0. The gain function Ĝ[m, k] is ex- |X[m, k]|2 γ̂[m − 1, k]
γ̂[m, k] = (16)
pressed in terms of ξ[m, k] depending on the method to be used |X[m − 1, k]|2 β + (1 − β)γ̂[m − 1, k]
[7]. We use the Wiener estimate solution [3, 7], because the
ˆ
partial derivative of Ĝ[m, k] with respect to ξ[m, k] does not where β is shorthand for β(Λ[m − 1, k]). Equations 8, 12, and
involve potential division by zero, which would result in gradi- 16 complete the recurrent estimation of both a priori and a pos-
ent explosion during training: teriori SNR, without the need to explicitly estimate noise power.
This distinction is important from the neural network learning
ˆ
ξ[m, k] perspective, as estimating a ratio mask rather than direct signal
Ĝ[m, k] = (9)
ˆ
ξ[m, k] + 1 is desirable [10]. We now present the full system.
2.2. RNN for A Priori SNR Estimation Gm,k
<latexit sha1_base64="gD+0YpANu9H2nhwu6GQy+4Ze9mI=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBg5TEi3oretBjBWMLbSib7aZdsrsJuxuhhPwILx5UvPp/vPlv3LY5aOuDgcd7M8zMC1POtHHdb6eysrq2vlHdrG1t7+zu1fcPHnWSKUJ9kvBEdUOsKWeS+oYZTrupoliEnHbC+Gbqd56o0iyRD2aS0kDgkWQRI9hYqXM7yMVZXAzqDbfpzoCWiVeSBpRoD+pf/WFCMkGlIRxr3fPc1AQ5VoYRTotaP9M0xSTGI9qzVGJBdZDPzi3QiVWGKEqULWnQTP09kWOh9USEtlNgM9aL3lT8z+tlJroMcibTzFBJ5ouijCOToOnvaMgUJYZPLMFEMXsrImOsMDE2oZoNwVt8eZn4582rpnfvNlrXZRpVOIJjOAUPLqAFd9AGHwjE8Ayv8Oakzovz7nzMWytOOXMIf+B8/gB+BI8y</latexit>
sha1_base64="LfSg8xQRLksWwsKWbKBYPGx6TVk=">AAAB7XicbVC7SgNBFL0bXzG+opY2g1GwkLBro3ZBCy0juCaQLGF2MpsMOzO7zMwKYclH2Fio2PoH1n6DnR9i7+RRaOKBC4dz7uXee8KUM21c98spLCwuLa8UV0tr6xubW+XtnTudZIpQnyQ8Uc0Qa8qZpL5hhtNmqigWIaeNML4c+Y17qjRL5K0ZpDQQuCdZxAg2VmpcdXJxHA875YpbdcdA88Sbkkrt4Pv9AwDqnfJnu5uQTFBpCMdatzw3NUGOlWGE02GpnWmaYhLjHm1ZKrGgOsjH5w7RoVW6KEqULWnQWP09kWOh9UCEtlNg09ez3kj8z2tlJjoLcibTzFBJJouijCOToNHvqMsUJYYPLMFEMXsrIn2sMDE2oZINwZt9eZ74J9Xzqndjw7iACYqwB/twBB6cQg2uoQ4+EIjhAZ7g2UmdR+fFeZ20FpzpzC78gfP2A0WlkgE=</latexit>
sha1_base64="77QUGU2pQVuCBNeajLYgcHLSgvk=">AAAB7XicbVC7SgNBFL3rMyY+opY2g1GwkLBro3ZBCy0jGBNIljA7mU2GnZldZmYDYclH2Fio2PoH/oB/YOeHaO3kUWjigQuHc+7l3nuChDNtXPfTWVhcWl5Zza3lC+sbm1vF7Z07HaeK0BqJeawaAdaUM0lrhhlOG4miWASc1oPocuTX+1RpFstbM0ioL3BXspARbKxUv2pn4jgatoslt+yOgeaJNyWlysHX23u/8F1tFz9anZikgkpDONa66bmJ8TOsDCOcDvOtVNMEkwh3adNSiQXVfjY+d4gOrdJBYaxsSYPG6u+JDAutByKwnQKbnp71RuJ/XjM14ZmfMZmkhkoyWRSmHJkYjX5HHaYoMXxgCSaK2VsR6WGFibEJ5W0I3uzL86R2Uj4vezc2jAuYIAd7sA9H4MEpVOAaqlADAhHcwyM8OYnz4Dw7L5PWBWc6swt/4Lz+ADzLk3s=</latexit>

The recurrent structure described in the previous subsection


naturally lends itself to a recurrent neural network framework. a
<latexit sha1_base64="l5fvM/MWx/+DpZyAURszk8BDyqM=">AAAB53icbVBNT8JAEJ3iF+IX6tHLRmLiibRe1BvRi0dIrJBAQ7bLFFa222Z3a0IafoEXD2q8+pe8+W9coAcFXzLJy3szmZkXpoJr47rfTmltfWNzq7xd2dnd2z+oHh496CRTDH2WiER1QqpRcIm+4UZgJ1VI41BgOxzfzvz2EyrNE3lvJikGMR1KHnFGjZVatF+tuXV3DrJKvILUoECzX/3qDRKWxSgNE1TrruemJsipMpwJnFZ6mcaUsjEdYtdSSWPUQT4/dErOrDIgUaJsSUPm6u+JnMZaT+LQdsbUjPSyNxP/87qZia6CnMs0MyjZYlGUCWISMvuaDLhCZsTEEsoUt7cSNqKKMmOzqdgQvOWXV4l/Ub+uey231rgp0ijDCZzCOXhwCQ24gyb4wADhGV7hzXl0Xpx352PRWnKKmWP4A+fzBzCajLU=</latexit>
<latexit sha1_base64="B7GcAiEC4M44/5dV+3sOvKLP1RY=">AAAB53icbZC7SwNBEMbn4iuer6ilzWIQrMKdjVqIQRvLBDwTSI6wt5lL1uw92N0TwhGwt7FQsfWvsbfzv3HzKDTxg4Uf3zfDzkyQCq6043xbhaXlldW14rq9sbm1vVPa3btTSSYZeiwRiWwGVKHgMXqaa4HNVCKNAoGNYHA9zhsPKBVP4ls9TNGPaC/mIWdUG6tOO6WyU3EmIovgzqB8+WlfPAJArVP6ancTlkUYayaoUi3XSbWfU6k5Eziy25nClLIB7WHLYEwjVH4+GXREjozTJWEizYs1mbi/O3IaKTWMAlMZUd1X89nY/C9rZTo883Mep5nGmE0/CjNBdELGW5Mul8i0GBqgTHIzK2F9KinT5ja2OYI7v/IieCeV84pbd8rVK5iqCAdwCMfgwilU4QZq4AEDhCd4gVfr3nq23qz3aWnBmvXswx9ZHz+eBY6C</latexit>
sha1_base64="StlBlhS0dsvuEJ6wOil6ehCvKQE=">AAAB53icbZA9SwNBEIbn4lc8v6KWNotBsAp3NmohBm0sE/BMIDnC3mYuWbP3we6eEI78AhsLFVvxx9jbiP/GTWKhiS8sPLzvDDszQSq40o7zZRUWFpeWV4qr9tr6xuZWaXvnRiWZZOixRCSyGVCFgsfoaa4FNlOJNAoENoLB5Thv3KFUPImv9TBFP6K9mIecUW2sOu2Uyk7FmYjMg/sD5fN3+yx9+7RrndJHu5uwLMJYM0GVarlOqv2cSs2ZwJHdzhSmlA1oD1sGYxqh8vPJoCNyYJwuCRNpXqzJxP3dkdNIqWEUmMqI6r6azcbmf1kr0+GJn/M4zTTGbPpRmAmiEzLemnS5RKbF0ABlkptZCetTSZk2t7HNEdzZlefBO6qcVty6U65ewFRF2IN9OAQXjqEKV1ADDxgg3MMjPFm31oP1bL1MSwvWT88u/JH1+g2OII/2</latexit>

x
Specifically, we place a feedforward neural network immedi- 1+x
x 2
<latexit sha1_base64="YLFUymZj95nHkP3rfmgaib5jpHw=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBZBEEoignorevFYwdhCG8pmu2mXbjZxd1NaQn6HFw8qXv0z3vw3btMctPXBwOO9GWbm+TFnStv2t1VaWV1b3yhvVra2d3b3qvsHjypKJKEuiXgk2z5WlDNBXc00p+1YUhz6nLb80e3Mb42pVCwSD3oaUy/EA8ECRrA2ktcNJCbpJEuds0nWq9bsup0DLROnIDUo0OxVv7r9iCQhFZpwrFTHsWPtpVhqRjjNKt1E0RiTER7QjqECh1R5aX50hk6M0kdBJE0JjXL190SKQ6WmoW86Q6yHatGbif95nUQHV17KRJxoKsh8UZBwpCM0SwD1maRE86khmEhmbkVkiE0O2uRUMSE4iy8vE/e8fl137i9qjZsijTIcwTGcggOX0IA7aIILBJ7gGV7hzRpbL9a79TFvLVnFzCH8gfX5A1gUkgQ=</latexit>

ately after each weighting factor, so that the RNN can learn the
⇠ˆm 1,k (
1+x
) ⇠ˆm,k
recursive averaging coefficients rather than applying heuristics: <latexit sha1_base64="oxIWaLbytY/SA7a19zuIz6KhERs=">AAAB+XicbVA9T8MwEHXKVylfKYwsFhUSA1QJC7BVsDAWidBKTRQ5rtNatZ3IdoAq5KewMABi5Z+w8W9w2wzQ8qSTnt670929KGVUacf5tipLyyura9X12sbm1vaOXd+9U0kmMfFwwhLZjZAijAriaaoZ6aaSIB4x0olGVxO/c0+koom41eOUBBwNBI0pRtpIoV33h0jn/iMtwpyfuMejIrQbTtOZAi4StyQNUKId2l9+P8EZJ0JjhpTquU6qgxxJTTEjRc3PFEkRHqEB6RkqECcqyKenF/DQKH0YJ9KU0HCq/p7IEVdqzCPTyZEeqnlvIv7n9TIdnwc5FWmmicCzRXHGoE7gJAfYp5JgzcaGICypuRXiIZIIa5NWzYTgzr+8SLzT5kXTvXEarcsyjSrYBwfgCLjgDLTANWgDD2DwAJ7BK3iznqwX6936mLVWrHJmD/yB9fkDemWTrA==</latexit>
sha1_base64="vc2VMwqdtmcuOIxbQ6W44+wGYJ4=">AAAB+XicbVC7TsNAEFyHVwgvB0oai4BEAZFNA3QRNJRBwiRSYlnnyzk55Xy27s5AZPwpNBSAaJGo+QY6PoSey6OAhJFWGs3sancnSBiVyra/jMLc/MLiUnG5tLK6tr5hljevZZwKTFwcs1g0AyQJo5y4iipGmokgKAoYaQT986HfuCFC0phfqUFCvAh1OQ0pRkpLvllu95DK2nc097Po0Dno575Zsav2CNYscSakUtv9fv8AgLpvfrY7MU4jwhVmSMqWYyfKy5BQFDOSl9qpJAnCfdQlLU05ioj0stHpubWnlY4VxkIXV9ZI/T2RoUjKQRTozgipnpz2huJ/XitV4YmXUZ6kinA8XhSmzFKxNczB6lBBsGIDTRAWVN9q4R4SCCudVkmH4Ey/PEvco+pp1bnUYZzBGEXYhh3YBweOoQYXUAcXMNzCAzzBs3FvPBovxuu4tWBMZrbgD4y3H0IGlns=</latexit>
sha1_base64="qA04ghxVkAh8pH6YQ2NKABWwB+8=">AAAB+XicbVC7TsMwFHXKq7Q8UhhZLAoSA1QJC7BVsDAWidBKTRQ5rtNatZPIdgpVyKewMABiReIH+AM2PgRm3McAhSNd6eice3XvPUHCqFSW9WEU5uYXFpeKy6XyyuraulnZuJJxKjBxcMxi0QqQJIxGxFFUMdJKBEE8YKQZ9M9GfnNAhKRxdKmGCfE46kY0pBgpLflmxe0hlbk3NPczfmDv93PfrFo1awz4l9hTUq3vfL6+DcpfDd98dzsxTjmJFGZIyrZtJcrLkFAUM5KX3FSSBOE+6pK2phHiRHrZ+PQc7mqlA8NY6IoUHKs/JzLEpRzyQHdypHpy1huJ/3ntVIXHXkajJFUkwpNFYcqgiuEoB9ihgmDFhpogLKi+FeIeEggrnVZJh2DPvvyXOIe1k5p9ocM4BRMUwRbYBnvABkegDs5BAzgAg2twBx7Ao3Fr3BtPxvOktWBMZzbBLxgv3zksl/U=</latexit>

<latexit sha1_base64="Vm6J2Ff6NFzfv/z4SRIoO+5kxgs=">AAAB+XicbVBNS8NAEJ34WetXqkcvi0WoCCUpgnorevFYwdhCG8tmu2mXbj7Y3WhLzE/x4kHFq//Em//GbZuDtj4YeLw3w8w8L+ZMKsv6NpaWV1bX1gsbxc2t7Z1ds7R3J6NEEOqQiEei5WFJOQupo5jitBULigOP06Y3vJr4zQcqJIvCWzWOqRvgfsh8RrDSUtcsVTq+wCQdZal9MsqO72tds2xVrSnQIrFzUoYcja751elFJAloqAjHUrZtK1ZuioVihNOs2EkkjTEZ4j5taxrigEo3nZ6eoSOt9JAfCV2hQlP190SKAynHgac7A6wGct6biP957UT5527KwjhRNCSzRX7CkYrQJAfUY4ISxceaYCKYvhWRAdZRKJ1WUYdgz7+8SJxa9aJq35yW65d5GgU4gEOogA1nUIdraIADBB7hGV7hzXgyXox342PWumTkM/vwB8bnD8rtkz4=</latexit>
<latexit sha1_base64="b4K+HEWfC7raiKHHfPAi8q1gAsM=">AAAB93icbVBNS8NAEN3Ur1o/GvXoJVgED1ISL+qt6MVjBWMLbQib7aZdursJuxOxhvwSLx5UvPpXvPlv3LY5aOuDgcd7M8zMi1LONLjut1VZWV1b36hu1ra2d3br9t7+vU4yRahPEp6oboQ15UxSHxhw2k0VxSLitBONr6d+54EqzRJ5B5OUBgIPJYsZwWCk0K73Rxjy/iMrwlycjovQbrhNdwZnmXglaaAS7dD+6g8SkgkqgXCsdc9zUwhyrIARTotaP9M0xWSMh7RnqMSC6iCfHV44x0YZOHGiTElwZurviRwLrSciMp0Cw0gvelPxP6+XQXwR5EymGVBJ5ovijDuQONMUnAFTlACfGIKJYuZWh4ywwgRMVjUTgrf48jLxz5qXTe/WbbSuyjSq6BAdoRPkoXPUQjeojXxEUIae0St6s56sF+vd+pi3Vqxy5gD9gfX5A5aNkzo=</latexit>
sha1_base64="BgIpnxbF7auGd23Dyzwq9OUquLM=">AAAB93icbVC7TsNAEFyHVwiPGChpLAISBYpsGqCLoKEMEiaRYss6X87JKeezdXdGBMtfQkMBiJaGmm+g40PouTwKSBhppdHMrnZ3wpRRqWz7yygtLC4tr5RXK2vrG5tVc2v7RiaZwMTFCUtEO0SSMMqJq6hipJ0KguKQkVY4uBj5rVsiJE34tRqmxI9Rj9OIYqS0FJhVr49U7t3RIsjjo0ERmDW7bo9hzRNnSmqN/e/3DwBoBuan101wFhOuMENSdhw7VX6OhKKYkaLiZZKkCA9Qj3Q05Sgm0s/HhxfWgVa6VpQIXVxZY/X3RI5iKYdxqDtjpPpy1huJ/3mdTEWnfk55minC8WRRlDFLJdYoBatLBcGKDTVBWFB9q4X7SCCsdFYVHYIz+/I8cY/rZ3XnSodxDhOUYRf24BAcOIEGXEITXMCQwQM8wbNxbzwaL8brpLVkTGd24A+Mtx9eLpYJ</latexit>
sha1_base64="twtCeyvmn4qBguHVDtqT34UUT+c=">AAAB93icbVC7TsMwFHV4lhZogJHFoiAxoCphAbYKFsYiEVqpqSLHdVqrthPZTkWJ8iUsDIBYWfgB/oCND4EZ9zFAy5GudHTOvbr3njBhVGnH+bQWFpeWV1YLa8XS+sZm2d7avlFxKjHxcMxi2QyRIowK4mmqGWkmkiAeMtII+xcjvzEgUtFYXOthQtocdQWNKEbaSIFd9ntIZ/4tzYOMH/XzwK44VWcMOE/cKanU9r/e3gel73pgf/idGKecCI0ZUqrlOoluZ0hqihnJi36qSIJwH3VJy1CBOFHtbHx4Dg+M0oFRLE0JDcfq74kMcaWGPDSdHOmemvVG4n9eK9XRaTujIkk1EXiyKEoZ1DEcpQA7VBKs2dAQhCU1t0LcQxJhbbIqmhDc2ZfniXdcPau6VyaMczBBAeyCPXAIXHACauAS1IEHMEjBPXgET9ad9WA9Wy+T1gVrOrMD/sB6/QFVVJeD</latexit>

ˆ
1 a
ˆ m,k Ĝ2 [m − 1, k]γ̂[m − 1, k]
<latexit sha1_base64="OJFfAG5GlDhZcwzrJtxUI/cEDoA=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBiyXxot6KXjxWNLbQhjLZbtqlm03Y3Qgl9Cd48aDi1X/kzX/jts1BWx8MPN6bYWZemAqujet+O6WV1bX1jfJmZWt7Z3evun/wqJNMUebTRCSqHaJmgkvmG24Ea6eKYRwK1gpHN1O/9cSU5ol8MOOUBTEOJI84RWOle+8Me9WaW3dnIMvEK0gNCjR71a9uP6FZzKShArXueG5qghyV4VSwSaWbaZYiHeGAdSyVGDMd5LNTJ+TEKn0SJcqWNGSm/p7IMdZ6HIe2M0Yz1IveVPzP62QmugxyLtPMMEnni6JMEJOQ6d+kzxWjRowtQaq4vZXQISqkxqZTsSF4iy8vE/+8flX37txa47pIowxHcAyn4MEFNOAWmuADhQE8wyu8OcJ5cd6dj3lrySlmDuEPnM8fCRqNJw==</latexit>
sha1_base64="ECbaArmw/QzP6GlSIYSkBu246U0=">AAAB6XicbVC7TsNAEFzzTMIrQElzIkKiIbJpgC6ChjIITCIlVjhf1skp57N1d0aKrHwCDQWvlj+i42+4PApIGGml0cyudnfCVHBtXPfbWVpeWV1bLxRLG5tb2zvl3b17nWSKoc8SkahmSDUKLtE33AhspgppHApshIOrsd94RKV5Iu/MMMUgpj3JI86osdKtd0I75YpbdScgi8SbkUqtmL4/AEC9U/5qdxOWxSgNE1TrluemJsipMpwJHJXamcaUsgHtYctSSWPUQT45dUSOrNIlUaJsSUMm6u+JnMZaD+PQdsbU9PW8Nxb/81qZic6DnMs0MyjZdFGUCWISMv6bdLlCZsTQEsoUt7cS1qeKMmPTKdkQvPmXF4l/Wr2oejc2jEuYogAHcAjH4MEZ1OAa6uADgx48wQu8OsJ5dt6cj2nrkjOb2Yc/cD5/AFFKjtk=</latexit>
sha1_base64="+WCNPY2qUWhZ6tpikJM1r8oUDHU=">AAAB6XicbVC7TgMxENwLryS8ApQ0FhESDdEdDdBF0FAGwZGI5BT5HF9ixWefbB8iOuUTaCh4tXwA/0LH14DzKCBhpJVGM7va3QkTzrRx3S8nt7C4tLySLxRX19Y3Nktb2zdapopQn0guVSPEmnImqG+Y4bSRKIrjkNN62D8f+fU7qjST4toMEhrEuCtYxAg2VrryDnG7VHYr7hhonnhTUq4Wktfbj/vvWrv02epIksZUGMKx1k3PTUyQYWUY4XRYbKWaJpj0cZc2LRU4pjrIxqcO0b5VOiiSypYwaKz+nshwrPUgDm1njE1Pz3oj8T+vmZroJMiYSFJDBZksilKOjESjv1GHKUoMH1iCiWL2VkR6WGFibDpFG4I3+/I88Y8qpxXv0oZxBhPkYRf24AA8OIYqXEANfCDQhQd4gmeHO4/Oi/M2ac0505kd+APn/Qclu5D5</latexit>

ξ[m, k] = a1
ˆ m,k max{γ̂[m, k] − 1, 0}
+ a2
|X[m, k]|2 γ̂[m − 1, k]
x 1
<latexit sha1_base64="a4vRUAqjZw7seJEh6MYi+uuavzQ=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4sSRe1FvRi8eKxhbaUDbbTbt0swm7E7GE/gQvHlS8+o+8+W/ctjlo64OBx3szzMwLUykMuu63s7S8srq2Xtoob25t7+xW9vYfTJJpxn2WyES3Qmq4FIr7KFDyVqo5jUPJm+HweuI3H7k2IlH3OEp5ENO+EpFgFK1093TqdStVt+ZOQRaJV5AqFGh0K1+dXsKymCtkkhrT9twUg5xqFEzycbmTGZ5SNqR93rZU0ZibIJ+eOibHVumRKNG2FJKp+nsip7Exozi0nTHFgZn3JuJ/XjvD6CLIhUoz5IrNFkWZJJiQyd+kJzRnKEeWUKaFvZWwAdWUoU2nbEPw5l9eJP5Z7bLm3brV+lWRRgkO4QhOwINzqMMNNMAHBn14hld4c6Tz4rw7H7PWJaeYOYA/cD5/ACxtjT4=</latexit>
sha1_base64="IP/ucXeMKn01LyFe01SWsQBHSXo=">AAAB6XicbVDLTgJBEOzFF+AL9ehlIjHxItn1ot6IXjxidIUENjg7NDBhdnYzM2skGz7BiwdfV//Im3/j8DgoWEknlarudHeFieDauO63k1taXlldyxeK6xubW9ulnd07HaeKoc9iEatGSDUKLtE33AhsJAppFAqsh4PLsV9/QKV5LG/NMMEgoj3Ju5xRY6Wbx2OvXSq7FXcCski8GSlXC8n7PQDU2qWvVidmaYTSMEG1bnpuYoKMKsOZwFGxlWpMKBvQHjYtlTRCHWSTU0fk0Cod0o2VLWnIRP09kdFI62EU2s6Imr6e98bif14zNd2zIOMySQ1KNl3UTQUxMRn/TTpcITNiaAllittbCetTRZmx6RRtCN78y4vEP6mcV7xrG8YFTJGHfTiAI/DgFKpwBTXwgUEPnuAFXh3hPDtvzse0NefMZvbgD5zPH3SdjvA=</latexit>
sha1_base64="/zbEfUdbki53wJii38gDI3FP3Vw=">AAAB6XicbVC7TgMxENzjmYRXgJLGIkKiIbqjAboIGsogOBKRnCKf40us+HyW7UOJTvkEGgpeLR/Av9DxNeA8CkgYaaXRzK52d0LJmTau++UsLC4tr6zm8oW19Y3NreL2zq1OUkWoTxKeqHqINeVMUN8ww2ldKorjkNNa2LsY+bV7qjRLxI0ZSBrEuCNYxAg2VrruH3mtYsktu2OgeeJNSamSl693H/3vaqv42WwnJI2pMIRjrRueK02QYWUY4XRYaKaaSkx6uEMblgocUx1k41OH6MAqbRQlypYwaKz+nshwrPUgDm1njE1Xz3oj8T+vkZroNMiYkKmhgkwWRSlHJkGjv1GbKUoMH1iCiWL2VkS6WGFibDoFG4I3+/I88Y/LZ2XvyoZxDhPkYA/24RA8OIEKXEIVfCDQgQd4gmeHO4/Oi/M2aV1wpjO78AfO+w9JDpEQ</latexit>

γ̂[m, k] = (17)
|X[m − 1, k]|2 b1 ˆ m,k + b2ˆ m,k γ̂[m − 1, k] <latexit sha1_base64="/62JbImWgjg4U1i0LGZjDCJkyk8=">AAAB/HicbVBNS8NAEN34WetX/Lh5CRbBg5bEi3orevFYwdhCE8Jku22X7m7C7kaoIfhXvHhQ8eoP8ea/cdvmoK0PBh7vzTAzL04ZVdp1v62FxaXlldXKWnV9Y3Nr297ZvVdJJjHxccIS2Y5BEUYF8TXVjLRTSYDHjLTi4fXYbz0QqWgi7vQoJSGHvqA9ikEbKbL3gwHoPOgD51BEOT/1ToZFZNfcujuBM0+8ktRQiWZkfwXdBGecCI0ZKNXx3FSHOUhNMSNFNcgUSQEPoU86hgrgRIX55PrCOTJK1+kl0pTQzkT9PZEDV2rEY9PJQQ/UrDcW//M6me5dhDkVaaaJwNNFvYw5OnHGUThdKgnWbGQIYEnNrQ4egASsTWBVE4I3+/I88c/ql3Xv1q01rso0KugAHaJj5KFz1EA3qIl8hNEjekav6M16sl6sd+tj2rpglTN76A+szx+1ZZTs</latexit>
sha1_base64="FEgfaMNCTQExUKx2aLnUO/MWbF8=">AAAB/HicbVC7TsNAEFzzDOFlHh2NRUCigMimAboIGsogYRIpjqL15ZKccmdbd2ekYEX8Cg0FINrUfAMdH0LP5VFAwkgrjWZ2tbsTJpwp7bpf1tz8wuLScm4lv7q2vrFpb23fqTiVhPok5rGshqgoZxH1NdOcVhNJUYScVsLu1dCv3FOpWBzd6l5C6wLbEWsxgtpIDXs36KDOgjYKgf1GJk68426/YRfcojuCM0u8CSmUDr4HHwBQbtifQTMmqaCRJhyVqnluousZSs0Ip/18kCqaIOlim9YMjVBQVc9G1/edQ6M0nVYsTUXaGam/JzIUSvVEaDoF6o6a9obif14t1a3zesaiJNU0IuNFrZQ7OnaGUThNJinRvGcIEsnMrQ7poESiTWB5E4I3/fIs8U+LF0XvxoRxCWPkYA/24Qg8OIMSXEMZfCDwAE/wAq/Wo/VsvVnv49Y5azKzA39gDX4AfQaXuw==</latexit>
sha1_base64="titSB5VazXSaoasJpsGbw0hyXAg=">AAAB/HicbVC7TsNAEDyHV0h4mEdHYxGQKCCyaYAugoYySJhESixrfbkkp9zZ1t05UrAsfoWGAhBtfoA/oONDoObyKCBhpJVGM7va3QliRqWy7U8jt7C4tLySXy0U19Y3Ns2t7TsZJQITF0csEvUAJGE0JK6iipF6LAjwgJFa0Lsa+bU+EZJG4a0axMTj0Alpm2JQWvLN3WYXVNrsAOeQ+Sk/cY57mW+W7LI9hjVPnCkpVQ6+hu/94nfVNz+arQgnnIQKM5Cy4dix8lIQimJGskIzkSQG3IMOaWgaAifSS8fXZ9ahVlpWOxK6QmWN1d8TKXApBzzQnRxUV856I/E/r5Go9rmX0jBOFAnxZFE7YZaKrFEUVosKghUbaAJYUH2rhbsgACsdWEGH4My+PE/c0/JF2bnRYVyiCfJoD+2jI+SgM1RB16iKXITRPXpEz+jFeDCejFfjbdKaM6YzO+gPjOEPdCyZNQ==</latexit>
ˆm 1,k <latexit sha1_base64="MeXxbpfhxJiCmABCB0JbkRYEz64=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KokX9Vb04rGCsYU2lM120y7dbMLuRCwhP8OLBxWv/htv/hu3bQ7a+mDg8d4MM/PCVAqDrvvtrKyurW9sVraq2zu7e/u1g8MHk2SacZ8lMtGdkBouheI+CpS8k2pO41Dydji+mfrtR66NSNQ9TlIexHSoRCQYRSt1e5GmLPeK/Kno1+puw52BLBOvJHUo0erXvnqDhGUxV8gkNabruSkGOdUomORFtZcZnlI2pkPetVTRmJsgn51ckFOrDEiUaFsKyUz9PZHT2JhJHNrOmOLILHpT8T+vm2F0GeRCpRlyxeaLokwSTMj0fzIQmjOUE0so08LeStiI2hTQplS1IXiLLy8T/7xx1fDu3HrzukyjAsdwAmfgwQU04RZa4AODBJ7hFd4cdF6cd+dj3rrilDNH8AfO5w8MoZFJ</latexit>
<latexit sha1_base64="G1750OsRP5tfbtQNVD/j5s5DkO0=">AAAB8XicbVA9T8MwEL2Ur1K+CowsFhUSU5WwULYKFsYiEVopjSrHdVqrjh3ZDqKK8jNYGACx8mcQGxN/BfdjgJYnnfT03p3u7kUpZ9q47pdTWlldW98ob1a2tnd296r7B3daZopQn0guVSfCmnImqG+Y4bSTKoqTiNN2NLqa+O17qjST4taMUxomeCBYzAg2Vgq6scIk94r8oehVa27dnQItE29Oas3G9wcCgFav+tntS5IlVBjCsdaB56YmzLEyjHBaVLqZpikmIzyggaUCJ1SH+fTkAp1YpY9iqWwJg6bq74kcJ1qPk8h2JtgM9aI3Ef/zgszEjTBnIs0MFWS2KM44MhJN/kd9pigxfGwJJorZWxEZYpuCsSlVbAje4svLxD+rX9S9GxvGJcxQhiM4hlPw4ByacA0t8IGAhEd4hhfHOE/Oq/M2ay0585lD+APn/QceJZOQ</latexit>
sha1_base64="U6qWvLuRCUwhu8abk0SW1ihHM/w=">AAAB8XicbVA9SwNBEJ2LXzF+RS1EbA6DYBXubIxd0MYygmcClyPsbfaSJXu7x+6eGI77GTYWKrb+GbHTxtaf4eaj0MQHA4/3ZpiZFyaMKu04H1ZhYXFpeaW4Wlpb39jcKm/v3CiRSkw8LJiQrRApwignnqaakVYiCYpDRprh4GLkN2+JVFTwaz1MSBCjHqcRxUgbyW9HEuHMzbO7vFOuOFVnDHueuFNSqde+3vY+v/cbnfJ7uytwGhOuMUNK+a6T6CBDUlPMSF5qp4okCA9Qj/iGchQTFWTjk3P7yChdOxLSFNf2WP09kaFYqWEcms4Y6b6a9Ubif56f6qgWZJQnqSYcTxZFKbO1sEf/210qCdZsaAjCkppbbdxHJgVtUiqZENzZl+eJd1I9q7pXJoxzmKAIB3AIx+DCKdThEhrgAQYB9/AIT5a2Hqxn62XSWrCmM7vwB9brD4uslWI=</latexit>
1
x <latexit sha1_base64="b9gM4OMMSoAsCWuXQ7mRvfZaouc=">AAAB+nicbVBNS8NAEN3Ur1q/Yj16CRbBg5TEi3orevFYwWihCWGy3bRLdzdhdyOWkL/ixYOKV3+JN/+N2zYHbX0w8Hhvhpl5ccao0q77bdVWVtfWN+qbja3tnd09e795r9JcYuLjlKWyF4MijAria6oZ6WWSAI8ZeYjH11P/4ZFIRVNxpycZCTkMBU0oBm2kyG4GI9BFMATOoYwKfjouI7vltt0ZnGXiVaSFKnQj+ysYpDjnRGjMQKm+52Y6LEBqihkpG0GuSAZ4DEPSN1QAJyosZreXzrFRBk6SSlNCOzP190QBXKkJj00nBz1Si95U/M/r5zq5CAsqslwTgeeLkpw5OnWmQTgDKgnWbGIIYEnNrQ4egQSsTVwNE4K3+PIy8c/al23v1m11rqo06ugQHaET5KFz1EE3qIt8hNETekav6M0qrRfr3fqYt9asauYA/YH1+QPPPZR6</latexit>
sha1_base64="8wu98Pa8PD/+eoT538OeidKMIeo=">AAAB+nicbVC7TsNAEFzzDOFlQkljEZAoUGTTAF0EDWWQMImUWNH6cklOubOtuzMisvwrNBSAaBE130DHh9BzeRSQMNJKo5ld7e6ECWdKu+6XtbC4tLyyWlgrrm9sbm3bO6VbFaeSUJ/EPJaNEBXlLKK+ZprTRiIpipDTeji4HPn1OyoVi6MbPUxoILAXsS4jqI3UtkutPuqs1UMhMG9n4niQt+2yW3HHcOaJNyXl6sH3+wcA1Nr2Z6sTk1TQSBOOSjU9N9FBhlIzwmlebKWKJkgG2KNNQyMUVAXZ+PbcOTRKx+nG0lSknbH6eyJDodRQhKZToO6rWW8k/uc1U909CzIWJammEZks6qbc0bEzCsLpMEmJ5kNDkEhmbnVIHyUSbeIqmhC82ZfniX9SOa941yaMC5igAHuwD0fgwSlU4Qpq4AOBe3iAJ3i2cuvRerFeJ60L1nRmF/7AevsBlt6XSQ==</latexit>
sha1_base64="WrBYMn7QY+bZKmiuaBCB0i8vJRE=">AAAB+nicbVC7TsNAEDyHV0h4mFDSWAQkChTZNEAXQUMZJEwixZa1vlySU+5s6+4cEVn+FRoKQLSIH+AP6PgQqLk8CkgYaaXRzK52d8KEUals+9MoLC2vrK4V10vljc2tbXOncivjVGDi4pjFohWCJIxGxFVUMdJKBAEeMtIMB5djvzkkQtI4ulGjhPgcehHtUgxKS4FZ8fqgMq8HnEMeZPx4kAdm1a7ZE1iLxJmRav3g6+19WP5uBOaH14lxykmkMAMp246dKD8DoShmJC95qSQJ4AH0SFvTCDiRfja5PbcOtdKxurHQFSlrov6eyIBLOeKh7uSg+nLeG4v/ee1Udc/8jEZJqkiEp4u6KbNUbI2DsDpUEKzYSBPAgupbLdwHAVjpuEo6BGf+5UXintTOa861DuMCTVFEe2gfHSEHnaI6ukIN5CKM7tA9ekRPRm48GM/Gy7S1YMxmdtEfGK8/jgSYww==</latexit>
ˆm,k
ˆ m,k
a1 ˆ m,k = F F (1 − a[m, k])
= F F (a[m, k]), a2
ˆ m,k = F F (β(Λ[m, k])), b2
b1 ˆ m,k = F F (1 − β(Λ[m, k])) 1 x ˆ
⇠ˆ
e 1+⇠ˆ
<latexit sha1_base64="PJNiMr6c/Cumz+G6wXEOYA7r8hs=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4sSRe1FvRi8eKxhbaUDbbTbt0swm7E7GE/gQvHlS8+o+8+W/ctjlo64OBx3szzMwLUykMuu63s7S8srq2Xtoob25t7+xW9vYfTJJpxn2WyES3Qmq4FIr7KFDyVqo5jUPJm+HweuI3H7k2IlH3OEp5ENO+EpFgFK10550+dStVt+ZOQRaJV5AqFGh0K1+dXsKymCtkkhrT9twUg5xqFEzycbmTGZ5SNqR93rZU0ZibIJ+eOibHVumRKNG2FJKp+nsip7Exozi0nTHFgZn3JuJ/XjvD6CLIhUoz5IrNFkWZJJiQyd+kJzRnKEeWUKaFvZWwAdWUoU2nbEPw5l9eJP5Z7bLm3brV+lWRRgkO4QhOwINzqMMNNMAHBn14hld4c6Tz4rw7H7PWJaeYOYA/cD5/ACvfjT4=</latexit>
sha1_base64="CJ4drV+P146Tzwha1gHzSlLTB2U=">AAAB6XicbVDLTgJBEOzFF+AL9ehlIjHxItn1ot6IXjxidIUENjg7NDBhdnYzM2skGz7BiwdfV//Im3/j8DgoWEknlarudHeFieDauO63k1taXlldyxeK6xubW9ulnd07HaeKoc9iEatGSDUKLtE33AhsJAppFAqsh4PLsV9/QKV5LG/NMMEgoj3Ju5xRY6Ub7/ixXSq7FXcCski8GSlXC8n7PQDU2qWvVidmaYTSMEG1bnpuYoKMKsOZwFGxlWpMKBvQHjYtlTRCHWSTU0fk0Cod0o2VLWnIRP09kdFI62EU2s6Imr6e98bif14zNd2zIOMySQ1KNl3UTQUxMRn/TTpcITNiaAllittbCetTRZmx6RRtCN78y4vEP6mcV7xrG8YFTJGHfTiAI/DgFKpwBTXwgUEPnuAFXh3hPDtvzse0NefMZvbgD5zPH3QPjvA=</latexit>
sha1_base64="ENhwjexHi3Tw3TP/U9cdxW8CPhQ=">AAAB6XicbVC7TgMxENzjmYRXgJLGIkKiIbqjAboIGsogOBKRnCKf40us+HyW7UOJTvkEGgpeLR/Av9DxNeA8CkgYaaXRzK52d0LJmTau++UsLC4tr6zm8oW19Y3NreL2zq1OUkWoTxKeqHqINeVMUN8ww2ldKorjkNNa2LsY+bV7qjRLxI0ZSBrEuCNYxAg2Vrr2jvqtYsktu2OgeeJNSamSl693H/3vaqv42WwnJI2pMIRjrRueK02QYWUY4XRYaKaaSkx6uEMblgocUx1k41OH6MAqbRQlypYwaKz+nshwrPUgDm1njE1Xz3oj8T+vkZroNMiYkKmhgkwWRSlHJkGjv1GbKUoMH1iCiWL2VkS6WGFibDoFG4I3+/I88Y/LZ2XvyoZxDhPkYA/24RA8OIEKXEIVfCDQgQd4gmeHO4/Oi/M2aV1wpjO78AfO+w9IgJEQ</latexit>

⇤m log
⇤m,k
1,k
where F F (·) represents a feedforward neural network. Al- 1 + ⇠ˆ
<latexit sha1_base64="ii5wciDiPeoxHjYUyRBxe6RCZHc=">AAAB83icbVA9T8MwFHwpX6V8FRhZLCokBlQlLMBWwcLAUCRCK7VR5ThOa9V2gu1UqqL+DhYGQKz8GTb+DW6bAVpOsnS6u6f3fGHKmTau++2UVlbX1jfKm5Wt7Z3dver+waNOMkWoTxKeqHaINeVMUt8ww2k7VRSLkNNWOLyZ+q0RVZol8sGMUxoI3JcsZgQbKwXdOxuNcC8XZ8NJr1pz6+4MaJl4BalBgWav+tWNEpIJKg3hWOuO56YmyLEyjHA6qXQzTVNMhrhPO5ZKLKgO8tnRE3RilQjFibJPGjRTf0/kWGg9FqFNCmwGetGbiv95nczEl0HOZJoZKsl8UZxxZBI0bQBFTFFi+NgSTBSztyIywAoTY3uq2BK8xS8vE/+8flX37t1a47poowxHcAyn4MEFNOAWmuADgSd4hld4c0bOi/PufMyjJaeYOYQ/cD5/APmVkcQ=</latexit>
sha1_base64="Dx/mkmPCaBRU2lehGklWrM/4OgU=">AAAB83icbVC7TsMwFL0pr1JeBUYWi4LEgKqEBdgqWBgYikRppTaqHMdprdpOsJ1KVdTvYGEAxMrKzDew8SHsuI8BWo5k6eicc3WvT5Bwpo3rfjm5hcWl5ZX8amFtfWNzq7i9c6fjVBFaIzGPVSPAmnImac0ww2kjURSLgNN60Lsc+fU+VZrF8tYMEuoL3JEsYgQbK/mtaxsNcTsTx71hu1hyy+4YaJ54U1KqHHy/fwBAtV38bIUxSQWVhnCsddNzE+NnWBlGOB0WWqmmCSY93KFNSyUWVPvZ+OghOrRKiKJY2ScNGqu/JzIstB6IwCYFNl09643E/7xmaqIzP2MySQ2VZLIoSjkyMRo1gEKmKDF8YAkmitlbEelihYmxPRVsCd7sl+dJ7aR8XvZubBkXMEEe9mAfjsCDU6jAFVShBgTu4QGe4NnpO4/Oi/M6ieac6cwu/IHz9gPBNpST</latexit>
sha1_base64="gEVT0gJnpksuxy8VOxfQmRkxCFY=">AAAB83icbVC7TsMwFL0pr9LyKDCyWBQkBlQlLMBWwcLAUCRCK7VR5ThOa9VOgu1UqqJ+BwsDIFZWfoA/YONDYMZ9DNByJEtH55yre338hDOlbfvTyi0sLi2v5FcLxbX1jc3S1vatilNJqEtiHsuGjxXlLKKuZprTRiIpFj6ndb93MfLrfSoVi6MbPUioJ3AnYiEjWBvJa12ZaIDbmTjqDdulsl2xx0DzxJmScnX/6+29X/yutUsfrSAmqaCRJhwr1XTsRHsZlpoRToeFVqpogkkPd2jT0AgLqrxsfPQQHRglQGEszYs0Gqu/JzIslBoI3yQF1l01643E/7xmqsNTL2NRkmoakcmiMOVIx2jUAAqYpETzgSGYSGZuRaSLJSba9FQwJTizX54n7nHlrOJcmzLOYYI87MIeHIIDJ1CFS6iBCwTu4B4e4cnqWw/Ws/Uyieas6cwO/IH1+gO4XJYN</latexit>

<latexit sha1_base64="pjj8QDuN4//rSD7kwrYVR2wip+A=">AAAB6XicbVA9T8MwED2Xr1K+CowsFhUSU5WwAFsFC2MRhFZqo8pxndSqY0e2g1RF/QksDIBY+Uds/BvcNgMUnnTS03t3ursXZYIb63lfqLKyura+Ud2sbW3v7O7V9w8ejMo1ZQFVQuluRAwTXLLAcitYN9OMpJFgnWh8PfM7j0wbruS9nWQsTEkiecwpsU66EyoZ1Bte05sD/yV+SRpQoj2of/aHiuYpk5YKYkzP9zIbFkRbTgWb1vq5YRmhY5KwnqOSpMyExfzUKT5xyhDHSruSFs/VnxMFSY2ZpJHrTIkdmWVvJv7n9XIbX4QFl1lumaSLRXEusFV49jcecs2oFRNHCNXc3YrpiGhCrUun5kLwl1/+S4Kz5mXTv/UarasyjSocwTGcgg/n0IIbaEMAFBJ4ghd4RQI9ozf0vmitoHLmEH4BfXwDz9uNqg==</latexit>
sha1_base64="lMEBX+U9FCUUUzIRR0xULF83jYw=">AAAB6XicbVC7TgMxENwLrxBeAUooLCIkquiOJtBF0FAmgiORklPkc3wXKz77ZPuQolM+gYYCEC0/wXfQ0fEpOI8CEkZaaTSzq92dMOVMG9f9cgorq2vrG8XN0tb2zu5eef/gXstMEeoTyaVqh1hTzgT1DTOctlNFcRJy2gqH1xO/9UCVZlLcmVFKgwTHgkWMYGOlWy7jXrniVt0p0DLx5qRSP/5ofgNAo1f+7PYlyRIqDOFY647npibIsTKMcDoudTNNU0yGOKYdSwVOqA7y6aljdGqVPoqksiUMmqq/J3KcaD1KQtuZYDPQi95E/M/rZCa6CHIm0sxQQWaLoowjI9Hkb9RnihLDR5Zgopi9FZEBVpgYm07JhuAtvrxM/PPqZdVr2jCuYIYiHMEJnIEHNajDDTTABwIxPMIzvDjceXJenbdZa8GZzxzCHzjvP92Qj+8=</latexit>
sha1_base64="00weeR1wz2eDutGILdDYtduox7g=">AAAB6XicbVC7SgNBFL0bXzG+opaKDAbBKuzaqF3QxjJB1wSSJcxOZpMhszPLzKwQlpSWNhYqtv5EvsPOb/AnnDwKTTxw4XDOvdx7T5hwpo3rfjm5peWV1bX8emFjc2t7p7i7d69lqgj1ieRSNUKsKWeC+oYZThuJojgOOa2H/euxX3+gSjMp7swgoUGMu4JFjGBjpVsuu+1iyS27E6BF4s1IqXI4qn0/Ho2q7eJnqyNJGlNhCMdaNz03MUGGlWGE02GhlWqaYNLHXdq0VOCY6iCbnDpEJ1bpoEgqW8Kgifp7IsOx1oM4tJ0xNj09743F/7xmaqKLIGMiSQ0VZLooSjkyEo3/Rh2mKDF8YAkmitlbEelhhYmx6RRsCN78y4vEPytflr2aDeMKpsjDARzDKXhwDhW4gSr4QKALT/ACrw53np03533amnNmM/vwB87HD7pvkVU=</latexit>

<latexit sha1_base64="W7mvHQlxQ4Y2HyyCzrncLD9dKas=">AAAB93icbVA9T8MwFHzhs5SPBhhZLCokBqgSFmCrYGFgKBKhldoochyntWonke0glai/hIUBECt/hY1/g9tmgJaTLJ3u7uk9X5hxprTjfFtLyyura+uVjerm1vZOzd7de1BpLgn1SMpT2Qmxopwl1NNMc9rJJMUi5LQdDq8nfvuRSsXS5F6PMuoL3E9YzAjWRgrsWu/WhCMcFOLUPRmOA7vuNJwp0CJxS1KHEq3A/upFKckFTTThWKmu62TaL7DUjHA6rvZyRTNMhrhPu4YmWFDlF9PDx+jIKBGKU2leotFU/T1RYKHUSIQmKbAeqHlvIv7ndXMdX/gFS7Jc04TMFsU5RzpFkxZQxCQlmo8MwUQycysiAywx0aarqinBnf/yIvHOGpcN986pN6/KNipwAIdwDC6cQxNuoAUeEMjhGV7hzXqyXqx362MWXbLKmX34A+vzB1Jqkmc=</latexit>
sha1_base64="lc5zgz4ewh+7bJ4+zcxBTJGsPNo=">AAAB93icbVC7TsMwFL0pr1IeDTCyRBQkBqgSFmCrYGFgKBKhldooclyntWo7ke0glahfwsIAiJWFmW9g40PYcR8DFI5k6eicc3WvT5QyqrTrflqFufmFxaXicmlldW29bG9s3qgkk5j4OGGJbEZIEUYF8TXVjDRTSRCPGGlE/fOR37glUtFEXOtBSgKOuoLGFCNtpNAuty9NuIPCnB96B/1haFfcqjuG85d4U1Kp7X69vQNAPbQ/2p0EZ5wIjRlSquW5qQ5yJDXFjAxL7UyRFOE+6pKWoQJxooJ8fPjQ2TNKx4kTaZ7Qzlj9OZEjrtSARybJke6pWW8k/ue1Mh2fBDkVaaaJwJNFccYcnTijFpwOlQRrNjAEYUnNrQ7uIYmwNl2VTAne7Jf/Ev+oelr1rkwZZzBBEbZhB/bBg2OowQXUwQcMGdzDIzxZd9aD9Wy9TKIFazqzBb9gvX4DGguVNg==</latexit>
sha1_base64="aJRidUPHYn7GC50qXhm658TvXuE=">AAAB93icbVC9TsMwGHTKX2mBBhhZLAoSA1QJC7BVsDAwFInQSm0UOY7bWrWdyHYqlahPwsIAiJWFF+AN2HgQmHF/Bmg5ydLp7j59ny9MGFXacT6t3MLi0vJKfrVQXFvfKNmbW7cqTiUmHo5ZLBshUoRRQTxNNSONRBLEQ0bqYe9i5Nf7RCoaixs9SIjPUUfQNsVIGymwS60rE45QkPEj97A3DOyyU3HGgPPEnZJyde/r7b1f/K4F9kcrinHKidCYIaWarpNoP0NSU8zIsNBKFUkQ7qEOaRoqECfKz8aHD+G+USLYjqV5QsOx+nsiQ1ypAQ9NkiPdVbPeSPzPa6a6fepnVCSpJgJPFrVTBnUMRy3AiEqCNRsYgrCk5laIu0girE1XBVOCO/vleeIdV84q7rUp4xxMkAc7YBccABecgCq4BDXgAQxScA8ewZN1Zz1Yz9bLJJqzpjPb4A+s1x8RMZaw</latexit>

though the equations look very similar to Eqs. 8 and 16, we note
<latexit sha1_base64="q6/1u35BwFuw38hb+WVGBAJ7RWU=">AAACKXicbZDLSsNAFIYnXmu9RV26CRZBEEriRt1V3bisYGyhieVkOmmHziRhZiKWkOdx46u46cLb1hdx2gTR1gMD3/z/OcycP0gYlcq2P4yFxaXlldXKWnV9Y3Nr29zZvZNxKjBxccxi0Q5AEkYj4iqqGGknggAPGGkFw6uJ33ogQtI4ulWjhPgc+hENKQalpa554YUCcEbuswK8AajMe6R5AX3gHPI8c45/jHzm2jVrdt2eljUPTgk1VFaza469XoxTTiKFGUjZcexE+RkIRTEjedVLJUkAD6FPOhoj4ET62XTV3DrUSs8KY6FPpKyp+nsiAy7liAe6k4MayFlvIv7ndVIVnvkZjZJUkQgXD4Ups1RsTXKzelQQrNhIA2BB9V8tPAAdmdLpVnUIzuzK8+Ce1M/rzo1da1yWaVTQPjpAR8hBp6iBrlETuQijJ/SCXtGb8WyMjXfjs2hdMMqZPfSnjK9v9BSpog==</latexit>
sha1_base64="WkQG03e7IxxH4I7ElY0S01J/d5w=">AAACKXicbZDLSsNAFIZPvNZ6q7p0EyyCIJTEjbqrunFZwdpCU8vJdNIOziRhZiKWkOdx40P4Am668IY7X8RpU0SrBwa++f9zmDm/H3OmtOO8WTOzc/MLi4Wl4vLK6tp6aWPzSkWJJLROIh7Jpo+KchbSumaa02YsKQqf04Z/czbyG7dUKhaFl3oQ07bAXsgCRlAbqVM68QKJJKXXaQ5eH3Xq3bEshx4KgVmWuvvfRjZ17ZTKTsUZl/0X3AmUq6cf5UcAqHVKQ68bkUTQUBOOSrVcJ9btFKVmhNOs6CWKxkhusEdbBkMUVLXT8aqZvWuUrh1E0pxQ22P150SKQqmB8E2nQN1X095I/M9rJTo4aqcsjBNNQ5I/FCTc1pE9ys3uMkmJ5gMDSCQzf7VJH01k2qRbNCG40yv/hfpB5bjiXozCgLwKsA07sAcuHEIVzqEGdSBwD0/wDC/WgzW0Xq33vHXGmsxswa+yPr8A/Pmr4w==</latexit>
sha1_base64="BSfxtPG4SL61p1nwvmrb5w88DyQ=">AAACKXicbZDLSsNAFIYn9VbrLerSTbAIglASN+qu6sZlBWsLTS0n00k7dCYJMxOxhDyPLnwHn8BNF95w5yu4cOm0KaKtBwa++f9zmDm/FzEqlW2/GrmZ2bn5hfxiYWl5ZXXNXN+4lGEsMKnikIWi7oEkjAakqqhipB4JAtxjpOb1Tod+7ZoIScPgQvUj0uTQCahPMSgttcxj1xeAE3KVZOB2QSXuDU0z6ADnkKaJs/djpBPXllm0S/aorGlwxlAsn7wXH77uPistc+C2QxxzEijMQMqGY0eqmYBQFDOSFtxYkghwDzqkoTEATmQzGa2aWjtaaVt+KPQJlDVSf08kwKXsc093clBdOekNxf+8Rqz8w2ZCgyhWJMDZQ37MLBVaw9ysNhUEK9bXAFhQ/VcLd0FHpnS6BR2CM7nyNFT3S0cl53wYBsoqj7bQNtpFDjpAZXSGKqiKMLpFj+gJPRv3xsB4Md6y1pwxntlEf8r4+AZrBq52</latexit>

two key differences. First, each coefficient is now parametrized |Xm,k |2


by both time and frequency. Because of the interconnection of <latexit sha1_base64="VLm/ZTAXlVxBH0qJhK4VSR6LGVI=">AAACCXicbVC7TsMwFHV4lvIKMLIYKiQGqJIuwFbBwlgkQiu1IXJcp7VqO5HtIFVpZhZ+hYUBECt/wMbf4LYZoOVIls49515d3xMmjCrtON/WwuLS8spqaa28vrG5tW3v7N6pOJWYeDhmsWyFSBFGBfE01Yy0EkkQDxlphoOrsd98IFLRWNzqYUJ8jnqCRhQjbaTAPuhEEuFs1AoyfjLIR/e1fFqcukUZ2BWn6kwA54lbkAoo0Ajsr043xiknQmOGlGq7TqL9DElNMSN5uZMqkiA8QD3SNlQgTpSfTU7J4ZFRujCKpXlCw4n6eyJDXKkhD00nR7qvZr2x+J/XTnV07mdUJKkmAk8XRSmDOobjXGCXSoI1GxqCsKTmrxD3kclGm/TKJgR39uR54tWqF1X3xqnUL4s0SmAfHIJj4IIzUAfXoAE8gMEjeAav4M16sl6sd+tj2rpgFTN74A+szx/K1JqB</latexit>
sha1_base64="NDbuMTez7CBVlmCgtJgtkPmkEhg=">AAACCXicbVC7TsNAEFyHVwgvAyWNIUKigMhOA3QBGgqKIGESKTHW+XJOTjk/dHdGihzXNNT8BQ0FIFr+gI6/4RKngISRTpqd2dXejhczKqRpfmuFufmFxaXicmlldW19Q9/cuhVRwjGxccQi3vSQIIyGxJZUMtKMOUGBx0jD61+M/MY94YJG4Y0cxMQJUDekPsVIKsnVd9s+RzgdNt00OOxnw7tqlhdH1qR09bJZMccwZok1IeXa2dNVDQDqrv7V7kQ4CUgoMUNCtCwzlk6KuKSYkazUTgSJEe6jLmkpGqKACCcdn5IZ+0rpGH7E1QulMVZ/T6QoEGIQeKozQLInpr2R+J/XSqR/4qQ0jBNJQpwv8hNmyMgY5WJ0KCdYsoEiCHOq/mrgHlLZSJVeSYVgTZ88S+xq5bRiXaswziFHEXZgDw7AgmOowSXUwQYMD/AMr/CmPWov2rv2kbcWtMnMNvyB9vkD/5KcJA==</latexit>
sha1_base64="b+i6676qkMqQlzL7Fdsg6FA+EMU=">AAACCXicbVC7TsMwFHXKq5RXgJElUCExQJV0AbYCC0IMRSK0Uhsqx3Vaq7YT2Q5SlWZmYeYvWBgAsfIHbHwDP4HbdICWI1k695x7dX2PH1EilW1/GbmZ2bn5hfxiYWl5ZXXNXN+4kWEsEHZRSENR96HElHDsKqIorkcCQ+ZTXPN7Z0O/doeFJCG/Vv0Iewx2OAkIgkpLLXO7GQiIkkG9lbD9Xjq4LadZceCMy5ZZtEv2CNY0ccakWDl5vKxciO9qy/xstkMUM8wVolDKhmNHykugUARRnBaascQRRD3YwQ1NOWRYesnolNTa1UrbCkKhH1fWSP09kUAmZZ/5upNB1ZWT3lD8z2vEKjjyEsKjWGGOskVBTC0VWsNcrDYRGCna1wQiQfRfLdSFOhul0yvoEJzJk6eJWy4dl5wrHcYpyJAHW2AH7AEHHIIKOAdV4AIE7sETeAGvxoPxbLwZ71lrzhjPbII/MD5+ADKRncs=</latexit>
|Xm 1,k |2
the neural network, these coefficients depend not only on the
frequency bin they belong to, but also all other frequency bins. Figure 1: SNRNN computation at frame m in the dashed box.
This is a useful generalization that is hard to carry out systemati- Octagons hold input and output. Latent variables are bounded
cally in the classical framework due to the lack of a closed-form by rings. Feedforward networks are highlighted by bold cap-
solution, but is easily realized in the neural network framework. sules. Rectangles and circles are element-wise operations.
Second, the heuristic constraint that weighting coefficients add
up to 1 is removed. We conclude our description of the SNRNN from the neural
The loss function of the SNRNN is twofold. We adopt network perspective. From the zoomed-in view of the recurrent
the mean squared error in the STSA domain of the enhanced structure shown in Fig. 1, a priori SNR, a posteriori SNR, and
speech, not only because it is a popular objective function in speech-presence likelihood ratio are interpreted as latent vari-
deep learning SE methods, but also because it is the principle ables that carry information across time frames. The four feed-
upon which the a priori SNR estimation is derived [4]: forward neural networks interact with instantaneous power ra-
tios rather than direct speech or noise power, which potentially
K−1
1 X makes the system robust against unseen noise types. In fact, all
Emse [m] = (|S[m, k]|−|Ŝ[m, k]|)2 (18)
K variables in the network are represented as ratios, motivated by
k=0
the finding that ratio masks are superior learning targets [10].
In addition to the MSE-STSA loss, we introduce frame-level
voice activity detection (VAD) loss. Because the recurrent 3. Experimental Results and Discussions
structure is derived directly from the decision-directed ap-
proach, we can obtain the frame-level speech-presence log- We conducted experiments using the RATS speech activity de-
likelihood ratio assuming statistical independence across fre- tection dataset [17] and the TIMIT dataset [18]. We selected the
quency: RATS dataset to train our system because it contains extremely
K−1
X challenging noise conditions. It also contains ample exam-
logΛ[m] = logΛ[m, k] (19) ples of both speech-present and speech-absent regions that are
k=0 needed for training. To demonstrate the enhancement quality of
where K is the total number of frequency bins. Assuming equal our system, we choose speakers from four of the RATS channels
prior probability of speech presence and absence, the speech- for a speaker verification (SV) evaluation. To demonstrate the
presence probability given the noisy frame can be expressed as: robustness of our system against other unseen noise types, we
performed the global and local signal-to-distortion ratio (SDR)
Λ[m] test [19] on the speech segments from the TIMIT dataset with
P (speech|X[m]) = (20)
1 + Λ[m] digitally added noise samples taken from the NOIZEUS dataset
[20]. In both experiments, we compared the SNRNN’s perfor-
We define the VAD loss for this two-class classfication prob- mance (denoted NN in all tables and figures) with the classical
lem as the cross entropy between the true and predicted speech- Wiener solution using decision-directed a priori SNR estima-
presence probability: tion with MCRA noise estimation (denoted DD).
To train the SNRNN, we used a total 56 hours of 320 audio
Evad [m] = −vad[m]log(P (speech|X[m]))
(21) recordings sampled at 16 kHz from Channels A and H in the de-
− (1 − vad[m])log(1 − P (speech|X[m])) velopment partition of the RATS SAD dataset. For the SV task,
we also used Channel D in training. During the training phase,
where vad[m] ∈ {0, 1} is the true speech-presence probability
1000 320-ms audio segments were randomly sampled from all
for frame m. The overall objective function is:
recordings to form one minibatch. We used oracle VAD in-
E[m] = αEmse [m] + (1 − α)Evad [m] (22) formation to maintain approximately equal numbers of speech-
present and speech-absent frames within each minibatch. Next,
where 0 ≤ α ≤ 1 is the weighting coefficient. we computed the STFT of each segment with a 32-ms Hamming
Table 1: Improvement of SDR and 512-point Segmental SDR on 6300 TIMIT Utterances

Cafeteria Babble Train Flight Car


SNR
SDR SegSDR SDR SegSDR SDR SegSDR SDR SegSDR
(dB)
DD NN DD NN DD NN DD NN DD NN DD NN DD NN DD NN
10 2.25 0.96 5.21 8.20 3.30 1.76 5.60 8.30 4.25 2.68 5.17 8.53 6.92 4.01 7.13 10.5
5 3.05 2.88 7.09 9.43 4.20 4.11 7.53 9.76 5.25 5.24 6.86 9.90 8.83 7.66 9.18 12.7
0 3.67 3.90 8.93 10.4 4.85 5.61 9.31 11.0 6.19 6.88 9.01 11.4 10.7 10.7 11.6 15.3
-5 3.44 3.65 9.54 10.5 5.27 6.37 10.7 11.9 7.00 7.69 11.3 12.5 12.3 13.1 13.8 17.5
-10 1.39 1.66 9.41 9.80 4.39 5.57 11.0 12.3 6.72 7.24 12.4 13.1 13.1 14.5 14.8 18.4
Mean 2.76 2.61 8.04 9.67 4.40 4.68 8.83 10.7 5.88 5.95 8.95 11.1 10.4 9.99 11.3 14.9

1
Table 2: Speaker Verification Performance on RATS Speakers
0.4
0.5
0.2
EER(%) Noisy DD NN Clean 0 0
Channel A 28.6 32.2 24.9 10.7 -0.2
Channel B 36.6 37.2 36.6 11.5 -0.4
-0.5

Channel C 44.8 40.1 36.7 7.93 -0.6 -1


0 2 4 6 8 10 0 0.5 1
Channel H 43.2 29.7 23.9 10.8 Time (s) Time (s)

window, 75% overlap between frames, and 512-point DFTs. Fi- 0.4
0.5
nally, we computed the magnitude STFT, retaining the first 257 0.2

frequency dimensions as the input to SNRNN. 0 0

-0.2
For each neural network inside the SNRNN, we used a 3- -0.5
-0.4
layer feedforward network with 257 neurons in each of the in-
-0.6 -1
put, hidden, and output layers. We used rectified linear units 0 2 4 6 8 10 0 0.5 1

(ReLUs) as the activation function at all layers because SNRs


are non-negative. We used the sigmoid function in Eq. 15. We Figure 2: Denoised waveforms. Top left and bottom left: de-
chose constants a = 0.98, b = 0.98 and δ = 0.15, which noised signal using DD and NN, respectively. Top right and
are effective for the classical DD approach [1]. The network bottom right: residual noise at 4-5s after 39.3dB amplification.
parameters were initialized so that each network is an identity wrongly suppressed by NN. However, the burst of noise during
ˆ and Λ are initialized the same
function prior to learning. γ̂, ξ, 0-3s and 5-10s is better contained using NN. In addition, NN
way as the decision-directed method [1]. α = 0.2 is a good produces far fewer musical artifacts during 4-5s. The parameter
weighting constant for the loss function. We used stochastic α controls the tradeoff between speech smearing and noise sup-
gradient descent with a learning rate of 10−4 and a momentum pression. Using MSE-STSA loss alone yields almost an identi-
of 0.9 to update all network weights. cal system as the decision-directed approach, while using VAD
We evaluated our enhancement system in terms of the equal loss alone results in a system that heavily suppresses noise and
error rate (EER) obtained for the SV task. The baseline SV smears speech. Overall, the robustness of SNRNN processing
system was trained using the ALIZE i-vector system setup de- was expected, as we note in Sec. 2, because the neural networks
scribed in [21]. Farsi speakers in the training partition of the “see” only instantaneous SNRs.
RATS SAD dataset were used for evaluation. The enrollment 4. Conclusions
consisted of 30, 28, 28, and 30 speakers from RATS Channels
In this paper, we have proposed a neural-network equivalent
A, B, C, and H, respectively. 28, 35, 25, and 37 recordings from
of the decision-directed a priori SNR estimation. We strongly
Channels A, B, C, and H, respectively, were tested against ev-
advocate the use of instantaneous SNRs as internal represen-
ery enrolled speaker from their corresponding channels. Table
tations in neural networks to accompany the use of IRMs as
2 shows that the NN provides significant improvement for all
learning targets [10] for noise robustness. Our system preserves
channels except Channel B. The improvement for unseen chan-
the robustness of the classical method while improving the ac-
nel C is even greater than the improvement for Channel A. In
curacy of the recurrent approximations of a priori and a pos-
addition, NN provides better performance than DD in all cases.
teriori SNRs. Our results have shown that SNRNN processing
One notable finding in our experiment is that Ĝ[m, k] in Eq. 8 can preserve speech and greatly suppress noise, while produc-
and 9 no longer need to be identical in SNRNN. The results in ing very few residual artifacts. In addition, our system can han-
Table 2 were obtained using the power subtraction rule [7] for dle unseen nonstationary noise conditions when trained on very
Eq. 8, and the Wiener filter rule for Eq. 9. few noise types. We introduce the joint STSA-MSE and VAD
Table 1 shows the improvement of global and 512-point loss function, and highlight the importance of VAD loss for bal-
segmental SDR after applying DD and NN on noisy TIMIT ancing the level of noise suppression and speech distortion. In
utterances. The four types of noise we include are perceptu- the future, we will attempt to improve the quality of enhanced
ally very different from the noise in RATS channels. Our re- speech in the speech-present regions, and extend the additive-
sults show that SNRNN consistently improves segmental SDR noise framework to linear filtering for channel compensation.
under all conditions, even though the global SDR sometimes
is worse than DD in high SNRs (which are rare in our train- 5. Acknowledgments
ing data). We illustrate this phenomenon in Fig. 2, where we We thank Benjamin Martinez Elizalde for useful suggestions on
show compensated waveforms after DD and NN processing, re- many prior drafts of this paper. We also thank our colleagues at
spectively. The impulse-like speech waveform at around 3.8s is the Afeka College of Engineering for their support.
6. References [19] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measure-
ment in blind audio source separation,” IEEE Transactions on Au-
[1] P. C. Loizou, Speech Enhancement : Theory and Practice, Second dio Speech and Language Processing, vol. 14, no. 4, pp. 1462–
Edition. CRC Press, 2013. 1469, 2006.
[2] S. F. Boll, “Suppression of Acoustic Noise in Speech Using Spec- [20] Y. Hu and P. C. Loizou, “Subjective Comparison of Speech En-
tral Subtraction,” IEEE Transactions on Acoustics, Speech, and hancement Algorithms,” Acoustics, Speech and Signal Process-
Signal Processing, vol. 27, no. 2, pp. 113–120, apr 1979. ing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International
[3] J. S. Lim and A. V. Oppenheim, “Enhancement and Bandwidth Conference on, vol. 1, pp. I–I, 2006.
Compression of Noisy Speech,” Proceedings of the IEEE, vol. 67, [21] A. Larcher, J.-F. Bonastre, B. Fauve, K. A. Lee, L. Christophe,
no. 12, pp. 1586–1604, 1979. H. Li, J. S. D. Mason, and J.-Y. Parfait, “ALIZE 3.0 - Open Source
[4] Y. Ephraim and D. Malah, “Speech Enhancement Using a- Min- Toolkit for State-of-the-Art Speaker Recognition,” Interspeech,
imum Mean- Square Error Short-Time Spectral Amplitude Esti- no. August, pp. 2768–2772, 2013.
mator,” IEEE Transactions on Acoustics Speech and Signal Pro-
cessing, vol. 32, no. 6, pp. 1109–1122, dec 1984.
[5] ——, “Speech enhancement using a minimum mean-square error
log-spectral amplitude estimator,” IEEE Transactions on Acous-
tics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445,
1985.
[6] R. J. McAulay and M. L. Malpass, “Speech Enhancement Using
a Soft-Decision Noise Suppression Filter,” IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 28, no. 2, pp. 137–
145, apr 1980.
[7] P. Scalart and J. Filho, “Speech enhancement based on a priori sig-
nal to noise estimation,” in 1996 IEEE International Conference
on Acoustics, Speech, and Signal Processing Conference Proceed-
ings, vol. 2, 1996, pp. 629–632.
[8] J. Sohn and W. Sung, “A voice activity detector employing soft
decision based noise spectrum adaptation,” Acoustics, Speech and
Signal Processing, . Proceedings of the IEEE International Con-
ference on., vol. 1, pp. 365–368, 1998.
[9] I. Cohen and B. Berdugo, “Noise Estimation by Minima Con-
trolled Recursive Averaging for Robust Speech Enhancement,”
IEEE SIGNAL PROCESSING LETTERS, vol. 9, no. 1, 2002.
[10] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for
supervised speech separation,” IEEE/ACM Transactions on Audio
Speech and Language Processing, vol. 22, no. 12, pp. 1849–1858,
2014.
[11] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regres-
sion Approach to Speech Enhancement Based on Deep Neural
Networks,” IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, vol. 23, no. 1, pp. 7–19, jan 2015.
[12] A. Kumar and D. Florencio, “Speech Enhancement In Multiple-
Noise Conditions using Deep Neural Networks,” 2016.
[13] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech Enhancement
Based on Deep Denoising Autoencoder,” INTERSPEECH-2013,
pp. 436–440, 2013.
[14] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,
J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM
recurrent neural networks and its application to noise-robust
ASR,” in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin-
formatics), vol. 9237. Springer, Cham, 2015, pp. 91–99.
[15] K. Osako, R. Singh, and B. Raj, “Complex recurrent neural net-
works for denoising speech signals,” in 2015 IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics, WAS-
PAA 2015, 2015.
[16] S. Suhadi, C. Last, and T. Fingscheidt, “A data-driven approach
to a priori SNR estimation,” IEEE Transactions on Audio, Speech
and Language Processing, vol. 19, no. 1, pp. 186–195, 2011.
[17] K. Walker, X. Ma, D. Graff, S. Stephanie, S. Stephanie, and
K. Jones, “Rats speech activity detection ldc2015s02,” Philadel-
phia: Linguistic Data Consortium, 2015.
[18] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.
Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor-
pus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Techni-
cal Report N, vol. 93, Feb. 1993.

You might also like