Speech Enhancement Based On Soft Masking Exploiting Both Output SNR and Selectivity of Spatial Filtering
Speech Enhancement Based On Soft Masking Exploiting Both Output SNR and Selectivity of Spatial Filtering
masking exploiting both output SNR and employed as spatial filtering to demonstrate that the described method
is not tuned to a specific preprocessing algorithm. We will show that
selectivity of spatial filtering the method provides robust and superior performance in terms of both
SNR with a retained speech ratio (RSR) and word accuracy.
Biho Kim, Yunil Hwang and Hyung-Min Park
Algorithm description: Fig. 1 shows the overall system using the
A speech enhancement method is presented, which applies a soft mask
to a target speech output of spatial filtering, such as conventional presented algorithm. Let us consider that multiple sensors observe
beamforming or independent component analysis (ICA). In contrast input signals denoted as {xi (t), i = 1, …, N}. Employing frequency-
to conventional methods using either outputs or filters estimated by domain spatial filtering, the outputs of the target-extracting and
spatial filtering, the mask is constructed by exploiting both local target-rejecting filters, denoted by YTE(k, τ) and YTR(k, τ), respectively,
output signal-to-noise ratio (SNR) and spatial selectivity obtained are expressed as
from the directivity pattern of the estimated filters. Experiments were
conducted for both ICA and minimum power distortionless response YTE (k, t) = W TE (k)[X1 (k, t), . . . , XN (k, t)]T (1)
beamforming as spatial filtering in order to demonstrate that the
described mask estimation is not a tuned method for particular prepro- YTR (k, t) = W TR (k)[X1 (k, t), . . . , XN (k, t)]T (2)
cessing. The results in terms of both SNR with a retained speech ratio
and word accuracy in speech recognition show that the described where Xi (k, τ) is the t–f representation at frequency-bin index k and
method can effectively suppress residual noise in the target speech frame index τ obtained by a short-time Fourier transform (STFT) of
output of spatial filtering. the ith input xi (t). Moreover, WTE(k) and WTR(k) denote vectors contain-
ing coefficients of the target-extracting and target-rejecting filters esti-
mated by spatial filtering at the kth frequency bin. Conventional
Introduction: Speech enhancement remains a very important issue beamforming or ICA can be used as a spatial filtering algorithm to esti-
because speech is prone to contamination by noise in a practical situ- mate these filters. A soft mask is applied to suppress the residual noise of
ation. There have been many techniques developed to overcome this YTE(k, τ) according to
problem, which can be categorised into spatial filtering and masking
algorithms. Spatial filtering such as conventional beamforming [1] or Ŝ(k, t) = M (k, t)YTE (k, t) (3)
independent component analysis (ICA) [2] typically attempts to find a
linear combination of multiple observations such that the combination where M(k, τ) denotes the weight of the soft mask. The output Ŝ(k, t) is
contains the maximum contribution from target speech. The prevalent transformed back to the time domain to obtain ŝ(t).
ICA algorithms require more sensors than sources, whereas conven-
tional beamforming may improve performance as more sensors x1 (t)
X1 (k,t)
STFT
X (k,t) target YTE (k,t) Sˆ (k,t) inverse
become available. Thus, more sensors seem to be advantageous but x2 (t) STFT 2 -extracting STFT
Sˆ (t)
spatial WTE (k)
XN (k,t)
result in heavy computational loads. Processing is usually performed xN (t) STFT filtering
soft mask M (k,t)
YTR (k,t)
in the frequency domain for efficiency, but performance is limited target
-rejecting
construction
because a long frame size is required to cover a long acoustic reverber- spatial
filtering
ation, whereas the amount of learning data in each frequency bin
decreases as the frame size increases [3]. In addition, the algorithms Fig. 1 Overall system using described algorithm
suffer from performance degradation due to phase ambiguity [1]. If
sensors are not far from each other, the low-frequency components The soft mask is basically based on the local dominance of the target
have negligible phase difference, which results in insufficient spatial speech, which is measured by the SNR at a t–f segment as
selectivity between the target and interference directions. On the other
hand, the directivity pattern can form nulls in both the target and interfer- |YTE (k, t)|
SNR(k, t) = (4)
ence directions for some high-frequency bins because of spatial aliasing, |YTR (k, t)| + 1
even though they are spatially separated. When the distances between
sensors are larger, more frequency bins suffer from this problem. where ε is a small positive number to avoid division by zero. Although
In contrast, masking algorithms apply estimated mask weights to mix- target speech can be further separated by segregation of t–f components
tures typically in the time–frequency (t–f ) domain; thus, the desired dominated by target speech, abrupt discontinuities [10] and oversimpli-
outputs with negligible residual noise can be obtained regardless of fication [5] of the segregated mixtures may be mitigated by introducing a
the number or configuration of sensors or sources if ideal mask sigmoid function with the SNR as an argument.
weights are estimated. Many contemporary algorithms have used In addition to the outputs, spatial filtering estimates filters with a
binary masks that specify which t–f components belong to a particular directivity pattern which may provide spatial selectivity. The directivity
source (e.g. [4]). However, a binary mask is clearly an oversimplified pattern value of the target-extracting filter at the kth frequency bin for a
description of how sound sources are combined because each sound unit-norm directional vector q can be obtained by
source contributes to the mixture to varying extents [5]. This oversimpli-
N
fication may be mitigated by introducing soft masks. Unfortunately, DTE (k, q) = i
WTE (k) exp[−jvk (pi − pR )T q/c] (5)
constructing a reliable mask from mixtures has been one of the i=1
primary problems [6]. Instead of estimating a mask directly from mix-
tures, spatial filtering can be employed as preprocessing to provide sub- where WTEi
(k) is the ith element of WTE(k), and ωk is the frequency cor-
optimal target and noise estimates or estimated spatial filters, which may responding to the kth bin. pi and pR denote vectors representing the
help determine mask weights. In this case, the mask can be applied to a locations of the ith and reference sensor, respectively, and c is the
target speech output enhanced by spatial filtering in order to suppress speed of sound. Therefore, spatial selectivity between the target and
residual noise in the output, so it can provide better performance than dominant noise directions can be measured by the ratio as
masking mixtures directly. Many algorithms have been described to |DTE (k, qT )|
construct masks, but most of them have considered either target and R(k) = (6)
|DTE (k, qN )|
noise estimates or estimated spatial filters [6–9].
Target and noise estimates provide local dominance of the target where qT and qN denote the unit-norm directional vectors corresponding
speech at a specific t–f segment, whereas spatial selectivity in a fre- to the target and dominant noise, respectively. These vectors may be set
quency bin can be obtained from the estimated spatial filters. Thus, by a prior knowledge, which is particularly required for some spatial
they may provide complementary information to obtain a better mask. filtering algorithms such as conventional beamforming. In other
In this Letter, we develop a soft-mask estimation method by exploiting spatial filtering algorithms such as ICA, they can be estimated by
both the local output signal-to-noise ratio (SNR) of spatial filtering and detecting the directions corresponding to the minima of the directivity
spatial selectivity obtained from the directivity pattern of the estimated patterns of estimated filters because nulls are formed towards
filters. Target speech enhancement is accomplished by employing the ‘jammer’ directions [10]. A large value of R(k) means that YTE(k, τ)
estimated soft mask to suppress the residual noise of the target speech retains target speech components of mixtures with noise components
output obtained by spatial filtering. In addition, both ICA and sufficiently removed. Therefore, YTE(k, τ) is sufficiently good to be