0% found this document useful (0 votes)
18 views3 pages

Speech Enhancement Based On Soft Masking Exploiting Both Output SNR and Selectivity of Spatial Filtering

Uploaded by

dxu861878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views3 pages

Speech Enhancement Based On Soft Masking Exploiting Both Output SNR and Selectivity of Spatial Filtering

Uploaded by

dxu861878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Speech enhancement based on soft- minimum power distortionless response (MPDR) beamforming are

masking exploiting both output SNR and employed as spatial filtering to demonstrate that the described method
is not tuned to a specific preprocessing algorithm. We will show that
selectivity of spatial filtering the method provides robust and superior performance in terms of both
SNR with a retained speech ratio (RSR) and word accuracy.
Biho Kim, Yunil Hwang and Hyung-Min Park
Algorithm description: Fig. 1 shows the overall system using the
A speech enhancement method is presented, which applies a soft mask
to a target speech output of spatial filtering, such as conventional presented algorithm. Let us consider that multiple sensors observe
beamforming or independent component analysis (ICA). In contrast input signals denoted as {xi (t), i = 1, …, N}. Employing frequency-
to conventional methods using either outputs or filters estimated by domain spatial filtering, the outputs of the target-extracting and
spatial filtering, the mask is constructed by exploiting both local target-rejecting filters, denoted by YTE(k, τ) and YTR(k, τ), respectively,
output signal-to-noise ratio (SNR) and spatial selectivity obtained are expressed as
from the directivity pattern of the estimated filters. Experiments were
conducted for both ICA and minimum power distortionless response YTE (k, t) = W TE (k)[X1 (k, t), . . . , XN (k, t)]T (1)
beamforming as spatial filtering in order to demonstrate that the
described mask estimation is not a tuned method for particular prepro- YTR (k, t) = W TR (k)[X1 (k, t), . . . , XN (k, t)]T (2)
cessing. The results in terms of both SNR with a retained speech ratio
and word accuracy in speech recognition show that the described where Xi (k, τ) is the t–f representation at frequency-bin index k and
method can effectively suppress residual noise in the target speech frame index τ obtained by a short-time Fourier transform (STFT) of
output of spatial filtering. the ith input xi (t). Moreover, WTE(k) and WTR(k) denote vectors contain-
ing coefficients of the target-extracting and target-rejecting filters esti-
mated by spatial filtering at the kth frequency bin. Conventional
Introduction: Speech enhancement remains a very important issue beamforming or ICA can be used as a spatial filtering algorithm to esti-
because speech is prone to contamination by noise in a practical situ- mate these filters. A soft mask is applied to suppress the residual noise of
ation. There have been many techniques developed to overcome this YTE(k, τ) according to
problem, which can be categorised into spatial filtering and masking
algorithms. Spatial filtering such as conventional beamforming [1] or Ŝ(k, t) = M (k, t)YTE (k, t) (3)
independent component analysis (ICA) [2] typically attempts to find a
linear combination of multiple observations such that the combination where M(k, τ) denotes the weight of the soft mask. The output Ŝ(k, t) is
contains the maximum contribution from target speech. The prevalent transformed back to the time domain to obtain ŝ(t).
ICA algorithms require more sensors than sources, whereas conven-
tional beamforming may improve performance as more sensors x1 (t)
X1 (k,t)
STFT
X (k,t) target YTE (k,t) Sˆ (k,t) inverse
become available. Thus, more sensors seem to be advantageous but x2 (t) STFT 2 -extracting STFT
Sˆ (t)
spatial WTE (k)
XN (k,t)
result in heavy computational loads. Processing is usually performed xN (t) STFT filtering
soft mask M (k,t)
YTR (k,t)
in the frequency domain for efficiency, but performance is limited target
-rejecting
construction

because a long frame size is required to cover a long acoustic reverber- spatial
filtering
ation, whereas the amount of learning data in each frequency bin
decreases as the frame size increases [3]. In addition, the algorithms Fig. 1 Overall system using described algorithm
suffer from performance degradation due to phase ambiguity [1]. If
sensors are not far from each other, the low-frequency components The soft mask is basically based on the local dominance of the target
have negligible phase difference, which results in insufficient spatial speech, which is measured by the SNR at a t–f segment as
selectivity between the target and interference directions. On the other
hand, the directivity pattern can form nulls in both the target and interfer- |YTE (k, t)|
SNR(k, t) = (4)
ence directions for some high-frequency bins because of spatial aliasing, |YTR (k, t)| + 1
even though they are spatially separated. When the distances between
sensors are larger, more frequency bins suffer from this problem. where ε is a small positive number to avoid division by zero. Although
In contrast, masking algorithms apply estimated mask weights to mix- target speech can be further separated by segregation of t–f components
tures typically in the time–frequency (t–f ) domain; thus, the desired dominated by target speech, abrupt discontinuities [10] and oversimpli-
outputs with negligible residual noise can be obtained regardless of fication [5] of the segregated mixtures may be mitigated by introducing a
the number or configuration of sensors or sources if ideal mask sigmoid function with the SNR as an argument.
weights are estimated. Many contemporary algorithms have used In addition to the outputs, spatial filtering estimates filters with a
binary masks that specify which t–f components belong to a particular directivity pattern which may provide spatial selectivity. The directivity
source (e.g. [4]). However, a binary mask is clearly an oversimplified pattern value of the target-extracting filter at the kth frequency bin for a
description of how sound sources are combined because each sound unit-norm directional vector q can be obtained by
source contributes to the mixture to varying extents [5]. This oversimpli- 
N
fication may be mitigated by introducing soft masks. Unfortunately, DTE (k, q) = i
WTE (k) exp[−jvk (pi − pR )T q/c] (5)
constructing a reliable mask from mixtures has been one of the i=1
primary problems [6]. Instead of estimating a mask directly from mix-
tures, spatial filtering can be employed as preprocessing to provide sub- where WTEi
(k) is the ith element of WTE(k), and ωk is the frequency cor-
optimal target and noise estimates or estimated spatial filters, which may responding to the kth bin. pi and pR denote vectors representing the
help determine mask weights. In this case, the mask can be applied to a locations of the ith and reference sensor, respectively, and c is the
target speech output enhanced by spatial filtering in order to suppress speed of sound. Therefore, spatial selectivity between the target and
residual noise in the output, so it can provide better performance than dominant noise directions can be measured by the ratio as
masking mixtures directly. Many algorithms have been described to |DTE (k, qT )|
construct masks, but most of them have considered either target and R(k) = (6)
|DTE (k, qN )|
noise estimates or estimated spatial filters [6–9].
Target and noise estimates provide local dominance of the target where qT and qN denote the unit-norm directional vectors corresponding
speech at a specific t–f segment, whereas spatial selectivity in a fre- to the target and dominant noise, respectively. These vectors may be set
quency bin can be obtained from the estimated spatial filters. Thus, by a prior knowledge, which is particularly required for some spatial
they may provide complementary information to obtain a better mask. filtering algorithms such as conventional beamforming. In other
In this Letter, we develop a soft-mask estimation method by exploiting spatial filtering algorithms such as ICA, they can be estimated by
both the local output signal-to-noise ratio (SNR) of spatial filtering and detecting the directions corresponding to the minima of the directivity
spatial selectivity obtained from the directivity pattern of the estimated patterns of estimated filters because nulls are formed towards
filters. Target speech enhancement is accomplished by employing the ‘jammer’ directions [10]. A large value of R(k) means that YTE(k, τ)
estimated soft mask to suppress the residual noise of the target speech retains target speech components of mixtures with noise components
output obtained by spatial filtering. In addition, both ICA and sufficiently removed. Therefore, YTE(k, τ) is sufficiently good to be

ELECTRONICS LETTERS 5th June 2014 Vol. 50 No. 12 pp. 889–891


preserved in the masking. If the ratio is low, the spatial selectivity is not provided the best performance among the compared conventional
sufficient; thus, further suppression of residual noise in YTE(k, τ) is methods, it requires an exhaustive search of the optimal slope in a sig-
needed for a better enhancement of target speech. moidal mask function for each utterance, which results in high compu-
Combining the local output SNR SNR(k, τ) and the ratio for spatial tational complexity. Most of all, the performance of the presented
selectivity R(k), we obtain a soft mask given by method was generally superior to that of the other methods because it
exploits both the target and noise estimates and the estimated filters of
1 spatial filtering.
M (k, t) = (7)
1 + FR (t) exp[− a(logR(k) + b) × log(SNR(k, t))]
Table 1: SNR gains (dB) averaged over test data. Numbers in
where α and β are parameters determining the slope of the sigmoid
parentheses indicate RSRs (%)
function and the bias of the logarithm of R(k), respectively. FR(τ)
denotes an inversed SNR at a frame, expressed as Method Input SNR (dB)
 5 10 15 20
|YTR (k, t)|
FR (t) = k (8) IVA (no masking) 9.05 9.50 7.09 3.36
k |YTE (k, t)| Sawada et al. [7] 11.79(91.06) 13.98(89.23) 12.36(86.77) 8.50(84.45)
Since many natural signals including speech have dependency across Jeong et al. [6] 11.93(74.78) 12.64(78.44) 9.97(79.90) 5.94(79.55)
frequency, the dominance of the target or noise for t–f components at Kolossa et al. [8] 12.28(82.96) 13.46(83.63) 10.93(82.80) 6.78(80.94)
a frame exhibits a similar tendency. Therefore, FR(τ) which can consider Toroghi et al. [9] 12.70(81.83) 13.87(79.08) 11.17(74.85) 7.05(72.84)
information in other frequency bins at a frame is used as a supple- presented method 14.33(89.69) 14.33(90.90) 10.85(87.98) 6.16(82.68)
mentary cue to boost or weaken the extent of suppression of residual MPDR (no masking) 2.47 1.36 0.77 0.52
noise in YTE(k, τ) determined by SNR(k, τ) and R(k). Sawada et al. [7] 3.28(58.28) 0.81(50.77) −0.86(46.27) −1.73(44.69)
Jeong et al. [6] 4.99(97.37) 2.85(99.26) 1.68(99.78) 1.11(99.91)
Experimental results: The developed algorithm was compared with Kolossa et al. [8] 10.91(42.61) 9.35(41.71) 7.88(40.39) 6.60(39.14)
several conventional methods [6–9] in terms of SNR with RSR for Toroghi et al. [9] 10.00(78.70) 8.50(80.51) 7.12(81.30) 5.96(81.52)
masking and word accuracy in speech recognition experiments. The Presented method 10.19(76.84) 8.62(78.38) 7.14(79.12) 5.92(79.42)
SNR gain and RSR are defined as
⎛  ⎞
(T ) Table 2: Word accuracy (%) averaged over test data
(k,t) | Ŝ (k, t)|
2

SNR gain(dB) = 10 log10  ⎠
(T) Input SNR (dB)
(k,t) |Ŝ(k, t ) − Ŝ (k, t)|2
Method
5 10 15 20
− Input SNR (dB) (9) No processing −5.33 29.90 59.13 77.22
IVA (no masking) 68.06 86.88 90.16 90.08
 (T )
(k,t) | Ŝ (k, t)| Sawada et al. [7] 71.10 88.09 91.60 92.00
RSR =  (T )
(10) Jeong et al. [6] 72.40 88.44 91.41 91.25
(k,t) |YTE (k, t)|
Kolossa et al. [8] 70.47 88.21 91.68 91.72
(T) (T)
where Ŝ (k, t) and YTE (k, t) are true target speech components in Toroghi et al. [9] 72.81 88.60 92.00 91.96
Ŝ(k, t) and YTE(k, τ), respectively. For the speech recognition experi- presented method 75.46 88.87 92.31 91.76
ments, fully continuous hidden Markov models (HMMs) were MPDR (no masking) 15.54 46.43 69.50 82.45
implemented with the HMM toolkit (HTK) for 3990 training and 300 Sawada et al. [7] 19.82 36.47 59.65 74.14
test sentences in the DARPA Resource Management database. Speech Jeong et al. [6] 39.17 60.94 76.39 85.36
features were 13th-order mel-frequency cepstral coefficients with the Kolossa et al. [8] 44.44 61.46 70.44 75.13
corresponding delta and acceleration coefficients, and the cepstral coef- Toroghi et al. [9] 53.34 66.93 76.10 80.16
ficients were gained from 24 mel-frequency bands with a frame size of presented method 62.36 78.41 85.20 87.74
25 ms and a frame rate of 10 ms. Considering a car environment, we
placed two microphones 4 cm apart on a front centre of the ceiling, Conclusion: In this Letter, we have described a speech enhancement
and recorded the training and clean test data at a sampling rate of method to suppress residual noise in the target speech output of
16 kHz while a speaker located on the driver’s seat played the sentences. spatial filtering by constructing a soft mask exploiting not only local
In addition, another 300 sentences played by a speaker on the passen- output SNR but also spatial selectivity gained from the directivity
ger’s seat were recorded, to corrupt the test data by digital addition pattern of estimated filters. For both the spatial filtering algorithms of
after scaling for the desired input SNRs. The distance between a IVA and MPDR beamforming, the described method showed more
speaker and the centre of the microphones was ∼50 cm, and each robust and better performance than the conventional methods in terms
speaker was placed at an angle of 30° from a plane perpendicular to a of SNR with RSR and word accuracy.
virtual line passing through the two microphones. The corrupted test
data were also used to evaluate the SNR with the RSR. For spatial filter-
ing, independent vector analysis (IVA) [11] and MPDR beamforming Acknowledgment: This work was supported by the Hyundai Motor
[1] were employed. For each spatial filtering algorithm and each Group.
masking method, the parameters were set to the values which provided
the best performance in general for all experimental cases. © The Institution of Engineering and Technology 2014
Tables 1 and 2 summarise the SNR gains with the RSRs and word 16 February 2014
accuracies averaged over the test data. The method by Sawada et al. doi: 10.1049/el.2014.0416
[7] showed good performance with IVA for spatial filtering but poor Biho Kim and Hyung-Min Park (Sogang University, Seoul, Republic of
performance with MPDR beamforming. As the method only utilises Korea)
estimated spatial filters, its performance directly depends on the accu-
racy of the filters, and it is disadvantageous for the unsuccessful E-mail: [email protected]
spatial filtering of MPDR beamforming. Instead of estimated filters, Yunil Hwang (Hyundai Motor Group R&D Division, Gyeonggi-do,
the other three compared methods consider target and noise estimates Republic of Korea)
of spatial filtering. The method by Jeong et al. [6] uses a predetermined
slope in a sigmoidal mask function, even though spatial selectivity in a References
frequency bin depends on the spatial filtering used. Therefore, it did not
1 Tashev, I.J.: ‘Sound capture and processing’ (John Wiley & Sons,
yield a robust performance. In the method by Kolossa et al. [8], con- Chichester, 2009)
secutive estimates to obtain residual noise in the target speech output 2 Hyvärinen, A., Karhunen, J., and Oja, E.: ‘Independent component
may not be reliable if an estimate is not accurate; thus, the method analysis’ (John Wiley & Sons, New York, 2001)
showed lower RSR and word accuracy especially for spatial filtering 3 Araki, S., Makino, S., Mukai, R., et al.: ‘Fundamental limitation of fre-
of MPDR beamforming. Although the method by Toroghi et al. [9] quency domain blind source separation for convolved mixtures of

ELECTRONICS LETTERS 5th June 2014 Vol. 50 No. 12 pp. 889–891


speech’. Int. Conf. on Acoustics, Speech and Singal Processing 8 Kolossa, D., Astudillo, R.F., Hoffmann, E., et al.: ‘Independent com-
(ICASSP), Salt Lake City, UT, USA, May 2001, pp. 2737–2740 ponent analysis and time-frequency masking for speech recognition in
4 Roman, N., Wang, D., and Brown, G.J.: ‘Speech segregation based on multitalker conditions’, EURASIP J. Audio, Speech, Music Process.,
sound localization’, J. Acoust. Soc. Am., 2003, 114, pp. 2236–2252 2010, 2010, pp. 1–13
5 Park, H.-M., and Stern, R.M.: ‘Spatial separation of speech signals 9 Toroghi, R.M., Faubel, F., and Klakow, D.: ‘Multi-channel speech sep-
using amplitude estimation based on interaural comparisons of zero aration with soft time-frequency masking’. SAPA-SCALE Conf.,
crossings’, Speech Commun., 2009, 51, pp. 15–25 Portland, OR, USA, September 2012
6 Jeong, S.-Y., Jeong, J.-H., and Oh, K.-C.: ‘Dominant speech enhance- 10 Araki, S., Mukai, R., Makino, S., et al.: ‘The fundamental limitation of
ment based on SNR-adaptive soft mask filtering’. Int. Conf. on frequency domain blind source separation for convolutive mixtures of
Acoustics, Speech and Singal Processing (ICASSP), Taipei, Taiwan, speech’, IEEE Trans. Speech Audio Process., 2003, 11, pp. 109–116
April 2009, pp. 1317–1320 11 Kim, T., Attias, H., Lee, S.-Y., et al.: ‘Blind source separation exploit-
7 Sawada, H., Araki, S., Mukai, R., et al.: ‘Blind extraction of dominant ing higher-order frequency dependencies’, IEEE Trans. Audio, Speech,
target sources using ICA and time-frequency masking’, IEEE Trans. Lang. Process., 2007, 15, pp. 70–79
Audio, Speech, Lang. Process., 2006, 14, pp. 2165–2173

ELECTRONICS LETTERS 5th June 2014 Vol. 50 No. 12 pp. 889–891

You might also like