0% found this document useful (0 votes)
61 views4 pages

Noise PSD

The document discusses noise power estimation techniques. It analyzes a minimum mean square error (MMSE) based spectral noise power estimator and proposes an improvement. Specifically: 1) The MMSE estimator only updates the noise power estimate when the a posteriori signal-to-noise ratio (SNR) is below a threshold of 1, which can be interpreted as a voice activity detector (VAD). 2) The proposed improvement replaces the hard VAD decision with a soft speech presence probability (SPP). This avoids the need for bias correction in the MMSE estimator. 3) Experimental results show the proposed estimator maintains the quick noise tracking of MMSE while being computationally more efficient and less prone to noise power

Uploaded by

di di
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views4 pages

Noise PSD

The document discusses noise power estimation techniques. It analyzes a minimum mean square error (MMSE) based spectral noise power estimator and proposes an improvement. Specifically: 1) The MMSE estimator only updates the noise power estimate when the a posteriori signal-to-noise ratio (SNR) is below a threshold of 1, which can be interpreted as a voice activity detector (VAD). 2) The proposed improvement replaces the hard VAD decision with a soft speech presence probability (SPP). This avoids the need for bias correction in the MMSE estimator. 3) Experimental results show the proposed estimator maintains the quick noise tracking of MMSE while being computationally more efficient and less prone to noise power

Uploaded by

di di
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16-19, 2011, New Paltz, NY

NOISE POWER ESTIMATION BASED ON THE PROBABILITY OF SPEECH PRESENCE

Timo Gerkmann∗ Richard C. Hendriks†

Sound and Image Processing Lab. Signal and Information Processing Lab.
KTH Royal Institute of Technology Delft University of Technology
100 44 Stockholm, Sweden 2628 CD Delft, The Netherlands
[email protected] [email protected]

ABSTRACT about 1-3 seconds. The spectral noise power is then inferred from
the minimum of the estimated power of the noisy signal, assum-
In this paper, we analyze the minimum mean square error (MMSE)
ing that speech is absent at least for a short duration within the
based spectral noise power estimator [1] and present an improve-
observed time-span. However, if the noise power rises within the
ment. We will show that the MMSE based spectral noise power
observed time-span, the noise power will be underestimated. While
estimate is only updated when the a posteriori signal-to-noise ratio
in [2] mechanisms are proposed that allow for a tracking of rising
(SNR) is lower than one. This threshold on the a posteriori SNR
noise powers within the observed time-span, rising noise powers as
can be interpreted as a voice activity detector (VAD).
caused e.g. by passing cars, are usually tracked with a rather large
We propose in this work to replace the hard decision of the
delay. The local underestimation of the noise power is likely to re-
VAD by a soft speech presence probability (SPP). We show that
sult in annoying artifacts, so-called musical noise, when the noise
by doing so, the proposed estimator does not require a bias cor-
power estimate is applied in a speech enhancement framework.
rection and safety-net as is required by the MMSE estimator pre-
More recent spectral noise power estimators allow for a quicker
sented in [1]. At the same time, the proposed estimator maintains
tracking of the noise spectral power, e.g. the subspace-DFT ap-
the quick noise tracking capability which is characteristic for the
proach [4], or minimum mean square error (MMSE) based ap-
MMSE noise tracker, results in less noise power overestimation and
proaches [5], [1]. While subspace based approaches are computa-
is computationally less expensive.
tionally rather demanding, the MMSE based algorithm [1] is com-
Index Terms— Noise power estimation, speech enhancement, putationally much less demanding and at the same time robust to
noise reduction. increasing noise levels [6]. In the MMSE based estimator [1], first a
limited maximum likelihood (ML) estimate of the a priori signal-to-
1. INTRODUCTION noise ratio (SNR) is used to estimate the periodogram of the noise
signal. However, this simple estimate results in a bias, which is
Portable digital communication devices, such as hearing aids or mo- then compensated based on a second estimate of the a priori SNR.
bile telephones, are often used in noisy environments. The noise In this work, we analyze the MMSE based estimator presented in
signal that corrupts the target speech signal can be locally quite [1] and present an improvement that makes the bias compensation
nonstationary. Nonstationary noise corruptions can be caused for unnecessary.
example by passing cars when communicating while walking along This work is organized as follows: after explaining the nota-
the street, or babble noise while in the cafeteria or at a party. Speech tions and assumptions in Section 2, we show in Section 3 that the
enhancement algorithms aim at reducing the additive noise while MMSE based noise power estimator of [1] can be interpreted as
keeping the target speech signal unaffected. One of the most impor- a VAD based noise power estimator, where the noise power esti-
tant parameters of speech enhancement algorithms is the spectral mate is only updated if the a posteriori SNR is smaller than one.
noise power. The spectral noise power can be estimated whenever Then, in Section 4 we propose to replace the VAD of [1] by a soft
we know that speech is absent. However, in nonstationary noise speech presence probability (SPP), without the need of applying a
scenarios the estimation of the noise power is particularly difficult bias compensation. In Section 5 we show that the proposed estima-
as the it may change rapidly over time. Then, the estimated noise tor results in a similar noise tracking performance as the estimator in
power has to be updated as often as possible, requiring a robust [1], while being computationally and memory-wise more efficient.
voice activity detector (VAD). However, deciding whether speech is
present or absent is more difficult the more nonstationary the noise 2. SIGNAL MODEL
source is, as a sudden rise in the noise power may be misinterpreted
as a speech onset. We assume the speech and noise signals to be additive in the short-
Several approaches have been proposed for the estimation of time Fourier domain. The complex spectral noisy observation is
the noise power. Among the most established estimators are those thus given by Yk (l) = Sk (l) + Nk (l), where k is the frequency
based on minimum statistics [2], [3]. For instance, in [2] the power index, l is the segment index, Sk are the complex spectral speech
of the noisy signal is estimated and observed over a time-span of coefficients and Nk are the complex spectral noise coefficients. For
2
∗ The research leading to these results has received funding from the Eu- each k, the spectral
  are defined as σS,k (l) =
speech and noise power
2 2 2
ropean Community’s Seventh Framework Programme under Grant Agree- E |Sk (l) | and σN,k (l) = E |Nk (l) | , respectively. In the se-
ment PIAP-GA-2008-214699. quel, we omit the time and frequency index wherever possible. We
† The research is supported by the Dutch Technology Foundation STW. define the a posteriori SNR as γ = |Y |2 /σN2 and the a priori SNR

c
978-1-4577-0693-6/11/$26.00 2011 IEEE 145
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16-19, 2011, New Paltz, NY

as ξ = σS2 /σN2 . We assume that the speech andnoise signals are where bγ (l) = |Y (l)|2 /bσN2 (l − 1). Substituting (4) and σ bN2 = σ
bN2 (l −
uncorrelated and have zero mean so that E |Y |2 = σS2 + σN2 . In 1) into (1) we see that the MMSE estimator can be seen as a VAD
addition, we assume that the real and imaginary part of the noise based detector, as
and speech spectral coefficients are independent and Gaussian dis- (
tributed. Furthermore, estimated quantities are denoted by a hat b 2 2  bN2 (l − 1) , if b
σ γ (l) ≥ 1
|N (l)| = E |N (l)| | Y (l) = (5)
symbol, e.g. ξbk is the estimate of ξk . |Y (l)|2 , if b
γ (l) < 1,
Using the a priori SNR estimator from (4) we thus have a hard
3. REVIEW OF MMSE BASED NOISE POWER instead of a soft decision between the noisy observation and the
ESTIMATION estimate of the spectral noise power σ bN2 (l − 1).
In [1] the bias is derived for the case that the limited ML esti-
In [5], [1] it is proposed to estimate the spectral noise power from mate of (4) is employed. However, the resulting bias again depends
an MMSE estimate of the noise periodogram. Given an estimate on the true and unknown a priori SNR. In [1] this a priori SNR is
of the a priori SNR ξ and an estimate of the noise power σN2 , the estimated using the decision-directed approach [7].
estimate of the noise periodogram is obtained as
 2 3.2. Safety-net

b |2 = E |N |2 | Y = 1 ξb
|N |Y |2 + bN2 .
σ (1) In addition to the bias compensation, in [1] a so-called safety-net is
1 + ξb 1 + ξb
employed to prevent the spectral noise power tracker from stalling
Assuming that the spectral noise power does not change abruptly when the noise level would make an abrupt step from one segment
from one signal segment to the other, it is reasonable to employ to the next. In this safety-net, the last 0.8 seconds of the noisy
the noise power estimate of the previous frame σ bN2 = σbN2 (l − 1) in speech periodogram, i.e. 50 signal segments |Y (l)|2 , are stored.
(1). However, as the speech signal may change quickly over time, The final estimate of the spectral noise power is obtained by com-
finding an appropriate estimate for the a priori SNR in (1) is rather paring the current noise power estimate to the minimum of the last
difficult. In [1] a limited ML is employed as detailed in Section 3.1. 0.8 seconds of |Y (l)|2 , as
After estimating the noise periodogram via (1), the noise power 
bN2 ← max σ
σ bN2 , min |Y (l − 49)|2 , ..., |Y (l)|2 . (6)
spectral density is updated by a recursive smoothing, as

bN2 (l) = α σ
σ bN2 (l − 1) + (1 − α) |N
b (l)|2 , (2) 4. PROPOSED APPROACH: SPP INSTEAD OF VAD

where, as in [1], we choose α = 0.8. Instead of first using a limited ML estimate for the a priori SNR that
Taking the expectation of (1) with respect to Y , we obtain results in the VAD behavior explained by (5), we argue in this paper
that neither a bias compensation nor the safety-net of Section 3.2 is
 necessary if the hard decision of the VAD (5) is exchanged by a soft
EY E |N |2 | Y, σ
bN2 , σ
bS2 = decision by means of the probability of speech presence.
 2
bN2
σ b2
σ Under speech presence uncertainty, an MMSE estimator for the
(σS2 + σN2 ) + 2 S 2 σb2 , (3)
σ 2
bS + σ 2
bN σ
bS + σbN N noise periodogram is given by
 
where we now explicity state that the estimator requires knowing E |N |2 | Y = P (H0 | Y ) E |N |2 | Y, H0

bN2 and σ
σ bS2 . From (3) it follows that if σbS2 = σS2  bN2 = σN2 (1)
and σ + P (H1 | Y ) E |N |2 | Y, H1 , (7)
is unbiased and we have EY E |N |2 | Y, σN2 , σS2 = σN2 . On the
other hand, if σbS2 6= σS2 and/or σbN2 6=σN2 the estimator is biased, where H0 indicates speech absence, while H1 indicates speech
2
and we have EY E |N | | Y, σ 2
bS2 6= σN2 . However, to com-
bN , σ presence.
pensate for the bias, again the true noise and speech spectral power
are required. 4.1. Estimation of the speech presence probability
As for the derivation of (1), we assume that the real and imagi-
3.1. Interpretation as a voice activity detector nary parts of the speech and noise spectral coefficients are Gaus-
sian distributed. With Bayes’ theorem, assuming uniform priors
In [1] it is proposed to employ a limited ML estimate of the a priori
P (H0 ) = P (H1 ), follows the probability of speech presence, e.g.
SNR in (1). In this section we show that by this the MMSE estimate
[8]
of the noise periodogram is only updated when the a posteriori SNR
  −1
is smaller than 1. This thresholding of the a posteriori SNR can be |Y |2 ξopt
interpreted as a VAD; the spectral noise power is only updated when P (H1 | Y ) = 1 + (1 + ξopt ) exp − 2 .
σ
bN 1 + ξopt
there is speech absence according to the a posteriori SNR. (8)
In the way we wrote (1), it can be seen that the MMSE solution While in (1) ξb is the local SNR, in (8) the a priori SNR ξopt reflects
results in a weighted sum of the noisy observation and the previous the SNR that is typical if speech were present [9]. In the radar or
estimate of the spectral noise power σ bN2 . The weights are a function communication context, one would choose ξopt in order to guar-
of the a priori SNR and gradually take values between zero and one, antee a specified performance in terms of false alarms or missed
i.e., a soft decision between |Y |2 and σbN2 . However, in [1] a limited detections [10]. Similarly, we find the fixed optimal a priori SNR
ML estimate of the a priori SNR is employed, which is obtained, as 10 log10 (ξopt ) = 15 dB by minimizing the total probability of er-
  ror when the true a priori SNR lies between −∞ and 20 dB, as
ξb = max 0, ξbml = max(0, γ b − 1) , (4) detailed in [9].

146
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16-19, 2011, New Paltz, NY

 
4.2. Derivation of E |N |2 |Y, H0 and E |N |2 |Y, H1 Algorithm 1 The proposed algorithm for noise power estimation.
1: for all signal segments l do
From (8) it is possible to derive an expression for the a posteriori
SNR γ = |Y |2 /bσN2 in terms of ξopt and P (H1 | Y ), that is, 2: Compute the a posteriori SPP
  −1
2 ξopt
  P (H1 | Y ) = 1+(1+ξopt ) exp − |Yσb 2| 1+ξopt
1 + ξopt 1 + ξopt N
γ = log . (9) bN2 is the noise power estimate of the previous frame
where σ
P (H1 | Y )−1 − 1 ξopt
and 10 log10 (ξopt ) = 15 dB.
From this expression it follows that already for P (H1 | Y ) > 3: Compute a smoothed a posteriori SPP, as
0.075, the a posteriori SNR satisfies γ > 1 if 10 log10 (ξopt ) = P(l) = 0.9 P(l − 1) + 0.1 P (H1 | Yk (l)) .
15 dB. From this it can be concluded that under speech presence,
i.e., when P (H1 | Y ) is sufficiently high, the ML estimate of the a 4: Avoid stagnation,
( as
priori SNR from (4) can be rewritten as ξbml = b γ − 1. The optimal min(0.99, P (H1 | Y (l))) , P(l) > 0.99
P (H1 | Y (l)) ←
estimator under speech presence can now be computed as P (H1 | Y (l)) , else.
    5: Update the noise periodogram estimate as
E |N |2 | Y, ξ,
b H1 = E |N |2 | Y, ξbml = γ b−1 =σ bN2 , b |2 = P (H0 | Y ) |Y |2 + P (H1 | Y ) σ
|N bN2 .
6: Obtain spectral noise power estimate by temporal smoothing
which follows from substitution of ξbml = b γ − 1 into (1). Un- b (l)|2 .
bN2 (l) = 0.8 σ
σ bN2 (l − 1) + 0.2 |N
der speech absence we have Y = N and thus E |N |2 | Y, H0 =

E |N |2 | N = |N |2 = |Y |2 . Then, similar to (1), we obtain 7: end for


b |2 = E |N |2 | Y = P (H0 | Y ) |Y |2 + P (H1 | Y ) σ
|N bN2 , (10)
length N = 512 for spectral analysis, where successive segments
where P (H0 | Y ) = 1 − P (H1 | Y ) and we employ the spec- overlap by 50%.
bN2 . The spectral
tral noise power estimated of the previous frame σ As proposed in [12] we compare the estimated noise power σ bN2,k
speech power is then obtained by a recursive smoothing of |Nb |2 as 2
to a reference σN,k in terms of the log-error distortion measure. In
given in (2). contrast to [12], we separate the error measure into over and under
estimation, i.e.
4.3. Avoiding stagnation
LogErr = LogErrOver + LogErrUnder, (13)
From (8) it can be seen that if the spectral noise power is under-
estimated, it may occur that P (H1 | Y ) = 1 even though |Y |2 is where LogErrOver measures the contributions of an overestima-
small with respect to the true, but unkown, noise power. Then, due tion of the true noise power, as
to (10), the noise power may not be updated anymore, such that the !!
1 X X
L−1 N−1
noise power remains underestimated. To check and overcome that σN2,k (l)
LogErrOver = min 0, 10 log10 ,
this happens, we recursively smooth P (H1 | Y ) over time by, NL bN2,k (l)
σ
l=0 k=0

P(l) = 0.9 P(l − 1) + 0.1 P (H1 | Y (l)) , (11) while LogErrUnder measures the contributions of an underestima-
tion of the true noise power, as
and force the current estimate P (H1 | Y ) to be smaller than one, if
P(l) is larger than a threshold, as L−1 N−1
!!
1 XX σN2,k (l)
( LogErrUnder = max 0, 10 log10 .
min(0.99, P (H1 | Y (l))) , P(l) > 0.99 N L l=0 k=0 bN2,k (l)
σ
P (H1 | Y (l)) ←
P (H1 | Y (l)) , else.
Note that an overestimation of the true noise power, as indicated
(12) by LogErrOver, is likely to result in an attenuation of the speech
This procedure fits well into the framework and is more memory signal in a speech enhancement framework and thus in speech dis-
efficient than the safety-net of Section 3.2 as we do not need to tortions. On the other hand, at time-frequency points where the
store 0.8 seconds of data. The proposed SPP based algorithm is noise power is underestimated, the noise signal is not reduced to the
summarized in Algorithm 1. same extend as for the true noise power. Furthermore, if the noise
In Section 5 we show that the proposed approach results in power is underestimated locally, isolated time-frequency points are
slightly better results than the estimator proposed in [1], but does not attenuated, which may result in annoying artifacts perceived as
not require a bias correction and requires less memory storage. so-called musical noise.
As noise types, we consider modulated white Gaussian noise,
5. EVALUATION passing cars noise, nonstationary vacuum cleaner noise, and babble
noise. The modulated noise is modulated with fmod = 0.5 Hz us-
In this section, we compare the proposed spectral noise power esti- ing the function f (m) = 1 + 0.5 sin(2πmfmod /fs ), where m is
mators to the minimum statistics approach [2] and the MMSE ap- the sample index.
proach with bias compensation proposed in [1]. For the evaluation For the synthetic modulated white noise, the true noise power is
we employ 320 sentences from the TIMIT database [11] and several known and is thus used for the evaluation. For the remaining non-
synthetic and natural noise sources. In these evaluations we set the stationary and thus non-ergodic noise sources the determination of
sampling rate at fs = 16 kHz. Further, we use a Hann-window of the true spectral noise power is impossible, as only one realization

147
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16-19, 2011, New Paltz, NY

7
proposed SPP 10 when a limited maximum likelihood (ML) estimator is used for the
proposed SPP
6 MMSE [1]
Minimum Statistics
MMSE [1] estimation of the a priori signal-to-noise ratio (SNR), the resulting
8 Minimum Statistics
5 noise power estimate is only updated when the a posteriori SNR
is below a certain threshold. We have argued that this thresholding
LogErr

LogErr
4 6

3
can be interpreted as a voice activity detector (VAD). In addition, in
4
order to function properly, the estimator in [1] requires a bias com-
2
2
pensation and a so-called safety-net that requires storing the last 0.8
1
seconds of data.
0
−10 −5 0 5 10 15
0
−10 −5 0 5 10 15 In this paper we have shown that the bias compensation and
input SNR input SNR the safety-net are unnecessary if the hard decision of the VAD is
(a) Modulated Gaussian noise (b) Passing-cars noise replaced by a soft speech presence probability (SPP) estimator. The
proposed estimator is more memory efficient and results slightly
10
proposed SPP
10
proposed SPP
better performance than the estimator from [1].
MMSE [1] MMSE [1]
8 Minimum Statistics 8 Minimum Statistics

7. REFERENCES
LogErr
LogErr

6 6

4 4
[1] R. C. Hendriks, R. Heusdens, and J. Jensen, “MMSE based
noise PSD tracking with low complexity,” IEEE ICASSP, pp.
2 2
4266–4269, Mar. 2010.
0
−10 −5 0 5 10 15
0
−10 −5 0 5 10 15
[2] R. Martin, “Noise power spectral density estimation based
input SNR input SNR on optimal smoothing and minimum statistics,” IEEE Trans.
(c) Babble noise (d) Vacuum cleaner Speech Audio Process., vol. 9, no. 5, pp. 504–512, July 2001.
Figure 1: Comparison in terms of the LogErr for 320 TIMIT sen- [3] I. Cohen, “Noise spectrum estimation in adverse environ-
tences and various input SNRs. The lower part of the bars represents ments: Improved minima controlled recursive averaging,”
the amount of noise power overestimation LogErrOver, while the IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466–
upper part represents the noise underestimation LogErrUnder. The 475, Sept. 2003.
total height of the bars corresponds to LogErr. [4] R. C. Hendriks, J. Jensen, and R. Heusdens, “Noise tracking
using DFT domain subspace decompositions,” IEEE Trans.
of the random variable is available in each time-frequency point. Audio, Speech, Lang. Process, vol. 16, no. 3, pp. 541–553,
Therefore, we use the periodogram of the noise-only signal as an March 2008.
estimate of the true noise power, e.g. σN2,k = |N |2 . [5] R. Yu, “A low-complexity noise estimation algorithm based
The results of our evaluation are given in Figure 1. It can be on smoothing of noise power estimation and estimation bias
seen that for the modulated white Gaussian noise, the minimum correction,” in IEEE ICASSP, 2009, pp. 4421–4424.
statistics approach is not able to follow the rapid changed of the [6] J. Taghia, J. Taghia, N. Mohammadiha, J. Sang, V. Bouse, and
noise signal, resulting in a large amount of noise underestimation R. Martin, “An evaluation of noise power spectral density es-
that is likely to result in musical noise in a speech enhancement timation algorithms in adverse acoustic environments,” IEEE
framework. For the natural noise sources we considered, passing ICASSP, May 2011.
car noise, babble noise and vacuum cleaner noise, this effect is not
[7] Y. Ephraim and D. Malah, “Speech enhancement using a min-
as dramatic as for the synthetic modulated Gaussian noise, but still,
imum mean-square error short-time spectral amplitude esti-
the minimum statistics approach results in the largest noise underes-
mator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32,
timation, and is thus likely to result in the largest amount of musical
no. 6, pp. 1109–1121, Dec. 1984.
noise. Comparing the proposed SPP based estimator to the MMSE
based estimator [1] it can be seen that the overall performance in [8] I. Cohen and B. Berdugo, “Speech enhancement for non-
terms of the LogErr is rather similar. However, the MMSE [1] ap- stationary noise environments,” ELSEVIER Signal Process,
proach has the tendency to overestimate the noise power most. This vol. 81, no. 11, pp. 2403–2418, Nov. 2001.
may be because the bias compensation factor multiplied to the esti- [9] T. Gerkmann, C. Breithaupt, and R. Martin, “Improved a pos-
mated noise signal in [1] is always larger or equal to one, even when teriori speech presence probability estimation based on a like-
the noise power estimate of the previous frame was overestimated. lihood ratio with fixed priors,” IEEE Trans. Audio, Speech,
While obtaining similar results in terms of the LogErr, the pro- Lang. Process, vol. 16, no. 5, pp. 910–919, July 2008.
posed SPP based estimator is more computationally and memory [10] R. J. McAulay and M. L. Malpass, “Speech enhancement
efficients, as we do not require to store 0.8 seconds of data for the using a soft-decision noise suppression filter,” IEEE Trans.
safety-net of Section 3.2, nor do we need to compute the incomplete Acoust., Speech, Signal Process., vol. 28, no. 2, pp. 137–145,
gamma function necessary for the bias compensation in [1]. Apr. 1980.
The code for this noise power estimator is available at www.
[11] J. S. Garofolo, “DARPA TIMIT acoustic-phonetic speech
ee.kth.se/˜gerkmann/sppBasedNoisePow.
database,” National Institute of Standards and Technology
(NIST), 1988.
6. CONCLUSIONS [12] R. C. Hendriks, J. Jensen, and R. Heusdens, “DFT domain
subspace based noise tracking for speech enhancement,” in
In this work, we have refined the minimum mean square error Interspeech, August 2007, pp. 830–833.
(MMSE) based noise power estimator [1]. We have shown that

148

You might also like