A Two-Step Technique For MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis
A Two-Step Technique For MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis
2
Here, σ̂N k
is the estimated variance of the noise coeffi- signal. Consequently, we supplemented our evaluation
2
cients in subband k of level D and σ̂X k
is the estimated with clean speech recordings from the Aurora 5 digits
variance of the noisy signal coefficients in subband k of database. We added the two MRI noises to the clean
level D, k = 1, 2, . . . , 2D . To compute the threshold, we speech with an SNR of −6 dB, which is similar to the
need an estimate of the noise. If the MRI noise is peri- SNR in the TIMIT utterances.
odic, we can estimate the noise with We compared the performance of our proposed algo-
X rithm to the normalized LMS algorithm (denoted LMS-1)
v[n] = αk cos(2πf0 kn) (4) and the LMS variant proposed in [4] (denoted LMS-2).
k
For LMS-1, we used a filter length of 3000 and a step
where f0 is calculated using Equation 1 and αk is a scalar size of 1. The LMS-2 algorithm did not need any param-
that shapes the spectrum of v[n] to match the spectral eter tuning; these are set by the algorithm and vary based
shape of the MRI noise. For non-periodic MRI noise, we on the MRI pulse sequence used to acquire the recording.
can estimate the noise from the beginning 1 second of the LMS-2 is known to perform well with seq1 noise and is
estimated noise calculated by PLCA. This gives us the currently used to remove seq1 noise from speech record-
flexibility to denoise speech corrupted by non-periodic ings. However, its performance degrades with GR noise,
MRI noise. Since the noise in our experiments is peri- preventing speech researchers from collecting better MRI
odic, we use v[n] for the noise estimate because it per- images using GR pulse sequences.
forms marginally better than estimating the noise from
PLCA’s noise estimate. Once we calculate the threshold, 4.1. Quantitative Performance Metrics
we soft threshold the wavelet coefficients in each subband To quantify the performance of our denoising algorithm,
and reconstruct the denoised signal from the thresholded we calculated the noise suppression, which is given by:
coefficients.
Soon et al. reported very little difference in the SNR Pnoise
noise suppression = 10 log (5)
of the denoised signal when using different wavelets, P̂noise
even accounting for varying SNR of the noisy signal where Pnoise is the power of the noise in the noisy signal
and male/female speakers [10]. They evaluated denois- and P̂noise is the power of the noise in the denoised sig-
ing performance using biorthogonal, Daubechies, Coiflet, nal. We use a voice activity detector (VAD) to find the
and Symmlet wavelets with different wavelet orders. Our noise-only regions in the denoised and noisy signals. We
experiments corroborated their findings; we found very calculate the noise suppression measure instead of SNR
little difference in the quality of the denoised signal, because we do not have a clean reference signal for the
both quantitatively and perceptually, when using differ- TIMIT utterances.
ent wavelets. Thus, we empirically found the Beylkin Ramachandran et al. proposed the log-likelihood ra-
wavelet to give the maximum noise suppression, and we tio (LLR) and distortion variance measures in [11] for
used this wavelet for the wavelet analysis and synthesis. evaluating denoising algorithms. The LLR calculates the
mismatch between the spectral envelopes of the clean sig-
4. Experimental Evaluation nal and the denoised signal. It is calculated using:
We tested our algorithm on a set of 6 TIMIT utterances
aTŝ Rs aŝ
recorded in an MRI scanner, with two different scanner LLR = log (6)
settings that produce two different periodic noises we aTs Rs as
will call seq1 and GR. The drawback with using these where as and aŝ are p-order LPC coefficients of the
recordings for evaluation is the lack of a clean reference clean and denoised signals respectively, and Rs is a
(p+1)×(p+1) autocorrelation matrix of the clean signal. Table 1: Noise suppression results for TIMIT sentences.
An LLR of 0 indicates no spectral distortion between the Proposed LMS-1 LMS-2
clean and denoised signals, while a high LLR indicates seq1 19.27 18.01 18.79
the presence of noise and/or distortion in the denoised GR 24.1 18.37 9.17
signal. The distortion variance is given by:
1 Table 2: Noise suppression (NS), LLR, and distortion
σd2 = ks[n] − ŝ[n]k2 (7) variance (DV) results for the Aurora 5 digits.
L
Metric Sequence Proposed LMS-1 LMS-2
where s[n] and ŝ[n] are the clean and denoised signals seq1 30.23 32.55 26.53
NS (dB)
respectively, and L is the length of the signal. A low dis- GR 24.14 27.88 10.91
tortion variance is more desirable than a high distortion LLR
seq1 0.17 0.4 0.42
variance. GR 0.11 0.41 0.33
seq1 7.52 34.8 21.4
DV (×10−5 )
GR 9.56 35.8 37.7
4.2. Qualitative Performance Metrics
To supplement the quantitative results, we created a lis-
tening test to compare the denoised signals from our pro- Test showed that the medians of rankings obtained for
posed algorithm, as well as LMS-1 and LMS-2. We cre- each denoising algorithm were significantly different at
ated 12 sets of audio clips in 4 different environments: the α = 99% level. We then used the post-hoc Wilcoxon
TIMIT utterances with seq1 noise, TIMIT utterances with rank-sum test to check for pairwise differences in the me-
GR noise, Aurora digits with seq1 noise, and Aurora dig- dian ranks. The Wilcoxon test results show that the me-
its with GR noise. Each environment contained 3 sets dian ranks for each pair of clips are significantly different
of audio clips. Each set contained a noisy signal and at the α = 99% level, except for the case of the LMS-
denoised versions of the signal from the proposed algo- 1/noisy pair for the TIMIT utterances with seq1 noise en-
rithm, LMS-1, and LMS-2. For the sets with Aurora dig- vironment. Hence, we can say with some certainty that
its, we also included the clean signal. Thus, each set with listeners ranked our algorithm as the best for removing
TIMIT utterances had 4 clips and each set with Aurora GR noise and second best for removing seq1 noise.
digits had 5 clips. The sets and the clips within each set
were randomized and presented in an online survey. 25 5. Conclusions
volunteers ranked each clip within a set from 1 to 4 or 5,
We have proposed a denoising algorithm to remove noise
with 1 meaning best quality and intelligibility.
from speech recorded in an MRI scanner. The two-step
algorithm uses PLCA to separate the noise and speech,
4.3. Results
and wavelet packet analysis to further remove noise left
Objective measures: Table 1 lists the noise suppression by the PLCA algorithm. Objective measures show that
for the TIMIT utterances. Table 2 shows the noise sup- our proposed algorithm achieves better noise suppression
pression, LLR, and distortion variance results for the Au- and less spectral distortion than LMS methods. A lis-
rora digits. For TIMIT utterances corrupted by seq1 and tening test shows that our algorithm yields higher quality
GR noises, our proposed algorithm suppresses noise bet- and more intelligible speech than LMS methods.
ter than LMS-1 and LMS-2. Our algorithm performs
To further extend our work, we will compare our pro-
slightly worse than LMS-1 for Aurora digits corrupted
posed algorithm to other denoising methods, such as sig-
by seq1 and GR noises. This is because the noise in the
nal subspace and model-based approaches. Additionally,
Aurora recordings is purely additive, while the noise in
we need to evaluate how well our algorithm aids speech
the direct MRI TIMIT recordings is more convolutive in
analysis, such as formant extraction. Finally, we will
nature. Our experiments confirmed that LMS-2 performs
evaluate the performance of our algorithm in other low-
better on seq1 noise than GR noise, both for the TIMIT
SNR speech enhancement scenarios, such as those in-
utterances and Aurora digits. Importantly, our proposed
volving Gaussian, Cauchy, babble, and traffic noises.
algorithm performs comparably to LMS-2 in seq1 noise.
The LLR and distortion variance results show that our
algorithm reconstructed the spectral characteristics of the Table 3: Median rankings of the audio clips for the four
clean signal more faithfully than LMS-1 and LMS-2. Pre- environments
serving spectral characteristics of the signal is a key re-
ENVIRONMENT ALGORITHM
sult when considering denoising speech for subsequent Clean Proposed LMS-1 LMS-2 Noisy
speech analysis and modeling. TIMIT, seq1 noise 2 3 1 4
Subjective measures: Table 3 shows the median rank- TIMIT, GR noise 1 2 3 4
ings obtained from the listening test for the audio clips Aurora, seq1 noise 1 3 4 2 5
Aurora, GR noise 1 2 3 4 5
in the 4 environments. A nonparametric Kruskal-Wallis
6. References
[1] W. F. Katz, S. V. Bharadwaj, and B. Carstens, “Electromagnetic
Articulography Treatment for an Adult With Broca’s Aphasia and
Apraxia of Speech,” J. Speech, Language, and Hearing Research,
vol. 42, no. 6, pp. 1355–1366, Dec. 1999.
[2] M. Itoh, S. Sasanuma, H. Hirose, H. Yoshioka, and T. Ushi-
jima, “Abnormal articulatory dynamics in a patient with apraxia
of speech: X-ray microbeam observation,” Brain and Language,
vol. 11, no. 1, pp. 66–75, Sep. 1980.
[3] D. Byrd, S. Tobin, E. Bresch, and S. Narayanan, “Timing effects
of syllable structure and stress on nasals: A real-time MRI exam-
ination,” J. Phonetics, vol. 37, no. 1, pp. 97–110, Jan. 2009.
[4] E. Bresch, J. Nielsen, K. S. Nayak, and S. Narayanan, “Synchro-
nized and Noise-Robust Audio Recordings During Realtime Mag-
netic Resonance Imaging Scans,” J. Acoustical Society of Amer-
ica, vol. 120, no. 4, pp. 1791–1794, Oct. 2006.
[5] Z. Duan, G. J. Mysore, and P. Smaragdis, “Online PLCA for
Real-time Semi-supervised Source Separation,” in Proc. Int. Conf.
Latent Variable Analysis/Independent Component Analysis, Tel-
Aviv, Israel, 2012, pp. 34–41.
[6] Y. Ghanbari and M. R. Karami-Mollaei, “A new approach for
speech enhancement based on the adaptive thresholding of the
wavelet packets,” Speech Commun., vol. 48, no. 8, pp. 927–940,
Aug. 2006.
[7] M. McJury and F. G. Shellock, “Auditory Noise Associated with
MR Procedures,” J. Magnetic Resonance Imaging, vol. 12, no. 1,
pp. 37–45, Jul. 2001.
[8] Y. Kim, S. S. Narayanan, and K. S. Nayak, “Flexible retrospective
selection of temporal resolution in real-time speech MRI using a
golden-ratio spiral view order,” Magnetic Resonance in Medicine,
vol. 65, no. 5, pp. 1365–1371, 2011.
[9] S. Tabibian, A. Akbari, and B. Nasersharif, “A New Wavelet
Thresholding Method for Speech Enhancement Based on Sym-
metric Kullback-Leibler Divergence,” in 14th Int. Computer Soci-
ety of Iran Computer Conf., Tehran, Iran, 2009, pp. 495–500.
[10] I. Y. Soon, S. N. Koh, and C. K. Yeo, “Wavelet for Speech De-
noising,” in Proc. IEEE Region 10 Annu. Conf. Speech and Im-
age Technologies Computing and Telecommunications, Brisbane,
Australia, 1997, pp. 479–482.
[11] V. R. Ramachandran, I. M. S. Panahi, and A. A. Milani, “Objec-
tive and Subjective Evaluation of Adaptive Speech Enhancement
Methods for Functional MRI,” J. Magnetic Resonance Imaging,
vol. 31, no. 1, pp. 46–55, Dec. 2009.