0% found this document useful (0 votes)
9 views5 pages

Encodec Trans

This study presents a novel approach to speech enhancement using pre-trained generative methods to improve speech quality in adverse acoustic environments. By leveraging vocoder and codec models, the proposed methods effectively reduce background noise and reverberation, resulting in enhanced fidelity and robustness compared to traditional techniques. Experimental results demonstrate superior performance in both simulated and realistic scenarios, highlighting the potential of these methods in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

Encodec Trans

This study presents a novel approach to speech enhancement using pre-trained generative methods to improve speech quality in adverse acoustic environments. By leveraging vocoder and codec models, the proposed methods effectively reduce background noise and reverberation, resulting in enhanced fidelity and robustness compared to traditional techniques. Experimental results demonstrate superior performance in both simulated and realistic scenarios, highlighting the potential of these methods in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

UNIFYING ROBUSTNESS AND FIDELITY: A COMPREHENSIVE STUDY OF PRETRAINED

GENERATIVE METHODS FOR SPEECH ENHANCEMENT IN ADVERSE CONDITIONS

Heming Wang1∗ , Meng Yu2 , Hao Zhang2 , Chunlei Zhang2 , Zhongweiyang Xu4∗ ,
Muqiao Yang3∗ , Yixuan Zhang1∗ , Dong Yu2
1 2
The Ohio State University, USA Tencent AI Lab, USA
3 4
Carnegie Mellon University, USA University of Illinois Urbana-Champaign, USA
arXiv:2309.09028v1 [eess.AS] 16 Sep 2023

ABSTRACT supervised learning based models in such challenging scenarios may


retain strong noise or reverberation, and be accompanied by distor-
Enhancing speech signal quality in adverse acoustic environments tions and artifacts [10].
is a persistent challenge in speech processing. Existing deep learn-
ing based enhancement methods often struggle to effectively remove To address these challenges, recent studies aim to leverage the
background noise and reverberation in real-world scenarios, hamper- potential of pre-trained models. Some researchers utilized diffu-
ing listening experiences. To address these challenges, we propose a sion models to refine speech, employing them to regenerate clean
novel approach that uses pre-trained generative methods to resynthe- speech based on enhanced priors acquired through pre-trained dis-
size clean, anechoic speech from degraded inputs. This study lever- criminative models [11, 12]. Another avenue of investigation in-
ages pre-trained vocoder or codec models to synthesize high-quality volves employing speech vocoders for speech resynthesis. For in-
speech while enhancing robustness in challenging scenarios. Gener- stance, VoiceFixer was proposed to address general speech restora-
ative methods effectively handle information loss in speech signals, tion [13]. It employs an enhancement model on mel-spectrograms
resulting in regenerated speech that has improved fidelity and re- and subsequently utilizes the HifiGAN [14] vocoder to resynthe-
duced artifacts. By harnessing the capabilities of pre-trained models, size the clean speech. Similarly, [15] proposed to use masked au-
we achieve faithful reproduction of the original speech in adverse toencoders for speech restoration, and employs mel-to-mel mapping
conditions. Experimental evaluations on both simulated datasets during pretraining to restore masked audio signals. We believe that
and realistic samples demonstrate the effectiveness and robustness of discrete representations stored in codebooks are more robust against
our proposed methods. Especially by leveraging codec, we achieve various interference, and propose to employ speech codecs to per-
superior subjective scores for both simulated and realistic record- form speech enhancement. The majority of existing research re-
ings. The generated speech exhibits enhanced audio quality, reduced lated to speech codecs [16, 17] is primarily centered around text-
background noise, and reverberation. Our findings highlight the po- to-speech tasks, relying heavily on text embeddings to ensure in-
tential of pre-trained generative techniques in speech processing, put stability. Drawing inspiration from a parallel study in com-
particularly in scenarios where traditional methods falter. Demos puter vision [18], which addresses blind face restoration through
are available at https://whmrtm.github.io/SoundResynthesis. the regeneration of code tokens within a learned discrete codebook,
Index Terms— speech enhancement, speech vocoder, speech we are motivated by its exceptional robustness against degradation
codec, robustness, fidelity in both synthetic and real-world datasets. Furthermore, a relevant
contribution by Wav2code [19] has also introduced the utilization
of codebooks to enhance the resilience of speech representations.
1. INTRODUCTION Notably, Wav2code focuses more on improving robust automatic
speech recognition and operates on self-supervised learning (SSL)
In real-world scenarios, speech signals are often degraded by back- embeddings.
ground noise and room reverberation, leading to diminished clar-
ity and comprehensibility. The main aim of speech enhancement This paper systematically investigates two pipelines: one based
is to mitigate the impact of such environmental disturbances. The on a speech vocoder and the other on a speech codec. In both
development of deep neural networks (DNN) has greatly advanced pipelines, our network processes a main input and an auxiliary input
speech enhancement research. DNNs have shown remarkable pro- to enhance intermediate representations. These representations are
ficiency in suppressing background noise and reverberation, yield- then enhanced before generating the desired speech output. We
ing satisfactory enhancement results [1]. DNN-based enhancement choose this design because generative methods excel in address-
techniques primarily focus on direct speech signal representations, ing complex situations with significant information loss in speech
aiming to establish mappings from noisy inputs to their correspond- signals. Utilizing pre-trained models also benefits by leveraging
ing clean targets. These representations include but not limited to existing semantic or acoustic information, aiding in faithful resyn-
magnitude [2, 3], complex spectrograms [4, 5], waveforms [6, 7], or thesis of the original speech and enhancing fidelity. Experimental
a fusion of these features [8, 9] which are all intrinsically associated outcomes with real and synthetic datasets demonstrate the superi-
with the signals. Despite the effectiveness of existing powerful en- ority of our proposed pipelines over traditional STFT-based models
hancement baselines, their performance often notably deteriorates in in terms of robustness and subjective ratings. Additionally, the
real-world complicated scenarios. The enhanced speech obtained by codec-based approach effectively reduces uncertainty and ambigu-
ity in restoration mapping, showing notable advantages in real-world
*Work done during an internship at Tencent AI Lab. scenarios.
Fig. 1: The overview of the vocoder pipeline.

Fig. 2: The overview of the codec pipeline.


Table 1: Objective scores of Comparison of All Pipelines
2. PROPOSED APPROACH
Reverberant Only Noisy + Reverberant
2.1. Vocoder Approach STOI PESQ DNS-MOS STOI PESQ DNS-MOS

As depicted in Fig. 1 we illustrate the vocoder approach, wherein Unprocessed 0.663 1.648 2.859 0.624 1.511 2.635
a noisy mel-spectrogram is transformed into a clean counterpart Vocoder Best 0.870 2.472 3.579 0.825 2.121 3.452
Codec Best 0.835 2.102 3.718 0.802 1.916 3.641
using an acoustic enhancer. During inference, we leverage a pre- STFT Based 0.787 1.948 3.133 0.751 1.801 3.024
trained HifiGAN vocoder [14] to restore the clean speech. An
auxiliary input is produced by employing an SSL conditioner on
the SSL features. Specifically, we adopt the WavLM-Large vari- where M̂ is the estimated mel-spectrogram, and M el(y) represents
ant of the WavLM model [20], extract the learnable weighted sum the ground truth mel-spectrogram.
of all layered results to produce 1024-dimension SSL features,
which is then processed by the SSL conditioner to extract the 2.2. Codec Approach
SSL embedding of 256 dimensions. This conditioner comprises a
three-layer 1-dimensional convolutional network with upsampling,
ReLU activation, instance normalization, and a dropout of 0.5. The
acoustic enhancer, based on deep complex convolutional recur-
rent network (DCCRN) architecture [5], employs a convolutional
encoder-decoder with an LSTM bottleneck. Concretely, DCCRN
consists of a six-layer convolution encoder and decoder, and an
LSTM block in the bottleneck part to model time dependencies. We
adjust the architecture for mel-spectrogram input by removing all
complex-value related operations and setting the input convolutional
channels to 1. The auxiliary input is fed to the bottleneck and is con-
catenated with the input of the LSTM block. To make the training
more efficient, the vocoder modules are only used during inference.
During training, we calculate the L1 loss between enhanced and
clean mel-spectrograms. Given the degraded speech input x ∈ RL ,
the target clean speech y is of the same length L. For the inter- Fig. 3: Diagrams of MOS for both synthetic and realistic samples.
mediate representation, we extract 128-band mel-spectrograms at a
We depict the pipeline of the codec approach in Fig. 2. The
hop-length of 10 ms with a Hann window of 64 ms, resulting in mel
implementation entails the application of supervised enhancement
features M el(x) ∈ RT ×K , where T denotes the number of time
learning within the code token space. This involves attempting to
frames and K represents the feature dimension 128. The training
obtain the code tokens for the target speech and then using a pre-
objective for the vocoder approach is then defined as,
trained speech decoder to restore the clean speech. The code en-
T K
hancer is designed to predict clean code tokens based on the primary
1 XX input codec embedding and the auxiliary input mel-spectrograms.
LV ocoder = |M̂ − M el(y)|, (1)
T K t=1 This undertaking is similar to a classification task focusing on code
k=1
tokens. Initial attempts to predict tokens corresponding to clean the speech representations, and maps noisy speech signals into their
speech encountered challenges. Firstly, most feature encoders of denoised counterparts. Specifically, we adopt the DCCRN [5] as
existing codecs are not trained using degraded speech utterances. the STFT backbone. This facilitates uniformity in our experimental
This inconsistency between the corrupted features of the codec in- setup, wherein we undertake speech enhancement trials employing
put and the accurate derivation of code tokens by the codec led identical datasets for the purpose of comparative analysis.
to instability in input code tokens, thereby yielding suboptimal en-
hancement outcomes. Furthermore, predicting speech embeddings 3. EXPERIMENTAL SETUP
(either pre-vector quantization or post-vector quantization) is com- Our experiments are primarily conducted by performing dereverber-
paratively simpler. However, the generated speech by the decoder ation of LibriTTS utterances [23], where we extract clean utterances
may contain distortions, as the predicted embeddings may not align of 16k Hz with high quality as the target. The training subset com-
well with the pre-stored patterns in the codebooks, consequently af- prises 147,039 instances of utterances, while the validation subset
fecting enhancement performance. To address these issues, we have encompasses 5,566 distinct utterances. Subsequently, an evaluation
proposed several techniques. Firstly, a generalized codec architec- is performed on 4,589 utterances that have not been encountered dur-
ture is adopted, involving a EnCodec [21] trained on utterances from ing training. For each utterance, we simulate the reverberation by
multiple languages retrieved from gigaspeech [22], LibriTTS [23], convolving with a simulated room impulse response (RIR). The pro-
and VP10K [24] and common voice [25], and augmented with 20% cess of generating RIRs entails a stochastic selection of a T60 param-
probabilities by simulated noise and reverberations. Additionally, eter, denoting the reverberation duration, from an interval spanning
encoder embeddings are employed as primary inputs during train- 0.2 to 1.5 seconds, executed through the image method [27]. In con-
ing, alongside mel-spectrograms of the input speech as auxiliary junction, the spatial dimensions of the room are determined by ran-
input. Finally, we have worked on the training target and training dom selection, with width and length options ranging between 3, 4,
objective. Two techniques are investigated to mitigate prediction er- and 2.5 meters, and height alternatives within 10, 20, and 4 meters.
rors when the correct tokens cannot be retrieved. The first is label In addition, for the noisy-reverberant dataset simulation, we addi-
smoothing. Specifically, principal component analysis is applied to tionally add environmental noises with signal-to-noise-ratio (SNR)
the target tokens, and they are sorted based on the absolute values ranging from 0 to 40 dB. During training, an Adam optimizer is em-
of the retrieved codebook entries. Label smoothing is then incorpo- ployed. The training process is instantiated with a batch size of 32
rated for neighboring tokens, where we set the target token with a utterances, accompanied by an initial learning rate of 4e-4, which
probability of 0.9, and its neighbor tokens with 0.05. For the other is sustained across 400 epochs. During training, we randomly cut a
technique, we retrieve the codebook entries Z from the tokens C 4-second segment for each training utterance, and pad shorter utter-
using the gumbel softmax layer [26] in a fully differentiable way, ances with zeros within each batch to guarantee they are of the same
which forms quantified representations of Zc ∈ RN ×D , where N to size.
indicate the total number of code tokens and D represents the feature The performance assessment of the models is executed through
dimension of each codebook entry. It can be formulated as, a comparative analysis of the resynthesized clean speech against the
LCodec = LT oken + λLEntry reference dry clean speech. To this end, two established metrics are
N
employed: the short-time objective intelligibility (STOI) [28] met-
1 X ric and the perceptual evaluation of speech quality (PESQ) [29].
LT oken = Ci log(Ĉi )
N i=1 In addition, we provide an evaluation of subjective metrics, which
N D
include the non-intrusive DNS-MOS [30] metric, and mean opin-
1 XX ion scores (MOS) that are provided by human listeners. The MOS
LEntry = (Zd − Ẑc )2 (2)
N D i=1 j=1 evaluation includes synthetic and real-world meeting data compo-
nents. Synthetic samples consist of 10 reverberant instances, featur-
where LT oken measures the cross entropy loss of the predicted code ing input, enhanced samples from three methodologies, and ground
tokens Ĉ and the target tokens C. The second term LEntry measures truth samples. The second segment involves 10 real recorded utter-
the mean squared loss value loss between the acquired codebook ances captured in diverse settings using different recording devices.
entries obtained by degraded speech Zd and the retrieved entries Zc These utterances are intentionally shuffled for each evaluation in-
acquired by the predicted tokens Ĉ. The coefficient λ is empirically stance within this segment.
chosen to be 0.5.
4. RESULTS AND ANALYSIS
Given the importance of the first code index and the hierarchical
architecture inherent in residual quantization, an architecture based 4.1. Comparison of All Pipelines
on layer-wise modeling is adopted to enhance performance. The Table 1 summarizes the best experimental results on two generative
proposed model architecture comprises a transformer decoder and pipelines and the STFT-based pipeline. The vocoder approach per-
a prediction layer. This transformer decoder integrates 12 trans- forms best in terms of STOI and PESQ, while the codec approach at-
former blocks, characterized by an embedding dimension of 512. tains the best DNS-MOS scores and effectively removes background
The model input encompasses the codec embedding and the auxil- and reverberation. A subjective auditory assessment is graphically
iary input acoustic conditioner embedding. For the prediction of sec- depicted in Fig. 3. Both codec and vocoder models surpass the
ond to last code tokens, an additional embedding generated from pre- STFT approach in synthetic samples. The difference between STOI
ceding code tokens is incorporated. The prediction layer facilitates and PESQ scores primarily arises from codec and vocoder charac-
the projection of transformer outputs to 1024 dimensions, which cor- teristics. The oracle codec has lower subjective scores (Table 4 and
responds to the size of the codebook vocabulary. 2). Moreover, the high compression rate of the codec may intro-
duce slight resynthesis misalignment, resulting in decreased objec-
2.3. Comparisons with Conventional DNN Approach tive scores. In real-world meeting scenarios, the codec technique
We compare the proposed two pipelines with the established DNN- presents a distinct advantage over the other two pipelines due to its
based approach for speech enhancement, which directly operates on superior interference robustness. This is because the codec decoder
Table 2: Compare Different Inputs to the Vocoder Using DCCRN Table 4: Compare Different Inputs to the codec
STOI PESQ DNS-MOS
STOI PESQ DNS-MOS
Unprocessed 0.663 1.648 2.857
Unprocessed 0.663 1.648 2.859
Last Layer (SSL Only) 0.718 1.459 3.084
WS (SSL Only) 0.731 1.438 3.163 Code tokens 0.745 1.662 3.628
Mel-spectrogram Only 0.865 2.372 3.525 SSL Embeddings 0.765 1.419 3.682
Codec Embeddings 0.807 1.918 3.685
Mel + WS 0.869 2.450 3.573
Mel Spectrograms 0.669 1.272 3.496
Mel + LWS 0.870 2.472 3.579
+ Mel Spectrograms 0.835 2.102 3.718
Vocoder Oracle 0.957 3.611 3.740
+ SSL Embeddings 0.828 2.037 3.727
Table 3: Ablation Study on Vocoder Mel + SSL Spectrograms 0.825 2.014 3.729
STOI PESQ DNS-MOS
Codec Oracle 0.904 2.764 3.689
Baseline (Mel + LWS) 0.870 2.472 3.579 Table 5: Ablation Studies on Codec
Add Adapter (i) 0.865 2.394 3.567
No Bottleneck (ii) 0.855 2.377 3.550 STOI PESQ DNS-MOS
SSL Token (iii) 0.852 2.231 3.457
Use Transformer Instead of DCCRN (iv) 0.867 2.368 3.802 Baseline (CE + Entry + Layerwise) 0.835 2.102 3.718
- Entry loss (i) 0.832 2.071 3.714
Replace Entry Loss with Label Smoothing (ii) 0.833 2.076 3.751
retains only clear speech patterns, effectively removing noise or re- Add Label Smoothing (iii) 0.821 1.980 3.701
verberation, and utilizing a previously acquired discrete codebook Layer-wise Token Prediction (iv) 0.797 1.960 3.705
reduces ambiguity in speech recovery.
4.2. Evaluation of the Vocoder Approach to information loss inherent in SSL embeddings during their extrac-
tion. The application of code tokens or mel-spectrogram does not
As presented in Table 2, a comparative analysis of diverse inputs yield optimal outcomes, leading to unwanted artifacts in generated
was conducted within the framework of the vocoder approach. The speech. Adding auxiliary input that contains semantic or acoustic in-
initial two rows of investigation aim to map the SSL embedding formation can bring benefits to the enhancement performance. Both
to the enhanced mel-spectrogram directly. Inspection of the results mel-spectrograms and SSL embeddings help, but the computational
reveals that only utilizing SSL embeddings produces unsatisfactory overhead associated with SSL embeddings is notably higher. Conse-
outcomes, as the SSL embeddings contain sufficient semantic infor- quently, the proposed framework adopts the mel-spectrogram as the
mation, but at the same time lose speaker and timbre information. preferred input. The result of the upperbound performance is also
This information loss impedes the restoration clean speech. In addi- provided for reference.
tion, the weighted sum (WS) of multiple layered SSL embeddings Table 5 reports the results of ablation studies pertinent to the
shows noticeable improvement across all metrics. When combined codec approach. We use the best configuration, featuring layer-wise
with the auxiliary input, the enhanced performance is considerably prediction and the utilization of cross entropy (CE) + Entry loss as
improved. Especially when we use the learnable weighted sum the baseline, and compare several variants on the reverberant dataset:
(LWS) of SSL embedding as auxiliary input, and mel-spectrogram (i) solely train using the CE loss for code tokens; (ii) substitute the
as the primary input, we obtain the best enhancement performance. entry loss with label smoothing; (iii) use these two techniques si-
Lastly, for reference, we provide the upper bound of this approach, multaneously; (iv) instead of predicting code tokens in a layer-wise
that is the last row, where the ground truth mel-spectrogram is manner, predict all tokens simultaneously. As shown in the table,
employed for the synthesis of the target speech. both label smoothing and adding entry loss are beneficial when the
Table 3 presents the results of the ablation study on the vocoder code tokens predictions are not accurate. However, using these two
approach. Multiple variations of the vocoder approach have been techniques simultaneously produces sub-optimal results. Therefore,
examined: (i) instead of bottleneck concatenation, directly concate- a solitary technique suffices. Predicting all code tokens simultane-
nate SSL embeddings with the mel-spectrogram along the feature ously degrades the performance, as it does not address the impor-
dimension; (ii) substitute continuous SSL representations with dis- tance of the first code, and could not leverage teacher-forcing during
crete SSL tokens extracted by k-means; (iii) introduce the residual training.
adapter [31] to extract SSL representations; (iv) a Transformer ar-
chitecture akin to the codec approach is employed as the acoustic 5. CONCLUSION
enhancer. From experimental results, we observe that the current In conclusion, this study introduces an innovative approach that
design performs the best. Adding a residual adapter is not compu- leverages pre-trained generative methods to address the long-
tationally efficient, and does not surpass the advantages conferred standing challenges of enhancing speech signal quality in adverse
by the mere employment of LWS. Furthermore, (iv) facilitates a fair acoustic environments. By employing established vocoder, codec,
comparison with the codec approach. and self-supervised learning models, the proposed methodology
4.3. Evaluation of the Codec Approach effectively resynthesizes clean and anechoic speech from degraded
inputs, mitigating issues like background noise and reverberation.
A comparative analysis was conducted to assess the impact of vari- Through empirical evaluations in both simulated and real-world
ous input features on code token prediction, and the results are listed scenarios, the method demonstrates superior subjective scores,
in Table 4. The results show that embeddings as the principal input showcasing its ability to improve audio fidelity, reduce artifacts,
exhibit superior performance compared to alternative inputs, aug- and superior robustness. This research highlights the potential of
menting all measured metrics. SSL embeddings improve STOI and leveraging generative techniques in speech processing, especially in
DNS-MOS score but result in suboptimal PESQ performance due challenging scenarios where conventional methods fall short.
6. REFERENCES M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasac-
chi, et al., “AudioLM: a language modeling approach to audio
[1] D. L. Wang and J. Chen, “Supervised speech separation based generation,” IEEE/ACM Transactions on Audio, Speech, and
on deep learning: An overview,” IEEE/ACM Transactions on Language Processing, 2023.
Audio, Speech, and Language Processing, vol. 26, pp. 1702– [17] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
1726, 2018. Y. Liu, H. Wang, J. Li, et al., “Neural codec language models
[2] K. Han, Y. Wang, and D. L. Wang, “Learning spectral mapping are zero-shot text to speech synthesizers,” arXiv:2301.02111,
for speech dereverberation,” in Proceedings of ICASSP, 2014, 2023.
pp. 4628–4632. [18] S. Zhou, K. Chan, C. Li, and C.C. Loy, “Towards robust blind
[3] X. Li and R. Horaud, “Online monaural speech enhancement face restoration with codebook lookup transformer,” Advances
using delayed subband LSTM,” in Proceedings of INTER- in Neural Information Processing Systems, vol. 35, pp. 30599–
SPEECH, 2020, pp. 2462–2466. 30611, 2022.
[4] H.-S. Choi, J.-H. Kim, J. H., A. Kim, J.-W. Ha, and K. Lee, [19] Y. Hu, C. Chen, Q. Zhu, and E.S. Chng, “Wav2code: Restore
“Phase-aware speech enhancement with deep complex U-Net,” clean speech representations via codebook lookup for noise-
in Proceedings of ICLR, 2018. robust ASR,” arXiv:2304.04974, 2023.
[5] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, [20] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li,
B. Zhang, and L. Xie, “DCCRN: Deep complex convolu- N. Kanda, T. Yoshioka, X. Xiao, et al., “WavLM: Large-scale
tion recurrent network for phase-aware speech enhancement,” self-supervised pre-training for full stack speech processing,”
arXiv:2008.00264, 2020. IEEE Journal of Selected Topics in Signal Processing, vol. 16,
[6] Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separa- pp. 1505–1518, 2022.
tion network for real-time, single-channel speech separation,” [21] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity
in Proceedings of ICASSP, 2018, pp. 696–700. neural audio compression,” arXiv:2210.13438, 2022.
[7] A. Pandey and D L Wang, “Dense CNN with self-attention [22] G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng,
for time-domain speech enhancement,” IEEE/ACM Transac- D. Su, D. Povey, J. Trmal, J. Zhang, et al., “Gigaspeech: An
tions on Audio, Speech, and Language Processing, vol. 29, pp. evolving, multi-domain asr corpus with 10,000 hours of tran-
1270–1279, 2021. scribed audio,” arXiv:2106.06909, 2021.
[8] A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are [23] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia,
better than one: A two-stage complex spectral mapping ap- Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from Lib-
proach for monaural speech enhancement,” IEEE/ACM Trans- riSpeech for text-to-speech,” arXiv:1904.02882, 2019.
actions on Audio, Speech, and Language Processing, vol. 29, [24] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza,
pp. 1829–1843, 2021. M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-
[9] H. Wang and D. L. Wang, “Neural cascade architecture scale multilingual speech corpus for representation learning,
with triple-domain loss for speech enhancement,” IEEE/ACM semi-supervised learning and interpretation,” in Proceedings
Transactions on Audio, Speech, and Language Processing, vol. of ACL, 2021, pp. 993–1003.
30, pp. 734–743, 2021. [25] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,
[10] W. Rao, Y. Fu, Y. Hu, X. Xu, Y. Jv, J. Han, Z. Jiang, J. Meyer, R. Morais, L. Saunders, F.M. Tyers, and G. We-
L. Xie, Y. Wang, S. Watanabe, et al., “INTERSPEECH ber, “Common voice: A massively-multilingual speech cor-
2021 conferencing speech challenge: Towards far-field pus,” arXiv:1912.06670, 2019.
multi-channel speech enhancement for video conferencing,” [26] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization
arXiv:2104.00960, 2021. with gumbel-softmax,” arXiv:1611.01144, 2016.
[11] H. Wang and D. L. Wang, “Cross-domain diffusion based [27] J.B Allen and D.A Berkley, “Image method for efficiently sim-
speech enhancement for very noisy speech,” in Proceedings ulating small-room acoustics,” The Journal of the Acoustical
of ICASSP, 2023, pp. 1–5. Society of America, vol. 65, pp. 943–950, 1979.
[12] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, [28] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,
“StoRM: A diffusion-based stochastic regeneration model for “An algorithm for intelligibility prediction of time–frequency
speech enhancement and dereverberation,” IEEE/ACM Trans- weighted noisy speech,” IEEE/ACM Transactions on Audio,
actions on Audio, Speech, and Language Processing, 2023. Speech, and Language Processing, vol. 19, pp. 2125–2136,
[13] H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. L. Wang, 2011.
C. Huang, and Y. Wang, “Voicefixer: A unified framework [29] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-
for high-fidelity speech restoration,” arXiv:2204.05841, 2022. stra, “Perceptual evaluation of speech quality (PESQ)-a new
[14] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversar- method for speech quality assessment of telephone networks
ial networks for efficient and high fidelity speech synthesis,” and codecs,” in Proceedings of ICASSP, 2001, pp. 749–752.
Advances in Neural Information Processing Systems, vol. 33, [30] C. KA. Reddy, V. Gopal, and R. Cutler, “DNSMOS P. 835:
pp. 17022–17033, 2020. A non-intrusive perceptual objective speech quality metric to
[15] Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya evaluate noise suppressors,” in Proceedings of ICASSP, 2022,
Tateishi, Takashi Shibuya, Shusuke Takahashi, and Yuki Mit- pp. 886–890.
sufuji, “Extending audio masked autoencoders toward audio [31] Shinta Otake, Rei Kawakami, and Nakamasa Inoue, “Param-
restoration,” arXiv:2305.06701, 2023. eter efficient transfer learning for various speech processing
[16] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, tasks,” in Proceedings of ICASSP, 2023, pp. 1–5.

You might also like