Encodec Trans

This study presents a novel approach to speech enhancement using pre-trained generative methods to improve speech quality in adverse acoustic environments. By leveraging vocoder and codec models, the proposed methods effectively reduce background noise and reverberation, resulting in enhanced fidelity and robustness compared to traditional techniques. Experimental results demonstrate superior performance in both simulated and realistic scenarios, highlighting the potential of these methods in real-world applications.

Uploaded by

Himajyothi Rajamahendravarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views5 pages

Encodec Trans

Uploaded by

Himajyothi Rajamahendravarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

UNIFYING ROBUSTNESS AND FIDELITY: A COMPREHENSIVE STUDY OF PRETRAINED

GENERATIVE METHODS FOR SPEECH ENHANCEMENT IN ADVERSE CONDITIONS

Heming Wang1∗ , Meng Yu2 , Hao Zhang2 , Chunlei Zhang2 , Zhongweiyang Xu4∗ ,
Muqiao Yang3∗ , Yixuan Zhang1∗ , Dong Yu2
1 2
The Ohio State University, USA Tencent AI Lab, USA
3 4
Carnegie Mellon University, USA University of Illinois Urbana-Champaign, USA
arXiv:2309.09028v1 [eess.AS] 16 Sep 2023

ABSTRACT supervised learning based models in such challenging scenarios may

retain strong noise or reverberation, and be accompanied by distor-
Enhancing speech signal quality in adverse acoustic environments tions and artifacts [10].
is a persistent challenge in speech processing. Existing deep learn-
ing based enhancement methods often struggle to effectively remove To address these challenges, recent studies aim to leverage the
background noise and reverberation in real-world scenarios, hamper- potential of pre-trained models. Some researchers utilized diffu-
ing listening experiences. To address these challenges, we propose a sion models to refine speech, employing them to regenerate clean
novel approach that uses pre-trained generative methods to resynthe- speech based on enhanced priors acquired through pre-trained dis-
size clean, anechoic speech from degraded inputs. This study lever- criminative models [11, 12]. Another avenue of investigation in-
ages pre-trained vocoder or codec models to synthesize high-quality volves employing speech vocoders for speech resynthesis. For in-
speech while enhancing robustness in challenging scenarios. Gener- stance, VoiceFixer was proposed to address general speech restora-
ative methods effectively handle information loss in speech signals, tion [13]. It employs an enhancement model on mel-spectrograms
resulting in regenerated speech that has improved fidelity and re- and subsequently utilizes the HifiGAN [14] vocoder to resynthe-
duced artifacts. By harnessing the capabilities of pre-trained models, size the clean speech. Similarly, [15] proposed to use masked au-
we achieve faithful reproduction of the original speech in adverse toencoders for speech restoration, and employs mel-to-mel mapping
conditions. Experimental evaluations on both simulated datasets during pretraining to restore masked audio signals. We believe that
and realistic samples demonstrate the effectiveness and robustness of discrete representations stored in codebooks are more robust against
our proposed methods. Especially by leveraging codec, we achieve various interference, and propose to employ speech codecs to per-
superior subjective scores for both simulated and realistic record- form speech enhancement. The majority of existing research re-
ings. The generated speech exhibits enhanced audio quality, reduced lated to speech codecs [16, 17] is primarily centered around text-
background noise, and reverberation. Our findings highlight the po- to-speech tasks, relying heavily on text embeddings to ensure in-
tential of pre-trained generative techniques in speech processing, put stability. Drawing inspiration from a parallel study in com-
particularly in scenarios where traditional methods falter. Demos puter vision [18], which addresses blind face restoration through
are available at https://whmrtm.github.io/SoundResynthesis. the regeneration of code tokens within a learned discrete codebook,
Index Terms— speech enhancement, speech vocoder, speech we are motivated by its exceptional robustness against degradation
codec, robustness, fidelity in both synthetic and real-world datasets. Furthermore, a relevant
contribution by Wav2code [19] has also introduced the utilization
of codebooks to enhance the resilience of speech representations.
1. INTRODUCTION Notably, Wav2code focuses more on improving robust automatic
speech recognition and operates on self-supervised learning (SSL)
In real-world scenarios, speech signals are often degraded by back- embeddings.
ground noise and room reverberation, leading to diminished clar-
ity and comprehensibility. The main aim of speech enhancement This paper systematically investigates two pipelines: one based
is to mitigate the impact of such environmental disturbances. The on a speech vocoder and the other on a speech codec. In both
development of deep neural networks (DNN) has greatly advanced pipelines, our network processes a main input and an auxiliary input
speech enhancement research. DNNs have shown remarkable pro- to enhance intermediate representations. These representations are
ficiency in suppressing background noise and reverberation, yield- then enhanced before generating the desired speech output. We
ing satisfactory enhancement results [1]. DNN-based enhancement choose this design because generative methods excel in address-
techniques primarily focus on direct speech signal representations, ing complex situations with significant information loss in speech
aiming to establish mappings from noisy inputs to their correspond- signals. Utilizing pre-trained models also benefits by leveraging
ing clean targets. These representations include but not limited to existing semantic or acoustic information, aiding in faithful resyn-
magnitude [2, 3], complex spectrograms [4, 5], waveforms [6, 7], or thesis of the original speech and enhancing fidelity. Experimental
a fusion of these features [8, 9] which are all intrinsically associated outcomes with real and synthetic datasets demonstrate the superi-
with the signals. Despite the effectiveness of existing powerful en- ority of our proposed pipelines over traditional STFT-based models
hancement baselines, their performance often notably deteriorates in in terms of robustness and subjective ratings. Additionally, the
real-world complicated scenarios. The enhanced speech obtained by codec-based approach effectively reduces uncertainty and ambigu-
ity in restoration mapping, showing notable advantages in real-world
*Work done during an internship at Tencent AI Lab. scenarios.
Fig. 1: The overview of the vocoder pipeline.

Fig. 2: The overview of the codec pipeline.

Table 1: Objective scores of Comparison of All Pipelines
2. PROPOSED APPROACH
Reverberant Only Noisy + Reverberant
2.1. Vocoder Approach STOI PESQ DNS-MOS STOI PESQ DNS-MOS

As depicted in Fig. 1 we illustrate the vocoder approach, wherein Unprocessed 0.663 1.648 2.859 0.624 1.511 2.635
a noisy mel-spectrogram is transformed into a clean counterpart Vocoder Best 0.870 2.472 3.579 0.825 2.121 3.452
Codec Best 0.835 2.102 3.718 0.802 1.916 3.641
using an acoustic enhancer. During inference, we leverage a pre- STFT Based 0.787 1.948 3.133 0.751 1.801 3.024
trained HifiGAN vocoder [14] to restore the clean speech. An
auxiliary input is produced by employing an SSL conditioner on
the SSL features. Specifically, we adopt the WavLM-Large vari- where M̂ is the estimated mel-spectrogram, and M el(y) represents
ant of the WavLM model [20], extract the learnable weighted sum the ground truth mel-spectrogram.
of all layered results to produce 1024-dimension SSL features,
which is then processed by the SSL conditioner to extract the 2.2. Codec Approach
SSL embedding of 256 dimensions. This conditioner comprises a
three-layer 1-dimensional convolutional network with upsampling,
ReLU activation, instance normalization, and a dropout of 0.5. The
acoustic enhancer, based on deep complex convolutional recur-
rent network (DCCRN) architecture [5], employs a convolutional
encoder-decoder with an LSTM bottleneck. Concretely, DCCRN
consists of a six-layer convolution encoder and decoder, and an
LSTM block in the bottleneck part to model time dependencies. We
adjust the architecture for mel-spectrogram input by removing all
complex-value related operations and setting the input convolutional
channels to 1. The auxiliary input is fed to the bottleneck and is con-
catenated with the input of the LSTM block. To make the training
more efficient, the vocoder modules are only used during inference.
During training, we calculate the L1 loss between enhanced and
clean mel-spectrograms. Given the degraded speech input x ∈ RL ,
the target clean speech y is of the same length L. For the inter- Fig. 3: Diagrams of MOS for both synthetic and realistic samples.
mediate representation, we extract 128-band mel-spectrograms at a
We depict the pipeline of the codec approach in Fig. 2. The
hop-length of 10 ms with a Hann window of 64 ms, resulting in mel
implementation entails the application of supervised enhancement
features M el(x) ∈ RT ×K , where T denotes the number of time
learning within the code token space. This involves attempting to
frames and K represents the feature dimension 128. The training
obtain the code tokens for the target speech and then using a pre-
objective for the vocoder approach is then defined as,
trained speech decoder to restore the clean speech. The code en-
T K
hancer is designed to predict clean code tokens based on the primary
1 XX input codec embedding and the auxiliary input mel-spectrograms.
LV ocoder = |M̂ − M el(y)|, (1)
T K t=1 This undertaking is similar to a classification task focusing on code
k=1
tokens. Initial attempts to predict tokens corresponding to clean the speech representations, and maps noisy speech signals into their
speech encountered challenges. Firstly, most feature encoders of denoised counterparts. Specifically, we adopt the DCCRN [5] as
existing codecs are not trained using degraded speech utterances. the STFT backbone. This facilitates uniformity in our experimental
This inconsistency between the corrupted features of the codec in- setup, wherein we undertake speech enhancement trials employing
put and the accurate derivation of code tokens by the codec led identical datasets for the purpose of comparative analysis.
to instability in input code tokens, thereby yielding suboptimal en-
hancement outcomes. Furthermore, predicting speech embeddings 3. EXPERIMENTAL SETUP
(either pre-vector quantization or post-vector quantization) is com- Our experiments are primarily conducted by performing dereverber-
paratively simpler. However, the generated speech by the decoder ation of LibriTTS utterances [23], where we extract clean utterances
may contain distortions, as the predicted embeddings may not align of 16k Hz with high quality as the target. The training subset com-
well with the pre-stored patterns in the codebooks, consequently af- prises 147,039 instances of utterances, while the validation subset
fecting enhancement performance. To address these issues, we have encompasses 5,566 distinct utterances. Subsequently, an evaluation
proposed several techniques. Firstly, a generalized codec architec- is performed on 4,589 utterances that have not been encountered dur-
ture is adopted, involving a EnCodec [21] trained on utterances from ing training. For each utterance, we simulate the reverberation by
multiple languages retrieved from gigaspeech [22], LibriTTS [23], convolving with a simulated room impulse response (RIR). The pro-
and VP10K [24] and common voice [25], and augmented with 20% cess of generating RIRs entails a stochastic selection of a T60 param-
probabilities by simulated noise and reverberations. Additionally, eter, denoting the reverberation duration, from an interval spanning
encoder embeddings are employed as primary inputs during train- 0.2 to 1.5 seconds, executed through the image method [27]. In con-
ing, alongside mel-spectrograms of the input speech as auxiliary junction, the spatial dimensions of the room are determined by ran-
input. Finally, we have worked on the training target and training dom selection, with width and length options ranging between 3, 4,
objective. Two techniques are investigated to mitigate prediction er- and 2.5 meters, and height alternatives within 10, 20, and 4 meters.
rors when the correct tokens cannot be retrieved. The first is label In addition, for the noisy-reverberant dataset simulation, we addi-
smoothing. Specifically, principal component analysis is applied to tionally add environmental noises with signal-to-noise-ratio (SNR)
the target tokens, and they are sorted based on the absolute values ranging from 0 to 40 dB. During training, an Adam optimizer is em-
of the retrieved codebook entries. Label smoothing is then incorpo- ployed. The training process is instantiated with a batch size of 32
rated for neighboring tokens, where we set the target token with a utterances, accompanied by an initial learning rate of 4e-4, which
probability of 0.9, and its neighbor tokens with 0.05. For the other is sustained across 400 epochs. During training, we randomly cut a
technique, we retrieve the codebook entries Z from the tokens C 4-second segment for each training utterance, and pad shorter utter-
using the gumbel softmax layer [26] in a fully differentiable way, ances with zeros within each batch to guarantee they are of the same
which forms quantified representations of Zc ∈ RN ×D , where N to size.
indicate the total number of code tokens and D represents the feature The performance assessment of the models is executed through
dimension of each codebook entry. It can be formulated as, a comparative analysis of the resynthesized clean speech against the
LCodec = LT oken + λLEntry reference dry clean speech. To this end, two established metrics are
N
employed: the short-time objective intelligibility (STOI) [28] met-
1 X ric and the perceptual evaluation of speech quality (PESQ) [29].
LT oken = Ci log(Ĉi )
N i=1 In addition, we provide an evaluation of subjective metrics, which
N D
include the non-intrusive DNS-MOS [30] metric, and mean opin-
1 XX ion scores (MOS) that are provided by human listeners. The MOS
LEntry = (Zd − Ẑc )2 (2)
N D i=1 j=1 evaluation includes synthetic and real-world meeting data compo-
nents. Synthetic samples consist of 10 reverberant instances, featur-
where LT oken measures the cross entropy loss of the predicted code ing input, enhanced samples from three methodologies, and ground
tokens Ĉ and the target tokens C. The second term LEntry measures truth samples. The second segment involves 10 real recorded utter-
the mean squared loss value loss between the acquired codebook ances captured in diverse settings using different recording devices.
entries obtained by degraded speech Zd and the retrieved entries Zc These utterances are intentionally shuffled for each evaluation in-
acquired by the predicted tokens Ĉ. The coefficient λ is empirically stance within this segment.
chosen to be 0.5.
4. RESULTS AND ANALYSIS
Given the importance of the first code index and the hierarchical
architecture inherent in residual quantization, an architecture based 4.1. Comparison of All Pipelines
on layer-wise modeling is adopted to enhance performance. The Table 1 summarizes the best experimental results on two generative
proposed model architecture comprises a transformer decoder and pipelines and the STFT-based pipeline. The vocoder approach per-
a prediction layer. This transformer decoder integrates 12 trans- forms best in terms of STOI and PESQ, while the codec approach at-
former blocks, characterized by an embedding dimension of 512. tains the best DNS-MOS scores and effectively removes background
The model input encompasses the codec embedding and the auxil- and reverberation. A subjective auditory assessment is graphically
iary input acoustic conditioner embedding. For the prediction of sec- depicted in Fig. 3. Both codec and vocoder models surpass the
ond to last code tokens, an additional embedding generated from pre- STFT approach in synthetic samples. The difference between STOI
ceding code tokens is incorporated. The prediction layer facilitates and PESQ scores primarily arises from codec and vocoder charac-
the projection of transformer outputs to 1024 dimensions, which cor- teristics. The oracle codec has lower subjective scores (Table 4 and
responds to the size of the codebook vocabulary. 2). Moreover, the high compression rate of the codec may intro-
duce slight resynthesis misalignment, resulting in decreased objec-
2.3. Comparisons with Conventional DNN Approach tive scores. In real-world meeting scenarios, the codec technique
We compare the proposed two pipelines with the established DNN- presents a distinct advantage over the other two pipelines due to its
based approach for speech enhancement, which directly operates on superior interference robustness. This is because the codec decoder
Table 2: Compare Different Inputs to the Vocoder Using DCCRN Table 4: Compare Different Inputs to the codec
STOI PESQ DNS-MOS
STOI PESQ DNS-MOS
Unprocessed 0.663 1.648 2.857
Unprocessed 0.663 1.648 2.859
Last Layer (SSL Only) 0.718 1.459 3.084
WS (SSL Only) 0.731 1.438 3.163 Code tokens 0.745 1.662 3.628
Mel-spectrogram Only 0.865 2.372 3.525 SSL Embeddings 0.765 1.419 3.682
Codec Embeddings 0.807 1.918 3.685
Mel + WS 0.869 2.450 3.573
Mel Spectrograms 0.669 1.272 3.496
Mel + LWS 0.870 2.472 3.579
+ Mel Spectrograms 0.835 2.102 3.718
Vocoder Oracle 0.957 3.611 3.740
+ SSL Embeddings 0.828 2.037 3.727
Table 3: Ablation Study on Vocoder Mel + SSL Spectrograms 0.825 2.014 3.729
STOI PESQ DNS-MOS
Codec Oracle 0.904 2.764 3.689
Baseline (Mel + LWS) 0.870 2.472 3.579 Table 5: Ablation Studies on Codec
Add Adapter (i) 0.865 2.394 3.567
No Bottleneck (ii) 0.855 2.377 3.550 STOI PESQ DNS-MOS
SSL Token (iii) 0.852 2.231 3.457
Use Transformer Instead of DCCRN (iv) 0.867 2.368 3.802 Baseline (CE + Entry + Layerwise) 0.835 2.102 3.718
- Entry loss (i) 0.832 2.071 3.714
Replace Entry Loss with Label Smoothing (ii) 0.833 2.076 3.751
retains only clear speech patterns, effectively removing noise or re- Add Label Smoothing (iii) 0.821 1.980 3.701
verberation, and utilizing a previously acquired discrete codebook Layer-wise Token Prediction (iv) 0.797 1.960 3.705
reduces ambiguity in speech recovery.
4.2. Evaluation of the Vocoder Approach to information loss inherent in SSL embeddings during their extrac-
tion. The application of code tokens or mel-spectrogram does not
As presented in Table 2, a comparative analysis of diverse inputs yield optimal outcomes, leading to unwanted artifacts in generated
was conducted within the framework of the vocoder approach. The speech. Adding auxiliary input that contains semantic or acoustic in-
initial two rows of investigation aim to map the SSL embedding formation can bring benefits to the enhancement performance. Both
to the enhanced mel-spectrogram directly. Inspection of the results mel-spectrograms and SSL embeddings help, but the computational
reveals that only utilizing SSL embeddings produces unsatisfactory overhead associated with SSL embeddings is notably higher. Conse-
outcomes, as the SSL embeddings contain sufficient semantic infor- quently, the proposed framework adopts the mel-spectrogram as the
mation, but at the same time lose speaker and timbre information. preferred input. The result of the upperbound performance is also
This information loss impedes the restoration clean speech. In addi- provided for reference.
tion, the weighted sum (WS) of multiple layered SSL embeddings Table 5 reports the results of ablation studies pertinent to the
shows noticeable improvement across all metrics. When combined codec approach. We use the best configuration, featuring layer-wise
with the auxiliary input, the enhanced performance is considerably prediction and the utilization of cross entropy (CE) + Entry loss as
improved. Especially when we use the learnable weighted sum the baseline, and compare several variants on the reverberant dataset:
(LWS) of SSL embedding as auxiliary input, and mel-spectrogram (i) solely train using the CE loss for code tokens; (ii) substitute the
as the primary input, we obtain the best enhancement performance. entry loss with label smoothing; (iii) use these two techniques si-
Lastly, for reference, we provide the upper bound of this approach, multaneously; (iv) instead of predicting code tokens in a layer-wise
that is the last row, where the ground truth mel-spectrogram is manner, predict all tokens simultaneously. As shown in the table,
employed for the synthesis of the target speech. both label smoothing and adding entry loss are beneficial when the
Table 3 presents the results of the ablation study on the vocoder code tokens predictions are not accurate. However, using these two
approach. Multiple variations of the vocoder approach have been techniques simultaneously produces sub-optimal results. Therefore,
examined: (i) instead of bottleneck concatenation, directly concate- a solitary technique suffices. Predicting all code tokens simultane-
nate SSL embeddings with the mel-spectrogram along the feature ously degrades the performance, as it does not address the impor-
dimension; (ii) substitute continuous SSL representations with dis- tance of the first code, and could not leverage teacher-forcing during
crete SSL tokens extracted by k-means; (iii) introduce the residual training.
adapter [31] to extract SSL representations; (iv) a Transformer ar-
chitecture akin to the codec approach is employed as the acoustic 5. CONCLUSION
enhancer. From experimental results, we observe that the current In conclusion, this study introduces an innovative approach that
design performs the best. Adding a residual adapter is not compu- leverages pre-trained generative methods to address the long-
tationally efficient, and does not surpass the advantages conferred standing challenges of enhancing speech signal quality in adverse
by the mere employment of LWS. Furthermore, (iv) facilitates a fair acoustic environments. By employing established vocoder, codec,
comparison with the codec approach. and self-supervised learning models, the proposed methodology
4.3. Evaluation of the Codec Approach effectively resynthesizes clean and anechoic speech from degraded
inputs, mitigating issues like background noise and reverberation.
A comparative analysis was conducted to assess the impact of vari- Through empirical evaluations in both simulated and real-world
ous input features on code token prediction, and the results are listed scenarios, the method demonstrates superior subjective scores,
in Table 4. The results show that embeddings as the principal input showcasing its ability to improve audio fidelity, reduce artifacts,
exhibit superior performance compared to alternative inputs, aug- and superior robustness. This research highlights the potential of
menting all measured metrics. SSL embeddings improve STOI and leveraging generative techniques in speech processing, especially in
DNS-MOS score but result in suboptimal PESQ performance due challenging scenarios where conventional methods fall short.
6. REFERENCES M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasac-
chi, et al., “AudioLM: a language modeling approach to audio
[1] D. L. Wang and J. Chen, “Supervised speech separation based generation,” IEEE/ACM Transactions on Audio, Speech, and
on deep learning: An overview,” IEEE/ACM Transactions on Language Processing, 2023.
Audio, Speech, and Language Processing, vol. 26, pp. 1702– [17] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
1726, 2018. Y. Liu, H. Wang, J. Li, et al., “Neural codec language models
[2] K. Han, Y. Wang, and D. L. Wang, “Learning spectral mapping are zero-shot text to speech synthesizers,” arXiv:2301.02111,
for speech dereverberation,” in Proceedings of ICASSP, 2014, 2023.
pp. 4628–4632. [18] S. Zhou, K. Chan, C. Li, and C.C. Loy, “Towards robust blind
[3] X. Li and R. Horaud, “Online monaural speech enhancement face restoration with codebook lookup transformer,” Advances
using delayed subband LSTM,” in Proceedings of INTER- in Neural Information Processing Systems, vol. 35, pp. 30599–
SPEECH, 2020, pp. 2462–2466. 30611, 2022.
[4] H.-S. Choi, J.-H. Kim, J. H., A. Kim, J.-W. Ha, and K. Lee, [19] Y. Hu, C. Chen, Q. Zhu, and E.S. Chng, “Wav2code: Restore
“Phase-aware speech enhancement with deep complex U-Net,” clean speech representations via codebook lookup for noise-
in Proceedings of ICLR, 2018. robust ASR,” arXiv:2304.04974, 2023.
[5] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, [20] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li,
B. Zhang, and L. Xie, “DCCRN: Deep complex convolu- N. Kanda, T. Yoshioka, X. Xiao, et al., “WavLM: Large-scale
tion recurrent network for phase-aware speech enhancement,” self-supervised pre-training for full stack speech processing,”
arXiv:2008.00264, 2020. IEEE Journal of Selected Topics in Signal Processing, vol. 16,
[6] Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separa- pp. 1505–1518, 2022.
tion network for real-time, single-channel speech separation,” [21] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity
in Proceedings of ICASSP, 2018, pp. 696–700. neural audio compression,” arXiv:2210.13438, 2022.
[7] A. Pandey and D L Wang, “Dense CNN with self-attention [22] G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng,
for time-domain speech enhancement,” IEEE/ACM Transac- D. Su, D. Povey, J. Trmal, J. Zhang, et al., “Gigaspeech: An
tions on Audio, Speech, and Language Processing, vol. 29, pp. evolving, multi-domain asr corpus with 10,000 hours of tran-
1270–1279, 2021. scribed audio,” arXiv:2106.06909, 2021.
[8] A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are [23] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia,
better than one: A two-stage complex spectral mapping ap- Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from Lib-
proach for monaural speech enhancement,” IEEE/ACM Trans- riSpeech for text-to-speech,” arXiv:1904.02882, 2019.
actions on Audio, Speech, and Language Processing, vol. 29, [24] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza,
pp. 1829–1843, 2021. M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-
[9] H. Wang and D. L. Wang, “Neural cascade architecture scale multilingual speech corpus for representation learning,
with triple-domain loss for speech enhancement,” IEEE/ACM semi-supervised learning and interpretation,” in Proceedings
Transactions on Audio, Speech, and Language Processing, vol. of ACL, 2021, pp. 993–1003.
30, pp. 734–743, 2021. [25] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,
[10] W. Rao, Y. Fu, Y. Hu, X. Xu, Y. Jv, J. Han, Z. Jiang, J. Meyer, R. Morais, L. Saunders, F.M. Tyers, and G. We-
L. Xie, Y. Wang, S. Watanabe, et al., “INTERSPEECH ber, “Common voice: A massively-multilingual speech cor-
2021 conferencing speech challenge: Towards far-field pus,” arXiv:1912.06670, 2019.
multi-channel speech enhancement for video conferencing,” [26] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization
arXiv:2104.00960, 2021. with gumbel-softmax,” arXiv:1611.01144, 2016.
[11] H. Wang and D. L. Wang, “Cross-domain diffusion based [27] J.B Allen and D.A Berkley, “Image method for efficiently sim-
speech enhancement for very noisy speech,” in Proceedings ulating small-room acoustics,” The Journal of the Acoustical
of ICASSP, 2023, pp. 1–5. Society of America, vol. 65, pp. 943–950, 1979.
[12] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, [28] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,
“StoRM: A diffusion-based stochastic regeneration model for “An algorithm for intelligibility prediction of time–frequency
speech enhancement and dereverberation,” IEEE/ACM Trans- weighted noisy speech,” IEEE/ACM Transactions on Audio,
actions on Audio, Speech, and Language Processing, 2023. Speech, and Language Processing, vol. 19, pp. 2125–2136,
[13] H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. L. Wang, 2011.
C. Huang, and Y. Wang, “Voicefixer: A unified framework [29] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-
for high-fidelity speech restoration,” arXiv:2204.05841, 2022. stra, “Perceptual evaluation of speech quality (PESQ)-a new
[14] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversar- method for speech quality assessment of telephone networks
ial networks for efficient and high fidelity speech synthesis,” and codecs,” in Proceedings of ICASSP, 2001, pp. 749–752.
Advances in Neural Information Processing Systems, vol. 33, [30] C. KA. Reddy, V. Gopal, and R. Cutler, “DNSMOS P. 835:
pp. 17022–17033, 2020. A non-intrusive perceptual objective speech quality metric to
[15] Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya evaluate noise suppressors,” in Proceedings of ICASSP, 2022,
Tateishi, Takashi Shibuya, Shusuke Takahashi, and Yuki Mit- pp. 886–890.
sufuji, “Extending audio masked autoencoders toward audio [31] Shinta Otake, Rei Kawakami, and Nakamasa Inoue, “Param-
restoration,” arXiv:2305.06701, 2023. eter efficient transfer learning for various speech processing
[16] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, tasks,” in Proceedings of ICASSP, 2023, pp. 1–5.

eCW Administration Guide
No ratings yet
eCW Administration Guide
295 pages
Full Stack
100% (1)
Full Stack
81 pages
000 - MX 15 19 - sn20600 49999 - D.P
No ratings yet
000 - MX 15 19 - sn20600 49999 - D.P
48 pages
HKMC RPMCC1
No ratings yet
HKMC RPMCC1
12 pages
Laptop Care Guide
No ratings yet
Laptop Care Guide
3 pages
BIND DNS Server - Webmin Documentation
No ratings yet
BIND DNS Server - Webmin Documentation
16 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
Linear Algebra For Quantum Computing (From Amelie Schreiber Notebook)
No ratings yet
Linear Algebra For Quantum Computing (From Amelie Schreiber Notebook)
72 pages
Accreditation of CPD Program
No ratings yet
Accreditation of CPD Program
14 pages
Cassia PhD13
No ratings yet
Cassia PhD13
251 pages
SAP Best Practices For SAP SuccessFactors Time Tracking
No ratings yet
SAP Best Practices For SAP SuccessFactors Time Tracking
4 pages
Block 3
No ratings yet
Block 3
126 pages
Ch-09 (Using Lists, Images and Links) Exercises
No ratings yet
Ch-09 (Using Lists, Images and Links) Exercises
2 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Polynomials Test Paper
No ratings yet
Polynomials Test Paper
3 pages
Discrete Time Signal Processing - Oppenheim
No ratings yet
Discrete Time Signal Processing - Oppenheim
75 pages
Speech Enhancement Temporal Convolutional Neural Network
No ratings yet
Speech Enhancement Temporal Convolutional Neural Network
37 pages
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
No ratings yet
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
26 pages
Post-Processing Method For Single Channel Speech Enhancement Systems 1
No ratings yet
Post-Processing Method For Single Channel Speech Enhancement Systems 1
74 pages
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
No ratings yet
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
23 pages
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
No ratings yet
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
29 pages
参考7
No ratings yet
参考7
24 pages
Thesis
No ratings yet
Thesis
37 pages
Metadata of The Chapter That Will Be Visualized Online: Samui
No ratings yet
Metadata of The Chapter That Will Be Visualized Online: Samui
14 pages
Ieee
No ratings yet
Ieee
12 pages
CNN Lab Manual
No ratings yet
CNN Lab Manual
29 pages
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
No ratings yet
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
14 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Enhancing Construction Claims Analysis Using Computer Simulation
No ratings yet
Enhancing Construction Claims Analysis Using Computer Simulation
11 pages
141x Ubisoft Premium Checked
No ratings yet
141x Ubisoft Premium Checked
12 pages
Animesh Kumar
No ratings yet
Animesh Kumar
1 page
When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
No ratings yet
When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
13 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Keynote Slides
No ratings yet
Keynote Slides
33 pages
Also Dog
No ratings yet
Also Dog
10 pages
Speech Enhancement Using Deep Learning
No ratings yet
Speech Enhancement Using Deep Learning
33 pages
MSGLN
No ratings yet
MSGLN
10 pages
High-Fidelity Noise Reduction With Differentiable Signal
No ratings yet
High-Fidelity Noise Reduction With Differentiable Signal
10 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
No ratings yet
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
13 pages
CDiffSEwRL 1113 Chu Final
No ratings yet
CDiffSEwRL 1113 Chu Final
9 pages
Social Media Regulation Freedom of Expression and Civic Space in Nigeria
No ratings yet
Social Media Regulation Freedom of Expression and Civic Space in Nigeria
11 pages
Clat1 Vlsi Ak
No ratings yet
Clat1 Vlsi Ak
5 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
No ratings yet
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
17 pages
2021 Acl-Short 23
No ratings yet
2021 Acl-Short 23
7 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
No ratings yet
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
7 pages
FLP - Training - Dec2020-Ab - Initio - Dev - Training-Unix - Basics-Basic Linux - Unix Commands With Examples
No ratings yet
FLP - Training - Dec2020-Ab - Initio - Dev - Training-Unix - Basics-Basic Linux - Unix Commands With Examples
12 pages
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
No ratings yet
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
5 pages
SE Via Token
No ratings yet
SE Via Token
5 pages
Extending Audio Masked Autoencoders Toward Audio Restoration
No ratings yet
Extending Audio Masked Autoencoders Toward Audio Restoration
5 pages
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
No ratings yet
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
9 pages
Liu22c Interspeech
No ratings yet
Liu22c Interspeech
5 pages
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
No ratings yet
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
13 pages
Refinegan: Universally Generating Waveform Better Than Ground Truth With Highly Accurate Pitch and Intensity Responses
No ratings yet
Refinegan: Universally Generating Waveform Better Than Ground Truth With Highly Accurate Pitch and Intensity Responses
5 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Improving GANs For Speech Enhancement
No ratings yet
Improving GANs For Speech Enhancement
5 pages
4032 Whispering LLaMA A Cross
No ratings yet
4032 Whispering LLaMA A Cross
10 pages
VRV Ahu Apcvdt1509b 1
No ratings yet
VRV Ahu Apcvdt1509b 1
10 pages
Voice-ENHANCE: Speech Restoration Using A Diffusion-Based Voice Conversion Framework
No ratings yet
Voice-ENHANCE: Speech Restoration Using A Diffusion-Based Voice Conversion Framework
5 pages
Cmgan
No ratings yet
Cmgan
5 pages
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
No ratings yet
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
6 pages
cao24 interspeech（语音增强）
No ratings yet
cao24 interspeech（语音增强）
5 pages
Rust in Action v1.0
No ratings yet
Rust in Action v1.0
7 pages
ASC 4 Switchboard Data Sheet 4921240553 UK
No ratings yet
ASC 4 Switchboard Data Sheet 4921240553 UK
19 pages
Jyot Resume
No ratings yet
Jyot Resume
4 pages
Speech Processing
No ratings yet
Speech Processing
5 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
1 Base
No ratings yet
1 Base
5 pages
Auction Research Paper
No ratings yet
Auction Research Paper
8 pages
Application of Deep Learning-Based Speech Signal P
No ratings yet
Application of Deep Learning-Based Speech Signal P
6 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Gesper: A Restoration-Enhancement Framework For General Speech Reconstruction
No ratings yet
Gesper: A Restoration-Enhancement Framework For General Speech Reconstruction
5 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
No ratings yet
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
13 pages
Glow Wavegan
No ratings yet
Glow Wavegan
5 pages
My Mine
No ratings yet
My Mine
5 pages
COBOL SQL SFPDCDRV Stuck in Status Processing (Doc ID 2308542.1)
No ratings yet
COBOL SQL SFPDCDRV Stuck in Status Processing (Doc ID 2308542.1)
2 pages
CNN Basic
No ratings yet
CNN Basic
11 pages
Gesper A Unified Framework For General Speech Restoration
No ratings yet
Gesper A Unified Framework For General Speech Restoration
2 pages
667400a31d833d00172262cf - ## - Inverse Trigonometric Functions - DPP 01 (Of Lec 03) - Lakshya JEE 2025
No ratings yet
667400a31d833d00172262cf - ## - Inverse Trigonometric Functions - DPP 01 (Of Lec 03) - Lakshya JEE 2025
2 pages
21MCME02
No ratings yet
21MCME02
1 page
(Patchapk) Rebuiding Apk - Error - Rebuilding The APK May Have Failed. Read The Following Output To Determine If Apktool Actually Had An Error
No ratings yet
(Patchapk) Rebuiding Apk - Error - Rebuilding The APK May Have Failed. Read The Following Output To Determine If Apktool Actually Had An Error
4 pages
j2c Uk (s12)
No ratings yet
j2c Uk (s12)
2 pages
Devops: Roadmap - SH
No ratings yet
Devops: Roadmap - SH
1 page
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
From Everand
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Encodec Trans

Uploaded by

Encodec Trans

Uploaded by

UNIFYING ROBUSTNESS AND FIDELITY: A COMPREHENSIVE STUDY OF PRETRAINED

GENERATIVE METHODS FOR SPEECH ENHANCEMENT IN ADVERSE CONDITIONS

ABSTRACT supervised learning based models in such challenging scenarios may

Fig. 2: The overview of the codec pipeline.

You might also like