Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Does Current Deepfake Audio Detection Model Effectively Detect ALM-based

Deepfake Audio?
Yuankun Xie1, †, Chenxu Xiong2, †, Xiaopeng Wang3,4 , Zhiyong Wang3,4 , Yi Lu3,4 , Xin Qi3,4 , Ruibo
Fu3,* ,Yukun Liu4 , Zhengqi Wen6 , Jianhua Tao5,6 , Guanjun Li3 , Long Ye1
1
State Key Laboratory of Media Convergence and Communication,
Communication University of China
2
SDU-ANU Joint Science College, Shandong University, Weihai
3
Institute of Automation, Chinese Academy of Sciences
4
School of Artificial Intelligence, University of Chinese Academy of Sciences
5
Department of Automation, Tsinghua University
6
Beijing National Research Center for Information Science and Technology, Tsinghua University
arXiv:2408.10853v1 [cs.SD] 20 Aug 2024

[email protected], [email protected]

Abstract VALL-E TTS, VC


Speech X SVS, SVC
Currently, Audio Language Models (ALMs) are rapidly advanc- UniAudio
VALL-E TTS, TTSO,
VC AG
Natural
Speech XSpeech SVS, SVC
MG, TTM
ing due to the developments in large language models and audio Real audio …
UniAudio TTSO, AG … ALM-based audio
Natural Speech MG, TTM
neural codecs. These ALMs have significantly lowered the bar- Real audio …
Audio Language Model

Generation Task
ALM-based audio

rier to creating deepfake audio, generating highly realistic and Audio Language Model Generation Task

diverse types of deepfake audio, which pose severe threats to Figure 1: ALM-based deepfake audio generation pipeline
society. Consequently, effective audio deepfake detection tech-
nologies to detect ALM-based audio have become increasingly
critical. This paper investigate the effectiveness of current coun- Real? Real?

termeasure (CM) against ALM-based audio. Specifically, we Real


Realaudio
audio Real audio
Real audio
Counter Counter
collect 12 types of the latest ALM-based deepfake audio and Measure Measure

utilizing the latest CMs to evaluate. Our findings reveal that the Fake?
Indistinguishable Fake?
Indistinguishable
latest codec-trained CM can effectively detect ALM-based au- ALM-based audio ALM-based audio
ALM-based audio ALM-based audio
dio, achieving 0% equal error rate under most ALM test condi-
tions, which exceeded our expectations. This indicates promis- Figure 2: Does current countermeasure effectively detect ALM-
ing directions for future research in ALM-based deepfake audio based deepfake audio?
detection.
Index Terms: audio language model, audio neural codec, audio
deepfake detection, countermeasure.
As for countermeasures (CMs), the study of audio deep-
fake detection has been increasing. Many significant works
1. Introduction have emerged [12–16] around the ASVspoof competition se-
Currently, due to the rapid development of large language model ries [17, 18]. In recent years, research in the audio deepfake
and audio neural codec, there have been significant advance- detection (ADD) field has gradually shifted from improving in-
ments in audio generation models. We typically refer to these distribution (ID) performance to focusing on generalization to
novel types of audio generation models as Audio Language wild domain. Tak et al. [12] proposed a countermeasure (CM)
Models (ALMs) [1–10]. These ALM-based audio generation based on self-supervised learning for front-end fine-tuning and
models have lowered the barrier to creating deepfake audio, AASIST for backend processing, which demonstrates remark-
making the process significantly easier and more accessible. On able generalization capabilities. Xie et al. [13] used three source
the other hand, ALM-based audio is characterized by its high domains for co-training and learned a self-supervised domain-
diversity type and remarkable realism. As shown in Fig. 1, a invariant representations to improve the generalization of the
forger can easily choose any ALM model and generate different CM. Wang et al. [14] proposed a CM based on stable learning to
types of deepfake audio, such as speech, singing voice, music, address the issue of distribution shifts across different domain.
and sound, for various forgery tasks. These high-fidelity deep- Additionally, Wang et al. [15] utilized masked autoencoder and
fake audio files are difficult for humans to discern, and the latest proposed a reconstruction learning method focused on real data
ALM models [11] have even achieved human parity in zero-shot to enhance the model’s ability to detect unknown forgery pat-
text-to-speech (TTS) synthesis for the first time. This poses sig- terns. Despite these advancements, a crucial question remains:
nificant threats, including fraud, misleading public opinion, and Does current deepfake audio detection model effectively de-
privacy violations. Therefore, the urgent development of effec- tect ALM-based deepfake audio?
tive audio deepfake detection technologies is crucial. In this paper, we aim to address the aforementioned ques-
tion by collecting as many ALM-based audio samples as possi-
† denotes equal contribution to this work. * denotes corresponding ble and using the latest CM for deepfake detection. Specifically,
author. we collected 12 types of ALM-based deepfake audio, which
Audio Language Model Audio Deepfake Detection Model

Text
Tokenizer real
Neural Real Audio
Language
Neural Codec Front-end Backbone or
Text Model
Decoder
LCNN fake
Handcrafted
Encodec Feature
Transformer Self-Supervised AASIST
Neural Codec
Mamba
Discrete FunCodec Deepfake … …
Encoder Code …
… audio
Deepfake audio
Real audio

Figure 3: Generation and countermeasure pipeline for ALM-based deepfake audio.

denote as A01-A12. Since most ALMs are not open-sourced, A03-VALL-E [3]. VALL-E is a neural codec language
the majority of the audio samples have come from demo pages. model designed to generate discrete codes derived from En-
The audio from demo pages is typically difficult for humans to Codec [20], utilizing either textual or acoustic inputs. We used
distinguish as real or fake, posing a significant challenge for condition A1 from the Codecfake dataset 4 for testing, which in-
CM detection, as shown in Fig. 2. For CM, the latest work cludes 4,451 real audio samples and 4,436 fake audio samples
on detecting ALM audio is Codecfake [19], which focuses on synthesized by VALL-E.
the generation mechanism of ALM - neural codec, for deep- A04-VALL-E X [5]. VALL-E X can generate high-quality
fake detection. We takes advantage of Codecfake by training audio in the target language with just a single utterance of the
both traditional vocoder-trained CM and codec-trained CM to source language audio as a prompt. Similar to VALL-E, we
test ALM from A01 to A12. The experiments demonstrate that used condition A2 from the Codecfake dataset for testing, which
the codec-trained CM achieves the lowest average EER and 0% includes 4,451 real audio samples and 4,436 fake audio sam-
EER under most ALM conditions, revealing that the current ples.
CM, specifically the codec-trained CM, can effectively detect A05-SpeechX [8]. SpeechX is a versatile speech genera-
ALM-based audio. tion model leveraging audio and text prompts, which can deal
with both clean and noisy speech inputs and perform zero-shot
2. ALM-based deepfake audio TTS and various tasks involving transforming the input speech.
We collected 16 fake speech samples and 16 real speech sam-
2.1. Generation pipeline ples from the TTS, content editing and target speaker extraction
ALM has rapidly developed due to advancements in both lan- tasks on demo pages5 . Real speech comes from the ground truth
guage model (LM) and neural codec model. In the left part of or the speaker prompt, whereas SpeechX generated speech sam-
Fig. 3, it illustrates the pipeline used by most ALM models for ples are categorized as fake speech.
generation. At first, the audio waveform is converted into dis- A06-UniAudio [10]. UniAudio is a versatile audio gen-
crete code representations through the encoder part of the neu- eration model that conditions on multiple types of inputs and
ral codec. Then, the LM decoder performs contextual learning, performs a variety of audio generation tasks. It treats all modal-
where the discrete quantized tokens contain information about ities as discrete tokens. We collected 30 fake speech sam-
the speaker style. Finally, the non-autoregressive neural codec ples and 18 real speech samples from the TTS tasks on demo
decoder generates the audio waveform from the discrete codes. pages6 . Real speech comes from the ground truth or the speaker
prompt, whereas Uniaudio generated speech is categorized as
2.2. Data collection fake speech.
A07-LauraGPT [9]. LauraGPT can take both audio and
We collected 12 type of ALM-based audio, denote as A01-A12. text as input and output both modalities, and perform a broad
All collected audio samples can be found on the website1 . The spectrum of content-oriented and speech-signal-related tasks.
numbering principle follows the publication date of the ALM We collected 8 fake speech samples and 8 real speech sam-
works. Since most ALMs are not open-sourced, the majority of ples from the TTS tasks on demo pages7 . Real speech samples
ALM-based audio samples come from demos. comes from the ground truth or the speaker prompt, whereas
A01-AudioLM [1]. AudioLM is a framework for high- LauraGPT generated speech samples are categorized as fake
quality audio generation with long-term consistency that maps speech.
the input audio to a sequence of discrete tokens and casts audio A08-ELLA-V [21]. ELLA-V is a simple but efficient LM-
generation as a language modeling task in this representation based zero-shot TTS framework, which enables fine-grained
space. We collected 48 fake speech samples and 28 real speech control over synthesized audio at the phoneme level. We col-
samples from the generation and continuation tasks on demo lected 14 real speech samples and 32 fake speech samples from
pages2 . Real speech comes from the ground truth or the speaker the generation and cloning tasks on demo pages8 . Real speech
prompt, whereas AudioLM generated speech is categorized as comes from the ground truth or the speaker prompt (speaker
fake speech. prompt-encodec), whereas ground truth Encodec regenerated
A02-AudioLM-music [1]. AudioLM is not limited to mod- and ELLA-V generated speech samples are categorized as fake
eling speech. It can also learn to generate coherent piano music speech.
continuations. 4 pairs of music segments were collected from A09-HAM-TTS [22]. HAM-TTS is a novel TTS system
the piano continuation task on demo pages3 . Real audio comes that leverages a hierarchical acoustic modeling approach. We
from the 4-second piano prompts, whereas AudioLM generated
4 https://zenodo.org/records/11169781
music is categorized as fake audio.
5 https://www.microsoft.com/en-us/research/project/speechx/
1 https://github.com/xieyuankun/ALM-ADD 6 https://uniaudio666.github.io/demo UniAudio/
2 https://google-research.github.io/seanet/audiolm/examples/ 7 https://lauragpt.github.io/
3 https://google-research.github.io/seanet/audiolm/examples/ 8 https://ereboas.github.io/ELLAV
collected 35 fake speech samples and 6 real speech samples best-performing model without participating in the training.
from the TTS task on demo pages9 . Real speech samples come The codec-trained CM was trained using the Codecfake
from the ground truth or the speaker prompt, whereas HAM- training set and used the validation set for model selection. The
TTS generated speech samples are categorized as fake speech. Codecfake training set contains 740,747 samples, and the val-
A10-RALL-E [23]. RALL-E is a robust language model- idation set contains 92,596 samples, with a total of six codec
ing method for text-to-speech synthesis that improves perfor- reconstruction methods.
mance and reduces errors by using chain-of-thought prompting For the test set, we first conducted preliminary performance
to decompose the task into simpler steps. We collected 5 fake tests on the CM using the 19LA test set and an in-the-wild
speech samples and 5 real speech samples from the TTS tasks (ITW) dataset. The 19LA test set includes 71,237 audio sam-
on demo pages10 . Real speech comes from the ground truth, ples with attack types (A07-A19) that are not seen in the train-
whereas RALL-E generated speech samples are categorized as ing set. ITW dataset includes 31,779 audio samples. Both real
fake speech. and fake audio samples are collected from publicly available
A11-NaturalSpeech 3 [24]. NaturalSpeech 3 is a TTS sys- sources, such as social networks and video streaming platforms,
tem that enhances speech quality, similarity, and prosody by which may contain background noise. This dataset is intended
using factorized diffusion models and factorized vector quanti- to evaluate the generalizability of detection models, including
zation to disentangle and generate speech attributes in a zero- cross-dataset evaluation.
shot manner. We collected 32 fake speech samples and 24 real
speech samples from the TTS tasks on demo pages11 . Real 4.2. Implementation details
speech samples comes from the ground truth or the speaker
prompt, whereas NaturalSpeech 3 generated speech samples are In the pre-processing stage for the countermeasure models
categorized as fake speech. (CMs), all audio samples were initially down-sampled to 16,000
A12-VALL-E 2 [11]. VALL-E 2 is the latest advancement Hz and adjusted to a uniform duration of 4 seconds through
in neural codec language models that marks a milestone in zero- trimming or padding. For the mel-spectrogram feature extrac-
shot TTS, achieving human parity for the first time. We col- tion, we derived an 80-dimensional mel-spectrogram. For the
lected 91 fake speech samples and 33 real speech samples on self-supervised feature extraction, we utilized the Wav2Vec-
demo pages12 . Real speech samples come from the speaker XLS-R model13 with frozen parameters, extracting 1024-
prompt, whereas VALL-E 2 generated speech samples are cate- dimensional hidden states as feature representations.
gorized as fake speech. All CMs were trained using the Adam optimizer with a
learning rate of 5×10−4 . The vocoder-trained CM underwent
3. Countermeasure 100 epochs of training using a weighted cross-entropy loss, as-
signing a weight of 10 to the real class and 1 to the fake class.
For the countermeasure, we consider two aspects: the training
The learning rate was halved every 10 epochs. In contrast, the
datasets and the audio deepfake detection (ADD) model. For
codec-trained CM was trained for 10 epochs, with the learning
the dataset, we selected models trained on the classic vocoder-
rate halved every 2 epochs. The model showing the best perfor-
based deepfake dataset ASVspoof2019LA and the latest codec-
mance on the validation set was chosen for evaluation.
based deepfake dataset Codecfake. This approach allows us to
verify whether countermeasures trained on traditional vocoder For experimental evaluation, we used the official imple-
datasets can effectively detect ALM-based audio, as well as mentation for calculating EER14 , maintaining precision to three
to evaluate the effectiveness of CM trained on the Codecfake decimal places. To compute the confusion matrix, we used a
dataset in practical ALM in the wild tests. threshold of 0.5 to distinguish between real and fake predic-
tions, utilizing calculations from SKlearn.
As the feature of ADD model, we selected handcrafted fea-
tures Mel-spectrogram and pre-trained features wav2vec2-xls- 5. Results and Discussion
r [25]. For the Mel-spectrogram, we used 80-dimensional Mel-
spectrograms to match the features of most conventional audio 5.1. Results on vocoder-based dataset
generation tasks. For wav2vec2-xls-r, we froze the weights and
To verify the generalization ability of the CM, we first test the
used its fifth hidden layer as the feature, due to its proven su-
performance on traditional vocoder-based dataset as shown in
periority in previous research [26]. For the backbone networks,
the left part (19LA, ITW) of Table 1 and Table 2. Specifically,
we chose LCNN [27] and AASIST [28], which are currently the
the vocoder-trained W2V2-AASIST achieve the best equal er-
most commonly used backbone networks in the filed of ADD.
ror rate (EER) of 0.122% in 19LA test set and 23.713% in ITW
The CM pipeline is shown in Fig. 3.
dataset. Especially in 19LA, to the best of our knowledge, our
CM can achieve the lowest EER, indicating the generalizabil-
4. Experiments ity of the CM. Additionally, in cross-domain training scenarios,
4.1. Experiments setting where only Codecfake is used for training and testing is con-
ducted on 19LA and ITW, W2V2-AASIST also achieved good
In the experiments, we trained on two different datasets. The results, with an EER of 3.806% on 19LA and 9.606% on ITW.
vocoder-trained CM was trained using the ASVspoof2019 LA
(19LA) training set, which includes 25,380 training samples 5.2. Results on ALM-based dataset
and 24,844 validation samples. There are six spoofing meth-
ods in total, and the validation set was used only to select the We tested the collected and generated ALM data. For the 19LA-
trained CM, the overall average (AVG) result was not very
9 https://anonymous.4open.science/w/ham-tts/ good, with the lowest AVG being 24.403% for W2V2-AASIST.
10 https://ralle-demo.github.io/RALL-E/
11 https://speechresearch.github.io/naturalspeech3/ 13 https://huggingface.co/facebook/wav2vec2-xls-r-300m
12 https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e- 14 https://github.com/asvspoof-challenge/2021/blob/main/eval-
2/ package/eval metrics.py
Table 1: EER (%) results for CM trained by 19LA training set. AVG represents the average EER across A01-A12.

Feature Backbone 19LA ITW A01 A02 A03 A04 A05 A06 A07 A08 A09 A10 A11 A12 AVG
Mel LCNN 5.084 47.021 46.131 50.000 21.233 3.241 31.250 43.858 37.500 37.500 49.286 40.000 33.854 45.255 39.592
W2V2 LCNN 0.625 41.385 43.304 50.000 0.608 2.093 6.250 33.333 37.500 30.208 14.048 40.000 33.854 39.477 27.556
W2V2 AASIST 0.122 23.713 39.435 50.000 0.225 0.833 6.250 33.333 37.500 25.000 16.905 20.000 30.208 33.150 24.403

Table 2: EER (%) results for CM trained by Codecfake training set. AVG represents the average EER across A01-A12.

Feature Backbone 19LA ITW A01 A02 A03 A04 A05 A06 A07 A08 A09 A10 A11 A12 AVG
Mel LCNN 26.826 42.635 31.696 50.000 7.393 9.553 43.750 45.556 12.500 21.652 0.000 40.000 33.854 44.705 28.388
W2V2 LCNN 4.433 4.975 17.262 50.000 0.450 1.080 0.000 39.444 0.000 1.562 0.000 0.000 3.646 3.164 9.717
W2V2 AASIST 3.806 (a)9.606 3.869 50.000 0.225
(b) 0.135 0.000 50.000 (c)0.000 0.000 0.000 0.000
(d) 8.854 0.000 9.424

(a) (b) (c) (d)

Figure 4: The confusion matrices under different test conditions. (a), (b) correspond to W2V2-AASIST trained on the 19LA training set
and tested on A01 and A02. (c), (d) correspond to W2V2-AASIST trained on the Codecfake training set and tested on A06 and A11.

The EER performance is consistent with the CM’s performance ommendations for future research on countering ALM-based
and vocoder-based data. For instance, Mel-LCNN performed audio. At first, the Codecfake dataset may lacks generaliza-
poorly on 19LA and ITW, and similarly, its performance on tion to other audio types such as music and sound. Even though
ALM was also not good, with an AVG as high as 39.592%. Fur- codec-trained CMs can detect ALM-based audio, they can also
thermore, we conducted a confusion matrix analysis for the two misclassify genuine audio. From dataset perspective, this may
worst cases, A01 and A02, of W2V2-AASIST, as shown in Fig. necessitate enriching the codec with a variety of audio types.
4(a) and 4(b). Fig. 4(a) corresponding to an EER of A01 for As for CM, specialized features for different audio types rather
39.435%, it can be seen that 52.94% of real speech samples are than relying solely on speech self-supervised feature such as
misclassified as fake, and 37.50% of fake speech samples are W2V2 may be needed. Additionally, the high FN rates indicate
misclassified as real. The CM fails to detect ALM-based audio deficiencies in the classifier’s ability to learn from real-world
and produces false positives on genuine audio. Fig. 4(b) shows audio data. This is apparent due to the limited diversity of real-
a similar situation, with 50% of genuine audio misclassified as world domains covered by a single dataset during training, and
fake and 50% of fake audio misclassified as genuine. the inadequate representation of real audio in pre-trained fea-
For the codec-trained CM, the results are very surpris- tures. Therefore, strategies such as co-training with supplemen-
ing. Most EER values are 0.000% when rounded to three dec- tary real audio datasets or enhancing pre-trained features could
imal places, indicating a very high distinction between real significantly enhance performance.
and fake audio. The best-performing model is still W2V2-
AASIST, achieving an EER of 9.424%. This also demonstrates 6. Conclusions
the model’s stability in performance across different training
datasets. Furthermore, we attempted a detailed analysis of the
In this paper, we attempt to address a novel question: does
worst cases, as shown in Fig. 4(c) and (d). It can be seen that
the current deepfake audio detection model effectively detect
the poor performance is due to real speech samples being mis-
ALM-based deepfake audio? We evaluate this by collecting and
classified as fake, while fake samples are correctly identified.
generating the latest 12 types of ALM-based audio and assess-
Specifically, in A06, 100% of the real speech was classified as
ing them using SOTA performance CM. The surprising results
fake, and in A11, 41.67% of the real speech was classified as
indicate that codec-trained CMs can effectively detect these
fake. This indicates that the codec-trained CM can recognize
ALM-based audios, with most EERs approaching 0%. This in-
ALM-based audio but lacks generalization to real data from
dicates that the current CM, specifically the codec-trained CM
other domains. Rather than speech, the situation for music in
trained with Codecfake dataset, can effectively detect ALM-
A02 is also very poor. For the same reason, the model classified
based audio.
all samples as fake. These findings suggest that when the CM
encounters new audio types, new features and backbones need
to be considered to adapt to these new types of audio. 7. Acknowledgements
5.3. Discussion This work is supported by the National Natural Science
Some codec-trained CMs still exhibit shortcomings, particu- Foundation of China (NSFC) (No.62101553, No.62306316,
larly high false negative (FN) rates, which suggest several rec- No.U21B20210, No. 62201571).
8. References Lee, “Asvspoof 2019: spoofing countermeasures for the detection
of synthesized, converted and replayed speech,” IEEE Transac-
[1] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, tions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2,
M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi pp. 252–265, 2021.
et al., “Audiolm: a language modeling approach to audio genera-
tion,” IEEE/ACM Transactions on Audio, Speech, and Language [18] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin-
Processing, 2023. nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al.,
“Asvspoof 2021: Towards spoofed and deepfake speech detection
[2] F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, in the wild,” IEEE/ACM Transactions on Audio, Speech, and Lan-
J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guage Processing, 2023.
guided audio generation,” in The Eleventh International Confer-
ence on Learning Representations, 2022. [19] Y. Xie, Y. Lu, R. Fu, Z. Wen, Z. Wang, J. Tao, X. Qi, X. Wang,
Y. Liu, H. Cheng et al., “The codecfake dataset and counter-
[3] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, measures for the universally detection of deepfake audio,” arXiv
Y. Liu, H. Wang, J. Li et al., “Neural codec language mod- preprint arXiv:2405.04880, 2024.
els are zero-shot text to speech synthesizers,” arXiv preprint
arXiv:2301.02111, 2023. [20] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity
neural audio compression,” arXiv preprint arXiv:2210.13438,
[4] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, 2022.
A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi
et al., “Musiclm: Generating music from text,” arXiv preprint [21] Y. Song, Z. Chen, X. Wang, Z. Ma, and X. Chen, “Ella-v: Stable
arXiv:2301.11325, 2023. neural codec language modeling with alignment-guided sequence
reordering,” arXiv preprint arXiv:2401.07333, 2024.
[5] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen,
Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your [22] C. Wang, C. Zeng, B. Zhang, Z. Ma, Y. Zhu, Z. Cai, J. Zhao,
own voice: Cross-lingual neural codec language modeling,” arXiv Z. Jiang, and Y. Chen, “Ham-tts: Hierarchical acoustic model-
preprint arXiv:2303.03926, 2023. ing for token-based zero-shot text-to-speech with model and data
scaling,” arXiv preprint arXiv:2403.05989, 2024.
[6] T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen,
J. Li, and F. Wei, “Viola: Unified codec language models for [23] D. Xin, X. Tan, K. Shen, Z. Ju, D. Yang, Y. Wang, S. Takamichi,
speech recognition, synthesis, and translation,” arXiv preprint H. Saruwatari, S. Liu, J. Li et al., “Rall-e: Robust codec language
arXiv:2305.16107, 2023. modeling with chain-of-thought prompting for text-to-speech syn-
thesis,” arXiv preprint arXiv:2404.03204, 2024.
[7] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve,
Y. Adi, and A. Défossez, “Simple and controllable music gen- [24] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng,
eration,” Advances in Neural Information Processing Systems, K. Song, S. Tang et al., “Naturalspeech 3: Zero-shot speech syn-
vol. 36, 2024. thesis with factorized codec and diffusion models,” arXiv preprint
arXiv:2403.03100, 2024.
[8] X. Wang, M. Thakker, Z. Chen, N. Kanda, S. E. Eskimez, S. Chen,
M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec [25] D. Oneata, A. Stan, O. Pascu, E. Oneata, and H. Cucu, “Towards
language model as a versatile speech transformer,” arXiv preprint generalisable and calibrated synthetic speech detection with self-
arXiv:2308.06873, 2023. supervised representations,” arXiv preprint arXiv:2309.05384,
2023.
[9] Q. Chen, Y. Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Ma,
W. Wang, S. Zheng et al., “Lauragpt: Listen, attend, understand, [26] J. W. Lee, E. Kim, J. Koo, and K. Lee, “Representation Selective
and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-
2023. aware Speaker Verification,” in Proc. Interspeech 2022, 2022, pp.
2898–2902.
[10] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi,
S. Zhao, J. Bian, X. Wu et al., “Uniaudio: An audio founda- [27] G. Lavrentyeva, A. Novoselov, S., M. Volkova, A. Gorlanov, and
tion model toward universal audio generation,” arXiv preprint A. Kozlov, “Stc antispoofing systems for the asvspoof2019 chal-
arXiv:2310.00704, 2023. lenge,” arXiv preprint arXiv:1904.05576, 2019.

[11] S. Chen, S. Liu, L. Zhou, Y. Liu, X. Tan, J. Li, S. Zhao, Y. Qian, [28] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee,
and F. Wei, “Vall-e 2: Neural codec language models are hu- H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in-
man parity zero-shot text to speech synthesizers,” arXiv preprint tegrated spectro-temporal graph attention networks,” in Proceed-
arXiv:2406.05370, 2024. ings of the ICASSP, 2022, pp. 6367–6371.

[12] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and


N. Evans, “Automatic speaker verification spoofing and deep-
fake detection using wav2vec 2.0 and data augmentation,” arXiv
preprint arXiv:2202.12233, 2022.
[13] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “Learning a self-
supervised domain-invariant feature representation for general-
ized audio deepfake detection,” in Proc. INTERSPEECH, vol.
2023, 2023, pp. 2808–2812.
[14] Z. Wang, R. Fu, Z. Wen, Y. Xie, Y. Liu, X. Wang, X. Liu, Y. Li,
J. Tao, Y. Lu et al., “Generalized fake audio detection via deep
stable learning,” arXiv preprint arXiv:2406.03237, 2024.
[15] X. Wang, R. Fu, Z. Wen, Z. Wang, Y. Xie, Y. Liu, J. Tao, X. Liu,
Y. Li, X. Qi et al., “Genuine-focused learning using mask au-
toencoder for generalized fake audio detection,” arXiv preprint
arXiv:2406.03247, 2024.
[16] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “Domain generalization via
aggregation and separation for audio deepfake detection,” IEEE
Transactions on Information Forensics and Security, 2023.
[17] A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman,
M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A.

You might also like