AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
904
Original WaveNet WaveRNN MelGAN Parallel WaveGAN WaveGrad DiffWave
Figure 1. The artifacts introduced by the neural vocoders to a voice signal. We show the mel-spectrogram of the original (top left) and
the self-vocoded voice signal (top right five). Their differences corresponding to the artifacts introduced by the vocoder are shown at the
bottom.
”Self-vocoding” samples were sourced from the same orig- Overall, our work provides a new and promising ap-
inal audio signals to highlight the artifacts introduced by proach to detecting synthetic human voices by focusing on
the vocoders. Figure 1 shows the differences in the mel- neural vocoder artifacts. Our multi-task learning strategy,
spectrogram of one original voice and its self-vocoded voice together with the LibriSeVoc dataset, could serve as a valu-
signals. Visible artifacts introduced by different neural able resource for future research in this area.
vocoder models can be observed, which serve as the ba-
sis of our detection algorithm. While these artifacts may be 2. Related Works
subtle to visualize, this work demonstrates that they can be
In this section, we provide a literature review that is
captured by a trained classifier.
relevant to our research, including voice synthesis meth-
To take advantage of the vocoder artifacts in detecting
ods, state-of-the-art neural vocoder models, and existing
synthetic human voices, we developed a multi-task learning
AI-synthesized voice detection methods.
strategy. We used a binary classifier that shares the front-
end feature extractor (e.g., RawNet2 [4]) with the vocoder 2.1. Human Voice Synthesis
identification module. This accommodates the insufficient
number of existing real and synthetic human voice samples The synthesis of human voice is a significant challenge
by including the self-vocoding samples in LibriSeVoc as in the field of artificial intelligence, with various practical
additional training data. We treated vocoder identification applications such as voice-driven smart assistants and ac-
as a pretext task to constrain the front-end feature extrac- cessible user interfaces. Human voice synthesis can be clas-
tion module to focus on vocoder-level artifacts and build sified into two general categories: text-to-speech (TTS) and
highly discriminative features for the final binary classifier. voice conversion (VC). In this work, we focus on recent
Our experiments showed that our RawNet2 model achieved TTS and VC methods that use deep neural network models.
outstanding classification performance on our LibriSeVoc TTS systems transform input text into audio using the
dataset and two public DeepFake audio datasets. We also target voice and typically consist of three components: a
evaluated our method under different post-processing sce- text analysis module that converts the input text into lin-
narios and demonstrated good detection robustness to re- guistic features, an acoustic model that generates acoustic
sampling and background noise. features in the form of a mel-spectrogram from the linguis-
The main contributions of our work are as follows: tic features, and a vocoder. Recent TTS models based on
deep neural networks include WaveNet [7], Tacotron [8],
• We propose to focus on neural vocoder artifacts as Tacotron 2 [9], ClariNet [10], and FastSpeech 2s [11].
specific and interpretable features for detecting AI- In contrast, VC models take a sample of one subject’s
synthesized audio; voice as input and create output audio of another sub-
ject’s voice of the same utterance. Recent VC models
• We designed a novel multi-task learning approach that (e.g., [12–14]) usually work within the mel-spectrum do-
combines a binary classification task with a vocoder main and use deep neural network models to map between
identification module. This approach constrains the the mel-spectrograms of the input and output voice signals.
feature extractor to learn discriminative vocoder arti- These models use neural style transfer methods such as vari-
facts for detecting synthetic human voices; ational auto-encoder (VAE) or generative adversarial net-
• We provide LibriSeVoc as a dataset with self-vocoding work (GAN) models to capture the utterance elements in
samples created using six state-of-the-art vocoders to the input voice and then combine them with the style of the
highlight and exploit the vocoder artifacts; output voice. The resulting mel-spectrogram is then recon-
structed to an audio waveform using a neural vocoder. Both
• Our proposed method was experimentally evaluated on the TTS and VC models employ deep neural network mod-
three datasets and demonstrated its effectiveness. els trained on large-scale human voice corpora.
905
2.2. Neural Vocoders human voices. Real human voice signals have random lo-
cal phases due to audio waves transmitting and bouncing
Vocoders are crucial components in both TTS and VC
around in the physical environment, while synthetic human
models as they synthesize output audio waveforms from
voices do not have these characteristics. Although these lo-
mel-spectrograms. However, the transformation from audio
cal phase inconsistencies cannot be detected by the human
waveforms to mel-spectrograms leads to the loss of infor-
auditory system, they can be identified through bi-spectral
mation due to binning and filtering, making it difficult to
analysis. Another method, known as DeepSonar [23], uses
recover the audio waveform from a mel-spectrogram. In re-
network responses of audio signals as the feature to detect
cent years, deep neural network-based vocoders have been
synthetic audio. The ASVspoof Challenge 2021 evaluates
developed, significantly improving training efficiency and
additional state-of-the-art synthetic voice detection meth-
synthesis quality. There are three main categories of exist-
ods. The Gaussian mixture models CQCC-GMM [24],
ing neural vocoders: autoregressive models, diffusion mod-
LFCC-GMM [24], a light convolutional neural network
els, and GAN-based models.
model LFCC-LCNN [24], and RawNet2 [4] have achieved
Autoregressive models are probabilistic models that pre-
the most reliable performance as primary baseline algo-
dict the distribution of each audio waveform sample based
rithms.
on all previous samples. However, since this process in-
Recent studies have focused on improving the gener-
volves linear sample-by-sample generation, autoregressive
alization capacity of fake audio detection. Various well-
models are slower than other methods. WaveNet [7], the
designed models have been proposed for Deepfake audio
first autoregressive neural vocoder, can also serve as a TTS
detection, such as the spectro-temporal graph attention net-
or VC model depending on the input. WaveRNN [15] is an-
work [4], unsupervised pretraining models [5], biometric
other autoregressive vocoder that uses a single-layer recur-
characteristics verification model [25], and self-distillation
rent neural network to efficiently predict 16-bit raw audio
framework [6]. However, Müller et al. [26] evaluated
samples from mel-spectrogram slices.
twelve architectures on their dataset with 37.9 hours of au-
Diffusion models are probabilistic generative models
dio recording and found that related work performs poorly
that run diffusion and reverse processes. The diffusion pro-
on real-world data, with some models even degenerating to
cess is characterized by a Markov chain, which gradually
random guessing. Thus, there is a high demand for develop-
adds Gaussian noise to an original signal until the noise
ing efficient and effective models for AI-synthesized audio
is eliminated. The reverse process is a de-noising stage
detection.
that removes the added Gaussian noise and converts a sam-
ple back to the original signal. WaveGrad [16] and Dif- 3. Method
fWave [17] are two notable examples of diffusion-based
vocoder models. While diffusion models are the most time- We aim to detect synthetic human voices by identifying
efficient vocoders, their reconstruction qualities are inferior vocoder artifacts present in the audio signals. Since real hu-
to autoregressive models, and the generated samples may man voice signals typically do not have vocoder artifacts,
contain higher levels of noise and artifacts. except for our self-vocoding signals that are specifically de-
GAN-based models follow the generative adversarial signed to have them, identifying the presence of vocoder ar-
network (GAN) architecture [18], which employs a deep tifacts is a key feature in detecting synthetic human voices.
neural network generator to model the waveform signal in To achieve this, let x be the waveform of a human voice
the time domain and a discriminator to estimate the quality signal with a label y ∈ 0, 1, where 0 corresponds to a real
of the generated speech. Mel-GAN [19] and Parallel Wave- human voice and 1 corresponds to a synthetic human voice.
GAN [20] are the two most commonly used GAN-based Our goal is to build a classifier ŷ = Fθ (x) that predicts
neural vocoders. Recent works have shown that GAN-based the label of an input x. We utilize the recent RawNet2
vocoders outperform autoregressive and diffusion models in model [4] as the backbone for our classifier, as it was de-
both generation speed and generation quality. signed to operate directly on raw waveforms. This reduces
the risk of losing information related to neural vocoder ar-
2.3. AI-synthetic Human Voice Detection tifacts when compared to using pre-processed features such
In recent years, detecting synthetic human voices has be- as mel-spectrograms or linear frequency cepstral coeffi-
come crucial due to their potential misuse. While exten- cients (LFCCs).
sive research has focused on audio authentication for speech The binary detection model can be constructed as a cas-
synthesis and replay attack detection [21, 22], detecting AI- cade of neural networks
generated audio with high realism and varying models is a
F_\theta (\mathbf {x}) = B_{\theta _B}(R_{\theta _R}(\mathbf {x})) (1)
developing field. One of the earliest methods for detecting
AI-synthetic audio is bi-spectral analysis [3]. This method where RθR (x) is the front-end RawNet2 model for feature
captures subtle inconsistencies in local phases of synthetic extraction with its own set of parameters θR , BθB is a back-
906
Lm
Vocoder Identification Lb
Real
RES BLOCK
RES BLOCK
Fixed Sinc Filter
Binary
GRU
Classifier
FC
FC
Fake
Audio Clips
Figure 2. Framework of the proposed synthesized voice detection method. Rather than simply learning discriminative features for binary
classification, we incorporate a vocoder identification module with a multi-class classification loss to direct the feature extraction network
toward prioritizing vocoder artifacts.
end binary classifier and θB are its specific parameters, with vocoders as corresponding labels. This dataset is much eas-
θ = (θR , θB ). We can train this classifier directly as in the ier to create by performing “self-vocoding”, i.e., creating
previous work [27], by solving synthetic human voices by running real samples through
the mel-spectrogram transform and inverse, the latter per-
\min _\theta \sum _{(\mathbf {x},y) \in T} L_\text {b}(y, F_\theta (\mathbf {x})) (2) formed with neural vocoders. We created such a dataset,
LibriSeVoc, which will be described in detail in Section
where Lb (y, ŷ) could be any loss function for binary classi- 4.1. λ is an adjustable hyper-parameter that controls the
fication, for instance, the cross-entropy loss. The variable trade-off between the two loss terms.
T refers to the training dataset that contains labeled real The whole framework of our detection model is shown
and synthetic human voice samples. However, this method in Figure 2. Note that the two classification modules in the
assumes that there is a large number of synthetic human new learning objective function serve different roles. The
voice samples available, which is increasingly difficult to first term pertains to binary classification and aims to dis-
achieve due to the rapid advancement of synthesis technol- tinguish between authentic and fake audio. Meanwhile, the
ogy. Additionally, this approach does not take into account second term focuses on vocoder identification, serving as
the unique statistical properties of neural vocoders, which a pretext task to direct the feature extractor’s attention to-
can be an essential indicator for synthetic audio signals. ward vocoder-related artifacts. The two tasks share the fea-
To address the aforementioned problem, we propose a ture extraction component so that the distinct features of the
multi-task learning approach that combines binary classi- vocoders can be captured and transferred to the binary clas-
fication with a vocoder identification task. This approach sification task.
is designed to emphasize the importance of identifying
vocoder-level artifacts in synthetic audio signals. Specifi- 4. Experiments
cally, we augment our detection model with a vocoder iden- In this section, we present a series of experiments to
tifier MθM , which categorizes a synthetic voice into one of evaluate the effectiveness of our proposed synthetic voice
the c possible neural vocoder models (c ∈ [0, C] where detection method. We start by introducing our LibriSevoc
C ≥ 2). Our goal is to ensure that the feature extractor dataset, as well as two publicly available datasets that we
is trained to capture the distinct statistical characteristics of use for evaluation. Then, we compare our method with
vocoders, making it more sensitive to these features. This state-of-the-art models on all three datasets, using both
approach is similar to self-supervised representation learn- intra- and cross-dataset testing scenarios. Finally, we ex-
ing [28]. To this end, we form a new classification objective, amine the robustness of our detection models to common
as post-processing operations.
\begin {array}{l} \min _{\theta _B,\theta _R} \lambda \sum _{(\mathbf {x},y) \in T} L_\text {b}(y, B_{\theta _B}(R_{\theta _R}(\mathbf {x}))) \\ + \min _{\theta _M,\theta _R} (1-\lambda )\sum _{(\mathbf {x},c) \in T'} L_\text {m}(c, M_{\theta _M}(R_{\theta _R}(\mathbf {x}))) \end {array}
4.1. Datasets
(3) Three DeepFake audio datasets are considered in exper-
In this equation, Lm is a multi-class loss function, and we iments, namely our LibriSeVoc, and two public datasets
use the softmax loss in our experiments. T ′ is a dataset con- WaveFake [29] and ASVspoof 2019 [24].
taining synthetic human voices created with different neural LibriSeVoc Dataset. We have created a new open-source
907
Table 1. Data details of voice samples from real and each vocoder category in the LibriSeVoc dataset.
908
Table 2. Details of evaluation datasets.
Dataset #Vocoder type Frequency Training size Dev size Testing size
LibriSeVoc 6 24kHz 55,440 18,480 18,487
WaveFake Dataset [29] 6 16kHz 64,000 16,000 24,800
ASVspoof 2019 [36] 6 16kHz 25,380 24,844 71,237
we calculate the Equal Error Rate (EER) following previous Table 3. Detection EER (%) of intra-dataset testing on three
studies [24, 29, 40]. datasets.
909
Table 4. Detection EER (%) of cross-dataset testing on WaveFake dataset.
910
References [12] E. A. AlBadawy and S. Lyu, “Voice conversion us-
ing speech-to-speech neuro-style transfer,” Proc. In-
[1] Forbes, “A Voice Deepfake Was Used To Scam A terspeech 2017, 2020. 2
CEO Out Of $243,000,” https://www.cnn.com/2020/
02/20/tech/fake-faces-deepfake/index.html, 11 2019. [13] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai,
1 “Voice conversion using deep neural networks with
[2] “4chan users embrace AI voice clone tool to gener- layer-wise generative training,” IEEE/ACM Transac-
ate celebrity hatespeech,” https://www.theverge.com/ tions on Audio, Speech, and Language Processing,
2023/1/31/23579289/ai-voice-clone-deepfake-abuse- vol. 22, no. 12, pp. 1859–1872, 2014. 2
4chan-elevenlabs, January 2023. 1
[14] S. H. Mohammadi and A. Kain, “Voice conversion
[3] E. A. AlBadawy, S. Lyu, and H. Farid, “Detecting using deep neural networks with speaker-independent
AI-synthesized speech using bispectral analysis.” in pre-training,” in 2014 IEEE Spoken Language Tech-
CVPR Workshops, 2019, pp. 104–109. 1, 3 nology Workshop (SLT). IEEE, 2014, pp. 19–23. 2
[4] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, [15] N. Kalchbrenner, E. Elsen, K. Simonyan et al., “Effi-
and N. Evans, “End-to-end spectro-temporal graph at- cient neural audio synthesis,” in ICML, 2018. 3
tention networks for speaker verification anti-spoofing
and speech deepfake detection,” arXiv preprint [16] N. Chen, Y. Zhang, H. Zen et al., “WaveGrad: Esti-
arXiv:2107.12710, 2021. 1, 2, 3, 5, 6, 7 mating gradients for waveform generation,” in ICLR,
2020. 3
[5] Z. Lv, S. Zhang, K. Tang, and P. Hu, “Fake audio de-
tection based on unsupervised pretraining models,” in [17] Z. Kong, W. Ping, J. Huang et al., “DiffWave: A ver-
ICASSP 2022-2022 IEEE International Conference on satile diffusion model for audio synthesis,” in ICLR,
Acoustics, Speech and Signal Processing (ICASSP). 2020. 3
IEEE, 2022, pp. 9231–9235. 1, 3
[6] J. Xue, C. Fan, J. Yi, C. Wang, Z. Wen, D. Zhang, [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza et al.,
and Z. Lv, “Learning from yourself: A self-distillation “Generative adversarial nets,” NeurIPS, vol. 27, 2014.
method for fake speech detection,” arXiv preprint 3
arXiv:2303.01211, 2023. 1, 3 [19] K. Kumar, R. Kumar, T. de Boissiere et al., “MelGAN:
[7] A. van den Oord, S. Dieleman, H. Zen et al., Generative adversarial networks for conditional wave-
“WaveNet: A generative model for raw audio,” in form synthesis,” arXiv, 2019. 3
arXiv, 2016. [Online]. Available: https://arxiv.org/
abs/1609.03499 2, 3 [20] R. Yamamoto, E. Song, and J.-M. Kim, “Paral-
lel WaveGAN: A fast waveform generation model
[8] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. based on generative adversarial networks with multi-
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, resolution spectrogram,” in ICASSP, 2020. 3
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.
Saurous, “Tacotron: A fully end-to-end text-to-speech [21] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Ale-
synthesis model,” CoRR, vol. abs/1703.10135, 2017. gre, and H. Li, “Spoofing and countermeasures for
[Online]. Available: http://arxiv.org/abs/1703.10135 speaker verification: A survey,” speech communica-
2 tion, vol. 66, pp. 130–153, 2015. 3
[9] Z. Wang, Y. Liu, and L. Shan, “CE-Tacotron2: End-to- [22] H. A. Patil and M. R. Kamble, “A survey on replay at-
end emotional speech synthesis,” in 2021 60th Annual tack detection for automatic speaker verification (asv)
Conference of the Society of Instrument and Control system,” in 2018 Asia-Pacific Signal and Information
Engineers of Japan (SICE), 2021, pp. 48–52. 2 Processing Association Annual Summit and Confer-
[10] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel ence (APSIPA ASC). IEEE, 2018, pp. 1047–1053.
wave generation in end-to-end text-to-speech,” arXiv 3
preprint arXiv:1807.07281, 2018. 2
[23] R. Wang, F. Juefei-Xu, Y. Huang, Q. Guo, X. Xie,
[11] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, L. Ma, and Y. Liu, “Deepsonar: Towards effective and
and T.-Y. Liu, “FastSpeech 2: Fast and high- robust detection of ai-synthesized fake voices,” in Pro-
quality end-to-end text to speech,” arXiv preprint ceedings of the 28th ACM International Conference on
arXiv:2006.04558, 2020. 2 Multimedia, 2020, pp. 1207–1216. 3
911
[24] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, [35] “Librivox – free public domain audiobooks,” https://
H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, librivox.org/. 5
T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future
horizons in spoofed and fake audio detection,” arXiv [36] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova,
preprint arXiv:1904.05441, 2019. 3, 4, 6 A. Gorlanov, and A. Kozlov, “Stc antispoofing sys-
tems for the asvspoof2019 challenge,” arXiv preprint
[25] A. Pianese, D. Cozzolino, G. Poggi, and L. Verdo- arXiv:1904.05576, 2019. 5, 6, 7
liva, “Deepfake audio detection by speaker verifica-
[37] K. Ito and L. Johnson, “The lj speech dataset,”
tion,” in 2022 IEEE International Workshop on Infor-
https://keithito.com/LJ-Speech-Dataset/, 2017. 5
mation Forensics and Security (WIFS). IEEE, 2022,
pp. 1–6. 3 [38] C. Veaux, J. Yamagishi, K. MacDonald et al.,
“Superseded-cstr vctk corpus: English multi-speaker
[26] N. M. Müller, P. Czempin, F. Dieckmann, A. Frogh-
corpus for cstr voice cloning toolkit,” 2016. 5
yar, and K. Böttinger, “Does audio deepfake detection
generalize?” arXiv preprint arXiv:2203.16263, 2022. [39] D. P. Kingma and J. Ba, “Adam: A method
3 for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014. 5
[27] G. Yang, S. Yang, K. Liu et al., “Multi-band melgan:
Faster waveform generation for high-quality text-to- [40] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah,
speech,” in SLT workshop, 2021, pp. 492–498. 4 J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kin-
nunen, N. Evans et al., “ASVspoof 2021: accelerating
[28] L. Ericsson, H. Gouk, C. C. Loy, and T. M. progress in spoofed and deepfake speech detection,”
Hospedales, “Self-supervised representation learning: arXiv preprint arXiv:2109.00537, 2021. 6
Introduction, advances and challenges,” IEEE Signal
Processing Magazine, vol. 32, 2022. 4 [41] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou,
[29] J. Frank and L. Schönherr, “Wavefake: a data set to S. Ren, Y. Qian, Y. Qian, M. Zeng, X. Yu, and F. Wei,
facilitate audio deepfake detection,” Thirty-fifth Con- “Wavlm: Large-scale self-supervised pre-training for
ference on Neural Information Processing Systems full stack speech processing,” IEEE Journal of Se-
(NeurIPS 2021), 2021. 4, 5, 6 lected Topics in Signal Processing, vol. 16, pp. 1–14,
10 2022. 6, 7
[30] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss,
Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus [42] T. Arun Babu, C. Wang, A. Tjandra, K. Lakhotia,
derived from LibriSpeech for text-to-speech,” arXiv Q. Xu, N. Goyal, K. Singh, P. Platen, Y. Saraf, J. Pino,
preprint arXiv:1904.02882, 2019. 5 A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-
supervised cross-lingual speech representation learn-
[31] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: ing at scale,” 11 2021. 6, 7
A generative flow for text-to-speech via monotonic
alignment search,” Advances in Neural Information
Processing Systems, vol. 33, pp. 8067–8077, 2020. 5
912