Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
Table 1: Full results of evaluation on the ASVspoof 2019 LA ‘eval’ data. We compare different model architectures against different
feature types and audio input lengths (4s, fixed-sized inputs vs. variable-length inputs). Results are averaged over three independent
trials with random initialization, and the standard deviation is reported. Best-performing configurations are highlighted in boldface.
When evaluating the models on our proposed ‘in-the-wild’ dataset, we see an increase in EER by up to 1000% compared to ASVspoof
2019 (rightmost column).
ASVspoof19 eval In-the-Wild Data 5.2. Effects of Feature Extraction Techniques
Input Length EER % t-DCF EER % We discuss the effects of different feature preprocessing tech-
Full 9.85 0.22 60.10 niques, c.f. 1: The ‘raw’ models outperform the feature-based
4s 18.89 0.39 67.25 models, obtaining up to 1.2% EER on ASVspoof and 33.9%
EER on the ‘in-the-wild’ dataset (RawGAT-ST and RawNet2).
The spectrogram-based models perform slightly worse, achiev-
Table 2: Model performance averaged by input preprocess- ing up to 6.3% EER on ASVspoof and 37.4% on the ‘in-the-
ing. Fixed-length, 4s inputs perform significantly worse on the wild’ dataset (LCNN and MesoNet). The superiority of the
ASVspoof data and on the ‘in-the-wild’ dataset than variable- ‘raw’ models is assumed to be due to finer feature-extraction
length inputs. This suggests that related work using fixed-length resolution than the spectogram-based models [26]. This has
inputs may (unnecessarily) sacrifice performance. lead recent research to focus largely on such raw-feature, end-
to-end models [25, 26].
Concerning the spectogram-based models, we observe that
melspec features are always outperformed by either cqtspec of
ate these architectures on constant-Q transform (cqtspec [31]), logspec. Simply replacing melspec with cqtspec increases the
log spectrogram (logspec) and mel-scaled spectrogram (mel- average performance by 37%, all other factors constant.
spec [32]) features (all of them 513-dimensional). We use
Python, librosa [33] and scipy [34]. The rest of the models does 5.3. Evaluation on ‘in-the-wild’ data
not rely on pre-processed data, but uses raw audio waveforms Especially interesting is the performance of the models on real-
as inputs. world deepfake data. Table 1 shows the performance of our
models on the ‘in-the-wild’ dataset. We see that there is a large
4.3. Audio Input Length performance gap between the ASVSpoof 2019 evaluation data
Audio samples usually vary in length, which is also the case and our proposed ‘in-the-wild’ dataset. In general, the EER
for the data in ASVspoof 2019 and our proposed ‘in-the-wild’ values of the models deteriorate by about 200 to 1000 percent.
dataset. While some models can accommodate variable-length Often, the models do not perform better than random guessing.
input (and thus also fixed-length input), many can not. We ex- To investigate this further, we train our best ‘in-the-wild’
tend these by introducing a global averaging layer, which adds model from Table 1, RawNet2 with 4s input length, on all
such capability. from ASVspoof 2019, i.e., the ‘train’, ‘dev’, and ‘eval’ splits.
In our evaluation of fixed-length input, we chose a length We then re-evaluate on the ‘in-the-wild’ dataset to investigate
of four seconds, following [23]. If an input sample is longer, a whether adding more ASVspoof training data improves out-of-
random four-second subset of the sample is used. If it is shorter, domain performance. We achieve 33.1 ± 0.2 % EER, i.e., no
the sample is repeated. To keep the evaluation fair, these shorter improvement over training with only the ‘train’ and ‘dev’ data.
samples are also repeated during the full-length evaluation. This The inclusion of the ‘eval’ split does not seem to add much
ensures that full-length input is never shorter than truncated in- information that could be used for real-world generalization.
put, but always at least 4s. This is plausible in that all splits of ASVspoof are fundamen-
tally based on the same dataset, VCTK, although the synthesis
algorithms and speakers differ between splits [5].
5. Results
Table 1 shows the results of our experiments, where we evaluate 6. Conclusion
all models against all configurations of data preprocessing: we
train twelve different models, using one of four different feature In this paper, we systematically evaluate audio spoof detection
types, with two different ways of handling variable-length au- models from related work according to common standards. In
dio. Each experiment is performed three times, using random addition, we present a new audio deefake dataset of ‘in-the-
initialization. We report averaged EER and t-DCF, as well as wild’ audio spoofs that we use to evaluate the generalization
standard deviation. We observe that on ASVspoof, our imple- capabilities of related work in a real-world scenario.
mentations perform comparable to related work, with a margin We find that regardless of the model architecture, some pre-
of approximately 2 − 4% EER and 0.1 t-DCF. This is likely processing steps are more successful than others. It turns out
because we do not fine-tune our models’ hyper-parameters. that the use of cqtspec or logspec features consistently outper-
forms the use of melspec features in our comprehensive analy-
sis. Furthermore, we find that for most models, four seconds of
5.1. Fixed vs. Variable Input Length
input audio does not saturate performance compared to longer
We analyze the effects of truncating the input signal to a fixed examples. Therefore, we argue that one should consider using
length compared to using the full, unabridged audio. For all cqtspec features and unabridged input audio when designing au-
models, performance decreases when the input is trimmed to dio deepfake detection architectures.
4s. Table 2 averages all results based on input length. We see Most importantly, however, we find that the ‘in-the-wild’
that average EER on ASVspoof drops from 19.89% to 9.85% generalization capabilities of many models may have been over-
when the full-length input is used. These results show that a estimated. We demonstrate this by collecting our own audio
four-second clip is insufficient for the model to extract useful in- deepfake dataset and evaluating twelve different model archi-
formation compared to using the full audio file as input. There- tectures on it. Performance drops sharply, and some models
fore, we propose not to use fixed-length truncated inputs, but to degenerate to random guessing. It may be possible that the
provide the full audio file to the model. This may seem obvious, community has tailored its detection models too closely to the
but the numerous works that use fixed-length inputs [23, 25, 26] prevailing benchmark, ASVSpoof, and that deepfakes are much
suggest otherwise. harder to detect outside the lab than previously thought.
7. References [18] Y. Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn-
thetic voice spoofing detection,” IEEE Signal Processing Letters,
[1] “Audio deep fake: Demonstrator entwickelt am fraunhofer aisec - vol. 28, pp. 937–941, 2021.
youtube,” https://www.youtube.com/watch?v=MZTF0eAALmE,
(Accessed on 04/01/2021). [19] J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end
detection of spoofing attacks to automatic speaker recognizers,”
[2] “Deepfake video of volodymyr zelensky surrendering surfaces Computer Speech & Language, vol. 63, p. 101096, 2020.
on social media - youtube,” https://www.youtube.com/watch?v=
X17yrEV5sl4, (Accessed on 03/23/2022). [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
[3] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, computer vision and pattern recognition, 2016, pp. 770–778.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“Wavenet: A generative model for raw audio,” arXiv preprint [21] Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using resid-
arXiv:1609.03499, 2016. ual network with transformer encoder,” in Proceedings of the 2021
ACM Workshop on Information Hiding and Multimedia Security,
[4] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. 2021, pp. 13–22.
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio,
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available: Advances in neural information processing systems, vol. 30, 2017.
http://arxiv.org/abs/1703.10135 [23] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and
[5] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Larcher, “End-to-End anti-spoofing with RawNet2,” in
A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, ICASSP 2021 - 2021 IEEE International Conference on Acous-
“Asvspoof 2019: Future horizons in spoofed and fake audio de- tics, Speech and Signal Processing (ICASSP), pp. 6369–6373.
tection,” arXiv preprint arXiv:1904.05441, 2019. [24] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-
[6] A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman, form with sincnet,” in 2018 IEEE Spoken Language Technology
M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A. Workshop (SLT). IEEE, 2018, pp. 1021–1028.
Lee, “ASVspoof 2019: Spoofing Countermeasures for the De- [25] W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable
tection of Synthesized, Converted and Replayed Speech,” vol. 3, architecture search for speech deepfake and spoofing detection,”
no. 2, pp. 252–265. arXiv preprint arXiv:2107.12212, 2021.
[7] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- [26] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and
pus: English multi-speaker corpus for CSTR voice cloning toolkit N. Evans, “End-to-end spectro-temporal graph attention networks
(version 0.92),” 2019. for speaker verification anti-spoofing and speech deepfake detec-
[8] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. tion,” arXiv preprint arXiv:2107.12710, 2021.
Gomez, “A Gated Recurrent Convolutional Neural Network for [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic
Robust Spoofing Detection,” vol. 27, no. 12, pp. 1985–1999. optimization,” in 3rd International Conference on Learning
[9] A. Chintha, B. Thai, S. J. Sohrawardi, K. M. Bhatt, A. Hickerson, Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
M. Wright, and R. Ptucha, “Recurrent Convolutional Structures 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun,
for Audio Spoof and Video Deepfake Detection,” pp. 1–1. Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[10] L. Zhang, X. Wang, E. Cooper, J. Yamagishi, J. Patino, and [28] N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, J. Williams,
N. Evans, “An initial investigation for detecting partially spoofed and K. Böttinger. Speech is Silver, Silence is Golden: What do
audio,” arXiv preprint arXiv:2104.02518, 2021. ASVspoof-trained Models Really Learn? [Online]. Available:
http://arxiv.org/abs/2106.12914
[11] S. Tambe, A. Pawar, and S. Yadav, “Deep fake videos identifica-
tion using ann and lstm,” Journal of Discrete Mathematical Sci- [29] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco,
ences and Cryptography, vol. 24, no. 8, pp. 2353–2364, 2021. M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a de-
tection cost function for the tandem assessment of spoofing coun-
[12] X. Wang and J. Yamagishi. A Comparative Study on Recent Neu- termeasures and automatic speaker verification,” in Odyssey 2018
ral Spoofing Countermeasures for Synthetic Speech Detection. The Speaker and Language Recognition Workshop. ISCA, pp.
[Online]. Available: http://arxiv.org/abs/2103.11326 312–319.
[13] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- [30] “tdcf official implementation,” https://www.asvspoof.org/
shev, and V. Shchemelinin, “Audio replay attack detection with asvspoof2019/tDCF python v1.zip, (Accessed on 03/03/2022).
deep learning frameworks,” in Interspeech 2017. ISCA, pp.
82–86. [Online]. Available: http://www.isca-speech.org/archive/ [31] J. C. Brown, “Calculation of a constant q spectral transform,” The
Interspeech 2017/abstracts/0360.html Journal of the Acoustical Society of America, vol. 89, no. 1, pp.
425–434, 1991.
[14] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova,
A. Gorlanov, and A. Kozlov, “STC antispoofing systems for [32] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the
the ASVspoof2019 challenge,” in Interspeech 2019. ISCA, measurement of the psychological magnitude pitch,” The journal
pp. 1033–1037. [Online]. Available: http://www.isca-speech.org/ of the acoustical society of america, vol. 8, no. 3, pp. 185–190,
archive/Interspeech 2019/abstracts/1768.html 1937.
[15] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: [33] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Batten-
A Compact Facial Video Forgery Detection Network,” in 2018 berg, and O. Nieto, “librosa: Audio and music signal analysis in
IEEE International Workshop on Information Forensics and Se- python,” in Proceedings of the 14th python in science conference,
curity (WIFS), pp. 1–7. vol. 8. Citeseer, 2015, pp. 18–25.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [34] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright
convolutions,” in 2015 IEEE Conference on Computer Vision and et al., “Scipy 1.0: fundamental algorithms for scientific comput-
Pattern Recognition (CVPR), 2015, pp. 1–9. ing in python,” Nature methods, vol. 17, no. 3, pp. 261–272, 2020.