Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH

Does Audio Deepfake Detection Generalize?
Nicolas M. Müller1 , Pavel Czempin2 , Franziska Dieckmann2 ,

Adam Froghyar3 , Konstantin Böttinger1
1 2 3
Fraunhofer AISEC Technical University Munich why do birds GmbH
[email protected]
Abstract features professional speakers and has been recorded in a stu-

dio environment, using a semi-anechoic chamber. What can
Current text-to-speech algorithms produce realistic fakes of hu- we expect from audio spoof detection trained on this dataset?
man voices, making deepfake detection a much-needed area of Is it capable of detecting realistic, unseen, ‘in-the-wild’ audio
research. While researchers have presented various deep learn- spoofs like those encountered on social media?
ing models for audio spoofs detection, it is often unclear exactly To answer these questions, this paper presents the following
why these architectures are successful: Preprocessing steps, hy- contributions:
arXiv:2203.16263v4 [cs.SD] 27 Aug 2024
perparameter settings, and the degree of fine-tuning are not con-

sistent across related work. Which factors contribute to success, • We reimplement twelve of the most popular architec-
and which are accidental? tures from related work and evaluate them according to a
In this work, we address this problem: We systematize au- common standard. We systematically exchange compo-
dio spoofing detection by re-implementing and uniformly evalu- nents to attribute performance reported in related work
ating twelve architectures from related work. We identify over- to either model architecture, feature extraction, or data
arching features for successful audio deepfake detection, such preprocessing techniques. In this way, we identify fun-
as using cqtspec or logspec features instead of melspec features, damental properties for well-performing audio deepfake
which improves performance by 37% EER on average, all other detection.
factors constant. • To investigate the applicability of related work in the real
Additionally, we evaluate generalization capabilities: We world, we introduce a new audio deepfake dataset1 . We
collect and publish a new dataset consisting of 37.9 hours of collect 17.2 hours of high-quality audio deepfakes and
found audio recordings of celebrities and politicians, of which 20.7 hours of of authentic material from 58 politicians
17.2 hours are deepfakes. We find that related work performs and celebrities.
poorly on such real-world data (performance degradation of up • We show that established models generally perform
to one thousand percent). This could suggest that the com- poorly on such real-world data. This discrepancy be-
munity has tailored its solutions too closely to the prevailing tween reported and actual generalization ability suggests
ASVspoof benchmark and that deepfakes are much harder to that the detection of audio fakes is a far more difficult
detect outside the lab than previously thought. challenge than previously thought.
1. Introduction 2. Related Work

Modern text-to-speech synthesis (TTS) is capable of realis- 2.1. Model Architectures
tic fakes of human voices, also known as audio deepfakes There is a significant body of work on audio spoof detection,
or spoofs. While there are many ethical applications of this driven largely by the ASVspoof challenges and datasets [5, 6].
technology, there is also a serious risk of malicious use. For In this section, we briefly present the architectures and models
example, TTS technology enables the cloning of politicians’ used in our evaluation in Section 5.
voices [1, 2], which poses a variety of risks to society, including LSTM-based models. Recurrent architectures are a natu-
the spread of misinformation. ral choice in the area of language processing, with numerous
Reliable detection of speech spoofing can help mitigate related work utilizing such models [8, 9, 10, 11]. As a base-
such risks and is therefore an active area of research. However, line for evaluating this approach, we implement a simple LSTM
since the technology to create audio deepfakes has only been model: it consists of three LSTM layers followed by a single
available for a few years (see Wavenet [3] and Tacotron [4], linear layer. The output is averaged over the time dimension to
published in 2016/17), audio spoof detection is still in its in- obtain a single embedding vector.
fancy. While many approaches have been proposed (cf. Sec- LCNN. Another common architecture for audio spoof
tion 2), it is still difficult to understand why some of the models detection are LCNN-based learning models such as LCNN,
work well: Each work uses different feature extraction tech- LCNN-Attention, and LCNN-LSTM [12, 13, 14]. LC-
niques, preprocessing steps, hyperparameter settings, and fine- NNs combine convolutional layers with Max-Feature-Map
tuning. Which are the main factors and drivers for models to activations to create ‘light’ convolutional neural networks.
perform well? What can be learned in principle for the devel- LCNN-Attention has an added single-head-attention pool-
opment of such systems? ing layer, while LCNN-LSTM uses a Bi-LSTM layer and a skip
Furthermore, the evaluation of spoof detection models has connection.
so far been performed exclusively on the ASVspoof dataset [5, MesoNet. MesoNet is based on the Meso-4 [15] archi-
6], which means that the reported performance of these mod- tecture, which was originally used for detecting facial video
els is based on a limited set of TTS synthesis algorithms.
ASVspoof is based on the VCTK dataset [7], which exclusively 1 https://deepfake-total.com/in the wild
deepfakes. It uses 4 convolutional layers in addition to Batch
Normalization, Max Pooling, and a fully connected classifier.
MesoInception. Based on the facial deepfake detector
Meso-Inception-4 [15], MesoInception extends the Meso-4
architecture with Inception blocks [16].
ResNet18. Residual Networks were first used for audio
deepfake detection by [17], and continue to be employed [18,
19]. This architecture, first introduced in the computer vision Figure 1: Schematics of our collected dataset. For n = 58
domain [20], uses convolutional layers and shortcut connec- celebrities and politicians, we collected both bona-fide and
tions, which avoids the vanishing gradient problem and allows spoofed audio (represented by blue and red boxes per speaker).
to design especially deep networks (18 layers for ResNet18). In total, we collected 20.8 hours of bona-fide and 17.2 hours of
Transformer. The Transformer architecture has also found spoofed audio. On average, there are 23 minutes of bona-fide
its way into the field of audio spoof detection [21]. We use and 18 minutes of spoofed audio per speaker.
four self-attention layers with 256 hidden dimensions and skip-
connections, and encode time with positional encodings [22].
CRNNSpoof. This end-to-end architecture combines 1D are converted to ‘wav’ after downloading. All recordings were
convolutions with recurrent layers to learn features directly downsampled to 16 kHz (the highest common frequency in the
from raw audio samples [9]. original recordings). Clips were collected from publicly avail-
RawNet2 [23] is another end-to-end model. It employs able sources such as social networks and popular video sharing
Sinc-Layers [24], which correspond to rectangular band-pass platforms. This dataset is intended as evaluation data: it allows
filters, to extract information directly from raw waveforms. evaluation of a model’s cross-database capabilities on a realistic
RawPC is an end-to-end model which also uses Sinc-layers use case.
to operate directly on raw wavforms. The architecture is found
via differentiable architecture search [25]. 4. Experimental Setup
RawGAT-ST, a spectro-temporal graph attention network
(GAT), trained in an end-to-end fashion. It introduces spectral 4.1. Training and Evaluation
and temporal sub-graphs and a graph pooling strategy, and re- 4.1.1. Hyper Parameters
ports state-of-the-art spoof detection capabilities [26], which we
can verify experimentally, c.f. Table 1. We train all of our models using a cross-entropy loss with a log-
Softmax over the output logits. We choose the Adam [27] opti-
3. Datasets mizer. We initialize the learning rate at 0.0001 and use a learn-
ing rate scheduler. We train for 100 epochs with early stopping
To train and evaluate our models, we use the ASVspoof 2019 using a patience of five epochs.
dataset [5], in particular its Logical Access (LA) part. It con-
sists of audio files that are either real (i.e., authentic recordings 4.1.2. Train and Evaluation Data Splits
of human speech) or fake (i.e., synthesized or faked audio). The
spoofed audio files are from 19 different TTS synthesis algo- We train our models on the ‘train’ and ‘dev’ parts of the
rithms. From a spoofing detection point of view, ASVspoof ASVSpoof 2019 Logical Access (LA) dataset part [5]. This
considers synthetic utterances as a threat to the authenticity of is consistent with most related work and also with the eval-
the human voice, and therefore labels them as ‘attacks’. In total, uation procedure of the ASVspoof 2019 Challenge. We test
there are 19 different attackers in the ASVspoof 2019 dataset, against two evaluation datasets. As in-domain evaluation data,
labeled A1 - A19. For each attacker, there are 4914 synthetic we use the ‘eval’ split of ASVspoof 2019. This split contains
audio recordings and 7355 real samples. This dataset is ar- unseen attacks, i.e., attacks not seen during training. However,
guably the best known audio deefake dataset used by almost the evaluation audios share certain properties with the training
all related work. data [28], so model generalization cannot be assessed using the
In order to evaluate our models on realistic unseen data ‘eval’ split of ASVspoof 2019 alone. This motivates the use of
in-the-wild, we additionally create and publish a new audio our proposed ‘in-the-wild’ dataset, see Section 3, as unknown
deefake dataset, c.f. Figure 1. It consists of 37.9 hours of audio out-of-domain evaluation data.
clips that are either fake (17.2 hours) or real (20.7 hours). We
feature English-speaking celebrities and politicians, both from 4.1.3. Evaluation metrics
present and past2 . The fake clips are created by segmenting We report both the equal-error rate (EER) and the tandem de-
219 of publicly available video and audio files that explicitly tection cost function (t-DCF) [29] on the ASVspoof 2019 ‘eval’
advertise audio deepfakes. Since the speakers talk absurdly and data. For consistency with the related work, we use the origi-
out-of-character (‘Donald Trump reads Star Wars’), it is easy to nal implementation of the t-DCF as provided for the ASVspoof
verify that the audio files are really spoofed. We then manually 2019 challenge [30]. For our proposed dataset, we report only
collect corresponding genuine instances from the same speak- the EER. This is because t-DCF scores require the false alarm
ers using publicly available material such as podcasts, speeches, and miss costs, which are available only for ASVspoof.
etc. We take care to include clips where the type of speaker,
style, emotions, etc. are similar to the fake (e.g., for a fake 4.2. Feature Extraction
speech by Barack Obama, we include an authentic speech and
try to find similar values for background noise, emotions, dura- Several architectures used in this work require pre-
tion, etc.). The clips have an average length of 4.3 seconds and processing the audio data with a feature extractor (LCNN,
LCNN-Attention, LCNN-LSTM, LSTM, MesoNet,
2 records available at deepfake-total.com/in the wild MesoInception, ResNet18, Transformer). We evalu-
ASVspoof19 eval In-the-Wild Data
Model Name Feature Type Input Length EER% t-DCF EER%
LCNN cqtspec Full 6.354±0.39 0.174±0.03 65.559±11.14
LCNN cqtspec 4s 25.534±0.10 0.512±0.00 70.015±4.74
LCNN logspec Full 7.537±0.42 0.141±0.02 72.515±2.15
LCNN logspec 4s 22.271±2.36 0.377±0.01 91.110±2.17
LCNN melspec Full 15.093±2.73 0.428±0.05 70.311±2.15
LCNN melspec 4s 30.258±3.38 0.503±0.04 81.942±3.50
LCNN-Attention cqtspec Full 6.762±0.27 0.178±0.01 66.684±1.08
LCNN-Attention cqtspec 4s 23.228±3.98 0.468±0.06 75.317±8.25
LCNN-Attention logspec Full 7.888±0.57 0.180±0.05 77.122±4.91
LCNN-Attention logspec 4s 14.958±2.37 0.354±0.03 80.651±6.14
LCNN-Attention melspec Full 13.487±5.59 0.374±0.14 70.986±9.73
LCNN-Attention melspec 4s 19.534±2.57 0.449±0.02 85.118±1.01
LCNN-LSTM cqtspec Full 6.228±0.50 0.113±0.01 61.500±1.37
LCNN-LSTM cqtspec 4s 20.857±0.14 0.478±0.01 72.251±2.97
LCNN-LSTM logspec Full 9.936±1.74 0.158±0.01 79.109±0.84
LCNN-LSTM logspec 4s 13.018±3.08 0.330±0.05 79.706±15.80
LCNN-LSTM melspec Full 9.260±1.33 0.240±0.04 62.304±0.17
LCNN-LSTM melspec 4s 27.948±4.64 0.483±0.03 82.857±3.49
LSTM cqtspec Full 7.162±0.27 0.127±0.00 53.711±11.68
LSTM cqtspec 4s 14.409±2.19 0.382±0.05 55.880±0.88
LSTM logspec Full 10.314±0.81 0.160±0.00 73.111±2.52
LSTM logspec 4s 23.232±0.32 0.512±0.00 78.071±0.49
LSTM melspec Full 16.216±2.92 0.358±0.00 65.957±7.70
LSTM melspec 4s 37.463±0.46 0.553±0.01 64.297±2.23
MesoInception cqtspec Full 11.353±1.00 0.326±0.03 50.007±14.69
MesoInception cqtspec 4s 21.973±4.96 0.453±0.09 68.192±12.47
MesoInception logspec Full 10.019±0.18 0.238±0.02 37.414±9.16
MesoInception logspec 4s 16.377±3.72 0.375±0.09 72.753±6.62
MesoInception melspec Full 14.058±5.67 0.331±0.11 61.996±12.65
MesoInception melspec 4s 21.484±3.51 0.408±0.03 51.980±15.32
MesoNet cqtspec Full 7.422±1.61 0.219±0.07 54.544±11.50
MesoNet cqtspec 4s 20.395±2.03 0.426±0.06 65.928±2.57
MesoNet logspec Full 8.369±1.06 0.170±0.05 46.939±5.81
MesoNet logspec 4s 11.124±0.79 0.263±0.03 80.707±12.03
MesoNet melspec Full 11.305±1.80 0.321±0.06 58.405±11.28
MesoNet melspec 4s 21.761±0.26 0.467±0.00 64.415±15.68
ResNet18 cqtspec Full 6.552±0.49 0.140±0.01 49.759±0.17
ResNet18 cqtspec 4s 18.378±1.76 0.432±0.07 61.827±7.46
ResNet18 logspec Full 7.386±0.42 0.139±0.02 80.212±0.23
ResNet18 logspec 4s 15.521±1.83 0.387±0.02 88.729±2.88
ResNet18 melspec Full 21.658±2.56 0.551±0.04 77.614±1.47
ResNet18 melspec 4s 28.178±0.33 0.489±0.01 83.006±7.17
Transformer cqtspec Full 7.498±0.34 0.129±0.01 43.775±2.85
Transformer cqtspec 4s 11.256±0.07 0.329±0.00 48.208±1.49
Transformer logspec Full 9.949±1.77 0.210±0.06 64.789±0.88
Transformer logspec 4s 13.935±1.70 0.320±0.03 44.406±2.17
Transformer melspec Full 20.813±6.44 0.394±0.10 73.307±2.81
Transformer melspec 4s 26.495±1.76 0.495±0.00 68.407±5.53
CRNNSpoof raw Full 15.658±0.35 0.312±0.01 44.500±8.13

CRNNSpoof raw 4s 19.640±1.62 0.360±0.04 41.710±4.86
RawNet2 raw Full 3.154±0.87 0.078±0.02 37.819±2.23
RawNet2 raw 4s 4.351±0.29 0.132±0.01 33.943±2.59
RawPC raw Full 3.092±0.36 0.071±0.00 45.715±12.20
RawPC raw 4s 3.067±0.91 0.097±0.03 52.884±6.08
RawGAT-ST raw Full 1.229±0.43 0.036±0.01 37.154±1.95
RawGAT-ST raw 4s 2.297±0.98 0.074±0.03 38.767±1.28
Table 1: Full results of evaluation on the ASVspoof 2019 LA ‘eval’ data. We compare different model architectures against different
feature types and audio input lengths (4s, fixed-sized inputs vs. variable-length inputs). Results are averaged over three independent
trials with random initialization, and the standard deviation is reported. Best-performing configurations are highlighted in boldface.
When evaluating the models on our proposed ‘in-the-wild’ dataset, we see an increase in EER by up to 1000% compared to ASVspoof
2019 (rightmost column).
ASVspoof19 eval In-the-Wild Data 5.2. Effects of Feature Extraction Techniques
Input Length EER % t-DCF EER % We discuss the effects of different feature preprocessing tech-
Full 9.85 0.22 60.10 niques, c.f. 1: The ‘raw’ models outperform the feature-based
4s 18.89 0.39 67.25 models, obtaining up to 1.2% EER on ASVspoof and 33.9%
EER on the ‘in-the-wild’ dataset (RawGAT-ST and RawNet2).
The spectrogram-based models perform slightly worse, achiev-
Table 2: Model performance averaged by input preprocessing up to 6.3% EER on ASVspoof and 37.4% on the ‘in-the-
ing. Fixed-length, 4s inputs perform significantly worse on the wild’ dataset (LCNN and MesoNet). The superiority of the
ASVspoof data and on the ‘in-the-wild’ dataset than variable- ‘raw’ models is assumed to be due to finer feature-extraction
length inputs. This suggests that related work using fixed-length resolution than the spectogram-based models [26]. This has
inputs may (unnecessarily) sacrifice performance. lead recent research to focus largely on such raw-feature, end-
to-end models [25, 26].
Concerning the spectogram-based models, we observe that
melspec features are always outperformed by either cqtspec of
ate these architectures on constant-Q transform (cqtspec [31]), logspec. Simply replacing melspec with cqtspec increases the
log spectrogram (logspec) and mel-scaled spectrogram (mel- average performance by 37%, all other factors constant.
spec [32]) features (all of them 513-dimensional). We use
Python, librosa [33] and scipy [34]. The rest of the models does 5.3. Evaluation on ‘in-the-wild’ data
not rely on pre-processed data, but uses raw audio waveforms Especially interesting is the performance of the models on real-
as inputs. world deepfake data. Table 1 shows the performance of our
models on the ‘in-the-wild’ dataset. We see that there is a large
4.3. Audio Input Length performance gap between the ASVSpoof 2019 evaluation data
Audio samples usually vary in length, which is also the case and our proposed ‘in-the-wild’ dataset. In general, the EER
for the data in ASVspoof 2019 and our proposed ‘in-the-wild’ values of the models deteriorate by about 200 to 1000 percent.
dataset. While some models can accommodate variable-length Often, the models do not perform better than random guessing.
input (and thus also fixed-length input), many can not. We ex- To investigate this further, we train our best ‘in-the-wild’
tend these by introducing a global averaging layer, which adds model from Table 1, RawNet2 with 4s input length, on all
such capability. from ASVspoof 2019, i.e., the ‘train’, ‘dev’, and ‘eval’ splits.
In our evaluation of fixed-length input, we chose a length We then re-evaluate on the ‘in-the-wild’ dataset to investigate
of four seconds, following [23]. If an input sample is longer, a whether adding more ASVspoof training data improves out-of-
random four-second subset of the sample is used. If it is shorter, domain performance. We achieve 33.1 ± 0.2 % EER, i.e., no
the sample is repeated. To keep the evaluation fair, these shorter improvement over training with only the ‘train’ and ‘dev’ data.
samples are also repeated during the full-length evaluation. This The inclusion of the ‘eval’ split does not seem to add much
ensures that full-length input is never shorter than truncated in- information that could be used for real-world generalization.
put, but always at least 4s. This is plausible in that all splits of ASVspoof are fundamen-
tally based on the same dataset, VCTK, although the synthesis
algorithms and speakers differ between splits [5].
5. Results
Table 1 shows the results of our experiments, where we evaluate 6. Conclusion
all models against all configurations of data preprocessing: we
train twelve different models, using one of four different feature In this paper, we systematically evaluate audio spoof detection
types, with two different ways of handling variable-length au- models from related work according to common standards. In
dio. Each experiment is performed three times, using random addition, we present a new audio deefake dataset of ‘in-the-
initialization. We report averaged EER and t-DCF, as well as wild’ audio spoofs that we use to evaluate the generalization
standard deviation. We observe that on ASVspoof, our imple- capabilities of related work in a real-world scenario.
mentations perform comparable to related work, with a margin We find that regardless of the model architecture, some pre-
of approximately 2 − 4% EER and 0.1 t-DCF. This is likely processing steps are more successful than others. It turns out
because we do not fine-tune our models’ hyper-parameters. that the use of cqtspec or logspec features consistently outper-
forms the use of melspec features in our comprehensive analy-
sis. Furthermore, we find that for most models, four seconds of
5.1. Fixed vs. Variable Input Length
input audio does not saturate performance compared to longer
We analyze the effects of truncating the input signal to a fixed examples. Therefore, we argue that one should consider using
length compared to using the full, unabridged audio. For all cqtspec features and unabridged input audio when designing au-
models, performance decreases when the input is trimmed to dio deepfake detection architectures.
4s. Table 2 averages all results based on input length. We see Most importantly, however, we find that the ‘in-the-wild’
that average EER on ASVspoof drops from 19.89% to 9.85% generalization capabilities of many models may have been over-
when the full-length input is used. These results show that a estimated. We demonstrate this by collecting our own audio
four-second clip is insufficient for the model to extract useful in- deepfake dataset and evaluating twelve different model archi-
formation compared to using the full audio file as input. There- tectures on it. Performance drops sharply, and some models
fore, we propose not to use fixed-length truncated inputs, but to degenerate to random guessing. It may be possible that the
provide the full audio file to the model. This may seem obvious, community has tailored its detection models too closely to the
but the numerous works that use fixed-length inputs [23, 25, 26] prevailing benchmark, ASVSpoof, and that deepfakes are much
suggest otherwise. harder to detect outside the lab than previously thought.
7. References [18] Y. Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn-
thetic voice spoofing detection,” IEEE Signal Processing Letters,
[1] “Audio deep fake: Demonstrator entwickelt am fraunhofer aisec - vol. 28, pp. 937–941, 2021.
youtube,” https://www.youtube.com/watch?v=MZTF0eAALmE,
(Accessed on 04/01/2021). [19] J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end
detection of spoofing attacks to automatic speaker recognizers,”
[2] “Deepfake video of volodymyr zelensky surrendering surfaces Computer Speech & Language, vol. 63, p. 101096, 2020.
on social media - youtube,” https://www.youtube.com/watch?v=
X17yrEV5sl4, (Accessed on 03/23/2022). [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
[3] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, computer vision and pattern recognition, 2016, pp. 770–778.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“Wavenet: A generative model for raw audio,” arXiv preprint [21] Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using resid-
arXiv:1609.03499, 2016. ual network with transformer encoder,” in Proceedings of the 2021
ACM Workshop on Information Hiding and Multimedia Security,
[4] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. 2021, pp. 13–22.
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio,
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available: Advances in neural information processing systems, vol. 30, 2017.
http://arxiv.org/abs/1703.10135 [23] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and
[5] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Larcher, “End-to-End anti-spoofing with RawNet2,” in
A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, ICASSP 2021 - 2021 IEEE International Conference on Acous-
“Asvspoof 2019: Future horizons in spoofed and fake audio de- tics, Speech and Signal Processing (ICASSP), pp. 6369–6373.
tection,” arXiv preprint arXiv:1904.05441, 2019. [24] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-
[6] A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman, form with sincnet,” in 2018 IEEE Spoken Language Technology
M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A. Workshop (SLT). IEEE, 2018, pp. 1021–1028.
Lee, “ASVspoof 2019: Spoofing Countermeasures for the De- [25] W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable
tection of Synthesized, Converted and Replayed Speech,” vol. 3, architecture search for speech deepfake and spoofing detection,”
no. 2, pp. 252–265. arXiv preprint arXiv:2107.12212, 2021.
[7] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- [26] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and
pus: English multi-speaker corpus for CSTR voice cloning toolkit N. Evans, “End-to-end spectro-temporal graph attention networks
(version 0.92),” 2019. for speaker verification anti-spoofing and speech deepfake detec-
[8] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. tion,” arXiv preprint arXiv:2107.12710, 2021.
Gomez, “A Gated Recurrent Convolutional Neural Network for [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic
Robust Spoofing Detection,” vol. 27, no. 12, pp. 1985–1999. optimization,” in 3rd International Conference on Learning
[9] A. Chintha, B. Thai, S. J. Sohrawardi, K. M. Bhatt, A. Hickerson, Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
M. Wright, and R. Ptucha, “Recurrent Convolutional Structures 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun,
for Audio Spoof and Video Deepfake Detection,” pp. 1–1. Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[10] L. Zhang, X. Wang, E. Cooper, J. Yamagishi, J. Patino, and [28] N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, J. Williams,
N. Evans, “An initial investigation for detecting partially spoofed and K. Böttinger. Speech is Silver, Silence is Golden: What do
audio,” arXiv preprint arXiv:2104.02518, 2021. ASVspoof-trained Models Really Learn? [Online]. Available:
http://arxiv.org/abs/2106.12914
[11] S. Tambe, A. Pawar, and S. Yadav, “Deep fake videos identifica-
tion using ann and lstm,” Journal of Discrete Mathematical Sci- [29] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco,
ences and Cryptography, vol. 24, no. 8, pp. 2353–2364, 2021. M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a de-
tection cost function for the tandem assessment of spoofing coun-
[12] X. Wang and J. Yamagishi. A Comparative Study on Recent Neu- termeasures and automatic speaker verification,” in Odyssey 2018
ral Spoofing Countermeasures for Synthetic Speech Detection. The Speaker and Language Recognition Workshop. ISCA, pp.
[Online]. Available: http://arxiv.org/abs/2103.11326 312–319.
[13] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- [30] “tdcf official implementation,” https://www.asvspoof.org/
shev, and V. Shchemelinin, “Audio replay attack detection with asvspoof2019/tDCF python v1.zip, (Accessed on 03/03/2022).
deep learning frameworks,” in Interspeech 2017. ISCA, pp.
82–86. [Online]. Available: http://www.isca-speech.org/archive/ [31] J. C. Brown, “Calculation of a constant q spectral transform,” The
Interspeech 2017/abstracts/0360.html Journal of the Acoustical Society of America, vol. 89, no. 1, pp.
425–434, 1991.
[14] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova,
A. Gorlanov, and A. Kozlov, “STC antispoofing systems for [32] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the
the ASVspoof2019 challenge,” in Interspeech 2019. ISCA, measurement of the psychological magnitude pitch,” The journal
pp. 1033–1037. [Online]. Available: http://www.isca-speech.org/ of the acoustical society of america, vol. 8, no. 3, pp. 185–190,
archive/Interspeech 2019/abstracts/1768.html 1937.
[15] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: [33] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Batten-
A Compact Facial Video Forgery Detection Network,” in 2018 berg, and O. Nieto, “librosa: Audio and music signal analysis in
IEEE International Workshop on Information Forensics and Se- python,” in Proceedings of the 14th python in science conference,
curity (WIFS), pp. 1–7. vol. 8. Citeseer, 2015, pp. 18–25.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [34] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright
convolutions,” in 2015 IEEE Conference on Computer Vision and et al., “Scipy 1.0: fundamental algorithms for scientific comput-
Pattern Recognition (CVPR), 2015, pp. 1–9. ing in python,” Nature methods, vol. 17, no. 3, pp. 261–272, 2020.
[17] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual

Neural Networks for Audio Spoofing Detection,” in Interspeech
2019. ISCA, pp. 1078–1082. [Online]. Available: http://www.
isca-speech.org/archive/Interspeech 2019/abstracts/3174.html

Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH

Uploaded by

Copyright:

Available Formats

Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH

Uploaded by

Copyright:

Available Formats

Does Audio Deepfake Detection Generalize?

Nicolas M. Müller1 , Pavel Czempin2 , Franziska Dieckmann2 ,

Abstract features professional speakers and has been recorded in a stu-

perparameter settings, and the degree of fine-tuning are not con-

1. Introduction 2. Related Work

CRNNSpoof raw Full 15.658±0.35 0.312±0.01 44.500±8.13

[17] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual

You might also like