Allmodels
Allmodels
Abstract— In this paper, we propose a deep learning based AI-based speech synthesis techniques (e.g., Speech to
system for the task of deepfake audio detection. In particu- Text [3], Voice Conversion [3], Scene Fake [4], Emotion
lar, the draw input audio is first transformed into various Fake [5]), posing significant threats to the integrity and
spectrograms using three transformation methods of Short-
time Fourier Transform (STFT), Constant-Q Transform (CQT), authenticity of voice-activated systems. Consequently, the
Wavelet Transform (WT) combined with different auditory- detection of audio deepfakes has become a crucial area of
based filters of Mel, Gammatone, linear filters (LF), and discrete research, drawing considerable attention from the research
arXiv:2407.01777v1 [cs.SD] 1 Jul 2024
cosine transform (DCT). Given the spectrograms, we evaluate community. Several benchmark datasets and following
a wide range of classification models based on three deep challenges such as ASVspoof [6], Audio Deep synthesis
learning approaches. The first approach is to train directly the
spectrograms using our proposed baseline models of CNN-based Detection (ADD) [7], have been proposed, which facilitates
model (CNN-baseline), RNN-based model (RNN-baseline), C- the creation of various systems and techniques to handle
RNN model (C-RNN baseline). Meanwhile, the second ap- this task. Existing studies can be divided into two kind:
proach is transfer learning from computer vision models such pipeline solutions (consisting of a front-end feature extractor
as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, and a back-end classifier) and end-to-end solutions [8].
SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet,
RegNet. In the third approach, we leverage the state-of-the-art The top-performing systems using these two methods in
audio pre-trained models of Whisper, Seamless, Speechbrain, the ASVspoof and ADD competitions are mainly score-
and Pyannote to extract audio embeddings from the input level fusion systems [8]. However, these systems lack a
spectrograms. Then, the audio embeddings are explored by comprehensive evaluation of how individual spectrograms
a Multilayer perceptron (MLP) model to detect the fake or and classifiers affect overall performance, which is crucial
real audio samples. Finally, high-performance deep learning
models from these approaches are fused to achieve the best for further research motivation and research direction. Other
performance. We evaluated our proposed models on ASVspoof successful systems utilize deep features through various
2019 benchmark dataset. Our best ensemble model achieved an supervised embedding methods, such as DNNs [9] and
Equal Error Rate (EER) of 0.03, which is highly competitive RNNs [10]. Despite their effectiveness, these embeddings
to top-performing systems in the ASVspoofing 2019 challenge. are trained on specific datasets and may encounter the issues
Experimental results also highlight the potential of selective
spectrograms and deep learning approaches to enhance the task of overfitting and susceptibility to adversarial attacks. This
of audio deepfake detection. reduces the model’s ability to generalize to new, unseen
Items— deepfake audio, deep learning model, spectrogram, data, particularly when the dataset is not sufficiently large
ASVspoof dataset. or diverse. Meanwhile, other approaches that can manage
generalization and domain adaptation, such as transfer
I. INTRODUCTION
learning and leveraging embeddings from large pre-trained
Sound-based applications represent a revolutionary audio models, have not been extensively explored. To
paradigm in the rapidly evolving landscape of Internet tackle these mentioned limitations, we therefore propose an
of Sound (IoS) technology, where audio signals serve ensemble of deep learning based models for audio deepfake
as the primary medium for data transmission, control, detection task, which is achieved via a comprehensive
and interaction among interconnected devices [1], [2]. analysis in terms of multiple spectrogram-based features
Voice-activated module in an IoS system, such as smart and deep learning approaches. Our key contributions can be
home devices, voice banking, home automation systems, highlighted as:
and virtual assistants, relies on recognizing the user’s
voice to activate critical functions and generally involve • Evaluated the efficacy different spectrograms in combi-
confidential information. However, with the advancement nation with auditory filters to model performance.
of deep learning technologies, the emergence of spoofing • Evaluated a wide range of architectures leveraging both
speech attacks, commonly referred to as ’Deepfake’, has transfer learning and end-to-end networks.
become more prevalent. These attacks involve various • Explored the performance of audio embeddings ex-
tracted from state-of-the-art pre-trained models (e.g.
L. Pham and A. Schindler are with Austrian Institute of Technology,
Vienna, Austria. Whisper, Speechbrain, Pyannote) on deepfake detection.
P. Lam and T. Nguyen are with HCM University of Technology, Ho Chi • Proposed an ensemble model via selective spectrograms
Minh city, Vietnam and models from experiment, indicating the research
H. Nguyen is with Tokyo University of Agriculture and Technology,
Tokyo, Japan focuses for further improving the task of deepfake audio
(*) Main and equal contribution into the paper. detection.
end-to-end approach
pfake
TABLE I
Baselines (CNN, RNN, C-RNN)
preal
improvement. Our best-performing system (A2, A4, A6, [5] Yan Zhao et al., “Emofake: An initial dataset for emotion fake audio
A7) achieves an ERR score and AuC of 0.03 and 0.994 detection,” 2023.
[6] Massimiliano Todisco et al., “Asvspoof 2019: Future horizons in
respectively, placing in the top-3 in terms of EER score in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441,
the ASVspoof 2019 challenge [6]. These results highlight 2019.
the strength of ensemble technique with leveraging multiple [7] Jiangyan Yi et al., “Add 2022: the first audio deep synthesis detection
challenge,” in Proc. ICASSP, 2022, pp. 9216–9220.
spectrogram analyses for feature extraction and deep learning [8] Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan
models for pattern recognition. Zhang, and Yan Zhao, “Audio deepfake detection: A survey,” arXiv
preprint arXiv:2308.14970, 2023.
IV. C ONCLUSION [9] Nanxin Chen et al., “Robust deep feature for spoofing detection —
the SJTU system for ASVspoof 2015 challenge,” in Proc. Interspeech
This paper has evaluated the efficacy of a wide range 2015, 2015, pp. 2097–2101.
of spectrograms and deep learning approaches for deepfake [10] Alejandro Gomez-Alanis et al., “A light convolutional gru-rnn deep
feature extractor for asv spoofing detection,” in Proc. Interspeech,
audio detection. By estabishling the ensemble of selective 2019, vol. 2019, pp. 1068–1072.
spectrograms and models, our best system achieves the EER [11] Akash Chintha et al., “Recurrent convolutional structures for audio
score of 0.03 on LA dataset of ASVspoofing 2019 chal- spoof and video deepfake detection,” IEEE Journal of Selected Topics
in Signal Processing, vol. 14, no. 5, pp. 1024–1037, 2020.
lenge, which is very competitive to state-of-the-art systems. [12] Jia Deng et al., “Imagenet: A large-scale hierarchical image database,”
Additionally, our comprehensive evaluation also indicate in Proc. CVPR, 2009, pp. 248–255.
the potential of certain types of spectrogram (e.g. STFT) [13] Alec Radford et al., “Robust speech recognition via large-scale weak
supervision,” in Proc. ICML, 2023, pp. 28492–28518.
and deep learning approaches (e.g. CNN-based, finetuning [14] Barrault Loı̈c et al., “Seamless: Multilingual expressive and streaming
pre-trained models), which can provide initial guidance for speech translation,” arXiv preprint arXiv:2312.05187, 2023.
deepfake audio detection. [15] Mirco Ravanelli et al., “SpeechBrain: A general-purpose speech
toolkit,” 2021, arXiv:2106.04624.
[16] Alexis Plaquet and Hervé Bredin, “Powerset multi-class cross entropy
R EFERENCES loss for neural speaker diarization,” in Proc. INTERSPEECH, 2023.
[1] Luca Turchet et al., “The internet of sounds: Convergent trends, [17] Hervé Bredin, “pyannote.audio 2.1 speaker diarization pipeline:
insights, and future directions,” IEEE Internet of Things Journal, vol. principle, benchmark, and recipe,” in Proc. INTERSPEECH, 2023.
10, no. 13, pp. 11264–11292, 2023. [18] Daniel Griffin and Jae Lim, “Signal estimation from modified short-
[2] Luca Turchet et al., “The internet of audio things: State of the art, time fourier transform,” IEEE Transactions on acoustics, speech, and
vision, and challenges,” IEEE internet of things journal, vol. 7, no. signal processing, vol. 32, no. 2, pp. 236–243, 1984.
10, pp. 10233–10249, 2020.
[3] Zhizheng Wu et al., “Spoofing and countermeasures for speaker
verification: A survey,” speech communication, vol. 66, pp. 130–153,
2015.
[4] Jiangyan Yi et al., “Scenefake: An initial dataset and benchmarks
for scene fake audio detection,” Pattern Recognition, vol. 152, pp.
110468, 2024.