0% found this document useful (0 votes)
8 views4 pages

Allmodels

Uploaded by

ibizam342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Allmodels

Uploaded by

ibizam342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Deepfake Audio Detection Using Spectrogram-based Feature

and Ensemble of Deep Learning Models


Lam Pham1∗ , Phat Lam2∗ , Truong Nguyen3 , Huyen Nguyen4 , Alexander Schindler5

Abstract— In this paper, we propose a deep learning based AI-based speech synthesis techniques (e.g., Speech to
system for the task of deepfake audio detection. In particu- Text [3], Voice Conversion [3], Scene Fake [4], Emotion
lar, the draw input audio is first transformed into various Fake [5]), posing significant threats to the integrity and
spectrograms using three transformation methods of Short-
time Fourier Transform (STFT), Constant-Q Transform (CQT), authenticity of voice-activated systems. Consequently, the
Wavelet Transform (WT) combined with different auditory- detection of audio deepfakes has become a crucial area of
based filters of Mel, Gammatone, linear filters (LF), and discrete research, drawing considerable attention from the research
arXiv:2407.01777v1 [cs.SD] 1 Jul 2024

cosine transform (DCT). Given the spectrograms, we evaluate community. Several benchmark datasets and following
a wide range of classification models based on three deep challenges such as ASVspoof [6], Audio Deep synthesis
learning approaches. The first approach is to train directly the
spectrograms using our proposed baseline models of CNN-based Detection (ADD) [7], have been proposed, which facilitates
model (CNN-baseline), RNN-based model (RNN-baseline), C- the creation of various systems and techniques to handle
RNN model (C-RNN baseline). Meanwhile, the second ap- this task. Existing studies can be divided into two kind:
proach is transfer learning from computer vision models such pipeline solutions (consisting of a front-end feature extractor
as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, and a back-end classifier) and end-to-end solutions [8].
SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet,
RegNet. In the third approach, we leverage the state-of-the-art The top-performing systems using these two methods in
audio pre-trained models of Whisper, Seamless, Speechbrain, the ASVspoof and ADD competitions are mainly score-
and Pyannote to extract audio embeddings from the input level fusion systems [8]. However, these systems lack a
spectrograms. Then, the audio embeddings are explored by comprehensive evaluation of how individual spectrograms
a Multilayer perceptron (MLP) model to detect the fake or and classifiers affect overall performance, which is crucial
real audio samples. Finally, high-performance deep learning
models from these approaches are fused to achieve the best for further research motivation and research direction. Other
performance. We evaluated our proposed models on ASVspoof successful systems utilize deep features through various
2019 benchmark dataset. Our best ensemble model achieved an supervised embedding methods, such as DNNs [9] and
Equal Error Rate (EER) of 0.03, which is highly competitive RNNs [10]. Despite their effectiveness, these embeddings
to top-performing systems in the ASVspoofing 2019 challenge. are trained on specific datasets and may encounter the issues
Experimental results also highlight the potential of selective
spectrograms and deep learning approaches to enhance the task of overfitting and susceptibility to adversarial attacks. This
of audio deepfake detection. reduces the model’s ability to generalize to new, unseen
Items— deepfake audio, deep learning model, spectrogram, data, particularly when the dataset is not sufficiently large
ASVspoof dataset. or diverse. Meanwhile, other approaches that can manage
generalization and domain adaptation, such as transfer
I. INTRODUCTION
learning and leveraging embeddings from large pre-trained
Sound-based applications represent a revolutionary audio models, have not been extensively explored. To
paradigm in the rapidly evolving landscape of Internet tackle these mentioned limitations, we therefore propose an
of Sound (IoS) technology, where audio signals serve ensemble of deep learning based models for audio deepfake
as the primary medium for data transmission, control, detection task, which is achieved via a comprehensive
and interaction among interconnected devices [1], [2]. analysis in terms of multiple spectrogram-based features
Voice-activated module in an IoS system, such as smart and deep learning approaches. Our key contributions can be
home devices, voice banking, home automation systems, highlighted as:
and virtual assistants, relies on recognizing the user’s
voice to activate critical functions and generally involve • Evaluated the efficacy different spectrograms in combi-
confidential information. However, with the advancement nation with auditory filters to model performance.
of deep learning technologies, the emergence of spoofing • Evaluated a wide range of architectures leveraging both
speech attacks, commonly referred to as ’Deepfake’, has transfer learning and end-to-end networks.
become more prevalent. These attacks involve various • Explored the performance of audio embeddings ex-
tracted from state-of-the-art pre-trained models (e.g.
L. Pham and A. Schindler are with Austrian Institute of Technology,
Vienna, Austria. Whisper, Speechbrain, Pyannote) on deepfake detection.
P. Lam and T. Nguyen are with HCM University of Technology, Ho Chi • Proposed an ensemble model via selective spectrograms
Minh city, Vietnam and models from experiment, indicating the research
H. Nguyen is with Tokyo University of Agriculture and Technology,
Tokyo, Japan focuses for further improving the task of deepfake audio
(*) Main and equal contribution into the paper. detection.
end-to-end approach
pfake
TABLE I
Baselines (CNN, RNN, C-RNN)
preal

audio finetuning approach


Predicted
T HE CNN, RNN, AND C-RNN BASELINE NETWORK ARCHITECTURES
2-s segments spectrograms Probabilities
recording pfake
split log-mel pfake
Benchmark Network Architectures ensemble
preal preal Models Configuration
64x64x3
audio-embedding approach CNN baseline 3 × {Conv(32/64/128)-ReLU-AP-Dropout(0.2)}
embedding
Audio e1
e2 MLP
pfake 1 × {Dense(256)-ReLU-Dropout(0.2)}
Pre-trained Models . preal
.
eC 1 × {Dense(2)-Softmax}
RNN baseline 2 × {BiLSTM(128/64)-ReLU-Dropout(0.2)}
Fig. 1. The high-level architecture of proposed deep learning based system 1 × {Dense(256)-ReLU-Dropout(0.2)}
for deepfake audio detection 1 × {Dense(2)-Softmax}
C-RNN baseline 3 × {Conv(32/64/128)-ReLU-AP-Dropout(0.2)}
2 × {BiLSTM(128/64)-ReLU-Dropout(0.2)}
II. P ROPOSED D EEP L EARNING BASED S YSTEMS 1 × {Dense(256)-ReLU-Dropout(0.2)}
1 × {Dense(2)-Softmax}
The high-level architecture of proposed deep learning
based system for audio deepfake detection, which is denoted further enhances the robustness to variations of the detection
in Fig. 1, comprises two main parts: front-end spectrogram- system.
based feature extraction and back-end deep learning model As we use the same settings of the window length, the
for classification. In particular, the draw input audio record- hop length, the filter number with 1024, 512, 64 for all
ings are first split into 2-second segments. This segment spectrograms, generated spectrograms present the same ten-
length generally provides sufficient context to capture im- sor shape of 64×64. Then, DCT is applied on spectrograms
portant features and allows faster training and inference for across the temporal dimension. Finally, we apply delta and
applications requiring real-time detection. Next, the 2-second delta-delta to these spectrograms, generate three dimensional
audio segments are transformed into spectrograms. Finally, tensor of 64×64×3 (i.e. the original spectrogram, delta, and
the spectrograms are explored by back-end deep learning delta-delta are concatenated across the third dimension).
models to detect real or fake audio segments.
B. End-to-end deep learning approach
There are three deep learning based approaches are pro-
posed in this paper. The first approach is shown in the upper Regarding the end-to-end deep learning approach, we
part in Fig. 1. In this approach, referred to as the end- propose three baseline models of CNN-based model, RNN-
to-end approach, proposed models are used to train input based model, C-RNN-based model, which are referred to
spectrograms directly. In the second approach as shown as the CNN baseline, RNN baseline, and C-RNN baseline,
in the middle part in Fig. 1, referred to as the finetuning respectively. The detailed configuration of these baselines are
approach , we fine-tune benchmark network architectures presented in Table I. CNNs are the most common architecture
which are popularly used in the computer vision domain. for this task, which can effectively capture and learn spectral
Regarding the third approach as shown in the lower part features within local frequency bands such as harmonic struc-
in Fig. 1, we leverage the state-of-the-art pre-trained models tures, formants, pitch variations, high-frequency artifacts,
which were trained on large audio datasets in advance. Then, etc. Meanwhile, RNNs focus on detecting natural sequential
we feed spectrograms input into these audio pre-trained patterns that can be disrupted in synthetic audio [11] (e.g.
models to obtain audio embeddings. The audio embeddings temporal coherence, prosodic features such as rhythm, stress,
are finally classified into either real or fake class by a and intonation). Consequently, the usage of C-RNN baseline
Multilayer Perceptron (MLP). We refer this approach to as is based on the expectation of combine both spectral features
the audio-embedding approach. Finally, individual and high- and temporal features for distinguishing characteristics of
performance models from three approaches are selected and deepfake audio.
fused to achieve the best performance.
C. Transfer learning approach
A. Spectrogram-based Feature Extraction Additionally, we also evaluate a wide range of benchmark
network architectures in the computer vision domain such as
Fig. 2 presents how 6 different spectrograms are generated ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121,
in this paper. In particular, 6 spectrograms are generated SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASs-
from three transformation methods of Short-time Fourier net, RegNet. In particular, these networks were trained on
Transform (STFT), Constant-Q Transform (CQT), Wavelet the ImageNet1K dataset [12] in advance. Their pre-trained
Transform (WT). Presumably, each type of spectrogram weights can capture rich and generalized features about
focus on different perspectives on frequency content and pattern recognition in images, which can be potentially
might catch different inconsistencies in the audio signal. adapted to identifying patterns in spectrograms via parameter
The combination of these spectrograms allows model to finetuning. In this approach, the final dense layer of these
learn a broader range of features and patterns, potentially mentioned networks is modified to match the binary classi-
improving its ability to generalize and detect deepfakes. fication task of deepfake audio detection before conducting
Additionally, we also establish different auditory-based fil- the fine-tune process.
ters: Mel, Gammatone focus on subtle variations relevant
to human auditory perception; linear filters (LF) isolates D. Audio-embedding deep learning approach
specific frequency bands., Integrating these filters alongside In the audio-embedding deep learning approach, we lever-
pre-defined spectrograms enriches the available features and age the state-of-the-art audio pre-trained models of Whis-
2-second segment
III. E XPERIMENTS AND R ESULTS

Wavelet STFT CQT


A. Datasets and Evaluation Metrics
We evaluate the proposed models on the Logic Access
dataset of ASVspoofing 2019 challenge. The Logic Access
dataset comprises three subsets(fake sample/real sample) of
Mel filter Linear filter Gamma filter ‘Train’(22800/2580), ‘Develop’(22296/2548), and ‘Evalua-
tion’(63882/7355), in which fake audio were generated from
19 AI-based generative systems. The models are trained on
Fig. 2. Generate spectrograms using different spectrogram transformation ‘Train’ subset, then evaluated and saved on ‘Develop’ subset.
methods and auditory filter models Finally, the models are test on the ‘Evaluation’ subset and
TABLE II the final result on this subset is reported.
T HE AUDIO PRE - TRAINED MODELS AND THE M ULTILAYER P ERCEPTRON We obey the ASVspoofing 2019 challenge, then use the
Models Using License Embedding size/configuration Equal Error Rate (ERR) as the main metric for evaluating
Whisper [13] MIT 512 proposed models. We also report the Accuracy, F1 score
SpeechBrain [15] Apache2-0 192
SeamLess [14] MIT 1024 and AuC score to compare the performance among proposed
Pyannote [16], [17] MIT 512
MLP Our proposal 1 × {Dense(128)-ReLU } models.
1 × {Dense(2)-Softmax }
B. Results and Discussion
per [13], Seamless [14], Speechbrain [15], and Pyanote [16], Evaluation of spectrogram inputs: Consider the efficacy
[17]. These pre-trained models are utilized for their ability of feature extraction among proposed spectrogram inputs (i.e.
to capture robust and high-level feature representations of systems from A1 to A6), STFT outperforms other compared
genuine speakers in practice such as pitch, tone, accent, and spectrograms (models such as A1, A4, A5 achieves the best
intonation from their diverse training data. This capability ERR score of 0.08 while the combination of STFT & LF
is crucial for distinguishing between real and fake audio. obtains slightly better accuracy and F1 score of 0.88 and
Therefore, the spectrogram inputs are first fed into these pre- 0.9 respectively). This result suggests that STFT is often
trained models to obtain audio embeddings Given the audio better suited for identify deepfake artifacts due to its uniform
embeddings, we propose a Multilayer perceptron (MLP), as resolution in time and frequency [18] while the interpretable
shown in Table II, to detect real or fake audio. features extracted from linearly filtered signals are suitable
for classification algorithms.
E. Ensemble of models
Multiple deep learning approaches: Regarding end-to-
As an individual model works on 2-second audio segment, end deep learning approach (A1 to B2), both RNN and
the predicted probability of an entire audio recording is C-RNN approaches obtains ERR score of 0.14 and 0.17,
computed by averaging of predicted probabilities over all 2- significantly worse than using only CNN with the best score
(n) (n) (n)
second segments. Consider p(n) = [p1 , p2 , ..., pC ], with of 0.08. This indicates the specific patterns indicative of
C being the category number of the n-th out of N 2-second deepfake audio might not be primarily temporal but rather
segments in one audio recording. The probability of an entire spatial in the spectrogram representation. In the finetuning
audio recording is calculated by the average classification and audio embeddding-based approaches (C1 to C10 and D1
probability which denoted as p̄ = [p̄1 , p̄2 , ..., p̄C ] where: to D4), Swint, Convnext-Tiny and Whisper stand out as best
N systems within the corresponding approaches with compet-
1 X (n)
p̄c = p f or 1 ≤ c ≤ C (1) itive EER score of 0.09, 0.0075 and 0.10 respectively. This
N n=1 c suggests the potential of these approaches when choosing the
To ensemble of results from individual models, we propose appropriate networks for enhancement.
a MEAN fusion. In particular, we first conduct experi- Ensembles: The experimental results presented in Table
ments on the individual models, then obtain the predicted III underscore the significant effectiveness of ensemble tech-
probability as p̂s = (p̄s1 , p̄s2 , ..., p̄sC ) where C is the niques in detecting audio deepfakes. Specifically, the com-
category number and the s-th out of S individual models bination of STFT and LF spectrograms (A1+A2) achieves a
evaluated. Next, the predicted probability after MEAN fusion score of 0.06, marking an improvement of 0.02 compared
p̂f −mean = (p̂1 , p̂2 , ..., p̂C ) is obtained by: to best systems utilizing single spectrograms. Similarly,
ensembles of models show slight enhancements such as the
S
1X combination of CNN and ConvNeXt-Tiny which helps to
pˆc = p̂s f or 1 ≤ c ≤ C (2) reduce the ERR by 0.01 and 0.005 compared to individual
S s=1 c
models. These findings suggest that diverse feature extraction
Finally, the predicted label ŷ for an entire audio sample is via ensembling multiple spectrograms substantially enhances
determined as: overall performance compared to evaluating a wide range of
models on a single spectrogram. Importantly, the ensemble
ŷ = argmax(p̂1 , p̂2 , ..., p̂C ) (3) of both spectrograms and models demonstrates significant
TABLE III
P ERFORMANCE COMPARISON AMONG DEEP LEARNING MODELS AND ENSEMBLE OF HIGH - PERFORMANCE MODELS
ON L OGIC ACCESS EVALUATION SUBSET IN ASV SPOOFING 2019
Systems Spectrograms Models Acc F1 AuC ERR
A1 STFT CNN 0.87 0.89 0.96 0.08
A2 CQT CNN 0.89 0.90 0.92 0.14
A3 WT CNN 0.84 0.86 0.89 0.17
A4 STFT & LF CNN 0.88 0.90 0.96 0.08
A5 STFT & MEL CNN 0.86 0.88 0.95 0.11
A6 STFT & GAM CNN 0.85 0.87 0.96 0.08
B1 STFT & LF RNN 0.92 0.91 0.88 0.17
B2 STFT & LF CRNN 0.88 0.90 0.96 0.14
C1 STFT & LF ResNet-18 0.49 0.58 0.51 0.47
C2 STFT & LF MobileNet-V3 0.59 0.67 0.52 0.48
C3 STFT & LF EfficientNet-B0 0.52 0.61 0.51 0.48
C4 STFT & LF DenseNet-121 0.58 0.66 0.51 0.48
C5 STFT & LF ShuffleNet-V2 0.64 0.71 0.53 0.48
C6 STFT & LF Swin T 0.84 0.87 0.94 0.09
C7 STFT & LF ConvNeXt-Tiny 0.88 0.90 0.96 0.075
C8 STFT & LF GoogLeNet 0.53 0.62 0.51 0.47
C9 STFT & LF MNASNet 0.62 0.70 0.54 0.47
C10 STFT & LF RegNet 0.50 0.60 0.50 0.48
D1 STFT & LF Whisper+MLP 0.85 0.88 0.95 0.10
D2 STFT & LF Speechbrain+MLP 0.77 0.81 0.81 0.25
D3 STFT & LF Seamless+MLP 0.86 0.88 0.87 0.20
D4 STFT & LF Pyannote+MLP 0.64 0.71 0.78 0.27
A1 + A2 STFT, CQT CNN 0.91 0.92 0.98 0.06
A1 + A3 STFT, WT CNN 0.88 0.90 0.96 0.09
A1 + A2 + A3 STFT, CQT, WT CNN 0.90 0.92 0.98 0.07
A4 + A5 LFCC, MEL CNN 0.88 0.90 0.97 0.08
A4 + A6 LFCC, GAM CNN 0.87 0.89 0.98 0.065
A4 + A5 + A6 LFCC, MEL, GAM CNN 0.88 0.90 0.98 0.069
A4 + C6 LFCC CNN, Swint T 0.87 0.89 0.96 0.078
A4 + C7 LFCC CNN, ConvNeXt-Tiny 0.88 0.90 0.97 0.07
A4 + C6 + C7 LFCC CNN, ConvNeXt-Tiny, Swint T 0.88 0.89 0.97 0.072
A2 + A4 + A6 + C7 CQT, LFCC, GAM CNN, ConvNeXt-Tiny, Whisper 0.90 0.91 0.994 0.03

improvement. Our best-performing system (A2, A4, A6, [5] Yan Zhao et al., “Emofake: An initial dataset for emotion fake audio
A7) achieves an ERR score and AuC of 0.03 and 0.994 detection,” 2023.
[6] Massimiliano Todisco et al., “Asvspoof 2019: Future horizons in
respectively, placing in the top-3 in terms of EER score in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441,
the ASVspoof 2019 challenge [6]. These results highlight 2019.
the strength of ensemble technique with leveraging multiple [7] Jiangyan Yi et al., “Add 2022: the first audio deep synthesis detection
challenge,” in Proc. ICASSP, 2022, pp. 9216–9220.
spectrogram analyses for feature extraction and deep learning [8] Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan
models for pattern recognition. Zhang, and Yan Zhao, “Audio deepfake detection: A survey,” arXiv
preprint arXiv:2308.14970, 2023.
IV. C ONCLUSION [9] Nanxin Chen et al., “Robust deep feature for spoofing detection —
the SJTU system for ASVspoof 2015 challenge,” in Proc. Interspeech
This paper has evaluated the efficacy of a wide range 2015, 2015, pp. 2097–2101.
of spectrograms and deep learning approaches for deepfake [10] Alejandro Gomez-Alanis et al., “A light convolutional gru-rnn deep
feature extractor for asv spoofing detection,” in Proc. Interspeech,
audio detection. By estabishling the ensemble of selective 2019, vol. 2019, pp. 1068–1072.
spectrograms and models, our best system achieves the EER [11] Akash Chintha et al., “Recurrent convolutional structures for audio
score of 0.03 on LA dataset of ASVspoofing 2019 chal- spoof and video deepfake detection,” IEEE Journal of Selected Topics
in Signal Processing, vol. 14, no. 5, pp. 1024–1037, 2020.
lenge, which is very competitive to state-of-the-art systems. [12] Jia Deng et al., “Imagenet: A large-scale hierarchical image database,”
Additionally, our comprehensive evaluation also indicate in Proc. CVPR, 2009, pp. 248–255.
the potential of certain types of spectrogram (e.g. STFT) [13] Alec Radford et al., “Robust speech recognition via large-scale weak
supervision,” in Proc. ICML, 2023, pp. 28492–28518.
and deep learning approaches (e.g. CNN-based, finetuning [14] Barrault Loı̈c et al., “Seamless: Multilingual expressive and streaming
pre-trained models), which can provide initial guidance for speech translation,” arXiv preprint arXiv:2312.05187, 2023.
deepfake audio detection. [15] Mirco Ravanelli et al., “SpeechBrain: A general-purpose speech
toolkit,” 2021, arXiv:2106.04624.
[16] Alexis Plaquet and Hervé Bredin, “Powerset multi-class cross entropy
R EFERENCES loss for neural speaker diarization,” in Proc. INTERSPEECH, 2023.
[1] Luca Turchet et al., “The internet of sounds: Convergent trends, [17] Hervé Bredin, “pyannote.audio 2.1 speaker diarization pipeline:
insights, and future directions,” IEEE Internet of Things Journal, vol. principle, benchmark, and recipe,” in Proc. INTERSPEECH, 2023.
10, no. 13, pp. 11264–11292, 2023. [18] Daniel Griffin and Jae Lim, “Signal estimation from modified short-
[2] Luca Turchet et al., “The internet of audio things: State of the art, time fourier transform,” IEEE Transactions on acoustics, speech, and
vision, and challenges,” IEEE internet of things journal, vol. 7, no. signal processing, vol. 32, no. 2, pp. 236–243, 1984.
10, pp. 10233–10249, 2020.
[3] Zhizheng Wu et al., “Spoofing and countermeasures for speaker
verification: A survey,” speech communication, vol. 66, pp. 130–153,
2015.
[4] Jiangyan Yi et al., “Scenefake: An initial dataset and benchmarks
for scene fake audio detection,” Pattern Recognition, vol. 152, pp.
110468, 2024.

You might also like