Base Paper Audio Deep Fake Detection
Base Paper Audio Deep Fake Detection
Article
Lanting Li, Tianliang Lu, Xingbang Ma, Mengjiao Yuan and Da Wan
https://doi.org/10.3390/app13148488
applied
sciences
Article
Voice Deepfake Detection Using the Self-Supervised
Pre-Training Model HuBERT
Lanting Li, Tianliang Lu *, Xingbang Ma, Mengjiao Yuan and Da Wan
College of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China
* Correspondence: [email protected]
Abstract: In recent years, voice deepfake technology has developed rapidly, but current detection
methods have the problems of insufficient detection generalization and insufficient feature extraction
for unknown attacks. This paper presents a forged speech detection method (HuRawNet2_modified)
based on a self-supervised pre-trained model (HuBERT) to improve detection (and address the above
problems). A combination of impulsive signal-dependent additive noise and additive white Gaussian
noise was adopted for data boosting and augmentation, and the HuBERT model was fine-tuned on
different language databases. On this basis, the size of the extracted feature maps was modified
independently by the α-feature map scaling (α-FMS) method, with a modified end-to-end method
using the RawNet2 model as the backbone structure. The results showed that the HuBERT model
could extract features more comprehensively and accurately. The best evaluation indicators were
an equal error rate (EER) of 2.89% and a minimum tandem detection cost function (min t-DCF) of
0.2182 on the database of the ASVspoof2021 LA challenge, which verified the effectiveness of the
detection method proposed in this paper. Compared with the baseline systems in databases of the
ASVspoof 2021 LA challenge and the FMFCC-A, the values of EER and min t-DCF decreased. The
results also showed that the self-supervised pre-trained model with fine-tuning can extract acoustic
features across languages. And the detection can be slightly improved when the languages of the
pre-trained database, and the fine-tuned and tested database are the same.
Keywords: voice deepfake detection; self-supervised learning; pre-training; feature map scaling;
anti-spoofing
Current research shows that self-supervised speech models can extract robust acoustic
features for unknown domains. In addition, the existing audio deepfake detection methods
are easy to overfit on the training set and have poor robustness in terms of audio detection
after recompression, noise addition, and other processing. Most of the existing audio
deepfake detection methods rely on specific datasets or specific deepfake methods, with
a single and homogeneous distribution of training data. Most are detected on English
datasets, so their generalizability cannot be tested on Chinese datasets.
To address these issues, this study proposed a self-supervised pre-trained model for
brevity, namely HuRawNet2_modified. The main contributions are as follows:
(1) Regarding front-end feature extraction, self-supervised pre-training models trained
on either English or Chinese datasets were used, and fine-tuning with English and
Chinese datasets was undertaken to explore the impact of pre-training models using
different language datasets on the results.
(2) For the back-end model, an improved end-to-end RawNet2 model was used as the
backbone structure, and α-FMS was used to independently modify the size of the
feature maps to improve the model detection and compare the performance with
current state-of-the-art algorithms on Chinese and English datasets.
(3) In terms of datasets, to address the problem of voice deepfake detection being trained
chiefly and tested on English datasets, cross-library tests were conducted on differ-
ent language datasets to verify the proposed method’s detection performance and
generalizability on Chinese and English datasets.
2. Related Work
2.1. Detection Methods Based on Traditional Features and Related Events
Early audio deepfake detection mainly relied on hidden Markov chains and Gaussian
mixture models, and later evolved into front-end and back-end models. The typical audio
deepfake detection system is a framework composed of a front end and back end. The front
ends extract acoustic features from speech, and the back end converts features into scores.
Traditional front-end feature extractors use digital signal processing algorithms to extract
spectrum, phase, or other acoustic features. Among them, the most widely used include
mel-frequency cepstral coefficients (MFCC), linear frequency cepstral coefficients (LFCC),
and constant-Q transform cepstral features (CQCC) [1]. However, the detection method
based on traditional features will result in the loss of some information. Moreover, it is
usually only effective for detecting specific types of voice deepfakes, and generalization
and robustness need to be improved.
The distinguishing features of the front end of the traditional detection system adopt
the hand-crafted features designed by experts, and the back end directly uses Gaussian
mixture models (GMM) or support vector machine (SVM) for classification and judgment.
In recent years, deep-learning-based systems have gradually become mainstream. The
front ends extract the speech features of the input neural network, and the back end learns
the high-level representation of the features through the neural network and then performs
a classification judgment to identify the authenticity of the audio [2]. With the development
of deep learning, it is increasingly common to use a deep neural network (DNN) to
process the original waveform directly in many tasks. Tak et al. [3] applied the improved
RawNet2 network to synthetic speech detection, used a set of sinc filters to operate the
original waveform through time-domain convolution directly, and then learned deep-level
discriminative information through the residual module and gate recurrent unit(GRU).
Based on this network, the RawGAT-ST model was proposed [4], and the spectro-temporal
graph attention network was used to model the relationship across different sub-bands
and temporal segments. Based on ResNet’s skip layer connection and Inception’s parallel
convolution structure, Hua et al. [2] designed two lightweight end-to-end time-domain
synthetic speech detection networks (TSSDNet).
In order to promote the research of audio deepfake detection technology, the Auto-
matic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof), jointly
Appl. Sci. 2023, 13, 8488 3 of 15
launched by the University of Edinburgh and other research institutions, has been held
four times since 2015. This is the main event regarding audio deepfake detection [5]. The
ASVspoof 2019 challenge [6] was the first challenge to consider three kinds of spoofing
attacks simultaneously. The minimum tandem detection cost function (min t-DCF) was
introduced to represent the performance of the whole system. The three sub-tasks of the
ASVspoof 2021 challenge [7] focused on logical access(LA), physical access (PA), and speech
deepfake (DF) tasks. In order to promote the development of audio deepfake detection
in Chinese scenes, domestic scholars have also launched Chinese audio deepfake events,
such as the second Fake Media Forensic Challenge of CSIG [8] and the third Chinese
AI competition.
3. Methods
The flow chart of the HuRawNet2_modified method is shown in Figure 1. The
whole model consists of a pre-trained HuBERT-based and back-end detection model. The
input of the entire model is the original waveform, and the output is the result of binary
classification. Firstly, the data were pre-processed by adding the impulse signal and
white noise additive noise to the original audio for data enhancement (see Section 3.1 for
details). Next, a self-supervised pre-trained model and fine-tuning (see Section 3.2 for more
information) were used to extract acoustic features. A fully connected layer was added
after the self-supervised front end to train jointly with the back-end detection model and
reduce the dimensionality of the self-supervised model output. The extracted acoustic
features were then processed byαthe three residual blocks of the back-end detection model
(see Section 3.3 for details), where α-FMS was used to obtain more discriminative features.
Finally, a softmax activation function was used in the output layer to obtain real or fake
detection results.
HuBERT Real
FIR filter Transformer Transformer
.
.
Feature .
x1 M M x4
w W´ rw’ representation
Fake
CNN Encoder CNN Encoder
W 1:T W 1:T
w ′ [ i ] = w [ i ] + z w [ i ], (1)
where w′ represents the audio added with impulsive additive noise, SNR refers to the
random signal-to-noise ratio, SNR ∈ [10, 40], zw′ denotes the result of white noise after
FIR filter processing, and rw′ denotes the result after data pre-processing.
where A is the projection matrix, ec is the embedding for the codebook, similarity calculates
the cosine similarity between two vectors, and τ is used to scale the logarithmic function,
which is set to 0.1.
Appl. Sci. 2023, 13, 8488 6 of 15
z1 z2 z3 z4 z5 z6 FC Layer
HuBERT
Transformer Transformer
W1:T W1:T
By iterative training, the cross-entropy loss functions LOSSmask and LOSSunmask are
calculated on masked and unmasked units, and then the final loss value LOSS is obtained
by weighted summation, as 𝑋 shown in Equation (4): 𝑂 = [𝑜 , 𝑜 , ⋯ , 𝑜 ]
𝑝
LOSS = αLOSS Mask + (1 − α) LOSSunmask , (4)
The HuBERT model is similar to the classic wav2vec 2.0 model, but the training
methods are different. The∣latter is to discretize the audio features as a self-supervised
target during training, which is characterized by calculating the loss function only in the
mask area; the former obtains the training target by carrying out k-means clustering on
MFCC features. The results show that the performance of the HuBERT model is better than
A 𝑒
that of the wav2vec 2.0 model [9].
τ
3.2.2. Fine-Tuning
𝐿𝑂𝑆𝑆 𝐿𝑂𝑆𝑆
Fine-tuning is one of the transfer learning methods suitable for smaller datasets
and it has low training costs, which can improve the detection performance for known
attacks. Some studies have shown that fine-tuning is beneficial and can prevent overfitting,
promoting better generalization [14]. Pre-training only extracts features of natural speech,
and fine-tuning, with both natural and deepfake audio data, enables the self-supervised
pre-training model to adapt to the downstream task of audio deepfake detection, which
ff
helps to improve tt performance.
detection
The process of fine-tuning is shown in Figure 2b. After pre-training on unlabeled data,
fine-tuning was performed on the two training sets with labels. The back-end detection
model and the pre-trained HuBERT model were jointly optimized by back-propagation, tt
and the weighted cross-entropy loss function was used to calculate the loss. The speech
W = [w1 , w2 , · · · , wT ] of the T-frame was passed through the CNN encoder to obtain
the potential speech representation S = [s1 , s2 , · · · , s T ]. and then sent to the transformer
encoder to obtain the context representation R. In order to reduce the dimension, a fully
connected layer (FC layer) was added after the output of the transformer encoder, and tt the
output was sent to the residual blocks of the back-end detection model. In thisttpaper, the
ASVspooftt2021 LA training set (same as ASVspoof 2019 LA training set) and the FMFCC-A
training set were used for fine-tuning, and the detection performance of the model was
tested on different evaluation sets.
Appl. Sci. 2023, 13, 8488 7 of 15
Real Fake
Softmax
FC
GRU
Resblock3
input
Resblock2
BatchNorm+LeakyRelu+Conv1D
Resblock1
BatchNorm+LeakyRelu+Conv1D
Maxpooling
FC
α-FMS
SSL front-end
output
Data augment
Waveform
(a) structure of RawNet2_modified (b) structure of residual block
FMS [21] (filter-wise feature map scaling), used in residual blocks, refers to filter-
based feature map scaling. The purpose of FMS is to modify the size of a given feature
map independently, the output of the residual block, to obtain a more discriminative
representation and improve the performance and generalization of the model, with the
advantages of reducing model parameters and computation. FMS obtains scaling vectors
from feature mapping and then adds or multiplies them with features or applies these
two operations in turn, as shown in reference [21]. The multiplicative FMS is similar to
the attention map for the attention mechanism, but uses the sigmoid activation function
instead of the softmax function. This is because using the softmax function may cause
the information to be over-removed. However, the limitations are that FMS uses the
same scaling vector for addition and multiplication, can only add values between 0 and 1
during addition, and has difficulty optimizing addition and multiplication operations
simultaneously when performing multiplication.
To solve this problem, Jung et al. [24] improved it and proposed α-FMS. A trainable
parameter α is added to each filter and multiplied by the scaling vector. The parame-
ter α is automatically learned by back propagation and optimization algorithms during
training. Each filter has its scaling vector, which can further improve the performance
and generalizability of the model compared to FMS. The specific operation is shown in
Equation (5).
As is shown in the α-FMS structure diagram in Figure 4, let C = [C1 , C2 , · · · , CF ] be the
feature map of the residual block, C f ∈ RT , T be the length of the time series, and F be the
number of filters. The scaling vector is first obtained by performing global average pooling
on the time axis, then feedforward through a fully connected layer, and finally sigmoid
activation. Let S = [S1 , S2 · · · , S F ] be the scaling vector, C ′ = C1′ , C2′ , · · · , CF′ be the scaled
feature map, S f ∈ R1 , C ′f ∈ RT , S f and C f are copied to perform element-by-element
operations. The purpose of additive FMS is to add a slight disturbance to the feature
map to increase the discriminative power of the feature map [25]. Add α for each filter in
Equation (5).
C ′f = (C f + α) × S f (5)
Filter Filter
α
α-FMS
time
C 1
C1 C2
C2
...
C C3
f C ′f = (C f + α) × S f
C1' C2' ... C 'f time
FC layer + sigmoid
One advantage of this method is that it allows the model to autonomously learn the
most suitable feature map scaling ratio for the task, rather than being manually set. This can
enhance the expressive power and flexibility of the model, thereby improving the model’s
performance.
4. Experiment
This section describes the dataset used, the evaluation metrics, and the experimental
results and analysis to train the binary classifier for the ASVspoof 2021 LA dataset and the
FMFCC-A dataset, respectively, for distinguishing the results as natural or faked speech.
tt
Fine-tuning requires a large amount of GPU memory, so the voice data were processed
into approximately four seconds of speech; those that were longer than four seconds were
tt
Appl. Sci. 2023, 13, 8488 9 of 15
cut, and those that were less than four seconds were first copied for the speech before
processing. In this study, the experimental iteration number Epoch was 40, using the Adam
optimizer and default settings. When the sinc filter was used, the learning rate was fixed
at 0.0001; when a self-supervised front end was used, the fine-tuning demanded high
computer computation, and the learning rate was chosen to be initialized at 0.000001 and
adjusted by the cosine annealing learning rate decay with the batch size of 14 to avoid
overfitting due to the experimental conditions. The experimental environment of this study
is shown in Table 1. A DCU (deep computing unit) is an accelerator card dedicated to AI
(artificial intelligence) and deep learning.
Name Version
CPU C86 7185 32-core Processor 2.0 GHz
Accelerator card Dcu2
Memory 16 GB
Operating system CentOS Linux 7.6 64-bit
Python 3.7.2
Deep learning library PyTorch 1.10.0, fairseq 0.10.0
Number of Utterances
Storage
Database Train Development Evaluation Language
Total Format
Real Fake Real Fake Real Fake
ASVspoof 2021 LA 231,790 2580 22,800 2548 22,296 14,816 166,750 English FLAC
ASVspoof 2019 LA 121,461 2580 22,800 2548 22,296 7355 63,882 English FLAC
FMFCC-A 50,000 4000 6000 3000 17,000 3000 17,000 Chinese WAV
FAD 115,800 12,800 25,600 4800 9600 21,000 42,000 Chinese WAV
To evaluate the performance of the detection system, this study used two evaluation
metrics commonly used for audio deepfake detection: equal error rate (EER) and tandem
detection cost function (t-DCF) as evaluation metrics. The min t-DCF was proposed by
Appl. Sci. 2023, 13, 8488 10 of 15
the ASVspoof 2019 challenge and improved by the ASVspoof 2021 challenge. The specific
formulae for EER and min t − DCF are as follows:
f aked voice with score > θ
Pfalse (θ ) = , (6)
total f aked voice
n o
min t − DCF = min C0 + C1 Pmiss (θ ) + C2 Pf alse (θ ) (9)
θ
where the EER denotes the error rate when the false alarm rate Pf alse (θ ) and the miss alarm
rate Pmiss (θ ) are equal, and θ EER denotes the threshold value when Pf alse (θ ) and Pmiss (θ )
are equal. The smaller the EER, the better the performance of the detection system. The
smaller the t-DCF, the better the generalizability of the detection system and the smaller
the impact of the performance of an ASV system [27].
The Log − loss function (Log − loss) is also used as an evaluation metric for the
FMFCC-A dataset. Log − loss is one of the primary metrics used to evaluate the perfor-
mance of a classification problem, indicating how close the predicted probability is to the
corresponding actual value.
where i denotes the index of the statement, yi is the corresponding label, and pi is the
predicted probability. When the Log − loss is smaller, the predicted probability is closer to
the true value, and the model performance is better.
Table 3. Performance for the ASVspoof 2021 evaluation partition in terms of EER (%) and min t-DCF.
The analysis of Table 3 shows that the detection performance of the proposed method
was significantly improved compared with the other four baseline models in the ASVspoof
2021 competition. Compared with Baseline RawNet2, the EER and min t-DCF indicators
were reduced by 69.5% and 48.7%, respectively. This proves that the method of using
Appl. Sci. 2023, 13, 8488 11 of 15
the self-supervised pre-training model to extract general features and then fine-tuning is
more suitable for audio deepfake detection tasks. It can be seen that the self-supervised
pre-training model was of practical value. The EER and Log − loss of the baseline models
and HuRawNet2_modified method for the development set of the FMFCC-A dataset of the
second Fake Media Forensic Challenge of CSIG are shown in Table 4.
Table 4. Performance for the FMFCC-A evaluation partition in terms of EER (%) and Log − loss.
To verify the performance of this model on Chinese deepfake speech, the FMFCC-Aff
and FAD datasets have been introduced in this paper. Analysis of the results in Table 4
shows that the EER of the model proposed in this paper was reduced by 55.3% and 60.7%,
and the Log − loss was reduced by 42.2% and 61.7% for the FMFCC-A dataset and the
FAD dataset, respectively, compared to the two baseline systems. This indicated that the
performance of the model had been improved and ffthe detection performance ff of Chinese
faked speech was better. The LCNN-LSTM model, among the four baseline models of the
ASVspoof2021 challenge, achieved the best result on the FAD dataset (EER = 13.91), and
the other specific results are not displayed in this paper. As compared to the LCNN-LSTM
model, the EER of HuRawNet2_modified was reduced by approximately 29.9% and 40.25%
for the FMFCC-A dataset and the FAD dataset, respectively. Therefore, we can conclude
that the performance was improved compared to the baseline model on different datasets,
indicating that HuRawNet2_modified model has a better generalizability and a certain
advantage in terms of detection performance.
In order to verify the above conclusions, this study used four datasets for testing, and
the results are shown in Table 5 and Figure 5. The results of pre-training using different
language datasets on the three datasets showed that the EER and min t-DCF were slightly
reduced when the pre-trained dataset, and the fine-tuned and tested dataset were in the
same language.
12
10
8
EER(%)
6 WenetSpeech
LibriSpeech
4
0
ASVspoof2021 ASVspoof2019 FMFCC-A FAD
LA LA
Figure 5. Results of HuRawNet2_modified model on different datasets using different language
pre-trained models.
ff ff
ff
ff
ff
Appl. Sci. 2023, 13, 8488 12 of 15
Method Metric
Front End
Abbreviation Data Augmentation α-FMS EER (%) Min t-DCF
SSL Pre-Trained and Fine-Tuned Sinc Filter
√
SSL √ -- --
√ -- 5.49 0.3687
+DA √ -- --
√ 4.89 0.3357
+α-FMS √ -- --
√ √ 4.51 0.3268
+DA+α-FMS --
√ 2.89 0.2182
Sinc -- √ --
√ -- 10.17 0.5103
+DA -- √ --
√ 9.19 0.4761
+α-FMS -- √ --
√ √ 8.25 0.4573
+DA+α-FMS -- 5.52 0.3964
α √ √
α √ √ √
√
√ √
α2023, 13, 8488
Appl. Sci. √ √ 13 of 15
α √ √ √
0.55
Sinc
0.5
+α-FMS
0.45 +DA
+DA+α-FMS SSL pre-trained&fine-tunning
min t-DCF
0.4
+DA
0.35 SSL Sinc filter
0.3 +α-FMS
+DA+α-FMS
0.25
0.2
2 4 6 8 10
EER(%)
In terms of EER for comparison, it can be seen from rows 2 and 6 of Table 6 that
adding the data augmentation module resulted in a 10.6~12.3% gain to the model; adding
only α-FMS in rows 3 and 7, the model performance was improved by approximately
17.8~18.9%; adding both data augmentation and α-FMS, the EER was reduced by approxi-
mately 45.7~47.3% from rows 4 and 8, indicating that both data augmentation and α-FMS
contribute to the model performance improvement, with α-FMS adding more to the model.
From Figure 6, it can be seen that the self-supervised pre-training and fine-tuning front end
was closer to the origin than the sinc filter’s front end. It can be concluded that the EER and
min t-DCF of the method based on self-supervised pre-training and fine-tuning proposed
in this paper keep decreasing. The results were generally better than those of the front end
of the sinc filter, which further verifies that the method proposed in this paper can fully
extract deepfake speech features, and improve the detection effect and generalizability
compared with mainstream detection algorithms.
Author Contributions: Conceptualization, L.L. and T.L.; methodology, X.M.; software, M.Y.; valida-
tion, L.L., D.W. and M.Y.; formal analysis, X.M.; investigation, D.W.; resources, X.M.; data curation,
L.L.; writing—original draft preparation, L.L.; writing—review and editing, L.L.; visualization, T.L.;
supervision, L.L.; project administration, T.L.; funding acquisition, T.L. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was supported by the Double First-Class Innovation Research Project for
People’s Public Security University of China (No.2023SYL07).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The ASVspoof 2021 dataset can be found at https://www.asvspoof.
org/index2021.html (accessed on 28 May 2023). The ASVspoof 2019 dataset can be found at https:
//www.asvspoof.org/index2019.html (accessed on 28 May 2023). The FMFCC-A dataset can be
found at https://github.com/Amforever/FMFCC-A (accessed on 28 May 2023). The FAD dataset
can be found at https://zenodo.org/record/6635521#.Ysjq4nZBw2x (accessed on 28 May 2023).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wang, X.; Yamagishi, J. A Practical Guide to Logical Access Voice Presentation Attack Detection. In Frontiers in Fake Media
Generation and Detection; Springer Nature: Singapore, 2022; pp. 169–214. [CrossRef]
2. Hua, G.; Teoh, A.B.J.; Zhang, H. Towards End-to-End Synthetic Speech Detection. IEEE Signal Process. Lett. 2021, 28, 1265–1269.
[CrossRef]
3. Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-End Anti-Spoofing with RawNet2. In Proceedings
of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON,
Canada, 6–11 June 2021; pp. 6369–6373.
4. Tak, H.; Jung, J.; Patino, J.; Kamble, M.; Todisco, M.; Evans, N. End-to-End Spectro-Temporal Graph Attention Networks for
Speaker Verification Anti-Spoofing and Speech Deepfake Detection. arXiv 2021, arXiv:2107.12710.
5. Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The First Automatic Speaker
Verification Spoofing and Countermeasures Challenge. In Proceedings of the Interspeech 2015, 16th Annual Conference of the
International Speech Communication Association, Dresden, Germany, 4–5 September 2015; ISCA: Dresden, Germany, 2015;
pp. 2037–2041.
6. Nautsch, A.; Wang, X.; Evans, N.; Kinnunen, T.H.; Vestman, V.; Todisco, M.; Delgado, H.; Sahidullah, M.; Yamagishi, J.; Lee, K.A.
ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Trans. Biom.
Behav. Identity Sci. 2021, 3, 252–265. [CrossRef]
7. Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al.
ASVspoof 2021: Accelerating Progress in Spoofed and Deepfake Speech Detection. In Proceedings of the 2021 Edition of the
Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; ISCA: Dresden, Germany,
2021; pp. 47–54.
8. Zhang, Z.; Gu, Y.; Yi, X.; Zhao, X. FMFCC-a: A Challenging Mandarin Dataset for Synthetic Speech Detection. In Proceedings of
the Digital Forensics and Watermarking: 20th International Workshop, Beijing, China, 20–22 November 2021; Springer: Beijing,
China, 2022; pp. 117–131.
9. Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation
Learning by Masked Prediction of Hidden Units. IEEEACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [CrossRef]
10. Kenton, J.D.M.-W.C.; Toutanova, L.K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.
11. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their
Compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV,
USA, 5–10 December 2013.
12. Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748.
13. Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. Wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proceedings
of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria,
15–19 September 2019; pp. 3465–3469.
14. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
In Proceedings of the Advances in Neural Information Processing Systems 2020, Virtual, 6–12 December 2020; Curran Associates,
Inc.: New York, NY, USA, 2020; Volume 33, pp. 12449–12460.
Appl. Sci. 2023, 13, 8488 15 of 15
15. Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale
Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [CrossRef]
16. Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L.; Xu, X.; Bu, H.; Chen, X.; Zeng, C.; et al. WENETSPEECH: A 10,000+
Hours Multi-Domain Mandarin Corpus for Speech Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 6182–6186.
17. Wang, X.; Yamagishi, J. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proceedings of the
Speaker and Language Recognition Workshop (Odyssey 2022), Beijing, China, 28 June–1 July 2022; ISCA: Beijing, China, 2022.
18. Tak, H.; Todisco, M.; Wang, X.; Jung, J.; Yamagishi, J.; Evans, N. Automatic Speaker Verification Spoofing and Deepfake Detection
Using Wav2vec 2.0 and Data Augmentation. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2022),
Beijing, China, 28 June–1 July 2022.
19. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method
for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech
Communication Association, Graz, Austria, 15–19 September 2019; pp. 2613–2617.
20. Zhang, J.; Qiu, T.; Luan, S. An Efficient Real-Valued Sparse Bayesian Learning for Non-Circular Signal’s DOA Estimation in the
Presence of Impulsive Noise. Digit. Signal Process. 2020, 106, 102838. [CrossRef]
21. Jung, J.; Kim, S.; Shim, H.; Kim, J.; Yu, H.-J. Improved Rawnet with Feature Map Scaling for Text-Independent Speaker
Verification Using Raw Waveforms. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech
Communication Association, Shanghai, China, 25–29 October 2020; pp. 1496–1500.
22. Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in
TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech
Communication Association, Shanghai, China, 25–29 October 2020; pp. 3830–3834.
23. Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.;
et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624.
24. Jung, J.; Shim, H.; Kim, J.; Yu, H.-J. α-feature map scaling for raw waveform speaker verification. J. Acoust. Soc. Korea 2020, 39,
441–446.
25. Zhang, J.; Inoue, N.; Shinoda, K. I-Vector Transformation Using Conditional Generative Adversarial Networks for Short Utterance
Speaker Verification. In Proceedings of the Interspeech 2018, 19th Annual Conference of the International Speech Communication
Association, Hyderabad, India, 2–6 September 2018; ISCA: Hyderabad, India, 2018.
26. Ma, H.; Yi, J.; Wang, C.; Yan, X.; Tao, J.; Wang, T.; Wang, S.; Xu, L.; Fu, R. FAD: A Chinese Dataset for Fake Audio Detection. arXiv
2022, arXiv:2207.12308.
27. Jung, J.; Heo, H.-S.; Tak, H.; Shim, H.; Chung, J.S.; Lee, B.-J.; Yu, H.-J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated
Spectro-Temporal Graph Attention Networks. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 6367–6371.
28. Todisco, M.; Delgado, H.; Evans, N. Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker
Verification. Comput. Speech Lang. 2017, 45, 516–535. [CrossRef]
29. Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A Comparison of Features for Synthetic Speech Detection. In Proceedings of
the Interspeech 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany,
4–5 September 2015; ISCA: Dresden, Germany, 2015; pp. 2087–2091.
30. Wang, X.; Yamagishi, J. A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection. In
Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno,
Czechia, 30 August–3 September 2021; ISCA: Brno, Czechia, 2021; pp. 4259–4263.
31. Todisco, M.; Delgado, H.; Evans, N. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral
Coefficients. In Proceedings of the Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; ISCA: Bilbao,
Spain, 2016; pp. 283–290.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.