30 - RawBMamba - End-to-End Bidirectional State Space Model For Audio Deepfake Detection

The document presents RawBMamba, an end-to-end bidirectional state space model designed for audio deepfake detection, which effectively captures both short- and long-range feature information from audio signals. By integrating attention mechanisms and a bidirectional fusion module, RawBMamba enhances the model's ability to discriminate between authentic and fake audio, achieving significant improvements over existing models. Experimental results demonstrate its competitive performance across multiple datasets, showcasing its effectiveness and generalizability in detecting audio deepfakes.

Uploaded by

ASHISH KUMAR KUMAWAT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

30 - RawBMamba - End-to-End Bidirectional State Space Model For Audio Deepfake Detection

Uploaded by

ASHISH KUMAR KUMAWAT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

RawBMamba: End-to-End Bidirectional State Space Model for Audio

Deepfake Detection

Yujie Chen1 , Jiangyan Yi2,∗ , Jun Xue1 , Chenglong Wang2 , Xiaohui Zhang2 , Shunbo Dong1 , Siding
Zeng2 , Jianhua Tao3 , Lv Zhao1 , Cunhang Fan1
1
School of Computer Science and Technology, Anhui University, China 2 Institute of Automation,
Chinese Academy of Sciences, China 3 Department of Automation, Tsinghua University, China
[email protected], [email protected]
Abstract extracting both short- and long-range feature information from
audio signals through an end-to-end approach can ensure the
Fake artefacts for discriminating between bonafide and fake au- model’s effectiveness and generalizability. Currently, extensive
dio can exist in both short- and long-range segments. There- research focuses on end-to-end models[13], RawNet2 [14, 15]
fore, combining local and global feature information can effec- by performing time-domain convolution operations directly on
arXiv:2406.06086v2 [cs.SD] 18 Jun 2024

tively discriminate between bonafide and fake audio. This paper raw audio, possesses the potential to learn clues undetectable by
proposes an end-to-end bidirectional state space model, named knowledge-based methods. Additionally, researchers incorpo-
RawBMamba, to capture both short- and long-range discrimi- rate attention mechanisms into graph neural networks to capture
native information for audio deepfake detection. Specifically, key information across time and frequency [16, 17, 18]. AA-
we use sinc Layer and multiple convolutional layers to capture SIST [5] captures complex short-range and long-range feature
short-range features, and then design a bidirectional Mamba information in graph networks by modeling the non-Euclidean
to address Mamba’s unidirectional modelling problem and fur- relationships of graph nodes. Rawformer [6] integrates convo-
ther capture long-range feature information. Moreover, we de- lutional layers with Transformer [19] structures, utilizing self-
velop a bidirectional fusion module to integrate embeddings, attention to capture short-range and long-range feature informa-
enhancing audio context representation and combining short- tion in speech signals. Although the aforementioned end-to-end
and long-range information. The results show that our pro- models are already capable of handling audio deepfake detec-
posed RawBMamba achieves a 34.1% improvement over Raw- tion, there still exists room for improvement [20].
former on ASVspoof2021 LA dataset, and demonstrates com-
petitive performance on other datasets. Codes will be released
Recently, inspired by the success of the state space model
on https://github.com/cyjie429/RawBMamba.
with efficient hardware-aware designs (i.e., Mamba [21]) in
Index Terms: state space model, bidirectional mamba, audio
the field of sequence modeling, which selects relevant infor-
deepfake detection
mation to model long-range relationships, empirical evidence
has demonstrated its enhanced capability in capturing long-
1. Introduction range relations more effectively. Therefore, we believe it can
Recently, advances in deep learning have significantly acceler- also more effectively capture long-range feature information in
ated the development of text-to-speech (TTS) and voice con- speech signals. Based on this, we aim to design an efficient end-
version (VC) technologies, making synthesised speech increas- to-end model that integrates both short-range and long-range
ingly indistinguishable from real human voices. Nevertheless, feature information. However, Mamba, due to its Markovian
the potential misuse of such technologies by malicious actors nature, suffers from the drawback of unidirectional modeling,
inevitably poses a significant threat to society. Hence, in recent which prevents it from fully capturing contextual information,
years, a series of fake audio detection challenges [1, 2, 3, 4] thereby affecting the acquisition of long-range information.
emerge with the aim of countering and bolstering the ability to
detect fake audio attacks. To address these limitations, we propose a End-to-End
Extensive research demonstrates that discriminative infor- Bidirectional Mamba model, named RawBMamba, to more
mation (i.e., fake artefacts) of synthesized and converted speech effectively capture short- to long-range feature information.
can be analyzed from both short-range and long-range segments Specifically, we utilize a series of parametrizable sinc func-
of speech signals [5, 6]. Abnormal intonation changes and un- tions to obtain low-level feature mappings, followed by acquir-
natural stresses may exist in short-range segments, while un- ing high-level feature mappings through multiple convolutional
natural rhythmic patterns and monotonous emotional expres- blocks, followed by mapping reconstruction to generate a two-
sions as deceptive cues may be present in long-range segments. dimensional sequence suitable for bidirectional Mamba. The
Therefore, integrating both short-range and long-range informa- Mamba models are then used to capture forward and backward
tion effectively enhances the system’s detection capability. In long-range features respectively. Finally, a bidirectional fea-
recent years, mainstream audio deepfake detection approaches ture fusion module merges the two sets of embeddings to ob-
have been divided into pipeline [7, 8, 9, 10, 11] and end-to- tain the comprehensive short-range to long-range feature rep-
end methods. The end-to-end model, by not relying on manu- resentation. In addition, we perform a preliminary theoretical
ally extracted features and directly optimizing on the raw audio analysis of the effectiveness of Mamba and Transformers in the
waveform within a single model, can effectively enhance the domain of audio deepfake detection to demonstrate the viabil-
model’s generalizability [12]. Inspired by this, we believe that ity of Mamba. We achieve competitive experimental results on
multiple datasets, proving the effectiveness and generalization
* Corresponding author. ability of RawBMamba.
Forward Mamba × 𝒩𝒩
𝑐𝑐

𝐹𝐹 𝑐𝑐 Forward Forward
Conv1d SSM Linear
Sinc Layer

𝑐𝑐

Bonafide \ Fake
Encoder

Flattener
f × 𝑡𝑡 Attention

A-Softmax
𝑇𝑇 f f × 𝑡𝑡
𝑡𝑡 𝑐𝑐 2𝑐𝑐
LFM HFM Activation
Linear
Short-range Feature Extractor f × 𝑡𝑡 Attention

Backward Mamba × 𝒩𝒩
Fusion Module
Long-range Feature Extractor

Figure 1: The overall structure diagram of our proposed RawBMamba.

2. Preliminaries Algorithm 1 Bidirectional Mamba

Mamba is inspired by continuous state space models (SSM) Input: token sequence FS : (B, L, C)
[22, 23, 24, 25] in control systems. It maps a 1-D function Output: token sequence Ff orward , Fbackward : (B, L, C)
or sequence x(t) ∈ R → y(t) ∈ R through a hidden state /* sequence reverse operation */
h(t) ∈ RN . This model uses A ∈ RN ×N as the State Transi- Ff orward : (B, L, C) ← Fshort
tion parameters, and B ∈ RN ×1 , C ∈ R1×N as the projection Fbackward : (B, L, C) ← Reverse(Fshort )
parameters. /* process with different direction */
for d in {forward, backward} do
h′ (t) = Ah(t) + Bx(t), x : (B, L, C) ← Fd
(1) for i in M ambaLayerd do
y(t) = Ch(t).
xi , z : (B, L, E) ← Linearxz (x))
Mamba is the classic discrete versions of the continuous system. x′i : (B, L, E) ← SiLU(Conv1d(xi ))
it include a timescale parameter ∆, and through discretization /* Selection Mechanism */
′
rules, transform the continuous parameters A, B to discrete pa- Bi : (B, L, N ) ← LinearB i (xi )
C ′
rameters A, B.A commonly used discretization rule is the zero- Ci : (B, L, N ) ← Lineari (xi )
order hold (ZOH), which is defined as follows: /* softplus ensures positive ∆i */
′
∆i : (B, L, E) ← log(1 + exp(Linear∆ i (xi ) +
∆
A = exp(∆A), Parameteri ))
(2) /* shape of ParameterAi is (E, N ) */
B = (∆A)−1 (exp(∆A) − I) · ∆B. Ai : (B, L, E, N ) ← ∆i ⊗ ParameterA i
Bi : (B, L, E, N ) ← ∆i ⊗ Bi
After the discretization of A, B, the discretized version of Eq. yi : (B, L, C) ← Linear(SSM(Ai ) ⊗ SiLU(z))
(1) using a step size ∆ can be rewritten as: x : (B, L, C) ← yi
end for
ht = Aht−1 + Bxt , Fd : (B, L, C) ← yi
(3)
yt = Cht . end for
return Ff orward , Fbackward
At last, the models compute output through a global convolu-
tion.
M −1
K = (CB, CAB, . . . , CA B), (HFM). Specifically, first, a series of parametric sinc func-
(4)
y = x ∗ K. tions are utilized to implement band-pass filters [26], extract-
ing spectral-temporal features from the raw waveform, form-
where M is the length of the input sequence x, and K ∈ RM is ing Low-level Feature Maps (LFM) FLF M ∈ RF ×T , where
a structured convolutional kernel. where F and T are the numbers of frequency and the tempo-
ral bins, respectively. Then, High-level Feature Maps (HFM)
FHF M ∈ RC×f ×t containing short-range feature informa-
3. Proposed Method tion are extracted through a series of three ResNet blocks with
In this section, we provide a detailed introduction to RawB- squeeze-and-excitation operations, here, C, f and t denote the
Mamba, as shown in Figure 1 . First, we describe how RawB- number of channels, frequency bins, and temporal locations
Mamba captures short-range feature representation. Then, we after dimensionality reduction. Finally, the high-level feature
present the implementation details of the Bidirectional Mamba maps are flattened along the time and frequency axes to obtain
and the Bidirectional Feature Fusion Module. a two-dimensional short-range feature sequence Fs ∈ RC×f t ,
suitable as input vectors for the bidirectional Mamba.
3.1. Short-range feature representation
3.2. Bidirectional state space model
An increasing number of researchers are using trainable neu-
ral layers to learn approximate standard filter processes from Mamba designs a simple selection mechanism that, by param-
raw waveforms. RawBMamba employs a variant of RawNet2’s eterizing the sequence dimension, enables the model to effec-
frontend to capture high-level short-range feature mappings tively select the most relevant information and filter out irrele-
vant information, thereby enhancing the model’s ability to cap- Table 1: The experimental results of RawBMamba with different
ture long-range features. However, due to its Markovian nature, configurations on the 19LA, 21LA, and 21DF datasets. Here,
it can only model long-range features in a unidirectional man- ”uni” is the unidirectional Mamba and ”bi” is the bidirectional
ner, which leads to ineffective capture of contextual informa- Mamba. ”L” is model’s total layers.
tion. Therefore, we design the bidirectional Mamba, as illus-
trated in Figure 1, which specifically consists of the following 19LA 21LA 21DF
Dir. L
two parts:
EER(%) t-DCF EER(%) t-DCF EER(%)
Bidirectional Mamba. We present Algorithm 1 for the
bidirectional Mamba. we first take the short-range features Fs 4 2.51 0.0707 3.74 0.2707 20.94
obtained from the frontend and generate a backward short-range uni 8 2.01 0.0623 3.44 0.2652 21.22
feature through a reverse sequence operation. Then, we deploy 12 1.47 0.0467 2.84 0.2517 22.48
two structurally identical Mamba networks to independently 4 1.07 0.0315 3.38 0.2687 20.83
capture the long-range feature information of the forward and bi 8 1.67 0.0484 3.45 0.2631 18.80
backward short-range feature. Specifically, we project the for- 12 1.19 0.0360 3.28 0.2709 15.85
ward and backward features onto xi and z with dimensions of
E, respectively. Then a 1-D convolution is applied to x to ob-
tain x′i . Following that, x′i is projected to obtain Bi , Ci and ∆i . Table 2: The experimental results of RawBMamba with different
Discretisation rules are used to generate Ai , Bi . Finally, the fusion methods on the 19LA, 21LA, and 21DF datasets.
forward long-range features Ff orward and the backward long-
range features Fbackward are obtained. 19LA 21LA 21DF
Methods
Bidirectional feature fusion block. After acquiring the EER(%) t-DCF EER(%) t-DCF EER(%)
forward and backward long-range features, we initially apply
a linear self-attention operation on each of the unidirectional Sum 1.27 0.0400 4.13 0.2924 16.58
long-range features to extract key information. Subsequently, Concat 1.19 0.0360 3.28 0.2709 15.85
we employ concatenation for bidirectional feature fusion. Fi- Attention 1.19 0.0369 3.19 0.2620 18.42
nally, we obtain short- and long-range speech features Fl en-
riched with contextual information, which are used for authen-
ticity discrimination. The specific formula is illustrated as fol- 4.3. Results of RawBMamba with different configurations
lows:
This section discusses the unidirectional and bidirectional
Mamba models’ capability to capture short- and long-range fea-
F1 , F2 = Attention(Ff orward ), Attention(Fbackward ) tures with N layers, as detailed in Table 1. We observe that the
(5) unidirectional Mamba model struggles with the 21DF dataset
Fl = M LP (Concat(F1 , F2 )) (6) and only shows effectiveness on 21LA, indicating limitations
in capturing generalized long-range features. However, when
analysing comprehensive results from all datasets, the bidi-
4. Experiments and Analysis rectional Mamba model consistently outperforms the unidirec-
tional Mamba model. This shows that bidirectional Mamba can
4.1. Datasets and metrics overcome the problem of insufficient contextual information
capture in unidirectional Mamba, thereby improving the gener-
We conduct extensive tests on the ASVspoof2019 LA (19LA),
alizability of long-range features. In addition, we observe that
ASVspoof2021 LA (21LA) and ASVspoof2021 DF (21DF) to
the bidirectional Mamba model maintains good performance
evaluate our model’s effectiveness and generalizability. 19LA
even with a lower number of layers (i.e., N = 4) compared
features three types of fake attacks (TTS, VC) across 19 algo-
to unidirectional Mamba models with fewer layers, which we
rithms (A01-A19). 21LA contains real and fake speeches via
attribute to the effectiveness of our bidirectional feature fusion
telephony systems like Voice over IP and Public Switched Tele-
module in fully exploiting information from both forward and
phone Network. 21DF comprises authentic and spoofed voices
backward long-range features. However, the reduction in the
altered by various media codecs, which introduces distortion
number of layers inevitably leads to a reduction in the ability to
during the encoding, compressing, and subsequent decoding of
handle difficult datasets such as 21DF.
audio data. The performance assessment uses the equal error
rate (EER) and the minimum tandem detection cost function
4.4. Results of RawBMamba with different fusion methods
(min t-DCF).
In this section, we will discuss in detail the impact of differ-
4.2. Implementation details ent fusion methods in the bidirectional feature fusion module
of RawBMamba, as shown in Table 2. Specifically, we de-
In the training phase, we use input data with 64,000 sampling sign three common fusion methods: summation, concatenation,
points (≈ 4 seconds) for RawBMamba. The SincNet layer is and the cross-attention mechanism. Both summation and cross-
configured with 70 filters, and the variant based on RawNet2 attention mechanism act directly on the two unidirectional long-
consists of four sub-blocks, with the first two having 32 filters range features. The results show that concatenation achieves the
and the latter two having 64 filters each. We utilize an Adam best overall effect compared to the other two fusion methods,
optimizer with a learning rate of 1 × 10−5 , and training batch maintaining good performance across all datasets. This sug-
size is set to 32. The model is trained on the ASVspoof 2019 gests that while preserving the integrity of the time-frequency
LA training and development sets for 32 epochs using the A- information as much as possible, it is able to capture the key
Softmax[27] loss function on a single RTX 3090 GPU. information of the two unidirectional long-range features in the
Table 3: The comparison results of RawBMamba, Rawformer,
and AASIST on the 19LA, 21LA, and 21DF datasets. Here, ”†”
is reproduced results.

19LA 21LA 21DF

Models
EER(%) t-DCF EER(%) t-DCF EER(%)
T23 [2] - - 1.32 0.2177 15.64
(a) Rawformer (b) RawBMamba
RawNet2 [14] - - 9.50 - 22.38
TO-RawBet [28] 1.58 - 3.70 - -
AASIST [5] 0.93 0.0285 10.51 0.4884 -
AASIST-4Block [5] 1.20 0.0341 9.15 0.4370 -
ARawNet2 [29] 4.61 - 8.36 - 19.03
SE-Rawformer [6] 1.05 0.0344 4.98 0.3186 -
SE-Rawformer† 1.15 0.0314 4.31 0.2851 20.26
RawMamba(ours) 1.47 0.0467 2.84 0.2517 22.48
RawBMamba(ours) 1.19 0.0360 3.28 0.2709 15.85 (c) Rawformer (d) RawBMamba

Figure 2: The 19LA test set samples’ clustering is shown in 2D

time-frequency domain. The cross-attention mechanism per- t-SNE plots from the model’s higher layers. The visualization
forms poorly on the 21DF dataset, suggesting that it may over- displays the clustering of bonafide vs. fake audio in (a) and (b)
focus on key information and lose time-frequency information, (blue for bonafide, red for fake) and different attack types in (c)
reducing generalisation to out-of-domain data. In contrast, the and (d), with each color indicating a specific attack.
summation fusion method shows promising results on the 21DF
dataset, as it involves a simple summation operation on all time-
frequency domain information of the two unidirectional long- that when dealing with the binary task of bonafide and fake au-
range features, acting as a potential regularization effect that dio detection, there is partial block overlap between bonafide
balances the information across all time-frequency domains. and fake audio for Transformer, while Mamba only has thread-
like overlap at the end of the curve-shaped figures, indicating
4.5. Comparison with the other end-to-end models that the features obtained by Mamba are more discriminable
than those extracted by Transformer. Furthermore, we also dis-
In Table 3, we compare the performance of RawBMamba on
cover that Mamba’s feature visualization presents multiple non-
19LA, 21LA, and 21DF with the other end-to-end models.
overlapping curvilinear patterns, embodying richer feature in-
RawMamba refers to a 12-layer unidirectional Mamba model,
formation capable of distinguishing different attack methods.
whereas RawBMamba is a 12-layer bidirectional Mamba
This indicates Mamba architecture’s potential in handling fine-
model, with 6 layers per direction.
grained classification tasks, as shown in (c) and (d). When map-
To verify RawBMamba’s generalization capability, We
ping the sample points of different attack methods, the distribu-
conduct extensive experiments on three datasets, with RawB-
tion of Mamba is clearer, which We believe is due to the key
Mamba consistently performing well across all. Specifically,
role of Mamba’s sequence selection mechanism in obtaining
on the 21LA dataset, our proposed RawBMamba significantly
discriminative information between different attack methods.
outperforms the other two baselines. RawBMamba shows
a 34.1% performance improvement over SE-Rawformer, and Therefore, we believe the Mamba architecture is capable of
RawMamba shows a 43.0% improvement over SE-Rawformer, handling audio deepfake detection and has significant potential
with comparable results on the 19LA dataset. This indicates to become a backbone in audio deepfake detection models.
that Mamba effectively captures long-range feature information
more efficiently than Transformer. Furthermore, experimen- 5. Conclusion
tal results on the 21DF dataset show that RawBMamba, as a
In this paper, we propose an end-to-end state space model,
single system, achieves performance extremely close to that of
named RawBMamba, to capture both short- and long-range dis-
the multi-system score fusion T23. This indicates that RawB-
criminative information in audio signals. Specifically, We use
Mamba maintains commendable performance even when faced
parametrizable sincLayers and multiple convolutional layers to
with complex, diverse, and unknown speech forgery methods. It
capture short-range features, then design a bidirectional Mamba
demonstrates RawBMamba’s robustness to out-of-domain data,
to address the limitations of Mamba’s unidirectional modeling.
further proving its generalization capability.
We use a bidirectional feature fusion module to merge forward
and backward long-range features, enhancing audio context rep-
4.6. Effectiveness analysis of RawBMamba and Rawformer
resentation. Experiments indicate that RawBMamba achieves a
We discuss whether the Mamba architecture has more poten- 34.1% improvement over the Rawformer on ASVspoof 2021
tial than the Transformer architecture for handling the audio LA and outperforms the state-of-the-art end-to-end models on
fake detection task. Specifically, we directly extract the final ASVspoof 2021 LA, validating RawBMamba’s generalizability
features on the 19LA test dataset using the Rawformer model and capability to handle out-of-domain data. Visualization anal-
and the RawBMamba model. Then, we use t-SNE [30], a non- ysis on t-SNE plots further validates the effectiveness of RawB-
linear dimensionality reduction algorithm for visualizing high- Mamba and its potential in fine-grained classification tasks. In
dimensional data, to visualize the features. The results are future work, we plan to explore the application of RawBMamba
shown in Figure 2. From (a) and (b), it is not difficult to see to provenance tasks.
6. Acknowledgements [13] X. Zhang, W. Fu, and M. Liang, “Multimodal emotion recog-
nition from raw audio with sinc-convolution,” arXiv preprint
This work is supported by the Scientific and Technological arXiv:2402.11954, 2024.
Innovation Important Plan of China (No. 2021ZD0201502), [14] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and
the National Natural Science Foundation of China (NSFC) A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP
(No.62322120, No.U21B2010, No.62306316, No.62206278, 2021-2021 IEEE International Conference on Acoustics, Speech
No.62201002). and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373.
[15] J. H. Hansen and Z. WANG, “Audio anti-spoofing using simple
attention module and joint optimization based on additive angular
7. References margin loss and meta-learning,” in Proc. Interspeech 2022, 2022,
[1] A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman, pp. 376–380.
M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A. [16] H. Tak, J. Jung, J. Patino, M. Todisco, and N. W. D. Evans, “Graph
Lee, “Asvspoof 2019: spoofing countermeasures for the detection attention networks for anti-spoofing,” in Interspeech 2021, 22nd
of synthesized, converted and replayed speech,” IEEE Transac- Annual Conference of the International Speech Communication
tions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2, Association, Brno, Czechia, 30 August - 3 September 2021.
pp. 252–265, 2021. ISCA, 2021, pp. 2356–2360.
[2] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, [17] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and
A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., Y. Bengio, “Graph attention networks,” in 6th International Con-
“Asvspoof 2021: accelerating progress in spoofed and deep- ference on Learning Representations, ICLR 2018, Vancouver, BC,
fake speech detection,” in ASVspoof 2021 Workshop-Automatic Canada, April 30 - May 3, 2018, Conference Track Proceedings.
Speaker Verification and Spoofing Coutermeasures Challenge, OpenReview.net, 2018.
2021. [18] H. Tak, J.-W. Jung, J. Patino, M. Kamble, M. Todisco, and
[3] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, N. Evans, “End-to-end spectro-temporal graph attention networks
Y. Bai, C. Fan et al., “Add 2022: the first audio deep synthe- for speaker verification anti-spoofing and speech deepfake detec-
sis detection challenge,” in ICASSP 2022-2022 IEEE Interna- tion,” in ASVSPOOF 2021, Automatic Speaker Verification and
tional Conference on Acoustics, Speech and Signal Processing Spoofing Countermeasures Challenge. ISCA, 2021, pp. 1–8.
(ICASSP). IEEE, 2022, pp. 9216–9220. [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
[4] J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y. Zhang,
Advances in neural information processing systems, vol. 30, 2017.
X. Zhang, Y. Zhao, Y. Ren et al., “Add 2023: the second audio
deepfake detection challenge,” arXiv preprint arXiv:2305.13774, [20] X. Zhang, J. Yi, J. Tao, C. Wang, L. Xu, and R. Fu, “Adaptive fake
2023. audio detection with low-rank model squeezing,” in Proceedings
of the Workshop on Deepfake Audio Detection and Analysis co-
[5] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, located with 32th International Joint Conference on Artificial In-
H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- telligence (IJCAI 2023), 2023, ser. CEUR Workshop Proceedings,
tegrated spectro-temporal graph attention networks,” in ICASSP vol. 3597, 2023, pp. 95–100.
2022-2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371. [21] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[6] X. Liu, M. Liu, L. Wang, K. A. Lee, H. Zhang, and J. Dang, [22] A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. Ré,
“Leveraging positional-related local-global dependency for syn- “Combining recurrent, convolutional, and continuous-time mod-
thetic speech detection,” in ICASSP 2023-2023 IEEE Interna- els with linear state space layers,” Advances in neural information
tional Conference on Acoustics, Speech and Signal Processing processing systems, vol. 34, pp. 572–585, 2021.
(ICASSP). IEEE, 2023, pp. 1–5.
[23] A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameteriza-
[7] J. Xue, C. Fan, J. Yi, C. Wang, Z. Wen, D. Zhang, and Z. Lv, tion and initialization of diagonal state space models,” Advances
“Learning from yourself: A self-distillation method for fake in Neural Information Processing Systems, vol. 35, pp. 35 971–
speech detection,” in ICASSP 2023-2023 IEEE International Con- 35 983, 2022.
ference on Acoustics, Speech and Signal Processing (ICASSP). [24] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as
IEEE, 2023, pp. 1–5. effective as structured state spaces,” Advances in Neural Informa-
[8] C. Wang, J. Yi, J. Tao, C. Y. Zhang, S. Zhang, and X. Chen, “De- tion Processing Systems, vol. 35, pp. 22 982–22 994, 2022.
tection of Cross-Dataset Fake Audio Based on Prosodic and Pro- [25] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences
nunciation Features,” in Proc. INTERSPEECH 2023, 2023, pp. with structured state spaces,” in The Tenth International
3844–3848. Conference on Learning Representations, ICLR 2022, Virtual
[9] X. Zhang, J. Yi, J. Tao, C. Wang, and C. Y. Zhang, “Do you re- Event, April 25-29, 2022. OpenReview.net, 2022. [Online].
member? overcoming catastrophic forgetting for fake audio de- Available: https://openreview.net/forum?id=uYLFoz1vlAC
tection,” in International Conference on Machine Learning, ICML [26] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-
2023, ser. Proceedings of Machine Learning Research, vol. 202. form with sincnet,” in 2018 IEEE spoken language technology
PMLR, 2023, pp. 41 819–41 831. workshop (SLT). IEEE, 2018, pp. 1021–1028.
[10] X. Zhang, J. Yi, C. Wang, C. Y. Zhang, S. Zeng, and J. Tao, “What [27] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface:
to remember: Self-adaptive continual learning for audio deepfake Deep hypersphere embedding for face recognition,” in Proceed-
detection,” in Thirty-Eighth AAAI Conference on Artificial Intelli- ings of the IEEE conference on computer vision and pattern
gence, AAAI 2024, M. J. Wooldridge, J. G. Dy, and S. Natarajan, recognition, 2017, pp. 212–220.
Eds. AAAI Press, 2024, pp. 19 569–19 577. [28] C. Wang, J. Yi, J. Tao, C. Y. Zhang, S. Zhang, R. Fu, and X. Chen,
[11] C. Fan, M. Ding, J. Tao, R. Fu, J. Yi, Z. Wen, and Z. Lv, “Dual- “TO-Rawnet: Improving RawNet with TCN and Orthogonal Reg-
branch knowledge distillation for noise-robust synthetic speech ularization for Fake Audio Detection,” in Proc. INTERSPEECH
detection,” IEEE/ACM Transactions on Audio, Speech, and Lan- 2023, 2023, pp. 3137–3141.
guage Processing, 2024. [29] J. Li, Y. Long, Y. Li, and D. Xu, “Advanced rawnet2 with
attention-based channel masking for synthetic speech detection,”
[12] X. Zhang, J. Yoon, M. Bansal, and H. Yao, “Multimodal repre- in Proc. INTERSPEECH 2023, 2023, pp. 2788–2792.
sentation learning by alternating unimodal adaptation,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and [30] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”
Pattern Recognition (CVPR), June 2024, pp. 27 456–27 466. Journal of machine learning research, vol. 9, no. 11, 2008.