0% found this document useful (0 votes)
8 views14 pages

Multi-Task Partially Spoofed Speech Detection Using A Dual-View Graph Neural Network Assisted Segment-Level Module

This document presents a novel approach for Multi-Task Partially Spoofed Speech Detection (PSSD) using a Dual-View Graph Neural Network (D-GNN) assisted segment-level module. The proposed method addresses limitations of existing models by integrating task-specific feature processing for both segment- and utterance-level tasks, enhancing the detection of spoofed speech segments. Experimental results demonstrate that this approach outperforms current methods in terms of detection accuracy, showcasing its effectiveness in handling multi-task scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Multi-Task Partially Spoofed Speech Detection Using A Dual-View Graph Neural Network Assisted Segment-Level Module

This document presents a novel approach for Multi-Task Partially Spoofed Speech Detection (PSSD) using a Dual-View Graph Neural Network (D-GNN) assisted segment-level module. The proposed method addresses limitations of existing models by integrating task-specific feature processing for both segment- and utterance-level tasks, enhancing the detection of spoofed speech segments. Experimental results demonstrate that this approach outperforms current methods in terms of detection accuracy, showcasing its effectiveness in handling multi-task scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.

33, 2025 2647

Multi-Task Partially Spoofed Speech Detection Using


a Dual-View Graph Neural Network Assisted
Segment-Level Module
Zirui Ge , Graduate Student Member, IEEE, Xinzhou Xu , Senior Member, IEEE, Haiyan Guo ,
and Björn W. Schuller , Fellow, IEEE

Abstract—The Partially Spoofed Speech Detection (PSSD), as I. INTRODUCTION


a multi-task learning problem, typically comprises segment- and
HE Automatic Speaker Verification (ASV) technology
utterance-level detection tasks, benefitting from diverse feature
representations for effective classification. However, existing mod-
els for multi-tasks PSSD usually employ a shared feature processing
T aims to authenticate individuals based on their unique
vocal characteristics extracted from spoken signals [1], [2],
module for the two tasks, which may lead to suboptimal perfor- [3]. On the contrary, advances in Text-To-Speech (TTS) and
mance compared with task-specific strategies. Further, most of
existing works mainly capture segment-level information from a
Voice Conversion (VC) techniques [4], [5] induce emerging
single view, which may result in poorly modeling local differences security challenges in designing an ASV system. In response
between fake and bonafide segments. In this regard, we propose a to this type of risk, various speech anti-spoofing challenges [6],
Dual-view Graph neural network Assisted segment-level Module [7] have been organized to encourage developing Audio Deep-
(DGAM) for multi-task PSSD. The proposed approach contains Fake Detection (ADD) systems [8], [9]. As a typical case in
three modules: Shared representation extracting, task-specific fea-
ture processing for the utterance-level task, and a Dual-View Graph
practice, Partially Spoofed Speech Detection (PSSD) further
Neural Network (D-GNN) with a dual-view consistency loss for the focuses on identifying substituted segments from bonafide to
segment-level task through the graph attention mechanism with synthesized speech in utterances [10], [11], which requires
cosine similarity and heat kernel function with Euclidean distance building fine-grained models for ADD systems considering
as two different views, which capture semantic and Euclidean the fusion of bonafide and synthesis segments. Initial PSSD
spatial relationships, respectively. Experimental evaluations on
multiple spoofed-speech datasets demonstrate that, the proposed
research [11] created the Half-truth Audio Detection (HAD)
approach outperforms existing approaches in both segment- and dataset that mainly alters a few words in an utterance, which
utterance-level detection in terms of equal error rate, showcasing was later integrated into the ADD 2022 challenge database [7].
its effectiveness for the multi-task partially spoofed scenario. Concurrently, [10] developed the Partial Spoof Database (PSD)
Index Terms—Dual-view consistency loss, dual-view graph through adding spoofed segments or substituting parts of an
neural network, partially spoofed speech detection, task-specific original utterance into spoofed speech at the segment level, and
feature processing. then, [12] refined segment-level labels in [10] to a frame-level
form.
As an inbuilt multi-task case, PSSD primarily involves
Received 7 August 2024; revised 14 March 2025 and 10 June 2025; accepted utterance-level and segment-level tasks, where the utterance-
10 June 2025. Date of current version 27 June 2025. This work was supported
in part by the National Natural Science Foundation of China under Grant
level task aims to determine whether an utterance is fake or
62071242 and Grant 62471249, in part by the Postgraduate Research and bonafide, whereas the segment-level task focuses on detecting
Practice Innovation Program of Jiangsu Province under Grant KYCX23-1034, fake segments at a granular level. For the former task, [13] inte-
in part by China Postdoctoral Science Foundation under Grant 2022M711693,
and in part by DFG (German Research Foundation) Reinhart Koselleck-Project
grates attentive statistics pooling and wav2vec 2.0 (W2V2) [14],
AUDI0NOMOUS under Grant 442218748. The associate editor coordinating the [15], while [16] formulates PSSD as a Multiple Instance Learn-
review of this article and approving it for publication was Dr. Romain Serizel. ing (MIL) problem [17]. Towards the segment-level task, [18],
(Corresponding author: Zirui Ge.)
Zirui Ge and Haiyan Guo are with the School of Communication and Informa-
[19] propose boundary detection systems to locate manipulated
tion Engineering, Nanjing University of Posts and Telecommunications, Nanjing segments, and [20] further proposes a fine-grained PSSD ap-
210003, China (e-mail: [email protected]; [email protected]). proach of Temporal Deepfake Location (TDL), while [21] fuses
Xinzhou Xu is with the School of Internet of Things, Nanjing University of
Posts and Telecommunications, Nanjing 210003, China, and also with the Signal
temporal and spectrogram domains to identify fake segments.
Processing and Speech Communication Laboratory, Technische Universität Nevertheless, it is normally inadequate to regard PSSD as two
Graz, 2010 Graz, Austria (e-mail: [email protected]). separate tasks [12], and due to this, [22] proposes the SE-LCNN
Björn W. Schuller is with the CHI – The Chair of Health Informatics,
Technische Universität München (TUM), 81675 Munich, Germany, also with
network followed by two Bidirectional Long Short-Term Mem-
the Munich Data Science Institute (MDSI), 85748 Munich, Germany, also with ory (BLSTM) to implement Multi-Task Learning (MTL) for
the Munich Center for Machine Learning, Munich (MCML), 80333 Munich, PSSD. Further, [12] proposes a W2V2-based model to enhance
Germany, and also with the GLAM – The Group on Language, Audio, & Music,
Imperial College London, London SW7 2AZ, U.K. (e-mail: [email protected]).
frame-level detection performance including different back-end
Digital Object Identifier 10.1109/TASLPRO.2025.3581019 modules, while [23] proposes a framework by incorporating a

2998-4173 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies.
Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2648 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

Question-Answering (QA) strategy with self-attention mecha- r We propose a multi-task PSSD approach using a dual-view
nisms. graph neural network assisted segment-level module.
Despite of the previously proposed multi-task PSSD ap- r Within the proposed approach, we design task-specific fea-
proaches, there still exist two issues to be addressed. First, ture processing modules to learn task-specific information
most of the existing multi-task PSSD works employ a shared for PSSD.
feature processing module for the two tasks, owing to which, r Within the proposed approach, we employ a dual-view
task-specific information may be insufficiently included in learn- graph neural network with dual-view consistency losses,
ing the downstream models [24]. Second, the lack of modeling for the purpose of learning effective inter-segment rela-
inherent inter-segment relationships from different views may tionships in segment-level processing.
give rise to difficulties in utilizing the local differences for The remainder of this paper is organized as follows. Section II
segment-level tasks, in view of the local structural differences introduces the related works, while the proposed approach is
between fake and bonafide segments [25]. detailed in Section III. Then, Sections IV and V present the
In response, we propose the Dual-view Graph neural net- experimental setups and results, respectively. Finally, Section VI
work Assisted segment-level Module (DGAM) to address these concludes the paper.
shortcomings. For the first problem, in addition to a shared
representation extracting module, we propose a two-branch II. RELATED WORKS
structure of task-specific representation processing modules,
with parallelled utterance- and segment-level branches. Then, A. Spoofed Speech Detection
regarding to the second problem, we employ a Dual-View Graph Existing research on spoofed speech detection can be gen-
Neural Network (D-GNN) within the task-specific representa- erally divided into two directions focusing on features and
tion processing modules, in which we treat speech segments algorithms, respectively. The first focus aims to investigate fea-
as nodes resided on a graph, and hence, transform the issue sible handcrafted acoustic features for ADD, including relative
of depicting local relationships among segments into graph- phase shift [28], [29], Cochlear Filter Cepstral Coefficients
structure modeling. Paralleled with an utterance-level branch (CFCCs) [30], instantaneous frequency [30], [31], Linear Fre-
with an utterance loss, the D-GNN further results in Dual-View quency Cepstral Coefficients (LFCCs) [32], and Constant-Q
Consistency (DVC) losses and a segment loss for the PSSD tasks. Cepstral Coefficients (CQCC) [33]. However, [34], [35] reveal
The proposed approach contains a speech representation ex- significant performance variance among these handcrafted fea-
tracting module, and task-specific feature processing modules tures, indicating their sensitivity to spoofing types. To further
with parallel utterance- and segment-level branches. Through improve the ADD system’s performance, these features can
pre-trained self-supervised models, the speech representation be processed with the inclusion of modification, fusion, and
extracting module aims to acquire frame-level representations, multiple adaptation strategies [34], [35], [36]. Nevertheless,
which are fed to the two branches. Within the utterance-level such handcrafted features struggle to model vocal complexity
branch, we perform processing on the representations using an across speaker, accent, and environment variations and exhibit
utterance-level module with pooling, leading to the utterance critical vulnerabilities to PS attacks and compressed speech
loss. For the segment-level branch, the D-GNN employs a artifacts, which motivates the community to explore data-driven
parallel structure including multi-view graph neural networks representations [35], [37].
on segments, and a One-Dimensional Convolutional Neural The second direction focuses on algorithmic advancements.
Network (1D-CNN) structure, through which, the segment and Existing works mainly develop end-to-end approaches, which
DVC losses can be achieved. typically include the use of RawNet2 [38] as the backbone
To clarify novelty, we make comparison between the pro- for raw audio signals [37]. Similarly, Self-Supervised Learning
posed approach and highly related existing works. Compared (SSL) based pre-trained models [15], [39], [40] on raw audio
with models solely focusing on a single task [18], [19], [20], signals have been employed as the front end for ADD [41],
our approach simultaneously addresses both utterance- and through encompassing a freezing or fine-tuning process for
segment-level tasks. Further, different from existing MTL mod- acquiring speech representations [9], [42]. Then, [43] proposes a
els relying on fully shared feature extracting and processing Spoofing-aware Transformer Network (SpoTNet) by integrating
modules [12], [22], [23], we integrate separate feature process- handcrafted features with deep attention modeling for the ADD
ing modules tailored to each task, to learn task-specific informa- tasks. Further, Lin et al. and Zhang et al. introduce a refined
tion. In addition, [20] considers segments’ relationships using ResNet-based encoder, and incorporate a one-class classification
an embedding-similarity module, while our approach regards loss for detecting spoofed speech [44], [45].
segments as graph nodes, which may yield better discrimina- In relation to partially-spoofed cases, the PSSD tasks
tive representations for PSSD. In contrast to alternative recon- mainly focus on detecting synthesized segments and identify-
struction loss [26] and Dual Correlation Reduction Network ing the utterances including spoofed segments [10]. Following
(DCRN) [27], we propose the DVC loss for the D-GNN based the front ends, [12] utilizes a gated Multi-Layer Perceptron
segment-level branch. (gMLP) [46] as the back-end network to address multiple
Then, the main contributions of this paper are presented as PSSD tasks, while [18], [19] set a One-Dimensional Residual
follows. Network (ResNet-1D) module for detecting the boundaries of

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2649

manipulated segments. Considering the effectiveness of atten-


tion mechanisms, Zhu et al. [16] regard PSSD as a MIL problem
using a local self-attention module, and Wu et al. [23] incor-
porate a self-attention mechanism with a Question-Answering
(QA) strategy. Meanwhile, related PSSD works also contain the
inclusion of the Light Convolution Neural Network (LCNN)
model [16], [22], where Zhang et al. [22] mainly investigate
training strategies when using the LCNN model. In addition,
Xie et al. [20] further design an embedding similarity module
for describing the relationships between bonafide and fake seg-
ments. [47] proposes a spectra-temporal fusion strategy com-
bining frame-level and utterance-level coefficients to deal with
multiple spoofing attacks. Further, [48] proposes a Boundary-
aware Attention Mechanism (BAM) containing Boundary En-
hancement (BE) and Boundary Frame-wise Attention (BFA)
modules to deal with the partially spoofed audio localization
task. Besides, [49] employs the improved Boundary Aware Tem-
poral Forgery Detection (BATFD+) to localize the boundary
of fake audio-visual segments, and [50] further proposes a
Universal MultiModal-Adaptive transFormer (UMMAFormer)
framework for detecting manipulated multimedia content.
Fig. 1. Multi-task learning strategies (a) with shared speech fea-
ture/representation extracting and feature processing modules and (b) with a
shared speech feature/representation extracting module and task-specific feature
B. Graph Neural Networks for Speech Processing processing modules.
Graph Neural Networks (GNNs) [51], [52] are first proposed
for extending Convolutional Neural Networks (CNNs) from
Euclidean-space to graph-structured data. Conventional GNNs further propose spectro-temporal GATs for ADD, utilizing a
benefit from the framework of Message Passing Neural Net- heterogeneous stacking graph attention layer. In addition, Chen
works (MPNNs) [53], to iteratively update node features by et al. [71] incorporate spectro-temporal dependencies into the
aggregating neighboring nodes messages. Further, Kipf et al. graphs within GNNs.
[54] propose Graph Convolutional Networks (GCNs) involving
the renormalization scheme to weight aggregated neighbor fea- III. METHODOLOGY
tures. Based on GCNs, Velickovic et al. [55] propose Graph
Conventional multi-task PSSD models share speech rep-
ATtention networks (GATs) with learnable attention weights
resentation extracting and feature processing modules across
for node aggregation processes. Meanwhile, Thekumparampil
tasks [12], [22], [23], as shown in Fig. 1(a). In contrast, the
et al. [56] leverage an explicit cosine similarity attention mech-
proposed approach introduces dedicated processing modules for
anism between different vertex pairs to obtain attention weights
segment-level and utterance-level tasks, as shown in Fig. 1(b).
for the node aggregation.
Specifically, the architecture of our proposed DGAM, shown in
In addition to the prevailing approaches using Recurrent Neu-
Fig. 2, consists of two principal components: 1) A shared speech
ral Networks (RNNs) and CNNs [57], [58], [59] for temporal
representation extracting module and 2) specialized feature pro-
or spectral modeling in speech, the low-level relationships in
cessing modules with integrated loss functions. The utterance-
an utterance can be further described by graphs [3], [9], [60].
level branch incorporates a pooling layer, culminating in the
Related typical works on GNNs [61] based speech processing
utterance-level loss computation. Conversely, the segment-level
include ASV [62], [63], speaker diarization [64], and emotion
branch employs two GNN views with a 1D-CNN construct,
recognition [65], [66]. In ASV tasks, GATs can be employed
generating distinctive DVC loss and segment-level loss terms.
to model segments’ aggregation manner for separating speak-
ers [62], [63], and further endeavors focus on improving the
GAT structures using isomorphic graph attention [3]. For the A. Speech Representation Extracting Module
field of speaker diarization, expanding upon the initial works Recent advances in self-supervised speech representa-
using GNNs [64], Kwon et al. [67] propose to employ GATs for tion learning have produced diverse architectures including
multi-scale speaker diarization, while Wei et al. [68] perform W2V2 [14], WavLM [39], XLS-R [40], among others. Taking
speaker diarization via graph attention-based deep embedded into account the availability and computational complexity, we
clustering. adopt W2V2-base [14] as the primary representation extrac-
In view of the specific locations for detecting spoofing at- tor. The implemented W2V2-base model, pre-trained on 960
tacks in ADD tasks, Tak et al. [69] initially employ GNNs hours of Librispeech and fine-tuned with contrastive loss [14],
to learn specific information among sub-bands and temporal processes raw speech through a CNN encoder followed by
segments. Building upon this, Jung et al. [70] and Tak et al. [41] context-aware Transformer layers.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2650 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

Fig. 2. Overview of the proposed DGAM framework, including a representation extracting module, a task-specific module for the utterance-level task, and a
dual-view graph neural network module with the DVC loss functions for the segment-level task, where λ1 and λ2 are weight parameters.

Then, for an arbitrary utterance sample x, the F -dimensional The QA structure employs the multi-head Scaled Dot-
column vector di,j represents the output of the jth hidden layer Product Attention (SDPA) to process the low-level represen-
(j = 1, 2, . . . , 13) at the ith time step (i = 1, 2, . . . , N ) in x, tations, and the i0 th head’s output from the totally h heads
where N indicates the number of time steps each correspond- (i0 = 1, 2, . . . , h) represented as
ing to a 20 ms segment in accordance with [12], [20], [23].  
Performing fine-tuning the 13 hidden representation layers, we RT W Q i0 (R W i0 )
T K T
Atteni0 (R) = softmax √ RT W Vi0 ,
further follow the implementation in [3] and combine all the 13 dk
hidden representations with their corresponding linear trainable (3)
weights η = [η1 , η2 , . . . , η13 ]T , leading to the ith-step output where the i0 th-head weights W Q i0 , W K
i0 ∈ R F ×dk
and W i0 ∈
V
for the utterance x, represented as RF ×dv with the dimensionalities of dk and dv , respectively.
 −1 Hence, we obtain the output
r i = η T I 13 [di,1 , di,2 , . . . , di,13 ] η ∈ RF ×1 , (1)
(QA)  T
f1 (R) = GELU Atten(R)W O ∈ RF ×N , (4)
where I 13 indicates a 13-dimensional column vector with each
of its elements equal to 1. The output representation matrix is where the multi-head attention Atten(R) = [Atten1
denoted as R = [r 1 , r 2 , . . . , r N ] ∈ RF ×N . (R), Atten2 (R), . . . , Attenh (R)] and the linear weights
W O ∈ Rhdv ×F , with a Gaussian Error Linear Unit
B. Task-Specific Feature Processing Modules (GELU) [72] activation GELU(·).
Then, the B2 LSTM structure consists of cascaded two
The Utterance-Level Branch: As shown in Fig. 2, for the BLSTM layers each denoted as ‘g (BLSTM) (·)’, enhanced with
utterance-level branch, we first set an utterance-level module a residual shortcut [12]. Note that each layer’s output considers
f1 (·) and a temporal average-pooling operator Pooling(·) to all the time steps, leading to a size of F × N . Thus, we write
process low-level representations from the speech representa- the B2 LSTM structure’s output as
tion extracting module, resulting the utterance-level branch’s  
(B2 LSTM)
F -dimensional output f1 (R) = g (BLSTM) g (BLSTM) (R) + R. (5)

hu = Pooling (f1 (R)) , (2) The Segment-Level Branch: For the segment-level task, we
design a novel D-GNN module f2 (·), which parallelly contains
with the utterance-level module’s output R̃ = f1 (R). Specif- two GNNs and a 1D-CNN structure to learn feature represen-
ically, the utterance-level module f1 (·) admits either of two tations from different views. Instead of simply stacking mod-
mutually exclusive configurations: 1) The QA architecture ules, these different views are constructed in accordance with
(QA)
‘f1 (·)’ [23], which adapts the extraction-based QA mech- the complementary and consistency principles of multi-view
anism through self-attention layers for the PSSD tasks, or 2) the learning, ensuring each view captures unique knowledge while
(B2 LSTM)
B2 LSTM architecture ‘f1 (·)’ [12], [22], which captures maximizing agreement across distinct views [73]. Then, the
temporal changes in PS speech via bidirectional temporal mod- representation matrix R ∈ RF ×N is fed to the three structures,
eling. respectively.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2651

Let G = (V, E, A) be a graph with its N -node set V and the (l) (l)  (l)
and the degree matrix D with D i,i = j S i,j . Finally, the
edge set E = {ei,j }i,j∈V linking the nodes, where ei,j = 1 if aggregated representation of the ith node in the second GNN
a link exists from the node i to j, otherwise ei,j = 0. Note (l)
view is denoted as r i with l = 1, 2, . . . , L and i = 1, 2, . . . , N .
that we set G to a complete graph, leading to ei,j = 1 for all
Although this view produces sparser weight adjacency matri-
the edges. Further, the N × N -size weight adjacency matrix A
ces (characterized by predominant low-weight edges) and sup-
based on the node representation similarity sets the weights for
presses noisy edge weights, the excessive sparsity (particularly
the edges, with its ith-row and jth-column element represented
in high-dimensional representations) may inadvertently discard
as ai,j (i, j = 1, 2, . . . , N ). Then, we denote the lth-layer weight
some intra-class edges.
adjacency matrices for the first and second GNN views as A  (l) Then, these two graph views can be considered with com-
(l) (l) (l)
and A , corresponding to the elements  ai,j and ai,j , respec- plementary relationships arising from the trade-off between the
tively. These two GNN views are designed to extract distinct intra-class edge preservation in the first view, and the noise edge
yet complementary node similarity patterns, implementing the suppression capability in the second view. We further combine
complementary principle to multi-view learning. the two complementary views’ outputs as
For the first GNN view, we implement a GAT with the    
(L) (L)
cosine similarity [56] to perform learning on the N nodes corre- r Gi = g (MLP) ri + g (MLP) r i , (9)
sponding to the input R = [r 1 , r 2 , . . . , r N ], aiming to capture
semantic relationships of different nodes. Specifically, for the ith using two one-hidden-layer Multi-Layer Perceptrons (MLPs)
node and the lth layer in the L-layer network (l = 1, 2, . . . , L), with shared parameters represented as g (MLP) (·) using ReLU
we first map the obtained frame features to a hidden state activation, noted as ‘MLP1 ’ and ‘MLP2 ’ in Fig. 2.
While the GNN module efficiently captures topological in-
 (l) = W (l) r
h i
(l−1)
+
(l) 
b ∈ RF ×1 , (6) formation within the speech segments, the Deep Neural Net-
i
works (DNN)-based module focuses on node attribute features,
(l)  
where the trainable projection matrix W ∈ RF ×F for l > 1, enabling the generation of heterogeneous yet complementary
(1) (l) representations through their distinct characteristic informa-
∈ RF ×F , with the offset 
 
and W b ∈ RF ×1 . When l = 1,
(0) tion [76], [77], [78]. To further enhance the representation of
aggregated node r i = r i , otherwise for l > 1, the node aggre-
(l) each segment, we employ a 1D-CNN structure to encode the
i is written as
gation of r node attribute features, leveraging the temporal one-dimensional
⎛ ⎞ convolutional operator g (1D-CNN) (·), with F  kernels, which is
(l) (l)
r
(l)
i = σ ⎝  +
ai,i h
(l)

(l)  ⎠
ai,j h , (7) expressed as
i j
j∈N (i) 
H R = g (1D-CNN) (R) ∈ RF ×N , (10)
where σ(·) is a Rectified Linear Unit (ReLU) activation, and
N (i) represents the set of the ith node’s neighbors including and H R = [r R R R
1 , r 2 , . . . , r N ]. Therefore, we derive the output
representation for the ith time step by integrating these two
itself. For 
(l)  (l) , we
ai,j in the lth-layer weight adjacency matrix A heterogeneous representations as
further set


β (l) S
(l) r Si = r Gi + r R
i ∈R
F ×1
, (11)
(l) e i,j

ai,j = (l)
, (8)

(l) S through which, the output matrix of the branch representation
eβ i,k
k∈N (i) can be represented as H S = [r S1 , r S2 , . . . , r SN ].
where β (l) is a learnable parameter initialized as 1, and S  (l) =
i,j
C. Dual-View Consistency Based Loss Function
cos(h (l) , h
 (l) ) is the symmetry node similarity score. However,
i j
the attention weights may be diluted across excessive edges, Afterwards, we introduce the DVC losses for the two GNN
causing nodes to aggregate irrelevant information from noisy views and the 1D-CNN structure to enforce the consistency [73].
connections (the edges linking nodes across different classes). These can be achieved via aligning the cross-view feature cor-
For the second GNN view, similar to the first view, the ith- relation with the ground-truth node adjacency matrix Ag ∈
node representation also undergoes a linear transformation to RN ×N , where the ith-row and jth-column element is set to 1
achieve the lth-layer hidden state, and employs a similar node if the ith and jth segments in the utterance belong to the same
aggregation. class, otherwise the element is set to 0.
Unlike the cosine similarity in the first view, the second view In the DVC loss, we employ a bilinear scoring function to
employs the heat kernel-based method [74], [75], [76] with evaluate node-pair similarity, which originates from the Mutual
a stricter distance to encode Euclidean spatial relationships, Information (MI) measurement, initially developed for semi-
considering the self-similarity principle for the heat kernel (see supervised cross-view discrimination [79]. Different from the
Appendix A). Specifically, the node similarity matrix based on semi-supervised scenario, our supervised setting provides the
(l) −1 (l)
hi −hj 2
(l) ground-truth node adjacency matrix Ag ∈ RN ×N , and directly
the heat kernel function is S i,j = e−κ(l) , and the integrates this bilinear scoring function into LDVC1 to learn
(l) (l) −1 (l)
weight adjacency matrix is A = (D ) S , where κ > 0 view-invariant graph constructs. Hence, the first DVC loss

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2652 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

TABLE I
DETAILS OF THE PSD, LA, AND HAD DATASETS, INCLUDING TOTAL DURATION, UTTERANCE DURATION, AND ALSO THE NUMBERS OF BONAFIDE UTTERANCES,
FULLY-SYNTHETIC (NOTED AS ‘FS’) UTTERANCES, PARTIALLY-SPOOFED (NOTED AS ‘PS’) UTTERANCES, AND SEGMENTS, IN TERMS OF THE 20 MS TEMPORAL
RESOLUTION FOR EACH SUBSET ARE SHOWN

LDVC1 is set as IV. EXPERIMENTAL SETUPS


1 T A. Datasets
LDVC1 = 2
(H W 1 H) − Ag 2 , (12)
N In order to evaluate the performance of the proposed DGAM
(L) (L) (L) (L) (L) framework for partial-spoofed speech detection, we perform
where H = [
r0 , r1 , . . . , r
N ] and H = [r 0 , r 1
(L)   experiments on the ASVspoof 2019 Logical Access (LA) [6], the
, . . . , r N ] are two graph view representations, W 1 ∈ RF ×F PSD [12], and the HAD [7], [11] datasets, with the information
is a transformation matrix, and (·) is a Sigmoid activation. shown in Table I.
Then, the LDVC2 loss enhances the integration of the graph The LA dataset is derived from the ASVSpoof 2019 chal-
topology and node attributes by enforcing consistency be- lenge [6], comprising three distinct subsets for training, devel-
tween GNN-derived structural representations and 1D-CNN- opment, and evaluation [6]. Within this dataset, spoofed speech
processed attribute representations at the node-pair level. Using samples are produced using a suite of speech synthesis and voice
H G = [r G0 , r G1 , . . . , r GN ] and H R through (9) and (10), respec- conversion algorithms (noted as ‘A01’ to ‘A19’). Specifically,
tively, the second DVC loss the training and development subsets are generated using six
1 different algorithms (‘A01’ to ‘A06’), whereas the evaluation
LDVC2 = (H TG W 2 H R ) − Ag 2 , (13) subset employs thirteen algorithms (‘A07’ to ‘A19’). Notably,
N2
 
the algorithms ‘A16’ and ‘A19’ replicate the algorithms ‘A04’
where W 2 ∈ RF ×F refers to a transformation matrix. and ‘A06’, respectively. In the experiments, we exclusively focus
Finally, we employ the Mean Square Error for Probability- on the evaluation set to assess the cross-scenario efficacy for all
to-Similarity Gradient (MSE-P2SGrad) loss function [80], [81] the compared detection approaches.
for the utterance loss Lutt and the segment loss Lseg , which is The PSD dataset1 contains more than 100 hours of total
widely used in ADD and PSSD tasks [10], [12], [22], [81]. Using duration, which is also divided into the subsets of training,
representations hu (2) and H S (11) with their ground-truth one- development, and evaluation, built upon the three corresponding
hot labels y u ∈ R2×1 and Y S = [(y S )1 , (y S )2 , . . . , (y S )N ] ∈ subsets in the LA dataset, respectively. The three subsets are
R2×N , respectively, the utterance loss including approximately 25.4 k (2.6 k bonafide), 24.8 k (2.5 k
   2 bonafide), and 71.2 k (7.4 k bonafide) samples, respectively. The
Lutt = cos hu , w(c)
u − y T (c)
u p , (14) PSD dataset considers six resolution levels (20 ms, 40 ms, 80 ms,
c∈{0,1} 160 ms, 320 ms, and 640 ms), in which we only focus on the
(c) 
20 ms resolution for the segment-level task as in [12]. Then,
where p(0) = [1, 0]T and p(1) = [0, 1]T , and wu ∈ RF ×1 in- the partially-spoofed utterances in the dataset can be achieved,
dicates the utterance-loss learnable variable for class c. through substituting spoofed segments in fully-synthetic utter-
Similarly, the segment loss ances, for the segments in bonafide utterances, and vice versa.
 2 Note that all the synthetic and bonafide utterances are from the
1
N
 (c)  LA dataset.
Lseg = cos r Si , wS − (y S )Ti p(c) , (15)
N i=1 c∈{0,1}
For the procedures of the segment substitution, either type
of the original fully bonafide or synthetic utterances can be
(c) 
where wS ∈ RF ×1 indicates the segment-loss learnable vari- regarded as a set of original utterances (with original segments),
able for class c. In this way, the total loss function for the while the rest type is noted as a set of candidate utterances (with
proposed DGAM can be represented as candidate segments) for the purpose of replacing the original
segments. First, majority voting from three types of Voice Activ-
L = Lutt + Lseg + λ1 LDVC1 + λ2 LDVC2 , (16) ity Detection (VAD) strategies is employed to obtain available

where λ1 and λ2 refer to non-negative weight parameters. 1 [Online]. Available: https://zenodo.org/record/5766198

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2653

(c)
regions for selection both original and candidate segments. Then, cos(r Si , wS ) (15), till equal true-positive and true-negative
the original and candidate segments are selected from the regions recalls are achieved for the two classes.
from the original and candidate utterances, respectively. Note Within the proposed network, the output feature dimension is
that the selection obeys three rules: 1) For each original utter- F = 768 in the speech representation extracting module. Then,
ance, the candidate segments should be from one speaker but in the utterance-level branch, the B2 LSTM structure sets 512
different utterances; 2) For each original utterance, any candidate and 256 hidden neurons for its first and second BLSTM layer,
segment should be used only once; 3) The original and candidate while the QA structure sets its input to 768 dimensions, the
segments in each substitution pair should be of a similar length. number of heads h = 4, and the weights’ dimensionalities dk
Hence, each partially-spoofed utterance may contain segments and dv both equal to 192. In the segment-level branch, we
generated through more than one TTS or VC method, and set the transformation dimension F  = 512, and the number
further, the meanings of the sentences, words, and phonemes of D-GNN layers is set to L = 1. With one hidden layer, the
are not taken into consideration in the substitution. Afterwards, MLP1 and MLP2 in the D-GNN module both contain 512, 1 024,
the PSD dataset also involves several post-processing steps, in and 512 nodes in their input, hidden, and output layers, respec-
order to avoid potential artifacts introduced by concatenating tively. For the 1D-CNN structure, with the 768 input channels,
audio segments. we set 512 output channels using the convolutional stride 1.
The HAD dataset consists of Mandarin Chinese audio ut- We consider two parameter configurations for the consistency
terances, based on the AISHELL-3 corpus [82]. The dataset is loss functions: 1) DGAM1 with weight parameters λ1 = 0 and
generated through replacing bonafide (in bonafide utterances) λ2 = 0.2 in (16), and 2) DGAM2 with λ1 = 0.1 and λ2 = 0.4.
to spoofed (in fully-synthetic utterances) segments, using an This setup indicates that DGAM2 jointly optimizes both LDVC1
open-source multi-speaker end-to-end TTS approach2 and a and LDVC2 loss functions, whereas DGAM1 employs only
neural vocoder LPCNet [83]. The generation of the HAD dataset LDVC2 .
includes three steps: Textual editing, sythesis, and substitution. During training, as in [12], the optimizer is set to Adaptive
In the first step, the textual editing is achieved through randomly moment estimation (Adam), with a default configuration β1 =
replacing one critical entity (including person, location, organi- 0.9, β2 = 0.999, and  = 10−8 . The learning rate is initialized
zation, or time) to its antonym, within each bonafide utterance’s with 10−5 and the weight decay is set to 10−6 , using a batch
transcripts. Then, the TTS and vocoder are employed to generate size of 12 [12]. For each batch, a zero-padding operation is
synthetic utterances based on the edited text, and further, the performed for the batch’s utterance, to make all these samples
selected entities’ audio in the original utterances are replaced to share the same length (the maximum utterance length in this
their corresponding spoofed parts from the synthetic utterances. batch). Further, the training step implements 60 epochs, with
Note that for the evaluation subset, an improved LPCNet vocoder halved learning rate every 10 epochs for the PSD dataset, while
is utilized in generating the synthetic utterances to evaluate every 5 epochs for the HAD dataset are chosen.
models’ generalization performance. Afterwards, we use the best segment-level EER results on
The published version of the dataset3 is officially split into the development set to select optimally trained models. Then,
training, development, and evaluation subsets, approximately in the evaluation phase, we discard the zero-padding parts for
containing 53.1 k (26.6 k bonafide), 17.8 k (8.9 k bonafide), and the evaluation data, and only use the original parts to calculate
9.1 k (0 bonafide) samples, respectively. Critically, the original the EER results. Note that we do not include any data augmen-
evaluation subset contains no bonafide samples, rendering it tation, voice activity detection, or feature normalization during
incapable of computing utterance-level metrics like Equal Error training.
Rate (EER) crucial for the anti-spoofing evaluation. To mitigate
this limitation, we reallocate the subsets: The new training, V. EXPERIMENTAL RESULTS
development, and evaluation subsets correspond to the origi-
nal development, evaluation, and training subsets, respectively. A. Experimental Comparisons
While the new development subset still lacks bonafide samples, The Compared Approaches: In the experiments, we aim to
we leverage its PS samples to evaluate trained models using make comparison between the proposed and existing approaches
segment-level metrics, circumventing the need for utterance- evaluated on the PSD, LA (noted as ‘PSD2LA’), and HAD
level EER computation. To guarantee fair comparisons, all datasets, with the EER results shown in Table II.
compared models are re-trained and evaluated on the reallocated For fair comparison against prior works, we reimplement
HAD subsets. the baselines using W2V2-base instead of their original fea-
ture extractors, and reimplement training and evaluation on
B. Implementation Details the PSD subsets and reallocated HAD subsets. Specifically,
For the evaluation metric, the reported performance is for the compared approaches using the fully shared feature
measured in terms of EER [6], [8], [36] for both of the extracting and processing module, we employ W2V2-base with
utterance-level and segment-level tasks, which can be calculated one gMLP block (noted as ‘W2V2-gMLP’) and with five
through regulating the decision thresholds for the utterance- gMLP blocks (noted as ‘W2V2-5gMLP’) by replacing the orig-
(c) inal W2V2-XLSR, respectively [12], [46], W2V2-base with
level outputs cos(hu , wu ) (14) and the segment-level outputs
B2 LSTM (noted as ‘W2V2-B2 LSTM’) [12], [22], W2V2-base
2 [Online]. Available: https://github.com/syang1993/gst-tacotron.git with QA [23] by replacing the original SENet [84], and W2V2-
3 [Online]. Available: https://zenodo.org/records/10377492 base with TDL [20] by replacing the original W2V2-XLSR.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2654 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

TABLE II
THE UTTERANCE AND SEGMENT EER (%) RESULTS OF DIFFERENT APPROACHES EMPLOYING THE SEGMENT- AND UTTERANCE-LEVEL BRANCHES EVALUATED ON
THE THREE DATASETS, WHERE ‘PSD2LA’ DENOTES THE CASE OF TRAINING ON THE TRAINING PART OF THE PSD DATASET, WHILE EVALUATED ON LA
DATASET’S EVALUATION PART, ONLY CONSIDERING UTTERANCE EERS

Additionally, we adopt the UMMAFormer [50] as the shared visual processing branch while retaining the audio branch for
branch for the utterance-level and segment-level tasks. Since the speech-only task. Then, we also employ the same train-
UMMAFormer is originally designed for multi-modal tasks, we ing details for all the compared approaches using non-shared
remove its visual processing branch to adapt it to our speech-only (task-specific) feature processing modules as in the proposed ap-
paradigm. Furthermore, we take several utterance-only models proach, while for the case of shared feature processing modules
as comparison methods, including W2V2-base integrated with (W2V2-1gMLP, W2V2-5gMLP, W2V2-B2 LSTM, and QA), we
attentive statistics pooling (noted as ‘W2V2-P’) [13] and W2V2- employ the same parameters as in their corresponding original
base combined with SE-Res1D (noted as ‘W2V2-Res1D’) [25], works [12], [22], [23].
by replacing the original W2V2-large. We also incorporate Afterwards, for the proposed approach using the D-GNN
MIL-based methods, including Hybrid MIL (H-MIL) and H-MIL in the segment-level branch, we consider the usage of QA
with Local Self-attention (LS-H-MIL), proposed in [16], as addi- and B2 LSTM in the utterance-level branch, respectively, for
tional utterance-level models for comparison. Then, for the cases the cases of DGAM 1 and DGAM2 . Note that for the com-
using the non-shared feature processing module, we add the pared W2V2-1gMLP, W2V2-5gMLP, and W2V2-B2 LSTM ap-
compared approaches considering the QA and B2 LSTM [12], proaches, we retain a single length of 20 ms for the segments for
[22], respectively, in the utterance-level branch, while deploying fair comparison. Similarly, we also set the same segments’ 20 ms
B2 LSTM [12], [22], Temporary Deepfake Location (TDL) [20], length in the TDL [20], with a modified architecture to adapt
QA [23], BA-TFD+ [49], Multi-Domain ResNet Transformer to this length. Further, since the original QA, UMMAFormer,
with Time-Domain (MDRT-TD) [21], and Waveform Boundary and BA-TFD+ are designed for predicting whether a segment
Detection (WBD) [18], [19], respectively, for the segment-level is the start or end point of a spoofed clip [12], we modify these
branch. Further, TDL and WBD are originally designed only for approaches to enable the models to predict whether a segment
the segment-level task, and hence, we feed the second Temporal- is fake or not.
CONVolution (TCONV) layer’s and the BLSTM layer’s outputs Experimental Analysis: As can be seen from Table II, the
for TDL and WBD [18], [20], respectively, to the segment loss proposed DGAM 1 and DGAM2 models outperform other com-
as in (15). For MDRT [21], which originally employs dual pared MTL approaches, for both the utterance- and segment-
speech feature extractors, we reimplement it by exclusively level tasks in EER metrics on the three evaluation sets. Specifi-
using W2V2-base. Similarly, BA-TFD+ [49], initially devel- cally, on the PSD evaluation set, the proposed DGAM performs
oped for multi-modal scenarios, is modified by removing its better in the case of using the B2 LSTM-based utterance-level

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2655

TABLE III
COMPARISONS BETWEEN THE EVALUATION PARTS OF THE PSD AND HAD
DATASETS, ON THE ASPECTS OF LANGUAGE, THE NUMBERS OF
SPOOFED-SPEECH GENERATION ALGORITHMS (NOTED AS ‘# ALGORITMS’),
ANS, AND ADS INDICATORS

branch (1.34% utterance and 13.893% segment EERs). This


is the same for the PSD2LA case (1.39% utterance EER),
which may be due to the same source of the PSD and LA
sets. In contrast, the HAD dataset results in the best EERs
for the case of using QA in the utterance-level branch (0.47%
utterance and 0.602% segment EERs). In spite of this, the
proposed approach performs well in the comparisons (es-
pecially for the segment EERs), when employing the same
utterance-level branches, which validates the effective inclu-
sion of the D-GNN module for the segment-level branch. As
for the evaluation on the LA dataset, we observe a larger
best-utterance-EER gap between proposed and compared ap-
proaches (0.32%), compared with the evaluation results on the
PSD dataset (0.08%), which implies stable performance for the
proposed approach towards multiple spoofed-speech detection
tasks.
In addition, Table II indicates huge EER differences between
the PSD and HAD datasets in the evaluation, especially in
terms of segment EER. This may be due to the factors of
languages, spoofed-speech generation algorithms, and substi-
tution strategies. In these regards, we present the information
of languages, the numbers of generation algorithms, Average
Number of Spoofed-clip starting points (ANS), and Average
Duration of Spoofed clips (ADS) for the evaluation parts of the
two datasets in Table III. Note that the ANS and ADS refer to
the average starting-point number and the average duration of
spoofed clips, respectively, within each utterance in the eval-
uation parts. The comparison shows a more complex case for
the PSD dataset in evaluation, which possibly influences the
difficulties in detection.
Fig. 3. The cosine-similarity matrices of three evaluation utterances from
Further, we perform significance tests for the EER resuls using the PSD dataset, when using the approaches of W2V2-1gMLP [12],
a one-tailed z-test [85]. For the segment EERs, the proposed W2V2-5gMLP [12], UMMAFormer [50], W2V2-B2 LSTM [12], [22],
DGAM outperforms the compared approaches at a significance B2 LSTM(QA) [23], B2 LSTM(TDL) [20], B2 LSTM(WBD) [18], [19],
DGAM1 (with the B2 LSTM utterance-level branch), and the ground-truth
level of 0.001 evaluated on both the PSD and HAD datasets, results, respectively, where the larger similarities are represented in red, while
while for the utterance EERs, the significance can be found the smaller similarities are marked in blue.
on the LA (0.01 level) and HAD (0.05 level) datasets. Then,
through comparing the best EER results between the approaches
with shared and non-shared feature processing modules, we can W2V2-B2 LSTM [12], [22], B2 LSTM (QA) [23], B2 LSTM
draw the significance for both utterance and segment EERs at (TDL) [20], B2 LSTM (WBD) [18], [19], and DGAM1 (with
the levels of 0.05 and 0.001, respectively. These tests show the the B2 LSTM utterance-level branch), respectively, for three
effectiveness of the inclusion of tasks-specific feature processing evaluation utterances from the PSD dataset.4 Note that the co-
modules and the D-GNN in the proposed approach. In order to sine similarity can be calculated through performing the cosine
have a closer look at of segment-level performance, we present
Fig. 3 to show the segment-level consine-similarity matrices of 4 Original audio files: ‘CON_E_0049949.wav’, ‘CON_E_0010632.wav’, and
W2V2-1gMLP [12], W2V2-5gMLP [12], UMMAFormer [50], ‘CON_E_0005909.wav’.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2656 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

TABLE IV TABLE V
THE SEGMENT-LEVEL F1 (%) AND MAP (%) RESULTS OF DIFFERENT THE NUMBERS OF THE MODELS’ PARAMETERS CONTAINED IN THE
APPROACHES, ALL USING B2 LSTM AS THE UTTERANCE-LEVEL BRANCH, SEGMENT-LEVEL BRANCHES’ NETWORK MODULE FOR THE COMPARED AND
EVALUATED ON THE TWO DATASETS PROPOSED APPROACHES

operation cos(r Si , r Sj ) (i, j = 1, 2, . . . , N ) using the segment-


level output representations r Si and r Sj within each utterance,
as in (11) and (15). It is observed from the figure that, the
proposed approach tends to accurately recognize boundary seg-
ments between bonafide and spoofed speech, and further, the
proposed approach results in stable performance on detecting
long-duration spoofed clips.
We further enhance the evaluation results with segment-
level metrics: The F1-score metric [86] (noted as ‘F1’) for
segment-wise classification accuracy and mean Average Pre-
cision (mAP) [87] computed over temporal Intersection over
Union (tIoU) thresholds. To compute mAP scores, we first
merge consecutive labeled segments into fake speech inter- Fig. 4. The segment EER results (%) on the evaluation parts of the PSD dataset
for the proposed DGAM approach (with B2 LSTM in the utterance-level branch),
vals and then compute mAP values using tIoU thresholds of considering the DVC loss terms’ weights λ1 and λ2 both varying from 0 to 0.5.
[0.5, 0.7, 0.8, 0.9]. As detailed in Table IV, our B2 LSTM-based
utterance-level models achieve the best performance on the
B. Ablation and Parametric Experiments
PSD dataset (F1: 88.71%, mAP: 59.38%) and better results
on HAD (F1: 99.67%, mAP: 96.87%) compared to compared Within the critical parameters for the proposed approach,
approaches. However, while the models exhibit satisfactory F1 we next investigate the influence of the dual-view consistency,
scores on the PSD dataset, their performance on the mAP metric represented by the DVC loss terms’ weights λ1 and λ2 (see (16)).
remains limited, indicating challenges in accurately predicting To this end, we present the segment EER results evaluated on
the start and end points of fake intervals. The low mAP results the PSD dataset, for the proposed DGAM (with B2 LSTM in
can be attributed to the nature of the segment-level predic- the utterance-level branch) using the weights varying from 0
tion task in our study, which focuses on classifying individual to 0.5 in Fig. 4, where the best EERs appear in the cases of
segments as fake or bonafide. A misprediction at the segment (λ1 = 0, λ2 = 0.2) and (λ1 = 0.1, λ2 = 0.4), corresponding to
level can lead to significant interval errors, even though such DGAM1 and DGAM2 , respectively. The results also imply a ten-
errors may have minimal impact on the overall EER and F1 dency in the condition above that, the loss term LDVC2 may have
results. a more critical impact on the performance, which corresponds
For the purpose of comparing the complexity of the pro- to the consistency between the GNN views and the 1D-CNN
posed and compared approaches, we present Table V containing structure, compared with the LDVC1 term corresponding to the
the numbers of models’ parameters, for the shared (1gMLP, inter-GNN-view consistency.
5gMLP, and UMMAFormer) and non-shared (B2 LSTM, QA, Following the research on the DVC losses’ weights, we further
MDRT-TD, BA-TFD+, WBD, TDL, and D-GNN) segment- analyze the ablation performance with or without these loss
level branches’ network modules in these approaches, respec- terms. In this regard, we show both the utterance and segment
tively. The comparison indicates that, the proposed D-GNN EER results for the proposed DGAM approach (with B2 LSTM
module corresponds to a small number of parameters, yet gives in the utterance-level branch) evaluated on the three datasets,
the best EER performance as shown in Table II. In addition, respectively in Table VI considering multiple setups of the loss
in spite of the frequently appeared competitive EERs for the terms. In detail, we present two groups of experiments in the
W2V2-5gMLP in Table II, the proposed approach can achieve table, where the first group (noted as ‘Group I’) focuses on the
better EERs through a more lightweight network structure, influence of the DVC losses without the 1D-CNN structure (see
compared with the W2V2-5gMLP. (11)), which leads to a lack of the LDVC2 , while the other group

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2657

TABLE VI TABLE VIII


THE UTTERANCE AND SEGMENT (NOTED AS ‘UTT.’ AND ‘SEG.’, RESPECTIVELY) THE UTTERANCE AND SEGMENT (NOTED AS ‘UTT.’ AND ‘SEG.’, RESPECTIVELY)
EER RESULTS (%) FOR THE PROPOSED DGAM (WITH B2 LSTM IN THE EER RESULTS (%) EVALUATED ON THE THREE DATASETS FOR THE PROPOSED
UTTERANCE-LEVEL BRANCH) WITH OR WITHOUT THE DVC LOSS TERMS DGAM (WITH B2 LSTM IN THE UTTERANCE-LEVEL BRANCH), WHEN
LDVC1 AND LDVC2 EVALUATED ON THE THREE DATASETS, WHERE THE CONSIDERING DIFFERENT NUMBERS OF LAYERS L IN THE TWO GNNS WITHIN
‘GROUP I’ REFERS TO AN ABSENCE OF THE 1D-CNN RESIDUAL SHORTCUT, THE D-GNN MODULE
WHILE THE ‘GROUP II’ CONTAINS THE SHORTCUT AS IN THE ORIGINAL DGAM

which results in the removal of the LDVC1 loss function. Next,


we remove both GNN views, reducing the model to a standalone
1D-CNN structure without the LDVC2 loss function, which
serves as the baseline model. Table VII shows that each GNN
view independently enhances performance over the 1D-CNN
TABLE VII
THE UTTERANCE AND SEGMENT (NOTED AS ‘UTT.’ AND ‘SEG.’, RESPECTIVELY) baseline, confirming their individual effectiveness. Further, the
EER RESULTS (%) FOR THE ABLATION EXPERIMENT OF DGAM 2 (WITH full dual-view architecture achieves performance gains, possibly
B2 LSTM IN THE UTTERANCE-LEVEL BRANCH) EVALUATED ON THE THREE validating synergistic effects between the complementary GNN
DATASETS
views.
In order to further investigate the GNN structures within
the D-GNN module for the proposed DGAM approach (with
B2 LSTM in the utterance-level branch), we perform experi-
ments for the cases setting different numbers of layers (L)
in the two GNNs, since we only set L = 1 in the proposed
approach, as shown in Table VIII in terms of utterance and
segment EERs evaluated on the three datasets. It is learned
from the table that, increasing the number of the layers cannot
(noted as ‘Group II’) includes the 1D-CNN structure as in the definitely result in improvement on utterance EERs, yet, more
proposed approach. layers contained in the GNNs may indicate better performance
First, as can be drawn from the comparisons in Group I, on segment-level tasks, according to the case of L = 3 in Ta-
without the 1D-CNN structure, the DVC loss LDVC2 between ble VIII for the PSD and HAD datasets. As shown in Table VIII,
the two GNN view results in better performance compared modifying the layer parameter L in the segment-level branch
with the non-consistency-loss case, especially for segment EERs impacts both segment- and utterance-level performance. This
evaluated on the PSD and HAD datasets (significant at the 0.001 occurs because both branches share the same representation
level using the one-tailed z-test). Then, the comparisons for the extractor. Adjusting L alters how the extractor is optimized
first two lines (without the DVC losses and only with the loss during training, which changes the learned representations. The
LDVC1 , respectively) in Group II also indicate the improvement altered representations interact with the utterance branch’s opti-
through including the consistency loss between the two GNN mization process during training, creating its final performance
views. Furthermore, the performance improvements achieved by variations.
the LDVC1 loss across two graph views in Group I and Group Within the scope of GNNs, we also make a comparison
II of Table VI highlight the effectiveness of LDVC1 in imple- between the proposed DGAM (with B2 LSTM in the utterance-
menting the consistency. Afterwards, when including the loss level branch), and the inclusion of the Reconstruction Loss
LDVC2 considering the consistency between the D-GNN and (RL) [26] and the loss function in the Dual Correlation Reduc-
the 1D-CNN structure, an improvement can be seen compared tion Network (DCRN) [27], respectively, replacing the proposed
with the other setups in Table VI, although the 1D-CNN structure DVC losses. As shown in Table IX, we present utterance and
itself does not always lead to an improvement. This implies that, segment EERs evaluated on the three datasets. We employ the
the consistency loss can make the most of the heterogeneity-wise same setups in accordance with the proposed DGAM, while
information from the graph topological and node-attribute rep- only replacing the DVC terms with the compared RL and
resentations, and also validates the effectiveness of LDVC2 in DCRN losses, respectively. Note that the RL utilizes the outputs
implementing the consistency. H S (see (11)) to reconstruct the ground-truth node adjacency
To investigate the impact of two GNN views, we perform matrix Ag (see Section III-C), and the DCRN sets the two
ablation studies on DGAM 2 . The experiments employ the GNNs’ outputs H and H (see Section III-C) as its inputs to
B2 LSTM as the utterance-level branch, as shown in Table VII. approximate an identical matrix. Through the results shown in
In these experiments, we first individually eliminate each GNN the table, the proposed approach achieves better performance on
view to demonstrate the performance of the remaining view, the three datasets, which may benefit from the inclusion of the
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2658 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

TABLE IX Proof: The self-similarity principle requires that for any node
THE UTTERANCE AND SEGMENT (NOTED AS ‘UTT.’ AND ‘SEG.’, RESPECTIVELY)
EER RESULTS (%) EVALUATED ON THE THREE DATASETS FOR THE PROPOSED
i, its self-similarity K(hi , hi ) should not be smaller than its
DGAM (WITH B2 LSTM IN THE UTTERANCE-LEVEL BRANCH), WHEN similarity K(hi , hj ) with another node j for any j = i, where hi
EMPLOYING DIFFERENT LOSS FUNCTIONS CONVENTIONALLY USED IN GNNS and hj are two nodes non-normalized features. We decompose
hj = hoi + hpi , where hoi is orthogonal to hi and hpi is parallel
to hi .
−1 2
Heat Kernel: The heat kernel K(hi , hj ) = e−κ hi −hj 
inherently satisfies the principle since K(hi , hi ) = 1 while
K(hi , hj ) ≤ 1 for i = j (as hi − hj 2 ≥ 0).
To demonstrate violations in linear, polynomial, and sigmoid
kernels, we set hpi = λhi with λ > 1.
Linear Kernel: The linear kernel K(hi , hj ) = hTi hj =
dual-view consistency. In detail, the RL loss solely focuses on the λhTi hi violates the principle since λhTi hi > hTi hi .
similarity between the adjacency matrix constructed from final Polynomial Kernel: The polynomial kernel function
outputs and the ground-truth adjacency matrix, which ignores K(hi , hj ) = (1 + hTi hj )d = (1 + λhTi hi )d (d ∈ N + ) also vi-
the consistency information constraint between different views. olates the principle as ((1 + λhTi hi )/(1 + hTi hi ))d > 1.
Furthermore, the DCRN focuses on maximizing the similarity Sigmoid Kernel: The sigmoid kernel K(hi , hj ) =
between representations of the same node across different views, tanh(hTi hj + c) = tanh(λhTi hi + c) violates the self-
without considering to preserve similarity among different nodes similarity principle under the same condition as tanh(·) is
belonging to the same class across views. Different from these a monotonically increasing function. 
two loss functions, our consistency loss focuses on the nodes’
similarity from different views within the same class, which may
REFERENCES
address these problems for the two loss functions above.
[1] J. H. Hansen and T. Hasan, “Speaker recognition by machines and humans:
A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74–99,
VI. CONCLUSION Nov. 2015.
[2] W. Lin and M.-W. Mak, “Mixture representation learning for deep speaker
In this paper, we proposed the Dual-View Graph neural embedding,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30,
network Assisted Segment-Level Module (DGAM) approach pp. 968–978, 2022.
towards multi-task Partially Spoofed Speech Detection (PSSD). [3] Z. Ge, X. Xu, H. Guo, T. Wang, and Z. Yang, “Speaker recognition using
isomorphic graph attention network based pooling on self-supervised
The proposed approach incorporated a shared speech represen- representation,” Appl. Acoust., vol. 219, 2024, Art. no. 109929.
tation extracting module and task-specific feature processing [4] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice
modules, consisting of an utterance-level branch for identifying conversion and its challenges: From statistical modeling to deep learning,”
IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 132–157,
partially-spoofed utterances, and a segment-level branch for 2021.
detecting spoofed segments. Within the segment-level branch, [5] A. Triantafyllopoulos et al., “An overview of affective speech synthesis
we designed a Dual-view Graph Neural Network (D-GNN) and conversion in the deep learning era,” Proc. IEEE, vol. 111, no. 10,
pp. 1355–1381, Oct. 2023.
with a dual-view consistency loss for segment-level tasks. [6] X. Wang et al., “ASVspoof 2019: A Large-scale Public Database of
Then, the experimental results indicated that, the proposed Synthesized, Converted and Replayed Speech,” Comput. Speech Lang.,
DGAM could achieve better performance for both utterance- vol. 64, 2020, Art. no. 101114.
[7] J. Yi et al., “ADD 2022: The first audio deep synthesis detection challenge,”
and segment-level tasks, compared with the related state-of-the- in Proc. Int. Conf. Acoust., Speech Signal Process., Singapore, 2022,
art approaches, and also the proposed approach with different pp. 9216–9220.
setups. [8] X. Liu et al., “ASVspoof 2021: Towards spoofed and deepfake speech
detection in the wild,” IEEE/ACM Trans. Audio, Speech, Lang. Process.,
Following the works in this paper, our future works will focus vol. 31, pp. 2507–2522, 2023.
on two aspects as follow. First, we expect to investigate cross- [9] Z. Ge, X. Xu, H. Guo, T. Wang, Z. Yang, and B. Schuller, “DGPN: A
domain cases for multi-task PSSD, which typically involves dual graph prototypical network for few-shot speech spoofing algorithm
recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Kos Island,
domain adaptation to diverse datasets and scenarios. Then, it will Greece, 2024, pp. 1125–1129.
be challenging to design well-performed knowledge distillation [10] L. Zhang, X. Wang, E. Cooper, J. Yamagishi, J. Patino, and N. Evans, “An
techniques for PSSD, aiming to achieve applicable lightweight initial investigation for detecting partially spoofed audio,” in Proc. Annu.
Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021, pp. 4264–4268.
architectures through distilling the knowledge learned from [11] J. Yi et al., “Half-truth: A partially fake audio detection dataset,” in Proc.
complex models. In addition, we are also interested in the real- Annu. Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021, pp. 1654–
world cases only providing extremely limited partially-spoofed 1658.
[12] L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The Par-
training data for the detection tasks. tialSpoof database and countermeasures for the detection of short fake
speech segments embedded in an utterance,” IEEE/ACM Trans. Audio,
APPENDIX A Speech, Lang. Process., vol. 31, pp. 813–825, 2023.
[13] Z. Lv, S. Zhang, K. Tang, and P. Hu, “Fake audio detection based on
SELF-SIMILARITY FOR KERNEL FUNCTIONS unsupervised pretraining models,” in Proc. Int. Conf. Acoust., Speech
Signal Process., Singapore, 2022, pp. 9231–9235.
Theorem: The heat kernel satisfies the self-similarity [14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A frame-
principle, while linear, polynomial, and sigmoid kernels violate work for self-supervised learning of speech representations,” in Proc. Int.
the principle. Conf. Neural Inf. Process. Syst., 2020, pp. 12449–12460.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2659

[15] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: [38] J. Jung, S. Kim, H. Shim, J. Kim, and H. Yu, “Improved RawNet with
A general framework for self-supervised learning in speech, vision and feature map scaling for text-independent speaker verification using raw
language,” in Proc. Int. Conf. Mach. Learn., Baltimore, MD, USA, 2022, waveforms,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Shanghai,
pp. 1298–1312. China, 2020, pp. 1496–1500.
[16] Y. Zhu, Y. Chen, Z. Zhao, X. Liu, and J. Guo, “Local self-attention-based [39] S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full
hybrid multiple instance learning for partial spoof speech detection,” ACM stack speech processing,” IEEE J. Sel. Topics Signal Process., vol. 16,
Trans. Intell. Syst. Technol., vol. 14, no. 5, 2023, Art. no. 93. no. 6, pp. 1505–1518, Oct. 2022.
[17] X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance [40] A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representa-
neural networks,” Pattern Recognit., vol. 74, pp. 15–24, 2018. tion learning at scale,” in Proc. Interspeech, 2021, pp. 2278–2282.
[18] Z. Cai, W. Wang, and M. Li, “Waveform boundary detection for partially [41] H. Tak, J.-W. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-
spoofed audio,” in Proc. Int. Conf. Acoust., Speech Signal Process., Rhodes to-end spectro-temporal graph attention networks for speaker verification
Island, Greece, 2023, pp. 1–5. anti-spoofing and speech deepfake detection,” in Proc. Autom. Speaker
[19] Z. Cai and M. Li, “Integrating frame-level boundary detection and deep- Verification Spoofing Countermeasures Challenge, 2021, pp. 1–8.
fake detection for locating manipulated regions in partially spoofed audio [42] J. Gui et al., “A survey on self-supervised learning: Algorithms, applica-
forgery attacks,” Comput. Speech Lang., vol. 85, 2024, Art. no. 101597. tions, and future trends,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46,
[20] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “An efficient temporary deepfake no. 12, pp. 9052–9071, Dec. 2024.
location approach based embeddings for partially spoofed audio detec- [43] A. Khan and K. M. Malik, “SpoTNet: A spoofing-aware transformer
tion,” in Proc. Int. Conf. Acoust., Speech Signal Process., Seoul, Korea, network for effective synthetic speech detection,” in Proc. Int. Work-
2024, pp. 966–970. shop Multimedia AI Against Disinformation, Thessaloniki, Greece, 2023,
[21] A. K. S. Yadav, K. Bhagtani, S. Baireddy, P. Bestagini, S. Tubaro, and E. pp. 10–18.
J. Delp, “Mdrt: Multi-domain synthetic speech localization,” in Proc. Int. [44] G. Lin, W. Luo, D. Luo, and J. Huang, “One-class neural network with
Conf. Acoust., Speech Signal Process., 2024, pp. 11171–11175. directed statistics pooling for spoofing speech detection,” IEEE Trans. Inf.
[22] L. Zhang, X. Wang, E. Cooper, and J. Yamagishi, “Multi-task learning Forensics Secur., vol. 19, pp. 2581–2593, 2024.
in utterance-level and segmental-level spoof detection,” in Proc. Au- [45] Y. Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn-
tom. Speaker Verification Spoofing Countermeasures Challenge, 2021, thetic voice spoofing detection,” IEEE Signal Process. Lett., vol. 28,
pp. 9–15. pp. 937–941, 2021.
[23] H. Wu et al., “Partially fake audio detection by self-attention-based fake [46] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” in Proc.
span discovery,” in Proc. Int. Conf. Acoust., Speech Signal Process., Int. Conf. Neural Inf. Process. Syst., 2021, pp. 9204–9215.
Singapore, 2022, pp. 9236–9240. [47] A. Khan, K. M. Malik, and S. Nawaz, “Frame-to-utterance convergence:
[24] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks A spectra-temporal approach for unified spoofing detection,” in Proc. Int.
for multi-task learning,” in Proc. Comput. Vis. Pattern Recognit., Las Conf. Acoust., Speech Signal Process., Seoul, Republic of Korea, 2024,
Vegas, NV, USA, 2016, pp. 3994–4003. pp. 10761–10765.
[25] T. Liu, L. Zhang, R. K. Das, Y. Ma, R. Tao, and H. Li, “How do [48] J. Zhong, B. Li, and J. Yi, “Enhancing partially spoofed audio localization
neural spoofing countermeasures detect partially spoofed audio?,” in Proc. with boundary-aware attention mechanism,” in Proc. Annu. Conf. Int.
Interspeech, 2024, pp. 1105–1109. Speech Commun. Assoc., 2024, pp. 4838–4842.
[26] C. Wang, S. Pan, C. P. Yu, R. Hu, G. Long, and C. Zhang, “Deep [49] Z. Cai, S. Ghosh, A. Dhall, T. Gedeon, K. Stefanov, and M. Hayat, “Glitch
neighbor-aware embedding for node clustering in attributed graphs,” Pat- in the matrix: A large scale benchmark for content driven audio-visual
tern Recognit., vol. 122, 2022, Art. no. 108230. forgery detection and localization,” Comput. Vis. Image Understanding,
[27] Y. Liu et al., “Deep graph clustering via dual correlation reduction,” vol. 236, 2023, Art. no. 103818.
in Proc. Assoc. Advance. Artif. Intell., Philadelphia, PA, USA, 2022, [50] R. Zhang, H. Wang, M. Du, H. Liu, Y. Zhou, and Q. Zeng, “UMMAFormer:
pp. 7603–7611. A universal multimodal-adaptive transformer framework for temporal
[28] I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Synthetic forgery localization,” in Proc. ACM Int. Conf. Multimedia, Ottawa, ON,
speech detection using phase information,” Speech Commun., vol. 81, Canada, 2023, pp. 8749–8759.
pp. 30–41, 2016. [51] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in
[29] J. Kim and S. M. Ban, “Phase-aware spoof speech detection based on graph domains,” in Proc. 2005 IEEE Int. Joint Conf. Neural Netw., 2005,
Res2Net with phase network,” in Proc. Int. Conf. Acoust., Speech Signal vol. 2, pp. 729–734.
Process., Rhodes Island, Greece, 2023, pp. 1–5. [52] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
[30] T. B. Patel and H. A. Patil, “Cochlear filter and instantaneous frequency “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20,
based features for spoofed speech detection,” IEEE J. Sel. Topics Signal no. 1, pp. 61–80, Jan. 2009.
Process., vol. 11, no. 4, pp. 618–631, Jun. 2017. [53] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural
[31] M. R. Kamble and H. A. Patil, “Novel energy separation based instanta- message passing for quantum chemistry,” in Proc. Int. Conf. Mach. Learn.,
neous frequency features for spoof speech detection,” in Proc. Eur. Signal Sydney, NSW, Australia, 2017, vol. 70, pp. 1263–1272.
Process. Conf., Kos island, Greece, 2017, pp. 106–110. [54] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
[32] M. Sahidullah, T. Kinnunen, and C. Hanilçi, “A comparison of features convolutional networks,” in Proc. Int. Conf. Learn. Representations, 2017.
for synthetic speech detection,” in Proc. Annu. Conf. Int. Speech Commun. [55] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio,
Assoc., Dresden, Germany, 2015, pp. 2087–2091. “Graph attention networks,” in Proc. Int. Conf. Learn. Representations,
[33] M. Todisco, H. Delgado, and N. W. D. Evans, “Constant Q cepstral coef- 2018.
ficients: A spoofing countermeasure for automatic speaker verification,” [56] K. K. Thekumparampil, C. Wang, S. Oh, and L. Li, “Attention-
Comput. Speech Lang., vol. 45, pp. 516–535, 2017. based graph neural network for semi-supervised learning,” CoRR,
[34] J. Sanchez, I. Saratxaga, I. Hernáez, E. Navas, D. Erro, and T. Raitio, 2018, arXiv:1803.03735.
“Toward a universal synthetic speech spoofing detection using phase in- [57] A. Pandey and D. Wang, “Self-attending RNN for speech enhancement to
formation,” IEEE Trans. Inf. Forensics Secur., vol. 10, no. 4, pp. 810–820, improve cross-corpus generalization,” IEEE/ACM Trans. Audio, Speech,
Apr. 2015. Lang. Process., vol. 30, pp. 1374–1385, 2022.
[35] A. Khan, K. M. Malik, J. Ryan, and M. Saravanan, “Battling voice [58] Y. Wang et al., “PredRNN: A recurrent neural network for spatiotemporal
spoofing: A review, comparative analysis, and generalizability evaluation predictive learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2,
of state-of-the-art voice spoofing counter measures,” Artif. Intell. Rev., pp. 2208–2225, Feb. 2023.
vol. 56, pp. 513–566, 2023. [59] P. Jiang, X. Xu, H. Tao, L. Zhao, and C. Zou, “Convolutional-recurrent
[36] S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, and V. neural networks with multiple attention mechanisms for speech emo-
Shchemelinin, “STC anti-spoofing systems for the ASVspoof 2015 chal- tion recognition,” IEEE Trans. Cogn. Devel. Syst., vol. 14, no. 4,
lenge,” in Proc. Int. Conf. Acoust., Speech Signal Process., Shanghai, pp. 1564–1573, Dec. 2022.
China, 2016, pp. 5475–5479. [60] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, and M. Brudno,
[37] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “Speaker attribution with voice profiles by graph-based semi-supervised
“End-to-end anti-spoofing with RawNet2,” in Proc. Int. Conf. Acoust., learning,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Shanghai,
Speech Signal Process., Toronto, Canada, 2021, pp. 6369–6373. China, 2020, pp. 289–293.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2660 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025

[61] V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. [74] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques
Bresson, “Benchmarking graph neural networks,” J. Mach. Learn. Res., for embedding and clustering,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 24, no. 43, pp. 1–48, 2023. 2001, pp. 585–591.
[62] J.-W. Jung, H.-S. Heo, H.-J. Yu, and J. S. Chung, “Graph attention [75] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep
networks for speaker verification,” in Proc. Int. Conf. Acoust., Speech clustering network,” in Proc. Web Conf., 2020, pp. 1400–1410.
Signal Process., Toronto, Canada, 2021, pp. 6149–6153. [76] X. He, B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Parallelly adaptive
[63] H.-J. Shim, J. Heo, J.-H. Park, G.-H. Lee, and H.-J. Yu, “Graph attentive graph convolutional clustering model,” IEEE Trans. Neural Netw. Learn.
feature aggregation for text-independent speaker verification,” in Proc. Int. Syst., vol. 35, no. 4, pp. 4451–4464, Apr. 2024.
Conf. Acoust., Speech Signal Process., Singapore, 2022, pp. 7972–7976. [77] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Attention-driven graph clustering
[64] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, and M. Brudno, network,” in Proc. ACM Multimedia Conf., 2021, pp. 935–943.
“Speaker diarization with session-level speaker embedding refinement [78] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Deep attention-guided graph clustering
using graph neural networks,” in Proc. Int. Conf. Acoust., Speech Signal with dual self-supervision,” IEEE Trans. Circuits Syst. Video Technol.,
Process., Barcelona, Spain, 2020, pp. 7109–7113. vol. 33, no. 7, pp. 3296–3307, Jul. 2023.
[65] A. Shirian and T. Guha, “Compact graph architecture for speech emotion [79] Y. Liu, S. Zhao, X. Wang, L. Geng, Z. Xiao, and J. C.-W. Lin, “Self-
recognition,” in Proc. Int. Conf. Acoust., Speech Signal Process., Toronto, consistent graph neural networks for semi-supervised node classification,”
ON, Canada, 2021, pp. 6284–6288. IEEE Trans. Big Data, vol. 9, no. 4, pp. 1186–1197, Aug. 2023.
[66] A. Shirian, S. Tripathi, and T. Guha, “Dynamic emotion modeling with [80] X. Zhang et al., “P2SGrad: Refined gradients for optimizing deep face
learnable graphs and graph inception network,” IEEE Trans. Multimedia, models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
vol. 24, pp. 780–790, 2022. pp. 9898–9906.
[67] Y. Kwon, H.-S. Heo, J.-W. Jung, Y. J. Kim, B.-J. Lee, and J. S. [81] X. Wang and J. Yamagishi, “A comparative study on recent neu-
Chung, “Multi-scale speaker embedding-based graph attention networks ral spoofing countermeasures for synthetic speech detection,” in Proc.
for speaker diarisation,” in Proc. Int. Conf. Acoust., Speech Signal Process., Annu. Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021,
Singapore, 2022, pp. 8367–8371. pp. 4259–4263.
[68] Y. Wei, H. Guo, Z. Ge, and Z. Yang, “Graph attention-based deep embed- [82] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi-speaker
ded clustering for speaker diarization,” Speech Commun., vol. 155, 2023, mandarin TTS corpus,” in Proc. INTERSPEECH, 2021, pp. 2756–2760.
Art. no. 102991. [83] J.-M. Valin and J. Skoglund, “LPCNET: Improving neural speech synthesis
[69] H. Tak, J. Jung, J. Patino, M. Todisco, and N. W. D. Evans, “Graph attention through linear prediction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
networks for anti-spoofing,” in Proc. Annu. Conf. Int. Speech Commun. Process., 2019, pp. 5891–5895.
Assoc., Brno, Czechia, 2021, pp. 2356–2360. [84] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: Anti-spoofing with
[70] J. Jung et al., “AASIST: Audio anti-spoofing using integrated spectro- squeeze-excitation and residual networks,” in Proc. Interspeech, 2019,
temporal graph attention networks,” in Proc. Int. Conf. Acoust., Speech pp. 1013–1017.
Signal Process., Singapore, 2022, pp. 6367–6371. [85] X. Xu et al., “Rethinking auditory affective descriptors through zero-shot
[71] F. Chen, S. Deng, T. Zheng, Y. He, and J. Han, “Graph-based spectro- emotion recognition in speech,” IEEE Trans. Computat. Social Syst., vol. 9,
temporal dependency modeling for anti-spoofing,” in Proc. Int. Conf. no. 5, pp. 1530–1541, Oct. 2022.
Acoust., Speech Signal Process., Rhodes Island, Greece: IEEE, 2023, [86] V. K. Sharma, R. Garg, and Q. Caudron, “A systematic literature review
pp. 1–5. on deepfake detection techniques,” Multimedia Tools Appl., pp. 1–43,
[72] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” 2024.
2023, arXiv:1606.08415. [87] E. Vahdani and Y. Tian, “Deep learning-based action detection in
[73] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” CoRR, untrimmed videos: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
2013, arXiv:1304.5634. vol. 45, no. 4, pp. 4302–4320, Apr. 2023.

Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.

You might also like