Multi-Task Partially Spoofed Speech Detection Using A Dual-View Graph Neural Network Assisted Segment-Level Module
Multi-Task Partially Spoofed Speech Detection Using A Dual-View Graph Neural Network Assisted Segment-Level Module
2998-4173 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies.
Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2648 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
Question-Answering (QA) strategy with self-attention mecha- r We propose a multi-task PSSD approach using a dual-view
nisms. graph neural network assisted segment-level module.
Despite of the previously proposed multi-task PSSD ap- r Within the proposed approach, we design task-specific fea-
proaches, there still exist two issues to be addressed. First, ture processing modules to learn task-specific information
most of the existing multi-task PSSD works employ a shared for PSSD.
feature processing module for the two tasks, owing to which, r Within the proposed approach, we employ a dual-view
task-specific information may be insufficiently included in learn- graph neural network with dual-view consistency losses,
ing the downstream models [24]. Second, the lack of modeling for the purpose of learning effective inter-segment rela-
inherent inter-segment relationships from different views may tionships in segment-level processing.
give rise to difficulties in utilizing the local differences for The remainder of this paper is organized as follows. Section II
segment-level tasks, in view of the local structural differences introduces the related works, while the proposed approach is
between fake and bonafide segments [25]. detailed in Section III. Then, Sections IV and V present the
In response, we propose the Dual-view Graph neural net- experimental setups and results, respectively. Finally, Section VI
work Assisted segment-level Module (DGAM) to address these concludes the paper.
shortcomings. For the first problem, in addition to a shared
representation extracting module, we propose a two-branch II. RELATED WORKS
structure of task-specific representation processing modules,
with parallelled utterance- and segment-level branches. Then, A. Spoofed Speech Detection
regarding to the second problem, we employ a Dual-View Graph Existing research on spoofed speech detection can be gen-
Neural Network (D-GNN) within the task-specific representa- erally divided into two directions focusing on features and
tion processing modules, in which we treat speech segments algorithms, respectively. The first focus aims to investigate fea-
as nodes resided on a graph, and hence, transform the issue sible handcrafted acoustic features for ADD, including relative
of depicting local relationships among segments into graph- phase shift [28], [29], Cochlear Filter Cepstral Coefficients
structure modeling. Paralleled with an utterance-level branch (CFCCs) [30], instantaneous frequency [30], [31], Linear Fre-
with an utterance loss, the D-GNN further results in Dual-View quency Cepstral Coefficients (LFCCs) [32], and Constant-Q
Consistency (DVC) losses and a segment loss for the PSSD tasks. Cepstral Coefficients (CQCC) [33]. However, [34], [35] reveal
The proposed approach contains a speech representation ex- significant performance variance among these handcrafted fea-
tracting module, and task-specific feature processing modules tures, indicating their sensitivity to spoofing types. To further
with parallel utterance- and segment-level branches. Through improve the ADD system’s performance, these features can
pre-trained self-supervised models, the speech representation be processed with the inclusion of modification, fusion, and
extracting module aims to acquire frame-level representations, multiple adaptation strategies [34], [35], [36]. Nevertheless,
which are fed to the two branches. Within the utterance-level such handcrafted features struggle to model vocal complexity
branch, we perform processing on the representations using an across speaker, accent, and environment variations and exhibit
utterance-level module with pooling, leading to the utterance critical vulnerabilities to PS attacks and compressed speech
loss. For the segment-level branch, the D-GNN employs a artifacts, which motivates the community to explore data-driven
parallel structure including multi-view graph neural networks representations [35], [37].
on segments, and a One-Dimensional Convolutional Neural The second direction focuses on algorithmic advancements.
Network (1D-CNN) structure, through which, the segment and Existing works mainly develop end-to-end approaches, which
DVC losses can be achieved. typically include the use of RawNet2 [38] as the backbone
To clarify novelty, we make comparison between the pro- for raw audio signals [37]. Similarly, Self-Supervised Learning
posed approach and highly related existing works. Compared (SSL) based pre-trained models [15], [39], [40] on raw audio
with models solely focusing on a single task [18], [19], [20], signals have been employed as the front end for ADD [41],
our approach simultaneously addresses both utterance- and through encompassing a freezing or fine-tuning process for
segment-level tasks. Further, different from existing MTL mod- acquiring speech representations [9], [42]. Then, [43] proposes a
els relying on fully shared feature extracting and processing Spoofing-aware Transformer Network (SpoTNet) by integrating
modules [12], [22], [23], we integrate separate feature process- handcrafted features with deep attention modeling for the ADD
ing modules tailored to each task, to learn task-specific informa- tasks. Further, Lin et al. and Zhang et al. introduce a refined
tion. In addition, [20] considers segments’ relationships using ResNet-based encoder, and incorporate a one-class classification
an embedding-similarity module, while our approach regards loss for detecting spoofed speech [44], [45].
segments as graph nodes, which may yield better discrimina- In relation to partially-spoofed cases, the PSSD tasks
tive representations for PSSD. In contrast to alternative recon- mainly focus on detecting synthesized segments and identify-
struction loss [26] and Dual Correlation Reduction Network ing the utterances including spoofed segments [10]. Following
(DCRN) [27], we propose the DVC loss for the D-GNN based the front ends, [12] utilizes a gated Multi-Layer Perceptron
segment-level branch. (gMLP) [46] as the back-end network to address multiple
Then, the main contributions of this paper are presented as PSSD tasks, while [18], [19] set a One-Dimensional Residual
follows. Network (ResNet-1D) module for detecting the boundaries of
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2649
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2650 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
Fig. 2. Overview of the proposed DGAM framework, including a representation extracting module, a task-specific module for the utterance-level task, and a
dual-view graph neural network module with the DVC loss functions for the segment-level task, where λ1 and λ2 are weight parameters.
Then, for an arbitrary utterance sample x, the F -dimensional The QA structure employs the multi-head Scaled Dot-
column vector di,j represents the output of the jth hidden layer Product Attention (SDPA) to process the low-level represen-
(j = 1, 2, . . . , 13) at the ith time step (i = 1, 2, . . . , N ) in x, tations, and the i0 th head’s output from the totally h heads
where N indicates the number of time steps each correspond- (i0 = 1, 2, . . . , h) represented as
ing to a 20 ms segment in accordance with [12], [20], [23].
Performing fine-tuning the 13 hidden representation layers, we RT W Q i0 (R W i0 )
T K T
Atteni0 (R) = softmax √ RT W Vi0 ,
further follow the implementation in [3] and combine all the 13 dk
hidden representations with their corresponding linear trainable (3)
weights η = [η1 , η2 , . . . , η13 ]T , leading to the ith-step output where the i0 th-head weights W Q i0 , W K
i0 ∈ R F ×dk
and W i0 ∈
V
for the utterance x, represented as RF ×dv with the dimensionalities of dk and dv , respectively.
−1 Hence, we obtain the output
r i = η T I 13 [di,1 , di,2 , . . . , di,13 ] η ∈ RF ×1 , (1)
(QA) T
f1 (R) = GELU Atten(R)W O ∈ RF ×N , (4)
where I 13 indicates a 13-dimensional column vector with each
of its elements equal to 1. The output representation matrix is where the multi-head attention Atten(R) = [Atten1
denoted as R = [r 1 , r 2 , . . . , r N ] ∈ RF ×N . (R), Atten2 (R), . . . , Attenh (R)] and the linear weights
W O ∈ Rhdv ×F , with a Gaussian Error Linear Unit
B. Task-Specific Feature Processing Modules (GELU) [72] activation GELU(·).
Then, the B2 LSTM structure consists of cascaded two
The Utterance-Level Branch: As shown in Fig. 2, for the BLSTM layers each denoted as ‘g (BLSTM) (·)’, enhanced with
utterance-level branch, we first set an utterance-level module a residual shortcut [12]. Note that each layer’s output considers
f1 (·) and a temporal average-pooling operator Pooling(·) to all the time steps, leading to a size of F × N . Thus, we write
process low-level representations from the speech representa- the B2 LSTM structure’s output as
tion extracting module, resulting the utterance-level branch’s
(B2 LSTM)
F -dimensional output f1 (R) = g (BLSTM) g (BLSTM) (R) + R. (5)
hu = Pooling (f1 (R)) , (2) The Segment-Level Branch: For the segment-level task, we
design a novel D-GNN module f2 (·), which parallelly contains
with the utterance-level module’s output R̃ = f1 (R). Specif- two GNNs and a 1D-CNN structure to learn feature represen-
ically, the utterance-level module f1 (·) admits either of two tations from different views. Instead of simply stacking mod-
mutually exclusive configurations: 1) The QA architecture ules, these different views are constructed in accordance with
(QA)
‘f1 (·)’ [23], which adapts the extraction-based QA mech- the complementary and consistency principles of multi-view
anism through self-attention layers for the PSSD tasks, or 2) the learning, ensuring each view captures unique knowledge while
(B2 LSTM)
B2 LSTM architecture ‘f1 (·)’ [12], [22], which captures maximizing agreement across distinct views [73]. Then, the
temporal changes in PS speech via bidirectional temporal mod- representation matrix R ∈ RF ×N is fed to the three structures,
eling. respectively.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2651
Let G = (V, E, A) be a graph with its N -node set V and the (l) (l) (l)
and the degree matrix D with D i,i = j S i,j . Finally, the
edge set E = {ei,j }i,j∈V linking the nodes, where ei,j = 1 if aggregated representation of the ith node in the second GNN
a link exists from the node i to j, otherwise ei,j = 0. Note (l)
view is denoted as r i with l = 1, 2, . . . , L and i = 1, 2, . . . , N .
that we set G to a complete graph, leading to ei,j = 1 for all
Although this view produces sparser weight adjacency matri-
the edges. Further, the N × N -size weight adjacency matrix A
ces (characterized by predominant low-weight edges) and sup-
based on the node representation similarity sets the weights for
presses noisy edge weights, the excessive sparsity (particularly
the edges, with its ith-row and jth-column element represented
in high-dimensional representations) may inadvertently discard
as ai,j (i, j = 1, 2, . . . , N ). Then, we denote the lth-layer weight
some intra-class edges.
adjacency matrices for the first and second GNN views as A (l) Then, these two graph views can be considered with com-
(l) (l) (l)
and A , corresponding to the elements ai,j and ai,j , respec- plementary relationships arising from the trade-off between the
tively. These two GNN views are designed to extract distinct intra-class edge preservation in the first view, and the noise edge
yet complementary node similarity patterns, implementing the suppression capability in the second view. We further combine
complementary principle to multi-view learning. the two complementary views’ outputs as
For the first GNN view, we implement a GAT with the
(L) (L)
cosine similarity [56] to perform learning on the N nodes corre- r Gi = g (MLP) ri + g (MLP) r i , (9)
sponding to the input R = [r 1 , r 2 , . . . , r N ], aiming to capture
semantic relationships of different nodes. Specifically, for the ith using two one-hidden-layer Multi-Layer Perceptrons (MLPs)
node and the lth layer in the L-layer network (l = 1, 2, . . . , L), with shared parameters represented as g (MLP) (·) using ReLU
we first map the obtained frame features to a hidden state activation, noted as ‘MLP1 ’ and ‘MLP2 ’ in Fig. 2.
While the GNN module efficiently captures topological in-
(l) = W (l) r
h i
(l−1)
+
(l)
b ∈ RF ×1 , (6) formation within the speech segments, the Deep Neural Net-
i
works (DNN)-based module focuses on node attribute features,
(l)
where the trainable projection matrix W ∈ RF ×F for l > 1, enabling the generation of heterogeneous yet complementary
(1) (l) representations through their distinct characteristic informa-
∈ RF ×F , with the offset
and W b ∈ RF ×1 . When l = 1,
(0) tion [76], [77], [78]. To further enhance the representation of
aggregated node r i = r i , otherwise for l > 1, the node aggre-
(l) each segment, we employ a 1D-CNN structure to encode the
i is written as
gation of r node attribute features, leveraging the temporal one-dimensional
⎛ ⎞ convolutional operator g (1D-CNN) (·), with F kernels, which is
(l) (l)
r
(l)
i = σ ⎝ +
ai,i h
(l)
(l) ⎠
ai,j h , (7) expressed as
i j
j∈N (i)
H R = g (1D-CNN) (R) ∈ RF ×N , (10)
where σ(·) is a Rectified Linear Unit (ReLU) activation, and
N (i) represents the set of the ith node’s neighbors including and H R = [r R R R
1 , r 2 , . . . , r N ]. Therefore, we derive the output
representation for the ith time step by integrating these two
itself. For
(l) (l) , we
ai,j in the lth-layer weight adjacency matrix A heterogeneous representations as
further set
β (l) S
(l) r Si = r Gi + r R
i ∈R
F ×1
, (11)
(l) e i,j
ai,j = (l)
, (8)
(l) S through which, the output matrix of the branch representation
eβ i,k
k∈N (i) can be represented as H S = [r S1 , r S2 , . . . , r SN ].
where β (l) is a learnable parameter initialized as 1, and S (l) =
i,j
C. Dual-View Consistency Based Loss Function
cos(h (l) , h
(l) ) is the symmetry node similarity score. However,
i j
the attention weights may be diluted across excessive edges, Afterwards, we introduce the DVC losses for the two GNN
causing nodes to aggregate irrelevant information from noisy views and the 1D-CNN structure to enforce the consistency [73].
connections (the edges linking nodes across different classes). These can be achieved via aligning the cross-view feature cor-
For the second GNN view, similar to the first view, the ith- relation with the ground-truth node adjacency matrix Ag ∈
node representation also undergoes a linear transformation to RN ×N , where the ith-row and jth-column element is set to 1
achieve the lth-layer hidden state, and employs a similar node if the ith and jth segments in the utterance belong to the same
aggregation. class, otherwise the element is set to 0.
Unlike the cosine similarity in the first view, the second view In the DVC loss, we employ a bilinear scoring function to
employs the heat kernel-based method [74], [75], [76] with evaluate node-pair similarity, which originates from the Mutual
a stricter distance to encode Euclidean spatial relationships, Information (MI) measurement, initially developed for semi-
considering the self-similarity principle for the heat kernel (see supervised cross-view discrimination [79]. Different from the
Appendix A). Specifically, the node similarity matrix based on semi-supervised scenario, our supervised setting provides the
(l) −1 (l)
hi −hj 2
(l) ground-truth node adjacency matrix Ag ∈ RN ×N , and directly
the heat kernel function is S i,j = e−κ(l) , and the integrates this bilinear scoring function into LDVC1 to learn
(l) (l) −1 (l)
weight adjacency matrix is A = (D ) S , where κ > 0 view-invariant graph constructs. Hence, the first DVC loss
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2652 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
TABLE I
DETAILS OF THE PSD, LA, AND HAD DATASETS, INCLUDING TOTAL DURATION, UTTERANCE DURATION, AND ALSO THE NUMBERS OF BONAFIDE UTTERANCES,
FULLY-SYNTHETIC (NOTED AS ‘FS’) UTTERANCES, PARTIALLY-SPOOFED (NOTED AS ‘PS’) UTTERANCES, AND SEGMENTS, IN TERMS OF THE 20 MS TEMPORAL
RESOLUTION FOR EACH SUBSET ARE SHOWN
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2653
(c)
regions for selection both original and candidate segments. Then, cos(r Si , wS ) (15), till equal true-positive and true-negative
the original and candidate segments are selected from the regions recalls are achieved for the two classes.
from the original and candidate utterances, respectively. Note Within the proposed network, the output feature dimension is
that the selection obeys three rules: 1) For each original utter- F = 768 in the speech representation extracting module. Then,
ance, the candidate segments should be from one speaker but in the utterance-level branch, the B2 LSTM structure sets 512
different utterances; 2) For each original utterance, any candidate and 256 hidden neurons for its first and second BLSTM layer,
segment should be used only once; 3) The original and candidate while the QA structure sets its input to 768 dimensions, the
segments in each substitution pair should be of a similar length. number of heads h = 4, and the weights’ dimensionalities dk
Hence, each partially-spoofed utterance may contain segments and dv both equal to 192. In the segment-level branch, we
generated through more than one TTS or VC method, and set the transformation dimension F = 512, and the number
further, the meanings of the sentences, words, and phonemes of D-GNN layers is set to L = 1. With one hidden layer, the
are not taken into consideration in the substitution. Afterwards, MLP1 and MLP2 in the D-GNN module both contain 512, 1 024,
the PSD dataset also involves several post-processing steps, in and 512 nodes in their input, hidden, and output layers, respec-
order to avoid potential artifacts introduced by concatenating tively. For the 1D-CNN structure, with the 768 input channels,
audio segments. we set 512 output channels using the convolutional stride 1.
The HAD dataset consists of Mandarin Chinese audio ut- We consider two parameter configurations for the consistency
terances, based on the AISHELL-3 corpus [82]. The dataset is loss functions: 1) DGAM1 with weight parameters λ1 = 0 and
generated through replacing bonafide (in bonafide utterances) λ2 = 0.2 in (16), and 2) DGAM2 with λ1 = 0.1 and λ2 = 0.4.
to spoofed (in fully-synthetic utterances) segments, using an This setup indicates that DGAM2 jointly optimizes both LDVC1
open-source multi-speaker end-to-end TTS approach2 and a and LDVC2 loss functions, whereas DGAM1 employs only
neural vocoder LPCNet [83]. The generation of the HAD dataset LDVC2 .
includes three steps: Textual editing, sythesis, and substitution. During training, as in [12], the optimizer is set to Adaptive
In the first step, the textual editing is achieved through randomly moment estimation (Adam), with a default configuration β1 =
replacing one critical entity (including person, location, organi- 0.9, β2 = 0.999, and = 10−8 . The learning rate is initialized
zation, or time) to its antonym, within each bonafide utterance’s with 10−5 and the weight decay is set to 10−6 , using a batch
transcripts. Then, the TTS and vocoder are employed to generate size of 12 [12]. For each batch, a zero-padding operation is
synthetic utterances based on the edited text, and further, the performed for the batch’s utterance, to make all these samples
selected entities’ audio in the original utterances are replaced to share the same length (the maximum utterance length in this
their corresponding spoofed parts from the synthetic utterances. batch). Further, the training step implements 60 epochs, with
Note that for the evaluation subset, an improved LPCNet vocoder halved learning rate every 10 epochs for the PSD dataset, while
is utilized in generating the synthetic utterances to evaluate every 5 epochs for the HAD dataset are chosen.
models’ generalization performance. Afterwards, we use the best segment-level EER results on
The published version of the dataset3 is officially split into the development set to select optimally trained models. Then,
training, development, and evaluation subsets, approximately in the evaluation phase, we discard the zero-padding parts for
containing 53.1 k (26.6 k bonafide), 17.8 k (8.9 k bonafide), and the evaluation data, and only use the original parts to calculate
9.1 k (0 bonafide) samples, respectively. Critically, the original the EER results. Note that we do not include any data augmen-
evaluation subset contains no bonafide samples, rendering it tation, voice activity detection, or feature normalization during
incapable of computing utterance-level metrics like Equal Error training.
Rate (EER) crucial for the anti-spoofing evaluation. To mitigate
this limitation, we reallocate the subsets: The new training, V. EXPERIMENTAL RESULTS
development, and evaluation subsets correspond to the origi-
nal development, evaluation, and training subsets, respectively. A. Experimental Comparisons
While the new development subset still lacks bonafide samples, The Compared Approaches: In the experiments, we aim to
we leverage its PS samples to evaluate trained models using make comparison between the proposed and existing approaches
segment-level metrics, circumventing the need for utterance- evaluated on the PSD, LA (noted as ‘PSD2LA’), and HAD
level EER computation. To guarantee fair comparisons, all datasets, with the EER results shown in Table II.
compared models are re-trained and evaluated on the reallocated For fair comparison against prior works, we reimplement
HAD subsets. the baselines using W2V2-base instead of their original fea-
ture extractors, and reimplement training and evaluation on
B. Implementation Details the PSD subsets and reallocated HAD subsets. Specifically,
For the evaluation metric, the reported performance is for the compared approaches using the fully shared feature
measured in terms of EER [6], [8], [36] for both of the extracting and processing module, we employ W2V2-base with
utterance-level and segment-level tasks, which can be calculated one gMLP block (noted as ‘W2V2-gMLP’) and with five
through regulating the decision thresholds for the utterance- gMLP blocks (noted as ‘W2V2-5gMLP’) by replacing the orig-
(c) inal W2V2-XLSR, respectively [12], [46], W2V2-base with
level outputs cos(hu , wu ) (14) and the segment-level outputs
B2 LSTM (noted as ‘W2V2-B2 LSTM’) [12], [22], W2V2-base
2 [Online]. Available: https://github.com/syang1993/gst-tacotron.git with QA [23] by replacing the original SENet [84], and W2V2-
3 [Online]. Available: https://zenodo.org/records/10377492 base with TDL [20] by replacing the original W2V2-XLSR.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2654 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
TABLE II
THE UTTERANCE AND SEGMENT EER (%) RESULTS OF DIFFERENT APPROACHES EMPLOYING THE SEGMENT- AND UTTERANCE-LEVEL BRANCHES EVALUATED ON
THE THREE DATASETS, WHERE ‘PSD2LA’ DENOTES THE CASE OF TRAINING ON THE TRAINING PART OF THE PSD DATASET, WHILE EVALUATED ON LA
DATASET’S EVALUATION PART, ONLY CONSIDERING UTTERANCE EERS
Additionally, we adopt the UMMAFormer [50] as the shared visual processing branch while retaining the audio branch for
branch for the utterance-level and segment-level tasks. Since the speech-only task. Then, we also employ the same train-
UMMAFormer is originally designed for multi-modal tasks, we ing details for all the compared approaches using non-shared
remove its visual processing branch to adapt it to our speech-only (task-specific) feature processing modules as in the proposed ap-
paradigm. Furthermore, we take several utterance-only models proach, while for the case of shared feature processing modules
as comparison methods, including W2V2-base integrated with (W2V2-1gMLP, W2V2-5gMLP, W2V2-B2 LSTM, and QA), we
attentive statistics pooling (noted as ‘W2V2-P’) [13] and W2V2- employ the same parameters as in their corresponding original
base combined with SE-Res1D (noted as ‘W2V2-Res1D’) [25], works [12], [22], [23].
by replacing the original W2V2-large. We also incorporate Afterwards, for the proposed approach using the D-GNN
MIL-based methods, including Hybrid MIL (H-MIL) and H-MIL in the segment-level branch, we consider the usage of QA
with Local Self-attention (LS-H-MIL), proposed in [16], as addi- and B2 LSTM in the utterance-level branch, respectively, for
tional utterance-level models for comparison. Then, for the cases the cases of DGAM 1 and DGAM2 . Note that for the com-
using the non-shared feature processing module, we add the pared W2V2-1gMLP, W2V2-5gMLP, and W2V2-B2 LSTM ap-
compared approaches considering the QA and B2 LSTM [12], proaches, we retain a single length of 20 ms for the segments for
[22], respectively, in the utterance-level branch, while deploying fair comparison. Similarly, we also set the same segments’ 20 ms
B2 LSTM [12], [22], Temporary Deepfake Location (TDL) [20], length in the TDL [20], with a modified architecture to adapt
QA [23], BA-TFD+ [49], Multi-Domain ResNet Transformer to this length. Further, since the original QA, UMMAFormer,
with Time-Domain (MDRT-TD) [21], and Waveform Boundary and BA-TFD+ are designed for predicting whether a segment
Detection (WBD) [18], [19], respectively, for the segment-level is the start or end point of a spoofed clip [12], we modify these
branch. Further, TDL and WBD are originally designed only for approaches to enable the models to predict whether a segment
the segment-level task, and hence, we feed the second Temporal- is fake or not.
CONVolution (TCONV) layer’s and the BLSTM layer’s outputs Experimental Analysis: As can be seen from Table II, the
for TDL and WBD [18], [20], respectively, to the segment loss proposed DGAM 1 and DGAM2 models outperform other com-
as in (15). For MDRT [21], which originally employs dual pared MTL approaches, for both the utterance- and segment-
speech feature extractors, we reimplement it by exclusively level tasks in EER metrics on the three evaluation sets. Specifi-
using W2V2-base. Similarly, BA-TFD+ [49], initially devel- cally, on the PSD evaluation set, the proposed DGAM performs
oped for multi-modal scenarios, is modified by removing its better in the case of using the B2 LSTM-based utterance-level
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2655
TABLE III
COMPARISONS BETWEEN THE EVALUATION PARTS OF THE PSD AND HAD
DATASETS, ON THE ASPECTS OF LANGUAGE, THE NUMBERS OF
SPOOFED-SPEECH GENERATION ALGORITHMS (NOTED AS ‘# ALGORITMS’),
ANS, AND ADS INDICATORS
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2656 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
TABLE IV TABLE V
THE SEGMENT-LEVEL F1 (%) AND MAP (%) RESULTS OF DIFFERENT THE NUMBERS OF THE MODELS’ PARAMETERS CONTAINED IN THE
APPROACHES, ALL USING B2 LSTM AS THE UTTERANCE-LEVEL BRANCH, SEGMENT-LEVEL BRANCHES’ NETWORK MODULE FOR THE COMPARED AND
EVALUATED ON THE TWO DATASETS PROPOSED APPROACHES
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2657
TABLE IX Proof: The self-similarity principle requires that for any node
THE UTTERANCE AND SEGMENT (NOTED AS ‘UTT.’ AND ‘SEG.’, RESPECTIVELY)
EER RESULTS (%) EVALUATED ON THE THREE DATASETS FOR THE PROPOSED
i, its self-similarity K(hi , hi ) should not be smaller than its
DGAM (WITH B2 LSTM IN THE UTTERANCE-LEVEL BRANCH), WHEN similarity K(hi , hj ) with another node j for any j = i, where hi
EMPLOYING DIFFERENT LOSS FUNCTIONS CONVENTIONALLY USED IN GNNS and hj are two nodes non-normalized features. We decompose
hj = hoi + hpi , where hoi is orthogonal to hi and hpi is parallel
to hi .
−1 2
Heat Kernel: The heat kernel K(hi , hj ) = e−κ hi −hj
inherently satisfies the principle since K(hi , hi ) = 1 while
K(hi , hj ) ≤ 1 for i = j (as hi − hj 2 ≥ 0).
To demonstrate violations in linear, polynomial, and sigmoid
kernels, we set hpi = λhi with λ > 1.
Linear Kernel: The linear kernel K(hi , hj ) = hTi hj =
dual-view consistency. In detail, the RL loss solely focuses on the λhTi hi violates the principle since λhTi hi > hTi hi .
similarity between the adjacency matrix constructed from final Polynomial Kernel: The polynomial kernel function
outputs and the ground-truth adjacency matrix, which ignores K(hi , hj ) = (1 + hTi hj )d = (1 + λhTi hi )d (d ∈ N + ) also vi-
the consistency information constraint between different views. olates the principle as ((1 + λhTi hi )/(1 + hTi hi ))d > 1.
Furthermore, the DCRN focuses on maximizing the similarity Sigmoid Kernel: The sigmoid kernel K(hi , hj ) =
between representations of the same node across different views, tanh(hTi hj + c) = tanh(λhTi hi + c) violates the self-
without considering to preserve similarity among different nodes similarity principle under the same condition as tanh(·) is
belonging to the same class across views. Different from these a monotonically increasing function.
two loss functions, our consistency loss focuses on the nodes’
similarity from different views within the same class, which may
REFERENCES
address these problems for the two loss functions above.
[1] J. H. Hansen and T. Hasan, “Speaker recognition by machines and humans:
A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74–99,
VI. CONCLUSION Nov. 2015.
[2] W. Lin and M.-W. Mak, “Mixture representation learning for deep speaker
In this paper, we proposed the Dual-View Graph neural embedding,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30,
network Assisted Segment-Level Module (DGAM) approach pp. 968–978, 2022.
towards multi-task Partially Spoofed Speech Detection (PSSD). [3] Z. Ge, X. Xu, H. Guo, T. Wang, and Z. Yang, “Speaker recognition using
isomorphic graph attention network based pooling on self-supervised
The proposed approach incorporated a shared speech represen- representation,” Appl. Acoust., vol. 219, 2024, Art. no. 109929.
tation extracting module and task-specific feature processing [4] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice
modules, consisting of an utterance-level branch for identifying conversion and its challenges: From statistical modeling to deep learning,”
IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 132–157,
partially-spoofed utterances, and a segment-level branch for 2021.
detecting spoofed segments. Within the segment-level branch, [5] A. Triantafyllopoulos et al., “An overview of affective speech synthesis
we designed a Dual-view Graph Neural Network (D-GNN) and conversion in the deep learning era,” Proc. IEEE, vol. 111, no. 10,
pp. 1355–1381, Oct. 2023.
with a dual-view consistency loss for segment-level tasks. [6] X. Wang et al., “ASVspoof 2019: A Large-scale Public Database of
Then, the experimental results indicated that, the proposed Synthesized, Converted and Replayed Speech,” Comput. Speech Lang.,
DGAM could achieve better performance for both utterance- vol. 64, 2020, Art. no. 101114.
[7] J. Yi et al., “ADD 2022: The first audio deep synthesis detection challenge,”
and segment-level tasks, compared with the related state-of-the- in Proc. Int. Conf. Acoust., Speech Signal Process., Singapore, 2022,
art approaches, and also the proposed approach with different pp. 9216–9220.
setups. [8] X. Liu et al., “ASVspoof 2021: Towards spoofed and deepfake speech
detection in the wild,” IEEE/ACM Trans. Audio, Speech, Lang. Process.,
Following the works in this paper, our future works will focus vol. 31, pp. 2507–2522, 2023.
on two aspects as follow. First, we expect to investigate cross- [9] Z. Ge, X. Xu, H. Guo, T. Wang, Z. Yang, and B. Schuller, “DGPN: A
domain cases for multi-task PSSD, which typically involves dual graph prototypical network for few-shot speech spoofing algorithm
recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Kos Island,
domain adaptation to diverse datasets and scenarios. Then, it will Greece, 2024, pp. 1125–1129.
be challenging to design well-performed knowledge distillation [10] L. Zhang, X. Wang, E. Cooper, J. Yamagishi, J. Patino, and N. Evans, “An
techniques for PSSD, aiming to achieve applicable lightweight initial investigation for detecting partially spoofed audio,” in Proc. Annu.
Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021, pp. 4264–4268.
architectures through distilling the knowledge learned from [11] J. Yi et al., “Half-truth: A partially fake audio detection dataset,” in Proc.
complex models. In addition, we are also interested in the real- Annu. Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021, pp. 1654–
world cases only providing extremely limited partially-spoofed 1658.
[12] L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The Par-
training data for the detection tasks. tialSpoof database and countermeasures for the detection of short fake
speech segments embedded in an utterance,” IEEE/ACM Trans. Audio,
APPENDIX A Speech, Lang. Process., vol. 31, pp. 813–825, 2023.
[13] Z. Lv, S. Zhang, K. Tang, and P. Hu, “Fake audio detection based on
SELF-SIMILARITY FOR KERNEL FUNCTIONS unsupervised pretraining models,” in Proc. Int. Conf. Acoust., Speech
Signal Process., Singapore, 2022, pp. 9231–9235.
Theorem: The heat kernel satisfies the self-similarity [14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A frame-
principle, while linear, polynomial, and sigmoid kernels violate work for self-supervised learning of speech representations,” in Proc. Int.
the principle. Conf. Neural Inf. Process. Syst., 2020, pp. 12449–12460.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
GE et al.: MULTI-TASK PSSD USING A DUAL-VIEW GRAPH NEURAL NETWORK ASSISTED SEGMENT-LEVEL MODULE 2659
[15] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: [38] J. Jung, S. Kim, H. Shim, J. Kim, and H. Yu, “Improved RawNet with
A general framework for self-supervised learning in speech, vision and feature map scaling for text-independent speaker verification using raw
language,” in Proc. Int. Conf. Mach. Learn., Baltimore, MD, USA, 2022, waveforms,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Shanghai,
pp. 1298–1312. China, 2020, pp. 1496–1500.
[16] Y. Zhu, Y. Chen, Z. Zhao, X. Liu, and J. Guo, “Local self-attention-based [39] S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full
hybrid multiple instance learning for partial spoof speech detection,” ACM stack speech processing,” IEEE J. Sel. Topics Signal Process., vol. 16,
Trans. Intell. Syst. Technol., vol. 14, no. 5, 2023, Art. no. 93. no. 6, pp. 1505–1518, Oct. 2022.
[17] X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance [40] A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representa-
neural networks,” Pattern Recognit., vol. 74, pp. 15–24, 2018. tion learning at scale,” in Proc. Interspeech, 2021, pp. 2278–2282.
[18] Z. Cai, W. Wang, and M. Li, “Waveform boundary detection for partially [41] H. Tak, J.-W. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-
spoofed audio,” in Proc. Int. Conf. Acoust., Speech Signal Process., Rhodes to-end spectro-temporal graph attention networks for speaker verification
Island, Greece, 2023, pp. 1–5. anti-spoofing and speech deepfake detection,” in Proc. Autom. Speaker
[19] Z. Cai and M. Li, “Integrating frame-level boundary detection and deep- Verification Spoofing Countermeasures Challenge, 2021, pp. 1–8.
fake detection for locating manipulated regions in partially spoofed audio [42] J. Gui et al., “A survey on self-supervised learning: Algorithms, applica-
forgery attacks,” Comput. Speech Lang., vol. 85, 2024, Art. no. 101597. tions, and future trends,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46,
[20] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “An efficient temporary deepfake no. 12, pp. 9052–9071, Dec. 2024.
location approach based embeddings for partially spoofed audio detec- [43] A. Khan and K. M. Malik, “SpoTNet: A spoofing-aware transformer
tion,” in Proc. Int. Conf. Acoust., Speech Signal Process., Seoul, Korea, network for effective synthetic speech detection,” in Proc. Int. Work-
2024, pp. 966–970. shop Multimedia AI Against Disinformation, Thessaloniki, Greece, 2023,
[21] A. K. S. Yadav, K. Bhagtani, S. Baireddy, P. Bestagini, S. Tubaro, and E. pp. 10–18.
J. Delp, “Mdrt: Multi-domain synthetic speech localization,” in Proc. Int. [44] G. Lin, W. Luo, D. Luo, and J. Huang, “One-class neural network with
Conf. Acoust., Speech Signal Process., 2024, pp. 11171–11175. directed statistics pooling for spoofing speech detection,” IEEE Trans. Inf.
[22] L. Zhang, X. Wang, E. Cooper, and J. Yamagishi, “Multi-task learning Forensics Secur., vol. 19, pp. 2581–2593, 2024.
in utterance-level and segmental-level spoof detection,” in Proc. Au- [45] Y. Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn-
tom. Speaker Verification Spoofing Countermeasures Challenge, 2021, thetic voice spoofing detection,” IEEE Signal Process. Lett., vol. 28,
pp. 9–15. pp. 937–941, 2021.
[23] H. Wu et al., “Partially fake audio detection by self-attention-based fake [46] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” in Proc.
span discovery,” in Proc. Int. Conf. Acoust., Speech Signal Process., Int. Conf. Neural Inf. Process. Syst., 2021, pp. 9204–9215.
Singapore, 2022, pp. 9236–9240. [47] A. Khan, K. M. Malik, and S. Nawaz, “Frame-to-utterance convergence:
[24] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks A spectra-temporal approach for unified spoofing detection,” in Proc. Int.
for multi-task learning,” in Proc. Comput. Vis. Pattern Recognit., Las Conf. Acoust., Speech Signal Process., Seoul, Republic of Korea, 2024,
Vegas, NV, USA, 2016, pp. 3994–4003. pp. 10761–10765.
[25] T. Liu, L. Zhang, R. K. Das, Y. Ma, R. Tao, and H. Li, “How do [48] J. Zhong, B. Li, and J. Yi, “Enhancing partially spoofed audio localization
neural spoofing countermeasures detect partially spoofed audio?,” in Proc. with boundary-aware attention mechanism,” in Proc. Annu. Conf. Int.
Interspeech, 2024, pp. 1105–1109. Speech Commun. Assoc., 2024, pp. 4838–4842.
[26] C. Wang, S. Pan, C. P. Yu, R. Hu, G. Long, and C. Zhang, “Deep [49] Z. Cai, S. Ghosh, A. Dhall, T. Gedeon, K. Stefanov, and M. Hayat, “Glitch
neighbor-aware embedding for node clustering in attributed graphs,” Pat- in the matrix: A large scale benchmark for content driven audio-visual
tern Recognit., vol. 122, 2022, Art. no. 108230. forgery detection and localization,” Comput. Vis. Image Understanding,
[27] Y. Liu et al., “Deep graph clustering via dual correlation reduction,” vol. 236, 2023, Art. no. 103818.
in Proc. Assoc. Advance. Artif. Intell., Philadelphia, PA, USA, 2022, [50] R. Zhang, H. Wang, M. Du, H. Liu, Y. Zhou, and Q. Zeng, “UMMAFormer:
pp. 7603–7611. A universal multimodal-adaptive transformer framework for temporal
[28] I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Synthetic forgery localization,” in Proc. ACM Int. Conf. Multimedia, Ottawa, ON,
speech detection using phase information,” Speech Commun., vol. 81, Canada, 2023, pp. 8749–8759.
pp. 30–41, 2016. [51] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in
[29] J. Kim and S. M. Ban, “Phase-aware spoof speech detection based on graph domains,” in Proc. 2005 IEEE Int. Joint Conf. Neural Netw., 2005,
Res2Net with phase network,” in Proc. Int. Conf. Acoust., Speech Signal vol. 2, pp. 729–734.
Process., Rhodes Island, Greece, 2023, pp. 1–5. [52] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
[30] T. B. Patel and H. A. Patil, “Cochlear filter and instantaneous frequency “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20,
based features for spoofed speech detection,” IEEE J. Sel. Topics Signal no. 1, pp. 61–80, Jan. 2009.
Process., vol. 11, no. 4, pp. 618–631, Jun. 2017. [53] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural
[31] M. R. Kamble and H. A. Patil, “Novel energy separation based instanta- message passing for quantum chemistry,” in Proc. Int. Conf. Mach. Learn.,
neous frequency features for spoof speech detection,” in Proc. Eur. Signal Sydney, NSW, Australia, 2017, vol. 70, pp. 1263–1272.
Process. Conf., Kos island, Greece, 2017, pp. 106–110. [54] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
[32] M. Sahidullah, T. Kinnunen, and C. Hanilçi, “A comparison of features convolutional networks,” in Proc. Int. Conf. Learn. Representations, 2017.
for synthetic speech detection,” in Proc. Annu. Conf. Int. Speech Commun. [55] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio,
Assoc., Dresden, Germany, 2015, pp. 2087–2091. “Graph attention networks,” in Proc. Int. Conf. Learn. Representations,
[33] M. Todisco, H. Delgado, and N. W. D. Evans, “Constant Q cepstral coef- 2018.
ficients: A spoofing countermeasure for automatic speaker verification,” [56] K. K. Thekumparampil, C. Wang, S. Oh, and L. Li, “Attention-
Comput. Speech Lang., vol. 45, pp. 516–535, 2017. based graph neural network for semi-supervised learning,” CoRR,
[34] J. Sanchez, I. Saratxaga, I. Hernáez, E. Navas, D. Erro, and T. Raitio, 2018, arXiv:1803.03735.
“Toward a universal synthetic speech spoofing detection using phase in- [57] A. Pandey and D. Wang, “Self-attending RNN for speech enhancement to
formation,” IEEE Trans. Inf. Forensics Secur., vol. 10, no. 4, pp. 810–820, improve cross-corpus generalization,” IEEE/ACM Trans. Audio, Speech,
Apr. 2015. Lang. Process., vol. 30, pp. 1374–1385, 2022.
[35] A. Khan, K. M. Malik, J. Ryan, and M. Saravanan, “Battling voice [58] Y. Wang et al., “PredRNN: A recurrent neural network for spatiotemporal
spoofing: A review, comparative analysis, and generalizability evaluation predictive learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2,
of state-of-the-art voice spoofing counter measures,” Artif. Intell. Rev., pp. 2208–2225, Feb. 2023.
vol. 56, pp. 513–566, 2023. [59] P. Jiang, X. Xu, H. Tao, L. Zhao, and C. Zou, “Convolutional-recurrent
[36] S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, and V. neural networks with multiple attention mechanisms for speech emo-
Shchemelinin, “STC anti-spoofing systems for the ASVspoof 2015 chal- tion recognition,” IEEE Trans. Cogn. Devel. Syst., vol. 14, no. 4,
lenge,” in Proc. Int. Conf. Acoust., Speech Signal Process., Shanghai, pp. 1564–1573, Dec. 2022.
China, 2016, pp. 5475–5479. [60] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, and M. Brudno,
[37] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “Speaker attribution with voice profiles by graph-based semi-supervised
“End-to-end anti-spoofing with RawNet2,” in Proc. Int. Conf. Acoust., learning,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., Shanghai,
Speech Signal Process., Toronto, Canada, 2021, pp. 6369–6373. China, 2020, pp. 289–293.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.
2660 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 33, 2025
[61] V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. [74] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques
Bresson, “Benchmarking graph neural networks,” J. Mach. Learn. Res., for embedding and clustering,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 24, no. 43, pp. 1–48, 2023. 2001, pp. 585–591.
[62] J.-W. Jung, H.-S. Heo, H.-J. Yu, and J. S. Chung, “Graph attention [75] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep
networks for speaker verification,” in Proc. Int. Conf. Acoust., Speech clustering network,” in Proc. Web Conf., 2020, pp. 1400–1410.
Signal Process., Toronto, Canada, 2021, pp. 6149–6153. [76] X. He, B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Parallelly adaptive
[63] H.-J. Shim, J. Heo, J.-H. Park, G.-H. Lee, and H.-J. Yu, “Graph attentive graph convolutional clustering model,” IEEE Trans. Neural Netw. Learn.
feature aggregation for text-independent speaker verification,” in Proc. Int. Syst., vol. 35, no. 4, pp. 4451–4464, Apr. 2024.
Conf. Acoust., Speech Signal Process., Singapore, 2022, pp. 7972–7976. [77] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Attention-driven graph clustering
[64] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, and M. Brudno, network,” in Proc. ACM Multimedia Conf., 2021, pp. 935–943.
“Speaker diarization with session-level speaker embedding refinement [78] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Deep attention-guided graph clustering
using graph neural networks,” in Proc. Int. Conf. Acoust., Speech Signal with dual self-supervision,” IEEE Trans. Circuits Syst. Video Technol.,
Process., Barcelona, Spain, 2020, pp. 7109–7113. vol. 33, no. 7, pp. 3296–3307, Jul. 2023.
[65] A. Shirian and T. Guha, “Compact graph architecture for speech emotion [79] Y. Liu, S. Zhao, X. Wang, L. Geng, Z. Xiao, and J. C.-W. Lin, “Self-
recognition,” in Proc. Int. Conf. Acoust., Speech Signal Process., Toronto, consistent graph neural networks for semi-supervised node classification,”
ON, Canada, 2021, pp. 6284–6288. IEEE Trans. Big Data, vol. 9, no. 4, pp. 1186–1197, Aug. 2023.
[66] A. Shirian, S. Tripathi, and T. Guha, “Dynamic emotion modeling with [80] X. Zhang et al., “P2SGrad: Refined gradients for optimizing deep face
learnable graphs and graph inception network,” IEEE Trans. Multimedia, models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
vol. 24, pp. 780–790, 2022. pp. 9898–9906.
[67] Y. Kwon, H.-S. Heo, J.-W. Jung, Y. J. Kim, B.-J. Lee, and J. S. [81] X. Wang and J. Yamagishi, “A comparative study on recent neu-
Chung, “Multi-scale speaker embedding-based graph attention networks ral spoofing countermeasures for synthetic speech detection,” in Proc.
for speaker diarisation,” in Proc. Int. Conf. Acoust., Speech Signal Process., Annu. Conf. Int. Speech Commun. Assoc., Brno, Czechia, 2021,
Singapore, 2022, pp. 8367–8371. pp. 4259–4263.
[68] Y. Wei, H. Guo, Z. Ge, and Z. Yang, “Graph attention-based deep embed- [82] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi-speaker
ded clustering for speaker diarization,” Speech Commun., vol. 155, 2023, mandarin TTS corpus,” in Proc. INTERSPEECH, 2021, pp. 2756–2760.
Art. no. 102991. [83] J.-M. Valin and J. Skoglund, “LPCNET: Improving neural speech synthesis
[69] H. Tak, J. Jung, J. Patino, M. Todisco, and N. W. D. Evans, “Graph attention through linear prediction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
networks for anti-spoofing,” in Proc. Annu. Conf. Int. Speech Commun. Process., 2019, pp. 5891–5895.
Assoc., Brno, Czechia, 2021, pp. 2356–2360. [84] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: Anti-spoofing with
[70] J. Jung et al., “AASIST: Audio anti-spoofing using integrated spectro- squeeze-excitation and residual networks,” in Proc. Interspeech, 2019,
temporal graph attention networks,” in Proc. Int. Conf. Acoust., Speech pp. 1013–1017.
Signal Process., Singapore, 2022, pp. 6367–6371. [85] X. Xu et al., “Rethinking auditory affective descriptors through zero-shot
[71] F. Chen, S. Deng, T. Zheng, Y. He, and J. Han, “Graph-based spectro- emotion recognition in speech,” IEEE Trans. Computat. Social Syst., vol. 9,
temporal dependency modeling for anti-spoofing,” in Proc. Int. Conf. no. 5, pp. 1530–1541, Oct. 2022.
Acoust., Speech Signal Process., Rhodes Island, Greece: IEEE, 2023, [86] V. K. Sharma, R. Garg, and Q. Caudron, “A systematic literature review
pp. 1–5. on deepfake detection techniques,” Multimedia Tools Appl., pp. 1–43,
[72] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” 2024.
2023, arXiv:1606.08415. [87] E. Vahdani and Y. Tian, “Deep learning-based action detection in
[73] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” CoRR, untrimmed videos: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
2013, arXiv:1304.5634. vol. 45, no. 4, pp. 4302–4320, Apr. 2023.
Authorized licensed use limited to: VP's Kamalnayan Bajaj Institute of Eng. and Tech. Downloaded on July 17,2025 at 10:59:22 UTC from IEEE Xplore. Restrictions apply.