0% found this document useful (0 votes)
14 views12 pages

WWW24 Graph YangyangLi

This document presents a novel approach to Graph Anomaly Detection (GAD) using a Bi-level optimization Graph Neural Network (BioGNN) that addresses the limitations of traditional GNNs by considering neighbor label distributions (NLD) and the phenomenon of 'loss rivalry'. The study introduces a theoretical framework based on the Contextual Stochastic Block Model (CSBM) to explain the observed behaviors in node classification and demonstrates that BioGNN outperforms existing methods in detecting anomalies. The proposed method has practical applications in various fields, including finance and healthcare, and is validated through extensive experiments.

Uploaded by

Hằng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

WWW24 Graph YangyangLi

This document presents a novel approach to Graph Anomaly Detection (GAD) using a Bi-level optimization Graph Neural Network (BioGNN) that addresses the limitations of traditional GNNs by considering neighbor label distributions (NLD) and the phenomenon of 'loss rivalry'. The study introduces a theoretical framework based on the Contextual Stochastic Block Model (CSBM) to explain the observed behaviors in node classification and demonstrates that BioGNN outperforms existing methods in detecting anomalies. The proposed method has practical applications in various fields, including finance and healthcare, and is validated through extensive experiments.

Uploaded by

Hằng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Graph Anomaly Detection with Bi-level Optimization

Yuan Gao Junfeng Fang Yongduo Sui


[email protected] [email protected] [email protected]
Alibaba Group University of Science and Technology University of Science and Technology
Beijing, China of China of China
Hefei, China Hefei, China

Yangyang Li Xiang Wang∗† Huamin Feng


[email protected] [email protected] [email protected]
Academy of Cyber University of Science and Technology Beijing Electronic Science and
Beijing, China of China Technology Institute
Hefei, China Beijing, China

Yongdong Zhang
[email protected]
University of Science and Technology
of China
Hefei, China
ABSTRACT CCS CONCEPTS
Graph anomaly detection (GAD) has various applications in fi- • Security and privacy → Web application security; • Com-
nance, healthcare, and security. Graph Neural Networks (GNNs) puting methodologies → Semi-supervised learning settings;
are now the primary method for GAD, treating it as a task of semi- Neural networks.
supervised node classification (normal vs. anomalous). However,
most traditional GNNs aggregate and average embeddings from KEYWORDS
all neighbors, without considering their labels, which can hinder Graph Neural Networks, Anomaly Detection, Bi-level Optimization
detecting actual anomalies. To address this issue, previous methods
try to selectively aggregate neighbors. However, the same selection ACM Reference Format:
strategy is applied regardless of normal and anomalous classes, Yuan Gao, Junfeng Fang, Yongduo Sui, Yangyang Li, Xiang Wang, Huamin
which does not fully solve this issue. This study discovers that Feng, and Yongdong Zhang. 2024. Graph Anomaly Detection with Bi-level
nodes with different classes yet similar neighbor label distributions Optimization. In Proceedings of the ACM Web Conference 2024 (WWW ’24),
May 13–17, 2024, Singapore, Singapore. ACM, New York, NY, USA, 12 pages.
(NLD) tend to have opposing loss curves, which we term it as “loss
https://doi.org/10.1145/3589334.3645673
rivalry”. By introducing Contextual Stochastic Block Model (CSBM)
and defining NLD distance, we explain this phenomenon theoreti-
cally and propose a Bi-level optimization Graph Neural Network 1 INTRODUCTION
(BioGNN), based on these observations. In a nutshell, the lower Graph anomaly detection (GAD) is a learning-to-detect task. The
level of BioGNN segregates nodes based on their classes and NLD, objective is to differentiate anomalies from normal ones, assuming
while the upper level trains the anomaly detector using separation that the anomalies are generated from a distinct distribution that
outcomes. Our experiments demonstrate that BioGNN outperforms diverges from the normal nodes [26]. As demonstrated by [33],
state-of-the-art methods and effectively mitigates “loss rivalry”. GAD has various real-world applications including detecting spam
Codes are available at https://github.com/blacksingular/Bio-GNN. reviews in user-rating-product graphs [17], finding misinformation
and fake news in social networks [9], and identifying fraud in
∗ Corresponding authors.
† Xiang
financial transaction graphs [34, 51].
Wang is also affiliated with Institute of Dataspace, Hefei Comprehensive
National Science Center.
A primary method is to consider GAD as a semi-supervised node
classification problem, where the edges are crucial. By examining
the edges, we can divide an ego node’s neighbors into two groups:
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed (1) homophilous neighbors that have the same labels as the ego node,
for profit or commercial advantage and that copies bear this notice and the full citation and (2) heterophilous neighbors whose labels are different from the
on the first page. Copyrights for components of this work owned by others than the ego node’s label. For instance, in the case of an anomaly ego node, its
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission interactions with anomaly neighbors display homophily, while its
and/or a fee. Request permissions from [email protected]. anomaly-normal edges demonstrate heterophily. Both homophily
WWW ’24, May 13–17, 2024, Singapore, Singapore and heterophily are prevalent in nature. In transaction networks,
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0171-9/24/05. . . $15.00 fraudsters have heterophilous connections with their customers,
https://doi.org/10.1145/3589334.3645673 while their connections with accomplices are homophilous.

4383
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

adaptive filters in heterophilic graphs[4, 10, 51], we approach the


normal nodes anomalies label distribution
problem in the spectral domain. Specifically, we first explain the
feasibility of acquiring NLD given the ego graph of a node in the
spectral domain in §3.2. Then, we distill the NLD of nodes from
filter performance through the bi-level optimization process, as
𝒑𝒑𝟎𝟎 𝒒𝒒𝟎𝟎 𝒑𝒑𝟏𝟏 𝒒𝒒𝟏𝟏 spectral filter performance depends on the concentration of spec-
tral label distribution [32, 8, 6]. In a nutshell, BioGNN distinguishes
nodes with similar NLD but likely belong to different classes and
Figure 1: The ego normal node and anomaly (marked in red feeds them into separate filters to prevent “loss rivalry”.
circle) have comparable neighborhood label distributions
(NLD). The probability of neighbor labels being 0 or 1 is Our contributions. (1) We reveal the “loss rivalry” phenomenon,
denoted by 𝑝𝑐 and 𝑞𝑐 , where the subscript represents the class where nodes belonging to different classes but with similar NLD
normal nodes 1 denotes the probability of anormal
instance, 𝑝anomalies normalnodestend to anomalies label distribution
label. For have opposite loss curves, which can negatively impact
neighbor for anomalies. model convergence. (2) We provide theoretical explanations regard-
ing the importance of NLD and the benefits of using polynomial-
From the standpoint of neighbor relationships, we can briefly based spectral filtering methods to capture the NLD of nodes. (3)
describe the primary graph neural networks (GNNs)-based GAD 𝒑𝒑𝟎𝟎 𝒒𝒒𝟎𝟎We propose a novel bi-level
𝒑𝒑𝟎𝟎 𝒒𝒒𝟎𝟎 optimization framework
𝒑𝒑𝟎𝟎 𝒒𝒒𝟎𝟎 to address the
solutions and their limitations as follows: problem, and the effectiveness of the proposed method is verified
• Early studies [31, 55] aggregate over all neighbors without con- through experiments.
sidering the impact of homophily and heterophily. That is, the
representation of each node blindly aggregate the information 𝒑𝒑 𝒒𝒒 2 PRELIMINARIES AND NOTATIONS
𝟏𝟏 𝟏𝟏 𝒑𝒑𝟏𝟏 𝒒𝒒𝟏𝟏 𝒑𝒑𝟏𝟏 𝒒𝒒𝟏𝟏
from all neighbors, without discriminating the neighbor relation- In GAD, anomalous and normal nodes can be modeled as an attrib-
ships. However, this approach can be disadvantageous to(a) GAD uted graph G(b) (c)
= (V, E, X), where V represents the set of anomalous
(b) are more likely to(c)
as anomalies be hidden among a large number and normal nodes, E denotes edges, and X is the attribute matrix.
of normal neighbors. Blindly aggregating information can dilute The objective of GAD is to identify anomalous nodes by learning
the suspiciousness of anomalies with normal signals, making from the attributes and structure of the graph. In §3.1, we will dis-
them less discernible [17, 29, 34, 43]. cuss the impact of NLD on GAD and demonstrate the superiority of
• To address the above-mentioned problem, recent studies [3, 4, 10, spectral filtering in addressing this issue. Therefore, we introduce
16, 51] draw inspiration from graph signal processing (GSP). They basic knowledge of graph spectral filtering in this section.
suggest that a low-pass filter may not be optimal for all graphs.
Instead, they manipulate eigenvalues of the normalized graph Graph-based Anomaly Detection. A primary approach for GAD
Laplacian to amplify some frequency information and weaken is to frame it as a semi-supervised node classification task [38]. The
others. However, these studies optimize node representations goal is to train a predictive GNN model 𝑔 that achieves minimal
as a whole, without addressing differences in their distribution error in approaching the ground truth Y𝑡𝑒𝑠𝑡 for unobserved nodes
regarding neighbor labels. For instance, as shown in Figure 1, V𝑡𝑒𝑠𝑡 given observed nodes V𝑡𝑟𝑎𝑖𝑛 , where V𝑡𝑟𝑎𝑖𝑛 ∪ V𝑡𝑒𝑠𝑡 = V and
a normal node shares the same neighbors as an anomaly. Our V𝑡𝑟𝑎𝑖𝑛 ∩ V𝑡𝑒𝑠𝑡 = ∅:
analysis in §3.2 reveals that nodes of different classes with the
same neighbors retain rather different frequency components. 𝑔(G, Y𝑡𝑟𝑎𝑖𝑛 ) → Ŷ𝑡𝑒𝑠𝑡 . (1)
While emphasizing a single frequency band can improve learning
for some nodes, it can hinder the learning of others. Note that GAD is an imbalanced classification problem, which often
results in similar NLD for normal nodes and anomalies: anomalies
Thus, it is crucial to understand the impact of neighbor label
in the graph are rare, hence both normal nodes and anomalies are
distribution (NLD) on detector behavior. We introduce and reveal
surrounded by numerous normal nodes.
the phenomenon of “loss rivalry”. Surprisingly, we observe opposite
loss curves for anomalies and normal nodes holding similar NLDs. Graph Spectral Filtering. Let A be the adjacency matrix, and
These are separately highlighted around the maxima and minima L be the graph Laplacian, which can be expressed as D − A or as
of the curves in Figure 2. Our analysis emphasizes the importance I−D −1/2 AD −1/2 (symmetric normalized), where I is the identity ma-
of using distinct aggregation mechanisms for nodes with different trix, and D is the diagonal degree matrix. L is positive semi-definite
classes but similar NLD. and symmetric, so it has an eigen-decomposition L = UΛU𝑇 , where
Based on this finding, we propose a bi-level optimization model Λ = {𝜆1, · · · , 𝜆𝑁 } are eigenvalues, and U = [u1, · · · , u𝑁 ] are cor-
in §4, named BioGNN. Specifically, it consists of two key compo- responding unit eigenvectors [51]. Assuming X = [x1, · · · , x𝑁 ]
nents. The first component is a mask generator that separates nodes is a graph signal, we call the spectrum U𝑇 X the graph Fourier
into mutually exclusive sets based on their classes and NLD. The transform of the signal X [8, 59]. In graph signal processing (GSP),
second component contains two well-designed GNN encoders that the frequency is associated with Λ. Therefore, the goal of spectral
adopt different mechanisms to learn node representations sepa- methods is to identify a response function 𝑔(·) on Λ to learn the
rately. In §3.1, we define the NLD distance based on the Contextual graph representation Z:
Stochastic Block Model (CSBM) and verify its direct proportion to
representation expressiveness. Due to the proved superiority of Z = 𝑔(L)X = U[𝑔(Λ) ⊙ (U𝑇 X)] = U𝑔(Λ)U𝑇 X. (2)

4384
Graph Anomaly Detection with Bi-level Optimization WWW ’24, May 13–17, 2024, Singapore, Singapore

(a) Amazon Loss (b) Yelp Loss

Figure 2: Illustration of the ‘loss rivalry’ phenomenon in YelpChi and Amazon Datasets with BWGNN [51]. From the same-color
circles around the maxima and minima, we observe that the two loss curves in the same dataset are opposite along the epochs.

3 THEORETICAL ANALYSIS Table 1: NLD Symbol Definition.


In this section, we introduce the Contextual Stochastic Block Model
(CSBM), a widely used model for describing node feature formation. Symbol Definition (Probability of)
Based on CSBM, we define the NLD distance and verify its direct 𝑝1 normal neighbor for anomalies
proportion to representation expressiveness. Furthermore, because
adaptive filters have been shown to perform better in heterophilic 𝑞1 anomaly neighbor for anomalies
graphs, we explore the feasibility of expressing NLD in the spectral 𝑝0 normal neighbor for normal nodes
domain to facilitate further study in later sections.
𝑞0 anomaly neighbor for normal nodes

3.1 Impact of NLD on Node classification


GNNs are widely used to learn node representations in networks, In this work, we focus on the binary GAD classification problem,
as they can capture graph topological and structural information hence D𝑐 = {D0 = [𝑝 0, 𝑞 0 ], D1 = [𝑝 1, 𝑞 1 ]}, where the symbol
effectively [21, 45, 46, 58]. However, GNNs distinguish nodes by definitions are shown in Table 1. Furthermore, following previous
averaging the node features of their neighborhood [63]. Therefore, works [5, 13, 39], we suppose that F𝑐 are two Gaussian distributions
it is intuitive that the neighbor label distribution has a significant of 𝑛 variables, i.e., 𝑥 0 ∼ 𝑁𝑛 (𝜇 0, 𝜎 2 I), 𝑥 1 ∼ 𝑁𝑛 (𝜇 1, 𝜎 2 I). This problem
impact on GNN performance. To analyze NLD from a graph gen- setting leads us to the following proposition, which indicates the
eration perspective, we introduce the Contextual Stochastic Block expressive power of GNNs.
Model (CSBM) [13]. CSBM is a random graph generative model
commonly used to measure the expressiveness of GNNs [39]. Proposition 3.1 Given a graph G = (V, E, {F𝑐 }, {D𝑐 }), the dis-
tance between the means of the class-wise hidden representations
CSBM. The Contextual Stochastic Block Model (CSBM) makes the is proportional to their NLD distance.
following assumptions for an attributed graph G: (1) For a cen-
tral node 𝑢 with label 𝑐 ∈ {0, 1}, the labels of its neighbors are Remark. The detailed proof can be found in Appendix A.1. Here
independently sampled from a fixed distribution D𝑐 ∼ 𝐵𝑒𝑟𝑛(𝑝𝑐 ). we extend the analysis in [39] to a more generalized 2-layer vanilla
𝑝𝑐 denotes the sampling probability of class 𝑐, and the sampling GNN and also to polynomial-based spectral case. This proposition
process continues until the number of neighbors matches the de- shows that the expressive power of the representation depends on
gree of node 𝑢. In this work, we refer to the distribution D𝑐 as the neighborhood label distribution. Specifically, for nodes 𝑢 and 𝑣
the neighborhood label distribution (NLD). (2) Anomalies and in different classes, a vanilla 2-layer GCN has the following distance
normal nodes have distinct node feature distributions, namely F𝑐 . between their hidden representations:
For simplicity, we define the NLD distance as follows:
[𝑑 (𝑢, 𝑣)] 2
||𝜇𝑢 − 𝜇 𝑣 || 2 = · ||𝜇 1 − 𝜇 0 || 2, (4)
Definition 3.1 (Neighborhood Label Distribution Distance) Given 2
a graph G with label vector y, the neighborhood label distribution where 𝜇𝑢 and 𝜇 𝑣 are the mean values of the learned representations
distance between nodes 𝑢 and 𝑣 is: of nodes 𝑢 and 𝑣. Similarly, for spectral methods, whose general
Í 𝑘
polynomial approximation form can be written as 𝑘 𝛼𝑘 L̃ X [63],
𝑑 (𝑢, 𝑣) = 𝑑𝑖𝑠 (D𝑢𝑐 (𝑢), D𝑣𝑐 (𝑣)), (3)
we can achieve a much larger NLD distance with a second-order
polynomial:
where 𝑑𝑖𝑠 (·, ·) measures the difference between distribution vectors,
such as cosine distance or Euclidean distance; 𝑢𝑐 and 𝑣𝑐 denote the [𝑑 (𝑢, 𝑣)] [𝑑 (𝑢, 𝑣)] 2
||𝜇𝑢 − 𝜇 𝑣 || 2 = [1 + √ + ] · ||𝜇 1 − 𝜇 0 || 2 . (5)
class of nodes 𝑢 and 𝑣, respectively. 2 2

4385
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

Table 2: Summary of the dataset statistics and the neighbor label distributions.

Statistics Neighbor Label Distribution (NLD)


Dataset
# Nodes # Edges # Features 𝑝0 𝑞0 𝑝1 𝑞1 Distance
YelpChi 11,944 4,398,392 25 0.8683 0.1317 0.8144 0.1856 0.0762
Amazon 45,954 3,846,979 32 0.9766 0.0234 0.9254 0.0746 0.0724
T-Finance 39,357 21,222,543 10 0.9850 0.0150 0.5280 0.4720 0.6462
T-Social 5,781,065 73,105,508 10 0.7634 0.2366 0.9161 0.0839 0.2159

The larger the distance ||𝜇𝑢 −𝜇 𝑣 || 2 , the more expressive the represen-
tation and the better capability of the downstream linear detector.
From (4) and (5), we observe two things: (1) the minimum value
of ||𝜇𝑢 − 𝜇 𝑣 || 2 is achieved when 𝑑 (𝑢, 𝑣) is 0; (2) using second-order
polynomial graph filtering can improve the ability to distinguish
between nodes, especially when the NLD of nodes from differ-
ent classes are similar. This finding aligns with previous research
[64, 69] in this area.

3.2 NLD in the Spectral Domain


The NLD of anomalous and normal nodes in four benchmark datasets
is statistically reported in Table 2. We observe that the NLD for
nodes from different classes are similar, especially in YelpChi and Figure 3: The influence of NLD on model performance.
Amazon datasets. Our analysis justifies the need to filter out anom-
alies sharing similar neighborhood labels with normal nodes, so that
the distribution of the remaining anomalies can be distinguished classes with similar NLD retain rather different frequency com-
from that of normal nodes. Proposition 3.1 suggests that spectral ponents. Based on this finding, separating nodes whose spectral
methods are more effective. Therefore, we aim to address the prob- label distributions are different could bring two benefits: (1) sepa-
lem in the spectral domain. To begin with, we express NLD in rate nodes in the same class but have different NLDs; (2) separate
the spectral domain by bridging the gap between it and frequency. nodes in different classes but have similar NLDs. Both of these
Specifically, we fragment a graph into a set of ego-graphs [61] and benefits alleviate the “loss rivalry” phenomenon and help with the
define the spectral label distribution as follows: convergence of GNNs.
Definition 3.2 (Spectral Label Energy Distribution) Given an ego 3.3 Validation on Real-World Graphs
node 𝑢 and its one-hop neighbor set N𝑢 with size 𝑁 , the spectral
label energy distribution at 𝜆𝑘 is: To verify the correctness of our theoretical findings, we report the
F1-macro and AUC performance of some general methods (triangle
𝑓𝑘 (y, L) = 𝛼𝑘2 / 𝑛=1 𝛼𝑖2,
Í𝑁
(6) marker) [31, 55, 64] and some polynomial spectral methods [10,
Í𝑁 51, 69] (star marker) in Figure 3. We make two observations: (1)
where 𝑓 is a probability distribution with 𝑘=1 𝑓𝑘 = 1, L is the
As shown in Table 2, the NLD distance between the two classes
Laplacian matrix of the ego-graph, and {𝛼 } denotes the ego-graph
is 0.0762 and 0.6462 for YelpChi and T-Finance, respectively. From
spectrum of the one-hot label vector y. Since 𝛼𝑘 = u𝑇𝑘 y, 𝑓𝑘 (y, L)
Figure 3, we observe that most methods achieve better results on
measures the weight of u𝑘 in y, a larger 𝑓𝑘 indicates that the spectral
T-Finance than on YelpChi, demonstrating the importance of NLD.
label distribution concentrates more on 𝜆𝑘 . With Definition 3.2, we
Moreover, the performance gap between models on YelpChi and
now show the relationship between 𝑓 (y, L) and NLD.
Amazon is much larger than that on T-Finance. This suggests that
Proposition 3.2 For a binary classification problem, the expecta- we can achieve decent performance with less powerful models
tion of the spectral label energy distribution E[𝑓 (y, L)] is positively on datasets with larger NLD distances. Our finding supports the
associated with the NLD of the node. Specifically: notion that NLD can influence the expressive power of the GNN
model, and separating nodes with specific NLDs can improve the
|E | · (1 − 𝑝 0 ) performance of the GNN model. (2) Spectral methods outperform
𝑦 = 0,



spatial methods by a large margin. These tailored heterophilic filters

 𝑁
E[𝑓 (y, L)] = (7)
 |E | · 𝑝 1 further support our argument for the superiority of addressing the

 𝑦 = 1.
 𝑁 problem in the spectral domain.
Remark. The detailed proof can be found in Appendix A.2. Propo-
sition 3.2 indicates that capturing the difference in spectral label dis- 4 METHODOLOGY
tribution is equivalent to measuring the similarity between NLDs. Guided by the analysis in §3.1, we advocate for the necessity of
Furthermore, the proposition elucidates that nodes in different treating nodes with distinct spectral label distributions separately.

4386
Graph Anomaly Detection with Bi-level Optimization WWW ’24, May 13–17, 2024, Singapore, Singapore

𝝋𝟏
𝒈𝟏 (𝑳) Loss

Prediction-based 𝚽
Input Graph 𝓖 Node Feature 𝜽 Ground Truth
Mask 𝝋𝟐 𝒈𝟐 (𝑳)
Masked
Representation

Figure 4: BioGNN Framework. Mask generator 𝜃 (·) identifies subsets of nodes according to equation (10). Two projection heads
𝜑 1 (·) and 𝜑 2 (·) and two spectral filters 𝑔1 (𝐿) and 𝑔2 (𝐿) assign labels to according subset of nodes. Mask generator and filters are
optimized iteratively according to equation (11) and equation (12).

In this section, we introduce our bi-level optimization graph neural the aggregated feature ℎ𝑢,𝑐 for class 𝑐:
network BioGNN. To begin with, we introduce the learning ob- 1 ∑︁
jectives in §4.1 and present the parameterization process in §4.2. ℎ𝑢,𝑐 = 𝑥𝑣, (9)
|N𝑙,𝑐 (𝑢)|
In §4.4, we validate the effectiveness of the framework on golden- 𝑣 ∈ N𝑙,𝑐 (𝑢 )
separated graphs. where N𝑙,𝑐 (𝑢) is the set of neighbors labeled with 𝑐. When there
are no labeled neighbors belonging to class 𝑐, we assign a zero
4.1 The Learning Objectives embedding to ℎ𝑢,𝑐 . Then we set
To start with, we introduce Lemma 4.1 which is widely agreed upon 𝑀1 (𝑢) = argmax(MLP𝜃 ([𝑥𝑢 ; ℎ𝑢,0 ; ℎ𝑢,1 ])). (10)
in the literature [6, 8, 32]: 𝜕𝑦
To ensure smoothed and well-defined gradients 𝜕𝜃 ,
we apply a
Lemma 4.1 The prediction performance of a spectral filter is better straight-through (ST) gradient estimator [2] to make the model
when the spectral label energy distribution concentrates more on differentiable. Note that BioGNN is trained in an iterative fashion,
the pass band of the filter. the encoders {Φ(·), 𝜑 1 (·), 𝜑 2 (·)} are fixed as {Φ∗ (·), 𝜑 1∗ (·), 𝜑 2∗ (·)},
A more detailed analysis about Lemma 4.1 can be found in [8]. the objective function in this phase is:
Building on Lemma 4.1, we could identify nodes according to the
performance of different spectral filters through bi-level optimiza- min R (Φ∗ (𝑔1 (L)𝜑 1∗ (𝑀1 ◦ X)), Y𝑠𝑒𝑝 )
𝑀1
tion. As shown in Figure 4, our learning objective is twofold: (1)
+ min R (Φ∗ (𝑔2 (L)𝜑 2∗ (𝑀2 ◦ X)), Y𝑠𝑒𝑝 ) (11)
Optimize the encoders {Φ(·), 𝜑 1 (·), 𝜑 2 (·)} to maximize the proba- 𝑀2
bility of correctly classifying nodes separated by 𝜃 (·); (2) Optimize 𝑠.𝑡 . 𝑀1 + 𝑀2 = 1.
the encoder 𝜃 (·) which predicts the NLD of nodes and separate
nodes to two sets. All the encoders are learnable and set as MLPs. Parameterizing {Φ(·), 𝜑 1 (·), 𝜑 2 (·)}. These three encoders serve
Concretely, the learning objective of BioGNN is defined as follows: as a predictor that assigns labels to input nodes. As we aim to
distinguish between different spectral label distributions, which
are closely related to the performance of filters with correspond-
min R (Φ(𝑔1 (L)𝜑 1 (𝑀1 ◦ X)), Y)
𝜑 1 ,Φ,𝑀1 ing band-pass, we adopt low-pass and high-pass filters as 𝑔1 (𝐿)
+ min R (Φ(𝑔2 (L)𝜑 2 (𝑀2 ◦ X)), Y), (8) and 𝑔2 (𝐿), respectively. Here, we choose to use two branches and
𝜑 2 ,Φ,𝑀2 leave the multi-branch framework for future work. Therefore, the
𝑠.𝑡 . 𝑀1 + 𝑀2 = 1, functions of 𝑀1 and 𝑀2 become the masking of nodes with high-
frequency and low-frequency ego-graphs, respectively. In this iter-
where 𝑀1 and 𝑀2 are hard masks given by learnable encoder 𝜃 (·), ative training phase, we freeze the masks as 𝑀1∗ and 1 − 𝑀1∗ , and
1 is an all-one vector, 𝑔1 (𝐿) and 𝑔2 (𝐿) are spectral filters, and ◦ set the objective function as:
denotes the element-wise multiplication.
min R (Φ(𝑔1 (L)𝜑 1 (𝑀1∗ ◦ X)), Y)
Φ,𝜑 1
4.2 Instantiation of BioGNN + min R (Φ(𝑔2 (L)𝜑 2 ((1 − 𝑀1∗ ) ◦ X)), Y).
(12)
Φ,𝜑 2
Given the two-fold objective, we propose to parameterize the en-
coder 𝜃 (·) and {Φ(·), 𝜑 1 (·), 𝜑 2 (·)}. A similar training process has also been used in graph contrastive
learning [50]. For the choice of 𝑔1 (L) and 𝑔2 (L), we adopt Bern-
Parameterizing 𝜃 (·). The encoder 𝜃 (·) serves as a separator that stein polynomial-based filters [27, 51] for their convenience to
predicts the NLD of nodes and feeds nodes into different branches decompose low-pass and high-pass filters:
of filters. Consequently, to obtain informative input for 𝜃 , we em-
ploy a label-wise message passing layer [11] which aggregates the 1 (L/2)𝑎 (𝐼 − L/2)𝑏
𝑔(L) = 𝑈 𝛽𝑎,𝑏 (Λ)𝑈 𝑇 = ∫ 1 , (13)
labeled neighbors of the nodes label-wise. Concretely, for node 𝑢, 2 2 0 𝑡 𝑎−1 (1 − 𝑡)𝑏 −1 d𝑡

4387
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

(a) Golden Separated Loss (b) BioGNN Training Loss

Figure 5: Golden separated loss and real loss curves on YelpChi.

where 𝛽𝑎,𝑏 is the standard beta distribution parameterized by 𝑎 and 4.4 Validation on Golden-separated Graphs
𝑏. When 𝑎 → 0, we acquire 𝑔(L) as a low-pass filter; similarly, From an omniscient perspective, where we know all the labels of
𝑔(L) acts as a high-pass filter when 𝑏 → 0. For the choices of 𝑎 the nodes, we have access to the accurate NLD of all the nodes. In
and 𝑏 on the specific benchmark and more training details, please this case, we can separate the nodes ideally, and validate the effec-
refer to Appendix B.1 and B.2. tiveness of BioGNN excluding the impact of false NLD prediction.
From Figure 5a, we observe that the loss decreases smoothly,
4.3 Initialization of BioGNN demonstrating our argument that mixed nodes are the main cause
To embrace a more stable process of the bi-level optimization, we of the “loss rivalry” phenomenon. Based on this finding, BioGNN
initialize the encoders before iterative training. can alleviate the problem and boost the performance of GAD. We
discovered that the training order is significant in achieving bet-
Initialization of 𝜃 (·). 𝜃 (·) is initialized in a supervised fashion, ter performance. Training nodes with high-frequency ego-graphs
where the supervision signal is obtained by counting the labeled before those with low-frequency ones leads to better results. One
inter-class neighbors: possible reason for this is the shared linear classifier Φ between the
two branches. Embeddings learned from the high-pass filter are
1 ∑︁
𝑌𝑠𝑒𝑝 (𝑢) = 𝑟𝑜𝑢𝑛𝑑 ( |{𝑦𝑢 ≠ 𝑦 𝑣 }|), (14) noisier, and a classifier that performs well on noisy embeddings
|N𝐿 (𝑢)|
𝑣 ∈ N𝐿 (𝑢 ) would most likely perform well on the whole dataset [28]. We con-
sider this to be an intriguing discovery, yet leaving a comprehensive
then the cross-entropy is minimized: theoretical examination for future research.

min −[Y𝑠𝑒𝑝 ◦ 𝑙𝑜𝑔(𝜃 (X)) + (1 − Y𝑠𝑒𝑝 ) ◦ 𝑙𝑜𝑔(1 − 𝜃 (X))]. (15)


𝜃 5 EXPERIMENT
In this section, we conduct experiments on four benchmarks and
Note that in our experiments, although the supervision signal are
report the results of our models as well as some state-of-the-art
calculated with ego-graphs, the input data is a complete graph
baselines to demonstrate the effectiveness of BioGNN.
rather than ego-graphs extracted from a larger graph. Each node in
the complete graph connects directly to all other nodes, ensuring
that all interactions are considered during the learning process. 5.1 Experimental Setup
As nodes with high-frequency ego-graph are rare, to shield the Datasets. Following previous works [17, 51], we conduct experi-
separator from predicting all nodes as low-frequency ego nodes, we ments on four datasets introduced in Table 2. For more details about
regularize the ratio of two sets of nodes by enforcing the following the datasets, please refer to Appendix B.3.
constraint: we treat Y𝑠𝑒𝑝 as the optimal known mask, and one term
Y𝑠𝑒𝑝 − 𝜃 (X) is added to the objective. The final objective is: Baselines. Our baselines can be roughly categorized into three
groups. The first group includes general methods, such as MLP,
min −[Y𝑠𝑒𝑝 ◦𝑙𝑜𝑔(𝜃 (X))+(1−Y𝑠𝑒𝑝 )◦𝑙𝑜𝑔(1−𝜃 (X))]+𝛾 (Y𝑠𝑒𝑝 −𝜃 (X)). GCN [31], GAT [55], ChebyNet [12], GWNN [62], and JKNet
𝜃 [64]. As our focus is GAD, the second group considers tailored
(16)
GAD methods including CAREGNN [17], PCGNN [34], GDN [24],
Initialization of {Φ(·), 𝜑 1 (·), 𝜑 2 (·)}. In this phase, we treat 𝑌𝑠𝑒𝑝 GHRN [23] and BWGNN [51]. The third group includes methods
as the optimal known mask: that consider neighbor labels, such as H2GCN [69], GPRGNN [10],
and MixHop [1]:
min R (Φ(𝑔1 (L)𝜑 1 (𝑌𝑠𝑒𝑝 ◦ X), Y)
Φ,𝜑 1
(17) • GCN [31]: GCN is a traditional graph convolutional network.
+ min R (Φ(𝑔2 (L)𝜑 2 ((1 − 𝑌𝑠𝑒𝑝 ) ◦ X), Y).
Φ,𝜑 2 • GAT [55]: GAT weights the neighbors with self-attention.

4388
Graph Anomaly Detection with Bi-level Optimization WWW ’24, May 13–17, 2024, Singapore, Singapore

Table 3: Performance Results. The best results are in boldface, and the 2nd-best are underlined.

Dataset YelpChi Amazon T-Finance T-Social


Metric F1-Macro AUC F1-Macro AUC F1-Macro AUC F1-Macro AUC
MLP 0.4614 0.7366 0.9010 0.9082 0.4883 0.8609 0.4406 0.4923
GCN 0.5157 0.5413 0.5098 0.5083 0.5254 0.8203 0.6550 0.7012
GAT 0.4614 0.5459 0.5675 0.7731 0.8816 0.9388 0.4921 0.4923
ChebyNet 0.4608 0.6216 0.8070 0.9187 0.8017 0.8001 OOM
GWNN 0.4608 0.6246 0.4822 0.9319 0.4883 0.9670 OOM
JKNet 0.5805 0.7736 0.8270 0.8970 0.8971 0.9554 0.4923 0.7226
CAREGNN 0.5921 0.7617 0.8850 0.9092 0.7508 0.9161 0.4868 0.7939
PCGNN 0.6499 0.7985 0.8662 0.9571 0.5390 0.9162 0.4536 0.8917
GDN 0.7545 0.8904 0.9068 0.9709 0.8474 0.9462 0.7401 0.9287
GHRN 0.7789 0.9073 0.9282 0.9728 0.8975 0.9609 0.7328 0.9084
BWGNN 0.7583 0.9011 0.9188 0.9724 0.8793 0.9517 0.7494 0.9275
H2GCN 0.6575 0.8406 0.9213 0.9693 0.8824 0.9553 OOM OOM
MixHop 0.6534 0.8796 0.8093 0.9723 0.4880 0.9569 0.6471 0.9597
GPRGNN 0.6005 0.7928 0.7253 0.9265 0.8283 0.9500 0.5976 0.9622
BioGNN 0.7632 0.8920 0.9368 0.9748 0.9047 0.9639 0.8140 0.9325

(a) YelpChi (b) Amazon (c) T-Finance (d) T-Social

Figure 6: The NLD distance between two separated sets of nodes.

• ChebyNet [12]: ChebyNet generalizes CNN to graph data in the • MixHop [1]: Mixhop repeatedly mixes feature representations
context of spectral graph theory. of neighbors at various distances to learn relationships.
• GWNN [62]: GWNN leverages graph wavelet transform to ad- • GPRGNN [10]: GPR-GNN learns a polynomial filter by directly
dress the shortcomings of spectral graph CNN methods that performing gradient descent on the polynomial coefficients.
depend on graph Fourier transform.
• JKNet [64]: The jumping-knowledge network which concate-
nates or max-pooling the hidden representations. 5.2 Performance Comparison
• Care-GNN [17]: Care-GNN is a camouflage-resistant GNN that
The main results are reported in Table 3. Note that we search for
adaptively samples neighbors according to the feature similarity.
the best threshold to achieve the best F1-macro in validation for
• PC-GNN [34]: PC-GNN consists of two modules “pick” and
all methods. In general, BioGNN achieves the best F1-macro score
“choose”, and maintains a balanced label frequency around fraud-
except YelpChi, empirically verifying that it has a larger distance
sters by downsampling and upsampling.
between predictions and the decision boundary, benefiting from
• H2GCN [69]: H2GCN is a tailored heterophily GNN which iden-
measuring the NLD distance. For AUC, BioGNN performs poorly
tifies three useful designs.
in T-Social. We suppose the reason is that T-social has a complex
• BWGNN [51]: BWGNN is a spectral filter addressing the “right-
frequency composition since the best performance is achieved when
shift" phenomenon in anomaly detection.
the frequency order is high according to BWGNN [51]. We believe
• GDN [24]: GDN deals with heterophily by leveraging constraints
this issue could be alleviated if multi-branch filters are adopted,
on original node features.
which we leave for future work. Furthermore, some methods could
• GHRN [23]: GHRN calculates post-aggregation score and mod-
achieve high AUC while maintaining a low F1-Macro, indicating
ify the graph to make downstream model better at handling
that the instances can be distinguished but hold tightly in the space.
heterophily issues.
In such cases, we can’t evaluate these methods as effective [44].

4389
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

Figure 7: The ego-graph of some yellow-circled ego nodes classified as high-frequency by BioGNN in YelpChi. The anomalies
are represented in red, while normals are represented in blue.

H2GCN, MixHop, and GPRGNN are three state-of-the-art het- the distribution curves are shown in dashed lines. From Figure 6,
erophilous GNNs that shed light on the relationship between the we observe that the two histograms seldom overlap, and the mean
ego node and neighbor labels. We observe that they consistently of two curves maintains a separable distance, demonstrating that
outperform other groups of methods, including some tailored GAD BioGNN successfully sets the nodes apart.
methods. We ascribe this large performance gap to two reasons: (1)
Visualization. To show the results in an intuitive way, we report
the harmfulness of heterophily where vast normal neighborhoods
the ego-graph of some nodes in Figure 7. These nodes are assigned
attenuate the suspiciousness of the anomalies; (2) the superiority
to the high-pass filter by 𝜃 (·). As observed from the figure where
of spectral filters to distinguish nodes with different NLD. How-
color denotes the class of the nodes, the ego node (red-circled) has
ever, BioGNN outperforms these methods, especially in F1-Macro,
more inter-class neighbors compared to the nodes assigned to the
where the improvement ranges from 2.7% to 25.8%. This supports
low-pass filter. This finding provides support for Equation 7 and
our analysis that different class nodes with similar NLD should be
verifies the effectiveness of our novel framework.
treated separately to alleviate “loss rivalry”. Furthermore, among
the tailored GNN methods, BWGNN and BioGNN are polynomial- Time complexity analysis. The time complexity of BioGNN is
based filters that perform better than others, further suggesting 𝑂 (𝐶 |E |), where 𝐶 is a constant and |E | denotes the number of edges
that spectral filtering is more promising in GAD. in the graph. This is due to the fact that the BernNet-based filter is
In several datasets, MLP outperforms some GNN-based methods, a polynomial function that can be computed recursively [51].
indicating that blindly mixing neighbors can sometimes degrade the
prediction performance. Therefore, structural information should 6 LIMITATION AND CONCLUSION
be used with care, especially when the neighborhood label distri-
butions for nodes are complex. Limitation. Although we propose a novel network that treats
nodes separately, it has some limitations. Our work only separates
the nodes into two sets, and we hope to extend it to more fine-
5.3 Analysis of BioGNN grained multi-branch neural networks in the future. Furthermore,
In this section, we take a closer look in BioGNN. We first verify the our theoretical result largely relies on CSBM’s assumptions; hence
smoothness of the BioGNN loss curve to demonstrate its effective- our model may fail in some cases where the graph generation
ness in alleviating “loss rivalry”. Then we plot the distribution of process doesn’t follow these assumptions.
the separated nodes to elucidate that our model can successfully Conclusion. This work starts with “loss rivalry”, expressing the
discriminate nodes with different NLD and set them apart. Making phenomenon that some nodes tend to have opposite loss curves
it more clear, we visualize some high-frequency ego-graphs. from others. We argue that it is caused by the mixed training of dif-
Loss Rivalry Addressing. To answer the question of whether ferent class nodes with similar NLD. Furthermore, we discover that
BioGNN can alleviate the “loss rivalry”, we plot the training loss spectral filters are superior in addressing the problem. To this end,
of BioGNN in Figure 5b. Similar to Section 4.4, two separate sets we propose BioGNN, which essentially discriminates nodes that
of nodes are trained in a specific order: high-frequency nodes are share similar NLD but are likely to be in different classes and feeds
trained first, followed by low-frequency nodes. Comparing Figure them into different filters to prevent “loss rivalry”. Although the
2, 5a, and 5b, we find that the smoothness of BioGNN’s training dataset, experiments and analysis of the are based on graph anom-
curve lies between golden-separate and mixed training, indicating aly detection, this two-branch method could be further deployed
that the new framework is effective in alleviating “loss rivalry” and to more downstream tasks, such as graph adversarial learning [52–
improves the overall performance of GAD. 54], graph prompts learning [66–68], graph OOD [47–49], graph
explanation [19, 20, 22] in the future.
Distribution of the separated nodes. The core of BioGNN is
node separation. To further validate its effectiveness, we report the
empirical histogram of the NLD in four benchmarks in Figure 6. The 7 ACKNOWLEDGMENTS
x-axis represents the edge homophily, which explicitly represents This research is supported by the National Natural Science Founda-
the NLD around the ego node. The y-axis denotes the density, and tion of China (9227010114).

4390
Graph Anomaly Detection with Bi-level Optimization WWW ’24, May 13–17, 2024, Singapore, Singapore

REFERENCES [30] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang.
[1] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina 2020. Graph Structure Learning for Robust Graph Neural Networks. In KDD.
Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. Mixhop: ACM, 66–74.
Higher-order graph convolutional architectures via sparsified neighborhood [31] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
mixing. In ICML. 21–29. Graph Convolutional Networks. In ICLR.
[2] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. 2013. Estimating or [32] Runlin Lei, Zhen Wang, Yaliang Li, Bolin Ding, and Zhewei Wei. 2022. EvenNet:
Propagating Gradients Through Stochastic Neurons for Conditional Computation. Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks.
CoRR abs/1308.3432 (2013). In NeuIPS.
[3] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. 2021. Beyond low-frequency [33] Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang,
information in graph convolutional networks. In AAAI. 3950–3957. Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, et al. 2022. BOND: Benchmarking
[4] Ziwei Chai, Siqi You, Yang Yang, Shiliang Pu, Jiarong Xu, Haoyang Cai, and Unsupervised Outlier Node Detection on Static Attributed Graphs. In NeurIPS
Weihao Jiang. 2022. Can Abnormality be Detected by Graph Neural Networks?. Datasets and Benchmarks Track.
In IJCAI. 1945–1951. [34] Yang Liu, Xiang Ao, Zidi Qin, Jianfeng Chi, Jinghua Feng, Hao Yang, and Qing
[5] Sudhanshu Chanpuriya and Cameron Musco. 2022. Simplified Graph Convolution He. 2021. Pick and choose: a GNN-based imbalanced learning approach for fraud
with Heterophily. In NeurIPS. detection. In WWW. 3168–3177.
[6] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020. [35] Zhiwei Liu, Yingtong Dou, Philip S Yu, Yutong Deng, and Hao Peng. 2020. Al-
Simple and Deep Graph Convolutional Networks. In ICML. 1725–1735. leviating the inconsistency problem of applying graph neural network to fraud
[7] Yu Chen, Lingfei Wu, and Mohammed J. Zaki. 2020. Iterative Deep Graph detection. In SIGIR. 1569–1572.
Learning for Graph Neural Networks: Better and Robust Node Embeddings. In [36] Kangkang Lu, Yanhua Yu, Hao Fei, Xuan Li, Zixuan Yang, Zirui Guo, Meiyu Liang,
NeurIPS. Mengran Yin, and Tat-Seng Chua. 2024. Improving Expressive Power of Spectral
[8] Zhixian Chen, Tengfei Ma, and Yang Wang. 2022. When Does A Spectral Graph Graph Neural Networks with Eigenvalue Correction. CoRR abs/2401.15603 (2024).
Neural Network Fail in Node Classification? CoRR abs/2202.07902 (2022). [37] Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen,
[9] Lu Cheng, Ruocheng Guo, Kai Shu, and Huan Liu. 2021. Causal understanding and Xiang Zhang. 2021. Learning to Drop: Robust Graph Neural Network via
of fake news dissemination on social media. In KDD. 148–157. Topological Denoising. In WSDM. ACM, 779–787.
[10] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive Universal [38] Xiaoxiao Ma, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z Sheng, Hui
Generalized PageRank Graph Neural Network. In ICLR. Xiong, and Leman Akoglu. 2021. A comprehensive survey on graph anomaly
[11] Enyan Dai, Zhimeng Guo, and Suhang Wang. 2021. Label-Wise Message Passing detection with deep learning. TKDE (2021).
Graph Neural Network on Heterophilic Graphs. CoRR abs/2110.08128 (2021). [39] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. 2022. Is homophily a necessity
[12] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu- for graph neural networks?. In ICLR.
tional Neural Networks on Graphs with Fast Localized Spectral Filtering. In NIPS. [40] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. 2021. A
3837–3845. Unified View on Graph Neural Networks as Graph Signal Denoising. In CIKM.
[13] Yash Deshpande, Subhabrata Sen, Andrea Montanari, and Elchanan Mossel. 2018. ACM, 1202–1211.
Contextual Stochastic Block Models. In NeurIPS. 8590–8602. [41] Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs:
[14] Kaize Ding, Jundong Li, and Huan Liu. 2019. Interactive anomaly detection on modeling the evolution of user expertise through online reviews. In WWW.
attributed networks. In WSDM. 357–365. 897–908.
[15] Kaize Ding, Zhe Xu, Hanghang Tong, and Huan Liu. 2022. Data Augmentation [42] Shebuti Rayana and Leman Akoglu. 2016. Collective opinion spam detection
for Deep Graph Learning: A Survey. SIGKDD Explor. 24, 2 (2022), 61–77. using active inference. In SDM. 630–638.
[16] Yushun Dong, Kaize Ding, Brian Jalaian, Shuiwang Ji, and Jundong Li. 2021. [43] Fengzhao Shi, Yanan Cao, Yanmin Shang, Yuchen Zhou, Chuan Zhou, and Jia
Adagnn: Graph neural networks with adaptive frequency response filter. In Wu. 2022. H2-FDetector: A GNN-based Fraud Detector with Homophilic and
CIKM. 392–401. Heterophilic Connections. In WWW. ACM, 1486–1494.
[17] Yingtong Dou, Zhiwei Liu, Li Sun, Yutong Deng, Hao Peng, and Philip S Yu. 2020. [44] Wentao Shi, Jiawei Chen, Fuli Feng, Jizhi Zhang, Junkang Wu, Chongming Gao,
Enhancing graph neural network-based fraud detectors against camouflaged and Xiangnan He. 2023. On the Theories Behind Hard Negative Sampling for
fraudsters. In CIKM. 315–324. Recommendation. In WWW. ACM, 812–822.
[18] Lun Du, Xiaozhou Shi, Qiang Fu, Xiaojun Ma, Hengyu Liu, Shi Han, and Dongmei [45] Wentao Shi, Junkang Wu, Xuezhi Cao, Jiawei Chen, Wenqiang Lei, Wei Wu, and
Zhang. 2022. GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Xiangnan He. 2023. FFHR: Fully and Flexible Hyperbolic Representation for
Both Homophily and Heterophily. In WWW. ACM, 1550–1558. Knowledge Graph Completion. CoRR abs/2302.04088 (2023).
[19] Junfeng Fang, Xinglin Li, Yongduo Sui, Yuan Gao, Guibin Zhang, Kun Wang, Xi- [46] Yongduo Sui, Tianlong Chen, Pengfei Xia, Shuyao Wang, and Bin Li. 2022. To-
ang Wang, and Xiangnan He. 2024. EXGC: Bridging Efficiency and Explainability wards robust detection and segmentation using vertical and horizontal adversarial
in Graph Condensation. In WWW. ACM. training. In IJCNN. 1–8.
[20] Junfeng Fang, Wei Liu, Yuan Gao, Zemin Liu, An Zhang, Xiang Wang, and [47] Yongduo Sui, Xiang Wang, Tianlong Chen, Meng Wang, Xiangnan He, and Tat-
Xiangnan He. 2023. Evaluating Post-hoc Explanations for Graph Neural Networks Seng Chua. 2023. Inductive Lottery Ticket Learning for Graph Neural Networks.
via Robustness Analysis. In NeurIPS. Journal of Computer Science and Technology (2023).
[21] Junfeng Fang, Wei Liu, An Zhang, Xiang Wang, Xiangnan He, Kun Wang, and [48] Yongduo Sui, Xiang Wang, Jiancan Wu, Min Lin, Xiangnan He, and Tat-Seng Chua.
Tat-Seng Chua. 2022. On Regularization for Explaining Graph Neural Networks: 2022. Causal attention for interpretable and generalizable graph classification. In
An Information Theory Perspective. (2022). KDD. 1696–1705.
[22] Junfeng Fang, Xiang Wang, An Zhang, Zemin Liu, Xiangnan He, and Tat-Seng [49] Yongduo Sui, Qitian Wu, Jiancan Wu, Qing Cui, Longfei Li, Jun Zhou, Xiang Wang,
Chua. 2023. Cooperative Explanations of Graph Neural Networks. In WSDM. and Xiangnan He. 2023. Unleashing the Power of Graph Data Augmentation on
ACM, 616–624. Covariate Distribution Shift. In NeurIPS.
[23] Yuan Gao, Xiang Wang, Xiangnan He, Zhenguang Liu, Huamin Feng, and Yong- [50] Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. 2021. Adversarial Graph
dong Zhang. 2023. Addressing Heterophily in Graph Anomaly Detection: A Augmentation to Improve Graph Contrastive Learning. In NeurIPS. 15920–15933.
Perspective of Graph Spectrum. In WWW. ACM, 1528–1538. [51] Jianheng Tang, Jiajin Li, Ziqi Gao, and Jia Li. 2022. Rethinking Graph Neural
[24] Yuan Gao, Xiang Wang, Xiangnan He, Zhenguang Liu, Huamin Feng, and Yong- Networks for Anomaly Detection. In ICML. 21076–21089.
dong Zhang. 2023. Alleviating Structrual Distribution Shift in Graph Anomaly [52] Shuchang Tao, Qi Cao, Huawei Shen, Liang Hou, and Xueqi Cheng. 2021. Adver-
Detection. In WSDM. sarial Immunization for Certifiable Robustness on Graphs. In WSDM.
[25] Yuan Gao, Xiang Wang, Xiangnan He, Zhenguang Liu, Huamin Feng, and Yong- [53] Shuchang Tao, Qi Cao, Huawei Shen, Junjie Huang, Yunfan Wu, and Xueqi Cheng.
dong Zhang. 2023. Alleviating Structural Distribution Shift in Graph Anomaly 2021. Single Node Injection Attack against Graph Neural Networks. In CIKM.
Detection. In WSDM. 1794–1803.
[26] Douglas M Hawkins. 1980. Identification of outliers. Vol. 11. Springer. [54] Shuchang Tao, Huawei Shen, Qi Cao, Yunfan Wu, Liang Hou, and Xueqi Cheng.
[27] Mingguo He, Zhewei Wei, Zengfeng Huang, and Hongteng Xu. 2021. Bern- 2023. Graph Adversarial Immunization for Certifiable Robustness. TKDE (2023).
Net: Learning Arbitrary Graph Spectral Filters via Bernstein Approximation. In [55] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
NeurIPS. 14239–14251. Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
[28] Weihua Hu, Kaidi Cao, Kexin Huang, Edward W. Huang, Karthik Subbian, and [56] Daixin Wang, Jianbin Lin, Peng Cui, Quanhui Jia, Zhen Wang, Yanming Fang,
Jure Leskovec. 2022. TuneUp: A Training Strategy for Improving Generalization Quan Yu, Jun Zhou, Shuang Yang, and Yuan Qi. 2019. A semi-supervised graph
of Graph Neural Networks. CoRR abs/2210.14843 (2022). attentive network for financial fraud detection. In ICDM. 598–607.
[29] Mengda Huang, Yang Liu, Xiang Ao, Kuan Li, Jianfeng Chi, Jinghua Feng, Hao [57] Jianyu Wang, Rui Wen, Chunming Wu, Yu Huang, and Jian Xion. 2019. Fdgars:
Yang, and Qing He. 2022. AUC-oriented Graph Neural Network for Fraud Detec- Fraudster detection via graph convolutional networks in online app review
tion. In WWW. 1311–1321. system. In WWW (Companion Volume). 310–316.

4391
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

[58] Shuyao Wang, Yongduo Sui, Jiancan Wu, Zhi Zheng, and Hui Xiong. 2024. Dy- 𝑥𝑢 ∼ 𝑁 (𝜇1, I) and 𝑥 𝑣 ∼ 𝑁 (𝜇 0, I), hence we know ℎ𝑢 and ℎ 𝑣 should
namic Sparse Learning: A Novel Paradigm for Efficient Recommendation. In obey Gaussian distribution, whose mean can be acquired as:
WSDM. ACM.
[59] Xiyuan Wang and Muhan Zhang. 2022. How Powerful are Spectral Graph Neural
Networks. In ICML. 23341–23362.
[60] Yanling Wang, Jing Zhang, Shasha Guo, Hongzhi Yin, Cuiping Li, and Hong
𝜇𝑢 = 𝜇1 − (𝑝 1 𝜇 0 + 𝑞 1 𝜇 1 ) + 𝑝 1 (𝑝 0 𝜇0 + 𝑞 0 𝜇 1 ) + 𝑞 1 (𝑝 1 𝜇 0 + 𝑞 1 𝜇 1 )
Chen. 2021. Decoupling representation learning and classification for gnn-based = 𝜇1 + 𝑝 1 (𝑝 0 𝜇 0 + 𝑞 0 𝜇 1 − 𝑝 1 𝜇 0 − 𝑞 1 𝜇 1 )
anomaly detection. In SIGIR. 1239–1248.
[61] Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022. Handling Distri- = 𝜇1 + 𝑝 1 [(𝑝 0 − 𝑝 1 )𝜇 0 + (𝑞 0 − 𝑞 1 )𝜇1 ]
bution Shifts on Graphs: An Invariance Perspective. In ICLR.
[62] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. 2019. Graph 𝜇 𝑣 = 𝜇0 − 𝑞 0 [(𝑝 0 − 𝑝 1 )𝜇 0 + (𝑞 0 − 𝑞 1 )𝜇 1 ]
Wavelet Neural Network. In ICLR (Poster). OpenReview.net. (21)
[63] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful
are Graph Neural Networks?. In ICLR.
Hence the distance between the mean of these two distributions:
[64] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi ||𝜇𝑢 − 𝜇 𝑣 || 2 = ||𝜇 1 − 𝜇 0 || 2 + (𝑝 1 + 𝑞 0 )||(𝑝 0 − 𝑝 1 )𝜇 0 + (𝑞 0 − 𝑞 1 )𝜇1 || 2
Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs
with Jumping Knowledge Networks. In ICML. 5449–5458. = ||𝜇 1 − 𝜇 0 || 2 + (1 + 𝑞 0 − 𝑞 1 ) · |𝑞 0 − 𝑞 1 | · ||𝜇 1 − 𝜇0 || 2
[65] Zhe Xu, Boxin Du, and Hanghang Tong. 2022. Graph Sanitation with Application
to Node Classification. In WWW. ACM, 1136–1147. = [1 + |𝑞 0 − 𝑞 1 | + |(𝑝 0 − 𝑝 1 )(𝑞 0 − 𝑞 1 )|] · ||𝜇 1 − 𝜇 0 || 2
[66] Xingtong Yu, Yuan Fang, Zemin Liu, and Xinming Zhang. 2023. HGPROMPT: (22)
Bridging Homogeneous and Heterogeneous Graphs for Few-shot Prompt Learn-
ing. CoRR abs/2312.01878 (2023). Similarly, since |𝑞 0 − 𝑞 1 | = |𝑝 0 − 𝑝 1 | , we have:
[67] Xingtong Yu, Zhenghao Liu, Yuan Fang, Zemin Liu, Sihong Chen, and Xinming
Zhang. 2023. Generalized Graph Prompt: Toward a Unification of Pre-Training ||𝜇𝑢 − 𝜇 𝑣 || 2 = [1 + |𝑝 0 − 𝑝 1 | + |(𝑝 0 − 𝑝 1 )(𝑞 0 −𝑞 1 )|] · ||𝜇 1 − 𝜇 0 || 2 (23)
and Downstream Tasks on Graphs. CoRR abs/2311.15317 (2023).
[68] Xingtong Yu, Chang Zhou, Yuan Fang, and Xinming Zhang. 2023. MultiGPrompt In our paper, we adopt Euclidean distance as NLD:
for Multi-Task Pre-Training and Prompting on Graphs. CoRR abs/2312.03731
(2023).
√︃
[69] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai 𝑑 (𝑢, 𝑣) = (𝑝 0 − 𝑝 1 ) 2 + (𝑞 0 − 𝑞 1 ) 2 (24)
Koutra. 2020. Beyond Homophily in Graph Neural Networks: Current Limitations
and Effective Designs. In NeurIPS. Joining equations (23) and (24), we can rewrite the distance between
distribution mean values as:
A PROOFS [𝑑 (𝑢, 𝑣)] [𝑑 (𝑢, 𝑣)] 2
In this section, the proofs of propositions are listed. ||𝜇𝑢 − 𝜇 𝑣 || 2 = [1 + √ + ] · ||𝜇 1 − 𝜇 0 || 2 . (25)
2 2

A.1 Proof of Proposition 1 Likewise, the mean values of hidden represenation given by a
2-layer vanilla GCN are:
Proof. In the spectral domain, the hidden representation of the
spectral filter can be expressed as: 𝜇𝑢 = 𝑝 1 (𝑝 0 𝜇 0 + 𝑝 1 𝜇 1 ) + 𝑞 1 (𝑝 1 𝜇 0 + 𝑝 0 𝜇 1 )
∑︁ ∑︁ = 𝜇 0 + 𝑝 12 (𝜇 1 − 𝜇 0 ) + 𝑞 1 𝑝 0 (𝜇 1 − 𝜇 0 )
𝛼𝑘 (I − D −1/2 AD −1/2 )𝑘 X
𝑘
𝐻= 𝛼𝑘 L̃ X = (18) (26)
𝑘 𝑘 𝜇 𝑣 = 𝑝 0 (𝑝 0 𝜇 0 + 𝑝 1 𝜇 1 ) + 𝑞 0 (𝑝 1 𝜇 0 + 𝑝 0 𝜇 1 )
Taking the second-order spectral filter as an example, = 𝜇 0 + 𝑝 1 𝑝 0 (𝜇1 − 𝜇 0 ) + 𝑞 0 𝑝 0 (𝜇 1 − 𝜇 0 )

𝐻 2 = 𝛼 0 X+𝛼 1 (I−D −1/2 AD −1/2 )X+𝛼 2 (I−D −1/2 AD −1/2 ) 2 X (19) Hence we have the distance between them:

The representation of node 𝑖 is given as: ||𝜇𝑢 − 𝜇 𝑣 || 2 = 𝑝 1 · |𝑝 1 − 𝑝 0 | · ||𝜇 1 − 𝜇 0 || 2 − 𝑝 0 · |𝑞 1 − 𝑞 0 | · ||𝜇 1 − 𝜇 0 || 2


(27)
1 ∑︁
Since |𝑝 1 − 𝑝 0 | = |1 − 𝑞 1 − (1 − 𝑞 0 )| = |𝑞 0 − 𝑞 1 |
ℎ𝑖 = 𝛼 0𝑥𝑖 + 𝛼 1 (𝑥𝑖 − 𝑥 𝑗 )+
𝑑𝑒𝑔(𝑥𝑖 )
𝑗 ∈ N𝑖
||𝜇𝑢 − 𝜇 𝑣 || 2 = |(𝑝 0 − 𝑝 1 )(𝑞 0 − 𝑞 1 )| · ||𝜇 1 − 𝜇 0 || 2 (28)
1 ∑︁ 1 ∑︁ 1 ∑︁
𝛼 2 (𝑥𝑖 − 2 𝑥𝑗 + 𝑥𝑘 )
𝑑𝑒𝑔(𝑥𝑖 ) 𝑑𝑒𝑔(𝑥𝑖 ) 𝑑𝑒𝑔(𝑥 𝑗 ) Joining Equations (24) and (28), we can rewrite the distance as:
𝑗 ∈ N𝑖 𝑗 ∈ N𝑖 𝑘 ∈ N𝑗
𝛼 1 + 2𝛼 2 ∑︁ [𝑑 (𝑢, 𝑣)] 2
= (𝛼 0 + 𝛼 1 + 𝛼 2 )𝑥𝑖 − 𝑥𝑗 ||𝜇𝑢 − 𝜇 𝑣 || 2 = · ||𝜇 1 − 𝜇 0 || 2, (29)
𝑑𝑒𝑔(𝑥𝑖 ) 2
𝑗 ∈ N𝑖
𝛼2 ∑︁ 1 ∑︁ Finish the proof.
+ 𝑥𝑘
𝑑𝑒𝑔(𝑥𝑖 ) 𝑑𝑒𝑔(𝑥 𝑗 )
𝑗 ∈N 𝑖 𝑘∈N 𝑗 A.2 Proof of Proposition 2
(20)
Note that G is the ego-graph of the node, hence 1 − ℎ(G) is the
Here we only focus on the aggregation process, hence the non-
ratio of inter-class edges, which is 𝑞 0 for negative nodes and 𝑝 1
linear activation is ignored. Furthermore, to simplify the calculation
for positive nodes. Hence we need to prove that E[𝑓 (y, L)] =
and the analysis, we set 𝛼 0 as 1, 𝛼 1 as -1, and 𝛼 2 as 1. In this case the | E | · (1−ℎ ( G) )
coefficients or the numerators of the coefficients equal to 1. Suppose 𝑁 . This equation is proved in [23].
𝑢 and 𝑣 are nodes with different labels (i.e., anomalies and normal
nodes), along with their NLD as D𝑢 = [𝑝 1, 𝑞 1 ] and D𝑣 = [𝑝 0, 𝑞 0 ], B REPRODUCIBILITY
where 𝑝 1 + 𝑞 1 = 𝑝 0 + 𝑞 0 = 1. From previous analysis, we assume In this section, some details for reproducibility are listed.

4392
Graph Anomaly Detection with Bi-level Optimization WWW ’24, May 13–17, 2024, Singapore, Singapore

Table 4: Model Hyperparameters and their search ranges

Dataset YelpChi Amazon T-Finance T-Social


𝑎 for 𝑔1 (L) {0,1,2} {0,1,2} {0,1,2} {0,1,2}
𝑏 for 𝑔1 (L) {0,1,2} {0,1,2} {0,1,2} {0,1,2}
𝑎 for 𝑔2 (L) {0,1,2} {0,1,2} {0,1,2} {0,1,2}
𝑏 for 𝑔2 (L) {0,1,2} {0,1,2} {0,1,2} {0,1,2}
Learning Rate (lr) for 𝜃 {1e-3, 5e-3, 1e-2}
Learning Rate (lr) for Φ {1e-3, 5e-3, 1e-2}
Learning Rate (lr) for 𝜑 1 and 𝜑 2 {1e-3, 5e-3, 1e-2}
weight decay for Φ 1e-3

(a) Amazon (b) T-Finance (c) T-Social

Figure 8: More training curves of BioGNN.

B.1 Model Hyperparameters C.1 Limited Label and Anomalies


According to [51], a Bernstein Polynominal-Based filter is parame- In the real-world case, the percentage of the anomaly is usually
terized by 𝛼 and 𝛽. The choice of 𝛼 and 𝛽 on datasets are presented quite low, even less than 5% or even 1%; also human annotation
in Table 4. Note that in the design of Berstein basis, the sum of 𝛼 is expensive which leads to limited label information. Hence we
and 𝛽 is a constant (which is 2 in our paper). In addition, some basic have conducted additional experiments to address this concern and
learning hyperparameters are reported. provide insights into the performance in such scenarios in Table 5.
Limited Label Information: With reduced labeled data to 1%,
our proposed method still achieved good performance, demonstrat-
B.2 Datasets ing its ability to effectively leverage limited label information for
The YelpChi dataset [42] focus on detecting anomalous recommen- accurate detection.
dations from Yelp.com. The Amazon dataset [41] includes product
Small Percentage of Abnormal Nodes: When the dataset con-
reviews under the Musical Instruments category from Amazon.com.
tains only a small percentage of abnormal nodes, our proposed
Both of the datasets have three relations, hence we treat them as
method maintains a high level of f1 and auc in detecting the abnor-
multi-relation graphs. T-Social and T-Finance [51] are two large-
mal instances, even amidst the imbalanced class distribution.
scale datasets released recently. The T-Finance dataset aims to
detect anomalous accounts in a transaction network where the Small Percentage of Abnormal Nodes with limited label in-
nodes are annotated as anomaly if they are likely fraud, money formation: In this case, the performance of the proposed method
laundering and online gambling. The nodes are accounts with 10- drops a little due to the inaccurate prediction of the NLD.
dimension features whereas the edges connecting them denote
they have transaction records. The T-social dataset aims to detect C.2 Performance with standard deviation
human-annotated anomaly accounts in a social network. The node Please See Table 6.
annotations and features are the same as T-Finance, whereas the
edges connecting the nodes denote they maintain the friendship D RELATED WORK
for more than 3 months. In this section, we introduce some static GAD networks and polynomial-
based spectral GNNs.

C MORE EXPERIMENTAL RESULTS D.1 Static Graph Anomaly Detection


In this section, we report the experimental results when data has On static attributed graphs, GNN-based semi-supervised learning
limited label and anomalies. And we also report the performance methods are widely adopted. For example, GraphUCB [14] adopts
of some representative methods with standard deviation. contextual multi-armed bandit technology, and transforms graph

4393
WWW ’24, May 13–17, 2024, Singapore, Singapore Yuan Gao et al.

Figure 9: The ego-graph of some yellow-circled ego nodes classified as high-frequency by BioGNN in Amazon.

Table 5: Performance with limited label information or (and) small percentage of abnormal nodes

BWGNN(F1-Macro %) BWGNN(AUC %) BioGNN(F1-Macro %) BioGNN(AUC %)


Yelp (anomaly=5%, training=40%) 76.44 89.67 74.71 88.17
Yelp(anomaly=14.53%, training=1%) 67.02 79.65 67.12 80.20
Yelp (anomaly=5%, training=1%) 66.27 78.49 64.93 74.57
Amazon (anomaly=5%, training=40%) 91.20 96.55 90.98 96.60
Amazon (anomaly=6.87%, training=1%) 90.69 91.24 84.36 92.46
Amazon (anomaly=5%, training=5%) 89.69 94.20 86.90 93.60
TFinance (anomaly=4.58%, training=1%) 84.89 91.15 83.14 92.53
TSocial (anomaly=3.01%, training=1%) 75.93 88.06 83.07 93.75

Table 6: Performance with standard deviations mechanism. BernNet [27] expresses the filtering operation with
Bernstein polynomials. BWGNN [51] designs a band-pass filter to
Dataset YelpChi Amazon T-Finance aggregate different frequency signals simultaneously. AMNet [4]
Metric F1-Macro AUC F1-Macro AUC F1-Macro AUC
aims to capture both low-frequency and high-frequency signals,
CAREGNN 0.5921±0.054 0.7617±0.018 0.8850±0.015 0.9092±0.021 0.7508±0.025 0.9161±0.006
PCGNN 0.6499±0.030 0.7985±0.011 0.8662±0.029 0.9571±0.020 0.5390±0.093 0.9162±0.039 and adaptively combine signals of different frequencies. GHRN
BWGNN 0.7583±0.002 0.9011±0.004 0.9188±0.002 0.9724±0.002 0.8793±0.011 0.9517±0.008 [23] design an edge indicator to distinguish homophilous and het-
GPRGNN 0.6005±0.030 0.7928±0.027 0.7253±0.067 0.9265±0.021 0.8283±0.022 0.9500±0.014
Ours 0.7632±0.006 0.8920±0.003 0.9368±0.009 0.9748±0.002 0.9047±0.001 0.9639±0.003 erophilous edges. GBK-GNN [18] utilizes two kernels to aggregate
homophily and heterophily neighbors respectively. These methods
can alleviate heterophily problem, however they train all the nodes
as a whole which suffers from loss-rivalry.
anomaly detection into a decision-making problem; DCI [60] de-
couples representation learning and classification with the self-
D.3 Graph Sanitation
supervised learning task. Recent methods realize the necessity of
leveraging multi-relation graphs into GAD. FdGars [57] and Graph- Graph Santinzation aim to learn a modified graph G to boost the
Consis [35] construct a single homo-graph with multiple relations. performance of the corresponding mining model, that is, they have
Likewise, Semi-GNN [56], CARE-GNN [17], and PC-GNN [34] con- a converged classifier and modify the graph topology to fit this
struct multiple homo-graphs based on node relations. In addition, model [15]. GASOLINE [65] formulates the graph sanitation prob-
some works discover that heterophily should be addressed properly lem as a bi-level optimization problem, and further instantiate it by
in GAD. Semi-GNN employs hierarchical attention mechanisms for semi-supervised node classification. Pro-GNN [30] jointly learns
interpretable prediction, while based on camouflage behaviors and a structural graph and a robust graph neural network model from
imbalanced problems, CARE-GNN, PC-GNN, and AO-GNN [29] the perturbed graph. IDGL [7] iteratively learns a better graph
prune edges adaptively according to neighbor distribution. GDN structure based on better node embeddings, and learns better node
[25] and H2 -FDetector [43] adopt different strategies for anomalies embeddings based on a better graph structure. PTDNet [37] prunes
and normal nodes. task-irrelevant edges by penalizing the number of edges in the spar-
sified graph with parameterized networks. Ada-UGNN [40] solves a
D.2 Graph Spectral Filtering graph denoising problem with a smoothness assumption, and han-
dles graphs with adaptive smoothness across nodes. Although there
Spectral GNNs simulate filters with different passbands in the spec- are similarities between BioGNN and graph sanitation methods,
tral domain, enabling GNNs to work on both homophilic and het- BioGNN is trained from scratch without a converged classifier.
erophilic graphs [36, 59]. GPRGNN [10] adaptively learns the Gener-
alized PageRank weights, regardless of whether the node labels are
homophilic or heterophilic. FAGCN [3] adaptively fuses different
signals in the process of message passing by employing a self-gating

4394

You might also like