Graph Diffusion Models for Anomaly Detection
Graph Diffusion Models for Anomaly Detection
Zekuan Liu, Huijun Yu, Yao Yan, Ziqing Hu, Pankaj Rajak, Amila Weerasinghe,
Olcay Boz, Deepayan Chakrabarti, Fei Wang
{lzekuan,huijuyu,ynyao,ziqinghu,rajakpan,weera,olcayboz,deepayc,fiwan}@amazon.com
Amazon
Seattle, Washington, USA
generator generates graph structure and node feature simultane- exploring the structural variability of graphs and understanding
ously in a multitask fashion. It offers a unique capability to generate how anomalies manifest in different graph contexts.
positive (anomaly) examples, thus effectively alleviating the prob- Label Imbalance in anomaly detection presents a significant
lem of label imbalance. Also, it is able to generate heterogeneous hurdle, particularly in graph data where anomalies are naturally
graphs with heterogeneous generator. Adding the generated posi- rare and often overshadowed by the majority of normal instances.
tive examples to the training set, it provides a relatively balanced This imbalance leads to models that are biased towards the majority
dataset for training downstream anomaly detection, and alleviates class, thereby diminishing their effectiveness in identifying true
the label imbalance problem in anomaly detection on graphs. anomalies. Some works adopt interpolation based methods to ad-
The contribution in this project can be mainly summarized in to dress label imbalance [18], and some others use data augmentations
following two parts: [7]. The integration of graph generators offers a novel solution to
this problem. By generating synthetic anomalies, these models can
• We propose a diffuision model based graph generator that improve datasets, creating a more balanced landscape for model
capable to generate heterogeneous graph structure and node training and evaluation.
features simultaneously conditioned on positive label.
• We further alleviate the label imbalance problem in general
anomaly detection on graphs framework.
3 METHODOLOGY
• Experimental results on multiple datasets prove that our gen- To alleviate the label imbalance problem in our anomaly detection,
erator outperforms baselines. Ablation study demonstrated this study proposes a diffusion model based graph generator to
the effectiveness of each components of our generator. generate synthetic anomaly nodes within a heterogeneous graph
structure. The overview of the proposed generator is shown in
The remaining paper is organized as follows: after this introduc- Figure 1. During the training Phase, the generator is trained solely
tion, we delineate previous research followed by the methodology, on the real training data. Then the generator can generate synthetic
featuring the architecture of our diffusion model based graph gen- graphs with fraudulent nodes. In downstream application, these
erator and its components. This is followed by rigorous empirical synthetic data, in conjunction with real training data, are utilized
evaluations, including ablation studies to assess the effectiveness for training the anomaly detection model. This section includes the
of each component and case studies for an in-depth understanding problem definition and how we generate the graph. In addition, the
of generated features. Finally, we summarize our contributions and details of the diffusion model will be introduced.
explore avenues for future research.
3.1 Problem Definition
2 RELATED WORKS Our primary objective is to enrich a dataset with anomaly nodes
Anomaly Detection on Graphs has emerged as a critical area for robust training of anomaly detection systems. These synthetic
of research, distinguished by unique challenges associated with nodes encompass four components:
the complex nature of graph structures. Early methods focused on • Node Feature: Attributes associated with the node within
clustering and proximity-based techniques, which were adept at the graph.
handling anomalies in simpler graph but faltered when applied to • Node Timestamp: A temporal stamp related to the node.
the intricate relationships inherent in complex graph data [15]. The • Positive Label: The positive label denoting the anomaly
introduction of GNNs marked a significant shift, offering a more nature of the node.
nuanced approach to modeling relational data. GNNs, through their • Heterogeneous Graph Structure: Different type of nodes
capacity to encapsulate both node-level and graph-wide patterns, and edge connections.
have demonstrated considerable success in detecting anomalies
across a range of applications, from fake news detection to anti- Each of these components is generated through a sequence of
money laundering [1, 8, 9, 14]. However, label imbalance challenge systematic steps, elaborated in the subsequent subsections.
persists, particularly in adapting these models to heterogeneous
graphs [17] where the graph contains different types of nodes and 3.2 Encoding with Graph Autoencoder
edges. The problem is more complicated if the graph is dynamic. In order to encode our anomaly detection graphs into a latent space
Graph Generators have gained attention as a potent tool for for the diffusion model, we adopt a graph autoencoder. For graph
addressing the data scarcity and synthetic data generation chal- data, which inherently possesses a complex and multifaceted nature,
lenges in graph-based applications. Conventional methods adopt including node attributes and graph topology, a specialized design
autoregressive paradigm to generate graphs [16]. Recently. the de- is imperative. This complexity arises from the interconnectedness
velopment of generative models, particularly diffusion models, has and the structural dependencies within the graph, which are not
been a noteworthy advancement. Originating in the domain of im- present in traditional tabular data. Therefore, the design of the
age synthesis, these models have demonstrated a unique capability graph autoencoder needs to address these unique characteristics.
in capturing complex data distributions through iterative noising
and denoising processes [3]. Their application in graph data is rela- 3.2.1 Generating Node Attributes. We first apply a traditional au-
tively new but promising, offering a method to generate realistic toencoder to cope with the node attributes generation. An autoen-
graph structures and node features that mirror the intricacies of coder is a type of artificial neural network used to learn efficient
real-world graphs [13]. This capability is particularly valuable in representations (encodings) of data, typically for the purpose of
Graph Diffusion Models for Anomaly Detection Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Encoder !0 !"
Diffusion Model
Graph Autoencoder
x T-1 times
Generated
MLPs MLP
Graph
Figure 1: An overview of proposed diffusion model based graph generator. We first encoder the input graph to latent space with
a GNN encoder. The diffusion process gradually adds noise to the embedding by 𝑇 steps, while the MLP denoises the embedding
𝑇 times. Finally, the decoder reconstructs the graph from the latent embedding.
dimensionality reduction. Its architecture is characterized by a two- and normal nodes. To achieve this, we have incorporated condi-
part structure: the encoder and the decoder. The encoder of an tional variables into the VAE. It enables the model to factor in the
autoencoder is responsible for transforming the input data into a additional label information during the generation process.
lower-dimensional latent space. This process involves a series of During the training phase of the VAE, we concatenate the labels
layers that gradually compress the input data, extracting and re- indicating the node type (anomalous or normal) with their respec-
taining the most salient features. The encoder’s primary objective tive feature vectors. This concatenated data is then fed into the
is to learn a compressed latent representation of the input data encoder of the VAE. The encoder, thus, learns a latent representa-
that encapsulates its essential characteristics, thereby reducing its tion that encapsulates not only the features of the nodes but also
dimensionality while preserving its significant attributes. their corresponding labels.
Following the encoding process, the decoder reconstructs the In the generation phase, the decoder of the VAE is explicitly
input data from the condensed representation in the latent space. conditioned on these labels. This conditioning is pivotal as it directs
The goal of the decoder is to generate an output that closely approx- the decoder to generate feature vectors that are inherently aligned
imates the original input data, using the compressed information with the specified labels. Consequently, when the label indicates
encoded by the encoder. This reconstruction process is crucial, as ’fraudulent’, the decoder is steered to produce feature vectors that
it ensures that the learned representations in the latent space are are characteristic of fraudulent nodes.
meaningful and informative, capturing the intrinsic patterns and This methodology allows for a more controlled generation of
structures of the input data. nodes, enhancing the model’s ability to differentiate and generate
The training of an autoencoder involves adjusting the weights distinct types of nodes based on their underlying characteristics.
of the network to minimize the difference between the original Such an approach is particularly beneficial in scenarios like fraud
input and its reconstruction, typically using a loss function like detection, where the distinction between normal and anomalous
mean squared error (MSE). Through this process, the autoencoder behavior is crucial for effective model performance.
learns to prioritize the most significant features in the input data,
effectively learning a compressed but informative representation. 3.2.3 Generating Heterogeneous Graph Structure. In our method-
As an enhanced variant of traditional autoencoder, the Varia- ology for generating the graph structure, the Variational Graph
tional Autoencoder (VAE) introduces a probabilistic approach, in- Autoencoder (VGAE) [6] is utilized. This approach adapts the ar-
creasing the generalizing ability. Specifically, In a VAE, the encoder chitecture of a VAE to graph-based data, enabling the encoder to
predicts two parameters for the prior Gaussian distribution—mean effectively capture both topological and feature information of the
(𝜇) and standard deviation (𝜎). During the forward pass, a sample is graph into a latent representation. In turn, the decoder focuses
drawn from the distribution defined by these parameters, which is on reconstructing the graph structure. Our model is specifically
then decoded to reconstruct the input data. The VAE loss function tailored for multi-task learning, leveraging a shared GNN as the
is a sum of two terms: reconstruction term, which is a MSE between encoder. This encoder is central to our approach, as it processes
the input and the reconstructed output, and a regularization term, the graph structure and node features simultaneously, embedding
which is KL divergence between the learned distribution and a them into a latent space.
standard Gaussian distribution. The decoder is bifurcated into two separate entities, each with a
distinct role. The first decoder is dedicated to feature reconstruction,
3.2.2 Conditional Generation for Only Nodes with Positive Label. while the second focuses on graph structure. This bifurcation is
In our approach, we aim to enhance the control over the genera- pivotal in addressing the distinct aspects of our graph data, namely,
tion of node types, particularly differentiating between anomalous the node features and the graph topology. The VGAE model is
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Liu et al.
inherently aligned with the framework of a traditional VAE, where with learned Gaussian transitions, defined as:
the loss function comprises two components: the reconstruction 𝑇
Ö
loss and the Kullback-Leibler (KL) divergence. These components 𝑝𝜃 (z0:𝑇 ) = 𝑝 (z𝑇 ) 𝑝𝜃 (z𝑡 −1 |z𝑡 ), (2)
are critical as they collectively cater to both feature and graph 𝑡 =1
structure generation tasks within our model. 𝑝𝜃 (z𝑡 −1 |z𝑡 ) = N (z𝑡 −1 ; 𝝁 𝜃 (z𝑡 , 𝑡), 𝚺𝜃 (z𝑡 , 𝑡)). (3)
However, our research addresses a graph that is inherently het-
A distinct characteristic of diffusion models is the forward diffu-
erogeneous, encompassing multiple types of nodes and edges. This
sion process, an approximate posterior 𝑞(z1:𝑇 |z0 ). This process is a
complexity necessitates a modification in the VGAE framework.
fixed Markov chain that incrementally introduces Gaussian noise
Consequently, we substitute the standard VGAE encoder with a
into the data based on a variance schedule 𝛽 1, . . . , 𝛽𝑇 :
Heterogeneous Graph Transformer (HGT), as introduced by [4].
The HGT is adept at handling the diverse and complex nature of our 𝑇
Ö
graph, allowing for a more nuanced understanding and processing 𝑞(z1:𝑇 |z0 ) = 𝑞(z𝑡 |z𝑡 −1 ), (4)
of the heterogeneous elements. 𝑡 =1
√︁
Additionally, we implement separate decoders for different types 𝑞(z𝑡 |z𝑡 −1 ) = N (z𝑡 ; 1 − 𝛽𝑡 z𝑡 −1, 𝛽𝑡 I). (5)
of edges. This differentiation is essential for accurately generating The training of diffusion models involves optimizing the varia-
adjacency matrices specific to each edge type in our heterogeneous tional bound on the negative log likelihood:
graph. The use of distinct decoders enables us to tailor the recon- " #
struction process for each edge type, thereby enhancing the fidelity ∑︁ 𝑝𝜃 (z𝑡 −1 |z𝑡 )
𝐿 = E[− log 𝑝𝜃 (z0 )] ≤ E𝑞 − log 𝑝 (z𝑇 ) − log .
of the reconstructed heterogeneous graph. 𝑞(z𝑡 |z𝑡 −1 )
𝑡 ≥1
(6)
3.2.4 Timestamp Generation. In the proposed methodology, a dis-
tinct temporal generator is meticulously designed and deployed, The variance parameters 𝛽𝑡 of the forward process can either be
specifically tasked with the generation of timestamps. This com- learned through reparameterization or set as fixed hyperparameters.
ponent parallels the architecture of the feature generator in terms The reverse process is made expressive, particularly by the choice of
of its foundational structure; however, it diverges in its targeted Gaussian conditionals in 𝑝𝜃 (z𝑡 −1 |z𝑡 ), especially when the 𝛽𝑡 values
output, which is singularly dimensional. The primary output of this are small. A notable aspect of the forward process is the ability to
temporal generator is a continuous variable representing the time sample z𝑡 at any timestep 𝑡 in a closed form. This is facilitated by
Î𝑡
aspect of the data, encapsulating a critical dimension in temporal the definitions 𝛼𝑡 = 1 − 𝛽𝑡 and 𝛼¯𝑡 = 𝑠=1 𝛼𝑠 . The sampling is:
√
data analysis. 𝑞(z𝑡 |z0 ) = N (z𝑡 ; 𝛼¯𝑡 z0, (1 − 𝛼¯𝑡 )I). (7)
To ensure the precision and reliability of the temporal generator,
Efficient training is feasible by optimizing random terms of 𝐿 using
a Mean Squared Error (MSE) loss function is employed. This choice
stochastic gradient descent. Additional improvements in training
of loss function is particularly effective for regression tasks, where
are achieved through variance reduction. This is done by reformu-
the goal is to minimize the average squared difference between
lating 𝐿 (from Eq. 6) as:
the estimated values and the actual value. In the context of this ∑︁
research, the MSE loss function aids in fine-tuning the temporal E𝑞 {KL[𝑞(z𝑇 |z0 )||𝑝 (z𝑇 )] + KL[𝑞(z𝑡 −1 |z𝑡 , z0 )||𝑝𝜃 (z𝑡 −1 |z𝑡 )]
generator to produce timestamps that closely align with the true | {z } 𝑡 >1 | {z }
temporal characteristics of the dataset. 𝐿𝑇 𝐿𝑡 −1
− log 𝑝𝜃 (z0 |z1 ) }. (8)
| {z }
3.3 Diffusion Model in Latent Space 𝐿0
With the graph autoencoder described above, we are able to gener- Equation 8 uses the Kullback-Leibler (KL) divergence to compare
ate the synthetic anomalies. However, when due to the complexity 𝑝𝜃 (z𝑡 −1 |z𝑡 ) against the tractable forward process posteriors condi-
of the graph structure, the divergence between prior distribution tioned on z0 . The expressions for the posteriors are:
and data distribution can be large, thereby result in suboptimal
𝑞(z𝑡 −1 |z𝑡 , z0 ) = N (z𝑡 −1 ; 𝝁˜ 𝑡 (z𝑡 , z0 ), 𝛽˜𝑡 I), (9)
performance in the downstream anomaly detection. As such, we
√ √
propose to further apply a diffusion model in latent space for better where 𝝁˜ 𝑡 (z𝑡 , z0 ) =
𝛼¯𝑡 −1 𝛽𝑡
+
𝛼𝑡 (1−𝛼¯𝑡 −1 )
and 𝛽˜𝑡 = 1− 𝛼¯𝑡 −1
generation quality. 1−𝛼¯𝑡 z0 1−𝛼¯𝑡 z𝑡
1−𝛼¯𝑡 𝛽𝑡 .
Since all KL divergences in Equation 8 compare Gaussians, they
In this paper, we adopt DDPM [3]. It is mathematically formu-
can be computed in a Rao-Blackwellized manner using closed-form
lated as:
expressions instead of Monte Carlo estimates with high variance.
∫
𝑝𝜃 (z0 ) = 𝑝𝜃 (z0:𝑇 ) 𝑑z1:𝑇 , (1) 4 EXPERIMENTS
4.1 Experimental Setup
where z1, . . . , z𝑇 represent latent variables of the same dimensional- To comprehensively evaluate the performance of our proposed
ity as the data z0 ∼ 𝑞(z0 ). The joint distribution 𝑝𝜃 (z0:𝑇 ) is known graph generator, we compared it against baseline methods, includ-
as the reverse process. This process is a Markov chain initiated from ing simple reweighting and only graph autoencoder (GAE) without
a standard Gaussian distribution 𝑝 (z𝑇 ) = N (z𝑇 ; 0, I) and evolves the diffusion model. The backbones of the baselines contain on the
Graph Diffusion Models for Anomaly Detection Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
top of various GNNs, including GCN [5], GraphSAGE [2], GAT [12], by our method but also qualitatively underscores the necessity of
and Graph Transformer (GT) [10]. We opted for three metrics that the diffusion model in our graph generator.
are robust to label imbalance: Area Under the Receiver Operating
Characteristic Curve (AUROC), Area Under the Precision Recall 4.4 Case Study
Curve (AUPRC), and Recall@𝑘, where 𝑘 is the number of anomalies
in the ground truth label.
800 Real
4.2 Datasets Diffusion
GAE
We follow previous work [11], and conducted experiments on five 600
Frequency
REFERENCES [11] Jianheng Tang, Fengrui Hua, Ziqi Gao, Peilin Zhao, and Jia Li. 2023. GADBench:
[1] Yingtong Dou, Kai Shu, Congying Xia, Philip S Yu, and Lichao Sun. 2021. User Revisiting and Benchmarking Supervised Graph Anomaly Detection. arXiv
preference-aware fake news detection. In Proceedings of the 44th International preprint arXiv:2306.12251 (2023).
ACM SIGIR Conference on Research and Development in Information Retrieval. [12] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
2051–2055. Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint
[2] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation arXiv:1710.10903 (2017).
learning on large graphs. Advances in neural information processing systems 30 [13] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher,
(2017). and Pascal Frossard. 2022. Digress: Discrete denoising diffusion for graph gener-
[3] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic ation. arXiv preprint arXiv:2209.14734 (2022).
models. Advances in neural information processing systems 33 (2020), 6840–6851. [14] Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel Karl I Weidele, Claudio
[4] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous Bellei, Tom Robinson, and Charles E Leiserson. 2019. Anti-money laundering in
graph transformer. In Proceedings of the web conference 2020. 2704–2710. bitcoin: Experimenting with graph convolutional networks for financial forensics.
[5] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph arXiv preprint arXiv:1908.02591 (2019).
convolutional networks. arXiv preprint arXiv:1609.02907 (2016). [15] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas AJ Schweiger. 2007. Scan:
[6] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv a structural clustering algorithm for networks. In Proceedings of the 13th ACM
preprint arXiv:1611.07308 (2016). SIGKDD international conference on Knowledge discovery and data mining. 824–
[7] Fanzhen Liu, Xiaoxiao Ma, Jia Wu, Jian Yang, Shan Xue, Amin Beheshti, Chuan 833.
Zhou, Hao Peng, Quan Z Sheng, and Charu C Aggarwal. 2022. Dagad: Data [16] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. 2018.
augmentation for graph anomaly detection. In 2022 IEEE International Conference Graphrnn: Generating realistic graphs with deep auto-regressive models. In
on Data Mining (ICDM). IEEE, 259–268. International conference on machine learning. PMLR, 5708–5717.
[8] Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, [17] Jianan Zhao, Xiao Wang, Chuan Shi, Zekuan Liu, and Yanfang Ye. 2020. Net-
Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, et al. 2022. Bond: Benchmarking work schema preserving heterogeneous information network embedding. In
unsupervised outlier node detection on static attributed graphs. Advances in International joint conference on artificial intelligence (IJCAI).
Neural Information Processing Systems 35 (2022), 27021–27035. [18] Tianxiang Zhao, Xiang Zhang, and Suhang Wang. 2021. Graphsmote: Imbalanced
[9] Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, node classification on graphs with graph neural networks. In Proceedings of the
Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, et al. 2022. Pygod: A python library 14th ACM international conference on web search and data mining. 833–841.
for graph outlier detection. arXiv preprint arXiv:2204.12095 (2022).
[10] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
Sun. 2020. Masked label prediction: Unified message passing model for semi-
supervised classification. arXiv preprint arXiv:2009.03509 (2020).