0% found this document useful (0 votes)

4 views14 pages

Project Report

Report

Uploaded by

Abdul Muqtadir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views14 pages

Project Report

Report

Uploaded by

Abdul Muqtadir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

CLASS IMBALANCED LEARNING ON GRAPHS

A Major Project Report Submitted

MOHAMMED ASIM (1604-20-737-047)

ADIL AWAIS KHAN (1604-20-737-301)
MOHAMMED ABDUL MUQTADIR (1604-20-737-303)

DEPARTMENT OF INFORMATION TECHNOLOGY

MUFFAKHAM JAH COLLEGE OF ENGINEERING &

TECHNOLOGY
(Affiliated to Osmania University)
Hyderabad
2023
CLASS IMBALANCED LEARNING ON GRAPHS

Major Project report submitted in partial fulfillment of the requirement for the
award of the Degree of B.E
By

MOHAMMED ASIM (1604-20-737-047)

ADIL AWAIS KHAN (1604-20-737-301)
MOHAMMED ABDUL MUQTADIR (1604-20-737-303)

DEPARTMENT OF INFORMATION TECHNOLOGY

MUFFAKHAM JAH COLLEGE OF ENGINEERING &

TECHNOLOGY
(Affiliated to Osmania University)
Hyderabad
ABSTRACT
The Class imbalance is a common challenge in machine learning applications, and its impact
extends to graph-based datasets where nodes represent entities and edges denote relationships. This
project explores the realm of Class-Imbalanced Learning on Graphs and proposes a novel data
interpolation technique called GraphSMOTE to tackle the imbalance issue. The proposed method
leverages Synthetic Minority Over-sampling Technique (SMOTE) principles adapted to the graph
domain.

GraphSMOTE focuses on oversampling the minority class in graph-based datasets, enhancing the
model's ability to discern patterns and make informed predictions for under-represented classes.
The method intelligently generates synthetic instances by considering the local structure of the
graph, preserving the inherent relationships and dependencies between nodes. This approach not
only mitigates class imbalance but also enhances the model's generalization capabilities on
imbalanced graph data.

In this research, we delve into the existing landscape of class-imbalance mitigation techniques in
graph-based learning. We provide a comprehensive overview of related work, highlighting the
challenges and opportunities associated with handling class imbalance in graph-structured data.
Furthermore, we present a detailed explanation of the GraphSMOTE methodology, illustrating its
application and effectiveness in rebalancing class distributions. Experimental results on benchmark
graph datasets demonstrate the efficacy of GraphSMOTE in improving classification performance
on imbalanced graphs. The proposed technique exhibits promising results in terms of precision,
recall, and F1-score, showcasing its potential to address class imbalance issues prevalent in graph-
based machine learning scenarios.

This research contributes to the growing body of knowledge on handling class imbalance in graph
data, offering a valuable tool in the form of GraphSMOTE for researchers and practitioners seeking
robust solutions to improve the performance of machine learning models on imbalanced graph data
TABLE OF CONTENTS

S.NO Topics PAGE.NO

1 Introduction 1

2 Problem Statement 1

3 Background 2

4 Proposed System 3

5 References 7

LIST OF FIGURES
FIGURE.NO FIGURE NAME PAGE.NO

1 An Example 1

2 Framework 3
1 INTRODUCTION
Graphs are a prevalent and powerful data structure for representing complex relational
systems, such as social networks, citation networks, and knowledge graphs.
In these systems, nodes symbolize entities, while edges denote their relationships. In recent
years, graph representation learning techniques have proven effective in discovering
meaningful vector representations of nodes, edges, or entire graphs, resulting in successful
applications across a wide range of downstream tasks. However, graph data often presents a
significant challenge in the form of class imbalance.
Class imbalance is a situation that occurs when the number of instances in each class of a
dataset is not equal. In other words, some classes have significantly more instances than
others.
.

2. Problem Statement
In this work, we focus on semi-supervised node classification taskon graphs, in the
transductive setting. As shown in Figure 1, we have a large network of entities, with some
labeled for training. Both training and testing are performed on this same graph. Each entity
belongs to one class, and the distribution of class sizes are imbalanced. This problem has
many practical applications. For example, the homophily in social networks which results
in the under-representation of minority groups, malicious behavior or fake user accounts on
social networks which are outnumbered by normal ones, and linked web pages in
knowledge base where materials for some topics are limited.

1
3. BACKGROUND
Graph-structured data is ubiquitous in various domains, ranging from social networks and
biological networks to citation networks and recommendation systems. The inherent
connectivity and dependencies between entities in such graphs provide a rich source of
information for machine learning tasks. However, when it comes to classification on graph
data, a common challenge arises—class imbalance. Class imbalance occurs when the
number of instances belonging to different classes is significantly uneven, leading to
suboptimal model performance, especially for minority classes.

Dealing with class imbalance in graph-based machine learning is crucial for ensuring that
models can effectively recognize and classify under-represented classes. Traditional
techniques designed for tabular data might not directly translate to the graph domain due to
the unique structural characteristics of graphs. As such, there is a need for specialized
methods to address class imbalance in the context of graph-structured data. The Class-
Imbalanced Learning on Graphs (CILG) paradigm has garnered attention as researchers and
practitioners strive to adapt existing imbalance mitigation techniques to the graph domain.

Synthetic Minority Over-sampling Technique (SMOTE), a well-established oversampling

technique in the context of tabular data, serves as a promising starting point for addressing
class imbalance in graphs. Motivated by the need for effective class-imbalance handling on
graphs, this project introduces GraphSMOTE—a novel data interpolation technique
designed to alleviate class imbalance by oversampling the minority class in graph-
structured datasets.
By extending the principles of SMOTE to the graph context, GraphSMOTE aims to
generate synthetic instances that not only balance class distributions but also preserve the
local connectivity patterns inherent in graph data. This project positions itself within the
broader landscape of machine learning on graphs and class-imbalance handling. It builds
upon existing research in both imbalance mitigation strategies and graph-based learning,
contributing a specialized technique that addresses the unique challenges presented by class
imbalance in graph-structured datasets.

In the subsequent sections of this project report, we will conduct a comprehensive survey
of existing literature, outlining the current state-of-the-art methods for handling class
imbalance on graphs. The report will then delve into the details of the proposed
GraphSMOTE methodology, presenting experimental results and insights gained from
applying the technique to benchmark graph datasets. Through this research, we aim to offer
a valuable contribution to the evolving field of Class-Imbalanced Learning on Graphs.,

2
4. PROPOSED SYSTEM
The proposed system consists of the following:
(i) Feature Extractor
(Ii) Synthetic Node Generator
(Iii) Edge Generator
(Iv) Gnn-based Classifier

Fig. 2. Framework.

GraphSMOTE is composed of four components: (i) a GNNbased feature extractor

(encoder) which learns node representation that preserves node attributes and graph
topology to facilitate the synthetic node generation; (ii) a synthetic node generator which
generates synthetic minority nodes in the latent space; (iii) an edge generator which
generate links for the synthetic nodes to from an augmented graph with balanced classes;
and (iv) a GNN-based classifier which performs node classification based on the
augmented graph. Next, we give the details of each component.
In conclusion, near-field wireless power transfer presents a viable solution for powering
pacemakers, offering advantages such as safety, compact design, and efficiency.
Nonetheless, challenges like limited range and susceptibility to interference must be
carefully addressed to ensure the widespread and dependable implementation of this
technology in the medical field.

3
4.1 Feature Extractor
One way to generate synthetic minority nodes is to directly apply SMOTE on the raw node
feature space. However, this will cause several problems: (i) the raw feature space could be
sparse and

highdimensional, which makes it difficult to find two similar nodes of the same class for
interpolation; and (ii) it doesn’t consider the graph structure, which can result in sub-
optimal synthetic nodes. Thus, instead of directly do synthetic minority over-sampling in
the raw feature space, we introduce an feature extractor learn node representations that can
simultaneously capture node properties and graph topology. Generally, the node
representations should reflect inter-class and intra-class relations of samples. Similar
samples should be closer to each other, and dissimilar samples should be more distant. In
this way, when performing interpolation on minority node with its nearest neighbor, the
obtained embedding would have a higher probability of representing a new sample
belonging to the same minority class. In graphs, the similarity of nodes need to consider
node attributes, node labels, as well as local graph structures. Hence, we implement it with
GNN, and train it on two down-stream tasks, edge prediction and node classification. The
feature extractor can be implemented using any kind of GNNs. In this work, we choose
GraphSage as the backbone model structure because it is effective in learning from various
types of local topology, and generalizes well to new structures. It has been observed that
too deep GNNs often lead to sub-optimal performance, as a result of over-smoothing and
over-fitting. Therefore, we adopt only one GraphSage block as the feature extractor. Inside
this block, the message passing and fusing process can be written as:
h 1 𝑣 = 𝜎(W1 · 𝐶𝑂𝑁𝐶𝐴𝑇 (F[𝑣, :], F · A[:, 𝑣]))
f represents input node attribute matrix and F[𝑣, :] represents attribute for node 𝑣. A[:, 𝑣]
is the 𝑣-th column in adjacency matrix, and h 1 𝑣 is the obtained embedding for node 𝑣.
W1 is the weight parameter, and 𝜎 refers to the activation function such as ReLU.
LTspice, developed by Analog Devices and originally by Linear Technology, stands as a
preeminent electronic circuit simulation software in the electronics industry. Offering a
wealth of features, LTspice enables engineers to simulate and analyze the behavior of
analog and digital circuits with exceptional accuracy. Its comprehensive library
encompasses a wide array of electronic components, facilitating the modeling of complex
systems. The software's intuitive schematic capture interface allows users to draw circuits
graphically, and the robust simulation engine supports transient, AC, and DC analyses,
providing insights into circuit performance under various conditions. With tools for

4
parameter sweeps, Monte Carlo analysis, and hierarchical designs using subcircuits,
LTspice is an invaluable resource for both novice users and seasoned engineers. Its user-
friendly interface, coupled with a powerful waveform viewer, makes it a go-to tool for
designing, optimizing, and understanding electronic circuits. Whether for educational
purposes or professional design projects, LTspice remains a widely adopted and freely
available simulation tool that plays a crucial role in the electronics design and analysis
process.

4.2 Synthetic Node Generation

After obtaining the representation of each node in the embedding space constructed by the
feature extractor, now we can perform over-sampling on top of that. We seek to generate
the expected

representations for new samples from the minority classes. In this work, to perform over-
sampli
ng, we adopt the widely used SMOTE algorithm, which augments vanilla over-sampling
via changing repetition to interpolation. We choose it due to its popularity, but our
framework can also cope with other over-sampling approaches as well. The basic idea of

nearest neighbors in the embedding space that belong to the same class. Let h 1 𝑣 be a
SMOTE is to perform interpolation on samples from the target minority class with their

the same class as h 1 𝑣 , i.e., 𝑛𝑛(𝑣) = argmin 𝑢 ∥h 1 𝑢 − h 1 𝑣 ∥, s.t. Y𝑢 = Y𝑣 (3) 𝑛𝑛(𝑣)

labeled minority nodes with label as . The first step is to finds the closest labeled node of

refers to the nearest neighbor of 𝑣 from the same class, measured using Euclidean distance

𝑣 ′ = (1 − 𝛿) · h 1 𝑣 + 𝛿 · h 1 (𝑣) , (4) where 𝛿 is a random variable, following uniform

in the embedding space. With the nearest neighbor, we can generate synthetic nodes as h 1

distribution in the range [0, 1]. Since h 1 𝑣 and h 1 𝑛𝑛(𝑣) belong to the same class and are
very close to each other, the generated synthetic node h 1 𝑣 ′ should also belong to the
same class. In this way, we can obtain labeled synthetic nodes. For each minority class, we
can apply SMOTE to generate syntetic nodes. We use a hyper-parameter, over-sampling
scale, to control the amount of samples to be generated for each class. Through this
generation process, we can make the distribution of class size more balanced, and hence
make the trained classifier perform better on those initially under-represented classes.

4.3 Edge Generator

Now we have generated synthetic nodes to balance the class distribution. However, these
nodes are isolated from the raw graph G as they don’t have links. Thus, we introduce an
edge generator to model the existence of edges among nodes. As GNNs need to learn how
to extract and propagate features simultaneously, this edge generator can provide relation

5
information for those synthesized samples, and hence facilitate the training of GNN-based
classifier. This generator is trained on real nodes and existing edges, and is used to predict
neighbor information for those synthetic nodes. These new nodes and edges will be added
to the initial adjacency matrix A, and serve as input the the GNN-based classifier. In order
to maintain model’s simplicity and make the analysis easier, we adopt a vanilla design,

E𝑣, = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ((h 1 𝑣 · S · h 1 𝑢 )).

weighted inner production, to implement this edge generator as:

where E𝑣,𝑢 refers to the predicted relation information between node 𝑣 and𝑢, and S is the

edge generator is L𝑒𝑑𝑔𝑒 = ∥E − A∥ 2 , (6) where E refers to predicted connections

parameter matrix capturing the interaction between nodes. The loss function for training the

between nodes in V, i.e., no synthetic nodes. Since we learn an edge generator which is
good at reconstructing the adjacency matrix using the node representations, it should give
good link predictions for synthetic nodes.
With the edge generator, we attempt two strategies to put the predicted edges for synthetic

using only edge reconstruction, and the edges for the synthetic node 𝑣 ′ is generated by
nodes into the augmented adjacency matrix. In the first strategy, this generator is optimized

setting a threshold 𝜂:
A˜ [𝑣 ′ , 𝑢] = ( 1, if E𝑣 ′ ,𝑢 > 𝜂 0, otherwise.

into A, and will be sent to the classifier. In the second strategy, for synthetic node 𝑣 ′ , we
where A˜ is the adjacency matrix after over-sampling, by inserting new nodes and edges

use soft edges instead of binary ones: A˜ [𝑣 ′ , 𝑢] = E𝑣 ′ ,𝑢, (8) In this case, gradient on A˜
can be propagated from the classifier, and hence the generator can be optimized using both
edge prediction loss and node classification loss, which will be introduced later. Both two
strategies are implemented, and their performance are compared in the experiment part.

4.4 GNN Classifier

real nodes) with the embedding of the synthetics nodes, and V˜ 𝐿 be the augmented
Let H˜ 1 be the augmented node representation set by concatenating H 1 (embedding of

graph G˜ = {A˜ , H˜ } with labeled node set V˜ 𝐿. The data size of different classes in G˜
labeled set by incorporating the synthetic nodes into V𝐿. Now we have an augmented

becomes balanced, and an unbiased GNN classifier would be able to be trained on that.

classificaiton on 𝐺˜ as:
Specifically, we adopt another GraphSage block, appended by a linear layer for node

h 2 𝑣 = 𝜎(W2 · 𝐶𝑂𝑁𝐶𝐴𝑇 (h 1 𝑣 , H˜ 1 · A˜ [:, 𝑣]))

6
5. REFERENCES
1. Tianxiang Zhao, Xiang Zhang, and Suhang Wang. 2021. GraphSMOTE:
Imbalanced Node Classification on Graphswith Graph Neural Networks. In
WSDM.
2. Yihong Ma, Yijun Tian, Nuno Moniz, and Nitesh V Chawla. 2023. Class-
Imbalanced Learning on Graphs: A Survey. In University of NotreDame, USA
3. Tianxiang Zhao, Dongsheng Luo, Xiang Zhang, and Suhang Wang. 2022.
TopoImb: Toward Topology-level Imbalancein Learning from Graphs. In
LoG.
4. Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W
Bowyer. 2003. SMOTEBoost: ImprovingPrediction of the Minority Class in
Boosting. In PKDD
5. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip
Kegelmeyer. 2002. SMOTE: Synthetic MinorityOver-sampling Technique.
Journal of Artificial Intelligence Research (2002)

7
1