0% found this document useful (0 votes)
5 views124 pages

Lecture 1_Introduction

The document outlines a course on Graph Analytics for Big Data, taught by Thanh H. Nguyen, covering traditional methods and modern machine learning techniques for graph data. It includes a detailed course schedule, prerequisites, grading criteria, and project requirements focused on real-world applications of Graph Neural Networks (GNNs). The course emphasizes the importance of graph representation and various types of graphs, alongside their applications in machine learning.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views124 pages

Lecture 1_Introduction

The document outlines a course on Graph Analytics for Big Data, taught by Thanh H. Nguyen, covering traditional methods and modern machine learning techniques for graph data. It includes a detailed course schedule, prerequisites, grading criteria, and project requirements focused on real-world applications of Graph Neural Networks (GNNs). The course emphasizes the importance of graph representation and various types of graphs, alongside their applications in machine learning.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

IT5429E-2-24-2 (24.

2B01)
Graph Analytics for Big Data

Lecture 1: Introduction, Traditional Methods

Many slides are adapted from https://web.stanford.edu/class/cs224w/


Course Information
▪Course forum: https://join.slack.com/t/bksoictmlwith-
dbu4090/shared_invite/zt-353msikdd-iL2OSW1KIKWLXjR2X4ijiw

▪Instructor: Thanh H. Nguyen ([email protected])

▪Coursework:
▪ One programming project

2
Course Outline
▪ Machine Learning and Representation Learning for graph data:

▪ Traditional ML methods for graphs


▪ Methods for node embeddings
▪ Graph neural networks
▪ Graph Transformers
▪ Knowledge graphs
▪ Generative models for graphs
▪ Scaling up to large graphs
▪ Applications

3
Thanh H. Nguyen 5/7/2025
Course Schedule (Tentative)
Lecture Topics
1 (May 07th) Introduction, Traditional Methods
2 (May 09th) Node Embedding, Link Analysis
3 (May 12th) Graph Neural Nets: Part 1
4 (May 14th) Graph Neural Nets: Part 2
5 (May 16th) Label Propagation, Heterogeneous Graphs
6 (May 19th) Knowledge Graphs
7 (May 21st) Subgraph Matching, GNN for Recommendations
8 (May 23rd) Deep Generative Models, Advanced Topics
9 (May 26th) Graph Transformer, Scaling Up GNNs
10 (May 28th) Selective Topics
11 (June 9th) Class Projects: Presentations
12 (June 9th) Class Projects: Presentations
4
Thanh H. Nguyen 5/7/2025
Prerequisites
▪ Background
▪ Machine learning
▪ Algorithms
▪ Probability and statistics

▪ Programming
▪ Write non-trivial Python programs
▪ Familiar with Pytorch

5
Thanh H. Nguyen 5/7/2025
Graph Machine Learning Tools
▪ PyG
▪ Link: https://www.pyg.org
▪ A library for Graph Neural Networks

▪ GraphGym
▪ Link: https://github.com/snap-stanford/GraphGym
▪ Platform for designing Graph Neural Networks.
▪ This platform is now supported in PyG

▪ Other network analytics tools: SNAP.PY, NetworkX

6
Thanh H. Nguyen 5/7/2025
Course Logistics
▪ All lecture slides and class discussions will be posted and held on
Slack accordingly.

▪ Lecture structures
▪ Two parts per class: approximately 90 minutes each part
▪ 30 minutes for coffee break, Q&A

7
Thanh H. Nguyen 5/7/2025
Course Logistics

▪ Readings

▪ Book: Graph Representation Learning


https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book.pdf
▪ Research papers

8
Thanh H. Nguyen 5/7/2025
Grading
▪ Class participation (10%)
▪ Attendance: each student is allowed to be absent from class <= two times
▪ Engagement (actively join discussions, Q&A)

▪ Class project (90%)

9
Thanh H. Nguyen 5/7/2025
Class Projects:
Real-world Applications of GNNs
▪ Determine a specific use case (e.g., fraud detection)

▪ Demonstrate how GNNs + PyG can be used to solve the problem

▪ Groups of two or three students


▪ Message me on Slack

10
Thanh H. Nguyen 5/7/2025
Class Projects: Tasks
▪ Identify an appropriate public dataset
▪ Formulate the use case as a clear graph ML problem
▪ Demonstrate how GNNs can be used to solve this problem.

▪ Important note:
▪ It is not enough to simply apply an existing GNN model to a new dataset
▪ Students are expected to provide some form of novelty: for example,
▪ Develop a new GNN method for a specific problem, improving on existing methods in a
non-trivial way,
▪ Or comprehensive analyses (ablation studies, comparison between multiple model
architectures) for the project.
11
Thanh H. Nguyen 5/7/2025
Class Projects: Examples
▪ Example graphs and datasets
▪ Open Graph Benchmark: https://ogb.stanford.edu
▪ Datasets available in PyG

▪ Application examples:
▪ Recommender systems
▪ Fraud detection in transaction graphs
▪ Friend recommendation
▪ Paper citation graphs
▪ Author collaboration networks
▪ Heterogeneous academic graphs
▪ Knowledge graphs
▪ Drug-drug interaction networks
12
Thanh H. Nguyen 5/7/2025
Class Projects: Components
▪ Project proposal (10%)

▪ Project report (60%)

▪ Project presentation (20%)

13
Thanh H. Nguyen 5/7/2025
Class Projects: Project Proposal (10%)
▪ Application domain
▪ Which dataset are you planning to use?
▪ Describe the dataset/task/metric.
▪ Why did you choose the dataset?

▪ Graph ML techniques that you want to apply


▪ Which graph ML model are you planning to use?
▪ Describe the model(s) (try using figures and equations).
▪ Why is/are the model(s) appropriate for the dataset you have chosen?
▪ Submission:
▪ Deadline: By the end of May 28 th (proposal file + student names)
▪ 2-3 pages in pdf.
▪ Format: NeurIPS2025 LaTeX style
▪ Message me the file on the class Slack channel.
14
Thanh H. Nguyen 5/7/2025
Class Projects: Project report (60%)
▪ Writing (40 points): 10-15 pages
▪ Motivation & explanation of data/task (9 points)
▪ Appropriateness & explanation of model(s) (9 points)
▪ Insights + results (9 points)
▪ Figures (9 points)
▪ Code snippets (4 points)

▪ Submission:
▪ Deadline: by June 8th.
▪ Format: NeurIPS2025 LaTeX style
▪ Message me the file on the class Slack channel.

▪ Colab (20 points)


▪ Code: correctness, design (10 points)
▪ Documentation: class/function descriptions, comments in code (10 points)
15
Thanh H. Nguyen 5/7/2025
Class Projects: Project presentation (20%)

▪ Present at class with Q&A

▪ Time: ~30 minutes (depending on #students per group)

16
Thanh H. Nguyen 5/7/2025
Machine Learning with Graphs:
Why Graphs?

Graphs are a general


language for describing
and
analyzing entities with
relations/interactions

17
Thanh H. Nguyen 5/7/2025
Many Types of Data are Graphs

Event Graphs Computer Networks Disease Pathways


:

Food Webs Particle Networks Underground Networks 18


Many Types of Data are Graphs

Social Networks Economic Networks Communication Networks

Citation Networks Internet Networks of Neurons 19


Many Types of Data are Graphs

Knowledge Graphs Regulatory Networks Scene Graphs

3D Shapes
Code Graphs Molecules
20
5/7/2025
Graphs: Machine Learning
▪ Complex domains have a rich relational structure, which can be
represented as a relational graph

▪ By explicitly modeling relationships we achieve better


performance!

▪ Main question: How do we take advantage of relational structure


for better prediction?

21
Thanh H. Nguyen 5/7/2025
Today: Modern ML Toolbox

Images

Text/Speech

Modern deep learning toolbox is designed


for simple sequences and grids
22
Thanh H. Nguyen 5/7/2025
This Course
How can we develop neural networks that
are much more broadly applicable?

Graphs are the new frontier of


deep learning

23
Thanh H. Nguyen 5/7/2025
Hot Subfield in Machine Learning
ICLR 2023 keywords

24
Thanh H. Nguyen 5/7/2025
Why is Graph Deep Learning Hard?
Networks are complex
▪ Arbitrary size and complex topological structure (i.e., no spatial
locality like grids)

vs

Networks Images Texts

▪ No fixed node ordering or reference point


▪ Often dynamic and have multimodal features
25
Thanh H. Nguyen 5/7/2025
Choices of Graph Representation

26
Thanh H. Nguyen 5/7/2025
Components of A Network

▪ Objects: nodes, vertices


▪ Interactions: edges, links
▪ Systems: graphs, networks

27
Thanh H. Nguyen 5/7/2025
Graphs: A Common Language
Actor 1 Peter
Actor 2 Mary
Actor 3 Tom

Actor 4 John

Protein 1 Protein 6

Protein 7

Protein 4
28
Thanh H. Nguyen 5/7/2025
Choosing a Proper Representation
▪ If you connect individuals who work with
each other, you will explore a professional
network

▪ If you connect scientific papers that cite


each other, you will be studying the citation
network

29
Thanh H. Nguyen 5/7/2025
How to Define a Graph
▪ How to build a graph
▪ What are nodes?
▪ What are edges?

▪ Choice of proper network representation of a given


domain/problem determines ability to use networks successfully
▪ In some cases, representation is unique and unambiguous
▪ In other cases, representation is by no mean unique
▪ The way you assign links will determine the nature of the questions you
study.

30
Thanh H. Nguyen 5/7/2025
Directed and Undirected Graphs
▪ Undirected ▪ Directed
▪ Links: undirected (symmetrical, ▪ Links: directed
reciprocal)

▪ Examples: ▪ Examples:
▪ Collaborations ▪ Phone calls
▪ Friendships on Facebook ▪ Following on Twitter (X)

▪ Other considerations
▪ Weights ▪ Types
Thanh H. Nguyen ▪ Properties ▪ Attributes 5/7/2025
31
Representing Graphs: Adjacency Matrix

32
Thanh H. Nguyen 5/7/2025
Adjacency Matrices are Sparse

33
Thanh H. Nguyen 5/7/2025
Networks are Sparse Graphs
Most real-world networks are sparse

Consequences: Adjacency matrix is filled with zeros


34
Thanh H. Nguyen 5/7/2025
Representing Graphs: Edge List
▪ Representing graph as a list of edges
▪ (2, 3)
▪ (2, 4)
▪ (3, 2)
▪ (3, 4)
▪ (4, 5)
▪ (5, 2)
▪ (5, 1)

35
Thanh H. Nguyen 5/7/2025
Representing Graphs: Adjacency List
▪ Adjacency list
▪ Easier to work with if network is
▪ Large
▪ Sparse

▪ Allow us to quickly retrieve all


neighbors of a given node
▪ 1:
▪ 2: 3, 4
▪ 3: 2, 4
▪ 4: 5
▪ 5: 1, 2

36
Thanh H. Nguyen 5/7/2025
Heterogeneous Graphs
▪ A heterogeneous graph is defined as: 𝐺 = 𝑉, 𝐸, 𝑅, 𝑇

➢Nodes with node types: 𝑣𝑖 ∈ 𝑉


➢Edges with relation types: 𝑣𝑖 , 𝑟, 𝑣𝑗 ∈ 𝐸
➢Node type: 𝑇 𝑣𝑖
➢Relation type: 𝑟 ∈ 𝑅
➢Nodes and edges have attributes/features

37
Thanh H. Nguyen 5/7/2025
Many Graphs are Heterogeneous

Academic Graphs
Biomedical Knowledge Graphs
▪ Example node: Migraine ▪ Example node: ICML
▪ Example edge: (fulvestrant, Treats, Breast Neoplasms) ▪ Example edge: (GraphSAGE, NeurIPS)

▪ Example node type: Protein ▪ Example node type: Author

▪ Example edge type (relation): Causes ▪ Example edge type (relation): pubYear
38
Thanh H. Nguyen 5/7/2025
Bipartite Graph
▪ Nodes can be divided into two disjoint sets U and V
▪ Every link connects a node in U to one in V
▪ U and V are independent sets

▪ Examples:
▪ Authors-to-Papers (they authored)
▪ Actors-to-Movies (they appeared in)
▪ Users-to-Movies (they rated)
▪ Recipes-to-Ingredients (they contain)

39
Thanh H. Nguyen 5/7/2025
More Types of Graphs
▪ Unweighted ▪ Weighted

40
Thanh H. Nguyen 5/7/2025
More Types of Graphs
▪ Self-edges (self-loop) ▪ Multi-graph

41
Thanh H. Nguyen 5/7/2025
Connectivity of Undirected Graphs
▪ Connected undirected graph
▪ Any two vertices can be joined by a path

▪ A disconnected graph consists of two or more connected


components

42
Thanh H. Nguyen 5/7/2025
Connectivity Example
▪ The adjacency matrix of a network with several components can
be written in a block-diagonal form, so that nonzero elements are
confined to squares, with all other elements being zero:

43
Thanh H. Nguyen 5/7/2025
Connectivity of Directed Graphs
▪ Strongly connected directed graphs
▪ Have paths from each node to every other nodes

▪ Weakly connected directed graphs


▪ Is connected if we disregard edge directions
▪ Example:

44
Thanh H. Nguyen 5/7/2025
Connectivity of Directed Graphs
▪ Strongly connected components (SCCs): not every node is part of a
non-trivial SCC

45
Thanh H. Nguyen 5/7/2025
Applications of Graph ML

46
Thanh H. Nguyen 5/7/2025
Different Types of Tasks
Node level

Graph level prediction,


Graph generation
Community
(subgraph) level

Edge level

47
Thanh H. Nguyen 5/7/2025
Node-Level Tasks

Machine
Learning

Node Classification
48
Thanh H. Nguyen 5/7/2025
Node-level Network Structure
▪ Goal: Characterize the structure and position of a node in the network
▪ Node degree
▪ Node importance and position
▪ E.g., Number of shortest paths passing through a node
▪ E.g., Avg. shortest path length to other nodes

▪ Substructures around the node

49
Thanh H. Nguyen 5/7/2025
Example (1): Anomaly Detection
▪ Computer network
▪ Nodes: computers/machines
▪ Edges: connection/communication between computers
▪ Task: detect compromised computers

50
Thanh H. Nguyen 5/7/2025
Example (1): Anomaly Detection

Zhuo, Ming, Leyuan Liu, Shijie Zhou, and Zhiwen Tian. "Survey on security issues of routing and
anomaly detection for space information networks." Scientific Reports 11, no. 1 (2021): 22261. 51
Thanh H. Nguyen 5/7/2025
Example (2): Research Paper Topics
▪ Citation networks
▪ Node: research papers
▪ Edges: citations
▪ Task: predict topics of papers

Valmarska, Anita, and Janez Demšar. "Analysis of citation networks." PhD diss., Diploma Thesis, Faculty of
Computer and Information Science, University of Ljubljana, 2014. 52
Thanh H. Nguyen 5/7/2025
Link-level Prediction Task
▪ The task is to predict new/missing/unknown links based on the
existing links.
▪ At test time, node pairs (with no existing links) are ranked, and
top 𝐾 node pairs are predicted.
▪ Task: Make a prediction for a pair of nodes.

53
Thanh H. Nguyen 5/7/2025
Link Prediction as a Task
▪ Links missing at random
▪ Remove a random set of links and then aim to
predict them

▪ Links over time


▪ Given 𝐺 𝑡0 , 𝑡0′ a graph defined by edges up to time
𝑡0′ , output a ranked list 𝐿 of edges (not in 𝐺 𝑡0 , 𝑡0′ )
that are predicted to appear in time 𝐺 𝑡1 , 𝑡1′
▪ Evaluation: 𝐺 𝑡0 , 𝑡0′
▪ 𝑛 = 𝐸𝑛𝑒𝑤 : number of edges that appear during the test 𝐺 𝑡1 , 𝑡1′
period 𝐺 𝑡1 , 𝑡1′
▪ Take top 𝑛 elements of 𝐿 and count corrected edges

54
Thanh H. Nguyen 5/7/2025
Example (1): Recommender Systems
▪ Users interacts with items
▪ Watch movies, buy merchandise, listen to music
▪ Nodes: Users and items
▪ Edges: User-item interactions

▪ Goal: recommend items that users might like


Users Interactions

You might
also like

Items

55
Thanh H. Nguyen 5/7/2025
Example (2): Drug Side Effects
Many patients take multiple drugs to treat
complex or co-existing diseases
▪ 46% of people ages 70-79 take more than 5 drugs
▪ Many patients take more than 20 drugs to treat heart
disease, depression, insomnia, etc.

Task: Given a pair of drugs predict


adverse side effects

56
Thanh H. Nguyen 5/7/2025
Graph-level Tasks

57
Thanh H. Nguyen 5/7/2025
Graph-level Features
▪ Goal: We want make a prediction for an entire graph or a
subgraph of the graph.

▪ Example

58
Thanh H. Nguyen 5/7/2025
Example (1): Traffic Prediction

59
Thanh H. Nguyen 5/7/2025
Road Network as a Graph
▪ Nodes: Road segments
▪ Edges: Connectivity between road segments
▪ Prediction: Time of Arrival (ETA)

Image credit: DeepMind 60


Thanh H. Nguyen 5/7/2025
Traffic Prediction via GNN
▪ Predict Time of Arrival with Graph Neural Nets

Image credit: DeepMind 61


Thanh H. Nguyen 5/7/2025
Example (2): Drug Discovery
▪ Antibiotics are small molecular graphs
▪ Nodes: Atoms
▪ Edges: Chemical bonds

Image credit: CNN

Konaklieva, Monika I. "Molecular targets of β-lactam-based antimicrobials:


beyond the usual suspects." Antibiotics 3.2 (2014): 128-142. 62
5/7/2025
Deep Learning for Antibiotic Discovery
▪ A Graph Neural Network graph classification model
▪ Predict promising molecules from a pool of candidates

Stokes, Jonathan M., et al. "A deep learning approach to antibiotic discovery."
Cell 180.4 (2020): 688-702. 63
Thanh H. Nguyen 5/7/2025
Summary
Node level

Graph level prediction,


Graph generation
Community
(subgraph) level

Edge level

64
Thanh H. Nguyen 5/7/2025
Traditional ML Methods for Graphs

66
Thanh H. Nguyen 5/7/2025
Traditional ML Pipeline
▪ Design features for nodes/links/graphs
▪ Obtain features for all training data

67
Thanh H. Nguyen 5/7/2025
Traditional ML Pipeline
▪ Train an ML model ▪ Apply the model
▪ Logistic regression ▪ Given a new node/link/graph,
▪ Random forest obtain its features and make a
prediction
▪ Neural network, etc

68
Thanh H. Nguyen 5/7/2025
This Lecture: Feature Design
▪ Use effective features 𝑥 over graphs
▪ Traditional ML pipeline uses hand-designed features

▪ In this lecture, we will overview the traditional features for:


▪ Node-level prediction
▪ Link-level prediction
▪ Graph-level prediction

▪ For simplicity, we focus on undirected graphs

69
Thanh H. Nguyen 5/7/2025
Machine Learning in Graphs
▪ Goal: Make predictions for a set of objects

▪ Design choices
▪ Features: d-dimensional vectors 𝑥
▪ Objects: Nodes, edges, sets of nodes, entire graphs
▪ Objective functions: What tasks are we aiming to solve?

70
Thanh H. Nguyen 5/7/2025
Machine Learning in Graphs
▪ Example: Node-level prediction
▪ Given: 𝐺 = 𝑉, 𝐸
▪ Learn a function: 𝑓: 𝑉 → ℝ

71
Thanh H. Nguyen 5/7/2025
Node-level Tasks and Features

72
Thanh H. Nguyen 5/7/2025
Node-Level Tasks

Machine
Learning

Node Classification
ML needs features 73
Thanh H. Nguyen 5/7/2025
Node-level Features: Overview
▪ Goal: Characterize the structure and position of a node in the
network
▪ Node degree
▪ Node centrality
▪ Clustering coefficient
▪ Graphlets

74
Thanh H. Nguyen 5/7/2025
Node Features: Node Degree
▪ The degree 𝑘𝑣 of node 𝑣 is the number of edges (neighboring
nodes) the node has
▪ Treat all neighboring nodes equally

75
Thanh H. Nguyen 5/7/2025
Node Features: Node Centrality
▪ Node degree counts the neighboring nodes without capturing their
importance

▪ Node centrality 𝑐𝑣 : takes the node importance in a graph into


account
▪ Different ways to model importance:
▪ Eigenvector centrality
▪ Betweenness centrality
▪ Closest centrality
▪ And many others

76
Thanh H. Nguyen 5/7/2025
Node Centrality: Eigenvector Centrality
▪ A node 𝑣 is important if surrounded by important neighboring
nodes 𝑢 ∈ 𝑁 𝑣
▪ We model the centrality of node 𝑣 as the sum of the centrality of
neighboring nodes:
1
𝑐𝑣 = ෍ 𝑐𝑢
𝜆
𝑢∈𝑁(𝑣)
𝜆 is normalizing constant (it will turn out to be
the largest eigenvalue of the adjacency matrix 𝐴)

▪ Note: the above equation models centrality in a recursive manner.


How do we solve it?

77
Thanh H. Nguyen 5/7/2025
Node Centrality: Eigenvector Centrality
▪ Rewrite the recursive equation in the matrix form
1
𝑐𝑣 = ෍ 𝑐𝑢 𝜆𝒄 = 𝐴𝒄
𝜆
𝑢∈𝑁(𝑣) ▪ 𝐴: adjacency matrix
𝜆 is normalizing constant 𝐴𝑢𝑣 = 1 if 𝑢 ∈ 𝑁 𝑣
(largest eigenvalue of 𝐴) ▪ 𝒄: centrality vector
▪ 𝜆: eigenvalue

▪ We can see that centrality 𝒄 is the eigenvector of 𝐴


▪ The largest eigenvalue 𝜆𝑚𝑎𝑥 is always positive and unique (by
Perron-Frobenius Theorem)
▪ The eigenvector 𝒄𝒎𝒂𝒙 corresponding to 𝜆𝑚𝑎𝑥 is used for centrality
78
Thanh H. Nguyen 5/7/2025
Node Centrality: Betweenness Centrality
▪ A node is important if lies on many shortest paths between other
nodes
#(𝑠ℎ𝑜𝑟𝑡𝑒𝑠𝑡 𝑝𝑎𝑡ℎ 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠 𝑎𝑛𝑑 𝑡 𝑡ℎ𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑣)
𝑐𝑣 = ෍
#(𝑠ℎ𝑜𝑟𝑡𝑒𝑠𝑡 𝑝𝑎𝑡ℎ 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠 𝑎𝑛𝑑 𝑡)
𝑠≠𝑣≠𝑡

▪ Example:

79
Thanh H. Nguyen 5/7/2025
Node Centrality: Closeness Centrality
▪ A node is important if it has small shortest path lengths to all
other nodes
1
𝑐𝑣 =
σ𝑢≠𝑣 𝑠ℎ𝑜𝑟𝑡𝑒𝑠𝑡 𝑝𝑎𝑡ℎ 𝑙𝑒𝑛𝑔𝑡ℎ 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑢 𝑎𝑛𝑑 𝑣

▪ Example:

80
Thanh H. Nguyen 5/7/2025
Node Features: Clustering Coefficient
▪ Measure how connected 𝑣’s neighboring nodes are:

#𝑒𝑑𝑔𝑒𝑠 𝑎𝑚𝑜𝑛𝑔 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑖𝑛𝑔 𝑛𝑜𝑑𝑒𝑠


𝑒𝑣 = ∈ [0,1]
𝑘𝑣
2
▪ Example:

81
Thanh H. Nguyen 5/7/2025
Node Features: Graphlets
▪ Observation: Clustering coefficient counts the number of triangles
in the ego-network

▪ We can generalize the above by counting #(pre-specified


subgraphs, i.e., graphlets)
82
Thanh H. Nguyen 5/7/2025
Node Features: Graphlets
▪ Goal: describe network structures around a node 𝑢
▪ Graphlets: small subgraphs that describe the structures of node 𝑢’s
network neighborhood

▪ Analogy:
▪ Degree counts #edges that a node touches
▪ Clustering coefficient: counts #triangles that a node touches
▪ Graphlet Degree Vector (GDV): Graph-based features for nodes
▪ Count #graphlets that a node touches

83
Thanh H. Nguyen 5/7/2025
Node Features: Graphlets
▪ Induced subgraph: is another graph formed from a subset of
vertices and all edges connecting vertices in the subset

Original graph An induced graph Not an induced graph

▪ Graph isomorphism: two graphs containing the same number of


nodes connected in the same way are said to be isomorphic.

Isomorphic Non isomorphic 84


5/7/2025
Node Features: Graphlets
▪ Graphlets: Connected
induced non-isomorphic
subgraphs

▪ Orbits: set of vertices


which are mapped onto
each other by the
graphlet’s automorphism

85
Thanh H. Nguyen 5/7/2025
Node Features: Graphlets
▪ Graphlet Degree Vector (GDV): a vector with the frequency of the
node in each orbit position
Possible graphlets
on up to 3 nodes

▪ Example:

86
Thanh H. Nguyen 5/7/2025
Graphlet Degree Vector (GDV)
▪ Count #graphlets that a node touches at a particular orbit

▪ Considering graphlets on 2 to 5 nodes, we get:


▪ Vector of 73 coordinates: a signature of a node that describes the topology of
the node’s neighborhood
▪ Capture its interconnectivities up to a distance of 4 hops

▪ GDV provides a measure of a node’s local topology


▪ Comparing vectors of two nodes provides a highly constrained measure of
local topological similarity between them.

87
Thanh H. Nguyen 5/7/2025
Graphlet Degree Vector: Example

▪ GDV of a node A:
▪ i-th element of GDV(A): #graphlets that touch A at orbit i
▪ Highlighted are graphlets that touch A at orbits 15, 19, 27, 35.
88
Thanh H. Nguyen 5/7/2025
Node-Level Feature: Summary
▪ Importance based features
▪ Node degree
▪ Different node centrality measures (eigenvector, betweenness, closeness)

▪ Structure-based features
▪ Node degree
▪ Clustering coefficient
▪ Graphlet count vector

89
Thanh H. Nguyen 5/7/2025
Node-Level Feature: Summary
▪ Importance-based features
▪ Node degree: count #neighboring nodes
▪ Node centrality:
▪ Model importance of neighboring nodes in a graph
▪ Different modeling choices: eigenvector centrality, betweenness centrality, closeness
centrality

▪ Useful for predicting influential nodes in a graph


▪ Example: predict celebrity users in a social network

90
Thanh H. Nguyen 5/7/2025
Node-Level: Summary
▪ Structure-based features: capture topological properties of local
neighborhood around a node
▪ Node degree: count #neighboring nodes
▪ Clustering coefficient: measure how connected neighboring nodes are
▪ Graphlet count vector: count the occurrences of different graphlets

▪ Useful for predicting a particular role a node plays in a graph


▪ Example: predict protein functionality in a protein-protein interaction
network

91
Thanh H. Nguyen 5/7/2025
Link Prediction Task and Features

93
Thanh H. Nguyen 5/7/2025
Link-level Prediction Task: Recap
▪ The task is to predict new/missing/unknown links based on the
existing links.
▪ At test time, node pairs (with no existing links) are ranked, and
top 𝐾 node pairs are predicted.
▪ Task: Make a prediction for a pair of nodes.

94
Thanh H. Nguyen 5/7/2025
Link Prediction as a Task
▪ Links missing at random
▪ Remove a random set of links and then aim to
predict them

▪ Links over time


▪ Given 𝐺 𝑡0 , 𝑡0′ a graph defined by edges up to time
𝑡0′ , output a ranked list 𝐿 of edges (not in 𝐺 𝑡0 , 𝑡0′ )
that are predicted to appear in time 𝐺 𝑡1 , 𝑡1′
▪ Evaluation: 𝐺 𝑡0 , 𝑡0′
▪ 𝑛 = 𝐸𝑛𝑒𝑤 : number of edges that appear during the test 𝐺 𝑡1 , 𝑡1′
period 𝐺 𝑡1 , 𝑡1′
▪ Take top 𝑛 elements of 𝐿 and count corrected edges

95
Thanh H. Nguyen 5/7/2025
Link Prediction via Proximity
▪ Methodology:
▪ For each pair of nodes 𝑥, 𝑦 , compute score 𝑐 𝑥, 𝑦
▪ For example: #common neighbors of 𝑥 and 𝑦

▪ Sort pairs 𝑥, 𝑦 by the decreasing score 𝑐 𝑥, 𝑦


▪ Predict top-n pairs as new links
▪ See which of these links actually appear in 𝐺 𝑡1 , 𝑡1′

96
Thanh H. Nguyen 5/7/2025
Link-Level Feature: Overview
▪ Distance-based feature
▪ Local neighborhood overlap
▪ Global neighborhood overlap

97
Thanh H. Nguyen 5/7/2025
Distance-based Feature
▪ Shortest path distance between two nodes
▪ Example

▪ However, this does not capture the degree of neighborhood overlap


▪ Node pair (B, H) has two shared neighboring nodes
▪ Node pair (B, E) and (A, B) only have 1 such nodes

98
Thanh H. Nguyen 5/7/2025
Local Neighborhood Overlap
▪ Capture #neighboring nodes shared between two nodes
▪ Common neighbors: 𝑁 𝑣1 ∩ 𝑁 𝑣2
▪ Example: 𝑁 𝐴 ∩ 𝑁 𝐵 =1

𝑁 𝑣1 ∩𝑁 𝑣2
▪ Jaccard’s coefficient:
𝑁 𝑣1 ∪𝑁 𝑣2
𝑁 𝐴 ∩𝑁 𝐵 1
▪ Example: =2
𝑁 𝐴 ∪𝑁 𝐵

1
▪ Adamic-Adar index: σ𝑢∈𝑁 𝑣1 ∩𝑁 𝑣2
log 𝑘𝑢
1 1
▪ Example: =
log 𝑘 𝐶 log 4

99
Thanh H. Nguyen 5/7/2025
Global Neighborhood Overlap
▪ Limitation of local neighborhood overlap
▪ Metric is always zero if the two nodes do not have any neighbors in common

𝑁 𝐴 ∩𝑁 𝐸 =0

▪ However, the two nodes may still potentially be connected in the future

▪ Global neighborhood overlap: resolves the limitation by


considering the entire graph
100
Thanh H. Nguyen 5/7/2025
Global Neighborhood Overlap
▪ Katz index: count the number of walks of all lengths between a
given pair of nodes

▪ Compute #walks:
▪ Use powers of the graph adjacency matrix

101
Thanh H. Nguyen 5/7/2025
Intuition: Powers of Adj Matrices
▪ Compute #walks between two nodes
▪ Recall: 𝐴𝑢𝑣 = 1 if 𝑢 ∈ 𝑁 𝑣
𝑘
▪ Let 𝑃𝑢𝑣 = #𝑤𝑎𝑙𝑘𝑠 of length k between u and v
▪ We will show 𝑃 𝑘 = 𝐴𝑘
1
▪ 𝑃𝑢𝑣 = 𝐴𝑢𝑣 = #𝑤𝑎𝑙𝑘𝑠 of length 1 (direct neighborhood) between u and v

102
Thanh H. Nguyen 5/7/2025
Intuition: Powers of Adj Matrices
2
▪ How to compute 𝑃𝑢𝑣 ?
▪ Step 1: Compute #walks of length 1 between each neighbor of u and v
▪ Step 2: Sum up these #walks across u’s neighbors

2 1
𝑃𝑢𝑣 = ෍ 𝐴𝑢𝑖 ∗ 𝑃𝑖𝑣 = ෍ 𝐴𝑢𝑖 ∗ 𝐴𝑖𝑣 = 𝐴2𝑢𝑣
𝑖 𝑖

103
Thanh H. Nguyen 5/7/2025
Global Neighborhood Overlap
▪ Katz index between 𝑣1 and 𝑣2 is computed as:

▪ Katz index matrix is computed in closed-form:

104
Thanh H. Nguyen 5/7/2025
Link-Level Features: Summary
▪ Distance-based feature
▪ Use the shortest path length
▪ Does not capture how neighborhood overlaps

▪ Local neighborhood overlap


▪ Capture how many neighboring nodes are shared
▪ Become zero when no neighbors are shared

▪ Global neighborhood overlap


▪ Use global graph structure to score two nodes
▪ Katz index count #walks of all lengths between two nodes

105
Thanh H. Nguyen 5/7/2025
Graph-Level Features and Graph Kernels

106
Thanh H. Nguyen 5/7/2025
Graph-Level Features
▪ Goal: characterize structure of an entire graph

▪ Example:

107
Thanh H. Nguyen 5/7/2025
Graph-Level Features: Overview
▪ Graph kernels: measure similarity between two graphs
▪ Graphlet kernel
▪ Weisfeiler-Lehman kernel

▪ Other kernels (will not be covered in this lecture)


▪ Random-walk kernel
▪ Shortest-path graph kernel
▪ Many more…

110
Thanh H. Nguyen 5/7/2025
Graph Kernel: Ideas
▪ Goal: design graph feature vector 𝜙 𝐺
▪ Key ideas: Bag-of-Words (BoW) for a graph
▪ Recall: BoW uses word counts as features for documents (no ordering)
▪ Naïve extension to a graph: consider nodes as words
▪ Limitation:

▪ Since both graphs have 4 nodes, we get the same feature vector for two different graphs

111
Thanh H. Nguyen 5/7/2025
Graph Kernel: Key Ideas
▪ What if we use Bag of node degrees?

▪ Both Graphlet kernel and Weisfeier-Lehman (WL) use Bag-of-*


representation of graphs.

112
Thanh H. Nguyen 5/7/2025
Graph-Level Graphlet Features
▪ Key idea: count #different graphlets in a graph

▪ Note: definition of graphlets here is slightly different from node-level


features

▪ Two differences:
▪ Nodes in graphlets here do not need to be connected
▪ Graphlets here are not rooted

113
Thanh H. Nguyen 5/7/2025
Graph-Level Graphlet Features
▪ Let 𝒢𝑘 = 𝑔1 , 𝑔2 , … , 𝑔𝑛𝑘 be a list of graphlets of size k.
▪ For k=3, there are 4 graphlets

▪ For k = 4, there are 11 graphlets

114
Thanh H. Nguyen 5/7/2025
Graph-Level Graphlet Features
▪ Given graph G, and a graphlet list 𝒢𝑘 = 𝑔1 , 𝑔2 , … , 𝑔𝑛𝑘 , define the
graphlet count vector 𝑓𝐺 ∈ ℝ𝑛𝑘 as:

𝑓𝐺 𝑖 = # 𝑔𝑖 ∈ 𝐺 , ∀𝑖 = 1, 2, … , 𝑛𝑘

115
Thanh H. Nguyen 5/7/2025
Graph-Level Graphlet Features
▪ Example: k = 3

116
Thanh H. Nguyen 5/7/2025
Graph-Level Graphlet Kernel
▪ Given two graphs, G and G’, graphlet kernel is computed as:

𝐾 𝐺, 𝐺 ′ = 𝑓𝐺𝑇 𝑓𝐺 ′

▪ Problem:
▪ If G and G’ have different sizes, that will greatly skew the value.

▪ Solution: normalize each feature vector

𝑓𝐺
ℎ𝐺 = 𝐾 𝐺, 𝐺 ′ = ℎ𝐺𝑇 ℎ𝐺 ′
𝑠𝑢𝑚(𝑓𝐺 )
117
Thanh H. Nguyen 5/7/2025
The Graphlet Kernel
▪ Limitations: counting graphlets is expensive
▪ Counting size-k graphlets for a graph of size n by enumeration takes 𝑛𝑘

▪ This is unavoidable in worst case since subgraph isomorphism test (judging


if a graph is a subgraph of another graph) is NP-hard.

▪ If a graph’s node degree is bounded by d, an 𝑂 𝑛𝑑 𝑘−1 algorithm exists to


count all graphlets of size k.

▪ Can we design a more efficient graph kernel?

118
Thanh H. Nguyen 5/7/2025
Weisfeiler-Lehman Kernel
▪ Goal: Design an efficient graph feature description 𝜙 𝐺

▪ Key idea: Use neighborhood structure to iteratively enrich node


vocabulary.
▪ Generalized version of Bag of node degree since node degrees are one-hop
neighborhood information

▪ Algorithm: Color refinement

119
Thanh H. Nguyen 5/7/2025
Color Refinement
▪ Given: A graph G with a set of nodes V.
▪ Assign an initial color 𝑐 0 𝑣 to each node 𝑣.
▪ Iteratively refine node colors by

𝑘+1 𝑘 𝑘
𝑐 𝑣 = 𝐻𝐴𝑆𝐻 𝑐 𝑣 , 𝑐 𝑢 𝑢∈𝑁 𝑣

where HASH maps different inputs to different colors


▪ After K steps of color refinement, 𝑐 𝐾 𝑣 summarizes the structure of K-
hop neighborhood.

120
Thanh H. Nguyen 5/7/2025
Color Refinement: Example
▪ Assign initial colors

▪ Aggregate neighboring colors

121
Thanh H. Nguyen 5/7/2025
Color Refinement: Example
▪ Aggregate neighboring colors

▪ Hash aggregated colors

122
Thanh H. Nguyen 5/7/2025
Color Refinement: Example
▪ Aggregate neighboring colors

▪ Hash aggregated colors

123
Thanh H. Nguyen 5/7/2025
Weisfeiler-Lehman Graph Features
▪ After color refinement, WL kernel counts #nodes with a given
color

124
Thanh H. Nguyen 5/7/2025
Weisfeiler-Lehman Kernel
▪ The WL kernel is computed by the inner product of the color count
vectors

125
Thanh H. Nguyen 5/7/2025
Weisfeiler-Lehman Kernel
▪ Computationally efficient
▪ Time complexity for color refinement at each step is linear in #edges.

▪ When computing a kernel value, only colors appeared in the two


graphs need to be tracked
▪ Thus, #color is at most the total number of nodes.

▪ Counting colors takes linear time w.r.t #nodes

▪ In total, time complexity is linear in #edges

126
Thanh H. Nguyen 5/7/2025
Graph-Level Features: Summary
▪ Graphlet kernel
▪ Graph is represented as Bag-of-graphlets
▪ Computationally expensive

▪ Weisfeiler-Lehman kernel
▪ Apply K-step color refinement algorithm to enrich node colors
▪ Different colors capture different K-hop neighborhood structures
▪ Graph is represented as Bag-of-colors
▪ Computationally efficient
▪ Closely related to Graph Neural Nets (will study later)

127
Thanh H. Nguyen 5/7/2025
Summary
▪ Traditional ML pipeline
▪ Hand-crafted (structural) features + ML models

▪ Hand-crafted features for graph data


▪ Node-level: node degree, centrality, clustering coefficient, graphlets
▪ Link-level: distance-based features, local/global neighborhood overlap
▪ Graph-level: graphlet kernel, WL kernel

▪ However, we only considered featurizing the graph structure (but


not the attribute of nodes and their neighbors)

128
Thanh H. Nguyen 5/7/2025

You might also like