0% found this document useful (0 votes)
12 views73 pages

06 GNN3

CS224W is a course at Stanford University focused on Machine Learning with Graphs, emphasizing the creation of resources for the graph ML community. Students will engage in projects that include real-world applications of Graph Neural Networks (GNNs), tutorials on PyG functionality, and implementations of cutting-edge research, with outputs such as blog posts and Google Colabs. The project contributes to 20% of the course grade, with proposals due on February 7 and final reports due on March 21.

Uploaded by

yf970113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views73 pages

06 GNN3

CS224W is a course at Stanford University focused on Machine Learning with Graphs, emphasizing the creation of resources for the graph ML community. Students will engage in projects that include real-world applications of Graph Neural Networks (GNNs), tutorials on PyG functionality, and implementations of cutting-edge research, with outputs such as blog posts and Google Colabs. The project contributes to 20% of the course grade, with proposals due on February 7 and final reports due on March 21.

Uploaded by

yf970113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University


http://cs224w.stanford.edu
 Goal: create long-lasting resources for your
technical profiles + broader graph ML
community
 Three types of projects
▪ 1) Real-world applications of GNNs
▪ 2) Tutorial on PyG functionality
▪ 3) Implementation of cutting-edge research
 We will publish your blog posts on our
course’s Medium page!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 2
 Goal: identify a specific use case and
demonstrate how GNNs and PyG can be used
to solve this problem
 Output: blog post, Google colab
 Example use cases
▪ Fraud detection
▪ Predicting drug interactions
▪ Friend recommendation
 Check out the featured posts from our course
last year as examples of this type of project
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3
 Goal: develop a tutorial that explains how to
use existing PyG functionality
 Output: blog post, Google colab
 Example topics for tutorials
▪ PyG’s explainability module
▪ Methods for graph sampling (e.g., negative
sampling, sampling on heterogeneous graphs)
▪ Tutorial on GraphGym, a platform for designing
and evaluating GNNs
 Check out example tutorials from PyG
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 4
 Goal: implement interesting methods from a
recent research paper in graph ML
 Output: PR to PyG contrib, short blog post
 Project details
▪ Implementation should include comprehensive
testing and documentation on new functionality
▪ Try to build on existing PyG and PyTorch code
wherever possible
▪ Note: this project is more manageable if you are
already comfortable with PyTorch and deep
learning. We also highly recommend group of 3.

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
 Project is worth 20% of your course grade
▪ Project proposal (2 pages), due February 7
▪ Final reports, due March 21
 We recommend groups of 3, but groups of 2
are also allowed
 Full project description will be released
tonight! We will provide much more detail on
each project type, examples, pointers to
datasets, tips for writing blog posts and
Google Colabs, etc.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8
 Putting things together:
▪ (1) Message: each node computes a message
(𝑙) 𝑙 𝑙−1
𝐦𝑢 = MSG 𝐡𝑢 , 𝑢 ∈ {𝑁 𝑣 ∪ 𝑣}
▪ (2) Aggregation: aggregate messages from neighbors
(𝑙) 𝑙 𝑙 𝑙
𝐡𝑣 = AGG 𝐦𝑢 , 𝑢 ∈ 𝑁 𝑣 , 𝐦𝑣
▪ Nonlinearity (activation): Adds expressiveness
▪ Often written as 𝜎(⋅): ReLU(⋅), Sigmoid(⋅) , …
▪ Can be added to message or aggregation

(2) Aggregation

(1) Message

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
He et al. Deep Residual Learning for Image Recognition, CVPR 2015

 What if my problem still requires many GNN layers?


 Lesson 2: Add skip connections in GNNs
▪ Observation from over-smoothing: Node embeddings in
earlier GNN layers can sometimes better differentiate nodes
▪ Solution: We can increase the impact of earlier layers on the
final node embeddings, by adding shortcuts in GNN
Duplicate
into two
branches
Idea of skip connections:
Before adding shortcuts:
𝑭𝐱
After adding shortcuts:
𝑭 𝐱 +𝐱
Sum two
branches
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10
 Graph Feature manipulation
▪ The input graph lacks features → feature
augmentation
 Graph Structure manipulation
▪ The graph is too sparse → Add virtual nodes / edges
▪ The graph is too dense → Sample neighbors when
doing message passing
▪ The graph is too large → Sample subgraphs to
compute embeddings
▪ Will cover later in lecture: Scaling up GNNs

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
 Feature augmentation: constant vs. one-hot
Constant node feature One-hot node feature
1 2

1 1
1 3

1 6
1 4
1 5

Expressive power Medium. All the nodes are High. Each node has a unique ID,
identical, but GNN can still learn so node-specific information can
from the graph structure be stored
Inductive learning High. Simple to generalize to new Low. Cannot generalize to new
(Generalize to nodes: we assign constant nodes: new nodes introduce new
unseen nodes) feature to them, then apply our IDs, GNN doesn’t know how to
GNN embed unseen IDs
Computational Low. Only 1 dimensional feature High. High dimensional feature,
cost cannot apply to large graphs
Use cases Any graph, inductive settings Small graph, transductive settings
(generalize to new nodes) (no new nodes)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12
Why do we need feature augmentation?
 (2) Certain structures are hard to learn by GNN
 Example: Cycle count feature
▪ Can GNN learn the length of a cycle that 𝑣1 resides in?
▪ Unfortunately, no

𝑣1 resides in a cycle with length 3 𝑣1 resides in a cycle with length 4


𝑣1 𝑣1

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 13
 𝒗𝟏 cannot differentiate which graph it resides in
▪ Because all the nodes in the graph have degree of 2
▪ The computational graphs will be the same binary tree

𝑣1 resides in a cycle 𝑣1 resides in a cycle


with length 3 with length 4 The computational
graphs for node 𝒗𝟏
𝑣1 𝑣2
are always the same

𝑣1 resides in a cycle with infinite length

… 𝑣1 …

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
J. You, J. Gomes-Selman, R. Ying, J. Leskovec. Identity-aware Graph Neural Networks, AAAI 2021

Why do we need feature augmentation?


 (2) Certain structures are hard to learn by GNN
 Solution:
▪ We can use cycle count as augmented node features

Augmented node feature for 𝒗𝟏 Augmented node feature for 𝒗𝟏


We start
from cycle
with length 0
[0, 0, 0, 1, 0, 0] [0, 0, 0, 0, 1, 0]
𝑣1 resides in a cycle with length 3 𝑣1 resides in a cycle with length 4
𝑣1 𝑣1

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 15
Why do we need feature augmentation?
 (2) Certain structures are hard to learn by GNN
 Other commonly used augmented features:
▪ Degree distribution
▪ Clustering coefficient
▪ PageRank
▪ Centrality
▪ …
 Any feature we have introduced can be used!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 16
 Motivation: Augment sparse graphs
 (1) Add virtual edges
▪ Common approach: Connect 2-hop neighbors via
virtual edges
▪ Intuition: Instead of using adj. matrix 𝐴 for GNN
computation, use 𝐴 + 𝐴2
Authors Papers
A
▪ Use cases: Bipartite graphs B
▪ Author-to-papers (they authored) C

▪ 2-hop virtual edges make an author-author D

collaboration graph E

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 17
 Motivation: Augment sparse graphs
 (2) Add virtual nodes
▪ The virtual node will connect to all the
nodes in the graph The virtual
node
▪ Suppose in a sparse graph, two nodes have
shortest path distance of 10
▪ After adding the virtual node, all the nodes
will have a distance of 2
▪ Node A – Virtual node – Node B
▪ Benefits: Greatly improves message
passing in sparse graphs
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18
Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017

 Previously:
▪ All the nodes are used for message passing

 New idea: (Randomly) sample a node’s


neighborhood for message passing
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 19
 For example, we can randomly choose 2
neighbors to pass messages
▪ Only nodes 𝐵 and 𝐷 will pass message to 𝐴

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
 Next time when we compute the embeddings,
we can sample different neighbors
▪ Only nodes 𝐶 and 𝐷 will pass message to 𝐴

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
Ying et al. Graph Convolutional Neural Networks for Web-Scale Recommender Systems, KDD 2018

 In expectation, we can get embeddings similar


to the case where all the neighbors are used
▪ Benefits: Greatly reduce computational cost
▪ And in practice it works great!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

(5) Learning objective

Next: How do we train a GNN?

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 24
So far what we have covered
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function

Output of a GNN: set of node embeddings


{𝐡𝑣𝐿 , ∀𝑣 ∈ 𝐺}

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 25
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function
(1) Different prediction heads:
- Node-level tasks
- Edge-level tasks
- Graph-level tasks
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 26
 Idea: Different task levels require different
prediction heads

Node-level
prediction

Graph-level
prediction

Edge-level
prediction

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27
 Node-level prediction: We can directly make
prediction using node embeddings!
 After GNN computation, we have 𝑑-dim node
𝐿
embeddings: {𝐡𝑣 ∈ ℝ𝑑 , ∀𝑣 ∈ 𝐺}
 Suppose we want to make 𝑘-way prediction
▪ Classification: classify among 𝑘 categories
▪ Regression: regress on 𝑘 targets

▪ 𝐖 (𝐻) ∈ ℝ𝑘∗𝑑 : We map node embeddings from


(𝐿)
𝐡𝑣 ∈ ℝ𝑑 to 𝒚ෝ𝑣 ∈ ℝ𝑘 so that we can compute the
loss
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 28
 Edge-level prediction: Make prediction using
pairs of node embeddings
 Suppose we want to make 𝑘-way prediction

𝐿
𝐡𝑢
? 𝐡𝑣𝐿

 What are the options for ?

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 29
 Options for :
 (1) Concatenation + Linear
▪ We have seen this in graph attention
Concatenate Linear
𝒚ෞ
𝑢𝑣
(𝑙−1) (𝑙−1)
𝐡𝑢 𝐡𝑣

𝐿 𝐿
▪𝒚ෝ𝒖𝒗 = Linear(Concat(𝐡𝑢 )) , 𝐡𝑣
▪ Here Linear(⋅) will map 2𝑑-dimensional
embeddings (since we concatenated embeddings)
to 𝑘-dim embeddings (𝑘-way prediction)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30
𝐿 𝐿
 Options for Headedg𝑒 (𝐡𝑢 , 𝐡𝑣 ):
 (2) Dot product
𝐿 𝐿
▪𝒚ෝ 𝒖𝒗 = (𝐡𝑢 )𝑇 𝐡𝑣
▪ This approach only applies to 𝟏-way prediction (e.g.,
link prediction: predict the existence of an edge)
▪ Applying to 𝒌-way prediction:
▪ Similar to multi-head attention: 𝐖 (1) , … , 𝐖 (𝑘) trainable
(𝟏) 𝐿 𝐿
𝒚𝒖𝒗 = (𝐡𝑢 )𝑇 𝐖(1) 𝐡𝑣


(𝒌) 𝐿 𝐿
𝒚𝒖𝒗 = (𝐡𝑢 )𝑇 𝐖 (𝑘) 𝐡𝑣

(𝟏) (𝒌)
𝒚𝑢𝑣 = Concat(ෝ
ෝ 𝒚𝒖𝒗 ) ∈ ℝ𝑘
𝒚𝒖𝒗 , … , ෝ
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
 Graph-level prediction: Make prediction using
all the node embeddings in our graph
 Suppose we want to make 𝑘-way prediction

Graph-level prediction

(2) Aggregation
 Headgraph (⋅) is similar to
(1) Message
AGG(⋅) in a GNN layer!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 32
K. Xu*, W. Hu*, J. Leskovec, S. Jegelka. How Powerful Are Graph Neural Networks, ICLR 2019

 Options for
 (1) Global mean pooling
𝐿
ෝ𝐺 = Mean({𝐡𝑣 ∈ ℝ𝑑 , ∀𝑣 ∈ 𝐺})
𝒚
 (2) Global max pooling
𝐿
𝒚ෝ𝐺 = Max({𝐡𝑣 ∈ ℝ𝑑 , ∀𝑣 ∈ 𝐺})
 (3) Global sum pooling
𝐿
ෝ𝐺 = Sum({𝐡𝑣 ∈ ℝ𝑑 , ∀𝑣 ∈ 𝐺})
𝒚
 These options work great for small graphs
 Can we do better for large graphs?
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 33
 Issue: Global pooling over a (large) graph will lose
information
 Toy example: we use 1-dim node embeddings
▪ Node embeddings for 𝐺1: {−1, −2, 0, 1, 2}
▪ Node embeddings for 𝐺2: {−10, −20, 0, 10, 20}
▪ Clearly 𝐺1 and 𝐺2 have very different node embeddings
→ Their structures should be different
 If we do global sum pooling:
▪ Prediction for 𝐺1: 𝑦ො𝐺 = Sum −1, −2, 0, 1, 2 = 0
▪ Prediction for 𝐺2: 𝑦ො𝐺 = Sum −10, −20, 0, 10, 20 = 0
▪ We cannot differentiate 𝐺1 and 𝐺2!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
 A solution: Let’s aggregate all the node
embeddings hierarchically
▪ Toy example: We will aggregate via ReLU Sum ⋅
▪ We first separately aggregate the first 2 nodes and last 3 nodes
▪ Then we aggregate again to make the final prediction
▪ 𝐺1 node embeddings: {−1, −2, 0, 1, 2}
▪ Round 1: 𝑦ො𝑎 = ReLU Sum −1, −2 = 0, 𝑦ො𝑏 =
ReLU Sum 0,1, 2 =3
▪ Round 2: 𝑦ො𝐺 = ReLU Sum 𝑦𝑎 , 𝑦𝑏 =𝟑
▪ 𝐺2 node embeddings: {−10, −20, 0, 10, 20}
▪ Round 1: 𝑦ො𝑎 = ReLU Sum −10, −20 = 0, 𝑦ො𝑏 =
ReLU Sum 0,10, 20 = 30 Now we can
differentiate
▪ Round 2: 𝑦ො𝐺 = ReLU Sum 𝑦𝑎 , 𝑦𝑏 = 𝟑𝟎 𝑮𝟏 and 𝑮𝟐 !
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
Ying et al. Hierarchical Graph Representation Learning with Differentiable Pooling , NeurIPS 2018

 DiffPool idea:
▪ Hierarchically pool node embeddings

▪ Leverage 2 independent GNNs at each level


▪ GNN A: Compute node embeddings
▪ GNN B: Compute the cluster that a node belongs to
▪ GNNs A and B at each level can be executed in parallel
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36
 DiffPool idea:

▪ For each Pooling layer


▪ Use clustering assignments from GNN B to aggregate node
embeddings generated by GNN A
▪ Create a single new node for each cluster, maintaining
edges between clusters to generated a new pooled network
▪ Jointly train GNN A and GNN B
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 37
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
(2) Where does ground-truth come from?
- Supervised labels
- Unsupervised signals Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
 Supervised learning on graphs
▪ Labels come from external sources
▪ E.g., predict drug likeness of a molecular graph
 Unsupervised learning on graphs
▪ Signals come from graphs themselves
▪ E.g., link prediction: predict if two nodes are connected
 Sometimes the differences are blurry
▪ We still have “supervision” in unsupervised learning
▪ E.g., train a GNN to predict node clustering coefficient
▪ An alternative name for “unsupervised” is “self-
supervised”

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 40
 Supervised labels come from the specific use
cases. For example:
▪ Node labels 𝒚𝒗: in a citation network, which subject
area does a node belong to
▪ Edge labels 𝒚𝒖𝒗: in a transaction network, whether an
edge is fraudulent
▪ Graph labels 𝒚𝐺 : among molecular graphs, the drug
likeness of graphs
 Advice: Reduce your task to node / edge / graph
labels, since they are easy to work with
▪ E.g., we knew some nodes form a cluster. We can treat
the cluster that a node belongs to as a node label
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 41
 The problem: sometimes we only have a graph,
without any external labels
 The solution: “self-supervised learning”, we can
find supervision signals within the graph.
▪ For example, we can let GNN predict the following:
▪ Node-level 𝒚𝑣 . Node statistics: such as clustering
coefficient, PageRank, …
▪ Edge-level 𝒚𝑢𝑣 . Link prediction: hide the edge
between two nodes, predict if there should be a link
▪ Graph-level 𝒚𝐺 . Graph statistics: for example, predict
if two graphs are isomorphic
▪ These tasks do not require any external labels!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function

(3) How do we compute the final loss?


- Classification loss
- Regression loss
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 43
 The setting: We have 𝑁 data points
▪ Each data point can be a node/edge/graph
(𝑖) (𝑖)
▪ Node-level: prediction 𝒚
ෝ𝑣 , label 𝒚𝑣
(𝑖) (𝑖)
▪ Edge-level: prediction 𝒚
ෝ𝑢𝑣 , label 𝒚𝑢𝑣
(𝑖) (𝑖)
▪ Graph-level: prediction 𝒚
ෝ𝐺 , label 𝒚𝐺
(𝑖) 𝑖
▪ We will use prediction 𝒚ෝ , label 𝒚 to refer
predictions at all levels

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
𝑖
 Classification: labels 𝒚 with discrete value
▪ E.g., Node classification: which category does a
node belong to
𝑖
 Regression: labels 𝒚 with continuous value
▪ E.g., predict the drug likeness of a molecular graph
 GNNs can be applied to both settings
 Differences: loss function & evaluation
metrics

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 45
 As discussed in lecture 6, cross entropy (CE) is
a very common loss function in classification
 𝐾-way prediction for 𝑖-th data point:
𝒊-th data point
𝒋-th class
Label Prediction
where:
E.g. 0 0 1 0 0
𝒚(𝑖) 𝜖 ℝ = one-hot label encoding
𝐾

ෝ(𝑖) 𝜖 ℝ𝐾 = prediction after Softmax(⋅)


𝒚
E.g. 0.1 0.3 0.4 0.1 0.1

 Total loss over all 𝑁 training examples

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46
 For regression tasks we often use Mean Squared
Error (MSE) a.k.a. L2 loss
 𝐾-way regression for data point (i):
𝒊-th data point
𝒋-th target
where:
E.g. 1.4 2.3 1.0 0.5 0.6
𝒚(𝒊) 𝜖 ℝ𝑘 = Real valued vector of targets
ෝ(𝒊) 𝜖 ℝ𝑘 = Real valued vector of predictions
𝒚
E.g. 0.9 2.8 2.0 0.3 0.8

 Total loss over all 𝑁 training examples

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47
(4) How do we measure the success of a GNN?
- Accuracy
- ROC AUC
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48
 We use standard evaluation metrics for GNN
▪ (Content below can be found in any ML course)
▪ In practice we will use sklearn for implementation
▪ Suppose we make predictions for 𝑁 data points
 Evaluate regression tasks on graphs:
▪ Root mean square error (RMSE)

▪ Mean absolute error (MAE)

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49
 Evaluate classification tasks on graphs:
 (1) Multi-class classification
▪ We simply report the accuracy

 (2) Binary classification


▪ Metrics sensitive to classification threshold
▪ Accuracy
▪ Precision / Recall
▪ If the range of prediction is [0,1], we will use 0.5 as threshold
▪ Metric Agnostic to classification threshold
▪ ROC AUC

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 50
 Accuracy:

 Precision (P):
Confusion matrix
 Recall (R):

 F1-Score:

Sklearn Classification Report


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
 ROC Curve: Captures the tradeoff in TPR and
FPR as the classification threshold is varied
for a binary classifier.

TPR

Note: the dashed line


represents performance of
Image Credit: Wikipedia
FPR a random classifier
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 52
Content Credit: Wikipedia

 ROC AUC: Area under the ROC Curve.


 Intuition: The probability that a classifier will rank a
randomly chosen positive instance higher than a
randomly chosen negative one

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 53
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function
(5) How do we split our dataset
into train / validation / test set?

Dataset split

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55
 Fixed split: We will split our dataset once
▪ Training set: used for optimizing GNN parameters
▪ Validation set: develop model/hyperparameters
▪ Test set: held out until we report final performance
 A concern: sometimes we cannot guarantee
that the test set will really be held out
 Random split: we will randomly split our
dataset into training / validation / test
▪ We report average performance over different
random seeds
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 56
 Suppose we want to split an image dataset
▪ Image classification: Each data point is an image
▪ Here data points are independent
▪ Image 5 will not affect our prediction on image 1

Training 2
1 3
Validation
6
Test 5 4

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57
 Splitting a graph dataset is different!
▪ Node classification: Each data point is a node
▪ Here data points are NOT independent
▪ Node 5 will affect our prediction on node 1, because it will
participate in message passing → affect node 1’s embedding

Training 2
1 3
Validation
6
Test 5 4

 What are our options?


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 58
 Solution 1 (Transductive setting): The input
graph can be observed in all the dataset splits
(training, validation and test set).
 We will only split the (node) labels
▪ At training time, we compute embeddings using the
entire graph, and train using node 1&2’s labels
▪ At validation time, we compute embeddings using
the entire graph, and evaluate on node 3&4’s labels

Training 2
1 3
Validation
6
Test 5 4
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 59
 Solution 2 (Inductive setting): We break the edges
between splits to get multiple graphs
▪ Now we have 3 graphs that are independent. Node 5 will
not affect our prediction on node 1 any more
▪ At training time, we compute embeddings using the
graph over node 1&2, and train using node 1&2’s labels
▪ At validation time, we compute embeddings using the
graph over node 3&4, and evaluate on node 3&4’s labels

Training 2
1 3
Validation
6
Test 5 4
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 60
 Transductive setting: training / validation / test
sets are on the same graph
▪ The dataset consists of one graph
▪ The entire graph can be observed in all dataset splits,
we only split the labels
▪ Only applicable to node / edge prediction tasks
 Inductive setting: training / validation / test sets
are on different graphs
▪ The dataset consists of multiple graphs
▪ Each split can only observe the graph(s) within the split.
A successful model should generalize to unseen graphs
▪ Applicable to node / edge / graph tasks
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61
 Transductive node classification
▪ All the splits can observe the entire graph structure, but
can only observe the labels of their respective nodes
Training
Validation
Test
 Inductive node classification
▪ Suppose we have a dataset of 3 graphs
▪ Each split contains an independent graph
Training
Validation
Test

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62
 Only the inductive setting is well defined for
graph classification
▪ Because we have to test on unseen graphs
▪ Suppose we have a dataset of 5 graphs. Each split
will contain independent graph(s).

Training Validation Test

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 63
 Goal of link prediction: predict missing edges
 Setting up link prediction is tricky:
▪ Link prediction is an unsupervised / self-supervised
task. We need to create the labels and dataset
splits on our own
▪ Concretely, we need to hide some edges from the
GNN and the let the GNN predict if the edges exist
2 2 2
1 3 1 3 1 3
?
5 4 5 4 5 4
Original graph Input graph to GNN Predictions made by GNN
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 64
2
1 3

Message edges Supervision edges


5 4
Original graph

 For link prediction, we will split edges twice


 Step 1: Assign 2 types of edges in the original graph
▪ Message edges: Used for GNN message passing
▪ Supervision edges: Use for computing objectives
▪ After step 1:
▪ Only message edges will remain in the graph
▪ Supervision edges are used as supervision for edge
predictions made by the model, will not be fed into GNN!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 65
 Step 2: Split edges into train / validation / test
 Option 1: Inductive link prediction split
▪ Suppose we have a dataset of 3 graphs. Each
inductive split will contain an independent graph

2 7 12
1 3 6 8 11 13
𝐺1 𝐺2 𝐺3
5 4 10 9 15 14
Training set Validation set Test set
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 66
 Step 2: Split edges into train / validation / test
 Option 1: Inductive link prediction split
▪ Suppose we have a dataset of 3 graphs. Each
inductive split will contain an independent graph
▪ In train or val or test set, each graph will have 2
types of edges: message edges + supervision edges
▪ Supervision edges are not the input to GNN

Message 2 7 12
1 3 6 8 11
edge 13
𝐺1 𝐺2 𝐺3
Supervision
5 4 10 9 15 14
edge
Training set Validation set Test set
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 67
 Option 2: Transductive link prediction split:
▪ This is the default setting when people talk about
link prediction
▪ Suppose we have a dataset of 1 graph

2
1 3

5 4

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 68
 Option 2: Transductive link prediction split:
▪ By definition of “transductive”, the entire graph can
be observed in all dataset splits
▪ But since edges are both part of graph structure and the
supervision, we need to hold out validation / test edges
▪ To train the training set, we further need to hold out
supervision edges for the training set
2
1 3

5 4

▪ Next: we will show the exact settings


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 69
 Option 2: Transductive link prediction split:
2
1 3

5 4
The original graph

2 2 2
1 3 1 3 1 3

5 4 5 4 5 4
(1) At training time: (2) At validation time: (3) At test time:
Use training message Use training message Use training message
edges to predict training edges & training edges & training
supervision edges supervision edges to supervision edges &
predict validation edges validation edges to
predict test edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 70
 Summary: Transductive link prediction split:

2 2 Training message edges


1 1 3
3 Training supervision edges
Split
Validation edges
5 4 Test edges
5 4
The original graph Split Graph with
4 types of edges

▪ Note: Link prediction settings are tricky and complex. You


may find papers do link prediction differently.
▪ Luckily, we have full support in PyG and GraphGym

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 72
Dataset split

Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head

Loss
function
Implementation resources:
DeepSNAP provides core modules for this pipeline
GraphGym further implements the full pipeline to facilitate GNN design
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 73
 We introduce a general GNN framework:
▪ GNN Layer:
▪ Transformation + Aggregation
▪ Classic GNN layers: GCN, GraphSAGE, GAT
▪ Layer connectivity:
▪ The over-smoothing problem
▪ Solution: skip connections
▪ Graph Augmentation:
▪ Feature augmentation
▪ Structure augmentation
▪ Learning Objectives
▪ The full training pipeline of a GNN
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 74

You might also like