0% found this document useful (0 votes)
11 views72 pages

09 Hetero

Uploaded by

laijiahao0430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views72 pages

09 Hetero

Uploaded by

laijiahao0430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Note to other teachers and users of these slides: We would be delighted if you found our

material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://cs224w.Stanford.edu

CS224W: Machine Learning with Graphs


Charilaos Kanatsoulis and Jure Leskovec, Stanford
University
http://cs224w.stanford.edu
¡ Project Proposal due today
§ Gradescope submissions close at 11:59 PM
¡ Colab 2 due this Thursday
¡ Homework 2: UPDATED + NEW DUE DATE
§ HW2 Problem 4 has been removed
§ Updated Due Date: Monday Nov 4th, 2024

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2
¡ Slide pre-viewing
We upload the slides the day before the lecture.
Please check it out!

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ So far we only handle graphs with one edge
type
¡ How to handle graphs with multiple nodes or
edge types (a.k.a heterogeneous graphs)?
¡ Goal: Learning with heterogeneous graphs
§ Relational GCNs
§ Design space for heterogeneous GNNs
§ Heterogeneous Graph Transformer (Time
permitting)

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
2 types of nodes:
¡ Node type A: Paper nodes
¡ Node type B: Author nodes
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6
2 types of edges:
¡ Edge type A: Like
¡ Edge type B: Cite
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
A graph could have multiple types of nodes and
edges! 2 types of nodes + 2 types of edges.

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8
8 possible relation types!

(Paper, Cite, Paper) (Author, Cite, Author)

(Paper, Like, Paper) (Author, Like, Author)

(Paper, Cite, Author) (Author, Cite, Paper)

(Paper, Like, Author) (Author, Like, Paper)

Relation types: (node_start, edge, node_end)


¡ We use relation type to describe an edge (as
opposed to edge type)
¡ Relation type better captures the interaction
between nodes and edges
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9
¡ A heterogeneous graph is defined as
𝑮 = 𝑽, 𝑬, 𝜏, 𝜙
§ Nodes with node types 𝑣 ∈ 𝑉
§ Node type for node 𝑣: 𝜏 𝑣
An edge can be
§ Edges with edge types (𝑢, 𝑣) ∈ 𝐸 described as a
pair of nodes
§ Edge type for edge (𝑢, 𝑣): 𝜙 𝑢, 𝑣
§ Relation type for edge 𝑒 is a tuple: 𝑟 𝑢, 𝑣 =
(𝜏 𝑢 , 𝜙 𝑢, 𝑣 , 𝜏(𝑣))
¡ There are other definitions for heterogeneous graphs
as well – describe graphs with node & edge types
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Biomedical Knowledge Graphs Event Graphs
Example node: Migraine Example node: SFO
Example relation: (fulvestrant, Example relation: (UA689, Origin,
Treats, Breast Neoplasms) LAX)
Example node type: Protein Example node type: Flight
Example edge type: Causes Example edge type: Destination

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
¡ Example: E-Commerce Graph
§ Node types: User, Item, Query, Location, ...
§ Edge types: Purchase, Visit, Guide, Search, …
§ Different node type's features spaces can be different!

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12
¡ Example: Academic Graph
§ Node types: Author, Paper, Venue, Field, ...
§ Edge types: Publish, Cite, …
§ Benchmark dataset: Microsoft Academic Graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ Observation: We can also treat types of
nodes and edges as features
§ Example: Add a one-hot indicator for nodes and
edges
§ Append feature [1, 0] to each “author node”; Append
feature [0, 1] to each “paper node”
§ Similarly, we can assign edge features to edges with
different types
§ Then, a heterogeneous graph reduces to a
standard graph
¡ When do we need a heterogeneous graph?
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
¡ When do we need a heterogeneous graph?
§ Case 1: Different node/edge types have different
shapes of features
§ An “author node” has 4-dim feature, a “paper node” has
5-dim feature
§ Case 2: We know different relation types
represent different types of interactions
§ (English, translate, French) and (English, translate,
Chinese) require different models
¡ There are many ways to convert a
heterogeneous graph to a standard graph
(that is, a homogeneous graph)
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
¡ Ultimately, heterogeneous graph is a more
expressive graph representation
§ Captures different types of interactions between
entities
¡ But it also comes with costs
§ More expensive (computation, storage)
§ More complex implementation

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

¡ (1) Graph Convolutional Networks (GCN)

(#) #
𝐡%#()
𝐡! = 𝜎 𝐖 %
𝑁 𝑣
%∈' !

¡ How to write this as Message + Aggregation?


Message

(#) #
𝐡%#() (2) Aggregation
𝐡! =𝜎 % 𝐖
𝑁 𝑣 (1) Message
%∈' !
Aggregation
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
¡ We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
¡ We start with a directed graph with one relation
§ How do we run GCN and update the representation of
the target node A on this graph?

B
Target Node
A
C

F
D E
Input Graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
¡ We start with a directed graph with one relation
§ How do we run GCN and update the representation of
the target node A on this graph?

B Only pass messages C


Target Node along direction of edges B
A
C F
A C
F
D E E
D
Input Graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ What if the graph has multiple relation types?

𝑟) B
Target node 𝑟+
A
𝑟) 𝑟* C
𝑟+ 𝑟*
F
D E 𝑟)

Input graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ What if the graph has multiple relation types?
¡ Use different neural network weights for
different relation types.
Weights 𝐖!! for 𝑟"
𝑟) B
Target node 𝑟+
A
Weights 𝐖!" for 𝑟#
𝑟) 𝑟* C
𝑟+ 𝑟*
F
D E 𝑟) Weights 𝐖!# for 𝑟$

Input graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
¡ What if the graph has multiple relation types?
¡ Use different neural network weights for
different relation types! AggregaQon
C
𝑟) B B
Target node 𝑟+
A F
𝑟) 𝑟* C A C
𝑟+ 𝑟*
F E
D E 𝑟) D

Input graph

Neural networks
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24
Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

¡ (1) Graph Convolutional Networks (GCN)

(#) #
𝐡%#()
𝐡! = 𝜎 % 𝐖
𝑁 𝑣
%∈' !

¡ We add a self-loop

(#) #
𝐡%#() #()
𝐡! =𝜎 % 𝐖 + 𝐖 # 𝐡!
𝑁 𝑣
%∈' !

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
¡ Introduce a set of neural networks for each
relation type!

Weight for rel_1




Weight for rel_N

Weight for self-loop

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26
¡ Relational GCN (RGCN):
#.) 1 # (#) (#) (#)
𝐡! =𝜎 % % 𝐖/ 𝐡% + 𝐖2 𝐡!
𝑐
" !,/
/∈0 %∈'!
¡ How to write this as Message + Aggregation?
¡ Message: Normalized by node degree
§ Each neighbor of a given relation: of the relation 𝑐%,! = 𝑁%!
(%) 1 % (%)
𝐦!,# = 𝐖# 𝐡!
𝑐',#
§ Self-loop:
(%) % (%)
𝐦' = 𝐖( 𝐡'
¡ Aggregation:
§ Sum over messages from neighbors and self-loop, then apply activation
%)* % %
§ 𝐡' = 𝜎 Sum 𝐦!,# , 𝑢 ∈ 𝑁(𝑣) ∪ 𝐦'

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
" # $
¡ Each relation has 𝐿 matrices: 𝐖! , 𝐖! ⋯ 𝐖!
%
¡ The size of each 𝐖! is 𝑑 (%'") ×𝑑 (%) 𝑑 is the hidden (")

dimension in layer 𝑙

¡ Rapid growth of the number of parameters w.r.t


number of relations!
§ Overfitting becomes an issue
(𝒍)
¡ Two methods to regularize the weights 𝐖𝒓
§ (1) Use block diagonal matrices
§ (2) Basis/Dictionary learning
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28
¡ Key insight: make the weights sparse!
¡ Use block diagonal matrices for 𝐖!

𝐖+ =
Limitation: only nearby
neurons/dimensions
can interact through 𝑊

¡ If use 𝐵 blocks, then # param reduces from


+ !"# +!
𝑑(%'") ×𝑑(%) to 𝐵× ×
, ,
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29
¡ Key insight: Share weights across different
relations!
¡ Represent the matrix of each relation as a linear
combination of basis transformations
𝐖! = ∑,-." 𝑎!- ⋅ 𝐕- , where 𝐕- is shared across
all relations
§ 𝐕! are the basis matrices
§ 𝑎"! is the importance weight of matrix 𝐕!
,
¡ Now each relation only needs to learn 𝑎!- -." ,
which is 𝐵 scalars
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30
¡ Goal: Predict the label of a given node
¡ RGCN uses the representation of the final layer:
§ If we predict the class of node 𝑨 from 𝒌 classes
(%)
§ Take the final layer (prediction head): ∈ ℝ' , 𝐡#
(%)
each item in 𝐡# represents the probability of that
class
𝑟) B
Target Node 𝑟+
A
𝑟) 𝑟* C
𝑟+ 𝑟*
F
D E 𝑟)
Input Graph
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
¡ Link prediction: Make prediction using pairs of
node embeddings
$ $
:
¡ 𝒚𝒖𝒗 = 𝑓(𝐡1 , 𝐡2 )

6
𝐡% 6
? 𝐡!

$ $
¡ What are the options for 𝑓(𝐡1 , 𝐡2 )?

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32
$ $
¡ Options for 𝑓(𝐡1 , 𝐡2 ):
¡ Dot product
§𝒚3𝒖𝒗 = (𝐡*% )+ 𝐡,%
§ This approach only applies to 𝟏-way prediction
(e.g., link prediction: predict the existence of an
edge)

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
¡ Transductive link prediction split:
2
1 3

5 4
The original graph

2 2 2
1 3 1 3 1 3

5 4 5 4 5 4

(1) At training time: (2) At validation time: (3) At test time:


Use training message Use training message Use training message
edges to predict training edges & training edges & training
supervision edges supervision edges to supervision edges &
predict validation edges validation edges to
predict test edges
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34
¡ Link prediction split: Every edge also has a
2 1
2 relation type, this is
1 3 3
Split independent of the 4
categories.
5 4 5 4
In a heterogeneous
The original graph Split Graph with 4 graph, the homogeneous
categories of edges graphs formed by every
Training message edges for 𝒓𝟏 single relation also have
Training supervision edges for 𝒓𝟏 the 4 splits.
Validation edges for 𝒓𝟏
Test edges for 𝒓𝟏
Training message edges
…..

Training supervision edges


Validation edges
Training message edges for 𝒓𝒏 Test edges
Training supervision edges for 𝒓𝒏
Validation edges for 𝒓𝒏
Test edges for 𝒓𝒏
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
¡ Assume 𝑬, 𝒓𝟑 , 𝑨 is training supervision edge,
all the other edges are training message edges
¡ Use RGCN to score 𝑬, 𝒓𝟑 , 𝑨 !
% (%)
§ Take the final layer of 𝐸 and 𝐴: 𝐡$ and 𝐡& ∈ ℝ)
§ Relation-specific score function 𝑓* : ℝ) ×ℝ) → ℝ
§ One example 𝑓#) 𝐡. , 𝐡0 = 𝐡1. 𝐖#) 𝐡0 , 𝐖#) ∈ ℝ2×2

𝑟) B
𝑟+
A
𝑟) 𝑟* C
𝑟+ 𝑟*
𝒓𝟑 F
D E 𝑟)
Input Graph
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
¡ Training:
𝑟) B 1. Use RGCN to score the training
𝑟+ supervision edge 𝑬, 𝒓𝟑 , 𝑨
A
2. Create a negative edge by perturbing
𝑟) 𝑟* C
𝑟+ 𝑟* the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝒓𝟑 F • Corrupt the tail of 𝑬, 𝒓𝟑 , 𝑨
D E 𝑟) • e.g., 𝑬, 𝒓𝟑 , 𝑩 , 𝑬, 𝒓𝟑 , 𝑫
Input Graph

Note the negative edges should NOT


training supervision edges: 𝑬, 𝒓𝟑 , 𝑨 belong to training message edges or
training message edges: all the rest training supervision edges!
existing edges (solid lines) e.g., 𝑬, 𝒓𝟑 , 𝑪 is NOT a negative edge

(1) Use training message edges to


predict training supervision edges
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37
¡ Training:
1. Use RGCN to score the training
𝑟) B supervision edge 𝑬, 𝒓𝟑 , 𝑨
𝑟+
A 2. Create a negative edge by perturbing
𝑟* C the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝑟) 𝑟+ 𝑟*
𝒓𝟑 3. Use GNN model to score negative edge
F
D E 𝑟) 4. Optimize a standard cross entropy loss
Input Graph (as discussed in Lecture 6)
1. Maximize the score of training supervision edge
2. Minimize the score of negative edge

ℓ = − log 𝜎 𝑓"! ℎ- , ℎ# − log(1 − 𝜎 𝑓"! (ℎ- , ℎ. ))

𝜎 … Sigmoid function
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B Evaluate how the model can predict the
𝑟+ validation edges with the relation types.
A
Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟) 𝑟* C
𝑟+ 𝑟+ 𝑟* Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟) in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
validation edges: 𝑬, 𝒓𝟑 , 𝑫
training message edges & training supervision
edges: all existing edges (solid lines)

(2) At validation time:


Use training message edges & training
supervision edges to predict validation edges
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B Evaluate how the model can predict the
𝑟+ validation edges with the relation types.
A
Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟) 𝑟* C
𝑟+ 𝑟+ 𝑟* Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟) in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
1. Calculate the score of 𝑬, 𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩, 𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B
𝑟+
A 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑫 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 : 𝟏
𝑟) 𝑟* C 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑩 1. Hits@𝟐 = 𝟏
𝑟+ 𝑟+ 𝑟* 2. Reciprocal Rank:
𝟏
𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑭 𝟏
F
D
𝒓𝟑 ?
E 𝑟)
Input Graph
1. Calculate the score of 𝑬, 𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩, 𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B
𝑟+
A 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑩 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 : 𝟐
𝑟) 𝑟* C 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑫 1. Hits@𝟐 = 𝟏
𝑟+ 𝑟+ 𝑟* 2. Reciprocal Rank:
𝟏
𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑭 𝟐
F
D
𝒓𝟑 ?
E 𝑟)
Input Graph
1. Calculate the score of 𝑬, 𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩, 𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B
𝑟+
A 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑩 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 : 𝟑
𝑟) 𝑟* C 𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑭 1. Hits@𝟐 = 𝟎
𝑟+ 𝑟+ 𝑟* 2. Reciprocal Rank:
𝟏
𝒇𝒓𝟑 𝑬, 𝒓𝟑 , 𝑫 𝟑
F
D
𝒓𝟑 ?
E 𝑟)
Input Graph
1. Calculate the score of 𝑬, 𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩, 𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.

¡ Benchmark dataset
§ ogbn-mag from Microsoft Academic Graph (MAG)
¡ Four (4) types of entities
§ Papers: 736k nodes
§ Authors: 1.1m nodes
§ Institutions: 9k nodes
§ Fields of study: 60k nodes

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.

¡ Benchmark dataset
§ ogbn-mag from Microsoft Academic Graph (MAG)
¡ Four (4) directed relations
§ An author is "affiliated with" an institution
§ An author "writes" a paper
§ A paper "cites" a paper
§ A paper "has a topic of" a field of study

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.

¡ Prediction task
§ Each paper has a 128-dimensional word2vec feature vector
§ Given the content, references, authors, and author affiliations
from ogbn-mag, predict the venue of each paper
§ 349-class classification problem due to 349 venues considered
¡ Time-based dataset splitting
§ Training set: papers published before 2018
§ Test set: papers published after 2018

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.

¡ Benchmark results:

SOTA

R-GCN

§ SOTA method: SeHGNN


§ ComplEx (Next lecture) + Simplified GCN

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
¡ Relational GCN, a graph neural network for
heterogeneous graphs

¡ Can perform entity classification as well as


link prediction tasks.

¡ Ideas can easily be extended into RGNN


(RGraphSAGE, RGIN, RGAT, etc.)

¡ Benchmark: ogbn-mag from Microsoft


Academic Graph, to predict paper venues
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

How do we extend the general GNN design


space to heterogneous graphs?
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation


10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50
¡ (1) Message computation
(7) 7 789
§ Message function: 𝐦* = MSG 𝐡*
§ Intuition: Each node will create a message, which will be
sent to other nodes later
(#) #()
§ Example: A Linear layer 𝐦% = 𝐖 # 𝐡%

A
Node 𝒗
TARGET NODE B B C

A (2) Aggregation
A
C B
A C
F E
D (1) Message
F
E
D
INPUT GRAPH A

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51
¡ (1) Heterogeneous message computation
(7) 7 789
§ Message function: = 𝐦* MSG" 𝐡*
§ Observation: A node could receive multiple types of
messages. Num of message type = Num of relation
type
§ Idea: Create a different message function for each
relation type
(,)
§𝐦+ = MSG*, 𝐡+,-. , 𝑟 = (𝑢, 𝑒, 𝑣) is the relation
type between node 𝑢 that sends the message, edge
type 𝑒 , and node 𝑣 that receive the message
(,) ,-. ,
§ Example: A Linear layer 𝐦+ = 𝐖* 𝐡+
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ (2) Aggregation
§ Intuition: Each node will aggregate the messages from
node 𝑣’s neighbors
(,) , 7
𝐡/ = AGG 𝐦* ,𝑢 ∈ 𝑁 𝑣
§ Example: Sum(⋅), Mean(⋅) or Max(⋅) aggregator
§ 𝐡!# = Sum({𝐦%# , 𝑢 ∈ 𝑁(𝑣)})
A

TARGET NODE B Node 𝒗 B C

A
A
C (2) Aggregation
B
A C
F E
D F
E (1) Message
D
INPUT GRAPH A

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
¡ (2) Heterogeneous Aggregation
§ Observation: Each node could receive multiple types of
messages from its neighbors, and multiple neighbors
may belong to each message type.
§ Idea: We can define a 2-stage message passing

(#) #
§ 𝐡! = AGG?## AGG/# 𝐦%# , 𝑢 ∈ 𝑁/ 𝑣
§ Given all the messages sent to a node
§ Within each message type, aggregate the messages
that belongs to the relation type with AGG/#
#
§ Aggregate across the edge types with AGG?##
# #
§ Example: 𝐡! = Concat Sum 𝐦% , 𝑢 ∈ 𝑁/ 𝑣
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
¡ (3) Layer connectivity
§ Add skip connections, pre/post-process layers

Pre-processing layers: Important when


encoding node features is necessary.
E.g., when nodes represent images/text

Post-processing layers: Important when


reasoning / transformation over node
embeddings are needed
E.g., graph classification, knowledge graphs

In practice, adding these layers works great!

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
¡ Heterogeneous pre/post-process layers:
§ MLP layers with respect to each node type
§ Since the output of GNN are node embeddings
(7) (7)
§ 𝐡, = MLP +(,) (𝐡, )
§ 𝑇(𝑣) is the type of node 𝑣
¡ Other successful GNN designs are
also encouraged for heterogeneous
GNNs: skip connections, batch/layer
normalization, …

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
¡ Graph Feature manipulation
§ The input graph lacks features à feature
augmentation
¡ Graph Structure manipulation
§ The graph is too sparse à Add virtual nodes / edges
§ The graph is too dense à Sample neighbors when
doing message passing
§ The graph is too large à Sample subgraphs to
compute embeddings
§ Will cover later in lecture: Scaling up GNNs

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
¡ Graph Feature manipulation
§ 2 Common options: compute graph statistics (e.g.,
node degree) within each relation type, or across the
full graph (ignoring the relation types)
¡ Graph Structure manipulation
§ Neighbor and subgraph sampling are also common
for heterogeneous graphs.
§ 2 Common options: sampling within each relation
type (ensure neighbors from each type are covered),
or sample across the full graph

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
Node-level prediction:
$ (7) ($)
:𝒗 = Head3456 (𝐡2 ) = 𝐖 𝐡2
¡ 𝒚
Edge-level prediction:
$ $
:𝒖𝒗 = Head6589 (𝐡1 , 𝐡2 )=
¡ 𝒚
$ $
Linear(Concat(𝐡1 , 𝐡2 ))
Graph-level prediction:
$
:: = Head8;<=> ({𝐡2 ∈ ℝ+ , ∀𝑣 ∈ 𝐺})
¡ 𝒚

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
Node-level prediction:
$ 7 ($)
:𝒗 = Head3456, A(2) (𝐡2 ) = 𝐖A(2) 𝐡2
¡ 𝒚
Edge-level prediction:
$ $
:𝒖𝒗 = Head6589, ! (𝐡1 , 𝐡2 )=
¡ 𝒚
$ $
Linear! (Concat(𝐡1 , 𝐡2 ))
Graph-level prediction:
$
:
¡ 𝒚: = AGG(Head8;<=>, B ({𝐡2 ∈
ℝ+ , ∀𝑇 𝑣 = 𝑖}))

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Graph Attention Networks (GAT)
(7) (7) (789)
𝐡, = 𝜎(∑*∈; , 𝛼,* 𝐖 𝐡* )
Attention weights

Not all node’s neighbors are equally important


§ Attention is inspired by cognitive attention.
§ The attention 𝜶𝒗𝒖 focuses on the important parts of
the input data and fades out the rest.
§ Idea: the NN should devote more computing power on that
small but important part of the data.
¡ Can we adapt GAT for heterogeneous graphs?
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
¡ HGT uses Scaled Dot-Product Attention
(proposed in Transformer)

¡ Query: 𝑄, Key: 𝐾, Value: 𝑉


§ 𝑄, 𝐾, 𝑉 have shape (batch_size, dim)
How do we obtain 𝑄, 𝐾, 𝑉?
¡ Apply Linear layer to the input
§ 𝑄 = 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
§ 𝐾 = 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
§ 𝑉 = 𝑉_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Recall: Applying GAT to a homogeneous graph


7
§𝐻 is the 𝑙-th layer representation:

How do we take relation type (node_s, edge,


node_e) into attention computation?

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Mutual Attention:

¡ A set of neural networks for the attention scores of


each edge.

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Motivation: GAT is unable to represent different


node & different edge types
¡ Introduce a set of neural networks for each
relation type is too expensive for attention
§ Recall: relation describes (node_s, edge, node_e)

Weight for rel_1


… Too expensive!
Weight for rel_N
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Innovation: Decompose heterogeneous attention to


Node- and edge-type dependent attention mechanism
§ 3 node weight matrices, 2 edge weight matrices
§ Without decomposition: 3*2*3=18 relation types -> 18
weight matrices (suppose all relation types exist)
Paper

" Q-Linear!"#$%

'()*+[-]
Write Cite
/)+[0+] %&& 1--[01, -]
!!"#$
!! K-Linear!"#$%
Paper
'()*+[-] …

!" %&&
!'("#$ 1--[02, -]
K-Linear&'()*% /)+[0,]
Author
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Heterogeneous Mutual Attention (First Attempt):

¡ Introduce a set of neural networks for the attention


scores of each relation type.
¡ Too expensive for attention!

10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Heterogeneous Mutual Attention:

¡ Each relation (𝜏 𝑠 , 𝜑 𝑒 , 𝜏 𝑡 ) has a distinct set


of projection weights
§ 𝜏 𝑠 : type of node 𝑠, 𝜑 𝑒 : type of edge 𝑒
§ 𝜏(𝑠) & 𝜏(𝑡) parameterize 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟@ A & 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟@ B ,
which further return Key and Query vectors 𝐾(𝑠) & 𝑄(𝑡)
§ Edge type 𝜑(𝑒) directly parameterizes Wφ(e)
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
¡ A full HGT layer

We have just computed

¡ Similarly, HGT decomposes weights with node & edge


types in the message computation

Weights for Weights for


each node type each edge type
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

¡ Benchmark: ogbn-mag from Microsoft


Academic Graph, to predict paper venues

¡ HGT uses much fewer parameters, even


though the attention computation is expensive,
while performs better than R-GCN
§ Thanks to the weight decomposition over node &
edge types
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 71
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

Heterogeneous GNNs extend GNNs by separately


modeling node/relation types + additional AGG
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation


10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 72
¡ Heterogeneous graphs: graphs with multiple
nodes or edge types
§ Key concept: relation type (node_s, edge, node_e)
§ Be aware that we don’t always need
heterogeneous graphs
¡ Learning with heterogeneous graphs
§ Key idea: separately model each relation type
§ Relational GCNs
§ Design space for heterogeneous GNNs
§ Heterogeneous Graph Transformer
10/22/24 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 73

You might also like