Graph Neural Network Node Emending - Node Edge and Sub Graph
Graph Neural Network Node Emending - Node Edge and Sub Graph
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 2
Graph Representation Learning alleviates
the need to do feature engineering every
single time.
Input Structured Learning
Prediction
Graph Features Algorithm
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3
Goal: Efficient task-independent feature
learning for machine learning with graphs!
node vector
𝑢
𝑓: 𝑢 → ℝ𝑑
ℝ𝑑
Feature representation,
embedding
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 4
Task: Map nodes into an embedding space
▪ Similarity of embeddings between nodes indicates
their similarity in the network. For example:
▪ Both nodes are close to each other (connected by an edge)
▪ Encode network information
▪ Potentially used for many downstream predictions
Vec Tasks
• Node classification
• Link prediction
• Graph classification
• Anomalous node detection
embeddings ℝ 𝑑 • Clustering
• ….
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
Example
2D embedding of nodes of the Zachary’s
Karate Karate
• Zachary’s Club network:
Network:
Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Assume we have a graph G:
▪ V is the vertex set.
▪ A is the adjacency matrix (assume binary).
▪ For simplicity: No node features or extra
information is used
4
3
2 0 1 0 1
1
1 0 0 1
A=
V: {1, 2, 3, 4} 0 0 0 1
1 0
1 1
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8
Goal is to encode nodes so that similarity in
the embedding space (e.g., dot product)
approximates similarity in the graph
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
Goal: similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢
in the original network Similarity of the embedding
Need to define!
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10
1. Encoder maps from nodes to embeddings
2. Define a node similarity function (i.e., a
measure of similarity in the original network)
3. Decoder 𝐃𝐄𝐂 maps from embeddings to the
similarity score
4. Optimize the parameters of the encoder so
that: 𝐃𝐄𝐂(𝐳Τ 𝐳 ) 𝑣 𝑢
similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢
in the original network Similarity of the embedding
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
Encoder: maps each node to a low-dimensional
vector d-dimensional
ENC 𝑣 = 𝐳𝑣 embedding
node in the input graph
Similarity function: specifies how the
relationships in vector space map to the
relationships in the original network
similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢 Decoder
Similarity of 𝑢 and 𝑣 in dot product between node
the original network embeddings
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12
Simplest encoding approach: Encoder is just an
embedding-lookup
ENC 𝑣 = 𝐳𝒗 = 𝐙 ⋅ 𝑣
Dimension/size
of embeddings
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
Simplest encoding approach: Encoder is just an
embedding-lookup
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 16
Key choice of methods is how they define node
similarity.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 17
This is unsupervised/self-supervised way of
learning node embeddings.
▪ We are not utilizing node labels
▪ We are not utilizing node features
▪ The goal is to directly estimate a set of coordinates
(i.e., the embedding) of a node so that some aspect
of the network structure (captured by DEC) is
preserved.
These embeddings are task independent
▪ They are not trained for a specific task but can be
used for any task.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Vector 𝐳𝑢 :
▪ The embedding of node 𝑢 (what we aim to find).
Probability 𝑃 𝑣 𝐳𝑢 ) : Our model prediction based on 𝐳𝑢
▪ The (predicted) probability of visiting node 𝑣 on
random walks starting from node 𝑢.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
10
9
12
2 Step 3 Step 4
8 Step 5
1
11
3
Given a graph and a starting
4 Step 1 Step 2 point, we select a neighbor of
it at random, and move to this
neighbor; then we select a
6
5 neighbor of this point at
random, and move to it, etc.
The (random) sequence of
7 points visited this way is a
random walk on the graph.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
probability that u
and v co-occur on a
random walk over
the graph
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
1. Estimate probability of visiting node 𝒗 on a
random walk starting from node 𝒖 using
some random walk strategy 𝑹
Log-likelihood objective:
*𝑁𝑅 (𝑢) can have repeat elements since nodes can be visited multiple times on random walks
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27
Equivalently,
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30
But doing this naively is too expensive!
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
Why is the approximation valid?
Technically, this is a different objective. But
𝜕ℒ
▪ For all 𝑢, make a step in reverse direction of derivative: 𝑧𝑢 ← 𝑧𝑢 − 𝜂 .
𝜕𝑧𝑢
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
▪ Stochastic Gradient Descent: Instead of evaluating
gradients over all examples, evaluate it for each
individual training example.
▪ Initialize 𝑧𝑢 at some randomized value for all nodes 𝑢.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
1. Run short fixed-length random walks starting
from each node on the graph
2. For each node 𝑢 collect 𝑁𝑅 (𝑢), the multiset of
nodes visited on random walks starting from 𝑢.
3. Optimize embeddings using Stochastic
Gradient Descent:
s1 s2 s8
careful
s7
cent re-
BFS
d to sig- u s6
eatures DFS
sitive to s4 s9
s3 s5
r learn- Figur e 1: BFS and DFS sear ch str ategies from node u (k = 3).
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
Jure Leskovec
Stanford University
[email protected]
BFS: DFS:
Micro-view of Macro-view of
neighbourhood neighbourhood
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 41
Biased fixed-length random walk 𝑹 that given a
node 𝒖 generates neighborhood 𝑵𝑹 𝒖
Two parameters:
▪ Return parameter 𝒑:
▪ Return back to the previous node
▪ In-out parameter 𝒒:
▪ Moving outwards (DFS) vs. inwards (BFS)
▪ Intuitively, 𝑞 is the “ratio” of BFS vs. DFS
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Biased 2nd-order random walks explore network
neighborhoods:
▪ Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤
▪ Insight: Neighbors of 𝑤 can only be:
Same distance to 𝒔𝟏
s2 s3
w Farther from 𝒔𝟏
u s1
Back to 𝒔𝟏
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰.
Where to go next?
Target 𝒕 Prob. Dist. (𝒔𝟏, 𝒕)
s2 1
s3
1/𝑞 s1 1/𝑝 0
w w → s2 1 1
1/𝑞
u s1 1/𝑝 s4
s3 1/𝑞 2
s4 1/𝑞 2
Unnormalized
▪ BFS-like walk: Low value of 𝑝 transition prob.
segmented based
Linear-time complexity
All 3 steps are individually parallelizable
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46
Different kinds of biased random walks:
▪ Based on node attributes (Dong et al., 2017).
▪ Based on learned weights (Abu-El-Haija et al., 2017)
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47
Core idea: Embed nodes so that distances in
embedding space reflect node similarities in
the original network.
Different notions of node similarity:
▪ Naïve: similar if two nodes are connected
▪ Neighborhood overlap (covered in Lecture 2)
▪ Random walk approaches (covered today)
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48
So what method should I use..?
No one method wins in all cases….
▪ E.g., node2vec performs better on node classification
while alternative methods perform better on link
prediction (Goyal and Ferrara, 2017 survey).
Random walk approaches are generally more
efficient.
In general: Must choose definition of node
similarity that matches your application.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Goal: Want to embed a subgraph or an entire
graph 𝐺. Graph embedding: 𝐳𝑮 .
𝒛𝐺
Tasks:
▪ Classifying toxic vs. non-toxic molecules
▪ Identifying anomalous graphs
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
Simple (but effective) approach 1:
Run a standard graph embedding
technique on the (sub)graph 𝐺.
Then just sum (or average) the node
embeddings in the (sub)graph 𝐺.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55
Number of anonymous walks grows exponentially:
▪ There are 5 anon. walks 𝑤𝑖 of length 3:
𝑤1=111, 𝑤2=112, 𝑤3= 121, 𝑤4= 122, 𝑤5= 123
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 56
Simulate anonymous walks 𝑤𝑖 of 𝑙 steps and
record their counts.
Represent the graph as a probability
distribution over these walks.
For example:
▪ Set 𝑙 = 3
▪ Then we can represent the graph as a 5-dim vector
▪ Since there are 5 anonymous walks 𝑤𝑖 of length 3: 111, 112,
121, 122, 123
▪ 𝒛𝑮 [𝑖] = probability of anonymous walk 𝑤𝑖 in graph 𝐺.
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57
Sampling anonymous walks: Generate
independently a set of 𝑚 random walks.
Represent the graph as a probability distribution
over these walks.
Objective:
exp(𝑦 𝑤𝑡 )
▪ 𝑃 𝑤𝑡 {𝑤𝑡−Δ, … , 𝑤𝑡−1, 𝒛𝑮 ) = σ𝜂
𝑖=1 exp(𝑦(𝑤𝑖 ))
1
▪ 𝑦 𝑤𝑡 = 𝑏 + 𝑈 ⋅ 𝑐𝑎𝑡( σΔ𝑖=1 𝒛𝑖 , 𝒛𝑮) All possible anonymous walks
Δ (requires negative sampling)
1
▪ 𝑐𝑎𝑡( σΔ𝑖=1 𝒛𝑖 , 𝒛𝑮) means an average of anonymous walk embeddings 𝒛𝑖 in
Δ
the window, concatenated with the graph embedding 𝒛𝑮.
▪ 𝑏 ∈ ℝ, 𝑈 ∈ ℝ𝐷 are learnable parameters. This represents a linear layer.
Anonymous Walk Embeddings, ICML 2018 https://arxiv.org/pdf/1805.11921.pdf
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61
We obtain the graph embedding
𝒛𝑮 (learnable parameter) after Overall Architecture
the optimization.
▪ Is 𝒛𝑮 simply the sum over walk
embeddings 𝒛𝒊 ? Or is 𝒛𝑮 the residual
embedding next to 𝒛𝒊 ?
▪ According to the paper, 𝒛𝑮 is a
separately optimized vector
parameter, just like other 𝒛𝒊 ’s.
Use 𝒛𝑮 to make predictions
(e.g., graph classification):
▪ Option1: Inner product
Kernel 𝒛𝑇𝑮𝟏 𝒛𝑮𝟐 (Lecture 2)
▪ Option2: Use a neural
network that takes 𝒛𝑮
9/28/2021
as input to classify 𝑮.
Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62
We discussed 3 ideas to graph embeddings:
Approach 1: Embed nodes and sum/avg them
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 64
How to use embeddings 𝒛𝒊 of nodes:
▪ Clustering/community detection: Cluster points 𝒛𝒊
▪ Node classification: Predict label of node 𝑖 based on 𝒛𝒊
▪ Link prediction: Predict edge (𝑖, 𝑗) based on (𝒛𝒊, 𝒛𝒋)
▪ Where we can: concatenate, avg, product, or take a difference
between the embeddings:
▪ Concatenate: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔([𝒛𝑖 , 𝒛𝑗 ])
▪ Hadamard: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(𝒛𝑖 ∗ 𝒛𝑗 ) (per coordinate product)
▪ Sum/Avg: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(𝒛𝑖 + 𝒛𝑗 )
▪ Distance: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(||𝒛𝑖 − 𝒛𝒋 ||2)
▪ Graph classification: Graph embedding 𝒛𝑮 via aggregating
node embeddings or anonymous random walks.
Predict label based on graph embedding 𝒛𝐺 .
9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 65
We discussed graph representation learning, a way to
learn node and graph embeddings for downstream
tasks, without feature engineering.
Encoder-decoder framework:
▪ Encoder: embedding lookup
▪ Decoder: predict score based on embedding to match
node similarity
Node similarity measure: (biased) random walk
▪ Examples: DeepWalk, Node2Vec