03 Nodeemb
03 Nodeemb
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 2
Graph Representation Learning alleviates
the need to do feature engineering every
single time.
Input Structured Learning
Prediction
Graph Features Algorithm
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3
Goal: Efficient task-independent feature
learning for machine learning with graphs!
node vector
𝑢
𝑓: 𝑢 → ℝ𝑑
ℝ𝑑
Feature representation,
embedding
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 4
Task: Map nodes into an embedding space
▪ Similarity of embeddings between nodes indicates
their similarity in the network. For example:
▪ Both nodes are close to each other (connected by an edge)
▪ Encode network information
▪ Potentially used for many downstream predictions
Vec Tasks
• Node classification
• Link prediction
• Graph classification
• Anomalous node detection
embeddings ℝ 𝑑 • Clustering
• ….
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
Example
2D embedding of nodes of the Zachary’s
Karate Karate
• Zachary’s Club network:
Network:
Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Assume we have a graph G:
▪ V is the vertex set.
▪ A is the adjacency matrix (assume binary).
▪ For simplicity: No node features or extra
information is used
4
3
2 0 1 0 1
1
1 0 0 1
A=
V: {1, 2, 3, 4} 0 0 0 1
1 0
1 1
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8
Goal is to encode nodes so that similarity in
the embedding space (e.g., dot product)
approximates similarity in the graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
Goal: similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢
in the original network Similarity of the embedding
Need to define!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10
1. Encoder maps from nodes to embeddings
2. Define a node similarity function (i.e., a
measure of similarity in the original network)
3. Decoder 𝐃𝐄𝐂 maps from embeddings to the
similarity score
4. Optimize the parameters of the encoder so
that: 𝐃𝐄𝐂(𝐳Τ 𝐳 ) 𝑣 𝑢
similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢
in the original network Similarity of the embedding
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
Encoder: maps each node to a low-dimensional
vector d-dimensional
ENC 𝑣 = 𝐳𝑣 embedding
node in the input graph
Similarity function: specifies how the
relationships in vector space map to the
relationships in the original network
similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢 Decoder
Similarity of 𝑢 and 𝑣 in dot product between node
the original network embeddings
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12
Simplest encoding approach: Encoder is just an
embedding-lookup
ENC 𝑣 = 𝐳𝒗 = 𝐙 ⋅ 𝑣
Dimension/size
of embeddings
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
Simplest encoding approach: Encoder is just an
embedding-lookup
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 16
Key choice of methods is how they define node
similarity.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 17
This is unsupervised/self-supervised way of
learning node embeddings.
▪ We are not utilizing node labels
▪ We are not utilizing node features
▪ The goal is to directly estimate a set of coordinates
(i.e., the embedding) of a node so that some aspect
of the network structure (captured by DEC) is
preserved.
These embeddings are task independent
▪ They are not trained for a specific task but can be
used for any task.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Vector 𝐳𝑢 :
▪ The embedding of node 𝑢 (what we aim to find).
Probability 𝑃 𝑣 𝐳𝑢 ) : Our model prediction based on 𝐳𝑢
▪ The (predicted) probability of visiting node 𝑣 on
random walks starting from node 𝑢.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
10
9
12
2 Step 3 Step 4
8 Step 5
1
11
3
Given a graph and a starting
4 Step 1 Step 2 point, we select a neighbor of
it at random, and move to this
neighbor; then we select a
6
5 neighbor of this point at
random, and move to it, etc.
The (random) sequence of
7 points visited this way is a
random walk on the graph.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
probability that u
and v co-occur on a
random walk over
the graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
1. Estimate probability of visiting node 𝒗 on a
random walk starting from node 𝒖 using
some random walk strategy 𝑹
Log-likelihood objective:
*𝑁𝑅 (𝑢) can have repeat elements since nodes can be visited multiple times on random walks
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27
Equivalently,
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30
But doing this naively is too expensive!
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
Why is the approximation valid?
Technically, this is a different objective. But
𝜕ℒ
▪ For all 𝑢, make a step in reverse direction of derivative: 𝑧𝑢 ← 𝑧𝑢 − 𝜂 .
𝜕𝑧𝑢
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
▪ Stochastic Gradient Descent: Instead of evaluating
gradients over all examples, evaluate it for each
individual training example.
▪ Initialize 𝑧𝑢 at some randomized value for all nodes 𝑢.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
1. Run short fixed-length random walks starting
from each node on the graph
2. For each node 𝑢 collect 𝑁𝑅 (𝑢), the multiset of
nodes visited on random walks starting from 𝑢.
3. Optimize embeddings using Stochastic
Gradient Descent:
s1 s2 s8
careful
s7
cent re-
BFS
d to sig- u s6
eatures DFS
sitive to s4 s9
s3 s5
r learn- Figur e 1: BFS and DFS sear ch str ategies from node u (k = 3).
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
Jure Leskovec
Stanford University
[email protected]
BFS: DFS:
Micro-view of Macro-view of
neighbourhood neighbourhood
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 41
Biased fixed-length random walk 𝑹 that given a
node 𝒖 generates neighborhood 𝑵𝑹 𝒖
Two parameters:
▪ Return parameter 𝒑:
▪ Return back to the previous node
▪ In-out parameter 𝒒:
▪ Moving outwards (DFS) vs. inwards (BFS)
▪ Intuitively, 𝑞 is the “ratio” of BFS vs. DFS
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Biased 2nd-order random walks explore network
neighborhoods:
▪ Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤
▪ Insight: Neighbors of 𝑤 can only be:
Same distance to 𝒔𝟏
s2 s3
w Farther from 𝒔𝟏
u s1
Back to 𝒔𝟏
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰.
Where to go next?
Target 𝒕 Prob. Dist. (𝒔𝟏, 𝒕)
s2 1
s3
1/𝑞 s1 1/𝑝 0
w w → s2 1 1
1/𝑞
u s1 1/𝑝 s4
s3 1/𝑞 2
s4 1/𝑞 2
Unnormalized
▪ BFS-like walk: Low value of 𝑝 transition prob.
segmented based
Linear-time complexity
All 3 steps are individually parallelizable
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46
Different kinds of biased random walks:
▪ Based on node attributes (Dong et al., 2017).
▪ Based on learned weights (Abu-El-Haija et al., 2017)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47
Core idea: Embed nodes so that distances in
embedding space reflect node similarities in
the original network.
Different notions of node similarity:
▪ Naïve: Similar if two nodes are connected (next)
▪ Neighborhood overlap (covered in Lecture 2)
▪ Random walk approaches (covered today)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48
So what method should I use..?
No one method wins in all cases….
▪ E.g., node2vec performs better on node classification
while alternative methods perform better on link
prediction (Goyal and Ferrara, 2017 survey).
Random walk approaches are generally more
efficient.
In general: Must choose definition of node
similarity that matches your application.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Goal: Want to embed a subgraph or an entire
graph 𝐺. Graph embedding: 𝐳𝑮 .
𝒛𝐺
Tasks:
▪ Classifying toxic vs. non-toxic molecules
▪ Identifying anomalous graphs
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
Simple (but effective) approach 1:
Run a standard graph embedding
technique on the (sub)graph 𝐺.
Then just sum (or average) the node
embeddings in the (sub)graph 𝐺.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 54
We can hierarchically cluster nodes in graphs,
and sum/avg the node embeddings according
to these clusters.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Recall: encoder as an embedding lookup
embedding vector for a
embedding specific node
matrix
Dimension/size
of embeddings
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57
Simplest node similarity: Nodes 𝑢, 𝑣 are
similar if they are connected by an edge
This means: 𝐳𝑣Τ 𝐳𝑢 = 𝐴𝑢,𝑣
which is the (𝑢, 𝑣) entry of the graph
adjacency matrix 𝐴
Therefore, 𝒁𝑇 𝒁 = 𝐴
𝒁𝑇 𝒁
4 0 1 0 1
3 1 0 0 1 ×
A=
0 0 0 1
2
1 0
1 1 1
𝐳𝑢 𝐳𝑣
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 58
The embedding dimension 𝑑 (number of rows in 𝒁)
is much smaller than number of nodes 𝑛.
Exact factorization 𝐴 = 𝒁𝑻 𝒁 is generally not possible
However, we can learn 𝒁 approximately
Objective:min ∥ A − 𝒁𝑇 𝒁 ∥2
𝐙
▪ We optimize 𝒁 such that it minimizes the L2 norm
(Frobenius norm) of A − 𝒁𝑇 𝒁
▪ Note today we used softmax instead of L2. But the goal to
approximate A with 𝒁𝑇 𝒁 is the same.
Conclusion: Inner product decoder with node
similarity defined by edge connectivity is
equivalent to matrix factorization of 𝐴.
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 59
DeepWalk and node2vec have a more
complex node similarity definition based on
random walks
DeepWalk is equivalent to matrix
factorization of the following complex matrix
expression:
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 60
Volume of graph
𝑣𝑜𝑙 𝐺 = 𝐴𝑖,𝑗 Diagonal matrix 𝐷
𝑖 𝑗 𝐷𝑢,𝑢 = deg(𝑢)
1 𝑇
log 𝑣𝑜𝑙(𝐺) (𝐷 −1𝐴)𝑟 𝐷 −1 − log 𝑏
𝑇 𝑟=1
4
Feature vector
3 (e.g. protein properties in a
2 protein-protein interaction graph)
1 5
DeepWalk / node2vec
embeddings do not incorporate
such node features