0% found this document useful (0 votes)
7 views

unit-II node embeddings

The document discusses node embeddings in graph representation learning, focusing on the encoder-decoder framework for encoding node information and reconstructing graph relationships. It covers various methods, including shallow embedding approaches, matrix factorization, random walk embeddings like DeepWalk and node2vec, and multi-relational graph embeddings for knowledge graph completion. The optimization objectives and loss functions used to train these models are also outlined, emphasizing the importance of accurately capturing graph structures and relationships.

Uploaded by

arasupadhmasini
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

unit-II node embeddings

The document discusses node embeddings in graph representation learning, focusing on the encoder-decoder framework for encoding node information and reconstructing graph relationships. It covers various methods, including shallow embedding approaches, matrix factorization, random walk embeddings like DeepWalk and node2vec, and multi-relational graph embeddings for knowledge graph completion. The optimization objectives and loss functions used to train these models are also outlined, emphasizing the importance of accurately capturing graph structures and relationships.

Uploaded by

arasupadhmasini
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

NODE EMBEDDINGS

Neighborhood Reconstruction Methods

• Node Embeddings
• Each node in a graph is encoded as a low-dimensional vector.
• The vector captures the node's graph position and local neighborhood
structure.
• Nodes that are close in the graph should have embeddings that are close in
the latent space.
• Latent Space Representation
• Instead of working with an entire adjacency matrix (which is sparse and high-
dimensional), embeddings place nodes in a continuous vector space.
• The relative distances between embeddings should reflect the relationships in
the original graph (e.g., edges).
3.1 An Encoder-Decoder Perspective
• The Encoder-Decoder framework is a central idea in graph
representation learning.
• It provides a structured way to approach the problem of encoding
node information and reconstructing graph relationships based on
these encodings.
• This perspective is particularly useful for understanding node
embeddings and is crucial for learning accurate representations of
nodes in a graph.
3.1.1 The Encoder
• Encoder:
The encoder's job is to transform each node in the graph into a low-
dimensional vector (embedding). This mapping is learned such that
similar nodes in the graph should be represented by similar vectors in
the embedding space.
• Shallow Embedding Approach:
In many methods, the encoder is a shallow function that simply looks up a
pre-learned vector for each node. This can be seen as a simple mapping
from node IDs to embeddings.

• Key Idea:
• This method doesn't explicitly learn based on the graph structure but assumes
that the embeddings, once learned, capture the relevant information for the node.
It's often seen in basic node embedding methods like DeepWalk or Node2Vec.
3.1.2 The Decoder
• Purpose of the Decoder
• Given a node embedding zu for a node u, the decoder attempts to
predict certain characteristics of the graph related to that node. These
characteristics might include:
• Pairwise Decoders
• The most common type of decoder used in node embedding models is the
pairwise decoder. This decoder takes in a pair of node embeddings and
predicts their relationship or similarity based on the graph structure.

• This means that the decoder receives two d-dimensional embeddings as


input and outputs a positive scalar value that represents the relationship
between the two nodes.
• Goal of the Decoder:
• The goal is for the decoder to predict the graph-based similarity S[u,v]
between two nodes u and v. This similarity could correspond to:
• Whether or not nodes u and v are neighbors in the graph.
• Or more generally, some other form of neighborhood overlap or similarity measure.
• Optimization Objective
• The encoder-decoder model is trained to minimize reconstruction loss. This
means that the embeddings zu​and zv​produced by the encoder should be
optimized so that the decoder can accurately reconstruct the relationship
between nodes. The optimization process can be written as:
3.1.3 Optimizing an Encoder-Decoder
Model
• To train an encoder-decoder model for node embeddings,
• the key goal is to minimize a reconstruction loss that
• measures how well the model can predict the relationships between pairs of
nodes based on their embeddings.
• optimizing both the encoder and decoder
• accurately reconstruct the graph's structure from the learned node embeddings.
• Reconstruction Loss Function
• The empirical reconstruction loss L is defined over a set of training node pairs D.
Each pair (u,v)∈D consists of two nodes, and the objective is to minimize the
difference between the predicted relationship (from the decoder) and the true
relationship (as defined by the similarity measure S[u,v].
• The loss function is expressed as:
3.1.4 Overview of the Encoder-Decoder
Approach
• The encoder-decoder framework is a powerful way to define and analyze various node
embedding methods. It provides a unified way to describe how different methods generate node
embeddings by focusing on three key aspects:
1. Decoder Function:
This function determines how the node embeddings (produced by the encoder) are used to
reconstruct the graph structure or relationships between nodes. The decoder predicts similarities,
distances, or other properties that represent the original graph's topology.
2. Graph-Based Similarity Measure:
The similarity measure S[u,v] defines how the relationship between two nodes uuu and vvv is
represented. This could be based on adjacency, proximity, common neighbors, or more
sophisticated metrics, depending on the method.
3. Loss Function:
The loss function ℓ measures the discrepancy between the predicted relationships (from the
decoder) and the true relationships (from the graph's structure). The loss function drives the
optimization of the encoder and decoder models.
3.2 Factorization-based approaches
• Matrix Factorization Viewpoint
• The encoder-decoder approach can be seen as matrix factorization, where we try to
approximate a node similarity matrix S in a lower-dimensional space.
• The similarity matrix S is a generalization of the adjacency matrix A, meaning it captures not
just direct connections but broader node relationships.
• Step 1: Consider a Graph
• We have a small undirected graph with 4 nodes and weighted edges:
• Step 2: Factorizing the Adjacency Matrix
• Instead of using the full matrix A, we approximate it by two smaller matrices:

• where Z is a low-dimensional embedding matrix of size 4×2 times (for 2D


embeddings).
• We assume:
• Step 3: Interpretation
• The dot product of two row vectors in Z gives an estimate of their connection
strength.
• Instead of storing the full adjacency matrix (which grows large for big graphs),
we now store only low-dimensional embeddings.
• The embeddings capture graph structure, meaning similar nodes have close
vectors.
2. Laplacian Eigenmaps (LE)
• One of the earliest factorization-based methods, Laplacian Eigenmaps,
focuses on preserving local graph structures.
• The decoder function measures similarity using the squared L2 distance
between node embeddings:

• The loss function ensures that similar nodes stay close in the embedding
space:

• If S is a Laplacian matrix, solving this loss function is equivalent to spectral


clustering, where the embeddings correspond to the smallest eigenvectors of
the graph Laplacian.
• Step 1: Construct the Adjacency Matrix A
• Step 2: Compute the Degree Matrix D
• The degree matrix D is a diagonal matrix where each diagonal entry D[i,i] is
the sum of the corresponding row in A:

• Step 3: Compute the Laplacian Matrix LLL


• The graph Laplacian is defined as:
• Step 4: Compute the Eigenvalues and Eigenvectors of L
• Laplacian Eigenmaps finds the smallest nonzero eigenvectors of L. These eigenvectors provide a low-dimensional
representation of the nodes.

• Step 5: Interpretation
• Each row of Z represents a 2D embedding for a node.
• Similar nodes (e.g., A and C, B and D) have closer embeddings.
• The embeddings preserve local graph structure by minimizing differences between connected nodes.
3. Inner-Product Based Methods
• A more recent approach replaces L2 distance with an inner-product
decoder
• This assumes that node similarity (e.g., neighborhood overlap) is
proportional to the dot product of their embeddings.
• Some methods using this approach:
• Graph Factorization (GF): Uses the adjacency matrix (𝑆=𝐴S=A) as similarity.
• GraRep: Uses powers of the adjacency matrix to capture long-range connections.
• HOPE: Uses more general node similarity measures. The loss function minimizes the
difference between predicted and actual similarities
• The loss function minimizes the difference between predicted and actual similarities

• These methods can be solved using Singular Value Decomposition (SVD), reducing
the problem to matrix factorization:
• Step 1: Define the Adjacency Matrix A
• Step 2: Define the Inner-Product Decoder
• Inner-product based methods use the formula:

• This means that the similarity between nodes uuu and vvv is
computed by taking the dot product of their embedding vectors.
• Step 3: Learn Node Embeddings
• Let's assume we have 2D embeddings for each node:

• Each row represents the embedding of a node in 2D space.


• Step 4: Compute Pairwise Similarities Using Inner Products

• Step 5: Optimizing Embeddings


• To train these embeddings, we minimize the reconstruction loss
3.3 Random walk embeddings
• Random walk embeddings are a class of node representation learning
techniques that define node similarity based on co-occurrence in
short random walks rather than deterministic structural properties
(such as adjacency matrix factorization).
• These methods are widely used in graph-based machine learning
tasks like link prediction, node classification, and clustering.
Core Methods: DeepWalk &
node2vec
• 2.1. DeepWalk
• Inspired by word embeddings (Word2Vec) in NLP, DeepWalk treats
nodes in a graph as words and random walks as sentences.
• Process:
• Perform short random walks starting from each node.
• Collect sequences of nodes visited during these walks (like sentences in NLP).
• Train a Skip-gram model to learn embeddings such that nodes appearing
together in walks have similar vector representations.
• Decoder Function (Softmax Probability):

• This ensures that similar nodes have higher probability of co-


occurrence in random walks.
• Loss Function (Negative Log-Likelihood)

• The model minimizes this loss, ensuring that nodes frequently


appearing together in walks have similar embeddings.
2.2. node2vec
• Enhancement over DeepWalk:
• DeepWalk performs completely uniform random walks.
• node2vec biases the random walks to explore neighborhoods more efficiently.
• Key idea: Balances between local and global graph exploration using two hyperparameters:
• p (Return parameter) – Controls the probability of returning to the previous node (low p → deeper
exploration).
• q (In-out parameter) – Controls the balance between BFS-like (local neighborhood) and DFS-like (global
reach) walks.
• Loss function: Similar to DeepWalk, node2vec uses a negative sampling approach to approximate
the softmax function efficiently:
3. LINE (Large-scale Information Network Embeddings)
• While DeepWalk and node2vec rely on random walks, LINE takes a different approach. It defines node
similarity based on first-order (direct connections) and second-order (two-hop connections) proximity.
First-order Proximity:
1. Directly connected nodes should have similar embeddings.
2. Loss function minimizes difference between adjacency-based similarities

Second-order Proximity:Nodes with many common neighbors should have similar embeddings.
• Optimized using Kullback-Leibler divergence(KL)-
• divergence to approximate second-order relationships.
Example

• Alice — Bob — Charlie


| |
Dave Eve
First-order Proximity (Direct Connections)
This ensures that directly connected nodes (edges in the graph) have similar embeddings.
Alice & Bob, Bob & Charlie, Bob & Dave, and Charlie & Eve are directly connected, so their embeddings
should be close.
Second-order Proximity (Common Neighbors)
This ensures that nodes sharing many common neighbors should have similar embeddings, even if they are
not directly connected.
Correct Interpretation in This Graph:
Alice and Dave both have Bob as a common neighbor.
Even though Alice and Dave are not directly connected, their embeddings should still be somewhat similar.
Charlie and Dave both have Bob as a common neighbor.
Again, they are not directly connected, but their embeddings should be somewhat similar.
Bob and Eve both have Charlie as a common neighbor.
This means Bob and Eve share a second-order proximity relation.
4. Variants and Extensions of Random Walks
• Several modifications of random walk embeddings have been
proposed:
1.Biased Random Walks (e.g., GraRep) – Skips nodes to capture long-
range dependencies.
2.Structural Role-based Walks – Instead of neighborhood similarity,
embeddings focus on structural roles (e.g., hubs vs. peripheral nodes).
3.Personalized Random Walks – Modify walk probabilities based on
external attributes like edge weights or node types.
5. Connection to Matrix Factorization
• Though random-walk-based embeddings appear different from factorization methods,
they are mathematically linked.
• Qiu et al. (2018) showed that DeepWalk’s learned embeddings implicitly factorize a
transformed adjacency matrix:

• The length of random walks (T) controls the influence of different eigenvalues, linking
it to spectral clustering methods.
• Spectral clustering is a technique that uses graph theory and eigenvalues to cluster data points.
Multi-relational Data and
Knowledge Graphs
1. Knowledge Graph Completion
• Knowledge Graph Completion (KGC) is the process of predicting
missing edges in a multi-relational graph, typically referred to as a
knowledge graph (KG). A knowledge graph consists of:
Definition of Multi-Relational Graphs
• A multi-relational graph is a type of graph where edges have specific
types (relations). The given notation:
3. Applications of Knowledge Graph Completion
• The primary application of KGC is relation prediction, where missing
relationships in the graph are inferred.
• Other tasks include:
• Node classification: Assigning labels to entities based on relational data (e.g.,
classifying drugs as "antibiotics" or "painkillers").
• Link prediction: Predicting new relationships between existing entities.
• A well-known reference for node classification using knowledge graphs is
Schlichtkrull et al., 2017.
4.1 Reconstructing Multi-Relational
Data
• In a multi-relational graph, nodes are connected by different types of
edges (relationships).
• Imporatnt task is to embed these nodes into a low-dimensional space
and reconstruct their relationships.
• Unlike simple graphs, where we only consider node pairs, multi-
relational graphs require us to consider edge types as well.
Decoder Function for Multi-Relational Graphs
• The decoder function DEC(u, τ, v) computes the likelihood that an
edge (u, v) of type τ exists.
• One early approach is RESCAL, which represents relationships using
learnable matrices
Decoders, Similarity Functions, and
Loss Functions
• Decoder (DEC): Computes a score between node embeddings to
estimate their relationship.
• Similarity Function (S[u, v]): Defines what type of node-node similarity
is being decoded.
• Loss Function (L): Measures the discrepancy between the decoder’s
output and the actual similarity measure.
Challenges in Multi-Relational
Graphs
• Most multi-relational embedding methods focus on reconstructing
immediate neighbors because defining higher-order relationships is
difficult.
• The adjacency tensor (a multi-relational extension of an adjacency matrix)
is often used as the similarity function.
• A naive reconstruction loss (like mean-squared error) is impractical
because:
• It is computationally expensive

• It does not work well with binary-valued adjacency tensors.


Loss Functions
• 1) Cross-Entropy Loss with Negative Sampling

Uses logistic function 𝜎


• Efficient alternative to full reconstruction loss.

• σ to normalize decoder scores to [0,1]
• Ensures correct predictions for existing edges and penalizes false positives
using negative samples.
• Uses Monte Carlo approximation for expectation over negative samples.
example Scenario
• Assume we have a knowledge graph with three entities:
u = "Paris"
v = "France"
vn (negative sample) = "Germany"
Relationship τ = "capital_of"
• need to correctly predicts the relationship capital_of(Paris, France)
while distinguishing it from incorrect relationships like
capital_of(Paris, Germany).
• The decoder (DEC) gives a score for a given triplet (𝑢,𝜏,𝑣)
Step 1: Compute Decoder Scores

• For simplicity, assume we use a dot product decoder:

Step 2: Apply Logistic Function 𝜎(x)


Step 3: Compute Loss
• The cross-entropy loss with negative sampling is:

Positive edge(Paris,capitalof,France)(Paris, capital_of, France)(Paris,capitalo​f,France) should get a high score.


Negative edge(Paris,capitalof,Germany)(Paris, capital_of, Germany)(Paris,capitalo​f,Germany) should get a low
score.
Negative sampling makes it computationally efficient
The model learns embeddings that maximize the probability of correct relationships while minimizing incorrect
ones.
Max-Margin (Hinge) Loss
• A contrastive loss that ensures "true" edges have higher scores than
negative samples by at least a margin Δ

• The loss is zero if the true edge scores are higher by at least Δ,
otherwise, the model is penalized.

You might also like