A Deep Dive Into Neo4j Link Prediction Pipeline and FastRP Embedding Algorithm
A Deep Dive Into Neo4j Link Prediction Pipeline and FastRP Embedding Algorithm
Save
In my previous blog post, I introduced the newly available Link Prediction pipeline in the Neo4j Graph Data Science
library. Since the post, I took more time to dig deeper and learn the inner workings of the pipeline. I’ve learned a
couple of things along the way that I want to share with you. At first, I intended to show how the Link Prediction
pipeline combines node properties to generate input features of the Link Prediction model. However, when I was
developing the content, I noticed a couple of insights about using the FastRP embedding algorithm. Therefore, by the
end of this blog post, you will hopefully learn more about the FastRP embedding model and how you can combine
multiple node features as an input to the Link Prediction model.
Graph import
I had to find a small network so I easily visualize results as we go along. I decided to use the interaction network from
the first season of the Game of Thrones TV show made available by Andrew Beveridge.
character A. We also know how many times two characters interacted, and we store that information as the
relationship property.
If you want to follow along with examples in this post, I recommend using a Blank project in Neo4j Sandbox. It is a
free cloud instance of Neo4j database that comes pre-installed with both APOC and Graph Data Science plugins.
The dataset is available on GitHub, so we can easily import it into Neo4j with the following Cypher query:
MERGE (s:Character{name:row.Source})
MERGE (t:Character{name:row.Target})
MERGE (s)-[i:INTERACTS]-(t)
In this post, we will focus on the first two steps, so let’s take a closer look at what is happening there.
Open in app Get started
As a first step, you have to define node features. For example, you could use custom node properties such as age or
gender. You can also use graph algorithms such as PageRank or Betweenness centrality as initial node features. In
this blog post, we will start by using FastRP node embeddings to define initial node features. The nice thing about the
FastRP embedding algorithm is that it captures the network information and preserves the similarity in embedding
space between neighboring nodes that are close in a graph. At the moment, you can’t use pairwise information such
as the number of common neighbors or the length of the shortest path between a pair of nodes as input features.
In the second step, the link feature combiner creates a single feature from a pair of node properties. Currently, there
are three techniques that you can use to combine a pair of node properties into a single link feature vector:
Cosine distance
L2 or Euclidian distance
Hadamard product
In the above example, I have used the Hadamard product to combine a pair of node properties into a single link
feature vector. All the available link feature combiner techniques are order-invariant as the Link Prediction pipeline
supports predicting only undirected relationships at the moment. You can use multiple link feature combiners in a
single pipeline to define several feature vectors, which are then concatenated as an input to the Link Prediction
model. I’ll walk you through an example later in the post.
Once the node features and link feature combiner are defined, you can train the model to predict new connections.
the node embedding results, so we need to store the node embeddings to the graph before diving into the pipeline
definition. We start by projecting an undirected named graph. Take a look at the documentation for more
information about the inner workings of the Graph Data Science library.
{INTERACTS:{orientation:'UNDIRECTED', properties:'weight'}})
We will use Louvain, a community detection algorithm, to help us better understand the results of the FastRP
embedding algorithm. You can use the following Cypher query to store the community structure information back to
the database.
CALL gds.louvain.write('gots1',
{writeProperty:'louvain', relationshipWeightProperty:'weight'})
Throughout this blog post, I will be using Neo4j Bloom to visualize the results of algorithms and link predictions.
Take a look at this guide if you want to learn how to visualize networks with Bloom.
Open in app Get started
Community structure of the Network. Nodes are colored based on the community they belong to. Image by the author.
Now we can go ahead and execute the FastRP embedding algorithm. The algorithm will produce an embedding or a
fixed-size vector for every node in the graph. My friend CJ Sullivan wrote an excellent article explaining the inner
working of the FastRP algorithm.
CALL gds.fastRP.write('gots1',
First, we will evaluate FastRP embeddings with a t-SNE scatter plot visualization. The stored node embeddings are
vectors with a length of 56, as defined by the embeddingDimension parameter. The t-SNE algorithm is a
dimensionality reduction algorithm, which we can use to reduce the embedding dimension to two. Having vectors
with length two allows us to visualize them with a scatter plot. The Python code I used for dimensionality reduction
and scatter plot visualization is:
Open in app Get started
lp_combiner_tsne.py
hosted with ❤ by GitHub view raw
t-SNE scatter plot visualization of FastRP embedding. Nodes are colored based on the community the belong to using
the Louvain algorithm. Image by the author.
FastRP embeddings and the Louvain algorithm were executed independently, and yet, we can observe that FastRP
embeddings cluster nodes in the same community close in the embedding space. This is no surprise as FastRP is a
community-based node embedding algorithm, meaning that nodes close in the graph will be also close in the
embedding space. Next, we will evaluate the cosine similarity between nodes in the graph.
MATCH (c:Character)
CALL gds.alpha.similarity.cosine.stats({
data: data,
topK: 1000,
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99
Results
An average cosine similarity coefficient between all nodes in the graph is around 0.5. Around 25% of the node pairs
have a cosine similarity greater than 0.81. Nodes are so similar in the embedding space because we have a tiny graph
of only 126 nodes. Next, we will evaluate the cosine and euclidian distance between pairs of nodes connected by a
relationship.
1 df = run_query("""
2 MATCH (c1:Character)-[:INTERACTS]->(c2:Character)
3 RETURN gds.alpha.similarity.euclideanDistance(c1.embedding, c2.embedding) AS distance, 'euclidian' as metric
4 UNION
5 MATCH (c1:Character)-[:INTERACTS]->(c2:Character)
6 RETURN gds.alpha.similarity.cosine(c1.embedding, c2.embedding) AS distance, 'cosine' as metric
7 """)
8
9 sns.displot(data=df, x='distance', col='metric')
lp_combiner_rel_distance.py
hosted with ❤ by GitHub view raw
Results
Most node pairs that are connected with a relationship have a high cosine similarity. Again, this is expected as the
FastRP is designed to translate the network topology structure into embedding space. Therefore, we expect the
neighboring nodes in the graph to be very similar in the embedding space. We will examine the node pairs connected
in the network with a cosine similarity of less than 0.5 as this is a bit more unexpected. First, we have to tag them
with Cypher:
Open in app Get started
MATCH p=(c1:Character)-[i:INTERACTS]->(c2:Character)
And now, we can go ahead and visualize them with Neo4j Bloom.
Relationships between pairs of nodes that have cosine similarity of less than 0.5. Image by the author.
It seems that pairs of connected nodes with a lower cosine similarity mainly occur when we have connections
between various clusters or communities in the network. If you remember the t-SNE visualization, the nodes in the
same community are nicely grouped in the embedding space. However, we have a couple of relationships between
nodes from different communities. When we have connections between nodes from various communities, their
similarity decreases. It also seems that these nodes have a higher degree, meaning they have many links within their
community and then a couple of links to other communities. Therefore, they are more similar to neighbors within
Open in app Get started
their community and then less similar to neighbors from other clusters.
Link Feature Combiner
We’ve got the embeddings ready, and we know that pairs of connected nodes are highly likely to have a high cosine
similarity in the embedding space. Now, we will evaluate how different link feature combiners affect the output of
the link prediction model.
Cosine combiner
Interestingly enough, the first combiner we will take a look at is the Cosine similarity combiner.
The Link Feature combiner takes pairs of node features and combines them into a single link feature, which is then
used as training data to the logistic regression model that will predict new links. We have already done the cosine
similarity analysis, so we know that node pairs with a high cosine similarity are likely to be connected. Therefore, you
might imagine that new predicted links will be between pairs of not yet connected nodes with high cosine similarity,
as this is precisely how our training data looks.
test_combiners.py
hosted with ❤ by GitHub view raw
Open in app Get started
It is a bit lengthy script as we need to define the entire Link Prediction pipeline for each link feature combiner option.
Finally, the predicted links are stored to Neo4j so that we can visualize them with Bloom. In my previous blog post, I
did a step-by-step explanation of the pipeline. As mentioned, I have prepared a Jupyter Notebook with all the code, so
you don’t have to copy it from the blog post directly.
Let’s examine the predicted links using the Cosine Link Feature combiner.
Open in app Get started
Top 20 predicted links using the Cosine Link Feature combiner. Image by the author.
The average Euclidian and Cosine similarity of node pairs that have predicted links is:
lp_combiner_cosine.csv
hosted with ❤ by GitHub view raw
As we might imagine, the cosine similarity between nodes of predicted links is on average 0.999. Thus, the results
agree with our training data. Furthermore, we can learn a bit about FastRP embeddings from the results. Peripheral
nodes with a low degree are very similar in the embedding space, even if they are not directly connected. For
example, the blue nodes in the bottom-left part of the visualizations are all very similar. All four nodes have only a
single relationship, and they share their only neighbor nodes.
Next, we will take a look at the L2 link feature combiner. L2 link feature combiner calculates the Euclidian distance
between two node features.
Open in app Get started
Top 20 predicted links using the L2 Link Feature combiner. Image by the author.
The results are almost identical to the Cosine Link Feature combiner. Seems that the FastRP embedding algorithm
optimizes both Cosine and Euclidian distance similarities between neighboring nodes in the network.
The Hadamard link feature combiner uses the Hadamard product to produce a link feature. The Hadamard product is
simply entrywise multiplication.
Open in app Get started
Let’s examine the predicted links using the Hadamard link feature combiner.
Open in app Get started
Top 20 predicted links using the Hadamard Link Feature combiner. Image by the author.
The average Euclidian and Cosine similarity of node pairs that have predicted links is:
lp_combiner_hadamard.csv
hosted with ❤ by GitHub view raw
Some of the predicted links are similar to the Cosine and Euclidian link feature combiner results. For example, the
predicted links in the red community are almost identical. On the other hand, the model predicted more links in the
center instead of the periphery of the blue cluster.
Open in app Get started
Using multiple Link Feature Combiners
In the previous example, we have only used a single link feature combiner. In the case of Cosine or L2 link feature
combiner, we have effectively only used a single input feature to the logistic regression model. In practice, it makes
sense to have multiple input features that best describe your domain. As a demonstration, we will add the
Preferential attachment input as the second link feature. Looking at the documentation in Neo4j, the Preferential
attachment is defined as multiplying node degrees between a pair of nodes. In practice, the preferential attachment
model assumes that nodes with a higher node degree are more likely to form new connections. Unfortunately, we
can’t automatically add the preferential attachment link feature just yet, but we can add it manually. To add the
preferential attachment input feature, we will first calculate the node degree values for all nodes. You sometimes
want to normalize the input features with logistic regression models, so I will show you how to scale features directly
in the Link Prediction pipeline. Then we just need to add the Hadamard link feature combiner, which multiplies the
input matrices, in this case, node degrees. So if I understand the math correctly, the resulting link feature should
represent preferential attachment as we effectively multiply node degrees between pairs of nodes.
Using multiple link feature combiners in Link Prediction pipeline. Image by the author.
In the Link Prediction pipeline, you can have as many link feature combiners as you wish. The results of all link
feature combiners are then concatenated into a single vector that is used as an input to the Link Prediction logistic
Open in app Get started
regression model. We can add the degree calculation and scaling directly into the pipeline and don’t have to prepare
the degree features beforehand.
lp_combiner_pa.py
hosted with ❤ by GitHub view raw
We need to add the following three steps into our pipeline. The first query mutates the degree value to the named
graph. The second step will scale the degree value using the MinMax scaler and mutate it as the scaledDegree
property. Lastly, we use the Hadamard link feature combiner that combines node degrees. In the end, we’ll quickly
evaluate how adding a secondary link feature affects the link prediction results.
We get the following results using the Cosine link feature combiner with FastRP embeddings and Hadamard
combiner with scaled node degrees.
Open in app Get started
Adding the secondary link feature shifted predicted links from the network periphery to the center. In some way, it
makes sense because the preferential attachment prefers links between nodes with a higher degree, and those nodes
are usually located in the center of the network.
We get the following results using the Euclidian link feature combiner with FastRP embeddings and Hadamard
combiner with scaled node degrees.
107
The secondary link feature didn’t really affect the results. I’ve also tried not scaling the node degree values and then
the predicted links shift strongly towards the center.
Open in app Get started
Using Hadamard link feature combiner without scaling node degrees. Image by the author.
If we don’t scale the node degrees before using the Hadamard combiner, the preferential attachment feature becomes
dominant. Therefore, the predicted links are between nodes with a high degree instead of the low degree nodes on
the periphery.
Conclusion
I have really enjoyed writing this blog post and learned a lot about FastRP embedding algorithm and Link Prediction
pipeline along the way. A quick summary would be:
FastRP is more likely to assign high similarity between neighboring nodes with a low degree
On the other hand, the cosine similarity between connected nodes of different communities could be lower than
0.5 Open in app Get started
Using multiple link feature combiners can help you better describe your domain
Scaling node features influences the result of Link Prediction logistic regression model
I encourage you to start a free Neo4j Sandbox project and start predicting new connections in your graphs. Let me
know how it goes!
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't
want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.