0% found this document useful (0 votes)
146 views21 pages

A Deep Dive Into Neo4j Link Prediction Pipeline and FastRP Embedding Algorithm

This document summarizes a blog post about optimizing link prediction models in Neo4j using the FastRP embedding algorithm. It discusses how the Neo4j link prediction pipeline works, focusing on using node embeddings to define features. It provides an example using character interaction data from Game of Thrones season 1. Node embeddings are generated using FastRP and analyzed using t-SNE visualization and cosine similarity, showing embeddings cluster nodes in the same communities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views21 pages

A Deep Dive Into Neo4j Link Prediction Pipeline and FastRP Embedding Algorithm

This document summarizes a blog post about optimizing link prediction models in Neo4j using the FastRP embedding algorithm. It discusses how the Neo4j link prediction pipeline works, focusing on using node embeddings to define features. It provides an example using character interaction data from Game of Thrones season 1. Node embeddings are generated using FastRP and analyzed using t-SNE visualization and cosine similarity, showing embeddings cluster nodes in the same communities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Open in app Get started

Published in Towards Data Science

Tomaz Bratanic Follow

Oct 7, 2021 · 12 min read · Listen

Save

A Deep Dive into Neo4j Link Prediction Pipeline and FastRP


Embedding Algorithm
Learn how to train and optimize Link Prediction models in the Neo4j Graph Data Science
library to get the best results

In my previous blog post, I introduced the newly available Link Prediction pipeline in the Neo4j Graph Data Science
library. Since the post, I took more time to dig deeper and learn the inner workings of the pipeline. I’ve learned a
couple of things along the way that I want to share with you. At first, I intended to show how the Link Prediction
pipeline combines node properties to generate input features of the Link Prediction model. However, when I was
developing the content, I noticed a couple of insights about using the FastRP embedding algorithm. Therefore, by the
end of this blog post, you will hopefully learn more about the FastRP embedding model and how you can combine
multiple node features as an input to the Link Prediction model.

The code used in this post is available on GitHub.

Graph import

I had to find a small network so I easily visualize results as we go along. I decided to use the interaction network from
the first season of the Game of Thrones TV show made available by Andrew Beveridge.

Graph model. Image by the author.


The graph model consists of characters and their interactions. We will treat the interaction relationship as
undirected, where is character A interacts with character B, this directly implies that character
OpenBinalso
app interacted with
Get started

character A. We also know how many times two characters interacted, and we store that information as the
relationship property.

If you want to follow along with examples in this post, I recommend using a Blank project in Neo4j Sandbox. It is a
free cloud instance of Neo4j database that comes pre-installed with both APOC and Graph Data Science plugins.

The dataset is available on GitHub, so we can easily import it into Neo4j with the following Cypher query:

LOAD CSV WITH HEADERS FROM


"https://raw.githubusercontent.com/mathbeveridge/gameofthrones/master/data/got-s1-edges.csv"
as row

MERGE (s:Character{name:row.Source})

MERGE (t:Character{name:row.Target})

MERGE (s)-[i:INTERACTS]-(t)

SET i.weight = toInteger(row.Weight)

Link prediction pipeline


Under the hood, the link prediction model in Neo4j uses a logistic regression classifier. We are dealing with a binary
classification problem, where we want to predict if a link exists between a pair of nodes or not. On a high level, the
link prediction pipeline follows the following steps:

Image by the author.

In this post, we will focus on the first two steps, so let’s take a closer look at what is happening there.
Open in app Get started

Feature engineering for link prediction pipeline. Image by the author.

As a first step, you have to define node features. For example, you could use custom node properties such as age or
gender. You can also use graph algorithms such as PageRank or Betweenness centrality as initial node features. In
this blog post, we will start by using FastRP node embeddings to define initial node features. The nice thing about the
FastRP embedding algorithm is that it captures the network information and preserves the similarity in embedding
space between neighboring nodes that are close in a graph. At the moment, you can’t use pairwise information such
as the number of common neighbors or the length of the shortest path between a pair of nodes as input features.

In the second step, the link feature combiner creates a single feature from a pair of node properties. Currently, there
are three techniques that you can use to combine a pair of node properties into a single link feature vector:

Cosine distance

L2 or Euclidian distance

Hadamard product

In the above example, I have used the Hadamard product to combine a pair of node properties into a single link
feature vector. All the available link feature combiner techniques are order-invariant as the Link Prediction pipeline
supports predicting only undirected relationships at the moment. You can use multiple link feature combiners in a
single pipeline to define several feature vectors, which are then concatenated as an input to the Link Prediction
model. I’ll walk you through an example later in the post.

Once the node features and link feature combiner are defined, you can train the model to predict new connections.

Node Feature engineering


You can preprocess node features before defining the Link Prediction pipeline. You can also include them directly in
the pipeline definition if you only use graph algorithms such as node embeddings or centrality measures as node
features. In the first example, we will use the FastRP embeddings as our Link Prediction model node features.
Therefore, we could potentially include them in the pipeline definition. However, we will first do a short analysis of
Open in app Get started

the node embedding results, so we need to store the node embeddings to the graph before diving into the pipeline
definition. We start by projecting an undirected named graph. Take a look at the documentation for more
information about the inner workings of the Graph Data Science library.

CALL gds.graph.create('gots1', 'Character',

{INTERACTS:{orientation:'UNDIRECTED', properties:'weight'}})

We will use Louvain, a community detection algorithm, to help us better understand the results of the FastRP
embedding algorithm. You can use the following Cypher query to store the community structure information back to
the database.

CALL gds.louvain.write('gots1',

{writeProperty:'louvain', relationshipWeightProperty:'weight'})

Throughout this blog post, I will be using Neo4j Bloom to visualize the results of algorithms and link predictions.
Take a look at this guide if you want to learn how to visualize networks with Bloom.
Open in app Get started

Community structure of the Network. Nodes are colored based on the community they belong to. Image by the author.

Now we can go ahead and execute the FastRP embedding algorithm. The algorithm will produce an embedding or a
fixed-size vector for every node in the graph. My friend CJ Sullivan wrote an excellent article explaining the inner
working of the FastRP algorithm.

CALL gds.fastRP.write('gots1',

{writeProperty:'embedding', embeddingDimension:56, relationshipWeightProperty:'weight'})

First, we will evaluate FastRP embeddings with a t-SNE scatter plot visualization. The stored node embeddings are
vectors with a length of 56, as defined by the embeddingDimension parameter. The t-SNE algorithm is a
dimensionality reduction algorithm, which we can use to reduce the embedding dimension to two. Having vectors
with length two allows us to visualize them with a scatter plot. The Python code I used for dimensionality reduction
and scatter plot visualization is:
Open in app Get started

1 def tsne(embeddings, hue=None):


2 tsne = TSNE(n_components=2, n_iter=300)
3 tsne_results = tsne.fit_transform(embeddings['embedding'].to_list())
4
5 embeddings['tsne_x'] = [x[0] for x in list(tsne_results)]
6 embeddings['tsne_y'] = [x[1] for x in list(tsne_results)]
7
8 plt.figure(figsize=(16,10))
9 sns.scatterplot(
10 x="tsne_x", y="tsne_y",
11 hue=hue,
12 palette="deep",
13 data=embeddings,
14 legend="full",
15 alpha=0.9
16
17 tsne_input = run_query("""
18 MATCH (c:Character)
19 RETURN c.name as character, c.embedding as embedding, c.louvain as hue
20 """)
21
22 tsne(tsne_input, 'hue')
23

lp_combiner_tsne.py
hosted with ❤ by GitHub view raw

This code produces the following visualization:


Open in app Get started

t-SNE scatter plot visualization of FastRP embedding. Nodes are colored based on the community the belong to using
the Louvain algorithm. Image by the author.

FastRP embeddings and the Louvain algorithm were executed independently, and yet, we can observe that FastRP
embeddings cluster nodes in the same community close in the embedding space. This is no surprise as FastRP is a
community-based node embedding algorithm, meaning that nodes close in the graph will be also close in the
embedding space. Next, we will evaluate the cosine similarity between nodes in the graph.

MATCH (c:Character)

WITH {item:id(c), weights: c.embedding} AS userData

WITH collect(userData) AS data

CALL gds.alpha.similarity.cosine.stats({

data: data,

topK: 1000,

similarityCutoff: 0.1

})

YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99

RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99

Results

Search this file…


1 nodes similarityPairs min max mean p25 p50 p75 p90 p95 p99
2 126 13546 0.100346 1.000007 0.537947 0.281726 0.496877 0.816959 0.964294 0.984401 0.99615
cosine_similarity.csv
hosted with ❤ by GitHub Open in app Get started
view raw

An average cosine similarity coefficient between all nodes in the graph is around 0.5. Around 25% of the node pairs
have a cosine similarity greater than 0.81. Nodes are so similar in the embedding space because we have a tiny graph
of only 126 nodes. Next, we will evaluate the cosine and euclidian distance between pairs of nodes connected by a
relationship.

1 df = run_query("""
2 MATCH (c1:Character)-[:INTERACTS]->(c2:Character)
3 RETURN gds.alpha.similarity.euclideanDistance(c1.embedding, c2.embedding) AS distance, 'euclidian' as metric
4 UNION
5 MATCH (c1:Character)-[:INTERACTS]->(c2:Character)
6 RETURN gds.alpha.similarity.cosine(c1.embedding, c2.embedding) AS distance, 'cosine' as metric
7 """)
8
9 sns.displot(data=df, x='distance', col='metric')

lp_combiner_rel_distance.py
hosted with ❤ by GitHub view raw

Results

Most node pairs that are connected with a relationship have a high cosine similarity. Again, this is expected as the
FastRP is designed to translate the network topology structure into embedding space. Therefore, we expect the
neighboring nodes in the graph to be very similar in the embedding space. We will examine the node pairs connected
in the network with a cosine similarity of less than 0.5 as this is a bit more unexpected. First, we have to tag them
with Cypher:
Open in app Get started

MATCH p=(c1:Character)-[i:INTERACTS]->(c2:Character)

WHERE gds.alpha.similarity.cosine(c1.embedding, c2.embedding) < 0.5

SET i.show = True

And now, we can go ahead and visualize them with Neo4j Bloom.

Relationships between pairs of nodes that have cosine similarity of less than 0.5. Image by the author.

It seems that pairs of connected nodes with a lower cosine similarity mainly occur when we have connections
between various clusters or communities in the network. If you remember the t-SNE visualization, the nodes in the
same community are nicely grouped in the embedding space. However, we have a couple of relationships between
nodes from different communities. When we have connections between nodes from various communities, their
similarity decreases. It also seems that these nodes have a higher degree, meaning they have many links within their
community and then a couple of links to other communities. Therefore, they are more similar to neighbors within
Open in app Get started
their community and then less similar to neighbors from other clusters.
Link Feature Combiner
We’ve got the embeddings ready, and we know that pairs of connected nodes are highly likely to have a high cosine
similarity in the embedding space. Now, we will evaluate how different link feature combiners affect the output of
the link prediction model.

Cosine combiner

Interestingly enough, the first combiner we will take a look at is the Cosine similarity combiner.

Cosine link feature combiner. Image by the author.

The Link Feature combiner takes pairs of node features and combines them into a single link feature, which is then
used as training data to the logistic regression model that will predict new links. We have already done the cosine
similarity analysis, so we know that node pairs with a high cosine similarity are likely to be connected. Therefore, you
might imagine that new predicted links will be between pairs of not yet connected nodes with high cosine similarity,
as this is precisely how our training data looks.

The Python script we will be using to produce link predictions is:

1 def generate_links(combiner, predictedRelType):


2 # Delete all graph models & drop all named graphs
3 run_query("""
4 CALL gds.graph.list() YIELD graphName CALL gds.graph.drop(graphName) YIELD graphName as done
5 RETURN distinct 'dropped named graphs' as result
6 UNION
7 CALL gds.beta.model.list() YIELD modelInfo CALL gds.beta.model.drop(modelInfo.modelName) YIELD modelInfo as done
8 RETURN distinct 'dropped ML models' as result
9 """)
10 # Define a new LP pipeline
11 run_query("""CALL gds.alpha.ml.pipeline.linkPrediction.create('lp-pipeline')""")
12 # Define feature combiner
13 run_query(f"""
14 CALL gds.alpha.ml.pipeline.linkPrediction.addFeature('lp-pipeline', '{combiner}', {{ Open in app Get started
15 nodeProperties: ['embedding']}}) YIELD featureSteps;
16 """)
17 # Define train-test split
18 run_query("""
19 CALL gds.alpha.ml.pipeline.linkPrediction.configureSplit(
20 'lp-pipeline', {testFraction: 0.3, trainFraction: 0.6, validationFolds: 7})
21 YIELD splitConfig;
22 """)
23 # Configure LP model params
24 run_query("""
25 CALL gds.alpha.ml.pipeline.linkPrediction.configureParams('lp-pipeline',
26 [{penalty:0.001, tolerance: 0.01, maxEpochs: 500},
27 {penalty:0.01, tolerance: 0.01, maxEpochs: 500}]) YIELD parameterSpace;""")
28 # Construct named graph
29 run_query("""
30 CALL gds.graph.create('lp-graph',
31 'Character', {INTERACTS:{orientation:'UNDIRECTED'}}, {nodeProperties:'embedding'});
32 """)
33 # Train the model
34 run_query("""
35 CALL gds.alpha.ml.pipeline.linkPrediction.train('lp-graph',
36 {pipeline: 'lp-pipeline', modelName: 'lp-model', randomSeed: 42}) YIELD modelInfo
37 RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.A
38 """)
39 # Predict new relationships
40 run_query("""
41 CALL gds.alpha.ml.pipeline.linkPrediction.predict.mutate('lp-graph',
42 {modelName: 'lp-model', mutateRelationshipType: 'I', topN: 40, threshold: 0.45})
43 YIELD relationshipsWritten;
44 """)
45 # Store relationships back to graph
46 predicted_links_stats = run_query(f"""
47 CALL gds.graph.streamRelationshipProperty('lp-graph', 'probability', ['I'])
48 YIELD sourceNodeId, targetNodeId, propertyValue as probability
49 WHERE sourceNodeId < targetNodeId
50 MATCH (s),(t)
51 WHERE id(s)=sourceNodeId AND id(t)=targetNodeId
52 MERGE (s)-[:{predictedRelType}]-(t)
53 RETURN s.name as c1, t.name as c2,
54 gds.alpha.similarity.euclideanDistance(s.embedding, t.embedding) AS euclidian_similarity,
55 gds.alpha.similarity.cosine(s.embedding, t.embedding) AS cosine_similarity
56 """)
57 return predicted_links_stats

test_combiners.py
hosted with ❤ by GitHub view raw
Open in app Get started

It is a bit lengthy script as we need to define the entire Link Prediction pipeline for each link feature combiner option.
Finally, the predicted links are stored to Neo4j so that we can visualize them with Bloom. In my previous blog post, I
did a step-by-step explanation of the pipeline. As mentioned, I have prepared a Jupyter Notebook with all the code, so
you don’t have to copy it from the blog post directly.

Let’s examine the predicted links using the Cosine Link Feature combiner.
Open in app Get started
Top 20 predicted links using the Cosine Link Feature combiner. Image by the author.

The average Euclidian and Cosine similarity of node pairs that have predicted links is:

Search this file…


1 euclidian_similarity cosine_similarity
2 0.050814 0.999548

lp_combiner_cosine.csv
hosted with ❤ by GitHub view raw

As we might imagine, the cosine similarity between nodes of predicted links is on average 0.999. Thus, the results
agree with our training data. Furthermore, we can learn a bit about FastRP embeddings from the results. Peripheral
nodes with a low degree are very similar in the embedding space, even if they are not directly connected. For
example, the blue nodes in the bottom-left part of the visualizations are all very similar. All four nodes have only a
single relationship, and they share their only neighbor nodes.

L2 link feature combiner

Next, we will take a look at the L2 link feature combiner. L2 link feature combiner calculates the Euclidian distance
between two node features.
Open in app Get started

Top 20 predicted links using the L2 Link Feature combiner. Image by the author.

The results are almost identical to the Cosine Link Feature combiner. Seems that the FastRP embedding algorithm
optimizes both Cosine and Euclidian distance similarities between neighboring nodes in the network.

Hadamard link feature combiner

The Hadamard link feature combiner uses the Hadamard product to produce a link feature. The Hadamard product is
simply entrywise multiplication.
Open in app Get started

Hadamard link feature combiner. Image by the author.

Let’s examine the predicted links using the Hadamard link feature combiner.
Open in app Get started

Top 20 predicted links using the Hadamard Link Feature combiner. Image by the author.

The average Euclidian and Cosine similarity of node pairs that have predicted links is:

Search this file…


1 euclidian_similarity cosine_similarity
2 0.180325 0.995475

lp_combiner_hadamard.csv
hosted with ❤ by GitHub view raw

Some of the predicted links are similar to the Cosine and Euclidian link feature combiner results. For example, the
predicted links in the red community are almost identical. On the other hand, the model predicted more links in the
center instead of the periphery of the blue cluster.
Open in app Get started
Using multiple Link Feature Combiners
In the previous example, we have only used a single link feature combiner. In the case of Cosine or L2 link feature
combiner, we have effectively only used a single input feature to the logistic regression model. In practice, it makes
sense to have multiple input features that best describe your domain. As a demonstration, we will add the
Preferential attachment input as the second link feature. Looking at the documentation in Neo4j, the Preferential
attachment is defined as multiplying node degrees between a pair of nodes. In practice, the preferential attachment
model assumes that nodes with a higher node degree are more likely to form new connections. Unfortunately, we
can’t automatically add the preferential attachment link feature just yet, but we can add it manually. To add the
preferential attachment input feature, we will first calculate the node degree values for all nodes. You sometimes
want to normalize the input features with logistic regression models, so I will show you how to scale features directly
in the Link Prediction pipeline. Then we just need to add the Hadamard link feature combiner, which multiplies the
input matrices, in this case, node degrees. So if I understand the math correctly, the resulting link feature should
represent preferential attachment as we effectively multiply node degrees between pairs of nodes.

Using multiple link feature combiners in Link Prediction pipeline. Image by the author.

In the Link Prediction pipeline, you can have as many link feature combiners as you wish. The results of all link
feature combiners are then concatenated into a single vector that is used as an input to the Link Prediction logistic
Open in app Get started

regression model. We can add the degree calculation and scaling directly into the pipeline and don’t have to prepare
the degree features beforehand.

1 # Get the degree value of each node


2 run_query("""
3 CALL gds.alpha.ml.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'degree', {
4 mutateProperty: 'degree'
5 })""")
6
7 # Scale the degree using minmax scaler
8 run_query("""
9 CALL gds.alpha.ml.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'scaleProperties', {
10 nodeProperties: ['degree'],
11 mutateProperty: 'scaledDegree',
12 scaler:'MinMax'
13 })
14 """)
15 # Define HADAMARD combiner for node degree combiner
16 run_query("""
17 CALL gds.alpha.ml.pipeline.linkPrediction.addFeature('lp-pipeline', 'HADAMARD', {
18 nodeProperties: ['scaledDegree']
19 }) YIELD featureSteps;
20 """)

lp_combiner_pa.py
hosted with ❤ by GitHub view raw

We need to add the following three steps into our pipeline. The first query mutates the degree value to the named
graph. The second step will scale the degree value using the MinMax scaler and mutate it as the scaledDegree
property. Lastly, we use the Hadamard link feature combiner that combines node degrees. In the end, we’ll quickly
evaluate how adding a secondary link feature affects the link prediction results.

We get the following results using the Cosine link feature combiner with FastRP embeddings and Hadamard
combiner with scaled node degrees.
Open in app Get started

Image by the author.

Adding the secondary link feature shifted predicted links from the network periphery to the center. In some way, it
makes sense because the preferential attachment prefers links between nodes with a higher degree, and those nodes
are usually located in the center of the network.

We get the following results using the Euclidian link feature combiner with FastRP embeddings and Hadamard
combiner with scaled node degrees.

107

The secondary link feature didn’t really affect the results. I’ve also tried not scaling the node degree values and then
the predicted links shift strongly towards the center.
Open in app Get started

Using Hadamard link feature combiner without scaling node degrees. Image by the author.

If we don’t scale the node degrees before using the Hadamard combiner, the preferential attachment feature becomes
dominant. Therefore, the predicted links are between nodes with a high degree instead of the low degree nodes on
the periphery.

Conclusion
I have really enjoyed writing this blog post and learned a lot about FastRP embedding algorithm and Link Prediction
pipeline along the way. A quick summary would be:

FastRP is more likely to assign high similarity between neighboring nodes with a low degree
On the other hand, the cosine similarity between connected nodes of different communities could be lower than
0.5 Open in app Get started

Using multiple link feature combiners can help you better describe your domain

Scaling node features influences the result of Link Prediction logistic regression model

I encourage you to start a free Neo4j Sandbox project and start predicting new connections in your graphs. Let me
know how it goes!

As always, the code is available on GitHub.

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't
want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

About Help Terms Privacy

Get the Medium app

You might also like