Graph Machine Learning
Graph Machine Learning
Learning
Claudio Stamile
Aldo Marzullo
Enrico Deusebio
BIRMINGHAM—MUMBAI
Graph Machine Learning
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the publisher,
except in the case of brief quotations embedded in critical articles or reviews.
Every e ort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without warranty,
either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors,
will be held liable for any damages caused or alleged to have been caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all of the companies
and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing
cannot guarantee the accuracy of this information.
ISBN 978-1-80020-449-2
www.packt.com
Alla memoria di mio Zio, Franchino Avolio. Alle ruote delle bici troppo
sgon e, all'infanzia che mi ha regalato.
To my family, my roots.
– Aldo Marzullo
Preface
2
Graph Machine Learning
Technical requirements 52 The generalized graph
Understanding machine embedding problem 57
learning on graphs 52 The taxonomy of graph
Basic principles of machine learning 53 embedding machine learning
The bene�t of machine learning on algorithms 64
graphs 55 The categorization of embedding
ii Table of Contents
algorithms 65 Summary 68
4
Supervised Graph Learning
Technical requirements 116 Manifold regularization and semi-
supervised embedding 132
The supervised graph
embedding roadmap 116 Neural Graph Learning 134
Planetoid 144
Feature-based methods 117
Shallow embedding methods 121 Graph CNNs 145
Label propagation algorithm 121 Graph classi�cation using GCNs 145
Label spreading algorithm 127 Node classi�cation using GraphSAGE 148
5
Problems with Machine Learning on Graphs
Technical requirements 152 Embedding-based methods 158
Predicting missing links in a Detecting meaningful
graph 153 structures such as communities
Similarity-based methods 154
Table of Contents iii
7
Text Analytics and Natural Language Processing Using
Graphs
Technical requirements 202 Knowledge graphs 210
Providing a quick overview of a Bipartite document/entity graphs 212
dataset 203 Building a document topic
Understanding the main classi�er 233
concepts and tools used in NLP 204 Shallow learning methods 234
Creating graphs from a corpus Graph neural networks 238
of documents 209
Summary 249
iv Table of Contents
8
Graph Analysis for Credit Card Transactions
Technical requirements 252 Embedding for supervised and
Overview of the dataset 252 unsupervised fraud detection 270
Loading the dataset and graph Supervised approach to fraudulent
building using networkx 254 transaction identi�cation 271
Unsupervised approach to fraudulent
Network topology and transaction identi�cation 274
community detection 260
Network topology 260
Summary 277
Community detection 264
9
Building a Data-Driven Graph-Powered Application
Technical requirements 280 Graph processing engines 285
Overview of Lambda Graph querying layer 288
architectures 280 Selecting between Neo4j and GraphX 293
10
Novel Trends on Graphs
Technical requirements 296 Graph machine learning and
neuroscience 302
Learning about data
augmentation for graphs 296 Graph theory and chemistry and
biology 304
Sampling strategies 297
Graph machine learning and computer
Exploring data augmentation
vision 304
techniques 298
Recommendation systems 305
Learning about topological data
Summary 305
analysis 299
Why subscribe? 307
Topological machine learning 300
If you are using the digital version of this book, we advise you to type the code yourself
or access the code via the GitHub repository (link available in the next section). Doing
so will help you avoid any potential errors related to the copying and pasting of code.
Conventions used
ere are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names,
lenames, le extensions, pathnames, dummy URLs, user input, and Twitter handles.
Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image le as
another disk in your system."
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant
lines or items are set in bold:
Jupyter==1.0.0
networkx==2.5
matplotlib==3.2.2
node2vec==0.3.3
karateclub==1.0.19
scipy==1.6.2
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see on screen. For
example, words in menus or dialog boxes appear in the text like this. Here is an example:
"Select System info from the Administration panel."
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you have found a mistake in this book, we would be grateful if you would
report this to us. Please visit www.packtpub.com/support/errata, selecting your
book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet,
we would be grateful if you would provide us with the location address or website name.
Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in,
and you are interested in either writing or contributing to a book, please visit authors.
packtpub.com.
Preface xi
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on
the site that you purchased it from? Potential readers can then see and use your unbiased
opinion to make purchase decisions, we at Packt can understand what you think about
our products, and our authors can see your feedback on their book. ank you!
For more information about Packt, please visit packt.com.
Section 1 –
Introduction to
Graph Machine
Learning
In this section, the reader will get a brief introduction to graph machine learning, showing
the potential of graphs combined with the right machine learning algorithms. Moreover,
a general overview of graph theory and Python libraries is provided in order to allow the
reader to deal with (that is, create, modify, and plot) graph data structures.
is section comprises the following chapters:
Technical requirements
We will be using Jupyter Notebooks with Python 3.8 for all of our exercises. In the
following code snippet, we show a list of Python libraries that will be installed for
this chapter using pip (for example, run pip install networkx==2.5 on the
command line, and so on):
Jupyter==1.0.0
networkx==2.5
snap-stanford==5.0.0
matplotlib==3.2.2
pandas==1.1.3
scipy==1.6.2
• import networkx as nx
• import pandas as pd
• import numpy as np
• e order of a graph is the number of its vertices |V|. e size of a graph is the
number of its edges |E|.
• e degree of a vertex is the number of edges that are adjacent to it. e neighbors of
a vertex v in a graph G is a subset of vertex V ′ induced by all vertices adjacent to v.
• e neighborhood graph (also known as an ego graph) of a vertex v in a graph G
is a subgraph of G, composed of the vertices adjacent to v and all edges connecting
vertices adjacent to v.
An example of what a graph looks like can be seen in the following screenshot:
According to this representation, since there is no direction, an edge from Milan to Paris
is equal to an edge from Paris to Milan. us, it is possible to move in the two directions
without any constraint. If we analyze the properties of the graph depicted in Figure 1.1,
we can see that it has order and size equal to 4 (there are, in total, four vertices and four
edges). e Paris and Dublin vertices have degree 2, Milan has degree 3, and Rome has
degree 1. e neighbors for each node are shown in the following list:
import networkx as nx
G = nx.Graph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Paris'), ('Paris','Dublin'),
('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")
We can also compute the graph order, the graph size, and the degree and neighbors for
each of the nodes, using the following commands:
Graph Order: 4
Graph Size: 4
Degree for nodes: {'Rome': 1, 'Paris': 2, 'Dublin':2, 'Milan':
3}
Neighbors for nodes: {'Rome': ['Milan'], 'Paris': ['Milan',
'Dublin'], 'Dublin': ['Milan', 'Paris'], 'Milan': ['Dublin',
'Paris', 'Rome']}
Finally, we can also compute an ego graph of a speci c node for the graph G, as follows:
e original graph can be also modi ed by adding new nodes and/or edges, as follows:
As expected, all the edges that contain the removed nodes are automatically deleted from
the edge list.
Also, edges can be removed by running the following code:
e networkx library also allows us to remove a single node or a single edge from
a graph G by using the following commands: G. remove_node('Dublin') and
G.remove_edge('Dublin', 'Paris').
Introduction to graphs with networkx 9
Types of graphs
In the previous section, we described how to create and modify simple undirected graphs.
Here, we will show how we can extend this basic data structure in order to encapsulate
more information, thanks to the introduction of directed graphs (digraphs), weighted
graphs, and multigraphs.
Digraphs
A digraph G is de ned as a couple G=(V, E), where V={v1, .., vn } is a set of nodes and
E={(vk , vw ) .., ( vi, vj)} is a set of ordered couples representing the connection between
two nodes belonging to V.
Since each element of E is an ordered couple, it enforces the direction of the connection.
e edge (vk , vw ) means the node vk goes into vw . is is di erent from (vw , vk )
since it means the node vw goes to vk . e starting node vw is called the head, while the
ending node is called the tail.
Due to the presence of edge direction, the de nition of node degree needs to be extended.
e direction of the edge is visible from the arrow—for example, Milan -> Dublin means
from Milan to Dublin. Dublin has −
( ) = 2 and +
( ) = 0, Paris has −
( )=
+
0 and ( ) = 2, Milan has −
( ) = 1 and +
( ) = 2, and Rome has −
( )=1
+
and ( ) = 0.
e same graph can be represented in networkx, as follows:
G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Paris','Milan'), ('Paris','Dublin'),
('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)
e de nition is the same as that used for simple undirected graphs; the only di erence
is in the networkx classes that are used to instantiate the object. For digraphs, the
nx.DiGraph()class is used.
Indegree and Outdegree can be computed using the following commands:
Multigraph
We will now introduce the multigraph object, which is a generalization of the graph
de nition that allows multiple edges to have the same pair of start and end nodes.
A multigraph G is de ned as G=(V, E), where V is a set of nodes and E is a multi-set (a set
allowing multiple instances for each of its elements) of edges.
Introduction to graphs with networkx 11
directed_multi_graph = nx.MultiDiGraph()
undirected_multi_graph = nx.MultiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Dublin'), ('Paris','Milan'),
('Paris','Dublin'), ('Milan','Rome'), ('Milan','Rome')]
directed_multi_graph.add_nodes_from(V)
undirected_multi_graph.add_nodes_from(V)
directed_multi_graph.add_edges_from(E)
undirected_multi_graph.add_edges_from(E)
Weighted graphs
We will now introduce directed, undirected, and multi-weighted graphs.
An edge-weighted graph (or simply, a weighted graph) G is de ned as G=(V, E ,w) where
V is a set of nodes, E is a set of edges, and w: E → ℝ is the weighted function that assigns
at each edge e ∈ E a weight expressed as a real number.
A node-weighted graph G is de ned as G=(V, E ,w) ,where V is a set of nodes, E is a set of
edges, and w: V → ℝ is the weighted function that assigns at each node v ∈ V a weight
expressed as a real number.
Please keep the following points in mind:
G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin', 19), ('Paris','Milan', 8),
('Paris','Dublin', 11), ('Milan','Rome', 5)]
G.add_nodes_from(V)
G.add_weighted_edges_from(E)
Bipartite graphs
We will now introduce another type of graph that will be used in this section: multipartite
graphs. Bi- and tripartite graphs—and, more generally, kth-partite graphs—are graphs
whose vertices can be partitioned in two, three, or more k-th sets of nodes, respectively.
Edges are only allowed across di erent sets and are not allowed within nodes belonging
to the same set. In most cases, nodes belonging to di erent sets are also characterized by
particular node types. In Chapters 7, Text Analytics and Natural Language Processing Using
Graphs, and Chapter 8, Graphs Analysis for Credit Cards Transaction, we will deal with
some practical examples of graph-based applications and you will see how multipartite
graphs can indeed arise in several contexts—for example, in the following scenarios:
A bipartite graph can be easily created in networkx with the following code:
import pandas as pd
import numpy as np
n_nodes = 10
n_edges = 12
bottom_nodes = [ith for ith in range(n_nodes) if ith % 2 ==0]
top_nodes = [ith for ith in range(n_nodes) if ith % 2 ==1]
iter_edges = zip(
np.random.choice(bottom_nodes, n_edges),
np.random.choice(top_nodes, n_edges))
edges = pd.DataFrame([
{"source": a, "target": b} for a, b in iter_edges])
B = nx.Graph()
14 Getting Started with Graphs
B.add_nodes_from(bottom_nodes, bipartite=0)
B.add_nodes_from(top_nodes, bipartite=1)
B.add_edges_from([tuple(x) for x in edges.values])
Graph representations
As described in the previous sections, with networkx, we can actually de ne and
manipulate a graph by using node and edge objects. In di erent use cases, such a
representation would not be as easy to handle. In this section, we will show two ways to
perform a compact representation of a graph data structure—namely, an adjacency matrix
and an edge list.
Introduction to graphs with networkx 15
Adjacency matrix
e adjacency matrix M of a graph G=(V,E) is a square matrix (|V| × |V|) matrix such that
its element is 1 when there is an edge from node i to node j, and 0 when there is no
edge. In the following screenshot, we show a simple example where the adjacency matrix
of di erent types of graphs is displayed:
Figure 1.6 – Adjacency matrix for an undirected graph, a digraph, a multigraph, and a weighted graph
16 Getting Started with Graphs
It is easy to see that adjacency matrices for undirected graphs are always symmetric,
since no direction is de ned for the edge. e symmetry instead is not guaranteed for the
adjacency matrix of a digraph due to the presence of constraints in the direction of the
edges. For a multigraph, we can instead have values greater than 1 since multiple edges
can be used to connect the same couple of nodes. For a weighted graph, the value in a
speci c cell is equal to the weight of the edge connecting the two nodes.
In networkx, the adjacency matrix for a given graph can be computed in two di erent
ways. If G is the networkx of Figure 1.6, we can compute its adjacency matrix as follows:
For the rst and second line, we get the following results respectively:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 1. 0. 0.]
[0. 1. 1. 0.]]
Since a numpy matrix cannot represent the name of the nodes, the order of the element in
the adjacency matrix is the one de ned in the G.nodes list.
Edge list
As well as an adjacency matrix, an edge list is another compact way to represent graphs.
e idea behind this format is to represent a graph as a list of edges.
e edge list L of a graph G=(V,E) is a list of size |E| matrix such that its element is a
couple representing the tail and the end node of the edge i. An example of the edge list for
each type of graph is available in the following screenshot:
Introduction to graphs with networkx 17
Figure 1.7 – Edge list for an undirected graph, a digraph, a multigraph, and a weighted graph
In the following code snippet, we show how to compute in networkx the edge list of the
simple undirected graph G available in Figure 1.7:
print(nx.to_pandas_edgelist(G))
18 Getting Started with Graphs
source target
0 Milan Dublin
1 Milan Rome
2 Paris Milan
3 Paris Dublin
Other representation methods, which we will not discuss in detail, are also available in
networkx. Some examples are nx.to_dict_of_dicts(G) and nx.to_numpy_
array(G), among others.
Plotting graphs
As we have seen in previous sections, graphs are intuitive data structures represented
graphically. Nodes can be plotted as simple circles, while edges are lines connecting two
nodes.
Despite their simplicity, it could be quite di cult to make a clear representation when the
number of edges and nodes increases. e source of this complexity is mainly related to
the position (space/Cartesian coordinates) to assign to each node in the nal plot. Indeed,
it could be unfeasible to manually assign to a graph with hundreds of nodes the speci c
position of each node in the nal plot.
In this section, we will see how we can plot graphs without specifying coordinates for each
node. We will exploit two di erent solutions: networkx and Gephi.
networkx
networkx o ers a simple interface to plot graph objects through the nx.draw library. In
the following code snippet, we show how to use the library in order to plot graphs:
Here, nodes_position is a dictionary where the keys are the nodes and the value
assigned to each key is an array of length 2, with the Cartesian coordinate used for
plotting the speci c node.
e nx.draw function will plot the whole graph by putting its nodes in the given
positions. e with_labels option will plot its name on top of each node with the
speci c font_size value. node_size and edge_color will respectively specify the
size of the circle, representing the node and the color of the edges. Finally, arrowsize
will de ne the size of the arrow for directed edges. is option will be used when the
graph to be plotted is a digraph.
In the following code example, we show how to use the draw_graph function previously
de ned in order to plot a graph:
G = nx.Graph()
V = {'Paris', 'Dublin','Milan', 'Rome'}
E = [('Paris','Dublin', 11), ('Paris','Milan', 8),
('Milan','Rome', 5), ('Milan','Dublin', 19)]
G.add_nodes_from(V)
G.add_weighted_edges_from(E)
node_position = {"Paris": [0,0], "Dublin": [0,1], "Milan":
[1,0], "Rome": [1,1]}
draw_graph(G, node_position, True)
e method previously described is simple but unfeasible to use in a real scenario since
the node_position value could be di cult to decide. In order to solve this issue,
networkx o ers a di erent function to automatically compute the position of each node
according to di erent layouts. In Figure 1.9, we show a series of plots of an undirected
graph, obtained using the di erent layouts available in networkx. In order to use them
in the function we proposed, we simply need to assign node_position to the result
of the layout we want to use—for example, node_position = nx.circular_
layout(G). e plots can be seen in the following screenshot:
Figure 1.9 – Plots of the same undirected graph with di erent layouts
networkx is a great tool for easily manipulating and analyzing graphs, but it does
not o er good functionalities in order to perform complex and good-looking plots of
graphs. In the next section, we will investigate another tool to perform complex graph
visualization: Gephi.