0% found this document useful (0 votes)
149 views

Graph Mining

This document provides an introduction and overview of graph mining algorithms. It begins with basic definitions from graph theory, including degree, diameter, distance distribution, clustering coefficient, centrality, and clustering. It then discusses using graphs to represent social networks and relationships. As an example, it analyzes the graph of relationships between characters in the novel Les Miserables, identifying key central nodes like Jean Valjean. It concludes by introducing basic graph theory concepts like edge direction, density, node labels, edge weights, and degree.

Uploaded by

Thilak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

Graph Mining

This document provides an introduction and overview of graph mining algorithms. It begins with basic definitions from graph theory, including degree, diameter, distance distribution, clustering coefficient, centrality, and clustering. It then discusses using graphs to represent social networks and relationships. As an example, it analyzes the graph of relationships between characters in the novel Les Miserables, identifying key central nodes like Jean Valjean. It concludes by introducing basic graph theory concepts like edge direction, density, node labels, edge weights, and degree.

Uploaded by

Thilak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction to Graph Mining Algorithms

Basic Definitions from Graph Theory

Andrea Marino

PhD Course on Graph Mining Algorithms,


Università di Pisa

Pisa, February 2018

Andrea Marino Graph Mining Algorithms


Today
1 Basic definitions of graph theory

Degree of Separation and Small World


2 Computing the diameter in huge graphs
3 Computing the distance distribution in huge graphs easily.
4 Sketches and probabilistic counting: distance distribution
and other applications.

Node Properties
5 Clustering Coefficient: counting the number of triangles.
6 Centrality: computing closeness and betwennees.

Grouping Nodes
7 Clustering: overview, algorithms and spectral clustering.
8 Finding Patterns in graphs with applications to community
detection: listing cliques.
Andrea Marino Graph Mining Algorithms
Social Networks

Our life is full of binary relations of very different types.


Some examples are:
communications (like talking at the phone or sending a
message on WhatsApp),
collaborations (like co-authoring a paper or co-playing in a
movie),
memberships (like being the member of the same university
department),
evaluations (like “liking” a picture on Instagram),
dependencies (like citing a paper or being a child), and
transfers (like lending or borrowing money).
These relationship induce connections between the two
involved agents.
These connections are the edges of a graph whose nodes are
the agents themselves.

Andrea Marino Graph Mining Algorithms


From Social Relations to Graphs

We can make use of the many mathematical and computer


science tools that have been developed in the field of graph
theory.
We can analyze mathematical properties of a graph.
We can design, analyze, and implement efficient algorithms
computing these properties.

The ultimate goal


Deriving conclusions relevant from a social science point of view
such as, for instance, the identification of important agents or
group of agents.

Andrea Marino Graph Mining Algorithms


The social network whose agents are 16 Florentine families of the
15th century. Two families are related if one member of one family
has been married to one member of the other family.
Can you guess which node corresponds to the Medici family?

Andrea Marino Graph Mining Algorithms


Part I

The graph of Les Misérables

Andrea Marino Graph Mining Algorithms


The Story

Les Misérables is one of the greatest French historical novel,


written by Victor Hugo and first published in 1862.
Set in early 19th-century France, it is mostly the story of
ex-convict Jean Valjean (and of his quest for redemption),
who is relentlessly tracked down by a police inspector named
Javert.
Along the way, Valjean and a slew of characters are swept into
a revolutionary period in France, where a group of young
idealists make their last stand at a street barricade.
Many characters which appear and disappear through the
more than one thousand pages of the book.

Andrea Marino Graph Mining Algorithms


Two characters are related if they both appear in at least one
chapter.
The corresponding graph has 77 nodes and 254 edges.

Andrea Marino Graph Mining Algorithms


There is a very central node, that is, the node labeled 12. Not
surprisingly, this node corresponds to Jean Valjean.

Andrea Marino Graph Mining Algorithms


Do not forget, do not ever forget, that you have promised
me to use the money to make yourself an honest man.

The node 1 plays some special role: it is a connection point


between a subset of nodes and the rest.
This node corresponds to a key character, i.e. the Bishop Myriel1
1
who, instead of punishing Valjean for robbing him, sees an opportunity to
change Valjean’s life by letting him go.
Andrea Marino Graph Mining Algorithms
Similarly, the two nodes labeled 24 and 17 connects a subset
of nodes to the rest. These two are Fantine and Tholomyes,
the parents of Cosette2
Fantine is a key character of the book: a “miserable” human
being who is punished all out of proportion for having a love
affair without being married.
2
who will be adopted by Jean Valjean when Fantine dies.
Andrea Marino Graph Mining Algorithms
What else can we get from the graph representation of the book of
Victor Hugo?

The figure does not help us too much, since the center of the
figure itself seems just a mess of nodes and edges.
It is now time to take advantage of the notions and the
algorithms developed in the field of graph theory.

Andrea Marino Graph Mining Algorithms


Part II

Graph theory basic notions

Andrea Marino Graph Mining Algorithms


Edge Direction

Undirected Whenever two characters of Les Misérables appear in


a chapter a symmetric relation is created between
them:
if character x appears in the same chapter of y ,
then y also appears in the same chapter of x.
We denote such edge as the set {x, y }.
Directed In other cases, it might be that the relation
represented by the graph is not symmetric:
for instance, if a paper x cites another paper y ,
most likely the paper y does not cite the paper
x since it has been published before.
We denote such edge as the pair (x, y ).

Andrea Marino Graph Mining Algorithms


The business social network whose agents are 16 Florentine
families of the 15th century.
The directed relations are business ties (specifically, recorded
financial ties such as loans, credits and joint partnerships).

Andrea Marino Graph Mining Algorithms


Density

The graph corresponding to the social network of the characters of


Les Misérables has 77 nodes and 254 edges.

Is this a “dense” graph? That is, are the nodes connected (almost)
as much as they could be?

Let’s calculate what is the maximum number of edges a graph of n


nodes can have.
If the graph is directed, then any node can have at most n − 1
edges towards the other nodes of the graph: hence, the
maximum number of directed edges is n(n − 1).
If the graph is undirected, then this number has to be divided
by two.

Andrea Marino Graph Mining Algorithms


When a graph has all the possible edges, it is called a complete
graph or clique.

An undirected complete graph or clique with 5 nodes. The number


of edges is equal to 5·4
2 = 10.

Andrea Marino Graph Mining Algorithms


Definition
The density of a graph with n nodes is the number of edges in the
graph divided by the number of edges in the clique of n nodes.

In the case of the graph of Les Misérables, the density is


254 254 254
= = ≈ 0.09.
(77 · 76)/2 5852/2 2926

This is close to 0, so that we can conclude that this graph is


not dense, that is, it is sparse.
This is the most frequent situation arising when dealing with
real-world social networks.

Andrea Marino Graph Mining Algorithms


Node labels

Labeled In the graph corresponding to Les Misérables, the


names of the characters are associated to the
corresponding nodes. This information is called the
label of the node.
Unlabeled If we forget about these names and no additional
information is associated with the nodes.

Andrea Marino Graph Mining Algorithms


Business Social Network for 16 Florentine families of the
15th century

At each node is associated a label, i.e. the name of the family.


Andrea Marino Graph Mining Algorithms
Edge weights

Unweighted In the graph corresponding to Les Misérables no


additional information is associated with the edges.
Weighted However, we could have considered also the number
of times two characters appear in the same chapter
and associate such integer number to the
corresponding edge. This information is called the
weight of the edge.

Andrea Marino Graph Mining Algorithms


Degree

Definition
The degree of a node is the number of edges adjacent to it, that
is, the number of its neighbors.

In the case of the graph corresponding to Les Misérables, we


can assume that the most important character corresponds to
the node of highest degree.
This node is the one associated to Jean Valjan, who is the
main character of the book: the degree of this node (the one
labeled 12) is 32.
The second most popular character is Gavroche (node labeled
49), a boy who lives on the streets of Paris and plays a short
yet significant role.

Andrea Marino Graph Mining Algorithms


In a clique, all nodes have degree n − 1
In a star, that is a node x connected to all the other nodes
and with no other edges, x has degree n − 1 and all the other
edges have degree 1.

Andrea Marino Graph Mining Algorithms


In the case of directed graphs, we distinguish between
the indegree of a node (that is, the number of edges
“entering” the node) and
the outdegree of a node (that is, the number of edges exiting
the node).
Sometimes, it might be convenient to “symmetrize” the edges
by making them undirected and to consider the degree in the
resulting undirected graph.

Andrea Marino Graph Mining Algorithms


Power Law

In social network, it seems that


the degree distribution f (d), i.e.
the frequency of the degree d,
follows a power law:
1
f (d) ∝ ,

for some constant γ, typically
1 ≤ γ ≤ 3.

Andrea Marino Graph Mining Algorithms


Simple graphs

In the graph of Les Misèrables we have not included an edge


between one node and itself (the same character appears in
the same chapters).
These kind of edges (of the form {x, x} or (x, x)) are called
loops.
A graph can have several edges between the same two nodes.
We call these multiple connections multiedges.

Definition
If a graph has no loop and has no multiedge, it is said to be simple.

Andrea Marino Graph Mining Algorithms


Paths

If a node x is connected to a node y and y is connected to


another node z, then we say that there is a path from x to z.
A path can be formed by more than two edges: for example,
in the graph of Les Misérables there is a path from from node
1 to node 23 passing through nodes 12 and 24.

Andrea Marino Graph Mining Algorithms


The number of edges in a path is called the length of the path.
In many cases, we are interested in the path with the
minimum length, which is called a shortest path. The length
of a shortest path from a node x to a node y is called the
distance from x to y and is denoted by d(x, y )
If there is no path from x to y , then we set d(x, y ) = ∞.
If the graph is weighted, then the length of the path is equal
to the sum of the weights of the edges in the path itself.

Andrea Marino Graph Mining Algorithms


Small World

In social networks, the diameter, that is the maximum distance


among all the nodes of the graph, and the average distance are low
compared to the size of the network.

The intermediate nodes are called degrees of separations.


Milgram experiment, to upper bound degrees of separations in
real-life.
In the case of the graph of Les Misérables the diameter is 5
and the average distance is 2.4.
In the case of Facebook in 2011 (721.1M nodes and 68.7G
edges) the diameter was 41 and the average degrees of
separation were 4.74.
Inspiration for many games:
Connecting one given actor to Kevin Bacon in IMDB network
Going from one Wikipedia page to another one in few hops.

Andrea Marino Graph Mining Algorithms


Connectivity

Definition
If, for any pair of nodes, there exists a path between them,
then the graph is said to be connected.
For directed graphs, a graph is strongly connected if, for any
pair of nodes x and y , there exists a path from x to y and
vice versa.
If a directed graph is not strongly connected but removing the
directions of the edges the resulting graph is connected, then
we say that the graph is weakly connected.

For example the graph of Les Misérables is connected.

Andrea Marino Graph Mining Algorithms


A directed graph which is not strongly connected.

Andrea Marino Graph Mining Algorithms


The giant component

Definition
A giant component is a (strongly) connected component of a given
(directed) graph that contains a constant fraction of the entire
graph’s nodes.

It is very common in the case of social networks that there


always exists a giant component.
The connectivity property by itself does not really identify
interesting communities of the social network.
A single bridge make collapsing components.

Andrea Marino Graph Mining Algorithms


Connected Components in the Web

In the case of the web directed graphs, in which nodes are


web pages and edges are hyperlinks connecting one page to
another, it has been observed the bowtie phenomenon.
A detailed analysis in
Andrei Z. Broder, Ravi Kumar, Farzin Maghoul,
Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,
Andrew Tomkins, Janet L. Wiener: Graph structure in the
Web. Computer Networks 33(1-6): 309-320 (2000)

Andrea Marino Graph Mining Algorithms


The bowtie structure of the web
High-level view of the web structure, based on its connectivity
properties and how its strongly connected components fit together.

This study has been replicated on other larger snapshots of the


web and it still a useful way of thinking about giant directed
graphs in the context of the web and more generally.
Andrea Marino Graph Mining Algorithms
Part III

Representing a graph

Andrea Marino Graph Mining Algorithms


Adjacency matrices

The adjacency matrix A of a graph is a bidimensional table


with as many rows as the number of nodes and as many
columns as the number of nodes.
The element of the table at row x and column y (denoted by
A[x][y ]) is 1 if there exists an edge connecting x to y ,
otherwise it is 0.
If the graph is not oriented, the adjacency matrix is
symmetric, in the sense that A[x][y ] = A[y ][x].
The advantage of this representation is that it allows us to
determine very quickly whether two nodes are neighbours.

Andrea Marino Graph Mining Algorithms


An adjacency matrix requires space quadratic in the number of
nodes, independently from the number of edges in the graph.
In the case of the graph of Les Misérables, even if we have
only 254 edges, the adjacency matrix requires 5929 elements
(or 29126 if we use the symmetry of the graph and we
represent only the upper right part of the matrix).

Andrea Marino Graph Mining Algorithms


The adjacency matrix of the Florentine families social network,
where two families are related if one member of one family has
been married to one member of the other family.
Andrea Marino Graph Mining Algorithms
Adjacency lists

The adjacency list representation associates to any node x a


list of its neighbors, that is, a list of all nodes y such that
there exists an edge from x to y .
By using an adjacency list representation, the amount of used
space is proportional to the total number of edges in the
graph.
For example, in the case of the graph of Les Misérables, the used
space is approximately 254, which is the number of edges.

Andrea Marino Graph Mining Algorithms


One disadvantage of the adjacency list representation is the
fact that deciding whether a node x is neighbour of y costs a
time proportional to the degree of the node itself.
In the case of social networks this is a quite negligible cost,
since the average degree of a social network is usually very
small compared to the number of nodes.
For instance, in the case of the graph corresponding to Les
Misérables the average degree is equal to 2·254
77 ≈ 6.6.
We will always use adjacency lists.

Andrea Marino Graph Mining Algorithms


The adjacency lists of the Florentine families social network,
where two families are related if one member of one family
has been married to one member of the other family.
The blocks on the left of the arrows contain a node and the
size of its adjacency list, while the blocks on the right of the
arrows contain the list of neighbors of the node.
Andrea Marino Graph Mining Algorithms
Part IV

Traversing a Graph

Andrea Marino Graph Mining Algorithms


Breadth-first search from a source node

The key idea of the breadth-first search is to mark each node which
has already been visited. Nodes that are visited for the first time
are put into a queue, which is a first-in first-out data structure.
At the beginning the source node is marked, and inserted into
the queue.
Until this queue is not empty, we “serve” the first node in the
queue, that is:
we examine all its neighbors and,
for each of them, if it is not marked, we set its state to marked
and we insert it into the queue.

Andrea Marino Graph Mining Algorithms


queue = [ s ]
marked [ s ]= t r u e
w h i l e queue . notempty ( ) :
v = queue . pop ( )
for u in v . neighbours :
i f not marked [ u ] :
marked [ u]= t r u e
queue . append ( u )

Andrea Marino Graph Mining Algorithms


If each time a node is inserted into
the queue, we also associate to this
node its parent, which is the node
whose neighborhood exploration
led us to insert the node in the
queue, we obtain the breadth-first
search tree.
One property of this tree is that
the nodes at level i are all the
nodes which are at distance i from
the initial node.

Andrea Marino Graph Mining Algorithms


queue = [ s ]
marked [ s ]= t r u e
d i s t [ s ]=0
p r e d [ s ]= s
w h i l e queue . notempty ( ) :
v = queue . pop ( )
for u in v . neighbours :
i f not marked [ u ] :
marked [ u]= t r u e
p r e d [ u]= v
d i s t [ u]= d i s t [ v ]+1
queue . append ( u )

Andrea Marino Graph Mining Algorithms


The BFS requires time proportional to the number of edges, since
each edge is taken into consideration exactly two times.

Applications:
By executing a BFS from any node of the graph, we can both
compute the diameter of the graph, that is, the maximum
distance between two nodes of the graph, and the average
distance of the graph.

Andrea Marino Graph Mining Algorithms


Applications to Connected Components in Undirected
Graphs

The BFS also allows us to determine whether the graph is


connected.
It is connected if and only if all nodes of the graph are
included in the BFS tree.
This is the case, for example, of the graph of Les Misérables.
If the graph is not connected, the breadth-first search tree
contains a connected component of the graph.
In order to find the other connected components we can
simply execute again the BFS starting from any of the nodes
which has not been reached until all nodes have been included
in at least one connected component.

Andrea Marino Graph Mining Algorithms


Depth-first search from a source node

The depth-first search is similar to the breadth-first search.


Whenever a node x is reached for the first time and it is
marked as visited, the depth-first search recursively invokes
itself with source node x.
Same procedure of before, using a stack instead of a queue.
Same time complexity.
Also the depth-first search generates a tree, which is called a
depth-first search tree.

Andrea Marino Graph Mining Algorithms


stack = [ s ]
marked [ s ]= t r u e
w h i l e s t a c k . notempty ( ) :
v = s t a c k . pop ( )
for u in v . neighbours :
i f n o t marked [ u ] :
marked [ u]= t r u e
s t a c k . push ( u )

Andrea Marino Graph Mining Algorithms


Thanks

Part of these slides are based on a chapter written by Pierluigi


Crescenzi for his course ”Algorithms for Graph Mining”.

Andrea Marino Graph Mining Algorithms

You might also like