Tutorial
Tutorial
2
1. Introduction to NetworkX
3
Introduction: networks are everywhere…
+ -
Clear syntax Can be slow
Multiple programming paradigms Beware when you are
Dynamic typing analysing very large networks
Strong on-line community
Rich documentation
Numerous libraries
Expressive features
Fast prototyping
5
Introduction: Python’s Holy Trinity
Click
Python’s primary library NumPy is an extension to Matplotlib is the
for mathematical and include multidimensional primary plotting library
statistical computing. arrays and matrices. in Python.
Contains toolboxes for: Both SciPy and NumPy rely Supports 2-D and 3-D
• Numeric optimization on the C library LAPACK for plotting. All plots are
very fast implementation. highly customisable and
• Signal processing ready for professional
• Statistics, and more… publication.
Primary data type is an
array. 6
Introduction: NetworkX
A “high-productivity software
for complex networks” analysis
• Data structures for representing various networks
(directed, undirected, multigraphs)
8
Introduction: a quick example
• Use Dijkstra’s algorithm to find the shortest path in a weighted and unweighted
network.
>>> import networkx as nx
>>> g = nx.Graph()
>>> g.add_edge('a', 'b', weight=0.1)
>>> g.add_edge('b', 'c', weight=1.5)
>>> g.add_edge('a', 'c', weight=1.0)
>>> g.add_edge('c', 'd', weight=2.2)
>>> print nx.shortest_path(g, 'b', 'd')
['b', 'c', 'd']
>>> print nx.shortest_path(g, 'b', 'd', weight='weight')
['b', 'a', 'c', 'd']
9
Introduction: drawing and plotting
• It is possible to draw small graphs with NetworkX. You can export network data
and draw with other programs (GraphViz, Gephi, etc.).
10
Introduction: NetworkX official website
http://networkx.github.io/
11
2. Getting started with Python and NetworkX
12
Getting started: the environment
• Different classes exist for directed and undirected networks. Let’s create a basic
undirected Graph:
• The graph g can be grown in several ways. NetworkX provides many generator
functions and facilities to read and write graphs in many formats.
13
Getting started: adding nodes
# A list of nodes
>>> g.add_nodes_from([2, 3])
# A container of nodes
>>> h = nx.path_graph(5)
>>> g.add_nodes_from(h)
15
Getting started: adding edges
# Single edge
>>> g.add_edge(1, 2)
>>> e = (2, 3)
>>> g.add_edge(*e) # unpack tuple
# List of edges
>>> g.add_edges_from([(1, 2), (1, 3)])
# A container of edges
>>> g.add_edges_from(h.edges())
• Some algorithms work only for undirected graphs and others are not well defined
for directed graphs. If you want to treat a directed graph as undirected for some
measurement you should probably convert it using Graph.to_undirected()
21
Getting started: graph operators
• subgraph(G, nbunch) - induce subgraph of G on nodes in nbunch
• union(G1, G2) - graph union, G1 and G2 must be disjoint
• cartesian_product(G1, G2) - return Cartesian product graph
• compose(G1, G2) - combine graphs identifying nodes common to both
• complement(G) - graph complement
• create_empty_copy(G) - return an empty copy of the same graph class
• convert_to_undirected(G) - return an undirected representation of G
• convert_to_directed(G) - return a directed representation of G
22
Getting started: graph generators
# small famous graphs
>>> petersen = nx.petersen_graph()
>>> tutte = nx.tutte_graph()
>>> maze = nx.sedgewick_maze_graph()
>>> tet = nx.tetrahedral_graph()
# classic graphs
>>> K_5 = nx.complete_graph(5)
>>> K_3_5 = nx.complete_bipartite_graph(3, 5)
>>> barbell = nx.barbell_graph(10, 10)
>>> lollipop = nx.lollipop_graph(10, 20)
# random graphs
>>> er = nx.erdos_renyi_graph(100, 0.15)
>>> ws = nx.watts_strogatz_graph(30, 3, 0.1)
>>> ba = nx.barabasi_albert_graph(100, 5)
>>> red = nx.random_lobster(100, 0.9, 0.9) 23
Getting started: graph input/output
• General read/write
>>> g = nx.read_<format>(‘path/to/file.txt’,...options...)
>>> nx.write_<format>(g,‘path/to/file.txt’,...options...)
• Data formats
• Node pairs with no data: 1 2
• Python dictionaries as data: 1 2 {'weight':7, 'color':'green'}
• Arbitrary data: 1 2 7 green 24
Getting started: drawing graphs
• NetworkX is not primarily a graph drawing package but it provides basic drawing
capabilities by using matplotlib. For more complex visualization techniques it
provides an interface to use the open source GraphViz software package.
>>> import pylab as plt #import Matplotlib plotting interface
>>> g = nx.watts_strogatz_graph(100, 8, 0.1)
>>> nx.draw(g)
>>> nx.draw_random(g)
>>> nx.draw_circular(g)
>>> nx.draw_spectral(g)
>>> plt.savefig('graph.png')
25
3. Basic network analysis
26
Basic analysis: the Cambridge place network
28
Basic analysis: degree distribution
• Calculate in (and out) degrees of a directed graph
in_degrees = cam_net.in_degree() # dictionary node:degree
in_values = sorted(set(in_degrees.values()))
in_hist = [in_degrees.values().count(x) for x in in_values]
• Then use matplotlib (pylab) to plot the degree distribution
plt.figure() # you need to first do 'import pylab as plt'
plt.grid(True)
plt.plot(in_values, in_hist, 'ro-') # in-degree
plt.plot(out_values, out_hist, 'bv-') # out-degree
plt.legend(['In-degree', 'Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('network of places in Cambridge')
plt.xlim([0, 2*10**2])
plt.savefig('./output/cam_net_degree_distribution.pdf')
plt.close()
29
Basic analysis: degree distribution
30
Basic analysis: degree distribution
31
Basic analysis: clustering coefficient
• We can get the clustering coefficient of individual nodes or all the nodes (but first
we need to convert the graph to an undirected one)
cam_net_ud = cam_net.to_undirected()
32
Basic analysis: node centralities
• We will first extract the largest connected component and then compute the
node centrality measures
# Connected components are sorted in descending order of their size
cam_net_components = nx.connected_component_subgraphs(cam_net_ud)
cam_net_mc = cam_net_components[0]
# Betweenness centrality
bet_cen = nx.betweenness_centrality(cam_net_mc)
# Closeness centrality
clo_cen = nx.closeness_centrality(cam_net_mc)
# Eigenvector centrality
eig_cen = nx.eigenvector_centrality(cam_net_mc)
33
Basic analysis: most central nodes
• We first introduce a utility method: given a dictionary and a threshold parameter
K, the top K keys are returned according to the element values.
def get_top_keys(dictionary, top):
items = dictionary.items()
items.sort(reverse=True, key=lambda x: x[1])
return map(lambda x: x[0], items[:top])
• We can then apply the method on the various centrality metrics available. Below
we extract the top 10 most central nodes for each case.
top_bet_cen = get_top_keys(bet_cen,10)
top_clo_cen = get_top_keys(clo_cen,10)
top_eig_cent = get_top_keys(eig_cen,10)
34
Basic analysis: interpretability
• The nodes in our network correspond to real entities. For each place in the
network, represented by its id, we have its title and geographic coordinates.
### READ META DATA ###
node_data = {}
for line in open('./output/cambridge_net_titles.txt'):
splits = line.split(';')
node_id = int(splits[0])
place_title = splits[1]
lat = float(splits[2])
lon = float(splits[3])
node_data[node_id] = (place_title, lat, lon)
• Iterate through the lists of centrality nodes and use the meta data to print the
titles of the respective places.
print 'Top 10 places for betweenness centrality:'
for node_id in top_bet_cen:
print node_data[node_id][0] 35
Basic analysis: most central nodes
Betweenness centrality Closeness centrality Eigenvector centrality
Top 10 Top 10 Top 10
Cambridge Railway Station (CBG) Cambridge Railway Station (CBG) Cambridge Railway Station (CBG)
Grand Arcade Grand Arcade Cineworld Cambridge
Cineworld Cambridge Cineworld Cambridge Grand Arcade
Greens Apple Store King's College
King's College Grafton Centre Apple Store
Cambridge Market Cambridge Market Cambridge Market
Grafton Centre Greens Greens
Apple Store King's College Addenbrooke's Hospital
Anglia Ruskin University Addenbrooke's Hospital Grafton Centre
Addenbrooke's Hospital Parker's Piece Revolution Bar (Vodka Revolutions)
• The ranking for the different centrality metrics does not change much, although
this may well depend on the type of network under consideration.
36
Basic analysis: drawing our network
# draw the graph using information about the nodes geographic position
pos_dict = {}
for node_id, node_info in node_data.items():
pos_dict[node_id] = (node_info[2], node_info[1])
nx.draw(cam_net, pos=pos_dict, with_labels=False, node_size=25)
plt.savefig('cam_net_graph.pdf')
plt.close()
37
Basic analysis: working with JSON data
• Computing network centrality metrics can be slow, especially for large networks.
• JSON (JavaScript Object Notation) is a lightweight data interchange format which
can be used to serialize and deserialize Python objects (dictionaries and lists).
import json
# Utility function: saves data in JSON format
def dump_json(out_file_name, result):
with open(out_file_name, 'w') as out_file:
out_file.write(json.dumps(result, indent=4, separators=(',', ': ')))
39
Writing your own code: BFS
• With Python and NetworkX it is easy to write any graph-based algorithm
from collections import deque
def get_triangles(g):
nodes = g.nodes()
for n1 in nodes:
neighbors1 = set(g[n1])
for n2 in filter(lambda x: x>n1, nodes):
neighbors2 = set(g[n2])
common = neighbors1 & neighbors2
for n3 in filter(lambda x: x>n2, common):
yield n1, n2, n3
41
Writing your own code: average neighbours’ degree
• Compute the average degree of each node’s neighbours:
def avg_neigh_degree(g):
data = {}
for n in g.nodes():
if g.degree(n):
data[n] = float(sum(g.degree(i) for i in g[n]))/g.degree(n)
return data
42
5. Ready for your own analysis!
43
What you have learnt today
• How to create graphs from scratch, with generators and by loading local data
• How to compute basic network measures, how they are stored in NetworkX and
how to manipulate them with list comprehension
• How to use matplotlib to visualize and plot results (useful for final report!)
• How to use and include NetworkX features to design your own algorithms
44
Useful links
• Code & data used in this lecture: www.cl.cam.ac.uk/~pig20/stna-examples.zip
• NodeXL: a graphical front-end that integrates network analysis into Microsoft Office and Excel.
(http://nodexl.codeplex.com/)
• Pajek: a program for network analysis for Windows (http://pajek.imfm.si/doku.php).
• Gephi: an interactive visualization and exploration platform (http://gephi.org/)
• Power-law Distributions in Empirical Data: tools for fitting heavy-tailed distributions to data
(http://www.santafe.edu/~aaronc/powerlaws/)
• GraphViz: graph visualization software (http://www.graphviz.org/)
• Matplotlib: full documentation for the plotting library (http://matplotlib.org/)
• Unfolding Maps: map visualization software in Java (http://unfoldingmaps.org/)
45
Questions?
E-mail: [email protected]
46