AAAPractical Social Network Analysis With Python (PDFDrive)
AAAPractical Social Network Analysis With Python (PDFDrive)
Series Editors
Jacek Rak
Department of Computer Communications, Faculty of Electronics,
Telecommunications and Informatics, Gdansk University of Technology,
Gdansk, Poland
A. J. Sammes
Cyber Security Centre, Faculty of Technology, De Montfort University,
Leicester, UK
Ankith Mohan
Department of ISE, Ramaiah Institute of Technology, Bangalore, Karnataka,
India
K. G. Srinivasa
Department of Information Technology, C.B.P. Government Engineering
College, Jaffarpur, Delhi, India
This work is subject to copyright. All rights are reserved by the Publisher,
whether the whole or part of the material is concerned, specifically the rights of
translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or
by similar or dissimilar methodology now known or hereafter developed.
Each of these complex systems (especially the Internet) has its own unique
idiosyncrasies but all of them share a particular commonality in the fact that they
can be described by an intricate wiring diagram, a network , which defines the
interactions between the components. We can never fully understand the system
unless we gain a full understanding of its network.
Network
A network is a collection of objects where some pairs of these objects are
connected by links. These objects are also sometimes referred to as nodes. By
representing a complex system through its network, we are able to better
visualize the system and observe the interconnections among the various nodes.
From close examination of networks, we can gather information about which
nodes are closely linked to one another, which are sparsely linked, whether there
nodes are closely linked to one another, which are sparsely linked, whether there
is a concentration of links in a particular part of the network, do some nodes
have a very high number of links when compared to others and so on.
Figure 2 illustrates network corresponding to the ARPANET in December
1970 (what the Internet was called then). It consisted of 13 sites where the nodes
represent computing hosts and links represent direct communication lines
between these hosts.
Fig. 2 ARPANET in December 1970. Computing hosts are represented as nodes and links denote the
communication lines. Online at https://imgur.com/gallery/Xk9MP
Graph
Several properties can be retrieved from these networks but there are some
others which require a more mathematical approach. In the purview of
mathematics, networks in its current state fail to be amenable. To allow for this
amenability, a network is represented as a graph . In this view, a graph can be
described as a mathematical representation of networks which acts as a
framework for reasoning about numerous concepts. More formally, a graph can
be defined as where denotes the set of all vertices of and
denotes the set of edges of . Each edge where
describes an edge from to . If an edge exists between and then they are
considered as neighbours . The number of vertices in is denoted as and
the number of edges is denoted by . These notations will be used throughout
the book.
Figure 3 represents the graph corresponding to the network in Fig. 2 where
the vertices of the graph correspond to the computing hosts and the edges
correspond to the linking communication lines.
Fig. 3 Graph corresponding to Fig. 2 consisting of 13 vertices and 17 edges
Network Datasets
In this age of information technology, we are blessed with an ever increasing
availability of large and detailed network datasets. These datasets generally fall
into one or more of the following groups.
Collaboration graphs : Collaboration graphs show the collaborators as
vertices and an edge between vertices indicating a collaboration between the
corresponding collaborators. Co-authorship among scientists, co-appearances
in movies among performers and Wikipedia collaboration networks are all
instances of collaboration graphs.
Who-talks-to-whom graphs : These graphs have conversers as vertices and an
edge between a pair of them if they have shared a conversation. Phone call
logs, e-mail logs, SMS logs and social media message logs are examples of
who-talks-to-whom graphs. However, these graphs have a certain amount of
regulation with it because it involves private information that can be accessed.
Certain privacy concerns must be resolved before these graphs can be subject
to any form of analysis.
A variant of these graphs are a “Who-transacts-with-whom graphs” where
the nodes represent individuals and edges denote a transaction between
individuals. These graphs are of particular interest to economists.
Information Linkage graphs : The Web is an example of these graphs where
the webpages denote the vertices and a hyperlink between webpages are
represented by edges. Such graphs which contain a large amount of
information in both the vertices and the edges, whose manipulation is a task
per se connotes information linkage graphs.
Technological graphs : In these graphs, physical devices such as routers or
substations represent the vertices and an edge represents a communication line
connecting these physical devices. Water grid, power grid, etc. are examples
of technological graphs.
There is a specific kind of technological graphs called the autonomous
systems (AS) graph where the vertices are autonomous systems and the edges
denote the data transfer between these systems. AS graphs are special in that
they have a two-level view. The first level is the direct connection among
components but there is a second level which defines the protocol regulating
the exchange of information between communicating components. A graph of
communication between network devices controlled by different ISPs, transfer
between bank accounts, transfer of students between departments, etc. are all
instances of these AS graphs.
Natural-world graphs : Graphs pertaining to biology and other natural
sciences fall under this category. Food webs, brain, protein–protein
interactions, disease spread, etc. are natural-world graphs.
Krishna Raj P. M.
Ankith Mohan
K. G. Srinivasa
Bangalore, India, Bangalore, India, Jaffarpur, India
Contents
1 Basics of Graph Theory
1.1 Choice of Representation
1.1.1 Undirected Graphs
1.1.2 Directed Graphs
1.2 Degree of a Vertex
1.3 Degree Distribution
1.4 Average Degree of a Graph
1.5 Complete Graph
1.6 Regular Graph
1.7 Bipartite Graph
1.8 Graph Representation
1.8.1 Adjacency Matrix
1.8.2 Edge List
1.8.3 Adjacency List
1.9 Edge Attributes
1.9.1 Unweighted Graph
1.9.2 Weighted Graph
1.9.3 Self-looped Graph
1.9.4 Multigraphs
1.10 Path
1.11 Cycle
1.12 Path Length
1.13 Distance
1.14 Average Path Length
1.15 Diameter
1.16 Connectedness of Graphs
1.17 Clustering Coefficient
1.18 Average Clustering Coefficient
Problems
Reference
2 Graph Structure of the Web
2.1 Algorithms
2.1.1 Breadth First Search (BFS) Algorithm
2.1.2 Strongly Connected Components (SCC) Algorithm
2.1.3 Weakly Connected Components (WCC) Algorithm
2.2 First Set of Experiments—Degree Distributions
2.3 Second Set of Experiments—Connected Components
2.4 Third Set of Experiments—Number of Breadth First Searches
2.5 Rank Exponent
Problems
References
3 Random Graph Models
3.1 Random Graphs
3.2 Erdös–Rényi Random Graph Model
3.2.1 Properties
3.2.2 Drawbacks of
3.2.3 Advantages of
3.3 Bollobás Configuration Model
3.4 Permutation Model
3.5 Random Graphs with Prescribed Degree Sequences
3.5.1 Switching Algorithm
3.5.2 Matching Algorithm
3.5.3 “Go with the Winners” Algorithm
3.5.4 Comparison
Problems
References
4 Small World Phenomena
4.1 Small World Experiment
4.2 Columbia Small World Study
4.3 Small World in Instant Messaging
4.4 Erdös Number
4.5 Bacon Number
4.6 Decentralized Search
4.7 Searchable
4.8 Other Small World Studies
4.9 Case Studies
4.9.1 HP Labs Email Network
4.9.2 LiveJournal Network
4.9.3 Human Wayfinding
4.10 Small World Models
4.10.1 Watts–Strogatz Model
4.10.2 Kleinberg Model
4.10.3 Destination Sampling Model
Problems
References
5 Graph Structure of Facebook
5.1 HyperANF Algorithm
5.2 Iterative Fringe Upper Bound (iFUB) Algorithm
5.3 Spid
5.4 Degree Distribution
5.5 Path Length
5.6 Component Size
5.7 Clustering Coefficient and Degeneracy
5.8 Friends-of-Friends
5.9 Degree Assortativity
5.10 Login Correlation
5.11 Other Mixing Patterns
5.11.1 Age
5.11.2 Gender
5.11.3 Country of Origin
Problems
References
6 Peer-To-Peer Networks
6.1 Chord
6.2 Freenet
Problems
References
7 Signed Networks
7.1 Theory of Structural Balance
7.2 Theory of Status
7.3 Conflict Between the Theory of Balance and Status
7.4 Trust in a Network
7.4.1 Atomic Propagations
7.4.2 Propagation of Distrust
7.4.3 Iterative Propagation
7.4.4 Rounding
7.5 Perception of User Evaluation
7.6 Effect of Status and Similarity on User Evaluation
7.6.1 Effect of Similarity on Evaluation
7.6.2 Effect of Status on Evaluation
7.6.3 Aggregate User Evaluation
7.6.4 Ballot-Blind Prediction
7.7 Predicting Positive and Negative Links
7.7.1 Predicting Edge Sign
Problems
References
8 Cascading in Social Networks
8.1 Decision Based Models of Cascade
8.1.1 Collective Action
8.1.2 Cascade Capacity
8.1.3 Co-existence of Behaviours
8.1.4 Cascade Capacity with Bilinguality
8.2 Probabilistic Models of Cascade
8.2.1 Branching Process
8.2.2 Basic Reproductive Number
8.2.3 SIR Epidemic Model
8.2.4 SIS Epidemic Model
8.2.5 SIRS Epidemic Model
8.2.6 Transient Contact Network
8.3 Cascading in Twitter
Problems
References
9 Influence Maximisation
9.1 Viral Marketing
9.2 Approximation Algorithm for Influential Identification
References
10 Outbreak Detection
10.1 Battle of the Water Sensor Networks
10.1.1 Expected Time of Detection ( )
10.1.5 Evaluation
10.2 Cost-Effective Lazy Forward Selection Algorithm
10.2.1 Blogspace
10.2.2 Water Networks
Problems
References
11 Power Law
11.1 Power Law Random Graph Models
11.1.1 Model A
11.1.2 Model B
11.1.3 Model C
11.1.4 Model D
11.2 Copying Model
11.3 Preferential Attachment Model
11.4 Analysis of Rich-Get-Richer Phenomenon
11.5 Modelling Burst of Activity
11.6 Densification Power Laws and Shrinking Diameters
11.7 Microscopic Evolution of Networks
11.7.1 Edge Destination Selection Process
11.7.2 Edge Initiation Process
11.7.3 Node Arrival Process
11.7.4 Network Evolution Model
Problems
References
12 Kronecker Graphs
12.1 Stochastic Kronecker Graph (SKG) Model
12.1.1 Fast Generation of SKGs
12.1.2 Noisy Stochastic Kronecker Graph (NSKG) Model
12.2 Distance-Dependent Kronecker Graph
12.3 KRONFIT
12.4 KRONEM
12.5 Multifractal Network Generator
References
13 Link Analysis
13.1 Search Engine
13.1.1 Crawling
13.1.2 Storage
13.1.3 Indexing
13.1.4 Ranking
13.2 Google
13.2.1 Data Structures
13.2.2 Crawling
13.2.3 Searching
13.3 Web Spam Pages
Problems
References
14 Community Detection
14.1 Strength of Weak Ties
14.1.1 Triadic Closure
14.2 Detecting Communities in a Network
14.2.1 Girvan-Newman Algorithm
14.2.2 Modularity
14.2.3 Minimum Cut Trees
14.3 Tie Strengths in Mobile Communication Network
14.4 Exact Betweenness Centrality
14.5 Approximate Betweenness Centrality
References
15 Representation Learning on Graphs
15.1 Node Embedding
15.1.1 Direct Encoding
15.1.2 Factorization-Based Approaches
15.1.3 Random Walk Approaches
15.2 Neighbourhood Autoencoder Methods
15.3 Neighbourhood Aggregation and Convolutional Encoders
15.4 Binary Classification
15.5 Multi-modal Graphs
15.5.1 Different Node and Edge Types
15.5.2 Node Embeddings Across Layers
15.6 Embedding Structural Roles
15.7 Embedding Subgraphs
15.7.1 Sum-Based Approaches
15.7.2 Graph-Coarsening Approaches
15.8 Graph Neural Networks
References
Index
List of Figures
Fig. 1.1 An undirected graph comprising of vertices , , and , and
edges, ( , ), ( , ), ( , ), ( , ), ( , )
Fig. 1.2 A directed graph comprising of vertices , , and , and
edges, ( , ), ( , ), ( , ), ( , ), ( , )
Fig. 1.3 Histogram of versus
Fig. 1.4 Scree plot of versus
Fig. 1.5 A complete undirected graph having vertices and edges
Fig. 1.6 A complete directed graph having vertices and edges
Fig. 1.7 A -regular random graph on vertices
Fig. 1.8 A folded undirected graph with its corresponding bipartite undirected
graph
Fig. 1.9 A folded directed graph with its corresponding bipartite directed graph
Fig. 1.10 An undirected weighted graph with vertices and weighted edges
Fig. 1.11 A directed weighted graph with vertices and weighted edges
Fig. 1.12 An undirected self-looped graph with vertices and edges
Fig. 1.13 A directed self-looped graph with vertices and edges
Fig. 1.14 An undirected multigraph with vertices and edges
Fig. 1.15 A directed multigraph with vertices and edges
Fig. 1.16 A disconnected undirected graph with vertices, , , and ,
and edges, ( , ), ( , ) leaving as an isolated vertex
Fig. 1.17 A strongly connected directed graph with vertices, , , and
, and edges, ( , ), ( , ), ( , ), ( , ) and ( , )
Fig. 1.18 A weakly connected directed graph with vertices, , , ,
and , with edges, ( , ), ( , ), ( , ), ( , ), ( , ) and
( , ). This graph has the SCCs, , , , and
Fig. 1.19 A weakly connected graph whose vertices are the SCCs of and
whose edges exist in because there is an edge between the corresponding
SCCs in
Fig. 1.20 A strongly connected directed graph where vertex belongs to two
SCCs , , and , ,
Fig. 1.21 A graph with the same vertices and edges as shown in Fig. 1.19 with
the exception that there exists an edge between and to make the graph a
strongly connected one
Fig. 2.1 In-degree distribution when only off-site edges are considered
Fig. 2.2 In-degree distribution over May and October crawls
Fig. 2.3 Out-degree distribution when only off-site edges are considered
Fig. 2.4 Out-degree distribution over May and October crawls
Fig. 2.5 Distribution of WCCs on the Web
Fig. 2.6 Distribution of SCCs on the Web
Fig. 2.7 Cumulative distribution on the number of nodes reached when BFS is
started from a random vertex and follows in-links
Fig. 2.8 Cumulative distribution on the number of nodes reached when BFS is
started from a random vertex and follows out-links
Fig. 2.9 Cumulative distribution on the number of nodes reached when BFS is
started from a random vertex and follows both in-links and out-links
Fig. 2.10 In-degree distribution plotted as power law and Zipf distribution
Fig. 2.11 Bowtie structure of the graph of the Web
Fig. 2.12 Log-log plot of the out-degree as a function of the rank in the
sequence of decreasing out-degree for Int-11-97
Fig. 2.13 Log-log plot of the out-degree as a function of the rank in the
sequence of decreasing out-degree for Int-04-98
Fig. 2.14 Log-log plot of the out-degree as a function of the rank in the
sequence of decreasing out-degree for Int-12-98
Fig. 2.15 Log-log plot of the out-degree as a function of the rank in the
sequence of decreasing out-degree for Rout-95
Fig. 2.16 Log-log plot of frequency versus the out-degree for Int-11-97
Fig. 2.17 Log-log plot of frequency versus the out-degree for Int-04-98
Fig. 2.18 Log-log plot of frequency versus the out-degree for Int-12-98
Fig. 2.19 Log-log plot of frequency versus the out-degree for Rout-95
Fig. 2.20 Log-log plot of the number of pairs of nodes within hops
versus the number of hops for Int-11-97
Fig. 2.21 Log-log plot of the number of pairs of nodes within hops
versus the number of hops for Int-04-98
Fig. 2.22 Log-log plot of the number of pairs of nodes within hops
versus the number of hops for Int-12-98
Fig. 2.23 Log-log plot of the number of pairs of nodes within hops
versus the number of hops for Rout-95
Fig. 2.24 Log-log plot of the eigenvalues in decreasing order for Int-11-97
Fig. 2.25 Log-log plot of the eigenvalues in decreasing order for Int-04-98
Fig. 2.26 Log-log plot of the eigenvalues in decreasing order for Int-12-98
Fig. 2.27 Log-log plot of the eigenvalues in decreasing order for Rout-95
Fig. 3.1 Figure shows three undirected graph variants of
Fig. 3.2 Figure shows three directed graph variants of
Fig. 3.3 A random graph generated using Bollobás configuration model with
and
Fig. 4.1 Frequency distribution of the number of intermediaries required to reach
the target
Fig. 4.2 Frequency distribution of the intermediate chains
Fig. 4.3 Average per-step attrition rates (circles) and confidence interval
(triangles)
Fig. 4.4 Histogram representing the number of chains that are completed in
steps
Fig. 4.5 “Ideal” histogram of chain lengths in Fig. 4.4 by accounting for message
attrition in Fig. 4.3
Fig. 4.6 Distribution of shortest path lengths
Fig. 4.7 A hand-drawn picture of a collaboration graph between various
researchers. Most prominent is Erdös
Fig. 4.8 Email communication between employees of HP Lab
Fig. 4.9 Degree distribution of HP Lab network
Fig. 4.10 Probability of linking as a function of the hierarchical distance
Fig. 4.11 Probability of linking as a function of the size of the smallest
organizational unit individuals belong to
Fig. 4.12 Email communication mapped onto approximate physical location.
Each block represents a floor in a building. Blue lines represent far away
contacts while red ones represent nearby ones
Fig. 4.13 Probability of two individuals corresponding by email as a function of
the distance between their cubicles. The inset shows how many people in total sit
at a given distance from one another
Fig. 4.14 In-degree and out-degree distributions of the LiveJournal network
Fig. 4.15 In each of trials, a source and target are chosen
randomly; at each step, the message is forwarded from the current message-
holder to the friend of geographically closest to . If ,
then the chain is considered to have failed. The fraction of pairs in which
the chain reaches ’s city in exactly steps is shown ( chains
completed; median , , for completed chains). (Inset) For
completed, median , , ; if
then picks a random person in the same city as to pass the message to, and
the chain fails only if there is no such person available
Fig. 4.16 a For each distance , the proportion P( ) of friendships among all
pairs , of LiveJournal users with is shown. The number of pairs
, with is estimated by computing the distance between
randomly chosen pairs of people in the network. b The same data are plotted,
correcting for the background friendship probability: plot distance versus
Fig. 4.17 The relationship between friendship probability and rank. The
probability of a link from a randomly chosen source to the th closest
node to , i.e, the node such that , in the LiveJournal network,
averaged over independent source samples. A link from to one of the
nodes , where the people in are all tied for rank
, is counted as a ( ) fraction of a link for each of these
Fig. 4.18 Depiction of a regular network proceeding first to a small world
network and next to a random network as the randomness, increases
Fig. 4.19 Log of decentralized search time as a function of the exponent
Fig. 5.1 Degree distribution of the global and US Facebook active users,
alongside its CCDF
Fig. 5.2 Neighbourhood function showing the percentage of users that are within
hops of one another
Fig. 5.3 Distribution of component sizes on log–log scale
Fig. 5.4 Clustering coefficient and degeneracy as a function of degree on log–log
scale
Fig. 5.5 Average number of unique and non-unique friends-of-friends as a
function of degree
Fig. 5.6 Average neighbour degree as a function of an individual’s degree, and
the conditional probability that a randomly chosen neighbour of an
individual with degree has degree
Fig. 5.7 Neighbor’s logins versus user’s logins to Facebook over a period of
days, and a user’s degree versus the number of days a user logged into Facebook
in the day period
Fig. 5.8 Distribution of ages for the neighbours of users with age
Fig. 5.9 Normalized country adjacency matrix as a heatmap on a log scale.
Normalized by dividing each element of the adjacency matrix by the product of
the row country degree and column country degree
Fig. 6.1 Abstract P2P network architecture
Fig. 6.2 Degree distribution of the freenet network
Fig. 7.1 , and are all friends of each other. Therefore this triad ( ) by
satisfying structural balance property is balanced
Fig. 7.2 and are friends. However, both of them are enemies of . Similar
to Fig. 7.1, this triad ( ) is balanced
Fig. 7.3 is friends with and . However, and are enemies.
Therefore the triad ( ) by failing to satisfy the structural balance property is
unbalanced
Fig. 7.4 , and are all enemies of one another. Similar to Fig. 7.3, the
triad ( ) is unbalanced
Fig. 7.5 A balanced (left) and an unbalanced (right) graph
Fig. 7.6 A signed graph
Fig. 7.7 Supernodes of the graph in Fig. 7.6
Fig. 7.8 A simplified labelling of the supernodes for the graph in Fig. 7.6
Fig. 7.9 All contexts ( , ; ) where the red edge closes the triad
Fig. 7.10 Surprise values and predictions based on the competing theories of
structural balance and status
Fig. 7.11 Given that the first edge was of sign , gives the
probability that reciprocated edge is of sign
Fig. 7.12 Edge reciprocation in balanced and unbalanced triads. Triads : number
of balanced/unbalanced triads in the network where one of the edges was
reciprocated. P ( RSS ): probability that the reciprocated edge is of the same sign.
: probability that the positive edge is later reciprocated with a plus.
: probability that the negative edge is reciprocated with a minus
Fig. 7.13 Prediction of the algorithms. Here, ,
Fig. 7.14 Helpfulness ratio declines with the absolute value of a review’s
deviation from the computed star average. The line segments within the bars
indicate the median helpfulness ratio; the bars depict the helpfulness ratio’s
second and third quantiles. Grey bars indicate that the amount of data at that
value represents or less of the data depicted in the plot
Fig. 7.15 Helpfulness ratio as a function of a review’s signed deviation
Fig. 7.16 As the variance of the star ratings of reviews for a particular product
increases, the median helpfulness ratio curve becomes two-humped and the
helpfulness ratio at signed deviation (indicated in red) no longer represents the
unique global maximum
Fig. 7.17 Signed deviations vs. helpfulness ratio for variance = , in the
Japanese (left) and U.S. (right) data. The curve for Japan has a pronounced lean
towards the left
Fig. 7.18 Probability of a positive evaluation ( ) as a function of the
similarity (binary cosine) between the evaluator and target edit vectors ( )
in Wikipedia
Fig. 7.19 Probability of positively evaluating as a function of similarity in
Stack Overflow
Fig. 7.20 Probability of voting positively on as a function of for
different levels of similarity in English Wikipedia
Fig. 7.21 Probability of voting positively on as a function of for
different levels of similarity on Stack Overflow for a all evaluators b no low
status evaluators
Fig. 7.22 Similarity between and pairs as a function of for a English
Wikipedia and b Stack Overflow
Fig. 7.23 Probability of positively evaluating versus for various fixed
levels of in Stack Overflow
Fig. 7.24 Dip in various datasets
Fig. 7.25 The Delta-similarity half-plane. Votes in each quadrant are treated as a
group
Fig. 7.26 Ballot-blind prediction results for a English Wikipedia b German
Wikipedia
Fig. 7.27 Accuracy of predicting the sign of an edge based on the signs of all the
other edges in the network in a Epinions, b Slashdot and c Wikipedia
Fig. 7.28 Accuracy of predicting the sign of an edge based on the signs of all the
other edges in the network in a Epinions, b Slashdot and c Wikipedia
Fig. 7.29 Accuracy for handwritten heuristics as a function of minimum edge
embeddedness
Fig. 8.1 must choose between behaviours and based on its neighbours
behaviours
Fig. 8.2 Initial network where all nodes exhibit the behaviour
Fig. 8.3 Nodes and are initial adopters of behaviour while all the other
nodes still exhibit behaviour
Fig. 8.4 First time step where and adopt behaviour by threshold rule
Fig. 8.5 Second time step where and adopt behaviour also by threshold
rule
Fig. 8.6 Initial network where all nodes have behaviour
Fig. 8.7 Nodes and are early adopters of behaviour while all the other
nodes exhibit
Fig. 8.8 After three time steps there are no further cascades
Fig. 8.9 An infinite grid with early adopters of behaviour while the others
exhibit behaviour
Fig. 8.10 The payoffs to node on an infinite path with neighbours exhibiting
behaviour and
Fig. 8.11 By dividing the a-c plot based on the payoffs, we get the regions
corresponding to the different choices
Fig. 8.12 The payoffs to node on an infinite path with neighbours exhibiting
behaviour and
Fig. 8.13 The a-c plot shows the regions where chooses each of the possible
strategies
Fig. 8.14 The plot shows the four possible outcomes for how spreads or fails
to spread on the infinite path, indicated by this division of the -plane into
four regions
Fig. 8.15 Graphical method of finding the equilibrium point of a threshold
distribution
Fig. 8.16 Contact network for branching process
Fig. 8.17 Contact network for branching process where high infection
probability leads to widespread
Fig. 8.18 Contact network for branching process where low infection probability
leads to the disappearance of the disease
Fig. 8.19 Repeated application of
Fig. 8.20 A contact network where each edge has an associated period of time
denoting the time of contact between the connected vertices
Fig. 8.21 Exposure curve for top hashtags
Fig. 9.1 Results for the linear threshold model
Fig. 9.2 Results for the weight cascade model
Fig. 9.3 Results for the independent cascade model with probability
Fig. 9.4 Results for the independent cascade model with probability
Fig. 10.1 (Left) Performance of CELF algorithm and offline and on-line bounds
for PA objective function. (Right) Compares objective functions
Fig. 10.2 Heuristic blog selection methods. (Left) unit cost model, (Right)
number of posts cost model
Fig. 10.3 (Left) CELF with offline and online bounds for PA objective. (Right)
Different objective functions
Fig. 10.4 Water network sensor placements: (Left) when optimizing PA, sensors
are concentrated in high population areas. (Right) when optimizing DL, sensors
are uniformly spread out
Fig. 10.5 Solutions of CELF outperform heuristic approaches
Fig. 11.1 A log-log plot exhibiting a power law
Fig. 11.2 Distribution with a long tail
Fig. 11.3 Burstiness of communities in the time graph of the Blogspace
Fig. 11.4 Average out-degree over time. Increasing trend signifies that the graph
are densifying
Fig. 11.5 Number of edges as a function of the number of nodes in log-log scale.
They obey the densification power law
Fig. 11.6 Effective diameter over time for different datasets. There is a
consistent decrease of the diameter over time
Fig. 11.7 The fraction of nodes that are part of the giant connected component
over time. Observe that after years, of all nodes in the graph belong to
the giant component
Fig. 11.8 Probability of a new edge choosing a destination at a node of
degree
Fig. 11.9 Average number of edges created by a node of age
Fig. 11.10 Log-likelihood of an edge selecting its source and destination node.
Arrows denote at highest likelihood
Fig. 11.11 Number of edges created to nodes hops away. counts
the number of edges that connected previously disconnected components
Fig. 11.12 Probability of linking to a random node at hops from source node.
Value at hops is for edges that connect previously disconnected
components
Fig. 11.13 Exponentially distributed node lifetimes
Fig. 11.14 Edge gap distribution for a node to obtain the second edge, , and
MLE power law with exponential cutoff fits
Fig. 11.15 Evolution of the and parameters with the current node degree
. remains constant, and linearly increases
Fig. 11.16 Number of nodes over time
Fig. 12.1 Top: a “3-chain” and its Kronecker product with itself; each of the
nodes gets expanded into nodes, which are then linked. Bottom: the
corresponding adjacency matrices, along with the matrix for the fourth
Kronecker power
Fig. 12.2 CIT-HEP-TH: Patterns from the real graph (top row), the deterministic
Kronecker graph with being a star graph on nodes (center satellites)
(middle row), and the Stochastic Kronecker graph ( , -
bottom row). Static patterns: a is the PDF of degrees in the graph (log-log scale),
and b the distribution of eigenvalues (log-log scale). Temporal patterns: c gives
the effective diameter over time (linear-linear scale), and d is the number of
edges versus number of nodes over time (log-log scale)
Fig. 12.3 AS-ROUTEVIEWS: Real (top) versus Kronecker (bottom). Columns a
and b show the degree distribution and the scree plot. Columns c shows the
distribution of network values (principal eigenvector components, sorted, versus
rank) and d shows the hop-plot (the number of reachable pairs within
hops or less, as a function of the number of hops
Fig. 12.4 Comparison of degree distributions for SKG and two noisy variations
Fig. 12.5 Schematic illustration of the multifractal graph generator. a The
construction of the link probability measure. Start from a symmetric generating
measure on the unit square defined by a set of probabilities associated
to rectangles (shown on the left). Here , the length of the
intervals defining the rectangles is given by and respectively, and the
magnitude of the probabilities is indicated by both the height and the colour of
the corresponding boxes. The generating measure is iterated by recursively
multiplying each box with the generating measure itself as shown in the middle
and on the right, yielding boxes at iteration . The variance of the
Fig. 12.6 A small network generated with the multifractal network generator. a
The generating measure (on the left) and the link probability measure (on the
right). The generating measure consists of rectangles for which the
magnitude of the associated probabilities is indicated by the colour. The number
of iterations, , is set to , thus the final link probability measure consists
of boxes, as shown in the right panel. b A network with nodes
generated from the link probability measure. The colours of the nodes were
chosen as follows. Each row in the final linking probability measure was
assigned a different colour, and the nodes were coloured according to their
position in the link probability measure. (Thus, nodes falling into the same row
have the same colour)
Fig. 13.1 Typical search engine architecture
Fig. 13.2 Counting in-links to pages for the query “newspapers”
Fig. 13.3 Finding good lists for the query “newspapers”: each page’s value as a
list is written as a number inside it
Fig. 13.4 Re-weighting votes for the query “newspapers”: each of the labelled
page’s new score is equal to the sum of the values of all lists that point to it
Fig. 13.5 Re-weighting votes after normalizing for the query “newspapers”
Fig. 13.6 Limiting hub and authority values for the query “newspapers”
Fig. 13.7 A collection of eight web pages
Fig. 13.8 Equilibrium PageRank values for the network in Fig. 13.7
Fig. 13.9 The same collection of eight pages, but and have changed their
links to point to each other instead of to . Without a smoothing effect, all the
PageRank would go to and
Fig. 13.10 High level Google architecture
Fig. 14.1 Undirected graph with four vertices and four edges. Vertices and
have a mutual contacts and , while and have mutual friend and
Fig. 14.2 Figure 14.1 with an edge between and , and and due to
triadic closure property
Fig. 14.3 Graph with a local bridge between and
Fig. 14.4 Each edge of the graph in Fig 14.3 is labelled either as a strong tie(S)
or a weak tie(W). The labelling in the figure satisfies the Strong Triadic Closure
property
Fig. 14.5 a Degree distribution. b Tie strength distribution. The blue line in a
and b correspond to , where x corresponds to
either k or w . The parameter values for the fits in (A) are , ,
, and for the fits in (B) are . c
Illustration of the overlap between two nodes, and , its value being shown
for four local network configurations. d In the real network, the overlap
(blue circles) increases as a function of cumulative tie strength ,
representing the fraction of links with tie strength smaller than w . The dyadic
hypothesis is tested by randomly permuting the weights, which removes the
coupling between and w (red squares). The overlap decreases
as a function of cumulative link betweenness centrality b (black diamonds)
Fig. 14.6 Each link represents mutual calls between the two users, and all nodes
are shown that are at distance less than six from the selected user, marked by a
circle in the center. a The real tie strengths, observed in the call logs, defined as
the aggregate call duration in minutes. b The dyadic hypothesis suggests that the
tie strength depends only on the relationship between the two individuals. To
illustrate the tie strength distribution in this case, we randomly permuted tie
strengths for the sample in a . c The weight of the links assigned on the basis of
their betweenness centrality values for the sample in as suggested by the
global efficiency principle. In this case, the links connecting communities have
high values (red), whereas the links within the communities have low
values (green)
Fig. 14.7 The control parameter f denotes the fraction of removed links. a and c
These graphs correspond to the case in which the links are removed on the basis
of their strengths ( removal). b and d These graphs correspond to the case in
which the links were removed on the basis of their overlap ( removal). The
black curves correspond to removing first the high-strength (or high ) links,
moving toward the weaker ones, whereas the red curves represent the opposite,
starting with the low-strength (or low ) ties and moving toward the stronger
ones. a and b The relative size of the largest component
indicates that the removal of the low or
links leads to a breakdown of the network, whereas the removal of the high
or links leads only to the network’s gradual shrinkage. a Inset Shown is
the blowup of the high region, indicating that when the low ties are
removed first, the red curve goes to zero at a finite f value. c and d According to
percolation theory, diverges for as we approach
the critical threshold , where the network falls apart. If we start link removal
from links with low ( c ) or ( d ) values, we observe a clear signature of
divergence. In contrast, if we start with high ( c ) or ( d ) links, there the
divergence is absent
Fig. 14.8 The dynamics of spreading on the weighted mobile call graph,
assuming that the probability for a node to pass on the information to its
neighbour in one time step is given by , with .
a The fraction of infected nodes as a function of time . The blue curve (circles)
corresponds to spreading on the network with the real tie strengths, whereas the
black curve (asterisks) represents the control simulation, in which all tie
strengths are considered equal. b Number of infected nodes as a function of time
for a single realization of the spreading process. Each steep part of the curve
corresponds to invading a small community. The flatter part indicates that the
spreading becomes trapped within the community. c and d Distribution of
strengths of the links responsible for the first infection for a node in the real
network ( c ) and control simulation ( d ). e and f Spreading in a small
neighbourhood in the simulation using the real weights (E) or the control case, in
which all weights are taken to be equal ( f ). The infection in all cases was
released from the node marked in red, and the empirically observed tie strength
is shown as the thickness of the arrows (right-hand scale). The simulation was
repeated 1,000 times; the size of the arrowheads is proportional to the number of
times that information was passed in the given direction, and the colour indicates
the total number of transmissions on that link (the numbers in the colour scale
refer to percentages of ). The contours are guides to the eye, illustrating
the difference in the information direction flow in the two simulations
Fig. 15.1 a Graph of the Zachary Karate Club network where nodes represent
members and edges indicate friendship between members. b Two-dimensional
visualization of node embeddings generated from this graph using the DeepWalk
method. The distances between nodes in the embedding space reflect proximity
in the original graph, and the node embeddings are spatially clustered according
to the different colour-coded communities
Fig. 15.2 Graph of the Les Misérables novel where nodes represent characters
and edges indicate interaction at some point in the novel between corresponding
characters. (Left) Global positioning of the nodes. Same colour indicates that the
nodes belong to the same community. (Right) Colour denotes structural
equivalence between nodes, i.e, they play the same roles in their local
neighbourhoods. Blue nodes are the articulation points. This equivalence where
generated using the node2vec algorithm
Fig. 15.3 Illustration of the neighbourhood aggregation methods. To generate the
embedding for a node, these methods first collect the node’s k-hop
neighbourhood. In the next step, these methods aggregate the attributes of node’s
neighbours, using neural network aggregators. This aggregated neighbourhood
information is used to generate an embedding, which is then fed to the decoder
List of Tables
Table 1.1 Table shows the vertices of the graph in Fig. 1.1 and their
corresponding degrees
Table 1.2 Table shows the vertices of the graph in Fig. 1.2 and their
corresponding in-degrees, out-degrees and degrees
Table 1.3 Adjacency list for the graph in Fig. 1.1
Table 1.4 Adjacency list for the graph in Fig. 1.2
Table 1.5 and for all the vertices in Fig. 1.1
Table 1.6 and for all the vertices in Fig. 1.2
Table 1.7 and for all the vertices in Fig. 1.16
Table 1.8 and for all the vertices in Fig. 1.17
Table 1.9 Clustering coefficient of vertices of Figs. 1.1 and 1.2
Table 2.1 Size of the largest surviving weak component when links to pages with
in-degree at least k are removed from the graph
Table 2.2 Sizes of the components forming the bowtie
Table 5.1 Countries with their codes
Table 7.1 Dataset statistics
Table 7.2 Dataset statistics
Table 7.3 Wikipedia statistics. number of votes, baseline
fraction of positive votes, number of users
Table 8.1 A-B coordination game payoff matrix
Table 8.2 Coordination game for node specific payoffs
Table 8.3 Payoff matrix
Table 8.4 Symmetric payoff matrix
Table 8.5 Symmetric payoff matrix parametrized by critical probability
Table 13.1 PageRank values of Fig. 13.7
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_1
Krishna Raj P. M.
Email: [email protected]
A graph can shed light on several properties of a complex system, but this is
incumbent on the effectiveness of the network. If the network does not fully
represent all the features described in the system, then the graph will also fail to
describe all the properties lying therein. Therefore, the choice of a proper
network representation of the complex system is of first importance.
A network must be chosen in such a way that it weighs every component of a
system individually on its own merit. In some cases, there is a unique,
unambiguous representation. While in others, the representation is not unique.
Fig. 1.1 An undirected graph comprising of 4 vertices A, B, C and D, and 5 edges, (A,B), (A,C), (A,D),
(B,C), (B,D)
degree of vertex i in a directed graph is therefore defined as the sum of the in-
degree of this vertex i and its out-degree. This is represented as .
Tables 1.1 and 1.2 tabulate the degrees of all the vertices in Figs. 1.1 and 1.2
respectively.
Table 1.1 Table shows the vertices of the graph in Fig. 1.1 and their corresponding degrees
Vertex Degree
A 3
B 3
C 2
D 2
Total 10
Table 1.2 Table shows the vertices of the graph in Fig. 1.2 and their corresponding in-degrees, out-degrees
and degrees
A vertex with zero in-degree is called a source vertex , a vertex with zero
out-degree is called a sink vertex , and a vertex with in-degree and out-degree
both equal to zero is called an isolated vertex .
(1.1)
The degree distribution is plotted either as a histogram of k vs P(k) (Fig. 1.3) or
as a scree plot of k vs (Fig. 1.4).
of the degrees of all the vertices in G. More formally, the average degree is
given by Eq. 1.2.
(1.2)
The average degree of a graph satisfies the properties given in Eqs. 1.3 and 1.4.
(1.3)
(1.4)
The average degree of the graph in Fig. 1.1 as well as the graph in Fig. 1.2 is
. Since there are 4 vertices and 5 edges in both of these graphs, Eq. 1.3
is satisfied. The average in-degree of the graph in Fig. 1.2 is , and its
(1.5)
(1.6)
The adjacency matrix for the graph in Fig. 1.1 is as shown in the matrix below.
Similarly, the adjacency matrix for the graph in Fig. 1.2 is described in the
following matrix.
(1.8)
(1.9)
(1.10)
(1.10)
(1.11)
(1.12)
Undirected graphs satisfy Eqs. 1.7, 1.9 and 1.11 while directed graphs satisfy
Eqs. 1.7, 1.10 and 1.12.
A : [B, C, D]
B : [A, C, D]
C : [A, B]
D : [A, B]
A : [ ]
B : [A]
C : [A, B]
D : [A, B]
D : [A, B]
The adjacency list for the graph in Fig. 1.1 and that for the graph in Fig. 1.2
is as shown in Tables 1.3 and 1.4 respectively.
When the graphs are small, there is not a notable difference between any of
these representations. However, when the graph is large and sparse (as in the
case of most real world networks), the adjacency matrix will be a large matrix
filled mostly with zeros. This accounts for a lot of unnecessary space and
account for a lot of time that could have been avoided during computations.
Although edge lists are concise, they are not the best data structure for most
graph algorithms. When dealing with such graphs, adjacency lists are
comparatively more effective. Adjacency lists can be easily implemented in
most programming languages as a hash table with keys for the source vertices
and a vector of destination vertices as values. Working with this implementation
can save a lot of computation time. SNAP uses this hash table and vector
representation for storing graphs [1].
Fig. 1.10 An undirected weighted graph with 4 vertices and 5 weighted edges
Fig. 1.11 A directed weighted graph with 4 vertices and 5 weighted edges
The adjacency matrix for the graph in Fig. 1.10 is given below
The adjacency matrix for the graph in Fig. 1.11 is as depicted below
Undirected weighted graphs satisfy properties in Eqs. 1.7, 1.9 and 1.13.
(1.13)
Directed weighted graphs on the other hand satisfy properties in Eqs. 1.7, 1.10
and 1.14.
(1.14)
The adjacency matrix for the graph in Fig. 1.12 is as shown below
The adjacency matrix for the graph in Fig. 1.13 is illustrated as follows
Undirected self-looped graphs satisfy properties in Eqs. 1.8, 1.9 and 1.15.
(1.15)
Directed weighted graphs on the other hand satisfy properties in Eqs. 1.8, 1.10
and 1.16.
(1.16)
1.9.4 Multigraphs
A multigraph is a graph where multiple edges may share the same source and
destination vertices. Figures 1.14 and 1.15 are instances of undirected and
directed multigraphs respectively.
The adjacency matrix for the graph in Fig. 1.14 is as shown below
The adjacency matrix for the graph in Fig. 1.15 is illustrated as follows
Undirected multigraphs satisfy properties in Eqs. 1.7, 1.9 and 1.13. Directed
multigraphs on the other hand satisfy properties in Eqs. 1.7, 1.10 and 1.14.
Communication and collaboration networks are some common instances of
multigraphs.
1.10 Path
A path from a vertex u to a vertex v is defined either as a sequence of vertices in
which each vertex is linked to the next, {u, , , , ,v} or as a sequence of
edges {(u, ), ( , ), , ( ,v)}. A path can pass through the same edge
multiple times. A path that does not contain any repetition in either the edges or
the nodes is called a simple path . Following a sequence of such edges gives us a
walk through the graph from a vertex u to a vertex v. A path from u to v does not
necessarily imply a path from v to u.
The path {A,B,C} is a simple path in both Figs. 1.1 and 1.2. {A,B,C,B,D} is a
path in Fig. 1.1 while there are no non-simple paths in Fig. 1.2.
1.11 Cycle
A cycle is defined as a path with atleast three edges, in which the first and the
last vertices are the same, but otherwise all other vertices are distinct. The path
{A,B,C,A} and {A,B,D,A} are cycles in Fig. 1.1.
Cycles exist in graph sometimes by design. The reason behind this is
redundancy. If any edge were to fail there would still exist a path between any
pair of vertices.
1.13 Distance
The distance between a pair of vertices u and v is defined as the number of edges
along the shortest path connecting u and v. If two nodes are not connected, the
distance is usually defined as infinity. Distance is symmetric in undirected
graphs and asymmetric in directed graphs.
The distance matrix, h of a graph is the matrix in which each element
denotes the distance from vertex i to vertex j. More formally, distance matrix is
defined as given in Eq. 1.17.
(1.17)
The distance matrix for the undirected graph in Fig. 1.1 is
between all pairs of vertices in the graph. Formally, the average path length is
given by
(1.18)
All infinite distances are disregarded in the computation of the average path
length.
1.15 Diameter
The diameter of a graph is the maximum of the distances between all pairs of
vertices in this graph. While computing the diameter of a graph, all infinite
distances are disregarded.
The diameter of Figs. 1.1 and 1.2 is 2 and 1 respectively.
Fig. 1.16 A disconnected undirected graph with 4 vertices, A, B, C and D, and 2 edges, (A,B), (B,C)
leaving D as an isolated vertex
Fig. 1.17 A strongly connected directed graph with 4 vertices, A, B, C and D, and 5 edges, (A,B), (B,C),
(C,A), (C,D) and (D,B)
Giant Component
The largest component of a graph is called its giant component . It is a
unique, distinguishable component containing a significant fraction of all the
vertices and dwarfs all the other components. A,B,C is the giant component in
Fig. 1.16.
Bridge Edge
A bridge edge is an edge whose removal disconnects the graph.Every edge
in Fig. 1.17 is a bridge edge. For Fig. 1.1, the combination of edges (A,B), (A,C),
(A,D); or (A,C), (B,C); or (A,D), (B,D) are bridge edges.
Articulation Vertex
An articulation vertex is defined as a vertex which when deleted renders the
graph a disconnected one. Every vertex in Fig. 1.17 as well as in Fig. 1.1 is an
articulation vertex.
In and Out of a Vertex
For a graph G(V, E), we define the In and Out of a vertex as given in
Eqs. 1.19 and 1.20.
(1.19)
(1.20)
In other words if Eq. 1.21 is satisfied by a directed graph, then this graph is
said to be a strongly connected directed graph. This means that a weakly
connected directed graph must satisfy Eq. 1.22.
(1.21)
(1.22)
Tables 1.5, 1.6, 1.7 and 1.8 tabulate the In and Out for all the vertices in each
of these graphs. From these tables, we observe that Fig. 1.17 is strongly
connected and Fig. 1.2 is weakly connected because they satisfy Eqs. 1.21 and
1.22 respectively.
Table 1.5 In and Out for all the vertices in Fig. 1.1
In(v) Out(v)
A A,B,C,D A,B,C,D
B A,B,C,D A,B,C,D
C A,B,C,D A,B,C,D
D A,B,C,D A,B,C,D
Table 1.6 In and Out for all the vertices in Fig. 1.2
In(v) Out(v)
A B,C,D
B A C,D
C A,B
D A,B
Table 1.7 In and Out for all the vertices in Fig. 1.16
In(v) Out(v)
A A,B,C A,B,C
B A,B,C A,B,C
C A,B,C A,B,C
D
Table 1.8 In and Out for all the vertices in Fig. 1.17
In(v) Out(v)
A A,B,C,D A,B,C,D
B A,B,C,D A,B,C,D
C A,B,C,D A,B,C,D
D A,B,C,D A,B,C,D
Fig. 1.18 A weakly connected directed graph G with 5 vertices, A, B, C, D and E, with 6 edges, (A,B),
(B,C), (B,D), (C,A), (E,A) and (E,C). This graph has the SCCs, E, A,B,C and D
For the proof of this theorem, we will use the graphs G in Fig. 1.18 and in
Fig. 1.19.
Fig. 1.19 A weakly connected graph whose vertices are the SCCs of G and whose edges exist in
Proof
1.
Let us assume there exists a vertex v which is a member of two SCCs S =
A,v,B and = C,v,D as shown in Fig. 1.20. By the definition of a SCC,
becomes one large SCC. Therefore, SCCs partition vertices of a
graph.
2.
2.
Assume that is not a DAG, i.e, there exists a directed cycle in (as
depicted in Fig. 1.21). However, by the definitions of a SCC and a DAG this
makes {A,B,C,D,E} one large SCC, rendering no longer a graph of
edges between SCCs. Therefore by contradiction, is a DAG on the SCCs
of G.
Fig. 1.20 A strongly connected directed graph where vertex v belongs to two SCCs A,v,B and C,v,D
Fig. 1.21 A graph with the same vertices and edges as shown in Fig. 1.19 with the exception that there
exists an edge between D and E to make the graph a strongly connected one
(1.24)
C 3
D 3
(1.25)
The average clustering coefficients of the graphs in Figs. 1.1 and 1.2 are both
equal to .
Problems
Download the email-Eu-core directed network from the SNAP dataset repository
available at http://snap.stanford.edu/data/email-Eu-core.html.
For this dataset compute the following network parameters:
1
Number of nodes
2 Number of edges
7 In-degree distribution
8 Out-degree distribution
12 Diameter
References
1. Leskovec, Jure, and Rok Sosič. 2016. Snap: A general-purpose network analysis and graph-mining
library. ACM Transactions on Intelligent Systems and Technology (TIST) 8 (1): 1.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_2
Krishna Raj P. M.
Email: [email protected]
In this chapter, we will take a close look at the Web when it is represented in the
form of a graph and attempt to understand its structure. We will begin by
looking at the reasons behind why we must be interested with this problem
concerning the Web’s structure. There are numerous reasons as to why the
structure of the Web is worth studying, but the most prominent ones are as
follows: the Web is a large system that evolved naturally over time,
understanding such a system’s organization and properties could help better
comprehend several other real-world systems; the study could yield valuable
insight into Web algorithms for crawling, searching and community discovery,
which could in turn help improve strategies to accomplish these tasks; we could
gain an understanding into the sociological phenomena characterising content
creation in its evolution; the study could help predict the evolution of known or
new web structures and lead to the development of better algorithms for
discovering and organizing them; and we could predict the emergence of new
phenomena in the Web.
Bowtie Structure of the Web
Reference [1] represented the Web as a directed graph where the webpages
are treated as vertices and hyperlinks are the edges and studied the following
properties in this graph: diameter, degree distribution, connected components
and macroscopic structure. However, the dark-web (part of the Web that is
composed of webpages that are not directly accessible (even by Web browsers))
were disregarded. This study consisted of performing web crawls on a snapshot
of the graph consisting of 203 million URLs connected by 1.5 billion hyperlinks.
The web crawl is based on a large set of starting points accumulated over time
from various sources, including voluntary submissions. A 465 MHz server with
12 GB of memory was dedicated for this purpose.
Reference [1] took a large snapshot of the Web and using Theorem 1,
attempted to understand how its SCCs fitted together as a DAG.
Power Law
The power law is a functional relationship between two quantities where a
relative change in one quantity results in a proportional relative change in the
other quantity, independent of the initial size of those quantities, i.e, one quantity
varies as the power of the other. Hence the name power law [3]. Power law
distributions as defined on positive integers is the probability of the value i being
proportional to for a small positive integer k.
2.1 Algorithms
In this section we will look at several algorithms used by [1] in their
experiments.
Three sets of experiments on these web crawls were performed from May 1999
to October 1999.
Fig. 2.1 In-degree distribution when only off-site edges are considered
Fig. 2.2 In-degree distribution over May and October 1999 crawls
Fig. 2.3 Outdegree distribution when only off-site edges are considered
Fig. 2.4 Outdegree distribution over May and October 1999 crawls
k 1000 100 10 5 4 3
Size (millions) 177 167 105 59 41 15
Table 2.1 gives us the following insights: Connectivity of the Web graph is
extremely resilient and does not depend on the existence of vertices of high
degree, vertices which are useful tend to include those vertices that have a high
PageRank or those which are considered as good hubs or authorities are
embedded in a graph that is well connected without them.
To understand what the giant component is composed of, it was subjected to
the SCC algorithm. The algorithm returned a single large SCC consisting of 56
million vertices which barely amounts to 28% of all the vertices in the crawl.
This corresponds to all of the vertices that can reach one another along the
directed edges situated at the heart of the Web graph. Diameter of this
component is atleast 28. The distribution of sizes of the SCCs also obeys power
law with exponent of 2.5 as observed in Fig. 2.6.
Fig. 2.7 Cumulative distribution on the number of nodes reached when BFS is started from a random
vertex and follows in-links
Fig. 2.8 Cumulative distribution on the number of nodes reached when BFS is started from a random
vertex and follows out-links
Fig. 2.9 Cumulative distribution on the number of nodes reached when BFS is started from a random
vertex and follows both in-links and out-links
Zipf’s Law
Zipf’s law states that the frequency of occurrence of a certain value is
inversely proportional to its rank in the frequency table [3]. The in-degree
distribution shows a fit more with Zipf distribution than the power law as is
evident from Fig. 2.10.
Fig. 2.10 In-degree distribution plotted as power law and Zipf distribution
(2.1)
If the nodes of the graph are sorted in decreasing order of outdegree, then the
rank exponent, is defined to be the slope of the node versus the rank of the
nodes in the log-log scale.
The outdegree, , of a node v is a function of the rank of this node, , and
the rank exponent, , as in Eq. 2.2.
(2.2)
(2.3)
Fig. 2.12 Log-log plot of the outdegree as a function of the rank in the sequence of decreasing
Fig. 2.13 Log-log plot of the outdegree as a function of the rank in the sequence of decreasing
Fig. 2.14 Log-log plot of the outdegree as a function of the rank in the sequence of decreasing
Fig. 2.15 Log-log plot of the outdegree as a function of the rank in the sequence of decreasing
(2.4)
Fig. 2.17 Log-log plot of frequency versus the outdegree d for Int-04-98
Fig. 2.18 Log-log plot of frequency versus the outdegree d for Int-12-98
Fig. 2.19 Log-log plot of frequency versus the outdegree d for Rout-95
Fig. 2.20 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops h for
Int-11-97
Fig. 2.21 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops h for
Int-04-98
Fig. 2.22 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops h for
Int-12-98
The total number of pairs of nodes, P(h), within h hops, is proportional to the
number of hops to the power of a constant, (Eq. 2.5)
(2.5)
where is the diameter of the graph.
(2.6)
(2.7)
(2.8)
(2.9)
Fig. 2.23 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops h for
Rout-95
2.8 Eigen Exponent
The plot of the eigenvalue as a function of i in the log-log scale for the first
20 eigenvalues in the decreasing order (shown in Figs. 2.24, 2.25, 2.26 and 2.27)
gives the following results:
The eigenvalues, , of a graph are proportional to the order, i, to the power
of a constant, is as given in Eq. 2.10.
(2.10)
The eigen exponent is defined as the slope of the plot of the sorted
eigenvalues as a function of their order in the log-log scale.
Fig. 2.24 Log-log plot of the eigenvalues in decreasing order for Int-11-97
Fig. 2.25 Log-log plot of the eigenvalues in decreasing order for Int-04-98
Fig. 2.26 Log-log plot of the eigenvalues in decreasing order for Int-12-98
Fig. 2.27 Log-log plot of the eigenvalues in decreasing order for Rout-95
Problems
Download the Epinions directed network from the SNAP dataset repository
available at http://snap.stanford.edu/data/soc-Epinions1.html.
For this dataset compute the structure of this social network using the same
methods as Broder et al. employed.
22 Compute the in-degree and outdegree distributions and plot the power law
for each of these distributions.
23 Choose 100 nodes at random from the network and do one forward and one
backward BFS traversal for each node. Plot the cumulative distributions of the
nodes covered in these BFS runs as shown in Fig. 2.7. Create one figure for the
forward BFS and one for the backward BFS. How many nodes are in the OUT
and IN components? How many nodes are in the TENDRILS component?
(Hint: The forward BFS plot gives the number of nodes in SCC OUT and
similarly, the backward BFS plot gives the number of nodes in SCC IN).
24 What is the probability that a path exists between two nodes chosen
uniformly from the graph? What if the node pairs are only drawn from the WCC
of the two networks? Compute the percentage of node pairs that were connected
in each of these cases.
References
1. Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,
Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33 (1–6):
309–320.
2. Faloutsos, Michalis, Petros Faloutsos, and Christos Faloutsos. 1999. On power-law relationships of the
internet topology. In ACM SIGCOMM computer communication review, vol. 29, 251–262, ACM.
3. Newman, Mark E.J. 2005. Power laws, pareto distributions and zipf’s law. Contemporary Physics 46
(5): 323–351.
4. Pansiot, Jean-Jacques, and Dominique Grad. 1998. On routes and multicast trees in the internet. ACM
SIGCOMM Computer Communication Review 28 (1): 41–50.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_3
Krishna Raj P. M.
Email: [email protected]
of these edges having the same probability of being chosen, let this edge be
denoted by . At time we choose one of the edges different
1.
: Graph with |V| isolated vertices.
2.
Phase 1, : Giant component appears.
3.
Phase 2, : The average degree is constant but lots
of isolated vertices still exist.
4.
Phase 3, : Fewer number of isolated vertices
exist.
5.
Phase 4, : The graph contains no more isolated
vertices.
6.
Phase 5, : A complete graph is obtained.
The proofs for each of these phases can be found in [5].
There are two variants of this graph model,
denotes a graph on n vertices where each edge (u,v) appears
independent and identically distributed with probability p.
denotes a graph on n vertices and m uniformly at random picked
edges.
n, p and m do not uniquely determine the graphs. Since the graph is the result of
a random process, we can have several different realizations for the same values
of n, p and m. Figure 3.1 gives instances of undirected graphs generated using
this model and Fig. 3.2 are examples of directed graphs also generated using the
model. Each of these graphs have and .
Fig. 3.1 Figure shows three undirected graph variants of
Fig. 3.2 Figure shows three directed graph variants of
If we consider a random graph having |V| possible vertices and |E| edges,
and choose uniformly at random one of the possible graphs which can be
formed from the |V| vertices , , , by selecting |E| edges from the
possible edges. The effective number of vertices of may be less than
|V| since some vertices may not be connected in with any other
vertices. These vertices are the isolated vertices. These isolated vertices are
still considered as belonging to .
is called completely connected if it effectively contains all the vertices
, , , .
has the statistical properties stated in the Theorems 2, 3, 4 and 5.
Theorem 2
If denotes the probability of being completely connected,
then we get Eq. 3.1
(3.1)
of x.
Theorem 3
If denotes the probability of the giant component of
consisting effectively of vertices, then we get Eq. 3.2
(3.2)
Theorem 4
If denotes the probability of consisting of exactly
disjoint components ( ). Then we get Eq. 3.3
(3.3)
Theorem 5
If the edges of are chosen successively so that after each step, every edge
which has not yet been chosen has the same probability to be chosen as the next,
and if we continue this process until the graph becomes completely connected,
the probability that the number of necessary steps v will be equal to a number l is
given by Eq. 3.4
(3.4)
(3.5)
3.2.1 Properties
We will look at various graph properties of the variant of this graph model.
3.2.1.1 Edges
Let P(|E|) denote the probability that generates a graph on |E| edges.
(3.6)
From Eq. 3.6, we observe that P(|E|) follows a binomial distribution with mean
of and variance of .
3.2.1.2 Degree
Let be the random variable measuring the degree of vertex v.
(3.8)
Equation 3.8 shows that the degree distribution follows a binomial distribution
with mean, and variance, .
By the law of large numbers, as the network size increases, the distribution
becomes increasingly narrow, i.e, we are increasingly confident that the degree
of the vertex is in the vicinity of k.
(3.9)
(3.10)
as
The clustering coefficient of is small. As the graph gets bigger with fixed
, C decreases with |V|.
3.2.1.5 Diameter
The path length is O(log|V|) and the diameter is where denotes the
expansion of .
A graph G(V, E) has expansion if
(3.11)
3.2.2 Drawbacks of G n, p
The exercise in this chapter will require the computations of the real world graph
and the .
These computations will show that the degree distribution of greatly
differs from that of real-world ones. The giant component in most real-world
networks does not emerge through phase transition, and the clustering
coefficient of is too low.
3.2.3 Advantages of
In spite of these drawbacks, acts as an extremely useful null model which
assists in the computation of quantities that can be compared with real data, and
helps us understand to what degree a particular property is the result of some
random process.
1.
None of the permutations have any fixed points.
2.
None of the permutations have any 2-cycles.
3.
No two permutations agree anywhere.
The probability that this graph is simple as given in Eq. 3.13.
(3.13)
The proof for Eqs. 3.12, 3.13 and other properties of graphs generated using
Bollobás or Permutation models can be found in [1].
where c is the number of cloning steps made and m is the number of surviving
graphs. The mean of any quantity X over a set of such graphs is
3.5.4 Comparison
Switching algorithm (Sect. 3.5.1):
– Samples the configurations uniformly, generating each graph an equal
number of times within measurement error on calculation.
– Produces results essentially identical to but faster than “go with the
winners” algorithm.
Matching algorithm (Sect. 3.5.2):
– Introduces a bias and undersamples the configurations.
– Faster than the other two methods but samples in a measurable biased way.
“Go with the winners” algorithm (Sect. 3.5.3):
– Produces an equal number of graphs which are more statistically correct by
sampling ensembles uniformly within permissible error.
– Far less computationally efficient than the other two methods.
From these points it is evident that any of these methods are adequate for
generating suitable random graphs to act as null models. Although the “go with
the winners” and the switching algorithms, while slower, are clearly more
satisfactory theoretically, the matching algorithm gives better results on real-
world problems.
Reference [2] argues in favour of using the switching method, with the “go
with the winners” method finding limited use as a check on the accuracy of
sampling.
Reference [3] shows how using generating functions, one can calculate
exactly many of the statistical properties of random graphs generated from
prescribed degree sequences in the limit of the number of vertices. Additionally,
the position at which the giant component forms, the size of the giant
component, the average and the distribution of the size of the other components,
average number of vertices at certain distance from a given vertex, the clustering
coefficient and the typical vertex-vertex distances are explained in detail.
Problems
Download the Astro Physics collaboration network from the SNAP dataset
repository available at http://snap.stanford.edu/data/ca-AstroPh.html. This co-
authorship network contains 18772 nodes and 198110 edges.
Generate the graph for this dataset (we will refer to this graph as the real
world graph).
For each of the real world graph, Erdös–Rényi graph and Cofiguration model
graph, compute the following:
27
Degree distributions
31 For each of these distributions, state whether or not the random models
have the same property as the real world graph.
32 Are the random graph generators capable of generating graphs that are
representative of real world graphs?
References
References
1. Ellis, David. 2011. The expansion of random regular graphs, Lecture notes. Lent.
2. Milo, Ron, Nadav Kashtan, Shalev Itzkovitz, Mark E.J. Newman, and Uri Alon. 2003. On the uniform
generation of random graphs with prescribed degree sequences. arXiv:cond-mat/0312028.
3. Newman, Mark E.J., Steven H. Strogatz, and Duncan J. Watts. 2001. Random graphs with arbitrary
degree distributions and their applications. Physical Review E 64(2):026118.
4. Paul, Erdös, and Rényi Alfréd . 1959. On random graphs, i. Publicationes Mathematicae (Debrecen) 6:
290–297.
5. Paul, Erdös, and Rényi Alfréd. 1960. On the evolution of random graphs. Publications of the
Mathematical Institute of the Hungarian Academy of Sciences 5: 17–61.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_4
Krishna Raj P. M.
Email: [email protected]
We have all had the experience of encountering someone far from home, who
turns out to share a mutual acquaintance with us. Then comes the cliché, “My
it’s a small world”. Similarly, there exists a speculative idea that the distance
between vertices in a very large graph is surprisingly small, i.e, the vertices co-
exist in a “small world”. Hence the term small world phenomena.
Is it surprising that such a phenomenon can exist? If each person is
connected to 100 people, then at the first step 100 people can be reached, the
next step reaches people, at the third step, at the fourth step and
at the fifth step. Taking into account the fact that there are only 7.6 billion
people in the world as of December 2017, the six degrees of separation is not at
all surprising. However, many of the 100 people a person is connected to may be
connected to one another. Due to the presence of this overlap in connections, the
five steps between every inhabitant of the Earth is a longshot.
Figure 4.2 depicts the distribution of the incomplete chains. The median of
this distributions is found to be 2.6. There are two probable reasons for this: (i)
Participants are not motivated enough to participate in this study. (ii) Participants
do not know to whom they must send the folder in order to advance the chain
towards the target.
Fig. 4.2 Frequency distribution of the intermediate chains
The authors conclude from this study that although each and every individual
is embedded in a small-world structure, not all acquaintances are equally
important. Some are more important than others in establishing contacts with
broader social realms because while some are relatively isolated, others possess
a wide circle of acquaintances.
What is of interest is that Milgram measured the average length of the
routing path on the social network, which is an upper bound on the average
distance (as the participants were not necessarily sending postcards to an
acquaintance on the shortest path to the target). These results prove that although
the world is small, the actors in this small world are unable to exploit this
smallness.
Although the idea of the six degrees of separation gained quick and wide
acceptance, there is empirical evidence which suggests that we actually live in a
world deeply divided by social barriers such as race and class.
Income Stratification
Reference [3] in a variation of the small world study, probably sent to
Milgram for review, not only showed extremely low chain completion rates
(below ) but also suggested that people are actually separated by social
class. This study consisted of 151 volunteers from Crestline, Ohio, divided into
low-income, middle-income and high-income groups. These volunteers were to
try to reach a low-income, middle-income and high-income person in Los
Angeles, California. While the chain completion rate was too low to permit
statistical comparison of subgroups, no low-income sender was able to complete
chains to targets other than the low-income target groups. The middle and high
income groups did get messages through to some people in every other income
group. These patterns suggest a world divided by social class, with low-income
people disconnected.
Acquaintance Networks
Milgram’s study of acquaintance networks [11] between racial groups also
reveals not only a low rate of chain completion but also the importance of social
barriers. Caucasian volunteers in Los Angeles, solicited through mailing lists,
were asked to reach both Caucasian and African-American targets in New York.
Of the 270 chains started and directed towards African-American targets, only
got through compared to of 270 chains directed towards Caucasian
targets.
Social Stratification
Reference [13] investigated a single urbanized area in the Northeast. The
research purpose was to examine social stratification, particularly barriers
between Caucasians and African-Americans. Of the 596 packets sent to 298
volunteers, 375 packets were forwarded and 112 eventually reached the target—
a success rate of . The authors concluded that,
Urban Myth
Reference [10] speculates that the six degrees of separation may be an
academic equivalent of an urban myth. In Psychology Today, Milgram recalls
that the target in Cambridge received the folder on the fourth day since the start
of the study, with two intermediaries between the target and a Kansas wheat
farmer. However, an undated paper found in Milgram’s archives, “Results of
Communication Project”, reveals that 60 people had been recruited as the
starting population from a newspaper advertisement in Wichita. Just 3 of the 60
folders ( ) reached the lady in Cambridge, passing through an average of 8
people (9 degrees of separation). The stark contrast between the anecdote and
the paper questions the phenomenon.
Bias in Selection of Starting Population
Milgram’s starting population has several advantages: The starting
population had social advantages, they were far from a representative group. The
“Nebraska random” group and the stock holders from Nebraska were recruited
from a mailing list apt to contain names of high income people, and the “Boston
random” group were recruited from a newspaper advertisement. All of these
selected would have had an advantage in making social connections to the
stockbroker.
The Kansas advertisement was worded in a way to particularly attract
sociable people proud of their social skills and confident of their powers to reach
someone across class barriers, instead of soliciting a representative group to
participate. This could have been done to ensure that the participants by virtue of
their inherent social capital could increase the probability of the folder reaching
the target with the least amount of intermediaries.
Milgram recruited subjects for Nebraska and Los Angeles studies by buying
mailing lists. People with names worth selling (who would commonly occur in
mailing lists) are more likely to be high-income people, who are more connected
and thus more apt to get the folders through.
Reference [10] questions whether Milgram’s low success rates results from
people not bothering to send the folder or does it reveal that the theory is
incorrect. Or do some people live in a small, small world where they can easily
reach people across boundaries while others do not. Further, the paper suggests
that the research on the small world problem may be a familiar pattern:
Despite all this, the experiment and the resulting phenomena have formed a
crucial aspect in our understanding of social networks. The conclusion has been
accepted in the broad sense: social networks tend to have very short paths
between essentially arbitrary pairs of people. The existence of these short paths
has substantial consequences for the potential speed with which information,
diseases, and other kinds of contagion can spread through society, as well as for
the potential access that the social network provides to opportunities and to
people with very different characteristics from one’s own.
This gives us the small world property , networks of size n have diameter
O(logn), meaning that between any two nodes there exists a path of size O(logn).
Fig. 4.3 Average per-step attrition rates (circles) and confidence interval (triangles)
Fig. 4.4 Histogram representing the number of chains that are completed in L steps
(4.1)
The study finds that successful chains are due to intermediate to weak
strength ties. It does not require highly connected hubs to succeed but instead
relies on professional relationships. Small variations in chain lengths and
participation rates generate large differences in target reachability. Although
global social networks are searchable, actual success depends on individual
incentives. Since individuals have only limited, local information about global
social networks, finding short paths represents a non-trivial search effort.
On the one hand, all the targets may be reachable from random initial seeders
in only a few steps, with surprisingly little variation across targets in different
countries and professions. On the other hand, small differences in either the
participation rates or the underlying chain lengths can have a dramatic impact on
the apparent reachability of the different targets. Therefore, the results suggest
that if the individuals searching for remote targets do not have sufficient
incentives to proceed, the small-world hypothesis will not appear to hold, but
that even the slightest increase in incentives can render social searches
successful under broad conditions. The authors add that,
4.7 Searchable
A graph is said to be searchable or navigable if its diameter is , i.e,
The first strategy involved sending the message to an individual who is more
likely to know the target by virtue of the fact that he or she knows a lot of
people. Reference [2] has shown that this strategy is effective in networks with
power law distribution with exponent close to 2. However, as seen in Fig. 4.9,
this network does not have a power law, but rather an exponential tail, more like
a Poisson distribution. Reference [2] also shows that this strategy performs
poorly in the case of Poisson distribution. The simulation confirms this with a
median of 16 and an average of 43 steps required between two randomly chosen
vertices.
The second strategy consists of passing messages to the contact closest to the
target in the organizational hierarchy. In the simulation, the individuals are given
full knowledge of organizational hierarchy but they have no information about
contacts of individuals except their own. Here, the search is tracked via the
hierarchical distance (h-distance) between vertices. The h-distance is computed
as follows: individuals have h-distance 1 to their manager and everyone they
share a manager with. Then, the distances are recursively assigned, i.e, h-
distance of 2 to their first neighbour’s neighbour, 3 to their second neighbour’s
neighbour, and so on. This strategy seems to work pretty well as shown in
Figs. 4.10 and 4.11. The median number of steps was only 4, close to the median
shortest path of 3. The mean was 5 steps, slightly higher than the median
because of the presence of a four hard to find individuals who had only a single
link. Excluding these 4 individuals as targets resulted in a mean of 4.5 steps.
This result indicates that not only are people typically easy to find, but nearly
everybody can be found in a reasonable number of steps.
Fig. 4.11 Probability of linking as a function of the size of the smallest organizational unit individuals
belong to
The last strategy was based on the target’s physical location. Individuals’
locations are given by their building, the floor of the building, and the nearest
building post to their cubicle. Figure 4.12 shows the email correspondence
mapped onto the physical layout of the buildings. The general tendency of
individuals in close physical proximity to correspond holds: over percent
of the 4000 emails are between individuals on the same floor. From Fig. 4.13,
we observe that geography could be used to find most individuals, but was
slower, taking a median number of 6 steps, and a mean of 12.
Fig. 4.12 Email communication mapped onto approximate physical location. Each block represents a floor
in a building. Blue lines represent far away contacts while red ones represent nearby ones
Fig. 4.13 Probability of two individuals corresponding by email as a function of the distance between their
cubicles. The inset shows how many people in total sit at a given distance from one another
The study concludes that individuals are able to successfully complete chains
in small world experiments using only local information. When individuals
belong to groups based on a hierarchy and are more likely to interact with
individuals within the same small group, then one can safely adopt a greedy
strategy—pass the message onto the individual most like the target, and they will
be more likely to know the target or someone closer to them.
Fig. 4.15 In each of 500, 000 trials, a source s and target t are chosen randomly; at each step, the message
is forwarded from the current message-holder u to the friend v of u geographically closest to t. If
, then the chain is considered to have failed. The fraction f(k) of pairs in which the chain
then u picks a random person in the same city as u to pass the message to, and the chain
used to estimate the proportion of friendships in the LiveJournal network that are
formed by geographic and nongeographic processes. The probability of a
nongeographic friendship between u and v is , so on average u will have
nongeographic friends. An average person in the LiveJournal network has eight
friends, so of an average person’s eight friends ( of his or her
friends) are formed by geographic processes. This statistic is aggregated across
the entire network: no particular friendship can be tagged as geographic or
nongeographic by this analysis; friendship between distant people is simply
more likely (but not guaranteed) to be generated by the nongeographic process.
However, this analysis does estimate that about two-thirds of LiveJournal
friendships are geographic in nature. Figure 4.16 shows the plot of geographic
distance versus the geographic-friendship probability . The
plot shows that decreases smoothly as increases. This shows that any
model of friendship that is based solely on the distance between people is
insufficient to explain the geographic nature of friendships in the LiveJournal
network.
Fig. 4.16 a For each distance , the proportion P( ) of friendships among all pairs u, v of LiveJournal
users with is shown. The number of pairs u,v with is estimated by computing the
distance between 10, 000 randomly chosen pairs of people in the network. b The same data are plotted,
correcting for the background friendship probability: plot distance versus
(4.3)
Under this model, the probability of a link from u to v depends only on the
number of people within distance d(u, v) of u and not on the geographic distance
itself; thus the non-uniformity of LiveJournal population density fits naturally
into this framework.
The relationship between and the probability that u is a friend of v
shows an approximately inverse linear fit for ranks up to (as shown
in Fig. 4.17). An average person in the network lives in a city of population
1, 306. Thus in Fig. 4.17 the same data is shown where the probabilities are
averaged over a range of 1, 306 ranks.
Fig. 4.17 The relationship between friendship probability and rank. The probability P(r) of a link from a
randomly chosen source u to the rth closest node to u, i.e, the node v such that , in the
LiveJournal network, averaged over 10, 000 independent source samples. A link from u to one of the nodes
, where the people in are all tied for rank , is counted as a (
) fraction of a link for each of these ranks. As before, the value of represents the background
probability of a friendship independent of geography. a Data for every 20th rank are shown. b The data are
averaged into buckets of size 1, 306: for each displayed rank r, the average probability of a friendship over
ranks is shown. c and b The same data are replotted (unaveraged and averaged,
respectively), correcting for the background friendship probability: we plot the rank r versus
Fig. 4.18 Depiction of a regular network proceeding first to a small world network and next to a random
network as the randomness, p increases
to . These are node u’s long-range contacts. When r is very small, the
long range edges are too random to facilitate decentralized search (as observed
in Sect. 4.10.1), when r is too large, the long range edges are not random enough
to provide the jumps necessary to allow small world phenomenon to be
exhibited.
This model can be interpreted as follows: An individual lives in a grid and
knows her neighbours in all directions for a certain number of steps, and the
number of her acquaintances progressively decrease as we go further away from
her.
The model gives the bounds for a decentralized algorithm in Theorems 6, 7
and 8.
When the long-range contacts are formed independent of the geometry of the
grid (as is the case in Sect. 4.10.1), short chains will exist but the nodes will be
unable to find them. When long-range contacts are formed by a process related
to the geometry of the grid in a specific way, however, then the short chains will
still form and the nodes operating with local knowledge will be able to construct
them.
Theorem 8
1.
Let . There is a constant , depending on p, q, r but independent
of n, so that the expected delivery time of the decentralized algorithm is
atleast .
2.
Let . There is a constant , depending on p, q, r but independent of
n, so that the expected delivery time of any decentralized algorithm is atleast
.
choosing endpoint w each time independently and with repetitions allowed. This
results in a graph G on the set V. Here the out-degree is for constant
A decentralized algorithm has knowledge of the tree T, and knows the location
of a target leaf that it must reach; however, it only learns the structure of G as it
visits nodes. The exponent determines how the structures of G and T are
related.
Theorem 9
1.
There is a hierarchical model with exponent and polylogarithmic out-
degree in which decentralized algorithm can achieve search time of O(logn).
2.
For every , there is no hierarchical model with exponent and
polylogarithm out-degree in which a decentralized algorithm can achieve
polylogarithm search time.
1.
If R is a group of size containing a node v, then there is a group
containing v that is strictly smaller than R, but has size atleast .
2.
If are groups that all have sizes atmost q and all contain a
common node v, then their union has size almost .
Theorem 10
1.
For every group structure, there is a group-induced model with exponent
and polylogarithm out-degree in which a decentralized algorithm can
achieve search time of O(logn).
2.
2.
For every , there is no group-induced model with exponent and
polylogarithm out-degree in which a decentralized algorithm can achieve
polylogarithm search time.
Theorem 11
1.
There is a hierarchical model with exponent , constant out-degree and
polylogarithmic resolution in which a decentralized algorithm can achieve
polylogarithmic search time.
2.
For every , there is no hierarchical model with exponent , constant
out-degree and polylogarithmic resolution in which a decentralized
algorithm can achieve polylogarithmic search time.
1.
Choose and uniformly from V.
2.
If do a greedy walk in from to . Let
denote the points of this walk.
2.
For each with atleast one shortcut, independently with
probability p replace a randomly chosen shortcut with one to .
vertex with its neighbours. For this purpose, the lower the value of the
better, but very small values of p will lead to slower sampling.
Each application of this algorithm defines a transition of Markov chain on a
set of shortcut configurations. Thus for any n, the Markov chain is defined on a
finite state space. Since this chain is irreducible and aperiodic, the chain
converges to a unique stationary distribution.
Problems
Download the General Relativity and Quantum Cosmology collaboration
network available at https://snap.stanford.edu/data/ca-GrQc.txt.gz.
For the graph corresponding to this dataset (which will be referred to as real
world graph), generate a small world graph and compute the following network
parameters:
33 Degree distribution
34 Short path length distribution
37 For each of these distributions, state whether or not the small world model
has the same property as the real world graph
38 Is the small world graph generator capable of generating graphs that are
representative of real world graphs?
References
1. Adamic, Lada, and Eytan Adar. 2005. How to search a social network. Social Networks 27 (3): 187–
203.
2. Adamic, Lada A., Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. 2001. Search in
power-law networks. Physical Review E 64 (4): 046135.
3. Beck, M., and P. Cadamagnani. 1968. The extent of intra-and inter-social group contact in the
American society. Unpublished manuscript, Stanley Milgram Papers, Manuscripts and Archives, Yale
University.
4. Dodds, Peter Sheridan, Roby Muhamad, and Duncan J. Watts. 2003. An experimental study of search
in global social networks. Science 301 (5634): 827–829.
5. Horvitz, Eric, and Jure Leskovec. 2007. Planetary-scale views on an instant-messaging network.
Redmond-USA: Microsoft research Technical report.
6. Killworth, Peter D., and H. Russell Bernard. 1978. The reversal small-world experiment. Social
Networks 1 (2): 159–192.
7. Killworth, Peter D., Christopher McCarty, H. Russell Bernard, and Mark House. 2006. The accuracy of
small world chains in social networks. Social Networks 28 (1): 85–96.
8. Kleinberg, Jon. 2000. The small-world phenomenon: An algorithmic perspective. In Proceedings of the
thirty-second annual ACM symposium on theory of computing, 163–170. ACM.
9. Kleinberg, Jon M. 2002. Small-world phenomena and the dynamics of information. In Advances in
neural information processing systems, 431–438.
10. Kleinfeld, Judith. 2002. Could it be a big world after all? the six degrees of separation myth. Society
12:5–2.
11. Korte, Charles, and Stanley Milgram. 1970. Acquaintance networks between racial groups: Application
of the small world method. Journal of Personality and Social Psychology 15 (2): 101.
12.
12. Liben-Nowell, David, Jasmine Novak, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. 2005.
Geographic routing in social networks. Proceedings of the National Academy of Sciences of the United
States of America 102 (33): 11623–11628.
13. Lin, Nan, Paul Dayton, and Peter Greenwald. 1977. The urban communication network and social
stratification: A small world experiment. Annals of the International Communication Association 1 (1):
107–119.
14. Sandberg, Oskar, and Ian Clarke. 2006. The evolution of navigable small-world networks. arXiv:cs/
0607025.
15. Travers, Jeffrey, and Stanley Milgram. 1967. The small world problem. Phychology Today 1 (1): 61–
67.
16. Travers, Jeffrey, and Stanley Milgram. 1977. An experimental study of the small world problem. Social
Networks, 179–197. Elsevier.
17. West, Robert, and Jure Leskovec. 2012. Automatic versus human navigation in information networks.
ICWSM.
18. West, Robert, and Jure Leskovec. 2012. Human wayfinding in information networks. In Proceedings of
the 21st international conference on World Wide Web, 619–628. ACM.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_5
Krishna Raj P. M.
Email: [email protected]
Reference [2] explains the study of the social graph of the active users of the
world’s largest online social network, Facebook . The study mainly focused on
computing the number of users and their friendships, degree distribution, path
length, clustering and various mixing patterns. All calculations concerning this
study were performed on Hadoop cluster with 2250 machines using
Hadoop/Hive data analysis framework developed at Facebook. This social
network is seen to display a broad range of unifying structural properties such as
homophily, clustering, small-world effect, heterogeneous distribution of friends
and community structure.
The graph of the entire social network of the active members of Facebook as
of May 2011 is analysed in an anonymized form and the focus is placed on the
set of active user accounts reliably corresponding to people. A user of Facebook
is deemed as an active member if they logged into the site in the last 28 days
from the time of measurement in May 2011 and had atleast one Facebook friend.
The restriction to study only active users allows us to eliminate accounts that
have been abandoned in the early stages of creation, and focus on accounts that
plausibly represent actual individuals. This graphs precedes the existence of
“subscriptions” and does not include “pages” that people may “like” . According
to this definition, the population of active Facebook users is 721 million at the
time of measurement. The world’s population at the time was 6.9 billion people
which means that this graph includes roughly of the Earth’s inhabitants.
There were 68.7 billion friendships in this graph, so the average Facebook user
had around 190 Facebook friends.
The study also focuses on the subgraph of the 149 US Facebook users. The
US Census Bureau for 2011 shows roughly 260 million individuals in the US
over the age of 13 and therefore eligible to create a Facebook account. Therefore
this social network includes more than half the eligible US population. This
graph had 15.9 billion edges, so an average US user had 214 other US users as
friends. Note that this average is higher than that of the global graph.
Neighbourhood function , denoted by of a graph G returns for each
, the number of pairs of vertices (x,y) such that x has a path of length at
most x to y. It provides data about how fast the “average ball” around each
vertex expands. It measures what percentile of vertex pairs are within a given
distance. Although, the diameter of a graph can be wildly distorted by the
presence of a single ill-connected path in some peripheral region of the graph,
the neighbourhood function and the average path length are thought to robustly
capture the distances between pairs of vertices. From this function, it is possible
to derive the distance distribution which gives for each t, the fraction of
reachable pairs at a distance of exactly t.
neighbourhood function in t.
5.3 Spid
Measures the dispersion of degree distribution. Spid, an acronym for “shortest-
paths index of dispersion”, is defined as the variance-to-mean ratio of the
distance distribution. It is sometimes referred to as the webbiness of a social
network. Networks with spid greater than one should be considered web-like
whereas networks with spid less than one should be considered properly social.
The intuition behind this measure is that proper social networks strongly
favour short connections, whereas in the web, long connections are not
uncommon. The correlation between spid and average distance is inverse, i.e,
larger the average distance, smaller is the spid.
The spid of the Facebook graph is 0.09 thereby confirming that it is a proper
social network.
Fig. 5.1 Degree distribution of the global and US Facebook active users, alongside its CCDF
From Fig. 5.1, we observe that most individuals have a moderate number of
friends, less than 200, while a much smaller population have many hundreds or
even thousands of friends. There is a small population of users who have an
abnormally high degree than the average user. The distribution is clearly right-
skewed with high variance but there is a substantial curvature exhibited in the
distribution on the log–log scale. This curvature is somewhat surprising because
empirical measurements of networks have claimed that the degree distributions
follow a power law. Thus, strict power laws are inappropriate for this degree
distribution.
Fig. 5.2 Neighbourhood function showing the percentage of users that are within h hops of one another
Fig. 5.4 Clustering coefficient and degeneracy as a function of degree on log–log scale
Fig. 5.5 Average number of unique and non-unique friends-of-friends as a function of degree
5.8 Friends-of-Friends
The friends-of-friends, as the name suggests, denotes the number of users that
are within two hops of an initial user. Figure 5.5 computes the average counts of
both the unique and non-unique friends-of-friends as a function of the degree.
The non-unique friends-of-friends count corresponds to the number of length 2
paths starting at an initial vertex and not returning to that vertex. The unique
friends-of-friends count corresponds to the number of unique vertices that can be
reached at the end of a length 2 path.
A naive approach to counting friends-of-friends would assume that a user
with k friends has roughly non-unique friends-of-friends, assuming that their
friends have roughly the same friend count as them. The same principle could
also apply to estimating the number of unique friends-of-friends. However, the
number of unique friends-of- friends grows very close to linear, and the number
of non-unique friends-of-friends grows only moderately faster than linear. While
the growth rate may be slower than expected, as Fig. 5.5 illustrates, until a user
has more than 800 friends the absolute amounts are unexpectedly large: a user
with 100 friends has 27500 unique friends-of-friends and 40300 non-unique
friends-of-friends. This is significantly more than non-unique
friends-of-friends we would have expected if our friends had roughly the same
number of friends as us.
Fig. 5.6 Average neighbour degree as a function of an individual’s degree, and the conditional probability
that a randomly chosen neighbour of an individual with degree k has degree
5.11.1 Age
To understand the friendship patterns among individuals with different ages, we
compute of selecting a random neighbour of individuals with age t who
has age . A random neighbour means that each edge connected to a user with
age t is given equal probability of being followed. Figure 5.8 shows that the
resulting distribution is not merely a function of the magnitude of age difference
as might naively be expected, and instead are asymmetric about a
maximum value of . Unsurprisingly, a random neighbour is most likely to
be the same age as you. Younger individuals have most of their friends within a
small age range while older individuals have a much wider range.
Fig. 5.8 Distribution of ages for the neighbours of users with age t
5.11.2 Gender
By computing , we get the probability that a random neighbour of
individuals with gender g has gender where M denotes male and F denotes
female. The Facebook graph gives us the following probabilities,
, , and
. By these computations, a random neighbour is more likely
to be a female. There are roughly 30 million fewer active female users on
Facebook with average female degree (198) larger than the average male degree
(172) with and . Therefore, we have
and . However, the
difference between these probabilities is extremely small thereby giving a
minimal effect on the preference for same gender friendships on Facebook.
5.11.3 Country of Origin
The obvious expectation is that an individual will have more friends from the
same country of origin than from outside that country, and data shows that
of edges are within countries. So, the network divides fairly cleanly
along country lines into network clusters or communities. This division can be
quantified using modularity , denoted by Q, which is the fraction of edges within
communities in a randomized version of the network that preserves the degree
for each individual, but is otherwise random. The computed value of
which is quite large and indicates a strongly modular network
structure at the scale of countries. Figure 5.9 visualizes this structure displayed
as a heatmap of the number of edges between 54 countries where active
Facebook user population exceeds a million users and is more than of
internet-enabled population. The results show a intuitive grouping according to
geography. However, other groupings not based on geography include
combination of UK, Ghana and South Africa which reflect links based on strong
historical ties.
Fig. 5.9 Normalized country adjacency matrix as a heatmap on a log scale. Normalized by dividing each
element of the adjacency matrix by the product of the row country degree and column country degree
Country Code
Indonesia ID
Philipines PH
Sri Lanka LK
Australia AU
Australia AU
New Zealand NZ
Thailand TH
Malaysia MY
Singapore SG
Hong Kong HK
Taiwan TW
United States US
Dominican Republic DO
Puerto Rico PR
Mexico MX
Canada CA
Venezuela VE
Chile CL
Argentina AR
Uruguay UY
Colombia CO
Costa Rica CR
Guatemala GT
Ecuador EC
Peru PE
Bolivia BO
Spain ES
Ghana GH
United Kingdom GB
South Africa ZA
Israel IL
Jordan JO
United Arab Emirates AE
Kuwait KW
Algeria DZ
Tunisia TN
Italy IT
Macedonia MK
Macedonia MK
Albania AL
Serbia RS
Slovenia SI
Bosnia and Herzegovina BA
Croatia HR
Turkey TR
Portugal PT
Belgium BE
France FR
Hungary HU
Ireland IE
Denmark DK
Norway NO
Sweden SE
Czech Republic CZ
Bulgaria BG
Greece GR
Appendix
Problems
Download the Friendster undirected social network data available at https://
snap.stanford.edu/data/com-Friendster.html.
This network consists of 65 million nodes and 180 million edges. The
world’s population in 2012 was 7 billion people. This means that the network
has of the world’s inhabitants.
For this graph, compute the following network parameters:
39
Degree distribution
40 Path length distribution
References
1. Backstrom, Lars, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna. 2012. Four degrees
of separation. In Proceedings of the 4th Annual ACM Web Science Conference, 33–42. ACM.
2. Ugander, Johan, Brian Karrer, Lars Backstrom, and Cameron Marlow. 2011. The anatomy of the
facebook social graph. arXiv:1111.4503.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_6
6. Peer-To-Peer Networks
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
Krishna Raj P. M.
Email: [email protected]
6.1 Chord
Chord [3] is a distributed lookup protocol that addresses the problem of
efficiently locating the node that stores a particular data item in a structured P2P
application. Given a key, it maps the key onto a node. A key is associated with
each data item and key/data item pair is stored at the node to which the key
maps. Chord is designed to adapt efficiently as nodes join and leave the network
dynamically.
The Chord software takes the form of a library to be linked with the client
and server applications that use it. The application interacts with Chord in two
ways: First, Chord provides key algorithm that yields IP address of the node
responsible for the key. Next, the Chord software on each node notifies the
application of changes in the set of keys that the node is responsible for. This
allows the application to move the corresponding values to new homes when
new node joins.
Chord uses consistent hashing . In this algorithm, each node and key has a
m-bit identifier. One has to ensure that m is large enough to make the probability
of two nodes or keys hashing to the same identifier negligible. The keys are
assigned to nodes as follows: The identifiers are ordered in an identifier circle
modulo . Key k is assigned to the first node whose identifier is equal to or
follows k. This node is called the successor node of key k, denoted by
successor(k). If the identifiers are represented as a circle of numbers from 0 to
, then successor(k) is the first node clockwise from k. This tends to balance
the load, since each node receives roughly the same number of keys and involves
relatively little movement of keys when nodes join and leave the system. In a N-
node system, each node maintains information of O(logN) other nodes and
resolves all lookups via O(logN) messages to the other nodes. To maintain
consistent hashing mapping when a node n joins the network, certain keys
previously assigned to n’s successor now become assigned to n. When n leaves
the network, all of its assigned keys are reassigned to n’s successor.
Each node n, maintains a routing table with atmost m entries, called the
finger table . The ith entry in a table at node n contains entry of the first node s,
that succeeds n by atleast on the identifier circle, i.e,
where (and all arithmetic is modulo ).
Node s is the i 4th finger of node n. This finger table scheme is designed for two
purposes: First, each node stores information about only a small number of other
nodes and knows more about nodes closely following it than the nodes far away.
Next, the node’s finger table generally does not contain enough information to
determine the success of an arbitrary key k.
If n does not know the successor of key k, then it finds the node whose ID is
closer than its own to k. That node will know more about identifier circle in
region of k than n does. Thus, n searches its finger table for node j whose ID
immediately precedes k, and asks j for the node it knows whose ID is closest to
k. By repeating this process, n learns about nodes with IDs closer to k.
In a dynamic network where the nodes can join and leave at any time, to
preserve the ability to locate every key in the network, each node’s successor
must be correctly maintained. For fast lookups, the finger tables must be correct.
To simplify this joining and leaving mechanisms, each node maintains a
predecessor pointer . A node’s predecessor pointer contains Chord identifier and
IP address of the immediate predecessor of this node, and can be used to walk
counter-clockwise around the identifier circle. To preserve this, Chord performs
the following tasks when a node n joins the network: First, it initializes the
predecessor and fingers of the node n. Next, it updates the fingers and
predecessors of the existing nodes to reflect the addition of n. Finally, it notifies
the application software so that it can transfer state associated keys that node n is
now responsible for.
6.2 Freenet
Freenet [1] is an unstructured P2P network application that allows the
publication, replication and retrieval of data while protecting the anonymity of
both the authors and the readers. It operates as a network of identical nodes that
collectively pool their storage space to store data files and cooperate to route
requests to the most likely physical location of data. The files are referred to in a
location-independent manner, and are dynamically replicated in locations near
requestors and deleted from locations where there is no interest. It is infeasible to
discover the true origin or destination of a file passing through the network, and
difficult for a node operator to be held responsible for the actual physical
contents of his or her node.
The adaptive peer-to-peer network of nodes query one another to store and
retrieve data files, which are named by location-independent keys. Each node
maintains a datastore which it makes available to the network for reading and
writing, as well as a dynamic routing table containing addresses of their
immediate neighbour nodes and the keys they are thought to hold. Most users
run nodes, both for security from hostile foreign nodes and to contribute to the
network’s storage capacity. Thus, the system is a cooperative distributed file
system with location independence and transparent lazy replication.
The basic model is that the request for keys are passed along from node to
node through a chain of proxy requests in which each node makes a local
decision about where to send the request next, in the style of IP routing. Since
the nodes only have knowledge of their immediate neighbours, the routing
algorithms are designed to adaptively adjust routes over time to provide efficient
performance while using only local knowledge. Each request is identified by a
pseudo-unique random number, so that nodes can reject requests they have seen
before, and a hops-to-live limit which is decremented at each node, to prevent
infinite chains. If a request is rejected by a node, then the immediately preceding
node chooses a different node to forward to. The process continues until the
request is either satisfied or exceeds its hops-to-live. The result backtracks the
chain to the sending node.
No privileges among the nodes, thereby preventing hierarchy or a central
point of failure. Joining the network is simply a matter of first discovering the
address of one or more existing nodes and then starting to send messages.
To retrieve a file, a user must first obtain or calculate its key (calculation of
the key is explained in [1]). Then, a request message is sent to his or her own
node specifying that key and a hops-to-live value. When a node receives a
request, it first checks its own store for the file and returns it if found, together
with a note saying it was the source of the data. If not found, it looks up the
nearest key in its routing table to the key requested and forwards the request to
the corresponding node. If that request is ultimately successful and returns with
the data, the node will pass the data back to the user, cache the file in its own
datastore, and create a new entry in its routing table associating the actual data
source with the requested key. A subsequent request for the same key will be
immediately satisfied by the user’s node. To obviate the security issue which
could potentially be caused by maintaining a table of data sources, any node can
unilaterally decide to change the reply message to claim itself or another
arbitrarily chosen node as the data source.
If a node cannot forward a request to its preferred node, the node having the
second-nearest key will be tried, then the third-nearest, and so on. If a node runs
out of candidates to try, it reports failure back to its predecessor node, which will
then try its second choice, etc. In this manner, a request operates as a steepest-
ascent hill-climbing search with backtracking. If the hops-to-live limit is
exceeded, a failure result is propagated back to the original requestor without
any further nodes being tried. As nodes process requests, they create new routing
table entries for previously unknown nodes that supplies files, thereby increasing
connectivity.
File insertions work in the same manner as file requests. To insert a file, a
user first calculates a file key and then sends an insert message to his or her node
specifying the proposed key and a hops-to-live value (this will determine the
number of nodes to store it on). When a node receives an insert message, it first
checks its own store to see if the key is already taken. If the key exists, the node
returns the existing file as if a request has been made for it. This notifies the user
returns the existing file as if a request has been made for it. This notifies the user
of a collision. If the key is not found, the node looks up the nearest key in its
routing table to the key proposed and forward the insert to the corresponding
node. If that insert also causes a collision and returns with the data, the node will
pass the data back to the upstream inserter and again behave as if a request has
been made. If the hops-to-live limit is reached without a key collision being
detected, a success message will be propagated back to the original inserter. The
user then sends the data to insert, which will be propagated along the path
established by the initial query and stored in each node along the way. Each
node will also create an entry in its routing table associating the inserter with the
new key. To avoid the obvious security problem, any node along the way can
arbitrarily decide to change the insert message to claim itself or another
arbitrarily-chosen node as the data source. If a node cannot forward an insert to
its preferred node, it uses the same backtracking approach as was used while
handling requests.
Data storage is managed as a least recently used cache (LRU) approach by
each node, in which data items are kept stored in decreasing order by the time of
most recent request, or time of insert, if an item has never been requested. When
a new file arrives, which would cause the datastore to exceed the designated
size, the least recently used files are evicted in order until there is space. Once a
particular file is dropped from all the nodes, it will no longer be available in the
network. Files are encrypted to the extent that node operators cannot access its
contents.
When a new node intends to join the network, it chooses a random seed and
sends an announcement message containing its address and the hash of that seed
to some existing node. When a node receives a new node announcement, it
generates a random seed, XORs that with the hash it received and hashes the
result again to create a commitment. It then forwards this new hash to some
randomly chosen node. This forwarding continues until the hops-to-live of the
announcement runs out. The last node to receive the announcement just
generates a seed. Now all the nodes in the chain reveal their seeds and the key of
the new node is assigned as the XOR of all the seeds. Checking the
commitments enables each node to confirm that everyone revealed their seeds
truthfully. This yields a consistent random key which each node as an entry for
this new node in the routing table.
A key factor in the identification of a small-world network is the existence of
a scale-free power-law distribution of the links within the network, as the tail of
such distributions provides the highly connected nodes needed to create short
paths. Figure 6.2 shows the degree distribution of the Freenet network. Except
for one point, it seems to follow a power-law. Therefore, Freenet seems to
exhibit power-law.
Reference [4] observed that the LRU cache replacement had a steep
reduction in the hit ratio with increasing load. Based on intuition from the small-
world models they proposed an enhanced-clustering cache replacement scheme
for use in place of LRU. This replacement scheme forced the routing tables to
resemble neighbour relationships in a small-world acquaintance graph and
improved the request hit ratio dramatically while keeping the small average hops
per successful request comparable to LRU.
Problems
In this exercise, the task is to evaluate a decentralized search algorithm on a
network where the edges are created according to a hierarchical tree structure.
The leaves of the tree will form the nodes of the network and the edge
probabilities between two nodes depends on their proximity in the underlying
tree structure.
P2P networks can be organized in a tree hierarchy, where the root is the main
software application and the second level contains the different countries. The
third level represents the different states and the fourth level is the different
third level represents the different states and the fourth level is the different
cities. There could be several more levels depending on the size and structure of
the P2P network. Nevertheless, the final level are the clients.
Consider a situation where client A wants a file that is located in client B. If
A cannot access B directly, A may connect to a node C which is, for instance, in
the same city and ask C to access the file instead. If A does not have access to
any node in the same city as B, it may try to access a node in the same state. In
general, A will attempt to connect to the node “closest” to B.
In this problem, there are two networks: one is the observed network, i.e, the
edge between P2P clients and the other is the hierarchical tree structure that is
used to generated the edges in the observed network.
For this exercise, we will use a complete, perfectly balanced b-ary tree T
(each node has b children and ), and a network whose nodes are the leaves
of T. For any pair of network nodes v and w, h(v, w) denotes the distance
between the nodes and is defined as the height of the subtree L(v, w) of T rooted
at the lowest common ancestor of v and w. The distance captures the intuition
that clients in the same city are more likely to be connected than, for example, in
the same state.
To model this intuition, generate a random network on the leaf nodes where
for a node v, the probability distribution of node v creating an edge to any other
node w is given by Eq. 6.1
(6.1)
Next, set some parameter k and ensure that every node v has exactly k
outgoing edges, using the following procedure. For each node v, sample a
random node w according to and create edge (v, w) in the network. Continue
this until v has exactly k neighbours. Equivalently, after an edge is added from v
to w, set to 0 and renormalize with a new Z to ensure that .
This results in a k-regular directed network.
Now experimentally investigate a more general case where the edge
probability is proportional to . Here is a parameter in our
experiments.
Consider a network with the setting , , , and a given
, i.e, the network consists of all the leaves in a binary tree of height 10; the out
degree of each node is 5. Given , create edges according to the distribution
described above.
References
1. Clarke, Ian, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. 2001. Freenet: A distributed
anonymous information storage and retrieval system. In Designing privacy enhancing technologies, 46–
66. Berlin: Springer.
2. Lua, Eng Keong, Jon Crowcroft, Marcelo Pias, Ravi Sharma, and Steven Lim. 2005. A survey and
comparison of peer-to-peer overlay network schemes. IEEE Communications Surveys & Tutorials 7 (2):
72–93.
3. Stoica, Ion, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek,
and Hari Balakrishnan. 2003. Chord: A scalable peer-to-peer lookup protocol for internet applications.
IEEE/ACM Transactions on Networking (TON) 11 (1): 17–32.
4. Zhang, Hui, Ashish Goel, and Ramesh Govindan. 2002. Using the small-world model to improve freenet
performance. In INFOCOM 2002. Twenty-first annual joint conference of the IEEE computer and
communications societies. Proceedings. IEEE, vol. 3, 1228–1237. IEEE.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_7
7. Signed Networks
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
Krishna Raj P. M.
Email: [email protected]
Fig. 7.1 A, B and C are all friends of each other. Therefore this triad ( ) by satisfying structural balance
property is balanced
Fig. 7.2 A and B are friends. However, both of them are enemies of C. Similar to Fig. 7.1, this triad ( )
is balanced
Fig. 7.3 A is friends with B and C. However, B and C are enemies. Therefore the triad ( ) by failing to
Fig. 7.4 A, B and C are all enemies of one another. Similar to Fig. 7.3, the triad ( ) is unbalanced
A graph in which all possible triads satisfy the structural balance property is
called a balanced graph and one which does not is called an unbalanced graph.
Figure 7.5 depicts a balanced and an unbalanced graph. The graph to the left
is balanced because all of the triads A,B,C; A,B,D; B,C,D and A,C,D satisfy the
structural balance property. However, the graph to the right is unbalanced
because the triads A,B,C and B,C,D do not satisfy the structural balance
property.
This leads to the balance theorem which states that if a labelled complete
graph is balanced, then either all pairs of its vertices are friends (this state is
referred as “paradise” [3]) , or else the nodes can be divided into two groups, X
and Y, such that every pair of vertices in X are friends of each other, every pair
of vertices in Y are friends of each other, and everyone in X is the enemy of
everyone in Y (this state is called “bipolar” [3]). Reference [6] reformulated this
theorem to include multiple groups X, Y, ..., Z, such that every pair of vertices in
X, Y or Z are friends of each other, and everyone in different groups are enemies
of one another.
Reference [6] also made an argument that a triad where all edges bear a
negative sign is inherently balanced, therefore giving a weak structural balance
property. This property says that there is no triad such that the edges among
them consist of exactly two positive edges and one negative edge.
Reference [11] found a closed-form expression for faction membership as a
function of initial conditions which implies that the initial amount of friendliness
in large social networks (started from random initial conditions) determines
whether they will end up as a paradise or a pair of bipolar sets.
Although identifying whether or not a graph is balanced merely involves
determining whether or not all of its triads satisfy the structural balance property,
this is a rather convoluted approach. A simpler approach is as follows: Consider
the signed graph shown in Fig. 7.6. To determine whether or not this graph is
balanced, we follow the procedure given below:
1.
Identify the supernodes. Supernodes are defined as the connected
components where each pair of adjacent vertices have a positive edge. If the
entire graph forms one supernode, then the graph is balanced. If there is a
vertex that cannot be part of any supernode, then this vertex is a supernode
in itself. Figure 7.7 depicts the supernodes of Fig. 7.6. Here, we see that the
vertices 4, 11, 14 and 15 are supernodes by themselves.
2.
Now beginning at a supernode we assign each of the supernodes to groups X
or Y alternatively. If every adjacent connected pair of supernodes can be
placed in different groups, then the graph is said to be balanced. If such a
placement is not possible, then the graph is unbalanced. Figure 7.6 is
unbalanced because, if we consider a simplified graph of the supernodes
(Fig. 7.8), then there are two ways to bipartition these vertices starting from
the vertex A: either A, C, G and E are assigned to X with B, D and F
assigned to Y or A, F, D and B are assigned to X with E, G and C assigned to
Y. In the first case, A and E are placed in the same group while in the next
case, A and B are assigned to the same group. Since this is the case no
matter which of the vertices we start from, this graph is unbalanced.
This procedure however shows that the condition for a graph to be balanced
is very strict. Such a strict condition almost never applies in real case scenarios.
Alternatively, we come up with an approximate balanced graph condition.
According to this condition, let be any number such that , and
1.
there is a set consisting of atleast of the vertices in
which atleast of all the pairs are friends, or else
2. the vertices can be divided into two groups, X and Y, such
that
a.
atleast of the pairs in X are friends of
one another
b.
atleast of the pairs in Y are friends of
one another, and
c.
c.
7.2 Theory of Status
The theory of status for signed link formation is based on an implicit ordering of
the vertices, in which a positive s(u, v) indicates that u considers v to have higher
status, while a negative s(u, v) indicates that u considers v to have lower status.
− edges
Table 7.2 tabulates the number of all the balanced and unbalanced triads in
each of these datasets. Let p denote the fraction of positive edges in the network,
denote the type of the triad, denote the number of , and denote
the fraction of triads , computed as where denotes the total
number of triads. Now, we shuffle the signs of all the edges in the graph
(keeping the fraction p of positive edges the same), and we let denote the
expected fraction of triads that are of type after this shuffling.
Epinions
Slashdot
Wikipedia
(7.1)
Fig. 7.10 Surprise values and predictions based on the competing theories of structural balance and status
Fig. 7.11 Given that the first edge was of sign X, gives the probability that reciprocated edge is
of sign Y
Fig. 7.12 Edge reciprocation in balanced and unbalanced triads. Triads: number of balanced/unbalanced
triads in the network where one of the edges was reciprocated. P(RSS): probability that the reciprocated
edge is of the same sign. : probability that the positive edge is later reciprocated with a plus.
To understand the boundary between the balance and status theories and
where they apply, it is interesting to consider a particular subset of these
networks where the directed edges are used to create reciprocating relationships.
Figure 7.11 shows that users treat each other differently in the context of
reciprocating interactions than when they are using links to refer to others who
do not link back.
To consider how reciprocation between A and B is affected by the context of
A and B’s relationships to third nodes X, suppose that an A-B link is part of a
directed triad in which each of A and B has a link to or from a node X. Now, B
reciprocates the link to A. As indicated in Fig. 7.12, we find that the B-A link is
significantly more likely to have the same sign as the A-B link when the original
triad on A-B-X (viewed as an undirected triad) is structurally balanced. In other
words, when the initial A-B-X triad is unbalanced, there is more of a latent
tendency for B to “reverse the sign” when she links back to A.
Transition of an Unbalanced Network to a Balanced One
A large signed network whose signs were randomly assigned was almost
surely unbalanced. This currently unbalanced network had to evolve to a more
balanced state with nodes changing their signs accordingly. Reference [3]
studied how a unbalanced network transitions to a balanced one and focused on
any human tendencies being exhibited.
They considered local triad dynamics (LTD) wherein every update step
chooses a triad at random. If this triad is balanced, or , no evolution
occurs. If the triad is unbalanced, or , the sign of one of the links is
changed as follows: occurs with probability p, occurs with
probability , while occurs with probability 1. After an update
step, the unbalanced triad becomes balanced but this could cause a balanced
triad that shares a link with this target to become unbalanced. These could
subsequently evolve to balance, leading to new unbalanced triads.
They show that for , LTD quickly drives an infinite network to a
bipolar and the time to reach this state scales faster than exponentially with
network size. For , the final state is paradise. The time to reach this state
would increase the total number of unbalanced triads are not allowed. On
average each link is changed once in a unit of time. A crucial outcome of this is
that a network is quickly driven to a balanced state in a time that scales as lnN .
What is most important with user evaluations is to determine what are the
factors that drive one’s evaluations and how a composite description that
factors that drive one’s evaluations and how a composite description that
accurately reflects the aggregate opinion of the community can be created. The
following are some of the studies that focus on addressing this problem.
Reference [12] designed and analysed a large-scale randomized experiment
on a social news aggregation Web site to investigate whether knowledge of such
aggregates distorts decision-making . Prior ratings were found to create
significant bias in individual rating behaviour, and positive and negative social
influences were found to create asymmetric herding effects. Whereas negative
social influence inspired users to correct manipulated ratings, positive social
influence increased the likelihood of positive ratings by and created
accumulating positive herding that increased final ratings by on average.
This positive herding was topic-dependent and affected by whether individuals
were viewing the opinions of friends or enemies. A mixture of changing opinion
and greater turnout under both manipulations together with a natural tendency to
up-vote on the site combined to create the herding effects.
Reference [13] studied factors in how users give ratings in different contexts,
i.e, whether they are given anonymously or under one’s own name and whether
they are displayed publicly or held confidentially. They investigated on three
datasets, Amazon.com reviews, Epinions ratings, and CouchSurfing.com trust
and friendship networks, which were found to represent a variety of design
choices in how ratings are collected and shared. The findings indicate that
ratings should not be taken at face value, but rather that one should examine the
context in which they were given. Public, identified ratings tend to be
disproportionately positive, but only when the ratee is another user who can
reciprocate.
Reference [1] studied YahooAnswers (YA) which is a large and diverse
question answer community, acting not only as a medium for knowledge
sharing, but as a place to seek advice, gather opinions, and satisfy one’s curiosity
about things which may not have a single best answer. The fact about YA that
sets it apart from others is that participants believe that anything from the sordid
intricacies of celebrities’ lives to conspiracy theories is considered knowledge,
worthy of being exchanged. Taking advantage of the range of user behaviour in
YA, several aspects of question-answer dynamics were investigated. First,
content properties and social network interactions across different YA categories
(or topics) were contrasted. It was found that the categories could be clustered
according to thread length and overlap between the set of users who asked and
those who replied. While, discussion topics or topics that did not focus on
factual answers tended to have longer threads, broader distributions of activity
levels, and their users tended to participate by both posing and replying to
questions, YA categories favouring factual questions had shorter thread lengths
on average and users typically did not occupy both a helper and asker role in the
same forum. It was found that the ego-networks easily revealed YA categories
where discussion threads, even in this constrained question-answer format,
tended to dominate. While many users are quite broad, answering questions in
many different categories, this was of a mild detriment for specialized, technical
categories. In those categories, users who focused the most (had a lower entropy
and a higher proportion of answers just in that category) tended to have their
answers selected as best more often. Finally, they attempted to predict best
answers based on attributes of the question and the replier. They found that just
the very basic metric of reply length, along with the number of competing
answers, and the track record of the user, was most predictive of whether the
answer would be selected. The number of other best answers by a user, a
potential indicator of expertise, was predictive of an answer being selected as
best, but most significantly so for the technically focused categories.
Reference [8] explored CouchSurfing, an application which enables users to
either allow other users to sleep in their couch or sleep on someone else’s couch.
Due to security and privacy concerns, this application heavily depends on
reciprocity and trust among these users. By studing the surfing activities, social
networks and vouch networks, they found the following: First, CouchSurfing is a
community rife with generalized reciprocity, i.e, active participants take on the
role of both hosts and surfers, in roughly equal proportion. About a third of those
who hosted or surfed are in the giant strongly connected component, such that
one couch can be reached from any other by following previous surfs. Second,
the high degree of activity and reciprocity is enabled by a reputation system
wherein users vouch for one another. They found that connections that are
vouched, or declared trustworthy can best be predicted based on the direct
interaction between the two individuals: their friendship degree, followed by the
overall experience from surfing or hosting with the other person, and also how
the two friends met. However, global measures that aim to propagate trust, such
as PageRank, are found to be poor predictors of whether an edge is vouched.
Although such metrics may be useful in assigning overall reputation scores to
individuals, they are too diffuse to predict specifically whether one individual
will vouch for another. Finally, the analysis revealed a high rate of vouching:
about a quarter of all edges that can be vouched are, as are a majority of highly
active users. While this could be reflection of a healthy web of trust, there are
indications that vouches may be given too freely. The main reason behind this
high rate of vouching may be its public nature. It can be awkward for friends to
not give or reciprocate a vouch, even if privately they have reservations about
the trustworthiness of the other person.
1.
Direct propagation: If , i.e, i trusts j and , i.e, j trusts k,
then we could conclude that , i.e, i trusts k. This operation is
referred to as direct propagation since the trust propagated directly
along an edge. Such direct propagations are represented as a matrix M
such that all conclusions expressible by these direct propagation can be
read from the matrix . The appropriate matrix M to encode direct
propagation is simply B itself: in this case , which is the
matrix of all length-2 paths in the initial belief graph. Thus, B itself is
the operator matrix that encodes the direct propagation basis element.
2.
Co-citation: If , and , i.e, trusts
and , and trusts , then we conclude that should also trust .
Here, will capture all beliefs that are inferable through
a single co-citation. The sequence can be viewed as a backward-
forward step propagating ’s trust of backward to , then forward
to . Therefore, the operator M for this atomic propagation is .
3.
Transpose trust: Here i’s trust of j causes j to develop some level of
trust towards i. In this case, .
4.
Trust coupling: When both i and j trust k, this implies that i trusts j.
This makes .
is a matrix whose ijth entry describes how beliefs should flow from i to j
via an atomic propagation step; if the entry is 0, then nothing can be concluded
in an atomic step about i’s views on j.
2.
One-step distrust: When a user distrusts someone, they discount all
their judgements. Therefore, distrust propagates only one step, giving
us: and .
3.
Propagated distrust: When trust and distrust both propagate together,
we get: and .
1.
Eigenvalue propagation(EIG): If , then .
2.
Weighted linear combinations(WLC): If is a discount factor that is
smaller than the largest eigenvalue of and , then
.
7.4.4 Rounding
The next problem encountered was that of converting a continuous value into a
discrete one ( ). There were the following three ways of accomplishing this
rounding:
1.
Global rounding: This rounding tries to align the ratio of trust to
distrust values in F to that in the input M. In the row vector , i trusts
j if and only if is within the top fraction of entries of the vector
. This threshold is chosen based on the overall relative fractions of
trust and distrust in the input.
2.
Local rounding: Here, we account for the trust/distrust behavior of i.
The conditions for and are same as the previous definition.
3.
Majority rounding: This rounding intends to capture the local structure
of the original trust and distrust matrix. Consider the set J of users on
whom i has expressed either trust or distrust. If J is a set of labelled
examples using which we are to predict the label of a user j, .
We order J along with j according to the entries where
. At the end of this, we have an ordered sequence of trust
and distrust labels with the unknown label for j embedded in the
sequence at a unique location. From this, we predict the label of j to be
that of the majority of the labels in the smallest local neighbourhood
surrounding it where the majority is well-defined.
Epinions
For this study, the Epinions web of trust was constructed as a directed graph
consisting of 131, 829 nodes and 841, 372 edges, each labelled either trust or
distrust. Of these labelled edges, were labelled trust; we interpret
trustto be the real value and distrust to be .
The combinations of atomic propagations, distrust propagations, iterative
propagations and rounding methods gave 81 experimental schemes. To
determine whether any particular algorithm can correctly induce the trust or
distrust that i holds for j, a single trial edge (i, j) is masked from the truth graph,
and then each of the 81 schemes is asked to guess whether i trusts j. This trial
was performed on 3, 250 randomly masked edges for each of the 81 schemes,
resulting in 263, 000 total trust computations, and depicted in Fig. 7.13. In this
table, denotes the prediction error of an algorithm and a given rounding
method, i.e., is the fraction of incorrect predictions made by the algorithm.
Fig. 7.13 Prediction of the algorithms. Here, ,
The trust edges in the graph outnumber the distrust edges by a huge margin:
85 versus 15. Hence, a naive algorithm that always predicts “trust” will incur a
prediction error of only . Nevertheless, the results are first reported for
prediction on randomly masked edges in the graph, as it reflects the underlying
problem. However, to ensure that the algorithms are not benefiting unduly from
this bias, the largest balanced subset of the 3, 250 randomly masked trial edges
are taken such that half the edges are trust and the other half are distrust—this is
done by taking all the 498 distrust edges in the trial set as well as the 498
randomly chosen trust edges from the trial set. Thus, the size of this subset S is
996. The prediction error in S is called . The naive prediction error on S
would be .
They found that even a small amount of information about distrust (rather
than information about trust alone) can provide tangibly better judgements about
how much user i should trust user j, and that a small number of expressed
trusts/distrust per individual can allow prediction of trust between any two
people in the system with surprisingly high accuracy.
1.
The conformity hypothesis: This hypothesis holds that a review is evaluated
as more helpful when its star rating is closer to the consensus star rating for
the product.
2.
The individual-bias hypothesis: According to this hypothesis, when a user
considers a review, she will rate it more highly if it expresses an opinion that
she agrees with. However, one might expect that if a diverse range of
individuals apply this rule, then the overall helpfulness evaluation could be
hard to distinguish from one based on conformity.
3.
The brilliant-but-cruel hypothesis: This hypothesis arises from the argument
that “negative reviewers are perceived as more intelligent, competent, and
expert than positive reviewers.”
4.
4.
The quality-only straw-man hypothesis: There is the possibility that
helpfulness is being evaluated purely based on the textual content of the
reviews, and that these non-textual factors are simply correlates of textual
quality.
The helpfulness ratio of a review is defined to be the fraction of evaluators
who found it to be helpful (in other words, it is the fraction when a out of b
people found the review helpful). The product average for a review of a given
product is defined to be the average star rating given by all reviews of that
product. Figure 7.14 shows that the median helpfulness ratio of reviews
decreases monotonically as a function of the absolute difference between their
star rating and the product average. This is consistent with the conformity
hypothesis: reviews in aggregate are deemed more helpful when they are close to
the product average. However, to assess the brilliant-but-cruel hypothesis, we
must look not at the absolute difference between a review’s star rating and its
product average, but at the signed difference, which is positive or negative
depending on whether the star rating is above or below the average. Figure 7.15
shows that not only does the median helpfulness as a function of signed
difference fall away on both sides of 0; it does so asymmetrically: slightly
negative reviews are punished more strongly, with respect to helpfulness
evaluation, than slightly positive reviews. This is at odds with both the brilliant-
but-cruel hypothesis as well as the conformity hypothesis.
Fig. 7.15 Helpfulness ratio as a function of a review’s signed deviation
Fig. 7.16 As the variance of the star ratings of reviews for a particular product increases, the median
helpfulness ratio curve becomes two-humped and the helpfulness ratio at signed deviation 0 (indicated in
red) no longer represents the unique global maximum
For a given review, let the computed product-average star rating be the
average star rating as computed over all reviews of that product in the dataset.
To investigate straw-man quality-only hypothesis, we must review the text
quality. To avoid the subjectiveness that may be involved when human
evaluators are used, a machine learning algorithm is trained to automatically
determine the degree of helpfulness of each review. For
where , when the helpfulness ratio of reviews, with absolute
deviation i is significantly larger than for reviews with absolute deviation j.
Reference [5] explains a model that can explain the observed behaviour.
We evaluate the robustness of the observed social-effects phenomena by
comparing review data from three additional different national Amazon sites:
Amazon.co.uk (UK), Amazon.de (Germany) and Amazon.co.jp (Japan). The
Japanese data exhibits a left hump that is higher than the right one for reviews
with high variance, i.e., reviews with star ratings below the mean are more
favored by helpfulness evaluators than the respective reviews with positive
deviations (Fig. 7.17).
Fig. 7.17 Signed deviations vs. helpfulness ratio for variance = 3, in the Japanese (left) and U.S. (right)
data. The curve for Japan has a pronounced lean towards the left
They found that the perceived helpfulness of a review depends not just on its
content but also in subtle ways on how the expressed evaluation relates to other
evaluations of the same product.
Wikipedia language N U
between the evaluator and target edit vectors (s(e, t)) in Wikipedia
Fig. 7.21 Probability of E voting positively on T as a function of for different levels of similarity on
Stack Overflow for a all evaluators b no low status evaluators
Fig. 7.22 Similarity between E and T pairs as a function of for a English Wikipedia and b Stack
Overflow
Fig. 7.23 Probability of E positively evaluating T versus for various fixed levels of in Stack
Overflow
The trend is somewhat similar on Stack Overflow as depicted in Fig. 7.21 for
versus plot. When , the picture is qualitatively the same as in
Wikipedia: the higher the similarity, the higher is. But for , the
situation is very different: the similarity curves are in the opposite order from
before: evaluators with low similarity to the higher-status targets (since )
are more positive than evaluators with high similarity. This is due to a particular
property of Stack Overflow’s reputation system, in which it costs a user a small
amount of reputation to downvote a question or answer (issue a negative
evaluation). This creates a dis-incentive to downvote which is most strongly felt
by users with lower reputation scores (which correlates with our measure of
status on Stack Overflow). When the low-status evaluators ( ), this
effect disappears and the overall picture looks the same as it does on Wikipedia.
In Wikipedia elections, we find that the evaluator’s similarity to the target
depends strongly on while this does not happen on Stack Overflow or
Epinions. This is shown in Fig. 7.22.
Figure 7.23 plots the fraction of positive evaluations against the
target’s status within several narrow ranges on Stack Overflow. If the
status difference is really how users compare their status against others, then we
would expect that these curves are approximately flat, because this would imply
that for pairs separated by , evaluation positivity does not depend on what
their individual statuses are, and that the level of these constant curves depends
on , so that different values result in different evaluation behaviour.
However, does not control the result well for extremely low target status
values, where evaluators are much more negative than just the difference in
status would otherwise suggest. This is because the poor quality of targets with
low status overwhelms the difference in status between evaluator and target.
Thus absolute status, and not differential status, is the main criterion evaluators
use to evaluate very low-status targets.
GS: The “gold-standard” represents the best possible performance that can be
achieved. It “cheats” by examining the values of actual votes themselves ( ,
for ), computes the empirical fraction of positive votes and learns the
optimal threshold to predict the election outcome.
The methods developed were:
M1: M1 models the probability that votes positively as
, where is ’s positivity, and is the
average deviation of the fraction of positive votes in the , bucket
compared to the overall fraction of positive votes across the entire dataset.
The average for is computed and then made threshold
for prediction.
M2: is modelled as
. between 0.6 and 0.9 is
used for computation.
Figure 7.26 plots the classification accuracy for the models on English
Wikipedia as well as German Wikipedia (French and Spanish Wikipedia were
very similar to the German results). Here, the prior refers to the accuracy of
random guessing.
Fig. 7.26 Ballot-blind prediction results for a English Wikipedia b German Wikipedia
These results demonstrate that without even looking at the actual votes, it is
possible to derive a lot of information about the outcome of the election from a
small prefix of the evaluators. Very informative implicit feedback could be
gleaned from a small sampling of the audience consuming the content in
question, especially if previous evaluation behaviour by the audience members is
known.
7.7 Predicting Positive and Negative Links
Reference [9] investigated relationships in Epinions, Slashdot and Wikipedia.
They observed that the signs of links in the underlying social networks can be
predicted with high accuracy using generalized models. These models are found
to provide insight into some of the fundamental principles that drive the
formation of signed links in networks, shedding light on theories of balance and
status from social psychology. They also suggest social computing applications
by which the attitude of one user towards another can be estimated from
evidence provided by their relationships with other members of the surrounding
social network.
The study looked at three datasets.
The trust network of Epinions data spanning from the inception of the site in
1999 until 2003. This network contained 119, 217 nodes and 841, 000 edges,
of which were positive. 80, 668 users received at least one trust or
distrust edge, while there were 49, 534 users that created at least one and
received at least one signed edge. In this network, s(u, v) indicates whether u
had expressed trust or distrust of user v.
The social network of the technology-related news website Slashdot, where u
can designate v as either a “friend” or “foe” to indicate u’s approval or
disapproval of v’s comments. Slashdot was crawled Slashdot in 2009 to obtain
its network of 82, 144 users and 549, 202 edges of which are positive.
70, 284 users received at least one signed edge, and there were 32, 188 users
with non-zero in-and out-degree.
The network of votes cast by Wikipedia users in elections for promoting
individuals to the role of admin. A signed link indicated a positive or negative
vote by one user on the promotion of another. Using the January 2008
complete dump of Wikipedia page edit history all administrator election and
vote history data was extracted. This gave 2, 794 elections with 103, 747 total
votes and 7, 118 users participating in the elections. Out of this total, 1, 235
elections resulted in a successful promotion, while 1, 559 elections did not
result in the promotion of the candidate. The resulting network contained
7, 118 nodes and 103, 747 edges of which are positive. There were
2, 794 nodes that received at least one edge and 1, 376 users that both
received and created signed edges. s(u, v) indicates whether u voted for or
against the promotion of v to admin status.
In all of these networks the background proportion of positive edges is about
the same, with of the edges having a positive sign.
The aim was to answer the question: “How does the sign of a given link
interact with the pattern of link signs in its local vicinity, or more broadly
throughout the network? Moreover, what are the plausible configurations of link
signs in real social networks?” The attempt is to infer the attitude of one user
towards another, using the positive and negative relations that have been
observed in the vicinity of this user. This becomes particularly important in the
case of recommender systems where users’ pre-existing attitudes and opinions
must be deliberated before recommending a link. This involves predicting the
sign of the link to be recommended before actually suggesting this.
This gives us the edge sign prediction problem. Formally, given a social
network with signs on all its edges, but the sign on the edge from node u to node
v, denoted s(u, v), has been “hidden”. With what probability can we infer this
sign s(u, v) using the information provided by the rest of the network? In a way,
this problem is similar to the problem of link prediction.
which specifies for each triad type whether it constitutes evidence for a
positive (u, v) edge , evidence for a negative (u, v) edge
, or whether it offers no evidence .
For studying balance, we see that if w forms a triad with the edge (u, v), then
(u, v) should have the sign that causes the triangle on u, v, w to have an odd
number of positive signs, regardless of edge direction. In other words,
. In order to determine , we first flip the
directions of the edges between u and w and between v and w, so that they point
from u to w and from w to v; we flip the signs accordingly as we do this. We
then define to be the sign of . The accuracy of
predicting signs considering these balance and status coefficients is depicted in
Fig. 7.28. Here, StatusLrn and BalanceLrn denote the coefficients learned via
logistic regression. The coefficients from provided by balance,
weak balance and status are denoted by BalanceDet, WeakBalDet and StatusDet
respectively.
Fig. 7.28 Accuracy of predicting the sign of an edge based on the signs of all the other edges in the
network in a Epinions, b Slashdot and c Wikipedia
Next the following handwritten heuristic predictors were considered:
Balance heuristic (Balance): For each choice of the sign of (u, v), some of the
triads it participates in will be consistent with balance theory, and the rest of
the triads will not. We choose the sign for (u, v) that causes it to participate in
a greater number of triads that are consistent with balance.
Status heuristic (StatDif): An estimate of a node x’s status is
. This gives x status benefits for each
positive link it receives and each negative link it generates, and status
detriments for each negative link it receives and each positive link it
generates. We then predict a positive sign for (u, v) if , and a
negative sign otherwise.
Out-degree heuristic (OutSign): We predict the majority sign based on the
signs given by the edge initiator u, i.e, we predict a positive sign if
.
In-degree heuristic (InSign): We predict the majority sign based on the signs
received by the edge target v, i.e, we predict a poisitive sign if
.
These predictors are plotted in Fig. 7.29 as a function of embeddedness.
Fig. 7.29 Accuracy for handwritten heuristics as a function of minimum edge embeddedness
Problems
Download the signed Epinions social network dataset available at https://snap.
stanford.edu/data/soc-sign-epinions.txt.gz.
Consider the graph as undirected and compute the following:
47 Calculate the count and fraction of triads of each type in this network.
48 Calculate the fraction of positive and negative edges in the graph. Let the
fraction of positive edges be p. Assuming that each edge of a triad will
independently be assigned a positive sign with probability p and a negative sign
with probability , calculate the probability of each type of triad.
References
1. Adamic, Lada A., Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge sharing and
Yahoo answers: Everyone knows something. In Proceedings of the 17th international conference on
World Wide Web, 665–674. ACM.
2. Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2012. Effects of user
similarity in social media. In Proceedings of the fifth ACM international conference on Web search and
data mining, 703–712. ACM.
3. Antal, Tibor, Pavel L. Krapivsky, and Sidney Redner. 2005. Dynamics of social balance on networks.
Physical Review E 72(3):036121.
4. Brzozowski, Michael J., Tad Hogg, and Gabor Szabo. 2008. Friends and foes: Ideological social
networking. In Proceedings of the SIGCHI conference on human factors in computing systems, 817–
820. ACM.
5. Danescu-Niculescu-Mizil, Cristian, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. 2009. How
opinions are received by online communities: A case study on Amazon. com helpfulness votes. In
Proceedings of the 18th international conference on World Wide Web, 141–150. ACM.
6. Davis, James A. 1963. Structural balance, mechanical solidarity, and interpersonal relations. American
Journal of Sociology 68 (4): 444–462.
7. Guha, Ramanthan, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. 2004. Propagation of trust
and distrust. In Proceedings of the 13th international conference on World Wide Web, 403–412. ACM.
8. Lauterbach, Debra, Hung Truong, Tanuj Shah, and Lada Adamic. 2009. Surfing a web of trust:
Reputation and reciprocity on couchsurfing. com. In 2009 International conference on computational
science and engineering (CSE’09), vol. 4, 346–353. IEEE.
9. Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. 2010. Predicting positive and negative links in
online social networks. In Proceedings of the 19th international conference on World Wide Web, 641–
650. ACM.
10. Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. 2010. Signed networks in social media. In
Proceedings of the SIGCHI conference on human factors in computing systems, 1361–1370. ACM.
11. Marvel, Seth A., Jon Kleinberg, Robert D. Kleinberg, and Steven H. Strogatz. 2011. Continuous-time
model of structural balance. Proceedings of the National Academy of Sciences 108 (5): 1771–1776.
12. Muchnik, Lev, Sinan Aral, and Sean J. Taylor. 2013. Social influence bias: A randomized experiment.
Science 341(6146):647–651.
13.
13. Teng, ChunYuen, Debra Lauterbach, and Lada A. Adamic. 2010. I rate you. You rate me. Should we
do so publicly? In WOSN.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_8
Krishna Raj P. M.
Email: [email protected]
(8.1)
Rearranging the terms in Eq. 8.1, we get
(8.2)
If we denote the right had side term as q ( ), then q can be called the
A B
A a,a 0,0
B 0,0 b,b
Fig. 8.1 v must choose between behaviours A and B based on its neighbours behaviours
In the long run, this behaviour adoption leads to one of the following states
of equilibria. Either everyone adopts behaviour A, or everyone adopts behaviour
B. Additionally, there exists a third possibility where nodes adopting behaviour
A coexist with nodes adopting behaviour B. In this section, we will understand
the network circumstances that will lead to one of these possibilities.
Assume a network where every node initially has behaviour B. Let a small
portion of the nodes be early adopters of the behaviour A. These early adopters
choose A for reasons other than those guiding the coordination game, while the
other nodes operate within the rules of the game. Now with every time step, each
of these nodes following B will adopt A based on the threshold rule. This
adoption will cascade until one of the two possibilities occur: Either all nodes
eventually adopt A leading to a complete cascade , or an impasse arises between
those adopting A and the ones adopting B forming a partial cascade .
Figures 8.2, 8.3, 8.4 and 8.5 all show an example of a complete cascade. In
this network, let the payoffs be and , and the initial adopters be v
and w. Assuming that all the nodes started with behaviour B. In the first time
step, but for the nodes r and t, and for the nodes s and u.
Since the p for nodes r and t is greater than q, r and t will adopt A. However, the
p for nodes s and u is less than q. Therefore these two nodes will not adopt A. In
the second time step, for the nodes s and u. p being greater than q cause
the remaining two nodes to adopt the behaviour, thereby causing a complete
cascade.
Fig. 8.2 Initial network where all nodes exhibit the behaviour B
Fig. 8.3 Nodes v and w are initial adopters of behaviour A while all the other nodes still exhibit behaviour
B
Fig. 8.4 First time step where r and t adopt behaviour A by threshold rule
Fig. 8.5 Second time step where s and u adopt behaviour A also by threshold rule
Figures 8.6, 8.7 and 8.8 illustrate a partial cascade. Similar to the complete
cascade depiction, we begin with a network where all the nodes exhibit
behaviour B with payoffs and . The nodes 7 and 8 are the initial
adopters, and . In the first step, node 5 with and node 10 with
are the only two nodes that can change behaviour to A. In the second step,
node 4 has and node 9 has , causing them to switch to A. In the
third time step, node 6 with adopts A. Beyond these adoptions, there are
There are two ways for a partial cascade to turn into a complete cascade. The
first way is for the payoff of one of the behaviours to exceed the other. In
Fig. 8.6, if the payoff a is increased, then the whole network will eventually
switch to A. The other way is to coerce some critical nodes of one of the
behaviours to adopt the other. This would restart the cascade, ending in a
complete cascade in favour of the coerced behaviour. In Fig. 8.6, if nodes 12 and
13 were forced to adopt A, then this would lead to a complete cascade. Instead, if
11 and 14 were coerced to switch, then there would be no further cascades.
Partial cascades are caused due to the fact that the spread of a new behaviour
can stall when it tries to break into a tightly-knit community within the network,
i.e, homophily can often serve as a barrier to diffusion, by making it hard for
innovations to arrive from outside densely connected communities. More
formally, consider a set of initial adopters of behaviour A, with a threshold of q
for nodes in the remaining network to adopt behaviour A. If the remaining
network contains a cluster of density greater than , then the set of initial
adopters will not cause a complete cascade. Whenever a set of initial adopters
does not cause a complete cascade with threshold q, the remaining network must
contain a cluster of density greater than .
The above discussed models work on the assumption that the payoffs that
each adopter receives is the same, i.e, a for each adopter of A and b for each
adopter of B. However, if we consider the possibility of node specific payoff, i.e,
every node v receives a payoff for adopting A and payoff for taking B. So
the payoff matrix in such a coordination game is as shown in Table 8.2.
Table 8.2 Coordination game for node specific payoffs
A B
A 0, 0
B 0, 0
On the same lines as the previous coordination game we arrive at the Eq. 8.3
(8.3)
give a personal threshold rule whereby v will take the behaviour adopted by
atleast of its neighbours.
fails to do so. Therefore in the case of an infinite path, the cascade capacity is .
Next, we will look at an infinite grid as shown in Fig. 8.9. Let the nine nodes
in black be the early adopters of behaviour A while the other nodes exhibit
behaviour B. Here, if the threshold then there will be a complete cascade.
If all nodes initially play B, and a small number of nodes begin adopting
strategy A instead, the best response would be to switch to A if enough of a
user’s neighbours have already adopted A. As this unfolds, cascading will occur
where either all nodes switch to A, or there will exist a state of coexistence
where some users use A, others use B, with no interoperability between these
two groups.
However, we see that the real-world contains examples where coexistence
occurs. Windows coexisting with Mac OS, or Android coexisting with iOS. The
point of interest is that these coexisting groups are not completely isolated.
There exists a region of overlap between these groups, i.e, there are certain users
who are able to use both of the choices and thus act as the bridge between the
groups.
Considering that our IM system becomes as follows. Users can choose
among three strategies, A, B and AB (in which both A and B are used). An
adopter of AB gets to use, on an edge-by-edge basis, whichever of A or B yields
higher payoffs in each interaction. However, AB has to pay a fixed cost c for
being bilingual. The payoffs now become as shown in Table 8.3.
Table 8.3 Payoff matrix
A B AB
A 0, 0
B 0, 0 q, q
AB
Let us consider that G is infinite and each node has a degree . The
per-edge cost for adopting AB is defined as .
Reference [9] found that for values of q close to but less than , strategy A
can cascade on the infinite line if r is sufficiently small or sufficiently large, but
not if r takes values in some intermediate interval. In other words, strategy B
(which represents the worse technology, since ) will survive if and only if
Fig. 8.10 The payoffs to node w on an infinite path with neighbours exhibiting behaviour A and B
Fig. 8.11 By dividing the a-c plot based on the payoffs, we get the regions corresponding to the different
choices
8.1.4.2 Choice Between AB and B
The graph in Fig. 8.12 gives us the situation where w is placed between a node
with behaviour AB and another exhibiting B.
Fig. 8.12 The payoffs to node w on an infinite path with neighbours exhibiting behaviour AB and B
Figure 8.14 summarises the possible cascade outcomes based on the values
of a and c. The possibilities are: (i) B is favoured by all nodes outside the initial
adopter set, (ii) A spreads directly without help from AB, (iii) AB spreads for one
step beyond the initial adopter set, but then B is favoured by all nodes after that,
(iv) AB spreads indefinitely to the right, with nodes subsequently switching to A.
Fig. 8.14 The plot shows the four possible outcomes for how A spreads or fails to spread on the infinite
path, indicated by this division of the (a, c)-plane into four regions
Models are developed to understand systems where users are faced with
alternatives and the costs and benefits of each depend on how many others
choose which alternative. This is further complicated when the idea of a
“threshold” is introduced. The threshold is defined as the proportion of
neighbours who must take a decision so that adoption of the same decision by
the user renders the net benefits to exceed the net costs for this user.
Reference [8] aimed to present a formal model that could predict, from the
initial distribution of thresholds, the final proportion of the users in a system
making each of the two decisions, i.e, finding a state of equilibrium. Let the
threshold be x, the frequency distribution be f(x), and the cumulative distribution
be F(x). Let the proportion of adopters at time t be denoted by r(t). This process
is described by the difference equation . When the frequency
distribution has a simple form, the difference equation can be solved to give an
expression for r(t) at any value of t. Then, by setting , the
equilibrium outcome can be found. However, when the functional form is not
simple, the equilibrium must be computed by forward recursion. Graphical
observation can be used to compute the equilibrium points instead of
manipulating the difference equations. Let’s start with the fact that we know r(t).
Since , this gives us . Repeating this process we find
the point at . This is illustrated in Fig. 8.15.
Fig. 8.15 Graphical method of finding the equilibrium point of a threshold distribution
Action 1 is best response for some player exactly if the probability assigned
is at least
(8.4)
This changes the payoff matrix to as tabulated in Table 8.5.
Table 8.5 Symmetric payoff matrix parametrized by critical probability q
0 1
0 q, q 0, 0
1 0, 0
(8.5)
Contagion is said to occur if one action can spread from a finite set of players to
the whole population. There is a contagion threshold such that the contagion
occurs if and only if the payoff parameter q is less than this contagion threshold.
A group of players is said to be p-cohesive if every member of that group has at
least proportion p of the neighbours within the group. Reference [13] shows that
the contagion threshold is the smallest p such that every “large” group
(consisting of all but a finite sets of players) contains an infinite, ( )-
cohesive, subgroup. Additionally, the contagion threshold is the largest p such
that it is possible to label players so that, for any player with a sufficiently high
label, proportion at least p of the neighbours has a lower label.
Contagion is most likely to occur if the contagion threshold is close to its
upper bound of . Reference [13] shows that the contagion threshold will be
close to if two properties hold. First, there is low neighbour growth : the
number of players who can be reached in k steps grows less than exponentially
in k. This will occur if there is a tendency for players’ neighbours’ neighbours to
be their own neighbours. Second, the local interaction system must be
sufficiently uniform, i.e, there is some number such that for all players a long
way from some core group, roughly proportion of their neighbours are closer
to the core group.
Reference [1] deals with a polling game on a directed graph. At round 0, the
vertices of a subset are coloured white and all the other vertices are coloured
black. At each round, each vertex v is coloured according to the following rule.
If at a round r, the vertex v has has more than half of its neighbours coloured c,
then at round , the vertex v will be coloured c. If at round r, the vertex v
has the vertex v has exactly half of its neighbours coloured white and the other
half coloured black, then this is said to be a tie. In this case, v is coloured at
round by the same colour it had at round r. If at a finite round r, all the
vertices are white, then is said to be a dynamic monopoly or dynamo . The
paper proves that for , there exists a graph with more than n vertices and
with a dynamo of 18 vertices. In general, a dynamo of size O(1) (as a function of
the number of vertices) is sufficient.
Reference [4] performed a study to understand the strength of weak ties
versus the strong ties. They found that the strength of weak ties is that they tend
to be long—they connect socially distant locations. Moreover, only a few long
ties are needed to give large and highly clustered populations the “degrees of
separation” of a random network, in which simple contagions, like disease or
information, can rapidly diffuse. It is tempting to regard this principle as a lawful
regularity, in part because it justifies generalization from mathematically
tractable random graphs to the structured networks that characterize patterns of
social interaction. Nevertheless, our research cautions against uncritical
generalization. Using Watts and Strogatz’s original model of a small world
network, they found that long ties do not always facilitate the spread of complex
contagions and can even preclude diffusion entirely if nodes have too few
common neighbours to provide multiple sources of confirmation or
reinforcement. While networks with long, narrow bridges are useful for
spreading information about an innovation or social movement, too much
randomness can be inefficient for spreading the social reinforcement necessary
to act on that information, especially as thresholds increase or connectedness
declines.
An informational cascade occurs when it is optimal for an individual, having
observed the actions of those ahead of him, to follow the behaviour of the
preceding individual without regard to his own information. Reference [2]
modelled the dynamics of imitative decision processes as informational cascades
and showed that at some stage a decision maker will ignore his private
information and act only on the information obtained from previous decisions.
Once this stage is reached, her decision is uninformative to others. Therefore, the
next individual draws the same inference from the history of past decisions; thus
if her signal is drawn independently from the same distribution as previous
individuals’, this individual also ignores her own information and takes the same
action as the previous individual. In the absence of external disturbances, so do
all later individuals.
The origin of large but rare cascades that are triggered by small initial shocks
is a phenomenon that manifests itself as diversely as cultural fads, collective
action, the diffusion of norms and innovations, and cascading failures in
infrastructure and organizational networks. Reference [17] presented a possible
explanation of this phenomenon in terms of a sparse, random network of
interacting agents whose decisions are determined by the actions of their
neighbours according to a simple threshold rule. Two regimes are identified in
which the network is susceptible to very large cascades also called global
cascades, that occur very rarely. When cascade propagation is limited by the
connectivity of the network, a power law distribution of cascade sizes is
observed. But when the network is highly connected, cascade propagation is
limited instead by the local stability of the nodes themselves, and the size
distribution of cascades is bimodal, implying a more extreme kind of instability
that is correspondingly harder to anticipate. In the first regime, where the
distribution of network neighbours is highly skewed, it was found that the most
connected nodes were far more likely than average nodes to trigger cascades, but
not in the second regime. Finally, it was shown that heterogeneity plays an
ambiguous role in determining a system’s stability: increasingly heterogeneous
thresholds make the system more vulnerable to global cascades; but an
increasingly heterogeneous degree distribution makes it less vulnerable.
To understand how social networks affect the spread of behavior, [3]
juxtaposed several hypothesis. One popular hypothesis stated that networks with
many clustered ties and a high degree of separation will be less effective for
behavioural diffusion than networks in which locally redundant ties are rewired
to provide shortcuts across the social space. A competing hypothesis argued that
when behaviours require social reinforcement, a network with more clustering
may be more advantageous, even if the network as a whole has a larger diameter.
To really understand the phenomenon, the paper investigated the effects of
network structure on diffusion by studying the spread of health behaviour
through artificially structured online communities. Individual adoption was
found to be much more likely when participants received social reinforcement
from multiple neighbours in the social network. The behaviour spread farther
and faster across clustered-lattice networks than across corresponding random
networks.
Reference [5] called the propagation requiring simultaneous exposure to
multiple sources of activation, multiplex propagation . This paper found that the
effect of random links makes the propagation more difficult to achieve.
Fig. 8.17 Contact network for branching process where high infection probability leads to widespread
Fig. 8.18 Contact network for branching process where low infection probability leads to the
disappearance of the disease
the first wave infects each person in the second wave independently with
probability p.
Subsequent waves: Further waves are formed in a similar manner where each
person in the preceding wave meets with k different people and transmits the
infection independently with probability p.
The contact network in Fig. 8.16 illustrates the transmission of an epidemic.
In this tree network, the root node represents the disease carrier. The first wave
is represented by this root node’s child nodes. Each of these child nodes denote
the parent nodes for the second wave. The set of nodes in every subsequent
depth depict each wave.
Figure 8.17 depicts a contact network where the probability of infection is
high. This leads the disease to become widespread and multiply at every level,
possibly reaching a level where all nodes are infected. In stark contrast, Fig. 8.18
signifies a contact network where the probability is low. This causes the disease
to infect fewer and fewer individuals at every level until it completely vanishes.
(8.7)
Now, , i.e, the
expectation of each is the probability of that individual j getting infected.
Individual j at depth n gets infected when each of the n contacts leading from the
root to j successfully transmit the disease. Since each contact transmits the
disease independently with probability p, individual j is infected with probability
. Therefore, . This leads us to conclude that
(8.8)
(8.9)
Since the root is infected, we can take by considering that the root is at
level 0. From here, we can compute . However, this does not tells us
the value of as n tends to infinity.
Let . Then, we can write . Figure 8.19
shows the plot of f(x) for the values 1, f(1), f(f(1)), obtained by repeated
application of f.
Fig. 8.20 A contact network where each edge has an associated period of time denoting the time of
contact between the connected vertices
8.2.6.1 Concurrency
In certain situations it is not just the timing of the contacts that matter but the
pattern of timing also which can influence the severity of the epidemic. A timing
pattern of particular interest is concurrency. A person is involved in concurrent
partnerships if she has multiple active partnerships that overlap in time. These
concurrent patterns cause the disease to circulate more vigorously through the
network and enable any node with the disease to spread it to any other.
Reference [7] proposed a model which when given as input the social
network graph with the edges labelled with probabilities of influence between
users could predict the time by which a user may be expected to perform an
action.
Problems
Generate the following two graphs with random seed of 10:
50
An Erdös-Rényi undirected random graph with 10000 nodes and 100000 edges.
51 A preferential attachment graph with 10000 nodes with out-degree 10.
Let the nodes in each of these graphs have IDs ranging from 0 to 9999.
Assume that the graphs represent the political climate of an upcoming
election between yourself and a rival with a total of 10000 voters. If most of the
voters have already made up their mind: will vote for you, for your
rival, and the remaining are undecided. Let us say that each voter’s support
is determined by the last digit of their node ID. If the last digit is 0, 2, 4 or 6, the
node supports you. If the last digit is 1, 3, 5 or 7, the node supports your rival.
And if the last digit is 8 or 9, the node is undecided.
Assume that the loyalties of the ones that have already made up their minds
are steadfast. There are 10 days to the election and the undecided voters will
choose their candidate in the following manner:
1.
In each iteration, every undecided voter decides on a
candidate. Voters are processed in increasing order of node ID.
For every undecided voter, if the majority of their friends
support you, they now support you. If the majority of their
friends support your rival, they now support your rival.
2.
If a voter has equal number of friends supporting you and
your rival, support is assigned in an alternating fashion, starting
from yourself. In other words, the first tie leads to support for
you, the second tie leads to support for your rival, the third for
you, the fourth for your rival, and so on.
3.
When processing the updates, the values from the current
iteration are used.
4.
There are 10 iterations of the process described above. One
happening on each day.
5.
The 11th day is the election day, and the votes are counted.
52 Perform these configurations and iterations, and compute who wins in the
first graph, and by how much? Similarly, compute the votes for the second
graph.
Let us say that you have a total funding of Rs. 9000, and you have decided to
spend this money by hosting a live stream. Unfortunately, only the voters with
IDs 3000–3099. However, your stream is so persuasive that any voter who sees
it will immediately decide to vote for you, regardless of whether they had
decided to vote for yourself, your rival, or where undecided. If it costs Rs. 1000
to reach 10 voters in sequential order, i.e, the first Rs. 1000 reaches voters 3000–
3009, the second Rs. 1000 reaches voters 3010–3019, and so on. In other words,
the total of Rs. k reaches voters with IDs from 3000 to . The
live stream happens before the 10 day period, and the persuaded voters never
change their mind.
53 Simulate the effect of spending on the two graphs. First, read in the two
graphs again and assign the initial configurations as before. Now, before the
decision process, you purchase Rs. k of ads and go through the decision process
of counting votes.
For each of the two social graphs, plot Rs. k (the amount you spend) on the
x-axis (for values k = ) and the number of votes you win
by on the y-axis (that is, the number of votes for youself less the number of votes
for your rival). Put these on the same plot. What is the minimum amount you can
spend to win the election in each of these graphs?
Instead of general campaigning, you decide to target your campaign. Let’s say
you have a posh Rs. 1000 per plate event for the high rollers among your voters
(the people with the highest degree). You invite high rollers in decreasing order
of their degree, and your event is so spectacular that any one who comes to your
event is instantly persuaded to vote for you regardless of their previous decision.
This event happens before the decision period. When there are ties between
voters of the same degree, the high roller with lowest node ID gets chosen first.
54 Simulate the effect of the high roller dinner on the two graphs. First, read
in the graphs and assign the initial configuration as before. Now, before the
decision process, you spend Rs. k on the fancy dinner and then go through the
decision process of counting votes.
For each of the two social graphs, plot Rs. k (the amount you spend) on the
x-axis (for values ) and the number of votes you win
by on the y-axis (that is, the number of votes for yourself less the number of
votes for your rival). What is the minimum amount you can spend to win the
election in each of the two social graphs?
Assume that a mob has to choose between two behaviours, riot or not. However,
this behaviour depends on a threshold which varies from one individual to
another, i.e, an individual i has a threshold that determines whether or not to
participate. If there are atleast individuals rioting, then i will also participate,
otherwise i will refrain from the behaviour.
Assuming that each individual has full knowledge of the behaviour of all the
other nodes in the network. In order to explore the impact of thresholds on the
final number of rioters, for a mob of n individuals, the histogram of thresholds
is defined, where expresses the number of individuals
that have threshold . For example, is the number of people who riot
no matter what, is the number of people who will riot if one person is
rioting, and so on.
Let T = [1, 1, 1, 1, 1, 4, 1, 0, 4, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 4, 0, 1, 4, 0, 1,
1, 1, 4, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 4, 1, 1, 4. 1, 4, 0,
1, 0, 1, 1, 1, 0, 4, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 4, 0,
4, 0, 0, 1, 1, 1, 4, 0, 4, 0] be the vector of thresholds of 101 rioters.
References
1. Berger, Eli. 2001. Dynamic monopolies of constant size. Journal of Combinatorial Theory, Series B 83
(2): 191–200.
[MathSciNet][Crossref]
2. Bikhchandani, Sushil, David Hirshleifer, and Ivo Welch. 1992. A theory of fads, fashion, custom, and
cultural change as informational cascades. Journal of Political Economy 100 (5): 992–1026.
[Crossref]
3. Centola, Damon. 2010. The spread of behavior in an online social network experiment. Science 329
(5996): 1194–1197.
[Crossref]
4. Centola, Damon, and Michael Macy. 2007. Complex contagions and the weakness of long ties.
American Journal of Sociology 113 (3): 702–734.
[Crossref]
5. Centola, Damon, Víctor M. Eguíluz, and Michael W. Macy. 2007. Cascade dynamics of complex
propagation. Physica A: Statistical Mechanics and its Applications 374 (1): 449–456.
[Crossref]
6. Goetz, Michaela, Jure Leskovec, Mary McGlohon, and Christos Faloutsos. 2009. Modeling blog
dynamics. In ICWSM.
7. Goyal, Amit, Francesco Bonchi, and Laks V.S. Lakshmanan. 2010. Learning influence probabilities in
social networks. In Proceedings of the third ACM international conference on Web search and data
mining, 241–250. ACM.
8. Granovetter, Mark. 1978. Threshold models of collective behavior. American Journal of Sociology 83
(6): 1420–1443.
[Crossref]
9. Immorlica, Nicole, Jon Kleinberg, Mohammad Mahdian, and Tom Wexler. 2007. The role of
compatibility in the diffusion of technologies through social networks. In Proceedings of the 8th ACM
conference on Electronic commerce, 75–83. ACM.
10. Leskovec, Jure, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. 2007.
Patterns of cascading behavior in large blog graphs. In Proceedings of the 2007 SIAM international
conference on data mining, 551–556. SIAM
11. Leskovec, Jure, Ajit Singh, and Jon Kleinberg. 2006. Patterns of influence in a recommendation
network. In Pacific-Asia conference on knowledge discovery and data mining, 380–389. Berlin:
Springer.
12. Miller, Mahalia, Conal Sathi, Daniel Wiesenthal, Jure Leskovec, and Christopher Potts. 2011.
Sentiment flow through hyperlink networks. In ICWSM.
13. Morris, Stephen. 2000. Contagion. The Review of Economic Studies 67 (1): 57–78.
[MathSciNet][Crossref]
14. Myers, Seth A., and Jure Leskovec. 2012. Clash of the contagions: Cooperation and competition in
information diffusion. In 2012 IEEE 12th international conference on data mining (ICDM), 539–548.
IEEE.
15. Myers, Seth A., Chenguang Zhu, and Jure Leskovec. 2012. Information diffusion and external
influence in networks. In Proceedings of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, 33–41. ACM.
16. Romero, Daniel M., Brendan Meeder, and Jon Kleinberg. 2011. Differences in the mechanics of
information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In
Proceedings of the 20th international conference on World wide web, 695–704. ACM.
17. Watts, Duncan J. 2002. A simple model of global cascades on random networks. Proceedings of the
National Academy of Sciences 99 (9): 5766–5771.
[MathSciNet][Crossref]
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_9
9. Influence Maximisation
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
Krishna Raj P. M.
Email: [email protected]
(9.1)
(9.2)
(9.3)
For calculating the optimal marketing plan for a product that has not yet been
introduced to the market, we compute
(9.4)
of , so the first term simplifies to it. The summation order in the second term
(9.6)
The goal is to find the marketing plan that maximises profit. Assume that M is a
boolean vector, c is the cost of marketing to a customer, is the revenue from
selling to the customer if no marketing action is performed, and is the
revenue if marketing is performed. and is the same unless marketing
action includes offering a discount. Let be the result of setting to 1
and leaving the rest of M unchanged, and similarly for . The expected lift
(9.7)
This equation gives the intrinsic value of the customer. Let be the null
vector. The global lift in profit that results from a particular choice M of
customers to market to is then
(9.8)
where if , if , and |M| is the number of ones in
M. The total value of a customer is computed as
. The network value is the difference
(9.9)
The global lift in profit is
(9.10)
To find the optimal M that maximises ELP, we must try all possible
combinations of assignments to its components. However, this is intractable, [4]
proposes the following approximate procedures:
Single pass : For each i, set if and set
otherwise.
Greedy search : Set . Loop through the ’s setting each to 1 if
. Continue looping until there are no changes
in a complete scan of the ’s.
Hill climbing search : Set . Set where
. Now, set where
The effect that marketing to a person has on the rest of the network is
independent of the marketing actions to other customers. From a customer’s
network effect, we can directly compute whether he is worth marketing to. Let
the be the network effect of customer i for a product with attributes Y. It
is defined as the total increase in probability of purchasing in the network
(including ) that results from a unit change in :
(9.11)
Since is the same for any M, we define it for . Reference [4]
calculates as
(9.12)
is initially set to 1 for all i, then recursively re-calculated using Eq. 9.12
until convergence. is defined to be the immediate change in customer
i’s probability of purchasing when he is marketed to with marketing action z:
(9.13)
From Eq. 9.11, and given that varies linearly with
, the change in the probability of purchasing across the entire
network is then
(9.14)
So the total lift in profit becomes
(9.15)
This approximation is exact when r(z) is constant, which is the case in any
marketing scenario that is advertising-based. When this is the case, the equation
simplifies to:
(9.16)
With Eq. 9.16, we can directly estimate customer i’s lift in profit for any
marketing action z. To find the z that maximises the lift in profit, we take the
derivative with respect to z and set it equal to zero, resulting in:
(9.17)
Assume is differentiable, this allows us to directly calculate the z
which maximises which is the optimal value for in the M
that maximizes ELP(Y, M). Hence, from the customers network effects, ,
we can directly calculate the optimal marketing plan.
Collaborative filtering systems were used to identify the items to recommend to
customers. In these systems, users rate a set of items, and these ratings are used
to recommend other items that the user might be interested in. The quantitative
collaborative filtering algorithm proposed in [11] was used in this study. The
algorithm predicts a user’s rating of an item as the weighted average of the
ratings given by similar users, and recommends these items with high predicted
ratings.
These methodologies were applied on the problem of marketing movies
using the EachMovie collaboration filtering database. EachMovie contains 2.8
million ratings of 1628 movies by 72916 users between 1996 and 1997 by the
eponymous recommendation site. EachMovie consists of three databases: one
contains the ratings (on a scale of zero to five), one contains the demographic
information, and one containing information about the movies. The methods
with certain modifications were applied to the Epinions dataset.
In summary, the goal is to market to a good customer. Although in a general
sense, the definition of good is a subjective one, this paper uses the following
operating definitions. A good customer is one who satisfies the following
conditions: (i) Likely to give the product a high rating, (ii) Has a strong weight
in determining the rating prediction for many of her neighbours, (iii) Has many
neighbours who are easily influenced by the rating prediction they receive, (iv)
Will have high probability of purchasing the product, and thus will be likely to
actually submit a rating that will affect her neighbours, (v) has many neighbours
with the aforementioned four characteristics, (vi) will enjoy the marketed movie,
(vii) has many close friends, (viii) these close friends are easily swayed, (ix) the
friends will very likely see this movie, and (x) has friends whose friends have
these properties.
random from the interval [0, 1] which represents the weighted fraction of v’s
neighbours that must become active in order for v to become active. Given a
random choice of thresholds, and an initial set of active nodes with all other
nodes inactive, the diffusion process unfolds deterministically in discrete steps.
In step t, all nodes that were active in step remain active, and we activate
any node v for which the total weight of its active neighbours is atleast , such
that
(9.18)
Additionally, the independent cascade model is also considered sometimes. This
model also starts with an initial active set of nodes , and the process unfolds
in discrete steps according to the following randomized rule. When node v first
becomes active in step t, it is given a single chance to activate each currently
inactive neighbour w. It succeeds with probability . If v succeeds, then w
will become active in step . Whether or not v succeeds, it cannot make any
further attempts to activate w in subsequent rounds. This process runs until no
more activations are possible.
We define the influence of a set of nodes A, denoted by , as the expected
number of active nodes at the end of the process. We need to find for a given
value of k, a k-node set of maximum influence. It is NP-hard to determine the
optimum for influence maximisation and an approximation can be done to a
factor of where e is the base of the natural logarithm and is any
This property can be extended to show that for any , there is a such
that by using -approximate values for the function to be optimized, a
-approximation.
Theorem 16
For a given targeted set A, the following two distributions over sets of
nodes are the same:
1.
The distribution over active sets obtained by running the
linear threshold model to completion starting from A.
2.
The distribution over sets reachable from A via live-edge
paths, under the random selection of live edges defined
above.
activating v.
The greedy algorithm was compared with heuristics based on degree
centrality, distance centrality and a crude baseline of choosing random nodes to
target. The degree centrality involves choosing nodes in order of decreasing
degrees, the distance centrality chooses nodes in order of increasing average
distance to other nodes in the network.
For computing , the process was simulated 10000 times for each
targeted set, re-choosing thresholds or edge outcomes pseudo-randomly from
[0, 1] every time.
Figures 9.1, 9.2, 9.3, 9.4 show the performance of the algorithms in the linear
model, weight cascade model, independent cascade model with probabilities 1
and respectively. It is observed that the greedy algorithm clearly surpasses
the other heuristics.
Fig. 9.1 Results for the linear threshold model
Fig. 9.4 Results for the independent cascade model with probability
From these results, the paper proposes a broader framework that generalizes
these models. The general threshold model is proposed. A node v’s decision to
become active can be based on an arbitrary monotone function of the set of
neighbours of v that are already active. Thus, associated with v is a monotone
threshold function that maps subsets of v’s neighbour set to real numbers in
[0, 1], subject to the condition that . Each node v initially chooses
uniformly at random from the interval [0, 1]. Now, however, v becomes active in
step t if , where S is the set of neighbours of v that are active in step
. Thus, the linear threshold model is a special case of this general threshold
model in which each threshold function has the form for
(9.20)
we assumed that the order in which the try to activate v does not affect their
overall success probability. Hence, this value depends on the set S only, and we
can define . Analogously, one can show that this
mechanisms which take each users’ true private information and use it to identify
and reward the subset of initial adopters. Thereby developing incentive
compatible mechanisms that compare favourably against the optimal influence
maximization solution. The experiments used Facebook network data that had
almost a million nodes and over 72 million edges, together with a representative
cost distribution that was obtained by running a simulated campaign on
Amazon’s Mechanical Turk platform.
References
1. Bakshy, Eytan, J.M. Hofman, W.A. Mason, and D.J. Watts. 2011. Everyone’s an influencer:
quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web
search and data mining, 65–74. ACM.
2. Chen, Wei, Yajun Wang, and Siyu Yang. 2009. Efficient influence maximization in social networks. In
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data
mining, 199–208. ACM.
3. Chen, Wei , Yifei Yuan, and Li Zhang. 2010. Scalable influence maximization in social networks under
the linear threshold model. In 2010 IEEE 10th international conference on data mining (ICDM), 88–
97. IEEE.
4. Domingos, Pedro, and Matt Richardson. 2001. Mining the network value of customers. In Proceedings
of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 57–
66. ACM.
5. Goldenberg, Jacob, Barak Libai, and Eitan Muller. 2001. Talk of the network: A complex systems look
at the underlying process of word-of-mouth. Marketing Letters 12 (3): 211–223.
[Crossref]
6. Goyal, Amit, Wei Lu, and Laks V.S. Lakshmanan. 2011. Simpath: An efficient algorithm for influence
maximization under the linear threshold model. In 2011 IEEE 11th international conference on data
mining (ICDM), 211–220. IEEE.
7. Granovetter, Mark S. 1977. The strength of weak ties. In Social networks, 347–367. Elsevier.
8. Kempe, David, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence through a
social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, 137–146. ACM.
9. Leskovec, Jure, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing.
ACM Transactions on the Web (TWEB) 1 (1): 5.
[Crossref]
10. Leskovec, Jure, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne Van Briesen, and Natalie
Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining, 420–429. ACM.
11. Resnick, Paul, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grouplens: An
open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on
computer supported cooperative work, 175–186. ACM.
12. Richardson, Matthew, and Pedro Domingos. 2002. Mining knowledge-sharing sites for viral marketing.
In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data
mining, 61–70. ACM.
13. Singer, Yaron. 2012. How to win friends and influence people, truthfully: Influence maximization
mechanisms for social networks. In Proceedings of the fifth ACM international conference on web
search and data mining, 733–742. ACM.
14. Watts, Duncan J., and Peter Sheridan Dodds. 2007. Influentials, networks, and public opinion
formation. Journal of Consumer Research 34 (4): 441–458.
[Crossref]
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_10
Krishna Raj P. M.
Email: [email protected]
(10.1)
The objective function to be minimized is the expected value computed over the
assumed probability distribution of the contaminations
(10.2)
where denotes the mathematical expectation of the minimum detection
time . Undetected events have no detection times.
(10.3)
where denotes the mean amount of water consumed by an individual
(L/day/person), is the evaluation time step (days), is the contaminant
concentration for node i at time step k (mg/L), is the dose rate multiplier for
node i and time step k, and N is the number of evaluation time steps prior to
detection. The series , has a mean of 1 and is intended to
model the variation in ingestion rate throughout the day. It is assumed that the
ingestion rate varies with the water demand rate at the respective node, thus
(10.4)
where is the water demand for time step k and node i, and is the average
water demand at node i.
A dose-response model is used to express the probability that any person
ingesting mass will be infected
(10.5)
where is the probability that a person who ingests contaminant mass will
become infected, is called the Probit slope parameter, W is the average body
mass (kg/person), is the dose that would result in a 0.5 probability of being
infected (mg/kg), and is the standard normal cumulative distribution function.
The population affected (PA), , for a particular contamination is calculated as
(10.6)
where is the population assigned to node i and V is the total number of nodes.
The objective function to be minimized is the expected value of computed
over the assumed probability distribution of contamination events
(10.7)
where denotes the mathematical expectation of the affected population
.
(10.8)
where denotes the total volumetric water demand that exceeds a predefined
hazard concentration threshold C; and is the mathematical expectation of
.
(10.9)
must not exceed a total budget of B which is the maximum amount that can be
spent on the sensors. The goal is to solve the optimisation function
(10.10)
An event from set I of contaminants originates from a node , and
spreads through the network, affecting other nodes. Eventually, it reaches a
monitored node and gets detected. Depending on the time of
detection , and the impact on the network before the detection, a
penalty . The goal is to minimize the expected penalty over all possible
contaminants i:
(10.11)
where for , is the time until event i is detected
by one of the sensors in A, and P is a given probability distribution over the
events.
is assumed to be monotonically nondecreasing in t, i.e, late detection are
never preferred if they can be avoided. Also , and set to
some maximum penalty incurred for not detecting event i.
Instead of maximising the penalty , the paper considers penalty
reduction, and the expected penalty reduction
(10.12)
which describes the expected reward received from placing sensors.
For the DL function, and , i.e, no penalty is incurred if
the outbreak is detected in finite time, otherwise the incurred penalty is 1. For
the DT measure, , where is the time horizon. The PA
criterion has is the size of the cascade i at time t, and is the size of
the cascade at the end of the horizon.
The penalty reduction function R(A) has several properties: First, ,
i.e, we do not reduce the penalty if we do not place any sensors. Second, R is
nondecreasing, i.e, . So, adding sensors can only
decrease the penalty. Finally, and sensors ,
, i.e, R is submodular.
To solve the outbreak detection problem, we will have to simultaneously
optimize multiple objective functions. Then, each A is
. However, there can arise a situation where and
are incomparable, i.e, , but . So, we hope
for Pareto-optimal solutions. A is said to be Pareto-optimal, if there does not
exist such that , and for some j. One
approach to finding such Pareto-optimal solutions is scalarization . Here, one
picks positive weights , and optimizes the objective
. Any solution maximising R(A) is guaranteed to be Pareto-
(10.13)
The algorithm stops, once it has selected B elements. For this unit cost case,
the greedy algorithm is proved to achieve atleast optimal score (Chap. 9).
We will refer to this algorithm as the unit cost algorithm .
In the case where the nodes non-constant costs, the greedy algorithm that
iteratively adds sensors until the budget is exhausted can fail badly, since a very
expensive sensor providing reward r is preferred over a cheaper sensor providing
reward . To avoid this, we modify Eq. 10.13 to take costs into account
(10.14)
So the greedy algorithm picks the element maximising the benefit/cost ratio. The
algorithm stops once no element can be added to the current set A without
exceeding the budget. Unfortunately, this intuitive generalization of the greedy
algorithm can perform arbitrarily worse than the optimal solution. Consider the
case where we have two locations, and , and . Also
assume we have only one contaminant i, and , and . Now,
, and . Hence the greedy
algorithm would pick . After selecting , we cannot afford anymore, and
our total reward would be . However, the optimal solution would pick ,
achieving total penalty reduction of B. As goes to 0, the performance of the
greedy algorithm becomes arbitrarily bad. This algorithm is called the benefit-
cost greedy algorithm .
The paper proposes the Cost-Effective Forward selection (CEF) algorithm. It
computes using the benefit-cost greedy algorithm and using the
unit-cost greedy algorithm. For both of these, CEF only considers elements
which do not exceed the budget B. CEF then returns the solution with higher
score. Even though both solutions can be arbitrarily bad, if R is a nondecreasing
submodular function with . Then we get Eq. 10.15.
(10.15)
and in the unit and non-constant cost cases are offline, i.e, we can state
them in advance before running the actual algorithm. Online bounds on the
performance can be found through arbitrary sensor locations. For and
(10.16)
This computes how far away is from the optimal solution. This is found to
give a 31% bound.
Most outbreaks are sparse, i.e, affect only a small area of network, and hence
are only detected by a small number of nodes. Hence, most nodes s do not
reduce the penalty incurred by an outbreak, i.e, . However, this
sparsity is only present when penalty reductions are considered. If for each
sensor and contaminant we store the actual penalty , the
resulting representation is not sparse. By representing the penalty function R as
an inverted index, which allows fast lookup of the penalty reductions by sensor
s, the sparsity can be exploited. Therefore, the penalty reduction can be
computed as given in Eq. 10.17.
(10.17)
Even if we can quickly evaluate the score R(A) of any A, we still need to
perform a large number of these evaluations in order to run the greedy algorithm.
If we select k sensors among |V| locations, we roughly need k|V| function
evaluations. We can exploit submodularity further to require far fewer function
evaluations in practice. Assume we have computed the marginal increments
(or for all . As our node
selection A grows, the marginal increments (and ) (i.e., the benefits
for adding sensor ) can never increase: For , it holds that
. So instead of recomputing for every sensor after
adding (and hence requiring evaluations of R), we perform lazy
evaluations: Initially, we mark all as invalid. When finding the next location
to place a sensor, we go through the nodes in decreasing order of their . If the
for the top node s is invalid, we recompute it, and insert it into the existing
order of the (e.g., by using a priority queue). In many cases, the
recomputation of will lead to a new value which is not much smaller, and
hence often, the top element will stay the top element even after recomputation.
In this case, we found a new sensor to add, without having re-evaluated for
every location s. The correctness of this lazy procedure follows directly from
submodularity, and leads to far fewer (expensive) evaluations of R. This is called
the lazy greedy algorithm CELF (Cost-Effective Lazy Forward selection) . This
is found to have a factor 700 improvement in speed compared to CEF.
10.2.1 Blogspace
A dataset having 45000 blogs with 10.5 million posts and 16.2 million links was
taken. Every cascade has a single starting post, and other posts recursively join
by linking to posts within the cascade, whereby the links obey time order. We
detect cascades by first identifying starting post and then following in-links.
346, 209 non-trivial cascades having at least 2 nodes were discovered. Since the
cascade size distribution is heavy-tailed, the analysis was limited to only
cascades that had at least 10 nodes. The final dataset had 17, 589 cascades,
where each blog participated in 9.4 different cascades on average.
Figure 10.1 shows the results when PA function is optimized. The offline
and the online bounds can be computed regardless of the algorithm used. CELF
shows that we are at away from optimal solution. In the right, we have
the performance using various objective functions (from top to bottom: DL, DT,
PA). DL increases the fastest, which means that one only needs to read a few
blogs to detect most of the cascades, or equivalently that most cascades hit one
of the big blogs. However, the population affected (PA) increases much slower,
which means that one needs many more blogs to know about stories before the
rest of population does.
Fig. 10.1 (Left) Performance of CELF algorithm and offline and on-line bounds for PA objective
function. (Right) Compares objective functions
In Fig. 10.2, the CELF method is compared with several intuitive heuristic
selection techniques. The heuristics are: the number of posts on the blog, the
cumulative number of out-links of blog’s posts, the number of in-links the blog
received from other blogs in the dataset, and the number of out-links to other
blogs in the dataset. CELF is observed to greatly outperform the other methods.
For the figure in the right, given a budget of B posts, we select a set of blogs to
optimize PA objective. For the heuristics, a set of blogs to optimize chosen
heuristic was selected, e.g., the total number of in-links of selected blogs while
still fitting inside the budget of B posts. Again, CELF outperforms the next best
heuristics by .
Fig. 10.2 Heuristic blog selection methods. (Left) unit cost model, (Right) number of posts cost model
Fig. 10.3 (Left) CELF with offline and online bounds for PA objective. (Right) Different objective
functions
Figure 10.5 shows the scores achieved by CELF compared with several
heuristic sensor placement techniques, where the nodes were ordered by some
“goodness” criteria, and then the top nodes were picked for the PA objective
function. The following criteria were considered: population at the node, water
flow through the node, and the diameter and the number of pipes at the node.
CELF outperforms the best heuristic by .
Problems
Download the DBLP collaboration network dataset at
https://snap.stanford.edu/data/bigdata/communities/com-dblp.ungraph.txt.gz.
This exercise is to explore how varying the set of initially infected nodes in a
SIR model can affect how a contagion spreads through a network. We learnt in
Sect. 8.2.3 that under the SIR model, every node can be either susceptible,
infected, or recovered and every node starts off as either susceptible or infected.
Every infected neighbour of a susceptible node infects the susceptible node with
probability , and infected nodes can recover with probability . Recovered
nodes are no longer susceptible and cannot be infected again. The pseudocode is
as given in Algorithm 7.
57
Implement the SIR model in Algorithm 7 and run 100 simulations with
and for each of the following three graphs:
1.
The graph for the network in the dataset (will be referred to as the real
world graph).
2.
An Erdös-Rényi random graph with the same number of nodes and
edges as the real world graph. Set a random seed of 10.
3.
A preferential attachment graph with the same number of nodes and
expected degree as the real world graph. Set a random seed of 10.
For each of these graphs, initialize the infected set with a single node chosen
uniformly at random. Record the total percentage of nodes that became infected
uniformly at random. Record the total percentage of nodes that became infected
in each simulation. Note that a simulation ends when there are no more infected
nodes; the total percentage of nodes that became infected at some point is thus
the number of recovered nodes at the end of your simulation divided by the total
number of nodes in the network.
58 Repeat the above process, but instead of selecting a random starting node,
infect the node with the highest degree. Compute the total percentage of nodes
that became infected in each simulation.
References
1. Goyal, Amit, Wei Lu, and Laks V.S. Lakshmanan. 2011. Celf++: Optimizing the greedy algorithm for
influence maximization in social networks. In Proceedings of the 20th international conference
companion on World wide web, 47–48. ACM.
2. Krause, Andreas, and Carlos Guestrin. 2005. A note on the budgeted maximization of submodular
functions. 2005.
3. Krause, Andreas, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos. 2008.
Efficient sensor placement optimization for securing large water distribution networks. Journal of Water
Resources Planning and Management 134 (6): 516–526.
[Crossref]
4. Leskovec, Jure, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie
Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining, 420–429. ACM.
5. Ostfeld, Avi, James G. Uber, Elad Salomons, Jonathan W. Berry, William E. Hart, Cindy A. Phillips,
Jean-Paul Watson, Gianluca Dorini, Philip Jonkergouw, Zoran Kapelan, et al. 2008. The battle of the
water sensor networks (BWSN): A design challenge for engineers and algorithms. Journal of Water
Resources Planning and Management 134 (6): 556–568.
[Crossref]
Krishna Raj P. M.
Email: [email protected]
An interesting question to ask is, which are the popular websites? Although
popularity may be considered an elusive term with a lot of varying definitions,
here we will restrict ourselves by taking a snapshot of the Web and counting the
number of in-links to websites and using this a measure of the site’s popularity.
Another way of looking at this is as follows: As a function of k, what fraction
of sites on the Web have k in-links? Since larger values of k indicate greater
popularity, this should give us the popularity distribution of the Web.
On first thought, one would guess that the popularity would follow a normal
or Gaussian distribution, since the probability of observing a value that exceeds
the mean by more than c times the standard deviation decreases exponentially in
c. Central Limit Theorem supports this fact because it states that if we take any
sequence of small independent random quantities, then in the limit their sum (or
average) will be distributed according to the normal distribution. So, if we
assume that each website decides independently at random whether to link to
any other site, then the number of in-links to a given page is the sum of many
independent random quantities, and hence we’d expect it to be normally
distributed. Then, by this model, the number of pages with k in-links should
decrease exponentially in k, as k grows large.
However, we learnt in Chap. 2 that this is not the case. The fraction of
websites that have k in-links is approximately proportional to (the
exponent is slightly larger than 2). A function that decreases as k to some fixed
power, such as , is called a power law ; when used to measure the fraction
of items having value k, it says, that it’s possible to see very large values of k.
Mathematically, a quantity x obeys a power law if it is drawn from a
probability distribution given by Eq. 11.1
(11.1)
where is a constant parameter of the distribution known as the exponent or
scaling parameter.
Going back to the question of popularity, this means that extreme imbalances
with very large values are likely to arise. This is true in reality because there are
a few websites that are greatly popular while compared to others. Additionally,
we have learnt in previous chapters that the degree distributions of several social
network applications exhibit a power law.
We could say that just as the normal distribution is widespread among
natural sciences settings, so is the case of power law when the network in
question has a popularity factor to it.
One way to quickly check if a dataset exhibits a power-law distribution is to
generate a “log-log” plot, i.e, if we want to see if f(k) follows a power law, we
plot log(f(k)) as a function of logk. If this plot has an approximately straight line
whose exponent can be read from the slope (as shown in Fig. 11.1), then the
dataset follows a power law.
Fig. 11.1 A log-log plot exhibiting a power law
1.
Estimate the parameters and of the power-law
model using maximum likelihood estimation (MLE) techniques.
2.
Calculate the goodness-of-fit between the data and the
power law. If the resulting p-value is greater than 0.1 the power
law is a plausible hypothesis for the data, otherwise it is
rejected.
3.
Compare the power law with alternative hypotheses via a
likelihood ratio test. For each alternative, if the calculated
likelihood ratio is significantly different from zero, then its sign
indicates whether the alternative is favoured over the power-law
model or not. Other established and statistically principled
approaches for model comparison, such as a fully Bayesian
approach [9], a cross-validation approach [17], or a minimum
description length approach [8] can also be used instead.
The power law has several moniker in various fields. It goes by the term Lotka
distribution for scientific productivity [15], Bradford law for journal use [4],
Pareto law of income distribution [16] and the Zipf law for literary word
frequencies.
11.1.1 Model A
Model A is the basic model which the subsequent models rely upon. It starts
with no nodes and no edges at time step 0. At time step 1, a node with in-weight
1 and out-weight 1 is added. At time step , with probability a new
node with in-weight 1 and out-weight 1 is added. With probability a new
directed edge (u, v) is added to the existing nodes. Here the origin u is chosen
with probability proportional to the current out-weight and the
at time t, respectively.
The total in-weight (out-weight) of graph in model A increases by 1 at a time
step. At time step t, both total in-weight and total out-weight are exactly t. So the
probability that a new edge is added onto two particular nodes u and v is exactly
as given in Eq. 11.2.
(11.2)
11.1.2 Model B
Model B is a slight improvement of Model A. Two additional positive constant
and are introduced. Different powers can be generated for in-degrees
With probability a new directed edge (u, v) is added to the existing nodes.
Here the origin u (destination v) is chosen proportional to the current out-weight
while the current in-weight . and
the graph are random variables. The probability that a new edge is added onto
two particular nodes u and v is as given in Eq. 11.3.
(11.3)
11.1.3 Model C
Now we consider Model C, this is a general model with four specified types of
edges to be added.
Assume that the random process of model C starts at time step . At ,
we start with an initial directed graph with some vertices and edges. At time step
, a new vertex is added and four numbers , , , are
drawn according to some probability distribution. Assuming that the four
random variables are bounded, we proceed as follows:
Add edges randomly. The origins are chosen with the probability
proportional to the current out-degree and the destinations are chosen
proportional to the current in-degree.
Add edges into the new vertex randomly. The origins are chosen
with the probability proportional to the current out-degree and the destinations
are the new vertex.
Add edges from the new vertex randomly. The destinations are
chosen with the probability proportional to the current in-degree and the
origins are the new vertex.
Add loops to the new vertex.
11.1.4 Model D
Model A, B and C are all power law models for directed graphs. Here we
describe a general undirected model which we denote by Model D. It is a natural
variant of Model C. We assume that the random process of model C starts at
time step . At , we start with an initial undirected graph with some
Vertices and edges. At time step , a new vertex is added and three
numbers , , are drawn according to some probability distribution.
We assume that the three random variables are bounded. Then we proceed as
follows:
Add edges randomly. The vertices are chosen with the probability
proportional to the current degree.
Add edges randomly. One vertex of each edge must be the new vertex.
The other one is chosen with the probability proportional to the current
degree.
Add loops to the new vertex.
1.
With probability , add a new vertex v together with an
edge from v to an existing vertex w, where w is chosen
according to .
2.
With probability , add an edge from an existing vertex v to
an existing vertex w, where v and w are chosen independently, v
according to , and w according to .
3. With probability , add a new vertex w and an edge from
an existing vertex v to w, where v is chosen according to
.
steps the model leads to a random network with vertices and mt edges.
This network is found to exhibit a power-law with exponent .
Due to preferential attachment, a vertex that acquired more connections than
another one will increase its connectivity at a higher rate, thus an initial
difference in the connectivity between two vertices will increase further as the
network grows. The rate at which a vertex acquires edges is , which
gives , where is the time at which vertex i was added to the
1.
Sites are created in order, and named .
2.
When site j is created, it produces a link to an earlier
website according to the following probabilistic rule
(probability ).
a.
With probability p, site j chooses a site i uniformly at
random among all earlier sites, and creates a link to this site
i.
b.
With probability , site j instead chooses a site i
uniformly at random from among all earlier sites, and
creates a link to the site that i points to.
3.
This describes the creation of a single link from site j. One
can repeat this process to create multiple, independently
generated links from site j.
Step 2(b) is the key because site j is imitating the decision of site i. The main
result about this model is that if we run it for many sites, the fraction of sites
with k in-links will be distributed approximately according to a power law ,
where the value of the exponent c depends on the choice of p. This dependence
goes in an intuitive direction: as p gets smaller, so that copying becomes more
frequent, the exponent c gets smaller as well, making one more likely to see
extremely popular pages.
The process described in line 2(b) is an implementation of the rich-get-richer
or preferential attachment phenomenon. It is called the rich-get-richer process
because when j copies the decision of i, the probability of linking to some site l
is directly proportional to the total number of sites that currently link to l. So, the
probability that site l experiences an increase in popularity is directly
proportional to l’s current popularity. The term preferential attachment is given
because the links are formed preferentially to pages that already have high
popularity. Intuitively, this makes sense because the more well-known some
website is, the more likely that name comes up in other websites, and hence it is
more likely that websites will have a link to this well-known website.
This rich-get-richer phenomenon is not confined to settings governed by
human-decision making. Instead the population of cities, the intensities of
earthquakes, the sizes of power outages, and the number of copies of a gene in a
genome are some instances of natural occurrences of this process.
Popularity distribution is found to have a long tail as shown in Fig. 11.2.
This shows some nodes which have very high popularity when compared to the
other nodes. This popularity however drops of to give a long set of nodes who
have more or less the same popularity. It is this latter set of a long list of nodes
which contribute to the long tail.
1.
Initial condition: Since node j has no links when first
created at time j, .
2.
Expected change to over time: At time step , node j
gains an in-link if and only if a link from this new created node
points to it. From step (2a) of the model, node links
to j with probability 1 / t. From step (2b) of the model, node
links to j with probability since at the moment
node was created, the total number of links in the network
was t and of these point to j. Therefore, the overall
probability that node links to node j is as given in
Eq. 11.4.
(11.4)
However, this deals with the probabilistic case. For the deterministic case,
we define the time as continuously running from 0 to N instead of the
probabilistic case considered in the model. The function of time in the
discrete case is approximated in the continuous case as . The two properties
of are:
1.
Initial condition: Since we had in the
probabilistic case, the deterministic case gets .
2.
Growth equation: In the probabilistic case, when node
arrives, the number of in-links to j increases with probability
given by Eq. 11.4. In the deterministic approximation provided
by , the rate of growth is modeled by the differential equation
given in Eq. 11.5.
(11.5)
we get
and hence
(11.6)
(11.7)
Now we ask the following question: For a given value of k, and time t, what
fraction of all nodes have atleast k in-links at time t? This can alternately be
formulated as: For a given value of k, and time t, what fraction of all functions
satisfy ?
Equation 11.7 corresponds to the inequality
we get
(11.8)
From this, we can infer that the deterministic model predicts that the fraction of
nodes with k in-links is proportional to with exponent
With high probability over the random formation of links, fraction of nodes with
k in-links is proportional to . When p is close to 1, the link formation
1.
A set V of vertices where each vertex has an
associated interval D(v) on the time axis called the duration of
v.
2.
A set E of edges where each edge is a triplet (u, v, t)
where u and v are vertices in V and t is a point in time in the
interval .
1.
Identify dense subgraphs in the time graph of the Blogspace
which will correspond to all potential communities. However,
this will result in finding all the clusters regardless of whether
or not they are bursty.
2.
2.
The time graph was generated for blogs from seven blog sites. The resulting
graph consisted of 22299 vertices, 70472 unique edges and 777653 edges
counting multiplicity. Applying the steps to this graph, the burstiness of the
graph is as shown in Fig. 11.3.
(11.9)
where |E|(t) and |V|(t) denote the number of vertices and edges at time step t
and a is an exponent that generally lies between 1 and 2. This relation is referred
to as the Densification power-law or growth power-law .
Twelve different datasets from seven different sources were considered for
the study. These included HEP-PH and HEP-TH arXiv citation graphs, a citation
graph for U.S. utility patents, a graph of routers comprising the Internet, five
bipartite affiliation graphs of authors with papers they authored for ASTRO-PH,
HEP-TH, HEP-PH, COND-MAT and GR-QC, a bipartite graph of actors-to-
movies corresponding to IMDB, a person to person recommendation graph, and
an email communication network from an European research institution.
Figure 11.4 depicts the plot of average out-degree over time. We observe an
increase indicating that graphs become dense. Figure 11.5 illustrates the log-log
plot of the number of edges as a function of the number of vertices. They all
obey the densification power law. This could mean that densification of graphs
is an intrinsic phenomenon.
Fig. 11.4 Average out-degree over time. Increasing trend signifies that the graph are densifying
Fig. 11.5 Number of edges as a function of the number of nodes in log-log scale. They obey the
densification power law
In this paper, for every , g(d) denotes the fraction of connected node
pairs whose shortest connecting path has length atmost d. The effective diameter
of the network is defined as the value of d at which this function g(d) achieves
the value 0.9. So, if D is a value where , then the graph has effective
diameter D. Figure 11.6 shows the effective diameter over time. A decrease in
diameter can be observed from the plots. Since all of these plots exhibited a
decrease in the diameter, it could be that the shrinkage was an inherent property
of networks.
Fig. 11.6 Effective diameter over time for 6 different datasets. There is a consistent decrease of the
diameter over time
To verify that the shrinkage of diameters was not intrinsic to the datasets,
experiments were performed to account for:
1.
Possible sampling problems: Since computing shortest
paths among all node pairs is computationally prohibitive for
these graphs, several different approximate methods were
applied, obtaining almost identical results from all of them.
2.
Effect of disconnected components: These large graphs have
a single giant component. To see if the disconnected
components had a bearing on the diameter, the effective
diameter for the entire graph and just the giant component were
computed. The values were found to be the same.
3. Effects of missing past: With almost any dataset, one does
not have data reaching all the way back to the network’s birth.
This is referred to as the problem of missing past . This means
that there will be edges pointing to nodes prior to the beginning
of the observation period. Such nodes and edges are referred to
as phantom nodes and phantom edges respectively.
To understand how the diameters of our networks are
affected by this unavoidable problem, we perform the following
test. We pick some positive time and determine what the
diameter would look like as a function of time if this were the
beginning of our data. We then put back in the nodes and the
edges from before time and study how much the diameters
change. If this change is small, then it provides evidence that
the missing past is not influencing the overall result. were
suitably set for the datasets and the results of three
measurements were compared:
a.
Diameter of full graph: For each time t, the
effective diameter of the graph’s giant
component was computed.
b.
Post- subgraph: The effective diameter
of the post- subgraph using all nodes and
edges were computed. For each time ,
a graph using all nodes dated before t was
created. The effective diameter of the subgraph
of the nodes dated between and t was
computed. To compute the effective diameter,
we can use all edges and nodes (including
those dated before ). This means that we are
measuring distances only among nodes dated
between and t while also using nodes and
edges before as shortcuts or bypasses. The
experiment measures the diameter of the graph
if we knew the full (pre- ) past, the citations
of the papers which we have intentionally
excluded for this test.
c.
Post- subgraph, no past. We set the
same way as in the previous experiment, but
then, for all nodes dated before , we delete
all their outlinks. This creates the graph we
would have gotten if we had started collecting
data only at time .
In Fig. 11.6, we superimpose the effective
diameters using the three different techniques.
If the missing past does not play a large role in
the diameter, then all three curves should lie
close to one another. We observe this is the
case for the arXiv citation graphs. For the
arXiv paper-author affiliation graph and for the
patent citation graph, the curves are quite
different right at the cut-off time , but they
quickly align with one another. As a result, it
seems clear that the continued decreasing trend
in the effective diameter as time runs to the
present is not the result of these boundary
effects.
4.
Emergence of the giant component: We have learnt in Chap. 3 that in the
Erdös-Rényi random graph model, the diameter of the giant component is
quite large when it first appears, and then it shrinks as edges continue to be
added. Therefore, are shrinking diameters a symptom of the emergence of
giant component?
Figure 11.7 shows us that this is not the case. In the plot of the size of
the GCC for the full graph and for a graph where we had no past, i.e, where
we delete all outlinks of the nodes dated before the cut-off time . Because
we delete the outlinks of the pre- nodes, the size of the GCC is smaller,
but, as the graph grows, the effect of these deleted links becomes negligible.
Within a few years, the giant component accounts for almost all the
nodes in the graph. The effective diameter, however, continues to steadily
decrease beyond this point. This indicates that the decrease is happening in a
mature graph and not because many small disconnected components are
being rapidly glued together.
Fig. 11.7 The fraction of nodes that are part of the giant connected component over time. Observe that
after 4 years, of all nodes in the graph belong to the giant component
The models studied so far do not exhibit this densification and diameter
shrinkage. The paper proposes the following models which can achieve these
properties. The first model is the Community Guided Attachment where the idea
is that if the nodes of a graph belong to communities-within-communities [6],
and if the cost for cross-community edges is scale-free, then the densification
power-law follows naturally. Also, this model exhibits a heavy-tailed degree
distribution. However, this model cannot capture the shrinking effective
diameters. To capture all of these, the Forest Fire model was proposed. In this
model, nodes arrive in over time. Each node has a center-of-gravity in some part
of the network and its probability of linking to other nodes decreases rapidly
with their distance from this center-of-gravity. However, occasionally a new
node will produce a very large number of outlinks. Such nodes will help cause a
more skewed out-degree distribution. These nodes will serve as bridges that
connect formerly disparate parts of the network, bringing the diameter down.
Formally, the forest fire model has two parameters, a forward burning
probability p , and a backward burning probability r . Consider a node v joining
the network at time , and let be the graph constructed thus far. Node v
forms outlinks to nodes in according to the following process:
1.
v first chooses an ambassador node w uniformly at random
and forms a link to w.
2.
2.
Two random numbers, x and y are generated, that are
geometrically distributed with means and
, respectively. Node v selects x outlinks and y in-
links of w incident to nodes that were not yet visited. Let ,
, , denotes the other ends of these selected links. If
not enough in-or outlinks are available, v selects as many as it
can.
3.
v forms outlinks to , , , , and then applies the
previous step recursively to each of , , , . As the
process continues, nodes cannot be visited a second time,
preventing the construction from cycling.
Thus, the burning of links in the Forest Fire model begins at w, spreads to
, , , and proceeds recursively until it dies out. The model can be
extended to account for isolated vertices and vertices with large degree by
having newcomers choose no ambassadors in the former case (called orphans )
and multiple ambassadors in the latter case (simply called multiple ambassadors
). Orphans and multiple ambassadors help further separate the diameter
decrease/increase boundary from the densification transition and so widen the
region of parameter space for which the model produces reasonably sparse
graphs with decreasing effective diameters.
Reference [12] proposed an affiliation network framework to generate
models that exhibit these phenomena. Their intuition is that individuals are part
of a certain society, therefore leading to a bipartite graph. By folding this
bipartite graph B, the resultant folded graph G is found to exhibit the following
properties:
B has power-law distribution, and G has a heavy-tailed distribution.
G has superlinear number of edges.
The effective diameter of G stabilizes to a constant.
The average number of edges, e(a), created by nodes of age a, is the number of
edges created by nodes of age a normalized by the number of nodes that
achieved age a given in Eq. 11.11
(11.11)
where is the time when the last node in the network joined.
Figure 11.9 plots the fraction of edges initiated by nodes of a certain age.
The spike at nodes of age 0 correspond to the people who receive an invite to
join the network, create a first edge, and then never come back.
Using the MLE principle, the combined effect of node age and degree was
studied by considering the following four parameterised models for choosing the
edge endpoints at time t.
D: The probability of selecting a node v is proportional to its current degree
raised to power .
DR: With probability , the node v is selected preferentially (proportionally to
its degree), and with probability , uniformly at random:
.
A: The probability of selecting a node is proportional to its age raised to power
.
DA: The probability of selecting a node v is proportional the product of its
current degree and its age raised to the power .
Figure 11.10 plots the log-likelihood under the different models as a function
of . The red curve plots the log-likelihood of selecting a source node and the
green curve for selecting the destination node of an edge.
Fig. 11.10 Log-likelihood of an edge selecting its source and destination node. Arrows denote at
highest likelihood
In FLICKR the selection of destination is purely preferential: model D
achieves the maximum likelihood at , and model DA is very biased to
model D, i.e, . Model A has worse likelihood but model DA improves the
overall log-likelihood by around . Edge attachment in DELICIOUS seems
to be the most “random”: model D has worse likelihood than model DR.
Moreover the likelihood of model DR achieves maximum at suggesting
that about of the DELICIOUS edges attach randomly. Model A has better
likelihood than the degree-based models, showing edges are highly biased
towards young nodes. For YAHOO! ANSWERS, models D, A, and DR have
roughly equal likelihoods (at the optimal choice of ), while model DA further
improves the log-likelihood by , showing some age bias. In LINKEDIN,
age-biased models are worse than degree-biased models. A strong degree
preferential bias of the edges was also noted. As in FLICKR, model DA
improves the log-likelihood by .
Selecting an edge’s destination node is harder than selecting its source (the
green curve is usually below the red). Also, selecting a destination appears more
random than selecting a source, the maximum likelihood of the destination
node (green curve) for models D and DR is shifted to the left when compared to
the source node (red), which means the degree bias is weaker. Similarly, there is
a stronger bias towards young nodes in selecting an edge’s source than in
selecting its destination. Based on the observations, model D performs
reasonably well compared to more sophisticated variants based on degree and
age.
Even though the analysis suggests that model D is a reasonable model for
edge destination selection, it is inherently “non-local” in that edges are no more
likely to form between nodes which already have friends in common. A detailed
study of the locality properties of edge destination selection is required.
Consider the following notion of edge locality: for each new edge (u, w), the
number of hops it spans are measured, i.e., the length of the shortest path
between nodes u and w immediately before the edge was created. Figure 11.11
shows the distribution of these shortest path values induced by each new edge
for (with ), PA, and the four social networks. (The isolated dot
on the left counts the number of edges that connected previously disconnected
components of the network). For most new edges span nodes that were
originally six hops away, and then the number decays polynomially in the hops.
In the PA model, we see a lot of long-range edges; most of them span four hops
but none spans more than seven. The hop distributions corresponding to the four
real-world networks look similar to one another, and strikingly different from
both and PA. The number of edges decays exponentially with the hop
distance between the nodes, meaning that most edges are created locally between
nodes that are close. The exponential decay suggests that the creation of a large
fraction of edges can be attributed to locality in the network structure, namely
most of the times people who are close in the network (e.g., have a common
friend) become friends themselves. These results involve counting the number of
edges that link nodes certain distance away. In a sense, this over counts edges
(u, w) for which u and w are far away, as there are many more distant candidates
to choose from, it appears that the number of long-range edges decays
exponentially while the number of long-range candidates grows exponentially.
To explore this phenomenon, the number of hops each new edge spans are
counted but then normalized by the total number of nodes at h hops, i.e, we
compute
(11.12)
First, Fig. 11.12a, b show the results for and PA models. (Again, the
isolated dot at plots the probability of a new edge connecting
disconnected components.) In , edges are created uniformly at random, and
so the probability of linking is independent of the number of hops between the
nodes. In PA, due to degree correlations short (local) edges prevail. However, a
non-trivial amount of probability goes to edges that span more than two hops.
Figure 11.12c–f show the plots for the four networks. Notice the probability of
linking to a node h hops away decays double-exponentially, i.e.,
, since the number of edges at h hops increases
exponentially with h. This behaviour is drastically different from both the PA
and models. Also note that almost all of the probability mass is on edges
that close length-two paths. This means that edges are most likely to close
triangles, i.e, connect people with common friends.
Fig. 11.11 Number of edges created to nodes h hops away. counts the number of edges that
Now that we have a model for the lifetime of a node u, we must model that
amount of elapsed time between edge initiations from u. Let
be the time it takes for the node u with current degree d
to create its -st out-edge; we call the edge gap. Again, we examine
several candidate distributions to model edge gaps. The best likelihood is
provided by a power law with exponential cut-off:
, where d is the current degree of the node.
These results are confirmed in Fig. 11.14, in which we plot the MLE estimates to
gap distribution , i.e., distribution of times that it took a node of degree 1 to
add the second edge. We find that all gaps distributions are best modelled
by a power law with exponential cut-off. For each we fit a separate
distribution and Fig. 11.15 shows the evolution of the parameters and of
the gap distribution, as a function of the degree d of the node. Interestingly, the
power law exponent remains constant as a function of d, at almost the
same value for all four networks. On the other hand, the exponential cutoff
parameter increases linearly with d, and varies by an order of magnitude
across networks; this variation models the extent to which the “rich get richer”
phenomenon manifests in each network. This means that the slope of power-
law part remains constant, only the exponential cutoff part (parameter ) starts
to kick in sooner and sooner. So, nodes add their -st edge faster than
their dth edge, i.e, nodes start to create more and more edges (sleeping times get
shorter) as they get older (and have higher degree). So, based on Fig. 11.15, the
overall gap distribution can be modelled by .
Fig. 11.14 Edge gap distribution for a node to obtain the second edge, , and MLE power law with
Fig. 11.15 Evolution of the and parameters with the current node degree d. remains constant,
and linearly increases
1.
Nodes arrive using the node arrival function .
2.
Node u arrives and samples its lifetime a from the exponential
distribution .
3.
Node u adds the first edge to node v with probability proportional to its
degree.
4.
A node u with degree d samples a time gap from the distribution
and goes to sleep for time
steps.
5.
When a node wakes up, if its lifetime has not expired yet, it creates a
two-hop edge using the random-random triangle-closing model.
6.
If a node’s lifetime has expired, then it stops adding edges; otherwise it
repeats from step 4.
Problems
Given that the probability density function (PDF) of a power-law distribution is
given by Eq. 11.13.
(11.13)
References
1. Aiello, William, Fan Chung, and Linyuan Lu. 2002. Random evolution in massive graphs. In
Handbook of massive data sets, 97–122. Berlin: Springer.
2. Barabási, Albert-László, and Réka Albert. 1999. Emergence of scaling in random networks. Science
286 (5439): 509–512.
[MathSciNet][Crossref]
3. Bollobás, Béla, Christian Borgs, Jennifer Chayes, and Oliver Riordan. 2003. Directed scale-free graphs.
In Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, 132–139.
Society for Industrial and Applied Mathematics.
4. Brookes, Bertram C. 1969. Bradford’s law and the bibliography of science. Nature 224 (5223): 953.
[Crossref]
5. Clauset, Aaron, Cosma Rohilla Shalizi, and Mark E.J. Newman. 2009. Power-law distributions in
empirical data. SIAM Review 51 (4): 661–703.
7. Goel, Sharad, Andrei Broder, Evgeniy Gabrilovich, and Bo Pang. 2010. Anatomy of the long tail:
Ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference
on Web search and data mining, 201–210. ACM.
8. Grünwald, Peter D. 2007. The minimum description length principle. Cambridge: MIT press.
9. Kass, Robert E., and Adrian E. Raftery. 1995. Bayes factors. Journal of the American Statistical
Association 90 (430): 773–795.
10. Kleinberg, Jon. 2003. Bursty and hierarchical structure in streams. Data Mining and Knowledge
Discovery 7 (4): 373–397.
[MathSciNet][Crossref]
11. Kumar, Ravi, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. 2005. On the bursty
evolution of blogspace. World Wide Web 8 (2): 159–178.
[Crossref]
12. Lattanzi, Silvio, and D. Sivakumar. 2009. Affiliation networks. In Proceedings of the forty-first annual
ACM symposium on Theory of computing, 427–434. ACM.
13. Leskovec, Jure, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and
shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1): 2.
[Crossref]
14. Leskovec, Jure, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. 2008. Microscopic evolution of
social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge
discovery and data mining, 462–470. ACM.
15. Lotka, Alfred J. 1926. The frequency distribution of scientific productivity. Journal of the Washington
Academy of Sciences 16 (12): 317–323.
16. Mandelbrot, Benoit. 1960. The pareto-levy law and the distribution of income. International Economic
Review 1 (2): 79–106.
[Crossref]
17. Stone, Mervyn. 1977. An asymptotic equivalence of choice of model by cross-validation and Akaike’s
criterion. Journal of the Royal Statistical Society. Series B (Methodological) 39: 44–47.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_12
Krishna Raj P. M.
Email: [email protected]
The graph models that we have learnt heretofore cater to specific network
properties. We need a graph generator that can produce the long list of
properties. These generated graphs can be used for simulations, scenarios and
extrapolation, and the graphs draw a boundary over what properties to
realistically focus on. Reference [4] proposed that such a realistic graph is the
Kronecker graph which is generated using the Kronecker product.
The idea behind these Kronecker graphs is to create self-similar graphs
recursively. Beginning with an initiator graph , with vertices and
edges, produce successively larger graphs such that the kth graph
is on vertices. To exhibit the densification power-law,
So, the Kronecker product of two graphs is the Kronecker product of their
adjacency matrices.
In a Kronecker graph, iff and
where and are vertices in , and and
are the corresponding vertices in G and H.
Figure 12.1 shows the recursive construction of , when is a
3-node path.
Fig. 12.1 Top: a “3-chain” and its Kronecker product with itself; each of the nodes gets expanded into
3 nodes, which are then linked. Bottom: the corresponding adjacency matrices, along with the matrix for the
fourth Kronecker power
(12.2)
The self-similar nature of the Kronecker graph product is clear: To produce
from , we “expand” (replace) each node of by converting it into a
copy of G, and we join these copies together according to the adjacencies in
(see Fig. 12.1). This process is intuitive: one can imagine it as positing that
communities with the graph grow recursively, with nodes in the community
recursively getting expanded into miniature copies of the community. Nodes in
the subcommunity then link among themselves and also to nodes from different
communities.
Kronecker graphs satisfy the following properties. The proofs for these theorems
can be found in [4] or [5].
unique bit vector of length k; given two vertices u with label and v
with label , the probability of edge (u, v) existing, denoted by P[u, v],
is , independent on the presence of other edges.
The SKG model is a generalization of the R-MAT model proposed by [2].
Reference [6] prove the following theorems for SKGs
Fig. 12.2 CIT-HEP-TH: Patterns from the real graph (top row), the deterministic Kronecker graph with
being a star graph on 4 nodes (center + 3 satellites) (middle row), and the Stochastic Kronecker graph (
, - bottom row). Static patterns: a is the PDF of degrees in the graph (log-log scale),
and b the distribution of eigenvalues (log-log scale). Temporal patterns: c gives the effective diameter over
time (linear-linear scale), and d is the number of edges versus number of nodes over time (log-log scale)
Fig. 12.3 AS-ROUTEVIEWS: Real (top) versus Kronecker (bottom). Columns a and b show the degree
distribution and the scree plot. Columns c shows the distribution of network values (principal eigenvector
components, sorted, versus rank) and d shows the hop-plot (the number of reachable pairs g(h) within h
hops or less, as a function of the number of hops h
presents a fast heuristic procedure that takes linear time in the number of edges
to generate a graph. To arrive to a particular edge of one has to make
a sequence of k decisions among the entries of , multiply the chosen entries
of , and then placing the edge with the obtained probability.
Thus, instead of flipping biased coins to determine the
. However, in practice it can happen that more than one edge lands in
the same entry of big adjacency matrix K. If an edge lands in a already
occupied cell we simply insert it again. Even though values of are usually
skewed, adjacency matrices of real networks are so sparse that collisions are not
really a problem in practice as only around of edges collide. It is due to
these edge collisions the above procedure does not obtain exact samples from the
graph distribution defined by the parameter matrix P. However, in practice
graphs generated by this fast linear time (O(E)) procedure are basically
indistinguishable from graphs generated with the exact exponential time
procedure.
(12.3)
The NSKG model adds some noise to this matrix given as follows. Let b be
noise parameter . For each level , define a new
matrix in such a way that the expectation of is just T, i.e, for level i,
choose to be a uniform random number in the range . Set to be
(12.4)
is symmetric with all its entries positive and summing to 1.
12.2 Distance-Dependent Kronecker Graph
Theorem 32 shows that Kronecker graphs are not searchable by a distributed
greedy algorithm. Reference [1] propose an extension to the SKGs that is
capable of generating searchable networks. By using a new “Kronecker-like”
operation and a family of generator matrices, , both dependent upon the
distance between two nodes, this generation method yields networks that have
both a local (lattice-based) and global (distance-dependent) structure. This dual
structure allows a greedy algorithm to search the network using only local
information.
Kronecker graphs are generated by iteratively Kronecker-multiplying one
initiator matrix with itself to produce a new adjacency or probability matrix.
Here, a distance-dependent Kronecker operator is defined. Depending on the
distance between two nodes u and v, , a different matrix from a
defined family will be selected to be multiplied by that entry.
(12.5)
(12.6)
12.3 KRONFIT
Reference [5] presented KRONFIT, a fast and scalable algorithm for fitting
Kronecker graphs by using the maximum likelihood principle. A Metropolis
sampling algorithm was developed for sampling node correspondences, and
approximating the likelihood of obtaining a linear time algorithm for Kronecker
graph model parameter estimation that scales to large networks with millions of
nodes and edges.
Fig. 12.5 Schematic illustration of the multifractal graph generator. a The construction of the link
probability measure. Start from a symmetric generating measure on the unit square defined by a set of
probabilities associated to rectangles (shown on the left). Here , the length of the
intervals defining the rectangles is given by and respectively, and the magnitude of the probabilities
is indicated by both the height and the colour of the corresponding boxes. The generating measure is
iterated by recursively multiplying each box with the generating measure itself as shown in the middle and
on the right, yielding boxes at iteration k. The variance of the height of the boxes (corresponding
to the probabilities associated to the rectangles) becomes larger at each step, producing a surface which is
getting rougher and rougher, meanwhile the symmetry and the self similar nature of the multifractal is
preserved. b Drawing linking probabilities from the obtained measure. Assign random coordinates in the
unit interval to the nodes in the graph, and link each node pair I, J with a probability given by the
probability measure at the corresponding coordinates
Fig. 12.6 A small network generated with the multifractal network generator. a The generating measure
(on the left) and the link probability measure (on the right). The generating measure consists of
rectangles for which the magnitude of the associated probabilities is indicated by the colour. The number of
iterations, k, is set to , thus the final link probability measure consists of boxes, as shown
in the right panel. b A network with 500 nodes generated from the link probability measure. The colours of
the nodes were chosen as follows. Each row in the final linking probability measure was assigned a
different colour, and the nodes were coloured according to their position in the link probability measure.
(Thus, nodes falling into the same row have the same colour)
12.4 KRONEM
Reference [3] addressed the network completion problem by using the observed
part of the network to fit a model of network structure, and then estimating the
missing part of the network using the model, re-estimating the parameters and so
on. This is combined with the Kronecker graphs model to design a scalable
Metropolized Gibbs sampling approach that allows for the estimation of the
model parameters as well as the inference about missing nodes and edges of the
network.
The problem of network completion is cast into the Expectation
Maximisation (EM) framework and the KRONEM algorithm is developed that
alternates between the following two stages. First, the observed part of the
network is used to estimate the parameters of the network model. This estimated
model then gives us a way to infer the missing part of the network. Now, we act
as if the complete network is visible and we re-estimate the model. This in turn
gives us a better way to infer the missing part of the network. We iterate between
the model estimation step (the M-step) and the inference of the hidden part of the
network (the E-step) until the model parameters converge.
The advantages of KronEM are the following: It requires a small number of
parameters and thus does not overfit the network. It infers not only the model
parameters but also the mapping between the nodes of the true and the estimated
networks. The approach can be directly applied to cases when collected network
data is incomplete. It provides an accurate probabilistic prior over the missing
network structure and easily scales to large networks.
References
1. Bodine-Baron, Elizabeth, Babak Hassibi, and Adam Wierman. 2010. Distance-dependent kronecker
graphs for modeling social networks. IEEE Journal of Selected Topics in Signal Processing 4 (4): 718–
731.
[Crossref]
2. Chakrabarti, Deepayan, Yiping Zhan, and Christos Faloutsos. 2004. R-mat: A recursive model for graph
mining. In Proceedings of the 2004 SIAM international conference on data mining, 442–446. SIAM.
3. Kim, Myunghwan, and Jure Leskovec. 2011. The network completion problem: Inferring missing nodes
and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data Mining, 47–
58. SIAM.
4. Leskovec, Jurij, Deepayan Chakrabarti, Jon Kleinberg, and Christos Faloutsos. 2005. Realistic,
mathematically tractable graph generation and evolution, using Kronecker multiplication. In European
conference on principles of data mining and knowledge discovery, 133–145. Berlin: Springer.
5. Leskovec, Jure, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani.
2010. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research 11
(Feb): 985–1042.
6. Mahdian Mohammad, and Ying Xu. 2007. Stochastic Kronecker graphs. In International workshop on
algorithms and models for the web-graph, 179–186. Berlin: Springer.
7. Palla, Gergely, László Lovász, and Tamás Vicsek. 2010. Multifractal network generator. Proceedings of
the National Academy of Sciences 107 (17): 7640–7645.
[Crossref]
8. Seshadhri, Comandur, Ali Pinar, and Tamara G. Kolda. 2013. An in-depth analysis of stochastic
Kronecker graphs. Journal of the ACM (JACM) 60 (2): 13.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_13
Krishna Raj P. M.
Email: [email protected]
13.1.1 Crawling
The crawler module starts with an initial set of URLs which are initially kept
in a priority queue. From the queue, the crawler gets a URL, downloads the
page, extracts any URLs in the downloaded page, and puts the new URLs in the
queue. This process repeats until the crawl control module asks the crawler to
stop.
13.1.1.1 Page Selection
Due to the sheer quantity of pages in the Web and limit in the available
resources, the crawl control program has to decide which pages to crawl, and
which not to. The “importance” of a page can be defined in one of the following
ways:
1.
Interest Driven: The important pages can be defined as those of
“interest to” the users. This could be defined in the following manner:
given a query Q, the importance of the page P is defined as the textual
similarity between P and Q [1]. This metric is referred to as .
2.
Popularity Driven: Page importance depends on how “popular” a page
is. Popularity can be defined in terms of a page’s in-degree. However, a
page’s popularity is the in-degree with respect to the entire Web. This
can be denoted as . The crawler may have to instead provide an
estimate with the in-degree from the pages seen so far.
3.
Location Driven: This importance measure is computed as a function
of a page’s location instead of its contents. Denoted as , if URL u
leads to a page P, then IL(P) is a function of u.
1.
Crawl and stop: Under this method, a crawler C starts at a
page and stops after visiting K pages. Here, K denotes the
number of pages that the crawler can download in one crawl. At
this point a perfect crawler would have visited ,
where is the page with the highest importance value, is
the next highest, and so on. These pages through are
called the hot pages. The K pages visited by our real crawler
will contain only pages with rank higher than or equal
to that of . We need to know the exact rank of all pages in
order to obtain the value M. The performance of the crawler C
is computed as . Although the
performance of the perfect crawler is , a crawler that
manages to visit pages at random would have a performance of
, where T is the total number of pages in the Web.
2.
Crawl and stop with threshold: Assume that a crawler visits
K pages. If we are given an importance target G, then any page
with importance target greater than G is considered hot. If we
take the total number of hot pages to be H, then the
performance of the crawler, , is the percentage of the H
hot pages that have been visited when the crawler stops. If
, then an ideal crawler will have performance
. If , then the ideal crawler has
performance. A purely random crawler that revisits pages is
expected to visit (H / T)K hot pages when it stops. Thus, its
performance is . Only if the random crawler visits
all T pages, is its performance expected to be .
1.
Uniform refresh policy: All pages are revisited at the same frequency f,
regardless of how often they change.
2.
Proportional refresh policy: The crawler visits a page proportional to
its change. If is the change frequency of a page , and is the
crawler’s revisit frequency for , then the frequency ratio is the
same for any i. Keep in mind that the crawler needs to estimate ’s for
each page, in order to implement this policy. This estimation can be
based on the change history of a page that the crawler can collect [9].
1.
Freshness: Let be the collection of N
pages. The freshness of a local page at time t is as given in
Eq. 13.1
(13.1)
(13.2)
Since the pages get refreshed over time, the time average of
the freshness of page , , and the time average of the
13.4.
(13.3)
(13.4)
(13.5)
(13.6)
13.1.2 Storage
The storage repository of a search engine must perform two basic functions.
First, it must provide an interface for the crawler to store pages. Second, it must
provide an efficient access API that the indexer module and the collection
analysis module can use to retrieve pages.
The issues a repository must deal with are as follows: First, it must be
scalable, i.e, it must be capable of being distributed across a cluster of systems.
Second, it must be capable of supporting both random access and streaming
access equally efficiently. Random access for quickly retrieving a specific
Webpage, given the page’s unique identifier to serve out cached copies to the
end-user. Streaming access to receive the entire collection as a stream of pages
to provide pages in bulk for the indexer and analysis module to process and
analyse. Third, since the Web changes rapidly, the repository needs to handle a
high rate of modifications since pages have to be refreshed periodically. Lastly,
the repository must have a mechanism of detecting and removing obsolete pages
that have been removed from the Web.
A distributed Web repository that is designed to function over a cluster of
interconnected storage nodes must deal with the following issues that affect the
characteristics and performance of the repository. A detailed explanation of
these issues are found in [15].
13.1.3 Indexing
The indexer and the collection analysis modules are responsible for generating
the text, structure and the utility indexes. Here, we describe the indexes.
13.1.4 Ranking
The query engine collects search terms from the user and retrieves pages that are
likely to be relevant. These pages cannot be returned to the user in this format
and must be ranked.
However, the problem of ranking is faced with the following issues. First,
before the task of ranking pages, one must first retrieve the pages that are
relevant to a search. This information retrieval suffers from problems of
synonymy (multiple terms that more or less mean the same thing) and polysemy
(multiple meanings of the same term, so by the term “jaguar”, do you mean the
animal, the automobile company, the American football team, or the operating
system).
Second, the time of retrieval plays a pivotal role. For instance, in the case of
an event such as a calamity, government and news sites will update their pages
as and when they receive information. The search engine not only has to retrieve
the pages repeatedly, but will have to re-rank the pages depending on which
page currently has the latest report.
Third, with every one capable of writing a Web page, the Web has an
abundance of information for any topic. For example, the term “social network
analysis” returns a search of around 56 million results. Now, the task is to rank
these results in a manner such that the most important ones appear first.
We observe from Fig. 13.2 that among the sites casting votes, a few of them
voted for many of the pages that received a lot of votes. Therefore, we could say
that these pages have a sense where the good answers are, and to score them
highly as lists. Thus, a page’s value as a list is equal to the sum of the votes
received by all pages that it voted for. Figure 13.3 depicts the result of applying
this rule to the pages casting votes.
Fig. 13.3 Finding good lists for the query “newspapers”: each page’s value as a list is written as a number
inside it
If pages scoring well as lists are believed to actually have a better sense for
where the good results are, then we should weigh their votes more heavily. So,
in particular, we could tabulate the votes again, but this time giving each page’s
vote a weight equal to its value as a list. Figure 13.4 illustrates what happens
when weights are accounted for in the newspaper case.
Fig. 13.4 Re-weighting votes for the query “newspapers”: each of the labelled page’s new score is equal
to the sum of the values of all lists that point to it
4.
The resultant hub and authority scores may involve numbers that
are very large (as can be seen in Fig. 13.4). But we are concerned
only with their relative size and therefore they can be normalized: we
divide down each authority score by the sum of all authority scores,
and divide down each hub score by the sum of all hub scores.
Figure 13.5 shows the normalized scores of Fig. 13.4.
This algorithm is commonly known as the HITS algorithm [16].
Fig. 13.5 Re-weighting votes after normalizing for the query “newspapers”
(13.8)
The k-step hub-authority computations for large values of k is be described
below: The initial authority and hub vectors are denoted as and each of
which are equal to unit vectors. Let and represent the vectors of
authority and hub scores after k applications of the Authority and Hub Update
Rules in order. This gives us Eqs. 13.9 and 13.10.
(13.9)
and
(13.10)
In the next step, we get
(13.11)
and
(13.12)
For larger values of k, we have
(13.13)
and
(13.14)
Now we will look at the convergence of this process.
Consider the constants c and d. If
(13.15)
is going to converge to a limit , we expect that at the limit, the direction of
might grow by a factor of c, i.e, we expect that will satisfy the equation
(13.16)
This means that is the eigenvector of the matrix , with c a
corresponding eigenvalue.
Any symmetric matrix with n rows and n columns has a set of n eigenvectors
that are all unit vectors and all mutually orthogonal, i.e, they form a basis for the
space .
Since is symmetric, we get the n mutually orthogonal eigenvectors as
, with corresponding eigenvalues . Let
. Suppose that , if we have to compute the
matrix-vector product for any vector x, we could first write x as a
(13.17)
k multiplications of of the matrix-vector product will therefore give us
(13.18)
Applying Eq. 13.18 to the vector of hub scores, we get . Since
So,
(13.19)
Dividing both sides by , we get
(13.20)
Since , as k goes to infinity, every term on the right-hand side except
the first is going to 0. This means that the sequence of vectors converges to
converge to a vector in the direction of even with this new starting vector.
Now we consider the case when the assumption is relaxed. Let’s
say that there are eigenvalues that are tied for the largest absolute value,
i.e, , and then , are all smaller in absolute value.
With all the eigenvalues in being non-negative, we have
. This gives us
(13.21)
Terms through n of this sum go to zero, and so the sequence converges to
. Thus, when , we still have convergence, but the limit
to which the sequence converges might now depend on the choice of the initial
vector (and particularly its inner product with each of ). In
practice, with real and sufficiently large hyperlink structures, one essentially
always gets a matrix M with the property that has .
This can be adapted directly to analyse the sequence of authority vectors. For the
authority vectors, we are looking at powers of , and so the basic result is
that the vector of authority scores will converge to an eigenvector of the matrix
associated with its largest eigenvalue.
13.1.4.2 PageRank
PageRank [7] also works on the same principle as HITS algorithm. It starts with
simple voting based on in-links, and refines it using the Principle of Repeated
Improvement.
PageRank can be thought of as a “fluid” that circulates through the network,
passing from node to node across edges, and pooling at the nodes that are most
important. It is computed as follows:
1.
In a network with n nodes, all nodes are assigned the same
initial PageRank, set to be 1 / n.
2.
Choose a number of steps k.
3.
Perform a sequence of k updates to the PageRank values,
using the following Basic PageRank Update rule for each
update: Each page divides its current PageRank equally across
its outgoing links, and passes these equal shares to the pages it
points to. If a page has no outgoing links, it passes all its current
PageRank to itself. Each page updates its new PageRank to be
the sum of the shares it receives.
Step A B C D E F G H
1 1 / 2 1 / 16 1 / 16 1 / 16 1 / 16 1 / 16 1 / 16 1 / 8
2 3 / 16 1 / 4 1 / 4 1 / 32 1 / 32 1 / 32 1 / 32 1 / 16
This scaling factor makes the PageRank measure less sensitive to the
addition or deletion of small number of nodes or edges.
A survey of the PageRank algorithm and its developments can be found in
[4].
We will first analyse the Basic PageRank Update rule and then move on to
the scaled version. Under the basic rule, each node takes its current PageRank
and divides it equally over all the nodes it points to. This suggests that the
“flow” of PageRank specified by the update rule can be naturally represented
using a matrix N. Let be the share of i’s PageRank that j should get in one
update step. if i doesn’t link to j, and when i links to j, then ,
where is the number of links out of i. (If i has no outgoing links, then we
define , in keeping with the rule that a node with no outgoing links
passes all its PageRank to itself.) If we represent the PageRank of all the nodes
using a vector r, where is the PageRank of node i. In this manner, the Basic
PageRank Update rule is
(13.22)
We can similarly represent the Scaled PageRank Update rule using the matrix
to denote the different flow of PageRank. To account for the scaling, we define
to be , this gives the scaled update rule as
(13.23)
Starting from an initial PageRank vector , a sequence of vectors
are obtained from repeated improvement by multiplying the
(13.24)
This means that if the Scaled PageRank Update rule converges to a limiting
vector , this limit would satisfy . This is proved using Perron’s
Theorem [18].
Fig. 13.8 Equilibrium PageRank values for the network in Fig. 13.7
Fig. 13.9 The same collection of eight pages, but F and G have changed their links to point to each other
instead of to A. Without a smoothing effect, all the PageRank would go to F and G
Reference [2] describes axioms that are satisfied by the PageRank algorithm, and
that any page ranking algorithm that satisfies these must coincide with the
PageRank algorithm.
Remark 1
Taking a random walk in this situation, the probability of being at a page X after
k steps of this random walk is precisely the PageRank of X after k applications of
the Basic PageRank Update rule.
Proof
If denote the probabilities of the walk being at nodes
respectively in a given step, then the probability it will be at node i in the next
step is computed as follows:
1.
For each node j that links to i, if we are given that the walk is currently
at node j, then there is a chance that it moves from j to i in the
next step, where is the number of links out of j.
2.
The walk has to actually be at node j for this to happen, so node j
contributes to the probability of being at i in the next
step.
3.
Therefore, summing over all nodes j that link to i gives the
probability the walk is at in the next step.
So the overall probability that the walk is at i in the next step is the sum of
over all nodes that link to i.
If we represent the probabilities of being at different nodes using a vector b,
where the coordinate is the probability of being at node i, then this update
rule can be written using matrix-vector multiplication as
(13.25)
This is exactly the same as Eq. 13.22. Since both PageRank values and random-
walk probabilities start out the same (they are initially 1 / n for all nodes), and
they then evolve according to exactly the same rule, so they remain same
forever. This justifies the claim.
Remark 2 The probability of being at a page X after k steps of the scaled
random walk is precisely the PageRank of X after k applications of the Scaled
PageRank Update Rule.
Proof
We go by the same lines as the proof of Remark 1. If denote the
probabilities of the walk being at nodes respectively in a given step,
then the probability it will be at node i in the next step, is the sum of ,
over all nodes j that link to i, plus . If we use the matrix , then we
(13.26)
This is the same as the update rule from Eq. 13.23 for the scaled PageRank
values. The random-walk probabilities and the scaled PageRank values start at
the same initial values, and then evolve according to the same update, so they
remain the same forever. This justifies the argument.
A problem with both PageRank and HITS is topic drift . Because they give the
same weights to all edges, the pages with the most in-links in the network being
considered tend to dominate, whether or not they are most relevant to the query.
References [8] and [5] propose heuristic methods for differently weighting links.
Reference [20] biased PageRank towards pages containing a specific word, and
[14] proposed applying an optimized version of PageRank to the subset of pages
containing the query terms.
(13.27)
If the Markov chain is irreducible, then the stationary distribution
of the Markov chain satisfies , where
is the set of all backward links.
Similarly, the Markov chain for the hubs has transition probabilities
(13.28)
(13.30)
The algorithm starts from node i, and visits its neighbours in BFS order. At each
iteration it takes a Backward or a Forward step (depending on whether it is an
odd, or an even iteration), and it includes the new nodes it encounters. The
weight factors are updated accordingly. Note that each node is considered only
once, when it is first encountered by the algorithm.
(13.31)
with the probability of no link from i to j given by
(13.32)
This states that a link is more likely if is large (in which case hub i has large
tendency to link to any site), or if both and are large (in which case i is an
intelligent hub, and j is a high-quality authority).
We must now assign prior distributions to the unknown parameters ,
, and . To do this, we let and be fixed parameters, and
let each have prior distribution , a normal distribution with mean
and variance . We further let each and have prior distribution exp(1),
(13.33)
for any measurable subset , and also
(13.34)
for any measurable function .
(13.36)
A Metropolis algorithm is used to compute these conditional means.
This algorithm can be further simplified by replacing Eq. 13.31 with
and replacing Eq. 13.32 with
. This eliminates the parameters entirely, so that
prior values and are no longer needed. This leads to the posterior density
, now given by where
(13.37)
Comparison of these algorithms can be found in [6].
13.2 Google
Reference [7] describes Google , a prototype of a large-scale search engine
which makes use of the structure present in hypertext. This prototype forms the
base of the Google search engine we know today.
Figure 13.10 illustrates a high level view of how the whole system works.
The paper states that most of Google was implemented in C or C++ for
efficiency and was available to be run on either Solaris or Linux.
13.2.2 Crawling
Crawling is the most fragile application since it involves interacting with
hundreds of thousands of Web servers and various name servers which are all
beyond the control of the system. In order to scale to hundreds of millions of
Webpages, Google has a fast distributed crawling system. A single URLserver
serves lists of URLs to a number of crawlers. Both the URLserver and the
crawlers were implemented in Python. Each crawler keeps roughly 300
connections open at once. This is necessary to retrieve Web pages at a fast
enough pace. At peak speeds, the system can crawl over 100 Web pages per
second using four crawlers. A major performance stress is DNS lookup so each
crawler maintains a DNS cache. Each of the hundreds of connections can be in a
number of different states: looking up DNS, connecting to host, sending request.
and receiving response. These factors make the crawler a complex component of
the system. It uses asynchronous IO to manage events, and a number of queues
to move page fetches from state to state.
13.2.3 Searching
Every hitlist includes position, font, and capitalization information. Additionally,
hits from anchor text and the PageRank of the document are factored in.
Combining all of this information into a rank is difficult. The ranking function so
that no one factor can have too much influence. For every matching document
we compute counts of hits of different types at different proximity levels. These
counts are then run through a series of lookup tables and eventually are
transformed into a rank. This process involves many tunable parameters.
(13.38)
Since oracle evaluations over all pages is an expensive ordeal, this is avoided by
Since oracle evaluations over all pages is an expensive ordeal, this is avoided by
asking the human to assign oracle values for just a few of the pages.
To evaluate pages without relying on O, the likelihood that a given page p is
good is estimated using a trust function T which yields the probability that a
page is good, i.e, .
Although it would be difficult to come up with a function T, such a function
would be useful in ordering results. These functions could be defined in terms of
the following properties. We first look at the ordered trust property
(13.39)
(13.40)
A binary function I(T, O, p, q) to signal if the ordered trust property has been
violated.
(13.41)
(13.42)
If pairord equals 1, there are no cases when misrated a pair. In contrast, if
pairord equals 0, then T misrated all the pairs.
We now proceed to define the trust functions. Given a budget L of O-
invocations, we select at random a seed set of L pages and call the oracle on
its elements. The subset of good and bad seed pages are denoted as and ,
respectively. Since the remaining pages are not checked by human expert, these
are assigned a trust score of 1 / 2 to signal our lack of information. Therefore,
this scheme is called the ignorant trust function defined for any as
(13.43)
Now we attempt to compute the trust scores, by taking advantage of the
approximate isolation of good pages. We will select at random the set of L
pages that we invoke the oracle on. Then, expecting that good pages point to
other good pages only, we assign a score of 1 to all pages that are reachable from
a page in in M or fewer steps. The M-step trust function is defined as
(13.44)
where denotes the existence of a path of maximum length of M from
page q to page p. However, such a path must not include bad seeds.
To reduce trust as we move further away from the good seed pages, there are
two possible schemes. The first is called the trust dampening . If a page is one
link away from a good seed page, it is assigned a dampened trust score of . A
page that can reach another page with score , gets a dampened score of .
The second technique is called trust splitting . Here, the trust gets split if it
propagates to other pages. If a page p has a trust score of T(p) and it points to
pages, each of the pages will receive a score fraction
from p. In this case, the actual score of a page will be the sum of the score
fractions received through its in-links.
The next task is to identify pages that are desirable for the seed set. By
desirable, we mean pages that will be the most useful in identifying additional
good pages. However, we must ensure that the size of the seed set is reasonably
small to limit the number of oracle invocations. There are two strategies for
accomplishing this task.
The first technique is based on the idea that since trust flows out of the good
seed pages, we give preference to pages from which we can reach many other
pages. Building on this, we see that seed set can be built from those pages that
point to many pages that in turn point to many pages and so on. This approach is
the inverse of the PageRank algorithm because the PageRank ranks pages based
on its in-degree while here it is based in the out-degree. This gives the technique
the name inverted PageRank . However, this method is a double-edged sword.
While it does not guarantee maximum coverage, its execution time is polynomial
in the number of pages.
The other technique is to take pages with high PageRank as the seed set,
since high-PageRank pages are likely to point to other high-PageRank pages.
Piecing all of these elements together gives us the TrustRank algorithm. The
algorithm takes as input the graph. At the first step, it identifies the seed set. The
pages in the set are re-ordered in decreasing order of their desirability score.
Then, the oracle function is invoked on the L most desirable seed pages. The
entries of the static score distribution vector d that correspond to good seed
pages are set to 1. After normalizing the vector d so that its entries sum to 1, the
TrustRank scores are evaluated using a biased PageRank computation with d
replacing the uniform distribution.
Reference [13] complements [12] by proposing a novel method for
identifying the largest spam farms . A spam farm is a group of interconnected
nodes involved in link spamming. It has a single target node , whose ranking the
spammer intends to boost by creating the whole structure. A farm also contains
boosting nodes , controlled by the spammer and connected so that they would
influence the PageRank of the target. Boosting nodes are owned either by the
author of the target, or by some other spammer. Commonly, boosting nodes have
little value by themselves; they only exist to improve the ranking of the target.
Their PageRank tends to be small, so serious spammers employ a large number
of boosting nodes (thousands of them) to trigger high target ranking.
In addition to the links within the farm, spammers may gather some external
links from reputable nodes. While the author of a reputable node y is not
voluntarily involved in spamming, “stray” links may exist for a number of
reasons. A spammer may manage to post a comment that includes a spam link in
a reputable blog. Or a honey pot may be created, a spam page that offers
valuable information, but behind the scenes is still part of the farm. Unassuming
users might then point to the honey pot, without realizing that their link is
harvested for spamming purposes. The spammer may purchase domain names
that recently expired but had previously been reputable and popular. This way
he/she can profit of the old links that are still out there.
A term called spam mass is introduced which is a measure of how much
PageRank a page accumulates through being linked to by spam pages. The idea
is that the target pages of spam farms, whose PageRank is boosted by many
spam pages, are expected to have a large spam mass. At the same time, popular
reputable pages, which have high PageRank because other reputable pages point
to them, have a small spam mass.
In order to compute this spam mass, we partition the Web into a set of
reputable node and a set of spam nodes , with and
. Given this partitioning, the goal is to detect web nodes x that
gain most of their PageRank through spam nodes in that link to them. Such
nodes x are called spam farm target nodes.
A simple approach would be that, given a node x, we look only at its
immediate in-neighbours. If we are provided information about whether or not
these in-neighbours are reputable or spam, we could infer whether or not x is
good or spam.
In a first approach, if majority of x’s links comes from spam nodes, x is
labelled a spam target node, otherwise it is labelled good. This scheme could
easily mislabel spam. An alternative is to look not only at the number of links,
but also at what amount of PageRank each link contributes. The contribution of a
link amounts to the change in PageRank induced by the removal of the link.
However, this scheme does not look beyond the immediate in-neighbours of x
and therefore could succumb to mislabelling. This paves a way for a third
scheme where a node x is labelled considering all the PageRank contributions of
other nodes that are directly or indirectly connected to x.
The PageRank contribution of x to y over the walk W is defined as
(13.45)
where is the probability of a random jump to x, and is the weight of
the walk.
(13.46)
This weight can be interpreted as the probability that a Markov chain of length k
starting in x reaches y through the sequence of nodes .
This gives the total PageRank contribution of x to y, , over all walks from
x to y as
(13.47)
For a node’s contribution to itself, a virtual cycle that has length zero and
weight 1 is considered such that
(13.48)
If a node x does not participate in cycles, x’s contribution to itself is
, which corresponds to the random jump component. If these is
no walk from node x to y then the PageRank contribution is zero.
This gives us that the PageRank score of a node y is the sum of the contributions
of all other nodes to y
(13.49)
(13.50)
defined as
(13.51)
We will now use this to compute the spam mass of a page. For a given
partitioning of V and for any node x, , i.e, x’s
PageRank is the sum of the contributions of good and that of spam nodes. The
absolute spam mass of a node x, denoted by , is the PageRank contribution
that x receives from spam nodes, i.e, . Therefore, the spam mass is a
measure of how much direct or indirect in-neighbour spam nodes increase the
PageRank of a node. The relative spam mass of node x, denoted by , is the
fraction of x’s PageRank due to contributing spam nodes, i.e, .
If we assume that only a subset of the good nodes is given, we can compute
(13.52)
Given the PageRank scores and , the estimated absolute spam mass of
node x is
(13.53)
and the estimated spam mass of x is
(13.54)
Now, we consider the situation where the core is significantly smaller than
, i.e, the total estimated good contribution is much smaller than the
total PageRank of nodes. Therefore, we will have with only a
few nodes that have absolute mass estimates differing from their PageRank
scores.
To counter this problem, we construct a uniform random sample of nodes
and manually label each sample node as spam or good. This way it is possible to
roughly approximate the prevalence of spam nodes on the Web. We introduce
to denote the fraction of nodes that we estimate that are good, so .
Then, we scale the core-based random jump vector to w, where
(13.55)
Note that , so the two random jump vectors are of the same
Problems
Download the Wikipedia hyperlinks network available at
https://snap.stanford.edu/data/wiki-topcats.txt.gz.
64 Compute the PageRank, hub score and authority score for each of the
nodes in the graph.
65 Report the nodes that have the top 3 PageRank, hub and authority scores
respectively.
References
1. Agirre, Eneko, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: a pilot
on semantic textual similarity. In Proceedings of the first joint conference on lexical and computational
semantics-volume 1: Proceedings of the main conference and the shared task, and Volume 2:
Proceedings of the sixth international workshop on semantic evaluation, 385–393. Association for
Computational Linguistics.
2. Altman, Alon, and Moshe Tennenholtz. 2005. Ranking systems: the pagerank axioms. In Proceedings
of the 6th ACM conference on electronic commerce, 1–8. ACM.
3. Arasu, Arvind, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. 2001.
Searching the web. ACM Transactions on Internet Technology (TOIT) 1 (1): 2–43.
[Crossref]
4. Berkhin, Pavel. 2005. A survey on pagerank computing. Internet Mathematics 2 (1): 73–120.
[MathSciNet][Crossref]
5. Bharat, Krishna, and Monika R. Henzinger. 1998. Improved algorithms for topic distillation in a
hyperlinked environment. In Proceedings of the 21st annual international ACM SIGIR conference on
research and development in information retrieval, 104–111. ACM.
6. Borodin, Allan, Gareth O. Roberts, Jeffrey S. Rosenthal, and Panayiotis Tsaparas. 2001. Finding
authorities and hubs from link structures on the world wide web. In Proceedings of the 10th
international conference on World Wide Web, 415–429. ACM.
7. Brin, Sergey, and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine.
Computer networks and ISDN systems 30 (1–7): 107–117.
[Crossref]
8. Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon
Kleinberg. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text.
Computer Networks and ISDN Systems 30 (1–7): 65–74.
[Crossref]
9. Cho, Junghoo, and Hector Garcia-Molina. 2003. Estimating frequency of change. ACM Transactions
on Internet Technology (TOIT) 3 (3): 256–290.
[Crossref]
10. Cohn, David, and Huan Chang. 2000. Learning to probabilistically identify authoritative documents. In
ICML, 167–174. Citeseer.
11. Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological) 1–38.
12. Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank.
In Proceedings of the thirtieth international conference on very large data bases-volume 30, 576–587.
VLDB Endowment.
13. Gyongyi, Zoltan, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection
based on mass estimation. In Proceedings of the 32nd international conference on very large data
bases, 439–450. VLDB Endowment.
14. Haveliwala, Taher H. 2002. Topic-sensitive pagerank. In Proceedings of the 11th international
conference on World Wide Web, 517–526. ACM.
15. Hirai, Jun, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. Webbase: a
repository of web pages. Computer Networks 33 (1–6): 277–293.
[Crossref]
16. Kleinberg, Jon M. 1998. Authoritative sources in a hyperlinked environment. In In Proceedings of the
ACM-SIAM symposium on discrete algorithms, Citeseer.
17. Lempel, Ronny, and Shlomo Moran. 2000. The stochastic approach for link-structure analysis (salsa)
and the tkc effect1. Computer Networks 33 (1–6): 387–401.
[Crossref]
18. MacCluer, Charles R. 2000. The many proofs and applications of perron’s theorem. Siam Review 42
(3): 487–498.
19. Melink, Sergey, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a
distributed full-text index for the web. ACM Transactions on Information Systems (TOIS) 19 (3): 217–
241.
[Crossref]
20. Rafiei, Davood, and Alberto O. Mendelzon. 2000. What is this page known for? computing web page
reputations. Computer Networks 33 (1–6): 823–835.
21. Ribeiro-Neto, Berthier A., and Ramurti A. Barbosa. 1998. Query performance for tightly coupled
distributed digital libraries. In Proceedings of the third ACM conference on digital libraries, 182–190.
ACM.
22. Tomasic, Anthony, and Hector Garcia-Molina. 1993. Performance of inverted indices in shared-nothing
distributed text document information retrieval systems. In Proceedings of the second international
conference on parallel and distributed information systems, 8–17. IEEE.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_14
Krishna Raj P. M.
Email: [email protected]
Fig. 14.1 Undirected graph with four vertices and four edges. Vertices A and C have a mutual contacts B
and D, while B and D have mutual friend A and C
Fig. 14.2 Figure 14.1 with an edge between A and C, and B and D due to triadic closure property
Granovetter postulated that triadic closure was one of the most crucial reason
why acquaintances are the ones to thank for a person’s new job. Consider the
graph in Fig. 14.3, B has edges to the tightly-knit group containing A, D and C,
and also has an edge to E. The connection of B to E is qualitatively different
from the links to the tightly-knit group, because the opinions and information
that B and the group have access to are similar. The information that E will
provide to B will be things that B will not necessarily have access to.
We define an edge to be a local bridge is the end vertices of the edge do not
have mutual friends. By this definition, the edge between B and E is a local
bridge. Observe that local bridge and triadic closure are conceptually opposite
term. While triadic closure implies an edge between vertices having mutual
friends, local bridge does not form an edge between such vertices. So,
Granovetter’s observation was that acquaintances connected to an individual by
local bridges can provide information such as job openings which the individual
might otherwise not have access to because the tightly-knit group, although with
greater motivation to find their buddy a job, will know roughly the same things
that the individual is exposed to.
(14.1)
Consider the edge (B, F) in Fig. 14.3. The denominator of the neighbourhood
overlap is determined by the vertices A, C, D, E and G, since they are the
neighbours of either B or F. Of these only A is a neighbour of both B and F.
Therefore, the neighbourhood overlap of (B, F) is 1 / 5.
Similarly, the neighbourhood overlap of (B, E) is zero. This is because these
vertices have no neighbour in common. Therefore, we can conclude that a local
bridge has zero neighbourhood overlap.
The neighbourhood overlap can be used as a quantitative way of determining
the strength of an edge in the network.
14.2 Detecting Communities in a Network
To identify groups in a network we will first have to formulate an approach to
partition the graph.
One class of methods to partition graphs is to identify and remove the
“spanning” links between densely-connected regions. Once these links are
removed, the graph breaks into disparate pieces. This process can be done
recursively. These methods are referred to as divisive methods of graph
partitioning.
An alternate class of methods aims at identifying nodes that are likely to
belong to the same region and merge them together. This results in a large
number of chunks, each of which are then merged in a bottom-up approach.
Such methods are called agglomerative methods of graph partitioning.
One traditional method of agglomerative clustering is hierarchical clustering
. In this technique, we first calculate the weight for each pair i,j of vertices
in the network, which gives a sense of how similar these two vertices are when
compared to other vertices. Then, we take the n vertices in the network, with no
edge between them, and add edges between pairs one after another in order of
their weights, starting with the pair with the strongest weight and progressing to
the weakest. As edges are added, the resulting graph shows a nested set of
increasingly large components, which are taken to be the communities. Since the
components are properly nested, they can all be represented using a tree called
“dendrograms” , in which the lowest level at which two vertices are connected
represents the strength of the edge which resulted in their first becoming
members of the same community. A “slice” through this tree at any level gives
the communities which existed just before an edge of the corresponding weight
was added.
Several different weights have been proposed for the hierarchical clustering
algorithm. One possible definition of the weight is the number of node-
independent paths between vertices. Two paths which connect the same pair of
vertices are said to be node-independent if they share none of the same vertices
other than their initial and final vertices. The number of node-independent paths
between vertices i and j in a graph is equal to the minimum number of vertices
that need be removed from the graph in order to disconnect i and j from one
another. Thus this number is in a sense a measure of the robustness of the
network to deletion of nodes.
Another possible way to define weights between vertices is to count the total
number of paths that run between them. However, since the number of paths
between any two vertices is infinite (unless it is zero), we typically weigh paths
of length l by a factor with small, so that the weighted count of the number
of paths converges. Thus long paths contribute exponentially less weight than
short ones. If A is the adjacency matrix of the network, such that is 1 if there
is an edge between vertices i and j and 0 otherwise, then the weights in this
definition are given by the elements of the matrix
(14.2)
In order for the sum to converge, we must choose smaller than the reciprocal
of the largest eigenvalue of A.
However, these definitions of weights are less successful in certain situations.
For instance, if a vertex is connected to the rest of a network by only a single
edge then, to the extent that it belongs to any community, it should clearly be
considered to belong to the community at the other end of that edge.
Unfortunately, both the numbers of node-independent paths and the weighted
path counts for such vertices are small and hence single nodes often remain
isolated from the network when the communities are formed.
1.
Calculate the betweenness values for all the edges in the
network.
network.
2.
Remove the edge with the highest betweenness value.
3.
Recalculate the betweenness for all edges affected by the
removal.
4.
Repeat from step 2 until no edges remain.
The algorithm in [5] can calculate the betweenness for all the |E| edges in a
graph of |V| vertices in time O(|V||E|). Since the calculation has to be repeated
once for the removal of each edge, the entire algorithm runs in worst-case time
.
14.2.2 Modularity
Reference [2] proposed a measure called “modularity” and suggested that the
optimization of this quality function over the possible divisions of the network is
an efficient way to detect communities in a network. The modularity is, up to a
multiplicative constant, the number of edges falling within groups minus the
expected number in an equivalent network with edges placed at random. The
modularity can be either positive or negative, with positive values indicating the
possible presence of community structure. Thus, we can search for community
structure precisely by looking for the divisions of a network that have positive,
and preferably large, values of the modularity.
Let us suppose that all the vertices in our graph belongs to one of two
groups, i.e, if vertex i belongs to group 1 and if it belongs to
group 2. Let the number of vertices between vertices i and j be , which will
normally be 0 or 1, although larger values are possible for graphs with multiple
edges. The expected number of edges between vertices i and j if edges are placed
at random is where and are the degrees of vertices i and j
respectively, and . The modularity Q is given by the sum of
over all pairs of vertices i and j that fall in the same group.
Observing that the quantity is 1 if i and j are in the same group
(14.3)
where the second equality follows from the observation that
.
Equation 14.3 can be written in matrix form as
(14.4)
where s is the column vector whose elements are and have defined a real
symmetric matrix B with elements
(14.5)
which is called the modularity matrix . The elements of each of its rows and
columns sum to zero, so that it always has an eigenvector with
eigenvalue zero.
Given Eq. 14.4, s can be written as a linear combination of the normalized
eigenvectors of B so that with , we get
(14.6)
where is the eigenvalue of B corresponding to eigenvector .
Assume that the eigenvalues are labelled in decreasing order,
. To maximize the modularity, we have to choose the value
of s that concentrates as much weight as possible in terms of the sum in Eq. 14.6
involving the largest eigenvalues. If there were no other constraints on the
choice of s, this would be an easy task: we would simply chose s proportional to
the eigenvector . This places all of the weight in the term involving the largest
eigenvalue , the other terms being automatically zero, because the
eigenvectors are orthogonal.
There is another constraint on the problem imposed by the restriction of the
elements of s to the values , which means s cannot normally be chosen
parallel to . To make it as close to parallel as possible, we have to maximize
the dot product . It is straightforward to see that the maximum is achieved
(14.8)
Since Eq. 14.7 has the same form as Eq. 14.4, we can now apply the spectral
approach to this generalized modularity matrix, just as before, to maximize .
Note that the rows and columns of still sum to zero and that is
correctly zero if group g is undivided. Note also that for a complete network
Eq. 14.8 reduces to the previous definition of the modularity matrix, Eq. 14.5,
because is zero in that case.
In repeatedly subdividing the network, an important question we need to address
is at what point to halt the subdivision process. A nice feature of this method is
that it provides a clear answer to this question: if there exists no division of a
subgraph that will increase the modularity of the network, or equivalently that
gives a positive value for , then there is nothing to be gained by dividing the
subgraph and it should be left alone; it is indivisible in the sense of the previous
section. This happens when there are no positive eigenvalues to the matrix ,
and thus the leading eigenvalue provides a simple check for the termination of
the subdivision process: if the leading eigenvalue is zero, which is the smallest
value it can take, then the subgraph is indivisible.
Note, however, that although the absence of positive eigenvalues is a
sufficient condition for indivisibility, it is not a necessary one. In particular, if
there are only small positive eigenvalues and large negative ones, the terms in
Eq. 14.6 for negative may outweigh those for positive. It is straightforward to
guard against this possibility, however; we simply calculate the modularity
contribution for each proposed split directly and confirm that it is greater
than zero.
Thus the algorithm is as follows. We construct the modularity matrix,
Eq. 14.5, for the network and find its leading (most positive) eigenvalue and the
corresponding eigenvector. We divide the network into two parts according to
the signs of the elements of this vector, and then repeat the process for each of
the parts, using the generalized modularity matrix, Eq. 14.8. If at any stage we
find that a proposed split makes a zero or negative contribution to the total
modularity, we leave the corresponding subgraph undivided. When the entire
network has been decomposed into indivisible subgraphs in this way, the
algorithm ends.
An alternate method for community detection is a technique that bears a
striking resemblance to the Kernighan-Lin algorithm. Suppose we are given
some initial division of our vertices into two groups. We then find among the
vertices the one that, when moved to the other group, will give the biggest
increase in the modularity of the complete network, or the smallest decrease if
no increase is possible. We make such moves repeatedly, with the constraint that
each vertex is moved only once. When all the vertices have been moved, we
search the set of intermediate states occupied by the network during the
operation of the algorithm to find the state that has the greatest modularity.
Starting again from this state, we repeat the entire process iteratively until no
further improvement in the modularity results.
Although this method by itself only gives reasonable modularity values, the
method really comes into its own when it is used in combination with the
spectral method introduced earlier. The spectral approach based on the leading
eigenvector of the modularity matrix gives an excellent guide to the general form
that the communities should take and this general form can then be fine-tuned by
the vertex moving method to reach the best possible modularity value. The
whole procedure is repeated to subdivide the network until every remaining
subgraph is indivisible, and no further improvement in the modularity is
possible.
The most time-consuming part of the algorithm is the evaluation of the
leading eigenvector of the modularity matrix. The fastest method for finding this
eigenvector is the simple power method, the repeated multiplication of the
matrix into a trial vector. Although it appears at first glance that matrix
multiplications will be slow, taking operations each because the
where A is the adjacency matrix and k is the vector whose elements are the
degrees of the vertices, the product of B and an arbitrary vector x can be written
(14.9)
The first term is a standard sparse matrix multiplication taking time
. The inner product takes time O(|V|) to evaluate and hence the second
term can be evaluated in total time O(|V|) also. Thus the complete multiplication
can be performed in time. Typically O(|V|) such multiplications are
needed to converge to the leading eigenvector, for a running time of
overall. Often we are concerned with sparse graphs with
, in which case the running time becomes . It is a simple matter
We then repeat the division into two parts until the network is reduced to its
component indivisible subgraphs. The running time of the entire process
depends on the depth of the tree or “dendrogram” formed by these repeated
divisions. In the worst case the dendrogram has depth linear in |V|, but only a
small fraction of possible dendrograms realize this worst case. A more realistic
figure for running time is given by the average depth of the dendrogram, which
goes as , giving an average running time for the whole algorithm of
in the sparse case. This is considerably better than the
running time of the betweenness algorithm, and slightly better than the
of the extremal optimization algorithm.
(14.10)
where w(u, v) is the weight of the edge (u, v). The expansion of a (sub)graph is
the minimum expansion over all the cuts of the (sub)graph. The expansion of a
clustering of G is the minimum expansion over all its clusters. The larger the
expansion of the clustering, the higher its quality.
The conductance for the cut is defined as
(14.11)
where . The conductance of a graph is the
minimum conductance over all the cuts of the graph. For a clustering of G, let
be a cluster and a cluster within C, where . The
conductance of S in C is
(14.12)
The conductance of a cluster is the smallest conductance of a cut within
the cluster; for a clustering, the conductance is the minimum conductance of its
clusters.
The main difference between expansion and conductance is that expansion treats
all nodes as equally important while the conductance gives greater importance to
nodes with higher degree and adjacent edge weight.
However, both expansion and conductance are insufficient by themselves as
clustering criteria because neither enforces qualities pertaining to inter-cluster
weight, nor the relative size of clusters. Moreover, these are both NP-hard and
require approximations.
The paper present a cut clustering algorithm . Let G(V, E) be a graph and let
be two nodes of G. Let (S, T) be the minimum cut between s and t,
where and . S is defined to be the community of s in G w.r.t t. If the
minimum cut between s and t is not unique, we choose the minimum cut that
maximises the size of S.
The paper provides proofs for the following theorems.
Here, a community S is defined as a collection of nodes that has the property
that all nodes of the community predominantly links to other community nodes.
Theorem 34
For an undirected graph G(V, E), let S be a community of s w.r.t t. Then
(14.13)
Theorem 36
Let G(V, E) be an undirected graph, a source, and connect an
artificial sink t with edges of capacity to all nodes. Let S be the community
of s w.r.t t. For any non-empty P and Q, such that and
, the following bounds always hold:
(14.14)
Theorem 38
Let be a minimum cut tree of a graph G(V, E), and let (u, w) be an edge
of . Edge (u, w) yields the cut (U, W) in G, with , . Now, take
any cut of U, so that and are non-empty, ,
, and . Then,
(14.15)
Theorem 39
Let be the expanded graph of G, and let S be the community of s w.r.t the
artificial sink t. For any non-empty P and Q, such that and
, the following bound always holds:
(14.16)
Theorem 40
Let be the expanded graph of G(V, E) and let S be the community of s w.r.t
the artificial sink t. Then, the following bound always holds:
(14.17)
value being shown for four local network configurations. d In the real network, the overlap (blue
circles) increases as a function of cumulative tie strength , representing the fraction of links with
tie strength smaller than w. The dyadic hypothesis is tested by randomly permuting the weights, which
removes the coupling between and w (red squares). The overlap decreases as a function
suggested by the global efficiency principle. In this case, the links connecting communities have high
values (red), whereas the links within the communities have low values (green)
Figures 14.5 and 14.6 suggest that instead of tie strength being determined
by the characteristics of the individuals it connects or by the network topology, it
is determined solely by the network structure in the tie’s immediate vicinity.
To evaluate this suggestion, we explore the network’s ability to withstand the
removal of either strong or weak ties. For this, we measure the relative size of
the giant component , providing the fraction of nodes that can all reach
each other through connected paths as a function of the fraction of removed
links, f. Figure 14.7a, b shows that removing in rank order the weakest (or
smallest overlap) to strongest (or greatest overlap) ties leads to the network’s
sudden disintegration at . However, removing first the
strongest ties will shrink the network but will not rapidly break it apart. The
precise point at which the network disintegrates can be determined by
monitoring , where is the number of clusters containing
s nodes. Figure 14.7c, d shows that develops a peak if we start with the
weakest links. Finite size scaling, a well established technique for identifying the
phase transition, indicates that the values of the critical points are
and for the removal of the weak
ties, but there is no phase transition when the strong ties are removed first.
Fig. 14.7 The control parameter f denotes the fraction of removed links. a and c These graphs correspond
to the case in which the links are removed on the basis of their strengths ( removal). b and d These
graphs correspond to the case in which the links were removed on the basis of their overlap ( removal).
The black curves correspond to removing first the high-strength (or high ) links, moving toward the
weaker ones, whereas the red curves represent the opposite, starting with the low-strength (or low ) ties
and moving toward the stronger ones. a and b The relative size of the largest component
indicates that the removal of the low or links leads to a
breakdown of the network, whereas the removal of the high or links leads only to the network’s
gradual shrinkage. a Inset Shown is the blowup of the high region, indicating that when the low
ties are removed first, the red curve goes to zero at a finite f value. c and d According to percolation theory,
diverges for as we approach the critical threshold , where the network
falls apart. If we start link removal from links with low (c) or (d) values, we observe a clear
signature of divergence. In contrast, if we start with high (c) or (d) links, there the divergence is
absent
This finding gives us the following conclusion: Given that the strong ties are
predominantly within the communities, their removal will only locally
disintegrate a community but not affect the network’s overall integrity. In
contrast, the removal of the weak links will delete the bridges that connect
different communities, leading to a phase transition driven network collapse.
To see whether this observation affects global information diffusion, at time
0 a randomly selected individual is infected with some novel information. It is
assumed that at each time step, each infected individual, , can pass the
information to his/her contact, , with effective probability , where
the parameter x controls the overall spreading rate. Therefore, the more time two
individuals spend on the phone, the higher the chance that they will pass on the
monitored information. The spreading mechanism is similar to the susceptible-
infected model of epidemiology in which recovery is not possible, i.e., an
infected individual will continue transmitting information indefinitely. As a
control, the authors considered spreading on the same network, but replaced all
tie strengths with their average value, resulting in a constant transmission
probability for all links.
Figure 14.8a shows the real diffusion simulation, where it was found that
information transfer is significantly faster on the network for which all weights
are equal, the difference being rooted in a dynamic trapping of information in
communities. Such trapping is clearly visible if the number of infected
individuals in the early stages of the diffusion process is monitored (as shown in
Fig. 14.8b). Indeed, rapid diffusion within a single community was observed,
corresponding to fast increases in the number of infected users, followed by
plateaus, corresponding to time intervals during which no new nodes are infected
before the news escapes the community. When all link weights are replaced with
an average value w (the control diffusion simulation) the bridges between
communities are strengthened, and the spreading becomes a predominantly
global process, rapidly reaching all nodes through a hierarchy of hubs.
The dramatic difference between the real and the control spreading process
begs the following question: Where do individuals get their information?
Figure 14.8c shows that the distribution of the tie strengths through which each
individual was first infected has a prominent peak at s, indicating that,
in the vast majority of cases, an individual learns about the news through ties of
intermediate strength. The distribution changes dramatically in the control case,
however, when all tie strengths are taken to be equal during the spreading
process. In this case, the majority of infections take place along the ties that are
otherwise weak (as depicted in Fig. 14.8d). Therefore, both weak and strong ties
have a relatively insignificant role as conduits for information, the former
because the small amount of on-air time offers little chance of information
transfer and the latter because they are mostly confined within communities,
with little access to new information.
To illustrate the difference between the real and the control simulation,
Fig. 14.8e, f show the spread of information in a small neighbourhood. First, the
overall direction of information flow is systematically different in the two cases,
as indicated by the large shaded arrows. In the control runs, the information
mainly follows the shortest paths. When the weights are taken into account,
however, information flows along a strong tie backbone, and large regions of the
network, connected to the rest of the network by weak ties, are only rarely
infected.
Fig. 14.8 The dynamics of spreading on the weighted mobile call graph, assuming that the probability for
a node to pass on the information to its neighbour in one time step is given by , with
. a The fraction of infected nodes as a function of time t. The blue curve (circles)
corresponds to spreading on the network with the real tie strengths, whereas the black curve (asterisks)
represents the control simulation, in which all tie strengths are considered equal. b Number of infected
nodes as a function of time for a single realization of the spreading process. Each steep part of the curve
corresponds to invading a small community. The flatter part indicates that the spreading becomes trapped
within the community. c and d Distribution of strengths of the links responsible for the first infection for a
node in the real network (c) and control simulation (d). e and f Spreading in a small neighbourhood in the
simulation using the real weights (E) or the control case, in which all weights are taken to be equal (f). The
infection in all cases was released from the node marked in red, and the empirically observed tie strength is
shown as the thickness of the arrows (right-hand scale). The simulation was repeated 1,000 times; the size
of the arrowheads is proportional to the number of times that information was passed in the given direction,
and the colour indicates the total number of transmissions on that link (the numbers in the colour scale refer
to percentages of 1, 000). The contours are guides to the eye, illustrating the difference in the information
direction flow in the two simulations
Problems
The betweenness centrality of an edge in G(V, E) is given by
Eq. 14.18.
(14.18)
(14.19)
After the BFS is finished, compute the dependency of s on each edge
using the recurrence
(14.20)
Since the values are only really defined for edges that connect two nodes, u
and v, where u is further away from s than v, assume that is 0 for cases that are
undefined (i.e., where an edge connects two nodes that are equidistant from s).
Do not iterate over all edges when computing the dependency values. It makes
more sense to just look at edges that are in some BFS tree starting from s, since
other edges will have zero dependency values (they are not on any shortest
paths). (Though iterating over all edges is not technically wrong if you just set
the dependency values to 0 when the nodes are equidistant from s.) The
betweenness centrality of (u, v) or B(u, v) is
(14.21)
This algorithm has a time complexity of O(|V||E|). However, the algorithm
requires the computation of the betweenness centrality of all edges even if only
the values in some edges are to required to be computed.
starting nodes .
References
1. Flake, Gary William, Robert E. Tarjan, and Kostas Tsioutsiouliklis. 2002. Clustering methods based on
minimum-cut trees. Technical report, technical report 2002–06, NEC, Princeton, NJ.
2. Girvan, Michelle, and Mark E.J. Newman. 2002. Community structure in social and biological networks.
Proceedings of the National Academy of Sciences 99 (12): 7821–7826.
[MathSciNet][Crossref]
3. Gomory, Ralph E., and Tien Chung Hu. 1961. Multi-terminal network flows. Journal of the Society for
Industrial and Applied Mathematics 9 (4): 551–570.
4. Granovetter, Mark S. 1977. The strength of weak ties. In Social networks, 347–367. Elsevier.
5. Newman, Mark E.J. 2011. Scientific collaboration networks. II. Shortest paths, weighted networks, and
centrality. Physical Review E 64 (1): 016132.
6. Onnela, J.-P., Jari Saramäki, Jorkki Hyvönen, György Szabó, David Lazer, Kimmo Kaski, János
Kertész, and A.-L. Barabási. 2007. Structure and tie strengths in mobile communication networks.
Proceedings of the National Academy of Sciences 104 (18): 7332–7336.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_15
Krishna Raj P. M.
Email: [email protected]
geometric relationships in this learned space reflect the structure of the original
graph. After optimizing the embedding space, the learned embeddings can be
used as feature inputs for machine learning tasks. The key distinction between
these representation learning approaches and hand-engineering tasks is how they
deal with the problem of capturing structural information about the graph. Hand-
engineering techniques deal with this problem as a pre-processing step, while
representation learning approaches treat it as a machine learning task in itself,
using a data-driven approach to learn embeddings that encode graph structure.
We assume that the representation learning algorithm takes as input an
undirected graph , its corresponding binary adjacency matrix A and a
real-valued matrix of node attributes . The goal is to use this
There are two perspectives to consider before deciding how to optimize this
mapping. First is the purview of unsupervised learning where only the
information in A and X are considered without accounting for the downstream
machine learning task. Second is the arena of supervised learning where the
embeddings are considered as regression and classification tasks.
(15.1)
that maps nodes to vector embeddings, (where corresponds to the
(15.2)
that maps pairs of node embeddings to a real-valued graph proximity measure,
which quantifies the proximity of the two nodes in the original graph.
When we apply the pairwise decoder to a pair of embeddings ( , ) we get
a reconstruction of the proximity between and in the original graph, and
the goal is to optimize the encoder and decoder mappings to minimize the error,
or loss, in this reconstruction so that:
(15.3)
where is a user-defined, graph-based proximity measure between nodes,
defined over G.
Most approaches realize the reconstruction objective (Eq. 15.3) by minimizing
an empirical loss, , over a set of training node pairs,
(15.4)
(15.5)
where is a matrix containing the embedding vectors for all nodes and
(15.6)
and the loss function weights pairs of nodes according to their proximity in the
graph
(15.7)
15.1.2.2 Inner-Product Methods
There are a large number of embedding methodologies based on a pairwise,
inner-product decoder
(15.8)
where the strength of the relationship between two nodes is proportional to the
dot product of their embeddings. This decoder is commonly paired with the
mean-squared-error (MSE) loss
(15.9)
Graph Factorization (GF) algorithm [1], GraRep [4], and HOPE [22] fall under
this category. These algorithms share the same decoder function as well as the
loss function, but differ in the proximity measure. The GF algorithm defines
proximity measure directly based on the adjacency matrix, i.e, .
GraRep uses higher-order adjacency matrix as a definition of the proximity
measure, (e.g., . The HOPE algorithm supports general
proximity measures.
These methods are generally referred to as matrix-factorization approaches
because, averaging across all nodes, they optimize loss function of the form:
(15.10)
where S is a matrix containing pairwise proximity measures (i.e.,
) and Z is the matrix of node embeddings. Intuitively, the goal
of these methods is simply to learn embeddings for each node such that the inner
product between the learned embedding vectors approximates some
deterministic measure of graph proximity.
(15.11)
where is the probability of visiting on a length-T random walk
starting at , with T usually defined to be in the range .
is both stochastic and symmetric.
More formally, these approaches attempt to minimize the following cross-
entropy loss:
(15.12)
where in this case the training set, , is generated by sampling random walks
starting from each node (i.e., where N pairs for each node, , are sampled from
the distribution ). However, naively evaluating this loss is
of order O(|D||V|) since evaluating the denominator of Eq. 15.11 has time
complexity O(|V|). Thus, DeepWalk and node2vec use different optimizations
and approximations to compute the loss in Eq. 15.12. DeepWalk employs a
“hierarchical softmax” technique to compute the normalizing factor, using a
binary-tree structure to accelerate the computation. In contrast, node2vec
approximates Eq. 15.12 using “negative sampling” : instead of normalizing over
the full vertex set, node2vec approximates the normalizing factor using a set of
random “negative samples”.
The key distinction between node2vec and DeepWalk is that node2vec allows
for a flexible definition of random walks, whereas DeepWalk uses simple
unbiased random walks over the graph. In particular, node2vec introduces two
random walk hyper-parameters, p and q, that bias the random walk. The hyper-
parameter p controls the likelihood of the walk immediately revisiting a node,
while q controls the likelihood of the walk revisiting a node’s one-hop
neighbourhood. By introducing these hyper-parameters, node2vec is able to
smoothly interpolate between walks that are more akin to breadth-first or depth-
first search. Reference [13] found that tuning these parameters allowed the
model to trade-off between learning embeddings that emphasize community
structures or embeddings that emphasize local structural roles.
Figure 15.1 depicts the DeepWalk algorithm applied to the Zachary Karate
Club network to generate its two-dimensional embedding. The distance between
nodes in this embedding space reflects proximity in the original graph.
Fig. 15.1 a Graph of the Zachary Karate Club network where nodes represent members and edges indicate
friendship between members. b Two-dimensional visualization of node embeddings generated from this
graph using the DeepWalk method. The distances between nodes in the embedding space reflect proximity
in the original graph, and the node embeddings are spatially clustered according to the different colour-
coded communities
Figure 15.2 illustrates the node2vec algorithm applied to the graph of the
characters of the Les Misérables novel where colour depict the nodes which have
structural equivalence.
Fig. 15.2 Graph of the Les Misérables novel where nodes represent characters and edges indicate
interaction at some point in the novel between corresponding characters. (Left) Global positioning of the
nodes. Same colour indicates that the nodes belong to the same community. (Right) Colour denotes
structural equivalence between nodes, i.e, they play the same roles in their local neighbourhoods. Blue
nodes are the articulation points. This equivalence where generated using the node2vec algorithm
(15.13)
and an adjacency-based proximity measure. The second-order encoder-decoder
objective is similar but considers two-hop adjacency neighbourhoods and uses
an encoder identical to Eq. 15.11. Both the first-order and second-order
objectives are optimized using loss functions derived from the KL-divergence
metric [28]. Thus, LINE is conceptually related to node2vec and DeepWalk in
that it uses a probabilistic decoder and loss, but it explicitly factorizes first-and
second-order proximities, instead of combining them in fixed-length random
walks.
1.
No parameters are shared between nodes in the encoder.
This can be statistically inefficient, since parameter sharing can
act as a powerful form of regularization, and it is also
computationally inefficient, since it means that the number of
parameters in direct encoding methods necessarily grows as
O(|V|).
2.
Direct encoding also fails to leverage node attributes during
encoding. In many large graphs, nodes have attribute
information (e.g., user profiles) that is often highly informative
with respect to the node’s position and role in the graph.
3.
Direct encoding methods can only generate embeddings for
nodes that were present during the training phase, and they
cannot generate embeddings for previously unseen nodes unless
additional rounds of optimization are performed to optimize the
embeddings for these nodes. This is highly problematic for
evolving graphs, massive graphs that cannot be fully stored in
memory, or domains that require generalizing to new graphs
after training.
contains ’s pairwise graph proximity with all other nodes and functions as a
high-dimensional vector representation of ’s neighbourhood. The autoencoder
objective for DNGR and SDNE is to embed nodes using the vectors such that
the vectors can then be reconstructed from these embeddings:
(15.14)
The loss for these methods takes the following form:
(15.15)
As with the pairwise decoder, we have that the dimension of the embeddings
is much smaller than |V|, so the goal is to compress the node’s neighbourhood
information into a low-dimensional vector. For both SDNE and DNGR, the
encoder and decoder functions consist of multiple stacked neural network layers:
each layer of the encoder reduces the dimensionality of its input, and each layer
of the decoder increases the dimensionality of its input.
SDNE and DNGR differ in the similarity functions they use to construct the
neighborhood vectors and also how the autoencoder is optimized. DNGR
defines according to the pointwise mutual information of two nodes co-
occurring on random walks, similar to DeepWalk and node2vec. SDNE sets
and combines the autoencoder objective with the Laplacian eigenmaps
objective.
However, the autoencoder objective depends on the input vector, which
contains information about ’s local graph neighbourhood. This dependency
allows SDNE and DNGR to incorporate structural information about a node’s
local neighbourhood directly into the encoder as a form of regularization, which
is not possible for the direct encoding approaches. However, despite this
improvement, the autoencoder approaches still suffer from some serious
limitations. Most prominently, the input dimension to the autoencoder is fixed at
|V|, which can be extremely costly and even intractable for very large graphs. In
addition, the structure and size of the autoencoder is fixed, so SDNE and DNGR
cannot cope with evolving graphs, nor can they generalize across graphs.
Fig. 15.3 Illustration of the neighbourhood aggregation methods. To generate the embedding for a node,
these methods first collect the node’s k-hop neighbourhood. In the next step, these methods aggregate the
attributes of node’s neighbours, using neural network aggregators. This aggregated neighbourhood
information is used to generate an embedding, which is then fed to the decoder
neighbourhood and, unlike the direct encoding approaches, these parameters are
shared across nodes. The same aggregation function and weight matrices are
used to generate embeddings for all nodes, and only the input node attributes and
neighbourhood structure change depending on which node is being embedded.
This parameter sharing increases efficiency, provides regularization, and allows
this approach to be used to generate embeddings for nodes that were not
observed during training.
GraphSAGE, column networks, and the various GCN approaches all follow
this algorithm but differ primarily in how the aggregation and vector
combination are performed. GraphSAGE uses concatenation and permits general
aggregation functions; the authors experiment with using the element-wise
mean, a max-pooling neural network and LSTMs [17] as aggregators, and they
found the the more complex aggregators, especially the max-pooling neural
network, gave significant gains. GCNs and column networks use a weighted sum
and a (weighted) element-wise mean.
Column networks also add an additional “interpolation” term, setting
(15.16)
where is an interpolation weight computed as a non-linear function of
and . This interpolation term allows the model to retain local information
trainable parameter vector. We can then compute the cross-entropy loss between
these predicted class probabilities and the true labels:
(15.17)
The gradient computed according to Eq. 15.17 can then be backpropagated
through the encoder to optimize its parameters. This task-specific supervision
can completely replace the reconstruction loss computed using the decoder, or it
can be included along with the decoder loss.
15.5 Multi-modal Graphs
The previous sections have all focused on simple, undirected graphs. However
many real-world graphs have complex multi-modal, or multi-layer, structures
(e.g., heterogeneous node and edge types). In this section we will look at
strategies that cope with this heterogeneity.
(15.18)
where indexes a particular edge type and is a learned parameter specific to
edges of type . The matrix, , in Eq. 15.18 can be regularized in various
ways, which can be especially useful when there are a large number of edge
types, as in the case for embedding knowledge graphs. Indeed, the literature on
knowledge-graph completion, where the goal is predict missing relations in
knowledge graphs, contains many related techniques for decoding a large
number of edge types.
Recently, [10] also proposed a strategy for sampling random walks from
heterogeneous graphs, where the random walks are restricted to only transition
between particular types of nodes. This approach allows many of the methods in
Sect. 15.1.3 to be applied on heterogeneous graphs and is complementary to the
idea of including type-specific encoders and decoders.
(15.19)
where denotes the usual embedding loss for that node, denotes the
regularization strength, and and denote ’s embeddings in the two
(15.21)
where and is a one-hot indicator vector
corresponding to ’s row/column in the Laplacian. The paper shows that these
vectors implicitly relate to topological quantities, such as ’s degree and
the number of k-cycles is involved in. With a proper choice of scale, s, the
WaveGraph is able to effectively capture structural information about a node’s
role in a graph.
Node embeddings find applications in visualization, clustering, node
classification, link prediction and pattern discovery.
(15.22)
Reference [8] constructive intermediate embeddings, , corresponding to
edges,
(15.23)
These edge embeddings are then aggregated to form the node embeddings:
(15.24)
Once these embeddings are computed, a simple element-wise sum to combine
the node embeddings for a subgraph.
iteration of the GNN algorithm, nodes accumulate inputs from their neighbours
using simple neural network layers
(15.25)
where and are trainable parameters and is a non-linearity.
References
1. Ahmed, Amr, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J. Smola.
Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference
on World Wide Web, 37–48. ACM, 2013.
2. Bronstein, Michael, M., Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017.
Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4): 18–
42.
3. Bruna, Joan, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally
connected networks on graphs. arXiv:1312.6203.
4. Cao, Shaosheng, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global
structural information. In Proceedings of the 24th ACM international on conference on information and
knowledge management, 891–900. ACM.
5. Cao, Shaosheng, Wei Lu, and Qiongkai Xu. 2016. Deep neural networks for learning graph
representations. In AAAI, 1145–1152.
6. Chang, Shiyu, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. 2015.
Heterogeneous network embedding via deep architectures. In Proceedings of the 21st ACM SIGKDD
international conference on knowledge discovery and data mining, 119–128. ACM.
7. Chen, Haochen, Bryan Perozzi, Yifan Hu, and Steven Skiena. 2017. Harp: Hierarchical representation
learning for networks. arXiv:1706.07845.
8. Dai, Hanjun, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for
structured data. In International conference on machine learning, 2702–2711.
9. Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks
on graphs with fast localized spectral filtering. In Advances in neural information processing systems,
3844–3852.
10. Dong, Yuxiao, Nitesh V. Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation
learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, 135–144. ACM.
11. Donnat, Claire, Marinka Zitnik, David Hallac, and Jure Leskovec. 2017. Spectral graph wavelets for
structural role similarity in networks. arXiv:1710.10321.
12. Duvenaud, David K., Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán
Aspuru-Guzik, and Ryan P. Adams. 2015. Convolutional networks on graphs for learning molecular
fingerprints. In Advances in neural information processing systems, 2224–2232.
13. Grover, Aditya, and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data
mining, 855–864. ACM.
14. Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large
graphs. In Advances in neural information processing systems, 1025–1035.
15. Hamilton, William L., Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs:
Methods and applications. arXiv:1709.05584.
16. Hinton, Geoffrey E., and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with
neural networks. Science 313 (5786): 504–507.
17. Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (8):
1735–1780.
18. Kipf, Thomas N., and Max Welling. 2016. Semi-supervised classification with graph convolutional
networks. arXiv:1609.02907.
19. Kipf, Thomas N., and Max Welling. 2016. Variational graph autoencoders. arXiv:1611.07308.
20. Li, Yujia, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural
networks. arXiv:1511.05493.
21. Nickel, Maximilian, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of
relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1): 11–33.
22. Ou, Mingdong, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity
preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, 1105–1114. ACM.
23. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge
discovery and data mining, 701–710. ACM.
24. Pham, Trang, Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. 2017. Column networks for
collective classification. In AAAI, 2485–2491.
25. Ribeiro, Leonardo FR., Pedro HP. Saverese, and Daniel R. Figueiredo. 2017. struc2vec: Learning node
representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, 385–394. ACM.
26. Scarselli, Franco, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009.
The graph neural network model. IEEE Transactions on Neural Networks 20 (1): 61–80.
27. Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max
Welling. 2017. Modeling relational data with graph convolutional networks. arXiv:1703.06103.
28. Tang, Jian, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-
scale information network embedding. In Proceedings of the 24th international conference on World
Wide Web, 1067–1077. (International World Wide Web Conferences Steering Committee).
29. van den Berg, Rianne, Thomas N. Kipf, and Max Welling. 2017. Graph convolutional matrix
completion. Statistics 1050: 7.
30. Wang, Daixin, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings
of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1225–
1234. ACM.
31. Zitnik, Marinka, and Jure Leskovec. 2017. Predicting multicellular function through multi-layer tissue
networks. Bioinformatics 33 (14): i190–i198.
Index
A
Absolute spam mass
Absolute status
Acquaintance networks
Action
Active node
Adjacency-based proximity measure
Adjacency list
Adjacency matrix
Affiliation network
African-American
Age
Agglomerative graph partitioning methods
Aggregated neighbourhood vector
Aggregation function
Alfréd Rényi
Ally request
Amazon.com
Ambassador node
Anchor
Approximate balanced graph
Approximate betweenness centrality
Approximation algorithm
ARPANET
Articulation vertex
Artificial sink
Asynchronous IO
Atomic propagation
Authorities
Authority score
Authority Update rule
Autoencoder objective
Autonomous system
Average clustering coefficient
Average degeneracy
Average degree
Average distance
Average distance
Average neighbourhood size
Average out-degree estimate
Average path length
B
Backtracking
Backward BFS traversal
Backward burning probability
Backward direction
Backward-forward step
Bacon number
Balance heuristic
Balance theorem
Balance-based reasoning
Balanced
Balanced dataset
Balanced graph
Balanced triad
BalanceDet
BalanceLrn
Ball of radius
Ballot-blind prediction
Barrel
B-ary tree
Base Set
Basic PageRank Update rule
Basic reproductive number
Basis set
Batch-mode crawler
Battle of the Water Sensor Networks
Bayesian inference method
Behaviour
Béla Bollobás
Benefit-cost greedy algorithm
Bernoulli distribution
Betweenness
BFS algorithm
BFS traversals
BigFiles
Bilingual
Bilingual
Bimodal
Binary action vector
Binary classification label
Binary evaluation vectors
Binary tree structure
Binomial distribution
Bipartite graph
Bipolar
BitTorrent
Blog
Blogspace
Blue chip stock holders
Bollobás configuration model
Boosting nodes
Boston
Boston random group
Bowtie
Bradford law
Branching process
Breadth first searches
Breadth-first search
Bridge edge
Brilliant-but-cruel hypothesis
B-tree index
Burst of activity
C
California
Cambridge
Cascade
Cascade capacity
Cascading failure
Caucasian
CELF optimization
CELF++
Center-surround convolutional kernel
Chain lengths
Chord
Chord software
Chord software
Classification accuracy
Cloning
Cloning step
Cluster
Clustering
Clustering coefficient
Co-citation
Collaboration graph
Collaborative filtering systems
Collection analysis module
Collective action
Columbia small world study
Columbia University
Column networks
Comments
Common knowledge
Community
Community discovery
Community Guided Attachment
Complete cascade
Complete crawl
Complete graph
Completely connected
Complex systems
Component
Component size distribution
Computed product-average star rating
Conceptual distance
Concurrency
Conductance
Configuration
Conformity hypothesis
Connected
Connected component
Connected components
Connected undirected graph
Connectivity
Consistent hashing
Constrained triad dynamics
Constrained triad dynamics
Constructive intermediate embedding
Contact network
Contagion
Contagion probability
Contagion threshold
Content Addressable Network
Content creation
Content similarity
Contextualized link
Continuous vector representation
Convolutional
Convolutional coarsening layer
Convolutional encoder
Convolutional molecular fingerprinting
Coordination game
Copying model
Cost-Effective Forward selection algorithm
Cost-Effective Lazy Forward
Cost-Effective Lazy Forward selection algorithm
CouchSurfing.com
Crawl and stop
Crawl and stop with threshold
Crawl control
Crawler
Crawler revisit frequency
Crawling
Crestline
Critical number
Cross-entropy loss
Cross-validation approach
Cultural fad
Cut
Cut clustering algorithm
Cycle
D
Dampened trust score
Dark-web
Datastore
Decentralized algorithm
Decentralized algorithm
Decentralized search
Decision-making
Decoder
Decoder mapping
Decoder proximity value
Deep learning
Deep Neural Graph Representations
DeepWalk
Degeneracy
Degree
Degree assortativity
Degree-based navigation
Degree centrality
Degree discount
Degree distribution
Degree of separation
DELICIOUS
Dendrogram
Dense neural network layer
Densification power-law
Dependency
Deterministic Kronecker graph
Diameter
Differential status
Diffusion
Diffusion of norm
Diffusion-based algorithm
Dimensionality reduction
Dip
Direct encoding
Direct propagation
Directed acyclic graph
Directed graph
Directed multigraph
Directed self-looped graph
Directed unweighted graph
Directed weighted graph
Disconnected graph
Disconnected undirected graph
Dispersion
Distance
Distance centrality
Distance-Dependent Kronecker graphs
Distance-dependent Kronecker operator
Distance distribution
Distance matrix
Distributed greedy algorithm
Distrust propagations
Divisive graph partitioning methods
DNS cache
DNS lookup
DocID
DocIndex
Document stream
Dodds
Dose-response model
-regular graph
DumpLexicon
Dynamic monopoly
Dynamic network
Dynamic programming
Dynamic routing table
Dynamo
E
EachMovie
Early adopters
Edge attributes
Edge destination selection process
Edge embedding
Edge initiation process
Edge list
Edge reciprocation
Edges
Edge sign prediction
Edge sign prediction
EDonkey2000
Effective diameter
Effective number of vertices
Egalitarian
Ego
Ego-networks
Eigen exponent
Eigenvalue propagation
Element-wise max-pooling
Element-wise mean
Embeddedness
Embedding lookup
Embedding space
Encoder
Encoder-decoder
Encoder mapping
Enhanced-clustering cache replacement
EPANET simulator
Epinions
Equilibrium
Erdös number
Essembly.com
Evaluation function
Evaluation-similarity
Evolutionary model
Exact betweenness centrality
Expansion
Expectation Maximisation
Expected penalty
Expected penalty reduction
Expected profit lift
Expected value navigation
Exponent
Exponential attenuation
Exponential distribution
Exponential tail
Exposure
F
F
Facebook
Facebook friends
Facebook user
Faction membership
Factor
FastTrack
Fault-tolerance
Feature vector
Finger table
First-order graph proximity
Fixed deterministic distance measure
FLICKR
Foe
Folded graph
Forest Fire
Forest Fire model
Forward BFS traversal
Forward burning probability
Forward direction
Four degrees of separation
Freenet
Frequency
Frequency ratio
Frequency table
Freshness
Friend
Friend request
Friends-of-friends
Fully Bayesian approach
G
Gated recurrent unit
General threshold model
Generating functions
Generative baseline
Generative surprise
Geographic distance
Geographical proximity
Giant component
Girvan-Newman algorithm
Girvan-Newman algorithm
Global cascade
Global information diffusion
Global inverted file
Global profit lift
Global rounding
Gnutella
Goodness
Goodness-of-fit
Google
``Go with the winners'' algorithm
Graph
Graph coarsening layer
Graph coarsening procedure
Graph convolutional networks
Graph Factorization algorithm
Graph Fourier transform
Graph neural networks
Graph structure
GraphSAGE algorithm
GraphWave
GraRep
Greedy search
Grid
Group-induced model
Group structure
Growth power-law
H
Hadoop
Half-edges
Hand-engineered heuristic
HARP
Hash-based organization
Hash-bucket
Hash distribution policy
Hashing mapping
Hashtags
Heat kernel
Helpfulness
Helpfulness
Helpfulness evaluation
Helpfulness ratio
Helpfulness vote
Hierarchical clustering
Hierarchical distance
Hierarchical model
Hierarchical softmax
High-speed streaming
Hill-climbing approach
Hill climbing search
Hit
Hit list
Hive
Hollywood
Homophily
Honey pot
Hop distribution
HOPE
Hop-plot exponent
Hops-to-live limit
Hops-to-live value
Hot pages
Hub
Hub-authority update
Hub score
Hub Update rule
Human wayfinding
Hybrid hashed-log organization
HyperANF algorithm
HyperLogLog counter
I
Ideal chain lengths frequency distribution
Identifier
Identifier circle
Ignorant trust function
In
Inactive node
Inactive node
Incentive compatible mechanism
Income stratification
Incremental function
In-degree
In-degree distribution
In-degree heuristic
Independent cascade model
Indexer
Index sequential access mode
Individual-bias hypothesis
Indivisible
Infected
Infinite grid
Infinite-inventory
Infinite line
Infinite paths
Infinite-state automaton
Influence
Influence maximization problem
Influence weights
Influential
Information linkage graph
Initiator graph
Initiator matrix
In-links
Inner-product decoder
Innovation
In-place update
Instance matrix
Instant messenger
Interaction
Inter-cluster cut
Inter-cluster weight
Interest Driven
Intermediaries
Internet
Internet Protocol
Intra-cluster cut
Intrinsic value
Intrinsic value
Inverted index
Inverted list
Inverted PageRank
Irrelevant event
Isolated vertex
Iterative propagation
J
Joint positive endorsement
K
Kademlia
Kansas
Kansas study
KaZaA
-core
Kernighan-Lin algorithm
Kevin Bacon
Key algorithm
-hop neighbourhood
KL-divergence metric
Kleinberg model
-regular graph
Kronecker graph
Kronecker graph product
Kronecker-like multiplication
Kronecker product
KRONEM algorithm
KRONFIT
L
Laplacian eigenmaps objective
Lattice distance
Lattice points
Lazy evaluation
Lazy replication
LDAG algorithm
Least recently used cache
Least recently used cache
Leave-one-out cross-validation
Leaves
Like
Likelihood ratio test
LINE method
Linear threshold model
LINKEDIN
Link prediction
Link probability
Link probability measure
Links
Links database
LiveJournal
LiveJournal population density
LiveJournal social network
Local bridge
Local contacts
Local inverted file
Local rounding
Local triad dynamics
Location Driven
Login correlation
Logistic regression
Logistic regression classifier
Log-log plot
Log-structured file
Log-structured organization
Long-range contacts
Long tail
LOOK AHEAD OPTIMIZATION
Lookup table
Los Angeles
Lotka distribution
Low-dimensional embedding
Low neighbour growth
LSTM
M
Machine learning
Macroscopic structure
Macroscopic structure
Mailing lists
Majority rounding
Marginal gain
Marketing action
Marketing plan
Markov chain
Massachusetts
Mass-based spam detection algorithm
Matching algorithm
Matrix-factorization
Maximal subgraph
Maximum likelihood estimation
Maximum likelihood estimation technique
Maximum number of edges
Max-likelihood attrition rate
Max-pooling neural network
Mean-squared-error loss
Message-forwarding experiment
Message funneling
Message passing
Metropolis sampling algorithm
Metropolized Gibbs sampling approach
Microsoft Messenger instant-messaging system
Minimum cut
Minimum cut clustering
Minimum cut tree
Minimum description length approach
Missing past
Mobile call graph
Model A
Model B
Model C
Model D
Modularity
Modularity matrix
Modularity penalty
Molecular graph representation
Monitored node
Monotone threshold function
Monotone threshold function
Monte Carlo switching steps
M-step trust function
Muhamad
Multicore breadth first search
Multifractal network generator
Multigraph
Multi-objective problem
Multiple ambassadors
Multiple edge
Multiplex propagation
N
Naive algorithm
Natural greedy hill-climbing strategy
Natural self-diminishing property
Natural-world graph
Navigable
Navigation agents
Navigation algorithm
Nebraska
Nebraska random group
Nebraska study
Negative attitude
Negative opinion
Negative sampling
Negative spam mass
Neighbourhood aggregation algorithm
Neighbourhood aggregation method
Neighbourhood function
Neighbourhood graph
Neighbourhood information
Neighbourhood overlap
Neighbours
Neighbour set
Nemesis request
Network
Network value
Neural network layer
New York
New York
Newman-Zipf
Node arrival process
Node classification
Node embedding
Node embedding approach
Node-independent
Node-independent path
Nodes
Node2vec
Node2vec optimization algorithm
Noisy Stochastic Kronecker graph model
Non-dominated solution
Non-navigable graph
Non-searchable
Non-searchable graph
Non-simple path
Non-unique friends-of-friends
Normalizing factor
NP-hard
Null model
O
Occupation similarity
Ohio
OhmNet
Omaha
-ball
One-hop neighbourhood
One-hot indicator vector
One-step distrust
Online contaminant monitoring system
Online social applications
Optimal marketing plan
Oracle function
Ordered degree sequence
Ordered trust property
Ordinary influencers
Organizational hierarchy
Orphan
Out
Outbreak detection
Outbreak detection problem
Out-degree
Out-degree distribution
Out-degree exponent
Out-degree heuristic
Out-links
Overloading
Overnet
P
Page addition/insertion
Page change frequency
PageRank
PageRank contribution
Pagerank threshold
PageRank with random jump distribution
Page repository
Pages
Pairwise decoder
Pairwise orderedness
Pairwise proximity measure
Paradise
Parameter matrix
Parent set
Pareto law
Pareto-optimal
Pareto-optimal solution
Partial cascade
Partial crawl
Participation rates
Pastry
Path
Path length
Pattern discovery
Paul Erdös
Payoff
Payoff
Peer-To-Peer overlay networks
Penalty
Penalty reduction
Penalty reduction function
Permutation model
Persistence
Personal threshold
Personal threshold rule
Phantom edges
Phantom nodes
PHITS algorithm
Physical page organization
Pluralistic ignorance
Poisson distribution
Poisson's law
Polling game
Polylogarithmic
Polysemy
Popularity Driven
Positive attitude
Positive opinion
Positivity
Power law
Power law degree distribution
Power-law distribution
Power law random graph models
Predecessor pointer
Prediction error
Preferential attachment
Preferential attachment model
Pre-processing step
Principle of Repeated Improvement
Probability of persistence
Probit slope parameter
Product average
Professional ties
Propagated distrust
Proper social network
Proportional refresh policy
Proportional refresh policy
Protein-protein interaction graph
Proxy requests
Pseudo-unique random number
P-value
Q
Quality-only straw-man hypothesis
Quantitative collaborative filtering algorithm
Quasi-stationary dynamic state
Query engine
R
Random access
Random-failure hypothesis
Random graph
Random initial seeders
Random jump vector
Random page access
Random walk
Rank
Rank exponent
Ranking
Rating
Realization matrix
Real-world graph
Real-world systems
Receptive baseline
Receptive surprise
Recommendation network
Recommender systems
Register
Regular directed graph
Regular graph
Reinforcement learning
Relative spam mass
Relevant event
Removed
Representation learning
Representation learning techniques
Request hit ratio
Request hit ratio
Resolution
Resource-sharing
Reviews
Rich-get-richer phenomenon
R-MAT model
Root Set
Roster
Rounding methods
Routing path
Routing table
S
SALSA algorithm
Scalability
Scalable
Scalarization
Scaled PageRank Update rule
Scale-free
Scale-free model
Scaling
SCC algorithm
Searchable
Searchable graph
Search engine
Searcher
Searching
Search query
Second-order encoder-decoder objective
Second-order graph proximity
Self-edge
Self-loop
Self-looped edge
Self-looped graph
Sensor
Seven degree of separation
Shadowing
Sharon
Sibling page
Sigmoid function
Sign
Sign
Signed-difference analysis
Signed network
Similarity
Similarity-based navigation
Similarity function
SIMPATH
Simple path
Simple Summary Statistics
Single pass
Sink vertex
SIR epidemic model
SIR model
SIRS epidemic model
SIRS model
SIS epidemic model
Six Degrees of Kevin Bacon
Six degrees of separation
Slashdot
Small world
Small world phenomena
Small world phenomenon
Small world property
Small-world acquaintance graph
Small-world experiment
Small-world hypothesis
Small-world models
Small-world structure
SNAP
Social epidemics
Social intelligence
Social network
Social similarity
Social stratification
Sorter
Source vertex
Spam
Spam detection
Spam farms
Spam mass
Spam mass
Spectral analysis
Spectral graph wavelet
Spid
Stack Overflow
Staircase effects
Stanley Milgram
Star rating
Starting population bias
State transition
Static score distribution vector
Status heuristic
Status-based reasoning
StatusDet
StatusLrn
Steepest-ascent hill-climbing search
Stickiness
Stochastic averaging
Stochastic Kronecker graph model
Stochastic process
Storage
Storage nodes
Store Server
Strategic alliance
Straw-man quality-only hypothesis
Streaming access
Strongly Connected Component (SCC)
Strongly connected directed graph
Strong ties
Strong Triadic Closure property
Structural balance
Structural balance property
Structural Deep Network Embeddings
Structural role
Structured P2P networks
Structure index
Struc2vec
Subgraph classification
Submodular
Submodularity
Submodularity
Subscriptions
Successor
Super-spreaders
Supernodes
Superseeders
Supervised learning
Surprise
Susceptible
Susceptible-Infected-Removed cycle
Switching algorithm
Synonymy
Systolic approach
T
Tag-similarity
Tapestry
Targeted immunization
Target node
Target reachability
Technological graph
Temporal graph
Text index
Theory of balance
Theory of status
Theory of structural balance
Theory of triad types
Threshold
Threshold rule
Threshold trust property
Tightly-knit community
Time-expanded network
Time graph
Time-outs
Time-To-Live
Topic
Topic drift
Tracer cards
Traditional information retrieval
Traditional information retrieval
Transient contact network
Transpose trust
Triadic closure property
Triggering model
Triggering set
True value
Trust
Trust coupling
Trust dampening
Trust function
Trust only
TrustRank algorithm
Trust splitting
Twitter idiom hashtag
Two-hop adjacency neighbourhood
U
Unbalanced
Unbalanced graph
Undirected component
Undirected graph
Undirected multigraph
Undirected self-looped graph
Undirected unweighted graph
Undirected weighted graph
Uniform distribution policy
Uniform immunization programmes
Uniform random -regular graph
Uniform random jump distribution
Uniform refresh policy
Uniform sampling
Union-Find
Unique friends-of-friends
Unit cost algorithm
Unit-cost greedy algorithm
Unstructured P2P network
Unsupervised learning
Unsupervised learning
Unweighted graph
Urban myth
URL Resolver
URL Server
User evaluation
Utility index
V
Variance-to-mean ratio
VERTEX COVER OPTIMIZATION
Vertices
Viceroy
Viral marketing
Visualization
W
Walk
Water distribution system
Watts
Watts–Strogatz model
WCC algorithm
WeakBalDet
Weakly connected directed graph
Weak structural balance property
Weak ties
Web
Webbiness
Web crawls
Weighted auxillary graph
Weighted cascade
Weighted graph
Weighted linear combinations
Weighted path count
Who-talks-to-whom graph
who-transacts-with-whom graph
Wichita
Wikipedia
Wikipedia adminship election
Wikipedia adminship voting network
Wikispeedia
Wikispeedia
Word-level parallelism
Y
YAHOO ANSWERS
Z
Zachary Karate Club network
Zero-crossing
Zipf distribution
Zipf law