CP5074 - SNA Unit III Notes
CP5074 - SNA Unit III Notes
Aggregating and reasoning with social network data, Advanced Representations – Extracting
evolution of Web Community from a Series of Web Archive - Detecting Communities in Social
Networks - Evaluating Communities – Core Methods for Community Detection & Mining -
Applications of Community Mining Algorithms - Node Classification in Social Networks.
Smushing:
Smushing is a reasoning task, where we iteratively execute the rules or procedures
that determine equality until no more equivalent instances can be found. The advantage of this
approach (compared to a one-step computation of a similarity measure) is that we can take into
account the learned equalities in subsequent rounds of reasoning.
One of the main advantages of RDF over other representation formalisms such as UML is the
possibility to uniquely identify resources (instances, classes, as well as properties). The primary
mechanism for this is the assignment of URIs to resources. Every resource, except blank nodes is
identified by a URI.
1
Multiple identifiers can be represented in RDF in two separate ways.
First, one can introduce a separate resource and use the identifiers as URIs for these
resources. Once separate resources are introduced for the same object, the equality of
these resources can be expressed using the owl:sameAs property.
The other alternative is to chose one of the identifiers and use it as a URI.
Many modern identifier schemes such as DOIs have been designed to conform to the URI
specification. It is also a common practice to create URIs within a web domain owned or
controlled by the creator of the metadata description. For example, if the registered domain name
of a public organization is http://www.example.org then a resource with the identifier 123 could
be represented as http://www.example.org/id/123. This satisfies the guidelines for good URIs, in
particular that good URIs should be unique and stable. i.e., Good URIs should be
unambiguous and there is no way to rename resources (to reassign URIs)
The most well-known formulation of equality was given by Wilhelm Gottfried Leibniz in his
Discourse on Metaphysics. The Leibniz-law is formalized using the logical formula given in
Formula 3.1. The converse of the Leibniz-law is called Indiscernibility of Identicals and written
as Formula 3.2. The two laws taken together (often also called Leibniz-law) are the basis of the
definition of equality in a number of systems.
∀P : P(x) ↔ P(y) → x = y (3.1)
∀P : P(x) ↔ P(y) ← x = y (3.2)
The reflexive, symmetric and transitive properties of equality follow from these definitions.
Notice that both formulas are second-degree due to the quantification on properties. This
quantification is also interesting because it provides the Leibniz-law different interpretations in
open and closed worlds. Namely, in an open world the number of properties is unknown and thus
the Leibniz-law is not useful in practice. In a closed world we can possibly iterate over all
properties to check if two resources are equal.
In OWL there are a limited set of constructs that can lead to (in)equality statements. Functional
and inverse functional properties (IFPs) and maximum cardinality restrictions in general can lead
to conclude that two symbols must denote the same resource when otherwise the cardinality
restriction could not be fulfilled.
For example, the foaf:mbox property denoting the email address of a person is inverse-
functional as a mailbox can only belong to a single person.
As another example, consider a hypothetical ex:hasParent property, which has a
maximum cardinality of two. If we state that a single person has three parents (which we
are allowed to state) an OWL reasoner should conclude that at least two of them has to be
the same. Once inferred, the equality of instances can be stated by using the owl:sameAs
property. There are more ways to conclude that two symbols do not denote the same
object, which is expressed by the owl:differentFrom property. For example, instances of
disjoint classes must be different from each other.
2
3.1.4 Reasoning with instance equality
Reasoning is the inference of new statements (facts) that necessarily follow from the set of
known statements.
Description Logic reasoners are designed to support primary tasks of classification and the
consistency checking of ontologies.
Unfortunately, this kind of reasoning is very inefficient in practice as we need to perform a
consistency check for every pair of instances in the ontology.
Example, OWL (DL) reasoner.
A better alternative is to consider rule-based reasoning in which the semantics of RDF(S) can be
completely expressed using a set of inference rules.
A significant part of the OWL semantics can be captured using rules and this contains the part
that is relevant to our task.
Example, OWLIM reasoner.
Forward chaining means that all consequences of the rules are computed to obtain what is called
a complete materialization of the knowledge base. Typically this is done by repeatedly checking
the prerequisites of the rules and adding their conclusions until no new statements can be
inferred.
The advantage of this method is that queries are fast to execute since all true statements
are readily available.
The disadvantage is that storing the complete materialization often takes exorbitant
amounts of space as even the simplest RDF(S) ontologies result in a large amount of
inferred statements (typically several times the size of the original repository), where
most of the inferred statements are not required for the task at hand. Also, the knowledge
base needs to be updated if a statement is removed, since in that case all other statements
that have been inferred from the removed statements also need to be removed (if they
cannot be inferred in some other way).
With backward-chaining, rules are executed “backwards” and on demand, i.e. when a query
needs to be answered. This method check whether it is explicitly stated or whether it could
inferred from some rule, either because the prerequisites of the rule are explicitly stated to be
true or again because they too could be inferred from some other rule(s).
The advantage of a rule-based axiomatization is that the expressivity of the reasoning can
be fine-tuned by removing rules that would only infer knowledge that is irrelevant to our
reasoning task.
3
In the Flink system, we first used the built-in inference engine of the ontology store. However,
some of the rules required cannot be expressed declaratively and thus we implemented our own
identity reasoner in the Java language. The reasoning is performed using a combination of
forward- and backward chaining in order to balance efficiency and scalability. The rules that are
used to infer the equality of instances are executed iteratively in a forward chaining manner by
querying the repository for the premise(s) and adding the consequents of the rules. The semantics
of the resulting owl:sameAs statements is partly inferred in a backward-chaining manner.
There are three basic variations on what point the identity reasoning is performed.
In the first variation, smushing is carried out while adding data into a repository. This is
the method chosen by Ingenta when implementing a large-scale publication metadata
repository.
The second variation is when the reasoning is performed after the repository has been
filled with data containing potential duplicates. This is the choice we take in the Flink
system.
Lastly (the third variation), reasoning can be performed at query time. This is often the
only choice such as when querying several repositories and aggregating data dynamically
in an AJAX interface such as the with the openacademia application. In this case the
solution is to query each repository for instances that match the query, perform the
duplicate detection on the combined set and present only the purged list to the end-user.
Note that instance unification is “easier” than ontology mapping where one-to-one mappings
are enforced. In particular, in ontology mapping a local choice to map two instances may limit
our future choices and thus we run the risk of ending up in a local minimum. In case of
smushing, we only have to be aware that in cases where we have both positive and negative
rules (entailing owl:sameAs and owl:differentFrom) there is a possibility that we end in an
inconsistent state (where two resources are the same as well as different). This would point to
the inconsistency of our rules, e.g. that we did not consider certain kind of input data.
---------------------------------------------------------------------------------------------------------------------
4
3.2 Advanced representations
Additional expressive power can be brought to bear by using logics that go beyond
predicate calculus. For example, temporal logic extends assertions with a temporal
dimension: using temporal logic we can express that a statement holds true at a certain
time or for a time interval. Temporal logic is required to formalize ontology versioning,
which is also called change management.
Extending logic with probabilities is also a natural step in representing our problem more
accurately.
---------------------------------------------------------------------------------------------------------------------
t1, t2, ..., tn: Time when each archive crawled. Currently, a month is used as the unit time.
W(tk): The Web archive at time tk.
C(tk): The Web community chart at time tk.
c(tk), d(tk), e(tk), ...: Communities in C(tk).
Emerge: A community c(tk) emerges in C(tk), when c(tk) shares no URLs with any
community in C(tk−1).
Dissolve: A community c(tk−1) in C(tk1) has dissolved, when c(tk−1) shares no URLs with
any community in C(tk).
Grow and shrink: When c(tk−1) in C(tk−1) shares URLs with only c(tk) in C(tk), and vice
versa, only two changes can occur to c(tk−1). The community grows when new URLs are
appeared in c(tk), and shrinks when URLs disappeared from c(tk−1). When the number of
appeared URLs is greater than the number of disappeared URLs, it grows. In the reverse
case, it shrinks.
5
Split: A community c(tk−1) may split into some smaller communities. In this case, c(tk−1)
shares URLs with multiple communities in C(tk). Split is caused by disconnections of
URLs in SDG. Split communities may grow and shrink. They may also merge with other
communities.
Merge: When multiple communities (c(tk−1)), d(tk−1), ...) share URLs with a single
community e(tk), these communities are merged into e(tk) by connections of their URLs
in SDG. Merged community may grow and shrink. They may also split before merging.
Evolution metrics measure how a particular community c(tk) has evolved. For example, we can
know how much c(tk) has grown, and how many URLs newly appeared in c(tk).
To measure changes of c(tk), the community is identified at time tk−1 corresponding to c(tk). This
corresponding community, c(tk−1), is defined as the community that shares the most URLs with
c(tk). If there were multiple communities that share the same number of URLs, a community that
has the largest number of URLs is selected.
The community at time tk corresponding to c(tk−1) can be reversely identified. When this
corresponding community is just c(tk), they call the pair (c(tk−1)), c(tk)) as main line.
Otherwise, the pair is called as branch line.
The metrics are defined by differences between c(tk) and its corresponding community c(tk−1).
To define metrics, the following attributes are used to represent how many URLs the focused
community obtains or loses.
6
Growth Rate
The Growth Rate, Rgrow(c(tk−1), c(tk)), represents the increase of URLs per unit time.
It allows us to find most growing or shrinking communities.
The growth rate is defined as follows:
Note that when c(tk−1) does not exist, zero is used as N(c(tk−1)).
Stability
Disappearance Rate
7
Merge Rate
The Merge rate, Rmerge(c(tk−1), c(tk)), is the number of absorbed URLs from other
communities by merging per unit time.
Higher merge rate means that the community has obtained URLs mainly by merging.
The merge rate is defined as follows:
Split rate
The Split rate, Rsplit(c(tk−1), c(tk)), is the number of split URLs from c(tk−1) per unit
time.
When the split rate is low, c(tk) is larger than other split communities. Otherwise, c(tk) is
smaller than other split communities.
The split rate is defined as follows:
Longer range metrics (more than one unit time) can be calculated for main lines.
For example, the novelty metrics of a main line (c(ti), c(ti+1), ..., c(tj)) is calculated as
follows:
---------------------------------------------------------------------------------------------------------------------
8
3.4 Detecting Communities in Social Networks
Definition of Community
“Community” means a subnetwork whose edges connecting inside of it are denser than
the edges connecting outside of it.
Local definitions
Focused on the vertices of the subnetwork under investigation and on its immediate
neighborhood.
Local definitions of community can be further divided into self-referring ones and
comparative ones.
Global definitions
Characterize a subnetwork with respect to the network as a whole.
9
3.5 Evaluating Communities
Quality Functions
Normalized Cut
The normalized cut of a group of nodes S is the sum of weights of the edges that connect S to the
rest of the graph, normalized by the total edge weight of S and that of the rest of the graph ¯S.
where A denotes the adjacency matrix of the network or graph, with A(i, j) representing the edge
weight or affinity between nodes i and j, and V denotes the vertex or node set of the graph or
network.
Groups with low normalized cut make for good communities, as they are well connected
amongst themselves but are sparsely connected to the rest of the graph.
Conductance
The Kernighan-Lin (KL) objective looks to minimize the edge cut (or the sum of the inter-cluster
edge weights) under the constraint that all clusters be of the same size.
10
Modularity
Modularity has recently become quite popular as a way to measure the goodness of a
clustering of a graph.
One of the advantages of modularity is that it is independent of the number of clusters
that the graph is divided into.
The modularity Q for a division of the graph into k clusters {V1, . . . , Vk} is given by:
Subnetwork is a community if the number of edges inside it is larger than the expected
number in modularity’s null model.
Advantage of Modularity
It is independent of the number of clusters
Limitations on Modularity
Modularity values cannot be compared for different networks
Resolution limit - fail to identify communities smaller than a scale
---------------------------------------------------------------------------------------------------------------------
11
3.6 Core Methods for Community Detection and Mining
Taxonomy of Community Criteria
The algorithm is iterative in nature and starts with an initial bipartition of the graph.
At each iteration, the algorithm searches for a subset of vertices from each part of the
graph such that swapping them will lead to a reduction in the edge cut. The
identification of such subsets is via a greedy procedure. The gain gv of a vertex v is the
reduction in edge-cut if vertex v is moved from its current partition to the other
partition. The KL algorithm repeatedly selects from the larger partition the vertex with
the largest gain and moves it to the other partition; a vertex is not considered for
moving again if it has already been moved in the current iteration. After a vertex has
been moved, the gains for its neighboring vertices will be updated in order to reflect
the new assignment of vertices to partitions.
While each iteration in the original KL algorithm had a complexity of O(|E| log |E|).
12
3.6.2 Agglomerative/Divisive Algorithms
Agglomerative Algorithms
Begin with each node in the social network in its own community.
At each step merge communities that are deemed to be sufficiently similar.
Continue until either the desired number of communities is obtained or the
remaining communities are found to be too dissimilar to merge any further.
A parent-child relationship indicates that the communities represented by the child
nodes were agglomerated (or merged) to obtain the community represented by the
parent node.
Divisive Algorithms
Divisive algorithms operate in reverse.
They begin with the entire network as one community.
At each step, choose a certain community and split it into two parts.
A parent-child relationship indicates that the community represented by the parent
node was divided to obtain the communities represented by the child nodes.
Using ideas of edge betweenness.
Edge betweenness measures are defined in a way that edges with high
betweenness scores are more likely to be the edges that connect different
communities.
Shortest path betweenness is one example of an edge betweenness measure.
Other examples include random-walk betweenness and current-flow betweenness.
Shortest path betweenness - shortest paths between nodes that belong to different
communities will be constrained to pass through few inter-community edges.
Random Walk betweenness - the choice of path connecting any two nodes is the
result of random walk instead of geodesic.
Current-flow betweenness
- network is virtually transformed into a resistance network where each edge is
replaced by a unit.
- betweenness of each edge is computed as the sum of absolute values of the
currents flowing on it with all possible selections of node pairs.
13
Detecting communities based on edge betweenness
Edge betweenness is the number of shortest paths between all vertex pairs that run
along the edge.
Disadvantage:
high computational cost
Modularity Optimization
Modularity measures the strength of a community partition by taking into
account the degree distribution.
Greedy agglomerative clustering algorithm for optimizing modularity.
Basic idea - At each stage, groups of vertices are successively merged to form
larger communities such that the modularity of the resulting division of the
network increases after each merge.
At the start, each node in the network is in its own community.
At each step one chooses the two communities whose merger leads to the
biggest increase in the modularity.
Spectral algorithms are among the classic methods for clustering and community
discovery. Spectral methods generally refer to algorithms that assign nodes to communities
based on the eigenvectors of matrices, such as the adjacency matrix of the network itself or other
related matrices. The top k eigenvectors define an embedding of the nodes of the network as
points in a k-dimensional space, and one can subsequently use classical data clustering
techniques such as K-means clustering to derive the final assignment of nodes to clusters.
The main idea behind spectral clustering is that the low-dimensional representation, induced by
the top eigenvectors, exposes the cluster structure in the original graph with greater clarity.
Cut given network into pieces so that the number of edges to be cut will be minimized.
One of the basic algorithm is spectral graph bipartitioning.
Laplacian matrix L of given network is used.
The Laplacian matrix L of a network is an n × n symmetric matrix, with one row and
column for each vertex.
Laplacian matrix is defined as L= D - A,
where A is the adjacency matrix,
D is the diagonal degree matrix.
All eigenvalues of L are real and non-negative, and L has a full set of n real and
orthogonal eigenvectors.
In order to minimize the above cut, vertices are partitioned based on the signs of the
eigenvector that corresponds to the second smallest eigenvalue of L.
14
The main disadvantage of spectral algorithms lies in their computational complexity.
15
3.6.5 Markov Clustering
Network Reduction
Discovering Scientific Collaboration Groups from Social Networks
Mining Communities from Distributed and Dynamic Networks
16
Network Reduction
17
18
Discovering Scientific Collaboration Groups from Social Networks
This section show how community mining techniques can be applied to the analysis of scientific
collaborations among researchers. Flink is a social network that describes the scientific
collaborations among 681 semantic Web researchers (http://flink.semanticweb.org/).
The network was constructed based on semantic Web technologies and all related semantic
information was automatically extracted from “Web-accessible information sources”, such as
“Web pages, FOAF profiles, email lists, and publication archives”. The weights on the links
measure the degrees of collaboration.
From the perspective of social network analysis, one may be especially interested in such
questions as:
1. among all researchers, which ones would more likely to collaborate with each
other?
2. what are the main reasons that bind them together?
Apply the community mining techniques.
19
Mining Communities from Distributed and Dynamic Networks
Many applications involve distributed and dynamically-evolving networks, in which resources
and controls are not only decentralized but also updated frequently. One promising solution is
based on an Autonomy-Oriented Computing (AOC) approach, in which a group of self-
organizing agents are utilized. The agents will rely only on their locally acquired information
about networks. Intelligent Portable Digital Assistants (or iPDAs for short) that people carry
around can form a distributed network, in which their users communicate with each other
through calls or messages.
One useful function of iPDAs would be to find and recommend new friends with common
interests, or potential partners in research or business, to the users. The way to implement it will
be through the following steps:
(1) Based on an iPDA user’s communication traces, selecting individuals who have
frequently contacted or been contacted with the user during a certain period of time;
(2) Taking the selected individuals as the input to an AOC-based algorithm.
(3) Ranking and recommending new persons who might not be included the current
acquaintance book, the user.
In such a way, people can periodically receive recommendations about friends or partners from
their iPDAs.
---------------------------------------------------------------------------------------------------------------------
20
3.8 Node Classification in Social Networks
Node Classification - providing a high quality labeling for nodes in a given graph structure
(i.e., social network or simply any network).
A first approach to this problem is to engage experts to provide labels on nodes, based on
additional data about the corresponding individuals and their connections. Or individuals
can be incentivized to provide accurate labels, via financial or other inducements.
Second approach is based on the paradigm of machine learning and classification. In this,
we aim to train a classifier based on the examples of nodes that are labeled so we can
apply it to the unlabeled nodes to predict labels for them.
Two important phenomena that can apply in online social networks:
homophily
co-citation regularity
Representing data as a graph
We consider graphs of the form G(V, E,W) from this data, where V is the set of n nodes, E is the
set of edges and W is the edge weight matrix. We also let Y be a set of m labels that can be
applied to nodes of the graph.
Inducing a graph
In some applications, the input may be a set of objects with no explicit link structure, for
instance, a set of images from Flickr. We may choose to induce a graph structure for the objects,
based on the principles of homophily or co-citation regularity: we should link entities which have
similar characteristics (homophily) or which refer to the same objects (co-citation regularity).
Types of Labels
binary: only two possible values are allowed (such as male or female,
positive or negative).
numeric: the label takes a numeric value (such as age, number of views, some range).
categorical: the label may be restricted to a set of specified categories (such as
for interests, occupation).
free-text: users may enter arbitrary text to identify the labels that apply to the node.
21
A transition matrix T is computed by row normalizing the weight matrix W as:
T = D−1 W
where D is a diagonal matrix D = diag(di) and di = ∑j wij.
The unnormalized graph Laplacian of the graph is defined as: L = D − W, and
the normalized graph Laplacian as: L = D−1/2LD−1/2.
If W is symmetric, then both these Laplacians are positive semi-definite matrices.
Let Yl = {y1, y2, . . . , yl} be the initial labels from the label set Y, on nodes in the set Vl.
The label yi on node vi may be a binary label, a single label or a multi-label.
22
Algorithm given above presents the Iterative Classification Algorithm (ICA) framework
for classifying nodes in a graph. An initial classifier is trained using Φl and the given
node labels Yl. In the first iteration, the trained classifier is applied to Φu to compute the
new labeling . For any node vi, some previously unlabeled nodes in the neighborhood
of vi now have labels from . In the t-th iteration, the procedure builds a new feature
vector Φt based on Φl and , and then applies the classifier to produce new labels .
Optionally, we may choose to retrain the classifier at each step, over the current set of
labels and features.
If node features are not known, the inference is based only on link features. In such a
case, if a node has no labeled node in its neighborhood, it is remains unlabeled in the
first iteration. As the algorithm proceeds, more nodes are labeled. Thus, the total
number of iterations τ should be sufficiently large to at least allow all nodes to receive
labels. One possibility is to run the iteration until “stability” is achieved, that is, until no
label changes in an iteration — but for arbitrary local classifiers there is no guarantee
that stability will be reached. Instead, we may choose to iterate for fixed number of
iterations that is considered large enough, or until some large fraction of node labels do
not change in an iteration.
Figure given below shows two steps of local iteration on a simple graph. Here, shaded
nodes are initially labeled. In this example, the first stage labels node X with the
label ‘18’. Based on this new link feature, in the second iteration this label is propagated
to node Y. Additional iterations will propagate the labeling further.
23
3.8.2 Random Walk based Methods
The idea underlying the random walk methods is as follows: the probability of labeling
a node vi ∈ V with label c ∈ Y is the total probability that a random walk starting at vi
will end at a node labeled c.
The random walk is defined by a transition matrix P, so that the walk proceeds from
node vi to node vj with probability pij, the (i, j)-th entry of P. For this to be well defined,
we require 0 ≤ pij ≤ 1 and ∑j pij = 1. The matrix P also encodes the absorbing states of
the random walk. These are nodes where the state remains the same with probability 1,
so there is zero probability of leaving the node, i.e., if a random walk reaches such a
node, it ends there.
The matrix equation for node classification using random walks can be written as:
24
Various methods based on Random Walks are:
Label Propagation
Graph Regularization
Adsorption
Label Propagation
Graph Regularization
25
Adsorption
The “adsorption” method is also based on iteratively averaging the labels from
neighbors, in common with the previous algorithms studied.
Adsorption takes as input a directed graph G with weight matrix W. The initial labels
are represented as Y = {y1, y2, . . . , yn} such that yi is the probability distribution over
labels Y if node vi ∈ Vl, and is zero if node vi ∈ Vu.
In order to maintain and propagate the initial labeling, adsorption creates a shadow
vertex ˜vi for each labeled node vi ∈ Vl such that ˜vi has a single incoming edge to vi,
and no outgoing edges. In other words, the shadow vertex is an absorbing state when
we view the algorithm as a random walk. Then, the label distribution yi is moved
from vi to the corresponding shadow vertex ˜vi, so initially vi is treated as unlabeled.
The set of shadow vertices is ˜ V = {˜vi|vi ∈ Vl}.
The weight on the edge from a vertex to its shadow is a parameter that can be
adjusted. That is, it can be set so that the random walk has a probability 1 − αi of
transitioning from vertex vi to its shadow ˜vi and terminating. This injection
probability was set to be a constant such as 1/4 for all labeled nodes.
i) Random Walk Formulation
First, A captures the injection probabilities from each node vi: A is the n × n
diagonal matrix A = diag(α1, α2, . . . , αl, 1, . . . , 1) where 1−αi is the (injection)
probability that a random walk currently at vi transitions to the shadow vertex
˜vi and terminates. Hence αi is the probability that the walk continues to a
different neighbor vertex.
A transition matrix T encodes the probability that the walk transitions from vi
to each of its non-shadow neighbors, so T = D−1W as before.
ii) Iterative Formulation
Consider the graph G and weight matrix W as above, but with the directionality
of all edges reversed. At each iteration, the algorithm computes the label
distribution at node vi as a weighted sum of the labels in the neighborhood, and
the node’s initial labeling, the weight being given by αi.
26
3.8.3 Applying Node Classification to Large Social Networks
27