0% found this document useful (0 votes)
250 views27 pages

CP5074 - SNA Unit III Notes

This document discusses methods for analyzing social network data through community detection and mining algorithms. It covers aggregating social network data from multiple sources, representing identity through unique resource identifiers, determining equality between network nodes, and reasoning about social networks using rule-based and description logic approaches. Community detection aims to identify groups of closely connected nodes through evaluating properties like network structure and node attributes.

Uploaded by

suresh s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views27 pages

CP5074 - SNA Unit III Notes

This document discusses methods for analyzing social network data through community detection and mining algorithms. It covers aggregating social network data from multiple sources, representing identity through unique resource identifiers, determining equality between network nodes, and reasoning about social networks using rule-based and description logic approaches. Community detection aims to identify groups of closely connected nodes through evaluating properties like network structure and node attributes.

Uploaded by

suresh s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CP5074 - SOCIAL NETWORK ANALYSIS

UNIT III – MINING COMMUNITIES

Aggregating and reasoning with social network data, Advanced Representations – Extracting
evolution of Web Community from a Series of Web Archive - Detecting Communities in Social
Networks - Evaluating Communities – Core Methods for Community Detection & Mining -
Applications of Community Mining Algorithms - Node Classification in Social Networks.

3.1 Aggregating and reasoning with social network data


Supposing that we have started out with some data sets in traditional formats (relational
databases, Excel sheets, XML files etc.) our first step is to convert them into an RDF-based
syntax, which allows to store the data in an ontology store and manipulate it with ontology-based
tools. In this process we need to assign identifiers to resources and re-represent our data in terms
of a shared ontology such as FOAF. In case our data sets come from external sources it is often
more natural to preserve their original schema. We can then apply ontology mapping to unify our
data on the schema level by mapping classes (types) and properties from different schemas to a
shared ontology such as FOAF.

The task of aggregation:


We need to find identical resources across the data sets. This is a two step process.
First, it requires capturing the domain-specific knowledge of when to consider two instances
to be the same. For example FOAF ontology itself also prescribes ways to infer the equality of
two instances, based on their email address.
Second, we need to introduce some domain-specific criteria based on domain-specific
properties. In order to do this, we need to consider the general meaning of equality in
RDF/OWL. For example, determining equality is often considered as applying a threshold value
to a similarity measure, which is a weighted combination of the similarity of certain properties.
This involves computation so we need a procedural representation of knowledge. Once we
determined the rules or procedures that determine equality in our domain, we need to carry out
the actual instance unification or smushing.

Smushing:
Smushing is a reasoning task, where we iteratively execute the rules or procedures
that determine equality until no more equivalent instances can be found. The advantage of this
approach (compared to a one-step computation of a similarity measure) is that we can take into
account the learned equalities in subsequent rounds of reasoning.

3.1.1. Representing identity

One of the main advantages of RDF over other representation formalisms such as UML is the
possibility to uniquely identify resources (instances, classes, as well as properties). The primary
mechanism for this is the assignment of URIs to resources. Every resource, except blank nodes is
identified by a URI.
1
Multiple identifiers can be represented in RDF in two separate ways.
 First, one can introduce a separate resource and use the identifiers as URIs for these
resources. Once separate resources are introduced for the same object, the equality of
these resources can be expressed using the owl:sameAs property.
 The other alternative is to chose one of the identifiers and use it as a URI.

Many modern identifier schemes such as DOIs have been designed to conform to the URI
specification. It is also a common practice to create URIs within a web domain owned or
controlled by the creator of the metadata description. For example, if the registered domain name
of a public organization is http://www.example.org then a resource with the identifier 123 could
be represented as http://www.example.org/id/123. This satisfies the guidelines for good URIs, in
particular that good URIs should be unique and stable. i.e., Good URIs should be
unambiguous and there is no way to rename resources (to reassign URIs)

3.1.2. On the notion of equality

The most well-known formulation of equality was given by Wilhelm Gottfried Leibniz in his
Discourse on Metaphysics. The Leibniz-law is formalized using the logical formula given in
Formula 3.1. The converse of the Leibniz-law is called Indiscernibility of Identicals and written
as Formula 3.2. The two laws taken together (often also called Leibniz-law) are the basis of the
definition of equality in a number of systems.
∀P : P(x) ↔ P(y) → x = y (3.1)
∀P : P(x) ↔ P(y) ← x = y (3.2)
The reflexive, symmetric and transitive properties of equality follow from these definitions.
Notice that both formulas are second-degree due to the quantification on properties. This
quantification is also interesting because it provides the Leibniz-law different interpretations in
open and closed worlds. Namely, in an open world the number of properties is unknown and thus
the Leibniz-law is not useful in practice. In a closed world we can possibly iterate over all
properties to check if two resources are equal.

3.1.3 Determining equality

In OWL there are a limited set of constructs that can lead to (in)equality statements. Functional
and inverse functional properties (IFPs) and maximum cardinality restrictions in general can lead
to conclude that two symbols must denote the same resource when otherwise the cardinality
restriction could not be fulfilled.
 For example, the foaf:mbox property denoting the email address of a person is inverse-
functional as a mailbox can only belong to a single person.
 As another example, consider a hypothetical ex:hasParent property, which has a
maximum cardinality of two. If we state that a single person has three parents (which we
are allowed to state) an OWL reasoner should conclude that at least two of them has to be
the same. Once inferred, the equality of instances can be stated by using the owl:sameAs
property. There are more ways to conclude that two symbols do not denote the same
object, which is expressed by the owl:differentFrom property. For example, instances of
disjoint classes must be different from each other.

2
3.1.4 Reasoning with instance equality

Reasoning is the inference of new statements (facts) that necessarily follow from the set of
known statements.

Description Logic versus rule-based reasoners

Description Logic reasoners are designed to support primary tasks of classification and the
consistency checking of ontologies.
Unfortunately, this kind of reasoning is very inefficient in practice as we need to perform a
consistency check for every pair of instances in the ontology.
Example, OWL (DL) reasoner.

A better alternative is to consider rule-based reasoning in which the semantics of RDF(S) can be
completely expressed using a set of inference rules.
A significant part of the OWL semantics can be captured using rules and this contains the part
that is relevant to our task.
Example, OWLIM reasoner.

Forward versus backward chaining

Forward chaining means that all consequences of the rules are computed to obtain what is called
a complete materialization of the knowledge base. Typically this is done by repeatedly checking
the prerequisites of the rules and adding their conclusions until no new statements can be
inferred.

 The advantage of this method is that queries are fast to execute since all true statements
are readily available.
 The disadvantage is that storing the complete materialization often takes exorbitant
amounts of space as even the simplest RDF(S) ontologies result in a large amount of
inferred statements (typically several times the size of the original repository), where
most of the inferred statements are not required for the task at hand. Also, the knowledge
base needs to be updated if a statement is removed, since in that case all other statements
that have been inferred from the removed statements also need to be removed (if they
cannot be inferred in some other way).

With backward-chaining, rules are executed “backwards” and on demand, i.e. when a query
needs to be answered. This method check whether it is explicitly stated or whether it could
inferred from some rule, either because the prerequisites of the rule are explicitly stated to be
true or again because they too could be inferred from some other rule(s).

 The drawback of backward chaining is longer query execution times.

The advantage of a rule-based axiomatization is that the expressivity of the reasoning can
be fine-tuned by removing rules that would only infer knowledge that is irrelevant to our
reasoning task.

3
In the Flink system, we first used the built-in inference engine of the ontology store. However,
some of the rules required cannot be expressed declaratively and thus we implemented our own
identity reasoner in the Java language. The reasoning is performed using a combination of
forward- and backward chaining in order to balance efficiency and scalability. The rules that are
used to infer the equality of instances are executed iteratively in a forward chaining manner by
querying the repository for the premise(s) and adding the consequents of the rules. The semantics
of the resulting owl:sameAs statements is partly inferred in a backward-chaining manner.

The timing of reasoning and the method of representation

There are three basic variations on what point the identity reasoning is performed.
 In the first variation, smushing is carried out while adding data into a repository. This is
the method chosen by Ingenta when implementing a large-scale publication metadata
repository.

 The second variation is when the reasoning is performed after the repository has been
filled with data containing potential duplicates. This is the choice we take in the Flink
system.

 Lastly (the third variation), reasoning can be performed at query time. This is often the
only choice such as when querying several repositories and aggregating data dynamically
in an AJAX interface such as the with the openacademia application. In this case the
solution is to query each repository for instances that match the query, perform the
duplicate detection on the combined set and present only the purged list to the end-user.

3.1.5 Evaluating smushing

Smushing can be considered as either a retrieval problem or a clustering problem.

 In terms of retrieval, we try to achieve a maximum precision and recall of the


theoretically correct set of mappings.
 Instance unification or smushing can be considered as an optimization task where we try
to optimize an information retrieval or clustering-based measure.

Smushing Vs Ontology Mapping

Note that instance unification is “easier” than ontology mapping where one-to-one mappings
are enforced. In particular, in ontology mapping a local choice to map two instances may limit
our future choices and thus we run the risk of ending up in a local minimum. In case of
smushing, we only have to be aware that in cases where we have both positive and negative
rules (entailing owl:sameAs and owl:differentFrom) there is a possibility that we end in an
inconsistent state (where two resources are the same as well as different). This would point to
the inconsistency of our rules, e.g. that we did not consider certain kind of input data.

---------------------------------------------------------------------------------------------------------------------

4
3.2 Advanced representations
 Additional expressive power can be brought to bear by using logics that go beyond
predicate calculus. For example, temporal logic extends assertions with a temporal
dimension: using temporal logic we can express that a statement holds true at a certain
time or for a time interval. Temporal logic is required to formalize ontology versioning,
which is also called change management.

 Extending logic with probabilities is also a natural step in representing our problem more
accurately.
---------------------------------------------------------------------------------------------------------------------

3.3 Extracting Evolution of Web Community from a Series


of Web Archive
Web Community

A Web community is a collection of Web pages created by individuals or associations


with a common interest on a topic, such as fan pages of a baseball team, and official pages of
computer vendors.
The extraction of Web community utilizes Web community chart that is a graph of
communities, in which related communities are connected by weighted edges. The main
advantage of the Web community chart is existence of relevance between communities. We
can navigate through related communities, and locate evolution around a particular community.

Notations Used for Web Community

t1, t2, ..., tn: Time when each archive crawled. Currently, a month is used as the unit time.
W(tk): The Web archive at time tk.
C(tk): The Web community chart at time tk.
c(tk), d(tk), e(tk), ...: Communities in C(tk).

3.3.1 Types of Changes

 Emerge: A community c(tk) emerges in C(tk), when c(tk) shares no URLs with any
community in C(tk−1).
 Dissolve: A community c(tk−1) in C(tk1) has dissolved, when c(tk−1) shares no URLs with
any community in C(tk).
 Grow and shrink: When c(tk−1) in C(tk−1) shares URLs with only c(tk) in C(tk), and vice
versa, only two changes can occur to c(tk−1). The community grows when new URLs are
appeared in c(tk), and shrinks when URLs disappeared from c(tk−1). When the number of
appeared URLs is greater than the number of disappeared URLs, it grows. In the reverse
case, it shrinks.

5
 Split: A community c(tk−1) may split into some smaller communities. In this case, c(tk−1)
shares URLs with multiple communities in C(tk). Split is caused by disconnections of
URLs in SDG. Split communities may grow and shrink. They may also merge with other
communities.
 Merge: When multiple communities (c(tk−1)), d(tk−1), ...) share URLs with a single
community e(tk), these communities are merged into e(tk) by connections of their URLs
in SDG. Merged community may grow and shrink. They may also split before merging.

3.3.2 Evolution Metrics

Evolution metrics measure how a particular community c(tk) has evolved. For example, we can
know how much c(tk) has grown, and how many URLs newly appeared in c(tk).

To measure changes of c(tk), the community is identified at time tk−1 corresponding to c(tk). This
corresponding community, c(tk−1), is defined as the community that shares the most URLs with
c(tk). If there were multiple communities that share the same number of URLs, a community that
has the largest number of URLs is selected.

The community at time tk corresponding to c(tk−1) can be reversely identified. When this
corresponding community is just c(tk), they call the pair (c(tk−1)), c(tk)) as main line.
Otherwise, the pair is called as branch line.
The metrics are defined by differences between c(tk) and its corresponding community c(tk−1).
To define metrics, the following attributes are used to represent how many URLs the focused
community obtains or loses.

 N(c(tk)): the number of URLs in the c(tk).


 Nsh(c(tk−1), c(tk)): the number of URLs shared by c(tk−1) and c(tk).
 Ndis(c(tk−1)): the number of disappeared URLs from c(tk−1) that exist in c(tk−1) but do not
exist in any community in C(tk)).
 Nsp(c(tk1), c(tk)): the number of URLs split from c(tk−1) to communities at tk other than
c(tk).
 Nap(c(tk)): the number of newly appeared URLs in c(tk)) that exist in c(tk) but do not exist
in any community C(tk−1).
 Nmg(c(tk−1), c(tk)): the number of URLs merged into c(tk)) from communities at tk−1 other
than c(tk−1).
The Evolution Metrics are listed below:
 Growth rate
 Stability
 Disappearance rate
 Merge rate
 Split rate

6
Growth Rate
 The Growth Rate, Rgrow(c(tk−1), c(tk)), represents the increase of URLs per unit time.
 It allows us to find most growing or shrinking communities.
 The growth rate is defined as follows:

 Note that when c(tk−1) does not exist, zero is used as N(c(tk−1)).

Stability

 The Stability, Rstability(c(tk−1), c(tk)), represents the amount of disappeared, appeared,


merged and split URLs per unit time.
 When there is no change of URLs, the stability becomes zero.
 Note that c(tk) may not be stable even if the growth rate of c(tk) is zero, because c(tk) may
lose and obtain the same number of URLs.
 A stable community on a topic is the best starting point for finding interesting changes
around the topic.
 The stability is defined as:

Disappearance Rate

 The Disappearance rate, Rdisappear(c(tk−1), c(tk)), is the number of disappeared URLs


from c(tk−1) per unit time.
 Higher disappear rate means that the community has lost URLs mainly by disappearance.
 The disappear rate is defined as:

7
Merge Rate

 The Merge rate, Rmerge(c(tk−1), c(tk)), is the number of absorbed URLs from other
communities by merging per unit time.
 Higher merge rate means that the community has obtained URLs mainly by merging.
 The merge rate is defined as follows:

Split rate

 The Split rate, Rsplit(c(tk−1), c(tk)), is the number of split URLs from c(tk−1) per unit
time.
 When the split rate is low, c(tk) is larger than other split communities. Otherwise, c(tk) is
smaller than other split communities.
 The split rate is defined as follows:

Longer Range metrics

 Longer range metrics (more than one unit time) can be calculated for main lines.
 For example, the novelty metrics of a main line (c(ti), c(ti+1), ..., c(tj)) is calculated as
follows:

3.3.3 Web Archives and Graphs


 Web archiving is the process of collecting portions of the Web to ensure the
information is preserved in an archive.
 Web crawlers are used for automated capture due to the massive size and amount of
information on the Web.
 From each archive, a Web graph is built with URLs and links by extracting anchors
from all pages in the archive.
 The graph included not only URLs inside the archive, but also URLs outside pointed
to by inside URLs.

---------------------------------------------------------------------------------------------------------------------

8
3.4 Detecting Communities in Social Networks
Definition of Community
“Community” means a subnetwork whose edges connecting inside of it are denser than
the edges connecting outside of it.

Three categories of Community:


 Local definitions
 Global definitions
 Definitions based on vertex similarity

Local definitions
 Focused on the vertices of the subnetwork under investigation and on its immediate
neighborhood.
 Local definitions of community can be further divided into self-referring ones and
comparative ones.

Global definitions
 Characterize a subnetwork with respect to the network as a whole.

Definitions based on vertex similarity


 Communities are groups of vertices which are similar to each other

Why Community Detection?


 Communities can be used for information recommendation because members of the
communities often have similar tastes and preferences. Membership of detected
communities will be the basis of collaborative filtering.
 Communities will help us understand the structures of given social networks
Communities are regarded as components of given social networks, and they will clarify
the functions and properties of the networks.
 Communities will play important roles when we visualize large-scale social networks
Relations of the communities clarify the processes of information sharing and
information diffusions, and they may give us some insights for the growth the networks
in the future.
 Communities can lend itself to actionable pattern discovery.
 Identification of influential nodes or sub-communities.
 Communities in a citation network might represent related papers on a single topic.
 Communities on the web might represent pages of related topics.
 Community can be considered as a summary of the whole network thus easy to visualize
and understand.
 Sometimes, community can reveal the properties without releasing the individual privacy
information.
---------------------------------------------------------------------------------------------------------------------

9
3.5 Evaluating Communities

Quality Functions

It is necessary to establish which partition exhibit a real community structure. Therefore, a


quality function for evaluating how good a partition is needed. There are many ways of
partitioning given network into communities. The Quality function for evaluating how good a
partition is:
 Normalized Cut
 Conductance
 Modularity

Normalized Cut
The normalized cut of a group of nodes S is the sum of weights of the edges that connect S to the
rest of the graph, normalized by the total edge weight of S and that of the rest of the graph ¯S.

The normalized cut of a group of vertices S ⊂ V is defined as

where A denotes the adjacency matrix of the network or graph, with A(i, j) representing the edge
weight or affinity between nodes i and j, and V denotes the vertex or node set of the graph or
network.

Groups with low normalized cut make for good communities, as they are well connected
amongst themselves but are sparsely connected to the rest of the graph.

Conductance

The conductance of a group of vertices S ⊂ V is closely related and is defined as

The Kernighan-Lin (KL) objective looks to minimize the edge cut (or the sum of the inter-cluster
edge weights) under the constraint that all clusters be of the same size.

10
Modularity
 Modularity has recently become quite popular as a way to measure the goodness of a
clustering of a graph.
 One of the advantages of modularity is that it is independent of the number of clusters
that the graph is divided into.
 The modularity Q for a division of the graph into k clusters {V1, . . . , Vk} is given by:

where Vi’s are the clusters,


m is the number of edges in the graph and
degree(Vi) is the total degree of the cluster Vi.
 Modularity can be rewritten as follows.

where nm is the number of communities,


ls is the total number of edges joining vertices of community s,
ds is the sum of the degrees of the vertices of s.
 Meaning of modularity

Subnetwork is a community if the number of edges inside it is larger than the expected
number in modularity’s null model.
 Advantage of Modularity
 It is independent of the number of clusters
 Limitations on Modularity
 Modularity values cannot be compared for different networks
 Resolution limit - fail to identify communities smaller than a scale
---------------------------------------------------------------------------------------------------------------------

11
3.6 Core Methods for Community Detection and Mining
Taxonomy of Community Criteria

 Criteria vary depending on the tasks


 Roughly, Community detection methods can be divided into 4 categories (not exclusive):
1. Node-Centric Community
 Each node in a group satisfies certain properties
2. Group-Centric Community
 Consider the connections within a group as a whole. The group has to
satisfy certain properties without zooming into node-level
3. Network-Centric Community
 Partition the whole network into several disjoint sets
4. Hierarchy-Centric Community
 Construct a hierarchical structure of communities

Methods for Community Detection


1. Graph Partitioning KL Algorithm
2. Hierarchical Clustering Agglomerative/Divisive Algorithms
3. Spectral Algorithms
4. Multi-level Graph Partitioning
5. Markov Clustering
6. Other Approaches

3.6.1 The Kernighan-Lin(KL) algorithm


 The KL algorithm is one of the classic graph partitioning algorithms which optimizes
the KL objective function i.e. minimize the edge cut while keeping the cluster sizes
balanced.

 The algorithm is iterative in nature and starts with an initial bipartition of the graph.
 At each iteration, the algorithm searches for a subset of vertices from each part of the
graph such that swapping them will lead to a reduction in the edge cut. The
identification of such subsets is via a greedy procedure. The gain gv of a vertex v is the
reduction in edge-cut if vertex v is moved from its current partition to the other
partition. The KL algorithm repeatedly selects from the larger partition the vertex with
the largest gain and moves it to the other partition; a vertex is not considered for
moving again if it has already been moved in the current iteration. After a vertex has
been moved, the gains for its neighboring vertices will be updated in order to reflect
the new assignment of vertices to partitions.
 While each iteration in the original KL algorithm had a complexity of O(|E| log |E|).

12
3.6.2 Agglomerative/Divisive Algorithms

 Both kinds of hierarchical clustering algorithms output a dendrogram.


 A binary tree, where the leaves are nodes of the network and each internal node is a
community.

 Agglomerative Algorithms
 Begin with each node in the social network in its own community.
 At each step merge communities that are deemed to be sufficiently similar.
 Continue until either the desired number of communities is obtained or the
remaining communities are found to be too dissimilar to merge any further.
 A parent-child relationship indicates that the communities represented by the child
nodes were agglomerated (or merged) to obtain the community represented by the
parent node.

 Divisive Algorithms
 Divisive algorithms operate in reverse.
 They begin with the entire network as one community.
 At each step, choose a certain community and split it into two parts.
 A parent-child relationship indicates that the community represented by the parent
node was divided to obtain the communities represented by the child nodes.
 Using ideas of edge betweenness.
 Edge betweenness measures are defined in a way that edges with high
betweenness scores are more likely to be the edges that connect different
communities.
 Shortest path betweenness is one example of an edge betweenness measure.
 Other examples include random-walk betweenness and current-flow betweenness.
 Shortest path betweenness - shortest paths between nodes that belong to different
communities will be constrained to pass through few inter-community edges.
 Random Walk betweenness - the choice of path connecting any two nodes is the
result of random walk instead of geodesic.
 Current-flow betweenness
- network is virtually transformed into a resistance network where each edge is
replaced by a unit.
- betweenness of each edge is computed as the sum of absolute values of the
currents flowing on it with all possible selections of node pairs.

 The general form of the above mentioned betweenness algorithms is as follows:


1) Calculate betweenness score for all edges in the network using any measure.
2) Find the edge with the highest score and remove it from the network.
3) Recalculate betweenness for all remaining edges.
4) Repeat from step 2.

The above procedure is continued until a sufficiently small number of communities


are obtained, and a hierarchical nesting of the communities is also obtained as a
natural by-product.

13
 Detecting communities based on edge betweenness
Edge betweenness is the number of shortest paths between all vertex pairs that run
along the edge.

Disadvantage:
 high computational cost

 Modularity Optimization
 Modularity measures the strength of a community partition by taking into
account the degree distribution.
 Greedy agglomerative clustering algorithm for optimizing modularity.
 Basic idea - At each stage, groups of vertices are successively merged to form
larger communities such that the modularity of the resulting division of the
network increases after each merge.
 At the start, each node in the network is in its own community.
 At each step one chooses the two communities whose merger leads to the
biggest increase in the modularity.

3.6.3 Spectral Algorithms

Spectral algorithms are among the classic methods for clustering and community
discovery. Spectral methods generally refer to algorithms that assign nodes to communities
based on the eigenvectors of matrices, such as the adjacency matrix of the network itself or other
related matrices. The top k eigenvectors define an embedding of the nodes of the network as
points in a k-dimensional space, and one can subsequently use classical data clustering
techniques such as K-means clustering to derive the final assignment of nodes to clusters.
The main idea behind spectral clustering is that the low-dimensional representation, induced by
the top eigenvectors, exposes the cluster structure in the original graph with greater clarity.

 Cut given network into pieces so that the number of edges to be cut will be minimized.
 One of the basic algorithm is spectral graph bipartitioning.
 Laplacian matrix L of given network is used.
 The Laplacian matrix L of a network is an n × n symmetric matrix, with one row and
column for each vertex.
 Laplacian matrix is defined as L= D - A,
where A is the adjacency matrix,
D is the diagonal degree matrix.
 All eigenvalues of L are real and non-negative, and L has a full set of n real and
orthogonal eigenvectors.
 In order to minimize the above cut, vertices are partitioned based on the signs of the
eigenvector that corresponds to the second smallest eigenvalue of L.

14
 The main disadvantage of spectral algorithms lies in their computational complexity.

3.6.4 Multi-level Graph Partitioning


 Multi-level methods provide a powerful framework for fast and high-quality graph
partitioning.
 The main idea here is to shrink or coarsen the input graph successively so as to obtain a
small graph, partition this small graph and then successively project this partition back
up to the original graph, refining the partition at each step along the way.
 Multi-level graph partitioning methods include:
 Multi-level spectral clustering
 Metis (which optimizes the KL objective function)
 Graclus (which optimizes normalized cuts and other weighted cuts) and
 MLR-MCL
 Components of a multi-level graph partitioning:
1. Coarsening - produce a smaller graph that is similar to the original graph. This
step may be applied repeatedly to obtain a graph that is small enough to be
partitioned quickly and with high quality.
2. Initial partitioning – In this step, a partitioning of the coarsest graph is
performed.
3. Uncoarsening – In this phase, the partition on the current graph is used to
initialize a partition on the finer (bigger) graph. The finer connectivity structure
of the graph revealed by the uncoarsening is used to refine the partition, usually
by performing local search. This step is continued until we arrive at the original
input graph.

15
3.6.5 Markov Clustering

 Markov Clustering algorithm (MCL) clusters graphs via manipulation of the


stochastic matrix or transition probability matrix corresponding to the graph.
 The transition probability between two nodes is also referred to as stochastic flow.
 The MCL process consists of two operations on stochastic matrices, Expand and
Inflate.
 Expand(M) is simply M∗M.
 Inflate(M,r) raises each entry in the matrix M to the inflation
parameter r ( > 1, and typically set to 2) followed by re-normalizing the columns
to sum to 1.
 These two operators are applied in alternation iteratively until convergence, starting
with the initial transition probability matrix.
 The expand step spreads the stochastic flow out of a vertex to potentially new
vertices and also enhances the stochastic flow to those vertices which are reachable
by multiple paths.
 The inflation step introduces a non-linearity into the process, with the purpose of
strengthening intra-cluster stochastic flow and weakening inter-cluster stochastic
flow.
 Very effective at clustering biological interaction networks.
 MCL has two major shortcomings:
 First, MCL is slow, since the Expand step, which involves matrix-matrix
multiplication.
 Second, MCL tends to produce imbalanced clustering, usually by producing a
large number of very small clusters or by producing one very big cluster, or by
doing both at the same time.

3.6.6 Other Approaches


• Local Graph Clustering
• Flow-Based Post-Processing for Improving Community Detection
• Community Discovery via Shingling
---------------------------------------------------------------------------------------------------------------------

3.7 Applications of Community Mining Algorithms


Some applications of community mining, with respect to various tasks in social network analysis
are listed below:

 Network Reduction
 Discovering Scientific Collaboration Groups from Social Networks
 Mining Communities from Distributed and Dynamic Networks

16
Network Reduction

17
18
Discovering Scientific Collaboration Groups from Social Networks
This section show how community mining techniques can be applied to the analysis of scientific
collaborations among researchers. Flink is a social network that describes the scientific
collaborations among 681 semantic Web researchers (http://flink.semanticweb.org/).

The network was constructed based on semantic Web technologies and all related semantic
information was automatically extracted from “Web-accessible information sources”, such as
“Web pages, FOAF profiles, email lists, and publication archives”. The weights on the links
measure the degrees of collaboration.

 From the perspective of social network analysis, one may be especially interested in such
questions as:
1. among all researchers, which ones would more likely to collaborate with each
other?
2. what are the main reasons that bind them together?
 Apply the community mining techniques.

19
Mining Communities from Distributed and Dynamic Networks
Many applications involve distributed and dynamically-evolving networks, in which resources
and controls are not only decentralized but also updated frequently. One promising solution is
based on an Autonomy-Oriented Computing (AOC) approach, in which a group of self-
organizing agents are utilized. The agents will rely only on their locally acquired information
about networks. Intelligent Portable Digital Assistants (or iPDAs for short) that people carry
around can form a distributed network, in which their users communicate with each other
through calls or messages.

One useful function of iPDAs would be to find and recommend new friends with common
interests, or potential partners in research or business, to the users. The way to implement it will
be through the following steps:
(1) Based on an iPDA user’s communication traces, selecting individuals who have
frequently contacted or been contacted with the user during a certain period of time;
(2) Taking the selected individuals as the input to an AOC-based algorithm.
(3) Ranking and recommending new persons who might not be included the current
acquaintance book, the user.

In such a way, people can periodically receive recommendations about friends or partners from
their iPDAs.

---------------------------------------------------------------------------------------------------------------------

20
3.8 Node Classification in Social Networks
Node Classification - providing a high quality labeling for nodes in a given graph structure
(i.e., social network or simply any network).
 A first approach to this problem is to engage experts to provide labels on nodes, based on
additional data about the corresponding individuals and their connections. Or individuals
can be incentivized to provide accurate labels, via financial or other inducements.
 Second approach is based on the paradigm of machine learning and classification. In this,
we aim to train a classifier based on the examples of nodes that are labeled so we can
apply it to the unlabeled nodes to predict labels for them.
Two important phenomena that can apply in online social networks:
 homophily
 co-citation regularity
Representing data as a graph
We consider graphs of the form G(V, E,W) from this data, where V is the set of n nodes, E is the
set of edges and W is the edge weight matrix. We also let Y be a set of m labels that can be
applied to nodes of the graph.
Inducing a graph
In some applications, the input may be a set of objects with no explicit link structure, for
instance, a set of images from Flickr. We may choose to induce a graph structure for the objects,
based on the principles of homophily or co-citation regularity: we should link entities which have
similar characteristics (homophily) or which refer to the same objects (co-citation regularity).
Types of Labels
 binary: only two possible values are allowed (such as male or female,
positive or negative).
 numeric: the label takes a numeric value (such as age, number of views, some range).
 categorical: the label may be restricted to a set of specified categories (such as
for interests, occupation).
 free-text: users may enter arbitrary text to identify the labels that apply to the node.

3.8 The Node Classification Problem


We are given a graph G(V, E, W) with a subset of nodes Vl ⊂ V labeled, where V is the set
of n nodes in the graph (possibly augmented with other features), and Vu = V \ Vl is the set
of unlabeled nodes. Here W is the weight matrix, and E is the set of edges. Let Y be the set
of m possible labels, and Yl = {y1, y2, . . . , yl} be the initial labels on nodes in the set Vl .
The task is to infer labels ˜Y on all nodes V of the graph.
Preliminaries and Definitions
Let Vl be the set of l initially labeled nodes and Vu the set of n − l unlabeled nodes such
that V = Vl ∪ Vu. We assume the nodes are ordered such that the first l nodes are initially
labeled and the remaining nodes are unlabeled so that V = {v1, . . . , vl, vl+1, . . . , vn}.
An edge (i, j) ∈ E between nodes vi and vj has weight wij .

21
A transition matrix T is computed by row normalizing the weight matrix W as:
T = D−1 W
where D is a diagonal matrix D = diag(di) and di = ∑j wij.
The unnormalized graph Laplacian of the graph is defined as: L = D − W, and
the normalized graph Laplacian as: L = D−1/2LD−1/2.
If W is symmetric, then both these Laplacians are positive semi-definite matrices.
Let Yl = {y1, y2, . . . , yl} be the initial labels from the label set Y, on nodes in the set Vl.
The label yi on node vi may be a binary label, a single label or a multi-label.

3.8.1 Methods using Local Classifiers


These iterative methods are based on building feature vectors for nodes from the
information known about them and their neighborhood (immediately adjacent or nearby
nodes). These feature vectors are then used along with the known class values Yl, to build
an instance of a local classifier such as Naïve Bayes, Decision Trees etc. for inferring the
labels on nodes in Vu.
3.8.1.1 Iterative Classification Method
Input:
Consider the YouTube graph with attributes such as the number of times a video is
viewed, the time of upload, rating etc. are node features that are known for each video.
What makes classification of graph data different is the presence of links between
objects (nodes) being classified. The information about the neighborhood of each object
is captured by link features, such as the (multi)set of tags on videos. Typically, link
features are presented to the classifier as aggregate statistics derived from the labels on
nodes in the neighborhood. A popular choice for computing a link feature is the
frequency with which a particular label is present in the neighborhood. For instance, for
a video vi in the YouTube graph, the number of times the label music appears in the
nodes adjacent to vi is a link feature. If the graph is directed, the link features may be
computed separately for incoming and outgoing links. The features may also include
graph properties, such as node degrees and connectivity information.
Iterative Framework
Let Φ denote the matrix of feature vectors for all nodes in V , where the i-th row of Φ
represents the feature vector φi for node vi. The feature vector φi may be composed of
both the node and link features. Let Φl and Φu denote the feature vectors for labeled and
unlabeled nodes respectively.

22
Algorithm given above presents the Iterative Classification Algorithm (ICA) framework
for classifying nodes in a graph. An initial classifier is trained using Φl and the given
node labels Yl. In the first iteration, the trained classifier is applied to Φu to compute the
new labeling . For any node vi, some previously unlabeled nodes in the neighborhood
of vi now have labels from . In the t-th iteration, the procedure builds a new feature
vector Φt based on Φl and , and then applies the classifier to produce new labels .
Optionally, we may choose to retrain the classifier at each step, over the current set of
labels and features.
If node features are not known, the inference is based only on link features. In such a
case, if a node has no labeled node in its neighborhood, it is remains unlabeled in the
first iteration. As the algorithm proceeds, more nodes are labeled. Thus, the total
number of iterations τ should be sufficiently large to at least allow all nodes to receive
labels. One possibility is to run the iteration until “stability” is achieved, that is, until no
label changes in an iteration — but for arbitrary local classifiers there is no guarantee
that stability will be reached. Instead, we may choose to iterate for fixed number of
iterations that is considered large enough, or until some large fraction of node labels do
not change in an iteration.
Figure given below shows two steps of local iteration on a simple graph. Here, shaded
nodes are initially labeled. In this example, the first stage labels node X with the
label ‘18’. Based on this new link feature, in the second iteration this label is propagated
to node Y. Additional iterations will propagate the labeling further.

Instances of the Iterative Framework


 Naive Bayes classifier to infer labels in their instantiation of the ICA framework
 Logistic regression to classify linked documents.
 Simpler classification method based on taking a weighted average of the class
probabilities in the neighborhood.
 A nearest neighbor classifier to find a labeled node that is most similar to an
unlabeled node being classified.
 Using features from neighboring documents to aid the classification, which can be
viewed as an instance of ICA on a graph formed by documents.

23
3.8.2 Random Walk based Methods
The idea underlying the random walk methods is as follows: the probability of labeling
a node vi ∈ V with label c ∈ Y is the total probability that a random walk starting at vi
will end at a node labeled c.
The random walk is defined by a transition matrix P, so that the walk proceeds from
node vi to node vj with probability pij, the (i, j)-th entry of P. For this to be well defined,
we require 0 ≤ pij ≤ 1 and ∑j pij = 1. The matrix P also encodes the absorbing states of
the random walk. These are nodes where the state remains the same with probability 1,
so there is zero probability of leaving the node, i.e., if a random walk reaches such a
node, it ends there.

The matrix equation for node classification using random walks can be written as:

24
Various methods based on Random Walks are:
 Label Propagation
 Graph Regularization
 Adsorption

Label Propagation

i) Random Walk Formulation


The random walk at node vi picks an (outgoing) edge with probability
proportional to the edge weight, if vi is unlabeled; however, if vi is labeled,
the walk always loops to vi. Therefore the nodes in Vl are absorbing states,
i.e. they are treated as if they have no outgoing edges, and thus their labels
do not change.

ii) Iterative Formulation


Consider an iterative algorithm where each node is assigned a label
distribution (or a null distribution) in each step. In step t, each unlabeled
node takes the set of distributions of its neighbors from step t−1, and takes
their mean as its label distribution for step t. This is illustrated in below
Algorithm.

Graph Regularization

i) Random Walk Formulation


We first describe the Graph regularization method in terms of a random walk
starting from a node vi. Now the random walk at every node proceeds to a
neighbor with probability α (whether or not the node is unlabeled) but, with
probability 1 − α the walk jumps back to vi, the starting node. Here, 1 − α can
be thought of a “reset probability”.

ii) Iterative Formulation


At each step, the label distribution of node i is computed as an α fraction of the
sum of label distributions of its neighbors from the previous step, plus a 1−α
fraction of its initial label distribution. This is illustrated in below Algorithm.

25
Adsorption
The “adsorption” method is also based on iteratively averaging the labels from
neighbors, in common with the previous algorithms studied.
Adsorption takes as input a directed graph G with weight matrix W. The initial labels
are represented as Y = {y1, y2, . . . , yn} such that yi is the probability distribution over
labels Y if node vi ∈ Vl, and is zero if node vi ∈ Vu.
In order to maintain and propagate the initial labeling, adsorption creates a shadow
vertex ˜vi for each labeled node vi ∈ Vl such that ˜vi has a single incoming edge to vi,
and no outgoing edges. In other words, the shadow vertex is an absorbing state when
we view the algorithm as a random walk. Then, the label distribution yi is moved
from vi to the corresponding shadow vertex ˜vi, so initially vi is treated as unlabeled.
The set of shadow vertices is ˜ V = {˜vi|vi ∈ Vl}.
The weight on the edge from a vertex to its shadow is a parameter that can be
adjusted. That is, it can be set so that the random walk has a probability 1 − αi of
transitioning from vertex vi to its shadow ˜vi and terminating. This injection
probability was set to be a constant such as 1/4 for all labeled nodes.
i) Random Walk Formulation
First, A captures the injection probabilities from each node vi: A is the n × n
diagonal matrix A = diag(α1, α2, . . . , αl, 1, . . . , 1) where 1−αi is the (injection)
probability that a random walk currently at vi transitions to the shadow vertex
˜vi and terminates. Hence αi is the probability that the walk continues to a
different neighbor vertex.
A transition matrix T encodes the probability that the walk transitions from vi
to each of its non-shadow neighbors, so T = D−1W as before.
ii) Iterative Formulation
Consider the graph G and weight matrix W as above, but with the directionality
of all edges reversed. At each iteration, the algorithm computes the label
distribution at node vi as a weighted sum of the labels in the neighborhood, and
the node’s initial labeling, the weight being given by αi.
26
3.8.3 Applying Node Classification to Large Social Networks

3.8.3.1 Basic Approaches


 Iteration
 Random walk Simulation
3.8.3.2 Second Order Methods
The main idea behind these second-order methods is that the update
performed at each iteration is adjusted by the update performed at the
previous iteration. These methods have been shown to converge more rapidly
than simple iterations (referred to as first-order methods) for applications
such as load balancing and multi-commodity flow.

3.8.3.3 Implementation within Map-Reduce


The Map-Reduce framework is a popular programming model that facilitates
distributing computation over a cluster of machines for data-intensive tasks.
Applications in this framework are implemented via two operations:
(1) Map: input represented as key/value pairs is processed to generate
intermediate key/value pairs, and
(2) Reduce: all intermediate pairs associated with the same key are collected
and aggregated.
The system takes care of allocating map and reduce tasks to different
machines, which can operate in parallel.

3.8.4 Related approaches 3.8.5 Variations on Node Classification


 Inference using Graphical Models  Dissimilarity in Labels
 Metric labeling  Edge Labeling
 Spectral Partitioning  Label Summarization
 Graph Clustering
---------------------------------------------------------------------------------------------------------------------

27

You might also like