3.1 Extracting Evolution of Web Community From A Series of Web Archive
3.1 Extracting Evolution of Web Community From A Series of Web Archive
Types of Changes
Emerge
◦ A community c(tk) emerges in C(tk), when c(tk) shares no URLs with any
community in C(tk−1).
Dissolve
◦ A community c(tk−1) in C(tk) has dissolved, when c(tk−1) shares no URLs with
any community in C(tk)
Growth and Shrink
◦ The community grows when new URLs are appeared in c(tk), and shrinks when
URLs disappeared from c(tk−1).
Split
◦ c(tk−1) shares URLs with multiple communities in C(tk)
Merge
◦ When multiple communities (c(tk−1)), d(tk−1), ...) share URLs with a single
community e(tk), these communities are merged into e(tk)
Evolution Metrics
Evolution metrics measure how a particular community c(tk) has evolved. The metrics are
defined by differences between c(tk) and its corresponding community c(tk−1).
Growth Rate
The growth rate, Rgrow(c(tk−1), c(tk)), represents the increase of URLs per unit time. It allows
us to find most growing or shrinking communities.
Stability
Represents the amount of disappeared, appeared, merged and split URLs per unit time. A stable
community on a topic is the best starting point for finding interesting changes around the topic.
Disappearance rate
The number of disappeared URLs from c(tk−1) per unit time. Higher disappear rate means that
the community has lost URLs mainly by disappearance.
Merge rate
The number of absorbed URLs from other communities by merging per unit time. Higher
merge rate means that the community has obtained URLs mainly by merging.
Split Rate
The split rate, Rsplit (c(tk−1), c(tk)), is the number of split URLs from c(tk−1) per unit time.
When the split rate is low, c(tk) is larger than other split communities. Otherwise, c(tk) is smaller
than other split communities.
Other Metrics
The novelty metrics of a main line (c(ti), c(ti+1), ..., c(t j)) is calculated as follows.
The size distribution of communities also follows the power law and its exponent did not change
so much over time. Although the size distribution of communities is stable, the structure of
communities changes dynamically. The structure of the chart changes mainly by split and merge,
in which more than half of communities are involved.
Combining evolution metrics and relevance, evolution around a particular community can be
located. The size distribution of communities followed the power-law, and its exponent did not
change so much over time.
2. Communities will help us understand the structures of given social networks. Communities are
regarded as components of given social networks, and they will clarify the functions and
properties of the networks.
3. Communities will play important roles when we visualize large-scale social networks.
Relations of the communities clarify the processes of information sharing and information
diffusions, and they may give us some insights for the growth the networks in the future.
network which matches the original network in some of its topological features, but
which does not display community structure.
The simplest way to design a null model is to introduce randomness in the distribution of
edges among vertices.
The most popular null model consists of a randomized version of the original network,
where edges are rewired at random under the constraint that each vertex keeps its degree.
This null model is the basic concept behind the definition of modularity.
Definitions Based on Vertex Similarity
Definitions of the last category is based on an assumption that communities are groups of
vertices which are similar to each other. Some quantitative criterion is employed to evaluate the
similarity between each pair of vertices. Similarity measures are at the basis of themethod of
hierarchical clustering. Hierarchical clustering is a way to find several layers of communities that
are composed of vertices similar to each other.
Repetitive merges of similar vertices based on some quantitative similarity measures will
generate a structure shown in Fig. 3.a. This structure is called dendrogram, and highly similar
vertices are connected in the lower part of the dendrogram. Subtrees obtained by cutting the
dendrogram with horizontal line correspond to communities. Communities of different
granurality will be obtained by changing the position of the horizontal line
where the sum runs over all pairs of vertices, A is the adjacency matrix, k i is the degree of vertex
i and m is the total number of edges of the network.
Modularity can be rewritten as follows:
where nm is the number of communities, ls is the total number of edges joining vertices of
community s, and ds is the sum of the degrees of the vertices of s.
The first term of each summand is the fraction of edges of the network inside the community,
whereas the second term represents the expected fraction of edges that would be there if the
network were a random network with the same degree for each vertex.
Figure 3.b illustrates the meaning of modularity.
Divisive Algorithms
A simple way to identify communities in a network is to detect the edges that connect vertices of
different communities and remove them, so that the communities get disconnected from each
other.
The steps of the algorithm are as follows:
(1) Computation of the centrality of all edges,
(2) Removal of edge with largest centrality,
(3) Recalculation of centralities on the running network, and
(4) Iteration of the cycle from step (2).
Edge betweenness is the number of shortest paths between all vertex pairs that run along the
edge.
Modularity Optimization
Modularity is a quality function for evaluating partitions. Therefore, the partition corresponding
to its maximum value on a given network should be the best one. This is the main idea for
modularity optimization. It has been proved that modularity optimization is an NPhard problem.
However, there are currently several algorithms that are able to find fairly good approximations
of the modularity maximum in a reasonable time. One of the famous algorithms for modularity
optimization is CNM algorithm. Another examples of the algorithms are greedy algorithms and
simulated annealing.
Spectral Algorithms
Spectral algorithms are to cut given network into pieces so that the number of edges to be cut
will be minimized. One of the basic algorithm is spectral graph bipartitioning. The Laplacian
matrix L of a network is an n * n symmetric matrix, with one row
and column for each vertex. Laplacian matrix is defined as L =D - A , where A is the adjacency
matrix and D is the diagonal degree matrix with
All eigenvalues of L are real and non-negative, and L has a full set of n real and orthogonal
eigenvectors. In order to minimize the above cut, vertices are partitioned based on the signs of
the eigenvector that corresponds to the second smallest eigenvalue of L. In general, community
detection based on repetative bipartitioning is relatively fast.
Other Algorithms
There are many other algorithms for detecting communities, such as the methods focusing on
random walk, and the ones searching for overlapping cliques.
This section show how community mining techniques can be applied to the analysis of scientific
collaborations among researchers. Flink is a social network that describes the scientific
collaborations among 681 semantic Web researchers (http://flink.semanticweb.org/).
The network was constructed based on semantic Web technologies and all related semantic
information was automatically extracted from “Web-accessible information sources”, such as
“Web pages, FOAF profiles, email lists, and publication archives”. The weights on the links
measure the degrees of collaboration.
The command fastgreedy.community is for maximizing modularity greedily, and its results are
stored in variable gr. The variable gr is composed of gr$modularity (a list of modularity values in
the process of maximization) and gr$merge (vertex IDs that are merged at each step).
Current online social networks are extended in two main directions towards the capabilities of
the provided services and the decentralization of the supporting infrastructures, as depicted in
Fig. 3.c
Fig 3.c Classification and development trend of online social network services
Updates – In peer collaboration systems, updates, e.g., of a workplace, are sent to a small group
of peers via a decentralized synchronization mechanism. In P2P social networks, with distributed
storage and replication – and a potential need for scalability, the requirements change. P2P
publish/subscribe mechanisms are a possibility, but their security in terms of access control will
have to be developed further.
Topology – In pure file-sharing networks, the topology does not depend on whether the peers
know each other and nodes exchange content with any other nodes in the network. At the other
end of the spectrum, existing examples of decentralized social networks (in the widest sense)
aremostly platforms for collaboration or media sharing and they tend to consist of collaborative
groups that are relatively closed circles, e.g., using a “ring of trust” or darknets. In contrast,
online social networking services have overlapping circles.
Search, Addressing – Over multiple sessions, peers may change their physical address. In a
typical file sharing network, this is not an issue. One just needs to find some peer with the
content it is looking for. Traditionally, peer identity is tied with an IP address which clearly is
not sufficient.
In social networks, tagging or folksonomies is the basic mechanism to annotate content.
Recently, there have also been advances made in enabling decentralized tagging, which paves
another step towards realizing social networks on top of a P2P infrastructure.
Openness to New Applications – One of the most alluring features of current online social
networks is that they are open to third-party applications, which enables a constant change of
what a social networking service provides to the users.
In addition, third-party applications provide more and unpredictable ways of contacting users,
finding out about other users’ interests, forming groups and group identities, etc. This openness
to extensions potentially provides great benefits for the users. The price for these benefits is the
risk that comes with opening the service to untrusted third parties, extending the privacy
problem.
In a decentralized environment, if some users choose to enable a third-party application, their
choice should not affect other users or even users connected directly to them.
Security – For distributed storage with other peers that the user not necessarily wants to access
data, the content has to be encrypted, as done for example for file backup or anonymous peer-to-
peer file-sharing. To manage access to encrypted data, key distribution and maintenance have to
be handled such that the social network group can access data but be flexible enough to handle
churn in terms of going offline and coming back, additions and removal to the user’s social
network.
For peer identities, one can take advantage of opportunistic networks and peer authentication by
in-person contact, when friends meet in real life and exchange keys over their phones. For
bootstrapping authentication, a central authority (trusted third party) seems hard to avoid.
Robustness – In a centralized system, one can turn to the provider in case of user misbehavior,
there is usually a process defined for dealing with such complaints. In a decentralized system,
there is no authority that can ban users for misbehavior or remove content. Robustness against
free-riding.
Once access to content is granted, it is difficult to revoke that right. When a user allows a friend
to see a message, the friend can store the message and keep access to it even after a change of
key. Trust has to be at least equal to assigned access rights, due to this difficulty.
Limited Peers – To take advantage of the decentralized nature of social networks, a mapping of
physical social network to virtual and vice versa enables extensions to offering access via web
browsers by phone applications and direct exchange of data in physical proximity.
Another immediate benefit however of allowing such two-tier system is that users can then
participate in the social network with resource constrained (e.g., mobile) devices, which they
may use as an auxiliary, even when they contribute resources to the core of the system with their
primary device.
The social networking layer implements all basic functionalities and features that are provided
by contemporary centralized social networking services.
Among these functionalities the most important ones are namely the capability to search the
system (Distributed search) for relevant information, the management of users and shared space
(User account and share space management), the management of security and access control
issues (Trust management, Access control and security), the coordination and management of
social applications developed by third parties (Application management).
It is expected that the social networking layer exposes and implements an application
programming interface (API) to support the development of new applications as well as to
enable the customization of the social network service to suit various preferences of the user.
The top layer of the architecture includes the user interface to the system and various
applications built on top of the development platform provided by the DOSN.
The DOSN user is expected to provide the user the necessary transparency to use the DOSN as
any other centralized OSN. Applications can be either implemented by the DOSN provider or
developed by third-parties, and can be installed or removed from the system according to user’s
preferences.
2. FacetNet: The community structure at a given timestep is determined both by the observed
networked data and by the prior distribution given by historic community structures. The
experimental results suggest that this technique is scalable and is able to extract meaningful
communities based on social media context.
3. MetaFac: MetaFac is the first graph-based tensor factorization framework for analyzing the
dynamics of heterogeneous social networks. In this framework, metagraph, is a novel relational
hypergraph representation for modeling multi-relational and multi-dimensional social data.
Extensive experiments on large-scale real-world social media data and from the enterprise data
suggest that this technique is able to extract meaningful communities that are adaptive to social
media context.