0% found this document useful (0 votes)
12 views13 pages

Chapter 11 Mining Social-Network Graphs

Chapter 11 discusses the modeling of social networks as graphs, highlighting the characteristics of social networks and the relationships between entities. It explores clustering techniques, including hierarchical clustering and the Girvan-Newman algorithm, to identify communities within social networks by analyzing edge betweenness. The chapter emphasizes the importance of locality in relationships and presents methods for discovering communities through edge removal based on betweenness scores.

Uploaded by

hingoranipiyush5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Chapter 11 Mining Social-Network Graphs

Chapter 11 discusses the modeling of social networks as graphs, highlighting the characteristics of social networks and the relationships between entities. It explores clustering techniques, including hierarchical clustering and the Girvan-Newman algorithm, to identify communities within social networks by analyzing edge betweenness. The chapter emphasizes the importance of locality in relationships and presents methods for discovering communities through edge removal based on betweenness scores.

Uploaded by

hingoranipiyush5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 11

Mining Social-Network Graphs

Social Networks as Graphs

What is a Social Network?

When we think of a social network, we think of Facebook, Twitter, Google+, or another website that is
called a “social network,” and indeed this kind of network is representative of the broader class of
networks called “social.” The essential characteristics of a social network are:

1. There is a collection of entities that participate in the network. Typically, these entities are
people, but they could be something else entirely.
2. There is at least one relationship between entities of the network. On Facebook or its ilk, this
relationship is called friends. Sometimes the relationship is all-or-nothing; two people are either
friends or they are not. However, in other examples of social networks, the relationship has a
degree. This degree could be discrete; e.g., friends, family, acquaintances, or none as in
Google+. It could be a real number; an example would be the fraction of the average day that
two people spend talking to each other.
3. There is an assumption of non-randomness or locality. This condition is the hardest to formalize,
but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and
C, then there is a higher probability than average that B and C are related.

Social Networks as Graphs

Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph. The
entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that
characterizes the network. If there is a degree associated with the relationship, this degree is
represented by labeling the edges. Often, social graphs are undirected, as for the Facebook friends
graph. But they can be directed graphs, as for example the graphs of followers on Twitter or Google+.

Example: Figure 11.1 is an example of a tiny social network. The entities are the nodes A through G. The
relationship, which we might think of as “friends,” is represented by the edges. For instance, B is friends
with A, C, and D.

Is this graph really typical of a social network, in the sense that it exhibits locality of relationships? First,
note that the graph has nine edges out of the 7 2 = 21 pairs of nodes that could have had an edge
between them. Suppose X, Y , and Z are nodes of Fig. 11.1, with edges between X and Y and also
between X and Z. What would we expect the probability of an edge between Y and Z to be? If the graph
were large, that probability would be very close to the fraction of the pairs of nodes that have edges
between them, i.e., 9/21 = .429 in this case. However, because the graph is small, there is a noticeable
difference between the true probability and the ratio of the number of edges to the number of pairs of
nodes. Since we already know there are edges (X, Y ) and (X, Z), there are only seven edges remaining.
Those seven edges could run between any of the 19 remaining pairs of nodes. Thus, the probability of an
edge (Y, Z) is 7/19 = .368.

Figure 11.1: Example of a small social network

Now, we must compute the probability that the edge (Y, Z) exists in Fig. 11.1, given that edges (X, Y ) and
(X, Z) exist. What we shall actually count is pairs of nodes that could be Y and Z, without worrying about
which node is Y and which is Z. If X is A, then Y and Z must be B and C, in some order. Since the edge (B,
C) exists, A contributes one positive example (where the edge does exist) and no negative examples
(where the edge is absent). The cases where X is C, E, or G are essentially the same. In each case, X has
only two neighbors, and the edge between the neighbors exists. Thus, we have seen four positive
examples and zero negative examples so far.

Now, consider X = F. F has three neighbors, D, E, and G. There are edges between two of the three pairs
of neighbors, but no edge between G and E. Thus, we see two more positive examples and we see our
first negative example. If X = B, there are again three neighbors, but only one pair of neighbors, A and C,
has an edge. Thus, we have two more negative examples, and one positive example, for a total of seven
positive and three negative. Finally, when X = D, there are four neighbors. Of the six pairs of neighbors,
only two have edges between them.

Thus, the total number of positive examples is nine and the total number of negative examples is seven.
We see that in Fig. 11.1, the fraction of times the third edge exists is thus 9/16 = .563. This fraction is
considerably greater than the .368 expected value for that fraction. We conclude that Fig. 11.1 does
indeed exhibit the locality expected in a social network.
Clustering of Social-Network Graphs

Distance Measures for Social-Network Graphs

If we were to apply standard clustering techniques to a social-network graph, our first step would be to
define a distance measure. When the edges of the graph have labels, these labels might be usable as a
distance measure, depending on what they represented. But when the edges are unlabeled, as in a
“friends” graph, there is not much we can do to define a suitable distance.

Our first instinct is to assume that nodes are close if they have an edge between them and distant if not.
Thus, we could say that the distance d(x, y) is 0 if there is an edge (x, y) and 1 if there is no such edge.
We could use any other two values, such as 1 and ∞, as long as the distance is closer when there is an
edge.

Neither of these two-valued “distance measures” – 0 and 1 or 1 and ∞ – is a true distance measure. The
reason is that they violate the triangle inequality when there are three nodes, with two edges between
them. That is, if there are edges (A, B) and (B, C), but no edge (A, C), then the distance from A to C
exceeds the sum of the distances from A to B to C. We could fix this problem by using, say, distance 1 for
an edge and distance 1.5 for a missing edge. But the problem with two-valued distance functions is not
limited to the triangle inequality.

Applying Standard Clustering Methods

There are two general approaches to clustering: hierarchical (agglomerative) and point-assignment. Let
us consider how each of these would work on a social-network graph.

Hierarchical clustering of a social-network graph starts by combining some two nodes that are
connected by an edge. Successively, edges that are not between two nodes of the same cluster would
be chosen randomly to combine the clusters to which their two nodes belong. The choices would be
random, because all distances represented by an edge are the same.

Example 10.3 : Consider again the graph of Fig. 11.1. First, let us agree on what the communities are. At
the highest level, it appears that there are two communities {A, B, C} and {D, E, F, G}. However, we could
also view {D, E, F} and {D, F, G} as two subcommunities of {D, E, F, G}; these two subcommunities overlap
in two of their members, and thus could never be identified by a pure clustering algorithm. Finally, we
could consider each pair of individuals that are connected by an edge as a community of size 2, although
such communities are uninteresting.

The problem with hierarchical clustering of a graph like that of Fig. 11.1 is that at some point we are
likely to chose to combine B and D, even though they surely belong in different clusters. The reason we
are likely to combine B and D is that D, and any cluster containing it, is as close to B and any cluster
containing it, as A and C are to B. There is even a 1/9 probability that the first thing we do is to combine
B and D into one cluster.
There are things we can do to reduce the probability of error. We can run hierarchical clustering several
times and pick the run that gives the most coherent clusters. We can use a more sophisticated method
for measuring the distance between clusters of more than one node. But no matter what we do, in a
large graph with many communities there is a significant chance that in the initial phases we shall use
some edges that connect two nodes that do not belong together in any large community.

Betweenness

Since there are problems with standard clustering methods, several specialized clustering techniques
have been developed to find communities in social networks. In this section we shall consider one of the
simplest, based on finding the edges that are least likely to be inside a community.

Define the betweenness of an edge (a, b) to be the number of pairs of nodes x and y such that the edge
(a, b) lies on the shortest path between x and y. To be more precise, since there can be several shortest
paths between x and y, edge (a, b) is credited with the fraction of those shortest paths that include the
edge (a, b). As in golf, a high score is bad. It suggests that the edge (a, b) runs between two different
communities; that is, a and b do not belong to the same community.

Example : In Fig. 11.1 the edge (B, D) has the highest betweenness, as should surprise no one. In fact,
this edge is on every shortest path between any of A, B, and C to any of D, E, F, and G. Its betweenness is
therefore 3 × 4 = 12. In contrast, the edge (D, F) is on only four shortest paths: those from A, B, C, and D
to F.

The Girvan-Newman Algorithm

In order to exploit the betweenness of edges, we need to calculate the number of shortest paths going
through each edge. We shall describe a method called the Girvan-Newman (GN) Algorithm, which visits
each node X once and computes the number of shortest paths from X to each of the other nodes that go
through each of the edges. The algorithm begins by performing a breadth-first search (BFS) of the graph,
starting at the node X. Note that the level of each node in the BFS presentation is the length of the
shortest path from X to that node. Thus, the edges that go between nodes at the same level can never
be part of a shortest path from X.

Edges between levels are called DAG edges (“DAG” stands for directed, acyclic graph). Each DAG edge
will be part of at least one shortest path from root X. If there is a DAG edge (Y, Z), where Y is at the level
above Z (i.e., closer to the root), then we shall call Y a parent of Z and Z a child of Y , although parents
are not necessarily unique in a DAG as they would be in a tree.

Example : Figure 11.2 is a breadth-first presentation of the graph of Fig. 11.1 , starting at node E. Solid
edges are DAG edges and dashed edges connect nodes at the same level.
Figure 11.2: Step 1 of the Girvan-Newman Algorithm

The second step of the GN algorithm is to label each node by the number of shortest paths that reach it
from the root. Start by labeling the root 1. Then, from the top down, label each node Y by the sum of the
labels of its parents.

Example : In Fig. 11.2 are the labels for each of the nodes. First, label the root E with 1. At level 1 are
the nodes D and F. Each has only E as a parent, so they too are labeled 1. Nodes B and G are at level 2. B
has only D as a parent, so B’s label is the same as the label of D, which is 1. However, G has parents D
and F, so its label is the sum of their labels, or 2. Finally, at level 3, A and C each have only parent B, so
their labels are the label of B, which is 1.

The third and final step is to calculate for each edge e the sum over all nodes Y of the fraction of shortest
paths from the root X to Y that go through e. This calculation involves computing this sum for both
nodes and edges, from the bottom. Each node other than the root is given a credit of 1, representing the
shortest path to that node. This credit may be divided among nodes and edges above, since there could
be several different shortest paths to the node. The rules for the calculation are as follows:

1. Each leaf in the DAG (a leaf is a node with no DAG edges to nodes at levels below) gets a credit
of 1.
2. Each node that is not a leaf gets a credit equal to 1 plus the sum of the credits of the DAG edges
from that node to the level below.
3. A DAG edge e entering node Z from the level above is given a share of the credit of Z
proportional to the fraction of shortest paths from the root to Z that go through e. Formally, let
the parents of Z be Y1, Y2, . . . , Yk. Let pi be the number of shortest paths from the root to Yi ;
this number was computed in Step 2 and is illustrated by the labels in Fig. 10.4. Then the credit
for the edge (Yi , Z) is the credit of Z times pi divided by Pk j=1 pj .
After performing the credit calculation with each node as the root, we sum the credits for each edge.
Then, since each shortest path will have been discovered twice – once when each of its endpoints is the
root – we must divide the credit for each edge by 2.

Using Betweenness to Find Communities

The betweenness scores for the edges of a graph behave something like a distance measure on the
nodes of the graph. It is not exactly a distance measure, because it is not defined for pairs of nodes that
are unconnected by an edge, and might not satisfy the triangle inequality even when defined. However,
we can cluster by taking the edges in order of increasing betweenness and add them to the graph one at
a time. At each step, the connected components of the graph form some clusters. The higher the
betweenness we allow, the more edges we get, and the larger the clusters become.

More commonly, this idea is expressed as a process of edge removal. Start with the graph and all its
edges; then remove edges with the highest between-ness, until the graph has broken into a suitable
number of connected components.

Example : Let us start with our running example, the graph of Fig. 11.1. We see it with the betweenness
for each edge in Fig. 11.3. The calculation of the betweenness will be left to the reader. The only tricky
part of the count is to observe that between E and G there are two shortest paths, one going through D
and the other through F. Thus, each of the edges (D, E), (E, F), (D, G), and (G, F) are credited with half a
shortest path.

Figure 11.3: Betweenness scores for the graph of Fig. 11.1

Clearly, edge (B, D) has the highest betweenness, so it is removed first. That leaves us with exactly the
communities we observed make the most sense, namely: {A, B, C} and {D, E, F, G}. However, we can
continue to remove edges. Next to leave are (A, B) and (B, C) with a score of 5, followed by (D, E) and (D,
G) with a score of 4.5. Then, (D, F), whose score is 4, would leave the graph. We see in Fig. 10.8 the
graph that remains.
Figure 11.4: All the edges with betweenness 4 or more have been removed

The “communities” of Fig. 11.4 look strange. One implication is that A and C are more closely knit to
each other than to B. That is, in some sense B is a “traitor” to the community {A, B, C} because he has a
friend D outside that community. Likewise, D can be seen as a “traitor” to the group {D, E, F, G}, which is
why in Fig. 11.4, only E, F, and G remain connected.
Direct Discovery of Communities

Finding Cliques

Our first thought about how we could find sets of nodes with many edges between them is to start by
finding a large clique (a set of nodes with edges between any two of them). However, that task is not
easy. Not only is finding maximal cliques NP-complete, but it is among the hardest of the NP-complete
problems in the sense that even approximating the maximal clique is hard. Further, it is possible to have
a set of nodes with almost all edges between them, and yet have only relatively small cliques.

Example : Suppose our graph has nodes numbered 1, 2, . . . , n and there is an edge between two nodes
i and j unless i and j have the same remainder when divided by k. Then the fraction of possible edges
that are actually present is approximately (k − 1)/k. There are many cliques of size k, of which {1, 2, . . . ,
k} is but one example.

Yet there are no cliques larger than k. To see why, observe that any set of k + 1 nodes has two that
leave the same remainder when divided by k. This point is an application of the “pigeonhole principle.”
Since there are only k different remainders possible, we cannot have distinct remainders for each of k +
1 nodes. Thus, no set of k + 1 nodes can be a clique in this graph.

Complete Bipartite Graphs

A complete bipartite graph consists of s nodes on one side and t nodes on the other side, with all st
possible edges between the nodes of one side and the other present. We denote this graph by Ks,t. You
should draw an analogy between complete bipartite graphs as subgraphs of general bipartite graphs and
cliques as subgraphs of general graphs. In fact, a clique of s nodes is often referred to as a complete
graph and denoted Ks, while a complete bipartite subgraph is sometimes called a bi-clique.

It is not possible to guarantee that a graph with many edges necessarily has a large clique, it is possible
to guarantee that a bipartite graph with many edges has a large complete bipartite subgraph.We can
regard a complete bipartite subgraph (or a clique if we discovered a large one) as the nucleus of a
community and add to it nodes with many edges to existing members of the community. If the graph
itself is k-partite then we can take nodes of two types and the edges between them to form a bipartite
graph. In this bipartite graph, we can search for complete bipartite subgraphs as the nuclei of
communities.

However, we can also use complete bipartite subgraphs for community finding in ordinary graphs where
nodes all have the same type. Divide the nodes into two equal groups at random. If a community exists,
then we would expect about half its nodes to fall into each group, and we would expect that about half
its edges would go between groups. Thus, we still have a reasonable chance of identifying a large
complete bipartite subgraph in the community. To this nucleus we can add nodes from either of the two
groups, if they have edges to many of the nodes already identified as belonging to the community.
Finding Complete Bipartite Subgraphs

Suppose we are given a large bipartite graph G , and we want to find instances of Ks,t within it. It is
possible to view the problem of finding instances of Ks,t within G as one of finding frequent itemsets.
For this purpose, let the “items” be the nodes on one side of G, which we shall call the left side. We
assume that the instance of Ks,t we are looking for has t nodes on the left side, and we shall also assume
for efficiency that t ≤ s. The “baskets” correspond to the nodes on the other side of G (the right side).
The members of the basket for node v are the nodes of the left side to which v is connected. Finally, let
the support threshold be s, the number of nodes that the instance of Ks,t has on the right side.

We can now state the problem of finding instances of Ks,t as that of finding frequent itemsets F of size t.
That is, if a set of t nodes on the left side is frequent, then they all occur together in at least s baskets.
But the baskets are the nodes on the right side. Each basket corresponds to a node that is connected to
all t of the nodes in F. Thus, the frequent itemset of size t and s of the baskets in which all those items
appear form an instance of Ks,t.
SimRank

Random Walkers on a Social Graph

What a “random surfer” would do if they walked on the Web graph. We can similarly think of a person
“walking” on a social network. The graph of a social network is generally undirected, while the Web
graph is directed. However, the difference is unimportant. A walker at a node N of an undirected graph
will move with equal probability to any of the neighbors of N (those nodes with which N shares an
edge).

Suppose, for example, that such a walker starts out at node T1 of Fig. 11.5 . At the first step, it would go
either to U1 or W1. If to W1, then it would next either come back to T1 or go to T2. If the walker first
moved to U1, it would wind up at either T1, T2, or T3 next.

Figure 11.5

We conclude that, starting at T1, there is a good chance the walker would visit T2, at least initially, and
that chance is better than the chance it would visit T3 or T4. It would be interesting if we could infer that
tags T1 and T2 are therefore related or similar in some way. The evidence is that they have both been
placed on a common Web page, W1, and they have also been used by a common tagger, U1.

However, if we allow the walker to continue traversing the graph at random, then the probability that
the walker will be at any particular node does not depend on where it starts out. This conclusion comes
from the theory of Markov processes although the independence from the starting point requires
additional conditions besides connectedness that the graph of Fig. 11.5 does satisfy.

Random Walks with Restart


Here, we take this idea to the extreme. As we are focused on one particular node N of a social network,
and want to see where the random walker winds up on short walks from that node, we modify the
matrix of transition probabilities to have a small additional probability of transitioning to N from any
node. Formally, let M be the transition matrix of the graph G. That is, the entry in row i and column j of
M is 1/k if node j of G has degree k, and one of the adjacent nodes is i. Otherwise, this entry is 0. We
shall discuss teleporting in a moment, but first, let us look at a simple example of a transition matrix.

Example : Figure 11.6 is an example of a very simple network involving three pictures, and two tags,
“Sky” and “Tree” that have been placed on some of them. Pictures 1 and 3 have both tags, while Picture
2 has only the tag “Sky.” Intuitively, we expect that Picture 3 is more similar to Picture 1 than Picture 2
is, and an analysis using a random walker with restart at Picture 1 will support that intuition.

Figure 11.6: A simple bipartite social graph

Let us order the nodes as Picture 1, Picture 2, Picture 3, Sky, Tree. Then the transition matrix for the
graph of Fig. 11.6 is

For example, the fourth column corresponds to the node “Sky,” and this node connects to each of the
tree picture nodes. It therefore has degree three, so the nonzero entries in its column must be 1/3. The
picture nodes correspond to the first three rows and columns, so the entry 1/3 appears in the first three
rows of column 4. Since the “Sky” node does not have an edge to either itself or the “Tree” node, the
entries in the last two rows of column 4 are 0.
Counting triangles using MapReduce

Why Count Triangles?

If we start with n nodes and add m edges to a graph at random, there will be an expected number of
triangles in the graph. We can calculate this number without too much difficulty. There are n 3 sets of
three nodes, or approximately n 3/6 sets of three nodes that might be a triangle. The probability of an
edge between any two given nodes being added is m/ n 2 , or approximately 2m/n2 . The probability
that any set of three nodes has edges between each pair, if those edges are independently chosen to be
present or absent is approximately (2m/n2 ) 3 = 8m3/n6 . Thus, the expected number of triangles in a
graph of n nodes and m randomly selected edges is approximately (8m3/n6 )(n 3/6) = 4 3 (m/n) 3 .

If a graph is a social network with n participants and m pairs of “friends,” we would expect the number
of triangles to be much greater than the value for a random graph. The reason is that if A and B are
friends, and A is also a friend of C, there should be a much greater chance than average that B and C are
also friends. Thus, counting the number of triangles helps us to measure the extent to which a graph
looks like a social network.

We can also look at communities within a social network. It has been demonstrated that the age of a
community is related to the density of triangles. That is, when a group has just formed, people pull in
their like-minded friends, but the number of triangles is relatively small. If A brings in friends B and C, it
may well be that B and C do not know each other. As the community matures, B and C may interact
because of their membership in the community. Thus, there is a good chance that at sometime the
triangle {A, B, C} will be completed.

An Algorithm for Finding Triangles

We shall begin our study with an algorithm that has the fastest possible running time on a single
processor. Suppose we have a graph of n nodes and m ≥ n edges. For convenience, assume the nodes
are integers 1, 2, . . . , n.

Call a node a heavy hitter if its degree is at least √ m. A heavy-hitter triangle is a triangle all three of
whose nodes are heavy hitters. We use separate algorithms to count the heavy-hitter triangles and all
other triangles. Note that the number of heavy-hitter nodes is no more than 2√ m, since otherwise the
sum of the degrees of the heavy hitter nodes would be more than 2m. Since each edge contributes to
the degree of only two nodes, there would then have to be more than m edges.

Assuming the graph is represented by its edges, we preprocess the graph as follows:

1. Compute the degree of each node. This part requires only that we examine each edge and add 1
to the count of each of its two nodes. The total time required is O(m).
2. Create an index on edges, with the pair of nodes at its ends as the key. That is, the index allows
us to determine, given two nodes, whether the edge between them exists. A hash table suffices.
It can be constructed in O(m) time, and the expected time to answer a query about the
existence of an edge is a constant, at least in the expected sense.
3. Create another index of edges, this one with key equal to a single node. Given a node v, we can
retrieve the nodes adjacent to v in time proportional to the number of those nodes. Again, a
hash table, this time with single nodes as the key, suffices in the expected sense.

degree, recall that both v and u are integers, so order them numerically. That is, we say v ≺ u if and only
We shall order the nodes as follows. First, order nodes by degree. Then, if v and u have the same

if either

a) The degree of v is less than the degree of u, or


b) The degrees of u and v are the same, and v < u

Heavy-Hitter Triangles: There are only O( √ m) heavy-hitter nodes, so we can consider all sets of three of
these nodes. There are O(m3/2 ) possible heavyhitter triangles, and using the index on edges we can
check if all three edges exist in O(1) time. Therefore, O(m3/2 ) time is needed to find all the heavy-hitter
triangles.

Other Triangles: We find the other triangles a different way. Consider each edge (v1, v2). If both v1 and

≺ v2. Let u1, u2, . . . , uk be the nodes adjacent to v1. Note that k < √ m. We can find these nodes, using
v2 are heavy hitters, ignore this edge. Suppose, however, that v1 is not a heavy hitter and moreover v1

the index on nodes, in O(k) time, which is surely O( √ m) time. For each ui we can use the first index to
check whether edge (ui , v2) exists in O(1) time. We can also determine the degree of ui in O(1) time,

edge (ui , v2) exists, and v1 ≺ ui . In that way, a triangle is counted only once – when v1 is the node of
because we have counted all the nodes’ degrees. We count the triangle {v1, v2, ui} if and only if the

the triangle that precedes both other nodes of the triangle according to the ≺ ordering. Thus, the time
to process all the nodes adjacent to v1 is O( √ m). Since there are m edges, the total time spent counting
other triangles is O(m3/2 ).

We now see that preprocessing takes O(m) time. The time to find heavyhitter triangles is O(m3/2 ), and
so is the time to find the other triangles. Thus, the total time of the algorithm is O(m3/2 ).

You might also like