0% found this document useful (0 votes)
89 views13 pages

Graph Summarization

Graph summarization algorithm

Uploaded by

Garyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views13 pages

Graph Summarization

Graph summarization algorithm

Uploaded by

Garyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Efficient Aggregation for Graph Summarization

Yuanyuan Tian Richard A. Hankins Jignesh M. Patel


University of Michigan Nokia Research Center University of Michigan
Ann Arbor, MI, USA Palo Alto, CA, USA Ann Arbor, MI, USA
[email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION
Graphs are widely used to model real world objects and their Graphs provide a powerful primitive for modeling data in
relationships, and large graph datasets are common in many a variety of applications. Nodes in graphs usually represent
application domains. To understand the underlying charac- real world objects and edges indicate relationships between
teristics of large graphs, graph summarization techniques objects. Examples of data modeled as graphs include social
are critical. However, existing graph summarization meth- networks, biological networks, and dynamic network traffic
ods are mostly statistical (studying statistics such as degree graphs. Often, nodes have attributes associated with them.
distributions, hop-plots and clustering coefficients). These For example, in Figure 1(a), a node representing a student
statistical methods are very useful, but the resolutions of may have attributes: gender and department. In addition,
the summaries are hard to control. a graph may contain many different types of relationships,
In this paper, we introduce two database-style operations such as the friends and classmates relationships shown in
to summarize graphs. Like the OLAP-style aggregation Figure 1(a).
methods that allow users to drill-down or roll-up to con- In many applications, graphs are very large, with thou-
trol the resolution of summarization, our methods provide sands or even millions of nodes and edges. As a result, it
an analogous functionality for large graph datasets. The is almost impossible to understand the information encoded
first operation, called SNAP, produces a summary graph by in large graphs by mere visual inspection. Therefore, ef-
grouping nodes based on user-selected node attributes and fective graph summarization methods are required to help
relationships. The second operation, called k-SNAP, fur- users extract and understand the underlying information.
ther allows users to control the resolutions of summaries and Most existing graph summarization methods use simple
provides the “drill-down” and “roll-up” abilities to navigate statistics to describe graph characteristics [6, 7, 13]; for ex-
through summaries with different resolutions. We propose ample, researchers plot degree distributions to investigate
an efficient algorithm to evaluate the SNAP operation. In the scale-free property of graphs, employ hop-plots to study
addition, we prove that the k-SNAP computation is NP- the small world effect, and utilize clustering coefficients to
complete. We propose two heuristic methods to approxi- measure the “clumpiness” of large graphs. While these meth-
mate the k-SNAP results. Through extensive experiments ods are useful, the summaries contain limited information
on a variety of real and synthetic datasets, we demonstrate and can be difficult to interpret and manipulate. Meth-
the effectiveness and efficiency of the proposed methods. ods that mine graphs for frequent patterns [11, 19, 20, 23]
are also employed to understand the characteristics of large
Categories and Subject Descriptors graphs. However, these algorithms often produce a large
number of results that can easily overwhelm the user. Graph
H.2.4 [Systems]: Query Processing; H.2.8 [Database Ap- partitioning algorithms [14, 18, 22] have been used to detect
plications]: Data Mining community structures (dense subgraphs) in large networks.
However, the community detection is based purely on nodes
General Terms connectivities, and the attributes of nodes are largely ig-
nored. Graph drawing techniques [3, 10] can help one bet-
Algorithms, Experimentation, Performance
ter visualize graphs, but visualizing large graphs quickly be-
comes overwhelming.
Keywords What users need is a more controlled and intuitive method
Graphs, Social Networks, Summarization, Aggregation for summarizing graphs. The summarization method should
allow users to freely choose the attributes and relationships
that are of interest, and then make use of these features
to produce small and informative summaries. Furthermore,
Permission to make digital or hard copies of all or part of this work for users should be able to control the resolution of the resulting
personal or classroom use is granted without fee provided that copies are summaries and “drill-down” or “roll-up” the information, just
not made or distributed for profit or commercial advantage and that copies
like the OLAP-style aggregation methods in a traditional
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific database systems.
permission and/or a fee. In this paper, we propose two operations for graph sum-
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. marization that fulfills these requirements. The first opera-
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
Figure 1: Graph Summarization by Aggrega- Figure 2: Illustration of Figure 3: Construction of Φ3
tion Multi-resolution Summaries in the Proof of Theorem 2.4

tion, called SNAP (Summarization by Grouping Nodes on The remainder of this paper is organized as follows: Sec-
Attributes and Pairwise Relationships), produces a sum- tion 2 defines the SNAP and the k-SNAP operations. Sec-
mary graph of the input graph by grouping nodes based on tion 3 introduces the evaluation algorithms for these op-
user-selected node attributes and relationships. Figure 1 erations. Experimental results are presented in Section 4.
illustrates the SNAP operation. Figure 1(a) is a graph Section 5 describes related work, and Section 6 contains our
about students (with attributes: gender, department and concluding remarks.
so on) and the relationships (classmates and friends) be-
tween them. Note that only few of the edges are shown 2. GRAPH AGGREGATION OPERATIONS
in Figure 1(a). Based on user-selected gender and depart- In a graph, objects are represented by nodes, and relation-
ment attributes, and classmates and friends relationships, ships between objects are modeled as edges. In this paper,
the SNAP operation produces a summary graph shown in we support a general graph model, where objects (nodes)
Figure 1(b). This summary contains four groups of stu- have associated attributes and different types of relation-
dents and the relationships between these groups. Students ships (edges). Formally, we denote a graph G as (V, Υ)
in each group have the same gender and are in the same de- where V is the set of nodes, and Υ = {E1 , E2 , ..., Er } is the
partment, and they relate to students belonging to the same set of edge types, with each Ei ⊆ V × V representing the
set of groups with friends and classmates relationships. For set of edges of a particular type.
example, in Figure 1(b), each student in group G1 has at Nodes in a graph have a set of associated attributes, which
least a friend and a classmate in group G2 . This compact is denoted as Λ = {a1 , a2 , ..., at }. Each node has a value
summary reveals the underlying characteristics about the for each attribute. These attributes are used to describe
nodes and their relationships in the original graph. the features of the objects that the nodes represent. For
The second operation, called k-SNAP, further allows users example, in Figure 1(a), a node representing a student may
to control the resolutions of summaries. This operation is have attributes that represent the student’s gender and de-
pictorially depicted in Figure 2. Here using the slider, a user partment. Different types of edges in a graph correspond
can “drill-down” to a larger summary with more details or to different types of relationships between nodes, such as
“roll-up” to a smaller summary with less details. friends and classmates relationships shown in Figure 1(a).
Our summarization methods have been applied to ana- Note that two nodes can be connected by different types of
lyze real social networking applications. In one example, edges. For example, in Figure 1(a), two students can be
by summarizing the coauthorship graphs in database and classmates and friends at the same time.
AI communities, different coauthorship patterns across the For ease of presentation, we denote the set of nodes of
two areas are displayed. In another application, interesting graph G as V (G), the set of attributes as Λ(G), the actual
linking behaviors among liberal and conservative blogs are value of attribute ai for node v as ai (v), the set of edge
discovered by summarizing a large political blogs network. types as Υ(G), and the set of edges of type Ei as Ei (G). In
The main contributions of this paper are: addition, we denote the cardinality of a set S as |S|.
(1) We introduce two database-style graph aggregation Our methods are applicable for both directed and undi-
operations SNAP and k-SNAP for summarizing large graphs. rected graphs. For ease of presentation, we only consider
We formally define the two operations, and prove that the undirected graphs in this paper. Adaptations of our method
k-SNAP computation is NP-complete. for directed graphs are fairly straightforward, and omitted
(2) We propose an efficient algorithm to evaluate the SNAP in the interest of space.
operation, and also propose two heuristic methods (the top-
down approach and the bottom-up approach) to approxi- 2.1 SNAP Operation
mately evaluate the k-SNAP operation. The SNAP operation produces a summary graph through
(3) We apply our graph summarization methods to a vari- a homogeneous grouping of the input graph’s nodes, based
ety of real and synthetic datasets. Through extensive exper- on user-selected node attributes and relationships. We now
imental evaluation, we demonstrate that our methods pro- formally define this operation.
duce meaningful summaries. We also show that the top- To begin the formal definition of the SNAP operation, we
down approach is the ideal choice for k-SNAP evaluation first define the concept of node-grouping.
in practice. In addition, the evaluation algorithms are very Definition 2.1. (Node-Grouping of a Graph) For a
efficient even for very large graph datasets. graph G, Φ = {G1 , G2 , ..., Gk } is called a node-grouping of G,
if and only if: We denote this global maximum A-compatible grouping
(1) ∀G
S i ∈ Φ, Gi ⊆ V (G) and Gi 6= ∅, as Φmax
A . Φmax
A is also the A-compatible grouping with the
(2) Gi ∈Φ Gi = V (G), minimum cardinality. In fact, if we consider each node in
(3) for ∀Gi , Gj ∈ Φ and (i 6= j), Gi ∩ Gj = ∅. a graph as a data record, then Φmax A is very much like the
Intuitively, a node-grouping partitions the nodes in a graph result of a group-by operation for these data records on the
into non-overlapping subsets. Each subset Gi is called a attributes A in the relational database systems.
group. When there is no ambiguity, we simply call a node- The A-compatible groupings only account for the node
grouping a grouping. For a given grouping Φ of G, the group attributes. However, nodes do not just have attributes, but
that node v belongs to is denoted as Φ(v). We further define also participate in pairwise relationships represented by the
the size of a grouping as the number of groups it contains. edges. Next, we consider relationships when grouping nodes.
Now, we define a partial order relation 4 on the set of all For a grouping Φ, we denote the neighbor-groups of node
groupings of a graph. v in Ei as N eighborGroupsΦ,Ei (v) = {Φ(u)|(u, v) ∈ Ei }.
Definition 2.2. (Dominance Relation) For a graph Now we define groupings compatible with both node at-
G, the grouping Φ dominates the grouping Φ0 , denoted as tributes and relationships.
Φ0 4 Φ, if and only if ∀G0i ∈ Φ0 , ∃Gj ∈ Φ s.t. G0i ⊆ Gj . Definition 2.5. (Attributes and Relationships Com-
It is easy to see that the dominance relation 4 is reflexive, patible Grouping) For a set of attributes A ⊆ Λ(G) and
anti-symmetric and transitive, hence it is a partial order a set of relationship types R ⊆ Υ(G), a grouping Φ is com-
relation. Next we define a special kind of grouping based on patible with attributes A and relationship types R or simply
a set of user-selected attributes. (A, R)-compatible, if it satisfies the following:
Definition 2.3. (Attributes Compatible Grouping) (1) Φ is A-compatible,
For a set of attributes A ⊆ Λ(G), a grouping Φ is compatible (2) ∀u, v ∈ V (G), if Φ(u) = Φ(v), then ∀Ei ∈ R,
with attributes A or simply A-compatible, if it satisfies the N eighborGroupsΦ,Ei (u) = N eighborGroupsΦ,Ei (v).
following: ∀u, v ∈ V, if Φ(u) = Φ(v), then ∀ai ∈ A, ai (u) = If a grouping Φ is compatible with A and R, we also de-
ai (v). note it as Φ(A,R) . In each group of an (A, R)-compatible
If a grouping Φ is compatible with A, we simply denote grouping, all the nodes are homogeneous in terms of both
it as ΦA . In each group of a A-compatible grouping, every attributes A and relationships in R. In other words, ev-
node has exactly the same values for the set of attributes A. ery node inside a group has exactly the same values for
Note that there could be more than one grouping compatible attributes A, and is adjacent to nodes in the same set of
with A. In fact a trivial grouping in which each node is a groups for all the relationships in R.
group is always compatible with any set of attributes. As an example, assume that the summary in Figure 1(b)
Next, we prove that amongst all the A-compatible group- is a grouping compatible with gender and department at-
ings of a graph, there is a global maximum grouping with tributes, and classmates and friends relationships. Then,
respect to the dominance relation 4. for example, every student (node) in group G2 , has the same
Theorem 2.4. In the set of all the A-compatible group- gender and department attributes values, and is a friend of
ings of a graph G, denoted as SA , ∃ΦA ∈ SA , s.t. ∀Φ0A ∈ SA , some student(s) in G3 , a classmate of some student(s) in G4 ,
Φ0A 4 ΦA . and a friend to some student(s) as well as a classmate to
some student(s) in G1 .
Proof. We prove by contradiction. Assume that there is
Given a grouping Φ(A,R) , we can infer relationships be-
no global maximum A-compatible grouping, but more than
tween groups from the relationships between nodes in R. For
one maximal grouping. Then, for every two of such maximal
each edge type Ei ∈ R, we define the corresponding group
groupings Φ1 and Φ2 , we will construct a new A-compatible
relationships as Ei (G, Φ(A,R) ) = {(Gi , Gj )| Gi , Gi ∈ Φ(A,R)
grouping Φ3 such that Φ1 4 Φ3 and Φ2 4 Φ3 , which con-
and ∃u ∈ Gi , v ∈ Gj s.t. (u, v) ∈ Ei }. In fact, by the defini-
tradicts the assumption that Φ1 and Φ2 are maximal A-
tion of (A, R)-compatible groupings, if there is one node in
compatible groupings.
a group adjacent to some node(s) in the other group, then
Assume that Φ1 = {G11 , G12 , ..., G1s } and Φ2 = {G21 , G22 , ..., G2t }.
every node in the first group is adjacent to some node(s) in
We construct a bipartite graph on Φ1 ∪ Φ2 as shown in Fig-
the second.
ure 3. The nodes in the bipartite graph are the groups from
Similarly to attributes compatible groupings, there could
Φ1 and Φ2 . And there is an edge between G1i ∈ Φ1 and
be more than one grouping compatible with the given at-
G2j ∈ Φ2 if and only if G1i ∩ G2j 6= ∅. After constructing the
tributes and relationships. The grouping in which each node
bipartite graph, we decompose this graph into connected
forms a group is always compatible with any given attributes
components C1 , C2 , ..., Cm . For each connected compo-
and relationships.
nent Ck , we union the groups inside this component and
Next we prove that among all the (A, R)-compatible group-
get a group ∪(Ck ). Now, we can construct a new group-
ings there is a global maximum grouping with respect to the
ing Φ3 = {∪(C1 ), ∪(C2 ), ..., ∪(Cm )}. It is easy to see that
dominance relation 4.
Φ1 4 Φ3 and Φ2 4 Φ3 . Now we prove that Φ3 is com-
patible with A. From the definition of A-compatible group-
Theorem 2.6. In the set of all the (A, R)-compatible group-
ings, if G1i ∩ G2j 6= ∅, nodes in G1i ∪ G2j all have the same
ings of a graph G, denoted as S(A,R) , ∃Φ(A,R) ∈ S(A,R) , s.t.
attributes values. Therefore, every node in ∪(Ck ) has the
∀Φ0(A,R) ∈ S(A,R) , Φ0(A,R) 4 Φ(A,R) .
same attributes values. Now, we have constructed a new
A-compatible grouping Φ3 such that Φ1 4 Φ3 and Φ2 4 Φ3 . Proof. Again we prove by contradiction. Assume that
This contradicts our assumption that Φ1 and Φ2 are two dif- there is no global maximum (A, R)-compatible grouping, but
ferent maximal A-compatible groupings. Therefore, there is more than one maximal grouping. Then, for every two of
a global maximum A-compatible grouping. such maximal groupings Φ1 and Φ2 , we use the same con-
struction method to construct Φ3 as in the proof of Theo- attributes and relationships. Unfortunately, homogeneity is
rem 2.4. We already know that Φ3 is A-compatible, Φ1 4 Φ3 often too restrictive in practice, as most real life graph data
and Φ2 4 Φ3 . Using similar arguments as in Theorem 2.4, is subject to noise and uncertainty; for example, some edges
we can also prove that Φ3 is compatible with R. This con- may be missing because of the failure in the detection pro-
tradicts our assumption that Φ1 and Φ2 are two different cess, and some edges may be spurious because of errors.
maximal (A, R)-compatible groupings. Applying the SNAP operation on noisy data can result in a
From the construction of Φ3 , we know that if G1i ∩ G2j 6= ∅, large number of small groups, and, in the worst case, each
then the nodes in G1i ∪ G2j belong to the same group in Φ3 . node may end up an individual group. Such a large sum-
Next, we prove that every node in G1i ∪ G2j is also adjacent mary is not very useful in practice. A better alternative
to nodes in the same set of groups in Φ3 . is to let users control the sizes of the results to get sum-
Again we prove by contradiction. Assume that there are maries with the resolutions that they can manage (as shown
two nodes u, v ∈ G1i ∪ G2j , u is adjacent to ∪(Ck ) in Φ3 but in Figure 2). Therefore, we introduce a second operation,
v is not. First, if both u, v ∈ G1i or both u, v ∈ G2j , then as called k-SNAP, which relaxes the homogeneity requirement
both Φ1 and Φ2 are (A, R)-compatible groupings, and the for the relationships and allows users to control the sizes of
construction of Φ3 does not decompose any groups in Φ1 or the summaries.
Φ2 , u, v should always be adjacent to the same set of groups The relaxation of the homogeneity requirement for the re-
in Φ3 . This contradicts our assumption. Second, the two lationships is based on the following observation. For each
nodes can come from different groupings. For simplicity, pair of groups in the result of the SNAP operation, if there
assume u ∈ G1i and v ∈ G2j . As G1i ∩ G2j 6= ∅, a node w ∈ is a group relationship between the two, then every node in
G1i ∩ G2j is adjacent to the same set of groups as u in Φ1 and both groups participates in this group relationship. In other
adjacent to the same set of groups as v in Φ2 . As a result, words, every node in one group relates to some node(s) in
every group that u is adjacent to in Φ1 should intersect with the other group. On the other hand, if there is no group rela-
some group that v is adjacent to in Φ2 . Since u is adjacent tionship between two groups, then absolutely no relationship
to ∪(Ck ), then u must be adjacent to at least one group in connects any nodes across the two groups. However, in re-
Φ1 that is later merged to ∪(Ck ). This group should also ality, if most (not all) nodes in the two groups participate
intersect with a group G2l in Φ2 that v is adjacent to. Then, in the group relationship, it is often a good indication of a
by the construction algorithm of Φ3 , G2l should belong to strong relationship between the two groups. Likewise, it is
the connected component Ck , thus should be later merged intuitive to mark two groups as being weakly related if only
in ∪(Ck ). As a result, v is also adjacent to ∪(Ck ) in Φ3 , a tiny fraction of nodes are connected between these groups.
which contradicts our assumption. Based on these observations, we relax the homogeneity
Now we know if Gi ∩Gj 6= ∅, nodes in Gi ∪Gj are all adjacent requirement for the relationships by not requiring that ev-
to the same set of groups in Φ3 . In each Ck , ∀Gi ∈ Ck , ery node participates in a group relationship. But we still
∃Gj ∈ Ck such that Gi ∩ Gj 6= ∅. As a result, every node in maintain the homogeneity requirement for the attributes,
∪(Ck ) is adjacent to the same set of groups in Φ3 . i.e. all the groupings should be compatible with the given
We have constructed a new (A, R)-compatible grouping attributes. Users control how many groups are present in
Φ3 such that Φ1 4 Φ3 and Φ2 4 Φ3 . This contradicts the summary by specifying the required number of groups,
the fact that Φ1 and Φ2 are two different maximal (A, R)- denoted as k. There are many different groupings of size
compatible groupings. Therefore, there is a global maximum k compatible with the attributes, thus we need to measure
(A, R)-compatible grouping. the qualities of the different groupings. We propose the ∆-
measure to assess the quality of an A-compatible grouping
We denote the global maximum (A, R)-compatible group- by examining how different it is to a hypothetical (A, R)-
ing as Φmax max compatible grouping.
(A,R) . Φ(A,R) is also the (A, R)-compatible group-
ing with the minimum cardinality. Due to its compactness, We first define the set of nodes in group Gi that participate
this maximum grouping is more useful than other (A, R)- in a group relationship (Gi , Gj ) of type Et as PEt ,Gj (Gi ) =
compatible groupings. {u|u ∈ Gi and ∃v ∈ Gj s.t. (u, v) ∈ Et }. Then we define
Now, we define our first operation for graph summariza- the participation ratio of the group relationship (Gi , Gj ) of
|PE ,G (Gi )|+|PE ,G (Gj )|
tion, namely SNAP. type Et as pti,j = t j
|Gi |+|Gj |
t i
. For a group re-
Definition 2.7. (SNAP Operation) The SNAP opera- lationship, if its participation ratio is greater than 50%, we
tion takes as input a graph G, a set of attributes A ⊆ Λ(G), call it a strong group relationship, otherwise, we call it a
and a set of edge types R ⊆ Υ(G), and produces a summary weak group relationship. Note that in an (A, R)-compatible
graph Gsnap , where V (Gsnap ) = Φmax grouping, the participation ratios are either 0% or 100%.
(A,R) , and Υ(Gsnap ) =
{Ei (G, Φmax Given a graph G, a set of attributes A and a set of rela-
(A,R) )|Ei ∈ R}.
tionship types R, the ∆-measure of ΦA = {G1 , G2 , ..., Gk } is
Intuitively, the SNAP operation produces a summary graph defined as follows:
of the input graph based on user-selected attributes and re- X X
lationships. The nodes of this summary graph correspond ∆(ΦA ) = (δEt ,Gj (Gi ) + δEt ,Gi (Gj )) (1)
to the groups in the maximum (A, R)-compatible grouping. Gi ,Gj ∈ΦA Et ∈R

And the edges of this summary graph are the group rela- (
tionships inferred from the node relationships in R. |PEt ,Gj (Gi )| if pti,j ≤ 0.5
δEt ,Gj (Gi ) = (2)
|Gi | − |PEt ,Gj (Gi )|
2.2 k-SNAP Operation otherwise
The SNAP operation produces a grouping in which nodes Intuitively, the ∆-measure counts the minimum number of
of each group are homogeneous with respect to user-selected differences in participations of group relationships between
!
the given A-compatible grouping and a hypothetical (A, R)-
compatible grouping of the same size. The measure looks at
each pairwise group relationship: If this group relationship
is weak (pti,k ≤ 0.5), then it counts the participation differ-
ences between this weak relationship and a non-relationship
(pti,k = 0); on the other hand, if the group relationship is
strong, it counts the differences between this strong rela-
tionship and a 100% participation-ratio group relationship.
The δ function, defined in Equation 2, evaluates the part
of the ∆ value contributed by a group Gi with one of its
neighbors Gj in a group relationship of type Et .
It is easy to prove that ∆(ΦA ) ≥ 0. The smaller ∆(ΦA )
value is, the more closer ΦA is to a hypothetical (A, R)- Figure 4: Data Structures Used in the Evaluation Al-
compatible grouping. ∆(ΦA ) = 0 if and only if ΦA is (A, R)- gorithms
compatible. We can also prove that ∆(ΦA ) is bounded by
2|ΦA ||V ||R|, as each δEt ,Gj (Gi ) ≤ |Gi |. therefore, we propose two heuristic algorithms for efficiently
Now we will formally define the k-SNAP operation. approximating the solution. Before discussing the details of
Definition 2.8. (k-SNAP Operation) The k-SNAP op- the algorithms, we first introduce the evaluation architec-
eration takes as input a graph G, a set of attributes A ⊆ ture and the data structures used for the algorithms. Note
Λ(G), a set of edge types R ⊆ Υ(G) and the desired number that, for ease of presentation, all algorithms discussed in this
of groups k, and produces a summary graph Gk-snap , where section are assumed to work on one type of relationship; ex-
V (Gk-snap ) = ΦA , s.t. |ΦA | = k and ΦA = arg minΦ0A {∆(Φ0A )}, tending these algorithms for multiple relationship types is
and Υ(Gk-snap ) = {Ei (G, ΦA ) | Ei ∈ R}. straightforward, hence is omitted in the interest of space.

Given the desired number of groups k, the k-SNAP opera- 3.1 Architecture and Data Structures
tion produces an A-compatible grouping with the minimum
All the evaluation algorithms employ an architecture as
∆ value. Unfortunately, as we prove below, this optimiza-
follows. The input graphs reside in the database on disk.
tion problem is NP-complete. To prove this, we first formally
A chunk of memory is allocated as a buffer pool for the
define the decision problem associated with this optimiza-
database. It is used to buffer actively used content from
tion problem and then prove it to be NP-complete.
disk to speed up the accesses. Every access of the eval-
Theorem 2.9. Given a graph G, a set of attributes A, a uation algorithms to the nodes and edges of graphs goes
set of relationship types R, a user-specified number of groups through the buffer pool. If the content is buffered, then the
k (|Φmax | ≤ k ≤ |V (G)|), and a real number D (0 ≤ D < evaluation algorithms simply retrieve the content from the
A
2k|V ||R|), the problem of finding an A-compatible grouping buffer pool; otherwise, the content is read from disk into the
ΦA of size k with ∆(ΦA ) ≤ D is NP-complete. buffer pool first. Another chunk of memory is allocated for
the evaluation algorithms as the working memory, similar to
Proof. We use proof by restriction to prove the NP- the working memory space used by traditional database al-
completeness of this problem. gorithms such as hash join. This working memory is used to
(1) This problem is in NP, because a nondeterministic hold the data structures used in the evaluation algorithms.
algorithm only needs to guess an A-compatible grouping ΦA The evaluation algorithms share some common data struc-
of size k and check in polynomial time that ∆(ΦA ) ≤ D. tures as shown in Figure 4. The group-array data structure
And an A-compatible grouping ΦA of size k can be generated keeps track of the groups in the current grouping. Each en-
by a polynomial time algorithm. try in groups-array stores the id of a group and also points
(2) This problem contains a known NP-complete problem to a node-set, which contains the nodes in the correspond-
2-Role Assignability (2RA) [16] as a special case. By re- ing group. Each node in the node-set points to one row of
stricting A = ∅, |R| = 1, k = 2 and D = 0, this problem the neighbor-groups bitmap. This bitmap is the most mem-
becomes 2RA (which decides whether the nodes in a graph ory consuming data structure in the evaluation algorithms.
can be assigned with 2 roles, each node with one of the roles, Each row of the bitmap corresponds to a node, and the bits
such that if two nodes are assigned with the same role, then in the row store the node’s neighbor-groups. If bit position
the sets of roles assigned to their neighbors are the same.) i is 1, then we know that this node has at least one neighbor
As proved in [16], 2RA is NP-complete. belonging to group Gi with id i; otherwise, this node has
no neighbor in group Gi . For each group Gi in the current
Given the NP-completeness, it is infeasible to find the
grouping, we also keep a participation-array which stores
exact optimal answers for the k-SNAP operation. Therefore,
the participation counts |PE,Gj (Gi )| for each neighbor group.
we propose two heuristic algorithms to evaluate the k-SNAP
Note that the participation-array of a group can be inferred
operation approximately.
from the nodes’ corresponding rows in the neighbor-groups
bitmap. For example, in Figure 4, the participation-array of
3. EVALUATION ALGORITHMS group G1 can be computed by counting the number of 1s in
In this section, we introduce the evaluation algorithms each column of the bitmap rows corresponding to n12 , n4 ,
for SNAP and k-SNAP. It is computationally feasible to ex- n9 and n2 . All the data structures shown in Figure 4 change
actly evaluate the SNAP operation, hence the proposed eval- dynamically during the evaluation algorithms. An increase
uation algorithm produces exact summaries. In contrast, in the number of groups leads to the growth of the group-
k-SNAP computation was proved to be NP-complete and, array size, which also results in an increase of the width of
Algorithm 1 SNAP(G, A, R) Algorithm 2 k-SNAP-Top-Down(G, A, R, k)
Input: G: a graph; A ⊆ Λ(G): a set of attributes; R = Input: G: a graph; A ⊆ Λ(G): a set of attributes; R =
{E} ⊆ Υ(G): a set containing one relationship type E {E} ⊆ Υ(G): a set containing one relationship type E;
Output: A summary graph. k: the required number of groups in the summary
1: Compute the maximum A-compatible grouping by sort- Output: A summary graph.
ing nodes in G based on values of attributes A 1: Compute the maximum A-compatible grouping by sort-
2: Initialize the data structures ing nodes in G based on values of attributes A
3: while there is a group Gi with participation array con- 2: Initialize the data structures and let Φc denote the cur-
taining values other than 0 or |Gi | do rent grouping
4: Divide Gi into subgroups by sorting nodes in Gi based 3: SplitGroups(G, A, R, k, Φc )
on their corresponding rows in the bitmap 4: Form the summary graph Gk-snap
5: Update the data structures 5: return Gk-snap
6: end while
7: Form the summary graph Gsnap
8: return Gsnap
nodes are changed, the bitmap entries for them and their
neighbor nodes have to be updated. Then the algorithm
goes to the next iteration. This process continues until the
the bitmap, as well as the sizes of the participation-arrays.
condition in line 3 does not hold anymore.
The set of nodes for a group also change dynamically.
It can be easily verified that the grouping produced by Al-
For most of this paper, we will assume that all the data
gorithm 1 is the maximum (A, R)-compatible grouping. The
structures needed by the evaluation algorithms can fit in the
algorithm starts from the maximum A-compatible group-
working memory. This is often a reasonable assumption in
ing, and it only splits existing groups, so the grouping after
practice for a majority of graph datasets. However, we also
each iteration is guaranteed to be A-compatible. In addi-
consider the case when this memory assumption does not
tion, each time we split a group, we always keep nodes with
hold (see Section 4.4.3).
same neighbor-groups together. Therefore, when the algo-
rithm stops, the grouping should be the maximum (A, R)-
3.2 Evaluating SNAP Operation compatible grouping.
In this section, we introduce the evaluation algorithm for After we get the maximum (A, R)-compatible grouping,
the SNAP operation. This algorithm also serves as a foun- we can construct the summary graph. The nodes in the sum-
dation for the two k-SNAP evaluation algorithms. mary graph corresponds to the groups in the result grouping.
The SNAP operation tries to find the maximum (A, R)- The edges in the summary graph are the group relationships
compatible grouping for a graph, a set of nodes attributes, inferred from the node relationships in the original graph.
and the specified relationship type. The evaluation algo- Now we will analyze the complexity of this evaluation al-
rithm starts from the maximum A-compatible grouping, and gorithm. Sorting by the attributes values takes O(|V | log |V |)
iteratively splits groups in the current grouping, until the time, assuming the number of attributes is a small constant.
grouping is also compatible with the relationships. The initialization of the data structures takes O(|E|) time,
The algorithm for evaluating the SNAP operation is shown where E is the only edge type in R (for simplicity, we only
in Algorithm 1. In the first step, the algorithm groups the consider one edge type in our algorithms). At each itera-
nodes based only on the attributes by a sorting on the at- tion of the while loop, the radix sort takes O(ki |Gi |) time,
tributes values. Then the data structures are initialized by where ki is the number of groups in the current group-
this maximum A-compatible grouping. Note that if a group- ing and Gi is the group to be split. Updating the data
ing is compatible with the relationships, then all nodes in- structures takes |Edges(Gi )|, where Edges(Gi ) is the set of
side a group should have the same set of neighbor-groups, edges adjacent to nodes in Gi . Note that ki is monotoni-
which means that they have the same values in their rows cally increasing, and that the number of iterations is less
of the bitmap. In addition, the participation array of each than the size of the final grouping, denoted as k. There-
group should then only contain the values 0 or the size of the fore, the complexity for all the iterations is bounded by
group. This criterion has been used as the terminating con- O(k 2 |V | + k|E|). Constructing the summary takes O(k 2 )
dition to check whether the current grouping is compatible time. To sum up, the upper-bound complexity of the SNAP
with the relationships in line 3 of Algorithm 1. If there ex- algorithm is O(|V | log |V | + k 2 |V | + k|E|).
ists a group whose participation array contains values other As the evaluation algorithm takes inputs from the graph
than 0 or the size of this group, the nodes in this group database on disk, we also need to analyze the number of
are not homogeneous in terms of the relationships. We can disk accesses to the database. We assume all the accesses to
split this group into subgroups, each of which contains nodes the database are in the units of pages. For simplicity, we do
with the same set of neighbor-groups. This can be achieved not distinguish whether an access is a disk page access or a
by sorting the nodes based on their corresponding entries buffer pool page access. We assume that all the nodes infor-
in the bitmap. (The radix sort is a perfect candidate for mation of the input graph takes kV k pages in the database,
this task.) After this division, new groups are introduced. and all the edges information takes kEk pages. Then the
One of them continues to use the same group id of the split SNAP operation incurs kV k page accesses to read all the
group, and the remaining groups are added to the end of nodes with their attributes, kEk page accesses to initialize
the group-array. Accordingly, each row of the bitmap has the data structures, and at most kEk page accesses each
to be widened. The nodes of this split group are distributed time it updates the data structures. So, the total number of
among the new groups. As the group memberships of these page accesses is bounded by kV k + (k + 1)kEk. Note that in
Algorithm 3 SplitGroups(G, A, R, k, Φc ) Algorithm 4 k-SNAP-Bottom-Up(G, A, R, k)
Input: G: a graph; A ⊆ Λ(G): a set of attributes; R = Input: G: a graph; A ⊆ Λ(G): a set of attributes; R =
{E} ⊆ Υ(G): a set containing one relationship type E; {E} ⊆ Υ(G): a set containing one relationship type E;
k: the required number of groups in the summary; Φc : k: the required number of groups in the summary.
the current grouping. Output: A summary graph.
Output: Splitting groups in Φc until |Φc | = k. 1: Gsnap =SNAP(G, A, R)
1: Build a heap on the CT value of each group in Φc 2: Initialize the data structures using the grouping in Gsnap
2: while |Φc | < k do and let Φc denote the current grouping
3: Pop the group Gi with the maximum CT value from 3: MergeGroups(G, A, R, k, Φc )
the heap 4: Form the summary graph Gk-snap
4: Split Gi into two based on the neighbor group Gt = 5: return Gk-snap
arg maxGj {δE,Gj (Gi )}
5: Update data structures (Φc is updated)
6: Update the heap groups based on its bitmap entries, the top-down approach
7: end while has to make the following decisions at each iterative step: (1)
which group to split and (2) how to split it. Such decisions
are critical as once a group is split, the next step will operate
practice, not every page access results in an actual disk IO. on the new grouping. At each step, we can only make the
Especially for the updates of the data structures discussed decision based on the current grouping. We want each step
in Section 3.1, most of the edges information will be cached to make the smallest move possible, to avoid going too far
in the buffer pool. away from the right direction. Therefore, we split one group
into only two subgroups at each iterative step. There are
3.3 Evaluating k-SNAP Operation different ways to split one group into two. One natural way
The k-SNAP operation allows a user to choose k, the num- is to divide the group based on whether nodes have relation-
ber of groups that are shown in the summary. For a given ships with nodes in a neighbor group. After the split, nodes
graph, a set of nodes attributes A and the set of relationship in the two new groups either all or never participate in the
types R, a meaningful k value should fall in the range be- group relationships with this neighbor group. This way of
tween |Φmax
A | and |Φmax
(A,R) |. However, if the user input is be- splitting groups also ensures that the resulting groups follow
yond the meaningful range, i.e. k < |Φmax A | or k > |Φmax
(A,R) |, the KEAT principle.
then the evaluation algorithms will return the summary cor- Now, we introduce the heuristic for deciding which group
responding to Φmax A or Φmax
(A,R) , respectively. For simplicity, to split and how to split at each iterative step. As defined in
we will assume that the k values input to the algorithms Section 2.2, the k-SNAP operation tries to find the grouping
are always meaningful. By varying the k values, users can with a minimum ∆ measure (see Equation 1) for a given k.
produce multi-resolution summaries. A larger k value corre- The computation of the ∆ measure can be broken down into
sponds to a higher resolution summary. The finest summary each group with each of its neighbors (see the δ function in
corresponds to the grouping Φmax (A,R) ; and the coarsest sum- Equation 2). Therefore, our heuristic chooses the group that
mary corresponds to the grouping Φmax A . makes the most contribution to the ∆ value with one of its
As proved in Section 2.2, computing the exact answers neighbor groups. More formally, for each group Gi , we define
for the k-SNAP operation is NP-complete. In this paper, CT (Gi ) as follows:
we propose two heuristic algorithms to approximate the an-
swers. The top-down approach starts from the maximum CT (Gi ) = max{δE,Gj (Gi )} (3)
Gj
grouping only based on attributes, and iteratively splits groups
until the number of groups reaches k. The other approach Then, at each iterative step, we always choose the group
employs a bottom-up scheme. This method first computes with the maximum CT value to split and then split it based
the maximum grouping compatible with both attributes and on whether nodes in this group Gi have relationships with
relationships, and then iteratively merges groups until the nodes in its neighbor group Gt , where
result satisfies the user defined k value. In both approaches,
we apply the same principle: nodes of a same group in the Gt = arg max{δE,Gj (Gi )}
Gj
maximum (A, R)-compatible grouping should always remain
in a same group even in coarser summaries. We call this As shown in Algorithm 3, to speed up the decision pro-
principle KEAT (Keep the Equivalent Always Together) cess, we build a heap on the CT values of groups. At each
principle. This principle guarantees that when k = |Φmax (A,R) |, iteration, we pop the group with the maximum CT value
the result produced by the k-SNAP evaluation algorithms to split. At the end of each iteration, we update the heap
is the same as the result of the SNAP operation with the elements corresponding to the neighbors of the split group,
same inputs. and insert elements corresponding to the two new groups.
The time complexity of the top-down approach is similar
3.3.1 Top-Down Approach to the SNAP algorithm, except that it takes O(k02 + k0 )
Similar to the SNAP evaluation algorithm, the top-down time to compute the CT values and build the heap, and
approach (see Algorithm 2) also starts from the maximum at most O(ki2 + ki log ki ) time to update the heap at each
grouping based only on attributes, and then iteratively splits iteration, where k0 is the number of groups in the maximum
existing groups until the number of groups reaches k. How- A-compatible grouping, and ki is the number of groups at
ever, in contrast to the SNAP evaluation algorithm, which each iteration. As k < |V |, the upper-bound complexity of
randomly chooses a splittable group and splits it into sub- the top-down approach is still O(|V | log |V | + k 2 |V | + k|E|).
Algorithm 5 MergeGroups(G, A, R, k, Φc ) volving either of the two merged groups, update elements
Input: G: a graph; A ⊆ Λ(G): a set of attributes; R = involving neighbors of the merged groups, and insert ele-
{E} ⊆ Υ(G): a set containing one relationship type E; ments involving this new group.
k: the required number of groups in the summary; Φc : The time cost of the bottom-up approach is the cost of
the current grouping. the SN AP algorithm plus the merging cost. The algorithm
3 2
Output: Merging groups in Φc until |Φc | = k. takes O(ksnap + ksnap ) to initialize the heap, then at each
1: Build a heap on (M ergeDist, Agree, M inSize) for pairs iteration at most O(ki3 + ki2 log ki ) time to update the heap,
of groups where ksnap is the size of the grouping resulting from the
2: while |Φc | > k do SN AP operation, and ki is the size of the grouping at each
3: Pop the pair of groups with the best (M ergeDist, iteration. Therefore, the time complexity of the bottom-up
2
Agree, M inSize) value from the heap approach is bounded by O(|V | log |V |+ksnap |V |+ksnap |E|+
4
4: Merge the two groups into one ksnap ). Note that updating the in memory data structures
5: Update data structures (Φc is updated) in the bottom-up approach does not need to access the
6: Update the heap database (i.e. no IOs). All the necessary information for
7: end while the updates can be found in the current data structures.
Therefore, the upper bound of the number of page accesses
for the bottom-up approach is kV k + (ksnap + 1)kEk.
Following the same method of analyzing the page accesses
3.3.3 Drill-Down and Roll-Up Abilities
for the SNAP algorithm, the number of page accesses in-
curred by the top-down approach is bounded by kV k + (k + The top-down and the bottom-up approaches introduced
1)kEk. above both start from scratch to produce the summaries.
However, it is easy to build an interactive querying scheme,
3.3.2 Bottom-Up Approach where the users can drill-down and roll-up based on the cur-
The bottom-up approach first computes the maximum rent summaries. The users can first generate an initial sum-
(A, R)-compatible grouping using Algorithm 1, and then it- mary using either the top-down approach or the bottom-up
eratively merges two groups until grouping size is k (see approach. However, as we will show in Section 4.3, the top-
Algorithm 4). Choosing which two groups to merge in each down approach has significant advantage in both efficiency
iterative step is crucial for the bottom-up approach. First, and summary quality in most practical cases. We suggest us-
the two groups are required to have the same attributes val- ing the top-down approach to generate the initial summary.
ues. Second, the two groups must have similar group rela- The drill-down operation can be simply achieved by call-
tionships with other groups. Now, we formally define this ing the SplitGroups function (Algorithm 3). To roll up to a
similarity between two groups. coarser summary, the MergeGroups function (Algorithm 5)
The two groups to be merged should have similar neigh- can be called. However, when the number of groups in the
bor groups with similar participation ratios. We define a current summary is large, the MergeGroups function be-
measure called M ergeDist to assess the similarity between comes expensive, as it needs to compare every pair of groups
two groups in the merging process. to calculate the M ergeDist (see Section 3.3.2). Therefore,
X using the top-down approach to generate a new summary
M ergeDist(Gi , Gj ) = |pi,k − pj,k | (4) with the decreased resolution is a better choice to roll-up
k6=i,j when the current summary is large.
M ergeDist accumulates the differences in participation ra-
tios between Gi and Gj with other groups. The smaller this 4. EXPERIMENTAL EVALUATION
value is, the more similar the two groups are. In this section, we present experimental results evaluating
If two pairs of groups have the same M ergeDist, we need the effectiveness and efficiency of the SNAP and the k-SNAP
to further distinguish which pair is “more similar”. We look operations on a variety of real and synthetic datasets. All
at each common neighbor Gk of Gi and Gj , and consider algorithms are implemented in C++ on top of PostgreSQL
the group relationships (Gi , Gk ) and (Gj , Gk ). If both group (http://www.postgresql.org) version 8.1.3. Graphs are
relationships are strong (pi,k > 0.5 and pj,k > 0.5) or weak stored in a node table and an edge table in the database,
(pi,k ≤ 0.5 and pj,k ≤ 0.5), then we call it an agreement using the following schema: NodeTable(graphID, nodeID, at-
between Gi and Gj . The total number of agreements between tributeName, attributeType, attributeValue) and EdgeTable(
Gi and Gj is denoted as Agree(Gi , Gj ). Having the same graphID, node1ID, node2ID, edgeType). Nodes with multiple
M ergeDist, the pair of groups with more agreements is a attributes have multiple entries in the node table, and edges
better candidate to merge. with multiple types have multiple entries in the edge table.
If both of the above criteria are the same for two pairs of Accesses to nodes and edges of graphs are implemented by
groups, we always prefer merging groups with smaller sizes issuing SQL queries to the PostgreSQL database. All exper-
(in the number of nodes). More formally, we choose the pair iments were run on a 2.8GHz Pentium 4 machine running
with smaller M inSize(Gi , Gj ) = min{|Gi |, |Gj |}, where Gi Fedora 2, and equipped with a 250GB SATA disk. For all ex-
and Gj are in this pair. periments (except the one in Section 4.4.3), we set the buffer
In Algorithm 5, we utilize a heap to store pairs of groups pool size to 512MB and working memory size to 256MB.
based on the values of the triple (M ergeDist, Agree, M inSize).
At each iteration, we pop the group pair with the best 4.1 Experimental Datasets
(M ergeDist, Agree, M inSize) value from the heap, and In this section, we describe the datasets used in our empir-
then merge the pair into one group. At the end of each ical evaluation. We use two real datasets and one synthetic
iteration, we remove the heap elements (pairs of groups) in- dataset to explore the effect of various graph characteristics.
5000

4000

Frequency
3000

2000

1000

0
0 20 40 60 80
# Publications

Figure 6: The Distribution of the Num-


Figure 5: DBLP DB Coauthor- Figure 7: The SNAP Result for
ber of DB Publications (Avg: 2.6,
ship Graph the DBLP DB Dataset
Stdev: 5.1)

Description #Nodes #Edges Avg. Degree Synthetic Dataset Most real world graphs show power-
D1 DB 7,445 19,971 5.4 law degree distributions and small-world characteristics [13].
D2 D1+AL 14,533 37,386 5.1 Therefore, we use the R-MAT model [8] in the GTgraph
D3 D2+OS +CC 22,435 55,007 4.9 suites [2] to generate graphs with power-law degree distribu-
D4 D3+AI 30,664 70,669 4.6 tions and small-world characteristics. Based on the statistics
in Table 1, we set the average node degree in each synthetic
Table 1: The DBLP Datasets for the Efficiency Ex- graph to 5. We used the default values for the other pa-
periments rameters in the R-MAT based generator. We also assign an
attribute to each node in a generated graph. The domain
of this attribute has 5 values. For each node we randomly
DBLP Dataset This dataset contains the DBLP Bibli- assign one of the five values.
ography data [12] downloaded on July 30, 2007. We use this
data for both effectiveness and efficiency experiments. In 4.2 Effectiveness Evaluation
order to compare the coauthorship behaviors across differ-
We first evaluate the effectiveness of our graph summa-
ent research areas and construct datasets for the efficiency
rization methods. In this section, we use only the top-down
experiments, we partition the DBLP data into different re-
approach to evaluate the k-SNAP operation, as we compare
search areas. We choose the following five areas: Database
the top-down and the bottom-up approaches in Section 4.3.
(DB), Algorithms (AL), Operating Systems (OS), Compiler
Construction (CC) and Artificial Intelligence (AI). For each
of the five areas, we collect the publications of a number 4.2.1 DBLP Coauthorship Networks
of selected journals and conferences in this area1 . These In this experiment, we are interested in analyzing how
journals and conferences are selected to construct the four researchers in the database area coauthor with each other.
datasets with increasing sizes for the efficiency experiments As input, we use the DBLP DB subset (see D1 of Table 1 and
(see Table 1). These four datasets are constructed as follows: Figure 5). Each node in this graph has one attribute called
D1 contains the selected DB publications. We add into D1 PubNum, which is the number of publications belonging to
the selected AL publications to form D2. D3 is D2 plus the the corresponding author. By plotting the distribution of
selected OS and CC publications. And finally, D4 contains the number of publications of this dataset in Figure 6, we
the publications of all the five areas we are interested in. assigned another attribute called Prolific to each author in
We construct a coauthorship graph for each dataset. The the graph indicating whether that author is prolific: authors
nodes in this graph correspond to authors and edges indi- with ≤ 5 papers are tagged as low prolific (LP), authors with
cate coauthorships between authors. The statistics for these > 5 but ≤ 20 papers are prolific (P), and the authors with
four datasets are shown in Table 1. > 20 papers are tagged as highly prolific (HP).
Political Blogs Dataset This dataset is a network of We first issue a SNAP operation on the Prolific attribute
1490 webblogs on US politics and 19090 hyperlinks between and the coauthorships. The result is visualized in Figure 7.
these webblogs [1] (downloaded from http://www-personal. Groups with the HP attribute value are colored in yellow,
umich.edu/~mejn/netdata/). Each blog in this dataset has groups with the P value are colored in light blue, and the
an attribute describing its political leaning as either liberal remaining groups with the attribute value LP are in white.
or conservative. The SNAP operation results in a summary with 3569 groups
and 11293 group relationships. This summary is too big to
1
DB: VLDB J., TODS, KDD, PODS, VLDB, SIGMOD; AL: analyze. On the other hand, if we apply the SNAP operation
STOC, SODA, FOCS, Algorithmica, J. Algorithms, SIAM J. on only the Prolific attribute (i.e. not considering any re-
Comput., ISSAC, ESA, SWAT, WADS; OS: USENIX, OSDI,
SOSP, ACM Trans. Comput. Syst., HotOS, OSR, ACM SIGOPS
lationships in the SNAP operation), we will get a summary
European Workshop; CC: PLDI, POPL, OOPSLA, ACM Trans. with only 3 groups as visualized in the top left figure in Ta-
Program. Lang. Syst., CC, CGO, SIGPLAN Notices, ECOOP; ble 2. The bold edges between two groups indicate strong
AI: IJCAI, AAAI, AAAI/IAAI, Artif. Intell. group relationships (with more than 50% participation ra-
Attribute Only k=4 k=5 k=6 k=7
DB
HP HP HP HP HP
0.84 Size: 110 0.95 0.84 Size: 110 0.95 0.84 Size: 110 0.95 0.84 Size: 110 0.95 0.84 Size: 110 0.95
0.97
P 0.29 P 0.41 P 1.00 P 1.00 P 1.00 LP
Size: 509 0.91 Size: 509 0.91 Size: 509 0.91 Size: 509 0.91 Size: 509 0.91 0.01 Size: 521 0.93
0.08
0.48 1.00 0.94 0.94 0.94
0.96 0.96 0.96
LP LP LP
Size: 1855 0.75 Size: 1855 0.75 Size: 1855 0.75

0.11 0.10

0.22 0.22 0.20 0.37 0.10 0.37


0.06 0.06
LP LP LP LP LP LP
Size: 3779 Size: 3779 0.18 Size: 3018 0.18 Size: 761 Size: 2497 0.18 0.16 Size: 761
0.80 0.80 1.00 0.11 0.98 0.11
LP 0.19 LP 0.15 LP 0.14 LP 0.08 LP
Size: 6826 0.84 Size: 3047 0.80 Size: 1192 0.76 Size: 1192 0.76 Size: 1192 0.76

AI
HP HP HP HP HP
0.54 Size: 76 0.54 Size: 76 0.54 Size: 76 0.54 Size: 76 0.92 0.54 Size: 76 0.92
0.92 0.92 0.92
P P P P P 0.09 LP
Size: 545 0.80 Size: 545 0.80 Size: 545 0.80 Size: 545 0.80 Size: 545 0.80 Size: 3880 0.98
0.37 0.99 0.99 0.91 0.97 0.91 0.97 0.07

0.18 0.18
LP LP 0.04
Size: 837 Size: 837

0.06 0.06
0.12 0.10 0.10 0.18
0.22 0.22 0.04 0.22 0.02
LP LP LP LP LP LP LP
0.13 0.17 Size: 5899 Size: 1502 0.17 Size: 4397 Size: 1502 0.19 Size: 4397 Size: 1502 0.09 0.19 Size: 517 0.92
0.15 0.75 0.13 1.00 0.11 1.00 0.28
LP LP 0.10 LP 0.09 LP LP
Size: 8874 0.78 Size: 2975 0.72 Size: 2975 0.72 Size: 2138 1.00 Size: 2138 1.00

Table 2: The Aggregation Results for the DBLP DB and AI Subsets

tio), while dashed edges are weak group relationships. This of publications: 1.26. Finally, the group of mostly single
summary shows that the HP researchers as a whole have authors has on average only 1.23 publications. Not surpris-
very strong coauthorship with the P group of researchers. ingly, these results suggest that collaborating with HP and
Researchers within both groups also tend to coauthor with P researchers is very helpful for the low prolific (often be-
people within their own groups. However, this summary ginning) researchers.
does not provide a lot of information for the LP researchers: Next, we want to compare the database community with
they tend to coauthor strongly within their group and they the AI community to see whether the coauthorship relation-
have some connection with the HP and P groups. ships are different across these two communities. We con-
Now we make use of the k-SNAP operation to produce structed the AI coauthorship graph with 9495 authors and
summaries with multiple resolutions. The first row of figures 16070 coauthorships from the DBLP AI subset. The distri-
in Table 2 shows the k-SNAP results for k =4, 5, 6 and 7. bution of the number of publications of AI authors is similar
As k increases, more details are shown in the summaries. to the DB authors, thus we use the same method to assign
When k = 7, the summary shows that there are 5 sub- the Prolific attribute to these authors. The SNAP opera-
groups of LP researchers. One group of 1192 LP researchers tion on the Prolific attribute and coauthorships results in
strongly collaborates with both HP and P researchers. One a summary with 3359 groups and 7091 group relationships.
group of 521 only strongly collaborates with HP researchers. The second row of figures in Table 2 shows the SNAP result
One group of 1855 only strongly collaborates with P re- based only on the Prolific attribute and the k-SNAP results
searchers. These three groups also strongly collaborate within for k =4, 5, 6 and 7. Comparing the summaries for the two
their groups. There is another group of 2497 LP researchers communities for k = 7, we can see the differences across the
that has very weak connections to other groups but strongly two communities: The HP and P groups in the AI commu-
cooperates among themselves. The last group has 761 LP nity have a weaker cooperation than the DB community;
researchers, who neither coauthor with others within their and there isn’t a large group of LP researchers who strongly
own group nor collaborate strongly with researchers in other coauthor with both HP and P researchers in the AI area.
groups. They often write single author papers. As this example shows, by changing the resolutions of
Now, in the k-SNAP result for k = 7, we are curious if summaries, users can better understand the characteristics
the average number of publications for each subgroup of the of the original graph data and also explore the differences
LP researchers is affected by the coauthorships with other and similarities across different datasets.
groups. The above question can be easily answered by ap-
plying the avg operation on the PubNum attribute for each 4.2.2 Political Blogs Network
group in the result of the k-SNAP operation. In this experiment, we evaluate the effectiveness of our
With this analysis, we find that the group of LP researchers graph summarization methods on the political blogs network
who collaborate with both P and HP researchers has a high (1490 nodes and 19090 edges). The SNAP operation based
average number of publications: 2.24. The group only col- on the political leaning attribute and the links between blogs
laborating with HP researchers has 1.66 publications on av- results in a summary with 1173 groups and 16657 group
erage. The group collaborating with only the P researchers relationships. The SNAP result based only on the attribute
has on average 1.55 publications. The group that tends to and the k-SNAP result based on both the attribute and the
only cooperate among themselves has a low average number links for k = 7 are shown in Table 3.
Attribute Only k=7

Execution Time (sec, log scale)


L L 1100 16000
Size: 758 0.76 Size: 320 0.95 1000 TopDown
5000
L 0.86 900 BottomUp
Size: 245 0.51 800
700 500
TopDown

Delta/k
0.09 600
500 BottomUp
L C
50
Size: 193 0.01 Size: 96 400
300
200 5
100 1
C C
0.42 Size: 128 1.00 Size: 205
0
0.89 1.00 8 16 32 64 128 256 512 1024 20483569 8 16 32 64 128 256 512 1024 2048 3569
C 0.56 C
Size: 732 0.86 Size: 303 0.97
k (log scale) k (log scale)

Table 3: The Aggregation Results for Figure 8: Quality of Summaries: Figure 9: Efficiency: Top-Down
the Political Blogs Dataset Top-Down vs. Bottom-Up vs. Bottom-Up

From the results, we see that there are a group of lib- # Groups # Group Relationships Time(sec)
eral blogs and a group of conservative blogs that interact D1 3569 11293 6.4
strongly with each other (perhaps to refute each other). D2 7892 26031 16.1
Other groups of blogs only connect to blogs in their own D3 11379 35682 27.9
communities (liberal or conservative), if they do connect to D4 15052 44318 44.0
other blogs. There is a relatively large group of 193 lib-
eral blogs with almost no connections to any other blogs, Table 4: The SNAP Results for the DBLP Datasets
while such isolated blogs compose a much smaller portion
(96 blogs) of all the conservative blogs. Overall, conserva-
tive blogs show a slightly higher tendency to link with each contrast, the top-down approach starts from the maximum
other than liberal blogs, which is consistent with the con- A-compatible grouping, and only needs a small number of
clusion from the analysis in [1]. Given that the blogs data splits to reach the result. Therefore, small amount of errors
was collected right after the US 2004 election, the authors is accumulated. As k becomes larger, the bottom-up ap-
in [1] speculated that the different linking behaviors in the proach shows slight advantage over the top-down approach.
two communities may be correlated with eventual outcome The execution times for the two approaches are shown in
of the 2004 election. Figure 9. Note that both axes are in log scale. The top-down
approach significantly outperform the bottom-up approach,
4.3 k-SNAP: Top-Down vs. Bottom-Up except when k is equal to the size of the grouping resulting
In this section, we compare the top-down and the bottom- from the SNAP operation. Initializing the heap takes a lot
up k-SNAP algorithms, both in terms of effectiveness and of time for the bottom-up approach, as it has to compare
efficiency. We use the DBLP DB subset (D1 in Table 1) and every pair of groups. This situation becomes worse, if the
apply both approaches for different k values. size of the initial grouping is very large.
For the effectiveness experiment, we use the ∆ measure in- In practice, users are more likely to choose small k val-
troduced in Section 2.2 to assess the qualities of summaries. ues to generate summaries. The top-down approach signif-
Note that for a given k value, smaller ∆ value means better icantly outperforms the bottom-up approach in both effec-
quality summary, but for different k values, comparing ∆ tiveness and efficiency for small k values. Therefore, the top-
does not make sense, as a higher k value tends to result in down approach is preferred for most practical uses. For all
a higher ∆ value according to Equation 1. However, if we the remaining experiments, we only consider the top-down
normalize ∆ by k, we get the average contribution of each approach.
group to the ∆ value, then we can compare ∆ for different
k values.
k
4.4 Efficiency Experiment
We acknowledge that ∆ is not a perfect measure for “quan- This section evaluates the efficiency the SNAP and the
k
titatively” evaluating the quality of summaries. However, k-SNAP operations.
quality assessment is a tricky issue in general, and ∆ , though
crude, is an intuitive measure for this study.
k
4.4.1 SNAP Efficiency
Figure 8 shows the comparison of the summary qualities In this section, we apply the SNAP operation on the four
between the top-down and the bottom-up approaches. Note DBLP datasets with increasing sizes (see Table 1). Ta-
that the x-axis is in log scale and the y-axis is ∆ k
. First, ble 4 shows the number of groups and group relationships in
as k increases, both methods produce higher quality sum- the summaries produced by the SNAP operation on the at-
maries. For small k values, top-down approach produces tribute Prolific (defined in the same way as in Section 4.2.1)
significantly higher quality summaries than the bottom-up and coauthorships, as well as the execution times. Even for
approach. This is because, the bottom-up approach starts the largest dataset with 30664 nodes and 70669 edges, the
from the grouping produced by the SNAP operation. This execution is completed in 44 seconds. However, all of the
initial grouping is usually very large, in this case, it con- SNAP results are very large. The summary sizes are com-
tains 3569 groups. The bottom-up approach has to continu- parable to the input graphs. Such large summaries are often
ously merge groups until the number of groups decreases to not very useful for analyzing the input graphs. This is an
a small k value. Each merge decision is only made based on anecdotal evidence of why the k-SNAP operation is often
the current grouping, and errors can easily accumulate. In more desired than the SNAP operation in practice.
10000
9000 k=10

Execution Time (sec)

Execution Time (sec)


250 2000
k=100
Runing Time (sec)

8000
200 D1 D2 D3 D4 7000 k=1000
1500
6000
150
5000 No Bitmap
1000
100 4000 Bitmap in Memory
3000
50 2000 500
1000
0
0 0
8 16 32 64 128 256 512 1024 2048 3569 50k 0
200k 500k 800k 1000k 1000 2000 3000
k Graphs Sizes (#nodes) k

Figure 11: Efficiency of Figure 12: Bitmap in Mem-


Figure 10: Efficiency of k-SNAP on the
k-SNAP on the Synthetic ory vs. No Bitmap
DBLP Datasets
Datasets

4.4.2 k-SNAP Efficiency approach without bitmap is much slower than the normal
This section evaluates the efficiency of the top-down k- version. This is not surprising as the former incurs more
SNAP algorithm on both the DBLP and the synthetic datasets. disk IOs.
DBLP Data In this experiment, we apply the top-down Given the graph size and the k value, our current imple-
k-SNAP evaluation algorithm on the four DBLP datasets mentation can decide in advance whether the bitmap can fit
shown in Table 1 (the k-SNAP operation is based on Pro- in the working memory, by estimating the upper bound of
lific attribute and coauthorships). The execution times with the bitmap size. It can then choose the appropriate version
increasing graph sizes and increasing k values are shown in of the algorithm to use. In the future, we plan on designing
Figure 10. For these datasets, the performance behavior is a more sophisticated version of the top-down algorithm in
close to linear, since the execution times are dominated by which part of the bitmap can be kept in memory when the
the database page accesses (as discussed in Section 3.3.1). available memory is small.
Synthetic Data We apply the k-SNAP operation on dif-
ferent sized synthetic graphs with three k values: 10, 100
and 1000. The execution times with increasing graph sizes
5. RELATED WORK
are shown in Figure 11. When k = 10, even on the largest Graph summarization has attracted a lot of interest from
graph with 1 million nodes and 2.5 million edges, the eval- both the sociology and the database research communities.
uation algorithm finishes in about 5 minutes. For a given Most existing works on graph summarization use statisti-
k value, the algorithm scales nicely with increasing graph cal methods to study graph characteristics, such as degree
sizes. distributions, hop-plots and clustering coefficients. Com-
prehensive surveys on these methods are provided in [6]
and [13]. A-plots [7] is a novel statistical method to sum-
4.4.3 Evaluation with Very Large Graphs marize the adjacency matrix of graphs for outlier detection.
So far, we have assumed that the amount of working mem- Statistical summaries are useful but hard to control and nav-
ory is big enough to hold all the data structures (shown in igate. Methods for mining frequent graph patterns [11, 19,
Figure 4) used in the evaluation algorithms. This is often 23] are also used to understand the characteristics of large
the case in practice, as large multi GB memory configura- graphs. Washio and Motoda [20] provide an elegant review
tions are common and many graph datasets can fit in this on this topic. However, these mining algorithms often pro-
space (especially if a subset of large graph is selected for duces an overwhelmingly large number of frequent patterns.
analysis). However, our methods also work, when the graph Various graph partitioning algorithms [14, 18, 22] are used
datasets are extremely large and this in-memory assumption to detect community structures (dense subgraphs) in large
in not valid. In this section, we discuss the behaviors of our graphs. SuperGraph [17] employs hierarchical graph par-
methods for this case. We only consider the most practically titioning to visualize large graphs. However, graph parti-
useful top-down k-SNAP algorithm for this experiment. tioning techniques largely ignore the node attributes in the
In the case when the most memory consuming data struc- summarization. Studies on graph visualization are surveyed
ture, namely the neighbor-groups bitmap (see Figure 4), in [3, 10]. For very large graphs, these visualization meth-
cannot fit in memory, the top-down approach drops the ods are still not enough. Unlike these existing methods, we
bitmap data structure. Without the bitmap, each time the introduce two database-style operations to summarize large
algorithm splits a group, it has to query the edges informa- graphs. Our method allows users to easily control and nav-
tion in the database to infer the neighbor-groups. We have igate through summaries.
implemented a version of the top-down k-SNAP algorithm Previous research [4, 5, 15] have also studied the prob-
without the bitmap data structure, and compared it with lem of compressing large graphs, especially Web graphs.
the normal top-down algorithm. However, these graph compression methods mainly focus
To keep this experiment manageable, we scaled down the on compact graph representation for easy storage and ma-
experiment settings. We used the DBLP D4 dataset in Ta- nipulation, whereas graph summarization methods aim at
ble 1, and set the buffer pool size and working memory size producing small and understandable summaries.
to 16MB and 8MB, respectively. This “scaled-down” ex- Regular equivalence is introduced in [21] to study social
periment exposes the behaviors of the two versions of the roles of nodes based on graphs structures in social networks.
top-down algorithm, while keeping the running times rea- It shares resemblance with the SNAP operation. However,
sonable. As shown in Figure 12, the version of the top-down regular equivalence is defined only based on the relationships
between nodes. Node attributes are largely ignored. In ad- [5] P. Boldi and S. Vigna. The WebGraph framework I:
dition, the k-SNAP operation relaxes the stringent equiva- Compression techniques. In Proceedings of WWW’04,
lence requirement of relationships between node groups, and pages 595–602, 2004.
produces user controllable multi-resolution summaries. [6] D. Chakrabarti and C. Faloutsos. Graph mining:
The SNAP algorithm (Algorithm 1) shares similarity with Laws, generators, and algorithms. ACM Comput.
the automorphism partitioning algorithm in [9]. However, Surv., 38(1), 2006.
the automorphism partitioning algorithm only partitions nodes [7] D. Chakrabarti, C. Faloutsos, and Y. Zhan.
based on node degrees and relationships, whereas SNAP can Visualization of large networks with min-cut plots,
be evaluated based on arbitrary node attributes and rela- A-plots and R-MAT. Int. J. Hum.-Comput. Stud.,
tionships that a user selects. 65(5):434–445, 2007.
[8] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT:
6. CONCLUSIONS AND FUTURE WORK A recursive model for graph mining. In Proceedings of
This paper has introduced two aggregation operations SNAP 4th SIAM International Conference on Data Mining,
and k-SNAP for controlled and intuitive database-style graph 2004.
summarization. Our methods allow users to freely choose [9] D. G. Corneil and C. C. Gotlieb. An efficient algorithm
node attributes and relationships that are of interest, and for graph isomorphism. J. ACM, 17(1):51–64, 1970.
produce summaries based on the selected features. Further- [10] I. Herman, G. Melançon, and M. S. Marshall. Graph
more, the k-SNAP aggregation allows users to control the visualization and navigation in information
resolutions of summaries and provides “drill-down” and “roll- visualization: A survey. IEEE Trans. Vis. Comput.
up” abilities to navigate through the summaries. We have Graph., 6(1):24–43, 2000.
formally defined the two operations and proved that evalu- [11] J. Huan, W. Wang, J. Prins, and J. Yang. SPIN:
ating the k-SNAP operation is NP-complete. We have also Mining maximal frequent subgraphs from graph
proposed an efficient algorithm to evaluate the SNAP opera- databases. In Proceedings of KDD’04, pages 581–586,
tion and two heuristic algorithms to approximately evaluate 2004.
the k-SNAP operation. Through extensive experiments on [12] M. Ley. DBLP Bibliography.
a variety of real and synthetic datasets, we show that of the http://www.informatik.uni-trier.de/~ley/db/.
two k-SNAP algorithms, the top-down approach is a better [13] M. E. J. Newman. The structure and function of
choice in practice. Our experiments also demonstrate the ef- complex networks. SIAM Review, 45:167–256, 2003.
fectiveness and efficiency of our methods. As part of future
[14] M. E. J. Newman and M. Girvan. Finding and
work, we plan on designing a formal graph data model and
evaluating community structure in networks. Phys.
query language that allows incorporation of k-SNAP, along
Rev. E, 69:026113, 2004.
with a number of other additional common and useful graph
[15] S. Raghavan and H. Garcia-Molina. Representing Web
matching methods.
graphs. In Proceedings of ICDE’03, pages 405–416,
2003.
7. ACKNOWLEDGMENT [16] F. S. Roberts and L. Sheng. How hard is it to
This research was supported by the National Institutes determine if a graph has a 2-role assignment?
of Health under grant 1-U54-DA021519-01A1, the National Networks, 37(2):67–73, 2001.
Science Foundation under grant DBI-0543272, and an unre- [17] J. F. Rodrigues, A. J. M. Traina, C. Faloutsos, and C.
stricted research gift from Microsoft Corp. We thank Taneli Traina Jr. SuperGraph visualization. In Proceedings of
Mielikainen and Yiming Ma for their valuable suggestions the 8th IEEE International Symposium on
during the early stage of this research. We also thank the Multimedia, pages 227–234, 2006.
reviewers of this paper for their constructive comments on
[18] J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is
a previous version of this manuscript.
more: Sparse graph mining with compact matrix
decomposition. Stat. Anal. Data Min., 1(1):6–22, 2008.
8. REPEATABILITY ASSESSMENT RESULT [19] W. Wang, C. Wang, Y. Zhu, B. Shi, J. Pei, X. Yan,
All the results in this paper were verified by the SIGMOD and J. Han. GraphMiner: A structural pattern-mining
repeatability committee. system for large disk-based graph databases and its
applications. In Proceedings of SIGMOD’05, pages
9. REFERENCES 879–881, 2005.
[1] L. A. Adamic and N. Glance. The political [20] T. Washio and H. Motoda. State of the art of
blogosphere and the 2004 US Election: Divided they graph-based data mining. SIGKDD Explor. Newsl.,
blog. In Proceedings of the 3rd International Workshop 5(1):59–68, 2003.
on Link Discovery, pages 36–43, 2005. [21] D. R. White and K. P. Reitz. Graph and semigroup
[2] D. A. Bader and K. Madduri. GTgraph: A suite of homomorphisms on semigroups of relations. Social
synthetic graph generators. Networks, 5(2):193–234, 1983.
http://www.cc.gatech.edu/~kamesh/GTgraph. [22] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger.
[3] G. Battista, P. Eades, R. Tamassia, and I. Tollis. SCAN: A structural clustering algorithm for networks.
Graph Drawing: Algorithms for the Visualization of In Proceedings of KDD’07, pages 824–833, 2007.
Graphs. Prentice Hall, 1999. [23] X. Yan and J. Han. gSpan: Graph-based substructure
[4] D. K. Blandford, G. E. Blelloch, and I. A. Kash. pattern mining. In Proceedings of ICDM’02, pages
Compact representations of separable graphs. In 721–724, 2002.
Proceedings of SODA’03, pages 679–688, 2003.

You might also like