Mining Frequent Subgraph Patterns From Uncertain Graph Data
Mining Frequent Subgraph Patterns From Uncertain Graph Data
Abstract—In many real applications, graph data is subject to uncertainties due to incompleteness and imprecision of data. Mining
such uncertain graph data is semantically different from and computationally more challenging than mining conventional exact graph
data. This paper investigates the problem of mining uncertain graph data and especially focuses on mining frequent subgraph patterns
on an uncertain graph database. A novel model of uncertain graphs is presented, and the frequent subgraph pattern mining problem is
formalized by introducing a new measure, called expected support. This problem is proved to be NP-hard. An approximate mining
algorithm is proposed to find a set of approximately frequent subgraph patterns by allowing an error tolerance on expected supports of
discovered subgraph patterns. The algorithm uses efficient methods to determine whether a subgraph pattern can be output or not and
a new pruning method to reduce the complexity of examining subgraph patterns. Analytical and experimental results show that the
algorithm is very efficient, accurate, and scalable for large uncertain graph databases. To the best of our knowledge, this paper is the
first one to investigate the problem of mining frequent subgraph patterns from uncertain graph data.
1 INTRODUCTION
full use of the apriori property, all subgraph patterns are 3 PROBLEM STATEMENT
organized into a search tree, and the search tree is traversed
3.1 Model of Uncertain Graphs
in depth-first strategy. During search, if a subgraph pattern
can not be output as a result, then all its descendants in the In this paper, the vertex set and the edge set of a graph G
search tree need not to be examined, thus reducing the are denoted by V ðGÞ and EðGÞ, respectively.
complexity. Furthermore, a new pruning method is Definition 1. An uncertain graph is a system G ¼ ððV ; EÞ; ;
proposed to further reduce the complexity of examining L; P Þ, where ðV ; EÞ is an undirected graph, is a set of labels,
subgraph patterns. L : V [ E ! is a function assigning labels to vertices and
Extensive experiments were performed on a real un- edges, and P : E ! ð0; 1 is a function assigning existence
certain graph database to evaluate the efficiency, the probability values to edges.
memory usage, the approximation quality, and the scal-
ability of MUSE, the impact of the optimized pruning The existence probability, P ððu; vÞÞ, of an edge ðu; vÞ is the
method on the efficiency of MUSE, and the impact of probability of the edge existing between vertices u and v.
uncertainties on the efficiency of MUSE. The experimental Specifically, P ððu; vÞÞ ¼ 1 indicates that ðu; vÞ definitely
results show that MUSE is very efficient, accurate, and exists. Thus, an exact graph1 is a special uncertain graph
scalable for large uncertain graph databases. with existence probabilities of 1 on all edges.
The rest of the paper is organized as follows: Section 2 Unlike an exact graph, an uncertain graph implicates a set
reviews related work. Section 3 introduces the model of of exact graphs, each of which is a possible structure that the
uncertain graphs and formalizes the frequent subgraph uncertain graph may exist in. More formally, an exact graph
pattern mining problem. Section 4 formally proves the I ¼ ððV 0 ; E 0 Þ; 0 ; L0 Þ is an implicated graph of an uncertain
computational complexity of the frequent subgraph pattern graph G ¼ ððV ; EÞ; ; L; P Þ, denoted by G ) I, if and only if
mining problem. Section 5 presents the approximate mining V 0 ¼ V , E 0 E, 0 , and L0 ¼ LjV 0 [E0 , where LjV 0 [E0
algorithm and analyzes its performance. Extensive experi- denotes the function obtained by restricting L to V 0 [ E 0 .
mental results are shown in Section 6. Finally, Section 7 For simplicity, we assume that all existence probabilities of
concludes this paper. edges are mutually independent. This assumption has been
shown to be reasonable in a range of practical applications [4],
[6], [7]. Based on the independence assumption, the prob-
2 RELATED WORK ability of an uncertain graph G implicating an exact graph I is
A number of algorithms have been proposed to discover the Y Y
complete set of frequent subgraph patterns from an exact P ðG ) IÞ ¼ P ðeÞ ð1 P ðe0 ÞÞ; ð1Þ
graph database [10], [12], [13], [14], [15], [16], [17], [18], [19]. e2EðIÞ e0 2EðGÞnEðIÞ
To reduce the number of redundant subgraph patterns, [20]
where P ðeÞ is the existence probability of edge e. Equation
proposed CloseGraph to discover frequent closed subgraph
(1) holds because all edges in EðIÞ are contained in I, but all
patterns, [21] developed SPIN to discover frequent maximal
edges in EðGÞ n EðIÞ are not contained in I. Let IðGÞ denote
subgraph patterns, [22] presented RP-GD and RP-FP to
the set of all implicated graphs of an uncertain graph G.
summarize frequent subgraph patterns, and [23] performed
Obviously, jIðGÞj ¼ 2jEðGÞj .
Metropolis-Hastings samplings on the space of frequent
An uncertain graph database is a set of uncertain graphs. It
subgraph patterns. In addition, some variants of the essentially represents a set of implicated graph databases.
frequent subgraph pattern mining problem have been More formally, an implicated graph database of an
investigated, such as the discovery of frequent closed uncertain graph database D ¼ fG1 ; G2 ; . . . ; Gn g is a set of
cliques [24], frequent closed quasi-cliques [25], cross-graph exact graphs d ¼ fI1 ; I2 ; . . . ; In g such that Gi ) Ii for
quasi-cliques [26], frequent topological substructures [27], 1 i n. The set of all implicated graphQdatabases of D
frequent outerplanar subgraphs [28], and significant sub- is denoted by IðDÞ. Obviously, jIðDÞj ¼ ni¼1 2jEðGi Þj . As-
graph patterns [29], [30]. However, all these algorithms suming that the uncertain graphs in an uncertain graph
were designed only for mining exact graphs and can not be database are mutually independent, the probability of an
extended to uncertain graphs. exact graph database d ¼ fI1 ; I2 ; . . . ; In g being implicated
Recent research on managing uncertain data has focused by an uncertain graph database D ¼ fG1 ; G2 ; . . . ; Gn g is
on data model and query processing [9], ranking query
Y
n
processing [31], skyline query processing [32], indexing P ðD ) dÞ ¼ P ðGi ) Ii Þ; ð2Þ
[33], and data cleaning [34], etc. There is also a few very i¼1
recent work on mining uncertain data. In [35] and [36], the
problem of finding frequent items from probabilistic data where P ðGi ) Ii Þ is the probability of Gi implicating Ii as
streams was investigated. In [37] and [38], frequent item given in (1).
sets mining on uncertain transactional databases was We have the following theorem on the semantics of
studied. In [39] and [40], uncertain data clustering algo- uncertain graphs:
rithms were explored. However, all these algorithms focus Theorem 1. Let G be an uncertain graph and IðGÞ be the set of all
on mining uncertain tabular data rather than graph data implicated graphs of G. The function PG : 2IðGÞ ! IR, defined as
and can not be shifted to uncertain graph mining.
To the best of our knowledge, there is no literature to date 1. A conventional labeled graph [13] is called an exact graph in this paper,
which is a 3-tuple G ¼ ððV ; EÞ; ; LÞ, where ðV ; EÞ is an undirected graph,
on mining frequent subgraph patterns from uncertain graph is a set of labels, and L : V [ E ! is a labeling function of vertices and
data. This paper is the first one to investigate this problem. edges.
1206 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010
P P
PG ðXÞ ¼ I2X P ðG ) IÞ, is a probability function, where where the last equality is because I2IðGÞ P ðG ) IÞ ¼ 1.
2IðGÞ is the power set of IðGÞ. Thus, we have PD ðIðDÞÞ ¼ PDnfG1 g ðIðD n fG1 gÞÞ ¼
PDnfG1 ;G2 g ðIðD n fG1 ; G2 gÞÞ ¼ ¼ P; ðIð;ÞÞ ¼ 1.
Proof. Let IðGÞ be the sample space. Trivially, 2IðGÞ is a
-algebra of the subsets of IðGÞ. So, we prove the theorem Third, for any sequence of pairwise disjoint subsets
by showing that PG satisfies the probability axioms [41]. . . . ; Xk of IðDÞ, we have PD ðX1 [ X2 [ . . . [
X1 ; X2 ;P
First, it is evident that PG ðXÞ 0 for any X 2 2IðGÞ . Xk Þ ¼ ki¼1 PD ðXi Þ.
Second, in order to prove PG ðIðGÞÞ ¼ 1, we claim that Thus, the theorem holds. u
t
PG ðIðGÞÞ ¼ PGe ðIðG eÞÞ for any edge e of G, where
G e denotes the uncertain graph obtained by removing Note that for any singleton set X ¼ fdg 2 2IðDÞ , PD ðXÞ ¼
e from G. Suppose EðGÞ ¼ fe1 ; e2 ; . . . ; en g, where P ðD ) dÞ. Thus, the probability distribution over all im-
n ¼ jEðGÞj, and let Gi be the uncertain graph obtained plicated graph databases of D can be denoted by P ðD ) dÞ.
by removing edges e1 ; e2 ; . . . ; ei from G for 1 i n. We Let us see the example of uncertain graph database D ¼
have PG ðIðGÞÞ ¼ PG1 ðIðG1 ÞÞ ¼ ¼ PGn ðIðGn ÞÞ by the fG1 ; G2 g shown in Fig. 2. G1 represents the probability
claim. Since Gn contains no edges, we have IðGn Þ ¼ distribution over the 16 implicated graphs of G1 as shown in
fGn g, thus PGn ðIðGn ÞÞ ¼ P ðGn ) Gn Þ ¼ 1. Now, we Fig. 3. G2 represents a probability distribution over the eight
prove the above claim. For any edge e 2 EðGÞ, implicated graphs of G2 . Thus, D represents a probability
distribution over all 128 implicated graph databases of D.
X X Y
P ðG ) IÞ ¼ P ðeÞ P ðe0 Þ 3.2 Frequent Subgraph Pattern Mining Problem
I2IðGÞj I2IðGÞje2EðIÞ e0 2EðIÞnfeg Definition 2. An exact graph G ¼ ðV ; E; ; LÞ is subgraph
e2EðIÞ
! isomorphic to another exact graph G0 ¼ ðV 0 ; E 0 ; 0 ; L0 Þ,
Y denoted by G vex G0 , if there exists an injection f : V ! V 0
00
ð1 P ðe ÞÞ ð3Þ such that
e00 2EðGÞnEðIÞ
X 1. LðvÞ ¼ L0 ðfðvÞÞ for every v 2 V ,
¼ P ðeÞ P ðG e ) IÞ
I2IðGeÞ
2. ðfðuÞ; fðvÞÞ 2 E 0 for every ðu; vÞ 2 E, and
3. Lððu; vÞÞ ¼ L0 ððfðuÞ; fðvÞÞÞ for every ðu; vÞ 2 E.
¼ P ðeÞPGe ðIðG eÞÞ;
The injection f is called a subgraph isomorphism from G to
and similarly G0 . The subgraph ðV 00 ; E 00 Þ of G0 with V 00 ¼ ffðvÞjv 2 V g and
X E 00 ¼ fðfðuÞ; fðvÞÞjðu; vÞ 2 Eg is called the embedding of G
P ðG ) IÞ ¼ ð1 P ðeÞÞPGe ðIðG eÞÞ: ð4Þ in G0 under f.
I2IðGÞje62EðIÞ In traditional frequent subgraph pattern mining [10], a
subgraph pattern is defined as a connected subgraph that is
Summing up (3) and (4), we justify the claim. subgraph isomorphic to at least one exact graph in the input
Third, due to the definition of PG , for any sequence of exact graph database, and the support of a subgraph
pairwise disjoint Psubsets X1 ; X2 ; . . . ; Xk of IðGÞ, PG ðX1 [ pattern S in an exact graph database D is formulated as
X2 [ . . . [ Xk Þ ¼ ki¼1 PG ðXi Þ.
Thus, the proof is completed. u
t jfGj S vex G and G 2 Dgj
supD ðSÞ ¼ : ð5Þ
jDj
For any singleton set X ¼ fIg 2 2IðGÞ , PG ðXÞ ¼ P ðG ) IÞ.
Thus, the probability distribution over all implicated graphs However, such concepts don’t make sense in uncertain
of G can be denoted by P ðG ) IÞ. graph databases since an exact subgraph is embedded in an
uncertain graph in a probabilistic sense. Hence, these
We also have the theorem below on the semantics of
concepts should be redefined in the context of uncertain
uncertain graph databases.
graph databases.
Theorem 2. Let D be an uncertain graph database and IðDÞ be the
Definition 3. A connected exact graph S is a subgraph pattern
set of all implicated graph databasesPof D. The function in an uncertain graph database D if S is subgraph isomorphic
PD : 2IðDÞ ! IR, defined as PD ðXÞ ¼ d2X P ðD ) dÞ, is a to at least one implicated graph in some implicated graph
probability function, where 2IðDÞ denotes the power set of IðDÞ. database of D. Let S and S 0 be two subgraph patterns. S is a
Proof. The proof of Theorem 2 is very similar to the proof of subpattern of S 0 , or S 0 is a superpattern of S, if S vex S 0 . S
Theorem 1. is a direct subpattern of S 0 , or S 0 is a direct superpattern of
First, PD ðXÞ 0 for every X 2 2IðDÞ . S, if S vex S 0 and jEðSÞj þ 1 ¼ jEðS 0 Þj.
Second, PD ðIðDÞÞ ¼ 1. To prove this, we show that for Definition 4. Given an uncertain graph database D, let IðDÞ be
any uncertain graph G 2 D, the set of all implicated graph databases of D. The support of a
X subgraph pattern S in D is a random variable over IðDÞ with
PD ðIðDÞÞ ¼ P ðD ) dÞ probability distribution
d2IðDÞ
X X
¼ P ðD ) dÞ
I2IðGÞ d2IðDÞjI2d
X X
¼ P ðG ) IÞ P ðD n fGg ) dÞ where si ¼ supd ðSÞ is the support value of S in some implicated
I2IðGÞ d2IðDnfGgÞ graph database d 2 IðDÞ, m ¼ jfsupd ðSÞjd 2 IðDÞgj is the
¼ PDnfGg ðIðD n fGgÞÞ; number of distinct support values
P of S in all implicated graph
databases of D, and P ðsi Þ ¼ d2IðDÞ and supd ðSÞ¼si P ðD ) dÞ is
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1207
the probability of S having support value si across all implicated where the second equality holds because
graph databases of D for 1 i m.
Definition 5. Let the support of a subgraph pattern S in an X
uncertain graph database D be a random variable as given in P ðD n fGi g ) dÞ ¼ 1
d2IðDnfGi gÞ
Definition 4. The expected support of S in D is defined as
X
m X as argued in the proof of Theorem 1. Hence, esupD ðSÞ can
esupD ðSÞ ¼ si P ðsi Þ ¼ supd ðSÞP ðD ) dÞ: ð6Þ
i¼1 d2IðDÞ
be readily computed using (8) instead of using (6) since jDj
in (8) is substantially smaller than jIðDÞj in (6).
A subgraph pattern S is frequent in an uncertain graph Theorem 3. It is #P-hard to compute the probability of a
database D if the expected support of S in D is no less than subgraph pattern occurring in an uncertain graph.
a threshold minsup 2 ½0; 1 specified by users. Then, the
problem of discovering frequent subgraph patterns from an Proof. We prove the theorem by reducing the #P-complete
uncertain graph database can be stated as follows: DNF counting problem [11] to the problem of computing
Input: an uncertain graph database D and an expected the probability, P ðS vU GÞ, of a subgraph pattern S
support threshold minsup occurring in an uncertain graph G.
Output: the set of all frequent subgraph patterns in D, i.e., The DNF counting problem can be stated as follows: Let
fSj S is a subgraph pattern in D, and esupD ðSÞ minsupg. F ¼ C1 _ C2 _ _ Cn be a boolean formula in disjunctive
normal form (DNF) on m boolean variables x1 ; x2 ; . . . ; xm .
Each clause Ci is of the form Ci ¼ l1 ^ l2 ^ ^ lk , where lj
4 COMPLEXITY OF FREQUENT SUBGRAPH PATTERN is a boolean variable in fx1 ; x2 ; . . . ; xm g. Let Prðxi Þ be the
MINING PROBLEM probability of xi being assigned true. The DNF counting
This section formally proves the computational complexity problem is to compute the probability of F being satisfied
of the frequent subgraph pattern mining problem to be by a randomly and independently chosen truth assign-
solved in this paper. Before that, we first reformulate the ment to the variables, denoted by PrðF Þ. Given an instance
concept of expected support given in Definition 5. of the DNF counting problem, an instance of the problem
Given an uncertain graph database D, a subgraph of computing P ðS vU GÞ can be constructed as follows:
pattern S in D is said to occur in an uncertain graph G 2 D, First, construct an uncertain graph G. The vertex set of
denoted by S vU G, if S is subgraph isomorphic to at least one G is V ðGÞ ¼ fc1 ; c2 ; . . . ; cn ; u1 ; u2 ; . . . ; um ; v1 ; v2 ; . . . ; vm g.
implicated graph of G. The probability of S occurring in G is The labels of c1 ; c2 ; . . . ; cn are , and the labels of
X u1 ; u2 ; . . . ; um ; v1 ; v2 ; . . . ; vm are . The edge set of G is
P ðS vU GÞ ¼ P ðG ) IÞ ðI; SÞ; ð7Þ constructed as follows: For each variable xi in the DNF
I2IðGÞ formula F , add an edge ðui ; vi Þ, associated with existence
probability of Prðxi Þ, to G, where 1 i m. For each
where IðGÞ is the set of all implicated graphs of G, P ðG )
variable xj in clause Ci , add an edge ðci ; uj Þ, associated
IÞ is the probability of G implicating I given in (1), and
with existence probability of 1, to G, where 1 i n and
ðI; SÞ ¼ 1 if S is subgraph isomorphic to I and ðI; SÞ ¼ 0
otherwise. Then, (6) can be rewritten as 1 j m. All edges of G are labeled .
Next, construct a subgraph pattern S. The vertex set of
X
esupD ðSÞ ¼ supd ðSÞP ðD ) dÞ S is V ðSÞ ¼ fc0 ; u01 ; u02 ; . . . ; u0k ; v01 ; v02 ; . . . ; v0k g. The label of c0
d2IðDÞ is , and the labels of u01 ; u02 ; . . . ; u0k ; v01 ; v02 ; . . . ; v0k are all .
jDj
! The edge set of S is constructed as EðSÞ ¼ fðc0 ; u01 Þ;
X P ðD ) dÞ X
¼ ðIi ; SÞ : ðc0 ; u02 Þ; . . . ; ðc0 ; u0k Þ; ðu01 ; v01 Þ; ðu02 ; v02 Þ; . . . ; ðu0k ; v0k Þg. All edges
d¼fI1 ;I2 ;...;IjDj g2IðDÞ
jDj i¼1 of S are labeled .
For example, given a DNF formula ðx1 ^ x2 ^ x3 Þ _
By swapping the order of summations and grouping the ðx2 ^ x3 ^ x4 Þ and the probabilities Prðx1 Þ; Prðx2 Þ; . . . ;
inner summands by the ith implicated graph, we have Prðx4 Þ of the variables x1 ; x2 ; . . . ; x4 being assigned true,
jDj
the uncertain graph G and the subgraph pattern S
1 X X X
constructed from the DNF formula are shown in Fig. 4,
esupD ðSÞ ¼ ðI; SÞP ðD ) dÞ
jDj i¼1 I2IðG Þ d2IðDÞjI2d where the labels of the edges are omitted for clarity.
i
jDj
Each truth assignment to the variables in F one-to-one
1 X X corresponds to an implicated graph of G, i.e., edge ðui ; vi Þ
¼ ðI; SÞP ðGi ) IÞ
jDj i¼1 I2IðG Þ exists in the implicated graph if and only if xi ¼ true. The
i
X probability of each truth assignment is equal to the
P ðD n fGi g ) dÞ ð8Þ probability of the implicated graph that the truth
d2IðDnfGi gÞ
assignment corresponds to. A truth assignment satisfies
jDj
1 X X F if and only if the implicated graph that the truth
¼ ðI; SÞP ðGi ) IÞ assignment corresponds to contains subgraph pattern S.
jDj i¼1 I2IðG Þ
i
Thus, PrðF Þ is equal to the probability, P ðS vU GÞ, of S
jDj
1 X occurring in G. This completes the polynomial time
¼ P ðS vU Gi Þ; reduction. The theorem thus holds. u
t
jDj i¼1
1208 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010
The rest of this section is organized as follows: Section 5.2 For example, consider the uncertain graph G1 and the
proposes an exact algorithm and an approximation algo- subgraph pattern S in Fig. 2. There are four embeddings of S
rithm to compute expected supports. Section 5.3 develops a in G1 , so we construct four variables x1 , x2 , x3 , and x4 for
new pruning technique to reduce the complexity of pruning edges ðv1 ; v2 Þ, ðv1 ; v3 Þ, ðv1 ; v4 Þ, and ðv1 ; v5 Þ of G1 , respectively.
search trees. Section 5.4 presents the complete description of The probabilities of x1 , x2 , x3 , and x4 being assigned true are
the approximate mining algorithm. Prðx1 Þ ¼ 0:5, Prðx2 Þ ¼ 0:6, Prðx3 Þ ¼ 0:7, and Prðx4 Þ ¼ 0:8,
respectively. The DNF formula constructed is thus
5.2 Algorithms for Computing Expected Supports
F ¼ ðx1 ^ x2 Þ _ ðx1 ^ x4 Þ _ ðx2 ^ x3 Þ _ ðx3 ^ x4 Þ.
Equation (8) shows that the expected support of a subgraph
Note that if F can be divided into several disjoint DNF
pattern S in an uncertain graph database D can be computed
subformulas F1 ; F2 ; . . . ; Fk such that F ¼ F1 _ F2 _ _ Fk
by averaging the probability of S occurring in every
and that Fi and Fj don’t contain any common variables for
uncertain graph G 2 D, i.e., P ðS vU GÞ. However, Theo-
1 i; j k and i 6¼ j, then we can first compute PrðFi Þ for
rem 3 shows that it is #P-hard to compute P ðS vU GÞ. To
deal with this challenge, we propose an exact algorithm to each subformula Fi and then compute PrðF Þ by
Q
exactly compute P ðS vU GÞ for small instances of the PrðF Þ ¼ 1 ki¼1 ð1 PrðFi ÞÞ. Hence, without loss of gen-
problem as well as an approximation algorithm to approx- erality, we assume that F can not be divided into disjoint
imate P ðS vU GÞ by an interval for large instances of the subformulas in the following discussion.
problem. On the basis of this technique, we develop an exact
algorithm and an approximation algorithm to compute
5.2.1 Fundamental Technique P ðS vU GÞ in the rest of this section.
To compute P ðS vU GÞ by its rigorous definition, i.e., (7), we
must compute the probability distribution over all 2jEðGÞj 5.2.2 Exact Algorithm
implicated graphs of G and perform 2jEðGÞj subgraph To exactly compute P ðS vU GÞ, we first construct a DNF
isomorphism testings from S to all implicated graphs of G, formula F ¼ C1 _ C2 _ _ Cn using the method given
which is intractable even if G is of small size, e.g., 30 edges. above. Then, by the Inclusive-Exclusive Principle [41], the
To reduce the complexity, we propose a new approach to probability of F being satisfied can be computed by
compute P ðS vU GÞ based on all embeddings2 of S in G. X X
The fundamental technique of the new approach is to PrðF Þ ¼ PrðCi Þ PrðCi ^ Cj Þ þ
transform the problem of computing P ðS vU GÞ to the DNF 1in 1i<jn
X ð9Þ
counting problem. Particularly, let fS1 ; S2 ; . . . ; Sn g be the set þ ð1Þn1 PrðCi1 ^ Ci2 ^ ^ Cin Þ;
of all embeddings of S in the exact graph ððV ðGÞ; EðGÞÞ; 1i1 <i2 <<in n
ðGÞ; LðGÞÞ, i.e., the exact graph obtained by removing
uncertainties from G. Let the edge set of each embedding Si where PrðCi1 ^ Ci2 ^ ^ Cij Þ is the probability of Ci1 ^ Ci2 ^
be EðSi Þ ¼ fei1 ; ei2 ; . . . ; eijEðSÞj g, where each subscript ij is in ^ Cij being satisfied. Since Ci1 ^ Ci2 ^ ^ Cij is satisfied
f1; 2; . . . ; jEðGÞjg. Note that all embeddings contain the if and only if all variables in it are satisfied, we have
same number of edges, jEðSÞj. The DNF counting problem
Y
is constructed as follows: PrðCi1 ^ Ci2 ^ ^ Cij Þ ¼ PrðxÞ; ð10Þ
Step 1. For each edge ej in the embeddings, create a x
boolean variable xj . The probability, Prðxj Þ, of xj being
assigned true is equal to the existence probability, P ðej Þ, of where x is over all variables in Ci1 ^ Ci2 ^ ^ Cij . In the
edge ej . following discussion, we call the set of all variables in a
Step 2. For each embedding Si , construct a conjunctive formula f the domain of f, denoted by domðfÞ.
clause Ci ¼ xi1 ^ xi2 ^ ^ xijEðSÞj , where xij is the boolean To improve the efficiency of computing (9), we propose a
variable created in step 1 for edge eij 2 EðSi Þ. method to reduce the time complexity of computing
Step 3. The output DNF formula F is the disjunction of PrðCi1 ^ Ci2 ^ ^ Cij Þ based on the following proposition:
all conjunctive clauses constructed for all n embeddings in Proposition 1. For a conjunctive formula C1 ^ C2 ^ ^ Ck ,
step 2, i.e., F ¼ ðx11 ^ x12 ^ ^ x1jEðSÞj Þ _ _ ðxn1 ^ xn2 ^
where each Ci is a conjunction of n variables, if there exist
^ xnjEðSÞj Þ.
and 2 f1; 2; . . . ; kg and 6¼ such that there are no
The construction of F can be done in ðnjEðSÞjÞ time
common variables in C and C , then
using a hash table to store the variable created for each
edge, where n is the number of embeddings of S in G, and
PrðC1 ^ C2 ^ ^ Ck Þ ¼ P 1 P 2 =P 3 ; ð11Þ
jEðSÞj is the number of edges in S. It is easy to prove that
P ðS vU GÞ is equal to the probability of F being satisfied by
where
a randomly and independently chosen truth assignment to
the variables in F , denoted as PrðF Þ. Thus, the problem of
P 1 ¼ PrðC1 ^ ^ C1 ^ Cþ1 ^ ^ Ck Þ;
computing P ðS vU GÞ is transformed to the problem of
computing PrðF Þ. P 2 ¼ PrðC1 ^ ^ C 1 ^ C þ1 ^ ^ Ck Þ;
P 3 ¼ PrðC1 ^ ^ C1 ^ Cþ1 ^
2. See Definition 2. Note that the number of all embeddings of S in G is
no greater than the number of all subgraph isomorphisms from S to G since ^ C 1 ^ C þ1 ^ ^ Ck Þ:
two distinct subgraph isomorphisms may map S to the same subgraph of G.
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1211
exactly compute P ðS vU GÞ as shown in Fig. 8. The input of complete subgraphs [43]. Thus, the number of terms
Exact-Occ-Prob is a subgraph pattern S, an uncertain graph in (9) that are computed in Oð1Þ time in Exact-Occ-Prob
G and the set, fS1 ; S2 ; . . . ; Sn g, of all embeddings of S in G. 0 0
is at least ð2n 1Þ ð2n þ 2m þ n n0 2Þ. Since there
The algorithm works in a “bottom-up” fashion, which are 2n 1 terms in (9),0 the0 value of the performance
n m 0
consists of five steps. metric is at least 1 2 þ2 2nþnn 2
. Thus, the theorem
1
Step 1. Construct DNF formula F ¼ C1 _ C2 _ . . . _ Cn holds. u
t
based on fS1 ; S2 ; . . . ; Sn g using the method presented in
Section 5.2.1, and let p ¼ 0. Theorem 5 shows that a substantially large fraction of
Step 2. For each clause Ci ¼ xi1 ^ xi2 ^ ^ xijEðSÞj in F , terms in (9) can be computed in Oð1Þ time in Exact-Occ-Prob.
QjEðSÞj
compute PrðCi Þ by j¼1 Prðxij Þ and add PrðCi Þ to p. For example, Fig. 9 shows the value of (12) while m varies
Step 3. If n ¼ 1, then output p as P ðS vU GÞ and from 0 to 45 and n is fixed to 10. We can see that the value of
terminate, otherwise let j ¼ 2. (12) keeps very high. Moreover, since (12) is a lower bound
Step 4. For every i1 ; i2 ; . . . ; ij such that 1 i1 < i2 < < of the performance metric, the actual performance can be
ij n, if there exist and 2 f1; 2; . . . ; jg such that Ci and even higher.
Ci don’t share any common variables, then compute PrðCi1 ^ The time complexity of Exact-Occ-Prob is analyzed as
Ci2 ^ ^ Cij Þ by (11) in Oð1Þ time, otherwise compute it by follows: Line 1 takes OðnjEðSÞjÞ time to construct the DNF
(10). Then, add ð1Þj1 PrðCi1 ^ Ci2 ^ ^ Cij Þ to p. formula F . Line 4 needs OðjEðSÞjÞ time to compute PrðCi Þ
1212 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010
TABLE 1
Summary of Real Uncertain Graph Database
Fig. 13. Execution time of MUSE, MUSE-Approx, and MUSE-Apriori with respect to (a) minsup, (b) ", and (c) .
minsup and the parameters " and . Fig. 13a shows the because the number of output frequent subgraph patterns
execution time of MUSE while minsup varies from 0.2 to decreases rapidly as minsup becomes larger, thus requiring
0.4, " ¼ 0:1, and ¼ 0:1. The execution time decreases less memory to bookkeep subgraph isomorphisms. In
substantially while minsup increases. This is because the Figs. 14b and 14c, the memory usage of MUSE decreases
number of output frequent subgraph patterns decreases with the increasing of " and , respectively. The reason is
rapidly as minsup becomes larger. Fig. 13b shows the that for larger " and , the approximation algorithm for
execution time of MUSE while " varies from 0.01 to 0.3, computing expected supports is more possible to outper-
minsup ¼ 0:3, and ¼ 0:1. The execution time decreases form the exact algorithm according to (14). Since the
rapidly while " increases. The reason is that the time spent approximation algorithm is more space efficient than the
by Approx-Occ-Prob decreases quadratically to the increase exact algorithm, the memory usage of MUSE decreases.
of " as analyzed in Section 5.2.3. Fig. 13c shows the The experimental results also show that MUSE outper-
execution time of MUSE while varies from 0.01 to 0.3, forms MUSE-Apriori, and that MUSE-Approx outperforms
minsup ¼ 0:3, and " ¼ 0:1. The execution time decreases MUSE in terms of memory usage. This is because 1) due to
rapidly while increases. This is because the time complex- the optimized pruning method, MUSE need not to compute
ity of Approx-Occ-Prob is proportional to lnð2=Þ as analyzed the probability of a subgraph pattern occurring in some
in Section 5.2.3. uncertain graphs, thus requiring less memory than MUSE-
The experimental results also show that MUSE outper- Apriori; and 2) MUSE-Approx only uses the approximation
forms all its competitors, MUSE-Approx and MUSE-Apriori. algorithm for computing expected supports, which is more
It verifies two statements in the previous sections: 1) the space efficient than the exact algorithm. Although MUSE-
approximation algorithm for computing expected supports Approx is more efficient than MUSE in memory usage, it is
is less efficient than the exact algorithm when the number of less efficient than MUSE in execution time.
embeddings of a subgraph pattern in an uncertain graph is
small; and 2) the optimized pruning method can reduce the 6.3 Approximation Quality of MUSE
time complexity of MUSE. Since MUSE is an approximate mining algorithm, we
evaluated its approximation quality with respect to " and
6.2 Memory Usage of MUSE on the real uncertain graph database. The approximation
We then investigated the memory usage of MUSE with quality is measured by the precision and recall metrics.
respect to minsup, ", and . Fig. 14 shows the memory usage Precision is the percentage of true frequent subgraph
of MUSE, MUSE-Approx, and MUSE-Apriori in mega bytes patterns in the output subgraph patterns. Recall is the
(MB) measured in the previous experiment. In Fig. 14a, the percentage of returned subgraph patterns in the true
memory usage of MUSE decreases while minsup increases frequent subgraph patterns. Since it is NP-hard to find all
Fig. 14. Memory usage of MUSE, MUSE-Approx, and MUSE-Apriori with respect to (a) minsup, (b) ", and (c) .
1216 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010
Fig. 17. Number of pruned subgraph patterns for MUSE and MUSE-
Apriori.
Fig. 15. Approximation quality of MUSE with respect to (a) " and (b) .
database while the number of duplications varies from 1 to
10, minsup ¼ 0:3, " ¼ 0:1, and ¼ 0:1. Both the execution
true frequent subgraph patterns, we regarded the subgraph
time and the memory usage of MUSE increase linearly to the
patterns discovered under " ¼ 0:01 and ¼ 0:01 as the true
increasing of the number of duplications. The experimental
frequent subgraph patterns.
results show that MUSE is very scalable for large uncertain
Fig. 15a shows the details of the output subgraph patterns
while " varies from 0.01 to 0.3, ¼ 0:1, and minsup ¼ 0:3. graph databases.
Each percentage above in the figure indicates the precision, 6.5 Effect of Optimized Pruning Method
and each percentage below indicates the recall. We can see
We have evaluated the effect of the optimized pruning
that the precision of MUSE decreases and that the recall
remains stable while " increases. This is because 1) when " method in the previous experiments. Here, we investigate it
becomes larger, more false frequent subgraph patterns will in more details. The pruned infrequent subgraph patterns3
be returned, so reducing the precision; and 2) when is fixed, can be classified into three groups according to the method
the probability of a frequent subgraph pattern being used: 1) the ones pruned by comparing their expected
returned is also fixed, thus the number of output true supports with minsup, i.e., the apriori pruning, 2) the ones
frequent subgraph patterns doesn’t change significantly. pruned by the optimized pruning method, and 3) the ones
Fig. 15b shows the experimental results while varies pruned by comparing the proportion of containing un-
from 0.01 to 0.3, " ¼ 0:1, and minsup ¼ 0:3. The precision certain graphs with minsup. Note that the third pruning
remains stable but the recall decreases while increases. method is actually a special case of the optimized pruning
The reason is that 1) the fixed " determines the expected method, but here it is examined individually for clarity.
number of false frequent subgraph patterns to be returned,
Fig. 17 shows the number of subgraph patterns pruned
so the precision remains stable; and 2) while increases, the
by each of the methods for MUSE and MUSE-Apriori on the
probability of a frequent subgraph pattern being output
decreases, thus the number of returned true frequent real uncertain graph database while minsup varies from 0.2
subgraph patterns decreases, reducing the recall. All to 0.3, " ¼ 0:1, and ¼ 0:1. Since MUSE-Apriori doesn’t use
experimental results verify that MUSE can have very high the optimized pruning method, the number of subgraph
approximation quality. patterns pruned by the optimized pruning method is of
course zero for MUSE-Apriori. We can see that for MUSE-
6.4 Scalability of MUSE Apriori, approximately 10 percent of subgraph patterns are
We also examined the scalability of MUSE with respect to pruned by the apriori pruning, the most expensive method
the number of uncertain graphs in an uncertain graph out of three, but only about 1 percent for MUSE. As a result,
database. We controlled the number of uncertain graphs by MUSE outperforms MUSE-Apriori as shown in Fig. 13a.
duplicating the uncertain graphs in the real uncertain graph
Moreover, the optimized pruning method can additionally
database. Fig. 16 shows the execution time and the memory
usage of MUSE on the duplicated real uncertain graph prune 9 percent of subgraph patterns that the third method
can’t prune.
[14] J. Huan, W. Wang, and J. Prins, “Efficient Mining of Frequent [45] M. Luby and B. Velickovic, “On Deterministic Approximation of
Subgraphs in the Presence of Isomorphism,” Proc. Int’l Conf. Data DNF,” Proc. Symp. Theory of Computing, 1991.
Mining, 2003. [46] COG functions, http://www.ncbi.nlm.nih.gov/COG/, 2010.
[15] S. Nijssen and J.N. Kok, “A Quickstart in Frequent Structure
Mining Can Make a Difference,” Proc. ACM SIGKDD Conf., 2004.
[16] N. Vanetik, “Discovering Frequent Graph Patterns Using Disjoint Zhaonian Zou received the BS and MEng
Paths,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 11, degrees in computer science from the Jilin
pp. 1441-1456, Nov. 2006. University, China, in 2002 and 2005, respec-
[17] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi, “Scalable Mining of tively. He is currently working toward the PhD
Large Disk-Based Graph Databases,” Proc. ACM SIGKDD Conf., degree in the Department of Computer Science
2004. and Technology at the Harbin Institute of
[18] J. Wang, W. Hsu, M.L. Lee, and C. Sheng, “A Partition-Based Technology, China. He worked as a research
Approach to Graph Mining,” Proc. Int’l Conf. Data Eng., 2006. assistant in the Department of System Engineer-
[19] C. Chen, C.X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and ing and Engineering Management at the Chinese
J. Han, “Mining Graph Patterns Efficiently Via Randomized University of Hong Kong in 2007. His research
Summaries,” Proc. Very Large Databases Conf., 2009. interests include data mining and query processing in graph databases.
[20] X. Yan and J. Han, “Closegraph: Mining Closed Frequent Graph
Patterns,” Proc. ACM SIGKDD Conf., 2003.
Jianzhong Li is a professor in the Department
[21] J. Huan, W. Wang, J. Prins, and J. Yang, “Spin: Mining Maximal
of Computer Science and Technology at the
Frequent Subgraphs from Graph Databases,” Proc. ACM SIGKDD
Harbin Institute of Technology, China. In the
Conf., 2004.
past, he worked as a visiting scholar at
[22] Y. Liu, J. Li, and H. Gao, “Summarizing Graph Patterns,” Proc. the University of California at Berkeley, as a
Int’l Conf. Data Eng., 2008. staff scientist in the Information Research Group
[23] M. Hasan and M. Zaki, “Output Space Sampling for Graph at the Lawrence Berkeley National Laboratory,
Patterns,” Proc. Very Large Databases Conf., 2009. and as a visiting professor at the University of
[24] J. Wang, Z. Zeng, and L. Zhou, “Clan: An Algorithm for Mining Minnesota. His research interests include data
Closed Cliques from Large Dense Graph Databases,” Proc. Int’l management systems, data mining, data ware-
Conf. Data Eng., 2006. housing, sensor networks, and bioinformatics. He has published
[25] Z. Zeng, J. Wang, L. Zhou, and G. Karypis, “Out-of-Core Coherent extensively and been involved in the program committees of all major
Closed Quasi-Clique Mining from Large Dense Graph Data- database conferences, including SIGMOD, VLDB, and ICDE. He has
bases,” ACM Trans. Database Systems, vol. 32, no. 2, p. 13, 2007. also served on the boards for varied journals, including the IEEE
[26] J. Pei, D. Jiang, and A. Zhang, “On Mining Cross-Graph Quasi- Transactions on Knowledge and Data Engineering. He is a member of
Cliques,” Proc. ACM SIGKDD Conf., 2005. the IEEE.
[27] R. Jin, C. Wang, D. Polshakov, S. Parthasarathy, and G. Agrawal,
“Discovering Frequent Topological Structures from Graph Data-
sets,” Proc. ACM SIGKDD Conf., 2005. Hong Gao received the BS degree in computer
[28] T. Horváth, J. Ramon, and S. Wrobel, “Frequent Subgraph Mining science from the Heilongjiang University, China,
in Outerplanar Graphs,” Proc. ACM SIGKDD Conf., 2006. the MS degree in computer science from the
[29] X. Yan, H. Cheng, J. Han, and P.S. Yu, “Mining Significant Graph Harbin Engineering University, China, and the
Patterns by Leap Search,” Proc. ACM SIGMOD Conf., 2008. PhD degree in computer science from the Harbin
[30] S. Ranu and A.K. Singh, “GraphSig: A Scalable Approach to Institute of Technology, China. She is currently
Mining Significant Subgraphs in Large Graph Databases,” Proc. a professor in the Department of Computer
Int’l Conf. Data Eng., 2009. Science and Technology at the Harbin Institute
[31] J. Li, B. Saha, and A. Deshpande, “A Unified Approach to Ranking of Technology, China. Her research interests
in Probabilistic Databases,” Proc. Very Large Databases Conf., 2009. include graph data management, sensor net-
[32] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic Skylines on works, and massive data management.
Uncertain Data,” Proc. Very Large Databases Conf., 2007.
[33] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar,
“Indexing Multi-Dimensional Uncertain Data with Arbitrary Shuo Zhang received the BEng and MEng
Probability Density Functions,” Proc. Very Large Databases Conf., degrees in computer science from the Harbin
2005. Institute of Technology, China, in 2003 and
[34] R. Cheng, J. Chen, and X. Xie, “Cleaning Uncertain Data with 2005, respectively. He is currently working
Quality Guarantees,” Proc. Very Large Databases Conf., 2008. toward the PhD degree in the Department of
[35] G. Cormode and M.N. Garofalakis, “Sketching Probabilistic Data Computer Science and Technology at the
Streams,” Proc. ACM SIGMOD Conf., 2007. Harbin Institute of Technology, China. His
[36] Q. Zhang, F. Li, and K. Yi, “Finding Frequent Items in research interests include indexing techniques,
Probabilistic Data,” Proc. ACM SIGMOD Conf., 2008. query processing, and data mining in graph
[37] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Züfle, databases.
“Probabilistic Frequent Itemset Mining in Uncertain Databases,”
Proc. ACM SIGKDD Conf., 2009.
[38] C.C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent Pattern . For more information on this or any other computing topic,
Mining with Uncertain Data,” Proc. ACM SIGKDD Conf., 2009. please visit our Digital Library at www.computer.org/publications/dlib.
[39] C.C. Aggarwal and P.S. Yu, “A Framework for Clustering
Uncertain Data Streams,” Proc. Int’l Conf. Data Eng., 2008.
[40] G. Cormode and A. McGregor, “Approximation Algorithms for
Clustering Uncertain Data,” Proc. Symp. Principles of Database
Systems, 2008.
[41] M. Mitzenmacher and E. Upfal, Probability and Computing:
Randomized Algorithms and Probabilistic Analysis. Cambridge Univ.
Press, 2005.
[42] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide
to the Theory of NP-Completeness. W. H. Freeman, 1979.
[43] D.R. Wood, “On the Maximum Number of Cliques in a Graph,”
Graphs and Combinatorics, vol. 23, no. 3, pp. 337-352, 2007.
[44] R.M. Karp and M. Luby, “Monte-Carlo Algorithms for Enumera-
tion and Reliability Problems,” Proc. Ann. Symp. Foundations of
Computer Science, 1983.