0% found this document useful (0 votes)
59 views16 pages

Mining Frequent Subgraph Patterns From Uncertain Graph Data

Uploaded by

Rehan Akhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views16 pages

Mining Frequent Subgraph Patterns From Uncertain Graph Data

Uploaded by

Rehan Akhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

9, SEPTEMBER 2010 1203

Mining Frequent Subgraph Patterns


from Uncertain Graph Data
Zhaonian Zou, Jianzhong Li, Member, IEEE, Hong Gao, and Shuo Zhang

Abstract—In many real applications, graph data is subject to uncertainties due to incompleteness and imprecision of data. Mining
such uncertain graph data is semantically different from and computationally more challenging than mining conventional exact graph
data. This paper investigates the problem of mining uncertain graph data and especially focuses on mining frequent subgraph patterns
on an uncertain graph database. A novel model of uncertain graphs is presented, and the frequent subgraph pattern mining problem is
formalized by introducing a new measure, called expected support. This problem is proved to be NP-hard. An approximate mining
algorithm is proposed to find a set of approximately frequent subgraph patterns by allowing an error tolerance on expected supports of
discovered subgraph patterns. The algorithm uses efficient methods to determine whether a subgraph pattern can be output or not and
a new pruning method to reduce the complexity of examining subgraph patterns. Analytical and experimental results show that the
algorithm is very efficient, accurate, and scalable for large uncertain graph databases. To the best of our knowledge, this paper is the
first one to investigate the problem of mining frequent subgraph patterns from uncertain graph data.

Index Terms—Graph mining, uncertain graph, frequent subgraph pattern, algorithm.

1 INTRODUCTION

I N recent years, graph mining has become an increasingly


important research topic in data mining [1]. Almost all
existing studies on graph mining are only concerned with
graphs include regulatory networks [6] and topologies of wireless
networks [7], etc.
Mining uncertain graphs is important in many practical
exact graphs that are precise and complete. However, in applications. For example, [4] predicts the membership of a
many practical applications, graph data is generally subject protein in a partially known protein complex by mining a PPI
to uncertainties due to noise, incompleteness, and inaccu- network as an uncertain graph, and [7] models a wireless
racy. Such kind of graphs with uncertainties are called network as an uncertain graph and extracts the most
uncertain graphs in this paper. probable delivery subgraph to aid the design of routing
For example, in bioinformatics, interactions between
protocols. In these studies, each discovered knowledge is
proteins are typically represented as a graph, called protein-
associated with a confidence value, computed from uncer-
protein interaction (PPI) network [2], where vertices represent
tainties, to assess the possibility of the knowledge existing in
proteins, and edges represent interactions between proteins.
A large number of PPIs have currently been detected by a reality. Only knowledge existing with high confidence can be
variety of methods, and it has been noted that all methods regarded useful since knowledge with low confidence may
produce a significant amount of noisy interactions that don’t be formed by chance and is generally of limited or little use.
really exist and miss a fraction of real interactions [3]. Furthermore, mining frequent subgraph patterns from
Therefore, it is more appropriate to represent a PPI network uncertain graphs is also very important in practical applica-
as an uncertain graph with the uncertainty of each edge tions. For example, biologists are often interested in identify-
representing the probability of the interaction existing in ing functional modules and evolutionarily conserved subnetworks
reality [4]. Fig. 1 illustrates a sample of PPI network, centered from biological networks such as PPI networks [8]. Frequent
at protein NTG1, extracted from the STRING database [5]. subgraph pattern mining has been shown to be an effective
Each vertex depicts the name (e.g., NTG1), ID (e.g., approach [2], [8]. However, as shown above, biological
YAL015C), and function (e.g., DNA replication, recombina- networks are generally subject to uncertainties. Therefore, it
tion, and repair) of a protein. The number on each edge is the motivates us to discover subgraph patterns that not only
uncertainty of the interaction provided by the STRING occur frequently in uncertain graphs but also have high
database. Besides PPI networks, other examples of uncertain confidence, in terms of uncertainty, to exist in reality.
This paper investigates the problem of mining frequent
. The authors are with the School of Computer Science and Technology, subgraph patterns from uncertain graphs. A typical class of
Harbin Institute of Technology, PO Box 750, No. 92 West Da Zhi St., uncertain graphs often used in practical applications are
Harbin, Heilongjiang 150001, China. considered in this paper, which have uncertainties asso-
E-mail: {znzou, lijzh, honggao, zhangshuocn}@hit.edu.cn.
ciated with edges only. More general uncertain graphs with
Manuscript received 31 Mar. 2009; revised 2 Aug. 2009; accepted 28 Sept. uncertainties associated with both vertices and edges will be
2009; published online 3 May 2010.
Recommended for acceptance by R. Cheng, M. Chau, M. Garofalakis, and studied in our other papers. An uncertain graph is a special
J.X. Yu. edge-weighted graph, where the weight on each edge ðu; vÞ
For information on obtaining reprints of this article, please send e-mail to: is the probability of the edge existing between vertices u and
[email protected], and reference IEEECS Log Number
TKDESI-2009-03-0254. v, called the existence probability of ðu; vÞ. An uncertain graph
Digital Object Identifier no. 10.1109/TKDE.2010.80. database is a set of uncertain graphs. Fig. 2 shows an example
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1204 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

Fig. 1. A sample of PPI network.

of uncertain graph database D ¼ fG1 ; G2 g, where the text on


each vertex, such as A on v1 , is the label of the vertex, the text
on each edge, such as x on ðv1 ; v2 Þ, is the label of the edge,
and the real number on each edge, such as 0.5 on ðv1 ; v2 Þ, is
the existence probability of the edge. Unlike a conventional Fig. 3. Probability distribution over all implicated graphs of uncertain
exact graph, an uncertain graph G semantically represents a graph G1 in Fig. 2.
probability distribution over all implicated graphs of G, where
each implicated graph of G is an exact graph whose vertex support of S in D. If the expected support of S in D is no less
set is the same as that of G, and whose edge set is a subset of than a threshold specified by users, then S is called frequent in
that of G. Certainly, there are no existence probabilities D. Therefore, the frequent subgraph pattern mining problem
associated with edges of implicated graphs. For example, in can be stated as follows: Given an uncertain graph database D
Fig. 2, G1 represents the probability distribution over the and an expected support threshold, find all frequent subgraph
16 implicated graphs of G1 as shown in Fig. 3, and G2 patterns in D.
represents a probability distribution over the eight impli- Finding all frequent subgraph patterns in an uncertain
cated graphs of G2 . An uncertain graph database D graph database is very hard in computing. First, it is formally
essentially represents a probability distribution over all proved in this paper that it is #P-hard [11] to compute the
implicated graph databases of D. Supposing D ¼ fG1 ; G2 ; . . . ; expected support of a subgraph pattern in an uncertain
Gn g, an implicated graph database of D is a set of exact graph database, i.e., there are no efficient algorithms to
graphs fI1 ; I2 ; . . . ; In g such that Ii is an implicated graph of determine whether a subgraph pattern is frequent or not.
Second, the number of all subgraph patterns in an uncertain
Gi for 1  i  n. Thus, the uncertain graph database in Fig. 2
graph database is extremely large in general, and theoreti-
represents a probability distribution over its 128 implicated
cally, it is #P-hard to count the number of all subgraph
graph databases. It is worth noting that an implicated graph
patterns, so it is unrealistic to examine all of them to find the
database is equivalent to a possible world [9] in the research on
frequent ones.
querying uncertain data.
To deal with these challenges, an approximate mining
Mining frequent subgraph patterns on an uncertain graph
algorithm, called MUSE (Mining Uncertain Subgraph
database is semantically different from that on an exact
pattErns), is proposed to find an approximate set of
graph database. On an exact graph database, the significance
frequent subgraph patterns in an uncertain graph data-
of a subgraph pattern is measured by the support of the
base. The algorithm approximates the exact answer, i.e., all
subgraph pattern, i.e., the proportion of input graphs
frequent subgraph patterns, in the following manner. Let
containing the subgraph pattern [10]. However, such
minsup be the expected support threshold, " 2 ð0; 1 be a
definition of support doesn’t make sense in mining uncertain
relative error tolerance, and 1   2 ð0; 1 be a degree of
graphs because the containment relationship between un-
confidence. With probability at least 1  , any subgraph
certain graphs is nondeterministic. The support of a
pattern with expected support no less than minsup will be
subgraph pattern S in an uncertain graph database D should output, but any subgraph pattern with expected support
be a probability distribution over the supports of S in all less than ð1  "Þminsup will not be output.
implicated graph databases of D. The significance of S in D The MUSE algorithm adopts two crucial techniques. The
thus can be measured by the expected value of the supports first technique is the efficient method to determine whether a
of S in all implicated graph databases of D, called the expected subgraph pattern can be output or not. Particularly, it first
approximates the expected support of the subgraph pattern
by an interval that encloses the expected support and then
makes decision on whether the subgraph pattern can be
output or not by checking the overlapping relationship
between the approximated interval and ½ð1  "Þminsup;
minsupÞ. In this way, it avoids the difficulty in exactly
computing the expected support of the subgraph pattern.
The second technique is the efficient method to examine
subgraph patterns. It is proved that the expected support
satisfies the apriori property, i.e., all supergraphs of an
Fig. 2. An example of uncertain graph database and subgraph pattern. infrequent subgraph pattern are also infrequent. To make
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1205

full use of the apriori property, all subgraph patterns are 3 PROBLEM STATEMENT
organized into a search tree, and the search tree is traversed
3.1 Model of Uncertain Graphs
in depth-first strategy. During search, if a subgraph pattern
can not be output as a result, then all its descendants in the In this paper, the vertex set and the edge set of a graph G
search tree need not to be examined, thus reducing the are denoted by V ðGÞ and EðGÞ, respectively.
complexity. Furthermore, a new pruning method is Definition 1. An uncertain graph is a system G ¼ ððV ; EÞ; ;
proposed to further reduce the complexity of examining L; P Þ, where ðV ; EÞ is an undirected graph,  is a set of labels,
subgraph patterns. L : V [ E !  is a function assigning labels to vertices and
Extensive experiments were performed on a real un- edges, and P : E ! ð0; 1 is a function assigning existence
certain graph database to evaluate the efficiency, the probability values to edges.
memory usage, the approximation quality, and the scal-
ability of MUSE, the impact of the optimized pruning The existence probability, P ððu; vÞÞ, of an edge ðu; vÞ is the
method on the efficiency of MUSE, and the impact of probability of the edge existing between vertices u and v.
uncertainties on the efficiency of MUSE. The experimental Specifically, P ððu; vÞÞ ¼ 1 indicates that ðu; vÞ definitely
results show that MUSE is very efficient, accurate, and exists. Thus, an exact graph1 is a special uncertain graph
scalable for large uncertain graph databases. with existence probabilities of 1 on all edges.
The rest of the paper is organized as follows: Section 2 Unlike an exact graph, an uncertain graph implicates a set
reviews related work. Section 3 introduces the model of of exact graphs, each of which is a possible structure that the
uncertain graphs and formalizes the frequent subgraph uncertain graph may exist in. More formally, an exact graph
pattern mining problem. Section 4 formally proves the I ¼ ððV 0 ; E 0 Þ; 0 ; L0 Þ is an implicated graph of an uncertain
computational complexity of the frequent subgraph pattern graph G ¼ ððV ; EÞ; ; L; P Þ, denoted by G ) I, if and only if
mining problem. Section 5 presents the approximate mining V 0 ¼ V , E 0  E, 0  , and L0 ¼ LjV 0 [E0 , where LjV 0 [E0
algorithm and analyzes its performance. Extensive experi- denotes the function obtained by restricting L to V 0 [ E 0 .
mental results are shown in Section 6. Finally, Section 7 For simplicity, we assume that all existence probabilities of
concludes this paper. edges are mutually independent. This assumption has been
shown to be reasonable in a range of practical applications [4],
[6], [7]. Based on the independence assumption, the prob-
2 RELATED WORK ability of an uncertain graph G implicating an exact graph I is
A number of algorithms have been proposed to discover the Y Y
complete set of frequent subgraph patterns from an exact P ðG ) IÞ ¼ P ðeÞ ð1  P ðe0 ÞÞ; ð1Þ
graph database [10], [12], [13], [14], [15], [16], [17], [18], [19]. e2EðIÞ e0 2EðGÞnEðIÞ
To reduce the number of redundant subgraph patterns, [20]
where P ðeÞ is the existence probability of edge e. Equation
proposed CloseGraph to discover frequent closed subgraph
(1) holds because all edges in EðIÞ are contained in I, but all
patterns, [21] developed SPIN to discover frequent maximal
edges in EðGÞ n EðIÞ are not contained in I. Let IðGÞ denote
subgraph patterns, [22] presented RP-GD and RP-FP to
the set of all implicated graphs of an uncertain graph G.
summarize frequent subgraph patterns, and [23] performed
Obviously, jIðGÞj ¼ 2jEðGÞj .
Metropolis-Hastings samplings on the space of frequent
An uncertain graph database is a set of uncertain graphs. It
subgraph patterns. In addition, some variants of the essentially represents a set of implicated graph databases.
frequent subgraph pattern mining problem have been More formally, an implicated graph database of an
investigated, such as the discovery of frequent closed uncertain graph database D ¼ fG1 ; G2 ; . . . ; Gn g is a set of
cliques [24], frequent closed quasi-cliques [25], cross-graph exact graphs d ¼ fI1 ; I2 ; . . . ; In g such that Gi ) Ii for
quasi-cliques [26], frequent topological substructures [27], 1  i  n. The set of all implicated graphQdatabases of D
frequent outerplanar subgraphs [28], and significant sub- is denoted by IðDÞ. Obviously, jIðDÞj ¼ ni¼1 2jEðGi Þj . As-
graph patterns [29], [30]. However, all these algorithms suming that the uncertain graphs in an uncertain graph
were designed only for mining exact graphs and can not be database are mutually independent, the probability of an
extended to uncertain graphs. exact graph database d ¼ fI1 ; I2 ; . . . ; In g being implicated
Recent research on managing uncertain data has focused by an uncertain graph database D ¼ fG1 ; G2 ; . . . ; Gn g is
on data model and query processing [9], ranking query
Y
n
processing [31], skyline query processing [32], indexing P ðD ) dÞ ¼ P ðGi ) Ii Þ; ð2Þ
[33], and data cleaning [34], etc. There is also a few very i¼1
recent work on mining uncertain data. In [35] and [36], the
problem of finding frequent items from probabilistic data where P ðGi ) Ii Þ is the probability of Gi implicating Ii as
streams was investigated. In [37] and [38], frequent item given in (1).
sets mining on uncertain transactional databases was We have the following theorem on the semantics of
studied. In [39] and [40], uncertain data clustering algo- uncertain graphs:
rithms were explored. However, all these algorithms focus Theorem 1. Let G be an uncertain graph and IðGÞ be the set of all
on mining uncertain tabular data rather than graph data implicated graphs of G. The function PG : 2IðGÞ ! IR, defined as
and can not be shifted to uncertain graph mining.
To the best of our knowledge, there is no literature to date 1. A conventional labeled graph [13] is called an exact graph in this paper,
which is a 3-tuple G ¼ ððV ; EÞ; ; LÞ, where ðV ; EÞ is an undirected graph, 
on mining frequent subgraph patterns from uncertain graph is a set of labels, and L : V [ E !  is a labeling function of vertices and
data. This paper is the first one to investigate this problem. edges.
1206 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

P P
PG ðXÞ ¼ I2X P ðG ) IÞ, is a probability function, where where the last equality is because I2IðGÞ P ðG ) IÞ ¼ 1.
2IðGÞ is the power set of IðGÞ. Thus, we have PD ðIðDÞÞ ¼ PDnfG1 g ðIðD n fG1 gÞÞ ¼
PDnfG1 ;G2 g ðIðD n fG1 ; G2 gÞÞ ¼    ¼ P; ðIð;ÞÞ ¼ 1.
Proof. Let IðGÞ be the sample space. Trivially, 2IðGÞ is a
-algebra of the subsets of IðGÞ. So, we prove the theorem Third, for any sequence of pairwise disjoint subsets
by showing that PG satisfies the probability axioms [41]. . . . ; Xk of IðDÞ, we have PD ðX1 [ X2 [ . . . [
X1 ; X2 ;P
First, it is evident that PG ðXÞ  0 for any X 2 2IðGÞ . Xk Þ ¼ ki¼1 PD ðXi Þ.
Second, in order to prove PG ðIðGÞÞ ¼ 1, we claim that Thus, the theorem holds. u
t
PG ðIðGÞÞ ¼ PGe ðIðG  eÞÞ for any edge e of G, where
G  e denotes the uncertain graph obtained by removing Note that for any singleton set X ¼ fdg 2 2IðDÞ , PD ðXÞ ¼
e from G. Suppose EðGÞ ¼ fe1 ; e2 ; . . . ; en g, where P ðD ) dÞ. Thus, the probability distribution over all im-
n ¼ jEðGÞj, and let Gi be the uncertain graph obtained plicated graph databases of D can be denoted by P ðD ) dÞ.
by removing edges e1 ; e2 ; . . . ; ei from G for 1  i  n. We Let us see the example of uncertain graph database D ¼
have PG ðIðGÞÞ ¼ PG1 ðIðG1 ÞÞ ¼    ¼ PGn ðIðGn ÞÞ by the fG1 ; G2 g shown in Fig. 2. G1 represents the probability
claim. Since Gn contains no edges, we have IðGn Þ ¼ distribution over the 16 implicated graphs of G1 as shown in
fGn g, thus PGn ðIðGn ÞÞ ¼ P ðGn ) Gn Þ ¼ 1. Now, we Fig. 3. G2 represents a probability distribution over the eight
prove the above claim. For any edge e 2 EðGÞ, implicated graphs of G2 . Thus, D represents a probability
distribution over all 128 implicated graph databases of D.
X X Y
P ðG ) IÞ ¼ P ðeÞ P ðe0 Þ 3.2 Frequent Subgraph Pattern Mining Problem
I2IðGÞj I2IðGÞje2EðIÞ e0 2EðIÞnfeg Definition 2. An exact graph G ¼ ðV ; E; ; LÞ is subgraph
e2EðIÞ
! isomorphic to another exact graph G0 ¼ ðV 0 ; E 0 ; 0 ; L0 Þ,
Y denoted by G vex G0 , if there exists an injection f : V ! V 0
00
 ð1  P ðe ÞÞ ð3Þ such that
e00 2EðGÞnEðIÞ
X 1. LðvÞ ¼ L0 ðfðvÞÞ for every v 2 V ,
¼ P ðeÞ P ðG  e ) IÞ
I2IðGeÞ
2. ðfðuÞ; fðvÞÞ 2 E 0 for every ðu; vÞ 2 E, and
3. Lððu; vÞÞ ¼ L0 ððfðuÞ; fðvÞÞÞ for every ðu; vÞ 2 E.
¼ P ðeÞPGe ðIðG  eÞÞ;
The injection f is called a subgraph isomorphism from G to
and similarly G0 . The subgraph ðV 00 ; E 00 Þ of G0 with V 00 ¼ ffðvÞjv 2 V g and
X E 00 ¼ fðfðuÞ; fðvÞÞjðu; vÞ 2 Eg is called the embedding of G
P ðG ) IÞ ¼ ð1  P ðeÞÞPGe ðIðG  eÞÞ: ð4Þ in G0 under f.
I2IðGÞje62EðIÞ In traditional frequent subgraph pattern mining [10], a
subgraph pattern is defined as a connected subgraph that is
Summing up (3) and (4), we justify the claim. subgraph isomorphic to at least one exact graph in the input
Third, due to the definition of PG , for any sequence of exact graph database, and the support of a subgraph
pairwise disjoint Psubsets X1 ; X2 ; . . . ; Xk of IðGÞ, PG ðX1 [ pattern S in an exact graph database D is formulated as
X2 [ . . . [ Xk Þ ¼ ki¼1 PG ðXi Þ.
Thus, the proof is completed. u
t jfGj S vex G and G 2 Dgj
supD ðSÞ ¼ : ð5Þ
jDj
For any singleton set X ¼ fIg 2 2IðGÞ , PG ðXÞ ¼ P ðG ) IÞ.
Thus, the probability distribution over all implicated graphs However, such concepts don’t make sense in uncertain
of G can be denoted by P ðG ) IÞ. graph databases since an exact subgraph is embedded in an
uncertain graph in a probabilistic sense. Hence, these
We also have the theorem below on the semantics of
concepts should be redefined in the context of uncertain
uncertain graph databases.
graph databases.
Theorem 2. Let D be an uncertain graph database and IðDÞ be the
Definition 3. A connected exact graph S is a subgraph pattern
set of all implicated graph databasesPof D. The function in an uncertain graph database D if S is subgraph isomorphic
PD : 2IðDÞ ! IR, defined as PD ðXÞ ¼ d2X P ðD ) dÞ, is a to at least one implicated graph in some implicated graph
probability function, where 2IðDÞ denotes the power set of IðDÞ. database of D. Let S and S 0 be two subgraph patterns. S is a
Proof. The proof of Theorem 2 is very similar to the proof of subpattern of S 0 , or S 0 is a superpattern of S, if S vex S 0 . S
Theorem 1. is a direct subpattern of S 0 , or S 0 is a direct superpattern of
First, PD ðXÞ  0 for every X 2 2IðDÞ . S, if S vex S 0 and jEðSÞj þ 1 ¼ jEðS 0 Þj.
Second, PD ðIðDÞÞ ¼ 1. To prove this, we show that for Definition 4. Given an uncertain graph database D, let IðDÞ be
any uncertain graph G 2 D, the set of all implicated graph databases of D. The support of a
X subgraph pattern S in D is a random variable over IðDÞ with
PD ðIðDÞÞ ¼ P ðD ) dÞ probability distribution
d2IðDÞ
X X
¼ P ðD ) dÞ
I2IðGÞ d2IðDÞjI2d
X X
¼ P ðG ) IÞ P ðD n fGg ) dÞ where si ¼ supd ðSÞ is the support value of S in some implicated
I2IðGÞ d2IðDnfGgÞ graph database d 2 IðDÞ, m ¼ jfsupd ðSÞjd 2 IðDÞgj is the
¼ PDnfGg ðIðD n fGgÞÞ; number of distinct support values
P of S in all implicated graph
databases of D, and P ðsi Þ ¼ d2IðDÞ and supd ðSÞ¼si P ðD ) dÞ is
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1207

the probability of S having support value si across all implicated where the second equality holds because
graph databases of D for 1  i  m.
Definition 5. Let the support of a subgraph pattern S in an X
uncertain graph database D be a random variable as given in P ðD n fGi g ) dÞ ¼ 1
d2IðDnfGi gÞ
Definition 4. The expected support of S in D is defined as
X
m X as argued in the proof of Theorem 1. Hence, esupD ðSÞ can
esupD ðSÞ ¼ si P ðsi Þ ¼ supd ðSÞP ðD ) dÞ: ð6Þ
i¼1 d2IðDÞ
be readily computed using (8) instead of using (6) since jDj
in (8) is substantially smaller than jIðDÞj in (6).
A subgraph pattern S is frequent in an uncertain graph Theorem 3. It is #P-hard to compute the probability of a
database D if the expected support of S in D is no less than subgraph pattern occurring in an uncertain graph.
a threshold minsup 2 ½0; 1 specified by users. Then, the
problem of discovering frequent subgraph patterns from an Proof. We prove the theorem by reducing the #P-complete
uncertain graph database can be stated as follows: DNF counting problem [11] to the problem of computing
Input: an uncertain graph database D and an expected the probability, P ðS vU GÞ, of a subgraph pattern S
support threshold minsup occurring in an uncertain graph G.
Output: the set of all frequent subgraph patterns in D, i.e., The DNF counting problem can be stated as follows: Let
fSj S is a subgraph pattern in D, and esupD ðSÞ  minsupg. F ¼ C1 _ C2 _    _ Cn be a boolean formula in disjunctive
normal form (DNF) on m boolean variables x1 ; x2 ; . . . ; xm .
Each clause Ci is of the form Ci ¼ l1 ^ l2 ^    ^ lk , where lj
4 COMPLEXITY OF FREQUENT SUBGRAPH PATTERN is a boolean variable in fx1 ; x2 ; . . . ; xm g. Let Prðxi Þ be the
MINING PROBLEM probability of xi being assigned true. The DNF counting
This section formally proves the computational complexity problem is to compute the probability of F being satisfied
of the frequent subgraph pattern mining problem to be by a randomly and independently chosen truth assign-
solved in this paper. Before that, we first reformulate the ment to the variables, denoted by PrðF Þ. Given an instance
concept of expected support given in Definition 5. of the DNF counting problem, an instance of the problem
Given an uncertain graph database D, a subgraph of computing P ðS vU GÞ can be constructed as follows:
pattern S in D is said to occur in an uncertain graph G 2 D, First, construct an uncertain graph G. The vertex set of
denoted by S vU G, if S is subgraph isomorphic to at least one G is V ðGÞ ¼ fc1 ; c2 ; . . . ; cn ; u1 ; u2 ; . . . ; um ; v1 ; v2 ; . . . ; vm g.
implicated graph of G. The probability of S occurring in G is The labels of c1 ; c2 ; . . . ; cn are , and the labels of
X u1 ; u2 ; . . . ; um ; v1 ; v2 ; . . . ; vm are . The edge set of G is
P ðS vU GÞ ¼ P ðG ) IÞ ðI; SÞ; ð7Þ constructed as follows: For each variable xi in the DNF
I2IðGÞ formula F , add an edge ðui ; vi Þ, associated with existence
probability of Prðxi Þ, to G, where 1  i  m. For each
where IðGÞ is the set of all implicated graphs of G, P ðG )
variable xj in clause Ci , add an edge ðci ; uj Þ, associated
IÞ is the probability of G implicating I given in (1), and
with existence probability of 1, to G, where 1  i  n and
ðI; SÞ ¼ 1 if S is subgraph isomorphic to I and ðI; SÞ ¼ 0
otherwise. Then, (6) can be rewritten as 1  j  m. All edges of G are labeled .
Next, construct a subgraph pattern S. The vertex set of
X
esupD ðSÞ ¼ supd ðSÞP ðD ) dÞ S is V ðSÞ ¼ fc0 ; u01 ; u02 ; . . . ; u0k ; v01 ; v02 ; . . . ; v0k g. The label of c0
d2IðDÞ is , and the labels of u01 ; u02 ; . . . ; u0k ; v01 ; v02 ; . . . ; v0k are all .
jDj
! The edge set of S is constructed as EðSÞ ¼ fðc0 ; u01 Þ;
X P ðD ) dÞ X
¼ ðIi ; SÞ : ðc0 ; u02 Þ; . . . ; ðc0 ; u0k Þ; ðu01 ; v01 Þ; ðu02 ; v02 Þ; . . . ; ðu0k ; v0k Þg. All edges
d¼fI1 ;I2 ;...;IjDj g2IðDÞ
jDj i¼1 of S are labeled .
For example, given a DNF formula ðx1 ^ x2 ^ x3 Þ _
By swapping the order of summations and grouping the ðx2 ^ x3 ^ x4 Þ and the probabilities Prðx1 Þ; Prðx2 Þ; . . . ;
inner summands by the ith implicated graph, we have Prðx4 Þ of the variables x1 ; x2 ; . . . ; x4 being assigned true,
jDj
the uncertain graph G and the subgraph pattern S
1 X X X
constructed from the DNF formula are shown in Fig. 4,
esupD ðSÞ ¼ ðI; SÞP ðD ) dÞ
jDj i¼1 I2IðG Þ d2IðDÞjI2d where the labels of the edges are omitted for clarity.
i

jDj
Each truth assignment to the variables in F one-to-one
1 X X corresponds to an implicated graph of G, i.e., edge ðui ; vi Þ
¼ ðI; SÞP ðGi ) IÞ
jDj i¼1 I2IðG Þ exists in the implicated graph if and only if xi ¼ true. The
i
X probability of each truth assignment is equal to the
 P ðD n fGi g ) dÞ ð8Þ probability of the implicated graph that the truth
d2IðDnfGi gÞ
assignment corresponds to. A truth assignment satisfies
jDj
1 X X F if and only if the implicated graph that the truth
¼ ðI; SÞP ðGi ) IÞ assignment corresponds to contains subgraph pattern S.
jDj i¼1 I2IðG Þ
i
Thus, PrðF Þ is equal to the probability, P ðS vU GÞ, of S
jDj
1 X occurring in G. This completes the polynomial time
¼ P ðS vU Gi Þ; reduction. The theorem thus holds. u
t
jDj i¼1
1208 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

Fig. 5. Uncertain graph database D constructed for ðx1 _ x2 Þ ^


ðx2 _ x3 Þ ^ x4 .
Fig. 4. Uncertain graph G and subgraph pattern S constructed for
ðx1 ^ x2 ^ x3 Þ _ ðx2 ^ x3 ^ x4 Þ.
1  i  l. Note that the expected support of a subgraph
pattern in D is identical to the traditional support of the
The following corollary naturally follows Theorem 3:
subgraph pattern in D since D is an exact graph database
Corollary 1. It is #P-hard to compute the expected support of a at this time. A truth assignment  doesn’t satisfy F if and
subgraph pattern in an uncertain graph database. only if the exact graph, g , corresponding to  is a
Proof. By (8), the problem of computing the probability of a frequent subgraph pattern in D with respect to threshold
subgraph pattern S occurring in an uncertain graph G 1=n. Thus, the number of frequent subgraph patterns in
can be trivially reduced to the problem of computing the D is 2m minus the number of satisfying truth assign-
expected support of a subgraph pattern S 0 in an ments of F . This completes the polynomial time
uncertain graph database D by specifying S 0 ¼ S and reduction. Thus, the theorem holds. u
t
D ¼ fGg. By Theorem 3, the corollary holds. u
t
From the #P-hardness of the problem of counting the
The number of frequent subgraph patterns in an number of frequent subgraph patterns, the NP-hardness of
uncertain graph database is generally exponential to the the problem of finding all frequent subgraph patterns can
size of the uncertain graph database. Naturally, the be readily derived [42].
complexity of any algorithm to mine frequent subgraph
patterns is exponential with respect to the size of the input. 5 APPROXIMATE ALGORITHM FOR MINING
Theoretically, we have the following theorem:
FREQUENT SUBGRAPH PATTERNS
Theorem 4. The problem of counting the number of frequent
5.1 Overview of Approximate Mining Algorithm
subgraph patterns in an uncertain graph database for an
arbitrary expected support threshold is #P-hard. Due to the NP-hardness of the frequent subgraph mining
problem, an approximate mining algorithm is proposed to
Proof. We prove the theorem by reducing the #P-complete find an approximate set of frequent subgraph patterns.
problem of counting the number of satisfying truth More precisely, let minsup be the input expected support
assignments of a monotone k-CNF formula [11] to the threshold and " 2 ð0; 1 be a relative error tolerance.
problem of counting the number of frequent subgraph
patterns in an uncertain graph database. A monotone . All subgraph patterns with expected support no less
k-CNF formula is a boolean formula in conjunctive normal than minsup will be output.
form (CNF) in which every clause has at most k literals . All subgraph patterns with expected support less
and every literal is not negated. than ð1  "Þminsup will not be output.
Let F ¼ D1 _ D2 _ . . . _ Dn be a monotone k-CNF . Decisions are arbitrary for subgraph patterns with
formula on m boolean variables x1 ; x2 ; . . . ; xm . Each clause expected support in ½ð1  "Þminsup; minsupÞ.
Di is of the form Di ¼ l1 ^ l2 ^    ^ lri , where each literal lj
The approximate mining algorithm has two main
is an unnegated boolean variable and ri  k. An uncertain
objectives to be achieved.
graph database D can be constructed as follows: For each
clause Di ¼ l1 ^ l2 ^    ^ lri in F , construct an uncertain . The first one is to determine as efficiently as possible
graph Gi . The vertex set of Gi is V ðGi Þ ¼ fvi0 ; vi1 ; . . . ; vimri g. whether a subgraph pattern can be output or not.
The edge set of Gi is EðGi Þ ¼ fðvi0 ; vi1 Þ; ðvi0 ; vi2 Þ; . . . ; . The second one is to examine subgraph patterns in
ðvi0 ; vimri Þg. All vertices of Gi are labeled . Each edge of the uncertain graph database as efficiently as
Gi is associated with a distinct label in fx1 ; x2 ; . . . ; xm g n possible to find all frequent subgraph patterns.
fl1 ; l2 ; . . . ; lri g and has existence probability of 1.
For example, given a monotone 2-CNF formula 5.1.1 Method for Objective I
ðx1 _ x2 Þ ^ ðx2 _ x3 Þ ^ x4 , the uncertain graph database
To complete the first objective, we first approximate the
D constructed based on the formula is shown in Fig. 5.
expected support, esupD ðSÞ, of a subgraph pattern S in
We establish the correspondence between the number
of satisfying truth assignments of F and the number of the uncertain graph database D by a closed interval,
frequent subgraph patterns in D as follows: Every truth denoted by ½esupD ðSÞ; esupD ðSÞ, such that esupD ðSÞ 2
assignment  to the variables in F one-to-one corre- ½esupD ðSÞ; esupD ðSÞ, then determine whether S can be
sponds to an exact graph g . Particularly, supposing that output or not by testing the following conditions:
the variables assigned true in  are x1 ; x2 ; . . . ; xl , the Condition 1. If esupD ðSÞ  minsup and esupD ðSÞ 
vertex set of g is V ðg Þ ¼ fv0 ; v1 ; . . . ; vl g, and the edge set ð1  "Þminsup, then output S since it is certain that
of g is Eðg Þ ¼ fðv0 ; v1 Þ; ðv0 ; v2 Þ; . . . ; ðv0 ; vl Þg. All vertices esupD ðSÞ  ð1  "Þminsup and it is probable that esupD ðSÞ
of g are labeled , and each edge ðv0 ; vi Þ is labeled xi for minsup. This condition is illustrated on the top of Fig. 6.
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1209

Fig. 6. Illustrations of conditions for deciding whether to output a


subgraph pattern or not.

Condition 2. If esupD ðSÞ < minsup, then don’t output S


since it is certain that esupD ðSÞ < minsup. This condition is
illustrated in the middle of Fig. 6.
Condition 3. If esupD ðSÞ  minsup and esupD ðSÞ <
ð1  "Þminsup, then approximate esupD ðSÞ by a smaller
interval and test the conditions again since we can’t decide Fig. 7. DAG and search tree of subgraph patterns in uncertain graph
whether esupD ðSÞ  minsup or esupD ðSÞ < ð1  "Þminsup database D in Fig. 2.
using the current interval. This condition is illustrated at the
bottom of Fig. 6. [15]. For example, using the DFS coding scheme proposed in
It is interesting to note that if the width of the interval [13], the DAG in Fig. 7 can be simplified to the tree
½esupD ðSÞ; esupD ðSÞ, i.e., jesupD ðSÞ  esupD ðSÞj, is less than highlighted by the solid directed arcs in Fig. 7. We call such
or equal to "  minsup, then either condition 1 or condition 2 tree a search tree of subgraph patterns. The advantage of
described above will be satisfied. Thus, it is sufficient to organizing subgraph patterns into a search tree is that if a
approximate esupD ðSÞ by an interval of width at most subgraph pattern is known to be infrequent, then all its
"  minsup. This observation is crucial for the discussion in descendants in the search tree can be pruned due to the
the rest of the paper. Apriori property of the expected support.
Thus, the problem of mining frequent subgraph patterns
5.1.2 Method for Objective II in an uncertain graph database is converted to traversing
To fulfill the second objective, we first study the property of the search tree to find all frequent subgraph patterns with
the expected support. For any uncertain graph G 2 D and low computational complexity. The proposed approximate
any subgraph patterns S and S 0 in D, if S is a subpattern of mining algorithm employs depth-first strategy to traverse the
S 0 , then ðI; SÞ  ðI; S 0 Þ for any implicated graph I of G, search tree. It works as follows:
where ðI; SÞ ¼ 1 if S is subgraph isomorphic to I and Step 1. Let T be an empty stack. Scan the edges of the
ðI; SÞ ¼ 0 otherwise. Then, we have P ðS vU GÞ  P ðS 0 vU uncertain graphs in D to get all subgraph patterns
GÞ by (7). This is called the Apriori property of the probability consisting of only one edge, and push them into T .
of a subgraph pattern occurring in G. Following this Step 2. Pop the subgraph pattern S on the top of stack T .
property, we also have esupD ðSÞ  esupD ðS 0 Þ. This is called Find the subgraph isomorphisms from S to every uncertain
the Apriori property of the expected support. A straightfor-
graph G 2 D and get the embeddings of S in G under the
ward inference from the Apriori property is that all
subgraph isomorphisms just found. Approximate the ex-
subpatterns of a frequent subgraph pattern are also
frequent, and all superpatterns of an infrequent subgraph pected support, esupD ðSÞ, of S in D by an interval, denoted
pattern are also infrequent. This result can be utilized to by ½esupD ðSÞ; esupD ðSÞ, such that esupD ðSÞ 2 ½esupD ðSÞ;
reduce the complexity of the mining algorithm. esupD ðSÞ and that the width of ½esupD ðSÞ; esupD ðSÞ is at
Then, we organize all subgraph patterns in the uncertain most "  minsup.
graph database D by a structure and search the structure Step 3. Determine whether S can be output or not by
systematically to find all frequent subgraph patterns by testing conditions 1 and 2 given in Section 5.1.1 with
taking advantage of the Apriori property of the expected ½esupD ðSÞ; esupD ðSÞ. If S can not be output, then the subtree
support. Based on the direct subpattern relationship defined rooted at S can be pruned due to the Apriori property of the
in Definition 3, all subgraph patterns in the uncertain graph expected support, thus we can skip this step and go to step 4.
database D can be organized as a directed acyclic graph If S can be output, then output S and generate all direct
(DAG) with nodes representing subgraph patterns and superpatterns of S based on the embeddings of S in the
edges representing direct subpattern relationships. Fig. 7 uncertain graphs in D. For each generated superpattern S 0 , if
shows the DAG of the subgraph patterns in the uncertain S 0 is a child of S in the search tree, then push S 0 into T ,
graph database D in Fig. 2. In a DAG of subgraph patterns, otherwise S 0 must be a child of another subgraph pattern S 00
a subgraph pattern may have more than one parent. By and should not be examined in the subtree rooted at S.
requiring each subgraph pattern, except those having no Step 4. If T ¼ ;, then terminate, otherwise go to step 2.
parents, to keep only one parent using some specific As can be seen from above, the approximation of
schemes, the DAG can be simplified to a tree. A number expected supports and the pruning of search trees are
of such schemes have been proposed [10], [12], [13], [14], crucial for reducing the complexity of the mining algorithm.
1210 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

The rest of this section is organized as follows: Section 5.2 For example, consider the uncertain graph G1 and the
proposes an exact algorithm and an approximation algo- subgraph pattern S in Fig. 2. There are four embeddings of S
rithm to compute expected supports. Section 5.3 develops a in G1 , so we construct four variables x1 , x2 , x3 , and x4 for
new pruning technique to reduce the complexity of pruning edges ðv1 ; v2 Þ, ðv1 ; v3 Þ, ðv1 ; v4 Þ, and ðv1 ; v5 Þ of G1 , respectively.
search trees. Section 5.4 presents the complete description of The probabilities of x1 , x2 , x3 , and x4 being assigned true are
the approximate mining algorithm. Prðx1 Þ ¼ 0:5, Prðx2 Þ ¼ 0:6, Prðx3 Þ ¼ 0:7, and Prðx4 Þ ¼ 0:8,
respectively. The DNF formula constructed is thus
5.2 Algorithms for Computing Expected Supports
F ¼ ðx1 ^ x2 Þ _ ðx1 ^ x4 Þ _ ðx2 ^ x3 Þ _ ðx3 ^ x4 Þ.
Equation (8) shows that the expected support of a subgraph
Note that if F can be divided into several disjoint DNF
pattern S in an uncertain graph database D can be computed
subformulas F1 ; F2 ; . . . ; Fk such that F ¼ F1 _ F2 _    _ Fk
by averaging the probability of S occurring in every
and that Fi and Fj don’t contain any common variables for
uncertain graph G 2 D, i.e., P ðS vU GÞ. However, Theo-
1  i; j  k and i 6¼ j, then we can first compute PrðFi Þ for
rem 3 shows that it is #P-hard to compute P ðS vU GÞ. To
deal with this challenge, we propose an exact algorithm to each subformula Fi and then compute PrðF Þ by
Q
exactly compute P ðS vU GÞ for small instances of the PrðF Þ ¼ 1  ki¼1 ð1  PrðFi ÞÞ. Hence, without loss of gen-
problem as well as an approximation algorithm to approx- erality, we assume that F can not be divided into disjoint
imate P ðS vU GÞ by an interval for large instances of the subformulas in the following discussion.
problem. On the basis of this technique, we develop an exact
algorithm and an approximation algorithm to compute
5.2.1 Fundamental Technique P ðS vU GÞ in the rest of this section.
To compute P ðS vU GÞ by its rigorous definition, i.e., (7), we
must compute the probability distribution over all 2jEðGÞj 5.2.2 Exact Algorithm
implicated graphs of G and perform 2jEðGÞj subgraph To exactly compute P ðS vU GÞ, we first construct a DNF
isomorphism testings from S to all implicated graphs of G, formula F ¼ C1 _ C2 _    _ Cn using the method given
which is intractable even if G is of small size, e.g., 30 edges. above. Then, by the Inclusive-Exclusive Principle [41], the
To reduce the complexity, we propose a new approach to probability of F being satisfied can be computed by
compute P ðS vU GÞ based on all embeddings2 of S in G. X X
The fundamental technique of the new approach is to PrðF Þ ¼ PrðCi Þ  PrðCi ^ Cj Þ þ   
transform the problem of computing P ðS vU GÞ to the DNF 1in 1i<jn
X ð9Þ
counting problem. Particularly, let fS1 ; S2 ; . . . ; Sn g be the set þ ð1Þn1 PrðCi1 ^ Ci2 ^    ^ Cin Þ;
of all embeddings of S in the exact graph ððV ðGÞ; EðGÞÞ; 1i1 <i2 <<in n
ðGÞ; LðGÞÞ, i.e., the exact graph obtained by removing
uncertainties from G. Let the edge set of each embedding Si where PrðCi1 ^ Ci2 ^    ^ Cij Þ is the probability of Ci1 ^ Ci2 ^
be EðSi Þ ¼ fei1 ; ei2 ; . . . ; eijEðSÞj g, where each subscript ij is in    ^ Cij being satisfied. Since Ci1 ^ Ci2 ^    ^ Cij is satisfied
f1; 2; . . . ; jEðGÞjg. Note that all embeddings contain the if and only if all variables in it are satisfied, we have
same number of edges, jEðSÞj. The DNF counting problem
Y
is constructed as follows: PrðCi1 ^ Ci2 ^    ^ Cij Þ ¼ PrðxÞ; ð10Þ
Step 1. For each edge ej in the embeddings, create a x
boolean variable xj . The probability, Prðxj Þ, of xj being
assigned true is equal to the existence probability, P ðej Þ, of where x is over all variables in Ci1 ^ Ci2 ^    ^ Cij . In the
edge ej . following discussion, we call the set of all variables in a
Step 2. For each embedding Si , construct a conjunctive formula f the domain of f, denoted by domðfÞ.
clause Ci ¼ xi1 ^ xi2 ^    ^ xijEðSÞj , where xij is the boolean To improve the efficiency of computing (9), we propose a
variable created in step 1 for edge eij 2 EðSi Þ. method to reduce the time complexity of computing
Step 3. The output DNF formula F is the disjunction of PrðCi1 ^ Ci2 ^    ^ Cij Þ based on the following proposition:
all conjunctive clauses constructed for all n embeddings in Proposition 1. For a conjunctive formula C1 ^ C2 ^    ^ Ck ,
step 2, i.e., F ¼ ðx11 ^ x12 ^    ^ x1jEðSÞj Þ _    _ ðxn1 ^ xn2 ^
where each Ci is a conjunction of n variables, if there exist 
   ^ xnjEðSÞj Þ.
and 2 f1; 2; . . . ; kg and  6¼ such that there are no
The construction of F can be done in ðnjEðSÞjÞ time
common variables in C and C , then
using a hash table to store the variable created for each
edge, where n is the number of embeddings of S in G, and
PrðC1 ^ C2 ^    ^ Ck Þ ¼ P 1 P 2 =P 3 ; ð11Þ
jEðSÞj is the number of edges in S. It is easy to prove that
P ðS vU GÞ is equal to the probability of F being satisfied by
where
a randomly and independently chosen truth assignment to
the variables in F , denoted as PrðF Þ. Thus, the problem of
P 1 ¼ PrðC1 ^    ^ C1 ^ Cþ1 ^    ^ Ck Þ;
computing P ðS vU GÞ is transformed to the problem of
computing PrðF Þ. P 2 ¼ PrðC1 ^    ^ C 1 ^ C þ1 ^    ^ Ck Þ;
P 3 ¼ PrðC1 ^    ^ C1 ^ Cþ1 ^   
2. See Definition 2. Note that the number of all embeddings of S in G is
no greater than the number of all subgraph isomorphisms from S to G since ^ C 1 ^ C þ1 ^    ^ Ck Þ:
two distinct subgraph isomorphisms may map S to the same subgraph of G.
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1211

Fig. 9. Value of (12) with respect to m.

Step 5. If j ¼ n, then output p as P ðS vU GÞ and


terminate, otherwise increase j by 1 and go to step 4.
We proceed to analyze the performance of Exact-Occ-
Prob. The metric for evaluating the performance is the ratio
of the number of terms in (9) that are computed in Oð1Þ time
in Exact-Occ-Prob to the total number of terms in (9).
Theorem 5. Given an uncertain graph G, a subgraph pattern S
and the set, fS1 ; S2 ; . . . ; Sn g, of all embeddings of S in G, let
F ¼ C1 _ C2 _    _ Cn be the DNF formula constructed in
step 1 of Exact-Occ-Prob. If there are m pairs of clauses in F
Fig. 8. The Exact-Occ-Prob procedure.
that share common variables, then the value of the performance
metric is at least
Proof. Without loss of generality, we prove the proposition
for  ¼ k  1 and ¼ k. For simplicity, we denote C1 ^ 0 0
2n þ 2m þ n  n0  2
C2 ^    ^ Ck2 by C. Then, 1 ; ð12Þ
2n  1
Y Y  0  0 
n n þ1
P1P2 ¼ PrðxÞ PrðxÞ where n0 and
 0 m0 are integers such that 2  m, 2 > m
0 n
x2domðCÞ[domðCk Þ x2domðCÞ[domðCk1 Þ and m þ 2 ¼ m.
Y Y
¼ PrðxÞ PrðxÞ Proof. First, we construct a graph G. The vertex set of G is
x2domðCÞ x2domðCÞ[domðCk1 Þ[domðCk Þ V ðGÞ ¼ fC1 ; C2 ; . . . ; Cn g. The edge set of G is EðGÞ ¼
Y Y
¼ PrðxÞ PrðxÞ fðCi ; Cj Þj1  i < j  n; Ci and Cj share at least one
x2domðCÞ x2domðC^Ck1 ^Ck Þ variableg. Hence, the number of vertices of G is n, and
the number of edges of G is m.
¼ P 3  PrðC1 ^ C2 ^    ^ Ck Þ:
According to Proposition 1, the probability PrðCi1 ^
Thus, the proposition holds. u
t Ci2 ^    ^ Cij Þ can be computed in Oð1Þ time if and
only if there exist  and 2 f1; 2; . . . ; jg such that Ci
Specifically, if k ¼ 2, and C1 and C2 don’t share any
and Ci don’t share any common variables. In other
common variables, then let P 3 ¼ 1, i.e., PrðC1 ^ C2 Þ ¼
words, the subgraph of G induced by the vertex set
PrðC1 Þ PrðC2 Þ. Moreover, Proposition 1 suggests that if P 1 ,
fCi1 ; Ci2 ; . . . ; Cij g is not a complete graph. Thus, the
P 2 , and P 3 in (11) have already been obtained, then PrðC1 ^
number of terms in (9) that are computed in Oð1Þ time
C2 ^    ^ Ck Þ can be computed by (11) in Oð1Þ time rather in Exact-Occ-Prob is equal to the number of incomplete
than being computed by (10) directly in Oðk2 jEðSÞj2 Þ time. induced subgraphs of G. For a graph with n vertices and
An algorithm, called Exact-Occ-Prob, is developed to 0
m edges, there are at most 2n þ 2m þ n  n0  2
0

exactly compute P ðS vU GÞ as shown in Fig. 8. The input of complete subgraphs [43]. Thus, the number of terms
Exact-Occ-Prob is a subgraph pattern S, an uncertain graph in (9) that are computed in Oð1Þ time in Exact-Occ-Prob
G and the set, fS1 ; S2 ; . . . ; Sn g, of all embeddings of S in G. 0 0
is at least ð2n  1Þ  ð2n þ 2m þ n  n0  2Þ. Since there
The algorithm works in a “bottom-up” fashion, which are 2n  1 terms in (9),0 the0 value of the performance
n m 0
consists of five steps. metric is at least 1  2 þ2 2nþnn 2
. Thus, the theorem
1
Step 1. Construct DNF formula F ¼ C1 _ C2 _ . . . _ Cn holds. u
t
based on fS1 ; S2 ; . . . ; Sn g using the method presented in
Section 5.2.1, and let p ¼ 0. Theorem 5 shows that a substantially large fraction of
Step 2. For each clause Ci ¼ xi1 ^ xi2 ^    ^ xijEðSÞj in F , terms in (9) can be computed in Oð1Þ time in Exact-Occ-Prob.
QjEðSÞj
compute PrðCi Þ by j¼1 Prðxij Þ and add PrðCi Þ to p. For example, Fig. 9 shows the value of (12) while m varies
Step 3. If n ¼ 1, then output p as P ðS vU GÞ and from 0 to 45 and n is fixed to 10. We can see that the value of
terminate, otherwise let j ¼ 2. (12) keeps very high. Moreover, since (12) is a lower bound
Step 4. For every i1 ; i2 ; . . . ; ij such that 1  i1 < i2 <    < of the performance metric, the actual performance can be
ij  n, if there exist  and 2 f1; 2; . . . ; jg such that Ci and even higher.
Ci don’t share any common variables, then compute PrðCi1 ^ The time complexity of Exact-Occ-Prob is analyzed as
Ci2 ^    ^ Cij Þ by (11) in Oð1Þ time, otherwise compute it by follows: Line 1 takes OðnjEðSÞjÞ time to construct the DNF
(10). Then, add ð1Þj1 PrðCi1 ^ Ci2 ^    ^ Cij Þ to p. formula F . Line 4 needs OðjEðSÞjÞ time to compute PrðCi Þ
1212 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

for each 1  i  n. Lines 8 to 12 loop 2n  n  1 times


totally. For each loop, line 8 validates the condition in Oðj2 Þ
time. For lines 9 and 11, in the worst case when every pair of
clauses in F share at least one common variable, only line 11
is executed, which takes Oðj2 jEðSÞj2 Þ time; in the best case
when every pair of clauses in F share no common variables,
only line 9 is carried out, which takes Oð1Þ time. Therefore,
the time complexity of Exact-Occ-Prob is Oð2n n2 jEðSÞj2 Þ in
the worst case and OðnjEðSÞj þ 2n n2 Þ in the best case.
We will discuss when Exact-Occ-Prob can be used in the
next section. It will be shown that Exact-Occ-Prob can be more
efficient than the approximation algorithm to be presented in
the next section for small instances of the problem, justifying
that the exact algorithm is very necessary.

5.2.3 Approximation Algorithm


The time complexity of Exact-Occ-Prob is still exponential,
so it can’t scale for more than 30 embeddings. When S has a
large number of embeddings in G, we propose an
approximation algorithm to efficiently approximate P ðS vU
GÞ by an interval. The approximation algorithm consists of
two steps. Fig. 10. The Approx-Occ-Prob procedure.
Step 1. Transform the problem of computing P ðS vU GÞ
to the DNF counting problem by constructing a DNF as the answer. It is evident that the returned interval has
formula F using the method given in Section 5.2.1.
width 2"0 ¼ "  minsup and PrðF Þ is contained in ½b
p  "0 ; pb þ
Step 2. Approximate the probability, PrðF Þ, of F being 0
"  with probability at least 1   due to (13).
satisfied with an interval ½l; u of width at most "  minsup
such that PrðF Þ 2 ½l; u. The time complexity of Approx-Occ-Prob is analyzed as
The constraints on ½l; u imposed in step 2 is due to the follows: The construction of F at line 1 can be done in
following reason: For an uncertain graph database D ¼ ðnjEðSÞjÞ time. Line 2 computes Z in ðnjEðSÞjÞ time.
fG1 ; G2 ; . . . ; Gn g and a subgraph pattern S, let ½li ; ui  be an Lines 7 to 11 loop for N times. For each of the loops,
interval approximating P ðS vU Gi Þ such that P ðS vU Gi Þ 2
line 8 spends OðnjEðSÞjÞ time to generate a random truth
½li ; ui  P
and that jui  li j  "P minsup for every 1  i  n. Let
L ¼ n1 ni¼1 li and U ¼ n1 ni¼1 ui . The expected support, assignment , and the condition at line 10 can be tested
esupD ðSÞ, of S in D must be within the interval ½L; U, in OðijEðSÞjÞ time. Since i is uniformly randomly chosen
and jU  Lj  "  minsup. Hence, either condition 1 or from f1; 2; . . . ; ng, line 10 can be tested expectedly in
condition 2 described in Section 5.1.1 will be satisfied, thus
OðnjEðSÞj=2Þ time. Thus, the expected time complexity of
S can be determined whether to be output or not in a very
efficient way. Approx-Occ-Prob is OðNnjEðSÞjÞ, where N ¼ 4n lnð2=Þ
"02 ¼
16n lnð2=Þ
A number of algorithms have been proposed to compute "2 minsup2 is the number of samplings, i.e., loops of lines
the interval ½l; u in step 2 [44], [45]. Although the 7 to 11, carried out by the FPRAS [44].
deterministic approximation algorithms such as [45] guar-
antee to produce the desired interval that certainly enclose 5.2.4 Trade-Off between Exact Algorithm and
the expected support, all existing deterministic algorithms Approximation Algorithm
have a common drawback that they are too complex to be Finally, we discuss how to adaptively choose the algorithms
implemented and to be applicable in practice. For this
given above to compute P ðS vU GÞ. As analyzed previously,
reason, we use the fully polynomial randomized approximation
scheme (FPRAS) proposed in [44] to achieve both high the worst-case time complexity of Exact-Occ-Prob is
accuracy and high efficiency. Particularly, for a given DNF Texact ¼ Oðn2 2n jEðSÞj2 Þ, where n is the number of embed-
formula F , an absolute error
2 ½0; 1 and a real number dings of S in G, and jEðSÞj is the number of edges in S. The
 2 ½0; 1, the FPRAS can compute an estimate pb of PrðF Þ expected time complexity of Approx-Occ-Prob is Tapprox ¼
such that 2
Oð16n "2lnð2=ÞjEðSÞj Þ. Therefore, if Texact > Tapprox , that is,
minsup2
p  PrðF Þj 
Þ  :
Prðjb ð13Þ lnð2=Þ
2n4 jEðSÞj > ; ð14Þ
Fig. 10 illustrates the approximation algorithm, called ð"  minsupÞ2
Approx-Occ-Prob, for approximating P ðS vU GÞ. Line 1
constructs the DNF formula F . Line 2 sets the absolute then we choose Approx-Occ-Prob, otherwise we select Exact-
error "0 ¼ "  minsup=2. Lines 3 to 12 are the FPRAS, which Occ-Prob. Intuitively, when the number of embeddings of S
returns an estimate pb of PrðF Þ, such that jbp  PrðF Þj  "0 , in G is not large, the exact algorithm will outperform the
p  "0 ; pb þ "0 
with probability at least 1  . Line 13 returns ½b approximation algorithm.
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1213

5.3 Optimized Method for Pruning Search Tree


This section proposes an optimized pruning method to
reduce the complexity of pruning. We first present the
fundamental idea of the pruning method in the context of
exact mining, followed by showing how to apply this
method in our approximate mining algorithm.
In exact frequent subgraph pattern mining, a subgraph
pattern S and all its descendants in the search tree can be
pruned if esupD ðSÞ < minsup. To compute esupD ðSÞ, we
must compute the probability, P ðS vU GÞ, of S occurring in
every uncertain graph G 2 D. However, it is known to be
very hard to compute P ðS vU GÞ from the previous
discussion. To reduce the complexity of pruning, a
promising way is to prune an infrequent subgraph pattern
S by computing P ðS vU GÞ only for a fraction of uncertain
graphs in D instead of all of them. To this end, we show the
following proposition that is the key to the design of the
pruning method:
Proposition 2. Let D ¼ fG1 ; G2 ; . . . ; Gn g be an uncertain graph
database, S and S 0 be two subgraph patterns in D with S 0 as a
subpattern of S, and function U : f0; 1; . . . ; ng ! ½0; 1 be
defined as
!
1 X x
0
X n
UðxÞ ¼ P ðS vU Gi Þ þ P ðS vU Gi Þ : ð15Þ
n i¼1 i¼xþ1

Then, esupD ðSÞ ¼ Uð0Þ  Uð1Þ      UðnÞ.


Fig. 11. The Approx-Exp-Sup procedure.
Proof. The proposition can be easily proved by the Apriori
property of P ðS vU GÞ and (8). u
t
because P ðS 0 vU Gi Þ for 1  i  k are obtained from AS0 ,
which have already been computed before examining S in
Proposition 2 defines a sequence of upper bounds the depth-first search. Thus, we need not to compute
Uð0Þ; Uð1Þ; . . . ; UðnÞ of esupD ðSÞ. If any one of the upper P ðS vU Gi Þ for 1  i  k.
bounds is less than minsup, then S must be infrequent. Next, we point out how to apply the proposed pruning
Thus, a new pruning method can be developed based on method in our approximate mining algorithm. Particularly,
Proposition 2. The method maintains an array AS of length the following changes need to be made:
n for each subgraph pattern S on the path from the root of
the search tree to the subgraph pattern being examined . Let Xi be the set of embeddings of S in Gi . If
currently. The ith element, AS ½i, of AS stores P ðS vU Gi Þ 2jXi j4 jEðSÞj > lnð2=Þ=ð"  minsupÞ2 , then P ðS vU
for 1  i  n. Given a subgraph pattern S and an uncertain Gi Þ should be approximated by an interval, denoted
graph database D ¼ fG1 ; G2 ; . . . ; Gn g as input, the pruning by ½P ðS vU Gi Þ; P ðS vU Gi Þ, using the Approx-Occ-
method works as follows: Prob procedure.
Step 1. Let S 0 be the parent of S in the search tree. . We use AS ½i to store P ðS vU Gi Þ for 1  i  n. Thus,
1
Pn
Initialize AS ½i to be AS0 ½i for P
1  i  n. n j¼1 AS ½j is an upper bound of UðxÞ, denoted
Step 2. Compute UðnÞ by n1 nj¼1 AS ½j. If UðnÞ < minsup, by UðxÞ, since either P ðS vU Gi Þ  P ðS 0 vU Gi Þ 
then prune S and all its descendants from the search tree P ðS 0 vU Gi Þ or P ðS vU Gi Þ  P ðS vU Gi Þ holds for
and terminate the pruning procedure.
1  i  n.
Step 3. For i ¼ n  1 to 1, perform the following steps:
. We should test UðiÞ < minsup instead of UðiÞ <
1. Compute P ðS vU Giþ1 Þ using the Exact-Occ-Prob minsup for 1  i  n.
procedure and set A PS ½i þ 1 to be P ðS vU Giþ1 Þ. In our implementation, the proposed pruning method
2. Compute UðiÞ by n1 nj¼1 AS ½j since AS ½j ¼ P ðS 0 vU and the method for computing expected supports are
Gj Þ for 1  j  i and AS ½j ¼ P ðS vU Gj Þ for i þ 1  combined together to form the Approx-Exp-Sup procedure
j  n. given in Fig. 11. It returns ½1; 1 if S can be pruned,
3. If UðiÞ < minsup, then prune S and all its descen- otherwise returns an interval of width "  minsup to approx-
dants from the search tree and terminate. imate esupD ðSÞ.
The advantage of the pruning method can be elaborated
as follows: Supposing that one of the upper bounds of 5.4 Complete Algorithm
esupD ðSÞ, say UðkÞ, is less than minsup, then S must be The complete algorithm, called MUSE (Mining Uncertain
pruned by the pruning method. When computing UðkÞ, Subgraph pattErns), is described in Fig. 12. The input of
only P ðS vU Gi Þ for k þ 1  i  n need to be computed MUSE is an uncertain graph database D, an expected
1214 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

TABLE 1
Summary of Real Uncertain Graph Database

Step 3. If l  ð1  "Þminsup and u  minsup, i.e., condi-


tion 1 given in Section 5.1.1 is satisfied, then add S to F at
line 10 and scan the edges incident on the vertices of the
embeddings of S in all uncertain graphs to obtain all direct
supergraphs of S at line 11. For each direct superpattern S 0
of S, if S is the parent of S 0 in the search tree, then push S 0
into stack T at line 14, otherwise S 0 is a child of another
subgraph pattern S 00 and should not be examined in the
subtree rooted at S. Note that Parent(S 0 ) at line 13 returns
the parent of S 0 in the search tree. The detailed procedure of
the function Parent depends on the scheme we used to
build the search tree. For example, if the DFS coding scheme
in [13] is used, then Parent(S 0 ) returns the subgraph pattern
with its minimum DFS code [13] being the longest prefix of
Fig. 12. The MUSE algorithm.
the minimum DFS code of S 0 .
support threshold minsup, a relative error tolerance ", and a
real number . The output of MUSE is an approximate set of 6 EXPERIMENTS
frequent subgraph patterns in D. MUSE works as follows: The MUSE algorithm was implemented, and extensive
First, initialize the result set F to be empty at line 1. Then, experiments were performed to evaluate the efficiency, the
scan the edges of the uncertain graphs in D to obtain all memory usage, the approximation quality and the scal-
subgraph patterns with one edge and push them into an ability of MUSE, the impact of the optimized pruning
empty stack T at line 2. Next, perform depth-first search on method on the efficiency of MUSE, and the impact of
the search tree of subgraph patterns to discover an approx- uncertainties on the efficiency of MSUE.
imate set of frequent subgraph patterns and add them into F For comparison, two other algorithms were also im-
at lines 3 to 14. Finally, line 15 outputs F as an answer. plemented based on MUSE as the baselines. The first one,
The depth-first search on the search tree is carried out called MUSE-Approx, only uses the approximation method
as follows: While the stack T is not empty, run the to compute expected supports. The second one, called
following steps: MUSE-Apriori, doesn’t use the optimized pruning method.
Step 1. Pop the subgraph pattern S on the top of T at line 4. We didn’t implement the algorithm that only uses the exact
For each uncertain graph Gi 2 D, find the subgraph method to compute expected supports because such
isomorphisms from S to Gi at line 6 and get the set Xi of algorithm is obviously very inefficient when the number
all embeddings of S in Gi under the subgraph isomorphisms of embeddings of a subgraph pattern in an uncertain graph
from S to Gi at line 7. The subgraph isomorphism testing is large. In our implementation, we use the DFS coding
problem has been extensively studied. Here, we take scheme proposed in [13] to construct search trees. All
advantage of the depth-first search to find the subgraph algorithms were implemented in C under Windows XP.
isomorphisms from S to Gi incrementally based on the All experiments were performed on an IBM ThinkPad T61
subgraph isomorphisms from the parent of S in the search notebook with 2 GHz CPU and 2 GB RAM.
tree to Gi . Our incremental subgraph isomorphism testing We experimented on a real uncertain graph database.
method will be briefly introduced in the Appendix. The real uncertain graph database is obtained from the
Step 2. Call the Approx-Exp-Sup procedure to approx- STRING database [5] and contains the PPI networks of six
imate the expected support of S in D by an interval ½l; u at organisms, which are summarized in Table 1. In Table 1, jV j
line 8. If u < minsup, i.e., condition 2 given in Section 5.1.1 indicates the number of vertices, jEj indicates the number of
is satisfied, then S will not be output, and the following edges, and avg(P ) indicates the average value of the
step 3 can be skipped. By skipping step 3, all descendants of existence probabilities of edges. Moreover, all vertices are
S in the search tree will not be examined, i.e., the subtree labeled with COG protein functions [46].
rooted at S is pruned. Note that, if the subtree rooted at S
can be pruned by the optimized pruning method, then the 6.1 Time Efficiency of MUSE
interval returned by Approx-Exp-Sup will be ½1; 1, thus We first investigated the time efficiency of MUSE on the
condition 2 will certainly be satisfied. real uncertain graph database with respect to the threshold
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1215

Fig. 13. Execution time of MUSE, MUSE-Approx, and MUSE-Apriori with respect to (a) minsup, (b) ", and (c) .

minsup and the parameters " and . Fig. 13a shows the because the number of output frequent subgraph patterns
execution time of MUSE while minsup varies from 0.2 to decreases rapidly as minsup becomes larger, thus requiring
0.4, " ¼ 0:1, and  ¼ 0:1. The execution time decreases less memory to bookkeep subgraph isomorphisms. In
substantially while minsup increases. This is because the Figs. 14b and 14c, the memory usage of MUSE decreases
number of output frequent subgraph patterns decreases with the increasing of " and , respectively. The reason is
rapidly as minsup becomes larger. Fig. 13b shows the that for larger " and , the approximation algorithm for
execution time of MUSE while " varies from 0.01 to 0.3, computing expected supports is more possible to outper-
minsup ¼ 0:3, and  ¼ 0:1. The execution time decreases form the exact algorithm according to (14). Since the
rapidly while " increases. The reason is that the time spent approximation algorithm is more space efficient than the
by Approx-Occ-Prob decreases quadratically to the increase exact algorithm, the memory usage of MUSE decreases.
of " as analyzed in Section 5.2.3. Fig. 13c shows the The experimental results also show that MUSE outper-
execution time of MUSE while  varies from 0.01 to 0.3, forms MUSE-Apriori, and that MUSE-Approx outperforms
minsup ¼ 0:3, and " ¼ 0:1. The execution time decreases MUSE in terms of memory usage. This is because 1) due to
rapidly while  increases. This is because the time complex- the optimized pruning method, MUSE need not to compute
ity of Approx-Occ-Prob is proportional to lnð2=Þ as analyzed the probability of a subgraph pattern occurring in some
in Section 5.2.3. uncertain graphs, thus requiring less memory than MUSE-
The experimental results also show that MUSE outper- Apriori; and 2) MUSE-Approx only uses the approximation
forms all its competitors, MUSE-Approx and MUSE-Apriori. algorithm for computing expected supports, which is more
It verifies two statements in the previous sections: 1) the space efficient than the exact algorithm. Although MUSE-
approximation algorithm for computing expected supports Approx is more efficient than MUSE in memory usage, it is
is less efficient than the exact algorithm when the number of less efficient than MUSE in execution time.
embeddings of a subgraph pattern in an uncertain graph is
small; and 2) the optimized pruning method can reduce the 6.3 Approximation Quality of MUSE
time complexity of MUSE. Since MUSE is an approximate mining algorithm, we
evaluated its approximation quality with respect to " and
6.2 Memory Usage of MUSE  on the real uncertain graph database. The approximation
We then investigated the memory usage of MUSE with quality is measured by the precision and recall metrics.
respect to minsup, ", and . Fig. 14 shows the memory usage Precision is the percentage of true frequent subgraph
of MUSE, MUSE-Approx, and MUSE-Apriori in mega bytes patterns in the output subgraph patterns. Recall is the
(MB) measured in the previous experiment. In Fig. 14a, the percentage of returned subgraph patterns in the true
memory usage of MUSE decreases while minsup increases frequent subgraph patterns. Since it is NP-hard to find all

Fig. 14. Memory usage of MUSE, MUSE-Approx, and MUSE-Apriori with respect to (a) minsup, (b) ", and (c) .
1216 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

Fig. 17. Number of pruned subgraph patterns for MUSE and MUSE-
Apriori.
Fig. 15. Approximation quality of MUSE with respect to (a) " and (b) .
database while the number of duplications varies from 1 to
10, minsup ¼ 0:3, " ¼ 0:1, and  ¼ 0:1. Both the execution
true frequent subgraph patterns, we regarded the subgraph
time and the memory usage of MUSE increase linearly to the
patterns discovered under " ¼ 0:01 and  ¼ 0:01 as the true
increasing of the number of duplications. The experimental
frequent subgraph patterns.
results show that MUSE is very scalable for large uncertain
Fig. 15a shows the details of the output subgraph patterns
while " varies from 0.01 to 0.3,  ¼ 0:1, and minsup ¼ 0:3. graph databases.
Each percentage above in the figure indicates the precision, 6.5 Effect of Optimized Pruning Method
and each percentage below indicates the recall. We can see
We have evaluated the effect of the optimized pruning
that the precision of MUSE decreases and that the recall
remains stable while " increases. This is because 1) when " method in the previous experiments. Here, we investigate it
becomes larger, more false frequent subgraph patterns will in more details. The pruned infrequent subgraph patterns3
be returned, so reducing the precision; and 2) when  is fixed, can be classified into three groups according to the method
the probability of a frequent subgraph pattern being used: 1) the ones pruned by comparing their expected
returned is also fixed, thus the number of output true supports with minsup, i.e., the apriori pruning, 2) the ones
frequent subgraph patterns doesn’t change significantly. pruned by the optimized pruning method, and 3) the ones
Fig. 15b shows the experimental results while  varies pruned by comparing the proportion of containing un-
from 0.01 to 0.3, " ¼ 0:1, and minsup ¼ 0:3. The precision certain graphs with minsup. Note that the third pruning
remains stable but the recall decreases while  increases. method is actually a special case of the optimized pruning
The reason is that 1) the fixed " determines the expected method, but here it is examined individually for clarity.
number of false frequent subgraph patterns to be returned,
Fig. 17 shows the number of subgraph patterns pruned
so the precision remains stable; and 2) while  increases, the
by each of the methods for MUSE and MUSE-Apriori on the
probability of a frequent subgraph pattern being output
decreases, thus the number of returned true frequent real uncertain graph database while minsup varies from 0.2
subgraph patterns decreases, reducing the recall. All to 0.3, " ¼ 0:1, and  ¼ 0:1. Since MUSE-Apriori doesn’t use
experimental results verify that MUSE can have very high the optimized pruning method, the number of subgraph
approximation quality. patterns pruned by the optimized pruning method is of
course zero for MUSE-Apriori. We can see that for MUSE-
6.4 Scalability of MUSE Apriori, approximately 10 percent of subgraph patterns are
We also examined the scalability of MUSE with respect to pruned by the apriori pruning, the most expensive method
the number of uncertain graphs in an uncertain graph out of three, but only about 1 percent for MUSE. As a result,
database. We controlled the number of uncertain graphs by MUSE outperforms MUSE-Apriori as shown in Fig. 13a.
duplicating the uncertain graphs in the real uncertain graph
Moreover, the optimized pruning method can additionally
database. Fig. 16 shows the execution time and the memory
usage of MUSE on the duplicated real uncertain graph prune 9 percent of subgraph patterns that the third method
can’t prune.

6.6 Impact of Uncertainties on MUSE


This experiment investigated the impact of the distribution
of uncertainties on the efficiency of MUSE. To vary the
distribution of uncertainties, we imposed mathematical
transformations to the uncertainties of each uncertain
graph. The transformation is of the form
8
<1 if c1 x þ c0 > 1;
fðxÞ ¼ 0 if c1 x þ c0 < 0;
:
c1 x þ c0 otherwise;

3. Here, the pruned infrequent subgraph patterns refer to the infrequent


Fig. 16. Scalability of MUSE with respect to number of uncertain graphs. subgraph patterns that are examined and pruned in the depth-first search,
(a) Execution time. (b) Memory usage. not including the descendants of these subgraph patterns in the search tree.
ZOU ET AL.: MINING FREQUENT SUBGRAPH PATTERNS FROM UNCERTAIN GRAPH DATA 1217

to find the subgraph isomorphisms. If S consists of more


than one edge, we can find the subgraph isomorphisms in
an incremental manner. Let S 0 be the parent subgraph
pattern of S in the search tree, and let ðu; vÞ be the only edge
in EðSÞ n EðS 0 Þ because S has one more edge than S 0 . Note
that a subgraph isomorphism from S to G must contains a
subgraph isomorphism from S 0 to G. Thus, we can find the
subgraph isomorphisms from S to G incrementally based
on the subgraph isomorphisms from S 0 to G.
Suppose that both u and v are contained in S 0 . For each
subgraph isomorphism f 0 from S 0 to G, if edge ðf 0 ðuÞ; f 0 ðvÞÞ 2
EðGÞ and the label of ðu; vÞ is identical to the label of
Fig. 18. Impact of uncertainties on efficiency of MUSE. (a) Varying c0 . ðf 0 ðuÞ; f 0 ðvÞÞ, then f 0 is also a subgraph isomorphism from S
(b) Varying c1 .
to G.
Suppose that u 2 V ðS 0 Þ but v 62 V ðS 0 Þ. For every sub-
where c0 ; c1 2 ½0; 1. It transforms the existence probability
value x 2 ½0; 1 of an edge to fðxÞ 2 ½0; 1. graph isomorphism f 0 from S 0 to G, if there exists an edge
We ran MUSE with minsup ¼ 0:3, " ¼ 0:1, and  ¼ 0:1 on ðf 0 ðuÞ; wÞ in G such that the label of ðu; vÞ is identical to the
the transformed real uncertain graph databases. Fig. 18a label of ðf 0 ðuÞ; wÞ and that w 6¼ f 0 ðxÞ for any vertex
shows the execution time of MUSE while the coefficient c0 of x 2 V ðGÞ, then f 0 [ fu ! f 0 ðuÞ; v ! wg is a subgraph iso-
the transformation varies from 0.1 to 0.5, and the coefficient morphism from S to G.
c1 ¼ 0:5. Every integer on the plot indicates the number of In this way, we can find all subgraph isomorphisms from
output subgraph patterns. We can see that the execution time S to G. It is obvious that our method is more efficient than
increases as c0 gets larger. This is because the larger c0 the methods that find subgraph isomorphisms from scratch.
increases the existence probabilities of edges, thus increasing
the expected supports of all subgraph patterns. Since minsup
is fixed, more subgraph patterns will be output as frequent ACKNOWLEDGMENTS
subgraph patterns, so increasing the execution time. This work was supported in part by the NSF of China
Fig. 18b shows the execution time of MUSE while the
under Grant No. 60773063, the NSFC-RGC of China
coefficient c1 of the transformation varies from 0.5 to 1 and
under Grant No. 60831160525, the National Grand
c0 ¼ ð1  c1 Þ , where is the mean value of the existence
probabilities of edges in each uncertain graph to be Fundamental Research 973 Program of China under
transformed. It is easy to show that the mean value of Grant No. 2006CB303000, and the Key Program of the
the existence probabilities after the transformation is also . NSF of China under Grant No. 60533110.
The execution time increases as c1 gets larger. This is
because with the increasing of c1 , the variance of the REFERENCES
existence probabilities becomes larger, and more edges will
[1] D.J. Cook and L.B. Holder, Mining Graph Data. Wiley, 2006.
have higher existence probabilities. It increases the number [2] M.E. Turanalp and T. Can, “Discovering Functional Interaction
of subgraph patterns with high expected supports, thus Patterns in Protein-Protein Interaction Networks,” BMC Bioinfor-
increasing the execution time subsequently. matics, vol. 9, no. 1, p. 276, 2008.
[3] S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and T. Ideker, “A
Direct Comparison of Protein Interaction Confidence Assignment
7 CONCLUSIONS Schemes,” BMC Bioinformatics, vol. 7, no. 1, p. 360, 2006.
[4] S. Asthana, O.D. King, F.D. Gibbons, and F.P. Roth, “Predicting
This paper investigates the problem of mining frequent Protein Complex Membership Using Probabilistic Network
subgraph patterns from an uncertain graph database. A Reliability,” Genome Research, vol. 14, no. 6, pp. 1170-1175, 2004.
novel model of uncertain graph data has been proposed, and [5] The STRING Database, http://string-db.org, 2010.
[6] R. Jiang, Z. Tu, T. Chen, and F. Sun, “Network Motif Identification
the frequent subgraph pattern mining problem has been in Stochastic Networks,” Proc. Nat’l Academy of Sciences, vol. 103,
formalized by introducing the concept of expected support. no. 25, 2006.
The computational complexity of this problem has been [7] J. Ghosh, H.Q. Ngo, S. Yoon, C. Qiao, “On a Routing Problem
formally proved. Due to the NP-hardness of the problem, an Within Probabilistic Graphs and Its Application to Intermittently
Connected Networks,” Proc. Int’l Conf. Computer Comm., 2007.
approximate mining algorithm, called MUSE, has been [8] M. Koyutürk, A. Grama, and W. Szpankowski, “An Efficient
developed to discover an approximate set of frequent Algorithm for Detecting Frequent Subgraphs in Biological Net-
subgraph patterns with respect to the relative error tolerance works,” Bioinformatics, vol. 20, no. Suppl. 1, pp. i200-i207, 2004.
" and the degree of confidence 1  . Analytical and [9] N.N. Dalvi and D. Suciu, “Efficient Query Evaluation on
experimental results show that MUSE has high efficiency, Probabilistic Databases,” Proc. Very Large Databases Conf., 2004.
[10] M. Kuramochi and G. Karypis, “Frequent Subgraph Discovery,”
accuracy, and scalability and that the optimization techni- Proc. Int’l Conf. Data Mining, 2001.
ques adopted by MUSE are very effective and efficient. [11] L.G. Valiant, “The Complexity of Computing the Permanent,”
Theoretical Computer Science, vol. 8, pp. 189-201, 1979.
[12] A. Inokuchi, T. Washio, and H. Motoda, “An Apriori-Based
APPENDIX Algorithm for Mining Frequent Substructures from Graph Data,”
Proc. European Conf. Principles of Data Mining and Knowledge
We briefly introduce our method for finding subgraph Discovery, 2000.
isomorphisms from a subgraph pattern S to a graph G. If S [13] X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern
contains only one edge, then we trivially scan all edges of G Mining,” Proc. Int’l Conf. Data Mining, 2002.
1218 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010

[14] J. Huan, W. Wang, and J. Prins, “Efficient Mining of Frequent [45] M. Luby and B. Velickovic, “On Deterministic Approximation of
Subgraphs in the Presence of Isomorphism,” Proc. Int’l Conf. Data DNF,” Proc. Symp. Theory of Computing, 1991.
Mining, 2003. [46] COG functions, http://www.ncbi.nlm.nih.gov/COG/, 2010.
[15] S. Nijssen and J.N. Kok, “A Quickstart in Frequent Structure
Mining Can Make a Difference,” Proc. ACM SIGKDD Conf., 2004.
[16] N. Vanetik, “Discovering Frequent Graph Patterns Using Disjoint Zhaonian Zou received the BS and MEng
Paths,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 11, degrees in computer science from the Jilin
pp. 1441-1456, Nov. 2006. University, China, in 2002 and 2005, respec-
[17] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi, “Scalable Mining of tively. He is currently working toward the PhD
Large Disk-Based Graph Databases,” Proc. ACM SIGKDD Conf., degree in the Department of Computer Science
2004. and Technology at the Harbin Institute of
[18] J. Wang, W. Hsu, M.L. Lee, and C. Sheng, “A Partition-Based Technology, China. He worked as a research
Approach to Graph Mining,” Proc. Int’l Conf. Data Eng., 2006. assistant in the Department of System Engineer-
[19] C. Chen, C.X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and ing and Engineering Management at the Chinese
J. Han, “Mining Graph Patterns Efficiently Via Randomized University of Hong Kong in 2007. His research
Summaries,” Proc. Very Large Databases Conf., 2009. interests include data mining and query processing in graph databases.
[20] X. Yan and J. Han, “Closegraph: Mining Closed Frequent Graph
Patterns,” Proc. ACM SIGKDD Conf., 2003.
Jianzhong Li is a professor in the Department
[21] J. Huan, W. Wang, J. Prins, and J. Yang, “Spin: Mining Maximal
of Computer Science and Technology at the
Frequent Subgraphs from Graph Databases,” Proc. ACM SIGKDD
Harbin Institute of Technology, China. In the
Conf., 2004.
past, he worked as a visiting scholar at
[22] Y. Liu, J. Li, and H. Gao, “Summarizing Graph Patterns,” Proc. the University of California at Berkeley, as a
Int’l Conf. Data Eng., 2008. staff scientist in the Information Research Group
[23] M. Hasan and M. Zaki, “Output Space Sampling for Graph at the Lawrence Berkeley National Laboratory,
Patterns,” Proc. Very Large Databases Conf., 2009. and as a visiting professor at the University of
[24] J. Wang, Z. Zeng, and L. Zhou, “Clan: An Algorithm for Mining Minnesota. His research interests include data
Closed Cliques from Large Dense Graph Databases,” Proc. Int’l management systems, data mining, data ware-
Conf. Data Eng., 2006. housing, sensor networks, and bioinformatics. He has published
[25] Z. Zeng, J. Wang, L. Zhou, and G. Karypis, “Out-of-Core Coherent extensively and been involved in the program committees of all major
Closed Quasi-Clique Mining from Large Dense Graph Data- database conferences, including SIGMOD, VLDB, and ICDE. He has
bases,” ACM Trans. Database Systems, vol. 32, no. 2, p. 13, 2007. also served on the boards for varied journals, including the IEEE
[26] J. Pei, D. Jiang, and A. Zhang, “On Mining Cross-Graph Quasi- Transactions on Knowledge and Data Engineering. He is a member of
Cliques,” Proc. ACM SIGKDD Conf., 2005. the IEEE.
[27] R. Jin, C. Wang, D. Polshakov, S. Parthasarathy, and G. Agrawal,
“Discovering Frequent Topological Structures from Graph Data-
sets,” Proc. ACM SIGKDD Conf., 2005. Hong Gao received the BS degree in computer
[28] T. Horváth, J. Ramon, and S. Wrobel, “Frequent Subgraph Mining science from the Heilongjiang University, China,
in Outerplanar Graphs,” Proc. ACM SIGKDD Conf., 2006. the MS degree in computer science from the
[29] X. Yan, H. Cheng, J. Han, and P.S. Yu, “Mining Significant Graph Harbin Engineering University, China, and the
Patterns by Leap Search,” Proc. ACM SIGMOD Conf., 2008. PhD degree in computer science from the Harbin
[30] S. Ranu and A.K. Singh, “GraphSig: A Scalable Approach to Institute of Technology, China. She is currently
Mining Significant Subgraphs in Large Graph Databases,” Proc. a professor in the Department of Computer
Int’l Conf. Data Eng., 2009. Science and Technology at the Harbin Institute
[31] J. Li, B. Saha, and A. Deshpande, “A Unified Approach to Ranking of Technology, China. Her research interests
in Probabilistic Databases,” Proc. Very Large Databases Conf., 2009. include graph data management, sensor net-
[32] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic Skylines on works, and massive data management.
Uncertain Data,” Proc. Very Large Databases Conf., 2007.
[33] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar,
“Indexing Multi-Dimensional Uncertain Data with Arbitrary Shuo Zhang received the BEng and MEng
Probability Density Functions,” Proc. Very Large Databases Conf., degrees in computer science from the Harbin
2005. Institute of Technology, China, in 2003 and
[34] R. Cheng, J. Chen, and X. Xie, “Cleaning Uncertain Data with 2005, respectively. He is currently working
Quality Guarantees,” Proc. Very Large Databases Conf., 2008. toward the PhD degree in the Department of
[35] G. Cormode and M.N. Garofalakis, “Sketching Probabilistic Data Computer Science and Technology at the
Streams,” Proc. ACM SIGMOD Conf., 2007. Harbin Institute of Technology, China. His
[36] Q. Zhang, F. Li, and K. Yi, “Finding Frequent Items in research interests include indexing techniques,
Probabilistic Data,” Proc. ACM SIGMOD Conf., 2008. query processing, and data mining in graph
[37] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Züfle, databases.
“Probabilistic Frequent Itemset Mining in Uncertain Databases,”
Proc. ACM SIGKDD Conf., 2009.
[38] C.C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent Pattern . For more information on this or any other computing topic,
Mining with Uncertain Data,” Proc. ACM SIGKDD Conf., 2009. please visit our Digital Library at www.computer.org/publications/dlib.
[39] C.C. Aggarwal and P.S. Yu, “A Framework for Clustering
Uncertain Data Streams,” Proc. Int’l Conf. Data Eng., 2008.
[40] G. Cormode and A. McGregor, “Approximation Algorithms for
Clustering Uncertain Data,” Proc. Symp. Principles of Database
Systems, 2008.
[41] M. Mitzenmacher and E. Upfal, Probability and Computing:
Randomized Algorithms and Probabilistic Analysis. Cambridge Univ.
Press, 2005.
[42] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide
to the Theory of NP-Completeness. W. H. Freeman, 1979.
[43] D.R. Wood, “On the Maximum Number of Cliques in a Graph,”
Graphs and Combinatorics, vol. 23, no. 3, pp. 337-352, 2007.
[44] R.M. Karp and M. Luby, “Monte-Carlo Algorithms for Enumera-
tion and Reliability Problems,” Proc. Ann. Symp. Foundations of
Computer Science, 1983.

You might also like