0% found this document useful (0 votes)
13 views13 pages

An Efficient Algorithm For Discovering Frequent Subgraphs

The document presents FSG, an efficient algorithm for discovering frequent subgraphs in large graph datasets, addressing the limitations of traditional frequent pattern discovery methods in non-traditional domains. FSG utilizes a level-by-level expansion strategy similar to the Apriori algorithm, optimizing candidate generation and frequency counting to handle datasets with over 200,000 transactions effectively. Experimental results demonstrate that FSG can identify all frequently occurring subgraphs in a reasonable time while scaling linearly with dataset size.

Uploaded by

Kartik Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

An Efficient Algorithm For Discovering Frequent Subgraphs

The document presents FSG, an efficient algorithm for discovering frequent subgraphs in large graph datasets, addressing the limitations of traditional frequent pattern discovery methods in non-traditional domains. FSG utilizes a level-by-level expansion strategy similar to the Apriori algorithm, optimizing candidate generation and frequency counting to handle datasets with over 200,000 transactions effectively. Experimental results demonstrate that FSG can identify all frequently occurring subgraphs in a reasonable time while scaling linearly with dataset size.

Uploaded by

Kartik Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

To appear in IEEE Transactions on Knowledge and Data Engineering 1

An Efficient Algorithm for


Discovering Frequent Subgraphs
Michihiro Kuramochi and George Karypis, Member, IEEE
Department of Computer Science
University of Minnesota
4-192 EE/CS Building, 200 Union St SE
Minneapolis, MN 55455
{kuram, karypis}@cs.umn.edu

Abstract— Over the years, frequent itemset discovery algo- formulating the frequent pattern discovery problem is as that
rithms have been used to find interesting patterns in various of discovering subgraphs that occur frequently over the entire
application areas. However, as data mining techniques are being set of graphs.
increasingly applied to non-traditional domains, existing frequent
pattern discovery approach cannot be used. This is because The power of graphs to model complex datasets has been
the transaction framework that is assumed by these algorithms recognized by various researchers [3], [6], [10], [14], [19],
cannot be used to effectively model the datasets in these domains. [23], [26], [30], [37], [43], [46] as it allows us to represent
An alternate way of modeling the objects in these datasets is to arbitrary relations among entities and solve problems that we
represent them using graphs. Within that model, one way of could not previously solve. For instance, consider the problem
formulating the frequent pattern discovery problem is as that of
discovering subgraphs that occur frequently over the entire set of mining chemical compounds to find recurrent substructures.
of graphs. In this paper we present a computationally efficient We can achieve that using a graph-based pattern discovery
algorithm, called FSG, for finding all frequent subgraphs in large algorithm by creating a graph for each one of the compounds
graph datasets. We experimentally evaluate the performance whose vertices correspond to different atoms, and whose edges
of FSG using a variety of real and synthetic datasets. Our correspond to bonds between them. We can assign to each ver-
results show that despite the underlying complexity associated
with frequent subgraph discovery, FSG is effective in finding tex a label corresponding to the atom involved (and potentially
all frequently occurring subgraphs in datasets containing over its charge), and assign to each edge a label corresponding
200,000 graph transactions and scales linearly with respect to to the type of the bond (and potentially information about
the size of the dataset. their relative 3D orientation). Once these graphs have been
Index Terms— Data mining, scientific datasets, frequent pat- created, recurrent substructures across different compounds
tern discovery, chemical compound datasets. become frequently occurring subgraphs. In fact, within the
context of chemical compound classification, such techniques
I. I NTRODUCTION have been used to mine chemical compounds and identify
the substructures that best discriminate between the different
E FFICIENT algorithms for finding frequent patterns—
both sequential and non-sequential—in very large
datasets have been one of the key success stories of data
classes [5], [11], [27], [42], and were shown to produce
superior classifiers than more traditional methods [21].
Developing algorithms that discover all frequently occurring
mining research [1], [2], [20], [36], [41], [49]. Nevertheless, as
subgraphs in a large graph dataset is particularly challenging
data mining techniques have been increasingly applied to non-
and computationally intensive, as graph and subgraph isomor-
traditional domains, there is a need to develop efficient and
phisms play a key role throughout the computations. In this
general-purpose frequent pattern discovery algorithms that are
paper we present a new algorithm, called FSG, for finding
capable of capturing the strong spatial, topological, geometric,
all connected subgraphs that appear frequently in a large
and/or relational nature of the datasets that characterize these
graph dataset. Our algorithm finds frequent subgraphs using
domains.
the level-by-level expansion strategy adopted by Apriori [2].
In recent years, labeled topological graphs have emerged
The key features of FSG are the following: (i) it uses a
as a promising abstraction to capture the characteristics of
sparse graph representation that minimizes both storage and
these datasets. In this approach, each object to be analyzed is
computation; (ii) it increases the size of frequent subgraphs
represented via a separate graph whose vertices correspond
by adding one edge at a time, allowing it to generate the
to the entities in the object and the edges correspond to
candidates efficiently; (iii) it incorporates various optimiza-
the relations between them. Within that model, one way of
tions for candidate generation and frequency counting which
This work was supported by NSF CCR-9972519, EIA-9986042, ACI- enables it to scale to large graph datasets; and (iv) it uses
9982274 and ACI-0133464, by Army Research Office contract DA/DAAG55- sophisticated algorithms for canonical labeling to uniquely
98-1-0441, and by Army High Performance Computing Research Center
contract number DAAH04-95-C-0008. Access to computing facilities was identify the various generated subgraphs without having to
provided by the Minnesota Supercomputing Institute. resort to computationally expensive graph- and subgraph-
To appear in IEEE Transactions on Knowledge and Data Engineering 2

TABLE I
isomorphism computations.
N OTATION USED THROUGHOUT THE PAPER
We experimentally evaluated FSG on three types of
datasets. The first two datasets correspond to various chemical
Notation Description
compounds containing over 200,000 transactions and frequent k-subgraph A connected subgraph with k edges
patterns whose size is large, and the third type corresponds to (also written as a size-k subgraph)
Gk , H k (Sub)graphs of size k
various graph datasets that were synthetically generated using E(G) Edges of a (sub)graph G
a framework similar to that used for market-basket transaction V (G) Vertices of a (sub)graph G
cl(G) A canonical label of a graph G
generation [2]. Our results illustrate that FSG can operate on a, b, c, e, f edges
very large graph datasets and find all frequently occurring u, v vertices
d(v) Degree of a vertex v
subgraphs in reasonable amount of time and scales linearly l(v) The label of a vertex v
with the dataset size. For example, in a dataset containing l(e) The label of an edge e
H =G−e H is a graph obtained by the deletion of edge e ∈ E(G)
over 200,000 chemical compounds, FSG can discover all D A dataset of graph transactions
subgraphs that occur in at least 1% of the transactions in {D1 , D2 , . . . , DN } Disjoint N partitions of D S
(for i and j, i 6= j, Di ∩ Dj = ∅ and i Di = D)
approximately one hour. Furthermore, our detailed evaluation T A graph transaction
using the synthetically generated graphs shows that for datasets C A candidate subgraph
that have a moderately large number of different vertex and Ck A set of candidates with k edges
C A set of all candidates
edge labels, FSG is able to achieve good performance as the F A frequent subgraph
transaction size increases. Fk A set of frequent k-subgraphs
F A set of all frequent subgraphs
The rest of the paper is organized as follows. Section II pro- k∗ The size of the largest frequent subgraph in D
vides some definitions and introduces the notation that is used LE A set of all edge labels in D
LV A set of all vertex labels in D
in the paper. Section III formally defines the problem of fre-
quent subgraph discovery and discusses the modeling strengths
of the discovered patterns and the challenges associated with i.e., to determine whether or not G2 is included in G1 . The
finding them in a computationally efficient manner. Section IV canonical label of a graph G = (V, E), cl(G), is defined to be
describes in detail the algorithm. Section V describes the vari- a unique code (i.e., a sequence of bits, a string, or a sequence
ous optimizations that we developed for efficiently computing of numbers) that is invariant on the ordering of the vertices
the canonical label of the patterns. Section VI provides a and edges in the graph [15]. As a result, two graphs will have
detailed experimental evaluation of FSG on a large number of the same canonical label if they are isomorphic. Examples of
real and synthetic datasets. Section VII describes the related different canonical label codes and details on how they are
research in this area, and finally, Section VIII provides some computed are presented in Section V. Both canonical labeling
concluding remarks. and determining graph isomorphism are not known to be either
in P or in NP-complete [15].
II. D EFINITIONS AND N OTATION The size of a graph G = (V, E) is defined to be equal to
|E|. Given a size-k connected graph G = (V, E), by adding an
A graph G = (V, E) is made of two sets, the set of vertices
edge we will refer to the operation in which an edge e = (u, v)
V and the set of edges E. Each edge itself is a pair of
is added to the graph so that the resulting size-(k + 1) graph
vertices, and throughout this paper we assume that the graph
remains connected. Similarly, by deleting an edge we refer to
is undirected, i.e., each edge is an unordered pair of vertices.
the operation in which e = (u, v) such that e ∈ E is deleted
Furthermore, we will assume that the graph is labeled. That
from the graph and the resulting size-(k − 1) graph remains
is, each vertex and edge has a label associated with it that is
connected. Note that depending on the particular choice of e,
drawn from a predefined set of vertex labels (LV ) and edge
the deletion of the edge may result in deleting at most one
labels (LE ). Each vertex (or edge) of the graph is not required
of its incident vertices if that vertex has only e as its incident
to have a unique label and the same label can be assigned to
edge.
many vertices (or edges) in the same graph.
Finally, the notation that we will be using through-out the
Given a graph G = (V, E), a graph Gs = (Vs , Es ) will be a
paper is shown in Table I.
subgraph of G if and only if Vs ⊆ V and Es ⊆ E and it will
be an induced subgraph of G if Vs ⊆ V and Es contains all the
edges of E that connect vertices in Vs . A graph is connected III. F REQUENT S UBGRAPH D ISCOVERY—P ROBLEM
if there is a path between every pair of vertices in the graph. D EFINITION
Two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ) are isomorphic The problem of finding frequently occurring connected
if they are topologically identical to each other, that is, there subgraphs in a set of graphs is defined as follows:
is a mapping from V1 to V2 such that each edge in E1 is Definition 1 (Subgraph Discovery): Given a set of graphs
mapped to a single edge in E2 and vice versa. In the case of D each of which is an undirected labeled graph, and a
labeled graphs, this mapping must also preserve the labels on parameter σ such that 0 < σ ≤ 1, find all connected undirected
the vertices and edges. An automorphism is an isomorphism graphs that are subgraphs in at least σ|D| of the input graphs.
mapping where G1 = G2 . Given two graphs G1 = (V1 , E1 ) We will refer to each of the graphs in D as a graph transaction
and G2 = (V2 , E2 ), the problem of subgraph isomorphism is or simply transaction when the context is clear, to D as the
to find an isomorphism between G2 and a subgraph of G1 , graph transaction dataset, and to σ as the support threshold.
To appear in IEEE Transactions on Knowledge and Data Engineering 3

There are two key aspects in the above problem statement. the various candidate and frequent subgraphs that it generates
First, we are only interested in subgraphs that are connected. using an adjacency list representation.
This is motivated by the fact that the resulting frequent
subgraphs will be encapsulating relations (or edges) between
A. Candidate Generation
some of the entities (or vertices) of various objects. Within this
context, connectivity is a natural property of frequent patterns. FSG generates candidate subgraphs of size k +1 by joining
An additional benefit of this restriction is that it reduces the two frequent size-k subgraphs. In order for two such frequent
complexity of the problem, as we do not need to consider size-k subgraphs to be eligible for joining they must contain
disconnected combinations of frequent connected subgraphs. the same size-(k − 1) connected subgraph. The simplest way
Second, we allow the graphs to be labeled, and as discussed to generate the complete set of candidate subgraphs is to join
in Section II, input graph transactions and discovered frequent all pairs of size-k frequent subgraphs that have a common
patterns can contain multiple vertices and edges carrying the size-(k − 1) subgraph. Unfortunately, the problem with this
same label. This greatly increases our modeling ability, as it approach is that a particular size-k subgraph, can have up to
allow us to find patterns involving multiple occurrences of the k different size-(k − 1) subgraphs. As a result, if we consider
same entities and relations, but at the same time makes the all such possible subgraphs and perform the resulting join
problem of finding such frequently occurring subgraphs non- operations, we will end up generating the same candidate
trivial. This is because in such cases, any frequent subgraph pattern multiple times, and generating a large number of
discovery algorithm needs to correctly identify how a partic- candidate patterns that are not downward closed. The net effect
ular subgraph maps to the vertices and edges of each graph of this, is that the resulting algorithm spends a significant
transaction, that can only be done by solving many instances amount of time identifying unique candidates and eliminating
of the subgraph isomorphism problem, which has been shown non-downward closed candidates (both of which operations
to be in NP-complete [16]. are non-trivial as they require to determine the canonical
label of the generated subgraphs). Note that candidate gen-
eration approaches in the context of frequent itemsets, (e.g.,
IV. FSG—F REQUENT S UBGRAPH D ISCOVERY
Apriori [2]) do not suffer from this problem because they
A LGORITHM
use a consistent way to order the items within an itemset
In developing our frequent subgraph discovery algorithm, (e.g., lexicographically). Using this ordering, they only join
we decided to follow the level-by-level structure of the Apri- two size-k itemsets if they have the same (k − 1)-prefix.
ori [2] algorithm used for finding frequent itemsets. The For example, a particular itemset {A, B, C, D} will only be
motivation behind this choice is the fact that the level-by-level generated once (by joining {A, B, C} and {A, B, D}), and if
structure of Apriori requires the smallest number of subgraph that itemset is not downward closed, it will never be generated
isomorphism computations during frequency counting, as it al- if only its {A, B, C} and {B, C, D} subsets were frequent.
lows it to take full advantage of the downward closed property Fortunately, the situation for subgraph candidate generation
of the minimum support constraint and achieves the highest is not as severe as the above discussion seems to indicate
amount of pruning when compared with the most recently and FSG addresses both of these problems by only joining
developed depth-first-based approaches such as dEclat [49], two frequent subgraphs if and only if they share a certain,
Tree Projection [1], and FP-growth [20]. In fact, despite the properly selected, size-(k − 1) subgraph. Specifically, for each
extra overhead due to candidate generation that is incurred frequent size-k subgraph Fi , let P(Fi ) = {Hi,1 , Hi,2 } be the
by the level-by-level approach, recent studies have shown two size-(k − 1) connected subgraphs of Fi such that Hi,1 has
that because of its effective pruning, it achieves comparable the smallest canonical label and Hi,2 has the second smallest
performance with that achieved by the various depth-first- canonical label among the various connected size-(k − 1) sub-
based approaches, as long as the data set is not dense or the graphs of Fi . We will refer to these subgraphs as the primary
support value is not extremely small [18], [22]. subgraphs of Fi . Note that if every size-(k − 1) subgraph of
The overall flow of our algorithm, called FSG, is similar Fi is isomorphic to each other, Hi,1 = Hi,2 and |P(Fi )| = 1.
to that of Apriori, and works as follows. FSG starts by FSG will only join two frequent subgraphs Fi and Fj , if and
enumerating all frequent single- and double-edge subgraphs. only if P(Fi )∩P(Fj ) 6= ∅, and the join operation will be done
Then, it enters its main computational phase, which consists with respect to the common size-(k−1) subgraph(s). The proof
of a main iteration loop. During each iteration, FSG first that this approach will correctly generate all valid candidate
generates all candidate subgraphs whose size is greater than subgraphs is presented in Appendix . This candidate generation
the previous frequent ones by one edge, and then counts the approach dramatically reduces the number of redundant and
frequency for each of these candidates and prunes subgraphs non-downward closed patterns that are generated and leads to
that do no satisfy the support constraint. FSG stops when significant performance improvements over the naive approach
no frequent subgraphs are generated for a particular iteration. (originally implemented in [29]).
Details on how FSG generates the candidates subgraphs, and The actual join operation of two frequent size-k subgraphs
on how it computes their frequency are provided in Section IV- Fi and Fj that have a common primary subgraph H is
A and Section IV-B, respectively. performed by generating a candidate size-(k + 1) subgraph
To ensure that the various graph-related operations are that contains H plus the two edges that were deleted from Fi
performed efficiently, FSG stores the various input graphs and and Fj to obtain H. However, unlike the joining of itemsets
To appear in IEEE Transactions on Knowledge and Data Engineering 4

a c a c a a c a c frequent subgraphs, however, is challenging as there is no


Join
natural way to build the hash-tree for graphs.
+
For this reason, FSG instead uses transaction identifier
a b a b a b a b
G4
1 G4
2 G5
1 G5
2
(TID) lists, proposed by [13], [40], [47]. In this approach
for each frequent subgraph FSG keeps a list of transaction
(a) By vertex labeling identifiers that support it. Now when FSG needs to compute
the frequency of Gk+1 , it first computes the intersection of
b c c b c
the TID lists of its frequent k-subgraphs. If the size of the
a a a a Join a a a a intersection is below the support, Gk+1 is pruned, otherwise
+ a a
a a
FSG computes the frequency of Gk+1 using subgraph iso-
a a a a a a
G5 G5 a a c
morphism by limiting the search only to the set of trans-
1 2 b b
G6
1 G6
2 G6
3 actions in the intersection of the TID lists. The advantages
of this approach are two-fold. First, in the cases where the
(b) By multiple automorphisms of a single core intersection of the TID lists is bellow the minimum support
level, FSG is able to prune the candidate subgraph without
Fig. 1. Two cases of joining performing any subgraph isomorphism computations. Second,
when the intersection set is sufficiently large, FSG only needs
to compute subgraph isomorphisms for those graphs that can
in which two frequent size-k itemsets lead to a unique size- potentially contain the candidate subgraph and not for all the
(k + 1) itemset, the joining of two size-k subgraphs may graph transactions.
produce multiple distinct size-(k+1) candidates. This happens
for the following two reasons. First, the difference between the 1) Reducing Memory Requirements of TID lists: The com-
common primary subgraph and the two frequent subgraphs putational advantages of TID lists come at the expense of
can be a vertex that has the same label. In this case, the higher memory requirements for maintaining them. To address
joining of such size-k subgraphs will generate two distinct this limitation we implemented a database-partitioning-based
subgraphs of size k + 1. Fig. 1(a) shows such an example, in scheme that was motivated by a similar scheme developed for
which the pair of graphs G4a and G4b generates two different mining frequent itemsets [39]. In this approach, the database
candidates G5a and G5b . Second, the primary subgraph itself is partitioned into N disjoint parts D = {D1 , D2 , . . . , DN }.
may have multiple automorphisms, and each of them can lead Each of these sub-databases Di is mined to find a set of
to a different size-(k + 1) candidate. In the worst case, when frequent subgraphs Fi , called local frequent
S subgraphs. The
the primary subgraph is an unlabeled clique, the number of union of the local frequent subgraphs C¯ = i Fi , called global
automorphisms is k!. An example for this case is shown in candidates, is determined and their frequency in the entire
Fig. 1(b), in which the primary subgraph—a square of four database is computed by reading each graph transaction and
vertices labeled with a—has four automorphisms resulting in finding the set of subgraphs that it supports. The subset of C¯
three different candidates of size six. Finally, in addition to that satisfies the minimum support constraint is output as the
joining two different subgraphs, FSG also needs to perform final set of frequent patterns F. Since the memory required for
self join. This happens, for example, when the two graphs storing the TID lists depends on the size of the database, their
Gki and Gkj in Fig. 1 are identical. It is necessary because, overall memory requirements can be reduced by partitioning
for example, consider graph transactions without any labels. the database in a sufficiently large number of partitions.
Then, there will be only one frequent size-1 subgraph and one One of the problems with a naive implementation of the
frequent size-2 subgraph regardless of the support threshold, above algorithm is that it can dramatically increase the num-
because those are the only allowed structures, and edges and ber of subgraph isomorphism operations that are required to
vertices do not have labels assigned. In general, whenever determine the frequency of the global candidate set. In order to
|F k | = 1, self join is necessary to obtain a set of valid (k +1)- address this problem, FSG incorporates three techniques: (i)
candidates. a priori pruning the number of candidate subgraphs that need
to be considered; (ii) using bitmaps to limit the frequency
B. Frequency Counting counting of a particular candidate subgraph to only those
partitions that this frequency has not already being determined
The simplest way to determine the frequency of each can-
locally; and (iii) taking advantage of the lattice structure of C¯
didate subgraph is to scan each one of the dataset transactions
to check each graph transaction only against the subgraphs that
and determine if it is contained or not using subgraph isomor-
are descendants of patterns that are already being supported
phism. Nonetheless, having to compute these isomorphisms
by that transaction. The net effect of these optimizations is
is particularly expensive and this approach is not feasible for
that, as shown in Section VI-A.1, the FSG’s overall run-time
large datasets. In the context of frequent itemset discovery
increases slowly as the number of partitions increases.
by Apriori, the frequency counting is performed substantially
faster by building a hash-tree of candidate itemsets and scan- The a priori pruning of the candidate subgraphs is achieved
ning each transaction to determine which of the itemsets in as follows. For each partition Di , FSG finds the set of
the hash-tree it supports. Developing such an algorithm for local frequent subgraphs and the set of local negative border
To appear in IEEE Transactions on Knowledge and Data Engineering 5

v0 v1 v2 v1 v0 v2
subgraphs1 , and stores them into a file Si along with their a v2 a a a a a a
associated frequencies. Then, it organizes the union of the v0 a z x v1 a z y
x y
local frequent and local negative border subgraphs across the v1 a z y v0 a z x

various partitions into a lattice structure (called pattern lattice), v0 v1 v2 a x y v2 a y x


a z a
by incrementally incorporating the information from each file code = aaa zxy code = aaa zyx

Si . Then, for each node v of the pattern lattice it computes an (a) G3 (b) (c)

upper bound f ∗ (v) of its occurrence frequency by adding the


Fig. 2. Simple examples of codes and canonical adjacency matrices
corresponding upper bounds for each one of the N partitions,
f ∗ (v) = f1∗ (v) + · · · + fP∗ (v). For each partition Di , fi∗ (v) is
determined using the following equation: narrow down the search space or by using alternate canonical
½
∗ fi (v), if v ∈ Si label definitions that take advantage of special properties that
fi (v) = ,
minu (fi∗ (u)) , otherwise may exist in a particular set of graphs [15], [31], [32]. In
particular, the Nauty program [31] developed by Brendan
where fi (v) is the actual frequency of the pattern correspond- McKay implements a number of such heuristics and has
ing to node v in Di , and u is a connected subgraph of v that been shown to scale reasonably well to moderate size graphs.
is smaller from it by one edge (i.e., it is its parent in the Unfortunately, Nauty does not allow graphs to have edge
lattice). Note that the various fi∗ (v) values can be computed labels and as such it cannot be used directly by FSG. As a
in a bottom-up fashion by a single scan of Si , and used result we developed our own canonical labeling algorithm that
directly to update the overall f ∗ (v) values. Now, given this incorporates some of the existing heuristics extended to vertex-
set of frequency upper bounds, FSG proceeds to prune the and edge-labeled graphs as well as a number of new heuristics
nodes of the pattern lattice that are either infrequent or fail that are well-suited for our particular problem. Details of our
the downward closure property. canonical labeling algorithm are provided in the rest of this
section.
V. C ANONICAL L ABELING Note that our canonical labeling algorithm operates on the
FSG relies on canonical labeling to efficiently check if adjacency matrix representation of a graph. For this reason,
a particular pattern satisfies the downward closure property FSG converts its internal adjacency list representation of
of the support condition and to eliminate duplicate candidate each candidate or frequent subgraph into its corresponding
subgraphs. Developing algorithms that can efficiently compute adjacency matrix representation, prior to computing its canon-
the canonical label of the various subgraphs is critical to ensure ical label. Once the canonical label has been obtained, the
that FSG can scale to very large graph datasets. adjacency matrix representation is discarded.
Recall from Section II that the canonical label of a graph
is nothing more than a code that uniquely identifies the graph A. Vertex Invariants
such that if two graphs are isomorphic to each other, they
Vertex invariants [15] are some inherent properties of the
will be assigned the same code. A simple way of defining
vertices that do not change across isomorphism mappings.
the canonical label of a graph is as the string obtained
An example of such an isomorphism-invariant property is the
by concatenating the upper triangular entries of the graph’s
degree or label of a vertex, which remains the same regardless
adjacency matrix when this matrix has been symmetrically
of the mapping (i.e., vertex ordering). Vertex invariants can
permuted so that this string becomes the lexicographically
be used to partition the vertices of the graph into equivalence
largest (or smallest) over the strings that can be obtained from
classes such that all the vertices assigned to the same partition
all such permutations. This is illustrated in Fig. 2 that shows
have the same values for the vertex invariants. Using these
a graph G3 and the permutation of its adjacency matrix2 that
partitions we can define the canonical label of a graph to be
leads to its canonical label “aaazyx”. In this code, “aaa” was
the lexicographically largest code obtained by concatenating
obtained by concatenating the vertex-labels in the order that
the columns of the upper triangular adjacency matrix (as it was
they appear in the adjacency matrix and “zyx” was obtained
done earlier), over all possible permutations of the vertices
by concatenating the columns of the upper triangular portion of
subject to the constraint that the vertices of each one of the
the matrix. Note that any other permutation of G3 ’s adjacency
partitions are numbered consecutively. Thus, the only modifi-
matrix will lead to a code that is lexicographically smaller (or
cation over our earlier definition is that instead of maximizing
equal) to “aaazyx”. If a graph has |V | vertices, the complexity
over all permutations of the vertices, we only maximize over
of determining its canonical label using this scheme is in
those permutations that keep the vertices in each partition
O(|V |!) making it impractical even for moderate size graphs.
together. Note that two graphs that are isomorphic will lead to
In practice, the complexity of finding the canonical label
the same partitioning of the vertices and they will be assigned
of a graph can be reduced by using various heuristics to
the same canonical label.
1 A local negative border subgraph is the one generated as a local candidate If m is the number of partitions created by using ver-
subgraph but does not satisfy the minimum threshold for the partition. tex invariants, containing p1 , p2 , . . . , pm vertices, respectively,
2 The symbol v in the figure is a vertex ID, not a vertex label, and
i then the number
blank elements in the adjacency matrix means there is no edge between the Qm of different permutations that we need to
corresponding pair of vertices. This notation will be used in the rest of the consider is i=1 (pi !), which can be substantially smaller than
section. the |V |! permutations required by the earlier approach. We
To appear in IEEE Transactions on Knowledge and Data Engineering 6

v0 v1 v2 v3 v1 v0 v3 v2 v1 v3 v0 v2 v2 v4 v1 v3 v0 v5
a a b a a a a b a a a b v0
v3 a b b b b a a
v0 a y v1 a x y x v1 a y x x x
x a v2 b y y y
v1 a y x x v0 a x v3 a y y
v1
y x v2 b x v3 a y v0 a x b v1 v4 b v4 b y y y
a
v0 v2
v3 a x v2 b x v2 b x v1 b y y x
a b y y y
p0 p1 p2 p0 p1 p2 v3 b y y z
code = aaba y0x0x0 code = aaab xy0x00 code = aaab yx0x00
(a) (b) (c) (d) b v2 y
v3 b v0 a x
z v5 a z
Fig. 3. A sample graph of size three and its adjacency matrices a v5 p0 p1
code = bbbbaa yyyyy000x0000z0
(a) (b)

have incorporated in FSG three types of vertex invariants


that utilize information about the degrees and labels of the v2 v4 v1 v3 v0 v5
b b b b a a
vertices, the labels and degrees of their adjacent vertices, and
v2 b y y y (y, 3, b), (y, 3, b), (y, 3, b)
information about their adjacent partitions. v4 b y y y (y, 3, b), (y, 3, b), (y, 3, b)
a) Vertex Degrees and Labels: This invariant partitions v1 b y y x (x, 1, a), (y, 3, b), (y, 3, b)
vertices into disjointed groups such that each partition contains v3 b y y z (y, 3, b), (y, 3, b), (z, 1, a)
vertices with the same label and the same degree. Fig. 3 v0 a x (x, 3, b)
illustrates the partitioning induced by this set of invariants for v5 a z (z, 3, b)

an example graph of size four. Based on their degree and their p0 p1 p2 p3 p4

labels, the vertices are partitioned into three groups p0 = {v1 }, code = bbbaa yyyyy000x0000z0
(c) (d)
p1 = {v0 , v3 } and p2 = {v2 } as shown in Fig. 3(c). Fig. 3
shows the adjacency matrix corresponding to the partition- Fig. 4. Use of neighbor lists
constrained permutation that leads to the canonical label of a a a
the graph. Using the partitioning based on vertex invariants, v6 v5 v4

we try only 1! × 2! × 1! = 2 permutations, although the total x x x


v7 v0 v1 v2 v3
number of permutations for four vertices is 4! = 24. a
x
a
x
a
x
a
x
a
b) Neighbor Lists: Invariants that lead to finer-grain (a)

partitioning can be created by incorporating information about v1 v0 v2 v3 v4 v5 v6 v7

the labels of the edges incident on each vertex, the degrees a a a a a a a a


v1 a x x x (p0 , a), (p0 , a), (p1 , a)
of the adjacent vertices, and their labels. In particular, we v0 a x x x (p0 , a), (p1 , a), (p1 , a)
describe an adjacent vertex v by a tuple (l(e), d(v), l(v)) v2 a x x x (p0 , a), (p1 , a), (p1 , a)

where l(e) is the label of the incident edge e, d(v) is the v3 a x (p0 , a)

degree of the adjacent vertex v, and l(v) is its vertex label. v4 a x (p0 , a)
v5 a x (p0 , a)
Now, for each vertex u, we construct its neighbor list nl(u) that v6 a x (p0 , a)
contains the tuples for each one of its adjacent vertices. Using v7 a x (p0 , a)
p0 p1
these neighbor lists, we then partition the vertices into disjoint
code = aaaaaaaa xx000x00x0x00000x00000x00000
sets such that two vertices u and v will be in the same partition (b)

if and only if nl(u) = nl(v). Note that this partitioning v1 v0 v2 v5 v3 v4 v6 v7

is performed within the partitions already computed by the a a a a a a a a


v1 a x x x (p1 , a), (p1 , a), (p2 , a)
previous set of invariants. v0 a x x x (p0 , a), (p2 , a), (p2 , a)
Fig. 4 illustrates the partitioning produced by also incorpo- v2 a x x x (p0 , a), (p2 , a), (p2 , a)

rating the neighbor list invariant on the graph of Fig. 4(a). v5 a x (p0 , a)
v3 a x (p1 , a)
Specifically, Fig. 4(b) shows the partitioning produced by the v4 a x (p1 , a)
vertex degrees and labels, and Fig. 4(c) shows the partitioning v6 a x (p1 , a)

that is produced by also incorporating neighboring lists. The v7 a x (p1 , a)


p0 p1 p2
neighbor lists are shown in Fig. 4(d). For this example we
code = aaaaaaaa xx0x0000x000x000x00000x00000
were able to reduce the number of permutations that needs to (c)

be considered from 4! × 2! to 2!. v1 v0 v2 v5 v3 v4 v6 v7

c) Iterative Partitioning: Iterative partitioning general- a a a a a a a a


v1 a x x x (p1 , a), (p1 , a), (p2 , a)
izes the idea of the neighbor lists, by incorporating the v0 a x x x (p0 , a), (p3 , a), (p3 , a)
partition information [15]. This time, instead of a tuple v2 a x x x (p0 , a), (p3 , a), (p3 , a)

(l(e), d(v), l(v)), we use a pair (p(v), l(e)) for representing v5 a x (p0 , a)

the neighbor lists where p(v) is the identifier of a partition to v3 a


v4 a
x
x
(p1 , a)
(p1 , a)
which a neighbor vertex v belongs and l(e) is the label of the v6 a x (p1 , a)

incident edge to the neighbor vertex v. v7 a x (p1 , a)


p0 p1 p2 p3
The effect of iterative partitioning is illustrated in Fig. 5. code = aaaaaaaa xx0x0000x000x000x00000x00000
In this example graph, all edges have the same label x and (d)

all vertices have the same label a. Initially the vertices are
partitioned into two groups only by their degrees, and in each Fig. 5. An example of iterative partitioning
To appear in IEEE Transactions on Knowledge and Data Engineering 7

partition they are sorted by their neighbor lists (Fig. 5(b)). Vertex stabilization breaks such a regular structure by as-
The ordering of those partitions is based on the degrees and suming that a particular vertex in a large partition with many
the labels of each vertex and its neighbors. Then, we split the equivalent vertices is different from the others. The selected
first partition p0 into two, because the neighbor lists of v1 vertex forms a new singleton partition for itself, which triggers
is different from those of v0 and v2 . By renumbering all the for the rest of the vertices the successive iterative partitioning
partitions, updating the neighbor lists, and sorting the vertices the details of which are described in Section V-A.0.c. Because
based on their neighbor lists, we obtain the matrix as shown in we have chosen the vertex arbitrarily, we have to repeat
Fig. 5(c). Now, because the partition p2 becomes non-uniform the same process for the remaining vertices in the original
in terms of the neighbor lists, we again divide p2 to factor out partition. During the successive iterative partitioning, the ver-
v5 , renumber partitions, update and sort the neighbor lists, and tex stabilization may be applied repeatedly if the iterative
sort vertices to obtain the matrix in Fig. 5(d). partitioning can not decompose a large partition effectively.
For example, in the case of a cycle with k edges, once a
B. Degree-based Partition Ordering particular vertex v is chosen from the initial partition with all
the k vertices, it breaks the symmetry and we immediately
In addition to using the vertex invariants to compute a fine-
obtain b(k − 1)/2c + 1 partitions based on the distance from
grain partitioning of the vertices, the overall run-time of the
v to each vertex. Thus, the necessary number of permutations
canonical labeling can be further reduced by properly ordering
to compute the canonical label after this partitioning is (b(k −
the various partitions. This is because, a proper ordering of the
1)/2c + 1)!. Because there are k such choices for the first
partitions may allow us to quickly determine whether a set of
vertex v, the entire computational complexity for the canonical
permutations can potentially lead to a code that is smaller than
labeling of G is bounded by O(k(k/2)!) which is significantly
the current best code or not; thus, allowing us to prune large
smaller than O(k!). Note that the vertex stabilization is not
parts of the search space.
limited to cycles and that it is applicable to any types of graphs.
Recall from Section V-A that we obtain the code of a graph
Once a partition becomes small enough, the straightforward
by concatenating its adjacent matrix in a column-wise fashion.
permutation can be simpler and faster than vertex stabilization,
As a result, when we permute the rows and the columns of a
in order to obtain a canonical label. Thus, our canonical
particular partition, the code corresponding to the columns of
labeling algorithm applies vertex stabilization only if the size
the preceding partitions is not affected. Now, while we explore
of a vertex partition is greater than five.
a particular set of within-partition permutations, if we obtain
For further details on vertex stabilization the readers should
a prefix of the final code that is larger than the corresponding
refer to a textbook on permutation groups such as [12].
prefix of the currently best code, then we know that regardless
of the permutations of the subsequent partitions, this code
will never be smaller than the currently best code, and the VI. E XPERIMENTAL E VALUATION
exploration of this set of permutations can be terminated. The We experimentally evaluated the performance of FSG using
critical property that allows us to prune such unpromising actual graphs derived from the molecular structure of chemical
permutations is our ability to obtain a bad code prefix. Ideally, compounds, and graphs generated synthetically. The first type
we will like to order the partitions in a way such that the of datasets allows us to evaluate the effectiveness of FSG for
permutations of the vertices in the initial partitions lead to finding rather large patterns and its scalability to large real
dramatically different code prefixes, which it turn will allow datasets, whereas the second one, a set of synthetic datasets,
us to prune parts of the search space. In general, the likelihood allows us to evaluate the performance of FSG on datasets
of this happening depends on the density (i.e., the number whose characteristics (e.g., number of graph transactions,
of edges) of each partition, and for this reason we sort the average graph size, average number of vertex and edge labels,
partitions in decreasing order of the degree of their vertices. and average length of patterns) differs dramatically; thus,
providing insights on how well FSG scales with respect to
C. Vertex Stabilization these characteristics. All experiments were done on dual AMD
Athlon MP 1800+ (1.53 GHz) machines with 2 Gbytes main
Vertex stabilization is effective for finding isomorphism of memory, running the Linux operating system. All the times
graphs with regular or symmetric structures [31]. The key idea reported are in seconds.
is to break the topological symmetry of a graph by forcing
a particular vertex into its own partition, when the iterative
partitioning leaves a large vertex partition which cannot be A. Chemical Compound Datasets
decomposed into smaller partitions anymore. We derived graph datasets from two publicly available
For example, consider a cycle G = (V, E) of k edges where datasets of chemical compounds. The first dataset3 contains
all the edges and the vertices have the same label. Each vertex 340 chemical compounds and was originally provided for the
is equivalent to any other since they are identical in terms of Predictive Toxicology Evaluation (PTE) Challenge [43], and
their degree, label, neighbors, and resulting partitions. As a the second dataset4 contains 223,644 chemical compounds and
result, a vertex cannot be distinguished from others and there
3 ftp://ftp.comlab.ox.ac.uk/pub/Packages/ILP/Datasets/carcinogenesis/
will be only a singe partition containing all the k vertices.
progol/carcinogenesis.tar.Z
To obtain a canonical label under such a partitioning with the 4 http://dtp.nci.nih.gov/docs/3d database/structural information/
iterative partitioning only, it would require O(k!) operations. structural data.html
To appear in IEEE Transactions on Knowledge and Data Engineering 8

is available from the Developmental Therapeutics Program set of experiments in which we used two datasets derived from
(DTP) at National Cancer Institute. From the description of the DTP dataset containing 100,000 and 200,000 chemical
chemical compounds in those two datasets, we created a compounds, respectively. For each dataset we used FSG to
transaction for a compound, a vertex for an atom, an edge find all frequent patterns that occur in at least 1% of the
for a bond. Each vertex has a vertex label assigned for its transactions by partitioning the dataset in 2, 3, 4, 5, 10, 20,
atom type and each edge has an edge label assigned for its 30, 40, and 50 partitions. These results are shown in Table III.
bond type. In the PTE dataset there are 66 atom types and 4 For each experiment, this table shows the total run-time, the
bond types, and in the DTP dataset there are 104 atom types maximum amount of TID list memory, and the maximum
and 3 bond types. Each graph transaction obtained from the amount of memory required to store the pattern lattice (pattern
PTE and the DTP datasets has 27 and 22 edges on the average, lattice memory).
respectively. From these results we can see that the database-partitioning-
d) Results: Table II shows the results by FSG on four based approach is quite effective in reducing the TID list
datasets derived from the PTE and DTP datasets. The first memory, because it decreases almost linearly as the number
dataset was obtained by using all the compounds of the PTE of partitions. Moreover, the various optimizations described in
dataset, whereas the remaining three datasets were obtained by Section IV-B.1 are quite effective in limiting the degradation
randomly selecting 50,000, 100,000, and 200,000 compounds in runtime of the resulting algorithm. For example, for the
from the DTP dataset. There are three types of results shown 200,000-compound dataset and 50 partitions, the runtime
in the table, the run-time in seconds (t), the size of the largest increases only by a factor of 3.4 over that for a single
discovered frequent subgraph (k ∗ ), and the total number of partition. Also, the pattern lattice memory increases slowly
frequent patterns (|F|) that were generated. The minimum as the number of partitions increases, and unless the number
support threshold was ranging from 10% down to 1.0%. of partitions is quite large, it is still dominated by the memory
Dashes in the table correspond to experiments that were required to store the TID lists. Note that these results suggest
aborted due to high computational requirements. All the results that there is an optimal point for the number of partitions that
in this table were obtained using a single partition of the leads to the least amount of memory, as the pattern lattice
dataset. memory will eventually exceed the TID list memory as the
FSG is able to effectively operate on datasets containing number of partitions increases.
200,000 transactions and discover all frequent connected sub-
graphs which occur in 1% of the transactions in approximately B. Synthetic Datasets
one hour. With respect to the number of transactions, the To evaluate the performance of FSG on datasets with dif-
run-time scales almost linearly. For instance, with the 2% ferent characteristics we developed a synthetic graph generator
support, the run-time for 50,000 transactions is 263 seconds, which can control the number of transactions |D|, the average
whereas the corresponding run-time for 200,000 transactions number of edges in each transaction |T |, the average number of
is 1,343 seconds, an increase by a factor of 5.1. As the support edges |I| of the potentially frequent subgraphs, the number of
decreases, the run-time increases reflecting the increase of the potentially frequent subgraphs |S|, the number of distinct edge
number of frequent subgraphs found from the input dataset. labels |LE |, and the number of distinct vertex labels |LV | of
For example, with 200,000 transactions, the run-time for the the generated dataset. The design of our generator was inspired
1% support is 4.2 times longer than that for the 3% support, by the synthetic transaction generator developed by the Quest
and the number of found frequent subgraphs for the 1% group at IBM and used extensively to evaluate algorithms that
support was 8.2 times more than that for the 3% support. find frequent itemsets [1], [2], [20].
Comparing the performance on the PTE and DTP-derived The actual generator works as follows. First, it generates a
datasets we notice that the run-time for the PTE dataset set of |S| potentially frequent connected subgraphs called seed
dramatically increases as the minimum support decreases, and patterns whose size is determined by Poisson distribution with
eventually overtakes the run-time for most of the DTP-derived mean |I|. For each seed pattern, the topology and the labels
datasets. This behavior is due to the maximum size and the of the edges and the vertices are chosen randomly. Each seed
total number of frequent subgraphs that are discovered in this pattern has a weight assigned, which becomes a probability
dataset (both of which are shown in Table II). For lower that the seed pattern is selected to be included in a graph
support values the PTE dataset contains both more and longer transaction. The weights are calculated by dividing a random
frequent subgraphs than the DTP-derived datasets do. This is variable which obeys an exponential distribution with unit
due to the inherent characteristics of the PTE dataset because mean by the number of edges in the seed pattern, and the sum
it contains larger and more similar compounds. For example, of the weights of all the seed patterns is normalized to one.
the PTE dataset contains 26 compounds with over 50 edges We call this set S of seed patterns a seed pool. The reason that
and the largest compound has 214 edges. Despite that, FSG we divide the exponential random variable by the number of
requires 459 seconds for a support value of 2.0%, and is able edges is to reduce the chance that larger weights are assigned
to discover patterns containing over 22 edges. to larger seed patterns. Otherwise, once a large weight was
1) Reducing Memory Requirement of TID lists: To evaluate assigned to a large seed pattern, the resulting dataset would
the effectiveness of the database-partitioning-based approach contain an exponentially large number of frequent patterns.
(described in Section IV-B.1) for reducing the amount of mem- Next, the generator creates |D| transactions. First, the gen-
ory required by TID lists (TID list memory), we performed a erator determines the target size of each transaction, which is
To appear in IEEE Transactions on Knowledge and Data Engineering 9

TABLE II
RUN - TIME IN SECONDS FOR THE PTE AND DTP CHEMICAL COMPOUND DATASETS .

Support Run-time t[sec], Size of Largest Frequent Pattern k∗ , and Number of Frequent Patterns |F |
threshold PTE |D| = 340 DTP |D| = 50, 000 DTP |D| = 100, 000 DTP |D| = 200, 000
[%] t[sec] k∗ |F | t[sec] k∗ |F| t[sec] k∗ |F| t[sec] k∗ |F |
10.0 3 11 844 74 9 351 156 9 360 337 9 373
9.0 3 11 977 80 9 400 169 10 420 366 10 442
8.0 4 11 1323 87 11 473 184 11 490 401 11 512
7.0 4 12 1770 94 11 562 200 11 591 437 11 635
6.0 6 13 2326 109 12 782 230 12 813 503 12 860
5.0 9 14 3608 122 12 1017 259 12 1068 570 12 1140
4.0 16 15 5935 146 13 1523 316 13 1676 705 13 1855
3.0 60 22 22758 186 14 2705 398 14 2810 894 14 3004
2.0 459 25 136927 263 14 5295 571 14 5633 1343 15 6240
1.0 — — — 658 16 19373 1458 16 20939 3776 17 24683
Note. Dashes indicate the computation was aborted because of the too long run-time.
|D|: Number of transactions

TABLE III
RUN - TIME AND TID LIST MEMORY WITH PARTITIONING

Run-time [sec]
|D| Number of Partitions
1 2 3 4 5 10 20 30 40 50
100,000 1432 1878 2032 2189 2356 2924 3899 4842 6122 7459
200,000 3698 4494 5095 5064 5538 6418 7856 9516 11165 12670

Maximum amount of memory for storing TID lists [Mbytes]


|D| Number of Partitions
1 2 3 4 5 10 20 30 40 50
100,000 53.8 27.0 18.1 13.6 11.0 5.6 2.9 2.0 1.5 1.2
200,000 118 59.1 39.5 29.6 23.9 12.1 6.2 4.2 3.2 2.6

Maximum amount of memory for storing pattern lattice [Mbytes]


|D| Number of Partitions
1 2 3 4 5 10 20 30 40 50
100,000 1.4 1.5 1.5 1.6 1.9 2.5 3.2 3.8 4.3
200,000 1.7 1.8 1.8 1.8 2.0 2.4 2.8 3.2 3.6
Note. The two datasets are generated from the DTP dataset by sampling 100,000 and 200,000 chemical
compounds. The minimum support σ = 1.0%
Pattern lattice memory is left blank for a single partition because the lattice is not built.
|D|: Number of transactions

|I| = 5 |I| = 7 |I| = 9


a Poisson random variable whose mean is equal to |T |. Then, 4
10
|T| = 40
4
10
|T| = 40
4
10

|T| = 30 |T| = 30
the generator selects a seed pattern from the seed pool, by 3
10
|T| = 20
|T| = 10
3
10
|T| = 20
|T| = 10
3
10

rolling an |S|-sided die. Each face of this die corresponds to |T| = 5


Runtime Median[s]

Runtime Median[s]

Runtime Median[s]

2 2 2
10 10 10
the probability assigned to a seed pattern in the seed pool. If
1 1 1
10 10 10
the size of the selected seed pattern fits in the target transaction
size, the generator adds it to the transaction. If the size of the 0
10
0
10
0
10
|T| = 40
|T| = 30
|T| = 20
current intermediate transaction does not reach its target size, −1
10
−1
10
−1
10
|T| = 10

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
we keep selecting and putting another seed pattern into it. |Lv| |Lv| |Lv|

When adding the selected seed pattern makes the intermediate


Fig. 6. Median of 10 run-times in seconds for synthetic data sets. |T | is
transaction size greater than the target transaction size, we the average size of transactions, |I| is the average size of seed patterns, and
add it for the half of the cases, and discard it and move onto |LV | is the number of distinct vertex labels.
the next transaction for the rest of the half. The generator
adds a seed pattern into a transaction by connecting randomly
selected pair of vertices, one from the transaction and the other time because some may contain harder seed patterns (e.g.,
from the seed pattern. regular patterns with similar labels) than others do. To reduce
a) Results: Using this generator, we obtained a number this variability, we created ten different datasets for each
of different datasets by varying the number of vertex labels parameter combination with different seeds for the pseudo
|LV |, the average size of the potentially frequent subgraphs random number generator and run FSG on all of them. The
|I|, and the average size of each transaction |T |, while keeping median of these run-times for each of the ten datasets is shown
fixed the total number of transactions |D| = 10, 000, seed in Fig. 6. Note that these results were obtained using 2% as
patterns |S| = 200, and edge labels |LE | = 1 respectively. De- the minimum support threshold.
spite our best efforts in designing the generator, we observed In general, the FSG’s run-time decreases as the number of
that as both |T | and |I| increase, different datasets created vertex labels |LV | increases, whereas it increases when the
under the same parameter combination lead to different run- average size of the seed patterns |I| or the average transaction
To appear in IEEE Transactions on Knowledge and Data Engineering 10

size |T | increases. These trends are consistent with the inherent usually correspond to subgraphs. Most ILP-based approaches
characteristics of the datasets because of the following reasons: are greedy in nature and use various heuristics to prune the
(i) As the number of vertex labels increases, the space of pos- space of possible hypotheses. Thus, they tend to find subgraphs
sible automorphisms and subgraph isomorphisms decreases— that have high support and can act as good discriminators
leading to faster candidate generation and frequency counting. between classes. However, they are not guaranteed to discover
(ii) As the size of the average seed pattern increases, because all frequent subgraphs. A notable exception is the ILP system
of the combinatorial nature of the problem, the total number WARMR developed by Dehaspe and De Raedt [9] capable
of frequent patterns to be found from the dataset increases of finding all frequently occurring subgraphs. WARMR is not
exponentially—increasing the overall run-time. (iii) As the size specialized for handling graphs, however, it does not employ
of the average transaction |T | increases frequency counting by any graph-specific optimizations and as such, it has high
subgraph isomorphism becomes expensive, regardless of the computational requirements.
size of candidate subgraphs. Moreover, the total number of In the last three years, three different algorithms have
frequent patterns to be found from the dataset also increases been developed capable of finding all frequently occurring
because more seed patterns can be put into each transaction. subgraphs with reasonable computational efficiency. These are
Both of these factors contribute in increasing the overall run- AGM by Inokuchi et al. [24], [25], the chemical substructure
time. discovery algorithm developed by Borgelt and Berthold [5],
and the gSpan algorithm developed by Yan and Han [45].
Among them, the early version of AGM [24] was developed
VII. R ELATED W ORK
prior to FSG, whereas the other algorithms were developed
Over the years, a number of different algorithms have been after the initial development of FSG [29].
developed to find frequent patterns corresponding to frequent AGM initially developed to find frequently induced sub-
subgraphs in graph datasets. Developing such algorithms graphs [24] and later extended to find arbitrary frequent sub-
is particularly challenging and computationally intensive, as graphs [25] discovers the frequent subgraphs using a breadth-
graph and subgraph isomorphisms play a key role throughout first approach, and grows the frequent subgraphs one-vertex-
the computations. For this reason, a considerable amount of at-a-time. To distinguish a subgraph from another, it uses
work has been focused on approximate algorithms [23], [28], a canonical labeling scheme based on the adjacency matrix
[35], [46] that use various heuristics to prune the search representation. Experiments reported in [24] show that AGM
space. However, a number of exact algorithms have been achieves good performance for synthetic dense datasets, and
developed [5], [10], [17], [24], [25], [45] that guarantee to find it required 40 minutes to 8 days to find all frequent induced
all subgraphs that satisfy certain minimum support or other subgraphs in the PTE dataset, as the minimum support thresh-
constraints. old varied from 20% to 10%. Their modified algorithm [25]
Probably the most well-known heuristic-based approach uses previously found embeddings of a frequent pattern in
is the SUBDUE system, originally developed in 1994, but a transaction to save the subgraph isomorphism computation
has been improved over the years [8], [23]. SUBDUE finds and improves the performance significantly at the expense of
patterns which can effectively compress the original input increased memory requirements.
data based on the minimum description length principle, by The chemical substructure mining algorithm developed by
substituting those patterns with a single vertex. To narrow Borgelt and Berthold [5], finds frequent substructures (con-
the search-space and improve its computational efficiency, nected subgraphs) using a depth-first approach similar to
SUBDUE uses a heuristic beam search approach, which quite that used by dEclat [49] in the context of frequent itemset
often results in failing to find subgraphs that are frequent. discovery. In this algorithm, once a frequent subgraph has
Nevertheless, despite its heuristic nature, its computational been identified, it then proceeds to explore the input dataset for
performance is considerably worse compared to some of the frequent subgraphs all of which contain the frequent subgraph.
recent frequent subgraph discovery algorithms. Experiments To reduce the number of subgraph isomorphism operations, it
reported in [17] for the PTE dataset [43], show that SUBDUE keeps the embeddings of previously discovered subgraphs and
spends about 80 seconds on a Pentium III 900 MHz computer tries to extend the embeddings by one edge which is similar
to find five most frequent substructures. In contrast, the FSG to the modified version of AGM [25]. In addition, since all the
algorithm developed by our group [29], takes only 20 seconds embeddings of the frequent subgraph are known, they project
on Pentium III 450 MHz to find all 3,608 frequent subgraphs the original dataset into a smaller one by removing edges and
that occur in at least 5% of the compounds. vertices that are not used by any embeddings. Nevertheless,
A number of approaches for finding commonly occurring despite these optimizations, the reported speed of the algorithm
subgraphs have been developed in the context of inductive is slower than that achieved by FSG. This is primarily due to
logic programming (ILP) systems [19], [33], [34], [38], [44], two reasons. First, their candidate subgraph generation scheme
as graphs can be easily expressed using first-order logic. Each does not ensure that the same subgraph is generated only
vertex and edge is represented as a predicate and a subgraph once, as a result, they end up generating and determining
corresponds to a conjunction of such predicates. The goal of the frequency of the same subgraph multiple times. Second,
ILP-based approaches is to induce a set of rules capable of in chemical datasets, the same subgraph tends to have many
correctly classifying a set of positive and negative examples. embeddings (in the range of 20–200), as a result the cost of
In the case of graphs modeled by ILP systems, these rules keeping track of them outweighs any benefits.
To appear in IEEE Transactions on Knowledge and Data Engineering 11

gSpan [45] finds the frequently occurring subgraphs also stantially reduces FSG’s memory requirement for storing TID
following a depth-first approach. Unlike the algorithm by lists with only a moderate increase in run-time.
Borgelt and Berthold, every time a candidate subgraph is
generated, its canonical label is computed. If the computed A PPENDIX
label is the minimum one, the candidate is saved for further Correctness of FSG’s Candidate Generation
exploration of the depth search. If not, the candidate is Let C denote a connected size-(k + 1) subgraph which is
discarded because there must be another path to the same to be generated as a valid candidate. A size-(k + 1) subgraph
candidate. By doing so, gSpan avoids redundant candidate is a valid candidate if each of its connected size-k subgraphs
generation. To ensure that these subgraph comparisons are is frequent. Let F(C) = {Fi } and H(C) = {Hi } denote
done efficiently, they use a canonical labeling scheme based sets of all connected size-k and size-(k − 1) subgraphs of C,
on depth-first traversals. In addition, gSpan does not keep respectively. For each Fi ∈ F(C), let ci be the edge of C such
the information about all previous embeddings of frequent that Fi = C − ci . Likewise, for each Hi ∈ H(C), let ai and bi
subgraphs which saves the memory usage. However, all em- be the edges of C such that Hi = C − ai − bi . Let H+ (C) =
beddings are identified on the fly, and use them to project the {Hi+ } be the set of connected size-(k − 1) subgraphs of C
dataset in a fashion similar to that used by [5]. According such that for each Hi+ , there exists a pair of edges a+
i and bi
+
+ + + +
to the reported performance in [45], gSpan and FSG are that belong to C so that Hi = C − ai − bi and both C − ai
comparable on the PTE dataset, whereas gSpan performs better and C − b+ +
i are connected. Note that H (C) ⊆ H(C) and
than FSG on synthetic datasets. it contains only those size-(k − 1) subgraphs of H(C) that
In addition to the work on frequent subgraph discovery, regardless of the order in which the two edges are removed,
researchers has recently focused on the related but different the intermediate size-k subgraph remains connected. Let H ∗ ∈
problem of mining trees to discover frequently occurring sub- H+ (C) denote a (k − 1)-subgraph whose canonical label is
trees. In particular, two similar algorithms have been recently the smallest among all the (k − 1)-subgraphs in H+ (C). We
developed by Asai et al. [4] and Zaki [48] that operate on will refer to H ∗ as the pivotal core of C. Let a∗ and b∗ be the
rooted ordered trees and find all frequent subtrees. A rooted edges deleted from C to obtain H ∗ , and we refer to a∗ and
∗ ∗
ordered tree is a tree in which one of its vertices is designated b∗ as the pivotal edges. Let F −a and F −b denote C − a∗
∗ ∗
as its root and the order of branches from every vertex is and C − b∗ , respectively. We will refer to F −a and F −b
specified. Because rooted ordered subtrees are in a special as the primary frequent size-k subgraphs of C. Note that by
∗ ∗
class of graphs, the inherent computational complexity of the construction, we have that F −a ∈ F(C), F −b ∈ F(C), and

problem is dramatically reduced as both graph and subgraph that H ∗ is a connected size-(k − 1) subgraph of both F −a

isomorphism problems for trees can be solved in polynomial and F −b .
time. Cong et al. [7] also proposed an algorithm to find Lemma 1: Given a connected size-(k + 1) valid candidate
frequent subtrees from a set of tree transactions, which allows subgraph C, let H ∗ , a∗ , b∗ be the pivotal core and pivotal
∗ ∗
wildcards on edge- and vertex-labels. Their algorithm first edges of C, respectively, and let F −a and F −b be the
finds a set of frequent paths which may contain wildcards, primary size-k subgraphs of C. Then, in each of the two
allowing inexact match on both the structure as well as the primary size-k subgraphs of C, there exists at most one
edge and vertex labels. connected size-(k − 1) subgraph whose canonical label is
smaller than that of the pivotal core H ∗ .

Proof: We prove the lemma only for F −a and the same
VIII. C ONCLUSIONS ∗
proof holds for F −b .

In this paper we presented an algorithm, FSG, for finding Let H 0 be a connected size-(k − 1) subgraph of F −a such

frequently occurring subgraphs in large graph datasets, that that cl(H 0 ) < cl(H ∗ ). Note that since F −b ∈ F(C), we have
can be used to discover recurrent patterns in scientific, spatial, that H 0 ∈ H(C). Let a0 and b0 be the two edges of C that
and relational datasets. Such patterns can play an important were deleted to obtain H 0 , that is, H 0 = C − a0 − b0 . From
role for understanding the nature of these datasets and can be the definition of H ∗ , we have that H 0 6∈ H+ (C), otherwise
used as input to other data-mining tasks [11]. Our detailed we would have that H ∗ = H 0 . Without loss of generality, we
experimental evaluation shows that FSG can scale reasonably assume that C−a0 is connected and that C−b0 is disconnected.

well to very large graph datasets provided that the graphs Now, since F −a is a connected size-k subgraph of C that

contain a sufficiently many different labels of edges and contains H , we know that F −a will be either C − a0 or
0
0 0
vertices. Key elements to FSG’s computational scalability are C − b . However, because C − b is disconnected, we have that
∗ ∗
the highly efficient canonical labeling algorithm and candidate F −a = C − a0 , and because F −a was initially obtained by
generation scheme, and its use of a TID list based approach deleting a , we have that a = a . Thus, H 0 can be written as
∗ 0 ∗

for frequency counting. These three features combined, allow


H 0 = C − a∗ − b0 , (1)
FSG to uniquely identify the various generated subgraphs,
∗ 0 0
generate candidate patterns with limited degree of redundancy, where a is independent of H . Moreover, because C − b is
and to quickly prune most of the infrequent subgraphs without disconnected, b0 must be a cut-edge that separates a∗ from the
having to resort to computationally expensive graph and sub- rest of the graph.
graph isomorphism computations. Furthermore, we presented Given the above, we can now show by contradiction that

and evaluated a database-partitioning-based approach that sub- there exists only one connected size-(k −1) subgraph of F −a
To appear in IEEE Transactions on Knowledge and Data Engineering 12

whose canonical label is smaller than H ∗ . Assume that there [10] L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures
exist two distinct connected size-(k − 1) subgraphs, Hi0 and in chemical compounds. In R. Agrawal, P. Stolorz, and G. Piatetsky-
Shapiro, editors, Proc. of the 4th ACM SIGKDD International Con-
Hj0 , such that cl(Hi0 ) < cl(H ∗ ) and cl(Hj0 ) < cl(H ∗ ). Let ference on Knowledge Discovery and Data Mining (KDD-98), pages
Hi0 = C − a0i − b0i and Hj0 = C − a0j − b0j , and without loss of 30–36. AAAI Press, 1998.
generality, assume that C − a0i and C − a0j are connected, and [11] M. Deshpande, M. Kuramochi, and G. Karypis. Automated approaches
for classifying structures. In Proc. of the 2nd Workshop on Data Mining
C − b0i and C − b0j are disconnected. Then, from Equation (1) in Bioinformatics (BIOKDD ’02), 2002.
we have that [12] J. D. Dixon and B. Mortimer. Permutation Groups, volume 163 of
Graduate Texts in Mathematics. Springer-Verlag, 1996.
Hi0 = C − a0i − b0i = C − a∗ − b0i [13] B. Dunkel and N. Soparkar. Data organizatinon and access for efficient
data mining. In Proc. of the 15th IEEE International Conference on
Hj0 = C − a0j − b0j = C − a∗ − b0j . Data Engineering, March 1999.
[14] D. Dupplaw and P. H. Lewis. Content-based image retrieval with scale-
In order for Hi0 6= Hj0 , we must have that b0i 6= b0j . However, spaced object trees. In M. M. Yeung, B.-L. Yeo, and C. A. Bouman,
because both b0i and b0j are cut-edges separating a∗ from the editors, Proc. of SPIE: Storage and Retrieval for Media Databases,
rest of the graph, and because a∗ can have only one such volume 3972, pages 253–261, 2000.
[15] S. Fortin. The graph isomorphism problem. Technical Report TR96-20,
cut-edge (otherwise it cannot be separated by a single-edge Department of Computing Science, University of Alberta, 1996.
deletion), we have that b0i = b0j . This is a contradiction, and [16] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide
thus Hi0 = Hj0 . to the Theory of NP-Completeness. W. H. Freeman and Company, New
York, 1979.
Using the above lemma, we can now prove the main
[17] S. Ghazizadeh and S. Chawathe. SEuS: Structure extraction using
theorem that shows that FSG’s candidate generation approach, summaries. In Proc. of the 5th International Conference on Discovery
described in Section IV-A is correct. Science, 2002.
Theorem 1: Given a connected size-(k + 1) valid candidate [18] B. Goethals. Efficient Frequent Pattern Mining. PhD thesis, University
of Limburg, Diepenbeek, Belgium, December 2002.
subgraph C, there exists a pair of connected size-k frequent [19] J. Gonzalez, L. B. Holder, and D. J. Cook. Application of graph-based
subgraphs Fi and Fj such that P(Fi ) ∩ P(Fj ) 6= ∅ that can concept learning to the predictive toxicology domain. In Proc. of the
be joined with respect to their common primary subgraph to Predictive Toxicology Challenge Workshop, 2001.
[20] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
obtain C. generation. In Proc. of ACM SIGMOD Int. Conf. on Management of
Proof: Let H ∗ = C−a∗ −b∗ be the pivotal core of C, and Data, Dallas, TX, May 2000.
∗ ∗
let F −a = C − a∗ and F −b = C − b∗ . Since from Lemma 1 [21] C. Hansch, P. P. Maolney, T. Fujita, and R. M. Muir. Correlation
there exists at most one such common connected size-(k − of biological activity of phenoxyacetic acids with hammett substituent
∗ ∗ constants and partition coefficients. Nature, 194:178–180, 1962.
1) subgraph shared by F −a and F −b that has a smaller [22] J. Hipp, U. Güntzer, and G. Nakhaeizadeh. Algorithms for association

canonical label than H ∗ , it follows that H ∗ ∈ P(F −a ) rule mining–a general survey and comparison. SIGKDD Explorations,
∗ ∗ ∗
and H ∗ ∈ P(F −b ); thus, H ∗ ∈ P(F −a ) ∩ P(F −b ). 2(1):58–64, July 2000.
∗ ∗ [23] L. B. Holder, D. J. Cook, and S. Djoko. Substructure discovery in
Consequently, Fi = F −a and Fj = F −b are the desired the SUBDUE system. In Proc. of the AAAI Workshop on Knowledge
size-k frequent subgraphs of C, and H ∗ is their common Discovery in Databases, pages 169–180, 1994.
primary subgraph that leads to C. [24] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. In Proc. of the 4th Eu-
ropean Conference on Principles and Practice of Knowledge Discovery
R EFERENCES in Databases (PKDD’00), pages 13–23, Lyon, France, September 2000.
[25] A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm
[1] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A tree projection for mining frequent connected subgraphs. Technical Report RT0448,
algorithm for generation of frequent item sets. Journal of Parallel and IBM Research, Tokyo Research Laboratory, 2002.
Distributed Computing, 61(3):350–371, 2001. [26] H. Kälviäinen and E. Oja. Comparisons of attributed graph matching
[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. algorithms for computer vision. In Proc. of STEP-90, Finnish Artificial
In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. of the 20th Intelligence Symposium, pages 354–368, Oulu, Finland, June 1990.
Int. Conf. on Very Large Data Bases (VLDB), pages 487–499. Morgan
[27] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg.
Kaufmann, September 1994.
Structure-activity relationships derived by machine learning: The use of
[3] Y. Amit and A. Kong. Graphical templates for model registration. IEEE
atoms and their bond connectivities to predict mutagenicity by inductive
Transactions on Pattern Analysis and Machine Intelligence, 18(3):225–
logic programming. In Proc. of the National Academy of Sciences,
236, 1996.
volume 93, pages 438–442, 1996.
[4] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa.
[28] S. Kramer, L. De Raedt, and C. Helma. Molecular feature mining in
Efficient substructure discovery from large semi-structured data. In Proc.
HIV data. In Proc. of the 7th ACM SIGKDD International Conference
of the 2nd SIAM International Conference on Data Mining (SDM’02),
on Knowledge Discovery and Data Mining (KDD-01), pages 136–143,
pages 158–174, 2002.
2001.
[5] C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding
relevant substructures of molecules. In Proc. of 2002 IEEE International [29] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
Conference on Data Mining (ICDM), 2002. Proc. of 2001 IEEE International Conference on Data Mining (ICDM),
[6] C.-W. K. Chen and D. Y. Y. Yun. Unifying graph-matching problem with November 2001.
a practical solution. In Proc. of International Conference on Systems, [30] T. K. Leung, M. C. Burl, and P. Perona. Finding faces in cluttered
Signals, Control, Computers, September 1998. scenes using random labeled graph matching. In Proc. of the 5th IEEE
[7] G. Cong, L. Yi, B. Liu, and K. Wang. Discovering frequent substructures International Conference on Computer Vision, June 1995.
from hierarchical semi-structured data. In Proc. of the 2nd SIAM [31] B. D. McKay. Nauty users guide. http://cs.anu.edu.au/ bdm/nauty/.
International Conference on Data Mining (SDM-2002), 2002. [32] B. D. McKay. Practical graph isomorphism. Congressus Numerantium,
[8] D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent 30:45–87, 1981.
Systems, 15(2):32–41, 2000. [33] S. H. Muggleton. Inverse entailment and Progol. New Generation
[9] L. Dehaspe and L. De Raedt. Mining association rules in multiple Computing, Special issue on Inductive Logic Programming, 13(3–
relations. In S. Džeroski and N. Lavrač, editors, Proc. of the 7th 4):245–286, 1995.
International Workshop on Inductive Logic Programming, volume 1297, [34] S. H. Muggleton. Scientific knowledge discovery using Inductive Logic
pages 125–132. Springer-Verlag, 1997. Programming. Communications of the ACM, 42(11):42–46, 1999.
To appear in IEEE Transactions on Knowledge and Data Engineering 13

[35] C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: A fast and


scalable tool for data mining in massive graphs. In Proc. of the 8th
ACM SIGKDD Internal Conference on Knowlege Discovery and Data
Mining (KDD’2002), Edmonton, AB, Canada, July 2002.
[36] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-
projected pattern growth. In Proc. of 2001 International Conference on
Data Engineering (ICDE’01), pages 215–226, 2001.
[37] E. G. M. Petrakis and C. Faloutsos. Similarity searching in medical
image databases. Knowledge and Data Engineering, 9(3):435–447,
1997.
[38] J. R. Quinlan. Learning logical definitions from relations. Machine
Learning, 5:239–266, 1990.
[39] A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm
for mining association rules in large databases. In Proc. of the 21st Int.
Conf. on Very Large Data Bases (VLDB), pages 432–444, 1995.
[40] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa, and
D. Shah. Turbo-charging vertical mining of large databases. In Proc. of
ACM SIGMOD Int. Conf. on Management of Data, pages 22–33, May
2000.
[41] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
and performance improvements. In Proc. of the 5th International
Conference on Extending Database Technology (EDBT), volume 1057,
pages 3–17, 1996.
[42] A. Srinivasan and R. D. King. Feature construction with inductive logic
programming: a study of quantitative predictions of biological activity
aided by structural attributes. Data Mining and Knowledge Discovery,
3(1):37–57, 1999.
[43] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg. The
predictive toxicology evaluation challenge. In Proc. of the 15th Inter-
national Joint Conference on Artificial Intelligence (IJCAI), pages 1–6.
Morgan-Kaufmann, 1997.
[44] A. Srinivasan, R. D. King, S. H. Muggleton, and M. J. E. Sternberg.
Carcinogenesis predictions using ILP. In S. Džeroski and N. Lavrač,
editors, Proc. of the 7th International Workshop on Inductive Logic
Programming, volume 1297, pages 273–287. Springer-Verlag, 1997.
[45] X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In
Proc. of 2002 IEEE International Conference on Data Mining (ICDM),
2002.
[46] K. Yoshida and H. Motoda. CLIP: Concept learning from inference
patterns. Artificial Intelligence, 75(1):63–92, 1995.
[47] M. J. Zaki. Scalable algorithms for association mining. IEEE Transac-
tions on Knowledge and Data Engineering, 12(2):372–390, 2000.
[48] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proc. of the
8th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD-2002), July 2002.
[49] M. J. Zaki and K. Gouda. Fast vertical mining using diffsets. Technical
Report 01-1, Department of Computer Science, Rensselaer Polytechnic
Institute, 2001.

You might also like