Generic Pattern Mining
Generic Pattern Mining
Abstract
Frequent Pattern Mining (FPM) is a very powerful paradigm for mining informative and use-
ful patterns in massive, complex datasets. In this paper we propose the Data Mining Template
Library, a collection of generic containers and algorithms for data mining, as well as persis-
tency and database management classes. DMTL provides a systematic solution to a whole class
of common FPM tasks like itemset, sequence, tree and graph mining. DMTL is extensible,
scalable, and high-performance for rapid response on massive datasets. A detailed set of experi-
ments show that DMTL is competitive with special purpose algorithms designed for a particular
pattern type, especially as database sizes increase.
1 Introduction
Frequent Pattern Mining (FPM) is a very powerful paradigm which encompasses an entire class of
data mining tasks, namely those dealing with extracting informative and useful patterns in massive
datasets representing complex interactions between diverse entities from a variety of sources. These
interactions may also span multiple-scales, as well as spatial and temporal dimensions. FPM is
ideally suited for categorical datasets, which include text/hypertext data (e.g., news articles, web
pages), semistructured and XML data, event or log data (e.g., network logs, web logs), biological
sequences (e.g. DNA/RNA, proteins), transactional datasets, and so on. FPM techniques are able
to extract patterns embedded in different subspaces within very high dimensional, massive datasets.
FPM is very well suited to selecting or constructing good features in complex data and also for
building global classification models of the datasets [26].
The specific tasks encompassed by FPM include the mining of increasingly complex and infor-
mative patterns, in complex structured and unstructured relational datasets, such as: Itemsets or
co-occurrences [1] (transactional, unordered data), Sequences [2, 24] (temporal or positional data,
as in text mining, bioinformatics), Tree patterns [25, 3] (XML/semistructured data), and Graph
patterns [10, 13, 21, 22] (complex relational data, bioinformatics). Figure 1 shows examples of
these different types of patterns; in a generic sense a pattern denotes links/relationships between
several objects of interest. The objects are denoted as nodes, and the links as edges. Patterns can
have multiple labels, denoting various attributes, on both the nodes and edges.
∗
This work was supported by NSF Grant EIA-0103708 under the KD-D program, NSF CAREER Award IIS-
0092978, and DOE Early Career PI Award DE-FG02-02ER25538.
†
Paolo is currently at CNUCE, Italy
1
The current practice in frequent pattern mining basically falls into the paradigm of incremen-
tal algorithm improvement and solutions to very specific problems. While there exist tools like
MLC++ [12], which provides a collection of algorithms for classification, and Weka [20], which
is a general purpose Java library of different data mining algorithms including itemset mining,
these systems do not have an unifying theme or framework, there is little database support, and
scalability to massive datasets is questionable. Moreover, these tools are not designed for handling
complex pattern types like trees and graphs.
Our work seeks to address all of the above limitations. In this paper we describe Data Mining
Template Library (DMTL), a generic collection of algorithms and persistent data structures, which
follows a generic programming paradigm [4]. DMTL provides a systematic solution for the whole
class of pattern mining tasks in massive, relational datasets. The main contributions of DMTL are
as follows:
• The design and implementation of generic data structures and algorithms to handle various
pattern types like itemsets, sequences, trees and graphs.
• Design and implementation of generic data mining algorithms for FPM, such as depth-first
and breadth-first search.
• Persistent data structures for supporting efficient pattern frequency computations using a
tightly coupled database (DBMS) approach.
• Native support for both a vertical and horizontal database formats for highly efficient mining.
• DMTL’s support for pre-processing steps like data mapping and discretization of continuous
attributes and creation of taxonomies. etc.
One of the main attractions of a generic paradigm is that the generic algorithms for mining
are guaranteed to work for any pattern type. Each pattern has a list of properties it satisfies,
and the generic algorithm can utilize these properties to speed up the mining. We conduct a
detailed set of experiments to show the scalability and efficiency of DMTL for different pattern
types like itemsets, sequences, trees and graphs. Our results indicate that DMTL is competitive
with the special purpose algorithms designed for a particular pattern type, especially with increasing
database sizes.
2
compute-tuple-distances (for point distances) were identified in [6] for classification and clustering
tasks, respectively.
2 Preliminaries
The problem of mining frequent patterns can be stated as follows: Let N = {x 1 , x2 , . . . , xnv } be a
set of nv distinct nodes or vertices. A pair of nodes (xi , xj ) is called en edge. Let L = {l1 , l2 , . . . , lnl },
be a set of nl distinct labels. Let Ln : N → L, be a node labeling function that maps a node to its
label Ln (xi ) = lj , and let Le : N × cE → L be an edge labeling function, that maps an edge to its
label Le (xi , xj ) = lk .
A pattern P is simply a relation on N , P ⊆ N × N , that is P = {(xi , xj ) | xi , xj ∈ N },
such that P satisfies some user-specified conditions C (i.e., C(P ) is true). It is also intuitive to
represent a pattern P as a graph (PV , PE ), with labeled vertex set PV ⊂ N and labeled edge set
PE = {(xi , xj ) | xi , xj ∈ PV }. The number of nodes in a pattern P is called its size. A pattern of
size k is called a k-pattern. In some applications P is a symmetric relation, i.e., (x i , xj ) = (xj , xi )
(unordered edges), while in other applications P is anti-symmetric, i.e., (xi , xj ) 6= (xj , xi ) (ordered
edges). A path in P is a set of distinct nodes {xi0 , xi1 , xin }, such that (xij , xij+1 ) in an edge in PE
for all j = 0 · · · n − 1. The number of edges gives the length of the path. If xi and xj are connected
by a path of length n we denote it as xi <n xj . Thus the edge (xi , xj ) can also be written as
xi < 0 xj .
Given two patterns P and Q, we say that P is a subpattern of Q (or Q is a super-pattern of P ),
denoted P ¹Q if and only if there exists a 1-1 mapping f from nodes in P to nodes in Q, such that
for all xi , xj ∈ PV : i) Ln (xi ) = Ln (f (xi )), ii) Le (xi , xj ) = Le (f (xi ), f (Xj )), and iii) (xi , xj ) ∈ PV
iff (if and only if) (f (xi ), f (xj )) ∈ QV . In some cases we are interested in embedded subpatterns.
P is an embedded subpattern of Q if: i) Ln (xi ) = Ln (f (xi )), iii) Le (xi , xj ) = Le (f (xi ), f (Xj )), and
iii) (xi , xj ) ∈ PV iff f (xi ) <l f (xj ), i.e., f (xi ) is connected to f (xj ) on some path. If P ¹Q we say
that P is contained in Q or Q contains P .
A database D is just a collection (a multi-set) of patterns. A database pattern is also called
an object. Let O = {o1 , o2 , . . . , ono }, be a set of no distinct object identifiers (oid). An object has
a unique identifier, given by the function O(di ) = oj , where di ∈ D and oj ∈ O. The number of
objects in D is given as |D|.
The absolute support of a pattern P in a database D is defined as the number of objects in
D that contain P , given as π a (P, D) = |{P ¹d | d ∈ D}|. The (relative) support of P is given as
a (P,D)
π(P, D) = π |D| . A pattern is frequent if its support is more than some user-specified minimum
threshold, i.e., if π(P, D) ≥ π min . A frequent pattern is maximal if it is not a subpattern of any other
frequent pattern. A frequent pattern is closed if it has no super-pattern with the same support.
The frequent pattern mining problem is to enumerate all the patterns that satisfy the user-specified
π min frequency requirement (and any other user-specified conditions).
The main observation in FPM is that the sub-pattern relation ¹ defines a partial order on the
set of patterns. If P ¹Q, we say that P is more general than Q, or Q is more specific than P .
The second observation used is that if Q is a frequent pattern, then all sub-patterns P ¹Q are also
frequent. The different FPM algorithms differ in the manner in with they search the pattern space.
3
ITEMSET (ABCD) SEQUENCE (A−−>AB−−>C)
A B C D A A B C
3 4 5 1 2 3 4
1
TREE GRAPH
A A
1 1
B C B C
3 4 3
4
A A
2 2
4
xi ’s children.
Finally, by definition a pattern can model any general graph, as well as any special constraints
that might appear in graph mining [10, 13, 21], such as connected graphs, or induced subgraphs.
It is also possible to model other patterns such as DAGs (directed acyclic graphs).
3.1 Pattern
In DMTL a pattern is a generic container, which can be instantiated as an itemset, sequence, tree
or a graph, specified as Pattern<class P> by means of a template argument called pattern-type
(P). A generic pattern is simply a pattern-type whose frequency we need to determine in a larger
collection or database of patterns of the same type.
5
Pattern Family
PatFamType
Pattern Type
6
the family, to find the maximal or closed patterns in the family, as well as a count() function that
finds the support of all patterns, in the database, using functions provided by the database class.
7
a depth-first search (DFS) [23, 24]. The generic DFS mining algorithm takes in a pattern family
and the database. The types of patterns and persistency manager are specified by the pattern
family type. The DFS algorithm in turn relies on other generic subroutines for creating equivalence
classes, for generating candidates, and for support counting. There is also a generic BFS-Mine that
performs Breadth-First Search [1, 17] over the pattern space.
8
Depending on the pattern type being mined the vat-type class may be different. For instance
for itemset mining it suffices to keep only the object identifiers where a given itemset appears. In
this case the vat-type is simply an int (assuming that oid is an integer). On the other hand for
sequence mining one needs not only the oid, but also the time stamp for the last AV pair in the
sequence. For sequences the vat-type is then pair<int, time>, i.e., a pair of an int, denoting
the oid, a nd time, denoting the time-stamp. Different vat-types must also provide operations like
equality testing (for itemsets and sequences), and less-than testing (for sequences; a oid-time pair
is less then another if they have the same oid, and the first one happens before the second).
Given the generic setup of a VAT, DMTL defines a generic algorithm to join/intersect two
VATs. For instance in vertical itemset mining, the support for an itemset is found by intersection
the VATs of its lexicographic first two subsets. A generic intersection operation utilizes the equality
operation defined on the vat-type to find the intersection of any two VATs. On the other hand in
vertical sequence mining the support of a new candidate sequence is found by a temporal join on
the VATs, which in turn uses the less-than operator defined by the vat-type. Since the itemset vat-
type typically will not provide a less-than operator, if the DMTL developer tries to use temporal
intersection on itemset vat-type it will generate a compile time error! This kind of concept-checking
support provided by DMTL is extremely useful in catching library misuses at compile-time rather
than at run-time.
DMTL provides support for creating VATs during the mining process, i.e., during algorithms
execution, as well as support for updating VATs (add and delete operations). In DMTL VATs
can be either persistent or non-persistent. Finally DMTL uses indexes for a collection of VATs for
efficient retrieval based on a given attribute-value, or a given pattern.
MetaTable<V,PM>
H H H H H H
B B B
B B
B
VAT<V> VAT<V>
Storage<PM> Storage<PM>
Buffer<V>
Intersect(VAT &v1, VAT &v2)
Get_Vats()
Get_Vat_Body()
DB<V,PM>
Figure 3: DMTL: High level overview of the different classes used for Persistency
9
The database support for VATs and for the horizontal family of patterns is provided by DMTL
in terms of the following classes, which are illustrated in Figure 3.
Vat-type: A class describing the vat-type that composes the body of a VAT, for instance int for
itemsets and pair<int,time> for sequences.
VAT<class V>: The class that represents VATs. This class is composed of a collection of records
of vat-type V.
Storage<class PM>: The generic persistency-manager class that implements the physical persis-
tency for VATs and other classes. The class PM provides the actual implementations of the
generic operations required by Storage. For example, PM metakit and PM gigabase are two
actual implementations of the Storage class in terms of different DBMS like Metakit [19],
a persistent C++ library that natively supports the vertical format, and Gigabase [11], an
object-relational database. Other implementations can easily be added as long as they provide
the required functionality.
MetaTable<class V, class PM>: This class represents a collection of VATs. It stores a list of
VAT pointers and the adequate data structures to handle efficient search for a specific VAT
in the collection. It also provides physical storage for VATs. It is templatized on the vat-type
V and on the Storage implementation PM.
DB<class V, class PM>: The database class which holds a collection of Metatables. This is the
main user interface to VATs and constitutes the database class DB referred to in previous
sections. It supports VAT operations such as intersection, as well as the operations for data
import and export. The double template follows the same format as that of the Metatable
class.
Buffer<class V>: A fixed-size main-memory buffer to which VATs are written and from which
VATs are accessed, used for buffer management to provide seamless support for main-memory
and out-of-core VATs (of type V).
A diagram of the class interaction is displayed in Figure 3. As previously stated, the DB class is the
main DMTL interface to VATs and the persistency manager for patterns. It has as data members
an object of type Buffer<V> and a collection of MetaTables<V,PM>.
The Buffer<V> class is composed of a fixed size buffer which will contain as many VAT bodies.
When a VAT body is requested from the DB class, the buffer is searched first. If the body is not
already present there, it is retrieved from disk, by accessing the Metatable containing the requested
VAT. If there is not enough space to store the new VAT in the buffer, the buffer manager will
(transparently) replace an existing VAT with the new one. A similar interface is used to provide
access to patterns in a persistent family or the horizontal database.
The MetaTable class stores all the pointers to the different VAT objects. It provides the mapping
between the patterns, called header, and their VATs, called the body, via a hashed based indexing
scheme. In the figure the H refers to a pattern and B its corresponding VAT. The Storage class
provides for efficient lookup of a particular VAT object given the header.
10
• volatile: the VAT is fully loaded and available in main memory only.
• buffered: the VAT is handled as if it were in main memory, but it is actually kept on disk in
an out-of-core fashion.
• persistent: the VAT is disk resident and can be retrieved after the program execution, i.e.:
the VAT is inserted in the VATdatabase.
Volatile VATs are created and handled by directly accessing the VAT class members. Buffered
VATs are managed from the DB class through Buffer functions. Buffered VATs must be inserted
into the file associated with a Metatable, but when a buffered VATis no longer needed, its space
on disk can be freed. A method for removing a VAT from disk is provided in the DB class. If such
method is not called, then the VAT will be persistent, i.e., it will remain in the metatable and in
the storage associated with it after execution.
curr=0 _index
free=10
(empty)
0 9
_index
curr=3 free=7
0−2
0 3 9
_index
curr=8 free=2 0−2
3−7
0 3 8 9
_index
8−1
Figure 4: Buffer Management: The buffer manager keeps track of the current position in the buffer
(curr), the amount of free space (free), and an index to keep track of where an object is. As new
objects are inserted these are updated and when an existing object is replaced when the buffer
becomes full.
The Buffer class provides methods to access and to manage a fixed size buffer where the most
recently used VATs/patterns are stored for fast retrieval. The idea behind the buffer management
implemented in the Buffer class is illustrated in Figure 4.
A fixed size buffer is available as a linear block of memory of objects of type V. Records are
inserted and retrieved from the buffer as linear chunks of memory. To start, the buffer is empty.
When a new object is inserted, some data structures are initialized in order to keep track of where
every object is placed so it can be accessed later. Objects are inserted one after the other in
11
a round-robin fashion. When there is no more space left in the buffer, the least recently used
(LRU) block (corresponding to one entire VAT body, or a pattern) is removed. While the current
implementation provides a LRU buffering strategy, as part of future work we will consider more
sophisticated buffer replacement strategies that closely tie with the mining.
4.5 Storage
Physical storage of VATs and pattern families can be implemented using different storage systems,
such as a DBMS or ad-hoc libraries. In order to abstract the details of the actual system used,
all storage-related operations are provided in a generic class, Storage. Implementations of the
Storage class for MetaKit [19] and Gigabase [11] backends are provided in the DMTL. Other
implementations can easily be added as long as they provide the required functionality.
The DB class is a doubly templated class where both the vat-type and the storage implementation
need to be specified. An example instantiation of a DB class for itemset patterns would therefore
be DB<int,PM metakit> or DB<int,PM gigabase>.
<?xml version="1.0"?>
<!DOCTYPE Datasource SYSTEM "dmtl config.dtd">
<Data model=relational source=ascii file>
<Access> [...] </Access>
<Structure>
<Format> [...] </Format>
<Attributes> [...] </Attributes>
</Structure>
</Data>
5.1 Attributes
Configuration used for mapping attribute values are contained in the <Structure> section. The
<Format> section contains the characters used as record separator and field separator. An <Attribute>
section must be present for each attribute (or column) in the input database. Such section might be
something like: <Attribute name="price" type="continuous" units="Euro" ignore="yes">
[ ... ] </Attribute>. Possible attributes for the <Attribute> tag are: name: the name of
the attribute, type: one of continuous, discrete, categorical, units: the unit of measure for values
(currency, weight, etc.), ignore: should a VAT be created for this attribute or not.
5.2 Mapping
The mapping information is enclosed in the <Mapping> section. Mapping can be different for cate-
gorical, continuous or discrete fields. For continuous values we can specify a fixed step discretization
12
within a range:
In this case the field price will be mapped to (max-min)/step = (5-1)/.5 = 10 values, la-
beled with integers starting from 0. It is also possible to specify non-uniform discretizations,
omitting the step attribute and explicitly specifying all the ranges and labels. For categorical
values we can also specify a mapping, that allows for taxonomies or other groupings.
6 Experiments
DMTL is implemented using C++ Standard Template Library [4]. We present some experimental
results on the time taken by DMTL to load databases and to perform different types of pattern
mining on them. We used the IBM synthetic database generator [1] for itemset and sequence
mining, the tree generator from [25] for tree mining and the graph generator by [13], with sizes
ranging from 10k to 1000k (or 1 million). The experiment were run on a Pentium4 2.8Ghz Processor
with 6GB of memory, running Linux.
Figure 5 shows the DMTL mining time versus the specialized algorithms for itemset mining
(ECLAT [23]), sequences (SPADE [24]), trees (TreeMiner [25]) and graphs (gSpan [21]). For the
DMTL algorithms, we show the time with a flat-file (Flat) persistency manager/database, with
the metakit backend (Metakit) and the gigabase backend (Gigabase). The left hand column shows
the effect of minimum support on the mining time for the various patterns. We find that for all
pattern types DMTL is within a factor of 10 of the specialized algorithms even as we decrease
the minimum support on a database with 100K records. The column on the right hand size
shows the effect of increasing database sizes on these algorithms. We find that as the number of
objects increase the gap between DMTL algorithms and the specialized ones starts to decrease.
We expect that as we increase the number of records, the specialized algorithms will break down,
while DMTL will continue to run since it explicitly manages the memory buffers. Comparing the
three backend implementations, we find that the flatfile approach has a slight edge, but the object-
oriented gigabase database backend is almost as fast. On the other hand the embedded database
Metakit is generally slower.
Figure 6 shows the time taken to convert the input data into VATs. The times are shown for
the three different backends (flat, metakit and gigabase) for upto 1 million objects. We find that
these three approaches are roughly the same, with the maximum difference being a factor of 2.
7 Conclusions
In this paper we describe the design and implementation of the DMTL prototype for an important
subset of FPM tasks, namely mining frequent itemsets, sequences, trees, and graphs. Following the
ideology of generic programming, DMTL provides a standardized, general, and efficient implemen-
tation of frequent pattern mining tasks by isolating the concept of data structures or containers,
from algorithms. DMTL provides container classes for representing different patterns, collection
of patterns, and containers for database objects (horizontal and vertical). Generic algorithms, on
13
Itemset Mining - Minsup VS Time (100K DB) Itemset Mining - DB Size VS Time (0.2% Minsup)
10 10
Metakit Metakit
Flat Flat
Gigabase Gigabase
Total Time (sec)
0.1
0.1 0.01
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 100 1000
Minimum Support (%) Database Size (K)
Sequence Mining - Minsup VS Time (100K DB) Sequence Mining - DB Size VS Time (0.2% Minsup)
1000 1000
Metakit Metakit
Flat Flat
Gigabase Gigabase
Total Time (sec)
10
10
1
1 0.1
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 100 1000
Minimum Support (%) Database Size (K)
Tree Mining - Minsup VS Time (100K DB) Tree Mining - DB Size VS Time (0.2% Minsup)
100 1000
Metakit Metakit
Flat Flat
Gigabase Gigabase
Total Time (sec)
100
TreeMiner TreeMiner
10 10
1 0.1
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 10 100 1000
Minimum Support (%) Database Size (K)
Graph Mining - Minsup VS Time (100K DB) Graph Mining - DB Size VS Time (0.2% Minsup)
1000 100
Metakit Metakit
Flat Flat
Gigabase Gigabase
Total Time (sec)
10 1
1 0.1
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 10 100 1000
Minimum Support (%) Database Size (K)
Figure 5: Itemset, Sequence, Tree and Graph Mining: Effect of Minimum Support and Database
Size
14
Tree Conversion Graph Conversion
25 70
Flat Flat
Metakit 60 Metakit
20 Gigabase Gigabase
Total Time (sec)
10 30
20
5
10
0 0
10 20 30 40 50 60 70 80 90 100 10 100 1000
# Transactions (thousands) # Transactions (thousands)
Figure 6: Database Conversion and Loading Times
the other hand are independent of the container and can be applied on any valid pattern. These
include algorithms for performing intersections of the VATs, or for mining.
The generic paradigm of DMTL is a first-of-its-kind in data mining, and we plan to use insights
gained to extend DMTL to other common mining tasks like classification, clustering, deviation
detection, and so on. Eventually, DMTL will house the tightly-integrated and optimized primitive,
generic operations, which serve as the building blocks of more complex mining algorithms. The
primitive operations will serve all steps of the mining process, i.e., pre-processing of data, mining
algorithms, and post-processing of patterns/models. Finally, we plan to release DMTL as part of
open-source, and the feedback we receive will help drive more useful enhancements. We also hope
that DMTL will provide a common platform for developing new algorithms, and that it will foster
comparison among the multitude of existing algorithms.
References
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of
association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data
Mining, pages 307–328. AAAI Press, Menlo Park, CA, 1996.
[2] R. Agrawal and R. Srikant. Mining sequential patterns. In 11th Intl. Conf. on Data Engg.,
1995.
[3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient substructure
discovery from large semi-structured data. In 2nd SIAM Int’l Conference on Data Mining,
April 2002.
[4] M. H. Austern. Generic Programming and the STL. Addison Wesley Longman, Inc., 1999.
[5] S. Chaudhri, U. Fayyad, and J. Bernhardt. Scalable classification over SQL databases. In 15th
IEEE Intl. Conf. on Data Engineering, March 1999.
[6] A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer
Academic Pub., Boston, MA, 1998.
[7] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In 1st IEEE Int’l
Conf. on Data Mining, November 2001.
15
[8] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language
for relational databases. In 1st ACM SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery, June 1996.
[9] T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining
and Knowledge Discovery: An International Journal, 3:373–408, 1999.
[10] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent sub-
structures from graph data. In 4th European Conference on Principles of Knowledge Discovery
and Data Mining, September 2000.
[12] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using mlc++, a machine learning
library in c++. International Journal of Artificial Intelligence Tools, 6(4):537–566, 1997.
[13] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In 1st IEEE Int’l Conf. on Data
Mining, November 2001.
[14] C. Mastroianni, D. Talia, and P. Trunfio. Managing heterogeneous resources in data mining
applications on grids using xml-based metadata. In Proceedings of The 12th Heterogeneous
Computing Workshop, 2002.
[15] R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In 22nd
Intl. Conf. Very Large Databases, 1996.
[16] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases:
alternatives and implications. In ACM SIGMOD Intl. Conf. Management of Data, June 1998.
[17] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. In 5th Intl. Conf. Extending Database Technology, March 1996.
[18] D. Tsur, J.D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association rule mining. In ACM SIGMOD Intl. Conf. Management of Data,
June 1998.
[20] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations. Morgan Kaufmann Publishers, 1999.
[21] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE Int’l Conf. on
Data Mining, 2002.
[22] X. Yan and J. Han. Closegraph: Mining closed frequent graph patterns. In ACM SIGKDD
Int. Conf. on Knowledge Discovery and Data Mining, August 2003.
[23] M. J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and
Data Engineering, 12(3):372-390, May-June 2000.
16
[24] M. J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning
Journal, 42(1/2):31–60, Jan/Feb 2001.
[25] M. J. Zaki. Efficiently mining frequent trees in a forest. In 8th ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining, July 2002.
[26] M. J. Zaki and C.C. Aggawral. Xrules: An effective structural classifier for xml data. In 9th
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, August 2003.
[27] M. J. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm for closed itemset mining. In 2nd
SIAM International Conference on Data Mining, April 2002.
17