Advanced Methods For Knowledge Discovery From Complex Data
Advanced Methods For Knowledge Discovery From Complex Data
Series Editors
Professor Lakhmi Jain
[email protected]
Professor Xindong Wu
[email protected]
Colin Fyfe
Hebbian Learning and Negative Feedback Networks
1-85233-883-0
Advanced Methods
for Knowledgeg
Discoveryy from
Complex Data
With 120 Figures
123
Sanghamitra Bandyopadhyay, PhD
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Ujjwal Maulik, PhD
Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
Lawrence B. Holder, PhD
Diane J. Cook, PhD
Department of Computer Science & Engineering, University of Texas at Arlington, USA
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued
by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be
sent to the publishers.
ISBN 1-85233-989-6
Springer Science+Business Media
springeronline.com
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for any errors
or omissions that may be made.
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Part I Foundations
7 Link-based Classification
Lise Getoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
viii Contents
Part II Applications
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Contributors
Sanghamitra Bandyopadhyay
Machine Intelligence Unit
Indian Statistical Institute
Kolkata, India
[email protected]
Bhabatosh Chanda
Electronics and Communication Sciences Unit
Indian Statistical Institute
Kolkata, India
[email protected]
Jeff Coble
Department of Computer Science and Engineering
University of Texas at Arlington
Arlington, Texas USA
[email protected]
Diane J. Cook
Department of Computer Science and Engineering
University of Texas at Arlington
Arlington, Texas USA
[email protected]
Melba M. Crawford
The University of Texas at Austin
Austin, Texas USA
[email protected]
x Contributors
Amit K. Das
Computer Science and Technology Department
Bengal Engineering College (Deemed University)
Kolkata, India
[email protected]
Mohamed M. Gaber
School of Computer Science and Software Engineering
Monash University
Australia
[email protected]
Thomas Gärtner
Fraunhofer Institut Autonome Intelligente Systeme
Germany
[email protected]
Lise Getoor
Department of Computer Science and UMIACS
University of Maryland, College Park
Maryland, USA
[email protected]
Joydeep Ghosh
The University of Texas at Austin
Austin, Texas USA
[email protected]
Jiawei Han
University of Illinois at Urbana-Champaign
Urbana-Champaign, Illinois USA
[email protected]
Lawrence B. Holder
Department of Computer Science and Engineering
University of Texas at Arlington
Arlington, Texas USA
[email protected]
Contributors xi
Tao Jiang
School of Computer Engineering
Nanyang Technological University
Nanyang Avenue, Singapore
[email protected]
Shonali Krishnaswamy
School of Computer Science and Software Engineering
Monash University
Australia
[email protected]
Shailesh Kumar
Fair Isaac Corporation
San Diego, California USA
[email protected]
Ujjwal Maulik
Department of Computer Science and Engineering
Jadavpur University
Kolkata, India
[email protected]
Srinivas Mukkamala
Department of Computer Science
New Mexico Tech, Socorro, USA
[email protected]
Joseph Potts
Department of Computer Science and Engineering
University of Texas at Arlington
Arlington, Texas USA
[email protected]
xii Contributors
Sanjoy K. Saha
Department of Computer Science and Engineering
Jadavpur University
Kolkata, India
[email protected]
Sunita Sarawagi
Department of Information Technology
Indian Institute of Technology
Mumbai, India
[email protected]
Andrew H. Sung
Department of Computer Science
Institute for Complex Additive Systems Analysis
New Mexico Tech, Socorro, USA
[email protected]
Ah-Hwee Tan
School of Computer Engineering
Nanyang Technological University
Nanyang Avenue, Singapore
[email protected]
Jason T. L. Wang
Department of Computer Science
New Jersey Institute of Technology
University Heights
Newark, New Jersey USA
[email protected]
Wei Wang
University of North Carolina at Chapel Hill
Chapel Hill, North Carolina USA
[email protected]
Xifeng Yan
University of Illinois, Urbana-Champaign
Urbana-Champaign, Illinois USA
[email protected]
Contributors xiii
Jiong Yang
Case Western Reserve University
Cleveland, Ohio USA
[email protected]
Mohammed J. Zaki
Computer Science Department
Rensselaer Polytechnic Institute
Troy, New York USA
[email protected]
Arkady Zaslavsky
School of Computer Science and Software Engineering
Monash University
Australia
[email protected]
Sen Zhang
Department of Mathematics, Computer Science and Statistics,
State University of New York, Oneonta
Oneonta, New York USA
[email protected]
Preface
The growth in the amount of data collected and generated has exploded in
recent times with the widespread automation of various day-to-day activities,
advances in high-level scientific and engineering research and the development
of efficient data collection tools. This has given rise to the need for automati-
cally analyzing the data in order to extract knowledge from it, thereby making
the data potentially more useful.
Knowledge discovery and data mining (KDD) is the process of identifying
valid, novel, potentially useful and ultimately understandable patterns from
massive data repositories. It is a multi-disciplinary topic, drawing from sev-
eral fields including expert systems, machine learning, intelligent databases,
knowledge acquisition, case-based reasoning, pattern recognition and statis-
tics.
Many data mining systems have typically evolved around well-organized
database systems (e.g., relational databases) containing relevant information.
But, more and more, one finds relevant information hidden in unstructured
text and in other complex forms. Mining in the domains of the world-wide
web, bioinformatics, geoscientific data, and spatial and temporal applications
comprise some illustrative examples in this regard. Discovery of knowledge,
or potentially useful patterns, from such complex data often requires the ap-
plication of advanced techniques that are better able to exploit the nature
and representation of the data. Such advanced methods include, among oth-
ers, graph-based and tree-based approaches to relational learning, sequence
mining, link-based classification, Bayesian networks, hidden Markov models,
neural networks, kernel-based methods, evolutionary algorithms, rough sets
and fuzzy logic, and hybrid systems. Many of these methods are developed in
the following chapters.
In this book, we bring together research articles by active practitioners
reporting recent advances in the field of knowledge discovery, where the in-
formation is mined from complex data, such as unstructured text from the
world-wide web, databases naturally represented as graphs and trees, geoscien-
tific data from satellites and visual images, multimedia data and bioinformatic
data. Characteristics of the methods and algorithms reported here include the
use of domain-specific knowledge for reducing the search space, dealing with
xvi Preface
increase in speed over the pattern-matching approach and applies the new
technique to the problem of mining usage patterns from real logs of website
browsing behavior.
Another specialized form in which complex data might be expressed is a se-
quence. In Chapter 6, Sarawagi discusses several methods for mining sequence
data, i.e., data modeled as a sequence of discrete multi-attribute records. She
reviews state-of-the-art techniques in sequence mining and applies these to two
real applications: address cleaning and information extraction from websites.
In Chapter 7, Getoor returns to the more general graph representation of
complex data, but includes probabilistic information about the distribution
of links (or relationships) between entities. Getoor uses a structured logistic
regression model to learn patterns based on both links and entity attributes.
Results in the domains of web browsing and citation collections indicate that
the use of link distribution information improves classification performance.
The remaining chapters constitute the applications section of the book.
Significant successes have been achieved in a wide variety of domains, indi-
cating the potential benefits of mining complex data, rather than applying
simpler methods on simpler transformations of the data. Chapter 8 begins
with a contribution by Zhang and Wang describing techniques for mining
evolutionary trees, that is, trees whose parent–child relationships represent
actual evolutionary relationships in the domain of interest. A good example,
and one to which they apply their approach, is phylogenetic trees that describe
the evolutionary pathways of species at the molecular level. Their algorithm
efficiently discovers “cousin pairs,” which are two nodes sharing a common
ancestor, in a single tree or a set of trees. They present numerous experimen-
tal results showing the efficiency and effectiveness of their approach in both
synthetic and real domains, namely, phylogenic trees.
In Chapter 9, Jiang and Tan apply a variant of the Apriori-based associ-
ation rule-mining algorithm to the relational domain of Resource Description
Framework (RDF) documents. Their approach treats RDF relations as items
in the traditional association-rule mining framework. Their approach also
takes advantage of domain ontologies to provide generalizations of the RDF
relations. They apply their technique to a synthetically-generated collection of
RDF documents pertaining to terrorism and show that the method discovers
a small set of association rules capturing the main associations known to be
present in the domain.
Saha, Das and Chanda address the task of content-based image retrieval
by mapping image data into complex data using features based on shape,
texture and color in Chapter 10. They also develop an image retrieval sim-
ilarity measure based on human perception and improve retrieval accuracy
using feedback to establish the relevance of the various features. The authors
empirically validate the superiority of their method over competing methods
of content-based image retrieval using two large image databases.
In Chapter 11, Mukkamala and Sung turn to the problem of intrusion
detection. They perform a comparative analysis of three advanced mining
xviii Preface
Foundations
1
Knowledge Discovery and Data Mining
Summary. Knowledge discovery and data mining has recently emerged as an im-
portant research direction for extracting useful information from vast repositories of
data of various types. This chapter discusses some of the basic concepts and issues
involved in this process with special emphasis on different data mining tasks. The
major challenges in data mining are mentioned. Finally, the recent trends in data
mining are described and an extensive bibliography is provided.
1.1 Introduction
The sheer volume and variety of data that is routinely being collected as a
consequence of widespread automation is mind-boggling. With the advantage
of being able to store and retain immense amounts of data in easily accessible
form comes the challenge of being able to integrate the data and make sense
out of it. Needless to say, this raw data potentially stores a huge amount of
information, which, if utilized appropriately, can be converted into knowledge,
and hence wealth for the human race. Data mining (DM) and knowledge
discovery (KD) are related research directions that have emerged in the recent
past for tackling the problem of making sense out of large, complex data sets.
Traditionally, manual methods were employed to turn data into knowl-
edge. However, sifting through huge amounts of data manually and making
sense out of it is slow, expensive, subjective and prone to errors. Hence the
need to automate the process arose; thereby leading to research in the fields
of data mining and knowledge discovery. Knowledge discovery from databases
(KDD) evolved as a research direction that appears at the intersection of re-
search in databases, machine learning, pattern recognition, statistics, artificial
intelligence, reasoning with uncertainty, expert systems, information retrieval,
signal processing, high performance computing and networking.
Data stored in massive repositories is no longer only numeric, but could be
graphical, pictorial, symbolic, textual and linked. Typical examples of some
such domains are the world-wide web, geoscientific data, VLSI chip layout
4 Sanghamitra Bandyopadhyay and Ujjwal Maulik
and routing, multimedia, and time series data as in financial markets. More-
over, the data may be very high-dimensional as in the case of text/document
representation. Data pertaining to the same object is often stored in different
forms. For example, biologists routinely sequence proteins and store them in
files in a symbolic form, as a string of amino acids. The same protein may also
be stored in another file in the form of individual atoms along with their three
dimensional co-ordinates. All these factors, by themselves or when taken to-
gether, increase the complexity of the data, thereby making the development
of advanced techniques for mining complex data imperative. A cross-sectional
view of some recent approaches employing advanced methods for knowledge
discovery from complex data is provided in the different chapters of this book.
For the convenience of the reader, the present chapter is devoted to the de-
scription of the basic concepts and principles of data mining and knowledge
discovery, and the research issues and challenges in this domain. Recent trends
in KDD are also mentioned.
Cleaned,
Data Raw Data Preparation Processed Integrated, Processed Extracted Knowledge
Data Mining Representation
Repository Data Data Filtered Data Patterns
Data
Users
Knowledge
Base
data is first cleaned to reduce noisy, erroneous and missing data as far as
possible. The different sub tasks of the data preparation step are often per-
formed iteratively by utilizing the knowledge gained in the earlier steps in the
subsequent phases. Once the data is cleaned, it may need to be integrated
since there could be multiple sources of the data. After integration, further
redundancy removal may need to be carried out. The cleaned and integrated
data is stored in databases or data warehouses.
Data warehousing [40, 66] refers to the tasks of collecting and cleaning
transactional data to make them available for online analytical processing
(OLAP). A data warehouse includes [66]:
• Cleaned and integrated data: This allows the miner to easily look across
vistas of data without bothering about matters such as data standardiza-
tion, key determination, tackling missing values and so on.
• Detailed and summarized data: Detailed data is necessary when the miner
is interested in looking at the data in its most granular form and is nec-
essary for extracting important patterns. Summary data is important for
a miner to learn about the patterns in the data that have already been
extracted by someone else. Summarized data ensures that the miner can
build on the work of others rather than building everything from scratch.
• Historical data: This helps the miner in analyzing past trends/seasonal
variations and gaining insights into the current data.
• Metadata: This is used by the miner to describe the context and the mean-
ing of the data.
It is important to note that data mining can be performed without the
presence of a data warehouse, though data warehouses greatly improve the
efficiency of data mining. Since databases often constitute the repository of
data that has to be mined, it is important to study how the current database
management system (DBMS) capabilities may be utilized and/or enhanced
for efficient mining [64].
As a first step, it is necessary to develop efficient algorithms for imple-
menting machine learning tools on top of large databases and utilizing the
existing DBMS support. The implementation of classification algorithms such
as C4.5 or neural networks on top of a large database requires tighter coupling
with the database system and intelligent use of coupling techniques [53, 64].
For example, clustering may require efficient implementation of the nearest
neighbor algorithms on top of large databases.
In addition to developing algorithms that can work on top of existing
DBMS, it is also necessary to develop new knowledge and data discovery
management systems (KDDMS) to manage KDD systems [64]. For this it is
necessary to define KDD objects that may be far more complex than database
objects (records or tuples), and queries that are more general than SQL and
that can operate on the complex objects. Here, KDD objects may be rules,
classifiers or a clustering [64]. The KDD objects may be pre-generated (e.g.,
as a set of rules) or may be generated at run time (e.g., a clustering of the
6 Sanghamitra Bandyopadhyay and Ujjwal Maulik
data objects). KDD queries may now involve predicates that can return a
classifier, rule or clustering as well as database objects such as records or
tuples. Moreover, KDD queries should satisfy the concept of closure of a query
language as a basic design paradigm. This means that a KDD query may take
as argument another compatible type of KDD query. Also KDD queries should
be able to operate on both KDD objects and database objects. An example
of such a KDD query may be [64]: “Generate a classifier trained on a user
defined training set generated though a database query with user defined
attributes and user specified classification categories. Then find all records in
the database that are wrongly classified using that classifier and use that set
as training data for another classifier.” Some attempts in this direction may
be found in [65, 120].
Since this module communicates between the users and the knowledge dis-
covery step, it goes a long way in making the entire process more useful and
effective. Important components of the knowledge presentation step are data
visualization and knowledge representation techniques. Presenting the infor-
mation in a hierarchical manner is often very useful for the user to focus
attention on only the important and interesting concepts. This also enables
the users to see the discovered patterns at multiple levels of abstraction. Some
possible ways of knowledge presentation include:
• rule and natural language generation,
• tables and cross tabulations,
• graphical representation in the form of bar chart, pie chart and curves,
• data cube view representation, and
• decision trees.
The following section describes some of the commonly used tasks in data
mining.
The root of the association rule mining problem lies in the market basket or
transaction data analysis. A lot of information is hidden in the thousands of
transactions taking place daily in supermarkets. A typical example is that
if a customer buys butter, bread is almost always purchased at the same
time. Association analysis is the discovery of rules showing attribute–value
associations that occur frequently.
Let I = {i1 , i2 , . . . , in } be a set of n items and X be an itemset where X ⊂
I. A k-itemset is a set of k items. Let T = {(t1 , X1 ), (t2 , X2 ) . . . , (tm , Xm )} be
a set of m transactions, where ti and Xi , i = 1, 2, . . . , m, are the transaction
identifier and the associated itemset respectively. The cover of an itemset X
in T is defined as follows:
The objective
in association rule mining is to find all rules of the form
X ⇒ Y , X Y = ∅ with probability c%, indicating that if itemset X occurs
in a transaction, the itemset Y also occurs with probability c%. X is called
the antecedent of the rule and Y is called the consequent of the rule. Support
of a rule denotes the percentage of transactionsin T that contains both X
and Y . This is taken to be the probability P (X Y ). An association rule is
called frequent if its support exceeds a minimum value min sup.
The confidence of a rule X ⇒ Y in T denotes the percentage of the
transactions in T containing X that also contains Y . It is taken to be the
conditional probability P (X|Y ). In other words,
support(X Y, T )
conf idence(X ⇒ Y, T ) = . (1.5)
support(X, T )
A rule is called confident if its confidence value exceeds a threshold min conf .
The problem of association rule mining can therefore be formally stated as
follows: Find the set of all rules R of the form X ⇒ Y such that
R = { X ⇒ Y |X, Y ⊂ I, X Y = ∅, X Y = F(T, min sup),
conf idence(X ⇒ Y, T ) > min conf }. (1.6)
Other than support and confidence measures, there are other measures of
interestingness associated with association rules. Tan et al. [125] have pre-
sented an overview of various measures proposed in statistics, machine learn-
ing and data mining literature in this regard.
The association rule mining process, in general, consists of two steps:
1. Find all frequent itemsets,
2. Generate strong association rules from the frequent itemsets.
Although this is the general framework adopted in most of the research in
association rule mining [50, 60], there is another approach to immediately
generate a large subset of all association rules [132].
1.3 Tasks in Data Mining 9
1.3.2 Classification
A typical pattern recognition system consists of three phases. These are data
acquisition, feature extraction and classification. In the data acquisition phase,
depending on the environment within which the objects are to be classified,
data are gathered using a set of sensors. These are then passed on to the
feature extraction phase, where the dimensionality of the data is reduced
by measuring/retaining only some characteristic features or properties. In a
broader perspective, this stage significantly influences the entire recognition
process. Finally, in the classification phase, the extracted features are passed
on to the classifier that evaluates the incoming information and makes a fi-
nal decision. This phase basically establishes a transformation between the
features and the classes.
The problem of classification is basically one of partitioning the feature
space into regions, one region for each category of input. Thus it attempts
to assign every data point in the entire feature space to one of the possible
(say, k) classes. Classifiers are usually, but not always, designed with labeled
data, in which case these problems are sometimes referred to as supervised
classification (where the parameters of a classifier function D are learned).
Some common examples of the supervised pattern classification techniques
are the nearest neighbor (NN) rule, Bayes maximum likelihood classifier and
perceptron rule [7, 8, 31, 36, 45, 46, 47, 52, 105, 127]. Figure 1.2 provides
a block diagram showing the supervised classification process. Some of the
related classification techniques are described below.
NN Rule [36, 46, 127]
Let us consider a set of n pattern points of known classification {x1 , x2 , . . . ,
xn }, where it is assumed that each pattern belongs to one of the classes
C1 , C2 , . . . , Ck . The NN classification rule then assigns a pattern x of unknown
classification to the class of its nearest neighbor, where xi ∈ {x1 , x2 , . . . , xn }
is defined to be the nearest neighbor of x if
10 Sanghamitra Bandyopadhyay and Ujjwal Maulik
ABSTRACTION PHASE
Model
GENERALIZATION PHASE
k
ri (x) = Lli p(Cl /x), (1.8)
l=1
where p(Cl /x) represents the probability that x is from Cl . Using Bayes for-
mula, Equation (1.8) can be written as,
1
k
ri (x) = Lli pl (x)Pl , (1.9)
p(x)
l=1
where
k
p(x) = pl (x)Pl .
l=1
The pattern x is assigned to the class with the smallest expected loss. The
classifier which minimizes the total expected loss is called the Bayes classifier.
Let us assume that the loss (Lli ) is zero for correct decision and greater
than zero but the same for all erroneous decisions. In such situations, the
expected loss, Equation (1.9), becomes
Pi pi (x)
ri (x) = 1 − . (1.10)
p(x)
Since p(x) is not dependent upon the class, the Bayes decision rule is nothing
but the implementation of the decision functions
1 −1
pi (x) = N
1
1 exp[ − (x − µi ) i (x − µi )], (1.12)
(2π) 2 | i |2 2
i = 1, 2, . . . , k.
Note that the decision functions in Equation (1.13) are hyperquadrics, since
no terms higher than the second degree in the components of x appear in it.
It can thus be stated that the Bayes maximum likelihood classifier for normal
distribution of patterns provides a second-order decision surface between each
pair of pattern classes. An important point to be mentioned here is that if the
pattern classes are truly characterized by normal densities, then, on average,
no other surface can yield better results. In fact, the Bayes classifier designed
over known probability distribution functions, provides, on average, the best
performance for data sets which are drawn according to the distribution. In
such cases, no other classifier can provide better performance, on average, be-
cause the Bayes classifier gives minimum probability of misclassification over
all decision rules.
Decision Trees
A decision tree is an acyclic graph, of which each internal node, branch and
leaf node represents a test on a feature, an outcome of the test and classes
or class distribution, respectively. It is easy to convert any decision tree into
classification rules. Once the training data points are available, a decision tree
can be constructed from them from top to bottom using a recursive divide
and conquer algorithm. This process is also known as decision tree induction.
A version of ID3 [112] , a well known decision-tree induction algorithm, is
described below.
k
I(n1 , n2 , . . . , nk ) = − pi logb (pi ) (1.14)
i=1
1.3 Tasks in Data Mining 13
where pi (= nni ) is the probability that a randomly selected data point belongs
to class Ci . In case the information is encoded in binary the base b of the log
function is set to 2. Let the feature space be d-dimensional, i.e., F has d
distance values {f1 , f2 , . . . , fd }, and this is used to partition the data points
D into s subsets {D1 , D2 , . . . , Ds }. Moreover, let nij be the number of data
points of class Ci in a subset Dj . The entropy or expected information based
on the partition by F is given by
s
n1j , n2j . . . nkj
E(A) = ( )I(n1j , n2j . . . nkj ), (1.15)
j=1
n
where
j=k
I(n1j , n2j . . . nkj ) = − pij logb (pij ). (1.16)
j=1
Here, pij is the probability that a data point in Di belongs to class Ci . The
corresponding information gain by branching on F is given by
The ID3 algorithm finds out the feature corresponding to the highest informa-
tion gain and chooses it as the test feature. Subsequently a node labelled with
this feature is created. For each value of the attribute, branches are generated
and accordingly the data points are partitioned.
Due to the presence of noise or outliers some of the branches of the deci-
sion tree may reflect anomalies causing the overfitting of the data. In these
circumstances tree-pruning techniques are used to remove the least reliable
branches, which allows better classification accuracy as well as convergence.
For classifying unknown data, the feature values of the data point are
tested against the constructed decision tree. Consequently a path is traced
from the root to the leaf node that holds the class prediction for the test data.
1.3.3 Regression
Y = α + β1 Xi + β2 X2 + . . . + βn Xn (1.20)
Y = α + β1 X + β2 X 2 + . . . + βn X n (1.21)
Y = α + β1 X1 + β2 X2 + . . . + βn Xn (1.22)
When the only data available are unlabelled, the classification problems are
sometimes referred to as unsupervised classification. Clustering [6, 31, 55,
67, 127] is an important unsupervised classification technique where a set of
patterns, usually vectors in a multidimensional space, are grouped into clusters
in such a way that patterns in the same cluster are similar in some sense and
patterns in different clusters are dissimilar in the same sense. For this it is
necessary to first define a measure of similarity which will establish a rule
for assigning patterns to a particular cluster. One such measure of similarity
1.3 Tasks in Data Mining 15
Deviation Detection
In this section major issues and challenges in data mining regarding underlying
data types, mining techniques, user interaction and performance are described
[54].
• User interaction
The knowledge discovery process is interactive and iterative in nature as
sometimes it is difficult to estimate exactly what can be discovered from
a database. User interaction helps the mining process to focus the search
patterns, appropriately sampling and refining the data. This in turn results
in better performance of the data mining algorithm in terms of discovered
knowledge as well as convergence.
• Incorporation of a priori knowledge
Incorporation of a priori domain-specific knowledge is important in all
phases of a knowledge discovery process. This knowledge includes integrity
constraints, rules for deduction, probabilities over data and distribution,
number of classes, etc. This a priori knowledge helps with better conver-
gence of the data mining search as well as the quality of the discovered
patterns.
• Scalability
Data mining algorithms must be scalable in the size of the underlying data,
meaning both the number of patterns and the number of attributes. The
size of data sets to be mined is usually huge, and hence it is necessary either
to design faster algorithms or to partition the data into several subsets,
executing the algorithms on the smaller subsets, and possibly combining
the results [111].
• Efficiency and accuracy
Efficiency and accuracy of a data mining technique is a key issue. Data
mining algorithms must be very efficient such that the time required to
extract the knowledge from even a very large database is predictable and
acceptable. Moreover, the accuracy of the mining system needs to be better
than or as good as the acceptable range.
20 Sanghamitra Bandyopadhyay and Ujjwal Maulik
Data mining is widely used in different application domains, where the data is
not necessarily restricted to conventional structured types, e.g., those found in
relational databases, transactional databases and data warehouses. Complex
data that are nowadays widely collected and routinely analyzed include:
• Spatial data – This type of data is often stored in Geographical Informa-
tion Systems (GIS), where the spatial coordinates constitute an integral
part of the data. Some examples of spatial data are maps, preprocessed
remote sensing and medical image data, and VLSI chip layout. Clustering
of geographical points into different regions characterized by the presence
of different types of land cover, such as lakes, mountains, forests, residen-
tial and business areas, agricultural land, is an example of spatial data
mining.
• Multimedia data – This type of data may contain text, image, graphics,
video clips, music and voice. Summarizing an article, identifying the con-
tent of an image using features such as shape, size, texture and color,
summarizing the melody and style of a music, are some examples of mul-
timedia data mining.
• Time series data – This consists of data that is temporally varying. Exam-
ples of such data include financial/stock market data. Typical applications
of mining time series data involve prediction of the time series at some fu-
ture time point given its past history.
• Web data – The world-wide web is a vast repository of unstructured infor-
mation distributed over wide geographical regions. Web data can typically
be categorized into those that constitute the web content (e.g., text, im-
ages, sound clips), those that define the web structure (e.g., hyperlinks,
tags) and those that monitor the web usage (e.g., http logs, application
server logs). Accordingly, web mining can also be classified into web con-
tent mining, web structure mining and web usage mining.
• Biological data – DNA, RNA and proteins are the most widely studied
molecules in biology. A large number of databases store biological data
in different forms, such as sequences (of nucleotides and amino acids),
atomic coordinates and microarray data (that measure the levels of gene
expression). Finding homologous sequences, identifying the evolutionary
relationship of proteins and clustering gene microarray data are some ex-
amples of biological data mining.
In order to deal with different types of complex problem domains, spe-
cialized algorithms have been developed that are best suited to the particular
1.4 Recent Trends in Knowledge Discovery 21
problem that they are designed for. In the following subsections, some such
complex domains and problem solving approaches, which are currently widely
used, are discussed.
Sometimes users of a data mining system are interested in one or more pat-
terns that they want to retrieve from the underlying data. These tasks, com-
monly known as content-based retrieval, are mostly used for text and image
databases. For example, searching the web uses a page ranking technique that
is based on link patterns for estimating the relative importance of different
pages with respect to the current search. In general, the different issues in
content-based retrieval are as follows:
• Identifying an appropriate set of features used to index an object in the
database;
• Storing the objects, along with their features, in the database;
• Defining a measure of similarity between different objects;
• Given a query and the similarity measure, performing an efficient search
in the database;
• Incorporating user feedback and interaction in the retrieval process.
Text Retrieval
where Vk = {vk1 vk2 . . . vkT }. This represents the inner product of the two
term vectors after they are normalized to have unit length, and it reflects the
similarity in the relative distribution of their term components.
22 Sanghamitra Bandyopadhyay and Ujjwal Maulik
The term vectors may have Boolean representation where 1 indicates that
the corresponding term is present in the document and 0 indicates that it is
not. A significant drawback of the Boolean representation is that it cannot
be used to assign a relevance ranking to the retrieved documents. Another
commonly used weighting scheme is the Term Frequency–Inverse Document
Frequency (TF–IDF) scheme [24]. Using TF, each component of the term vec-
tor is multiplied by the frequency of occurrence of the corresponding term. The
IDF weight for the ith component of the term vector is defined as log(N/ni ),
where ni is the number of documents that contain the ith term and N is the
total number of documents. The composite TF–IDF weight is the product of
the TF and IDF components for a particular term. The TF term gives more
importance to frequently occurring terms in a document. However, if a term
occurs frequently in most of the documents in the document set then, in all
probability, the term is not really that important. This is taken care of by the
IDF factor.
The above schemes are based strictly on the terms occurring in the docu-
ments and are referred to as vector space representation. An alternative to this
strategy is latent semantic indexing (LSI). In LSI, the dimensionality of the
term vector is reduced using principal component analysis (PCA) [31, 127].
PCA is based on the notion that it may be beneficial to combine a set of
features in order to obtain a single composite feature that can capture most
of the variance in the data. In terms of text retrieval, this could identify the
similar pattern of occurrences of terms in the documents, thereby capturing
the hidden semantics of the data. For example, the terms “data mining” and
“knowledge discovery” have nothing in common when using the vector space
representation, but could be combined into a single principal component term
since these two terms would most likely occur in a number of related docu-
ments.
Image Retrieval
Image and video data are increasing day by day; as a result content-based
image retrieval is becoming very important and appealing. Developing in-
teractive mining systems for handling queries such as Generate the N most
similar images of the query image is a challenging task. Here image data does
not necessarily mean images generated only by cameras but also images em-
bedded in a text document as well as handwritten characters, paintings, maps,
graphs, etc.
In the initial phase of an image retrieval process, the system needs to un-
derstand and extract the necessary features of the query images. Extracting
the semantic contents of a query image is a challenging task and an active
research area in pattern recognition and computer vision. The features of an
image are generally expressed in terms of color, texture, shape. These features
of the query image are computed, stored and used during retrieval. For exam-
ple, QBIC (Query By Image Content) is an interactive image mining system
1.4 Recent Trends in Knowledge Discovery 23
developed by the scientists in IBM. QBIC allows the user to search a large
image database with content descriptors such as color (a three-dimensional
color feature vector and k-dimensional color histogram where the value of k
is dependent on the application), texture (a three-dimensional texture vector
with features that measure coarseness, contrast and directionality) as well as
the relative position and shape (twenty-dimensional features based on area,
circularity, eccentricity, axis orientation and various moments) of the query
image. Subsequent to the feature-extraction process, distance calculation and
retrieval are carried out in multidimensional feature space. Chapter 10 deals
with the task of content-based image retrieval where features based on shape,
texture and color are extracted from an image. A similarity measure based
on human perception and a relevance feedback mechanism are formulated for
improved retrieval accuracy.
Translations, rotations, nonlinear transformation and changes of illumi-
nation (shadows, lighting, occlusion) are common distortions in images. Any
change in scale, viewing angle or illumination changes the features of the dis-
torted version of the scene compared to the original version. Although the
human visual system is able to handle these distortions easily, it is far more
challenging to design image retrieval techniques that are invariant under such
transformation and distortion. This requires incorporation of translation and
distortion invariance into the feature space.
Over the past few decades, major advances in the field of molecular biol-
ogy, coupled with advances in genomic technology, have led to an explosive
growth in the biological information generated by the scientific community.
Bioinformatics, viewed as the use of computational methods to make biolog-
ical discoveries, has evolved as a major research direction in response to this
deluge of information. The main purpose is to utilize computerized databases
to store, organize and index the data and to use specialized tools to view
and analyze the data. The ultimate goal of the field is to enable the discov-
ery of new biological insights as well as to create a global perspective from
which unifying principles in biology can be derived. Sequence analysis, phy-
logenetic/evolutionary trees, protein classification and analysis of microarray
data constitute some typical problems of bioinformatics where mining tech-
niques are required for extracting meaningful patterns. A broad classification
of some (not all) bioinformatic tasks is provided in Figure 1.4. The mining
26 Sanghamitra Bandyopadhyay and Ujjwal Maulik
tasks often used for biological data include clustering, classification, prediction
and frequent pattern identification [130]. Applications of some data mining
techniques in bioinformatics and their requirements are mentioned below.
Bioinformatics
In recent times, data that are distributed among different sites that are dis-
persed over a wide geographical area are becoming more and more common. In
particular, sensor networks, consisting of a large number of small, inexpensive
sensor devices, are gradually being deployed in many situations for monitoring
the environment. The nodes of a sensor network collect time-varying streams
of data, have limited computing capabilities, small memory storage, and low
28 Sanghamitra Bandyopadhyay and Ujjwal Maulik
detection of outliers for event detection are only two examples that may re-
quire clustering algorithms. The distributed and resource-constrained nature
of the sensor networks demands a fundamentally distributed algorithmic so-
lution to the clustering problem. Therefore, distributed clustering algorithms
may come in handy [71] when it comes to analyzing sensor network data or
data streams.
Fuzzy Sets
Fuzzy set theory was developed in order to handle uncertainties, arising from
vague, incomplete, linguistic or overlapping patterns, in various problem-
solving systems. This approach is developed based on the realization that
an object may belong to more than one class, with varying degrees of class
membership. Uncertainty can result from the incomplete or ambiguous in-
put information, the imprecision in the problem definition, ill-defined and/or
overlapping boundaries among the classes or regions, and the indefiniteness
in defining or extracting features and relations among them.
Fuzzy sets were introduced in 1965 by Lotfi A. Zadeh [136, 137], as a
way to represent vagueness in everyday life. We almost always speak in fuzzy
terms, e.g., he is more or less tall, she is very beautiful. Hence, concepts of tall
30 Sanghamitra Bandyopadhyay and Ujjwal Maulik
and beautiful are fuzzy, and the gentleman and lady have membership values
to these fuzzy concepts indicating their degree of belongingness. Since this
theory is a generalization of the classical set theory, it has greater flexibility
to capture various aspects of incompleteness, imprecision or imperfection in
information about a situation. It has been applied successfully in computing
with words or the matching of linguistic terms for reasoning.
Fuzzy set theory has found a lot of applications in data mining [10, 107,
134]. Examples of such applications may be found in clustering [82, 106, 128],
association rules [9, 133], time series [27], and image retrieval [44, 94].
Evolutionary Computation
Neural Networks
the real world in a manner similar to biological systems. Their origin can
be traced to the work of Hebb [57], where a local learning rule is proposed.
The benefit of neural nets lies in the high computation rate provided by their
inherent massive parallelism. This allows real-time processing of huge data sets
with proper hardware backing. All information is stored distributed among
the various connection weights. The redundancy of interconnections produces
a high degree of robustness resulting in a graceful degradation of performance
in the case of noise or damage to a few nodes/links.
Neural network models have been studied for many years with the hope
of achieving human-like performance (artificially), particularly in the field of
pattern recognition, by capturing the key ingredients responsible for the re-
markable capabilities of the human nervous system. Note that these models
are extreme simplifications of the actual human nervous system. Some com-
monly used neural networks are the multi-layer perceptron, Hopfield network,
Kohonen’s self organizing maps and radial basis function network [56].
Neural networks have been widely used in searching for patterns in data
[23] because they appear to bridge the gap between the generalization capabil-
ity of human beings and the deterministic nature of computers. More impor-
tant among these applications are rule generation and classification [86], clus-
tering [5], data modeling [83], time series analysis [33, 49, 63] and visualization
[78]. Neural networks may be used as a direct substitute for autocorrelation,
multivariable regression, linear regression, trigonometric and other regression
techniques [61, 123]. Apart from data mining tasks, neural networks have
also been used for data preprocessing, such as data cleaning and handling
missing values. Various applications of supervised and unsupervised neural
networks to the analysis of the gene expression profiles produced using DNA
microarrays has been studied in [90]. A hybridization of genetic algorithms
and perceptrons has been used in [74] for supervised classification in microar-
ray data. Issues involved in the research on the use of neural networks for data
mining include model selection, determination of an appropriate architecture
and training algorithm, network pruning, convergence and training time, data
representation and tackling missing values. Hybridization of neural networks
with other soft computing tools such as fuzzy logic, genetic algorithms, rough
sets and wavelets have proved to be effective for solving complex problems.
tectural and industrial design. Details about CBR may be found in [131] and
more recently in [102].
1.5 Conclusions
This chapter presented the basic concepts and issues in KDD, and also dis-
cussed the challenges that data mining researchers are facing. Such challenges
arise due to different reasons, such as very high dimensional and extremely
large data sets, unstructured and semi-structured data, temporal and spatial
patterns and heterogeneous data. Some important application domains where
data mining techniques are heavily used have been elaborated. These include
web mining, bioinformatics, and image and text mining. The recent trends in
KDD have also been summarized, including brief descriptions of some common
mining tools. An extensive bibliography is provided.
Traditional data mining generally involved well-organized database sys-
tems such as relational databases. With the advent of sophisticated technol-
ogy, it is now possible to store and manipulate very large and complex data.
The data complexity arises due to several reasons, e.g., high dimensionality,
semi- and/or un-structured nature, and heterogeneity. Data related to the
world-wide web, the geoscientific domain, VLSI chip layout and routing, mul-
timedia, financial markets, sensor networks, and genes and proteins constitute
some typical examples of complex data. In order to extract knowledge from
such complex data, it is necessary to develop advanced methods that can
exploit the nature and representation of the data more efficiently. The fol-
lowing chapters report the research work of active practitioners in this field,
describing recent advances in the field of knowledge discovery from complex
data.
References
[1] The Berkeley Initiative in Soft Computing. URL:
www-bisc.cs.berkeley.edu/
[2] Agrawal, C. C., and Philip S. Yu, 2001: Outlier detection for high di-
mensional data. Proccedings of the SIGMOD Conference.
[3] Agrawal, R., T. Imielinski and A. N. Swami, 1993: Mining associa-
tion rules between sets of items in large databases. Proceedings of the
1993 ACM SIGMOD International Conference on Management of Data,
P. Buneman and S. Jajodia, eds., Washington, D.C., 207–16.
[4] Agrawal, R., and R. Srikant, 1994: Fast algorithms for mining associa-
tion rules. Proc. 20th Int. Conf. Very Large Data Bases, VLDB , J. B.
Bocca, M. Jarke, and C. Zaniolo, eds., Morgan Kaufmann, 487–99.
[5] Alahakoon, D., S. K. Halgamuge, and B. Srinivasan, 2000: Dynamic self
organizing maps with controlled growth for knowledge discovery. IEEE
Transactions on Neural Networks, 11, 601–14.
34 Sanghamitra Bandyopadhyay and Ujjwal Maulik
[24] Chakrabarti, S., 2002: Mining the Web: Discovering Knowledge from
Hypertext Data. Morgan Kaufmann.
[25] Chawla, N. V., K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, 2002:
Smote: Synthetic minority over-sampling technique. Journal of Artificial
Intelligence Research, 16, 321–57.
[26] Chen, W., J. C. Hou and L. Sha, 2004: Dynamic clustering for acous-
tic target tracking in wireless sensor networks. IEEE Transactions on
Mobile Computing, 3, 258–71.
[27] Chiang, D. A., L. R. Chow and Y. F. Wang, 2000: Mining time series
data by a fuzzy linguistic summary system. Fuzzy Sets and Systems,
112, 419–32.
[28] Chiba, S., K. Sugawara, and T. Watanabe, 2001: Classification and func-
tion estimation of protein by using data compression and genetic algo-
rithms. Proc. Congress on Evolutionary Computation, 2, 839–44.
[29] Cristianini, N. and J. Shawe-Taylor, 2000: An Introduction to Support
Vector Machines (and other kernel-based learning methods). Cambridge
University Press, UK.
[30] Dayhoff, J. E., 1990: Neural Network Architectures: An Introduction.
Van Nostrand Reinhold, New York.
[31] Devijver, P. A. and J. Kittler, 1982: Pattern Recognition: A Statistical
Approach. Prentice-Hall, London.
[32] Dopazo, H., J. Santoyo and J. Dopazo, 2004: Phylogenomics and the
number of characters required for obtaining an accurate phylogeny of
eukaryote model species. Bioinformatics, 20, Suppl 1, I116–I121.
[33] Dorffner, G., 1996: Neural networks for time series processing. Neural
Network World , 6, 447–68.
[34] Dorohonceanu, B. and C. G. Nevill-Manning, 2000: Accelerating pro-
tein classification using suffix trees. Proceedings of the 8th International
Conference on Intelligent Systems for Molecular Biology (ISMB), 128–
33.
[35] Du, W. and Z. Zhan, 2002: Building decision tree classifier on private
data. Proceedings of the IEEE International Conference on Data Mining
Workshop on Privacy, Security, and Data Mining, Australian Computer
Society, 14, 1–8.
[36] Duda, R. O. and P. E. Hart, 1973: Pattern Classification and Scene
Analysis. John Wiley, New York.
[37] Eisenhardt, M., W. Muller and A. Henrich, 2003: Classifying Docu-
ments by Distributed P2P Clustering. Proceedings of Informatik 2003,
GI Lecture Notes in Informatics, Frankfurt, Germany.
[38] Ester, M., H.-P. Kriegel, J. Sander and X. Xu, 1996: Density-based
algorithm for discovering clusters in large spatial databases. Proc. of the
Second International Conference on Data Mining KDD-96 , Portland,
Oregon, 226–31.
[39] Ester, M., H.-P. Kriegel and X. Xu, 1995: Knowledge discovery in large
spatial databases: Focusing techniques for efficient class identification.
36 Sanghamitra Bandyopadhyay and Ujjwal Maulik
[57] Hebb, D. O., 1949: The Organization of Behavior . John Wiley, New
York.
[58] Heinzelman, W., A. Chandrakasan and H. Balakrishnan, 2000: Energy-
efficient communication protocol for wireless microsensor networks. Pro-
ceedings of the Hawaii Conference on System Sciences.
[59] — 2002: An application-specific protocol architecture for wireless mi-
crosensor networks. IEEE Transactions on Wireless Communications,
1, 660–70.
[60] Hipp, J., U. Güntzer and G. Nakhaeizadeh, 2000: Algorithms for asso-
ciation rule mining – a general survey and comparison. SIGKDD Explo-
rations, 2, 58–64.
[61] Hoya, T. and A. Constantidines, 1998: A heuristic pattern correction
scheme for GRNNS and its application to speech recognition. Proceed-
ings of the IEEE Signal Processing Society Workshop, 351–9.
[62] Hu, Y.-J., S. Sandmeyer, C. McLaughlin and D. Kibler, 2000: Combi-
natorial motif analysis and hypothesis generation on a genomic scale.
Bioinformatics, 16, 222–32.
[63] Hüsken, M. and P. Stagge, 2003: Recurrent neural networks for time
series classification. Neurocomputing, 50(C).
[64] Imielinski, T. and H. Mannila, 1996: A database perspective on knowl-
edge discovery. Communications of the ACM , 39, 58–64.
[65] Imielinski, T., A. Virmani and A. Abdulghani, 1996: A discovery board
application programming interface and query language for database
mining. Proceedings of KDD 96 , Portland, Oregon, 20–26.
[66] Inmon, W. H., 1996: The data warehouse and data mining. Communi-
cations of the ACM , 39, 49–50.
[67] Jain, A. K. and R. C. Dubes, 1988: Algorithms for Clustering Data.
Prentice-Hall, Englewood Cliffs, NJ.
[68] Jensen, F. V., 1996: An Introduction to Bayesian Networks. Springer-
Verlag, New York, USA.
[69] Kargupta, H., S. Bandyopadhyay and B. H. Park, eds., 2005: Special
Issue on Distributed and Mobile Data Mining, IEEE Transactions on
Systems, Man, and Cybernetics Part B. IEEE.
[70] Kargupta, H. and P. Chan, eds., 2001: Advances in Distributed and
Parallel Knowledge Discovery. MIT Press.
[71] Kargupta. H, R. Bhargava, K. Liu, M. Powers, P. Blair and M. Klein,
2004: VEDAS: A mobile distributed data stream mining system for real-
time vehicle monitoring. Proceedings of the 2004 SIAM International
Conference on Data Mining.
[72] Kargupta, H., W. Huang, S. Krishnamoorthy and E. Johnson, 2000:
Distributed clustering using collective principal component analysis.
Knowledge and Information Systems Journal, 3, 422–48.
[73] Kargupta, H., A. Joshi, K. Sivakumar and Y. Yesha, eds., 2004: Data
Mining: Next Generation Challenges and Future Directions. MIT/AAAI
Press.
38 Sanghamitra Bandyopadhyay and Ujjwal Maulik
[134] Yager, R. R., 1996: Database discovery using fuzzy sets. International
Journal of Intelligent Systems, 11, 691–712.
[135] Younis, O. and S. Fahmy, 2004 (to appear): Heed: A hybrid, energy-
efficient, distributed clustering approach for ad-hoc sensor networks.
IEEE Transactions on Mobile Computing, 3.
[136] Zadeh, L. A., 1965: Fuzzy sets. Information and Control , 8, 338–53.
[137] — 1994: Fuzzy logic, neural networks and soft computing. Communica-
tions of the ACM , 37, 77–84.
[138] Zhang, T., R. Ramakrishnan and M. Livny, 1996: Birch: an efficient
data clustering method for very large databases. Proceedings of the 1996
ACM SIGMOD international conference on management of data, ACM
Press, 103–114.
2
Automatic Discovery of Class Hierarchies via
Output Space Decomposition
2.1 Introduction
A classification problem involves identifying a set of objects, each represented
in a suitable common input space, using one or more class labels taken from
a pre-determined set of possible labels. Thus it may be described as a four-
tuple: (I, Ω, PX×Ω , X ), where I is the input space, in which the raw data is
available (e.g. the image of a character), Ω is the output space, comprised of
all the class labels that can be assigned to an input pattern (e.g. the set of
26 alphabetic characters in English), PX×Ω is the unknown joint probability
density function over random variables X ∈ I and Ω ∈ Ω, and X ⊂ I × Ω is
the training set sampled from the distribution PX×Ω . The goal is to determine
44 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
the relationship between the input and output spaces, a full specification of
which is given by modeling the joint probability density function PX×Ω .
Complexity in real-world classification problems can arise from multiple
causes. First, the objects (and their representation) may themselves be com-
plex, e.g. XML trees, protein sequences with 3-D folding geometry, variable
length sequences, etc. [18]. Second, the data may be very noisy, the classes
may have significant overlap and the optimal decision boundaries may be
highly nonlinear. In this chapter we concentrate simultaneously on complex-
ity due to high-dimensional inputs and a large number of class labels that
can be potentially assigned to any input. Recognition of characters from the
English alphabet (C = 26 classes) based on a (say) 64× 64 binary input im-
age and labeling of a piece of land into one of 10–12 land-cover types based
on 100+ dimensional hyperspectral signatures are two examples that exhibit
such complex characteristics.
There are two main approaches to simplifying such problems:
• Feature extraction: A feature extraction process transforms the input
space, I, into a lower-dimensional feature space, F, in which discrimina-
tion among the classes Ω is high. It is particularly helpful given finite
training data in a high-dimensional input space, as it can alleviate fun-
damental problems arising from the curse of dimensionality [2, 15]. Both
domain knowledge and statistical methods can be used for feature extrac-
tion [4, 9, 12, 16, 27, 33]. Feature selection is a specific case of linear
feature extraction [33].
• Modular learning: Based on the divide-and-conquer precept that “learn-
ing a large number of simple local concepts is both easier and more useful
than learning a single complex global concept” [30], a variety of modular
learning architectures have been proposed by the pattern recognition and
computational intelligence communities [28, 36, 47]. In particular, multi-
classifier systems develop a set of M classifiers instead of one, and sub-
sequently combine the individual solutions in a suitable way to address
the overall problem. In several such architectures, each individual classi-
fier addresses a simpler problem. For example, it may specialize in only
part of the feature space as in the mixture of experts framework [26, 41].
Alternatively, a simpler input space may effectively be created per clas-
sifier by sampling/re-weighting (as in bagging and boosting), using one
module for each data source [48]; different feature subsets for different
classes (input decimation) [49], etc. Advantages of modular learning in-
clude the ease and efficiency in learning, scalability, interpretability, and
transparency [1, 21, 36, 38, 42].
This chapter focuses on yet another type of modularity which is possible
for multi-class problems, namely, the decomposition of a C-class problem into
a set of binary problems. Such decompositions have attracted much interest
recently because of the popularity of certain powerful binary classifiers, most
notably the support vector machine (SVM), which was originally formulated
2.1 Introduction 45
Mid-infrared band
LAND WATER
NDVI
UPLANDS WETLANDS
Fig. 2.1. A simple two-level hierarchy for a site with one WATER class and 12
LAND classes divided into seven UPLANDS and five WETLANDS meta-classes.
The land versus water distinction is made by the response in the mid-infrared band
while the distinction between uplands and wetlands is made using the Normalized
Difference Vegetation Index (NDVI).
in the Bolivar peninsula [7]. In this example, 13 original (base) classes are
first decomposed into two groups, LAND and WATER. WATER and LAND
“meta-classes” can be readily separated based on the pixel responses in the
mid-infrared frequency bands. WATER is one of the 13 base classes, while the
LAND meta-class comprises 12 classes and is thus further partitioned into
UPLANDS and WETLANDS meta-classes comprised of seven and five base
classes respectively. The distinction between the UPLANDS and WETLANDS
is made using the Normalized Difference Vegetation Index (NDVI) [45]. In-
stead of solving a 13-class problem, the hierarchy shown in Figure 2.1 can
be used to first solve a binary problem (separating WATER from LAND),
and then solve another binary problem to separate UPLANDS from WET-
LANDS. Note that both the feature space as well as the output space of the
two problems are different. The seven-class problem of discriminating among
the UPLANDS classes and the five-class problem of discriminating among the
WETLANDS classes can be further addressed in appropriate feature spaces
using appropriate classifiers. Thus, a 13-class problem is decomposed using an
existing hierarchy into simpler classification problems in terms of their output
spaces.
Section 2.2 summarizes existing approaches to solving multi-class prob-
lems through output space decomposition. The BHC framework is formally
2.2 Background: Solving Multi-Class Problems 47
2.2.1 One-versus-rest
Also known as round robin classification [17], these approaches learn one clas-
sifier for each pair of classes (employing a total of C2 classifiers in the process)
and then combine the outputs of these classifiers in a variety of ways to de-
termine the final class label. This approach has been investigated by several
researchers [14, 23, 39, 46]. Typically the binary classifiers are developed and
examined in parallel, a notable exception being the efficient DAG-structured
ordering given in [39]. A straightforward way of finding the winning class is
through a simple voting scheme used for example in [14], which evaluates
48 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
pairwise classification for two versions of CART and for the nearest neighbor
rule. Alternatively, if the individual classifiers provide good estimates of the
two-class posterior probabilities, then these estimates can be combined using
an iterative hill-climbing approach suggested by [23].
Our first attempts at output space decomposition [7, 31] involved applying
a pairwise classifier framework for land-cover prediction problems involving
hyperspectral data. Class-pair-specific feature extraction was used to obtain
superior classification accuracies. It also provided important domain knowl-
edge with regard to what features were more useful for discriminating specific
pairs of classes. While such a modular-learning approach for decomposing a
C-class problem is attractive for a number of reasons including focussed fea-
ture extraction, interpretability of results and automatic discovery of domain
knowledge, the fact that it requires O(C 2 ) pairwise classifiers might make
it less attractive for problems involving a large number
of classes. Further,
the combiner that integrates the results of all the C2 classifiers must resolve
the couplings among these outputs that might increase with the number of
classes.
These approaches impose an ordering among the classes, and the classifiers
are developed in sequence rather than in parallel. For example, one can first
discriminate between class “1” and the rest. Then for data classified as “rest”,
a second classifier is designed to separate class “2” from the other remaining
classes, and so on. Problem decomposition in the output space can also be ac-
complished implicitly by having C classifiers, each trying to solve the complete
C-class problem, but with each classifier using input features most correlated
with only one of the classes. This idea was used in [49] for creating an en-
semble of classifiers, each using different input decimations. This method not
only reduces the correlation among individual classifiers in an ensemble, but
also reduces the dimensionality of the input space for classification problems.
Significant improvements in misclassification error together with reductions
2.3 The Binary Hierarchical Classifier Framework 49
Feature
ψ3 Extractor
Ω1 = {1,2,3,4,5}
1 φ3 Classifier
2 3
Ω6= {1,4}
3 5 6 2 Leaf node
Ω = {3} Ω5 = {5} Ω 7= {2}
4
1 4
Ω12= {1} Ω13 = {4}
The dimensionality of the Fisher projection space for a C-class problem with
a D-dimensional input space is min{D, C−1}. At each internal node in the
52 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
ω∈Ωρ x∈Xω x ω∈Ωρ P̂ (ω)µ̂ω
ρ
µ̂ = = , ρ ∈ {α, β}, (2.4)
ω∈Ωρ |Xω | ω∈Ωρ P̂ (ω)
ω∈Ωρ x∈Xω (x−µ̂ρ )(x−µ̂ρ )T
Σ̂ ρ =
|Xω |
ω∈Ωρ
(2.5)
ω∈Ωρ P̂ (ω)[Σ̂ω +(µ̂ρ −µ̂ω )(µ̂ρ −µ̂ω )T ]
=
ω∈Ωρ P̂ (ω)
and the D × D, rank 1, between class covariance matrix Bα,β given by:
vT Bα,β v −1
α
vαβ = arg max ∝ Wα,β µ − µβ . (2.8)
v∈D×1 vT Wα,β v
(1) T
Thus, the Fisher(1) feature extractor ψf isher (X|Ωα , Ωβ ) = vαβ x, where
x∈ D×1
and y ∈ is a one-dimensional feature. The distance between the
two meta-classes Ωα and Ωβ is the Fisher(1) discriminant along the Fisher
projection vαβ of Equation (2.8).
The basic assumption in Fisher’s discriminant is that the two classes are
unimodal. Even if this assumption is true for individual classes, it is not true
1
Substituting estimated parameters for expected ones (e.g. P̂ ≡ P , µ̂ ≡ µ, and
Σ̂ ≡ Σ).
2.4 Bottom-up BHC 53
Let Ωα and Ωβ be the two closest (in terms of the Fisher projected dis-
tances defined in Sections 2.4.1 and 2.4.2) classes that are merged to form the
meta-class Ωαβ = merge(Ωα , Ωβ ). The estimated mean vector µ̂αβ ∈ D×1 ,
54 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
Once the mean and covariance of the new meta-class Ωαβ are obtained, its
distance from the remaining classes Ωγ ∈ ΠK − {Ωα , Ωβ } is computed as
follows. The within-class covariance Wαβ,γ is given by:2,3
Similarly, the between-class covariance Bαβ,γ for the fisher(1) case is defined
as: T
Bαβ,γ = P (Ωαβ )P (Ωγ ) µαβ − µγ µαβ − µγ Bαβ,γ
P (Ω ) (2.15)
= Bα,γ + Bβ,γ − P (Ωα )+Pγ (Ωβ ) Bα,β .
When partitioning a set of classes into two meta-classes, initially each class
is associated with both the meta-classes. The update of these associations
and meta-class parameters is performed alternately while gradually decreas-
ing the temperature, until a hard partitioning is achieved. The complete Par-
titionNode algorithm which forms the basis of the TD-BHC algorithm is
described in this section.
Let Ω = Ωn be some meta-class at internal node n with K = |Ωn | > 2
classes that needs to be partitioned into two meta-classes, Ωα = Ω2n and
Ωβ = Ω2n+1 . The “association” A = [aω,ρ ] between class ω ∈ Ω and meta-
class Ωρ , (ρ ∈ {α, β}) is interpreted as the posterior probability of ω belong-
ing to Ωρ : P (Ωρ |ω). The completeness constraint of GAMLS [30] implies that
P (Ωα |ω) + P (Ωβ |ω) = 1, ∀ω ∈ Ω.
PartitionNode(Ω)
1. Initialize associations {aω,α = P (Ωα |ω), ω ∈ Ω} (aω,β = 1 − aω,α ):
1 for some ω = ω(1) ∈ Ω
P (Ωα |ω) = (2.17)
0.5 ∀ ω ∈ Ω − {ω(1) }
56 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
where the pdf p(ψ(x|A)|Ωρ ) can be modeled using any distribution func-
tion. A single Gaussian per class is used in this chapter.
4. Update the meta-class posteriors by optimizing Gibb’s free en-
ergy [30]:
exp(L(ω|Ωα )/T )
aω,α = P (Ωα |ω) = . (2.19)
exp(L(ω|Ωα )/T ) + exp(L(ω|Ωβ )/T )
5. Repeat Steps 2 through 4 until the increase in Gibb’s free energy is in-
significant.
1
6. If |Ω| ω∈Ω H(aω ) < θH (user-defined threshold) stop, otherwise:
• Cool temperature: T ← T θT (θT < 1 is a user-defined cooling param-
eter).
• Go to Step 2.
As the temperature cools sufficiently and the entropy decreases to near
zero (θH = 0.01 in our implementation), the associations or the posterior
probabilities {P (Ωα |ω), ω ∈ Ω} become close to 0 or 1. The meta-class Ω =
Ωn is then split as follows:
For any set of associations A, the estimates of the meta-class mean vectors
{µ̂ρ ∈ D×1 , ρ ∈ {α, β}}, the covariance matrices {Σ̂ ρ ∈ D×D , ρ ∈ {α, β}},
and priors {P̂ (Ωρ ), ρ ∈ {α, β}} are updated using the mean vectors {µ̂ω ∈
D×1
, ω ∈ Ω}, covariance matrices {Σ̂ω ∈ D×D , ω ∈ Ω}, and class priors
{P̂ (ω), ω ∈ Ω}, of the classes in Ω. Let Xω denote the training set compris-
ing Nω = |Xω | examples of class ω. For any given associations or posterior
probabilities A = {aω,ρ = P (Ωρ |ω), ρ ∈ {α, β}, ω ∈ Ω}, the estimate of the
mean is computed by µ̂ρ = ω∈Ω P (ω|Ωρ )µ̂ω , ρ ∈ {α, β}. The corresponding
covariance is:
P (ω|Ω )
Σ̂ ρ = ω∈Ω Nω ρ x∈Xω (x − µ̂ )(x − µ̂ )
ρ ρ T
(2.21)
= ω∈Ω P (ω|Ωρ ) Σ̂ω + (µ̂ω − µ̂ρ )(µ̂ω − µ̂ρ )T , ρ ∈ {α, β}.
1
P̂ (Ωρ ) = P (Ωρ |ω)P̂ (ω) : ρ ∈ {α, β}. (2.22)
P̂ (Ω) ω∈Ω
class covariance component is large only when the associations with the re-
spective classes are different. In the limiting case, when the associations be-
come hard i.e. 0 or 1, then Equation (2.23) reduces to Equation (2.9). The
rank of the pairwise between-class covariance matrix is min{D, |Ω| − 1} and
hence the dimensionality of the feature space Fn at internal node n remains
min{D, |Ωn | − 1} as it was in the BU-BHC algorithm. Either Fisher(1) or
Fisher(m) can be used as the feature extractors ψ(X|A) in Step 2 of the
PartitionNode algorithm.
If the original class densities are Gaussian (G(x|µ, Σ)), the class density
functions in Step 3 of the PartitionNode algorithm in Equation (2.18) for
Fisher(1) is:
(1) T
p(ψf isher (x|A)|Ωρ ) = G vαβ T
x|vαβ µρ , vαβ
T
Σ ρ vαβ , ρ ∈ {α, β}, (2.24)
where vαβ is defined in Equation (2.8). Similarly the class density functions
for the Fisher(m) feature extractor can be defined as a multivariate (mαβ -
dimensional) Gaussians,
(m) T
p(ψf isher (x|A)|Ωρ ) = G Vαβ T
x, Vα,β µρ , Vαβ
T
Σ ρ Vαβ , ρ ∈ {α, β}, (2.25)
where Vαβ is defined in Equation (2.10).
If a soft classifier is used at each internal node, the results of these hierar-
chically arranged classifiers can be combined by first computing the overall
posteriors {P (ω|x), ω ∈ Ω} and then applying the maximum a posteriori
probability (MAP) rule: ω(x) = arg maxω∈Ω P (ω|x), to assign the class la-
bel ω(x) to x. The posteriors P (ω|x) can be computed by multiplying the
posterior probabilities of all the internal node classifiers on the path to the
corresponding leaf node.
Theorem 1. The posterior probability P (ω|x) for any input x is the product
of the posterior probabilities of all the internal classifiers along the unique path
from the root node to the leaf node n(ω) containing the class ω, i.e.
D(ω)−1
( +1) ( )
P (ω|x) = P (Ωn(ω) |x, Ωn(ω) ), (2.27)
=0
( )
where D(ω) is the depth of n(ω) (depth of the root node is 0), Ωn is the meta-
(D(ω))
class at depth in the path from the root node to n(ω), such that Ωn(ω) = {ω}
(0)
and Ωn(ω) = Ω1 = root node. (See [32] for proof.)
Remark 1 The posterior probabilities Pn (Ωk |x, Ωn ), k ∈ {2n, 2n + 1} are
related to the overall posterior probabilities {P (ω|x), ω ∈ Ω} as follows:4
P (ω|x)
Pn (Ωk |x, Ωn ) = ω∈Ωk , k ∈ {2n, 2n + 1} (2.28)
ω∈Ωn P (ω|x)
2.7 Experiments
Both BU-BHC and TD-BHC algorithms are evaluated in this section on
public-domain data sets available from the UCI repository [35] and National
Institute of Standards and Technology (NIST) and two additional hyperspec-
tral data sets. The classification accuracies of eight different combinations of
the BHC classifiers (bottom-up vs top-down, Fisher(1) vs Fisher(m) fea-
ture extractor and soft vs hard combiners) are compared with multilayered
perceptron-based and maximum likelihood classifiers. The class hierarchy that
is automatically discovered from both the BU-BHC and TD-BHC for these
data sets are shown for some of these data sets to provide concrete examples
of the domain knowledge discovered by the BHC algorithms.
Table 2.1. The twelve classes in the AVIRIS/KSC hyperspectral data set
Num Class Name
Upland Classes
1 Scrub
2 Willow Swamp
3 Cabbage palm hammock
4 Cabbage oak hammock
5 Slash pine
6 Broad leaf/oak hammock
7 Hardwood swamp
Wetland Classes
8 Graminoid marsh
9 Spartina marsh
10 Cattail marsh
11 Salt marsh
12 Mud flats
Table 2.3. Classification accuracies on public-domain data sets from the UCI repos-
itory [35] (satimage, digits, letter-i) and NIST(letter-ii) and remote-sensing
data sets from the Center for Space Research, The University of Texas at Austin
(hymap, aviris). The input dimensions and number of classes are also indicated for
each data set.
satimage digits letter-I letter-II hymap aviris
Dimensions 36 64 16 30 126 183
Classes 6 10 26 26 9 12
MLP 79.77 82.33 79.28 76.24 78.21 74.54
MLC 77.14 74.85 82.73 79.48 82.73 72.66
BU-BHC(1,H) 83.26 88.87 71.29 78.45 95.18 94.97
BU-BHC(1,S) 84.48 89.00 72.81 79.93 95.62 95.31
BU-BHC(m,H) 85.29 91.71 76.55 80.94 95.12 95.51
BU-BHC(m,S) 85.35 91.95 78.41 81.11 95.43 95.83
TD-BHC(1,H) 83.77 90.11 70.45 74.59 95.31 96.33
TD-BHC(1,S) 84.02 90.24 72.71 75.83 95.95 97.09
TD-BHC(m,H) 84.70 91.44 77.85 81.48 96.48 97.15
TD-BHC(m,S) 84.95 91.61 79.13 81.99 96.64 97.93
The finely tuned MLP classifiers and the MLC classifiers are used as bench-
marks for evaluating the BHC algorithms. Almost all the BHC versions per-
formed significantly better than the MLP and MLC classifiers on all data sets
except LETTER-I and LETTER-II. In general the TD-BHC was slightly bet-
ter than the BU-BHC mainly because its global bias leads to less greedy trees
than the BU-BHC algorithm. Further, the Fisher(m) feature extractor con-
sistently yields slightly better results than the Fisher(1) feature extractor,
2.8 Domain Knowledge Discovery 63
as expected. Finally, the soft combiner also performed sightly better than the
hard combiner. This again is an expected result as the hard combiner loses
some information as it thresholds the posteriors at each internal node.
• IRIS: It is well known that Iris Versicolour and Virginica are “closer” to
each other than Iris Setosa. So, not surprisingly, the first split for both BU-
BHC(m) and TD-BHC(m) algorithms invariably separates Setosa from the
other two classes.
• SATIMAGE: Figures 2.3 and 2.4 show the BU-BHC(m) and TD-BHC(m)
trees generated for the SATIMAGE data set. In the BU-BHC tree, the
Classes 4 (damp gray soil) and 6 (very damp gray soil) merged first. This
was followed by Class 3 (gray soil) merging in the meta-class (4,6). The
right child of the root node contains the remaining three classes out of
which the vegetation classes i.e. Class 2 (cotton crop) and Class 5 (soil
with vegetation stubble) were grouped first. The tree formed in the TD-
BHC is even more informative as it separates the four bare soil classes from
the two vegetation classes at the root node and then separates the four
soil classes into red-soil (Class 1) and gray-soil (Classes 3, 4, and 6) meta-
classes. The gray-soil meta-class is further partitioned into damp-gray-soil
(Classes 4 and 6) and regular-gray-soil (Class 3). Thus reasonable class hi-
erarchies are discovered by the BHC framework for the SATIMAGE data
set.
96.9 – 96.9
Fig. 2.3. BU-BHC(m) class hierarchy for the satimage data set.
2.8 Domain Knowledge Discovery 65
95.5 – 95.5
Damp Very
Gray Damp
Soil Gray
Soil
Fig. 2.4. TD-BHC(m) class hierarchy for the satimage data set.
92.4 – 92.4
M W N U F P T A
V Y S Z B E I J K R X C D L
H O G Q
Fig. 2.5. BU-BHC(m) class hierarchy for the letter-I data set.
88.8 – 88.8
M W U K R A J
H N T F P Q X I O E
V Y S Z C G B D
Fig. 2.6. TD-BHC(m) class hierarchy for the letter-I data set.
TD-BHC classifier for LETTER-II data set resulted in a few new groupings
as well, including {O,Q}, {H,K,A,R} and {P,D}.
• Hyperspectral data: Figures 2.7, 2.8, 2.9 and 2.10 show the bottom-up
and top-down trees obtained for AVIRIS and HYMAP. By considering the
meaning of the class labels it is evident that this domain provided the most
useful knowledge. Invariably, when water was present, it was the first to
2.8 Domain Knowledge Discovery 67
99.1 – 99.1
8 9 5 100 – 91.5
3 90.3 – 87.5
4 7
Fig. 2.7. BU-BHC(m) class hierarchy for the AVIRIS data set.
98.8 – 98.8
99.8 – 99.8 10 8 9 4 5
11 12
Fig. 2.8. TD-BHC(m) class hierarchy for the AVIRIS data set.
68 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
100 – 100
2 9 5 99.4 – 99.4 7 8 1 3
4 6
Fig. 2.9. BU-BHC(m) class hierarchy for the HYMAP data set.
100 – 100
99.4 – 99.4 6 2 9
7 8
Fig. 2.10. TD-BHC(m) class hierarchy for the HYMAP data set.
2.9 Conclusions 69
2.9 Conclusions
This chapter presented a general framework for certain difficult classification
problems in which the complexity is primarily due to having several classes
as well as high-dimensional inputs. The BHC methodology relies on progres-
sively partitioning or grouping the set of classes based on their affinities with
one another. The BHC, as originally conceived, uses a custom Fisher’s dis-
criminant feature extraction for each partition, which is quite fast as it only
involves summary class statistics. Moreover, as a result of the tree building
algorithms, a class taxonomy is automatically discovered from data, which of-
ten leads to useful domain knowledge. This property was particularly helpful
in our analysis of hyperspectral data.
The hierarchical BHC approach is helpful only if some class affinities are
actually present, i.e. it will not be appropriate if all the classes are essen-
tially “equidistant” from one another. In practice, this is not very restrictive
since many applications involving multiple class labels, such as those based
on biological or text data, do have natural class affinities, quite often reflected
in class hierarchies or taxonomies. In fact it has been shown that exploiting
a known hierarchy of text categories substantially improves text classifica-
tion [5]. In contrast, the BHC attempts to induce a hierarchy directly from
the data where no pre-existing hierarchy is available. Another recent approach
with a similar purpose is presented in [19] where Naive Bayes is first used to
quickly generate a confusion matrix for a text corpus. The classes are then
clustered based on this matrix such that classes that are more confused with
one another tend to be placed in the same group. Then SVMs are used in
a “one-versus-all” framework within each group of classes to come up with
the final result. Thus this approach produces a two-level hierarchy of classes.
On text benchmarks, this method was three to six times faster than using
“one-vs-all” SVMs directly, while producing comparable or better classifica-
tion results.
We note that one need not be restricted to our choices of a Fisher dis-
criminant and a simple Bayesian classifier at each internal node of the class-
partitioning tree. In Section 2.7.3, we summarized our related work on using
SVMs as the internal classifiers on a tree obtained via the Fisher discrimi-
nant/Bayesian classifier combination. The feature extraction step itself can
also be customized for different domains such as image or protein sequence
classification. In this context, recollect that the trees obtained for a given
problem can vary somewhat depending on the specific training set or classi-
fier design, indicative of the fact that that there are often multiple reasonable
ways of grouping the classes. The use of more powerful binary classifiers pro-
vides an added advantage in that the overall results are more tolerant to the
quality of the tree that is obtained.
The design space for selecting an appropriate feature extractor–classifier
combination is truly rich and needs to be explored further. A well-known
trade-off exists between these two functions. For example, a complex feature
70 Joydeep Ghosh, Shailesh Kumar and Melba M. Crawford
extraction technique can compensate for a simple classifier. With this view-
point, let us compare the top-down BHC with decision trees such as C5.0,
CART and CHAID. One can view the action at each internal node of a de-
cision tree as the selection of a specific value of exactly one variable (feature
extraction stage), followed by a simple classifier that just performs a sim-
ple comparison against this value. Thus the BHC node seems more complex.
However, the demands on a single node in a decision tree are not that strong,
since samples from the same class can be routed to different branches of the
tree and still be identified correctly at later stages. In contrast, in the hard
version of BHC, all the examples of a given class have to be routed to the
same child at each internal node visited by them.
References
[1] Ballard, D., 1987: Modular learning in neural networks. Proc. AAAI-87 ,
279–84.
[2] Bellman, R. E., ed., 1961: Adaptive Control Processes. Princeton Univer-
sity Press.
[3] Breiman, L., J. H. Friedman, R. Olshen and C. J. Stone, 1984: Clas-
sification and Regression Trees. Wadsworth and Brooks, Pacific Grove,
California.
[4] Brill, F. Z., D. E. Brown and W. N. Martin, 1992: Fast genetic selection
of features for neural network classifiers. IEEE Transactions on Neural
Networks, 3, 324–28.
[5] Chakrabarti, S., B. Dom, R. Agrawal and P. Raghavan, 1998: Scalable
feature selection, classification and signature generation for organizing
large text databases into hierarchical topic taxonomies. VLDB Journal ,
7, 163–78.
[6] Chakravarthy, S., J. Ghosh, L. Deuser and S. Beck, 1991: Efficient train-
ing procedures for adaptive kernel classifiers. Neural Networks for Signal
Processing, IEEE Press, 21–9.
[7] Crawford, M. M., S. Kumar, M. R. Ricard, J. C. Gibeaut and A. Neuensh-
wander, 1999: Fusion of airborne polarimetric and interferometric SAR
for classification of coastal environments. IEEE Transactions on Geo-
science and Remote Sensing, 37, 1306–15.
[8] Dattatreya, G. R. and L. N. Kanal, 1985: Decision trees in pattern recog-
nition. Progress in Pattern Recognition 2 , L. N. Kanal and A. Rosenfeld,
eds., Elsevier Science, 189–239.
References 71
3.1 Introduction
Much of current data-mining research focuses on algorithms to discover sets
of attributes that can discriminate data entities into classes, such as shop-
ping or banking trends for a particular demographic group. In contrast, we
are developing data-mining techniques to discover patterns consisting of com-
plex relationships between entities. The field of relational data mining, of
which graph-based relational learning is a part, is a new area investigating
approaches to mining relational information by finding associations involving
multiple tables in a relational database.
Two main approaches have been developed for mining relational infor-
mation: logic-based approaches and graph-based approaches. Logic-based ap-
proaches fall under the area of inductive logic programming (ILP) [16]. ILP
embodies a number of techniques for inducing a logical theory to describe
the data, and many techniques have been adapted to relational data mining
[6]. Graph-based approaches differ from logic-based approaches to relational
mining in several ways, the most obvious of which is the underlying represen-
tation. Furthermore, logic-based approaches rely on the prior identification
of the predicate or predicates to be mined, while graph-based approaches are
more data-driven, identifying any portion of the graph that has high support.
However, logic-based approaches allow the expression of more complicated
76 Diane J. Cook, Lawrence B. Holder, Jeff Coble and Joseph Potts
Graph-based data mining (GDM) is the task of finding novel, useful, and
understandable graph-theoretic patterns in a graph representation of data.
Several approaches to GDM exist, based on the task of identifying frequently
occurring subgraphs in graph transactions, i.e., those subgraphs meeting a
minimum level of support. Kuramochi and Karypis [15] developed the FSG
system for finding all frequent subgraphs in large graph databases. FSG starts
by finding all frequent single and double edge subgraphs. Then, in each itera-
tion, it generates candidate subgraphs by expanding the subgraphs found in
the previous iteration by one edge. In each iteration the algorithm checks how
many times the candidate subgraph occurs within an entire graph. The candi-
dates whose frequency is below a user-defined level are pruned. The algorithm
returns all subgraphs occurring more frequently than the given level.
Yan and Han [19] introduced gSpan, which combines depth-first search
and lexicographic ordering to find frequent subgraphs. Their algorithm starts
from all frequent one-edge graphs. The labels on these edges, together with
3.3 Graph-based Relational Learning in Subdue 77
labels on incident vertices, define a code for every such graph. Expansion of
these one-edge graphs maps them to longer codes. The codes are stored in a
tree structure such that if α = (a0 , a1 , ..., am ) and β = (a0 , a1 , ..., am , b), the
β code is a child of the α code. Since every graph can map to many codes, the
codes in the tree structure are not unique. If there are two codes in the code
tree that map to the same graph and one is smaller than the other, the branch
with the smaller code is pruned during the depth-first search traversal of the
code tree. Only the minimum code uniquely defines the graph. Code ordering
and pruning reduces the cost of matching frequent subgraphs in gSpan.
Inokuchi et al. [12] developed the Apriori-based Graph Mining (AGM)
system, which uses an approach similar to Agrawal and Srikant’s [2] Apriori
algorithm for discovering frequent itemsets. AGM searches the space of fre-
quent subgraphs in a bottom-up fashion, beginning with a single vertex, and
then continually expanding by a single vertex and one or more edges. AGM
also employs a canonical coding of graphs in order to support fast subgraph
matching. AGM returns association rules satisfying user-specified levels of
support and confidence.
We distinguish graph-based relational learning (GBRL) from graph-based
data mining in that GBRL focuses on identifying novel, but not necessarily
the most frequent, patterns in a graph representation of data [10]. Only a few
GBRL approaches have been developed to date. Subdue [4] and GBI [20] take
a greedy approach to finding subgraphs, maximizing an information theoretic
measure. Subdue searches the space of subgraphs by extending candidate sub-
graphs by one edge. Each candidate is evaluated using a minimum description
length metric [17], which measures how well the subgraph compresses the in-
put graph if each instance of the subgraph were replaced by a single vertex.
GBI continually compresses the input graph by identifying frequent triples
of vertices, some of which may represent previously-compressed portions of
the input graph. Candidate triples are evaluated using a measure similar to
information gain. Kernel-based methods have also been used for supervised
GBRL [14].
1
Subdue source code, sample data sets and publications are available at
ailab.uta.edu/subdue.
78 Diane J. Cook, Lawrence B. Holder, Jeff Coble and Joseph Potts
more than one edge between vertices vi and vj ). The input graph need not be
connected, but the learned patterns must be connected subgraphs (called sub-
structures) of the input graph. The input to Subdue can consist of one large
graph or several individual graph transactions and, in the case of supervised
learning, the individual graphs are classified as positive or negative examples.
shape
object triangle S1
on
S1 S1 S1
object shape square
Given the ability to find a prevalent subgraph pattern in a larger graph and
then compress the graph with this pattern, iterating over this process until
the graph can no longer be compressed will produce a hierarchical, conceptual
clustering of the input data. On the ith iteration, the best subgraph Si is used
to compress the input graph, introducing new vertices labeled Si in the graph
input to the next iteration. Therefore, any subsequently-discovered subgraph
Sj can be defined in terms of one or more Si , where i < j. The result is a
lattice, where each cluster can be defined in terms of more than one parent
subgraph. For example, Figure 3.2 shows such a clustering done on a portion
of DNA. See [13] for more information on graph-based clustering.
CH2 O O
O
N N O
adenine N N H N
CH2 thymine
O
N N H O CH 3
O P OH O
H
O HO P O
O
CH2 O N H O
N N
guanine N H N O
N cytosine
CH2
O N O H N CH 3
O
O P OH
H
HO P O
O
O
CH2 O O
N N
N H N N O
thymine CH2
O N adenine
CH 3 O H N O
O P OH
H HO P O
O
O
DNA
C N C C
O P OH
C O
C C
N C
O O P OH
C
O
O
CH 2
C C N C
O C
(a)
obj
Fig. 3.3. Graph-based supervised learning example with (a) four positive and four
negative examples, (b) one possible graph concept and (c) another possible graph
concept.
with this incremental mining problem, but restrict the problem to itemset data
and assume the data arrives in complete and independent units [1, 7, 11].
Fig. 3.4. Incremental data can be viewed as a unique extension to the accumulated
graph.
Storing all accumulated data and continuing to periodically repeat the entire
structure discovery process is intractable both from a computational perspec-
tive and for data storage purposes. Instead, we wish to devise a method by
which we can discover structures from the most recent data increment and
simultaneously refine our knowledge of the globally-best substructures dis-
covered so far. However, we can often encounter a situation where sequential
applications of Subdue to individual data increments will yield a series of
locally-best substructures that are not the globally-best substructures, that
would be found if the data were evaluated as one aggregate block.
Figure 3.5 illustrates an example where Subdue is applied sequentially to
each data increment as it is received. At each increment, Subdue discovers
the best substructure for the respective data increment, which turns out to be
only a local best. However, if we aggregate the same data, as depicted in Fig-
ure 3.6, and then apply the baseline Subdue algorithm we get a different best
substructure, which in fact is globally best. This is illustrated in Figure 3.7.
Although our simple example could easily be aggregated at each time step,
realistically large data sets would be too unwieldy for this approach.
In general, sequential discovery and action brings with it a set of unique
challenges, which are generally driven by the underlying system that is gen-
erating the data. One problem that is almost always a concern is how to re-
evaluate the accumulated data at each time step in the light of newly-added
data. There is a tradeoff between the amount of data that can be stored and
re-evaluated, and the quality of the result. A summarization technique is of-
ten employed to capture salient metrics about the data. The richness of this
summarization is a tradeoff between the speed of the incremental evaluation
and the range of new substructures that can be considered.
Fig. 3.5. Three data increments received serially and processed individually by
Subdue. The best substructure is shown for each local increment.
86 Diane J. Cook, Lawrence B. Holder, Jeff Coble and Joseph Potts
m
length of the substructure, Si , under consideration. The term j=1 DL(Gj |Si )
represents the description length of the accumulated
m graph after it is com-
pressed by substructure Si . Finally, the term j=1 DL(Gj ) represents the full
description length of the accumulated graph. I-Subdue then can re-evaluate
substructures using Equation (3.3) (an inverse of Equation (3.2)), choosing
the one with the lowest value as globally best.
m
DL(Si ) + j=1 DL(Gj |Si )
argmax(i) m (3.3)
j=1 DL(Gj )
Fig. 3.7. Result from applying Subdue to the three aggregated data increments.
Fig. 3.8. The top n=3 substructures from each local increment.
88 Diane J. Cook, Lawrence B. Holder, Jeff Coble and Joseph Potts
Table 3.2. Using I-Subdue to calculate the global value of each substructure.
to S12 has a value for each iteration. The values in Table 3.1 are the result
of the compression evaluation metric from Equation (3.1). The locally-best
substructures illustrated in Figure 3.5 have the highest values overall.
Table 3.2 depicts our application of I-Subdue to the increments from Fig-
ure 3.5. After each increment is received, we apply Equation (3.3) to select
the globally-best substructure. The values in Table 3.2 are the inverse of
the compression metric from Equation (3.2). As an example, the calcula-
tion of the compression metric for substructure S12 after iteration 3 would
DL(S12 )+DL(G1 |S12) +DL(G2 |S12 )+DL(G3 |S12 )
be DL(G1 )+DL(G2 )+DL(G3 ) . Consequently the value of S12
would be (117 + 117 + 116) / (15 + 96.63 + 96.63 + 96.74) = 1.1474.
For this computation, we rely on the metrics computed by Subdue when it
evaluates substructures in a graph, namely the description length of the dis-
covered substructure, the description length of the graph compressed by the
substructure, and the description length of the graph. By storing these values
after each increment is processed, we can retrieve the globally-best substruc-
ture using Equation (3.3). In circumstances where a specific substructure is
not present in a particular data increment, such as S31 in iteration 2, then
DL(G2 |S31 ) = DL(G2 ) and the substructure’s value would be calculated as
(117 + 117 + 116) / (15 + 117 + 117 + 85.76) = 1.0455.
data generator. This data generator takes as input a library of data labels,
configuration parameters governing the size of random graph patterns and
one or more specific substructures to be embedded within the random data.
Connectivity can also be controlled.
60
40
I-Subdue
20
Subdue
0
10 20 30 40 50
A
2200 4400 6600 8800 11000
vertices, vertices, vertices, vertices, vertices,
1499 2859 4409 6049 7183
edges edges edges edges edges B C
Number of Increments D
Fig. 3.9. Comparison of I-Subdue with Subdue on 10–50 increments, each with 220
new vertices and 0 or 1 outgoing edges.
For the first experiment, illustrated in Figure 3.9, we compare the per-
formance of I-Subdue to Subdue at benchmarks ranging from 10 to 50 in-
crements. Each increment introduced 220 new vertices, within which five in-
stances of the four-vertex substructure pictured in Figure 3.9 were embedded.
The quality of the result, in terms of the number of discovered instances, was
the same.
The results from the second graph are depicted in Figure 3.10. For this
experiment, we increased the increment size to 1020 vertices. Each degree
value between 1 and 4 was shown with 25% probability, which means that on
average there are about twice as many edges as vertices. This more densely
connected graph begins to illustrate the significance of the run-time difference
between I-Subdue and Subdue. Again, five instances of the four-vertex sub-
structure shown in Figure 3.10 were embedded within each increment. The
discovery results were the same for both I-Subdue and Subdue with the only
qualitative difference being in the run time.
90 Diane J. Cook, Lawrence B. Holder, Jeff Coble and Joseph Potts
400
300 I-Subdue
200 Subdue
100
0
10 20 30 40 50 A
10200 20400 30600 40800 51000
vertices, vertices, vertices, vertices, vertices,
20490 40649 61175 81974 102402 B C
edges edges edges edges edges
Number of Increments D
Fig. 3.10. Comparison of I-Subdue with Subdue on 10–50 increments, each with
1020 new vertices and 1 to 4 outgoing edges.
credit history to our social network can enhance the input data, but may be
acquired at a price in terms of money, time, or other resources. To implement
the cost feature, the cost of specific vertices and edges is specified in the input
file. The cost for substructure S averaged over all of its instances, Cost(S),
is then combined with the MDL value of S using the equation E(S) =
(1 − Cost(S)) × M DL(S). The evaluation measure, E(S), determines the
overall value of the substructure and is used to order candidate substructures.
Class membership in a supervised graph can now be treated as a cost,
which varies from no cost for clearly positive members to +1 for clearly neg-
ative members. As an example, we consider the problem of learning which
regions of the ocean surface can expect a temperature increase in the next
time step. Our data set contains gridded sea surface temperatures (SST) de-
rived from NASA’s Pathfinder algorithm and a five-channel Advanced Very
High Resolution Radiometer instrument. The data contains location, time of
year, and temperature data for each region of the globe.
The portion of the data used for training is represented as a graph with
vertices for each month, discretized latitude and longitude values, hemisphere,
and change in temperature from one month to the next. Vertices labelled with
“increase” thus represent the positive examples and “decrease” or “same” la-
bels represent negative examples. A portion of the graph is shown in Fig-
ure 3.11. The primary substructure discovered by Subdue for this data set
reports the rule that when there are two regions in the Southern hemisphere,
one just north of the other, an increase in temperature can be expected for
the next month in the southernmost of the two regions. Using three-fold cross
validation experimentation, Subdue classified this data set with 71% accuracy.
DECREASE
DeltaNextMonth
N HEMI
JAN N
TEMP S
120 HEMI
N
N DeltaNextMonth
W JAN SAME
W
TEMP
HEMI JAN
−299
S
DeltaNextMonth
TEMP
32766 INCREASE
3.6 Conclusions
There are several future directions for our graph-based relational learning
research that will improve our ability to handle such challenging data as de-
scribed in this chapter. The incremental discovery technique described in this
chapter did not address data that is connected across increment boundaries.
However, many domains will include event correlations that transcend mul-
tiple data iterations. For example, a terrorist suspect introduced in one data
increment may be correlated to events that are introduced in later incre-
ments. As each data increment is received it may contain new edges that
extend from vertices in the new data increment to vertices received in pre-
vious increments. We are investigating techniques of growing substructures
across increment boundaries. We are also considering methods of detecting
changes in the strengths of substructures across increment boundaries, that
could represent concept shift or drift.
The handling of supervised graphs is an important direction for mining
structural data. To extend our current work, we would like to handle embed-
ded instances without a single representative instance node (the “increase”
and “decrease” nodes in our NASA example) and instances that may possibly
overlap.
Finally, improved scalability of graph operations is necessary to learn pat-
terns, evaluate their accuracy on test cases and, ultimately, to use the patterns
to find matches in future intelligence data. The graph and subgraph isomor-
phism operations are a significant bottleneck to these capabilities. We need
to develop faster and approximate versions of these operations to improve the
scalability of graph-based relational learning.
References
[1] Agrawal, R. and G. Psaila, 1995: Active data mining. Proceedings of the
Conference on Knowledge Discovery in Databases and Data Mining.
[2] Agrawal, R. and R. Srikant, 1994: Fast algorithms for mining association
rules. Proceedings of the Twentieth Conference on Very Large Databases,
487–99.
References 93
Thomas Gärtner
Summary. Graphs are a major tool for modeling objects with complex data struc-
tures. Devising learning algorithms that are able to handle graph representations
is thus a core issue in knowledge discovery with complex data. While a significant
amount of recent research has been devoted to inducing functions on the vertices of
the graph, we concentrate on the task of inducing a function on the set of graphs.
Application areas of such learning algorithms range from computer vision to biology
and beyond. Here, we present a number of results on extending kernel methods to
complex data, in general, and graph representations, in particular. With the very
good performance of kernel methods on data that can easily be embedded in a Eu-
clidean space, kernel methods have the potential to overcome some of the major
weaknesses of previous approaches to learning from complex data. In order to apply
kernel methods to graph data, we propose two different kernel functions and compare
them on a relational reinforcement learning problem and a molecule classification
problem.
4.1 Introduction
Graphs are an important tool for modeling complex data in a systematic way.
Technically, different types of graphs can be used to model the objects. Con-
ceptually, different aspects of the objects can be modeled by graphs: (i) Each
object is a vertex in a graph modeling the relation between the objects, and
(ii) each object is modeled by a graph. While a significant amount of recent
research is devoted to case i, here we are concerned with case ii. An important
example for this case is the prediction of biological activity of molecules given
their chemical structure graph.
Suppose we know of a function that estimates the effectiveness of chemical
compounds against a particular illness. This function would be very helpful in
developing new drugs. One possibility for obtaining such a function is to use
in-depth chemical knowledge. A different – and for us more interesting – pos-
sibility is to try to learn from chemical compounds with known effectiveness
96 Thomas Gärtner
against that illness. We will call these compounds “training instances”. Super-
vised machine learning tries to to find a function that generalizes over these
training instances, i.e., a function that is able to estimate the effectiveness of
other chemical compounds against this disease. We will call this function the
“hypothesis” and the set of all functions considered as possible hypotheses,
the “hypothesis space”.
Though chemical compounds are three-dimensional structures, the three-
dimensional shape is often determined by the chemical structure graph. That
is, the representation of a molecule by a set of atoms, a set of bonds con-
necting pairs of atoms, and a mapping from atoms to element-types (carbon,
hydrogen, ...) as well as from bonds to bond-types (single, double, aromatic,
...). Standard machine learning algorithms can not be applied to such a rep-
resentation.
Predictive graph mining is interested in supervised machine learning prob-
lems with graph-based representations. This is an emerging research topic at
the heart of knowledge discovery from complex data. In contrast with other
graph mining approaches it is not primarily concerned with finding interesting
or frequent patterns in a graph database but only with supervised machine
learning, i.e., with inducing a function on the set of all graphs that approxi-
mates well some unknown functional or conditional dependence. In the above
mentioned application this would be effectiveness against an illness depending
on the chemical structure of a compound.
Kernel methods are a class of learning algorithms that can be applied to
any learning problem as long as a positive-definite kernel function has been
defined on the set of instances. The hypothesis space of kernel methods is
the linear hull (i.e., the set of linear combinations) of positive-definite kernel
functions “centered” at some training instances. Kernel methods have shown
good predictive performance on many learning problems, such as text classi-
fication. In order to apply kernel methods to instances represented by graphs,
we need to define meaningful and efficiently computable positive-definite ker-
nel functions on graphs.
In this article we describe two different kernels for labeled graphs together
with applications to relational reinforcement learning and molecule classifica-
tion. The first graph kernel is based on comparing the label sequences corre-
sponding to walks occurring in each graph. Although these walks may have
infinite length, for undirected graphs, such as molecules, this kernel function
can be computed in polynomial time by using properties of the direct product
graph and computing the limit of a power series. In the molecule classifica-
tion domain that we will look at, however, exact computation of this kernel
function is infeasible and we need to resort to approximations. This motivates
the search for other graph kernels that can be computed more efficiently on
this domain. We thus propose a graph kernel based on the decomposition of
each graph into a set of simple cycles and into the set of connected compo-
nents of the graph induced by the set of bridges in the graph. Each of these
cycles and trees is transformed into a pattern and the cyclic-pattern kernel
4.2 Learning with Kernels and Distances 97
for graphs is the cardinality of the intersection of two pattern sets. Although
cyclic-pattern kernels can not be computed in polynomial time, empirical re-
sults on a molecule classification problem show that, while walk-based graph
kernels exhibit higher predictive performance, cyclic-pattern kernels can be
computed much faster. Both kernels perform better than, or at least as good
as, previously proposed predictive graph mining approaches over different sub-
problems and parameter settings.
Section 4.2 introduces kernel methods, kernels for structured instances
spaces, and discusses the relation between kernels and distances for structured
instances. Section 4.3 begins with the introduction of general set kernels and
conceptually describes kernels for other data structures afterwards. Section
4.4 describes walk-based graph kernels and cyclic-pattern kernels for graphs.
Two applications of predictive graph mining are shown in Section 4.5, before
Section 4.6 concludes.
Kernel methods [41] are a popular class of algorithms within the machine-
learning and data-mining communities. Being theoretically well founded in
statistical learning theory, they have shown good empirical results in many
applications. One particular aspect of kernel methods such as the support vec-
tor machine is the formation of hypotheses by linear combination of positive-
definite kernel functions “centered” at individual training instances. By the re-
striction to positive-definite kernel functions, the regularized risk minimization
problem (we will define this problem once we have defined positive-definite
functions) becomes convex and every locally optimal solution is globally op-
timal.
Kernel Functions
Kernel methods can be applied to different kinds of (structured) data by using
any positive-definite kernel function defined on the data.
A symmetric function k : X ×X → R on a set X is called a positive-definite
kernel on that set if, for all n ∈ Z+ , x1 , . . . , xn ∈ X , and c1 , . . . , cn ∈ R, it
follows that
98 Thomas Gärtner
ci cj k(xi , xj ) ≥ 0.
i,j∈{1,...,n}
Kernel Machines
The usual supervised learning model [44] considers a set X of individuals and
a set Y of labels, such that the relation between individuals and labels is a
fixed but unknown probability measure on the set X × Y. The common theme
in many different kernel methods such as support vector machines, Gaussian
processes, or regularized least squares regression is to find a hypothesis func-
tion that minimizes not just the empirical risk (the training error) but also
the regularized risk . This gives rise to the optimization problem
C
n
min V (yi , f (xi )) + f (·)2H
f (·)∈H n
i=1
C
n
min (yi − f (xi ))2 + f (·)2H
f (·)∈H n
i=1
Plugging in our knowledge about the form of solutions and taking the direc-
tional derivative with respect to the parameter vector c of Equation (4.1), we
can find the analytic solution to the optimization problem as:
n −1
c= K+ I y
C
where I denotes the identity matrix of appropriate size.
Gaussian Processes
Gaussian processes [35] are an incrementally learnable Bayesian regression
algorithm. Rather than parameterizing some set of possible target functions
and specifying a prior over these parameters, Gaussian processes directly put
a (Gaussian) prior over the function space. A Gaussian process is defined by
a mean function and a covariance function, implicitly specifying the prior.
The choice of covariance functions is thereby only limited to positive-definite
kernels. It can be seen that the mean prediction of a Gaussian process corre-
sponds to the prediction found by a regularized least squares algorithm. This
links the regularization parameter C with the variance of the Gaussian noise
distribution assumed in Gaussian processes.
Illustration
To illustrate the importance of choosing the “right” kernel function, we next
illustrate the hypothesis found by a Gaussian process with different kernel
functions.
In Figure 4.1 the training examples are pairs of real numbers x ∈ X = R2
illustrated by black discs and circles in the figure. The (unknown) tar-
get function is an XOR-type function, the target variable y takes values
−1 for the black discs and +1 for the black circles. The probability of
a test example being of class +1 is illustrated by the color of the corre-
sponding pixel in the figure. The different kernels used are the linear ker-
nel k(x, x ) = x, x , the polynomial kernel k(x, x ) = (x, x + l)p , the
sigmoid kernel k(x, x ) = tanh(γ
x, x ), and the Gaussian kernel function
2 2
k(x, x ) = exp −x − x /σ .
Figure 4.2 illustrates the impact of choosing the parameter of a Gaussian
kernel function on the regularization of the solution found by a Gaussian
process. Training examples are single real numbers and the target value is
100 Thomas Gärtner
(a) (b)
(c) (d)
Fig. 4.1. Impact of different kernel functions on the solution found by Gaussian
processes. Kernel functions are (a) linear kernel, (b) polynomial kernel of degree 2,
(c) sigmoid kernel and (d) Gaussian kernel.
also a real number. The unknown target function is a sinusoid function shown
by a thin line in the figure. Training examples perturbed by random noise are
depicted by black circles. The color of each pixel illustrates the likeliness of a
target value given a test example, with the most likely value colored white.
4.2 Learning with Kernels and Distances 101
Fig. 4.2. Impact of the bandwidth of a Gaussian kernel function on the regulariza-
tion of the solution found by Gaussian processes. The bandwidth is decreasing from
left to right, top to bottom.
For exponential power series such as the diffusion kernel, the limit can
be computed by exponentiating the eigenvalues, while for geometrical power
series, the limit can be computed by the formula 1/(1 − γe), where e is an
eigenvalue of B B or −L, respectively. A general framework and analysis of
these kernels is given in [42].
In the literature, distances are often defined using the minima and/or max-
ima over a set of distances, e.g., all distances described in [12] between point
sets, the string edit distance [13] between sequences, or the subgraph dis-
tance [3, 36] between graphs. It is thus interesting to investigate whether in
general kernel functions can be defined as the minimum and/or maximum of
a set of kernels. In this section we investigate whether certain uses of minima
and/or maxima give rise to positive-definite kernels and discuss minima- and
maxima-based kernels on instances represented by sets.
coincides with the usual (L2 ) inner product between the functions θx (·) and
θx (·). Thus it is positive-definite.
The function max{x, x } defined on non-negative real numbers is not
positive-definite. Setting x = 0, x = 1 we obtain the indefinite matrix
01
.
11
Here, A has the eigenvectors (1, 1, 0) ; (0, 0, 1) ; (1, −1, 0) with correspond-
ing eigenvalues 2, 1, 0 ≥ 0, showing that both matrices are positive-definite.
The component wise maximum of A and B
104 Thomas Gärtner
⎛ ⎞
110
D = ⎝1 1 1⎠
011
is, however, indefinite: (1, 0, 0)D(1, 0, 0) = 1 > 0 and (1, −1, 1)D(1, −1, 1) =
−1 < 0.
and
max k(x, x ) = max kij (X, X ) .
x∈X,x ∈X ij
≥0 .
Note that in the simplest case (finite sets with µ(·) being the set cardinal-
ity) the intersection kernel coincides with the inner product of the bitvector
representations of the sets.
In the case that the sets Xi are finite or countable sets of elements on
which a kernel has been defined, it is often beneficial to use set kernels other
than the intersection kernel. For example the crossproduct kernel
106 Thomas Gärtner
k× (Xi , Xj ) = k(xi , xj ) . (4.3)
xi ∈Xi ,xj ∈Xj
The crossproduct kernel with the right kernel set to the matching kernel (de-
fined as kδ (xi , xj ) = 1 ⇔ xi = xj and 0 otherwise) coincides with the inter-
section kernel.
In the remainder of this section we are more interested in the case that S is
a Borel algebra with unit X , and µ is countably additive with µ(X ) <∞. We
can then extend the definition of the characteristic functions to X = C∈S C
such that ΓX (x) = 1 ⇔ x ∈ X and ΓX (x) = 0 otherwise. We can then write
the intersection kernel as
k∩ (Xi , Xj ) = µ(X ∩ X ) = ΓXi (x) ∗ ΓXj (x)dµ (4.4)
X
this shows the relation of the intersection kernel to the usual (L2 ) inner prod-
uct between the characteristic functions ΓX (·), ΓX (·) of the sets.
Similarly, for the crossproduct kernel in Equation (4.3) we obtain in this
setting the integral equation
k(x, x )dµdµ = ΓX (x) ∗ k(x, x ) ∗ ΓX (x )dµdµ
X×X X ×X
The best known kernel for representation spaces that are not mere attribute-
value tuples is the convolution kernel proposed by Haussler [22]. The basic idea
of convolution kernels is that the semantics of composite objects can often be
captured by a relation R between the object and its parts. The kernel on the
object is then made up from kernels defined on different parts.
Let x, x ∈ X be the objects and x, x ∈ X1 × · · · × XD be tuples of parts
of these objects. Given the relation R : (X1 × · · · × XD ) × X we can define the
decomposition R−1 as R−1 (x) = {x : R(x, x)}. Then the convolution kernel
is defined as
D
kconv (x, x ) = kd (xd , xd ) .
x∈R−1 (x),x ∈R−1 (x ) d=1
they are very general and can be applied in many different problems. How-
ever, because of that generality, they require a significant amount of work to
adapt them to a specific problem, which makes choosing R in “real-world”
applications a non-trivial task.
The idea of most string kernels [34, 47] defined in the literature is to base
the similarity of two strings on the number of common subsequences. These
subsequences need not occur contiguously in the strings but the more gaps
in the occurrence of the subsequence, the less weight is given to it in the
kernel function. For example, the string “cat” would be decomposed in the
subsequences “c”, “a”, “t”, “ca”, “at”, “ct”, and “cat”. These subsequences
also occur in the string “cart”, albeit with different length of the occurrence.
Usually the length of the occurrence of the substring is used as a penalty.
With an exponentially decaying penalty, the weight of every occurrence
in “cat”/“cart” becomes: “c”:(λ1 λ1 ), “a”:(λ1 λ1 ), “t”:(λ1 λ1 ), “ca”:(λ2 λ2 ),
“at”:(λ2 λ3 ), “ct”:(λ3 λ4 ), “cat”:(λ3 λ4 ) and the kernel of “cat” and “cart” be-
comes k(“cat”, “cart”) = 2λ7 + λ5 + λ4 + 3λ2 . Using a divide and conquer
approach, computation of this kernel can be reduced to O(n|s||t|) [34]. In [45]
and [32] other string kernels are proposed and it is shown how these can be
computed efficiently by using suffix and mismatch trees, respectively.
In [19], a framework has been been proposed that allows for the application of
kernel methods to different kinds of structured data. This approach is based
108 Thomas Gärtner
on the idea of having a powerful representation that allows for modeling the
semantics of an object by means of the syntax of the representation. The
underlying principle is that of representing individuals as (closed) terms in a
typed higher-order logic [33]. The biggest difference to terms of a first-order
logic is the use of types and the presence of abstractions that allow explicit
modeling of sets, multisets, and so on.
The typed syntax is important for pruning search spaces and for modeling
as closely as possible the semantics of the data in a human- and machine-
readable form. The individuals-as-terms representation is a natural general-
ization of the attribute-value representation and collects all information about
an individual in a single term.
Basic terms represent the individuals that are the subject of learning and
fall into one of three categories: basic structures that represent individuals that
are lists, trees, and so on; basic abstractions that represent sets, multisets, and
so on; and basic tuples that represent tuples. Basic abstractions are almost
constant mappings β → γ that can be regarded as lookup tables, where all
basic terms of type β in the table are mapped to some basic term of type γ
and all basic terms not in the table are mapped to one particular basic term,
the default term of type γ.
Applications of this kernel are spatial clustering of demographic data,
multi-instance learning for drug-activity prediction and predicting the struc-
ture of molecules from their NMR spectra.
Multi-instance learning problems [8] occur whenever example objects, in-
dividuals, can only be described by a set of which any single element could be
responsible for the classification of the set. Here, it can be shown that with
a particular abstraction kernel, the number of iterations needed by a kernel
perceptron to converge to a consistent hypothesis is bound by a polynomial
in the number of elements in the sets.
Then, in each step, from the set of frequent subgraphs of size l, a set of can-
didate graphs of size l + 1 is generated by joining those graphs of size l that
have a subgraph of size l − 1 in common. Of the candidate graphs only those
satisfying a frequency threshold are retained for the next step. The iteration
stops when the set of frequent subgraphs of size l is empty.
Conceptually, the graph kernels presented in [15, 18, 26, 27] are based on
a measure of the walks in two graphs that have some or all labels in common.
In [15] walks with equal initial and terminal label are counted, in [26, 27] the
probability of random walks with equal label sequences is computed, and in
[18] walks with equal label sequences, possibly containing gaps, are counted.
In [18] computation of these – possibly infinite – walks is made possible in
polynomial time by using the direct product graph and computing the limit
of matrix power series involving its adjacency matrix. The work on rational
graph kernels [5] generalizes these graph kernels by applying a general trans-
ducer between weighted automata instead of forming the direct product graph.
However, only walks up to a given length are considered in the kernel com-
putation. More recently, Horvath et al. [23] suggested that the computational
intractability of detecting all cycles in a graph can be overcome in practical
applications by observing that “difficult structures” occur only infrequently
in real-world databases. As a consequence of this assertion, Horvath et al.
[23] use a cycle-detection algorithm to decompose all graphs in a molecule
database into all simple cycles occurring.
In the remainder of this section we will describe walk- and cycle-based
graph kernels in more detail.
both factor graphs and both edges have the same label. For unlabeled graphs,
the adjacency matrix of the direct product graph corresponds to the tensor
product of the adjacency matrices of its factors.
With a sequence of weights λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) the
direct product kernel is defined as
|V× |
∞
n
k× (G1 , G2 ) = λn E×
i,j=1 n=0 ij
A labeled undirected graph can be seen as a labeled directed graph where the
existence of an edge between two vertices implies the existence of an edge in
the other direction and both edges are mapped to the same label. Each edge
of an undirected graph is usually represented by a subset of the vertex set
with cardinality two. A path in an undirected graph is a sequence v1 , . . . vn of
distinct vertices vi ∈ V where {vi , vi+1 } ∈ E. A simple cycle in an undirected
graph is a path, where also {v1 , vn } ∈ E. A bridge is an edge not part of any
simple cycle; the graph made up by all bridges is a forest, i.e., a set of trees.
We describe now the kernel proposed in [23] for molecule classification.
The key idea is to decompose every undirected graph into the set of cyclic
and tree patterns in the graph. A cyclic pattern is a unique representation of
the label sequence corresponding to a simple cycle in the graph. A tree pattern
in the graph is a unique representation of the label sequence corresponding to
a tree in the forest made up by all bridges. The cyclic-pattern kernel between
two graphs is defined by the cardinality of the intersection of the pattern sets
associated with each graph.
Consider a graph with vertices 1, . . . 6 and labels (in the order of vertices)
“c”, “a”, “r”, “t”, “e”, and “s”. Let the edges be the set
4.5 Applications of Predictive Graph-Mining 111
{{1, 2}, {2, 3}, {3, 4}, {2, 4}, {1, 5}, {1, 6}}.
This graph has one simple cycle and the lexicographically smallest represen-
tation of the labels along this cycle is the string “art”. The bridges of the
graph are {1, 2}, {1, 5}, {1, 6} and the bridges form a forest consisting of a
single tree. The lexicographically smallest representation of the labels of this
tree (in pre-order notation) is the string “aces”.
If the cyclic-pattern kernel between any two graphs could be computed
in polynomial time, the Hamiltonian cycle problem could also be solved in
polynomial time. Furthermore, the set of simple cycles in a graph can not be
computed in polynomial time – even worse, the number of simple cycles in
a graph can be exponential in the number of vertices of the graph. Consider
a graph consisting of two paths v0 , . . . vn and u0 , . . . un with additional edges
{{vi , ui } : 0 ≤ i ≤ n} ∪ {{vi , ui−2 } : 2 ≤ i ≤ n} where the number of paths
from v0 to un is lower bound by 2n . It follows directly that the number of
simple cycles in the graph with the additional edge {un , v0 } is also lower
bound by 2n .
The only remaining hope for a practically feasible algorithm is that the
number of simple cycles in each graph can be bound by a small polynomial.
Read and Tarjan [38] proposed an algorithm with polynomial delay complex-
ity, i.e., the number of steps that the algorithm needs between finding one
simple cycle and finding the next simple cycle is polynomial. This algorithm
can be used to enumerate all cyclic patterns. Note that this does not imply
that the number of steps the algorithm needs between two cyclic patterns is
polynomial.
In the next section we will compare walk- and cycle-based graph kernels in
the context of drug design and prediction of properties of molecules. It is illus-
trated there that indeed for the application considered, only a few molecules
exist that have a large number of simple cycles. Before that we describe an
application of walk-based graph kernels in a relational reinforcement learning
setting.
information the agent can get about the environment is its current state and
whether it received a reward. The goal of reinforcement learning is to maximize
this reward. One particular form of reinforcement learning is Q-learning [46].
It tries to learn a map from state-action-pairs to real numbers (Q-values)
reflecting the quality of that action in that state.
Relational reinforcement learning [10, 11] (RRL) is a Q-learning technique
that can be applied whenever the state-action space can not easily be repre-
sented by tuples of constants but has an inherently relational representation
instead. In this case, explicitly representing the mapping from state-action-
pairs to Q-values is usually not feasible.
The RRL-system learns through exploration of the state-space in a way
that is very similar to normal Q-learning algorithms. It starts with running
an episode2 just like table-based Q-learning, but uses the encountered states,
chosen actions and the received rewards to generate a set of examples that
can then be used to build a Q-function generalization. These examples use a
structural representation of states and actions.
To build this generalized Q-function, RRL applies an incremental rela-
tional regression engine that can exploit the structural representation of the
constructed example set. The resulting Q-function is then used to decide which
actions to take in the following episodes. Every new episode can be seen as a
new experience and is thus used to updated the Q-function generalization.
A rather simple example of relational reinforcement learning takes place
in the blocks world. The aim there is to learn how to put blocks that are in
an arbitrary configuration into a given configuration.
{clear}
v5
{block, a/1} {on}
v4 {on}
4 {block} {on}
v1 {action} {block, a/2}
1 {on} v3
v2
{on} {on}
2 3 {block}
v0 {floor}
Fig. 4.3. Simple example of a blocks world state and action (left) and its represen-
tation as a graph (right).
2
An “episode” is a sequence of states and actions from an initial state to a
terminal state. In each state, the current Q-function is used to decide which action
to take.
4.5 Applications of Predictive Graph-Mining 113
Evaluation
We evaluated RRL with Gaussian processes and walk-based graph kernels on
three different goals: stacking all blocks, unstacking all blocks and putting two
specific blocks on top of each other. The RRL-system was trained in worlds
where the number of blocks varied between three and five, and given “guided”
traces [9] in a world with 10 blocks. The Q-function and the related policy
were tested at regular intervals on 100 randomly generated starting states in
worlds where the number of blocks varied from 3 to 10 blocks.
In our empirical evaluation, RRL with Gaussian processes and walk-based
graph kernels proved competitive or better than the previous implementa-
tions of RRL. However, this is not the only advantage of using graph kernels
and Gaussian processes in RRL. The biggest advantages are the elegance and
potential of our approach. Very good results could be achieved without sophis-
ticated instance selection or averaging strategies. The generalization ability
can be tuned by a single parameter. Probabilistic predictions can be used to
guide exploration of the state–action space.
One of the most interesting application areas for predictive graph mining
algorithms is the classification of molecules.
We used the HIV data set of chemical compounds to evaluate the predictive
power of walk- and cycle-based graph kernels. The HIV database is maintained
by the US National Cancer Institute (NCI) [37] and describes information of
the compounds’ capability to inhibit the HIV virus. This database has been
used frequently in the empirical evaluation of graph-mining approaches (for
example [1, 7, 30]). However, the only approaches to predictive graph mining
on this data set are described in [6, 7]. There, a support vector machine
was used with the frequent subgraph kernel mentioned at the beginning of
Section 4.4.
Figure 4.4 shows the number of molecules with a given number of simple
cycles. This illustrates that in the HIV domain the assumption made in the
development of cyclic-pattern kernels holds.
Data set
In the NCI HIV database, each compound is described by its chemical struc-
ture and classified into one of three categories: confirmed inactive (CI), mod-
erately active (CM), or active (CA). A compound is inactive if a test showed
less than 50% protection of human CEM cells. All other compounds were
re-tested. Compounds showing less than 50% protection (in the second test)
are also classified inactive. The other compounds are classified active, if they
4.5 Applications of Predictive Graph-Mining 115
Fig. 4.4. Log-log plot of the number of molecules (y) versus the number of simple
cycles (x).
Vertex coloring
Though the number of molecules and thus atoms in this data set is rather
large, the number of vertex labels is limited by the number of elements oc-
curring in natural compounds. For that, it is reasonable to not just use the
element of the atom as its label. Instead, we use the pair consisting of the
atom’s element and the multiset of all neighbouring elements as the label. In
the HIV data set, this increases the number of different labels from 62 to 1391.
More sophisticated vertex coloring algorithms are used in isomorphism
tests. There, one would like two vertices to be colored differently iff they do
not lie on the same orbit of the automorphism group [14]. As no efficient algo-
rithm for the ideal case is known, one often resorts to colorings such that two
differently colored vertices can not lie on the same orbit. One possibility there
is to apply the above simple vertex coloring recursively. This is guaranteed to
converge to a “stable coloring”.
Implementation Issues
The size of this data set, in particular the size of the graphs in this data set,
hinders the computation of walk-based graph kernels by means of eigen decom-
positions on the product graphs. The largest graph contains 214 atoms (not
counting hydrogen atoms). If all had the same label, the product graph would
3
http://cactus.nci.nih.gov/ncidb/download.html
116 Thomas Gärtner
have 45, 796 vertices. As different elements occur in this molecule, the prod-
uct graph has fewer vertices. However, it turns out that the largest product
graph (without the vertex coloring step) still has 34, 645 vertices. The vertex
coloring above changes the number of vertices with the same label, thus the
product graph is reduced to 12, 293 vertices. For each kernel computation,
either eigendecomposition or inversion of the adjacency matrix of a product
graph has to be performed. With cubic time complexity, such operations on
matrices of this size are not feasible.
The only chance to compute graph kernels in this application is to approx-
imate them. There are two choices. First we consider counting the number of
walks in the product graph up to a certain depth. In our experiments it turned
out that counting walks with 13 or fewer vertices is still feasible. An alter-
native is to explicitly construct the image of each graph in feature space. In
the original data set 62 different labels occur and after the vertex coloring
1391 different labels occur. The size of the feature space of label sequences of
length 13 is then 6213 > 1023 for the original data set and 139113 > 1040 with
the vertex coloring. We would also have to take into account walks with fewer
than 13 vertices but at the same time not all walks will occur in at least one
graph. The size of this feature space hinders explicit computation. We thus
resorted to counting walks with 13 or fewer vertices in the product graph.
Experimental Methodology
We compare our approach to the results presented in [6] and [7]. The clas-
sification problems considered there were: (1) distinguish CA from CM, (2)
distinguish CA and CM from CI, and (3) distinguish CA from CI. For each
problem, the area under the ROC curve (AUC), averaged over a five-fold
crossvalidation, is given for different misclassification cost settings.
In order to choose the parameters of the walk-based graph kernel we
proceeded as follows. We split the smallest problem (1) into 10% for pa-
rameter tuning and 90% for evaluation. First we tried different parameters
for the exponential weight (10−3 , 10−2 , 10−1 , 1, 10) in a single nearest neigh-
bor algorithm (leading to an average AUC of 0.660, 0.660, 0.674, 0.759, 0.338)
and decided to use 1 from now. Next we needed to choose the complexity
(regularization) parameter of the SVM. Here we tried different parameters
(10−3 , 10−2 , 10−1 leading to an average AUC of 0.694, 0.716, 0.708) and found
the parameter 10−2 to work best. Evaluating with an SVM and these param-
eters on the remaining 90% of the data, we achieved an average AUC of 0.820
and standard deviation of 0.024.
For cyclic-pattern kernels, only the complexity constant of the support
vector machine has to be chosen. Here, the heuristic as implemented in SVM-
light [24] is used. Also, we did not use any vertex coloring with cyclic pattern
kernels.
4.5 Applications of Predictive Graph-Mining 117
Table 4.1. Area under the ROC curve for different costs and problems (•: significant
loss against walk-based kernels at 10% / ••: significant loss against walk-based ker-
nels at 1% / ◦: significant loss against cyclic-pattern kernels at 10% / ◦◦: significant
loss against cyclic-pattern kernels at 1%).
walk-based cyclic-pattern
problem cost kernels kernels FSG FSG∗
CA vs CM 1.0 0.818(±0.024) 0.813(±0.014) 0.774 ••◦◦ 0.810
CA vs CM 2.5 0.825(±0.032) 0.827(±0.013) 0.782 • ◦◦ 0.792 • ◦◦
CA vs CM+CI 1.0 0.926(±0.015) 0.908(±0.024) • — —
CA vs CM+CI 100.0 0.928(±0.013) 0.921(±0.026) — —
CA+CM vs CI 1.0 0.815(±0.015) 0.775(±0.017) •• 0.742 ••◦◦ 0.765 ••
CA+CM vs CI 35.0 0.799(±0.011) 0.801(±0.017) 0.778 ••◦ 0.794
CA vs CI 1.0 0.942(±0.015) 0.919(±0.011) • 0.868 ••◦◦ 0.839 ••◦◦
CA vs CI 100.0 0.944(±0.015) 0.929(±0.01) • 0.914 ••◦ 0.908 ••◦◦
References
[1] Borgelt, C. and M. R. Berthold, 2002: Mining molecular fragments: Find-
ing relevant substructures of molecules. Proc. of the 2002 IEEE Interna-
tional Conference on Data Mining, IEEE Computer Society.
References 119
[17] Gärtner, T., K. Driessens and J. Ramon, 2003: Graph kernels and Gaus-
sian processes for relational reinforcement learning. Proceedings of the
13th International Conference on Inductive Logic Programming.
[18] Gärtner, T., P. A. Flach and S. Wrobel, 2003: On graph kernels: Hardness
results and efficient alternatives. Proceedings of the 16th Annual Confer-
ence on Computational Learning Theory and the 7th Kernel Workshop.
[19] Gärtner, T., J. W. Lloyd and P. A. Flach, 2004: Kernels for structured
data. Machine Learning.
[20] Geibel, P. and F. Wysotzki, 1996: Relational learning with decision trees.
Proceedings of the 12th European Conference on Artificial Intelligence,
W. Wahlster, ed., John Wiley, 428–32.
[21] Graepel, T., 2002: PAC-Bayesian Pattern Classification with Kernels.
Ph.D. thesis, TU Berlin.
[22] Haussler, D., 1999: Convolution kernels on discrete structures. Techni-
cal report, Department of Computer Science, University of California at
Santa Cruz.
[23] Horvath, T., T. Gärtner and S. Wrobel, 2004: Cyclic pattern kernels for
predictive graph mining. Proceedings of the International Conference on
Knowledge Discovery and Data Mining.
[24] Joachims, T., 1999: Making large-scale SVM learning practical. Advances
in Kernel Methods: Support Vector Learning, B. Schölkopf, C. J. C.
Burges and A. J. Smola, eds., MIT Press.
[25] Kandola, J., J. Shawe-Taylor and N. Christianini, 2003: Learning se-
mantic similarity. Advances in Neural Information Processing Systems,
S. Becker, S. Thrun and K. Obermayer, eds., MIT Press, 15.
[26] Kashima, H., and A. Inokuchi, 2002: Kernels for graph classification.
ICDM Workshop on Active Mining.
[27] Kashima, H., K. Tsuda and A. Inokuchi, 2003: Marginalized kernels be-
tween labeled graphs. Proceedings of the 20th International Conference
on Machine Learning.
[28] Kolmogorov, A. N., and S. V. Fomin, 1960: Elements of the Theory
of Functions and Functional Analysis: Measure, Lebesgue Integrals, and
Hilbert Space, Academic Press, NY, USA, 2.
[29] Kondor, R. I. and J. Lafferty, 2002: Diffusion kernels on graphs and other
discrete input spaces. Proceedings of the 19th International Conference
on Machine Learning, C. Sammut and A. Hoffmann, eds., Morgan Kauf-
mann, 315–22.
[30] Kramer, S., L. De Raedt and C. Helma, 2001: Molecular feature min-
ing in HIV data. Proceedings of the 7th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, F. Provost and
R. Srikant, eds., 136–43.
[31] Kuramochi, M. and G. Karypis, 2001: Frequent subgraph discovery. Pro-
ceedings of the IEEE International Conference on Data Mining.
[32] Leslie, C., E. Eskin, J. Weston and W. Noble, 2003: Mismatch string
kernels for SVM protein classification. Advances in Neural Information
References 121
Mohammed J. Zaki
5.1 Introduction
Frequent structure mining (FSM) refers to an important class of exploratory
mining tasks, namely those dealing with extracting patterns in massive
databases representing complex interactions between entities. FSM not only
encompasses mining techniques like associations [3] and sequences [4], but it
also generalizes to more complex patterns like frequent trees and graphs [17,
20]. Such patterns typically arise in applications like bioinformatics, web min-
ing, mining semi-structured documents, and so on. As one increases the com-
plexity of the structures to be discovered, one extracts more informative pat-
terns; we are specifically interested in mining tree-like patterns.
As a motivating example for tree mining, consider the web usage min-
ing [13] problem. Given a database of web access logs at a popular site, one
can perform several mining tasks. The simplest is to ignore all link informa-
tion from the logs, and to mine only the frequent sets of pages accessed by
users. The next step can be to form for each user the sequence of links they
followed and to mine the most frequent user access paths. It is also possible to
look at the entire forward accesses of a user, and to mine the most frequently
accessed subtrees at that site. In recent years, XML has become a popular way
of storing many data sets because the semi-structured nature of XML allows
124 Mohammed J. Zaki
others and called the root. We refer to a vertex of a rooted tree as a node of
the tree. An ordered tree is a rooted tree in which the children of each node
are ordered, i.e., if a node has k children, then we can designate them as the
first child, second child, and so on up to the kth child. A labeled tree is a tree
where each node of the tree is associated with a label. In this paper, all trees
we consider are ordered, labeled, and rooted trees. We choose to focus on
labeled rooted trees, since those are the types of data sets that are most com-
mon in a data mining setting, i.e., data sets represent relationships between
items or attributes that are named, and there is a top root element (e.g., the
main web page on a site). In fact, if we treat each node as having the same
label, we can mine all ordered, unlabeled subtrees as well!
Subtrees
We say that a tree S = (Ns , Bs ) is an embedded subtree of T = (N, B), denoted
as S T , provided Ns ⊆ N , and b = (nx , ny ) ∈ Bs if and only if ny ≤l nx ,
i.e., nx is an ancestor of ny in T . In other words, we require that a branch
appears in S if and only if the two vertices are on the same path from the
root to a leaf in T . If S T , we also say that T contains S. A (sub)tree
of size k is also called a k-(sub)tree. Note that in the traditional definition
of an induced subtree , for each branch b = (nx , ny ) ∈ Bs , nx must be a
parent of ny in T . Embedded subtrees are thus a generalization of induced
subtrees; they allow not only direct parent–child branches, but also ancestor–
descendant branches. As such embedded subtrees are able to extract patterns
126 Mohammed J. Zaki
“hidden” (or embedded) deep within large trees which might be missed by
the traditional definition.
T1 T2 T3
B B B Embedded Subtree
B
A B C
A C
A C A C A C
As an example, consider Figure 5.1, which shows three trees. Let’s assume
we want to mine subtrees that are common to all three trees (i.e., 100% fre-
quency). If we mine induced trees only, then there are no frequent trees of
size more than one. On the other hand, if we mine embedded subtrees, then
the tree shown in the box is a frequent pattern appearing in all three trees; it
is obtained by skipping the “middle” node in each tree. This example shows
why embedded trees are of interest. Henceforth, a reference to subtree should
be taken to mean an embedded subtree, unless indicated otherwise. Also note
that, by definition, a subtree must be connected. A disconnected pattern is
a sub-forest of T . Our main focus is on mining subtrees, although a simple
modification of our enumeration scheme also produces sub-forests.
Scope
Let T (nl ) refer to the subtree rooted at node nl and let nr be the right-most
leaf node in T (nl ). The scope of node nl is given as the interval [l, r], i.e., the
lower bound is the position (l) of node nl , and the upper bound is the position
(r) of node nr . The concept of scope will play an important part in counting
subtree frequency.
T (a tree in D)
n0, s = [0, 6]
0
3 2
1 2
1 2 2 2
1
support = 1 support = 1
weighted support = 2 weighted support = 1 not a subtree; a sub−forest
match labels = {134, 135} match label = {03456}
string = 1 1 −1 2 −1 string = 0 1 −1 2 −1 2 −1 2 −1
Example 1. Consider Figure 5.2, which shows an example tree T with node
labels drawn from the set L = {0, 1, 2, 3}. The figure shows for each node, its
label (circled), its number according to depth-first numbering, and its scope.
For example, the root occurs at position n = 0, its label l(n0 ) = 0, and since
the right-most leaf under the root occurs at position 6, the scope of the root
is s = [0, 6]. Tree S1 is a subtree of T ; it has a support of 1, but its weighted
support is 2, since node n2 in S1 occurs at positions 4 and 5 in T , both of
which support S1, i.e., there are two match labels for S1, namely 134 and 135
(we omit set notation for convenience). S2 is also a valid subtree. S3 is not a
(sub)tree since it is disconnected; it is a sub-forest.
128 Mohammed J. Zaki
Equivalence Classes
We say that two k-subtrees X, Y are in the same prefix equivalence class iff
they share a common prefix up to the (k − 1)th node. Formally, let X , Y be
the string encodings of two trees, and let function p(X , i) return the prefix up
to the ith node. X, Y are in the same class iff p(X , k − 1) = p(Y, k − 1). Thus
any two members of an equivalence class differ only in the position of the last
node.
Class Prefix
n0
Equivalence Class
3
Prefix String: 3 4 2 −1 1
n1
x
Element List: (label, attached to position)
4
(x, 0) // attached to n0: 3 4 2 −1 1 −1 −1 x −1
n2 n3 (x, 1) // attached to n1: 3 4 2 −1 1 −1 x −1 −1
2 1 x
(x, 3) // attached to n3: 3 4 2 −1 1 x −1 −1 −1
x x
Fig. 5.3. Prefix equivalence class.
Example 3. Consider Figure 5.3, which shows a class template for subtrees of
size 5 with the same prefix subtree P of size 4, with string encoding P =
3 4 2 −1 1. Here x denotes an arbitrary label from L. The valid positions
where the last node with label x may be attached to the prefix are n0 , n1 and
n3 , since in each of these cases the subtree obtained by adding x to P has
the same prefix. Note that a node attached to position n2 cannot be a valid
member of class P, since it would yield a different prefix, given as 3 4 2 x.
The figure also shows the actual format we use to store an equivalence
class; it consists of the class prefix string, and a list of elements. Each element
is given as a (x, p) pair, where x is the label of the last node, and p specifies
the depth-first position of the node in P to which x is attached. For example
(x, 1) refers to the case where x is attached to node n1 at position 1. The figure
shows the encoding of the subtrees corresponding to each class element. Note
130 Mohammed J. Zaki
how each of them shares the same prefix up to the (k − 1)th node. These
subtrees are shown only for illustration purposes; we only store the element
list in a class.
Let P be a prefix subtree of size k − 1; we use the notation [P ]k−1 to refer
to its class (we omit the subscript when there is no ambiguity). If (x, i) is an
element of the class, we write it as (x, i) ∈ [P ]. Each (x, i) pair corresponds
to a subtree of size k, sharing P as the prefix, with the last node labeled x,
attached to node ni in P . We use the notation Px to refer to the new prefix
subtree formed by adding (x, i) to P .
Lemma 1. Let P be a class prefix subtree and let nr be the right-most leaf node
in P , whose scope is given as [r, r]. Let (x, i) ∈ [P ]. Then the set of valid node
positions in P to which x can be attached is given by {i : ni has scope [i, r]},
where ni is the ith node in P .
This lemma states that a valid element x may be attached to only those nodes
that lie on the path from the root to the right-most leaf nr in P . It is easy
to see that if x is attached to any other position the resulting prefix would be
different, since x would then be before nr in depth-first numbering.
Candidate Generation
Given an equivalence class of k-subtrees, how do we obtain candidate (k + 1)-
subtrees? First, we assume (without loss of generality) that the elements (x, p)
in each class are kept sorted by node label as the primary key and position
as the secondary key. Given a sorted element list, the candidate generation
procedure we describe below outputs a new class list that respects that order,
without explicit sorting. The main idea is to consider each ordered pair of
elements in the class for extension, including self extension. There can be up
to two candidates from each pair of elements to be joined. The next theorem
formalizes this notion.
Theorem 1 (Class Extension). Let P be a prefix class with encoding P,
and let (x, i) and (y, j) denote any two elements in the class. Let Px denote
the class representing extensions of element (x, i). Define a join operator ⊗
on the two elements, denoted (x, i)⊗(y, j), as follows:
case I – (i = j):
(a) If P = ∅, add (y, j) and (y, ni ) to class [Px ], where ni is the depth-first
number for node (x, i) in tree Px .
(b) If P = ∅, add (y, j + 1) to [Px ].
case II – (i > j): add (y, j) to class [Px ].
case III – (i < j): no new candidate is possible in this case.
Then all possible (k + 1)-subtrees with the prefix P of size k − 1 will be enu-
merated by applying the join operator to each ordered pair of elements (x, i)
and (y, j).
5.3 Generating Candidate Trees 131
Equivalence Class
Prefix: 1 2
Element List: (3,1) (4,0)
1 1
2 2 4
3
+
+
1 1 1 1 1
2 2 2 4 2 4 4
2 4
3 3 3 3 4
3
Prefix: 1 2 −1 4
Element List: (4,0) (4,1)
Prefix: 1 2 3
Element List: (3,1) (3,2) (4,0)
xi , but since they are not connected, they would be roots of two trees in a
sub-forest. If we allow such class elements then one can show that the class
extension theorem would produce all possible candidate sub-forests. However,
in this paper we will focus only on subtrees.
Database D of 3 Trees
D in Horizontal Format : (tid, string encoding)
Tree T0 Tree T2 (T0, 1 2 −1 3 4 −1 −1)
n0, [0,3] 1
n0, [0,7] 1 (T1, 2 1 2 −1 4 −1 −1 2 −1 3 −1)
(T2, 1 3 2 −1 5 1 2 −1 3 4 −1 −1 −1 −1)
2 3
n2, [2,3] n1, [1,2] 3 5 n3, [3,7]
n1, [1,1]
D in Vertical Format: (tid, scope) pairs
4 1 2 3 4 5
n2, [2,2] 2 1 n4, [4,7]
n3, [3,3]
0, [0, 3] 0, [1, 1] 0, [2, 3] 0, [3, 3] 2, [3, 7]
Tree T1 2
1, [1, 3] 1, [0, 5] 1, [5, 5] 1, [3, 3]
3 n6, [6,7]
n0, [0,5]
2
2, [0, 7] 1, [2, 2] 2, [1, 2] 2, [7, 7]
n5, [5,5]
2, [4, 7] 1, [4, 4] 2, [6, 7]
2, [2, 2]
n1, [1,3] 1 2 4
3 2, [5, 5]
n4, [4, 4] n5, [5,5] n7, [7,7]
2 4
n2, [2,2] n3, [3, 3]
a match label of the (k − 1) length prefix of X, and s is the scope of the last
item xk . Recall that the prefix match label gives the positions of nodes in T
that match the prefix. Since a given prefix can occur multiple times in a tree,
X can be associated with multiple match labels as well as multiple scopes.
The initial scope-lists are created for single items (i.e., labels) i that occur
in a tree T . Since a single item has an empty prefix, we don’t have to store
the prefix match label m for single items. We will show later how to compute
pattern frequency via joins on scope-lists.
Example 5. Figure 5.5 shows a database of three trees, along with the hori-
zontal format for each tree and the vertical scope-list format for each item.
Consider item 1; since it occurs at node position 0 with scope [0, 3] in tree T0 ,
we add (0, [0, 3]) to its scope list L(1). Item 1 also occurs in T1 at position n1
with scope [1, 3], so we add (1, [1, 3]) to L(1). Finally, item 1 occurs with scope
[0, 7] and [4, 7] in tree T2 , so we add (2, [0, 7]) and (2, [4, 7]) to its scope-list.
In a similar manner, the scope-lists for other items are created.
Figure 5.6 shows the high-level structure of TreeMiner. The main steps in-
clude the computation of the frequent items and 2-subtrees, and the enumera-
tion of all other frequent subtrees via DFS search within each class [P ]1 ∈ F2 .
We will now describe each step in more detail.
Enumerate-Frequent-Subtrees([P ]):
for each element (x, i) ∈ [P ] do
[Px ] = ∅;
for each element (y, j) ∈ [P ] do
R = {(x, i)⊗(y, j)};
L(R) = {L(x) ∩⊗ L(y)};
if for any R ∈ R, R is frequent then
[Px ] = [Px ] ∪ {R};
Enumerate-Frequent-Subtrees([Px ]);
Fig. 5.6. TreeMiner algorithm.
with empty prefix, given as [P ]0 = [∅] = {(i, −1), i ∈ F1 }, and the position
−1 indicates that i is not attached to any node. Total time for this step is
O(n) per tree, where n = |T |.
By Theorem 1 each candidate class [P ]1 = [i] (with i ∈ F1 ) consists of
elements of the form (j, 0), where j ≥ i. For efficient F2 counting we compute
the supports of each candidate by using a two-dimensional integer array of size
F1 × F1 , where cnt[i][j] gives the count of candidate subtrees with encoding
(i j −1). Total time for this step is O(n2 ) per tree. While computing F2 we
also create the vertical scope-list representation for each frequent item i ∈ F1 .
Computing Fk (k ≥ 3): Figure 5.6 shows the pseudo-code for the depth-first
search for frequent subtrees (Enumerate-Frequent-Subtrees). The input
to the procedure is a set of elements of a class [P ], along with their scope-
lists. Frequent subtrees are generated by joining the scope-lists of all pairs of
elements (including self-joins). Before joining the scope-lists, a pruning step
can be inserted to ensure that subtrees of the resulting tree are frequent.
If this is true, then we can go ahead with the scope-list join, otherwise we
can avoid the join. For convenience, we use the set R to denote the up to
two possible candidate subtrees that may result from (x, i)⊗(y, j), according
to the class extension theorem, and we use L(R) to denote their respective
scope-lists. The subtrees found to be frequent at the current level form the
elements of classes for the next level. This recursive process is repeated until
all frequent subtrees have been enumerated. If [P ] has n elements, the total
cost is given as O(ln2 ), where l is the cost of a scope-list join (given later). In
terms of memory management it is easy to see that we need memory to store
classes along a path in the DFS search. At the very least we need to store
intermediate scope-lists for two classes, i.e., the current class [P ] and a new
candidate class [Px ]. Thus the memory footprint of TreeMiner is not large.
Scope-list join for any two subtrees in a class [P ] is based on interval algebra
on their scope lists. Let sx = [lx , ux ] be a scope for node x, and sy = [ly , uy ]
a scope for y. We say that sx is strictly less than sy , denoted sx < sy , if and
only if ux < ly , i.e., the interval sx has no overlap with sy , and it occurs
before sy . We say that sx contains sy , denoted sx ⊃ sy , if and only if lx ≤ ly
and ux ≥ uy , i.e., the interval sy is a proper subset of sx . The use of scopes
allows us to compute in constant time whether y is a descendant of x or y
is a embedded sibling of x. Recall from the candidate extension Theorem 1
that when we join elements (x, i)⊗(y, j) there can be at most two possible
outcomes, i.e., we either add (y, j + 1) or (y, j) to the class [Px ].
In-Scope Test
The first candidate (y, j + 1) is added to [Px ] only when i = j, and thus refers
to the candidate subtree with y as a child of node x. In other words, (y, j + 1)
5.4 TreeMiner Algorithm 135
represents the subtree with encoding (Px y). To check if this subtree occurs
in an input tree T with tid t, we search for triples (ty , sy , my ) ∈ L(y) and
(tx , sx , mx ) ∈ L(x), such that:
• ty = tx = t, i.e., the triples both occur in the same tree, with tid t.
• my = mx = m, i.e., x and y are both extensions of the same prefix
occurrence, with match label m.
• sy ⊂ sx , i.e., y lies within the scope of x.
If the three conditions are satisfied, we have found an instance where y is a
descendant of x in some input tree T . We next extend the match label my
of the old prefix P , to get the match label for the new prefix Px (given as
my ∪ lx ), and add the triple (ty , sy , {my ∪ lx }) to the scope-list of (y, j + 1) in
[Px ]. We refer to this case as an in-scope test.
Out-Scope Test
The second candidate (y, j) represents the case when y is a embedded sibling
of x, i.e., both x and y are descendants of some node at position j in the prefix
P , and the scope of x is strictly less than the scope of y. The element (y, j),
when added to [Px ] represents the pattern (Px −1 ... −1 y) with the number
of -1’s depending on the path length from j to x. To check if (y, j) occurs in
some tree T with tid t, we need to check for triples (ty , sy , my ) ∈ L(y) and
(tx , sx , mx ) ∈ L(x), such that:
• ty = tx = t, i.e., the triples both occur in the same tree, with tid t.
• my = mx = m, i.e., x and y are both extensions of the same prefix
occurrence, with match label m.
• sx < sy , i.e., x comes before y in depth-first ordering and their scopes do
not overlap.
If these conditions are satisfied, we add the triple (ty , sy , {my ∪ lx }) to the
scope-list of (y, j) in [Px ]. We refer to this case as an out-scope test. Note that
if we just check whether sx and sy are disjoint (with identical tids and prefix
match labels), i.e., either sx < sy or sx > sy , then the support can be counted
for unordered subtrees!
Computation Time
Each application of in-scope or out-scope test takes O(1) time. Let a and b
be the distinct (t, m) pairs in L(x, i) and L(y, j), respectively. Let α denote
the average number of scopes with a match label. Then the time to perform
scope-list joins is given as O(α2 (a + b)), which reduces to O(a + b) if α is a
small constant.
Example 6. Figure 5.7 shows an example of how scope-list joins work, using
the database D from Figure 5.5, with minsup = 100%, i.e., we want to mine
subtrees that occur in all three trees in D. The initial class with empty pre-
fix consists of four frequent items (1, 2, 3 and 4), with their scope-lists. All
136 Mohammed J. Zaki
1 2 3 4 1 1 1
0, [0, 3] 0, [1, 1] 0, [2, 3] 0, [3, 3]
1, [1, 3] 1, [0, 5] 1, [5, 5] 1, [3, 3] 2 4 2 4
2, [0, 7] 1, [2, 2] 2, [1, 2] 2, [7, 7] 0, 01, [3, 3]
0, 0, [1, 1] 0, 0, [3, 3]
2, [4, 7] 1, [4, 4] 2, [6, 7] 1, 12, [3, 3]
1, 1, [2, 2] 1, 1, [3, 3]
2, [2, 2] 2, 02, [7, 7]
2, 0, [2, 2] 2, 0, [7, 7]
2, [5, 5] 2, 05, [7, 7]
2, 0, [5, 5] 2, 4, [7, 7]
2, 4, [5, 5] 2, 45, [7, 7]
Infrequent Elements
(5,−1): 5
Infrequent Elements Infrequent Elements
(1,0) : 1 1 −1 (2,0) : 1 2 −1 2
(3,0) : 1 3 −1 (2,1) : 1 2 2 −1 −1
(4,1) : 1 2 4 −1 −1
Fig. 5.7. Scope-list joins: minsup = 100%.
pairs of elements are considered for extension, including self-join. Consider the
extensions from item 1, which produces the new class [1] with two frequent
subtrees: (1 2 − 1) and (1 4 − 1). The infrequent subtrees are listed at the
bottom of the class.
While computing the new scope-list for the subtree (1 2 − 1) from L(1) ∩⊗
L(2), we have to perform only in-scope tests, since we want to find those
occurrences of 2 that are within some scope of 1 (i.e., under a subtree rooted at
1). Let si denote a scope for item i. For tree T0 we find that s2 = [1, 1] ⊂ s1 =
[0, 3]. Thus we add the triple (0, 0, [1, 1]) to the new scope-list. In like manner,
we test the other occurrences of 2 under 1 in trees T1 and T2 . Note that for
T2 there are three instances of the candidate pattern: s2 = [2, 2] ⊂ s1 = [0, 7],
s2 = [5, 5] ⊂ s1 = [0, 7], and s2 = [5, 5] ⊂ s1 = [4, 7]. If a new scope-list occurs
in at least minsup tids, the pattern is considered frequent.
Consider the result of extending class [1]. The only frequent pattern is
(1 2 − 1 4 − 1), whose scope-list is obtained from L(2, 0) ∩⊗ L(4, 0), by appli-
cation of the out-scope test. We need to test for disjoint scopes, with s2 < s4 ,
which have the same match label. For example we find that s2 = [1, 1] and
s4 = [3, 3] satisfy these condition. Thus we add the triple (0, 01, [1, 1]) to
L(4, 0) in class [1 2]. Notice that the new prefix match label (01) is obtained
by adding to the old prefix match label (0) to the position where 2 occurs (1).
The final scope list for the new candidate has three distinct tids, and is thus
frequent. There are no more frequent patterns at minsup= 100%.
and only if x occurs more than once in a subtree with tid t. Thus, if most items
occur only once in the same tree, this optimization drastically cuts down the
match label size, since the only match labels kept refer to items with more
than one occurrence. In the special case that all items in a tree are distinct,
the match label is always empty and each element of a scope-list reduces to a
(tid, scope) pair.
Example 7. Consider the scope-list of (4, 0) in class [12] in Figure 5.7. Since 4
occurs only once in T0 and T1 we can omit the match label from the first two
entries altogether, i.e., the triple (0, 01, [3, 3]) becomes a pair (0, [3, 3]), and
the triple (1, 12, [3, 3]) becomes (1, [3, 3]).
Pattern Pruning
Before adding each candidate k-subtree to a class in Ck we make sure that all
its (k − 1)-subtrees are also frequent. To perform this step efficiently, during
creation of Fk−1 (line 8), we add each individual frequent subtree into a hash
table. Thus it takes O(1) time to check each subtree of a candidate, and since
there can be k subtrees of length k − 1, it takes O(k) time to perform the
pruning check for each candidate.
Prefix Matching
Matching the prefix P of a class in a leaf against the tree T is the main step in
support counting. Let X[i] denote the ith node of subtree X, and let X[i, . . . , j]
denote the nodes from positions i to j, with j ≥ i. We use a recursive routine
to test prefix matching. At the rth recursive call we maintain the invariant
that all nodes in P [0, 1, ..., r] have been matched by nodes in T [i0 , i1 , ..., ir ],
i.e., prefix node P [0] matches T [i0 ], P [1] matches T [i1 ], and so on, and finally
P [r] matches T [ir ]. Note that while nodes in P are traversed consecutively,
the matching nodes in T can be far apart. We thus have to maintain a stack
of node scopes, consisting of the scope of all nodes from the root i0 to the
current right-most leaf ir in T . If ir occurs at depth d, then the scope stack
has size d + 1.
Assume that we have matched all nodes up to the rth node in P . If the
next node P [r + 1] to be matched is the child of P [r], we likewise search for
P [r + 1] under the subtree rooted at T [ir ]. If a match is found at position
ir+1 in T , we push ir+1 onto the scope stack. On the other hand, if the next
node P [r + 1] is outside the scope of P [r], and is instead attached to position
l (where 0 ≤ l < r), then we pop from the scope stack all nodes ik , where
l < k ≤ r, and search for P [r + 1] under the subtree rooted at T [il ]. This
process is repeated until all nodes in P have been matched. This step takes
O(kn) time in the worst case. If each item occurs once it takes O(k + n) time.
Element Matching
If P T , we search for a match in T for each element (x, k) ∈ [P ], by
searching for x starting at the subtree T [ik−1 ]. (x, k) is either a descendant
or an embedded sibling of P [k − 1]. Either check takes O(1) time. If a match
is found the support of the element (x, k) is incremented by one. If we are
interested in support (at least one occurrence in T ), the count is incremented
only once per tree; if we are interested in weighted support (all occurrences
in T ), we continue the recursive process until all matches have been found.
140 Mohammed J. Zaki
1600 160000
F5 (0.05%) D10 (0.075%)
1400 T1M (0.05%) 140000 cslogs (0.3%)
Number of Frequent Trees
1000 100000
800 80000
600 60000
400 40000
200 20000
0 0
0 2 4 6 8 10 12 14 16 0 5 10 15 20
Length Length
Fig. 5.9. Distribution of frequent trees by length.
Performance Comparison
Figure 5.10 shows the performance of PatternMatcher versus Tree-
Miner. On the real cslogs data set, we find that TreeMiner is about twice
as fast as PatternMatcher until support 0.5%. At 0.25% support Tree-
Miner outperforms PatternMatcher a factor of by more than 20! The
reason is that cslogs had a maximum pattern length of 7 at 0.5% support.
The level-wise pattern matching used in PatternMatcher is able to easily
handle such short patterns. However, at 0.25% support the maximum pat-
tern length suddenly jumped to 19, and PatternMatcher is unable to deal
efficiently with such long patterns. Exactly the same thing happens for D10
as well. For supports lower than 0.5% TreeMiner outperforms Pattern-
Matcher by a wide margin. At the lowest support the difference is a fac-
tor of 15. Both T1M and F5 have relatively short frequent subtrees. Here
too TreeMiner outperforms PatternMatcher but, for the lowest support
shown, the difference is only a factor of four. These experiments clearly indi-
cate the superiority of the scope-list-based method over the pattern-matching
method, especially as patterns become long.
Scaleup Comparison
Figure 5.11 shows how the algorithms scale with increasing number of trees
in the database D, from 10,000 to 1 million trees. At a given level of support,
we find a linear increase in the running time with increasing number of trans-
actions for both algorithms, though TreeMiner continues to be four times
as fast as PatternMatcher.
Effect of Pruning
In Figure 5.12 we evaluated the effect of candidate pruning on the performance
of PatternMatcher and TreeMiner. We find that PatternMatcher
(denoted PM in the graph) always benefits from pruning, since the fewer
the number of candidates, the lesser the cost of support counting via pat-
tern matching. On the other hand TreeMiner (labeled TM in the graph)
142 Mohammed J. Zaki
cslogs T1M
10000 140
Total Time (sec) [log-scale]
PatternMatcher PatternMatcher
TreeMiner 120 TreeMiner
1000
10 60
40
1
20
0.1 0
5 2.5 1 0.75 0.5 0.25 1 0.5 0.1 0.075 0.05
Minimum Support (%) Minimum Support (%)
D10 F5
7000 25
PatternMatcher PatternMatcher
6000 TreeMiner TreeMiner
20
Total Time (sec)
5000
4000 15
3000 10
2000
5
1000
0 0
1 0.5 0.1 0.075 1 0.5 0.1 0.075 0.05
Minimum Support (%) Minimum Support (%)
100 1000
TM-Pruning
80
100
60
40 10
20
0 1
10 100 250 500 1000 1 0.5 0.1
Number of Trees (in 1000’s) Minimum Support (%)
subtrees of each new k-pattern. This adds significant overhead, especially for
lower supports when there are many frequent patterns. Second, the vertical
representation is extremely efficient; it is actually faster to perform scope-list
joins than to perform a pruning test.
Table 5.1 shows the number of candidates generated on the D10 data
set with no pruning, with full pruning (in PatternMatcher), and with
opportunistic pruning (in TreeMiner). Both full pruning and opportunistic
pruning are extremely effective in reducing the number of candidate patterns,
and opportunistic pruning is almost as good as full pruning (within a factor
of 1.3). Full pruning cuts down the number of candidates by a factor of 5 to
7! Pruning is thus essential for pattern-matching methods, and may benefit
scope-list methods in some cases (for high support).
contain a complete history of the user clicks. Each user session has a session
id (IP or host name) and a list of edges (uedges) giving source and target
node pairs and the time (utime) when a link is traversed. An example user
session is shown below:
<userSession name="ppp0-69.ank2.isbank.net.tr" ...>
<uedge source="5938" target="16470" utime="7:53:46"/>
<uedge source="16470" target="24754" utime="7:56:13"/>
<uedge source="16470" target="24755" utime="7:56:36"/>
<uedge source="24755" target="47387" utime="7:57:14"/>
<uedge source="24755" target="47397" utime="7:57:28"/>
<uedge source="16470" target="24756" utime="7:58:30"/>
Itemset Mining
To discover frequent sets of pages accessed we ignore all link information and
note down the unique nodes visited in a user session. The user session above
produces a user “transaction” containing the user name, and the node set,
as follows: (ppp0-69.ank2.isbank.net.tr, 5938 16470 24754 24755 47387 47397
24756).
After creating transactions for all user sessions we obtain a database that
is ready to be used for frequent set mining. We applied an association mining
algorithm to a real LOGML document from the CS website (one day’s logs).
There were 200 user sessions with an average of 56 distinct nodes in each
session. An example frequent set found is shown below. The pattern refers to
a popular Turkish poetry site maintained by one of our department members.
The user appears to be interested in the poet Akgun Akova.
Let Path=http://www.cs.rpi.edu/∼name/poetry
FREQUENCY=16, NODE IDS = 16395 38699 38700 38698 5938
Path/poems/akgun akova/index.html
Path/poems/akgun akova/picture.html
Path/poems/akgun akova/biyografi.html
Path/poems/akgun akova/contents.html
Path/sair listesi.html
Sequence Mining
If our task is to perform sequence mining, we look for the longest forward
links [7] in a user session, and generate a new sequence each time a back edge
is traversed. We applied sequence mining to the LOGML document from
the CS website. From the 200 user sessions, we obtain 8208 maximal forward
sequences, with an average sequence size of 2.8. An example frequent sequence
(shown below) indicates in what sequence the user accessed some of the pages
related to Akgun Akova. The starting page sair listesi contains a list of
poets.
Let Path=http://www.cs.rpi.edu/∼name/poetry
FREQUENCY = 20, NODE IDS = 5938 -> 16395 -> 38698
Path/sair listesi.html ->
5.8 Related Work 145
Tree Mining
For frequent tree mining, we can easily extract the forward edges from the
user session (avoiding cycles or multiple parents) to obtain the subtree corre-
sponding to each user. For our example user-session we get the tree: (ppp0-
69.ank2.isbank.net.tr, 5938 16470 24754 -1 24755 47387 -1 47397 -1 -1 24756
-1 -1).
We applied the TreeMiner algorithm to the CS logs. From the 200 user
sessions, we obtain 1009 subtrees (a single user session can lead to multiple
trees if there are multiple roots in the user graph), with an average record
length of 84.3 (including the back edges, -1). An example frequent subtree
found is shown below. Notice how the subtree encompasses all the partial
information of the sequence and the unordered information of the itemset
relating to Akgun Akova. The mined subtree is clearly more informative,
highlighting the usefulness of mining complex patterns.
Let Path=http://www.cs.rpi.edu/˜name/poetry
Let Akova = Path/poems/akgun_akova
FREQUENCY=59, NODES = 5938 16395 38699 -1 38698 -1 38700
Path/sair_listesi.html
|
Path/poems/akgun_akova/index.html
/ | \
Akova/picture.html Akova/contents.html Akova/biyografi.html
We also ran detailed experiments on log files collected over one month
at the CS department, which touched a total of 27,343 web pages. After
processing, the LOGML database had 34,838 user graphs. We do not have
space to show the results here (we refer the reader to [25] for details), but these
results lead to interesting observations that support the mining of complex
patterns from web logs. For example, itemset mining discovers many long
patterns. Sequence mining takes a longer time but the patterns are more
useful, since they contain path information. Tree mining, though it takes more
time than sequence mining, produces very informative patterns beyond those
obtained from item-set and sequence mining.
5.9 Conclusions
store the horizontal data set, and we use the notion of a node’s scope to develop
a novel vertical representation of a tree, called a scope-list. Our formalization
of the problem is flexible enough to handle several variations. For instance,
if we assume the label on each node to be the same, our approach mines all
unlabeled trees. A simple change in the candidate tree extension procedure
allows us to discover sub-forests (disconnected patterns). Our formulation can
find frequent trees in a forest of many trees or all the frequent subtrees in a
single large tree. Finally, it is relatively easy to extend our techniques to find
unordered trees (by modifying the out-scope test) or to use the traditional
definition of a subtree. To summarize, this paper proposes a framework for
tree mining which can easily encompass most variants of the problem that
may arise in different domains.
We introduced a novel algorithm, TreeMiner, for tree mining. TreeM-
iner uses depth-first search; it also uses the novel scope-list vertical represen-
tation of trees to quickly compute the candidate tree frequencies via scope-list
joins based on interval algebra. We compared its performance against a base
algorithm, PatternMatcher. Experiments on real and synthetic data con-
firmed that TreeMiner outperforms PatternMatcher by a factor of 4 to
20, and scales linearly in the number of trees in the forest. We studied an
application of TreeMiner in web usage mining.
For future work we plan to extend our tree mining framework to incorpo-
rate user-specified constraints. Given that tree mining, though able to extract
informative patterns, is an expensive task, performing general unconstrained
mining can be too expensive and is also likely to produce many patterns that
may not be relevant to a given user. Incorporating constraints is one way to
focus the search and to allow interactivity. We also plan to develop efficient
algorithms to mine maximal frequent subtrees from dense data sets which
may have very large subtrees. Finally, we plan to apply our tree mining tech-
niques to other compelling applications, such as finding common tree patterns
in RNA structures within bioinformatics, as well as the extraction of struc-
ture from XML documents and their use in classification, clustering, and so on.
References
[1] Abiteboul, S., H. Kaplan and T. Milo, 2001: Compact labeling schemes
for ancestor queries. ACM Symp. on Discrete Algorithms.
[2] Abiteboul, S., and V. Vianu, 1997: Regular path expressions with con-
straints. ACM Int’l Conf. on Principles of Database Systems.
[3] Agrawal, R., H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo,
1996: Fast discovery of association rules. Advances in Knowledge Discov-
References 149
ery and Data Mining, U. Fayyad et al., eds., AAAI Press, Menlo Park,
CA, 307–28.
[4] Agrawal, R., and R. Srikant, 1995: Mining sequential patterns. 11th Intl.
Conf. on Data Engineering.
[5] Asai, T., K. Abe, S. Kawasoe, H. Arimura, H. Satamoto and S. Arikawa,
2002: Efficient substructure discovery from large semi-structured data.
2nd SIAM Int’l Conference on Data Mining.
[6] Asai, T., H. Arimura, T. Uno and S. Nakano, 2003: Discovering frequent
substructures in large unordered trees. 6th Int’l Conf. on Discovery Sci-
ence.
[7] Chen, M., J. Park and P. Yu, 1996: Data mining for path traversal
patterns in a web environment. International Conference on Distributed
Computing Systems.
[8] Chen, Z., H. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng and
D. Srivastava, 2001: Counting twig matches in a tree. 17th Intl. Conf. on
Data Engineering.
[9] Chi, Y., Y. Yang and R. R. Muntz, 2003: Indexing and mining free trees.
3rd IEEE International Conference on Data Mining.
[10] — 2004: Hybridtreeminer: An efficient algorihtm for mining frequent
rooted trees and free trees using canonical forms. 16th International Con-
ference on Scientific and Statistical Database Management.
[11] Cole, R., R. Hariharan and P. Indyk, 1999: Tree pattern matching and
subset matching in deterministic o(n log3 n)-time. 10th Symposium on
Discrete Algorithms.
[12] Cook, D., and L. Holder, 1994: Substructure discovery using minimal
description length and background knowledge. Journal of Artificial Intel-
ligence Research, 1, 231–55.
[13] Cooley, R., B. Mobasher and J. Srivastava, 1997: Web mining: Information
and pattern discovery on the world wide web. 8th IEEE Intl. Conf. on
Tools with AI .
[14] Dehaspe, L., H. Toivonen and R. King, 1998: Finding frequent substruc-
tures in chemical compounds. 4th Intl. Conf. Knowledge Discovery and
Data Mining.
[15] Fernandez, M., and D. Suciu, 1998: Optimizing regular path expressions
using graph schemas. IEEE Int’l Conf. on Data Engineering.
[16] Huan, J., W. Wang and J. Prins, 2003: Efficient mining of frequent sub-
graphs in the presence of isomorphism. IEEE Int’l Conf. on Data Mining.
[17] Inokuchi, A., T. Washio and H. Motoda, 2000: An Apriori-based algo-
rithm for mining frequent substructures from graph data. 4th European
Conference on Principles of Knowledge Discovery and Data Mining.
[18] — 2003: Complete mining of frequent patterns from graphs: Mining graph
data. Machine Learning, 50, 321–54.
[19] Kilpelainen, P., and H. Mannila, 1995: Ordered and unordered tree inclu-
sion. SIAM J. of Computing, 24, 340–56.
150 Mohammed J. Zaki
[20] Kuramochi, M., and G. Karypis, 2001: Frequent subgraph discovery. 1st
IEEE Int’l Conf. on Data Mining.
[21] — 2004: An efficient algorithm for discovering frequent subgraphs. IEEE
Transactions on Knowledge and Data Engineering, 16, 1038–51.
[22] Li, Q., and B. Moon, 2001: Indexing and querying XML data for regular
path expressions. 27th Int’l Conf. on Very Large Databases.
[23] Nijssen, S., and J. N. Kok, 2003: Efficient discovery of frequent unordered
trees. 1st Int’l Workshop on Mining Graphs, Trees and Sequences.
[24] — 2004: A quickstart in frequent structure mining can make a difference.
ACM SIGKDD Int’l Conf. on KDD.
[25] Punin, J., M. Krishnamoorthy and M. J. Zaki, 2001: LOGML: Log
markup language for web usage mining. ACM SIGKDD Workshop on
Mining Log Data Across All Customer TouchPoints.
[26] Ruckert, U., and S. Kramer, 2004: Frequent free tree discovery in graph
data. Special Track on Data Mining, ACM Symposium on Applied Com-
puting.
[27] Shamir, R., and D. Tsur, 1999: Faster subtree isomorphism. Journal of
Algorithms, 33, 267–80.
[28] Shapiro, B., and K. Zhang, 1990: Comparing multiple RNA secondary
structures using tree comparisons. Computer Applications in Biosciences,
6(4), 309–18.
[29] Shasha, D., J. Wang and S. Zhang, 2004: Unordered tree mining with
applications to phylogeny. International Conference on Data Engineering.
[30] Termier, A., M.-C. Rousset and M. Sebag, 2002: Treefinder: a first step
towards XML data mining. IEEE Int’l Conf. on Data Mining.
[31] Wang, C., M. Hong, J. Pei, H. Zhou, W. Wang and B. Shi, 2004: Efficient
pattern-growth methods for frequent tree pattern mining. Pacific-Asia
Conference on KDD.
[32] Wang, K., and H. Liu, 1998: Discovering typical structures of documents:
A road map approach. ACM SIGIR Conference on Information Retrieval.
[33] Xiao, Y., J.-F. Yao, Z. Li and M. H. Dunham, 2003: Efficient data mining
for maximal frequent subtrees. International Conference on Data Mining.
[34] Yan, X., and J. Han, 2002: gSpan: Graph-based substructure pattern
mining. IEEE Int’l Conf. on Data Mining.
[35] — 2003: Closegraph: Mining closed frequent graph patterns. ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining.
[36] Yoshida, K., and H. Motoda, 1995: CLIP: Concept learning from inference
patterns. Artificial Intelligence, 75, 63–92.
[37] Zaki, M. J., 2001: Efficiently mining trees in a forest. Technical Report
01-7, Computer Science Dept., Rensselaer Polytechnic Institute.
[38] — 2002: Efficiently mining frequent trees in a forest. 8th ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining.
[39] Zaki, M. J. and C. Aggarwal, 2003: Xrules: An effective structural classi-
fier for XML data. 9th ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining.
References 151
Sunita Sarawagi
6.1 Introduction
Sequences are fundamental to modeling the three primary media of human
communication: speech, handwriting and language. They are the primary
data types in several sensor and monitoring applications. Mining models for
network-intrusion detection view data as sequences of TCP/IP packets. Text
information-extraction systems model the input text as a sequence of words
and delimiters. Customer data-mining applications profile buying habits of
customers as a sequence of items purchased. In computational biology, DNA,
RNA and protein data are all best modeled as sequences.
A sequence is an ordered set of pairs (t1 x1 ) . . . (tn xn ) where ti denotes an
ordered attribute like time (ti−1 ≤ ti ) and xi is an element value. The length n
of sequences in a database is typically variable. Often the first attribute is not
explicitly specified and the order of the elements is implicit in the position of
the element. Thus, a sequence x can be written as x1 . . . xn . The elements of a
sequence are allowed to be of many different types. When xi is a real number,
we get a time series. Examples of such sequences abound – stock prices over
time, temperature measurements obtained from a monitoring instrument in a
plant or day to day carbon monoxide levels in the atmosphere. When si is of
discrete or symbolic type we have a categorical sequence. Examples of such
sequences are protein sequences where each element is an amino acid that can
take one of 20 possible values, or a gene sequence where each element can
154 Sunita Sarawagi
Many popular classification methods like decision trees, neural networks, and
linear discriminants like Fisher’s fall in this class. These differ a lot in what
kind of model they produce and how they train such models but they all
require the data to have a fixed set of attributes so that each data instance
6.2 Sequence Classification 155
The class with the highest posterior probability is chosen as the winner.
This method has been extensively applied to classification tasks. We can
apply it to sequence classification provided we can design a distribution that
can adequately model the probability of generating a sequence while being
trainable with realistic amounts of training data. We discuss models for doing
so next.
Denote a sequence x of n elements as x1 , . . . , xn . Applying the chain rule
we can express the probability of generating a sequence Pr(x) as a product of
n terms as follows:
This general form, where the probability of generating the i-th element
depends on all previous elements, is too complex to train and too expensive
to compute. In practice, simpler forms with limited amounts of dependency
suffice. We list them in increasing order of complexity below.
6.2 Sequence Classification 157
Fig. 6.2. Models of increasing complexity for a sequence data set with two cate-
gorical elements “A” and “C”.
Figure 6.2(b) shows an example trained Markov model with two possible
elements. In this example, the probability of a sequence AACA is calculated
as Pr(AACA) = Pr(A) Pr(A|A) Pr(C|A) Pr(A|C) = 0.5 × 0.1 × 0.9 × 0.4.
During training the maximum likelihood value of the parameter Pr(vj |vk )
is estimated as the ratio of vk vj occurrences in T over the number of vk oc-
currences. The value of πj is the fraction of sequences in T that start with
value vj .
label of the node is the largest match that can be achieved with the suffix of
the sequence immediately before vj . The probability of an example sequence
Pr(AACA) is evaluated as 0.28 × 0.3 × 0.7 × 0.1. The first 0.28 is for the first
“A” in “AACA” obtained from the root node with an empty history. The
second 0.3 denotes the probability of generating “A” from the node labeled
“A”. The third “0.7” denotes the probability of generating a “C” from the
same node. The fourth multiplicand “0.1” is the probability of generating “A”
from the node labeled “AC”. The “AC”-labeled node has the largest suffix
match with the part of the sequence before the last “A”. This example, shows
that calculating the probability of generating a sequence is more expensive
with a PST than with a PSA. However, PSTs are amenable to more efficient
training. Linear time algorithms exist for constructing such PSTs from train-
ing data in one single pass [2]. Simple procedures exist to convert a PST to
the equivalent PSA after the training [41].
PSTs/PSAs have been generalized to even sparser Markov models and ap-
plied to protein classification in [17] and for classifying sequences of system
calls as intrusions or not [18].
0.5
A 0.9 S1 0.9 S2 A 0.6
C 0.1 C 0.4
0.1 0.5
0.8
A 0.5
S3 S4 A 0.3
C 0.5 C 0.7
0.2
Fig. 6.3. A hidden Markov model with four states, transition and emission prob-
abilities as shown and starting probability π = [1 0 0 0].
has been done on building specialized hidden Markov models for capturing
the distribution of protein sequences within a family [16].
We can exploit the Markov property of the model to design an efficient dy-
namic programming algorithm to avoid enumerating the exponential number
of paths. Let α(i, q) be the value of q ∈qi:q Pr(x1..i , q ) where qi:q denotes
all state sequences from 1 to i with the i-th state q and x1..i denotes the part
of the sequence from 1 to i, that is x1 . . . xi . α() can be expressed recursively
as
q ∈S α(i − 1, q )aq q bq (xi ) if i > 1
α(i, q) =
πq bq (xi ) if i = 1
The value of Pr(x) can then be written as Pr(x) = q α(|x|, q).
The running time of this algorithm is O(ns) where n is the sequence length
and s is the number of states.
6.2 Sequence Classification 161
Training an HMM
The parameters of the HMM comprising of the number of states s, the set
of symbols in the dictionary m, the edge transition matrix A, the emission
probability matrix B, and starting probability π are learnt from training data.
The training of an HMM has two phases. In the first phase we choose the
structure of the HMM, that is, the number of states s and the edges amongst
states. This is often decided manually based on domain knowledge. A number
of algorithms have also been proposed for learning the structure automatically
from the training data [42, 45]. We will not go into a discussion of these
algorithms. In the second phase we learn the probabilities, assuming a fixed
structure of the HMM.
N
argmaxΘ L(Θ) = argmaxΘ Pr(x |Θ). (6.1)
=1
Since a given sequence can take multiple paths, direct estimates of the
maximum likelihood parameters is not possible. An expectation maximization
(EM) algorithm is used to estimate these parameters. For HMMs the EM
algorithm is popularly called the Baum-Welch algorithm. It starts with initial
guesses of the parameter values and then makes multiple passes over the
training sequence to iteratively refine the estimates. In each pass, first in
the E-step the previous values of parameters are used to assign the expected
probability of each transition and each emission for each training sequence.
Then, in the M -step the maximum-likelihood values of the parameters are
recalculated by a straight aggregation of the weighted assignments of the E-
step. Exact formulas can be found elsewhere [38, 39]. The above algorithm is
guaranteed to converge to the locally optimum value of the likelihood of the
training data.
bias term bc . These parameters are learnt during training via classifier-specific
methods [7]. The predicted class of a sequence x is found by computing for
each class c, f (x, c) = i wic K(xi , x) + bc and choosing the class with the
highest value of f (x, c).
We can exploit kernel classifiers like SVMs for sequence classification, pro-
vided we can design appropriate kernel functions that take as input two data
points and output a real value that roughly denotes their similarity. For near-
est neighbor classifiers it is not necessary for the function to satisfy the above
two kernel properties but the basic structure of the similarity functions is of-
ten shared. We now discuss examples of similarity/kernel functions proposed
for sequence data.
A common approach is to first embed the sequence in a fixed dimen-
sional space using methods discussed in Section 6.2.1 and then compute
similarity using well-known functions like Euclidean, or any of the other
Lp norms, or a dot-product. For time series data, [31] deploys a degree
three polynomial over a fixed number of Fourier coefficients computed as
K(x, x ) = (F F T (x).F F T (x )+1)3 . The mismatch coefficients for categorical
data described in Section 6.2.1 were used in [30] with a dot-product kernel
function to perform protein classification using SVMs.
Another interesting technique is to define a fixed set of dimensions from
intermediate computations of a structural generative model and then super-
impose a suitable distance function on these dimensions. Fisher’s kernel is
an example of such a kernel [23] which has been applied to the task of pro-
tein family classification. A lot of work has been done on building specialized
hidden Markov models for capturing the distribution of protein sequences
within a family [16]. The Fisher’s kernel provides a mechanism of exploit-
ing these models for building kernels to be used in powerful discriminative
classifiers like SVMs. First we train the parameters Θp of an HMM using all
positive example sequences in a family. Now, for any given sequence x the
Fisher’s co-ordinate is derived from the HMM as the derivative of the gen-
erative probability Pr(x|Θp ) with respect to each parameter of the model.
Thus x is expressed as a vector ∇Θ Pr(x|Θ) of size equal to the number of
parameters of the HMM. This intuitively captures the influence of each of the
model parameters on the sequence x and thus captures the key characteristics
of the sequence as far as the classification problem is concerned. Now, given
any two sequences x and x the distance between them can be measured using
either a scaled Euclidean or a general scaled similarity computation based on
a co-variance matrix. [23] deployed such a distance computation on a Gaus-
sian kernel and obtained accuracies that are significantly higher than with
applying the Bayes rule on generative models as discussed in Section 6.2.2.
Finally, a number of sequence-specific similarity measures have also been
proposed. For real-valued elements these include measures such as the Dy-
namic Time Warping method [39] and for categorical data these include mea-
sures such as the edit distance, the more general Levenstein distance [3], and
sequence alignment distances like BLAST and PSI-BLAST protein data.
6.3 Sequence Clustering 163
This is the most popular clustering method and includes the famous K-
means and K-medoid clustering algorithms and the various hierarchical al-
gorithms [21]. The primary requirement for these algorithms is to be able to
design a similarity measure over a pair of sequences. We have already discussed
sequence similarity measures in Section 6.2.3.
In density-based clustering [21], the goal is to define clusters such that regions
of high point density in a multidimensional space are grouped together into a
connected region. The primary requirement to be able to deploy these algo-
rithms is to be able to embed the variable-length sequence data into a fixed
dimensional space. Techniques for creating such embeddings are discussed in
Section 6.2.1.
164 Sunita Sarawagi
House
Building Road City State Zip
number
4089 Whispering Pines Nobel Drive San Diego CA 92122
Fig. 6.4. An example showing the tagging of a sequence of nine words with six
labels.
As for whole sequence classification, one set of methods for the sequence tag-
ging problems is based on reduction to existing classification methods. The
simplest approach is to independently assign for each element xi of a sequence
x a label yi using features derived from the element xi . This ignores the con-
text in which xi is placed. The context can be captured by taking a window
of w elements around xi . Thus, for getting predictions for xi we would use
as input features derived from the record (xi−w . . . xi−1 xi xi+1 . . . xi+w ). Any
existing classifier like SVM or decision trees can be applied on such fixed-
dimensional record data to get a predicted value for yi . However, in several
applications the tags assigned to adjacent elements of a sequence depend on
each other and assigning independent labels may not be a good idea. A pop-
ular method of capturing such dependency is to assign tags to the sequence
6.4 Sequence Tagging 165
elements in a fixed left to right or right to left order. The predicted labels
of the previous h positions are added as features in addition to the usual x
context features. During training, the features corresponding to each position
consist of the x-window features and the true labels of the previous h posi-
tions. This method has been applied for named-entity recognition by [46] and
for English pronunciation prediction by [15]. In Section 6.4.3 we will consider
extensions where instead of using a fixed prediction from the previous labels,
we could exploit multiple predictions each attached with a probability value
to assign a globally optimum assignment.
The value of the highest probability path corresponds to maxy δ(|x|, y).
166 Sunita Sarawagi
Local Models
A common variant is to define the conditional distribution of y given x as
n
P (y|x) = P (yi |yi−1 , xi )
i=1
|x|
F(x, y) = f (i, x, y). (6.3)
i
For the case of NER, the components of f might include the measurement
f 13 (i, x, y) = [[xi is capitalized]]·[[yi = I]], where the indicator function [[c]] = 1
if c if true and zero otherwise; this implies that F 13 (x, y) would be the number
of capitalized words xi paired with the label I.
For sequence learning, any feature f k (i, x, y) is local in the sense that the
feature at a position i will depend only on the previous labels. With a slight
abuse of notation, we claim that a local feature f k (i, x, y) can be expressed
as f k (yi , yi−1 , x, i). Some subset of these features can be simplified further to
depend only on the current state and are independent of the previous state.
We will refer to these as state features and denote them by f k (yi , x, i) when
we want to make the distinction explicit. The term transition features refers
to the remaining features that are not independent of the previous state.
A conditional random field (CRF) [26, 43] is an estimator of the form
1 W·F(x,y)
Pr(y|x, W) = e (6.4)
Z(x)
The inference problem for a CRF and the Maxent classifier of Equation (6.2) is
identical and is defined as follows: given W and x, find the best label sequence,
argmax y Pr(y|x, W), where Pr(y|x, W) is defined by Equation (6.4).
the largest value of W · F(x, y ) for any y ∈ yi:l . The following recursive
calculation implements the usual Viterbi algorithm:
maxy δ(i − 1, y ) + W · f (y, y , x, i) if i > 0
δ(i, y) = (6.5)
0 if i = 0
The best label then corresponds to the path traced by maxy δ(|x|, y).
Training algorithm
The first set of terms is easy to compute. However, we must use the
Markov property of F and a dynamic programming step to compute the
normalizer, ZW (x ), and the expected value of the features under the cur-
rent weight vector, EPr(y |W) F(x , y ). We thus define α(i, y) as the value of
W·F(y ,x)
y ∈yi:y e where again yi:y denotes all label sequences from 1 to i
with i-th position labeled y. For i > 0, this can be expressed recursively as
α(i, y) = α(i − 1, y )eW·f (y,y ,x,i)
y ∈L
For the k-th component of F, let η k (i, y) be the value of the sum
F k (y , x )eW·F(x ,y ) ,
y ∈yi:y
6.4 Sequence Tagging 169
restricted to the part of the label ending at position i. The following recursion
can then be used to compute η k (i, y):
η k (i, y) = (η k (i − 1, y ) + α(i − 1, y )f k (y, y , x, i))eW·f (y,y ,x,i)
y ∈L
Finally we let EPr(y |W) F k (y , x) = ZW1(x) y η k (|x|, y).
As in the forward-backward algorithm for chain CRFs [43], space require-
ments here can be reduced from K|L| + |L|n to K + |L|n, where K is the
number of features, by pre-computing an appropriate set of β values.
After training, one takes as the final learned weight vector W the average
value of Wt over all time steps t.
This simple perceptron-like training algorithm has been shown to perform
surprisingly well for sequence learning tasks in [12].
0.20
However, the above model does not provide a sufficiently detailed model of
the text within each tag. We therefore associate each tag with another inner
HMM embedded within the outer HMM that captures inter-tag transitions.
We found a parallel-path HMM as shown in Figure 6.6 to provide the best
accuracy while requiring little or no tuning over different tag types. In the
figure, the start and end states are dummy nodes to mark the two end points
of a tag. They do not output any token. All records of length one will pass
172 Sunita Sarawagi
through the first path, length two will go through the second path and so on.
The last path captures all records with four or more tokens. Different elements
would have different numbers of such parallel paths depending on the element
lengths observed during training.
Start End
Experimental Evaluation
We report evaluation results on the following three real-life address data sets:
• US address: The US address data set consisted of 740 addresses down-
loaded from an Internet directory.1 The addresses were segmented into six
elements: House No, Box No. Road Name, City, State, Zip.
• Student address: This data set consisted of 2388 home addresses of stu-
dents in the author’s university. These addresses were partitioned into 16
elements based on the postal format of the country. The addresses in this
set do not have the kind of regularity found in US addresses.
• Company address: This data set consisted of 769 addresses of customers
of a major national bank in a large Asian metropolis. The address was
segmented into six elements: Care Of, House Name, Road Name, Area,
City, Zipcode.
For the experiments all the data instances were first manually segmented into
their constituent elements. In each set, one-third of the data set was used
for training and the remaining two-thirds used for testing as summarized in
Table 6.1.
All tokens were converted to lower case. Each word, digit and delimiter in
the address formed a separate token to the HMM. Each record was prepro-
cessed where all numbers were represented by a special symbol “digit” and all
delimiters where represented with a special “delimit” symbol.
We obtained accuracy of 99%, 88.9% and 83.7% on the US, Student and
Company data sets respectively. The Asian addresses have a much higher
complexity compared to the US addresses. The company data set had lower
accuracy because of several errors in the segmentation of data that was handed
to us.
We compare the performance of the proposed nested HMM with the fol-
lowing three automated approaches.
1
www.superpages.com
174 Sunita Sarawagi
Naive HMM
This is the HMM model with just one state per element. The purpose here is
to evaluate the benefit of the nested HMM model.
Independent HMM
In this approach, for each element we train a separate HMM to extract just its
part from a text record, independent of all other elements. Each independent
HMM has a prefix and suffix state to absorb the text before and after its
own segment. Otherwise the structure of the HMM is similar to what we
used in the inner HMMs. Unlike the nested-model there is no outer HMM
to capture the dependency amongst elements. The independent HMMs learn
the relative location in the address where their element appears through the
self-loop transition probabilities of the prefix and suffix states. This is similar
to the approach used in [19] for extracting location and timings from talk
announcements.
The main idea here is to evaluate the benefit of simultaneously tagging all
the elements of a record exploiting the sequential relationship amongst the
elements using the outer HMM.
Rule learner
We compare HMM-based approaches with a rule learner, Rapier [8], a bottom-
up inductive learning system for finding information extraction rules to mark
the beginning, content and end of an entity. Like the independent HMM ap-
proach it also extracts each tag in isolation from the rest.
100
80
60 Naive HMM
Accuracy
Independent HMM
Rapier
40 DATAMOLD
20
0
Student Data Company Data US Data
Figure 6.7 shows a comparison of the accuracy of the four methods naive
HMM, independent HMM, rule learner and nested HMM. We can make the
following observations:
6.5 Applications of Sequence Tagging 175
Model Training
During training, we are given examples of several paths of labeled pages where
some of the paths end in goal pages and others end with a special “fail” label.
We can treat each path as a sequence of pages denoted by the vector x and
their corresponding labels denoted by y. Each xi is a web page represented
suitably in terms of features derived from the words in the page, its URL, and
anchor text in the link pointing to xi .
A number of design decisions about the label space and feature space need
to be made in constructing a CRF to recognize characteristics of valid paths.
One option is to assign a state to each possible label in the set L which consists
of the milestone labels and two special labels “goal” and “fail”. An example
of such a model for the publications scenario is given in Figure 6.8(a) where
each circle represents a label.
State features are defined on the words or other properties comprising
a page. For example, state features derived from words are of the form
f k (i, x, yi ) = [[xi is “computer” and yi = faculty]]. The URL of a page also
yields valuable features. For example, a “tilda” in the URL is strongly associ-
ated with a personal home page and a link name with the word “contact” is
strongly associated with an address page. We tokenize each URL on delimiters
and add a feature corresponding to each feature.
Transition features capture the soft precedence order amongst labels. One
set of transition features is of the form:
f k (i, x, yi , yi−1 ) = [[yi is “faculty” and yi−1 is “department”]]. They are inde-
pendent of xi and are called edge features since they capture dependency
amongst adjacent labels. In this model, transition features are also derived
from the words in and around the anchor text surrounding the link leading to
6.5 Applications of Sequence Tagging 177
Faculty/Staff
Information
Faculty/Staff
Information
the next state. Thus, a transition feature could be of the form f k (i, x, yi , yi−1 )
= [[xi is an anchor word “advisor”, yi is “faculty”, and yi−1 is “student”]].
A second option is to model each given label as a dual-state — one for
the characteristics of the page itself (page-states) and the other for the
information around links that lead to such a page (link-states). Hence, every
path alternates between a page-state and a link-state.
In Figure 6.8(b), we show the state space corresponding to this option
for the publications domain. There are two advantages of this labeling. First,
it reduces the sparcity of parameters by making the anchor word features
be independent of the label of the source page. In practice, it is often found
that the anchor text pointing to the same page are highly similar and this is
captured by allowing multiple source labels to point to the same link state
of label. Second for the foraging phase, it allows one to easily reason about
178 Sunita Sarawagi
intermediate probability of a path prefix where only the link is known and the
page leading to it has not been fetched.
In this model, the state features of the page states are the same as in the
previous model and the state features of the link states are derived from the
anchor text. Thus, the anchor-text transition features of the previous model,
become state features of the link state. Thus the only transition features in
this model are the edge features that capture the precedence order between
labels.
Path Foraging
Given the trained sequential model M and a list of starting pages of websites,
our goal is to find all paths from the list that lead to the “goal” state in M
while fetching as few unrelated pages.
The key issue in solving this is to be able to score from a prefix of a path
already fetched, all the set of outgoing links with a value that is inversely pro-
portional to the expected work involved in reaching the goal pages. Consider
a path prefix of the form P1 L2 P3 . . . Li where Li−1 is a link to page Pi in
the path. We need to find for link Li a score value that would indicate the
desirability of fetching the page pointed to by Li . This score is computed in
two parts. First, we estimate for each state y, the proximity of the state to
the goal state. We call this the reward associated with the state. Then we
compute for the link Li the probability of its being in state y.
Reward of a state
When i = n, the reward is 1 for the goal state and 0 for every other label.
Otherwise the values are computed recursively from the proximity of the next
state and the probability of transition to the next state from the current state.
We then compute a weighted sum of these positioned reward values to
get position-independent reward values. The weights are controlled via γ, a
discount factor that captures the desirability of preferring states that are closer
to the goal state as follows:
6.5 Applications of Sequence Tagging 179
n
γ k · Rxn−k
Rx = k=1
(6.8)
n
k
γ
k=1
Score of a link
Finally, the score of a link Li , after i steps, is calculated as the sum of the
product of reaching a state y and the static reward at state y.
αi (y)
Score(Li ) =
R(y) (6.11)
y y ∈YL αi (y )
If a link appears in multiple paths, we sum over its score from each path.
Thus, at any give snapshot of the crawl we have a set of unfetched links
whose scores we compute and maintain in a priority queue. We pick the link
with the highest score to fetch next. The links in the newly fetched page are
added to the queue. We stop when no more unfetched links have a score above
a threshold value.
Experimental Results
We present a summary of experiments over two applications — a task of fetch-
ing publication pages starting from university pages and a task of reaching
180 Sunita Sarawagi
company contact addresses starting from a root company web page. The re-
sults are compared with generic focused crawlers [10] that are not designed to
exploit the commonality of the structure of groups such as university websites.
More details of the experiment can be found in [47].
The data sets were built manually by navigating sample websites and en-
listing the sequence of web pages from the entry page to a goal page. Se-
quences that led to irrelevant pages were identified as negative examples. The
Publications model was trained on 44 sequences (of which 28 were positive
paths) from seven university domains and computer science departments of
US universities chosen randomly from an online list.2
We show the percentage of relevant pages as a function of pages fetched for
two different websites where we applied the above trained model for finding
publications:
• www.cs.cmu.edu/, henceforth referred to as the CMU domain.
• www.cs.utexas.edu/, henceforth referred to as the UTX domain.
Performance is measured in terms of harvest rates. The harvest rate is
defined as the ratio of relevant pages (goal pages, in our case) found to the
total number of pages visited.
Figure 6.9 shows a comparison of how our model performs against the
simplified model of the accelerated focused crawler (AFC). We observe that
the performance of our model is significantly better than the AFC model. The
relevant pages fetched by the CRF model increases rapidly at the beginning
before stabilizing at over 60%, when the Crawler model barely reaches 40%.
The Address data set was trained on 32 sequences out of which 17 sequences
were positive. There was a single milestone state “About-us” in addition to
the start, goal and fail states.
The foraging experimentation on the address data set differs slightly from
the one on the Publications data set. In the Publications data set, we have
multiple goal pages with a website. During the foraging experiment, the model
aims at reaching as many goal pages as possible quickly. In effect, the model
tries to reach a hub — i.e. a page that links many desired pages directly such
that the outlink probability from the page to goal state is maximum.
In the Address data set, there is only one (or a countable few) goal pages.
Hence, following an approach similar to that of the Publications data set
would lead to declining harvest rates once the address page is fetched. Hence
we modify the foraging run to stop when a goal page is reached. We proceed
2
www.clas.ufl.edu/CLAS/american-universities.html
6.5 Applications of Sequence Tagging 181
80
60
40
20
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
No. of pages fetched
80
60
40
20
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
No. of pages fetched
Fig. 6.9. Comparison with simplified accelerated focused crawler. The graphs
labeled PathLearner show the performance of our model.
182 Sunita Sarawagi
with the crawling only when we have a link with a higher score of reaching
the goal state than the current page score.
The experiment was run on 108 domains of company addresses taken ran-
domly from the list of companies available at www.hoovers.com. We calculate
the average number of pages required to reach the goal page from the company
home page.
The average length of path from home page to goal page was observed to
be 3.426, with the median and mode value being 2. This agrees with the usual
practice of having a “Contact Us” link on the company home page that leads
in one link access to the contact address.
Summary
This study showed that conditional random fields provide an elegant, unified
and high-performance method of solving the information foraging task from
large domain-specific websites. The proposed model performs significantly bet-
ter than a generic focused crawler and is easy to train and deploy.
6.7 Conclusions
In this article we reviewed various techniques for analyzing sequence data. We
first studied two conventional mining operations, classification and clustering,
that work on whole sequences. We were able to exploit the wealth of existing
formalisms and algorithms developed for fixed attribute record data by defin-
ing three primitive operations on sequence data. The first primitive was to map
variable length sequences to a fixed-dimensional space using a wealth of tech-
niques ranging from aggregation after collapsing order, k-grams, to capturing
limited order and mismatching scores on k-grams. The second primitive was
defining generative models for sequences where we considered models starting
from simple independent models to variable- length Markov models to the
popular hidden Markov models. The third primitive was designing kernels or
similarity functions between sequence pairs where amongst standard sequence
similarity functions we discussed the interesting Fisher’s kernels that allow a
powerful integration of generative and discriminative models such as SVMs.
We studied two sequence specific operations, tagging and segmentation,
that operate on parts of the sequence and can be thought of as the equiva-
lent of classification and clustering respectively for whole sequences. Sequence
tagging is an extremely useful operation that has seen extensive applications
in the field of information extraction. We explored generative approaches like
hidden Markov models and conditional approaches like conditional random
fields (CRFs) for sequence tagging.
The field of sequence mining is still being actively explored, spurred by
emerging applications in information extraction, bio-informatics and sensor
networks. We can hope to witness more exciting research in the techniques
and application of sequence mining in the coming years.
184 Sunita Sarawagi
References
[1] Aldelberg, B., 1998: Nodose: A tool for semi-automatically extracting
structured and semistructured data from text documents. SIGMOD.
[2] Apostolico, A., and G. Bejerano, 2000: Optimal amnesic probabilistic au-
tomata or how to learn and classify proteins in linear time and space.
Proceedings of RECOMB2000 .
[3] Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg,
2003: Adaptive name-matching in information integration. IEEE Intel-
ligent Systems.
[4] Borkar, V. R., K. Deshmukh and S. Sarawagi, 2001: Automatic text seg-
mentation for extracting structured records. Proc. ACM SIGMOD Inter-
national Conf. on Management of Data, Santa Barbara, USA.
[5] Borthwick, A., J. Sterling, E. Agichtein and R. Grishman, 1998: Exploit-
ing diverse knowledge sources via maximum entropy in named entity
recognition. Sixth Workshop on Very Large Corpora, New Brunswick,
New Jersey. Association for Computational Linguistics.
[6] Bunescu, R., R. Ge, R. J. Mooney, E. Marcotte and A. K. Ramani, 2002:
Extracting gene and protein names from biomedical abstracts, unpub-
lished Technical Note. Available from
URL: www.cs.utexas.edu/users/ml/publication/ie.html.
[7] Burges, C. J. C., 1998: A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2, 121–67.
[8] Califf, M. E., and R. J. Mooney, 1999: Relational learning of pattern-
match rules for information extraction. Proceedings of the Sixteenth Na-
tional Conference on Artificial Intelligence (AAAI-99), 328–34.
[9] Chakrabarti, S., 2002: Mining the Web: Discovering Knowledge from Hy-
pertext Data. Morgan Kauffman.
URL: www.cse.iitb.ac.in/∼ soumen/mining-the-web/
[10] Chakrabarti, S., K. Punera and M. Subramanyam, 2002: Accelerated fo-
cused crawling through online relevance feedback. WWW, Hawaii , ACM.
[11] Chakrabarti, S., S. Sarawagi and B. Dom, 1998: Mining surprising tem-
poral patterns. Proc. of the Twentyfourth Int’l Conf. on Very Large
Databases (VLDB), New York, USA.
[12] Collins, M., 2002: Discriminative training methods for hidden Markov
models: Theory and experiments with perceptron algorithms. Empirical
Methods in Natural Language Processing (EMNLP).
[13] Crespo, A., J. Jannink, E. Neuhold, M. Rys and R. Studer, 2002: A survey
of semi-automatic extraction and transformation.
URL: www-db.stanford.edu/∼ crespo/publications/.
[14] Deng, K., A. Moore and M. Nechyba, 1997: Learning to recognize time
series: Combining ARMA models with memory-based learning. IEEE Int.
Symp. on Computational Intelligence in Robotics and Automation, 1,
246–50.
References 185
[15] Dietterich, T., 2002: Machine learning for sequential data: A review.
Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes
in Computer Science, T. Caelli, ed., Springer-Verlag, 2396, 15–30.
[16] Durbin, R., S. Eddy, A. Krogh and G. Mitchison, 1998: Biological sequence
analysis: probabilistic models of proteins and nucleic acids. Cambridge
University Press.
[17] Eskin, E., W. N. Grundy and Y. Singer, 2000: Protein family classification
using sparse Markov transducers. Proceedings of the Eighth International
Conference on Intelligent Systems for Molecular Biology (ISMB-2000).
San Diego, CA.
[18] Eskin, E., W. Lee and S. J. Stolfo, 2001: Modeling system calls for intru-
sion detection with dynamic window sizes. Proceedings of DISCEX II .
[19] Freitag, D., and A. McCallum, 1999: Information extraction using HMMs
and shrinkage. Papers from the AAAI-99 Workshop on Machine Learning
for Information Extraction, 31–6.
[20] Gionis, A., and H. Mannila, 2003: Finding recurrent sources in sequences.
In Proceedings of the 7th annual conference on Computational Molecular
Biology. Berlin, Germany.
[21] Han, J., and M. Kamber, 2000: Data Mining: Concepts and Techniques.
Morgan Kaufmann.
[22] Humphreys, K., G. Demetriou and R. Gaizauskas, 2000: Two applica-
tions of information extraction to biological science journal articles: En-
zyme interactions and protein structures. Proceedings of the 2000 Pacific
Symposium on Biocomputing (PSB-2000), 502–13.
[23] Jaakkola, T., M. Diekhans and D. Haussler, 1999: Using the Fisher kernel
method to detect remote protein homologies. ISMB , 149–58.
[24] Klein, D., and C. D. Manning, 2002: Conditional structure versus con-
ditional estimation in NLP models. Workshop on Empirical Methods in
Natural Language Processing (EMNLP).
[25] Kushmerick, N., D. Weld and R. Doorenbos, 1997: Wrapper induction for
information extraction. Proceedings of IJCAI .
[26] Lafferty, J., A. McCallum and F. Pereira, 2001: Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. Proceed-
ings of the International Conference on Machine Learning (ICML-2001),
Williams, MA.
[27] Laplace, P.-S., 1995: Philosophical Essays on Probabilities. Springer-
Verlag, New York, translated by A. I. Dale from the 5th French edition
of 1825.
[28] Lawrence, S., C. L. Giles and K. Bollacker, 1999: Digital libraries and
autonomous citation indexing. IEEE Computer , 32, 67–71.
[29] Lee, W., and S. Stolfo, 1998: Data mining approaches for intrusion detec-
tion. Proceedings of the Seventh USENIX Security Symposium (SECU-
RITY ’98), San Antonio, TX .
186 Sunita Sarawagi
[30] Leslie, C., E. Eskin, J. Weston, and W. S. Noble, 2004: Mismatch string
kernels for discriminative protein classification. Bioinformatics, 20, 467–
76.
[31] Li, D., K. Wong, Y. H. Hu and A. Sayeed., 2002: Detection, classifica-
tion and tracking of targets in distributed sensor networks. IEEE Signal
Processing Magazine, 19.
[32] Liu, D. C., and J. Nocedal, 1989: On the limited memory BFGS method
for large-scale optimization. Mathematic Programming, 45, 503–28.
[33] Malouf, R., 2002: A comparison of algorithms for maximum entropy pa-
rameter estimation. Proceedings of The Sixth Conference on Natural Lan-
guage Learning (CoNLL-2002), 49–55.
[34] McCallum, A., D. Freitag and F. Pereira, 2000: Maximum entropy Markov
models for information extraction and segmentation. Proceedings of the
International Conference on Machine Learning (ICML-2000), Palo Alto,
CA, 591–8.
[35] McCallum, A. K., K. Nigam, J. Rennie, and K. Seymore, 2000: Automat-
ing the construction of Internet portals with machine learning. Informa-
tion Retrieval Journal , 3, 127–63.
[36] Muslea, I., 1999: Extraction patterns for information extraction tasks: A
survey. The AAAI-99 Workshop on Machine Learning for Information
Extraction.
[37] Muslea, I., S. Minton and C. A. Knoblock, 1999: A hierarchical approach
to wrapper induction. Proceedings of the Third International Conference
on Autonomous Agents, Seattle, WA.
[38] Rabiner, L., 1989: A tutorial on Hidden Markov Models and selected
applications in speech recognition. Proceedings of the IEEE, 77(2).
[39] Rabiner, L., and B.-H. Juang, 1993: Fundamentals of Speech Recognition,
Prentice-Hall, Chapter 6.
[40] Ratnaparkhi, A., 1999: Learning to parse natural language with maximum
entropy models. Machine Learning, 34.
[41] Ron, D., Y. Singer and N. Tishby, 1996: The power of amnesia: learning
probabilistic automata with variable memory length. Machine Learning,
25, 117–49.
[42] Seymore, K., A. McCallum and R. Rosenfeld, 1999: Learning Hidden
Markov Model structure for information extraction. Papers from the
AAAI-99 Workshop on Machine Learning for Information Extraction,
37–42.
[43] Sha, F., and F. Pereira, 2003: Shallow parsing with conditional random
fields. InProceedings of HLT-NAACL.
[44] Soderland, S., 1999: Learning information extraction rules for semi-
structured and free text. Machine Learning, 34.
[45] Stolcke, A., 1994: Bayesian Learning of Probabilistic Language Models.
Ph.D. thesis, UC Berkeley.
References 187
[46] Takeuchi, K., and N. Collier, 2002: Use of support vector machines in
extended named entity recognition. The 6th Conference on Natural Lan-
guage Learning (CoNLL).
[47] Vydiswaran, V., and S. Sarawagi, 2005: Learning to extract information
from large websites using sequential models. COMAD.
[48] Warrender, C., S. Forrest and B. Pearlmutter, 1999: Detecting intrusions
using system calls: Alternative data models. IEEE Symposium on Security
and Privacy.
7
Link-based Classification
Lise Getoor
Summary. A key challenge for machine learning is the problem of mining richly
structured data sets, where the objects are linked in some way due to either an
explicit or implicit relationship that exists between the objects. Links among the
objects demonstrate certain patterns, which can be helpful for many machine learn-
ing tasks and are usually hard to capture with traditional statistical models. Re-
cently there has been a surge of interest in this area, fuelled largely by interest in
web and hypertext mining, but also by interest in mining social networks, biblio-
graphic citation data, epidemiological data and other domains best described using
a linked or graph structure. In this chapter we propose a framework for modeling
link distributions, a link-based model that supports discriminative models describing
both the link distributions and the attributes of linked objects. We use a structured
logistic regression model, capturing both content and links. We systematically eval-
uate several variants of our link-based model on a range of data sets including both
web and citation collections. In all cases, the use of the link distribution improves
classification performance.
7.1 Introduction
Traditional data mining tasks such as association rule mining, market basket
analysis and cluster analysis commonly attempt to find patterns in a data set
characterized by a collection of independent instances of a single relation. This
is consistent with the classical statistical inference problem of trying to identify
a model given a random sample from a common underlying distribution.
A key challenge for machine learning is to tackle the problem of mining
more richly structured data sets, for example multi-relational data sets in
which there are record linkages. In this case, the instances in the data set are
linked in some way, either by an explicit link, such as a URL, or a constructed
link, such as join between tables stored in a database. Naively applying tradi-
tional statistical inference procedures, which assume that instances are inde-
pendent, can lead to inappropriate conclusions [15]. Care must be taken that
potential correlations due to links are handled appropriately. Clearly, this is
190 Lise Getoor
not perform as well as a structured logistic regression model that combines one
logistic regression model built over content with a separate logistic regression
model built over links.
Having learned a model, the next challenge is classification using the
learned model. A learned link-based model specifies a distribution over link
and content attributes and, unlike traditional statistical models, these at-
tributes may be correlated. Intuitively, for linked objects, updating the cat-
egory of one object can influence our inference about the categories of its
linked neighbors. This requires a more complex classification algorithm. Iter-
ative classification and inference algorithms have been proposed for hypertext
categorization [4, 28] and for relational learning [17, 25, 31, 32]. Here, we
also use an iterative classification algorithm. One novel aspect is that un-
like approaches that make assumptions about the influence of the neighbor’s
categories (such as that linked objects have similar categories), we explicitly
learn how the link distribution affects the category. We also examine a range
of ordering strategies for the inference and evaluate their impact on overall
classification accuracy.
7.2 Background
There has been a growing interest in learning from structured data. By struc-
tured data, we simply mean data best described by a graph where the nodes
in the graph are objects and the edges/hyper-edges in the graph are links or
relations between objects. Tasks include hypertext classification, segmenta-
tion, information extraction, searching and information retrieval, discovery of
authorities and link discovery. Domains include the world-wide web, biblio-
graphic citations, criminology, bio-informatics to name just a few. Learning
tasks range from predictive tasks, such as classification, to descriptive tasks,
such as the discovery of frequently occurring sub-patterns.
Here, we describe some of the most closely related work to ours, however
because of the surge of interest in recent years, and the wide range of venues
where research is reported (including the International World Wide Web Con-
ference (WWW), the Conference on Neural Information Processing (NIPS),
the International Conference on Machine Learning (ICML), the International
ACM conference on Information Retrieval (SIGIR), the International Confer-
ence of Management of Data (SIGMOD) and the International Conference on
Very Large Databases (VLDB)), our list is sure to be incomplete.
Probably the most famous example of exploiting link structure is the use
of links to improve information retrieval results. Both the well-known page
rank [29] and hubs and authority scores [19] are based on the link-structure
of the web. These algorithms work using in-links and out-links of the web
pages to evaluated the importance or relevance of a web-page. Other work,
such Dean and Henzinger [8] propose an algorithm based on co-citation to find
192 Lise Getoor
related web pages. Our work is not directly related to this class of link-based
algorithms.
One line of work more closely related to link-based classification is the
work on hypertext and web page classification. This work has its roots in the
information retrieval community. A hypertext collection has a rich structure
beyond that of a collection of text documents. In addition to words, hyper-
text has both incoming and outgoing links. Traditional bag-of-words models
discard this rich structure of hypertext and do not make full use of the link
structure of hypertext.
Beyond making use of links, another important aspect of link-based classi-
fication is the use of unlabeled data. In supervised learning, it is expensive and
labor-intensive to construct a large, labeled set of examples. However in many
domains it is relatively inexpensive to collect unlabeled examples. Recently
several algorithms have been developed to learn a model from both labeled
and unlabeled examples [1, 27, 34]. Successful applications in a number of ar-
eas, especially text classification, have been reported. Interestingly, a number
of results show that while careful use of unlabeled data is helpful, it is not
always the case that more unlabeled data improves performance [26].
Blum and Mitchell [2] propose a co-training algorithm to make use of un-
labeled data to boost the performance of a learning algorithm. They assume
that the data can be described by two separate feature sets which are not
completely correlated, and each of which is predictive enough for a weak pre-
dictor. The co-training procedure works to augment the labeled sample with
data from unlabeled data using these two weak predictors. Their experiments
show positive results on the use of unlabeled examples to improve the per-
formance of the learned model. In [24], the author states that many natural
learning problems fit the problem class where the features describing the ex-
amples are redundantly sufficient for classifying the examples. In this case, the
unlabeled data can significantly improve learning accuracy. There are many
problems falling into this category: web page classification; semantic classifi-
cation of noun phrases; learning to select word sense and object recognition
in multimedia data.
Nigam et al. [27] introduce an EM algorithm for learning a naive Bayes
classifier from labeled and unlabeled examples. The algorithm first trains a
classifier based on labeled documents and then probabilistically classifies the
unlabeled documents. Then both labeled and unlabeled documents participate
in the learning procedure. This process repeats until it converges. The ideas of
using co-training and EM algorithms for learning from labeled and unlabeled
data are fully investigated in [13].
Joachims et al. [18] proposes a transductive support vector machine
(TSVM) for text classification. A TSVM takes into account a particular test
set and tries to optimize the classification accuracy for that particular test
set. This also is an important means of using labeled and unlabeled examples
for learning.
7.2 Background 193
In other recent work on link mining [12, 25, 31], models are learned from
fully labeled training examples and evaluated on a disjoint test set. In some
cases, the separation occurs naturally, for example in the WebKB data set
[6]. This data set describes the web pages at four different universities, and
one can naturally split the data into a collection of training schools and a test
school, and there are no links from the test school web pages to the training
school pages. But in other cases, the data sets are either manipulated to
extract disconnected components, or the links between the training and test
sets are simply ignored. One major disadvantage of this approach is that it
discards links between labeled and unlabeled data which may be very helpful
for making predictions or may artificially create skewed training and test sets.
Chakrabarti et al. [4] proposed an iterative relaxation labeling algorithm
to classify a patent database and a small web collection. They examine us-
ing text, neighboring text and neighbor class labels for classification in a
rather realistic setting wherein some portion of the neighbor class labels are
known. In the start of their iteration, a bootstrap mechanism is introduced
to classify unlabeled documents. After that, classes from labeled and unla-
beled documents participate in the relaxation labeling iteration. They showed
that naively incorporating words from neighboring pages reduces performance,
while incorporating category information, such has hierarchical category pre-
fixes, improves performance.
Oh et al. [28] also suggest an incremental categorization method, where the
classified documents can take part in the categorization of other documents in
the neighborhood. In contrast to the approach used in Chakrabarti et al., they
do not introduce a bootstrap stage to classify all unlabeled documents. In-
stead they incrementally classify documents and take into account the classes
of unlabeled documents as they become available in the categorization process.
They report similar results on a collection of encyclopedia articles: merely in-
corporating words from neighboring documents was not helpful, while making
use of the predicted class of neighboring documents was helpful.
Popescul et al. [30] study the use of inductive logic programming (ILP) to
combine text and link features for classification. In contrast to Chakrabarti
et al. and Oh et al., where class labels are used as features, they incorporate
the unique document IDs of the neighborhood as features. Their results also
demonstrate that the combination of text and link features often improves
performance.
These results indicate that simply assuming that link documents are on the
same topic and incorporating the features of linked neighbors is not generally
effective. One approach is to identify certain types of hypertext regularities
such as encyclopedic regularity (linked objects typically have the same class)
and co-citation regularity (linked objects do not share the same class, but
objects that are cited by the same object tend to have the same class). Yang et
al. [33] compare several well-known categorization learning algorithms: naive
Bayes [22], kNN [7], and FOIL on three data sets. They find that adding words
from linked neighbors is sometimes helpful for categorization and sometimes
194 Lise Getoor
7.3.1 Definitions
1
Essentially this is a propositionalization [11, 20] of the aspects of the neighbor-
hood of an object in the graph. This is a technique that has been proposed in the
inductive logic programming community and is applicable here.
196 Lise Getoor
A
A
A
? A
B
B B
Co-Out Links: B Co-In Links:
•mode: B C
A •mode: A
•binary: (1,1,0) •binary: (1,0,0)
•count: (2,1,0) •count: (2,0,0)
Fig. 7.1. Assuming there are three possible categories for objects, A, B and C, the
figure shows examples of the mode, binary and count link features constructed for
the object labeled with ?.
about the individual entity to which the object is connected, we maintain the
frequencies of the different categories.
A middle ground between these two is a simple binary feature vector; for
each category, if a link to an object of that category occurs at least once,
the corresponding feature is 1; the feature is 0 if there are no links to this
category. In this case, we use the term binary-link model. Figure 7.1 shows
examples of the three types of link features computed for an object for each
category of links (In links, Out links, Co-In links and Co-Out links).
1
n
ŵ = arginfw ln(1 + exp(−wT xi ci )) + λw2
n i=1
The simplest model is a flat model, which uses a single logistic regression
model over both the object attributes and link features. We found that this
model did not perform well, and instead we found that a structured logistic
regression model, which uses separate logistic regression models (with differ-
ent regularization parameters) for the object features and the link features,
outperformed the flat model. Now the MAP estimation for categorization be-
comes
P (c | OA(X)) t∈{In,Out,Co-In,Co-Out} P (c | LDt (X))
Ĉ(X) = argmaxc∈C
P (c)
where OA(X) are the object features and LDt (X) are the link features for
each of the different types of links t and we make the (probably incorrect)
assumption that they are independent. P (c | OA(X)) and P (c | LDt (X)) are
defined as
1
P (c | OA(X)) = T
exp(−wo OA(X)c) + 1
1
P (c | LDt (X)) = T
exp(−wl LDt (X)c) + 1
where wo and wl are the parameters for the regularized logistic regression
models for P (c | OA(X)) and the P (c | LDt (X)) respectively.
P (c(X) : X ∈ Du | D) =
P (c(X) | OA(X), LDIn (X), LDOut (X), LDCo-In (X), LDCo-Out (X))
X∈D u
where j = 1, ..., m. Next this categorized Du and labeled data Dl are used to
build a new model.
Step 1: (Initialization) Build an initial structured logistic regression classifier
using content and link features using only the labeled training data.
198 Lise Getoor
Step 2: (Iteration) Loop while the posterior probability over the unlabeled
test data increases:
1. Classify unlabeled data using the current model.
2. Recompute the link features of each object. Re-estimate the parame-
ters of the logistic regression models.
7.6 Results
We evaluated our link-based classification algorithm on two variants of the
Cora data set [23], a data set that we constructed from CiteSeer entries [14]
and WebKB [6].
The first Cora data set, CoraI, contains 4187 machine learning papers,
each categorized into one of seven possible topics. We consider only the 3181
papers that are cited or cite other papers. There are 6185 citations in the data
set. After stemming and removing stop words and rare words, the dictionary
contains 1400 words.
The second Cora data set, CoraII,2 contains 30,000 papers, each catego-
rized into one of ten possible topics: information retrieval, databases, artifi-
cial intelligence, encryption and compression, operating systems, networking,
hardware and architecture, data structure algorithms and theory, program-
ming and human–computer interaction. We consider only the 3352 documents
that are cited or cite other papers. There are 8594 citations in the data set.
2
www.cs.umass.edu/∼ mccallum/code-data.html
7.6 Results 199
After stemming and removing stop words and rare words, the dictionary con-
tains 3174 words.
The CiteSeer data set has 3312 papers from six categories: Agents, Artifi-
cial Intelligence, Database, Human Computer Interaction, Machine Learning
and Information Retrieval. There are 7522 citations in the data set. After
stemming and removing stop words and rare words, the dictionary for Cite-
Seer contains 3703 words.
The WebKB data set contains web pages from four computer science de-
partments, categorized into topics such as faculty, student, project, course
and a catch-all category, other. In our experiments we discard pages in the
“other” category, which generates a data set with 700 pages. After stemming
and removing stop words, the dictionary contains 2338 words. For WebKB,
we train on three schools, plus 2/3 of the fourth school, and test on the last
1/3.
On Cora and CiteSeer, for each experiment, we take one split as a test
set, and the remaining two splits are used to train our model: one for training
and the other for a validation set used to find the appropriate regularization
parameter λ. Common values of λ were 10−4 or 10−5 . On WebKB, we learned
models for a variety of λ; here we show the best result.
In our experiments, we compared a baseline classifier (Content) with our
link-based classifiers (Mode, Binary, Count). We compared the classifiers:
• Content: Uses only object attributes.
• Mode: Combines a logistic regression classifier over the object attributes
with separate logistic regression classifiers over the mode of the In Links,
Out Links, Co-In Links, and Co-Out Links.
• Binary: Combines a logistic regression classifier over the object attributes
with a separate logistic regression classifier over the binary link statistics
for all of the links.
• Count-Link: Combines a logistic regression classifier over the object at-
tributes with a separate logistic regression classifier over the counts link
statistics for all of the links.
Table 7.1. Results with Content, Mode, Binary and Count models on CoraI,
CoraII, CiteSeer and WebKB. Statistically significant results (at or above 90% con-
fidence level) for each row are shown in bold.
CoraI
Content Mode Binary Count
avg accuracy 68.14 82.35 77.53 83.14
avg precision 67.47 81.01 77.35 81.74
avg recall 63.08 80.08 76.34 81.20
avg F1 measure 64.17 80.0 75.69 81.14
CoraII
Content Mode Binary Count
avg accuracy 67.55 83.03 81.46 83.66
avg precision 65.87 78.62 74.54 80.62
avg recall 47.51 75.27 75.69 76.15
avg F1 measure 52.11 76.52 74.62 77.77
CiteSeer
Content Mode Binary Count
avg accuracy 60.59 71.01 69.83 71.52
avg precision 55.48 64.61 62.6 65.22
avg recall 55.33 60.09 60.3 61.22
avg F1 measure 53.08 60.68 60.28 61.87
WebKB
Content Mode Binary Count
avg accuracy 87.45 88.52 78.91 87.93
avg precision 78.67 77.27 70.48 77.71
avg recall 72.82 73.43 71.32 73.33
avg F1 measure 71.77 73.03 66.41 72.83
100
80
Content Only
60 Mode
40 Binary
Count
20
0
CoraI Cora II CiteSeer W ebKB
Fig. 7.2. Average F1 measure for different models (Content, Mode, Binary and
Count) on four data sets (CoraI, CoraII, CiteSeer and WebKB).
In this set of experiments, all of the links (In Links, Out Links, Co-In
Links, Co-Out Links) are used and we use a fixed ordering for the iterative
classification algorithm.
7.6 Results 201
For all four data sets, the link-based models outperform the content only
models. For three of the four data sets, the difference is statistically significant
at the 99% significance level. For three of the four data sets, count outper-
forms mode at the 90% significance level or higher, for both accuracy and F1
measure. Both mode and count outperform binary; the difference is most
dramatic for CoraI and WebKB.
Clearly, the mode, binary and count link-based models are using infor-
mation from the description of the link neighborhood of an object to improve
classification performance. Mode and count seem to make the best use of
the information; one explanation is that while binary contains more informa-
tion in terms of which categories of links exist, it loses the information about
which link category is most frequent. In many domains one might think that
mode should be enough information, particulary bibliographic domains. So it
is somewhat surprising that the count model is the best for our three citation
data sets.
Our results on WebKB were less reliable. Small changes to the ways that
we structured the classifiers resulted in different outcomes. Overall, we felt
there were problems because the link distributions were quite different among
the different schools. Also, after removing the other pages, the data set is
rather small.
Content
80 Links
60 In Links & Content
Out Links & Content
40 Co-In Links & Content
20 Co-Out Links & Content
Links & Content
0
CoraI Cora II CiteSeer WebKB
Fig. 7.3. Average F1 measure for Count on four data sets (CoraI, CoraII, CiteSeer
and WebKB) for varying content and links (Content, Links, In Links & Content,
Out Links & Content, Co-In Links & Content, Co-Out links & Content and Links
& Content).
all the links, but no content (Links),4 and Link & Content (which gave
us the best results in the previous section). Figure 7.3 shows the average F1
accuracy for the four of the data sets using different link types.
Clearly using all of the links performs best. Individually, the Out Links
and Co-In Links seem to add the most information, although again, the
results for WebKB are less definitive.
More interesting is the difference in results when using only Links versus
Links & Content. For CoraI and Citeseer, Links only performs reasonably
well, while for the other two cases, CoraII and WebKB, it performs horribly.
Recall that the content helps give us an initial starting point for the iterative
classification algorithm. Our theory is that, for some data sets, especially
those with fewer links, getting a good initial starting point is very important.
In others, there is enough information in the links to overcome a bad starting
point for the iterative classification algorithm. This is an area that requires
further investigation.
Table 7.2. Avg F1 results using “Test Links Only” and “Complete Links” on CoraI,
CoraII, CiteSeer and WebKB.
Test Links Only Complete Links
Mode Binary Count Mode Binary Count
CoraI 75.85 71.57 79.16 80.00 75.69 81.14
CoraII 58.70 58.19 61.50 76.52 74.62 77.77
CiteSeer 59.06 60.03 60.74 60.68 60.28 61.87
WebKB 73.02 67.29 71.79 73.03 66.41 72.83
of learning with labeled and unlabeled data using the iterative algorithm
proposed in Section 7.5. To better understand the effects of unlabeled data, we
compared the performance of our algorithm with varying amounts of labeled
and unlabeled data.
For two of the domains, CoraII and CiteSeer, we randomly choose 20%
of the data as test data. We compared the performance of the algorithms
when different percentages (20%, 40%, 60%, 80%) of the remaining data is
labeled. We compared the accuracy when only the labeled data is used for
training (Labeled only) with the case where both labeled and the remaining
unlabeled data is used for training (Labeled and Unlabeled).
• Content: Uses only object attributes.
• Labeled Only: The link model is learned on labeled data only. The only
unlabeled data used is the test set.
• Labeled and Unlabeled: The link model is learned on both labeled and
all of the unlabeled data.
Figure 7.4 shows the results averaged over five different runs. The algo-
rithm which makes use of all of the unlabeled data gives better performance
than the model which uses only the labeled data.
For both data sets, the algorithm which uses both labeled and unlabeled
data outperforms the algorithm which uses Labeled Only data; even with 80%
of the data labeled and only 20% of the data unlabeled, the improvement in
error on the test set using unlabeled data is statistically significant at the 95%
confidence level for both Cora and Citeseer.
CoraII CiteSeer
85 85
80 80
Labeled & Labeled &
Unlabeled Unlabeled
Accuracy
Accuracy
75 75
Labeled Only Labeled Only
70 70
Content Content
65 65
60 60
20 40 60 80 20 40 60 80
% Data Labeled % Data Labeled
(a) (b)
Fig. 7.4. (a) Results varying the amount of labeled and unlabeled data used for
training on CoraII (b) and on CiteSeer. The results are averages of five runs.
CoraII
85
83
81
79
accuracy
77 Random
75 INC-OUT
73 DEC-OUT
71 PP
69
67
65
0
0
30
20
10
00
90
80
70
12
21
30
39
48
57
66
75
84
93
10
11
12
12
13
14
#of updates
Fig. 7.5. The convergence rates of different iteration methods on the CoraII data
set.
7.7 Conclusions
Many real-world data sets have rich structures, where the objects are linked
in some way. Link mining targets data-mining tasks on this richly-structured
data. One major task of link mining is to model and exploit the link distribu-
tions among objects. Here we focus on using the link structure to help improve
classification accuracy.
In this chapter we have proposed a simple framework for modeling link
distributions, based on link statistics. We have seen that for the domains we
examined, a combined logistic classifier built over the object attributes and
link statistics outperforms a simple content-only classifier. We found the ef-
fect of different link types is significant. More surprisingly, the mode of the
link statistics is not always enough to capture the dependence. Avoiding the
assumption of homogeneity of labels and modeling the distribution of the link
categories at a finer grain is useful.
Acknowledgments: I’d like to thank Prithviraj Sen and Qing Lu for their
work on the implementation of the link-based classification system. This study
was supported by NSF Grant 0308030 and the Advanced Research and De-
velopment Activity (ARDA) under Award Number NMA401-02-1-2018. The
views, opinions, and findings contained in this report are those of the author
and should not be construed as an official Department of Defense position,
policy, or decision unless so designated by other official documentation.
References
[1] Blum, A., and S. Chawla, 2001: Learning from labeled and unlabeled
data using graph mincuts. Proc. 18th International Conf. on Machine
Learning. Morgan Kaufmann, San Francisco, CA, 19–26.
[2] Blum, A. and T. Mitchell, 1998: Combining labeled and unlabeled data
with co-training. COLT: Proceedings of the Workshop on Computational
Learning Theory. Morgan Kaufmann.
[3] Chakrabarti, S., 2002: Mining the Web. Morgan Kaufman.
[4] Chakrabarti, S., B. Dom and P. Indyk, 1998: Enhanced hypertext cate-
gorization using hyperlinks. Proc of SIGMOD-98 .
[5] Cook, D., and L. Holder, 2000: Graph-based data mining. IEEE Intelli-
gent Systems, 15, 32–41.
[6] Craven, M., D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell,
K. Nigam and S. Slattery, 1998: Learning to extract symbolic knowledge
from the world wide web. Proc. of AAAI-98 .
[7] Dasarathy, B. V., 1991: Nearest neighbor norms: NN pattern classification
techniques. IEEE Computer Society Press, Los Alamitos, CA.
[8] Dean, J., and M. Henzinger, 1999: Finding related pages in the World
Wide Web. Computer Networks, 31, 1467–79.
206 Lise Getoor
[9] Dzeroski, S., and N. Lavrac, eds., 2001: Relational Data Mining. Kluwer,
Berlin.
[10] Feldman, R., 2002: Link analysis: Current state of the art. Tutorial at the
KDD-02 .
[11] Flach, P., and N. Lavrac, 2000: The role of feature construction in induc-
tive rule learning. Proc. of the ICML2000 workshop on Attribute-Value
and Relational Learning: crossing the boundaries.
[12] Getoor, L., N. Friedman, D. Koller and B. Taskar, 2002: Learning prob-
abilistic models with link uncertainty. Journal of Machine Learning Re-
search.
[13] Ghani, R., 2001: Combining labeled and unlabeled data for text clas-
sification with a large number of categories. Proceedings of the IEEE
International Conference on Data Mining, N. Cercone, T. Y. Lin and
X. Wu, eds., IEEE Computer Society, San Jose, US, 597–8.
[14] Giles, C., K. Bollacker, and S. Lawrence, 1998: CiteSeer: An automatic
citation indexing system. ACM Digital Libraries 98 .
[15] Jensen, D., 1999: Statistical challenges to inductive inference in linked
data. Seventh International Workshop on Artificial Intelligence and
Statistics.
[16] Jensen, D., and H. Goldberg, 1998: AAAI Fall Symposium on AI and
Link Analysis. AAAI Press.
[17] Jensen, D, J. Neville. and B. Gallagher, 2004: Why collective inference
improves relational classification. Proceedings of the 10th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
[18] Joachims, T., 1999: Transductive inference for text classification using
support vector machines. Proceedings of ICML-99, 16th International
Conference on Machine Learning, I. Bratko and S. Dzeroski, eds., Morgan
Kaufmann, San Francisco, US, 200–9.
[19] Kleinberg, J., 1999: Authoritative sources in a hyperlinked environment.
Journal of the ACM , 46, 604–32.
[20] Kramer, S., N. Lavrac and P. Flach, 2001: Propositionalization ap-
proaches to relational data mining. Relational Data Mining, S. Dzeroski
and N. Lavrac, eds., Kluwer, 262–91.
[21] Macskassy, S., and F. Provost, 2003: A simple relational classifier. KDD
Workshop on Multi-Relational Data Mining.
[22] McCallum, A., and K. Nigam, 1998: A comparison of event models for
naive Bayes text classification. AAAI-98 Workshop on Learning for Text
Categorization.
[23] McCallum, A., K. Nigam, J. Rennie and K. Seymore, 2000: Automating
the construction of Internet portals with machine learning. Information
Retrieval , 3, 127–63.
[24] Mitchell, T., 1999: The role of unlabeled data in supervised learning.
Proceedings of the Sixth International Colloquium on Cognitive Science.
References 207
[25] Neville, J., and D. Jensen, 2000: Iterative classification in relational data.
Proc. AAAI-2000 Workshop on Learning Statistical Models from Rela-
tional Data, AAAI Press.
[26] Nigam, K., 2001: Using Unlabeled Data to Improve Text Classification.
Ph.D. thesis, Carnegie Mellon University.
[27] Nigam, K., A. McCallum, S. Thrun, and T. Mitchell, 2000: Text classifica-
tion from labeled and unlabeled documents using EM. Machine Learning,
39, 103–34.
[28] Oh, H., S. Myaeng, and M. Lee, 2000: A practical hypertext categorization
method using links and incrementally available class information. Proc.
of SIGIR-00 .
[29] Page, L., S. Brin, R. Motwani and T. Winograd, 1998: The page rank
citation ranking: Bringing order to the web. Technical report, Stanford
University.
[30] Popescul, A., L. Ungar, S. Lawrence and D. Pennock, 2002: Towards
structural logistic regression: Combining relational and statistical learn-
ing. KDD Workshop on Multi-Relational Data Mining.
[31] Taskar, B., P. Abbeel and D. Koller, 2002: Discriminative probabilistic
models for relational data. Proc. of UAI-02 , Edmonton, Canada, 485–92.
[32] Taskar, B., E. Segal and D. Koller, 2001: Probabilistic classification and
clustering in relational data. Proc. of IJCAI-01 .
[33] Yang, Y., S. Slattery and R. Ghani, 2002: A study of approaches to hy-
pertext categorization. Journal of Intelligent Information Systems, 18,
219–41.
[34] Zhang, T., and F. J. Oles, 2000: A probability analysis on the value of
unlabeled data for classification problems. Proc. 17th International Conf.
on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1191–8.
[35] — 2001: Text categorization based on regularized linear classification
methods. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 5–31.
Part II
Applications
8
Knowledge Discovery from Evolutionary Trees
8.1 Introduction
Data mining, or knowledge discovery from data, refers to the process of ex-
tracting interesting, non-trivial, implicit, previously unknown and potentially
useful information or patterns from data [13]. In life sciences, this process
could refer to detecting patterns in evolutionary trees, extracting clustering
rules for gene expressions, summarizing classification rules for proteins, infer-
ring associations between metabolic pathways and predicting genes in genomic
DNA sequences [25, 26, 28, 29], among others. This chapter presents knowl-
edge discovery algorithms for extracting patterns from evolutionary trees.
Scientists model the evolutionary history of a set of taxa (organisms or
species) that have a common ancestor using rooted unordered labeled trees,
also known as phylogenetic trees (phylogenies) or evolutionary trees [20]. The
internal nodes within a particular tree represent older organisms from which
their child nodes descend. The children represent divergences in the genetic
212 Sen Zhang and Jason T. L. Wang
composition in the parent organism. Since these divergences cause new or-
ganisms to evolve, these organisms are shown as children of the previous or-
ganism. Evolutionary trees are usually constructed from molecular data [20].
They can provide guidance in aligning multiple molecular sequences [24] and
in analyzing genome sequences [6].
The patterns we want to find from evolutionary trees contain “cousin
pairs.” For example, consider the three hypothetical evolutionary trees in
Figure 8.1. In the figure, a and y are cousins with distance 0 in T1 ; e and f
are cousins with distance 0.5 in T2 ; b and f are cousins with distance 1 in all
the three trees.
1 1
a a 1
a2 y3 2 3 4
d e c d 2 d3
4 5 6 5 6 5 7
d x f b b4 c c6 f
7 8 9 10
b c f g
11
p
T1 T2 T3
Fig. 8.1. Three trees T1 , T2 and T3 . Each node in a tree may or may not have
a label, and is associated with a unique identification number (represented by the
integer outside the node).
The measure “distance” represents kinship of two nodes; two cousins with
distance 0 are siblings, sharing the same parent node. Cousins of distance
1 share the same grandparent. Cousins of distance 0.5 represent aunt–niece
relationships. Our algorithms can find cousin pairs of varying distances in a
single tree or multiple trees. The cousin pairs in the trees represent evolution-
ary relationships between species that share a common ancestor. Finding the
cousin pairs helps one to better understand the evolutionary history of the
species [22], and to produce better results in multiple sequence alignment [24].
The rest of the chapter is organized as follows. Section 8.2 introduces
notation and terminology. Section 8.3 presents algorithms for finding frequent
cousin pairs in trees. Section 8.4 reports experimental results on both synthetic
data and real trees, showing the scalability and effectiveness of the proposed
approach. Section 8.5 reports implementation efforts and discusses several
applications where we use cousin pairs to define new similarity measures for
trees and to evaluate the quality of consensuses of equally parsimonious trees.
8.2 Preliminaries 213
Section 8.6 compares our work with existing methods. Section 8.7 concludes
the chapter and points out some future work.
8.2 Preliminaries
We model evolutionary trees by rooted unordered labeled trees. Let Σ be a
finite set of labels. A rooted unordered labeled tree of size k > 0 on Σ is a
quadruple T = (V, N, L, E), where
• V is the set of nodes of T in which a node r(T ) ∈ V is designated as the
root of T and |V | = k;
• N : V → {1 . . . , k} is a numbering function that assigns a unique identifi-
cation number N (v) to each node v ∈ V ;
• L : V → Σ, V ⊆ V , is a labeling function that assigns a label L(v) to
each node v ∈ V ; the nodes in V − V do not have a label;
• E ⊂ N (V ) × N (V ) contains all parent–child pairs in T .
For example, refer to the trees in Figure 8.1. The node numbered 6 in T1
does not have a label. The nodes numbered 2, 3 in T3 have the same label d
and the nodes numbered 5, 6 in T3 have the same label c. We now introduce
a series of definitions that will be used in our algorithms.
Cousin Distance
Given two labeled nodes u, v of tree T where neither node is the parent of
the other, we represent the least common ancestor, w, of u and v as lca(u, v),
and represent the height of u, v respectively, in the subtree rooted at w as
H(u, w), H(v, w) respectively. We define the cousin distance of u and v, de-
noted c dist(u, v), as shown in Equation (8.1).
H(u, w) − 1 if H(u, w) = H(v, w)
c dist(u, v) =
max{H(u, w), H(v, w)} − 1.5 if |H(u, w) − H(v, w)| = 1
(8.1)
The cousin distance c dist(u, v) is undefined if |H(u, w) − H(v, w)| is
greater than 1, or one of the nodes u, v is unlabeled. (The cutoff of 1 is a
heuristic choice that works well for phylogeny. In general there could be no
cutoff, or the cutoff could be much greater.)
Our cousin distance definition is inspired by genealogy [12]. Node u is a
first cousin of v, or c dist(u, v) = 1, if u and v share the same grandparent. In
other words, v is a child of u’s aunts or vice versa. Node u is a second cousin
of v, or c dist(u, v) = 2, if u and v have the same great-grandparent, but not
the same grandparent. For two nodes u, v that are siblings, i.e. they share the
same parent, c dist(u, v) = 0.
We use the number “0.5” to represent the “once removed” relationship.
When the word “removed” is used to describe a relationship between two
214 Sen Zhang and Jason T. L. Wang
nodes, it indicates that the two nodes are from different generations. The
words “once removed” mean that there is a difference of one generation. For
any two labeled nodes u and v, if u is v’s parent’s first cousin, then u is
v’s first cousin once removed [12], and c dist(u, v) = 1.5. “Twice removed”
means that there is a two-generation difference. Our cousin distance definition
requires |H(u, w) − H(v, w)| ≤ 1 and excludes the twice removed relationship.
As mentioned above, this is a heuristic rather than a fundamental restriction.
For example, consider again T1 in Figure 8.1. There is a one-generation
difference between the aunt–niece pair y, x and c dist(y, x) = 0.5. Node b is
node f ’s first cousin and c dist(b, f ) = 1. Node d is node g’s first cousin
once removed, and c dist(d, g) = 1.5. Node f is node g’s second cousin,
and c dist(f, g) = 2. Node f is node p’s second cousin once removed, and
c dist(f, p) = 2.5.
Notice that parent–child relationships are not included in our work be-
cause the internal nodes of evolutionary trees usually have no labels. (Each
leaf in these trees has a label, which is a taxon name.) So, we do not treat
parent–child pairs at all. This heuristic works well in phylogenetic applica-
tions, but could be generalized. We proposed one such generalization using
the UpDown distance [27]. Another approach would be to use one upper limit
parameter for inter-generational (vertical) distance and another upper limit
parameter for horizontal distance.
We may also consider the total number of occurrences of the cousins u and
v regardless of their distance, for which case we use λ in place of c dist(u, v) in
8.3 Tree-Mining Algorithms 215
the cousin pair item. For example, in Table 8.1, T3 has (b, c, 0, 1) and (b, c, 1, 1),
and hence we obtain (b, c, λ, 2). Here, the cousin pair (b, c) occurs once with
distance 0 and occurs once with distance 1. Therefore, when ignoring the
distance, the total number of occurrences of (b, c) is 2. Likewise we can ig-
nore the number of occurrences of a cousin pair (u, v) by using λ in place of
occur(u, v) in the cousin pair item. For example, in Table 8.1, T3 has (b, c, 0, λ)
and (b, c, 1, λ). We may ignore both the cousin distance and the number of oc-
currences and focus on the cousin labels only. For example, T3 has (b, c, λ, λ),
which simply indicates that b, c are cousins in T3 .
where
R = 2 × (d − d) (8.4)
Lemma 1. Algorithm Single Tree Mining correctly finds all cousin pair items
of T where the cousin pairs have a distance less than or equal to maxdist and
an occurrence number greater than or equal to minoccur.
Proof. The correctness of the algorithm follows directly from two observa-
tions: (i) every cousin pair with distance d where 0 ≤ d ≤ maxdist is found
by the algorithm; (ii) because Step 9 eliminates duplicate cousin pairs from
consideration, no cousin pair with the same identification numbers is counted
twice.
Fig. 8.2. Algorithm for finding frequent cousin pair items in a single tree.
To find all frequent cousin pairs in a set of trees {T1 , . . . , Tk } whose dis-
tance is at most maxdist and whose support is at least minsup for a user-
specified minsup value, we first find all cousin pair items in each of the trees
that satisfy the distance requirement. Then we locate all frequent cousin pairs
by counting the number of trees in which a qualified cousin pair item occurs.
This procedure will be referred to as Multiple Tree Mining and its time com-
plexity is clearly O(kn2 ) where n = max{|T1 |, . . . , |Tk |}.
Table 8.2 summarizes the parameters of our algorithms and their default
values used in the experiments. The value of 4 was used for minimum support
because the evolutionary trees in TreeBASE differ substantially and using this
support value allowed us to find interesting patterns in the trees. Table 8.3
lists the parameters and their default values related to the synthetic trees.
The f anout of a tree is the number of children of each node in the tree. The
alphabet size is the total number of distinct node labels these synthetic trees
have.
Table 8.2. Parameters and their default values used in the algorithms.
Name Meaning Value
minoccur minimum occurrence number of an interesting cousin pair 1
in a tree
maxdist maximum distance allowed for an interesting cousin pair 1.5
minsup minimum number of trees in the database that contain 4
an interesting cousin pair
Table 8.3. Parameters and their default values related to synthetic trees.
Name Meaning Value
tree size number of nodes in a tree 200
database size number of trees in the database 1000
f anout number of children of each node in a tree 5
alphabet size size of the node label alphabet 200
Figure 8.3 shows how changing the f anout of synthetic trees affects the
running time of the algorithm Single Tree Mining. 1000 trees were tested and
the average was plotted. The other parameter values are as shown in Table 8.2
and Table 8.3. Given a fixed tree size value, a large fanout value will result in
a small number of children sets, which will consequently reduce the times of
executing the outer for-loop of the algorithm, see Step 1 in Figure 8.2. There-
fore, one may expect that the running time of Single Tree Mining drops as
f anout increases. To our surprise, however, Figure 8.3 shows that the running
time of Single Tree Mining increases as a tree becomes bushy, i.e. its f anout
becomes large. This happens mainly because for bushy trees, each node has
many siblings and hence more qualified cousin pairs could be generated, see
Step 8 in Figure 8.2. As a result, it takes more time in the postprocessing
stage to aggregate those cousin pairs, see Step 12 in Figure 8.2.
Figure 8.4 shows the running times of Single Tree Mining with different
maxdist values for varying node numbers of trees. 1000 synthetic trees were
8.4 Experiments and Results 219
0.35
0.3
0.25
Time (sec.)
0.2
0.15
0.1
0.05
0
0 10 20 30 40 50 60
Fanout
0.3
0.25
0.2
Time (sec.)
0.15
0.1
maxdist=2 maxdist=1.5
0.05
maxdist=1 maxdist=0.5
0
0 250 500 750 1000 1250
Tree size
tested and the average was plotted. The other parameter values are as shown
in Table 8.2 and Table 8.3. It can be seen from the figure that as maxdist
increases, the running time becomes large, because more time will be spent
in the inner for-loop of the algorithm for generating cousin pairs, Steps 3
to 10 in Figure 8.2. We also observed that a lot of time needs to be spent in
aggregating qualified cousin pairs in the postprocessing stage of the algorithm,
Step 12 in Figure 8.2. This extra time, though not explicitly described by the
asymptotic time complexity O(|T |2 ) in Lemma 2, is reflected by the graphs
in Figure 8.4.
220 Sen Zhang and Jason T. L. Wang
The running times of Multiple Tree Mining when applied to 1 million syn-
thetic trees and 1,500 evolutionary trees obtained from TreeBASE are shown
in Figures 8.5 and 8.6, respectively. Each evolutionary tree has between 50
and 200 nodes and each node has between two and nine children (most in-
ternal nodes have two children). The size of the node label alphabet for the
evolutionary trees is 18,870. The other parameter values are as shown in Ta-
ble 8.2 and Table 8.3. We see from Figure 8.6 that Multiple Tree Mining can
find all frequent cousin pair items in the 1,500 evolutionary trees in less than
150 seconds. The algorithm scales up well – its running time increases linearly
with increasing number of trees (Figure 8.5).
250
200
Time (1000 sec.)
150
100
50
0
0 250 500 750 1000
Number of trees (in 1000s)
250
200
Time (sec.)
150
100
50
0
0 250 500 750 1000 1250 1500
Number of trees
graphical display of the tree via a pop-up window, as shown in Figure 8.8.
In this figure, the found cousin pair (Scutellaria californica and Scutellaria
siphocampyloides) is highlighted with a pair of bullets.
distance nor the occurrence number in each tree), t simcdist (T1 , T2 ) (consid-
ering the cousin distance only in each tree), t simocc (T1 , T2 ) (considering the
occurrence number only in each tree), and t simocc cdist (T1 , T2 ) (considering
both the cousin distance and the occurrence number in each tree), respectively.
For example, referring to the trees T2 and T3 in Figure 8.1, we have
4 2
t simnull (T2 , T3 ) = 12 = 0.33, t simcdist (T2 , T3 ) = 16 = 0.125, t simocc (T2 ,
4 2
T3 ) = 12 = 0.33, t simocc cdist (T2 , T3 ) = 16 = 0.125. The intersection and
union of two sets of cousin pair items take into account the occurrence num-
bers in them. For example, suppose cpi(T1 ) = {(a, b, m, occur1)} and cpi(T2 ) =
{(a, b, m, occur2)}. Then cpi(T1 ) ∩ cpi(T2 ) = {(a, b, m, min(occur1, occur2))}
and cpi(T1 )∪cpi(T2 ) = {(a, b, m, max(occur1, occur2))}. These similarity mea-
sures can be used to find kernel trees in a set of phylogenies [22].
where |S| is the total number of trees in the set S. The higher the average
similarity score ∆cus (C, S) is, the better consensus tree C is.
Figure 8.9 compares average similarity scores of the consensus trees gener-
ated by the five methods mentioned above for varying number of parsimonious
224 Sen Zhang and Jason T. L. Wang
trees. The parameter values used by our algorithms for finding the cousin
pairs are as shown in Table 8.2. The parsimonious trees were generated by
the PHYLIP tool [10] using the first 500 nucleotides extracted from six genes
representing paternally, maternally, and biparentally inherited regions of the
genome among 16 species of Mus [16]. There are 33 trees in total. We randomly
choose 10, 15, 20, 25, 30 or 33 trees for each test. In each test, five different
individual runs of the algorithms are performed and the average is plotted. It
can be seen from Figure 8.9 that the majority consensus method and Nelson
consensus method are better than the other three consensus methods – they
yield consensus trees with higher average similarity scores.
1
Average similarity score based on
0.5
0.4
0.3
0.2
0.1
0
5 10 15 20 25 30 35
Number of trees
Fig. 8.9. Comparing the quality of consensus trees using cousin pairs.
1
0.9
Average similarity score based on
0.8
0.7
0.6
clusters
0.5
0.4
0.3
0.2 Majority Nelson
Adams Strict
0.1 Semi
0
5 10 15 20 25 30 35
Number of trees
Figure 8.10 shows the experimental results in which clusters are used to
evaluate the quality of consensus trees. The data used here are the same as the
data for cousin pairs. In comparing the graphs in Figure 8.9 and Figure 8.10,
we observe that majority consensus and Nelson consensus trees are the best
consensus trees, yielding the highest average similarity scores between the
consensus trees and the original parsimonious trees. A close look at the data
reveals why this happens. All the original parsimonious trees are fully resolved;
i.e. the resolution rate [23] of these trees is 100%. This means every node
in an original parsimonious tree has two children, i.e. the tree is a binary
tree. Furthermore the average depth of these trees is eight. When considering
10 out of 33 parsimonious trees, the average resolution rate and the depth
of the obtained majority consensus trees are 73% and 7 respectively. The
average resolution rate and the depth of the obtained Nelson consensus trees
are 66% and 7 respectively. The average resolution rate and the depth of
Adams consensus trees are 60% and 6 respectively. The average resolution rate
226 Sen Zhang and Jason T. L. Wang
and the depth of the strict consensus trees are only 33% and 4 respectively.
This shows that the majority consensus trees and Nelson consensus trees are
closest to the original parsimonious trees. On average, the majority consensus
trees differ from the Nelson consensus trees by only two clusters and five cousin
pairs. These small differences indicate that these two kinds of consensus trees
are close to each other. Similar results were observed for the other input data.
Notice that, in Figure 8.10 where clusters are used, the average similar-
ity scores for strict consensus trees decrease monotonously as the number of
equally parsimonious trees increases. This happens because when the number
of equally parsimonious trees is large, the number of common clusters shared
by all the parsimonious trees becomes small. Thus, the obtained strict con-
sensus trees become less resolved; i.e., they are shallow and bushy. As a result,
the similarity scores between the strict consensus trees and each fully resolved
parsimonious tree become small.
Notice also that, in both Figure 8.9 and Figure 8.10, the average similarity
scores of semi-strict consensus trees and strict consensus trees are almost the
same. This happens because the parsimonious trees used in the experiments
are all generated by the PHYLIP tool, which produces fully resolved binary
trees. It is well known that a semi-strict consensus tree and a strict consensus
tree are exactly the same when the original equally parsimonious trees are
binary trees [2].
The triplet metric is similar to the quartet metric except that we enumerate
triplets (three leaves) as opposed to quartets (four leaves). In other words, the
triplet metric counts the number of subtrees with three taxa that are different
in two trees. This metric is useful for rooted trees while the quartet metric is
useful for unrooted trees. The algorithm for calculating the triplet metric of
two trees runs in time O(n2 ).
The partition metric treats each phylogenetic tree as an unrooted tree and
analyzes the partitions of species resulting from removing one edge at a time
from the tree. By removing one edge from a tree, we are able to partition
that tree. The distance between two trees is defined as the number of edges
for which there is no equivalent (in the sense of creating the same partitions)
edge in the other tree. The algorithm implemented in COMPONENT for
computing the partition metric runs in time O(n).
An agreement subtree between two trees T1 and T2 is a substructure of T1
and T2 on which the two trees are the same. Commonly such a subtree will
have fewer leaves than either T1 or T2 . A maximum agreement subtree (MAS)
between T1 and T2 is an agreement subtree of T1 and T2 . Furthermore there
is no other agreement subtree of T1 and T2 that has more leaves (species or
taxa) than MAS. The MAS metric is defined as the number of leaves removed
from T1 and T2 to obtain an MAS of T1 and T2 . In COMPONENT, programs
have been written to find the MAS for two (rooted or unrooted) fully resolved
binary trees.
Given two unrooted, unordered trees with the same set of labeled leaves,
the NNI metric is defined to be the number of NNI operations needed to trans-
form one tree to the other. DasGupta et al. [7] showed that calculating the
NNI metric is NP-hard, for both labeled and unlabeled unrooted trees. Brown
and Day [4] developed approximation algorithms, which were implemented in
COMPONENT. The time complexities of the algorithms are O(nlogn) and
O(n2 logn), respectively, for rooted trees and unrooted trees, respectively.
Another widely used metric for trees is the edit distance, defined through
three edit operations, change node label, insert a node and delete a node, on
trees. Finding the edit distance between two unordered trees is NP-hard, and
hence a constrained edit distance, known as the degree-2 edit distance, was
developed [30]. In contrast to the above tree metrics, the similarity measures
between two trees proposed in this chapter are defined in terms of the cousin
pairs found in the two trees. The definition of cousin pairs is different from
the definitions for quartets, triplets, partitions, maximum agreement subtrees,
NNI operations and edit operations, and consequently the proposed similarity
measures are different from the existing tree metrics. These measures provide
complementary information when applied to real-world data.
228 Sen Zhang and Jason T. L. Wang
8.7 Conclusion
We presented new algorithms for finding and extracting frequent cousin pairs
with varying distances from a single evolutionary tree or multiple evolu-
tionary trees. A system built based on these algorithms can be accessed at
http://aria.njit.edu/mediadb/cousin/main.html. The proposed single
tree mining method, described in Section 8.3, is a quadratic-time algorithm.
We suspect the best-case time complexity for finding all frequent cousin pairs
in a tree is also quadratic. We have also presented some applications of the
proposed techniques, including the development of new similarity measures
for evolutionary trees and new methods to evaluate the quality of consensus
trees through a quantitative measure. Future work includes (i) extending the
proposed techniques to trees whose edges have weights, and (ii) finding differ-
ent types of patterns in the trees and using them in phylogenetic data cluster-
ing as well as other applications (e.g. the analysis of metabolic pathways [14]).
References
[1] Adams, E. N., 1972: Consensus techniques and the comparison of taxo-
nomic trees. Systematic Zoology, 21, 390–97.
[2] Bremer, K., 1990: Combinable component consensus. Cladistics, 6, 369–
72.
[3] Brodal, G. S., R. Fagerberg and C. N. S. Pedersen, 2003: Computing the
quartet distance between evolutionary trees in time O(n log n). Algorith-
mica, 38(2), 377–95.
[4] Brown, E. K., and W. H. E. Day, 1984: A computationally efficient ap-
proximation to the nearest neighbor interchange metric. Journal of Clas-
sification, 1, 93–124.
[5] Bryant, D., J. Tsang, P. E. Kearney and M. Li, 2000: Computing the
quartet distance between evolutionary trees. In Proceedings of the 11th
Annual ACM-SIAM Symposium on Discrete Algorithms, 285–6.
[6] Bustamante, C. D., R. Nielsen and D. L. Hartl, 2002: Maximum likeli-
hood method for analyzing pseudogene evolution: Implications for silent
site evolution in humans and rodents. Molecular Biology and Evolu-
tion, 19(1), 110–17.
[7] DasGupta, B., X. He, T. Jiang, M. Li, J. Tromp and L. Zhang, 1997: On
distances between phylogenetic trees. In Proceedings of the 8th Annual
ACM-SIAM Symposium on Discrete Algorithms, 427–36.
References 229
[8] Day W. H. E., 1985: Optimal algorithms for comparing trees with labeled
leaves. Journal of Classification, 1, 7–28.
[9] Douchette, C. R., 1985: An efficient algorithm to compute quartet dis-
similarity measures. Unpublished BSc (Hons) dissertation, Memorial Uni-
versity of Newfoundland.
[10] Felsenstein, J., 1989: PHYLIP: Phylogeny inference package (version
3.2). Cladistics, 5, 164–6.
[11] Fitch, W., 1971: Toward the defining the course of evolution: Minimum
change for a specific tree topology. Systematic Zoology, 20, 406–16.
[12] Genealogy.com, What is a first cousin, twice removed? Available at URL:
www.genealogy.com/16 cousn.html.
[13] Han, J., and M. Kamber, 2000: Data Mining: Concepts and Techniques.
Morgan Kaufmann, San Francisco, California.
[14] Heymans, M., and A. K. Singh, 2003: Deriving phylogenetic trees from the
similarity analysis of metabolic pathways. In Proceedings of the 11th Inter-
national Conference on Intelligent Systems for Molecular Biology, 138–46.
[15] Holmes, S., and P. Diaconis, 2002: Random walks on trees and matchings.
Electronic Journal of Probability, 7.
[16] Lundrigan, B. L., S. Jansa and P. K. Tucker, 2002: Phylogenetic relation-
ships in the genus mus, based on paternally, maternally, and biparentally
inherited characters. Systematic Biology, 51, 23–53.
[17] Margush, T., and F. R. McMorris, 1981: Consensus n-trees. Bull. Math.
Biol., 43, 239–44.
[18] Nelson, G., 1979: Cladistic analysis and synthesis: Principles and defini-
tions, with a historical note on Adanson’s Famille des Plantes (1763–4).
Systematic Zoology, 28, 1–21.
[19] Page, R. D. M., 1989: COMPONENT user’s manual (release 1.5). Uni-
versity of Auckland, Auckland.
[20] Pearson, W. R., G. Robins and T. Zhang, 1999: Generalized neighbor-
joining: More reliable phylogenetic tree reconstruction. Molecular Biology
and Evolution, 16(6), 806–16.
[21] Sanderson, M. J., M. J. Donoghue, W. H. Piel and T. Erikson, 1994: Tree-
base: A prototype database of phylogenetic analyses and an interac-
tive tool for browsing the phylogeny of life. American Journal of
Botany, 81(6), 183.
[22] Shasha, D., J. T. L. Wang, and S. Zhang, 2004: Unordered tree mining
with applications to phylogeny. In Proceedings of the 20th International
Conference on Data Engineering, 708–19.
[23] Stockham, C., L. Wang and T. Warnow, 2002: Statistically based post-
processing of phylogenetic analysis by clustering. In Proceedings of the
10th International Conference on Intelligent Systems for Molecular Biol-
ogy, 285–93.
[24] Tao, J., E. L. Lawler and L. Wang, 1994: Aligning sequences via an
evolutionary tree: Complexity and approximation. In Proceedings of the
26th Annual ACM Symposium on Theory of Computing, 760–9.
230 Sen Zhang and Jason T. L. Wang
9.1 Introduction
Resource description framework (RDF) [19, 20] is a data modeling language
proposed by the World Wide Web Consortium (W3C) for describing and
interchanging metadata about web resources. The basic element of RDF is
statements, each consisting of a subject, an attribute (or predicate), and an
object. A sample RDF statement based on the XML syntax is depicted in
Figure 9.1. At the semantic level, an RDF statement could be interpreted
as “the subject has an attribute whose value is given by the object” or “the
subject has a relation with the object”. For example, the statement in Fig-
ure 9.1 represents the relation: “Samudra participates in a car bombing event”.
For simplicity, we use a triplet of the form <subject, predicate, object> to
express an RDF statement. The components in the triplets are typically de-
scribed using an ontology [15], which provides the set of commonly approved
vocabularies for concepts of a specific domain. In general, the ontology also
defines the taxonomic relations between concepts in the form of a concept
hierarchy.
Due to the continual popularity of the semantic web, in a foreseeable future
there will be a sizeable amount of RDF-based content available on the web.
232 Tao Jiang and Ah-Hwee Tan
Fig. 9.1. A sample RDF statement based on the XML syntax. “Samudra” denotes
the subject, “participate” denotes the attribute (predicate), and “CarBombing” de-
notes the object.
A new challenge thus arises as to how we can efficiently manage and tap the
information represented in RDF documents.
In this paper, we propose a method, known as Apriori-based RDF Asso-
ciation Rule Mining (ARARM), for discovering association rules from RDF
documents. The method is based on the Apriori algorithm [2], whose sim-
plistic underlying principles enable it to be adapted for a new data model.
Our work is motivated by the fact that humans could learn useful patterns
from a set of similar events or evidences. As an event is typically decomposed
into a set of relations, we treat a relation as an item to discover associations
among relations. For example, many terrorist attack events may include the
scenario that the terrorists carried out a robbery before the terrorist attacks.
Though the robberies may be carried out by different terrorist groups and
may have different types of targets, we can still derive useful rules from those
events, such as “<Terrorist, participate, TerroristAttack> → <Terrorist, rob,
CommercialEntity>”.
The flow of the proposed knowledge discovery process is summarized in
Figure 9.2. First, the raw information content of a domain is encoded using
the vocabularies defined in the domain ontology to produce a set of RDF
documents. The RDF documents, each containing a set of relations, are used
as the input of the association rule mining process. For RDF association rule
mining, RDF documents and RDF statements correspond to transactions and
items in the traditional AR mining context respectively. Using the ontology,
the ARARM algorithm is used to discover generalized associations between
relations in RDF documents. To derive compact rule sets, we further present
a generalized pruning method for removing uninteresting rules.
The rest of this chapter is organized as follows. Section 9.2 provides a
review of the related work. Section 9.3 discusses the key issues of mining
association rules from RDF documents. Section 9.4 formulates the problem
statement for RDF association rule mining. Section 9.5 presents the proposed
ARARM algorithm. An illustration of how the ARARM algorithm works is
provided in Section 9.6. Section 9.7 discusses the rule redundancy issue and
presents a new algorithm for pruning uninteresting rules. Section 9.8 reports
our experimental results by evaluating the proposed algorithms on an RDF
9.2 Related Work 233
Speech Data
(audio) RDF
Encoding AR Mining <C1, R1, C2>ė
RDF
<C2, R2, C3>
Video Data Document
RDF {support:30%,
Document
RDF confidence:66%}
Web Page Document
(Hypermedia) …
Ontology
C1
R1
C3
C2
Fig. 9.2. The flow of the proposed RDF association rule mining process.
document set in the Terrorist domain. Section 9.9 concludes and highlights
the future work.
Association rule (AR) mining [1] is one of the most important tasks in the
field of data mining. It was originally designed for well-structured data in
transaction and relational databases. The formalism of typical AR mining
was presented by Agrawal and Srikant [2]. Many efficient algorithms, such as
Apriori [2], Close [16], and FP-growth [10], have been developed. A general
survey of AR mining algorithms was given in [12]. Among those algorithms,
Apriori is the most popular one because of its simplicity.
In addition to typical association mining, variants of the Apriori algo-
rithm for mining generalized association rules have been proposed by Srikant
and Agrawal [17] to find associations between items located in any level of a
taxonomy (is-a concept hierarchy). For example, a supermarket may want to
find not only specific associations, such as “users who buy the Brand A milk
usually tend to buy the Brand B bread”, but also generalized associations,
such as “users who buy milk tend to buy bread”. For generalized rule min-
ing, several optimization strategies have been proposed to speed up support
counting. An innovative rule pruning method based on taxonomic information
was also provided. Han and Fu [9] addressed a similar problem and presented
234 Tao Jiang and Ah-Hwee Tan
Thing
Terrorist
Terrorist Activity
Financial Terrorist
Crime Attack
Samudra Omar
Bank Card
Robbery Cheating Bombing Kidnapping
Fig. 9.3. A simple concept hierarchy for the Terrorist domain ontology.
Through the algorithm defined in Figure 9.4, we obtain a set of most ab-
stract relations (Rlist). Each abstract relation and its sub-relations form a
relation lattice. An example of a relation lattice is shown in Figure 9.5. In this
lattice, <Terrorist, participate, TerroristAttack> is the most abstract relation
subsuming the eight relations at the bottom levels. The middle-level nodes in
the lattice represent sub-abstract-relations. For example, <Samudra, partici-
pate, Bombing> represents a sub-abstract-relation composed of two relations,
namely <Samudra, participate, CarBombing> and <Samudra, participate,
SuicideBombing>.
The algorithm for finding all 1-frequent relationsets is given in Figure 9.6.
For each most abstract relation R in Rlist, if R is frequent, we add R into 1-
frequent relationsets L1 and we traverse the relation lattice whose top vertex
is R to find all 1-frequent sub-relations of R (Figure 9.6a).
Figures 9.6b and 9.6c define the procedures of searching the abstract rela-
tion lattice. First, we recursively search the right children of the top relation
to find 1-frequent relationsets and add them into L1 . Then, we look at each
9.5 The ARARM Algorithm 239
3 1
<Terrorist, participate, Bombing>
<Samudra, participate, TerroristAttack > <Terrorist, participate, Kidnapping>
<Omar, participate, TerroristAttack >
2
4
< Samudra, participate, Bombing>
< Omar, participate, Bombing > < Terrorist, participate, CarBombing>
<Samudra, participate, Kidnapping > < Terrorist, participate, SuicideBombing>
<Omar, participate, Kidnapping > < Terrorist, participate, Kidnapping1 >
5 < Terrorist, participate, Kidnapping2 >
left child of the top abstract relation. If it is frequent, we add it into L1 and re-
cursively search the sub-lattice using this left child as the new top relation. In
Figure 9.5, the dashed arrows and the order numbers of the arrows illustrate
the process of searching the lattice for 1-frequent relationsets.
Here, we define the notions of right/left children, right/left sibling, and
left/right parent of an abstract relation in a relation lattice. In Figure 9.5,
<Terrorist, participate, Bombing> and <Terrorist, participate, Kidnapping>
are sub-relations of <Terrorist, participate, TerroristAttack>. They are de-
rived from their parent by drilling down its object based on the domain
concept hierarchy. We call them the right children of <Terrorist, partici-
pate, TerroristAttack> and call <Terrorist, participate, TerroristAttack> the
left parent of <Terrorist,participate, Bombing> and <Terrorist, participate,
Kidnapping>. Similarly, if some sub-relations are derived from their parent
by drilling down its subject, we call them the left children of their parent
and call their parent the right parent of these sub-relations. If there exists
an abstract relation that has a left child A and a right child B, A is called a
left sibling of B and B is called a right sibling of A.
(a)
Procedure searchAbsRelationLattice
Input: Abstract relation R; hash table that stores right siblings of R, rSiblings
Output: 1-frequent relationsets in the relation lattice of R (excluding R)
(1) L1’:= Ɏ
(2) L1’ :=searchRightChildren(R, rSiblings)
(3) for each left children Rlc of R do //get left child by drilling down the subject of R
(4) if support(Rlc) t minSup
(5) L1’ := { Rlc } L1’
(6) Rlc.rightParent := R
(7) R.leftChildren.insert(R lc)
(8) L1’’:=searchAbsRelationLattice(Rlc, R.rightChildren)
(9) L1’ := L1’ L1’’
(10) Output L1’
(b)
Procedure searchRightChildren
Input: Abstract relation R; hash table that stores right siblings of R, rSiblings
Output: 1-frequent relationsets among the right descendants of R
(1) L1R:= Ɏ
(2) for each right children Rrc of R do
(3) rParent := getRParent(Rrc, rSiblings) //get the right parent of Rrc by finding the right
sibling of R that has the same object with Rrc.
(4) if support(rParent) < minSup
(5) continue; //Optimization 1.
(6) if support(Rrc) t minSup
(7) if rParent != NULL
(8) rParent.leftChildren.insert(Rrc);
(9) Rrc.rightParent := rParent
(10) R.rightChildren.insert(Rrc)
(11) Rrc.leftParent := R
(12) L1R := { Rrc } L1R
(13) L1R’:=searchRightChildren(Rrc,
rParent.rightChildren)
(14) L1R := L1R L1R’
(15) Output L1R
(c)
According to Lemma 1, once we find that the left parent or right parent
of an abstract relation is not frequent, we do not need to calculate the sup-
port of this abstract relation and can simply prune it away. This forms our
Optimization Strategy 1.
Lemma 2. Given a k-relationset A={R1 ,R2 , . . . ,Rk }, if there are two ab-
stract relations Ri and Rj (1≤i, j≤k and i=j), such that |Ri | ≥ |Rj | and
Ri ∩Rj = ∅, there exists a k−1-relationset B with support(B) = support(A).
pair (A, B), where A, B ∈ Lk−1 , A={R1 ,R2 ,. . . ,Rk−1 }, B={R1 ,R2 ,. . . ,Rk−1 },
Ri = Ri (i=1,2,. . . , k−2), and Rk−1 ∩R’k−1 = ∅ (Optimization Strategy 2 ). For
each such pair of k-1-frequent relationsets (A, B), we generate a k-candidate
relationset A∪B={R1 ,R2 ,. . . ,Rk−1 ,Rk−1 }. We use Ck to denote the entire
set of k-candidate relationsets. We further generate Lk by pruning the k-
candidate relationsets whose supports are below minSup. In Lk , some redun-
dant k-frequent relationsets also need to be removed according to Optimization
Strategy 3.
For each frequent relationset A, the algorithm finds each possible sub-
relationset B and calculates the confidence of the association rule B → A
< minus > B, where A < minus > B denotes the set of relations in A but not
in B. If confidence(B→A<minus>B) is larger than minConf, B→A<minus>B
is generated as a rule.
9.6 Illustration
In this section, we illustrate our ARARM algorithm by mining associations
from the sample knowledge base SD depicted in Table 9.1. Suppose that the
minimum support is 2 and the minimum confidence is 66%. The relations
(RDF statements) in the knowledge base are constructed using the ontology
as shown in Figure 9.3. The predicate set is defined as S = {raiseFundBy,
participate}.
First, we aggregate all relations in SD (as described in Figure 9.4) and
obtain two most-abstract relations (Table 9.2). Because the supports of those
two abstract relations are all greater than or equal to minimum support of 2,
they will be used in the next step to generate 1-frequent relationsets.
9.6 Illustration 243
Table 9.2. The most-abstract relations obtained from the knowledge base SD.
Most-Abstract Relations Support
<Terrorist, raiseFundBy, FinancialCrime> 2
<Terrorist, participate, TerroristAttack> 3
(b)
In Figure 9.8a, because all of the relations in the second level are be-
low the minimum support, the relations at the bottom of the lattice will
not be considered. In Figure 9.8b, because the relations <Omar, participate,
TerroristAttack> and <Terrorist, participate, Bombing> are frequent, their
child relation <Omar, participate, Bombing> at the bottom of the lattice will
still be considered. Other relations will be directly pruned because either the
support of their left parent or right parent is below the minimum support.
Table 9.3. The 1-frequent relationsets identified from the sample knowledge base
SD.
Table 9.4. The k-frequent relationsets (k ≥2) identified from the sample knowledge
base SD.
Table 9.5. The association rules discovered from the sample knowledge base SD.
• Concept replacement in both the left- and right-hand sides. For example,
an association rule AR1: <a, rel1, b> → <c, rel2, a> could be derived
from an association rule AR2: <a+, rel1, b> → <c, rel2, a+> by replacing
concept “a+” with its sub-concept “a”. This kind of concept replacement
246 Tao Jiang and Ah-Hwee Tan
only influences the support of the association rule. The expected support
and confidence of AR1 is given by
and
conf idenceE (AR1) = conf idence(AR2) (9.2)
where P(a|a+) is the conditional probability of a, given a+.
• Concept replacement in the left-hand side only. For example, an associ-
ation rule AR1: <a, rel1, b> → <c, rel2, d> could be generated from
an association rule AR2: <a+, rel1, b> → <c, rel2, d> by replacing the
concept “a+” with its sub concept “a”. This kind of concept replacement
influences only the support of the association rule. We can calculate the
support and confidence of AR1 by using Eqns. (9.1) and (9.2).
• Concept replacement in the right-hand side only. For example, an associ-
ation rule AR1: <c, rel1, d> → <a, rel2, b> could be generated from an
association rule AR2: <c, rel1, d> → <a+, rel2, b> by replacing concept
“a+” with its sub concept “a”. This kind of concept replacement influ-
ences both the support and the confidence of the association rule. We can
calculate the expected support and confidence of AR1 by
and
conf idenceE (AR1) = conf idence(AR2) · P (a|a+) (9.4)
respectively.
Note that the above three cases may be combined to calculate the overall
expected support and confidence of an association rule. The conditional prob-
ability P(a|a+) can be estimated by the ratio of the number of the leaf sub-
concepts of “a” and the number of the leaf sub-concepts of “a+” in the domain
concept hierarchy. For example, in Figure 9.3, the number of the leaf sub-
concepts of “Financial Crime” is two and the number of the leaf sub-concepts
of “Terrorist Activity” is four. The conditional probability P(Financial Crime
|Terrorist Activity) is thus estimated as 0.5.
Following the idea of Srikant and Agrawal [17], we define the interesting-
ness of a rule as follows. Given a set of rules S and a minimum interest factor
F , a rule A→B is interesting , if there is no ancestor of A→B in S or both
the support and confidence of A→B are at least F times the expected support
and confidence of its close ancestors respectively. We name the above inter-
estingness measure expectation measure with semantic relationships (EMSR).
EMSR may be used in conjunction with other pruning methods, such as those
described in [13].
9.8 Experiments 247
9.8 Experiments
Experiments were conducted to evaluate the performance of the proposed
association rule mining and pruning algorithms both quantitatively and qual-
itatively. Our experiments were performed on an IBM T40 (1.5GHz Pentium
Mobile CPU, 512MB RAM) running Windows XP. The RDF storage system
was Sesame (release 1.0RC1) running on MySQL database (release 4.0.17).
The ARARM algorithm was implemented using Java (JDK 1.4.2).
Fig. 9.9. The seven domain axioms for generating the terrorist events.
sociation rule mining according to the ARARM algorithm and evaluated if the
extracted rules captured the underlying associations specified by the domain
axioms. With a 5% minimum support and a 50% minimum confidence, the
ARARM algorithm generated 76 1-frequent and 524 k-frequent (k ≥2) rela-
tionsets, based on which 1061 association rules were extracted. With a 10%
minimum support and a 60% minimum confidence, the algorithm produced 42
1-frequent relationsets, 261 k-frequent relationsets, and 516 association rules.
We observed that although the events were generated based on only
seven domain axioms, a much larger number of rules were extracted. For
example, axiom 2 may cause the association rule “<Terrorist, participate,
Bombing> → <Terrorist, participate, Robbery>” to be generated. Axiom 2
may also result in the association rule “<Terrorist, participate, Robbery> →
<Terrorist, participate, Bombing>”, as <Terrorist, participate, Bombing>
tended to co-occur with <Terrorist, participate, Robbery>. In addition, ax-
ioms can be combined to generate new rules. For example, axioms 1, 3, and
5 can combine to generate association rules, such as “<Terrorist, partici-
pate, TerroristActivity> → <Terrorist, takeVehicle, Vehicle>, <Terrorist,
useWeapon, Weapon>”. As the association rule sets generated using the
ARARM algorithm may still be quite large, pruning methods were further
applied to derive more compact rule sets.
We experimented with a revised version of Srikant’s interestingness mea-
sure method [17] and the EMSR method for pruning the rules. The exper-
imental results are summarized in Table 9.6 and Table 9.7. We further ex-
perimented with two simple statistical interestingness measure methods [13]
described below:
• Statistical correlations measure (SC): Given a rule R1→R2, where R1 and
R2 are relationsets, if the conjunctive probability P(R1,R2) = P(R1)·P(R2),
R1 and R2 are correlated and the rule R1→R2 is considered as interesting.
• Conditional independency measure (CI): Given two rules R1→R2 and R1,
R3→R2 where R1, R2 and R3 are relationsets, if the conditional prob-
ability P(R2|R1) = P(R2|R1,R3), we say R2 and R3 are conditionally
independent and the rule R1, R3→R2 is considered as redundant and un-
interesting.
Table 9.7. The experimental results using the EMSR interestingness measure
method.
When pruning association rules, we first applied Srikant’s and the EMSR
methods on the rule sets produced by the ARARM algorithm and derived
association rule sets considered as interesting for each strategy. Then we com-
bined Srikant’s method and the EMSR method individually with the SC and
CI interestingness measures to derive even smaller rule sets.
We observed that there was no significant difference between the numbers
of rules obtained using the EMSR method and Srikant’s method. However, by
combining with other pruning methods, the resultant rule sets of EMSR were
about 40% smaller than those produced by Srikant’s method. The reason was
that the rule sets produced by Srikant’s method contained more rules similar
to those produced using the SC and CI measures. In other words, Srikant’s
method failed to remove those uninteresting rules that could not be detected
by the SC and CI measures.
For evaluating the quality of the rule sets produced by the EMSR method,
we analyzed the association rule set obtained using a 5% minimum support
and a 50% minimum confidence. We found that the heuristics of all seven ax-
ioms were represented in the rules discovered. In addition, most of the associa-
tion rules can be traced to one or more of the domain axioms. A representative
set of the association rules is shown in Table 9.8.
9.9 Conclusions
We have presented an Apriori-based algorithm for discovering association
rules from RDF documents. We have also described how uninteresting rules
can be detected and pruned in the RDF AR mining context.
Our experiments so far have made use of a synthetic data set, created
based on a set of predefined domain axioms. The data set has allowed us to
evaluate the performance of our algorithms in a quantitative manner. We are
in the process of building a real Terrorist data set by annotating web pages.
Our ARARM algorithm assumes that all the RDF relations of interest
could fit into the main memory. In fact, the maximum memory usage of our
algorithm is proportional to the number of relations. When the number of
250 Tao Jiang and Ah-Hwee Tan
References
[1] Agrawal, R., T. Imielinski and A. Swami, 1993: Mining association rules
between sets of items in large databases. Proceedings of the ACM SIG-
MOD International Conference on Management of Data, 207–16.
[2] Agrawal, R., and R. Srikant, 1994: Fast algorithms for mining associa-
tion rules. Proceedings of the 20th International Conference in Very Large
Databases, 487–99.
[3] Braga, D., A. Campi, S. Ceri, M. Klemettinen and P.L. Lanzi, 2003:
Discovering interesting information in XML data with association rules.
Proceedings of ACM Symposium on Applied Computing, 450–4.
References 251
[18] Tan, A.-H., 1999: Text mining: The state of the art and the challenges.
Proceedings of the Pacific Asia Conference on Knowledge Discovery and
Data Mining PAKDD’99 workshop on Knowledge Discovery from Ad-
vanced Databases, 65–70.
[19] W3C, RDF Specification. URL: www.w3.org/RDF/.
[20] W3C, RDF Schema Specification. URL: www.w3.org/TR/rdf-schema/.
[21] XML DOM Tutorial. URL: www.w3schools.com/dom/default.asp.
10
Image Retrieval using Visual Features and
Relevance Feedback
Summary. The present paper describes the design and implementation of a novel
CBIR system using a set of complex data that comes from completely different kinds
of low-level visual features such as shape, texture and color. In the proposed system,
a petal projection technique is used to extract the shape information of an object. To
represent the texture of an image, a co-occurrence matrix of a texture pattern over a
2 × 2 block is proposed. A fuzzy index of color is suggested to measure the closeness
of the image color to six major colors. Finally, a human-perception-based similarity
measure is employed to retrieve images and its performance is established through
rigorous experimentation. Performance of the system is enhanced through a novel
relevance feedback scheme as evident from the experimental results. Performance of
the system is compared with that of the others.
10.1 Introduction
Image search and retrieval has been a field of very active research since the
1970s and this field has observed an exponential growth in recent years as a
result of unparalleled increase in the volume of digital images. This has led
to the development and flourishing of Content-based Image Retrieval (CBIR)
systems [12, 18, 34]. There are, in general, two fundamental modules in a
CBIR system, visual feature extraction and retrieval engine. An image may
be considered as the integrated representation of a large volume of complex
information. Spatial and spectral distribution of image data or pixel values to-
gether carry some complex visual information. Thus visual feature extraction
is crucial to any CBIR scheme, since it annotates the image automatically
using its contents. Secondly, these visual features may be completely different
from one another suggesting complex relations among them inherent in the
image. So the retrieval engine handles all such complex data and retrieves
the images using some sort of similarity measure. Quality of retrieval can be
improved deploying the relevance feedback scheme. Proper indexing improves
efficiency of the system considerably.
254 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
Visual features may be classified into two broad categories: high-level fea-
tures and low-level features. High-level features mostly involve semantics of
the region(s) as well as that of the entire image. On the other hand, low-
level features are more elementary and general and are computed from pixel
values. In this work, we confine ourselves to extraction of low-level features
only. Shape, texture and color are three main independent groups of low-level
features that are used in CBIR systems.
Most of the CBIR systems measure shape features either by geometric
moments or by Fourier descriptor [4, 38] methods. Hu [14] suggested seven
moment invariants by combining raw geometric moments. Teh and Chin [53]
studied various types of moments and their capabilities for characterizing vi-
sual patterns. Fourier descriptor methods use as shape features the coefficients
obtained by Fourier transformation of object boundaries [35]. Other methods
proposed for shape matching include features like area, perimeter, convex-
ity, aspect ratio, circularity and elongatedness [4, 38]. Elastic deformation of
templates [3], comparison of directional histograms of edges [17], skeletal rep-
resentation [20] and polygonal approximation [42] of shapes are also used.
Texture is another feature that has been extensively explored by various
research groups. Texture features are measured using either a signal processing
or statistical model [28] or a human perception model [52]. In [13], Haralick et
al. proposed the co-occurrence matrix representation of texture features. Many
researchers have used wavelets [2, 27] and their variants to extract appropriate
texture features. Gabor Filters [9] and fractal dimensions [19] are also used as
a measure of the texture property.
Another widely used visual feature for CBIR is color. The main advantage
of this feature is its invariance to size, position, orientation and arrangements
of the objects. On the other hand, the disadvantage is its immense varia-
tion within a single image. In CBIR systems, a color histogram is most com-
monly used for representing color features. Various color similarity measures
based on histogram intersection have been reported [50, 51]. Other than color
histogram, color layout vectors [24], color correlograms [16], color coherence
vectors [7], color sets [47] and color moments [22, 56] are also commonly used.
The retrieval engine is responsible for finding the set of similar images from
the database against a query on the basis of certain similarity measures on the
feature set. It is evident from the literature that various distance/similarity
measures have been adopted by CBIR systems. Mukherjee et al. [31] have
used template matching for shape-based retrieval. A number of systems [29,
33, 49] have used Euclidean distance (weighted or unweighted) for matching.
Other schemes include the Minkowski metric [9], self-organizing maps [22],
proportional transportation distance [55], the CSS matching algorithm [30],
etc. For matching multivalued features such as a color histogram or texture
matrix, a variety of distance measures are deployed by different systems. They
include schemes like quadratic form distance [33], Jaccard’s co-efficient [23], L1
distance [2, 7, 21], histogram intersection [11], etc. The details on combining
10.2 Computation of Features 255
the distance of various types of features is not available. But, it is clear that
Euclidean distance is the most widely used similarity measure.
The quality of retrieved images can be improved through a relevance feed-
back mechanism. As the importance of the features varies for different queries
and applications, to achieve better performance, different emphases have to be
given to different features and the concept of relevance feedback (RF) comes
into the picture. Relevance feedback, originally developed in [54], is a learn-
ing mechanism to improve the effectiveness of information retrieval systems.
For a given query, the CBIR system retrieves a set of images according to a
predefined similarity measure. Then, the user provides feedback by marking
the retrieved images as relevant to the query or not. Based on the feedback,
the system takes action and retrieves a new set. The classical RF schemes can
be classified into two categories: query point movement (query refinement)
and re-weighting (similarity measure refinement) [37, 41]. The query point
movement method tries to improve the estimate of the ideal query point by
moving it towards the relevant examples and away from bad ones. Rocchio’s
formula [37] is frequently used to improve the estimation iteratively. In [15],
a composite query is created based on relevant and irrelevant images. Various
systems like WebSEEk [46], Quicklook [5], iPURE [1] and Drawsearch [44]
have adopted the query refinement principle. In the re-weighting method, the
weight of the feature that helps in retrieving the relevant images is enhanced
and the importance of the feature that hinders this process is reduced. Rui
et al. [39] and Squire et al. [48] have proposed weight adjustment techniques
based on the variance of the feature values. Systems like ImageRover [45] and
RETIN [9] use a re-weighting technique.
Here in this paper we have given emphasis to the extraction of shape,
texture and color features which together form a complex data set as they
bear diverse kinds of information. A human-perception-based similarity mea-
sure and a novel relevance feedback scheme are designed and implemented to
achieve the goal. This paper is organised as follows. Section 10.2 deals with
the computation of features. Section 10.3 describes a new similarity measure
based on human perception. A relevance feedback scheme based on the Mann-
Whitney test has been elaborated in Section 10.4. Results and discussions are
given in Section 10.5 followed by the concluding remarks in Section 10.6.
∆θ
desired object [40]. All the visual features are then computed on the segmented
region of interest.
Fourier descriptors and moment invariants are the two widely used shape
features. In the case of Fourier descriptors, data is transformed to a completely
different domain where co-coefficients may not have direct correlation with
the shape perception except whether the boundary is smooth or rough, etc.
They do not, in general, straightaway indicate properties like symmetry or
concavity. This is also true for higher-order moments. Moreover, moments of
different order vary so widely that it becomes difficult to balance their effects
on distance measures. These observations have led us to look for different
shape descriptors.
It is known that projection signatures retain the shape information, which
is confirmed by the existence of image reconstruction algorithms from pro-
jection data [38]. Horizontal and vertical projection of image gray levels are
already used in image retrieval [36]. In this work we propose petal projection
which explicitly reveals the symmetricity, circularity, concavity and aspect ra-
tio.
Petal Projection
After segmentation the object is divided into a number of petals where a
petal is an angular strip originating from the center of gravity as shown in
Figure 10.1. The area of the object lying within a petal is taken as the pro-
jection along it. Thus, Sθi , the projection on the ith petal can be represented
as:
θi + θ R
S θi = f (r, θ)drdθ (10.1)
θi r=0
1
n/2
Symmetry = | Sθ(m+n−k) mod n
− Sθ(m+k−1) mod n
| (10.2)
n
k=1
For a perfectly symmetric object, the value is zero and it gives a positive value
for an asymmetric one.
Circularity: It can be expressed as:
1
n/2−1
Circularity = | sθm − sθi | (10.3)
n i=0
For a perfectly circular object it gives zero and a positive value otherwise.
Aspect ratio: In order to compute the aspect ratio, sθm is obtained first.
Then pθi , the projection of sθi along the direction orthogonal to θm is com-
puted for all sθi other than sθm . Finally, the aspect ratio can be represented
as:
sθm
Asp.Ratio = (10.4)
max{pθi }
Concavity: Consider the the triangle BOA as shown in Figure 10.2. Suppose
OC of length r is the angular bisector of ∠BOA. The point c is said to be a
concave point with respect to AB if
ra .rb
r<
(ra + rb )2cos2α
Hence, Ci , the concavity due to the ith petal zone can be obtained as
258 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
B
C (x, y)
r
b
r
α
α
O ra A
⎧ sθi+1 ×sθi−1
⎨ 0; if sθi ≥ (sθi+1 +sθi−1 )×2cos2 θ
Ci = sθi+1 ×sθi−1
⎩ − sθi ; otherwise
(sθi+1 +sθi−1 )×2cos2 θ
Thus
n/2−1
Concavity = Ci (10.5)
i=0
can act as the measure for concavity.
Supplementary Features
Petal-projection-based measures of shape features are very effective when ∆θ
is sufficiently small. However, since the mathematical formulation for measur-
ing the shape features available in the literature, including the proposed ones,
are based on intuition and heuristics, it is observed that more features usually
improves performance of the system particularly for a wide variety of images.
For this reason, similar types of shape features may also be computed in a
different manner as described next. These supplementary features improve the
performance by about 2 to 3% and do not call for much extra computation.
Three different measures for circularity, Ci , (see Figure 10.3a) are defined
and computed as follows:
C1 = (object area)/(π D2 /4)
Length of the object boundary
C2 =
πD + length of the object boundary
C3 = (2 × min{ri })/D
where D is the diameter of the smallest circle enclosing the object and ri is
same as Sθi for very small ∆θ. D can be determined by taking projections of
ri s along θm .
To compute the aspect ratio the principal axis (P A) and the axis orthogo-
nal to it (OA) are obtained first [38] using ri . Two different aspect ratio, ARi
features (see Figure 10.3b) are computed as
AR1 = OA length/P A length
AR2 = M edian of {OLi }/median of {P Li }
10.2 Computation of Features 259
D
r r
1
D 2
r
3
r5 r
4
(a)
Lines parallel to PA
OA OA du 1
du 2
db1
db 2
PA PA
Lines parallel to OA
OA dr 1
dl1 dr
2
dl2
PA
(b)
where the length of the lines parallel to P A (or OA) forms {P Li } ({OLi }).
Symmetricity (see Figure 10.3b) about various axes are measured in the
following way.
1 dui − dbi
n
Symmetricity about P A =
n i=1 dui + dbi
1 dli − dri
m
Symmetricity about OA =
m i=1 dli + dri
where m denotes the number of pixels on OA. Note that dui and dbi are the
lengths of line segments parallel to OA drawn on either side of P A from the
ith pixel on P A. dli and dri can be defined in a similar way. Here again, dui
(or dbi ) and dli (dri ) may be obtained by taking projections of Sθi along OA
and P A respectively for very small ∆θ. However, we have implemented it by
pixel counting along the lines.
The convex hull of the object is obtained first and then the concavity
features (Coni ) are computed as follows:
260 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
Object area
Con1 =
Area of the convex hull
P erimeter of the convex hull
Con2 =
P erimeter of the object
Intensity 20 22 9 19 17 11 20 7 8 7
Block 8 7 11 20 18 9 6 23 6 8
Binary 1 1 0 1 1 0 1 0 1 0
Pattern 0 0 0 1 1 0 0 1 0 1
not occur. A problem of this approach is that a smooth intensity block (see
Figure 10.4e) and a coarse textured block (see Figure 10.4d) may produce
same binary pattern and, hence, the same texture value. To surmount this
problem we define a smooth block as having an intensity variance less than
a small threshold. In our experiment, the threshold is 0.0025 of the average
intensity variance computed over all the blocks. All such smooth blocks have
texture value 0. Thus we get the scaled (both in space and value) image whose
height and width are half of that of the original image and the pixel values
range from 0 to 15 except 10 (all 1 combination). This new image may be
considered as the image representing the texture of the original image (see
Figure 10.5).
Finally, considering left-to-right and top-to-bottom directions, the co-
occurence matrix of size 15 × 15 is computed from this texture image. To
make this matrix translation invariant the 2 × 2 block frames are shifted by
one pixel horizontally and vertically. For each case, the co-occurence matrix is
computed. To make the measure flip invariant, co-occurence matrices are also
computed for the mirrored image. Thus, we have sixteen such matrices. Then,
we take the element-wise average of all the matrices and normalize them to
obtain the final one. In the case of landscape, this is computed over the whole
image; while in the case of an image containing dominant object(s) the texture
feature is computed over the segmented region(s) of interest only.
The texture co-occurrence matrix provides the detailed description of the
image texture, but handling of such multivalued features is always difficult,
particularly in the context of indexing and comparison cost. Hence, to obtain
more perceivable features, statistical measures like entropy, energy and tex-
ture moments [13] are computed based on this matrix. We have considered
moments up to order 4 as the higher orders are not perceivable. The use of
gray code has enabled us to measure homogeneity and variation in texture.
use. Lim and Lu [25] have suggested that among various color models, the
HSV (Hue, Saturation, Value) model is most effective for CBIR applications
and is less sensitive to quantization. Hence, in our system, the color feature
is computed based on the HSV model. As, H controls the luminance, it has
more impact on the perception of color and we have used a fuzzy index of
color based on hue histogram to improve the performance of the system.
Color is represented using the HSV model. A hue histogram is formed. The
hue histogram thus obtained can not be used directly to search for similar
images. As an example, a red image and an almost red image (with similar
contents) are visually similar but their hue histogram may differ. Hence, to
compute the color features the hue histogram is first smoothed with a Gaussian
kernel and normalized. Then, for each of the six major colors (red, yellow,
green, blue, cyan and magenta), an index of fuzziness is computed as follows.
It is assumed that in the ideal case for an image with one dominant color
of hue h, the hue histogram would follow the Gaussian distribution p(i) with
mean h and standard deviation, say, σ. In our experiment we have chosen σ =
20 so that 99% of the population falls within h−60 to h+60. Figure 10.6 shows
the ideal distribution for h = 120 and actual hue distribution of an image.
The Bhattacharya distance [10], dh , between the actual distribution pa (i) and
this ideal
$ one p(i) indicates the closeness of the image color to hue h, where
dh = i p(i)pa (i). Therefore, dh gives a measure of similarity between two
distributions. Finally, an S-function [26] maps dh to fuzzy membership F (h)
where
1
F (h) =
1 + e−θ(dh −0.5)
For h = 0, 60, 120, . . . membership values corresponding to red, yellow, green
etc. are obtained. In our experiment θ is taken as 15.
feature vector) thus formed conveys, to some extent, the visual appearance of
the image in quantitative terms. Image retrieval engines compare the feature
vector of the query image with those of the database images and presents
to the users the images of highest similarity (i.e., least distance) in order as
the retrieved images. However, it must be noted that this collection is highly
complex as its elements carry different kinds of information, shape, texture
and color, which are mutually independent. Hence, they should be handled
differently as suited to their nature. In other words, if there are n features
altogether, one should not consider the collection as a point in n-dimensional
space and apply a single distance measure to find similarity between two such
collections. For example, in the set of shape features, circularity indicates
a particular appearance of the object. If the object in the query image is
circular, then objects present in the retrieved images must be circular. If those
objects are not circular the images are rejected; it does not matter whether
the objects of those rejected images are triangular or oblong or something
else. Simply speaking, two images are considered to be similar in terms of
circularity, if their circularity feature exceeds a predefined threshold. It may
be observed that almost every shape feature presented in this work, as well as
in the literature, usually carries some information about the appearance of the
object independently. On the other hand, texture features as mentioned in the
previous section together represent the type of texture of the object surface,
and none of them can represent the coarseness or periodicity independently.
Hence, a distance function comprising all the texture features can be used
to determine the similarity between two images. Color features like redness,
greenness etc. convey, in some sense, the amount of a particular color and its
associated color present in the image. However, they are not as independent
as the shape features (circularity, convexity etc.). Secondly, these features are
represented in terms of a fuzzy index which are compared (a logical operation)
to find similarity between two images. Thus, it is understandable that though
these features together annotate an image, they are not in the same scale of
unit nor they are evenly interpretable. Moreover, it is very difficult to find
out the correlations hidden among the various features, color and texture
features especially. On the other hand, there are strong implications in the
retrieval of similar images against a query. As the similarity (distance) measure
establishes the association between the query image and the corresponding
retrieved images based on these features only, it becomes the major issue.
The early work shows that most of the schemes deal with Euclidean dis-
tance, which has a number of disadvantages. One pertinent question is how to
combine the distance of multiple features. Berman and Shapiro [2] proposed
the following operations to deal with the problem:
Addition : distance = di (10.6)
i
where di is the Euclidean distance of the ith features of the images being
compared. This operation may declare visually similar images as dissimilar due
264 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
to the mismatch of only a few features. The effect will be further pronounced
if the mismatched features are sensitive enough even for a minor dissimilarity.
The situation may be improved by using
W eighted Sum : distance = ci di (10.7)
i
where ci is the weight for the Euclidean distance of ith feature. The problem
with this measure is that selection of the proper weight is again a difficult
proposition. One plausible solution could be taking ci as some sort of recip-
rocal of the variance of the ith feature. An alternative measure could be
It indicates that similar images will have all their features lying within a range.
It suffers from similar problems as the addition method. On the other hand,
the following measure
helps in finding images which have at least one feature within a specified
threshold. The effect of all other features are thereby ignored and the measure
becomes heavily biased. Hence, it is clear that for high-dimensional data, Eu-
clidean distance-based neighbor searching can not do justice to the problem.
This observation motivates us to develop a new distance-measuring scheme.
A careful investigation of a large group of perceptually similar images
reveals that similarity between two images is not usually judged by all possible
attributes. Which means visually similar images may be dissimilar in terms
of some features as shown in Figures 10.7, 10.8 and 10.9.
Fig. 10.7. Figure shows similar images: (a) and (b) are symmetric but differ in
circularity; whereas (b) and (c) are similar in circularity but differ in symmetricity.
10.3 Human-Perception-Based Similarity Measure 265
Fig. 10.8. Figures show similar textured objects with different shapes.
Fig. 10.10. Search regions for (a) 1 out of 2; (b) 2 out of 3; (c) 1 out of 3.
the ranks that would have been assigned in the case of no ties, are assigned.
Based on the ranks, a test statistic is generated to check the null hypothesis.
If the value of the test statistic falls within the critical region then the null
hypothesis is rejected. Otherwise, it is accepted.
In CBIR systems, a set of images are retrieved according to a similarity
measure. Then feedback is taken from the user to identify the relevant and
irrelevant outcomes. For the time being, let us consider only the jth feature
and Xi = dist(Qj , fij ), where Qj is the jth feature of the query image and fij
is the jth feature of the ith relevant image retrieved by the process. Similarly,
Yi = dist(Qj , fij ) where fij is the jth feature of ith irrelevant image. Thus,
Xi and Yi form the different random samples. Then, the Mann-Whitney test
is applied to judge the discriminating power of the jth feature. Let F (x) and
G(x) be the distribution functions corresponding to X and Y respectively. The
null hypothesis, H0 , and alternate hypothesis, H1 , may be stated as follows:
H0 : The jth feature cannot discriminate X and Y (X and Y come from same
population) i.e.,
F (x) = G(x) for all x.
H1 : The jth feature can discriminate X and Y (X and Y come from different
population) i.e.,
F (x) = G(x) for some x.
It becomes a two-tailed test Because, H0 is rejected for any of the two cases:
F (x) < G(x) and F (x) > G(x).
It can be understood that a useful feature can separate the two sets and
X may be followed by Y or Y may be followed by X in the combined ordered
list. Thus, if H0 is rejected then the jth feature is taken to be a useful feature.
The steps are as follows:
1. Combine X and Y to form a single sample of size N , where N = n + m.
2. Arrange them in ascending order
3. Assign a rank starting from 1. If required, resolve ties.
4. Compute the test statistic, T , as follows.
n N + 1
i=1 R(Xi ) − n × 2
T = N 2 nm(N + 1)2
i=1 Ri − 4(N − 1)
nm
N (N − 1)
2
where R(Xi ) denotes the rank assigned to Xi and Ri denotes the sum
of the squares of the ranks of all X and Y .
5. If the value of T falls within the critical region then H0 is rejected and
the jth feature is considered useful otherwise it is not.
The critical region depends on the level of significance α which denotes the
maximum probability of rejecting a true H0 . If T is less than its α/2 quantile
or greater than its 1 − α/2 quantile then H0 is rejected. In our experiment,
the distribution of T is assumed to be normal and α is taken as 0.1. If the
concerned feature discriminates and places the relevant images at the begin-
ning of the combined ordered list, then T will fall within the lower critical
10.4 Relevance Feedback Scheme 269
region. On the other hand, if the concerned feature discriminates and places
the relevant images at the end of the same list then T will fall within the
upper critical region.
It may be noted that, the proposed work proceeds only if the retrieved set
contains both relevant and irrelevant images. Otherwise, samples from two
different populations will not be available and no feedback mechanism can be
adopted.
d2
w =w w > w2
1 2 1
d1
Fig. 10.11. Variation of search space with the weights of the features.
wj = wj + σx2
where σx2 is the variance of X.
3. For each jth useful feature where the test statistic falls within the upper
critical region, set wj as follows:
2. For all jth useful features with the test statistic in the lower critical region
set, tolerancej = tolerancej − 1.
If tolerancej < MIN then tolerancej = MIN.
3. For all jth useful features with the test statistic in the upper critical region
set, tolerancej = tolerancej + 1.
If tolerancej > MAX then tolerancej = MAX.
4. Repeat steps 2 and 3 for successive iterations.
MIN and MAX denote the minimum and maximum possible tolerance
values. In our experiment, we have considered t as 2, MIN as 0 and MAX as
B − 1 where B is the number of buckets in the feature space.
(a)
(b)
Fig. 10.12. Recall–precision graphs for our database (a) using shape features and
(b) using shape and texture features.
proposed system for various types of features using the two databases. Some
sample results are shown in Figures 10.15 and 10.16 for our database and the
COIL-100 database respectively.
(a)
(b)
Fig. 10.13. Recall–precision graphs for the COIL-100 database (a) using shape
features and (b) using shape and texture features.
(a)
(b)
Fig. 10.14. Recall–precision graphs for (a) our database and (b) the COIL-100
database.
similarity measure, as it is quite likely that similar images may spread over
multiple divisions of a feature space, achievement of high recall is quite diffi-
cult. Hence, performance is studied based on top order retrievals. Moreover,
Muller et al. [32] have mentioned that, from the perspective of a user, top
order retrievals are of major interest. Table 10.3 shows that retrieval precision
is higher in the case of the human-perception-based similarity measure and it
proves the retrieval capability of the proposed similarity measure.
The proposed relevance feedback scheme is also applied to improve the
retrieval performance. It has been checked for both the databases and using
both Euclidean distance and the proposed human-perception-based measure.
Tables 10.4 and 10.5 along with the recall–precision graphs in Figures 10.17
and 10.18 reflect the improvement achieved through the proposed scheme for
the measures.
Table 10.4. Precision (in %) using relevance feedback for our database.
Euclidean distance Proposed similarity measure
No. of No No
retrieved relevance Relevance feedback relevance Relevance feedback
images feedback Iter1 Iter2 Iter3 feedback Iter1 Iter2 Iter3
10 76.16 77.91 79.61 81.40 81.10 87.39 89.32 91.17
20 70.87 74.50 76.03 78.48 76.39 82.39 84.85 86.63
30 68.05 69.89 71.38 72.63 73.15 78.61 81.34 83.20
10.6 Conclusions
In this paper we have established the capability of petal projection and other
types of shape features for content-based retrieval. The use of the texture
co-occurrence matrix and fuzzy indexes of color based on a hue histogram
276 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
Fig. 10.15. Retrieval results (using our database): first image of each row is the
query image and the others are the top five images matched.
10.6 Conclusions 277
Fig. 10.16. Retrieval results (using the COIL-100 database): first image of each
row is the query image and the others are the top five images matched.
278 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
Fig. 10.17. Recall–precision graphs for different classes; they are (in raster order)
Airplane, Car, Fish and Overall database.
Table 10.5. Precision (in %) using relevance feedback for COIL-100 database.
Euclidean distance Proposed similarity measure
No. of No relevance Relevance No relevance Relevance
retrieved feedback feedback feedback feedback
images (after iteration 3) (after iteration 3)
10 82.46 84.74 88.52 91.07
20 73.59 76.47 79.25 83.91
30 67.31 70.40 72.25 79.57
further improves the performance. Comparison with similar systems was also
made, as a benchmark. A new measure of similarity based on human percep-
tion was presented and its capability has been established. To improve the
retrieval performance, a novel feedback mechanism was described and experi-
ment shows that the enhancement is substantial. Hence, our proposed retrieval
scheme in conjunction with the proposed relevance feedback strategy are able
to discover knowledge about the image content by assigning various emphases
to the annotating features.
References 279
Fig. 10.18. Recall–precision graphs for different objects from the COIL-100
database; they are (in raster order) objects 17, 28, 43 and 52.
References
[1] Aggarwal, G., P. Dubey, S. Ghosal, A. Kulshreshtha and A. Sarkar, July
2000: IPURE: Perceptual and user-friendly retrieval of images. Proceed-
ings of IEEE Conference on Multimedia and Exposition (ICME 2000),
New York, USA, volume 2, 693–6.
[2] Berman, A. P., and L. G. Shapiro, 1999: A flexible image database system
for content-based retrieval. Computer Vision and Image Understanding,
75, 175–95.
280 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
[3] Bimbo, A. D., P. Pala and S. Santini, 1996: Image retrieval by elastic
matching of shapes and image patterns. Proceedings of Multimedia’96 ,
215–18.
[4] Chanda, B., and D. D. Majumdar, 2000: Digital Image Processing and
Analysis. Prentice Halla, New Delhi, India.
[5] Ciocca, G., I. Gagliardi and R. Schettini, 2001: Quicklook2: An integrated
multimedia system. International Journal of Visual Languages and Com-
puting, Special issue on Querying Multiple Data Sources Vol 12 (SCI
5417), 81–103.
[6] Conover, W. J., 1999: Practical nonparametric statistics, 3rd edition. John
Wiley and Sons, New York.
[7] Cox, I. J., M. L. Miller, T. P. Minka, T. Papathomas and P. N. Yiani-
los, 2000: The Bayesian image retrieval system, pichunter: Theory, imple-
mentation and psychophysical experiments. IEEE Transactions on Image
Processing, 9(1), 20–37.
[8] Delp, E. J., and O. R. Mitchell, 1979: Image compression using block
truncation coding. IEEE Trans. on Comm., 27, 1335–42.
[9] Fournier, J., M. Cord and S. Philipp-Foliguet, 2001: RETIN: A content-
based image indexing and retrieval system. Pattern Analysis and Appli-
cations, 4, 153–73.
[10] Fukunaga, K., 1972: Introduction to Statistical Pattern Recognition. Aca-
demic Press, NY, USA.
[11] Gevers, T., and A. Smeulders, 2000: Pictoseek: Combining color and shape
invariant features for shape retrieval. IEEE Transactions on Image Pro-
cessing, 9(1), 102–19.
[12] Gudivada, V. N., and V. V. Raghavan, 1995: Content-based image re-
trieval systems. IEEE Computer , 28(9), 18–22.
[13] Haralick, R. M., K. Shanmugam and I. Dinstein, 1973: Texture features
for image classification. IEEE Trans. on SMC , 3(11), 610–22.
[14] Hu, M. K., 1962: Visual pattern recognition by moment invariants. IRE
Trans. on Info. Theory, IT-8, 179–87.
[15] Huang, J., S. R. Kumar, and M. Mitra, 1997: Combining supervised learn-
ing with color correlogram for content-based retrieval. 5th ACM Intl. Mul-
timedia Confernce, 325–34.
[16] Huang, J., S. R. Kumar, M. Mitra, W. J. Zhu and R. Zabih, 1997: Image
indexing using color correlogram. IEEE Conference on Computer Vision
and Pattern Recognition, 762–8.
[17] Jain, A. K., and A. Vailaya, 1998: Shape-based retrieval: A case study
with trademark image database. Pattern Recognition, 31(9), 1369–90.
[18] Jain, R., ed., 1997: Special issue on visual information management.
Comm. ACM .
[19] Kaplan, L. M., 1998: Fast texture database retrieval using extended frac-
tal features. SPIE 3312 , SRIVD VI, 162–73.
References 281
[20] Kimia, B., J. Chan, D. Bertrand, S. Coe, Z. Roadhouse and H. Tek, 1997:
A shock-based approach for indexing of image databases using shape.
SPIE 3229 , MSAS II, 288–302.
[21] Ko, B., J. Peng and H. Byun, 2001: Region-based image retrieval using
probabilistic feature relevance learning. Pattern Analysis and Applica-
tions, 4, 174–84.
[22] Laaksonen, J., M. Koskela, S. Laakso and E. Oja, 2000: Picsom: content-
based image retrieval with self-organizing maps. PRL, 21, 1199–1207.
[23] Lai, T.-S., January 2000: CHROMA: a Photographic Image Retrieval Sys-
tem. Ph.D. thesis, School of Computing, Engineering and Technology,
University of Sunderland, UK.
[24] Li, Z. N., D. R. Zaiane and Z. Tauber, 1999: Illumination invariance and
object model in content-based image and video retrieval. Journal of Visual
Communication and Image Representation, 10(3), 219–44.
[25] Lim, S. and G. Lu, 2003: Effectiveness and efficiency of six colour spaces
for content based image retrieval. CBMI 2003 , France, 215–21.
[26] Lin, C., and C. S. G. Lee, 1996: Neural Fuzzy Systems. Prentice-Hall, NJ.
[27] Ma, W. Y., and B. S. Manjunath, 1995: A comparison of wavelet trans-
form features for texture image annotation. IEEE Intl. Conf. on Image
Processing, 256–9.
[28] Manjunath, B. S., and W. Y. Ma, 1996: Texture features for browsing
and retrieval of image data. IEEE Trans. on PAMI , 18, 837–42.
[29] Mills, T. J., D. Pye, D. Sinclair and K. R. Wood, 2000: Shoebox: A digital
photo management system. technical report 2000.10.
[30] Mokhtarian, F., S. Abbasi and J. Kittler, August 1996: Efficient and
robust retrieval by shape content through curvature scale space. Image
Database and Multi-Media Search, Proceedings of the First International
Workshop IDB-MMS’96 , Amsterdam, The Netherlands. Amsterdam Uni-
versity Press, 35–42.
[31] Mukherjee, S., K. Hirata and Y. Hara, 1999: A world wide web image
retrieval engine. The WWW journal, 2(3), 115–32.
[32] Muller, H., W. Muller, S. Marchand-Mallet, T. Pun and D. M. Squire,
2001: Automated benchmarking in content-based image retrieval. ICME
2001 , Tokyo, Japan, 22–5.
[33] Niblack, W., 1993: The QBIC project: Querying images by content using
color, texture and shape. SPIE , SRIVD.
[34] Pentland, A., and R. Picard, 1996: Introduction to special section on the
digital libraries: Representation and retrieval. IEEE Trans. on PAMI , 18,
769–70.
[35] Persoon, E., and K. S. Fu, 1977: Shape discrimination using Fourier de-
scriptors. IEEE Trans. on SMC , 7, 170–9.
[36] Prasad, B. G., S. K. Gupta and K. K. Biswas, 2001: Color and shape index
for region-based image retrieval. IWVF4 , volume LNCS 2059, 716–25.
282 Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda
[55] Vleugels, J., and R. C. Veltkamp, 2002: Efficient image retrieval through
vantage objects. Pattern Recognition, 35(1), 69–80.
[56] Yu, H., M. Li, H. Jiang Zhang and J. Feng, 2002: Color texture moments
for content-based image retrieval. IEEE Int. Conf. on Image Proc., New
York, USA.
11
Significant Feature Selection Using
Computational Intelligent Techniques for
Intrusion Detection
Summary. Due to increasing incidence of cyber attacks and heightened concerns for
cyber terrorism, implementing effective intrusion detection and prevention systems
(IDPSs) is an essential task for protecting cyber security as well as physical security
because of the great dependence on networked computers for the operational control
of various infrastructures.
Building effective intrusion detection systems (IDSs), unfortunately, has re-
mained an elusive goal owing to the great technical challenges involved; and com-
putational techniques are increasingly being utilized in attempts to overcome the
difficulties. This chapter presents a comparative study of using support vector ma-
chines (SVMs), multivariate adaptive regression splines (MARSs) and linear genetic
programs (LGPs) for intrusion detection. We investigate and compare the perfor-
mance of IDSs based on the mentioned techniques, with respect to a well-known set
of intrusion evaluation data.
We also address the related issue of ranking the importance of input features,
which itself is a problem of great interest. Since elimination of the insignificant
and/or useless inputs leads to a simplified problem and possibly faster and more
accurate detection, feature selection is very important in intrusion detection. Ex-
periments on current real-world problems of intrusion detection have been carried
out to assess the effectiveness of this criterion. Results show that using significant
features gives the most remarkable performance and performs consistently well over
the intrusion detection data sets we used.
11.1 Introduction
Feature selection and ranking is an important issue in intrusion detection.
Of the large number of features that can be monitored for intrusion detection
purposes, which are truly useful, which are less significant, and which may be
useless? The question is relevant because the elimination of useless features
(audit trail reduction) enhances the accuracy of detection while speeding up
the computation, thus improving the overall performance of an IDS. In cases
where there are no useless features, by concentrating on the most important
286 Srinivas Mukkamala and Andrew H. Sung
ones we may well improve the time performance of an IDS without affecting
the accuracy of detection in statistically significant ways.
The feature selection and ranking problem for intrusion detection is similar
in nature to various engineering problems that are characterized by:
perform MARSs and artificial neural networks (ANNs) in three critical aspects
of intrusion detection: accuracy, training time, and testing time [9].
A brief introduction to SVMs and SVM-specific feature selection is given in
Section 11.2. Section 11.3 introduces LGPs and LGP-specific feature selection.
In Section 11.4 we introduce MARSs and MARS-specific feature selection. An
experimental data set used for evaluation is presented in Section 11.5. Sec-
tion 11.6 describes the significant feature identification problem for intrusion
detection systems, a brief overview of significant features as identified by dif-
ferent ranking algorithms and the performance of classifiers using all features
and significant features. Conclusions of our work are given in Section 11.7.
subject to
l
yi αi
i=1 (11.2)
∀i : 0 ≤ αi ≤ C
where l is the number of training examples, α is a vector of l variables and
each component αi corresponds to a training example (xi , yi ). The solution of
Equation (11.1) is the vector α∗ for which Equation (11.1) is minimized and
Equation (11.2) is fulfilled.
In the first phase of the SVM, called the learning phase, the decision
function is inferred from a set of objects. For these objects the classification
is known a priori. The objects of the family of interest are called, for ease
of notation, the positive objects and the objects from outside the family, the
negative objects.
In the second phase, called the testing phase, the decision function is ap-
plied to arbitrary objects in order to determine, or more accurately predict,
whether they belong to the family under study, or not.
288 Srinivas Mukkamala and Andrew H. Sung
+
+ Hb
+ + +
+ + + H
+
+ Hr
*
*
*
* *
margin * * *
*
*
The positive examples form a cloud of points, say the points labelled “+”
(referred to as a set Xb ), while the negative examples form another cloud of
points, say, the points labelled “*” (referred to as Xr ). The aim is to find a
hyper-plane H separating the two clouds of points in some optimal way.
Quadratic Programming:
A constrained optimization problem consists of two parts: a function to op-
timize and a set of constraints to be satisfied by the variables. Constraint
satisfaction is typically a hard combinatorial problem; while, for an appro-
priate choice of function, optimization is a comparatively easier analytical
11.2 Support Vector Machines 289
and
and there exist points in Xb and Xr for which the inequalities are replaced
by equalities. Consequently we have the margin:
γ = g2t/||W ||2
where Xi, is a data point and Yi is the label of the data point, equal to 1 or
–1 depending on whether the point is a positive or a negative example.
We now have a typical quadratic programming problem and we will change
this formulation with nonlinear separability in mind.
Nonlinear Separation
It can be shown that if you have fewer points than the dimension, then any
two sets are separable. It is therefore tempting, when the two sets are not
linearly separable, to map the problem into a higher dimension where it will
become separable. There is however a price to pay as quadratic programming
problems are quite sensitive to high dimensions. SVM handles this problem
290 Srinivas Mukkamala and Andrew H. Sung
Wolfe’s Dual
The preceding formulation can be transformed by duality, which has the ad-
vantage of simplifying the set of constraints, but more importantly, Wolfe’s
11.2 Support Vector Machines 291
dual gives us a formulation where the data appears only as vector dot prod-
ucts. As a consequence we can handle nonlinear separation.
Minimizing 1/2*||W||2 under the constraint of Equation (11.4) is equivalent
to maximizing the dual Lagrangian obtained by computing variables from the
stationary conditions and replacing them by the values so obtained in the
primal Lagrangian. Details can be found in [3].
Support Vectors
It is clear that the maximum margin is not defined by all points to be sep-
arated, but by only a subset of points called support vectors. Indeed from
Equation (11.9) we know that W = Yi * αi *Xi , and the data points Xi
whose coefficients αi = 0 are irrelevant to W, are therefore irrelevant to the
definition of the separating surface; the others are the support vectors.
It is of great interest and use to find exactly which features underline the
nature of connections of various classes. This is precisely the goal of data
visualization in data mining. The problem is that the high-dimensionality of
data makes it hard for human experts to gather any knowledge. If we knew
the key features, we could greatly reduce the dimensionality of the data and
11.2 Support Vector Machines 293
thus help human experts become more efficient and productive in learning
about network intrusions.
The information about which features play key roles and which are more
neutral is “hidden” in the SVM decision function. Equation (11.11) is the
formulation of the decision function in the case of using linear kernels.
One can see that the value of F(X) depends on the contribution of each
factor, Wi Xi . Since Xi can take only b ≥ g0 , the sign of Wi indicates whether
the contribution is towards positive classification or negative classification.
The absolute size of Wi measures the strength of this contribution. In other
words if Wi is a large positive value, then the ith feature is a key factor of
“positive class” or class A. Similarly if Wi is a large negative value then the
ith feature is a key factor of the “negative class” or class B. Consequently the
Wi , that are close to zero, either positive or negative, carry little weight. The
feature, which corresponds to this Wi , is said to be a garbage feature and
removing it has very little effect on the classification.
Having retrieved this information directly from the SVM’s decision func-
tion, we rank the Wi , from largest positive to largest negative. This essentially
provides the soft partitioning of the features into the key features of class A,
neutral features, and key features of class B. We say soft partitioning, as it
depends on either a threshold on the value of Wi that will define the parti-
tions or the proportions of the features that we want to allocate to each of
the partitions. Both the threshold and the value of proportions can be set by
the human expert.
We first describe a general (i.e., independent of the modeling tools being used),
performance-based input ranking (PBR) methodology [12]: One input feature
is deleted from the data at a time; the resultant data set is then used for
294 Srinivas Mukkamala and Andrew H. Sung
the training and testing of the classifier. Then the classifier’s performance is
compared to that of the original classifier (based on all features) in terms of
relevant performance criteria. Finally, the importance of the feature is ranked
according to a set of rules based on the performance comparison.
The procedure is summarized as follows:
1 pred
n m
des 2 w
F (p) = (Oij − Oij ) + CE = M SE + w · M CE (11.13)
n · m i=1 j=1 n
Fig. 11.7. MARS data estimation using splines and knots (actual data on the right).
Given these two criteria, a successful method will essentially need to be adap-
tive to the characteristics of the data. Such a solution will probably ignore
11.5 The Experimental Data 297
quite a few variables (affecting variable selection) and will take into account
only a few variables at a time (also reducing the number of regions). Even if
the method selects 30 variables for the model, it will not look at all 30 simul-
taneously. Such simplification is accomplished by a decision tree at a single
node, only ancestor splits being considered; thus, at a depth of six levels in
the tree, only six variables are used to define the node.
1 yi − f (xi )2
N
GCV = [ ] (11.14)
N i=1 1 − k/
N
where N is the number of records and x and y are independent and dependent
variables respectively. k is the effective number of degrees of freedom whereby
the GCV adds penalty for adding more input variables to the model. The con-
tribution of the input variables may be ranked using the GCV with/without
an input feature [13].
11.5.1 Probing
User to super user (U2Su) exploits are a class of attacks where an attacker
starts out with access to a normal user account on the system and is able to
exploit a vulnerability to gain root access to the system. The most common
exploits in this class of attacks are buffer overflows, which are caused by
programming mistakes and environment assumptions (see Table 11.3).
11.6 Significant Feature Selection for Intrusion Detection 299
the most important ones we may well improve the time performance of an IDS
without affecting the accuracy of detection in statistically significant ways.
The feature ranking and selection problem for intrusion detection is similar
in nature to various engineering problems that are characterized by:
• Having a large number of input variables x = (x1 , x2 ,. . . , xn ) of vary-
ing degrees of importance to the output y; i.e., some elements of x are
essential, some are less important, some of them may not be mutually
independent, and some may be useless or irrelevant (in determining the
value of y);
• Lacking an analytical model that provides the basis for a mathematical
formula that precisely describes the input–output relationship, y = F (x);
• Having available a finite set of experimental data, based on which a model
(e.g. a neural network) can be built for simulation and prediction purposes.
Due to the lack of an analytical model, one can only seek to determine the
relative importance of the input variables through empirical methods. A com-
plete analysis would require examination of all possibilities, e.g., taking two
variables at a time to analyze their dependence or correlation, then taking
three at a time, etc. This, however, is both infeasible (requiring 2n experi-
ments!) and not infallible (since the available data may be of poor quality in
sampling the whole input space). Features are ranked based on their influence
towards the final classification. Description of the most important features as
ranked by three feature-ranking algorithms (SVDF, LGP, and MARS) is given
in Tables 11.5, 11.6, and 11.7. The (training and testing) data set contains
11,982 randomly generated points from the five classes, with the amount of
data from each class proportional to its size, except that the smallest class
is completely included. The normal data belongs to class 1, probe belongs to
class 2, denial of service belongs to class 3, user to super user belongs to class
11.6 Significant Feature Selection for Intrusion Detection 301
11.7 Conclusions
Three different significant feature identification techniques along with a com-
parative study of feature selection metrics for intrusion detection systems are
presented. Another contribution of this work is a novel significant feature
selection algorithm (independent of the modeling tools being used) that con-
siders the performance of a classifier to identify significant features. One input
feature is deleted from the data at a time; the resultant data set is then used
for the training and testing of the classifier. Then the classifier’s performance
is compared to that of the original classifier (based on all features) in terms of
relevant performance criteria. Finally, the importance of the feature is ranked
according to a set of rules based on the performance comparison.
Regarding feature ranking, we observe that
• The three feature-ranking methods produce largely consistent results. Ex-
cept for the class 1 (Normal) and class 4 (U2Su) data, the features ranked
as important by the three methods heavily overlap.
• The most important features for the two classes of Normal and DoS heavily
overlap.
• U2Su and R2U are the two smallest classes representing the most serious
attacks. Each has a small number of important features and a large number
of insignificant features.
• Using the important features for each class gives the most remarkable per-
formance: the testing time decreases in each class, the accuracy increases
slightly for Normal, decreases slightly for Probe and DoS, and remains the
same for the two most serious attack classes.
• Performance-based and SVDF feature ranking methods produce largely
consistent results: except for the class 1 (Normal) and class 4 (U2Su) data,
the features ranked as important by the two methods heavily overlap.
Acknowledgments: Support for this research was received from ICASA (In-
stitute for Complex Additive Systems Analysis, a division of New Mexico
Tech), and DoD and NSF IASP capacity building grant. We would also like
to acknowledge many insightful suggestions from Dr. Jean-Louis Lassez and
Dr. Ajith Abraham that helped clarify our ideas and contributed to our work.
References
[1] Banzhaf, W., P. Nordin, E. R. Keller and F. D. Francone, 1998: Genetic
programming: An introduction – on the automatic evolution of computer
programs and its applications. Morgan Kaufmann.
[2] Brameier, M., and W. Banzhaf, 2001: A comparison of linear genetic
programming and neural networks in medical data mining. IEEE Trans-
actions on Evolutionary Computation, 5 (1), 17–26.
306 Srinivas Mukkamala and Andrew H. Sung
Summary. Data streams are generated in large quantities and at rapid rates from
sensor networks that typically monitor environmental conditions, traffic conditions
and weather conditions among others. A significant challenge in sensor networks is
the analysis of the vast amounts of data that are rapidly generated and transmitted
through sensing. Given that wired communication is infeasible in the environmen-
tal situations outlined earlier, the current method for communicating this data for
analysis is through satellite channels. Satellite communication is exorbitantly ex-
pensive. In order to address this issue, we propose a strategy for on-board mining
of data streams in a resource-constrained environment. We have developed a novel
approach that dynamically adapts the data-stream mining process on the basis of
available memory resources. This adaptation is algorithm-independent and enables
data-stream mining algorithms to cope with high data rates in the light of finite
computational resources. We have also developed lightweight data-stream mining
algorithms that incorporate our adaptive mining approach for resource constrained
environments.
12.1 Introduction
In its early stages, data-mining research was focused on the development of
efficient algorithms for model building and pattern extraction from large cen-
tralized databases. The advance in distributed computing technologies had its
effect on data mining research and led to the second generation of data mining
technology – distributed data mining (DDM) [46]. There are primarily two
models proposed in the literature for distributed data mining: collect the data
to a central site to be analyzed (which is infeasible for large data sets) and
mine data locally and merge the results centrally. The latter model addresses
the issue of communication overhead associated with data transfer, however,
brings with it the new challenge of knowledge integration [38]. On yet another
strand of development, parallel data mining techniques have been proposed
and developed to overcome the problem of length execution times of complex
machine learning algorithms [53].
308 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
The transfer of such vast amounts of data streams for analysis from sensor
networks is dependent on satellite communication, which is exorbitantly ex-
pensive. A potential and intuitive solution to this problem is to develop new
techniques that are capable of coping with the high data rate of streams and
deliver mining results in real-time with application-oriented acceptable accu-
racy [24]. Such predictive or analytical models of streamed data can be used
to reduce the transmission of raw data from sensor networks since they are
compact and representative. The analysis of data in such ubiquitous environ-
ments has been termed ubiquitous data mining(UDM) [20, 32]. The research
in the field has two main directions: the development of lightweight analysis
algorithms that are capable of coping with rapid and continuous data streams
and the application of such algorithms for real-time decision making [34, 35].
The applications of UDM can vary from critical astronomical and geophys-
ical applications to real-time decision support in business applications. There
are several potential scenarios for such applications:
These algorithms have proved their efficiency [20, 21, 28]. However, we realized
that one-pass algorithms don’t address the problem of resource constraints
with regard to high data rates of incoming streams.
Algorithm output granularity (AOG) introduces the first resource-aware
data analysis approach that can cope with fluctuating data rates according
to available memory and processing speed. AOG was first introduced in [20,
28]. Holistic perspective and integration of our lightweight algorithms with
the resource-aware AOG approach is discussed. Experimental validation that
demonstrates the feasibility and applicability of our proposed approach is
presented in this chapter.
This chapter is organized as follows. In Section 12.2, an overview of the field
of data-stream processing is presented. Data-stream mining is discussed in Sec-
tion 12.3. Section 12.4 presents our AOG approach in addressing the problem.
Our lightweight algorithms that use AOG are discussed in Section 12.5. The
experimental results of using the AOG approach are shown and discussed in
Section 12.5.3. Finally, open issues and challenges in the field conclude our
chapter.
been addressed in research recently. STREAM [5], Aurora [1] and Tele-
graphCQ [37] are representative work for such prototypes and systems. STan-
ford stREam datA Manager (STREAM) [5] is a data-stream management sys-
tem that handles multiple continuous data streams and supports long-running
continuous queries. The intermediate results of a continuous query are stored
in a Scratch Store. The results of a query could be a data stream transferred
to the user or it could be a relation that could be stored for re-processing.
Aurora [1] is a data work-flow system under construction. It directs the input
data stream using pre-defined operators to the applications. The system can
also maintain historical storage for ad hoc queries. The Telegraph project is a
suite of novel technologies developed for continuous adaptive query process-
ing implementation. TelegraphCQ [37] is the next generation of that system,
which can deal with continuous data stream queries.
Querying over data streams faces the problem of the unbounded memory
requirement and the high data rate [39]. Thus, the computation time per data
element should be less than the data rate. Also, it is very hard, due to un-
bounded memory requirements, to have an exact result. Approximating query
results have been addressed recently. One of the techniques used in solving
this problem is the sliding window, in which the query result is computed
over a recent time interval. Batch processing, sampling, and synopsis data
structures are other techniques for data reduction [6, 24].
12.3.1 Techniques
There are different algorithms proposed to tackle the high speed nature of
mining data streams using different techniques. In this section, we review the
state of the art of mining data streams.
Guha et al. [29, 30] have studied clustering data streams using the K-
median technique. Their algorithm makes a single pass over the data and uses
12.3 Mining Data Streams 311
little space. It requires O(nk) time and O(nε) space where k is the number of
centers, n is the number of points and ε <1. The algorithm is not implemented,
but the analysis of space and time requirements of it are studied analytically.
They proved that any k-median algorithm that achieves a constant factor
approximation can not achieve a better run time than O(nk). The algorithm
starts by clustering a calculated size sample according to the available memory
into 2k, and then at a second level, the algorithm clusters the above points
for a number of samples into 2k and this process is repeated to a number of
levels, and finally it clusters the 2k clusters to k clusters.
Babcock et al. [8] have used an exponential histogram (EH) data structure
to enhance the Guha et al. algorithm. They use the same algorithm described
above, however they try to address the problem of merging clusters when the
two sets of cluster centers to be merged are far apart by marinating the EH
data structure. They have studied their proposed algorithm analytically.
Charikar et al. [12] have proposed a k-median algorithm that overcomes
the problem of increasing approximation factors in the Guha et al. algorithm
by increasing the number of levels used to result in the final solution of the
divide and conquer algorithm. This technique has been studied analytically.
Domingos et al. [16, 17, 33] have proposed a general method for scaling up
machine-learning algorithms. This method depends on determining an upper
bound for the learner’s loss as a function in a number of examples in each
step of the algorithm. They have applied this method to K-means clustering
(VFKM) and decision tree classification (VFDT) techniques. These algorithms
have been implemented and tested on synthetic data sets as well as real web
data. VFDT is a decision-tree learning system based on Hoeffding trees. It
splits the tree using the current best attribute taking into consideration that
the number of examples used satisfies a statistical result which is “Hoeffd-
ing bound”. The algorithm also deactivates the least promising leaves and
drops the non-potential attributes. VFKM uses the same concept to deter-
mine the number of examples needed in each step of the K-means algorithm.
The VFKM runs as a sequence of K-means executions with each run using
more examples than the previous one until a calculated statistical bound is
satisfied.
O’Callaghan et al. [43] have proposed STREAM and LOCALSEARCH al-
gorithms for high quality data-stream clustering. The STREAM algorithm
starts by determining the size of the sample and then applies the LO-
CALSEARCH algorithm if the sample size is larger than a pre-specified equa-
tion result. This process is repeated for each data chunk. Finally, the LO-
CALSEARCH algorithm is applied to the cluster centers generated in the
previous iterations.
Aggarwal et al. [2] have proposed a framework for clustering data steams,
called the CluStream algorithm. The proposed technique divides the clus-
tering process into two components. The online component stores summary
statistics about the data streams and the offline one performs clustering on
the summarized data according to a number of user preferences such as the
312 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
time frame and the number of clusters. A number of experiments on real data
sets have been conducted to prove the accuracy and efficiency of the proposed
algorithm. Aggarwal et al. [3] have recently proposed HPStream, a projected
clustering for high dimensional data streams. HPStream has outperformed
CluStream in recent results. The idea of micro-clusters introduced in CluS-
tream has also been adopted in On-Demand classification in [4] and it shows
a high accuracy.
Keogh et al. [36] have proved empirically that most cited clustering time-
series data-stream algorithms proposed so far in the literature come out with
meaningless results in subsequence clustering. They have proposed a solution
approach using a k-motif to choose the subsequences that the algorithm can
work on.
Ganti et al. [19] have described an algorithm for model maintenance un-
der insertion and deletion of blocks of data records. This algorithm can be
applied to any incremental data mining model. They have also described a
generic framework for change detection between two data sets in terms of
the data mining results they induce. They formalize the above two techniques
into two general algorithms: GEMM and Focus. The algorithms are not imple-
mented, but are applied analytically to decision tree models and the frequent
itemset model. The GEMM algorithm accepts a class of models and an incre-
mental model maintenance algorithm for the unrestricted window option, and
outputs a model maintenance algorithm for both window-independent and
window-dependent block selection sequences. The FOCUS framework uses
the difference between data mining models as the deviation in data sets.
Papadimitriou et al. [45] have proposed AWSOM (Arbitrary Window
Stream mOdeling Method) for interesting patterns discovery from sensors.
They developed a one-pass algorithm to incrementally update the patterns.
Their method requires only O(log N ) memory where N is the length of the
sequence. They conducted experiments on real and synthetic data sets. They
use wavelet coefficients for compact information representation and correlation
structure detection, and then apply a linear regression model in the wavelet
domain.
Giannella et al. [25] have proposed and implemented a frequent itemsets
mining algorithm over data streams. They proposed to use tilted windows to
calculate the frequent patterns for the most recent transactions based on the
fact that people are more interested in the most recent transactions. They
use an incremental algorithm to maintain the FP-stream, which is a tree data
structure, to represent the frequent itemsets. They conducted a number of
experiments to prove the algorithm’s efficiency. Manku and Motwani [40] have
proposed and implemented approximate frequency counts in data streams.
The implemented algorithm uses all the previous historical data to calculate
the frequent patterns incrementally.
Wang et al. [52] have proposed a general framework for mining concept-
drifting data streams. They observed that data-stream mining algorithms
don’t take notice of concept drifting in the evolving data. They proposed
12.3 Mining Data Streams 313
Recently systems and applications that deal with data streams have been
developed. These systems include:
• Burl et al. [10] have developed Diamond Eye for NASA and JPL. They
aim by this project to enable remote systems as well as scientists to extract
patterns from spatial objects in real-time image streams. The success of
this project will enable “a new era of exploration using highly autonomous
spacecraft, rovers, and sensors” [3].
• Kargupta et al. [35, 46] have developed the first UDM system: MobiMine.
It is a client/server PDA-based distributed data mining application for
financial data. They develop the system prototype using a single data
source and multiple mobile clients; however the system is designed to han-
dle multiple data sources. The server functionalities in the proposed system
are data collection from different financial web sites; storage; selection of
active stocks using common statistics; and applying online data mining
techniques to the stock data. The client functionalities are portfolio man-
agement using a mobile micro database to store portfolio data and user’s
314 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
The above systems and techniques use different strategies to overcome the
three main problems discussed earlier. The following is an abstraction of these
strategies [27]:
each element that is used in the sampling technique. Figure 12.2 illustrates
the idea of data rate adaptation from the input side using sampling.
• The algorithm rate (AR) is a function of the data rate (DR), i.e., AR =
f(DR).
• The time needed to fill the available memory by the algorithm results
(TM) is a function of (AR), i.e., TM = f(AR).
• The algorithm accuracy (AC) is a function of (TM), i.e., AC = f(TM).
Algorithm Threshold
The algorithm threshold is a controlling parameter built into the algorithm
logic that encourages or discourages the creation of new outputs according to
three factors that vary over temporal scale:
• Available memory.
• Remaining time to fill the available memory.
• Data stream rate.
12.4 Algorithm Output Granularity 319
Output Granularity
The output granularity is the amount of generated results that are acceptable
according to a pre-specified accuracy measure. This amount should be resi-
dent in memory before doing any incremental integration.
Time Threshold
The time threshold is the required time to generate the results before any in-
cremental integration. This time might be specified by the user or calculated
adaptively based on the history of running the algorithm.
Time Frame
The time frame is the time between each two consecutive data rate measure-
ments. This time varies from one application to another and from one mining
technique to another.
v. Repeat steps 3 and 4 till the algorithm lasts the time interval threshold.
vi. Perform knowledge integration of the results
The algorithm output granularity in mining data streams has primitive
parameters, and operations that operate on these parameters. AOG algebra
is concerned with defining these parameters and operations. The develop-
ment of AOG-based mining techniques should be guided by these primitives
depending on empirical studies. That means defining the timing settings of
these parameters to get the required results. Thus the settings of these pa-
rameters depend on the application and technique used. For example, we can
use certain settings for a clustering technique when we use it in astronomi-
cal applications that require higher accuracy; however we can change these
settings in business applications that require less accuracy. Figure 12.5 and
Figure 12.6 show the conceptual framework of AOG.
AOG parameters:
• TFi: The time frame i
• Di: Input data stream during the time frame i
• I(Di): Average data rate of the input stream Di
• O(Di): Average output rate resulting from mining the stream Di
AOG operations:
• α(Di) Mining process of the Di stream
• β([I(D1), O(D1)],. . . ,[I(Di), O(Di)]) Adaptation process of the algo-
rithm threshold at the end of time frame i
• Ω (Oi, ...,Ox) Knowledge integration process done on the output i to
the output x
AOG settings:
• D(TF) Time duration of each time frame
• D(Ω) Time duration between each two consecutive knowledge integration
processes
The main idea behind our approach is to change the threshold value that
in turn changes the algorithm rate according to three factors:
LWC is a one-pass similarity-based algorithm. The main idea behind the algo-
rithm is to incrementally add new data elements to existing clusters according
to an adaptive threshold value. If the distance between the new data point
and all existing cluster centers is greater than the current threshold value,
then create a new cluster. Figure 12.7 shows the algorithm.
12.5 AOG-based Mining Techniques 323
LWF starts by setting the number of frequent items that will be calculated
according to the available memory. This number changes over time to cope
with the high data rate. The algorithm receives the data elements one by
one, tries to find a counter for any new item and increases the item for the
registered items. If all the counters are occupied, any new item will be ignored
and the counters will be decreased by one till the algorithm reaches some
time threshold. A number of the least frequent items will be ignored and their
counters will be re-set to zero. If the new item is similar to one of the items
in memory, the counter will be increased by one. The main parameters that
can affect the algorithm accuracy are time threshold, number of calculated
frequent items and number of items that will be ignored. Their counters will
12.6 Experimental Results 325
be re-set after some time threshold. Figure 12.9 shows the algorithm outline
for the LWF algorithm.
Fig. 12.12. Number of knowledge structures created with and without AOG.
12.7 RA-UDM
Having developed the theoretical model and experimental validation, we are
now implementing a resource-aware UDM system (RA-UDM) [22, 27]. In this
section, we describe the architecture, design and operation of each compo-
nent of this system. The system architecture of our approach is shown in
Figure 12.13 [27]. The detailed discussion about each component is given in
the following.
Resource-aware Component
Local resource information is a resource monitoring component which is
able to inform the system by the number of running processes in a mobile de-
vice, battery consumption status, available memory and scheduled resources.
Context-aware middleware is a component that can monitor the environ-
mental measurements such as the effective bandwidth. It can use reasoning
techniques to reason about the context attributes of the mobile device.
Resource measurements is a component that can receive the information
from the above two modules and formulate this information to be used by the
solution optimizer.
Solution optimizer is a component determines the data mining task scenario
according to the available information about the local and context informa-
tion. The module can choose from different scenarios to achieve the UDM
process in a cost-efficient way. The following is a formalization of this task.
Table 12.3 shows the symbols used.
328 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
12.8 Conclusions
Mining data streams is in its infancy. The last two years have witnessed in-
creasing attention this area of research because of the increase in sensor net-
works that generate vast amounts of data streams and the increase of com-
putational power of small devices. In this chapter, we have presented our
contribution to the field represented in three mining techniques and a general
strategy that adds resource-awareness which is a highly demanded feature in
pervasive and ubiquitous environments. AOG has proved its applicability and
efficiency.
330 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
The following open issues need to be addressed to realize the full potential
of this exciting field [18, 23]:
• Handling the continuous flow of data streams: Data items in data streams
are characterized by continuity. That dictates the design of non-stopping
management and analysis techniques that can cope with the continuous,
rapid data elements.
• Minimizing energy consumption of the mobile device [9]: The analysis
component in the UDM is local to the mobile device site. Mobile devices
face the problem of battery life-time.
• Unbounded memory requirements: Due to the continuous flow of data
streams, sensors or handheld devices have the problem of lack of sufficient
memory size to run traditional data-mining techniques.
• Transferring data mining results over a wireless network with limited band-
width: The wireless environment is characterized by unreliable connections
and limited bandwidth. If the number of mobile devices involved in a UDM
process is high, the process of sending the results back to a central site
becomes a challenging process.
• Data mining results visualization on the small screen of mobile device: The
user interface on a handheld device for visualizing data-mining results is
a challenging issue. The visualization of data mining results on a desk-
top is still a challenging process. Novel visualization techniques that are
concerned with the size of image should be investigated.
• Modeling changes of mining results over time: Due to the continuity of data
streams, some researchers have pointed out that capturing the change of
mining results is more important in this area than the mining results. The
research issue is how to model this change in the results.
• Interactive mining environment to satisfy user requirements: The user
should be able to change the process settings in real time. The problem is
how the mining technique can use the generated results to integrate with
the new results after the change in the settings.
• Integration between data-stream management systems and ubiquitous
data-stream mining approaches: There is a separation between the research
in querying and management of data streams and mining data streams.
The integration between the two is an important research issue that should
be addressed by the research community. The process of management and
analysis of data streams is highly correlated.
• The relationship between the proposed techniques and the needs of real-
world applications: The needs of real-time analysis of data streams is af-
fected by the application needs. Most of the proposed techniques don’t
pay attention to real-world applications: they attempt to achieve the min-
ing task with low computational and space complexity regardless of the
applicability of such techniques. One of the interesting studies in this area
is by Keogh et al.[36] who have proved that the results of the most cited
clustering techniques in times series are meaningless.
References 331
References
[1] Abadi, D., D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Er-
win, E. Galvez, M. Hatoun, J. Hwang, A. Maskey, A. Rasin, A. Singer,
M. Stonebraker, N. Tatbul, Y. Xing, R. Yan and S. Zdonik, 2003: Aurora:
A data stream management system (demonstration). Proceedings of the
ACM SIGMOD International Conference on Management of Data.
[2] Aggarwal, C., J. Han, J. Wang and P. S. Yu, 2003: A framework for
clustering evolving data streams. Proceedings of 2003 International Con-
ference on Very Large Databases.
[3] — 2004: A framework for projected clustering of high dimensional
data streams. Proceedings of International Conference on Very Large
Databases.
[4] — 2004: On demand classification of data streams. Proceedings of Inter-
national Conference on Knowledge Discovery and Data Mining.
[5] Arasu, A., B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosen-
stein and J. Widom, 2003: STREAM: The Stanford stream data man-
ager demonstration description – short overview of system status and
plans. Proceedings of the ACM International Conference on Management
of Data.
[6] Babcock, B., S. Babu, M. Datar, R. Motwani and J. Widom, 2002: Models
and issues in data stream systems. Proceedings of the 21 st Symposium on
Principles of Database Systems.
[7] Babcock, B., M. Datar and R. Motwani 2003: Load shedding techniques
for data stream systems (short paper). Proceedings of the Workshop on
Management and Processing of Data Streams.
332 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
[22] — 2004: A wireless data stream mining model. Proceedings of the Third
International Workshop on Wireless Information Systems, Held in con-
junction with the Sixth International Conference on Enterprise Informa-
tion Systems ICEIS Press.
[23] — 2004: Ubiquitous data stream mining, Current Research and Future
Directions Workshop Proceedings held in conjunction with the Eighth
Pacific-Asia Conference on Knowledge Discovery and Data Mining.
[24] Garofalakis, M., J. Gehrke and R. Rastogi, 2002: Querying and mining
data streams: you only get one look (a tutorial). Proceedings of the ACM
SIGMOD international conference on Management of data.
[25] Giannella, C., J. Han, J. Pei, X. Yan and P. S. Yu, 2003: Mining frequent
patterns in data streams at multiple time granularities. H. Kargupta, A.
Joshi, K. Sivakumar and Y. Yesha (eds.), Next Generation Data Mining,
AAAI/MIT.
[26] Golab L., and M. Ozsu, 2003: Issues in data stream management. SIG-
MOD Record, Number 2, 32, 5–14.
[27] Gaber, M. M., A. Zaslavsky and S. Krishnaswamy, 2004: A cost-efficient
model for ubiquitous data stream mining. Proceedings of the Tenth In-
ternational Conference on Information Processing and Management of
Uncertainty in Knowledge-Based Systems.
[28] — 2004: Resource-aware knowledge discovery in data streams. Proceed-
ings of First International Workshop on Knowledge Discovery in Data
Streams, to be held in conjunction with the 15 th European Conference on
Machine Learning and the 8 th European Conference on the Principals and
Practice of Knowledge Discovery in Databases.
[29] Guha, S., N. Mishra, R. Motwani and L. O’Callaghan, 2000: Clustering
data streams. Proceedings of the IEEE Annual Symposium on Founda-
tions of Computer Science.
[30] Guha, S., A. Meyerson, N. Mishra, R. Motwani and L. O’Callaghan, 2003:
Clustering data streams: Theory and practice. TKDE special issue on
clustering, 15.
[31] Henzinger, M., P. Raghavan and S. Rajagopalan, 1998: Computing on
data streams. Technical Note 1998-011, Digital Systems Research Center.
[32] Hsu, J., 2002: Data mining trends and developments: The key data mining
technologies and applications for the 21st century. Proceedings of the 19 th
Annual Information Systems Education Conference.
[33] Hulten, G., L. Spencer and P. Domingos, 2001: Mining time-changing
data streams. Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, 97–106.
[34] Kargupta, H., R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra,
J. Dull, K. Sarkar, M. Klein, M. Vasa and D. Handy, 2004: VEDAS:
A mobile and distributed data stream mining system for real-time vehi-
cle monitoring. Proceedings of SIAM International Conference on Data
Mining.
334 Mohamed Medhat Gaber, Shonali Krishnaswamy and Arkady Zaslavsky
[35] Kargupta, H., B. Park, S. Pittie, L. Liu, D. Kushraj and K. Sarkar, 2002:
MobiMine: Monitoring the stock market from a PDA. ACM SIGKDD
Explorations, 3, 2, 37–46.
[36] Keogh, E., J. Lin and W. Truppel, 2003: Clustering of time series sub-
sequences is meaningless: implications for past and future research. Pro-
ceedings of the 3rd IEEE International Conference on Data Mining.
[37] Krishnamurthy, S., S. Chandrasekaran, O. Cooper, A. Deshpande,
M. Franklin, J. Hellerstein, W. Hong, S. Madden, V. Raman, F. Reiss
and M. Shah, 2003: TelegraphCQ: An architectural status report. IEEE
Data Engineering Bulletin, 26(1).
[38] Krishnaswamy, S., S. W. Loke and A. Zaslavsky, 2000: Cost models for
heterogeneous distributed data mining. Proceedings of the 12 th Interna-
tional Conference on Software Engineering and Knowledge Engineering,
31–8.
[39] Koudas, N., and D. Srivastava, 2003: Data stream query processing: A
tutorial. Presented at International Conference on Very Large Databases.
[40] Manku, G. S., and R. Motwani, 2002: Approximate frequency counts over
data streams. Proceedings of the 28th International Conference on Very
Large Databases.
[41] Muthukrishnan, S., 2003: Data streams: algorithms and applications. Pro-
ceedings of the fourteenth annual ACM-SIAM symposium on discrete al-
gorithms.
[42] Muthukrishnan, S., 2003: Seminar on processing massive data sets. Avail-
able at URL: athos.rutgers.edu/%7Emuthu/stream- seminar.html.
[43] O’Callaghan, L., N. Mishra, A. Meyerson, S. Guha and R. Motwani,
2002: Streaming-data algorithms for high-quality clustering. Proceedings
of IEEE International Conference on Data Engineering.
[44] Ordonez, C., 2003: Clustering binary data streams with k-means. Proceed-
ings of ACM SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery (DMKD), 10–17.
[45] Papadimitriou, S., C. Faloutsos and A. Brockwell, 2003: Adaptive, hands-
off stream mining. Proceedings of 29 th International Conference on Very
Large Databases.
[46] Park, B., and H. Kargupta, 2002: Distributed data mining: Algorithms,
systems, and applications. Data Mining Handbook, Nong Ye (ed.).
[47] Srivastava, A., and J. Stroeve, 2003: Onboard detection of snow, ice,
clouds and other geophysical processes using kernel methods. Proceed-
ings of the International Conference on Machine Learning workshop on
Machine Learning Technologies for Autonomous Space Applications.
[48] Tanner, S., M. Alshayeb, E. Criswell, M. Iyer, A. McDowell, M. McEniry
and K. Regner, 2002: EVE: On-board process planning and execution.
Proceedings of Earth Science Technology Conference.
[49] Tatbul, N., U. Cetintemel, S. Zdonik, M. Cherniack and M. Stonebraker,
2003: Load shedding in a data stream manager. Proceedings of the 29 th
International Conference on Very Large Data Bases.
References 335
13.1 Introduction
Data mining has been an active research area in the past decade. With the
emergence of sensor nets, the world-wide web, and other on-line data-intensive
applications, mining streaming data has become an urgent problem. Recently,
a lot of research has been performed on data-stream mining, including clus-
tering [12, 20], aggregate computation [5, 11], classifier construction [3, 15],
and frequent counts computation [18]. However, a lot of issues still need to
be explored to ensure that high-speed, nonstatic streams can be mined in
real-time and at a reasonable cost.
Let’s examine some application areas that pose a demand for real-time
classification of nonstatic streaming data:
338 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
classifiers, support vector machines, neural networks, and so on. In many stud-
ies, researchers have found that each classifier has advantages for certain types
of data sets. Among these classifiers, some, such as neural networks and sup-
port vector machines, are obviously not good candidates for single-scan, very
fast model reconstruction while handling the huge amount of data streams.
In the previous studies on classification of streaming data, decision trees
have been popularly used as the first choice for their simplicity and easy
explanation, such as [3, 13, 15]. However, it is difficult to dynamically and
drastically change decision trees due to the costly reconstruction once they
have been built. In many real applications, dynamic changes in stream data
could be normal, such as in stock market analysis, traffic or weather modeling,
and so on. In addition, a large amount of raw data is needed to build a decision
tree. According to the model proposed in [15], it has to keep them in memory
or on disk since they may be used later for updating the statistics when
old records leave the window and for reconstructing parts of the tree. If the
concept drifts very often, the related data needs to be scanned multiple times
so that the decision tree can be kept updated. This is usually unaffordable
for streaming data. Also, after detecting the drift in the model, it may take
a long time to accumulate sufficient data to build an accurate decision tree
[15]. Any drift taking place during that period either cannot be caught or will
make the tree unstable. In addition, the method presented in this paper only
works for peer prediction, but not for future prediction.
Based on the above analysis, we do not use the decision tree model, in-
stead we choose the naı̈ve Bayesian classifier scheme because it is easy to
construct and adapt. The naı̈ve Bayesian classifier, in essence, maintains a set
of probability distributions P (ai |v) where ai and v are the attribute value and
the class label, respectively. To classify a record with several attribute values,
it is assumed that the conditional probability distributions of these values
are independent of each other. Thus, one can simply multiply the conditional
probabilities together and label the record with the class label of the greatest
probability. Despite its simplicity, the accuracy of the naı̈ve Bayesian classifier
is comparable to other classifiers such as decision trees [4, 19].
The characteristics of the stream may change at any moment. Table 13.1
illustrates an example of a credit card pre-approval database, constructed by
a target marketing department in a credit card company. Suppose it is used
to trace the customers to whom the company sent credit card pre-approval
packages and the applications received from the customers. In the first portion
of the stream, client 1578 is sent a pre-approval package. However, in the
second portion of the stream, client 7887 has similar attribute values, but is
not delivered such a package due to a change in the economic situation.
The above example shows that it is critical to detect the changes in the
classifier and construct a new classifier in a timely manner to reflect the
changes in the data. Furthermore, it is nice to know which attribute is dom-
inant for such a change. Notice that almost all the classifiers require a good
amount of data to build. If the data for constructing a classifier is insufficient,
340 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
the accuracy of the classifier may degrade significantly. On the other hand, it
is impractical to keep all the data in memory especially when the arrival rate
of the data is high, e.g., in network monitoring. As a result, we have to keep
only a small amount of summarized data. The naı̈ve Bayesian classifier can
work for this scenario nicely, where the summarized data structure is just the
occurrence frequency of each attribute value for every given class label.
Since the change of underlying processes may occur at any time, the stream
can be partitioned into disjoint windows. Each window contains a portion of
the stream. The summarized data (occurrence frequency) of each window is
computed and stored. When the stream is very long, even the summarized
data may not be able to fit in the main memory. With a larger window size,
the memory can store the summarized data for a larger portion of the stream.
However, this can make the summarized data too coarse. During the process
of constructing a new classifier, we may not be able to recover much useful
information from the coarse summarized data. To overcome this difficulty, a
tilted window [2] is employed for summarizing the data. In the tilted window
scheme, the most recent window contains the finest frequency counts. The
window size increases exponentially for older data. This design is based on the
observation that more recent data is usually more important. With this tilted
window scheme, the summarized counts for a large portion of the stream can fit
in memory. During the construction of the classifier, more recent information
can be obtained, and the classifier can be updated accordingly.
Based on the above observation, an evolutionary stream data classification
method is developed in this study, with the following contributions:
• The proposal of a model for the construction of an evolutionary classifier
(e.g., naı̈ve Bayesian) over streaming data.
• A novel two-fold algorithm, EvoClass, is developed with the following fea-
tures:
– A test-and-update technique is employed to detect the changes of con-
ditional probability distributions of the naı̈ve Bayesian.
– The naı̈ve Bayesian is adaptive to new data by continuous refinement.
– A tilted window is utilized to partition the data so that more detailed
information is maintained for more recent data.
13.2 Related Work 341
Querying and mining streaming data has raised great interest in the database
community. An overview of the current state of the art of stream data man-
agement systems, stream query processing, and stream data mining can be
found in [1, 9]. Here, we briefly introduce the major work on streaming data
classification.
Building classifiers on streaming data has been studied in [3, 15], with
decision trees as the classification approach. In [3], it is assumed that the
data is generated by a static Markov process. As a result, each portion of the
stream can be viewed as a sample of the same underlying process, which may
not handle well dynamically evolving data. A new decision-tree construction
algorithm, VFDT is proposed. The first portion (window) of the stream data
is used to determine the root node. The second portion (window) of the stream
data is used to build the the second node of the tree, and so on. The window
size is determined by the desired accuracy. The higher the accuracy desired,
the more data in a window. According to [3], this method can achieve a higher
degree of accuracy and it outperforms some other decision-tree construction
methods, such as C4.5.
The algorithm proposed in [15], CVFDT, relaxed the assumption of static
classification modeling in VFDT. It allows concept drift, which means the
underlying classification model may change over time. CVFDT keeps its un-
derlying model consistent with the ongoing data. When the concept in the
streaming data changes, CVFDT can adaptively change the decision tree by
growing alternative subtrees in questionable portions of the old tree. When
the accuracy of the alternative subtree outperforms the old subtree, the old
one will be replaced with the new one. CVFDT achieves better performance
than VFDT because of its fitness to changing data. However, because of the
342 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
EvoClass avoids the above problems using a tilted window scenario and
naı̈ve Bayesian classifier. Since naı̈ve Bayesian needs only the summary in-
formation of records, EvoClass does not need to store data records, it is a
truly one-scan algorithm. EvoClass can refine the minimum window size to
small granularity without much loss of efficiency. Thus EvoClass can catch
high frequency significant concept drifts. Furthermore, the additive property
of the naı̈ve Bayesian classifier makes the merging of two probability distribu-
tions simple and robust. Finally, the cost per record for EvoClass
is O(|V ||A|),
which is much cheaper than that for CVFDT, O(dNt |V | ∀j |Aj |) [15] (where
d is the maximum depth of the decision tree and Nt is the number of alternate
trees; the notation is introduced in the next section).
For a given record a1 , a2 , . . . , an , we compute the probability for all vi . The
class label of this record is that vj (for some j where 1 ≤ j ≤ k) which yields
the maximum probability in Equation (13.1). The number of conditional prob-
abilities that need to be stored is |A1 | × |A2 | × · · · × |Am | × |V | where |Ai | is
the number of distinct values in the ith attribute. If there are 10 attributes
and 100 distinct values for each attribute and 10 class labels, there will be
10010 × 10 conditional probabilities to be computed which is prohibitively
expensive. On the other hand, the naı̈ve Bayesian classifier assumes the inde-
pendence of each variable, i.e., P (ai , aj |v) = P (ai |v) × P (aj |v). In this case,
Equation
(13.1) can be simplified to Equation (13.2). Then we need only track
∀i |A i | × |V | probabilities. In the previous example, we only need to track
10,000 probabilities, a manageable task. The set of conditional probabilities
that can be learned from the data seen so far is, P (ai |v) = PP(a(v) i ,v)
, where
P (ai , v) is the joint probability distribution of attribute value ai and class
label v, and P (v) is the probability distribution of the class label.
Classifier Evolution
The problem is to catch the concept drifts and identify them. For discovery of
the evolution of a classifier, one needs to keep trace of the changes of the data
344 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
or conditions closely related to the classifier [8]. The naı̈ve Bayesian classifier
captures the probability distributions of attribute values and class labels, and
thus becomes a good candidate for the task. It is important to capture and
measure the difference between two probability distributions. There exist some
methods which assess the difference between two probability distributions,
among which the variational distance and the Kullback-Leibler divergence are
the most popular ones [16].
• Variational Distance: Given two probability distributions, P1 and P2 ,
of the variable σ, the variational distance is defined as V (P1 , P2 ) =
σ∈Ω |P1 (σ) − P2 (σ)|.
• Kullback-Leibler Divergence: The Kullback-Leibler divergence is one of the
well-known divergence measures rooted in information theory. There are
two popular versions of the Kullback-Leibler Divergence. The asymmetric
measure (sometimes referred as the I-directed divergence) is defined as
P1 (σ)
I(P1 , P2 ) = P1 (σ) log .
P2 (σ)
σ∈Ω
Since the I-divergence does not satisfy the metric properties, its sym-
metrized measure, J-divergence, is often used to serve as a distance mea-
sure.
the stream data, which may evolve over time. We first present a high level
overview of our approach and then give the detailed description of each com-
ponent in the algorithm.
13.4.1 Overview
As mentioned previously, the naı̈ve Bayesian classifier is chosen for its efficient
construction, incremental update, and high accuracy. Since data may arrive at
a high rate and the set of overall data stream can be very large, it is expected
that the computer system cannot store the complete set of data in the main
memory, especially for sensor nets. As a result, only part of the raw data and
some summarized data may be stored. Most of the raw data is only processed
once and discarded. Thus, one needs to know the count of the number of
records in which the value ai and the class label v occurred together.
The stream is partitioned into a set of disjoint windows, each of which
consists of a portion of the stream. The coming data is continuously used
to test the classifier to see whether the classifier is still sufficiently accurate.
Once the data in a window is full, the counts of the occurrences of all distinct
ai ∩ v are computed. After computing the counts, the raw data of the stream
can be discarded. These counts are used to train the classifier, i.e., to update
the probability distributions.
There are two cases to be considered. First, if the accuracy of the classifier
degrades significantly, one needs to discard the old classifier and build a new
one. In many occasions, the changes in the classifier are also interesting to the
users because based on the changes, they may know what occurred in the data.
Therefore, the major changes in the classifier will be reported. The procedure
is depicted in Figure 13.1. Second, when the probability distribution does not
change for a long time, there may be a significant amount of information
accumulated on the counts. In this case, some of the windows will need to be
combined to reduce the amount of information.
In the following subsections, we will present the details of each step.
The size of the window is a critical factor that may influence the classification
quality. The probability distribution is updated when the accumulated data
has filled a window. When the window size is small, the evidence in a window
may be also small, and the induced probability distribution could be inac-
curate, which may lead to a low-quality naı̈ve Bayesian classifier. However,
when the window size is too large, it will be slow in detecting the change of
probability distribution, and the classifier may not be able to reflect the true
state of the current stream.
The summary information of a window includes the number of occurrences
of each distinct pair of aj ∩ v, the number of occurrencesof each v, and the
number of records in the window. There are in total |V |× ∀j |Aj | counts (for
346 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
Accumulate data
until window is full
good
Condense windows
all distinct v∩aj ) where |V | and |Aj | are respectively the number of class labels
and the number of distinct values for attributeAj . As a result, the number
of counts for summarizing a window is |V | × ∀j |Aj | + |V | + 1. First, let
us assume that each count can be represented by an integer which consumes
four bytes. Then the total number of windows (summary information), Nw ,
that can fit in the allocated memory is 4×(|V |× M|Aj |+|V |+1) where M is the
∀j
size of the allocated memory. Now the problem becomes how to partition the
stream into Nw windows.
First, we want to know the minimum window size, wmin . Let us assume
that each record has |A| attributes. There are overall |V | × 1≤j≤|A| |Aj |
counts that need to be tracked for the purpose of computing conditional prob-
record can update |A| counts. The minimum window size is set
abilities. Each
|V |× |Aj |
to q × 1≤j≤|A|
|A| , where q is a small number. In Section 13.5, we exper-
iment with various wmin . We found that with large wmin , the accuracy is low
and the delay of evolution detection may be large. This is because the change
of the data characteristics may take place at any time but the construction
of a new classifier is done only at the end of a window. On the other hand,
although a smaller wmin can improve the accuracy, the average response time
is prolonged. After a window is full, we need to update the classifier. Since the
cost of classifier update is the same regardless of window size, the per record
cost of classifier updating can be large with a small wmin . In Section 13.5, we
will discuss how to decide wmin empirically.
To approximate the exponential window growth, we use the following al-
gorithm. When the summary information can fit in the allocated memory,
13.4 Approach of EvoClass 347
we keep the size of each window as wmin . Once the memory is full, some
windows may have to be merged, and the newly freed space can be utilized
for the summary data of a new window. We choose to merge the consecutive
windows with the smallest growth ratio i.e., wi and wi−1 where |w|wi−1 i|
|
is the
smallest ratio. The rationale behind this choice is that we want the growth of
the window size to be as smooth as possible. If there exists a tie, we choose the
oldest windows to merge because recent windows contain more updated infor-
mation than older ones. Figure 13.2 shows the process of window merging. At
the beginning, there are four windows, each of which contains a record. For
illustration, we assume the memory can only store the summary data for four
windows (in Figure 13.2a). When a new window of data arrives, some windows
have to be merged. Since the ratio between any two consecutive windows is
the same, the earliest two windows are merged (as shown in Figure 13.2b). As
a result, the size ratios between windows 3 and 4 and windows 2 and 3 is 1
while the size ratio between windows 1 and 2 is 2. Thus, windows 2 and 3 are
merged when the new data is put in window 4 as illustrated in Figure 13.2c.
(a)
(b)
(c )
After the merge of two existing windows, some space is freed to store the
new data. Once wmin new records have been obtained, the counts for the
new window of data are calculated. For instance, assuming that the window
consists of the first four records in Table 13.1, Table 13.2 shows the summary
counts after processing the window of data. This structure is similar to AVC-
Set (Attribute–Value–ClassLabel) in [7].
348 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
Table 13.2. Counts after processing first four records in Table 13.1.
Pre-Approval
Attribute Value
Yes No
25–29 0 1
30–34 1 0
Age 35–39 1 0
40–44 1 0
45–49 0 0
20k–25k 0 0
25k–30k 1 1
30k–35k 1 1
Salary 35k–40k 0 0
40k–45k 0 0
45k–50k 0 0
Credit History Good 2 0
Bad 1 1
The classifier is updated once the current window is full and there is no
significant error increasing (see Section 13.4.4). Let’s assume that we have
a naı̈ve Bayesian classifier, i.e., a set of probability distribution P (ai |v) and
P (v) and a set of new counts c(ai ∩ v) and c(v) where c(ai ∩ v) and c(v)
are the number of records having Ai = ai with class label v and the number
of records having class label v in the new window, respectively. If there is no
prior knowledge about the probability distribution, we can assume the uniform
prior distribution which yields the largest entropy, i.e., uncertainty. Based on
the current window, we can obtain the probability distribution within the
i ∩v)
window Pcur (ai |v) = c(ac(v) . For example, based on the data in Table 13.2,
Pcur (25k–30k|yes) = 12 = 0.5. Next we need to merge the current and the prior
probability distributions. Let’s assume that the overall number of records in
the current window is w, and the number of records for building the classifier
before this window is s. The updated probability distribution is
where µ = ws if the importance of the current and past records are equal. µ
can be used to control the weight of the windows. For example, in the fading
model, i.e., the recent data can reflect the trend much better than the past
data, µ can be less than ws , even equal to 0.
13.4 Approach of EvoClass 349
13.4.5 Under-representing
the classifier, updating the classifier, and rebuilding a new classifier. Assume
the minimum window size is wmin , the number of testing records in each
window is T , and the number of windows in record is Nw . We have,
• Accumulating records: The cost to count each record (updating c(ai ∩ v)
and c(v)) is O(|A|).
• Testing the classifier: The cost is O(T |V ||A|) per window.
• Updating the classifier: The cost to merge two windows is O(|V | ∀j |Aj |).
The cost to update the current probability distribution is O(|V | ∀j |Aj |).
• Rebuilding a new classifier: Since each time we have to scan the history
windows and build a best naı̈ve
Bayesian classifier based on the testing
records, the cost is O(Nw |V | ∀j |Aj | + Nw T |V ||A|).
|V | ∀j |Aj |
We set wmin = q |A| and T = σwmin , where usually q > 1 and
σ < 1. Based on the above analysis, we can calculate the lower bound and
the upper bound of the amortized cost for processing one single record. To
calculate the lower bound cost, one extreme case is that the concept does not
change at all over time. Then it will not rebuild any classifiers except the
initial one. Therefore, the lower bound for the total cost per record is
2|V | ∀j |Aj |
O |A| + ∼ O(|A|).
wmin
For the upper bound, the worst case is that the concept changes dramatically
in each window. The cost is
2|V | ∀j |Aj | + Nw |V | ∀j |Aj | + Nw T |V ||A|
O |A| +
wmin
which can be simplified to O(|V ||A|) if q is larger than Nw , and σNw is a
small constant. Therefore, the amortized upper bound of processing cost for
each record is O(|V ||A|), which is equal to the cost of classifying one record.
d
wi ai = w0 (13.5)
i=1
where ai is the coordinate of the ith dimension. We can treat the vector
a1 , a2 , ..., ad as a data record, where ai is the value of attribute Ai . The class
d
label v of the record can be determined by the following rule: if i=1 wi ai >
d
w0 , it is assigned the positive label; otherwise (i.e., i=1 wi ai w0 ), it is
assigned the negative label. By randomly assigning the value of ai in a record,
an infinite number of data records can be generated in this way. One can
regard wi as the weight of Ai . The larger wi is, the more dominant is the
attribute Ai . Therefore, through rotating the hyperplane to some degree by
changing the magnitude of wi , the possible distribution of the class label
vs a1 , a2 , . . . , ad changes, which is equal to saying the underlying concept
drifts. This also means that some records are relabelled according to the new
concept. In our experiments, we set w0 to 0.1d and restrict the value of vi in
[0.0, 1.0]. We increase the value of wi with +0.01d or −0.01d gradually. After
it reaches either 0.1d or 0.0, it then changes in the opposite direction.
While generating the synthetic data, we also inject noise into the data.
With the probability pnoise , the data is arbitrarily assigned to the class labels.
pnoise is randomly selected from [0, Pnoise,max ] each time the concept drifts.
We do not use a fixed probability of noise injection such as that performed in
[15] since we want to test the robustness and sensitivity of our algorithm. The
concept drift from small changes of wi cannot be detected since the drift and
the noise are not distinguishable. The average probability of noise is around
Pnoise,max /2 for the synthetic data set. Because there are only two class labels
in the data set, so with 50% probability (assume Ppositive = Pnegative = 0.5),
the injected noise produces wrong class labels. Therefore, the error caused
by the noise is around Pnoise,max /4 on average. This is a background error
that cannot be removed for any kind of classification algorithm. We denote
pne = Pnoise,max /4. In Table 13.3, we collect the parameters used in the
synthetic data sets and our experiments.
13.5.2 Accuracy
The first two experiments show how quickly our algorithm can respond to the
underlying concept changes by checking the classifier error of EvoClass after
a concept drift.
We assume that the Bayesian classifier has an error rate pb which means
without the injection of any noise and concept drift, given a synthetic data
set described above, the Bayesian classifier can achieve the accuracy of 1 − pb .
Suppose the noise does not affect pb significantly if the noise is not very large
13.5 Experimental Results 353
(which is justified in Section 13.5.3), we can achieve the average error rate
perror = pb + pne . We denote the new error rate perror for the classifier we
build after the concept changes. We want to see how fast our algorithm can
catch it. Drift level [15], pde , is the error rate if we still use the old concept
Cold (before one drift) to label the new data (after that drift). It is expected
that perror should evolve from perror + pde to some value close to perror . The
problem is how fast this procedure takes place.
In this experiment, we set |A| = 30, C = 8, N = 4, 800, 000, Nw = 32,
wmin = 12, 000, fc = 400, 000, and pnoise,max = 5%. Figure 13.3a shows three
kinds of errors: the error from concept drift (the percentage of records that
change their labels at each concept drift point), the error from our EvoClass
algorithm without any concept drift, and the error from our EvoClass algo-
rithm with concept drift. It illustrates that the EvoClass algorithm can start
of respond to the concept drift very quickly. The very start of Figure 13.3a
shows that when a huge drift happens (> 10% records change their labels),
EvoClass can respond with a spike and quickly adapt to the new concept.
For the small concept drifts taking place in the middle of the figure, EvoClass
struggles to absorb the drift. It takes much longer because it is more difficult
to separate the concept drift from noise in the middle of a stream. Further-
more, since the -error tolerance (by Equation (13.4)) in this experiment is
0.034, it makes EvoClass oscillate around its average classifier error.
Figure 13.3b depicts the result of another experiment where fc is set to
20, 000. It means the concept drifts in 20 times faster than in the first experi-
ment. Again the curves show that the change of classifier error rate can follow
the concept drift.
Next we want to test the model described in Section 13.3, which represents
the set of attributes causing the concept drift. Here we use Kullback-Leibler
divergence to rank the top-k greatest changes (Equation (13.3)) discovered in
the distribution of P (ai |v). We vary the value of w1 , w2 , . . . , wk simultaneously
for each concept drift. Then the average recall and precision are calculated.
Figure 13.4 shows the recall and precision from the top-k divergence list when
k is between 1 and 5. The overall recall and precision are around 50–60%. It
354 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
0.2
concept drift
0.18 classifier error w/o concept drift
classifier error w/o concept drift (smoothed)
classifier error with concept drift
0.16
0.14
0.12
Error
0.1
0.08
0.06
0.04
0.02
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Records Over Time( x1M)
(a)
0.35
concept drift
classifier error w/o concept drift
0.3 classifier error w/o concept drift (smoothed)
classifier error with concept drift
0.25
0.2
Error
0.15
0.1
0.05
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Records Over Time( x1M)
(b)
Fig. 13.3. Accuracy over time when the concept drifts over a)400,000 records and
b) 20,000 records.
13.5.3 Sensitivity
Sensitivity is used to measure how the fluctuation of noise level may influence
the quality of a classifier. Sensitivity is involved with the percentage of noise in
the data set, the concept drift frequency, and the minimum window size. We
use the same experimental setting mentioned previously. Figure 13.5a shows
13.5 Experimental Results 355
1
Recall
0.9 Precision
0.8
0.7
Recall / Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5
Number of Changing Attributes
the relationship between the noise rate and the classifier error rate. The error
rate of the classifier increases proportionally to the percentage changes of the
noise. The dotted line shows pne caused by the average noise rate. The formula
perror = pb + pne holds very well based on the result in Figure 13.5a. That
means that the performance of our EvoClass algorithm will not degrade even
when lots of noise presents in the data set.
We next conduct an experiment to see the influence of the concept drift
frequency in the classifier error rate. We set the minimum window size wmin
to 10k and then vary the concept drift frequency fc from 200 to 100k. Fig-
ure 13.5b shows that the classifier cannot update its underlying structure to
fit the new concept if fc is below 10k. This is because our minimum processing
unit is 10k, and the classifier cannot catch up with the changing frequency
below that minimum processing unit.
In Figure 13.5c, we vary the minimum window size from 100 to 40k and fix
the concept drift frequency to 20k. It shows that when the minimum window
size is below 10k, the error rate will be in the range [popt , 1.1popt ], where popt
is the best error rate achieved in this series of experiments. When the concept
does not drift very frequently, the minimum window size can be selected freely
in a very large range and EvoClass can still achieve a nearly optimal result.
When wmin is close to 100, the error rate increases steadily because of over-
fitting. Figure 13.5c also shows the processing time for a varying minimum
window size: generally it will take a longer time to complete the task if we
choose a smaller minimum window size.
We then check the performance of EvoClass when the total available
number of windows, Nw , varies. We have the following experiment settings:
|A| = 30, C = 32, N = 480, 000, wmin = 400, and fc = 400. We intentionally
change w0 to 0.001 and pnoise,max to 0.10 such that in a long period, the con-
356 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
0.25
classifier error caused by noise (estimated)
classifier error (total)
0.2
0.15
Error
0.1
0.05
0
0 0.05 0.1 0.15 0.2 0.25
Average Noise Rate
(a)
0.4
classifier error (window size = 12000)
0.35
0.3
0.25
Error
0.2
0.15
0.1
0.05
0 10 20 30 40 50 60 70 80 90 100
Concept Drift Frequency (x1K)
(b)
classifier error total runtime
Processing Time (sec)
110
Error
0.15 100
0.14 90
0.13 80
0.12 70
0.11 60
0.1 50
0.09
2 3 4 5
10 10 10 10
Window Size
(c)
cept drift cannot be detected from noise. With the increase of cardinality (or
number of attributes), one window is not enough to build an accurate clas-
sifier. The increment of minimum window size does not work because of the
small concept drift frequency. The tilt window scenario performs well in this
case. The result is depicted in Figure 13.6. As we can see, only maintaining
one window will result in a significant increase of errors when compared with
13.5 Experimental Results 357
0.2 10.2
classifier error total runtime
0 9.8
0 5 10 15 20 25 30 35
Number of Windows
13.5.4 Scalability
0.20
Error
150 0.15
100
0.1
50
0 0.05
0 10 30 50 100 150 200 250
Number of Attributes
(a)
80
classifier construction time
total run time
classifier error
60
Processing Time (sec)
0.13
0.125
0.12
Error
40
0.115
0.11
20
0
3 5 10 20 50
Cardinality
(b)
Figure 13.7a also illustrates the computational time in different parts of the
EvoClass algorithm. We roughly divide the processing time into two parts:
classifier construction time, which includes classifier initialization, change de-
tection, testing, and classifier updating; and the classification time, which
includes the time to predict the class label of each record when it arrives. The
experiments show that the first part occupies 1/6 to 1/2 of the total process-
ing time. This ratio can be further reduced if the concept changes slow down
or the minimum window size is enlarged. We collect the data that shows the
number of records that can be processed each second. For a 200-attribute data
set, the processing speed is around 20,000 records per second. For 10-attribute
data set, it can achieve processing of 100,000 records per second. Since our
implementation is based on C++/STL, we believe that it can be further im-
proved using a C implementation and a more compact data structure.
13.6 Discussion 359
13.6 Discussion
We first discuss why we chose the naive Bayesian classifier as the base classi-
fier for streaming data and then consider other issues for improvements and
extensions of EvoClass.
In this subsection, we are going to discuss a few related issues, including choos-
ing the window size, handling high frequent data streams, window weighting,
and alternative classifiers.
360 Jiong Yang, Xifeng Yan, Jiawei Han and Wei Wang
Window Size
In the previous
section, we mentioned that a window has a minimum size,
|V |× 1≤j≤|A| |Aj |
q× |A| . When the cardinality of attributes is large, the mini-
mum number of records in a window can also be quite large. Certainly, we
can arbitrarily reduce the minimum size, in the extreme to 1. A smaller win-
dow size means updating the classifier more frequently, which degrades the
performance a lot. A user can determine the window size based on the trade-
off between processing speed and data arriving rate. Once the size of the
minimum window is fixed, it may need to wait until a window is full before
EvoClass can process the data. This may lead to longer delay in detecting the
evolution of the classifier. To solve this problem, we test the classifier at the
same time as data accumulating. Let accuracy1 be the best accuracy of the
classifier for all previous windows. A change will be detected if the accuracy
falls below accuracy1 − 2. This means that we will detect a change if more
than wmin × (1 + 2 − accuracy1 ) records are mislabeled in the current win-
dow. Thus, we can keep track of the number of misclassified records. If the
number of records exceeds this threshold, a change is detected and we will
immediately build a new classifier. Under this scheme, the new classifier can
be done much earlier.
When the data arrival rate is extremely high, it is possible that our algorithm
may not be able to process the data in time. In turn, more data have to
be buffered. Over time, the system would become unstable. To solve this
problem, we propose to use a sampling method. Let’s assume that the time
for processing a window of data is wmin , and wnew new records arrive in
that time. If wnew ≤ wmin , it means that we are able to process the new
data. Otherwise, we only can process a fraction of the new data. As a result,
among the new wnew records, we use a random sample to pick wmin records,
each having the probability wwnew of being chosen. The unchosen records are
min
Window Weighting
In this paper, µ is the parameter that controls the weight that a new window
carries. This value can easily be adjusted to fit the needs of different users. µ
will be set to a smaller value if a user believe that the current data is a better
indicator of the classifier. In the extreme case, we can set µ = 0 when a user
only wants a classifier that is solely built on the current window. On the other
hand, if the user thinks that each record contributes equally to the classifier,
we should set µ = ws where s and w are the number of records in previous
windows and the current window, respectively.
13.7 Conclusions 361
Alternative Classifiers
13.7 Conclusions
We have investigated the major issues in classifying large-volume, high-speed
and dynamically evolving streaming data, and proposed a novel approach,
EvoClass, which integrates the naı̈ve Bayesian classification method with ti-
tled window, boosting, and several other optimization techniques, and achieves
high accuracy, high adaptivity, and low construction cost.
Compared with other classification methods, the EvoClass approach offers
several distinct features:
References
[1] Babcock, B., S. Babu, M. Datar, R. Motwani and J. Widom, 2002: Models
and issues in data stream systems. In Proceedings of ACM Symp. on
Principles of Database Systems, 1–16.
[2] Chen, Y., G. Dong, J. Han, B. W. Wah and J. Wang, 2002: Multidimen-
sional regression analysis of time-series data streams. In Proceedings of
International Conference on Very Large Databases.
[3] Domingos, P., and G. Hulten, 2000: Mining high-speed data streams. Pro-
ceedings of ACM Conference on Knowledge Discovery and Data Mining,
71–80.
[4] Domingos, P., and M. J. Pazzani, 1997: On the optimality of the simple
bayesian classifier under zero-one loss. Machine Learning, 29, no. 2–3,
103–30.
[5] Dobra, A., M. N. Garofalakis, J. Gehrke and R. Rastogi, 2002: Process-
ing complex aggregate queries over data streams. In Proceedings of ACM
Conference on Management of Data, 61–72.
[6] Duda, R., P. E. Hart and D. G. Stork, 2000: Pattern Classification. Wi-
leyInterscience.
[7] Gehrke, J., R. Ramakrishnan and V. Ganti. RainForest: A framework for
fast decision tree construction of large datasets, 1998: In Proceedings of
International Conference on Very Large Databases, 416–27.
[8] Ganti, V., J. Gehrke, R. Ramakrishnan and W. Loh, 1999: A framework
for measuring changes in data characteristics. In Proceedings of ACM
Symp. Principles of Database Systems, 126–37.
[9] Garofalakis, M., J. Gehrke and R. Rastogi, 2002: Querying and mining
data streams: you only get one look. Tutorial in Proc. 2002 ACM Con-
ference on Management of Data.
[10] Gehrke, J., V. Ganti, R. Ramakrishnan and W. Loh, 1999: BOAT: opti-
mistic decision tree construction. Proceedings of Conference on Manage-
ment of Data, 169–80.
[11] Gehrke, J., F. Korn and D. Srivastava, 2001: On computing correlated
aggregates over continuous data streams. In Proceedings of ACM Confer-
ence on Management of Data, 13–24.
[12] Guha, S., N. Mishra, R. Motwani and L. O’Callaghan, 2000: Clustering
data streams. In Proc. IEEE Symposium on Foundations of Computer
Science, 359–66.
[13] Han, J., and M. Kamber, 2000: Data Mining Concepts and Techniques.
Morgan Kaufmann.
[14] Hastie, T., R. Tibshirani and J. Friedman, 2001: The Elements of Statis-
tical Learning: Data Mining, Inference, and Prediction. Springer-Verlag.
[15] Hulton, G., L. Spencer and P. Domingos, 2001: Mining time-changing
data streams. Proceedings of ACM Conference on Knowledge Discovery
in Databases, 97–106.
References 363
[16] Lin, J., 1991: Divergence measures based on the Shannon entropy. IEEE
Tran. on Information Theory, 37, 1, 145–51.
[17] Liu, H., F. Hussain, C.L. Tan and M. Dash, 2002: Discretization: An
enabling technique. Data Mining and Knowledge Discovery, 6, 393–423.
[18] Manku, G., and R. Motwani, 2002: Approximate frequency counts over
data streams. In Proc. 2002 Int. Conf. on Very Large Databases.
[19] Mitchell, T., 1997: Machine Learning. McGraw-Hill.
[20] O’Callaghan, L., N. Mishra, A. Meyerson, S. Guha and R. Motwani, 2002:
High-performance clustering of streams and large data sets. In Proceedings
of IEEE International Conference on Data Engineering.
[21] Witten, I., and E. Frank, 2001: Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations. Morgan Kaufmann.
Index