(FREE PDF SAMPLE) Pattern Discovery Using Sequence Data Mining Applications and Studies 1st Edition Pradeep Kumar Ebook Full Chapters
(FREE PDF SAMPLE) Pattern Discovery Using Sequence Data Mining Applications and Studies 1st Edition Pradeep Kumar Ebook Full Chapters
com
https://ebookgate.com/product/pattern-discovery-using-
sequence-data-mining-applications-and-studies-1st-edition-
pradeep-kumar/
OR CLICK HERE
DOWLOAD NOW
https://ebookgate.com/product/data-mining-and-knowledge-discovery-for-
geoscientists-1st-edition-guangren-shi-auth/
ebookgate.com
Data Mining Algorithms Explained Using R 1st Edition Pawel
Cichosz
https://ebookgate.com/product/data-mining-algorithms-explained-
using-r-1st-edition-pawel-cichosz/
ebookgate.com
https://ebookgate.com/product/data-mining-and-medical-knowledge-
management-cases-and-applications-1st-edition-petr-berka/
ebookgate.com
https://ebookgate.com/product/data-mining-with-python-theory-
application-and-case-studies-1st-edition-di-wu/
ebookgate.com
Pattern Discovery Using
Sequence Data Mining:
Applications and Studies
Pradeep Kumar
Indian Institute of Management Lucknow, India
P. Radha Krishna
Infosys Lab, Infosys Limited, India
S. Bapi Raju
University of Hyderabad, India
Senior Editorial Director: Kristin Klinger
Director of Book Publications: Julia Mosemann
Editorial Director: Lindsay Johnston
Acquisitions Editor: Erika Carter
Development Editor: Joel Gamon
Production Editor: Sean Woznicki
Typesetters: Jennifer Romanchak, Lisandro Gonzalez
Print Coordinator: Jamie Snavely
Cover Design: Nick Newcomer
Copyright © 2012 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
List of Reviewers
Manish Gupta, University of Illinois at Urbana, USA
Chandra Sekhar, Indian Institute of Technology Madras, India
Arnab Bhattacharya, Indian Institute of Technology Kanpur, India
Padmaja T Maruthi, University of Hyderabad, India
T. Ravindra Babu, Infosys Technologies Ltd, India
Pratibha Rani, International Institute of Information Technology Hyderabad, India
Nita Parekh, International Institute of Information Technology Hyderabad, India
Anass El-Haddadi, IRIT, France
Pinar Senkul, Middle East Technical University, Turkey
Jessica Lin, George Mason University, USA
Pradeep Kumar, Indian Institute of Management Lucknow, India
Raju S. Bapi, University of Hyderabad, India
P. Radha Krishna, Infosys Lab, Infosys Limited, India
Table of Contents
Preface...................................................................................................................................................vii
Section 1
Current State of Art
Chapter 1
Applications of Pattern Discovery Using Sequential Data Mining......................................................... 1
Manish Gupta, University of Illinois at Urbana-Champaign, USA
Jiawei Han, University of Illinois at Urbana-Champaign, USA
Chapter 2
A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential
Patterns, Part I: Sequences of Continuous Feature Vectors................................................................... 24
Dileep A. D., Indian Institute of Technology, India
Veena T., Indian Institute of Technology, India
C. Chandra Sekhar, Indian Institute of Technology, India
Chapter 3
A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential
Patterns, Part II: Sequences of Discrete Symbols.................................................................................. 51
Veena T., Indian Institute of Technology, India
Dileep A. D., Indian Institute of Technology, India
C. Chandra Sekhar, Indian Institute of Technology, India
Section 2
Techniques
Chapter 4
Mining Statistically Significant Substrings Based on the Chi-Square Measure.................................... 73
Sourav Dutta, IBM Research Lab, India
Arnab Bhattacharya, Indian Institute of Technology Kanpur, India
Chapter 5
Unbalanced Sequential Data Classification Using Extreme Outlier Elimination and Sampling
Techniques............................................................................................................................................. 83
T. Maruthi Padmaja, University of Hyderabad (UoH), India
Raju S. Bapi, University of Hyderabad (UoH), India
P. Radha Krishna, Infosys Lab, Infosys Limited, India
Chapter 6
Quantization Based Sequence Generation and Subsequence Pruning for Data Mining
Applications........................................................................................................................................... 94
T. Ravindra Babu, Infosys Limited, India
M. Narasimha Murty, Indian Institute of Science Bangalore, India
S. V. Subrahmanya, Infosys Limited, India
Chapter 7
Classification of Biological Sequences................................................................................................ 111
Pratibha Rani, International Institute of Information Technology Hyderabad, India
Vikram Pudi, International Institute of Information Technology Hyderabad, India
Section 3
Applications
Chapter 8
Approaches for Pattern Discovery Using Sequential Data Mining..................................................... 137
Manish Gupta, University of Illinois at Urbana-Champaign, USA
Jiawei Han, University of Illinois at Urbana-Champaign, USA
Chapter 9
Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learning
Techniques........................................................................................................................................... 155
S. Prasanthi, University of Hyderabad, India
S. Durga Bhavani, University of Hyderabad, India
T. Sobha Rani, University of Hyderabad, India
Raju S. Bapi, University of Hyderabad, India
Chapter 10
Identification of Genomic Islands by Pattern Discovery..................................................................... 166
Nita Parekh, International Institute of Information Technology Hyderabad, India
Chapter 11
Video Stream Mining for On-Road Traffic Density Analytics............................................................ 182
Rudra Narayan Hota, Frankfurt Institute for Advanced Studies, Germany
Kishore Jonna, Infosys Lab, Infosys Limited, India
P. Radha Krishna, Infosys Lab, Infosys Limited, India
Chapter 12
Discovering Patterns in Order to Detect Weak Signals and Define New Strategies............................ 195
Anass El Haddadi, University of Toulouse III, France & University of Mohamed V, Morocco
Bernard Dousset, University of Toulouse, France
Ilham Berrada, University of Mohamed V, Morocco
Chapter 13
Discovering Patterns for Architecture Simulation by Using Sequence Mining.................................. 212
Pınar Senkul, Middle East Technical University, Turkey
Nilufer Onder, Michigan Technological University, USA
Soner Onder, Michigan Technological University, USA
Engin Maden, Middle East Technical University, Turkey
Hui Meen Nyew, Michigan Technological University, USA
Chapter 14
Sequence Pattern Mining for Web Logs.............................................................................................. 237
Pradeep Kumar, Indian Institute of Management Lucknow, India
Raju S. Bapi, University of Hyderabad, India
P. Radha Krishna, Infosys Lab, Infosys Limited, India
Index.................................................................................................................................................... 270
vii
Preface
A huge amount of data is collected every day in the form of sequences. These sequential data are valu-
able sources of information not only to search for a particular value or event at a specific time, but also
to analyze the frequency of certain events or sets of events related by particular temporal/sequential
relationship. For example, DNA sequences encode the genetic makeup of humans and all other species,
and protein sequences describe the amino acid composition of proteins and encode the structure and
function of proteins. Moreover, sequences can be used to capture how individual humans behave through
various temporal activity histories such as weblog histories and customer purchase patterns. In general
there are various methods to extract information and patterns from databases, such as time series ap-
proaches, association rule mining, and data mining techniques.
The objective of this book is to provide a concise state-of-the-art in the field of sequence data min-
ing along with applications. The book consists of 14 chapters divided into 3 sections. The first section
provides review of state-of-art in the field of sequence data mining. Section 2 presents relatively new
techniques for sequence data mining. Finally, in section 3, various application areas of sequence data
mining have been explored.
Chapter 1, Approaches for Pattern Discovery Using Sequential Data Mining, by Manish Gupta and
Jiawei Han of University of Illinois at Urbana-Champaign, IL, USA, discusses different approaches for
mining of patterns from sequence data. Apriori based methods and the pattern growth methods are the
earliest and the most influential methods for sequential pattern mining. There is also a vertical format
based method which works on a dual representation of the sequence database. Work has also been done
for mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensional
databases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining.
Some works also focus on mining incremental patterns and mining from stream data. In this chapter,
the authors have presented at least one method of each of these types and discussed advantages and
disadvantages.
Chapter 2, A Review of Kernel Methods Based Approaches to Classification and Clustering of
Sequential Patterns, Part I: Sequences of Continuous Feature Vectors, was authored by Dileep A. D.,
Veena T., and C. Chandra Sekhar of Department of Computer Science and Engineering, Indian Institute
of Technology Madras, India. They present a brief description of kernel methods for pattern classifica-
tion and clustering. They also describe dynamic kernels for sequences of continuous feature vectors.
The chapter also presents a review of approaches to sequential pattern classification and clustering using
dynamic kernels.
viii
Chapter 9, Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learn-
ing Techniques, by S. Prashanthi, S. Durga Bhavani, T. Sobha Rani, and Raju S. Bapi of Department
of Computer & Information Sciences, University of Hyderabad, Hyderabad, India, focuses on human
kinase drug target sequences since kinases are known to be potential drug targets. The authors have also
presented a preliminary analysis of kinase inhibitors in order to study the problem in the protein-ligand
space in future. The identification of druggable kinases is treated as a classification problem in which
druggable kinases are taken as positive data set and non-druggable kinases are chosen as negative data set.
Chapter 10, Identification of Genomic Islands by Pattern Discovery, by Nita Parekh of International
Institute of Information Technology, Hyderabad, India addresses a pattern recognition problem at the
genomic level involving identifying horizontally transferred regions, called genomic islands. A horizon-
tally transferred event is defined as the movement of genetic material between phylogenetically unrelated
organisms by mechanisms other than parent to progeny inheritance. Increasing evidence suggests the
importance of horizontal transfer events in the evolution of bacteria, influencing traits such as antibiotic
resistance, symbiosis and fitness, virulence, and adaptation in general. Considerable effort is being made
in their identification and analysis, and in this chapter, a brief summary of various approaches used in
the identification and validation of horizontally acquired regions is discussed.
Chapter 11, Video Stream Mining for On-Road Traffic Density Analytics, by Rudra Narayan Hota of
Frankfurt Institute for Advanced Studies, Frankfurt, Germany along with Kishore Jonna and P. Radha
Krishna, Infosys Lab, Infosys Technologies Limited, India, addresses the problem of estimating computer
vision based traffic density using video stream mining. The authors present an efficient approach for
traffic density estimation using texture analysis along with Support Vector Machine (SVM) classifier, and
describe analyzing traffic density for on-road traffic congestion control with better flow management.
Chapter 12, Discovering Patterns in Order to Detect Weak Signals and Define New Strategies, by
Anass El Haddadi of Université de Toulouse, IRIT UMR France Bernard Dousset, Ilham Berrada of
Ensias, AL BIRONI team, Mohamed V University – Souissi, Rabat, Morocco presents four methods
for discovering patterns in the competitive intelligence process: “correspondence analysis,” “multiple
correspondence analysis,” “evolutionary graph,” and “multi-term method.” Competitive intelligence
activities rely on collecting and analyzing data in order to discover patterns from data using sequence
data mining. The discovered patterns are used to help decision-makers considering innovation and de-
fining business strategy.
Chapter 13, Discovering Patterns for Architecture Simulation by Using Sequence Mining, by Pınar
Senkul (Middle East Technical University, Computer Engineering Dept., Ankara, Turkey) along with
Nilufer Onder (Michigan Technological University, Computer Science Dept., Michigan, USA), Soner
Onder (Michigan Technological University, Computer Science Dept., Michigan, USA), Engin Maden
(Middle East Technical University, Computer Engineering Dept., Ankara, Turkey) and Hui Meen Nyew
(Michigan Technological University, Computer Science Dept., Michigan, USA), discusses the problem
of designing and building high performance systems that make effective use of resources such as space
and power. The design process typically involves a detailed simulation of the proposed architecture fol-
lowed by corrections and improvements based on the simulation results. Both simulator development
and result analysis are very challenging tasks due to the inherent complexity of the underlying systems.
They present a tool called Episode Mining Tool (EMT), which includes three temporal sequence mining
algorithms, a preprocessor, and a visual analyzer.
Chapter 14 is called Sequence Pattern Mining for Web Logs by Pradeep Kumar, Indian Institute of
Management, Lucknow, India, Raju S. Bapi, University of Hyderabad, India and P. Radha Krishna,
x
Infosys Lab, Infosys Technologies Limited, India. In their work, the authors utilize a variation to the
AprioriALL Algorithm, which is commonly used for the sequence pattern mining. The proposed varia-
tion adds up the measure Interest during every step of candidate generation to reduce the number of
candidates thus resulting in reduced time and space cost.
This book can be useful to academic researchers and graduate students interested in data mining
in general and in sequence data mining in particular, and to scientists and engineers working in fields
where sequence data mining is involved, such as bioinformatics, genomics, Web services, security, and
financial data analysis.
Sequence data mining is still a fairly young research field. Much more remains to be discovered in
this exciting research domain in the aspects related to general concepts, techniques, and applications.
Our fond wish is that this collection sparks fervent activity in sequence data mining, and we hope this
is not the last word!
Pradeep Kumar
Indian Institute of Management Lucknow, India
P. Radha Krishna
Infosys Lab, Infosys Limited, India
S. Bapi Raju
University of Hyderabad, India
Section 1
Current State of Art
1
Chapter 1
Applications of Pattern
Discovery Using Sequential
Data Mining
Manish Gupta
University of Illinois at Urbana-Champaign, USA
Jiawei Han
University of Illinois at Urbana-Champaign, USA
ABSTRACT
Sequential pattern mining methods have been found to be applicable in a large number of domains.
Sequential data is omnipresent. Sequential pattern mining methods have been used to analyze this data
and identify patterns. Such patterns have been used to implement efficient systems that can recommend
based on previously observed patterns, help in making predictions, improve usability of systems, de-
tect events, and in general help in making strategic product decisions. In this chapter, we discuss the
applications of sequential data mining in a variety of domains like healthcare, education, Web usage
mining, text mining, bioinformatics, telecommunications, intrusion detection, et cetera. We conclude
with a summary of the work.
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Applications of Pattern Discovery Using Sequential Data Mining
sets. Each set in the sequence is a hospitaliza- rules. It describes the criteria on the relationships
tion instance. Each element in a hospitalization between subgroups. A constraint on rule outlines
can be any symbolic data gathered by the PMSI the composition of an association rule; it describes
(medical data source). They used the SLPMiner the attributes that form the antecedents and the
system (Seno & Karypis, 2002) for mining the consequents, and calculates the confidence of an
patient path database in order to find frequent association rule. It also specifies the minimum
sequential patterns among the patient path. They support for a rule and prunes away item-sets that do
tested the model on the 2002 year of PMSI data at not meet this support at the end of each subgroup-
the Nancy University Hospital and also propose merging step. A typical user constraint can look
an interactive tool to perform inter-institutional like [1,2,3][1, a=A1&n<=2][2, a=B1&n<=2][3,
patient path analysis. v=1][rule, (s1 s2) ⇒s3]. This can be interpreted
Patterns in dyspepsia symptoms: Consider as: looking at subgroups 1, 2 and 3, from subgroup
a domain expert, who is an epidemiologist and 1, extract patterns that contain the attribute A1
is interested in finding relationships between (a=A1) and contain no more than 2 attributes
symptoms of dyspepsia within and across time (n<=2); from subgroup 2, extract patterns that
points. This can be done by first mining patterns contain the attribute B1 (a=B1) and contain no
from symptom data and then using patterns to more than 2 attributes (n<=2); then from subgroup
define association rules. Rules could look like 3, extract patterns with at least one attribute that
ANOREX2=0 VOMIT2=0 NAUSEA3=0 AN- has a value of 1 (v=1). Attributes from subgroups
OREX3=0 VOMIT3=0 ⇒ DYSPH2=0 where 1 and 2 form the antecedents in a rule, and at-
each symptom is represented as <symptom>N=V tributes from subgroup 3 form the consequents
(time=N and value=V). ANOREX (anorexia), ([rule, (s1 s2) ⇒ s3]). Such constraints are easily
VOMIT (vomiting), DYSPH (dysphagia) and incorporated into the Apriori process by pruning
NAUSEA (nausea) are the different symptoms. away more candidates based on these constraints.
However, a better way of handling this is to de- They experimented on a dataset with records
fine subgroups as a set of symptoms at a single of 303 patients treated for dyspepsia. Each record
time point. (Lau, Ong, Mahidadia, Hoffmann, represented a patient, the absence or presence of
Westbrook, & Zrimec, 2003) solve the problem 10 dyspepsia symptoms at three time points (initial
of identifying symptom patterns by implement- presentation to a general practitioner, 18 months
ing a framework for constraint based association after endoscopy screening, and 8–9 years after
rule mining across subgroups. Their framework, endoscopy) and the endoscopic diagnosis for the
Apriori with Subgroup and Constraint (ASC), is patient. Each of these symptoms can have one
built on top of the existing Apriori framework. of the following three values: symptom present,
They have identified four different types of phase- symptom absent, missing (unknown). At each of
wise constraints for subgroups: constraint across the three time points, a symptom can take any of
subgroups, constraint on subgroup, constraint on these three possible values. They show that their
pattern content and constraint on rule. A constraint approach leads to interesting symptom pattern
across subgroups specifies the order of subgroups discovery.
in which they are to be mined. A constraint on Patterns in daily activity data: There are also
subgroup describes the intra-subgroup criteria works, which investigate techniques for using
of the association rules. It describes a minimum agent-based smart home technologies to provide
support for subgroups and a set of constraints for at-home automated assistance and health moni-
each subgroup. A constraint on pattern content toring. These systems first learn patterns from
outlines the inter-subgroup criteria on association at-home health and activity data. Further, for any
2
Applications of Pattern Discovery Using Sequential Data Mining
new test cases, they identify behaviors that do not 3. A pair of set-up and clean-up: <set-up
conform to normal behavior and report them as method>, <misc action>, …, <clean-up
predicted anomalous health problems. method>
4. Exception Handling: Every instance is in-
cluded in a try-catch statement.
EDUCATION 5. Other patterns.
In the education domain, work has been done They have made this technique available as
to extract patterns from source code and student a tool: Fung(http://sel.ist.osaka-u.ac.jp/~ishio/
teamwork data. fung/)
Patterns in source code: A coding pattern is Patterns in student team-work data: (Kay,
a frequent sequence of method calls and control Maisonneuve, Yacef, & Zaïane, 2006) describe
statements to implement a particular behavior. data mining of student group interaction data to
Coding patterns include copy-and-pasted code, identify significant sequences of activity. The goal
crosscutting concerns (parts of a program which is to build tools that can flag interaction sequences
rely on or must affect many other parts of the indicative of problems, so that they can be used
system) and implementation idioms. Dupli- to assist student teams in early recognition of
cated code fragments and crosscutting concerns problems. They also want tools that can identify
that spread across modules are problematic in patterns that are markers of success so that these
software maintenance. (Ishio, Date, Miyake, & might indicate improvements during the learning
Inoue, 2008) propose a sequential pattern min- process. They obtain their data using TRAC which
ing approach to capture coding patterns in Java is an open source tool designed for use in software
programs. They define a set of rules to translate development projects. Students collaborate by
Java source code into a sequence database for sharing tasks via the TRAC system. These tasks
pattern mining, and apply PrefixSpan algorithm are managed by a “Ticket” system; source code
to the sequence database. They define constraints writing tasks are managed by a version control
for mining source code patterns. A constraint for system called “SVN”; students communicate by
control statements could be: If a pattern includes means of collaborative web page writing called
a LOOP/IF element, the pattern must include its “Wiki”. Data consist of events where each event
corresponding element generated from the same is represented as Event = {EventType, Resour-
control statement. They classify sub-patterns into ceId, Author, Time} where: EventType is one of
pattern groups. As a case study, they applied their T (for Ticket), S (for SVN), W (for Wiki). One
tool to six open-source programs and manually such sequence is generated for each of the group
investigated the resultant patterns. of students.
They identify about 17 pattern groups which The original sequence obtained for each group
they classify into 5 categories: was 285 to 1287 long. These event sequences
were then broken down into several “sequences”
1. A boolean method to insert an additional of events using a per session approach or a per
action: <Boolean method>, <IF>, <action- resource approach. In breakdown per session ap-
method>, <END-IF> proach, date and the resourceId are omitted and
2. A boolean method to change the behavior a sequence is of form: (iXj) which captures the
of multiple methods: <Boolean method>, number of i consecutive times a medium X was
<IF>, <action-method>, <END-IF> used by j different authors, e.g., <(2T1), (5W3),
(2S1),(1W1)>. In breakdown per resource ap-
3
Applications of Pattern Discovery Using Sequential Data Mining
proach, sequence is of form <iXj, t> which captures being accepted by a team member and then SVN
the number of i different events of type X, the work relating to that task being completed and the
number j of authors, and the number of days over second likely being work being done followed
which t the resource was modified, e.g., <10W5, by the ticket being closed. The close coupling of
2>. In a follow-up paper (Perera, Kay, Yacef, & task-related SVN and Wiki activity and Ticket
Koprinska, 2007), they have a third approach, events for this group was also shown by relatively
breakdown by task where every sequence is of high support for the patterns (1,t,b)(1,t,b)(1,t,b),
the form (i,X,A) which captures the number of (1,t,b)(1,s,b)(1,t,b) and (1,t,b)(1,w,b)(1,t,b). The
consecutive events (i) occurring on a particular poorest group displayed the highest support for
TRAC medium (X), and the role of the author (A). the last pattern, but no support for the former,
Patterns observed in group sessions: Better again indicating their lack of SVN use in tasks.
groups had many alternations of SVN and Wiki Patterns observed in resource sequences: The
events, and SVN and Ticket events whereas best group had very high support for patterns
weaker groups had almost none. The best group where the leader interacted with group members
also had the highest proportion of author ses- on tickets, such as (L,1,t)(b,1,t)(L,1,t). The poorest
sions containing many consecutive ticket events group in contrast lacked these interaction patterns,
(matching their high use of ticketing) and SVN and had more tickets which were created by the
events (suggesting they committed their work to Tracker rather than the Leader, suggestive of
the group repository more often). weaker leadership. The best group displayed the
A more detailed analysis of these patterns highest support for patterns such as (b,3,t) and
revealed that the best group used the Ticket (b,4,t), suggestive of group members making at
more than the Wiki, whereas the weakest group least one update on tickets before closing them.
displayed the opposite pattern. The data sug- In contrast, the weaker groups showed support
gested group leaders in good groups were much mainly for the pattern (b,2,t), most likely indicative
less involved in technical work, suggesting work of group members accepting and closing tickets
was being delegated properly and the leader was with no update events in between.
leading rather than simply doing all the work. In
contrast, the leaders of the poorer groups either Web Usage Mining
seemed to use the Wiki (a less focused medium)
more than the tickets, or be involved in too much The complexity of tasks such as Web site design,
technical work. Web server design, and of simply navigating
Patterns observed in task sequences: The two through a Web site has been increasing continu-
best groups had the greatest percentage support ously. An important input to these design tasks
for the pattern (1,t,L)(1,t,b), which were most is the analysis of how a Web site is being used.
likely tickets initiated by the leader and accepted Usage analysis includes straightforward statistics,
by another team member. The fact this occurred such as page access frequency, as well as more
more often than (1,t,L)(2,t,b), suggests that the sophisticated forms of analysis, such as finding
better groups were distinguished by tasks being the common traversal paths through a Web site.
performed on the Wiki or SVN files before the Web Usage Mining is the application of pattern
ticket was closed by the second member. Notably, mining techniques to usage logs of large Web
the weakest group had higher support for this latter data repositories in order to produce results that
pattern than the former. The best group was one of can be used in the design tasks mentioned above.
the only two to display the patterns (1,t,b)(1,s,b) However, there are several preprocessing tasks that
and (1,s,b)(1,t,b) – the first likely being a ticket
4
Applications of Pattern Discovery Using Sequential Data Mining
must be performed prior to applying data mining Patterns for customer acquisition: (Buchner &
algorithms to the data collected from server logs. Mulvenna, 1998) propose an environment that al-
Transaction identification from web usage data: lows the discovery of patterns from trading related
(Cooley, Mobasher, & Srivastava, 1999) present web sites, which can be harnessed for electronic
several data preparation techniques in order to commerce activities, such as personalization,
identify unique users and user sessions. Also, a adaptation, customization, profiling, and recom-
method to divide user sessions into semantically mendation.
meaningful transactions is defined. Each user The two essential parts of customer attraction
session in a user session file can be thought of in are the selection of new prospective customers and
two ways; either as a single transaction of many the acquisition of the selected potential candidates.
page references, or a set of many transactions each One marketing strategy to perform this exercise,
consisting of a single page reference. The goal of among others, is to find common characteristics in
transaction identification is to create meaningful already existing visitors’ information and behavior
clusters of references for each user. Therefore, for the classes of profitable and non-profitable
the task of identifying transactions is one of customers. The authors discover these sequences
either dividing a large transaction into multiple by extending GSP so it can handle duplicates in
smaller ones or merging small transactions into sequences, which is relevant to discover naviga-
fewer larger ones. This process can be extended tional behavior.
into multiple steps of merge or divide in order to
create transactions appropriate for a given data A found sequence looks as the
mining task. Both types of approaches take a following:
transaction list and possibly some parameters as {ecom.infm.ulst.ac.uk/, ecom.infm.
input, and output a transaction list that has been ulst.ac.uk/News_Resources.html, ecom.
operated on by the function in the approach in infm.ulst.ac.uk/Journals.html, ecom.
the same format as the input. They consider three infm.ulst.ac.uk/, ecom.infm.ulst.
different ways of identifying transactions based ac.uk/search.htm} Support = 3.8%;
on: Reference Length (time spent when visiting a Confidence = 31.0%
page), Maximal Forward Reference (set of pages
in the path from the first page in a user session up The discovered sequence can then be used
to the page before a backward reference is made) to display special offers dynamically to keep a
and Time Window. customer interested in the site, after a certain
By analyzing this information, a Web Usage page sequence with a threshold support and/or
Mining system can determine temporal relation- confidence value has been visited.
ships among data items such as the following
Olympics Web site examples: Patterns to Improve Web Site Design
• 9.81% of the site visitors accessed the For the analysis of visitor navigation behavior
Atlanta home page followed by the in web sites integrating multiple information
Sneakpeek main page. systems (multiple underlying database servers
• 0.42% of the site visitors accessed or archives), (Berendt, 2000) proposed the web
the Sports main page followed by the usage miner (WUM), which discovers naviga-
Schedules main page. tion patterns subject to advanced statistical and
structural constraints. Experiments with a real web
site that integrates data from multiple databases,
5
Applications of Pattern Discovery Using Sequential Data Mining
the German SchulWeb (a database of German- frequent navigational paths) are more suitable for
language school magazines), demonstrate the ap- predictive tasks, such as Web pre-fetching, which
propriateness of WUM in discovering navigation involve predicting which item is accessed next by
patterns and show how those discoveries can help a user), while less constrained patterns, such as
in assessing and improving the quality of the site frequent item-sets or general sequential patterns
design i.e. conformance of the web site’s structure are more effective alternatives in the context of
to the intuition of each group of visitors accessing Web personalization and recommender systems.
the site. The intuition of the visitors is indirectly Web usage preprocessing ultimately results
reflected in their navigation behavior, as repre- in a set of n page-views, P = {p1, p2... pn}, and a
sented in their browsing patterns. By comparing set of m user transactions, T = {t1, t2... tm}. Each
the typical patterns with the site usage expected transaction t is defined as an l-length sequence of
by the site designer, one can examine the quality ordered pairs: t = <(pt1, w(pt1)), (pt2, w(pt2)),...,(ptl,
of the site and give concrete suggestions for its w(ptl))> where w(pti) is the weight associated with
improvement. For instance, repeated refinements page-view pti. Contiguous sequential patterns
of a query may indicate a search environment that (CSPs -- patterns in which the items appearing
is not intuitive for some users. Also, long lists in the sequence must be adjacent with respect
of results may signal that sufficiently selective to the underlying ordering) are used to capture
search options are lacking, or that they are not frequent navigational paths among user trails.
understood by everyone. General sequential patterns are used to represent
A session is a directed list of page accesses more general navigational patterns within the site.
performed by a user during her/his visit in a site. To build a recommendation algorithm using
Pages of a session are mapped onto elements sequential patterns, the authors focus on frequent
of a sequence, whereby each element is a pair sequences of size |w| + 1 whose prefix contains an
comprised of the page and a positive integer. This active user session w. The candidate page-views
integer is the occurrence of the page in the session, to be recommended are the last items in all such
taking the fact into account that a user may visit the sequences. The recommendation values are based
same page more than once during a single session. on the confidence of the patterns. A simple trie
Further, they also define generalized sequences structure is used to store both the sequential and
which are sequences with length constraints on contiguous sequential patterns discovered during
gaps. These constraints are expressed in a mining the pattern discovery phase. The recommendation
language MINT. algorithm is extended to generate all kth order
The patterns that they observe are as follows. recommendations as follows. First, the recom-
Searches reaching a ‘school’ entry are a dominant mendation engine uses the largest possible active
sub-pattern. ‘State’ lists of schools are the most session window as an input for recommendation
popular lists. Schools are rarely reached in short engine. If the engine cannot generate any recom-
searches. mendations, the size of active session window is
iteratively decreased until a recommendation is
Pattern Discovery for generated or the window size becomes 0.
Web Personalization The CSP model can do better in terms of pre-
cision, but the coverage levels, in general, may
Pattern discovery from usage data can also be used be too low when the goal is to generate as many
for Web personalization. (Mobasher, Dai, Luo, & good recommendations as possible. On the other
Nakagawa, 2002) find that more restrictive pat- hand, when dealing with applications such as
terns, such as contiguous sequential patterns (e.g., Web pre-fetching in which the primary goal is to
6
Applications of Pattern Discovery Using Sequential Data Mining
predict the user’s immediate next actions (rather the results saved. The user’s query is translated
than providing a broader set of recommendations), into a shape query and this query is then executed
the CSP model provides the best choice. This is over the mined data yielding the desired trends.
particularly true in sites with many dynamically The results of the mining are a set of phrases that
generated pages, where often a contiguous navi- occur frequently in the underlying documents and
gational path represents a semantically meaningful that match a query supplied by the user. Thus, the
sequence of user actions each depending on the system has three major steps: Identifying frequent
previous actions. phrases using sequential patterns mining, generat-
ing histories of phrases and finding phrases that
satisfy a specified trend.
TEXT MINING 1-phrase is a list of elements where each ele-
ment is a phrase. k-phrase is an iterated list of
Pattern mining has been used for text databases to phrases with k levels of nesting. <<(IBM)><(data
discover trends, for text categorization, for docu- mining)>> is a 1-phrase, which can mean that
ment classification and authorship identification. IBM and “data mining” should occur in the same
We discuss these works below. paragraph, with “data mining” being contiguous
words in the paragraph.
Trends in Text Databases A word in a text field is mapped to an item in
a data-sequence or sequential pattern and a phrase
(Lent, Agrawal, & Srikant, 1997) describe a to a sequential pattern that has just one item in
system for identifying trends in text documents each element. Each element of a data sequence
collected over a period of time. Trends can be in the sequential pattern problem has some as-
used, for example, to discover that a company sociated timestamp relative to the other elements
is shifting interests from one domain to another. in the sequence thereby defining an ordering of
Their system mines these trends and also provides the elements of a sequence. Sequential pattern
a method to visualize them. algorithms can now be applied to the transaction
The unit of text is a word and a phrase is a list ID labeled words to identify simple phrases from
of words. Associated with each phrase is a history the document collection.
of the frequency of occurrence of the phrase, ob- User may be interested in phrases that are
tained by partitioning the documents based upon contained in individual sentences only. Alterna-
their timestamps. The frequency of occurrence in tively, the words comprising a phrase may come
a particular time period is the number of docu- from sequential sentences so that a phrase spans
ments that contain the phrase. A trend is a specific a paragraph. This generalization can be accom-
subsequence of the history of a phrase that satisfies modated by the use of distance constraints that
the users’ query over the histories. For example, specify a minimum and/or maximum gap between
the user may specify a shape query like a spike adjacent words of a phrase. For example, the first
query to find those phrases whose frequency of variation described above would be constrained
occurrence increased and then decreased. In this by specifying a minimum gap of one word and a
trend analysis, sequential pattern mining is used maximum gap of one sentence. The second varia-
for phrase identification. tion would have a minimum gap of one sentence
A transaction ID is assigned to each word of and a maximum gap of one paragraph. For this
every document treating the words as items in the latter example, one could further generalize the
data mining algorithms. This transformed data is notion from a single word from each sentence
then mined for dominant words and phrases, and to a set of words from each sentence by using a
7
Applications of Pattern Discovery Using Sequential Data Mining
sliding transaction time window within sentences. words. E.g., the sequential pattern < (data) (infor-
The generalizations made in the GSP algorithm mation) (machine)> means that some texts contain
for mining sequential patterns allow a one-to-one words ‘data’ then ‘information’ then ‘machine’ in
mapping of the minimum gap, maximum gap, three different sentences. Once sequential patterns
and transaction window to the parameters of the have been extracted for each category, the goal is
algorithm. to derive a categorizer from the obtained patterns.
Basic mapping of phrases to sequential patterns This is done by computing, for each category, the
is extended by providing a hierarchical mapping confidence of each associated sequential pattern.
over sentences, paragraphs, or even sections of a To solve this problem, a rule R is generated in the
text document. This extended mapping helps in following way:
taking advantage of the structure of a document
to obtain a richer set of phrases. Where a docu- R:<s1... sp> ⇒ Ci; confidence(R)=(#texts from Ci
ment has completely separate sections, phrases matching <s1... sp>)/(#texts matching <s1... sp>).
that span multiple sections can also be mined,
thereby discovering a new set of relationships. Rules are sorted depending on their confidence
This enhancement of the GSP algorithm can be level and the size of the associated sequence.
implemented by changing the Apriori-like candi- When considering a new text to be classified,
date generation algorithm, to consider both phrases a simple categorization policy is applied: the K
and words as individual elements when generating rules having the best confidence level and being
candidate k-phrases. The manner in which these supported are applied. The text is then assigned
candidates are counted would similarly change. to the class mainly obtained within the K rules.
8
Applications of Pattern Discovery Using Sequential Data Mining
a depth-first exploration of the tree gives the sequence of words including empty sequence.
corresponding sequence. An example sequential These sequential word patterns were introduced
pattern looks like <(0 movie), (1 title), (1 url), for authorship identification based on the fol-
(1 CountryOfProduction), (2 item), (2 item), (1 lowing assumption. Because people usually
filmography), (3 name)>. Once the whole set of generate words from the beginning to the end of
sequences (corresponding to the XML documents a sentence, how one orders words in a sentence
of a collection) is obtained, a traditional sequential can be an indicator of author’s writing style. As
pattern extraction algorithm is used to extract word order in Japanese (they study a Japanese
the frequent sequences. Those sequences, once corpus) is relatively free, rigid word segments
mapped back into trees, will give the frequent and non-contiguous word sequences may be a
sub-trees embedded in the collection. particularly important indicator of the writing
They tested several measures in order to decide style of authors.
which class each test document belongs to. The two While N-grams (consecutive word sequences)
best measures are based on the longest common fail to account for non-contiguous patterns, se-
subsequence. The first one computes the average quential pattern mining methods can do so quite
matching between the test document and the set naturally.
of sequential patterns and the second measure is a
modified measure, which incorporates the actual
length of the pattern compared to the maximum BIOINFORMATICS
length of a sequential pattern in the cluster.
Pattern mining is useful in the bioinformatics
Patterns to Identify Authors domain for predicting rules for organization of
of Documents certain elements in genes, for protein function pre-
diction, for gene expression analysis, for protein
(Tsuboi, 2002) aims at identifying the authors fold recognition and for motif discovery in DNA
of mailing list messages using a machine learn- sequences. We study these applications below.
ing technique (Support Vector Machines). In
addition, the classifier trained on the mailing Pattern Mining for Bio-Sequences
list data is applied to identify the author of Web
documents in order to investigate performance in Bio-sequences typically have a small alphabet,
authorship identification for more heterogeneous a long length, and patterns containing gaps (i.e.,
documents. Experimental results show better “don’t care”) of arbitrary size. A long sequence
identification performance when features of not (especially, with a small alphabet) often contains
only conventional word N-gram information but long patterns. Mining frequent patterns in such
also of frequent sequential patterns extracted by sequences faces a different type of explosion
a data mining technique (PrefixSpan) are used. than in transaction sequences primarily moti-
They applied PrefixSpan to extract sequential vated in market-basket analysis. (Wang, Xu, &
word patterns from each sentence and used them Yu, 2004) study how this explosion affects the
as author’s style markers in documents. The classic sequential pattern mining, and present a
sequential word patterns are sequential patterns scalable two-phase algorithm to deal with this
where item and sequence correspond to word and new explosion.
sentence, respectively. Biosequence patterns have the form of X1
Sequential pattern is <w1*w2*...*wl> where wi *...* Xn spanning over a long region, where each
is a word and l is the length of pattern. * is any Xi is a short region of consecutive items, called
9
Applications of Pattern Discovery Using Sequential Data Mining
a segment, and * denotes a variable length gap the positions after i when matching P’*X
corresponding to a region not conserved in the against s.
evolution. The presence of * implies that pattern
matching is more permissible and involves the Further to deal with the huge size of the
whole range in a sequence. The support of a pattern sequences, they introduce compression based
is the percentage of the sequences in the database querying. In this method, all positions in a
that contain the pattern. Given a minimum segment non-coding region are compressed into a new
length min_len and a minimum support min_sup, item ε that matches no existing item except *. A
a pattern X1 *...* Xn is frequent if |Xi|>=min_len non-coding region contains no part of a frequent
for 1<=i<=n and the support of the pattern is at segment. Each original sequence is scanned once,
least min_sup. The problem of mining sequence each consecutive region not overlapping with any
patterns is to find all frequent patterns. frequent segment is identified and collapsed into
The Segment Phase first searches short patterns the new item ε. For a long sequence and large
containing no gaps (Xi), called segments. This min_len and min_sup, a compressed sequence is
phase is efficient. This phase finds all frequent typically much shorter than the original sequence.
segments and builds an auxiliary structure for On real life datasets like DNA and protein
answering position queries. GST (generalized sequences submitted from 2002/12, 2003/02, they
suffix tree) is used to find: (1) The frequent seg- show the superiority of their method compared
ments of length min_len, Bi, called base segments, to PrefixSpan with respect to execution time and
and the position lists for each Bi, s:p1, p2... where the space required.
pj<pj+1 and each <s, pj> is a start position of Bi.
(2) All frequent segments of length>min_len. Note Patterns in Genes for Predicting
that position lists for such frequent segments are Gene Organization Rules
not extracted. This information about the base
segments and their positions is then stored in an In eukaryotes, rules regarding organization of cis-
index, Segment to Position Index. regulatory elements are complex. They sometimes
The Pattern Phase searches for long patterns govern multiple kinds of elements and positional
(X1 *...* Xn) containing multiple segments sepa- restrictions on elements. (Terai & Takagi, 2004)
rated by variable length gaps. This phase grows propose a method for detecting rules, by which the
rapidly one segment at a time, as opposed to one order of elements is restricted. The order restric-
item at a time. This phase is time consuming. The tion is expressed as element patterns. They extract
purpose of two phases is to exploit the information all the element patterns that occur in promoter
obtained from the first phase to speed up the pat- regions of at least the specified number of genes.
tern growth and matching and to prune the search Then, significant patterns are found based on
space in the second phase. the expression similarity of genes with promoter
Two types of pruning techniques are used. regions containing each of the extracted patterns.
Consider a pattern P’, which is a super-pattern of P: By applying the method to Saccharomyces cerevi-
siae, they detected significant patterns overlooked
• Pattern Generation Pruning: If P*X fails by previous methods, thus demonstrating the
to be a frequent pattern, so does P’*X. So, utility of sequential pattern mining for analysis
we can prune P’*X. of eukaryotic gene regulation. Several types of
• Pattern Matching Pruning: If P*X fails element organization exist, those in which (1)
to occur before position i in sequence s, so only the order of elements is important, (2) order
does P’*X. So, we only need to examine and distance both are important and (3) only the
10
Applications of Pattern Discovery Using Sequential Data Mining
11
Applications of Pattern Discovery Using Sequential Data Mining
in the rule are <= the maximum cvd threshold. into two parts, one for the antecedent and one for
Their algorithm to mine distance-based associa- the consequent of the rule, is considered. If the
tion rules from a dataset of instances extends the rule so formed satisfies the min_conf constraint,
Apriori algorithm. then the rule is added to the output. These rules are
In order to obtain distance-based association then used for building a classification/predictive
rules, one could use the Apriori algorithm to model for gene expression.
mine all association rules whose supports and
confidences satisfy the thresholds, and then an- Patterns for Protein
notate those rules with the cvd’s of all the pair of Fold Recognition
items present in the rule. Only those rules whose
cvd’s satisfy the max-cvd threshold are returned. Protein data contain discriminative patterns that
They call this algorithm to mine distance-based can be used in many beneficial applications if
association rules, Naïve distance-Apriori. they are defined correctly. (Exarchos, Papaloukas,
Distance-based Association Rule Mining Lampros, & Fotiadis, 2008) use sequential pat-
(DARM) algorithm first generates all the frequent tern mining for sequence-based fold recognition.
item-sets that satisfy the max-cvd constraint (cvd- Protein classification in terms of fold recognition
frequent item-sets), and then generates all associa- plays an important role in computational protein
tion rules with the required confidence from those analysis, since it can contribute to the determina-
item-sets. Note that the max-cvd constraint is a tion of the function of a protein whose structure is
non-monotonic property. An item-set that does not unknown. Fold means 3D structure of a protein.
satisfy this constraint may have supersets that do. They use cSPADE (Zaki, Sequence mining in
However, they define the following procedure that categorical domains: incorporating constraints,
keeps under consideration only frequent item-sets 2000), for the analysis of protein sequence. Se-
that deviate properly in an interesting manner. quential patterns were generated for each category
Let n be the number of promoter regions (in- (fold) separately. A patterni extracted from foldi,
stances) in the dataset. Let I be a frequent item- indicates an implication (rule) of the form patterni
set, and let S be the set of promoter regions that ⇒foldi. A maximum gap constraint is also used.
contain I. I is then said to deviate properly if either: When classifying an unknown protein to one of
the folds, all the extracted sequential patterns from
1. I is cvd-frequent. That is, the cvd over S of all folds are examined to find which of them are
each pair of motifs in I is <= max-cvd, or contained in the protein. For a pattern contained
2. For each pair of motifs P∈I, there is a subset in a protein, the score of this protein with respect
S’ of S with cardinality >= ⌈min_sup*n⌉ such to this fold is increased by: scoreai=(length of the
that the cvd over S’ of P is <= max-cvd. patternai-k) /(number of patterns in foldi) where ‘i’
represents a fold, ‘a’ represents a pattern of a fold.
The k-level of item-sets kept by the DARM Here, the length is the size of the pattern with gaps.
algorithm is the collection of frequent item-sets of Patternai is the ath pattern of the ith fold and k is a
cardinality k that deviate properly. Those item-sets value employed to assign the minimum score, to
are used to generate the (k+1)-level. Once, all the the minimal pattern. It should be mentioned that
frequent item-sets that deviate properly have been if a pattern is contained in a protein sequence
generated, distance-based association rules are more than once, it receives the same score as if
constructed from those item-sets that satisfy the it was contained only once. The scores for each
max-cvd constraint. As is the case with the Apriori fold are summed and the new protein is assigned
algorithm, each possible split of such an item-set to the fold exhibiting the highest sum.
12
Applications of Pattern Discovery Using Sequential Data Mining
The score of a protein with respect to a fold These subsequences are possibly implied in a
is calculated based on the number of sequential structural or biological function of the family and
patterns of this fold contained in the protein. The have been preserved through the protein evolution.
higher the number of patterns of a fold contained Thus, if a sequence shares patterns with other
in a protein, the higher the score of the protein sequences it is expected that the sequences are
for this fold. biologically related. Considering the two types
A classifier uses the extracted sequential pat- of patterns, rigid gap patterns reveal better con-
terns to classify proteins in the appropriate fold served regions of similarity. On the other hand,
category. For training and evaluating the proposed flexible gap patterns have a greater probability
method they used the protein sequences from of occur by chance, having a smaller biological
the Protein Data Bank and the annotation of the significance. Since the protein alphabet is small,
SCOP database. The method exhibited an overall many small patterns that express trivial local
accuracy of 25% (random would be 2.8%) in a similarity may arise. Therefore, longer patterns
classification problem with 36 candidate catego- are expected to express greater confidence in the
ries. The classification performance reaches up to sequences similarity.
56% when the five most probable protein folds
are considered. Patterns in DNA Sequences
Patterns for Protein Family Detection Large collections of genomic information have
been accumulated in recent years, and embedded
In another work on protein family detection (pro- in them is potentially significant knowledge for
tein classification), (Ferreira & Azevedo, 2005) exploitation in medicine and in the pharmaceutical
use the number and average length of the relevant industry. (Guan, Liu, & Bell, 2004) detect strings
subsequences shared with each of the protein in DNA sequences which appear frequently, either
families, as features to train a Bayes classifier. within a given sequence (e.g., for a particular
Priors for the classes are set using the number of patient) or across sequences (e.g., from different
patterns and average length of the patterns in the patients sharing a particular medical diagnosis).
corresponding class. Motifs are strings that occur very frequently.
Having discovered such motifs, they show how to
They Identify Two Types of Patterns mine association rules by an existing rough-sets
based technique.
Rigid Gap Patterns (only contain gaps with a
fixed length) and Flexible Gap Patterns (allow a
variable number of gaps between symbols of the TELECOMMUNICATIONS
sequence). Frequent patterns are mined with the
constraint of minimum length. Apart from this, Pattern mining can be used in the field of tele-
they also support item constraints (restricts set of communications for mining of group patterns
other symbols that can occur in the pattern), gap from mobile user movement data, for customer
constraints (minGap and maxGap), duration or behavior prediction, for predicting future location
window constraints which defines the maximum of a mobile user for location based services and
distance (window) between the first and the last for mining patterns useful for mobile commerce.
event of the sequence patterns. We discuss these works briefly in this sub-section.
Protein sequences of the same family typically
share common subsequences, also called motifs.
13
Applications of Pattern Discovery Using Sequential Data Mining
14
Applications of Pattern Discovery Using Sequential Data Mining
Prediction method (RLP), to guess the user’s future patterns and sequential intrusion patterns from a
location for LBSs. They define moving sequences collection of attack packets, and then converts the
and frequent patterns in trajectory data. Further, patterns to Snort detection rules for on-line intru-
they find out all frequent spatiotemporal move- sion detection. Patterns are extracted both from
ment patterns using an algorithm based on GSP packet headers and the packet payload. A typical
algorithm. The candidate generating mechanism pattern is of the form “A packet with DA port as
of the technique is based on that of GSP algorithm 139, DgmLen field in header set to 48 and with
with an additional temporal join operation and content as 11 11”. Intrusion behavior detection
a different method for pruning candidates. In engine creates an alert when a series of incom-
addition, they employ the clustering method to ing packets match the signatures representing
control the dense regions of the patterns. With sequential intrusion scenarios.
the frequent movement patterns obtained from
the preceding subsection, the movement rules are Patterns for Discovering Multi-
generated easily. Stage Attack Strategies
15
Applications of Pattern Discovery Using Sequential Data Mining
is a collection of alerts that occur relatively close WP, Solar etc are different earth science parameters
to each other in a given order frequently. Once with values Hi (High) or Lo (Low).
such patterns are known, the rules can be produced
for describing or predicting the behavior of the Patterns for Computer
sequence of network attack. Systems Management
The earth science data consists of time series mea- Patterns to Detect Plan Failures
surements for various Earth science and climate
variables (e.g. soil moisture, temperature, and (Zaki, Lesh, & Mitsunori, 1999) present an al-
precipitation), along with additional data from gorithm to extract patterns of events that predict
existing ecosystem models (e.g. Net Primary failures in databases of plan executions: Plan-
Production). The ecological patterns of interest Mine. Analyzing execution traces is appropriate
include associations, clusters, predictive models, for planning domains that contain uncertainty,
and trends. (Potter, Klooster, Torregrosa, Tan, such as incomplete knowledge of the world or
Steinbach, & Kumar) discuss some of the chal- actions with probabilistic effects. They extract
lenges involved in preprocessing and analyzing causes of plan failures and feed the discovered
the data, and also consider techniques for handling patterns back into the planner. They label each
some of the spatio-temporal issues. Earth Science plan as “good” or “bad” depending on whether
data has strong seasonal components that need it achieved its goal or it failed to do so. The goal
to be removed prior to pattern analysis, as Earth is to find “interesting” sequences that have a high
scientists are primarily interested in patterns confidence of predicting plan failure. They use
that represent deviations from normal seasonal SPADE to mine such patterns.
variation such as anomalous climate events (e.g., TRIPS is an integrated system in which a
El Nino) or trends (e.g., global warming). They person collaborates with a computer to develop a
de-seasonalize the data and then compute variety high quality plan to evacuate people from a small
of spatio-temporal patterns. Rules learned from island. During the process of building the plan,
the patterns look like (WP-Hi) ⇒ (Solar-Hi) ⇒ the system simulates the plan repeatedly based
(NINO34-Lo) ⇒ (Temp-Hi) ⇒ (NPP-Lo) where on a probabilistic model of the domain, includ-
16
Applications of Pattern Discovery Using Sequential Data Mining
ing predicted weather patterns and their effect on Significant patterns provide knowledge of one or
vehicle performance. more product failures that lead to future product
The system returns an estimate of the plan’s fault(s). The effectiveness of the algorithm is il-
success. Additionally, TRIPS invokes PlanMine lustrated with the warranty data mining application
on the execution traces produced by simulation, from the automotive industry.
in order to analyze why the plan failed when it
did. The system runs PlanMine on the execution Patterns in Alarm Data
traces of the given plan to pinpoint defects in the
plan that most often lead to plan failure. It then Increasingly powerful fault management systems
applies qualitative reasoning and plan adaptation are required to ensure robustness and quality of
algorithms to modify the plan to correct the defects service in today’s networks. In this context, event
detected by PlanMine. correlation is of prime importance to extract
meaningful information from the wealth of alarm
Patterns in Automotive data generated by the network. Existing sequen-
Warranty Data tial data mining techniques address the task of
identifying possible correlations in sequences of
When a product fails within a certain time period, alarms. The output sequence sets, however, may
the warranty is a manufacturer’s assurance to a contain sequences which are not plausible from
buyer that the product will be repaired without the point of view of network topology constraints.
a cost to the customer. In a service environment (Devitt, Duffin, & Moloney, 2005) presents the
where dealers are more likely to replace than to Topographical Proximity (TP) approach which
repair, the cost of component failure during the exploits topographical information embedded in
warranty period can easily equal three to ten times alarm data in order to address this lack of plausibil-
the supplier’s unit price. Consequently, companies ity in mined sequences. Their approach is based
invest significant amounts of time and resources on an Apriori approach and introduces a novel
to monitor, document, and analyze product war- criterion for sequence selection which evaluates
ranty data. (Buddhakulsomsiri & Zakarian, 2009) sequence plausibility and coherence in the context
present a sequential pattern mining algorithm that of network topology. Connections are inferred at
allows product and quality engineers to extract run-time between pairs of alarm generating nodes
hidden knowledge from a large automotive war- in the data and a Topographical Proximity (TP)
ranty database. The algorithm uses the elementary measure is assigned based on the strength of the
set concept and database manipulation techniques inferred connection. The TP measure is used to
to search for patterns or relationships among reject or promote candidate sequences on the basis
occurrences of warranty claims over time. The of their plausibility, i.e. the strength of their con-
sequential patterns are represented in a form of nection, thereby reducing the candidate sequence
IF–THEN association rules, where the IF portion set and optimizing the space and time constraints
of the rule includes quality/warranty problems, of the data mining process.
represented as labor codes, that occurred in an
earlier time, and the THEN portion includes Patterns for Personalized
labor codes that occurred at a later time. Once a Recommendation System
set of unique sequential patterns is generated, the
algorithm applies a set of thresholds to evaluate (Romero, Ventura, Delgado, & Bra, 2007) describe
the significance of the rules and the rules that a personalized recommender system that uses web
pass these thresholds are reported in the solution. mining techniques for recommending a student
17
Applications of Pattern Discovery Using Sequential Data Mining
which (next) links to visit within an adaptable with a dataset consisting of samples from aerosol
educational hypermedia system. They present a time-of-flight mass spectrometer (ATOFMS).
specific mining tool and a recommender engine A mass spectrum is a plot of signal intensity
that helps the teacher to carry out the whole web (often normalized to the largest peak in the spec-
mining process. The overall process of Web per- trum) versus the mass-to-charge (m/z) ratio of
sonalization based on Web usage mining generally the detected ions. Thus, the presence of a peak
consists of three phases: data preparation, pattern indicates the presence of one or more ions con-
discovery and recommendation. The first two taining the m/z value indicated, within the ion
phases are performed off-line and the last phase cloud generated upon the interaction between
on-line. To make recommendations to a student, the particle and the laser beam. In many cases,
the system first, classifies the new students in one the ATOFMS generates elemental ions. Thus, the
of the groups of students (clusters). Then, it only presence of certain peaks indicates that elements
uses the sequential patterns of the correspond- such as Na+ (m/z = +23) or Fe+ (m/z = +56) or
ing group to personalize the recommendations O- (m/z = -16) ions are present. In other cases,
based on other similar students and his current cluster ions are formed, and thus the m/z observed
navigation. Grouping of students is done using represents that of a sum of the atomic weights of
k-means. They use GSP to get frequent sequences various elements.
for each of the clusters. They mine rules of the For many kinds of analysis, what is significant
form readme⇒install, welcome⇒install which are in each particle’s mass spectrum is the composi-
intuitively quite common patterns for websites. tion of the particle, i.e., the ions identified by the
peak labels (and, ideally, their proportions in the
Patterns in Atmospheric particle, and our confidence in having correctly
Aerosol Data identified them). While this representation is
less detailed than the labeled spectrum itself, it
EDAM (Exploratory Data Analysis and Man- allows us to think of the ATOFMS data stream as
agement) is a joint project between researchers a time-series of observations, one per observed
in Atmospheric Chemistry and Computer Sci- particle, where each observation is a set of ions
ence at Carleton College and the University of (possibly labeled with some additional details).
Wisconsin-Madison that aims to develop data This is precisely the market-basket abstraction
mining techniques for advancing the state of the used in e-commerce: a time-series of customer
art in analyzing atmospheric aerosol datasets. transactions, each recording the items purchased
The traditional approach for particle measure- by a customer on a single visit to a store. This
ment, which is the collection of bulk samples of analogy opens the door to applying a wide range
particulates on filters, is not adequate for studying of association rule and sequential pattern algo-
particle dynamics and real-time correlations. This rithms to the analysis of mass spectrometry data.
has led to the development of a new generation Once these patterns are mined, they can be used to
of real-time instruments that provide continuous extrapolate to periods where filter-based samples
or semi-continuous streams of data about certain were not collected.
aerosol properties. However, these instruments
have added a significant level of complexity to at- Patterns in Individuals’ Time Diaries
mospheric aerosol data, and dramatically increased
the amounts of data to be collected, managed, and Identifying patterns of activities within indi-
analyzed. (Ramakrishnan, et al., 2005) experiment viduals’ time diaries and studying similarities and
deviations between individuals in a population
18
Applications of Pattern Discovery Using Sequential Data Mining
is of interest in time use research. So far, activ- of the population) that should be performing the
ity patterns in a population have mostly been pattern. The sequential mining algorithm that
studied either by visual inspection, searching for they have used for the activity pattern extraction
occurrences of specific activity sequences and is an “AprioriAll” algorithm which is adapted to
studying their distribution in the population, or the time diary data.
statistical methods such as time series analysis Two stage classification using patterns: (Ex-
in order to analyze daily behavior. (Vrotsou, El- archos, Tsipouras, Papaloukas, & Fotiadis, 2008)
legård, & Cooper) describe a new approach for present a methodology for sequence classification,
extracting activity patterns from time diaries that which employs sequential pattern mining and
uses sequential data mining techniques. They optimization, in a two-stage process. In the first
have implemented an algorithm that searches the stage, a sequence classification model is defined,
time diaries and automatically extracts all activ- based on a set of sequential patterns and two sets
ity patterns meeting user-defined criteria of what of weights are introduced, one for the patterns and
constitutes a valid pattern of interest. Amongst the one for classes. In the second stage, an optimiza-
many criteria which can be applied are: a time tion technique is employed to estimate the weight
window containing the pattern, and minimum values and achieve optimal classification accuracy.
and maximum number of people that perform the Extensive evaluation of the methodology is car-
pattern. The extracted activity patterns can then ried out, by varying the number of sequences, the
be interactively filtered, visualized and analyzed number of patterns and the number of classes and
to reveal interesting insights using the VISUAL- it is compared with similar sequence classifica-
TimePAcTS application. To demonstrate the value tion approaches.
of this approach they consider and discuss sequen-
tial activity patterns at a population level, from a
single day perspective, with focus on the activity CONCLUSION
“paid work” and some activities surrounding it.
An activity pattern in this paper is defined as a We presented selected applications of the se-
sequence of activities performed by an individual quential pattern mining methods in the fields of
which by itself or together with other activities, healthcare, education, web usage mining, text
aims at accomplishing a more general goal/proj- mining, bioinformatics, telecommunications,
ect. When analyzing a single day of diary data, intrusion detection, etc. We envision that the
activity patterns identified in a single individual power of sequential mining methods has not yet
(referred to as an individual activity pattern) are been fully exploited. We hope to see many more
unlikely to be significant but those found amongst strong applications of these methods in a variety
a group or population (a collective activity pat- of domains in the years to come.
tern) are of greater interest. Seven categories of
activities that they consider are: care for oneself,
care for others, household care, recreation/reflec- REFERENCES
tion, travel, prepare/procure food, work/school.
{“cook dinner”; “eat dinner”; “wash dishes”} is Berendt, B. A. (2000). Analysis of navigation
a typical pattern. They also incorporate a variety behaviour in web sites integrating multiple infor-
of constraints like min and max pattern duration, mation systems. The VLDB Journal, 9(1), 56–75.
min and max gap between activities, min and doi:10.1007/s007780050083
max number of occurrences of the pattern and
min and max number of people (or a percentage
19
Applications of Pattern Discovery Using Sequential Data Mining
Buchner, A. G., & Mulvenna, M. D. (1998). Dis- Ferreira, P. G., & Azevedo, P. J. (2005). Protein
covering Internet marketing intelligence through sequence classification through relevant sequence
online analytical web usage mining. SIGMOD Re- mining and bayes classifiers. Proc. 12th Portu-
cord, 27(4), 54–61. doi:10.1145/306101.306124 guese Conference on Artificial Intelligence (EPIA)
(pp. 236-247). Springer-Verlag.
Buddhakulsomsiri, J., & Zakarian, A. (2009). Se-
quential pattern mining algorithm for automotive Garboni, C., Masseglia, F., & Trousse, B. (2005).
warranty data. Journal of Computers and Indus- Sequential pattern mining for structure-based
trial Engineering, 57(1), 137–147. doi:10.1016/j. XML document classification. Workshop of the
cie.2008.11.006 INitiative for the Evaluation of XML Retrieval.
Chen, Y.-L., & Huang, T. C.-K. (2008). A novel Guan, J. W., Liu, D., & Bell, D. A. (2004). Dis-
knowledge discovering model for mining fuzzy covering motifs in DNA sequences. Fundam.
multi-level sequential patterns in sequence data- Inform., 59(2-3), 119–134.
bases. Data & Knowledge Engineering, 66(3),
Icev, A. (2003). Distance-enhanced association
349–367. doi:10.1016/j.datak.2008.04.005
rules for gene expression. BIOKDD’03, in con-
Cooley, R., Mobasher, B., & Srivastava, J. (1999). junction with ACM SIGKDD.
Data preparation for mining World Wide Web
Ishio, T., Date, H., Miyake, T., & Inoue, K. (2008).
browsing patterns. Knowledge and Information
Mining coding patterns to detect crosscutting con-
Systems, 1(1), 5–32.
cerns in Java programs. WCRE ‘08: Proceedings
Devitt, A., Duffin, J., & Moloney, R. (2005). of the 2008 15th Working Conference on Reverse
Topographical proximity for mining network Engineering (pp. 123-132). Washington, DC:
alarm data. MineNet ‘05: Proceedings of the 2005 IEEE Computer Society.
ACM SIGCOMM workshop on Mining network
Jaillet, S., Laurent, A., & Teisseire, M. (2006).
data (pp. 179-184). Philadelphia, PA: ACM.
Sequential patterns for text categorization. Intel-
Eichinger, F., Nauck, D. D., & Klawonn, F. (n.d.). ligent Data Analysis, 10(3), 199–214.
Sequence mining for customer behaviour predic-
Kay, J., Maisonneuve, N., Yacef, K., & Zaïane,
tions in telecommunications.
O. (2006). Mining patterns of events in students’
Exarchos, T. P., Papaloukas, C., Lampros, C., & teamwork data. In Educational Data Mining
Fotiadis, D. I. (2008). Mining sequential patterns Workshop, held in conjunction with Intelligent
for protein fold recognition. Journal of Biomedi- Tutoring Systems (ITS), (pp. 45-52).
cal Informatics, 41(1), 165–179. doi:10.1016/j.
Kum, H.-C., Chang, J. H., & Wang, W. (2006).
jbi.2007.05.004
Sequential Pattern Mining in Multi-Databases via
Exarchos, T. P., Tsipouras, M. G., Papaloukas, C., Multiple Alignment. Data Mining and Knowl-
& Fotiadis, D. I. (2008). A two-stage methodology edge Discovery, 12(2-3), 151–180. doi:10.1007/
for sequence classification based on sequential s10618-005-0017-3
pattern mining and optimization. Data & Knowl-
Kum, H.-C., Chang, J. H., & Wang, W. (2007).
edge Engineering, 66(3), 467–487. doi:10.1016/j.
Benchmarking the effectiveness of sequential
datak.2008.05.007
pattern mining methods. Data & Knowledge
Engineering, 60(1), 30–50. doi:10.1016/j.
datak.2006.01.004
20
Applications of Pattern Discovery Using Sequential Data Mining
Kuo, R. J., Chao, C. M., & Liu, C. Y. (2009). In- Masseglia, F., Poncelet, P., & Teisseire, M. (2009).
tegration of K-means algorithm and AprioriSome Efficient mining of sequential patterns with time
algorithm for fuzzy sequential pattern mining. Ap- constraints: Reducing the combinations. Expert
plied Soft Computing, 9(1), 85–93. doi:10.1016/j. Systems with Applications, 36(2), 2677–2690.
asoc.2008.03.010 doi:10.1016/j.eswa.2008.01.021
Lau, A., Ong, S. S., Mahidadia, A., Hoffmann, Mendes, L. F., Ding, B., & Han, J. (2008). Stream
A., Westbrook, J., & Zrimec, T. (2003). Mining sequential pattern mining with precise error
patterns of dyspepsia symptoms across time points bounds. Proc. 2008 Int. Conf. on Data Mining
using constraint association rules. PAKDD’03: (ICDM’08), Italy, Dec. 2008.
Proceedings of the 7th Pacific-Asia conference on
Mobasher, B., Dai, H., Luo, T., & Nakagawa, M.
Advances in knowledge discovery and data mining
(2002). Using sequential and non-sequential pat-
(pp. 124-135). Seoul, Korea: Springer-Verlag.
terns in predictive Web usage mining tasks. ICDM
Laur, P.-A., Symphor, J.-E., Nock, R., & Pon- ‘02: Proceedings of the 2002 IEEE International
celet, P. (2007). Statistical supports for mining Conference on Data Mining (pp. 669-672). Wash-
sequential patterns and improving the incremental ington, DC: IEEE Computer Society.
update process on data streams. Intelligent Data
Nicolas, J. A., Herengt, G., & Albuisson, E. (2004).
Analysis, 11(1), 29–47.
Sequential pattern mining and classification of
Lent, B., Agrawal, R., & Srikant, R. (1997). Dis- patient path. MEDINFO 2004: Proceedings Of
covering trends in text databases. Proc. 3rd Int. The 11th World Congress On Medical Informatics.
Conf. Knowledge Discovery and Data Mining,
Parthasarathy, S., Zaki, M., Ogihara, M., &
KDD (pp. 227-230). AAAI Press.
Dwarkadas, S. (1999). Incremental and interactive
Li, Z., Zhang, A., Li, D., & Wang, L. (2007). Dis- sequence mining. In Proc. of the 8th Int. Conf.
covering novel multistage attack strategies. ADMA on Information and Knowledge Management
‘07: Proceedings of the 3rd international confer- (CIKM’99).
ence on Advanced Data Mining and Applications
Perera, D., Kay, J., Yacef, K., & Koprinska, I.
(pp. 45-56). Harbin, China: Springer-Verlag.
(2007). Mining learners’ traces from an online
Lin, N. P., Chen, H.-J., Hao, W.-H., Chueh, H.-E., collaboration tool. Proceedings of Educational
& Chang, C.-I. (2008). Mining strong positive and Data Mining workshop (pp. 60–69). CA, USA:
negative sequential patterns. W. Trans. on Comp., Marina del Rey.
7(3), 119–124.
Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., &
Mannila, H., Toivonen, H., & Verkamo, I. (1997). Dayal, U. (2001). Multi-dimensional sequential
Discovery of frequent episodes in event sequences. pattern mining. CIKM ‘01: Proceedings of the
Data Mining and Knowledge Discovery, 1(3), Tenth International Conference on Information
259–289. doi:10.1023/A:1009748302351 and Knowledge Management (pp. 81-88). New
York, NY: ACM.
Masseglia, F., Poncelet, P., & Teisseire, M. (2003).
Incremental mining of sequential patterns in large Potter, C., Klooster, S., Torregrosa, A., Tan, P.-
databases. Data & Knowledge Engineering, 46(1), N., Steinbach, M., & Kumar, V. (n.d.). Finding
97–121. doi:10.1016/S0169-023X(02)00209-4 spatio-temporal patterns in earth science data.
21
Applications of Pattern Discovery Using Sequential Data Mining
Ramakrishnan, R., Schauer, J. J., Chen, L., Huang, Wang, J. L., Chirn, G., Marr, T., Shapiro, B.,
Z., Shafer, M. M., & Gross, D. S. (2005). The Shasha, D., & Zhang, K. (1994). Combinatorial
EDAM project: Mining atmospheric aerosol da- pattern discovery for scientific data: Some pre-
tasets: Research articles. International Journal of liminary results. Proc. ACM SIGMOD Int’l Conf.
Intelligent Systems, 20(7), 759–787. doi:10.1002/ Management of Data, (pp. 115-125).
int.20094
Wang, K., Xu, Y., & Yu, J. X. (2004). Scalable
Romero, C., Ventura, S., Delgado, J. A., & Bra, sequential pattern mining for biological sequences.
P. D. (2007). Personalized links recommendation CIKM ‘04: Proceedings of the Thirteenth ACM
based on data mining un adaptive educational International Conference on Information and
hypermedia systems. Creating New Learning Knowledge Management (pp. 178-187). Wash-
Experiences on a Global Scale. Second European ington, DC: ACM.
Conference on Technology Enhanced Learning,
Wang, M., Shang, X.-Q., & Li, Z.-H. (2008).
EC-TEL 2007 (pp. 293-305). Crete, Greece:
Sequential pattern mining for protein function
Springer.
prediction. ADMA ‘08: Proceedings of 4th In-
Seno, M., & Karypis, G. (2002). SLPMiner: An ternational Conference on Adv Data Mining and
algorithm for finding frequent sequential patterns Applications (pp. 652-658). Chengdu, China:
using length-decreasing support constraint. In Springer-Verlag.
Proceedings of the 2nd IEEE International Con-
Wang, Y., Lim, E.-P., & Hwang, S.-Y. (2006).
ference on Data Mining (ICDM), (pp. 418-425).
Efficient mining of group patterns from user
Srikant, R., & Agrawal, R. (1996)... Advances in movement data. Data & Knowledge Engineering,
Database Technology EDBT, 96, 3–17. 57(3), 240–282. doi:10.1016/j.datak.2005.04.006
Terai, G., & Takagi, T. (2004). Predicting rules Wong, P. C., Cowley, W., Foote, H., Jurrus, E.,
on organization of cis-regulatory elements, tak- & Thomas, J. (2000). Visualizing sequential pat-
ing the order of elements into account. Bioin- terns for text mining. Proc. IEEE Information
formatics (Oxford, England), 20(7), 1119–1128. Visualization, 2000 (pp. 105-114). Society Press.
doi:10.1093/bioinformatics/bth049
Wuu, L.-C., Hung, C.-H., & Chen, S.-F. (2007).
Tsuboi, Y. (2002). Authorship identification for Building intrusion pattern miner for Snort network
heterogeneous documents. intrusion detection system. Journal of Systems
and Software, 80(10), 1699–1715. doi:10.1016/j.
Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., &
jss.2006.12.546
Weiss, S. M. (2002). Predictive algorithms in the
management of computer systems. IBM Systems Xing, Z., Pei, J., & Keogh, E. (2010). A
Journal, 41(3), 461–474. doi:10.1147/sj.413.0461 brief survey on sequence classification. SIG-
KDD Explorations Newsletter, 12(1), 40–48.
Vrotsou, K., Ellegård, K., & Cooper, M. (n.d.).
doi:10.1145/1882471.1882478
Exploring time diaries using semi-automated
activity pattern extraction. Yun, C. H., & Chen, M. S. (2007). Mining mobile
sequential patterns in a mobile commerce environ-
Vu, T. H., Ryu, K. H., & Park, N. (2009). A
ment. IEEE Transactions on Systems, Man, and
method for predicting future location of mobile
Cybernetics, 278–295.
user for location-based services system. Com-
puters & Industrial Engineering, 57(1), 91–105.
doi:10.1016/j.cie.2008.07.009
22
Applications of Pattern Discovery Using Sequential Data Mining
Yun, U. (2008). A new framework for detecting Han, J., & Kamber, M. (2006). Data Mining:
weighted sequential patterns in large sequence Concepts and Techniques (2nd ed.). Morgan
databases. Knowledge-Based Systems, 21(2), Kaufmann Publishers.
110–122. doi:10.1016/j.knosys.2007.04.002
Li, T.-R., Xu, Y., Ruan, D., & Pan, W.-m. Sequen-
Zaki, M. J. (2001). SPADE: An efficient algorithm tial pattern mining. In R. Da, G. Chen, E. E. Kerre,
for mining frequent sequences. Machine Learning, & G. Wets, Intelligent data mining: techniques
42(1-2), 31–60. doi:10.1023/A:1007652502315 and applications (pp. 103-122). Springer.
Zaki, M. J., Lesh, N., & Mitsunori, O. (1999). Lu, J., Adjei, O., Chen, W., Hussain, F., & Enach-
PlanMine: Predicting plan failures using sequence escu, C. (n.d.). Sequential Patterns Mining.
mining. Artificial Intelligence Review, 14(6),
Srinivasa, R. N. (2005). Data mining in e-com-
421–446. doi:10.1023/A:1006612804250
merce: A survey. Sadhana, 275–289. doi:10.1007/
BF02706248
Teisseire, M., Poncelet, P., Scientifique, P., Besse,
ADDITIONAL READING
G., Masseglia, F., & Masseglia, F. (2005). Se-
Adamo, J.-M. (2001). Data Mining for Associa- quential pattern mining: A survey on issues and
tion Rules and Sequential Patterns: Sequential approaches. Encyclopedia of Data Warehousing
and Parallel Algorithms. Secaucus, NJ, USA: and Mining, nformation Science Publishing (pp.
Springer-Verlag New York, Inc.doi:10.1007/978- 3–29). Oxford University Press.
1-4613-0085-4 Yang, L. (2003). Visualizing frequent itemsets, as-
Alves, R., & Rodriguez-Baena, D. S., Aguilar- sociation rules, and sequential patterns in parallel
Ruiz, & S., J. (2009). Gene association analysis: coordinates. ICCSA’03: Proceedings of the 2003
a survey of frequent pattern mining from gene international conference on Computational sci-
expression data. Briefings in Bioinformatics, ence and its applications (pp. 21-30). Montreal,
210–224. Canada: Springer-Verlag.
Zhao, Q., & Bhowmick, S. S. (2003). Sequential
Pattern Matching: A Survey.
23
24
Chapter 2
A Review of Kernel Methods
Based Approaches to
Classification and Clustering
of Sequential Patterns, Part I:
Sequences of Continuous Feature Vectors
Dileep A. D.
Indian Institute of Technology, India
Veena T.
Indian Institute of Technology, India
C. Chandra Sekhar
Indian Institute of Technology, India
ABSTRACT
Sequential data mining involves analysis of sequential patterns of varying length. Sequential pattern
analysis is important for pattern discovery from sequences of discrete symbols as in bioinformatics and
text analysis, and from sequences or sets of continuous valued feature vectors as in processing of au-
dio, speech, music, image, and video data. Pattern analysis techniques using kernel methods have been
explored for static patterns as well as sequential patterns. The main issue in sequential pattern analysis
using kernel methods is the design of a suitable kernel for sequential patterns of varying length. Kernel
functions designed for sequential patterns are known as dynamic kernels. In this chapter, we present a
brief description of kernel methods for pattern classification and clustering. Then we describe dynamic
kernels for sequences of continuous feature vectors. We then present a review of approaches to sequential
pattern classification and clustering using dynamic kernels.
DOI: 10.4018/978-1-61350-056-9.ch002
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Review of Kernel Methods Based Approaches, Part I
25
A Review of Kernel Methods Based Approaches, Part I
methods for pattern analysis. The SVM based ap- Optimal Hyperplane for Linearly
proach to pattern classification and kernel based Separable Classes
approaches to pattern clustering are presented in
this section. Then the design of dynamic kernels Suppose the training data set consists of L ex-
for sequential patterns is presented in the third amples, {xi , yi }
L
, xi ∈ Rd and yi ∈ {+1, −1},
section. This section also describes the dynamic i =1
kernels for continuous feature vector sequences. where xi is ith training example and yi is the cor-
Finally, we present a review of kernel methods responding class label. Figure 1 illustrates the
based approaches to sequential pattern analysis. construction of an optimal separating hyperplane
for linearly separable classes in the two-dimen-
sional input space of x.
KERNEL METHODS FOR A hyperplane is specified as wtx + b = 0, where
PATTERN ANALYSIS w is the parameter vector and b is the bias. A
separating hyperplane that separates the data points
In this section we describe different approaches of two linearly separable classes satisfies the fol-
using kernel methods for patterns analysis. We first lowing constraints:
describe the support vector machines (SVMs) for
yi(wtxi + b) > 0 for i = 1, 2,…, L (1)
pattern classification, and then present the kernel
K-means clustering and support vector clustering
The distance between the nearest example and
methods for pattern clustering.
the separating hyperplane, called the margin, is
given by 1/||w||. The problem of finding the op-
Support Vector Machines for
timal separating hyperplane that maximizes the
Pattern Classification
margin is the same as the problem of minimizing
the Euclidean norm of the parameter vector w. For
The SVM (Burges, 1998; Cristianini & Shawe-
reducing the search space of w, the constraints that
Taylor, 2000; Sekhar et al., 2003) is a linear two-
the optimal separating hyperplane must satisfy are
class classifier. An SVM constructs the maximum
specified as follows:
margin hyperplane (optimal hyperplane) as a
decision surface to separate the data points of two
classes. The margin of a hyperplane is defined as
the minimum distance of training points from the Figure 1. Illustration of constructing the optimal
hyperplane. We first discuss the construction of an hyperplane for linearly separable classes
optimal hyperplane for linearly separable classes.
Then we discuss the construction of an optimal
hyperplane for linearly nonseparable classes, i.e.,
some training examples of the classes cannot be
classified correctly. Finally, we discuss building
an SVM for nonlinearly separable classes by con-
structing an optimal hyperplane in a high dimen-
sional feature space corresponding to a nonlinear
transformation induced by a kernel function.
26
A Review of Kernel Methods Based Approaches, Part I
∑αy i i
=0 (8)
i =1
The learning problem of finding the optimal
separating hyperplane is a constrained optimiza-
Substituting the expression for w from (7) in
tion problem stated as follows: Given the training
(4) and using the condition in (8), the dual form of
data set, find the values of w and b such that they
Lagrangian objective function can be derived as a
satisfy the constraints in (2) and the parameter
function of Lagrangian multipliers α, as follows:
vector w minimizes the following cost function:
L
1 L L
1 2 Ld (α) = ∑ αi − ∑ ∑ α α y y xt x (9)
J(w) = w (3) i =1 2 i =1 j =1 i j i j i j
2
The constrained optimization problem is solved The optimum values of Lagrangian multipli-
using the method of Lagrangian multipliers. The ers are determined by maximizing the objective
primal form of the Lagrangian objective function function Ld(α) subject to the following constraints:
is given by
L
L
∑αy i i
=0 (10)
1
( )
L p (w, b, α) = w − ∑ αi yi w t xi + b − 1
2 i =1
2 i =1
(4) αi ≥ 0 for i = 1, 2, …, L (11)
where the non-negative variables αi are called This optimization problem is solved using
Lagrange multipliers. The saddle point of the quadratic programming methods (Kaufman,
Lagrangian objective function provides the solu- 1999). The data points for which the values of the
tion for the optimization problem. The solution optimum Lagrange multipliers are not zero are
is determined by first minimizing the Lagrang- the support vectors. For these data points the
ian objective function with respect to w and b, distance to the optimal hyperplane is minimum.
and then maximizing with respect to α. The two Hence, the support vectors are the training data
conditions of optimality due to minimization are points that lie on the margin, as illustrated in
Figure 1. For the optimum Lagrange multipliers
{α }
Ls
∂L p (w, b, α) (5)
*
, the optimum parameter vector w∗ is
=0 j j =1
∂w given by
∂L p (w, b, α) Ls
=0 (6) w * = ∑ α *j y j x j (12)
∂b j =1
27
A Review of Kernel Methods Based Approaches, Part I
Ls
The slack variable ξi is a measure of the devia-
D (x) = w *t x + b * = ∑ αj*y j xt x j + b * (13)
j =1
tion of a data point xi from the ideal condition of
separability. For 0 ≤ ξi ≤ 1, the data point falls
inside the region of separation, but on the correct
where b∗ is the optimum bias.
side of the separating hyperplane. For ξi > 1, the
However, the data for most of the real world
data point falls on the wrong side of the separat-
tasks are not linearly separable. Next we present
ing hyperplane. The support vectors are those
a method to construct an optimal hyperplane for
particular data points that satisfy the constraint
linearly non-separable classes.
in (14) with equality sign. The cost function for
linearly non-separable classes is given as
Optimal Hyperplane for Linearly
Non-Separable Classes L
1 2
J (w, ξ) = w + C ∑ ξi (15)
The training data points of the linearly non- 2 i =1
0 ≤ αi ≤ C for i = 1, 2, …, L (18)
Ls
w * = ∑ α *j y j x j (19)
j =1
28
A Review of Kernel Methods Based Approaches, Part I
where Ls is the number of support vectors. The mapped onto three-dimensional feature vectors
discriminant function of the optimal hyperplane Φ(xi) =[ x i21, x i22 , 2x i 1x i 2 ]t, i = 1, 2, …, L where
for an input vector x is given by they are linearly separable.
For the construction of the optimal hyperplane
Ls
D (x) = w *t x + b * = ∑ αj*y j xt x j + b * (20) in the high dimensional feature space Φ(x), the
j =1 dual form of the Lagrangian objective function
in (16) takes the following form:
where b∗ is the optimum bias. L
1 L L
Ld (α) = ∑ αi − ∑ ∑ α α y y Φ(xi )t Φ(x j )
Support Vector Machine for i =1 2 i =1 j =1 i j i j
Nonlinearly Separable Classes (21)
Figure 3. Illustration of nonlinear transformation used in building an SVM for nonlinearly separable
classes
29
A Review of Kernel Methods Based Approaches, Part I
Ls L
1 L L
D (x) = w *t ¦ (x) + b * = ∑ αj*y j ¦ (x)t ¦ (x j ) + b * Ld (α) = ∑ αi − ∑ ∑ α α y y K (xi , x j )
j =1 i =1 2 i =1 j =1 i j i j
(25) (28)
Figure 4. Architecture of a support vector machine for two-class pattern classification. The class of
the input pattern x is given by the sign of the discriminant function D(x). The number of hidden nodes
corresponds to the number of support vectors Ls. Each hidden node computes the innerproduct kernel
function K(x, xi) on the input pattern x and a support vector xi.
30
A Review of Kernel Methods Based Approaches, Part I
degree of the polynomial and δ is a nonnegative we assume that the number of examples for each
constant used for numerical stability in Gaussian class is the same, i.e., Lt = L/T.
kernel function. The dimensionality of the feature
space is (p+d)!/(p! d!) for the polynomial kernel One-Against-the-Rest Approach
(Cristianini & Shawe-Taylor, 2000). The feature
spaces for the sigmoidal and Gaussian kernels In this approach, an SVM is constructed for each
are of infinite dimension. The kernel functions class by discriminating that class against the re-
involve computations in the d-dimensional input maining (T-1) classes. The classification system
space and avoid the innerproduct operations in based on this approach consists of T SVMs. All
the high dimensional feature space. the L training examples are used in constructing
The best choice of the kernel function for a an SVM for each class. In constructing the SVM
given pattern classification problem is still a re- for the class t the desired output yi for a training
search issue (Burges, 1998). The suitable kernel example xi is specified as follows:
function and its parameters are chosen empirically.
The complexity of a two-class support vector +1, if ci = t
machine is a function of the number of support yi = (30)
−1, if ci ≠ t
vectors (Ls) determined during its training. Multi-
class pattern classification problems are generally
solved using a combination of two-class SVMs. The examples with the desired output yi = +1
Therefore, the complexity of a multiclass pattern are called positive examples. The examples with
classification system depends on the number of the desired output yi = −1 are called negative
SVMs and the complexity of each SVM used. In examples. An optimal hyperplane is constructed
the next subsection, we present the commonly to separate Lt positive examples from L(T-1)/T
used approaches to multiclass pattern classifica- negative examples. The much larger number of
tion using SVMs. negative examples leads to an imbalance, resulting
in the dominance of negative examples in deter-
Multiclass Pattern Classification mining the decision boundary (Kressel & Ulrich,
Using SVMs 1999). The extent of imbalance increases with the
number of classes and is significantly high when
Support vector machines are originally designed the number of classes is large. A test pattern x is
for two-class pattern classification. Multiclass classified by using the winner-takes-all strategy
pattern classification problems are commonly that uses the following decision rule:
solved using a combination of two-class SVMs
and a decision strategy to decide the class of the Class label for x = argt max Dt (x) (31)
input pattern (Allwein et al., 2001). Each SVM has
the architecture given in Figure 4 and is trained where Dt(x) is the discriminant function of the
independently. Now we present the two approaches SVM constructed for the class t.
to decomposition of the learning problem in mul-
ticlass pattern classification into several two-class One-Against-One Approach
learning problems so that a combination of SVMs
can be used. The training data set {(xi, ci)} consists In this approach, an SVM is constructed for ev-
of L examples belonging to T classes. The class ery pair of classes by training it to discriminate
label ci ∈ {1, 2,..., T}. For the sake of simplicity, the two classes. The number of SVMs used in
this approach is T(T-1)/2. An SVM for a pair of
31
A Review of Kernel Methods Based Approaches, Part I
classes s and t is constructed using 2Lt training recognition and verification, and speech emotion
examples belonging to the two classes only. The recognition.
desired output yi for a training example xi is speci-
fied as follows: Kernel Methods for
Pattern Clustering
+1, if ci = s
yi = (32) In this subsection we the describe kernel K-means
−1, if ci = t
clustering and support vector clustering methods
for clustering in the kernel feature space.
The small size of the set of training examples
and the balance between the number of positive Kernel K-means Clustering
and negative examples lead to a simple optimi-
zation problem to be solved in constructing an The commonly used K-means clustering method
SVM for a pair of classes. When the number of gives a linear separation of data, as illustrated
classes is large, the proliferation of SVMs leads in Figure 5, and is not suitable for separation of
to a complex classification system. nonlinearly separable data. In this subsection,
The maxwins strategy is commonly used to the criterion for partitioning the data into clusters
determine the class of a test pattern x in this ap- in the input space using the K-means clustering
proach. In this strategy, a majority voting scheme algorithm is first presented. Clustering in the
is used. If Dst(x), the value of the discriminant kernel feature space is then realised using the
function of the SVM for the pair of classes s and t, K-means clustering algorithm (Girolami, 2002;
is positive, then the class s wins a vote. Otherwise, Satish, 2005).
the class t wins a vote. Outputs of SVMs are used Consider a set of L data points in the input
to determine the number of votes won by each
space, {xi }
L
, xi ∈ Rd. Let the number of clusters
class. The class with the maximum number of i =1
votes is assigned to the test pattern. When there are to be formed is Q. The criterion used by the K-
multiple classes with the same maximum number means clustering method in the input space for
of votes, the class with the maximum value of the grouping the data into Q clusters is to minimize
total magnitude of discriminant functions (TMDF) the trace of the within-cluster scatter matrix, Sw,
is assigned. The total magnitude of discriminant defined as follows (Girolami, 2002):
functions for the class s is defined as follows:
1 Q L
Sw = ∑ ∑ z (x − µq )(xi − µq )t
L q =1 i =1 qi i
(34)
TMDF = ∑ Dst (x) (33)
t
32
A Review of Kernel Methods Based Approaches, Part I
Figure 5. Illustration of K-means clustering in input space. (a) Scatter plot of the data in clusters sepa-
rable by a circular shaped curve in a 2-dimensional space. Inner cluster belongs to cluster 1 and the
outer cluster belongs to cluster 2. (b) Linear separation of data obtained using K-means clustering in
the input space.
The center of the cluster Cq is given as μq where µqΦ , the center of the qth cluster in the
defined by feature space, is given by
L
1 L
µq =
Lq
∑z x
qi i
(36) µqΦ =
1
∑z qi
Φ(x i ) (38)
i =1 Lq i =1
The optimal clustering of the data points in- The trace of the scatter matrix SwΦ can be
volves determining the Q × L indicator matrix,
computed using the innerproduct operations as
Z, with the elements as zqi, that minimizes the
given below:
trace of the matrix Sw. This method is used in the
K-means clustering algorithm for linear separation
1 Q L
( ) (Φ(x ) − µ )
t
of the clusters. For nonlinear separation of clusters Tr (SwΦ ) = ∑ ∑ z Φ(x i ) − µqΦ
L q =1 i =1 qi i
Φ
q
of data points, the input space is transformed into
a high dimensional feature space using a smooth (39)
and continuous nonlinear mapping, Φ, and the
clusters are formed in the feature space. The When the feature space is explicitly repre-
optimal partitioning in the feature space is based sented, as in the case of mapping using polyno-
on the criterion of minimizing the trace of the mial kernels, the K-means clustering algorithm
within-cluster scatter matrix in the feature space, can be used to minimise the trace given in the
above equation. However, for Mercer kernels such
SwΦ . The feature space scatter matrix is given by
as Gaussian kernels with implicit mapping used
for transformation, it is necessary to express the
1 Q L
( )( )
t
SwΦ = ∑ ∑ z Φ(x i ) − µqΦ Φ(x i ) − µqΦ
L q =1 i =1 qi
trace in terms of kernel function. The Mercer
kernel function in the input space corresponds to
(37) the inner-product operation in the feature space,
33
A Review of Kernel Methods Based Approaches, Part I
L
i.e., Ki j = K(xi, xj) = Φ(xi)tΦ(xj). The trace of SwΦ 1
can be rewritten as
Dqi = Kˆii −
Lq
∑z qj
Kˆij (43)
j =1
L z
1 Q L 1 Q L For implicit mapping kernels such as the
Tr (SwΦ ) = ∑ ∑
L q =1 i =1
z K
qi ii
− ∑ ∑
L q =1 i =1
zqi ∑ qj K ij
j =1 Lq Gaussian kernel function, the explicit feature
1 Q L 1 L space representation is not known. A Gaussian
= ∑ ∑ zqi K ii − ∑ zqj K ij
L q =1 i =1 Lq j =1 kernel is defined as K(x, xi) = exp(−δ||x − xi||2),
1 Q L where δ is the kernel width parameter. For Gauss-
= ∑ ∑ zqi Dqi ian kernel, Dq j takes a nonnegative value because
L q =1 i =1
(40) Kii =1 and Ki j ≤ Kii.
In the kernel K-means clustering, the optimiza-
where tion problem is to determine the indicator matrix
Z∗ such that
L
1
Dqi = K ii −
Lq
∑z qj
K ij (41) Z * = arg min Tr (SwΦ ) (44)
j =1 Z
The term Dqi is the penalty associated with An iterative method for solving this optimiza-
assigning xi to the qth cluster in the feature space. tion problem is given in (Girolami, 2002). The
For explicit mapping kernels such as the polyno- clusters obtained for the ring data using the kernel
mial kernel function, the feature space represen- K-means clustering method are shown in Figure 6.
tation is explicitly known. Polynomial kernel is
given by K(x, xi) = (axtxi + c)p, where a and c are
constants, and p is the degree of polynomial ker-
nel. The vector Φ(x) in the feature space of the
Figure 6. Nonlinear separation of data obtained
polynomial kernel corresponding to the input
using the kernel K-means clustering method for
space vector x includes the monomials upto order
the ring data plotted in Figure 5(a).
p of elements in x. For a polynomial kernel, Dqi
may take a negative value because the magnitude
of Kq j can be greater than that of Kii. To avoid Dqi
taking negative values, Ki j, in the equation for Dqi
is replaced with the normalized value K̂ ij defined
as
K ij
K̂ ij = (42)
K ii K jj
F r o m C a u c h y - S c h w a r z i n e q u a l i t y,
K ij ≤ K ii K jj . It follows that for the polyno-
mial kernel K̂ = 1 and Kˆ ≤ Kˆ , and D is
ii ij ii qi
defined as:
34
A Review of Kernel Methods Based Approaches, Part I
∑α i
=1 (52)
i =1
|| Φ(xi) - a||2 ≤ R2, for i = 1, 2, …, L (45)
0 ≤ αi ≤ C for i = 1, 2, …, L (53)
where a is the center of the sphere. Soft constraints
are incorporated by adding slack variables ζi as
The objective function in (51) can now be
follows:
specified using the kernel function as follows:
|| Φ(xi) - a||2 ≤ R2 + ζi, for i = 1, 2, …, L (46) L L L
Ld = ∑ αi K (x i , x i ) − ∑ ∑ αi αj K (x i , x j )
with ζi ≥ 0. This constrained optimization problem i =1 i =1 j =1
35
A Review of Kernel Methods Based Approaches, Part I
Let Z(x) be the distance of Φ(x) to the center Though the methods are described for static pat-
of the sphere a is given by terns with each example represented as a vector
in d-dimensional input space, these methods
Z2(x) = || Φ(x) - a||2 (55) can also be used for patterns with each example
represented as a non-vectorial type structure.
From equations (55) and (49) we have, However, it is necessary to design a Mercer
kernel function for patterns represented using
a non-vectorial type structure so that the kernel
L L L
Z 2 (x ) = K (x , x ) − 2∑ αi K (x i , x ) − ∑ ∑ αi αj K (x i , x j )
i =1 i =1 j =1 methods can be used for analysis of such pat-
(56) terns. Kernel functions have been proposed for
different types of structured data such as strings,
Then, the radius of the sphere R can be deter- sets, texts, graphs, images and time series data.
mined by computing Z(xi), where xi is a unbounded In the next section, we present dynamic kernels
support vector. for sequential patterns represented as sequences
The sphere in the feature space when mapped of continuous feature vectors.
back to the input space leads to the formation of
a set of contours which are interpreted as cluster
boundaries. To identify the points that belong to DESIGN OF DYNAMIC
different clusters, a geometric approach involv- KERNELS FOR CONTINUOUS
ing Z(x) and based on the following observation FEATURE VECTOR
is used: Given a pair of data points that belong to
different clusters, any path that connects them must Sequences
exit from the sphere in feature space. Therefore,
such a path contains a segment of points v such Continuous sequence data is represented in the
that Z(v) > R. This leads to the following definition form of a sequence of continuous feature vectors.
of the adjacency Ai j between a pairs of points xi Examples of continuous sequence data are speech
and xj with Φ(xi) and Φ(xj) being present in or data, handwritten character data, video data and
on the sphere in feature space shown in Box 1. time series data such as weather forecasting data,
Clusters are now defined as the connected financial data, stock market data and network
components of the graph induced by the adja- traffic data. Short-time spectral analysis of the
cency matrix A. Bounded support vectors are speech signal of an utterance gives a sequence of
unclassified by this procedure since their feature continuous feature vectors. Short-time analysis
space images lie outside the enclosing sphere. One of speech signal involves performing spectral
may decide either to leave them unclassified, or to analysis on each frame of about 20 milliseconds
assign them to the cluster that they are closest. duration and representing each frame by a real
In this section, we presented the kernel meth- valued feature vector. These feature vectors corre-
ods for classification and clustering of patterns. spond to the observations. The speech signal of an
Box 1.
36
A Review of Kernel Methods Based Approaches, Part I
utterance with M number of frames is represented for designing the kernel between two sequences of
as X = x1x2... xm... xM, where xm is a vector of real feature vectors (Boughorbel et al., 2005; Grauman
valued observations for frame m. The duration & Darrell, 2007). In this Section, we describe dif-
of utterances belonging to a class varies from ferent dynamic kernels such as generalized linear
one utterance to another. Hence, the number of discriminant sequence kernel (Campbell et al.,
frames also differs from one utterance to another. 2006a), the probabilistic sequence kernel (Lee
This makes the number of observations to vary. In et al., 2007), Kullback-Leibler divergence based
the tasks such as speech recognition, duration of kernel (Moreno et al., 2004), GMM supervector
the data is short and there is a need to model the kernel (Campbell et al., 2006b), Bhattacharyya
temporal dynamics and correlations among the distance based kernel (You et al., 2009a), earth
features. This requires the sequence information mover’s distance kernel (Jing et al,. 2003), inter-
present in the data to be preserved. In such cases, mediate matching kernel (Boughorbel et al., 2005),
a speech utterance is represented as a sequence and pyramid match kernel (Grauman & Darrell,
of feature vectors. On the other hand, in the tasks 2007) used for sequences or sets of continuous
such as speaker identification, spoken language feature vectors.
identification, and speech emotion recognition,
the duration of the data is long and preserving Generalized Linear Discriminant
sequence information is not critical. In such cases, Sequence Kernel
a speech signal is represented as a set of feature
vectors. In the handwritten character data also, Generalized linear discriminant sequence
each character is represented as a sequence of (GLDS) kernel (Campbell et al., 2006a) uses
feature vectors. In the video data, each video clip an explicit expansion into a kernel feature space
is considered as a sequence of frames and a frame defined by the polynomials of degree p. Let X
may be considered as an image. Each image can be = x1x2... xm... xM, where xm ∈ Rd be a set of M
represented by a feature vector. Since the sequence feature vectors. The GLDS kernel is derived by
information present among the adjacent frames is considering polynomials as the generalized linear
to be preserved, a video clip data is represented as discriminant functions (Campbell et al., 2002).
a sequence of feature vectors. An image can also A feature vector xm is represented in a higher
be represented as a set of local feature vectors. dimensional space Ψ as a polynomial expansion
The main issue in designing a kernel for se- Ψ (xm) =[ψ1(xm), ψ2(xm),..., ψr(xm)]t. The expansion
quences of continuous feature vectors is to handle Ψ(xm) includes all monomials of elements of xm
the varying length nature of sequences. Dynamic upto and including degree p. The set of feature
kernels for sequences of continuous feature vectors vectors X is represented as a fixed dimensional
are designed in three ways. In the first approach, vector Φ(X) which is obtained as follows:
a sequence of feature vectors is mapped onto a
vector in a fixed dimension feature space and a 1 M
37
Other documents randomly have
different content
I GO DOWN INTO EGYPT
Awake, awake, put on thy strength; put on thy beautiful garments, O Jerusalem,
the holy city.... Shake thyself from the dust; arise, sit on thy throne, O Jerusalem.—
Peace be unto thee, O Zion.”
TRANSCRIBER’S NOTE
1. Silently corrected typographical errors and variations in
spelling.
2. Retained anachronistic, non-standard, and uncertain
spellings as printed.
*** END OF THE PROJECT GUTENBERG EBOOK NEW PATHS
THROUGH OLD PALESTINE ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookgate.com