0% found this document useful (0 votes)
794 views347 pages

Data Mining - Theories - Algorithms - and Examples PDF

Uploaded by

Anand Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
794 views347 pages

Data Mining - Theories - Algorithms - and Examples PDF

Uploaded by

Anand Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 347

Data Mining

Ergonomics and Industrial Engineering

YE
“… provides full spectrum coverage of the most important topics in data mining.
By reading it, one can obtain a comprehensive view on data mining, including
the basic concepts, the important problems in the area, and how to handle these
problems. The whole book is presented in a way that a reader who does not have
much background knowledge of data mining can easily understand. You can find
many figures and intuitive examples in the book. I really love these figures and Theories, Algorithms, and Examples
examples, since they make the most complicated concepts and algorithms much
easier to understand.”

DATA MINING
—Zheng Zhao, SAS Institute Inc., Cary, North Carolina, USA

“… covers pretty much all the core data mining algorithms. It also covers several
useful topics that are not covered by other data mining books such as univariate
and multivariate control charts and wavelet analysis. Detailed examples are
provided to illustrate the practical use of data mining algorithms. A list of software
packages is also included for most algorithms covered in the book. These are
extremely useful for data mining practitioners. I highly recommend this book for
anyone interested in data mining.”
—Jieping Ye, Arizona State University, Tempe, USA

New technologies have enabled us to collect massive amounts of data in many


fields. However, our pace of discovering useful information and knowledge from
these data falls far behind our pace of collecting the data. Data Mining: Theories,
Algorithms, and Examples introduces and explains a comprehensive set of data
mining algorithms from various data mining fields. The book reviews theoretical
rationales and procedural details of data mining algorithms, including those
commonly found in the literature and those presenting considerable difficulty,
using small data examples to explain and walk through the algorithms.

NONG YE
K10414
ISBN: 978-1-4398-0838-2
90000
www.c rc pr e ss.c o m

9 781439 808382
w w w.crcpress.com

www.allitebooks.com

K10414 cvr mech.indd 1 6/25/13 3:08 PM


Data Mining
Theories, Algorithms, and Examples

www.allitebooks.com
Human Factors and Ergonomics Series

Published TiTles
Conceptual Foundations of Human Factors Measurement
D. Meister
Content Preparation Guidelines for the Web and Information Appliances:
Cross-Cultural Comparisons
H. Liao, Y. Guo, A. Savoy, and G. Salvendy
Cross-Cultural Design for IT Products and Services
P. Rau, T. Plocher and Y. Choong
Data Mining: Theories, Algorithms, and Examples
Nong Ye
Designing for Accessibility: A Business Guide to Countering Design Exclusion
S. Keates
Handbook of Cognitive Task Design
E. Hollnagel
The Handbook of Data Mining
N. Ye
Handbook of Digital Human Modeling: Research for Applied Ergonomics
and Human Factors Engineering
V. G. Duffy
Handbook of Human Factors and Ergonomics in Health Care and Patient Safety
Second Edition
P. Carayon
Handbook of Human Factors in Web Design, Second Edition
K. Vu and R. Proctor
Handbook of Occupational Safety and Health
D. Koradecka
Handbook of Standards and Guidelines in Ergonomics and Human Factors
W. Karwowski
Handbook of Virtual Environments: Design, Implementation, and Applications
K. Stanney
Handbook of Warnings
M. Wogalter
Human–Computer Interaction: Designing for Diverse Users and Domains
A. Sears and J. A. Jacko
Human–Computer Interaction: Design Issues, Solutions, and Applications
A. Sears and J. A. Jacko
Human–Computer Interaction: Development Process
A. Sears and J. A. Jacko
Human–Computer Interaction: Fundamentals
A. Sears and J. A. Jacko
The Human–Computer Interaction Handbook: Fundamentals
Evolving Technologies, and Emerging Applications, Third Edition
A. Sears and J. A. Jacko
Human Factors in System Design, Development, and Testing
D. Meister and T. Enderwick

www.allitebooks.com
Published TiTles (conTinued)

Introduction to Human Factors and Ergonomics for Engineers, Second Edition


M. R. Lehto
Macroergonomics: Theory, Methods and Applications
H. Hendrick and B. Kleiner
Practical Speech User Interface Design
James R. Lewis
The Science of Footwear
R. S. Goonetilleke
Skill Training in Multimodal Virtual Environments
M. Bergamsco, B. Bardy, and D. Gopher
Smart Clothing: Technology and Applications
Gilsoo Cho
Theories and Practice in Interaction Design
S. Bagnara and G. Crampton-Smith
The Universal Access Handbook
C. Stephanidis
Usability and Internationalization of Information Technology
N. Aykin
User Interfaces for All: Concepts, Methods, and Tools
C. Stephanidis

ForThcoming TiTles
Around the Patient Bed: Human Factors and Safety in Health care
Y. Donchin and D. Gopher
Cognitive Neuroscience of Human Systems Work and Everyday Life
C. Forsythe and H. Liao
Computer-Aided Anthropometry for Research and Design
K. M. Robinette
Handbook of Human Factors in Air Transportation Systems
S. Landry
Handbook of Virtual Environments: Design, Implementation
and Applications, Second Edition,
K. S. Hale and K M. Stanney
Variability in Human Performance
T. Smith, R. Henning, and M. Wade

www.allitebooks.com
www.allitebooks.com
Data Mining
Theories, Algorithms, and Examples

NONG YE

www.allitebooks.com
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
pedagogical approach or particular use of the MATLAB® software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2014 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20130624

International Standard Book Number-13: 978-1-4822-1936-4 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com

www.allitebooks.com
Contents

Preface.................................................................................................................... xiii
Acknowledgments.............................................................................................. xvii
Author.................................................................................................................... xix

Part I  An Overview of Data Mining

1. Introduction to Data, Data Patterns, and Data Mining...........................3


1.1 Examples of Small Data Sets................................................................ 3
1.2 Types of Data Variables......................................................................... 5
1.2.1 Attribute Variable versus Target Variable............................. 5
1.2.2 Categorical Variable versus Numeric Variable..................... 8
1.3 Data Patterns Learned through Data Mining.................................... 9
1.3.1 Classification and Prediction Patterns................................... 9
1.3.2 Cluster and Association Patterns......................................... 12
1.3.3 Data Reduction Patterns........................................................ 13
1.3.4 Outlier and Anomaly Patterns.............................................. 14
1.3.5 Sequential and Temporal Patterns....................................... 15
1.4 Training Data and Test Data............................................................... 17
Exercises........................................................................................................... 17

Part II Algorithms for Mining Classification


and Prediction Patterns
2. Linear and Nonlinear Regression Models............................................... 21
2.1 Linear Regression Models.................................................................. 21
2.2 Least-Squares Method and Maximum Likelihood Method
of Parameter Estimation��������������������������������������������������������������������� 23
2.3 Nonlinear Regression Models and Parameter Estimation............ 28
2.4 Software and Applications................................................................. 29
Exercises........................................................................................................... 29

3. Naïve Bayes Classifier.................................................................................. 31


3.1 Bayes Theorem..................................................................................... 31
3.2 Classification Based on the Bayes Theorem and Naïve Bayes
Classifier����������������������������������������������������������������������������������������������� 31
3.3 Software and Applications................................................................. 35
Exercises........................................................................................................... 36

vii

www.allitebooks.com
viii Contents

4. Decision and Regression Trees................................................................... 37


4.1 Learning a Binary Decision Tree and Classifying Data
Using a Decision Tree������������������������������������������������������������������������� 37
4.1.1 Elements of a Decision Tree.................................................. 37
4.1.2 Decision Tree with the Minimum Description Length..... 39
4.1.3 Split Selection Methods.......................................................... 40
4.1.4 Algorithm for the Top-Down Construction
of a Decision Tree....................................................................44
4.1.5 Classifying Data Using a Decision Tree.............................. 49
4.2 Learning a Nonbinary Decision Tree............................................... 51
4.3 Handling Numeric and Missing Values of Attribute Variables.....56
4.4 Handling a Numeric Target Variable and Constructing
a Regression Tree�������������������������������������������������������������������������������� 57
4.5 Advantages and Shortcomings of the Decision Tree
Algorithm...................................................................................... 59
4.6 Software and Applications................................................................. 61
Exercises........................................................................................................... 62

5. Artificial Neural Networks for Classification and Prediction.............63


5.1 Processing Units of ANNs..................................................................63
5.2 Architectures of ANNs....................................................................... 69
5.3 Methods of Determining Connection Weights for a Perceptron...... 71
5.3.1 Perceptron................................................................................ 72
5.3.2 Properties of a Processing Unit............................................. 72
5.3.3 Graphical Method of Determining Connection
Weights and Biases���������������������������������������������������������������� 73
5.3.4 Learning Method of Determining Connection
Weights and Biases���������������������������������������������������������������� 76
5.3.5 Limitation of a Perceptron..................................................... 79
5.4 Back-Propagation Learning Method for a Multilayer
Feedforward ANN������������������������������������������������������������������������������80
5.5 Empirical Selection of an ANN Architecture for a Good Fit
to Data��������������������������������������������������������������������������������������������������� 86
5.6 Software and Applications................................................................. 88
Exercises........................................................................................................... 88

6. Support Vector Machines............................................................................. 91


6.1 Theoretical Foundation for Formulating and Solving an
Optimization Problem to Learn a Classification Function����������� 91
6.2 SVM Formulation for a Linear Classifier and a Linearly
Separable Problem������������������������������������������������������������������������������ 93
6.3 Geometric Interpretation of the SVM Formulation
for the Linear Classifier���������������������������������������������������������������������� 96
6.4 Solution of the Quadratic Programming Problem
for a Linear Classifier������������������������������������������������������������������������� 98

www.allitebooks.com
Contents ix

6.5 SVM Formulation for a Linear Classifier and a Nonlinearly


Separable Problem���������������������������������������������������������������������������� 105
6.6 SVM Formulation for a Nonlinear Classifier
and a Nonlinearly Separable Problem������������������������������������������� 108
6.7 Methods of Using SVM for Multi-Class Classification
Problems��������������������������������������������������������������������������������������������� 113
6.8 Comparison of ANN and SVM........................................................ 113
6.9 Software and Applications............................................................... 114
Exercises......................................................................................................... 114

7. k-Nearest Neighbor Classifier and Supervised Clustering................ 117


7.1 k-Nearest Neighbor Classifier.......................................................... 117
7.2 Supervised Clustering....................................................................... 122
7.3 Software and Applications............................................................... 136
Exercises......................................................................................................... 136

Part III Algorithms for Mining Cluster


and Association Patterns
8. Hierarchical Clustering.............................................................................. 141
8.1 Procedure of Agglomerative Hierarchical Clustering.................. 141
8.2 Methods of Determining the Distance between Two Clusters......141
8.3 Illustration of the Hierarchical Clustering Procedure.................. 146
8.4 Nonmonotonic Tree of Hierarchical Clustering............................ 150
8.5 Software and Applications............................................................... 152
Exercises......................................................................................................... 152

9. K-Means Clustering and Density-Based Clustering............................ 153


9.1 K-Means Clustering........................................................................... 153
9.2 Density-Based Clustering................................................................. 165
9.3 Software and Applications............................................................... 165
Exercises......................................................................................................... 166

10. Self-Organizing Map.................................................................................. 167


10.1 Algorithm of Self-Organizing Map................................................. 167
10.2 Software and Applications............................................................... 175
Exercises......................................................................................................... 175

11. Probability Distributions of Univariate Data....................................... 177


11.1 Probability Distribution of Univariate Data and Probability
Distribution Characteristics of Various Data Patterns���������������� 177
11.2 Method of Distinguishing Four Probability Distributions.......... 182
11.3 Software and Applications............................................................... 183
Exercises......................................................................................................... 184

www.allitebooks.com
x Contents

12. Association Rules........................................................................................ 185


12.1 Definition of Association Rules and Measures of Association......185
12.2 Association Rule Discovery.............................................................. 189
12.3 Software and Applications............................................................... 194
Exercises......................................................................................................... 194

13. Bayesian Network........................................................................................ 197


13.1 Structure of a Bayesian Network and Probability
Distributions of Variables���������������������������������������������������������������� 197
13.2 Probabilistic Inference....................................................................... 205
13.3 Learning of a Bayesian Network..................................................... 210
13.4 Software and Applications............................................................... 213
Exercises......................................................................................................... 213

Part IV  Algorithms for Mining Data Reduction Patterns

14. Principal Component Analysis................................................................. 217


14.1 Review of Multivariate Statistics..................................................... 217
14.2 Review of Matrix Algebra................................................................. 220
14.3 Principal Component Analysis........................................................ 228
14.4 Software and Applications............................................................... 230
Exercises......................................................................................................... 231

15. Multidimensional Scaling......................................................................... 233


15.1 Algorithm of MDS............................................................................. 233
15.2 Number of Dimensions..................................................................... 246
15.3 INDSCALE for Weighted MDS........................................................ 247
15.4 Software and Applications............................................................... 248
Exercises......................................................................................................... 248

Part V Algorithms for Mining Outlier


and Anomaly Patterns
16. Univariate Control Charts......................................................................... 251
16.1 Shewhart Control Charts.................................................................. 251
16.2 CUSUM Control Charts....................................................................254
16.3 EWMA Control Charts...................................................................... 257
16.4 Cuscore Control Charts..................................................................... 261
16.5 Receiver Operating Curve (ROC) for Evaluation
and Comparison of Control Charts������������������������������������������������ 265
16.6 Software and Applications............................................................... 267
Exercises......................................................................................................... 267
Contents xi

17. Multivariate Control Charts...................................................................... 269


17.1 Hotelling’s T 2 Control Charts........................................................... 269
17.2 Multivariate EWMA Control Charts............................................... 272
17.3 Chi-Square Control Charts............................................................... 272
17.4 Applications........................................................................................ 274
Exercises......................................................................................................... 274

Part VI Algorithms for Mining Sequential


and Temporal Patterns
18. Autocorrelation and Time Series Analysis............................................ 277
18.1 Autocorrelation................................................................................... 277
18.2 Stationarity and Nonstationarity..................................................... 278
18.3 ARMA Models of Stationary Series Data....................................... 279
18.4 ACF and PACF Characteristics of ARMA Models........................ 281
18.5 Transformations of Nonstationary Series Data
and ARIMA Models������������������������������������������������������������������������� 283
18.6 Software and Applications...............................................................284
Exercises......................................................................................................... 285

19. Markov Chain Models and Hidden Markov Models.......................... 287


19.1 Markov Chain Models....................................................................... 287
19.2 Hidden Markov Models������������������������������������������������������������������� 290
19.3 Learning Hidden Markov Models................................................... 294
19.4 Software and Applications...............................................................305
Exercises.........................................................................................................305

20. Wavelet Analysis......................................................................................... 307


20.1 Definition of Wavelet......................................................................... 307
20.2 Wavelet Transform of Time Series Data.........................................309
20.3 Reconstruction of Time Series Data from Wavelet
Coefficients...................................................................................... 316
20.4 Software and Applications............................................................... 317
Exercises......................................................................................................... 318

References............................................................................................................ 319
Index...................................................................................................................... 323
Preface

Technologies have enabled us to collect massive amounts of data in many


fields. our pace of discovering useful information and knowledge from
these data falls far behind our pace of collecting the data. Conversion of
massive data into useful information and knowledge involves two steps:
(1)  mining patterns present in the data and (2) interpreting those data
patterns in their problem domains to turn them into useful information
and knowledge. There exist many data mining algorithms to automate
the first step of mining various types of data patterns from massive data.
Interpretation of data patterns usually depend on specific domain knowl-
edge and analytical thinking. This book covers data mining algorithms that
can be used to mine various types of data patterns. Learning and applying
data mining algorithms will enable us to automate and thus speed up the
first step of uncovering data patterns from massive data. Understanding
how data patterns are uncovered by data mining algorithms is also crucial
to carrying out the second step of looking into the meaning of data patterns
in problem domains and turning data patterns into useful information and
knowledge.

Overview of the Book


The data mining algorithms in this book are organized into five parts for
mining five types of data patterns from massive data, as follows:

1. Classification and prediction patterns


2. Cluster and association patterns
3. Data reduction patterns
4. Outlier and anomaly patterns
5. Sequential and temporal patterns

Part I introduces these types of data patterns with examples. Parts II–VI
describe algorithms to mine the five types of data patterns, respectively.
Classification and prediction patterns capture relations of attribute vari-
ables with target variables and allow us to classify or predict values of target

xiii
xiv Preface

variables from values of attribute variables. Part II describes the following


algorithms to mine classification and prediction patterns:

• Linear and nonlinear regression models (Chapter 2)


• Naïve Bayes classifier (Chapter 3)
• Decision and regression trees (Chapter 4)
• Artificial neural networks for classification and prediction (Chapter 5)
• Support vector machines (Chapter 6)
• K-nearest neighbor classifier and supervised clustering (Chapter 7)

Part III describes data mining algorithms to uncover cluster and associa-
tion patterns. Cluster patterns reveal patterns of similarities and differ-
ences among data records. Association patterns are established based on
co-occurrences of items in data records. Part III describes the following data
mining algorithms to mine cluster and association patterns:

• Hierarchical clustering (Chapter 8)


• K-means clustering and density-based clustering (Chapter 9)
• Self-organizing map (Chapter 10)
• Probability distributions of univariate data (Chapter 11)
• Association rules (Chapter 12)
• Bayesian networks (Chapter 13)

Data reduction patterns look for a small number of variables that can be
used to represent a data set with a much larger number of variables. Since
one variable gives one dimension of data, data reduction patterns allow a
data set in a high-dimensional space to be represented in a low-dimensional
space. Part IV describes the following data mining algorithms to mine data
reduction patterns:

• Principal component analysis (Chapter 14)


• Multidimensional scaling (Chapter 15)

Outliers and anomalies are data points that differ largely from a normal pro-
file of data, and there are many ways to define and establish a norm profile
of data. Part V describes the following data mining algorithms to detect and
identify outliers and anomalies:

• Univariate control charts (Chapter 16)


• Multivariate control charts (Chapter 17)
Preface xv

Sequential and temporal patterns reveal how data change their patterns
over time. Part VI describes the following data mining algorithms to mine
sequential and temporal patterns:

• Autocorrelation and time series analysis (Chapter 18)


• Markov chain models and hidden Markov models (Chapter 19)
• Wavelet analysis (Chapter 20)

Distinctive Features of the Book


As stated earlier, mining data patterns from massive data is only the first
step of turning massive data into useful information and knowledge in prob-
lem domains. Data patterns need to be understood and interpreted in their
problem domain in order to be useful. To apply a data mining algorithm
and acquire the ability of understanding and interpreting data patterns pro-
duced by that data mining algorithm, we need to understand two important
aspects of the algorithm:

1. Theoretical concepts that establish the rationale of why elements of


the data mining algorithm are put together in a specific way to mine
a particular type of data pattern
2. Operational steps and details of how the data mining algorithm pro-
cesses massive data to produce data patterns.

This book aims at providing both theoretical concepts and operational


details of data mining algorithms in each chapter in a self-contained, com-
plete manner with small data examples. It will enable readers to understand
theoretical and operational aspects of data mining algorithms and to manu-
ally execute the algorithms for a thorough understanding of the data pat-
terns produced by them.
This book covers data mining algorithms that are commonly found in
the data mining literature (e.g., decision trees artificial neural networks
and hierarchical clustering) and data mining algorithms that are usually
considered difficult to understand (e.g.,  hidden Markov models, multidi-
mensional scaling, support vector machines, and wavelet analysis). All the
data mining algorithms in this book are described in the self-contained,
example-supported, complete manner. Hence, this book will enable read-
ers to achieve the same level of thorough understanding and will provide
the same ability of manual execution regardless of the difficulty level of the
data mining algorithms.
xvi Preface

For the data mining algorithms in each chapter, a list of software packages
that support them is provided. Some applications of the data mining algo-
rithms are also given with references.

Teaching Support
The data mining algorithms covered in this book involve different levels of
difficulty. The instructor who uses this book as the textbook for a course on
data mining may select the book materials to cover in the course based on the
level of the course and the level of difficulty of the book materials. The book
materials in Chapters 1, 2 (Sections 2.1 and 2.2 only), 3, 4, 7, 8, 9 (Section 9.1
only), 12, 16 (Sections 16.1 through 16.3 only), and 19 (Section 19.1 only), which
cover the five types of data patterns, are appropriate for an undergraduate-
level course. The remainder is appropriate for a graduate-level course.
Exercises are provided at the end of each chapter. The following additional
teaching support materials are available on the book website and can be
obtained from the publisher:

• Solutions manual
• Lecture slides, which include the outline of topics, figures, tables,
and equations

MATLAB® is a registered trademark of The MathWorks, Inc. For product


information, please contact:

The MathWorks, Inc.


3 Apple Hill Drive
Natick, MA 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
E-mail: [email protected]
Web: www.mathworks.com
Acknowledgments

I would like to thank my family, Baijun and Alice, for their love, understand-
ing, and unconditional support. I appreciate them for always being there for
me and making me happy.
I am grateful to Dr. Gavriel Salvendy, who has been my mentor and friend,
for guiding me in my academic career. I am also thankful to Dr. Gary Hogg,
who supported me in many ways as the department chair at Arizona State
University.
I would like to thank Cindy Carelli, senior editor at CRC Press. This book
would not have been possible without her responsive, helpful, understand-
ing, and supportive nature. It has been a great pleasure working with her.
Thanks also go to Kari Budyk, senior project coordinator at CRC Press, and
the staff at CRC Press who helped publish this book.

xvii
Author

Nong Ye is a professor at the School of Computing, Informatics, and


Decision Systems Engineering, Arizona State University, Tempe, Arizona.
She holds a PhD in industrial engineering from Purdue University, West
Lafayette, Indiana, an MS in computer science from the Chinese Academy
of Sciences, Beijing, People’s Republic of China, and a BS in computer
s­ cience from Peking University, Beijing, People’s Republic of China.
Her publications include The Handbook of Data Mining and Secure Computer
and Network Systems: Modeling, Analysis and Design. She has also published
over 80 journal papers in the fields of data mining, statistical data analysis
and modeling, computer and network security, quality of service optimiza-
tion, quality control, human–computer interaction, and human factors.

xix

www.allitebooks.com
Part I

An Overview of Data Mining


1
Introduction to Data, Data
Patterns, and Data Mining

Data mining aims at discovering useful data patterns from massive amounts
of data. In this chapter, we give some examples of data sets and use these
data sets to illustrate various types of data variables and data patterns that
can be discovered from data. Data mining algorithms to discover each type
of data patterns are briefly introduced in this chapter. The concepts of train-
ing and testing data are also introduced.

1.1  Examples of Small Data Sets


Advanced technologies such as computers and sensors have enabled many
activities to be recorded and stored over time, producing massive amounts
of data in many fields. In this section, we introduce some examples of small
data sets that are used throughout the book to explain data mining concepts
and algorithms.
Tables 1.1 through 1.3 give three examples of small data sets from the UCI
Machine Learning Repository (Frank and Asuncion, 2010). The balloons
data set in Table 1.1 contains data records for 16 instances of balloons. Each
balloon has four attributes: Color, Size, Act, and Age. These attributes of the
balloon determine whether or not the balloon is inflated. The space shuttle
O-ring erosion data set in Table 1.2 contains data records for 23 instances of
the Challenger space shuttle flights. There are four attributes for each flight:
Number of O-rings, Launch Temperature (°F), Leak-Check Pressure (psi),
and Temporal Order of Flight, which can be used to determine Number of
O-rings with Stress. The lenses data set in Table 1.3 contains data records
for 24 instances for the fit of lenses to a patient. There are four attributes
of a patient for each instance: Age, Prescription, Astigmatic, and Tear
Production Rate, which can be used to determine the type of lenses to be
fitted to a patient.
Table 1.4 gives the data set for fault detection and diagnosis of a manufac-
turing system (Ye et al., 1993). The manufacturing system consists of nine
machines, M1, M2, …, M9, which process parts. Figure 1.1 shows the produc-
tion flows of parts to go through the nine machines. There are some parts

3
4 Data Mining

Table 1.1
Balloon Data Set
Target
Attribute Variables Variable
Instance Color Size Act Age Inflated
1 Yellow Small Stretch Adult T
2 Yellow Small Stretch Child T
3 Yellow Small Dip Adult T
4 Yellow Small Dip Child T
5 Yellow Large Stretch Adult T
6 Yellow Large Stretch Child F
7 Yellow Large Dip Adult F
8 Yellow Large Dip Child F
9 Purple Small Stretch Adult T
10 Purple Small Stretch Child F
11 Purple Small Dip Adult F
12 Purple Small Dip Child F
13 Purple Large Stretch Adult T
14 Purple Large Stretch Child F
15 Purple Large Dip Adult F
16 Purple Large Dip Child F

that go through M1 first, M5 second, and M9 last, some parts that go through
M1 first, M5 second, and M7 last, and so on. There are nine variables, xi,
i = 1, 2, …, 9, representing the quality of parts after they go through the nine
machines. If parts after machine i pass the quality inspection, xi takes the
value of 0; otherwise, xi takes the value of 1. There is a variable, y, represent-
ing whether or not the system has a fault. The system has a fault if any of
the nine machines is faulty. If the system does not have a fault, y takes the
value of 0; otherwise, y takes the value of 1. There are nine variables, yi, i = 1,
2, …, 9, representing whether or not nine machines are faulty, respectively.
If machine i does not have a fault, yi takes the value of 0; otherwise, yi takes
the value of 1. The fault detection problem is to determine whether or not the
system has a fault based on the quality information. The fault detection prob-
lem involves the nine quality variables, xi, i = 1, 2, …, 9, and the system fault
variable, y. The fault diagnosis problem is to determine which machine has a
fault based on the quality information. The fault diagnosis problem involves
the nine quality variables, xi, i = 1, 2, …, 9, and the nine variables of machine
fault, yi, i = 1, 2, …, 9. There may be one or more machines that have a fault
at the same time, or no faulty machine. For example, in instance 1 with M1
being faulty (y1 and y taking the value of 1 and y2, y3, y4, y5, y6, y7, y8, and y9
taking the value of 0), parts after M1, M5, M7, M9 fails the quality inspection
with x1, x5, x7, and x9 taking the value of 1 and other quality variables, x2, x3,
x4, x6, and x8, taking the value of 0.
Introduction to Data, Data Patterns, and Data Mining 5

Table 1.2
Space Shuttle O-Ring Data Set
Target
Attribute Variables Variable
Number Temporal Number of
of Launch Leak-Check Order of O-Rings
Instance O-Rings Temperature Pressure Flight with Stress
1 6 66 50 1 0
2 6 70 50 2 1
3 6 69 50 3 0
4 6 68 50 4 0
5 6 67 50 5 0
6 6 72 50 6 0
7 6 73 100 7 0
8 6 70 100 8 0
9 6 57 200 9 1
10 6 63 200 10 1
11 6 70 200 11 1
12 6 78 200 12 0
13 6 67 200 13 0
14 6 53 200 14 2
15 6 67 200 15 0
16 6 75 200 16 0
17 6 70 200 17 0
18 6 81 200 18 0
19 6 76 200 19 0
20 6 79 200 20 0
21 6 75 200 21 0
22 6 76 200 22 0
23 6 58 200 23 1

1.2  Types of Data Variables


The types of data variables affect what data mining algorithms can be applied
to a given data set. This section introduces the different types of data variables.

1.2.1 Attribute Variable versus Target Variable


A data set may have attribute variables and target variable(s). The values of the
attribute variables are used to determine the values of the target variable(s).
Attribute variables and target variables may also be called as independent
variables and dependent variables, respectively, to reflect that the values of
6 Data Mining

Table 1.3
Lenses Data Set
Attributes Target
Spectacle Tear Production
Instance Age Prescription Astigmatic Rate Lenses
1 Young Myope No Reduced Noncontact
2 Young Myope No Normal Soft contact
3 Young Myope Yes Reduced Noncontact
4 Young Myope Yes Normal Hard contact
5 Young Hypermetrope No Reduced Noncontact
6 Young Hypermetrope No Normal Soft contact
7 Young Hypermetrope Yes Reduced Noncontact
8 Young Hypermetrope Yes Normal Hard contact
9 Pre-presbyopic Myope No Reduced Noncontact
10 Pre-presbyopic Myope No Normal Soft contact
11 Pre-presbyopic Myope Yes Reduced Noncontact
12 Pre-presbyopic Myope Yes Normal Hard contact
13 Pre-presbyopic Hypermetrope No Reduced Noncontact
14 Pre-presbyopic Hypermetrope No Normal Soft contact
15 Pre-presbyopic Hypermetrope Yes Reduced Noncontact
16 Pre-presbyopic Hypermetrope Yes Normal Noncontact
17 Presbyopic Myope No Reduced Noncontact
18 Presbyopic Myope No Normal Noncontact
19 Presbyopic Myope Yes Reduced Noncontact
20 Presbyopic Myope Yes Normal Hard contact
21 Presbyopic Hypermetrope No Reduced Noncontact
22 Presbyopic Hypermetrope No Normal Soft contact
23 Presbyopic Hypermetrope Yes Reduced Noncontact
24 Presbyopic Hypermetrope Yes Normal Noncontact

the target variables depend on the values of the attribute variables. In the bal-
loon data set in Table 1.1, the attribute variables are Color, Size, Act, and Age,
and the target variable gives the inflation status of the balloon. In the space
shuttle data set in Table 1.2, the attribute variables are Number of O-rings,
Launch Temperature, Leak-Check Pressure, and Temporal Order of Flight,
and the target variable is the Number of O-rings with Stress.
Some data sets may have only attribute variables. For example, customer
purchase transaction data may contain the items purchased by each cus-
tomer at a store. We have attribute variables representing the items pur-
chased. The interest in the customer purchase transaction data is in finding
out what items are often purchased together by customers. Such association
patterns of items or attribute variables can be used to design the store lay-
out for sale of items and assist customer shopping. Mining such a data set
involves only attribute variables.
Table 1.4
Data Set for a Manufacturing System to Detect and Diagnose Faults
Attribute Variables Target Variables
Instance
Quality of Parts Machine Fault
(Faulty System
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault, y y1 y2 y3 y4 y5 y6 y7 y8 y9
1 (M1) 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0
2 (M2) 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0
3 (M3) 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0
4 (M4) 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
5 (M5) 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0
6 (M6) 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0
Introduction to Data, Data Patterns, and Data Mining

7 (M7) 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
10 (none) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7
8 Data Mining

M1 M5 M9

M2 M6 M7

M3 M4 M8

Figure 1.1
A manufacturing system with nine machines and production flows of parts.

1.2.2  Categorical Variable versus Numeric Variable


A variable can take categorical or numeric values. All the attribute variables
and the target variable in the balloon data set take categorical values. For
example, two values of the Color attribute, yellow and purple, give two dif-
ferent categories of Color. All the attribute variables and the target variable
in the space shuttle O-ring data set take numeric values. For example, the
values of the target variable, 0, 1, and 2, give the quantity of O-rings with
Stress. The values of a numeric variable can be used to measure the quanti-
tative magnitude of differences between numeric values. For example, the
value of 2 O-rings is 1 unit larger than 1 O-ring and 2 units larger than
0 O-rings. However, the quantitative magnitude of differences cannot be
obtained from the values of a categorical variable. For example, although
yellow and purple show us a difference in the two colors, it is inappropri-
ate to assign a quantitative measure of the difference. For another example,
child and adult are two different categories of Age. Although each person
has his/her years of age, we cannot state from child and adult categories in
the balloon data set that an instance of child is 20, 30, or 40 years younger
than an instance of adult.
Categorical variables have two subtypes: nominal variables and ordinal vari-
ables (Tan et al., 2006). The values of an ordinal variable can be sorted in order,
whereas the values of nominal variables can be viewed only as same or dif-
ferent. For example, three values of Age (child, adult, and senior) make Age
an ordinal variable since we can sort child, adult, and senior in this order of
increasing age. However, we cannot state that the age difference between
child and adult is bigger or smaller than the age difference between adult
and senior since child, adult, and senior are categorical values instead
of numeric values. That is, although the values of an ordinal variable can
be sorted, those values are categorical and their quantitative differences
Introduction to Data, Data Patterns, and Data Mining 9

are not available. Color is a nominal variable since yellow and purple show
two different colors but an order of yellow and purple may be meaningless.
Numeric variables have two subtypes: interval variables and ratio variables
(Tan et al., 2006). Quantitative differences between the values of an interval
variable (e.g., Launch Temperature in °F) are meaningful, whereas both
quantitative differences and ratios between the values of a ratio variable
(e.g., Number of O-rings with Stress) are meaningful.
Formally, we denote the attribute variables as x1, …, xp, and the target vari-
ables as y1, …, yq. We let x = (x1, …, xp) and y = (y1, …, yq). Instances or data
observations of x1, …, xp, y1, …, yq give data records, (x1, …, xp, y1, …, yq).

1.3  Data Patterns Learned through Data Mining


The following are the major types of data patterns that are discovered from
data sets through data mining algorithms:

• Classification and prediction patterns


• Cluster and association patterns
• Data reduction patterns
• Outlier and anomaly patterns
• Sequential and temporal patterns

Each type of data patterns is described in the following sections.

1.3.1  Classification and Prediction Patterns


Classification and prediction patterns capture relations of attribute variables,
x1, …, xp, with target variables, y1, …, yq, which are supported by a given set of
data records, (x1, …, xp, y1, …, yq). Classification and prediction patterns allow
us to classify or predict values of target variables from values of attribute
variables.
For example, all the 16 data records of the balloon data set in Table 1.1
s­ upport the following relation of the attribute variables, Color, Size, Age, and
Act, with the target variable, Inflated (taking the value of T for true or F
for false):

IF (Color = Yellow AND Size = Small) OR (Age = Adult AND Act = Stretch),


THEN Inflated = T; OTHERWISE, Inflated = F.

This relation allows us to classify a given balloon into a categorical value


of the target variable using a specific value of its Color, Size, Age, and
Act attributes. Hence, the relation gives us data patterns that allow us to

www.allitebooks.com
10 Data Mining

perform the classification of a balloon. Although we can extract this rela-


tion pattern by examining the 16 data records in the balloon data set, learn-
ing such a pattern manually from a much larger set of data with noise can
be a difficult task. A data mining algorithm allows us to learn from a large
data set automatically.
For another example, the following linear model fits the 23 data records
of the attribute variable, Launch Temperature, and the target variable,
Number of O-rings with Stress, in the space shuttle O-ring data set in
Table 1.2:

y = −0.05746 x + 4.301587 (1.1)

where
y denotes the target variable, Number of O-rings with Stress
x denotes the attribute variable, Launch Temperature

Figure 1.2 illustrates the values of Launch Temperature and Number of


O-rings with Stress in the 23 data records and the fitted line given by
Equation 1.1. Table 1.5 shows the value of O-rings with Stress for each data
record that is predicted from the value of Launch Temperature using the
linear relation model of Launch Temperature with Number of O-rings with
Stress in Equation 1.1. Except two data records for instances 2 and 11, the
linear model in Equation 1.1 captures the relation of Launch Temperature
with Number of O-rings with Stress well in that a lower value of Launch
Temperature increases the value of O-rings with Stress. The highest pre-
dicted value of O-rings with Stress is produced for the data record of instance
14 with 2 O-rings experiencing thermal stress. Two predicted values in the

2
Number of O-rings with stress

0
50 55 60 65 70 75 80 85
Launch temperature

Figure 1.2
The fitted linear relation model of Launch Temperature with Number of O-rings with Stress in
the space shuttle O-ring data set.
Introduction to Data, Data Patterns, and Data Mining 11

Table 1.5
Predicted Value of O-Rings with Stress
Attribute Variable Target Variable
Number of Predicted Value
Launch O-Rings of O-Rings
Instance Temperature with Stress with Stress
1 66 0 0.509227
2 70 1 0.279387
3 69 0 0.336847
4 68 0 0.394307
5 67 0 0.451767
6 72 0 0.164467
7 73 0 0.107007
8 70 0 0.279387
9 57 1 1.026367
10 63 1 0.681607
11 70 1 0.279387
12 78 0 −0.180293
13 67 0 0.451767
14 53 2 1.256207
15 67 0 0.451767
16 75 0 −0.007913
17 70 0 0.279387
18 81 0 −0.352673
19 76 0 −0.065373
20 79 0 −0.237753
21 75 0 −0.007913
22 76 0 −0.065373
23 58 1 0.968907

middle range, 1.026367 and 0.681607, are produced for two data records of
instances 9 and 10 with 1 O-rings with Stress. The predicted values in the low
range from −0.352673 to 0.509227 are produced for all the data records with 0
O-rings with Stress. The negative coefficient of x, −0.05746, in Equation 1.1,
also reveals this relation. Hence, the linear relation in Equation 1.1 gives data
patterns that allow us to predict the target variable, Number of O-rings with
Stress, from the attribute variable, Launch Temperature, in the space shuttle
O-ring data set.
Classification and prediction patterns, which capture the relation of attribute
variables, x1, …, xp, with target variables, y1, …, yq, can be represented in the
general form of y = F(x). For the balloon data set, classification patterns for F
take the form of decision rules. For the space shuttle O-ring data set, prediction
patterns for F take the form of a linear model. Generally, the term, “classification
12 Data Mining

patterns,” is used if the target variable is a categorical variable, and the term,
“prediction patterns,” is used if the target variable is a numeric variable.
Part II of the book introduces the following data mining algorithms that
are used to discover classification and prediction patterns from data:

• Regression models in Chapter 2


• Naïve Bayes classifier in Chapter 3
• Decision and regression trees in Chapter 4
• Artificial neural networks for classification and prediction in
Chapter 5
• Support vector machines in Chapter 6
• K-nearest neighbor classifier and supervised clustering in Chapter 7

Chapters 20, 21, and 23 in The Handbook of Data Mining (Ye, 2003) and Chapters 12
and 13 in Secure Computer and Network Systems: Modeling, Analysis and Design
(Ye, 2008) give applications of classification and prediction algorithms to
human performance data, text data, science and engineering data, and com-
puter and network data.

1.3.2  Cluster and Association Patterns


Cluster and association patterns usually involve only attribute variables,
x1, …, xp. Cluster patterns give groups of similar data records such that
data records in one group are similar but have larger differences from data
records in another group. In other words, cluster patterns reveal patterns of
similarities and differences among data records. Association patterns are
established based on co-occurrences of items in data records. Sometimes
target variables, y1, …, yq, are also used in clustering but are treated in the
same way as attribute variables.
For example, 10 data records in the data set of a manufacturing system
in Table 1.4 can be clustered into seven groups, as shown in Figure 1.3.
The horizontal axis of each chart in Figure 1.3 lists the nine quality vari-
ables, and the vertical axis gives the value of these nine quality variables.
There are three groups that consist of more than one data record: group 1,
group 2, and group 3. Within each of these groups, the data records are
similar with different values in only one of the nine quality variables.
Adding any other data record to each of these three groups makes the
group having at least two data records with different values in more than
one quality variable.
For the same data set of a manufacturing system, the quality variables, x4
and x8, are highly associated because they have the same value in all the data
records except that of instance 8. There are other pairs of variables, e.g., x5 and
x9, that are highly associated for the same reason. These are some association
patterns that exist in the data set of a manufacturing system in Table 1.4.
Introduction to Data, Data Patterns, and Data Mining 13

Group 1 Group 2
1 1 1 1

0 0 0 0
123456789 123456789 123456789 123456789
Instance 1 Instance 5 Instance 2 Instance 4

Group 3 Group 4 Group 5


1 1 1 1

0 0 0 0
123456789 123456789 123456789 123456789
Instance 3 Instance 6 Instance 7 Instance 8

Group 6 Group 7
1 1

0 0
123456789 123456789
Instance 9 Instance 10
Figure 1.3
Clustering of 10 data records in the data set of a manufacturing system.

Part III of the book introduces the following data mining algorithms that
are used to discover cluster and association patterns from data:

• Hierarchical clustering in Chapter 8


• K-means clustering and density-based clustering in Chapter 9
• Self-organizing map in Chapter 10
• Probability distribution of univariate data in Chapter 11
• Association rules in Chapter 12
• Bayesian networks in Chapter 13

Chapters 10, 21, 22, and 27 in The Handbook of Data Mining (Ye, 2003) give
applications of cluster algorithms to market basket data, web log data, text
data, geospatial data, and image data. Chapter 24 in The Handbook of Data
Mining (Ye, 2003) gives an application of the association rule algorithm to
protein structure data.

1.3.3  Data Reduction Patterns


Data reduction patterns look for a small number of variables that can be
used to represent a data set with a much larger number of variables. Since
one variable gives one dimension of data, data reduction patterns allow a
data set in a high-dimensional space to be represented in a low-dimensional
space. For example, Figure 1.4 gives 10 data points in a two-dimensional
14 Data Mining

20
18
16
14
12
y

10
8
6
4
2
1 2 3 4 5 6 7 8 9 10
x

Figure 1.4
Reduction of a two-dimensional data set to a one-dimensional data set.

space, (x, y), with y = 2x and x = 1, 2, …, 10. This two-dimensional data set
can be represented as the one-dimensional data set with z as the axis, and z
is related to the original variables, x and y, as follows:
2
 y
z = x * 12 + 1 *   . (1.2)
 x

The 10 data points of z are 2.236, 4.472, 6.708, 8.944, 11.180, 13.416, 15.652,
17.889, 20.125, and 22.361.
Part IV of the book introduces the following data mining algorithms that
are used to discover data reduction patterns from data:

• Principal component analysis in Chapter 14


• Multidimensional scaling in Chapter 15

Chapters 23 and 8 in The Handbook of Data Mining (Ye, 2003) give applications of
principal component analysis to volcano data and science and engineering data.

1.3.4  Outlier and Anomaly Patterns


Outliers and anomalies are data points that differ largely from the norm of
data. The norm can be defined in many ways. For example, the norm can
be defined by the range of values that a majority of data points take, and a
data point with a value outside this range can be considered as an outlier.
Figure 1.5 gives the frequency histogram of Launch Temperature values for
the data points in the space shuttle data set in Table 1.2. There are 3 values of
Launch Temperature in the range of [50, 59], 7 values in the range of [60, 69], 12
values in the range of [70, 79], and only 1 value in the range of [80, 89]. Hence,
the majority of values in Launch Temperature are in the range of [50, 79].
The value of 81 in instance 18 can be considered as an outlier or anomaly.
Introduction to Data, Data Patterns, and Data Mining 15

12
11
10
9
8
Frequency

7
6
5
4
3
2
1
0
[50, 59] [60, 69] [70, 79] [80, 89]
Launch temperature

Figure 1.5
Frequency histogram of Launch Temperature in the space shuttle data set.

Part V of the book introduces the following data mining algorithms that
are used to define some statistical norms of data and detect outliers and
anomalies according to these statistical norms:

• Univariate control charts in Chapter 16


• Multivariate control charts in Chapter 17

Chapters 26 and 28 in The Handbook of Data Mining (Ye, 2003) and Chapter 14 in
Secure Computer and Network Systems: Modeling, Analysis and Design (Ye, 2008)
give applications of outlier and anomaly detection algorithms to manufac-
turing data and computer and network data.

1.3.5  Sequential and Temporal Patterns


Sequential and temporal patterns reveal patterns in a sequence of data points.
If the sequence is defined by the time over which data points are observed,
we call the sequence of data points as a time series. Figure 1.6 shows a time

100
Temperature

80

60
1 2 3 4 5 6 7 8 9 10 11 12
Quarter

Figure 1.6
Temperature in each quarter of a 3-year period.
16

Table 1.6
Test Data Set for a Manufacturing System to Detect and Diagnose Faults
Attribute Variables Target Variables
Instance
Quality of Parts Machine Fault
(Faulty System
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault, y y1 y2 y3 y4 y5 y6 y7 y8 y9
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 0 0
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 0 0
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 0 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1
Data Mining
Introduction to Data, Data Patterns, and Data Mining 17

series of temperature values for a city over quarters of a 3-year period. There
is a cyclic pattern of 60, 80, 100, and 60, which repeats every year. A variety of
sequential and temporal patterns can be discovered using the data mining
algorithms covered in Part VI of the book, including

• Autocorrelation and time series analysis in Chapter 18


• Markov chain models and hidden Markov models in Chapter 19
• Wavelet analysis in Chapter 20

Chapters 10, 11, and 16 in Secure Computer and Network Systems: Modeling,
Analysis and Design (Ye, 2008) give applications of sequential and temporal
pattern mining algorithms to computer and network data for cyber attack
detection.

1.4  Training Data and Test Data


The training data set is a set of data records that is used to learn and discover
data patterns. After data patterns are discovered, they should be tested to see
how well they can generalize to a wide range of data records, including those
that are different from the training data records. A test data set is used for
this purpose and includes new, different data records. For example, Table 1.6
shows a test data set for a manufacturing system and its fault detection and
diagnosis. The training data set for this manufacturing system in Table 1.4
has data records for nine single-machine faults and a case where there is no
machine fault. The test data set in Table 1.6 has data records for some two-
machine and three-machine faults.

Exercises
1.1 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering classification pat-
terns. The data set contains multiple categorical attribute variables and
one categorical target variable.
1.2 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering prediction patterns.
The data set contains multiple numeric attribute variables and one
numeric target variable.
18 Data Mining

1.3 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering cluster patterns. The
data set contains multiple numeric attribute variables.
1.4 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering association patterns.
The data set contains multiple categorical variables.
1.5 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering data reduction pat-
terns, and identify the type(s) of data variables in this data set.
1.6 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering outlier and anomaly
patterns, and identify the type(s) of data variables in this data set.
1.7 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering sequential and tem-
poral patterns, and identify the type(s) of data variables in this data set.
Part II

Algorithms for Mining


Classification and
Prediction Patterns

www.allitebooks.com
2
Linear and Nonlinear Regression Models

Regression models capture how one or more target variables vary with one
or more attribute variables. They can be used to predict the values of the
target variables using the values of the attribute variables. In this chapter,
we introduce linear and nonlinear regression models. This chapter also
describes the least-squares method and the maximum likelihood method of
estimating parameters in regression models. A list of software packages that
support building regression models is provided.

2.1  Linear Regression Models


A simple linear regression model, as shown next, has one target variable y
and one attribute variable x:

y i = β 0 + β1 x i + ε i (2.1)

where
(xi, yi) denotes the ith observation of x and y
εi represents random noise (e.g., measurement error) contributing to the ith
observation of y

For a given value of xi, both yi and εi are random variables whose values may
follow a probability distribution as illustrated in Figure 2.1. In other words,
for the same value of x, different values of y and ε may be observed at differ-
ent times. There are three assumptions about εi:

1. E(εi) = 0, that is, the mean of εi is zero


2. var(εi) = σ2, that is, the variance of εi is σ2
3. cov(εi, εj) = 0 for i ≠ j, that is, the covariance of εi and εj for any two
different data observations, the ith observation and the jth observa-
tion, is zero

21
22 Data Mining

yj

E(yj) = β0 + β1xj

E(yi) = β0 + β1xi
yi

x
xi xj
Figure 2.1
Illustration of a simple regression model.

These assumptions imply

1. E ( yi ) = β0 + β1xi
2. var(yi) = σ2
3. cov(yi, yj) = 0 for any two different data observations of y, the ith
observation and the jth observation

The simple linear regression model in Equation 2.1 can be extended to


include multiple attribute variables:

y i = β 0 + β1 x i , 1 +  + β p x i , p + ε i , (2.2)

where
p is an integer greater than 1
xi,j denotes the ith observation of jth attribute variable

The linear regression models in Equations 2.1 and 2.2 are linear in the
parameters β0, …, βp and the attribute variables xi,1, …, xi,p. In general, linear
regression models are linear in the parameters but are not necessarily linear
in the attribute variables. The following regression model with polynomial
terms of x1 is also a linear regression model:

yi = β0 + β1xi ,1 +  + β k xik,1 + ε i , (2.3)

where k is an integer greater than 1. The general form of a linear regression


model is

( ) (
yi = β0 + β1Φ1 xi ,1 , … , xi , p +  + β k Φ k xi ,1 , … , xi , p + ε i , ) (2.4)
Linear and Nonlinear Regression Models 23

where Φl, l = 1, …, k, is a linear or nonlinear function involving one or more


of the variables x1, …, xp. The following is another example of a linear regres-
sion model that is linear in the parameters:

yi = β0 + β1xi ,1 + β 2 xi , 2 + β 3 log xi ,1xi , 2 + ε i . (2.5)

2.2 Least-Squares Method and Maximum Likelihood


Method of Parameter Estimation
To fit a linear regression model to a set of training data (xi, yi), xi = (xi,1, …, xi,p),
i = 1, …, n, the parameters βs need to be estimated. The least-squares method
and the maximum likelihood method are usually used to estimate the param-
eters βs. We illustrate both methods using the simple linear regression model
in Equation 2.1.
The least-squares method looks for the values of the parameters β0 and
β1 that minimize the sum of squared errors (SSE) between observed target
values (yi, i = 1, …, n) and the estimated target values (ŷi, i = 1, …, n) using the
estimated parameters β̂0 and β̂1. SSE is a function of β̂0 and β̂1:

n n

∑ ( yi − yˆ i ) ∑ ( y − βˆ )
2
− βˆ 1xi .
2
SSE = = i 0 (2.6)
i =1 i =1

The partial derivatives of SSE with respect to β̂0 and β̂1 should be zero at the
point where SSE is minimized. Hence, the values of β̂0 and β̂1 that mini-
mize SSE are obtained by differentiating SSE with respect to β̂0 and β̂1
and setting these partial derivatives equal to zero:
n

∑ ( y − βˆ )
∂SSE
= −2 − βˆ 1xi = 0 (2.7)
∂βˆ
i 0
0 i =1

∑x ( y − βˆ )
∂SSE
= −2 − βˆ 1xi = 0. (2.8)
∂βˆ
i i 0
1 i =1

Equations 2.7 and 2.8 are simplified to


n n n

∑ ( y − βˆ − βˆ x ) = ∑y − nβˆ − βˆ ∑x = 0
i =1
i 0 1 i
i =1
i 0 1
i =1
i (2.9)

n n n n

∑ (
i =1
xi yi − βˆ 0 − βˆ 1xi = ) ∑ i =1
xi yi − βˆ 0 ∑ i =1
xi − βˆ 1 ∑x
i =1
2
i = 0. (2.10)
24 Data Mining

Solving Equations 2.9 and 2.10 for β̂0 and β̂1, we obtain:

∑ (x − x ) ( y − y ) = n∑ x y −  ∑ x   ∑
n n
 n
 n

i i i yi 
i i i =1 i =1 i =1 
β̂1 = i =1
(2.11)
∑ (x − x )
n 2
n∑ x −  ∑ x 
i
2
  n
2
n
i i
i =1   i =1 i =1

1 
n n

βˆ 0 = 
n ∑ i =1
yi − βˆ 1 ∑ i =1
xi  = y − βˆ 1x .

(2.12)

The estimation of the parameters in the simple linear regression model


based on the least-squares method does not require that the random error
εi has a specific form of the probability distribution. If we add to the simple
linear regression model in Equation 2.1 an assumption that εi is normally
distributed with the mean of zero and the constant, unknown variance of
σ2, denoted by N(0, σ2), the maximum likelihood method can also be used to
estimate the parameters in the simple linear regression model. The assump-
tion that εis are independent N(0, σ2) gives the normal distribution of yi with

E ( y i ) = β 0 + β1 x i (2.13)

var ( yi ) = σ 2 (2.14)
and the density function of the normal probability distribution:

2 2
1  y i − E( y i )  1  yi − β0 − β1 xi 
1 −  1 − 
f ( yi ) = e 2 σ 
= e 2 σ 
. (2.15)
2πσ 2πσ

Because yis are independent, the likelihood of observing y1, …, yn, L, is the
product of individual densities f(yi)s and is the function of β0, β1, and σ2:
2
n 1  yi − β0 − β1 xi 

∏ (2πσ )
1 −  
L (β0 , β1 , σ ) = e 2 σ
. (2.16)
2 12
i =1

The estimated values of the parameters, βˆ 0 , βˆ 1 , and σˆ 2, which maximize the


likelihood function in Equation 2.16 are the maximum likelihood estimators
and can be obtained by differentiating this likelihood function with respect
to β0, β1, and σ2 and setting these partial derivatives to zero. To ease the com-
putation, we use the natural logarithm transformation (ln) of the likelihood
function to obtain

(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 ) n

∑ ( y − βˆ )
1

ˆ = 2 i 0 − βˆ 1xi = 0 (2.17)
∂β 0 ˆ
σ i =1
Linear and Nonlinear Regression Models 25

(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 ) n

∑x ( y − βˆ )
1
= 2 − βˆ 1xi = 0 (2.18)
∂βˆ 1
i i 0
σˆ i =1

(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 )=− n

∑ ( y − βˆ )
n 1 2
+ 4 i 0 − βˆ 1xi = 0. (2.19)
∂σˆ 2 ˆ
2σ 2
2σˆ i =1

Equations 2.17 through 2.19 are simplified to

∑ ( y − βˆ
i =1
i 0 − βˆ 1xi = 0 ) (2.20)

∑x ( y − βˆ
i =1
i i 0 − βˆ 1xi = 0 ) (2.21)

∑ ( ) .
n 2
y − βˆ i 0 − βˆ 1xi
σˆ 2
= i =1
(2.22)
n
Equations 2.20 and 2.21 are the same as Equations 2.9 and 2.10. Hence, the
maximum likelihood estimators of β0 and β1 are the same as the least-squares
estimators of β0 and β1 that are given in Equations 2.11 and 2.12.
For the linear regression model in Equation 2.2 with multiple attribute
variables, we define x0 = 1 and rewrite Equation 2.2 to

y i = β 0 x i , 0 + β1 x i , 1 +  + β1 x i , p + ε i . (2.23)

Defining the following matrices

 y1  1 x1,1  x1, p  β 0   ε1 
       
y=   x =      b=  e =   ,
 y n  1 x n ,1  xn , p  β p   ε n 
  

we rewrite Equation 2.23 in the matrix form

y = xb + e. (2.24)

The least-squares and maximum likelihood estimators of the parameters are

bˆ = ( x′x ) ( x′y ) ,
−1
(2.25)

where ( x′x ) represents the inverse of the matrix x′x.


−1
26 Data Mining

Example 2.1
Use the least-squares method to fit a linear regression model to the space
shuttle O-rings data set in Table 1.5, which is also given in Table 2.1, and
determine the predicted target value for each observation using the lin-
ear regression model.
This data has one attribute variable x representing Launch Temperature
and one target variable y representing Number of O-rings with Stress.
The linear regression model for this data set is
y i = β 0 + β1 x i + ε i .

Table 2.2 shows the calculation for estimating β1 using Equation 2.11.
Using Equation 2.11, we obtain:

∑ (x − x )( y − y ) = −65.91 = −0.05.
n
i i
βˆ 1 = i =1

∑ (x − x )
n
1382.82
2
i
i =1

Table 2.1
Data Set of O-Rings with Stress along
with the Predicted Target Value from
the Linear Regression
Launch Number of O-Rings
Instance Temperature with Stress
1 66 0
2 70 1
3 69 0
4 68 0
5 67 0
6 72 0
7 73 0
8 70 0
9 57 1
10 63 1
11 70 1
12 78 0
13 67 0
14 53 2
15 67 0
16 75 0
17 70 0
18 81 0
19 76 0
20 79 0
21 75 0
22 76 0
23 58 1
Linear and Nonlinear Regression Models 27

Table 2.2
Calculation for Estimating the Parameters of the Linear Model in Example 2.1
Launch Number
Instance Temperature of O-Rings xi − x– yi − y– ( xi - x ) ( yi - y ) ( xi - x )2
1 66 0 −3.57 −0.30 1.07 12.74
2 70 1 0.43 0.70 0.30 0.18
3 69 0 −0.57 −0.30 0.17 0.32
4 68 0 −1.57 −0.30 0.47 2.46
5 67 0 −2.57 −0.30 0.77 6.60
6 72 0 2.43 −0.30 −0.73 5.90
7 73 0 3.43 −0.30 −1.03 11.76
8 70 0 0.43 −0.30 −0.13 0.18
9 57 1 −12.57 0.70 −8.80 158.00
10 63 1 −6.57 0.70 −4.60 43.16
11 70 1 0.43 0.70 0.30 0.18
12 78 0 8.43 −0.30 −2.53 71.06
13 67 0 −2.57 −0.30 0.77 6.60
14 53 2 −16.53 1.70 −28.10 273.24
15 67 0 −2.57 −0.30 0.77 6.60
16 75 0 5.43 −0.30 −1.63 29.48
17 70 0 0.43 −0.30 −0.13 0.18
18 81 0 11.43 −0.30 −3.43 130.64
19 76 0 6.43 −0.30 −1.93 41.34
20 79 0 19.43 −0.30 −5.83 377.52
21 75 0 5.43 −0.30 −1.63 29.48
22 76 0 6.43 −0.30 −1.93 41.34
23 58 1 −11.57 0.70 −8.10 133.86
Sum 1600 7 −65.91 1382.82
Average x– = 69.57 y– = 0.30

Using Equation 2.12, we obtain:

βˆ 0 = y − βˆ 1x = 0.30 − ( −.05)(69.57 ) = 3.78.


Hence, the linear regression model is

y i = 3.78 − 0.05xi + ε i .

The parameters in this linear regression model are similar to the param-
eters β̂0 = 4.301587 and β̂1 = −0.05746 in Equation 1.1, which are obtained
from Excel for the same data set. The differences in the parameters are
caused by rounding in the calculation.
28 Data Mining

2.3  Nonlinear Regression Models and Parameter Estimation


Nonlinear regression models are nonlinear in model parameters and take
the following general form:

y i = f ( xi , b ) + ε i , (2.26)

where

 1  β 0 
   
xi , 1 β1
xi =   b =  
   
   
 xi , p  β p 

and f is nonlinear in β. The exponential regression model given next is an


example of nonlinear regression models:

yi = β0 + β1eβ2 xi + ε i . (2.27)

The logistic regression model given next is another example of nonlinear


regression models:

β0
yi = + εi . (2.28)
1 + β1eβ2 xi

The least-squares method and the maximum likelihood method are


used to estimate the parameters of a nonlinear regression model. Unlike
Equations 2.9, 2.10, 2.20, and 2.21 for a linear regression model, the equa-
tions for a nonlinear regression model generally do not have analytical
solutions because a nonlinear regression model is nonlinear in the param-
eters. Numerical search methods using an iterative search procedure such
as the Gauss–Newton method and the gradient decent search method are
used to determine the solution for the values of the estimated parameters.
A detailed description of the Gauss–Newton method is given in Neter et al.
(1996). Computer software programs in many statistical software pack-
ages are usually used to estimate the parameters of a nonlinear regression
model because intensive computation is involved in a numerical search
procedure.
Linear and Nonlinear Regression Models 29

2.4  Software and Applications


Many statistical software packages, including the following, support build-
ing a linear or nonlinear regression model:

• Statistica (http://www.statsoft.com)
• SAS (http://www.sas.com)
• SPSS (http://www.ibm/com/software/analytics/spss/)

Applications of linear and nonlinear regression models are common in many


fields.

Exercises
2.1 Given the space shuttle data set in Table 2.1, use Equation 2.25 to esti-
mate the parameters of the following linear regression model:

y i = β 0 + β1 x i + ε i ,

where
xi is Launch Temperature
yi is Number of O-rings with Stress

Compute the sum of squared errors that are produced by the predicted
y values from the regression model.
2.2 Given the space shuttle data set in Table 2.1, use Equations 2.11 and 2.12
to estimate the parameters of the following linear regression model:

y i = β 0 + β1 x i + ε i ,

where
xi is Launch Temperature
yi is Number of O-rings with Stress

Compute the sum of squared errors that are produced by the predicted
y values from the regression model.
2.3 Use the data set found in Exercise 1.2 to build a linear regression model
and compute the sum of squared errors that are produced by the pre-
dicted y values from the regression model.

www.allitebooks.com
3
Naïve Bayes Classifier

A naïve Bayes classifier is based on the Bayes theorem. Hence, this chapter
first reviews the Bayes theorem and then describes naïve Bayes classifier. A
list of data mining software packages that support the learning of a naïve
Bayes classifier is provided. Some applications of naïve Bayes classifiers are
given with references.

3.1  Bayes Theorem


Given two events A and B, the conjunction (^) of the two events represents
the occurrence of both A and B. The probability, P(A ^ B) is computed using
the probability of A and B, P(A) and P(B), and the conditional probability of
A given B, P(A|B), or B given A, P(B|A):

P ( A ^ B) = P ( A|B) P (B) = P (B|A ) P ( A ) . (3.1)

The Bayes theorem is derived from Equation 3.1:

P (B|A ) P ( A )
P ( A|B) = . (3.2)
P (B)

3.2 Classification Based on the Bayes Theorem


and Naïve Bayes Classifier
For a data vector x whose target class y needs to be determined, the maxi-
mum a posterior (MAP) classification y of x is

p ( y ) P ( x|y )
y MAP = arg max P ( y|x ) = arg max ≈ arg max p ( y ) P ( x|y ) , (3.3)
y ∈Y y ∈Y P (x) y ∈Y

31
32 Data Mining

where Y is the set of all target classes. The sign ≈ in Equation 3.3 is used
because P(x) is the same for all y values and thus can be ignored when we
compare p ( y ) P ( x|y ) P ( x ) for all y values. P(x) is the prior probability that we
observe x without any knowledge about what the target class of x is. P(y) is
the prior probability that we expect y, reflecting our prior knowledge about
the data set of x and the likelihood of the target class y in the data set with-
out referring to any specific x. P(y|x) is the posterior probability of y given
the observation of x. arg max P ( y|x ) compares the posterior probabilities of
y ∈Y
all target classes given x and chooses the target class y with the maximum
posterior probability.
P(x|y) is the probability that we observe x if the target class is y. A clas-
sification y that maximizes P(x|y) among all target classes is the maximum
likelihood (ML) classification:

y ML = arg max P ( x|y ) . (3.4)


y ∈Y

If P ( y ) = P ( y ′ ) for any y ≠ y ′ , y ∈ Y , y ′ ∈ Y, then

y MAP ≈ arg max p ( y ) P ( x|y ) ≈ arg max P ( x|y ) ,


y ∈Y y ∈Y

and thus
y MAP = y ML.
A naïve Bayes classifier is based on a MAP classification with the additional
assumption about the attribute variables x = (x1, …, xp) that these attribute
variables xis are independent of each other. With this assumption, we have
p

y MAP ≈ arg max p ( y ) P ( x|y ) = arg max p ( y )


y ∈Y y ∈Y ∏P (x|y ).
i =1
i (3.5)

The naïve Bayes classifier estimates the probability terms in Equation 3.5 in
the following way:

ny
P (y) = (3.6)
n
n y & xi
P ( xi|y ) = , (3.7)
ny

where
n is the total number of data points in the training data set
ny is the number of data points with the target class y
ny & xi is the number of data points with the target class y
the ith attribute variable taking the value of xi
Naïve Bayes Classifier 33

An application of the naïve Bayes classifier is given in Example 3.1.

Example 3.1
Learn and use a naïve Bayes classifier for classifying whether or not a
manufacturing system is faulty using the values of the nine quality vari-
ables. The training data set in Table 3.1 gives a part of the data set in
Table 1.4 and includes nine single-fault cases and the nonfault case in a
manufacturing system. There are nine attribute variables for the qual-
ity of parts, (x1, …, x9), and one target variable y for the system fault.
Table 3.2 gives the test cases for some multiple-fault cases.
Using the training data set in Table 3.1, we compute the following:

n = 10
ny = 1 = 9 ny = 0 = 1
ny = 1& x1 = 1 = 1 ny = 1& x1 = 0 = 8 ny = 0 & x1 = 1 = 0 ny = 0 & x1 = 0 = 1
ny = 1& x2 = 1 = 1 ny = 1& x2 = 0 = 8 ny = 0 & x2 = 1 = 0 ny = 0 & x2 = 0 = 1
ny = 1& x3 = 1 = 1 ny = 1& x3 = 0 = 8 ny = 0 & x3 = 1 = 0 ny = 0 & x3 = 0 = 1
ny = 1& x4 = 1 = 3 ny = 1& x4 = 0 = 6 ny = 0 & x 4 = 1 = 0 ny = 0 & x 4 = 0 = 1
ny = 1& x5 = 1 = 2 ny = 1& x5 = 0 = 7 ny = 0 & x5 = 1 = 0 ny = 0 & x5 = 0 = 1
ny = 1& x6 = 1 = 2 ny = 1& x6 = 0 = 7 n y = 0 & x6 = 1 = 0 n y = 0 & x6 = 0 = 1
ny = 1& x7 = 1 = 5 ny = 1& x7 = 0 = 4 ny = 0 & x7 = 1 = 0 ny = 0 & x7 = 0 = 1
ny = 1& x8 = 1 = 4 ny = 1& x8 = 0 = 5 n y = 0 & x8 = 1 = 0 n y = 0 & x8 = 0 = 1
ny = 1& x9 = 1 = 3 ny = 1& x9 = 0 = 6 n y = 0 & x9 = 1 = 0 n y = 0 & x9 = 0 = 1

Table 3.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable
Instance
Quality of Parts
(Faulty
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 System Fault y
1 (M1) 1 0 0 0 1 0 1 0 1 1
2 (M2) 0 1 0 1 0 0 0 1 0 1
3 (M3) 0 0 1 1 0 1 1 1 0 1
4 (M4) 0 0 0 1 0 0 0 1 0 1
5 (M5) 0 0 0 0 1 0 1 0 1 1
6 (M6) 0 0 0 0 0 1 1 0 0 1
7 (M7) 0 0 0 0 0 0 1 0 0 1
8 (M8) 0 0 0 0 0 0 0 1 0 1
9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0
34 Data Mining

Table 3.2
Classification of Data Records in the Testing Data Set for System Fault Detection
Target Variable
Attribute Variables (Quality of Parts) (System Fault y)
Instance True Classified
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Value Value
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1

Instance #1 in Table 3.1 with x = (1, 0, 0, 0, 1, 0, 1, 0, 1) is classified as follows:

9 9

∏ ∏
ny = 1 ny = 1& xi
p( y = 1) P ( xi|y = 1) =
i =1
n i =1
ny = 1

ny = 1  ny = 1& x1 = 1 ny = 1& x2 = 0 ny = 1& x3 = 0 ny = 1& x4 = 0


= × × ×
n  ny = 1 ny = 1 ny = 1 ny = 1

ny = 1& x5 = 1 ny = 1& x6 = 0 ny = 1& x7 = 1


× × ×
ny = 1 ny = 1 ny = 1

ny = 1& x8 = 0 ny = 1& x9 = 1 
× ×
ny = 1 ny = 1 

9  1 8 8 6 2 7 5 5 3
=  × × × × × × × ×  >0
10  9 9 9 9 9 9 9 9 9 
Naïve Bayes Classifier 35

9 9

∏ ∏
ny = 0 n y = 0 & xi
p( y = 0) P ( xi|y = 0 ) =
i =1
n i =1
ny = 0

ny = 0  ny = 0 & x1 = 1 ny = 0 & x2 = 0 ny = 0 & x3 = 0


=  n × ×
n  y=0 ny = 0 ny = 0

ny = 0 & x 4 = 0 ny = 0 & x5 = 1
× ×
ny = 0 ny = 0

n y = 0 & x 6 = 0 n y = 0 & x7 = 1
× ×
ny = 0 ny = 0

n y = 0 & x8 = 0 n y = 0 & x9 = 1 
× ×
ny = 0 ny = 0 

1  0 1 1 1 0 1 0 1 0
=  × × × × × × × ×  =0
10  1 1 1 1 1 1 1 1 1 

y MAP ≈ arg max p ( y )


y ∈Y ∏P ( x|y ) = 1 i (system is faulty ).
i =1

Instance #2 to Case #9 in Table 3.1 and all the cases in Table 3.2 can
be classified similarly to produce y MAP = 1 since there exist xi = 1 and
ny = 0 & xi = 1 ny = 0 = 0 1 , which make p ( y = 0 ) P ( x|y = 0 ) = 0. Instance #10 in
Table 3.1 with x = (0, 0, 0, 0, 0, 0, 0, 0, 0) is classified as follows:

y MAP ≈ arg max p ( y )


y ∈Y ∏P ( x|y ) = 0 i (system is not faulty ).
i =1

Hence, all the instances in Tables 3.1 and 3.2 are correctly classified by
the naïve Bayes classifier.

3.3  Software and Applications


The following software packages support the learning of a naïve Bayes classifier:

• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (http://www.mathworks.com), statistics toolbox

The naïve Bayes classifier has been successfully applied in many fields, includ-
ing text and document classification (http://www.cs.waikato.ac.nz/∼eibe/
pubs/FrankAndBouckaertPKDD06new.pdf).
36 Data Mining

Exercises
3.1 Build a naïve Bayes classifier to classify the target variable from the
attribute variable in the balloon data set in Table 1.1 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
3.2 In the space shuttle O-ring data set in Table 1.2, consider the Leak-
Check Pressure as a categorical attribute with three categorical values
and the Number of O-rings with Stress as a categorical target variable
with three categorical values. Build a naïve Bayes classifier to classify
the Number of O-rings with Stress from the Leak-Check Pressure and
evaluate the classification performance of the naïve Bayes classifier by
computing what percentage of the date records in the data set are clas-
sified correctly by the naïve Bayes classifier.
3.3 Build a naïve Bayes classifier to classify the target variable from the
attribute variables in the lenses data set in Table 1.3 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
3.4 Build a naïve Bayes classifier to classify the target variable from the
attribute variables in the data set found in Exercise 1.1 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
4
Decision and Regression Trees

Decision and regression tress are used to learn classification and prediction
patterns from data and express the relation of attribute variables x with a
target variable y, y = F(x), in the form of a tree. A decision tree classifies the
categorical target value of a data record using its attribute values. A regres-
sion tree predicts the numeric target value of a data record using its attribute
­ alues. In this chapter, we first define a binary decision tree and give the algo-
v
rithm to learn a binary decision tree from a data set with categorical attribute
variables and a categorical target variable. Then the method of learning a
nonbinary decision tree is described. Additional concepts are introduced to
handle numeric attribute variables and missing values of attribute variables,
and to handle a numeric target variable for constructing a regression tree.
A list of data mining software packages that support the learning of decision
and regression trees is provided. Some applications of decision and regres-
sion trees are given with references.

4.1 Learning a Binary Decision Tree and Classifying


Data Using a Decision Tree
In this section, we introduce the elements of a decision tree. The rationale
of seeking a decision tree with the minimum description length is provided
and followed by the split selection methods. Finally, the top-down construc-
tion of a decision tree is illustrated.

4.1.1 Elements of a Decision Tree


Table 4.1 gives a part of the data set for a manufacturing system shown in
Table 1.4. The data set in Table 4.1 includes nine attribute variables for the
quality of parts and one target variable for system fault. This data set is used
as the training data set to learn a binary decision tree for classifying whether
or not the system is faulty using the values of the nine quality variables.
Figure 4.1 shows the resulting binary decision tree to illustrate the elements
of the decision tree. How this decision tree is learned is explained later.
As shown in Figure 4.1, a binary decision tree is a graph with nodes.
The root node at the top of the tree consists of all data records in the training

37
38 Data Mining

Table 4.1
Data Set for System Fault Detection
Target
Attribute Variables Variable
Quality of Parts
Instance System
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault y
1 (M1) 1 0 0 0 1 0 1 0 1 1
2 (M2) 0 1 0 1 0 0 0 1 0 1
3 (M3) 0 0 1 1 0 1 1 1 0 1
4 (M4) 0 0 0 1 0 0 0 1 0 1
5 (M5) 0 0 0 0 1 0 1 0 1 1
6 (M6) 0 0 0 0 0 1 1 0 0 1
7 (M7) 0 0 0 0 0 0 1 0 0 1
8 (M8) 0 0 0 0 0 0 0 1 0 1
9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0

TRUE FALSE

{2, 4, 8, 9, 10} {1, 3, 5, 6, 7}


x8 = 0 y=1

TRUE FALSE

{9, 10} {2, 4, 8}


x9 = 0 y=1

TRUE FALSE

{10} {9}
y=0 y=1

Figure 4.1
Decision tree for system fault detection.

data set. For the data set of system fault detection, the root node contains a
set with all the 10 data records in the training data set, {1, 2, …, 10}. Note that
the numbers in the data set are the instance numbers. The root node is split
into two subsets, {2, 4, 8, 9, 10} and {1, 2, 5, 6, 7}, using the attribute variable,
x 7, and its two categorical values, x7 = 0 and x7 = 1. All the instances in the
subset, {2, 4, 8, 9, 10}, have x7 = 0. All the instances in the subset, {1, 2, 5, 6, 7},
have x7 = 1. Each subset is represented as a node in the decision tree. A
Boolean expression is used in the decision tree to express x7 = 0 by x7 = 0 is
Decision and Regression Trees 39

TRUE, and x7 = 1 by x7 = 0 is FALSE. x7 = 0 is called a split condition or a split


criterion, and its TRUE and FALSE values allow a binary split of the set at
the root node into two branches with a node at the end of each branch. Each
of the two new nodes can be further divided using one of the remaining
attribute variables in the split criterion. A node cannot be further divided
if the data records in the data set at this node have the same value of the
target variable. Such a node becomes a leaf node in the decision tree. Except
the root node and leaf nodes, all other nodes in the decision trees are called
internal nodes.
The decision tree can classify a data record by passing the data record
through the decision tree using the attribute values in the data record. For
example, the data record for instance 10 is first checked with the first split
condition at the root node. With x7 = 0, the data record is passed down to the
left branch. With x8 = 0 and then x9 = 0, the data record is passed down to
the left-most leaf node. The data record takes the target value for this leaf
node, y = 0, which classifies the data record as not faulty.

4.1.2  Decision Tree with the Minimum Description Length


Starting with the root node with all the data records in the training data set,
there are nine possible ways to split the root node using the nine attribute
variables individually in the split condition. For each node at the end of a
branch from the split of the root node, there are eight possible ways to split
the node using each of the remaining eight attribute variables individually.
This process continues and can result in many possible decision trees. All
the possible decision trees differ in their size and complexity. A decision
tree can be large to have as many leaf nodes as data records in the training
data set with each leaf node containing each data record. Which one of all
the possible decision trees should be used to represent F, the relation of the
attribute variables with the target variable? A decision tree algorithm aims
at obtaining the smallest decision tree that can capture F, that is, the decision
tree that requires the minimum description length. Given both the small-
est decision tree and a larger decision tree that classify all the data records
in the training data set correctly, it is expected that the smallest decision
tree generalizes classification patterns better than the larger decision tree
and better-generalized classification patterns allow the better classification
of more data points including those not in the training data set. Consider a
large decision tree that has as many leaf nodes as data records in the train-
ing data set with each leaf node containing each data record. Although this
large decision tree classifies all the training data records correctly, it may
perform poorly in classifying new data records not in the training data set.
Those new data records have different sets of attribute values from those of
data records in the training data set and thus do not follow the same paths
of the data records to leaf nodes in the decision tree. We need a decision tree
that captures generalized classification patterns for the F relation. The more

www.allitebooks.com
40 Data Mining

generalized the F relation, the smaller description length it has because it


eliminates specific differences among individual data records. Hence, the
smaller a decision tree is, the more generalization capacity the decision tree
is expected to have.

4.1.3  Split Selection Methods


With the goal of seeking a decision tree with the minimum description
length, we need to know how to split a node so that we can achieve the goal
of obtaining the decision tree with the minimum description length. Take an
example of learning a decision tree from the data set in Table 4.1. There are
nine possible ways to split the root node using the nine attribute variables
individually, as shown in Table 4.2.
Which one of the nine split criteria should we use so we will obtain the
smallest decision tree? A common approach of split selection is to select
the split that produces the most homogeneous subsets. A homogenous data
set is a data set whose data records have the same target value. There are
various measures of data homogeneity: information entropy, gini-index, etc.
(Breiman et al., 1984; Quinlan, 1986; Ye, 2003).
Information entropy is originally introduced to measure the number of bits of
information needed to encode data. Information entropy is defined as follows:

entropy (D) = ∑ − P log P


i =1
i 2 i (4.1)

− 0 log 2 0 = 0 (4.2)
c

∑P = 1,
i =1
i (4.3)

where
D denotes a given data set
c denotes the number of different target values
Pi denotes the probability that a data record in the data set takes the ith
target value

An entropy value falls in the range, [0, log2c]. For example, given the data set
in Table 4.1, we have c = 2 (for two target values, y = 0 and y = 1), P1 = 9/10
(9 of the 10 records with y = 0) = 0.9, P2 = 1/10 (1 of the 10 records with y = 1) =
0.1, and
2


entropy (D) = ∑ − P log P = −0.9log 0.9 − 0.1log 0.1 = 0.47.
i =1
i 2 i 2 2
Decision and Regression Trees 41

Table 4.2
Binary Split of the Root Node and Calculation of Information Entropy
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Information Entropy of Split
x1 = 0: TRUE or FALSE {2, 3, 4, 5, 6, 7, 8, 9, 10}, {1}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9  8 8 1 1 1
= ×  − log 2 − log 2 + × 0 = 0.45
10  9 9 9 9  10
x2 = 0: TRUE or FALSE {1, 3, 4, 5, 6, 7, 8, 9, 10}, {2}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9  8 8 1 1 1
= ×  − log 2 − log 2  + × 0 = 0.45
10  9 9 9 9  10
x3 = 0: TRUE or FALSE {1, 2, 4, 5, 6, 7, 8, 9, 10}, {3}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9  8 8 1 1 1
= ×  − log 2 − log 2  + × 0 = 0.45
10  9 9 9 9  10
x4 = 0: TRUE or FALSE {1, 5, 6, 7, 8, 9, 10}, {2, 3, 4}
7 3
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
7  6 6 1 1 3
= ×  − log 2 − log 2  + × 0 = 0.41
10  7 7 7 7  10
x5 = 0: TRUE or FALSE {2, 3, 4, 6, 7, 8, 9, 10}, {1, 5}
8 2
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
8  7 7 1 1 2
= ×  − log 2 − log 2  + × 0 = 0.43
10  8 8 8 8  10
x6 = 0: TRUE or FALSE {1, 2, 4, 5, 7, 8, 9, 10}, {3, 6}
8 2
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
8  7 7 1 1 2
= ×  − log 2 − log 2  + × 0 = 0.43
10  8 8 8 8  10
x7 = 0: TRUE or FALSE {2, 4, 8, 9, 10}, {1, 3, 5, 6, 7}
5 5
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
5  4 4 1 1 5
= ×  − log 2 − log 2  + × 0 = 0.36
10  5 5 5 5  10
(continued)
42 Data Mining

Table 4.2 (continued)


Binary Split of the Root Node and Calculation of Information Entropy
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Information Entropy of Split
x8 = 0: TRUE or FALSE {1, 5, 6, 7, 9, 10}, {2, 3, 4, 8}
6 4
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
6  5 5 1 1 4
= ×  − log 2 − log 2 + × 0 = 0.39
10  6 6 6 6  10

x9 = 0: TRUE or FALSE {2, 3, 4, 6, 7, 8, 10}, {1, 5, 9}


7 3
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
7  6 6 1 1 3
= ×  − log 2 − log 2  + × 0 = 0.41
10  7 7 7 7  10

Figure 4.2 shows how the entropy value changes with P1 (P2 = 1 − P1) when
c = 2. Especially, we have

• P1 = 0.5, P2 = 0.5, entropy(D) = 1


• P1 = 0, P2 = 1, entropy(D) = 0
• P1 = 1, P2 = 0, entropy(D) = 0

If all the data records in a data set take one target value, we have P1 = 0, P2 = 1 or
P1 = 1, P2 = 0, and the value of information entropy is 0, that is, we need 0 bit of
information because we already know the target value that all the data records
take. Hence, the entropy value of 0 indicates that the data set is homogenous

1
0.9
0.8
0.7
entropy (D)

0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P1

Figure 4.2
Information entropy.
Decision and Regression Trees 43

with regard to the target value. If one half set of the data records in a data
set takes one target value and the other half set takes another target value,
we have P1 = 0.5, P2 = 0.5, and the value of information entropy is 1, meaning
that we need 1 bit of information to convey what target value is. Hence, the
entropy value of 1 indicates that the data set is inhomogeneous. When we use
the information entropy to measure data homogeneity, the lower the entropy
value is, the more homogenous the data set is with regard to the target value.
After a split of a data set into several subsets, the following formula is used
to compute the average information entropy of subsets:


Dv
entropy (S) = entropy (Dv ) , (4.4)
v ∈Values( S )
D

where
S denotes the split
Values(S) denotes a set of values that are used in the split
v denotes a value in Values(S)
D denotes the data set being split
|D| denotes the number of data records in the data set D
Dv denotes the subset resulting from the split using the split value v
|Dv| denotes the number of data records in the data set Dv

For example, the root node of a decision tree for the data set in Table 4.1
has the data set, D = {1, 2, …, 10}, whose entropy value is 0.47 as shown previ-
ously. Using the split criterion, x1 = 0: TRUE or FALSE, the root node is split
into two subsets: Dfalse = {1}, which is homogenous, and Dtrue = {2, 3, 4, 5, 6,
7, 8, 9, 10}, which is inhomogeneous with eight data records taking the tar-
get value of 1 and one data record taking the target value of 0. The average
entropy of the two subsets after the split is

9 1
entropy (S) =
10
entropy (Dtrue ) + entropy D false
10
( )
9  8 8 1 1 1
= ×  − log 2 − log 2 + × 0 = 0.45.
10  9 9 9 9  10

Since this average entropy of subsets after the split is better than entropy (D) =
0.47, the split improves data homogeneity. Table 4.2 gives the average entropy of
subsets after each of the other eight splits of the root node. Among the nine pos-
sible splits, the split using the criterion of x7 = 0: TRUE or FALSE produces the
smallest average information entropy, which indicates the most homogeneous
subsets. Hence, the split criterion of x7 = 0: TRUE or FALSE is selected to split the
root node, resulting in two internal nodes as shown in Figure 4.1. The internal
node with the subset, {2, 4, 8, 9, 10}, is not homogenous. Hence, the decision tree
is further expanded with more splits until all leaf nodes are homogenous.
44 Data Mining

The gini-index, another measure of data homogeneity, is defined as follows:


c

gini (D) = 1 − ∑P .
i =1
i
2
(4.5)

For example, given the data set in Table 4.1, we have c = 2, P1 = 0.9, P2 = 0.1, and
c


gini (D) = 1 − ∑P
i =1
i
2
= 1 − 0.92 − 0.12 = 0.18.

The gini-index values are computed for c = 2 and the following values of Pi:

• P1 = 0.5, P2 = 0.5, gini(D) = 1 − 0.52 − 0.52 = 0.5


• P1 = 0, P2 = 1, gini(D) = 1 − 02 − 12 = 0
• P1 = 1, P2 = 0, gini(D) = 1 − 12 − 0 2 = 0

Hence, the smaller the gini-index value is, the more homogeneous the data
set is. The average gini-index value of data subsets after a split is calculated
as follows:


Dv
gini (S) = gini (Dv ) . (4.6)
v ∈Values( S )
D

Table 4.3 gives the average gini-index value of subsets after each of the nine
splits of the root node for the training data set of system fault detection.
Among the nine possible splits, the split criterion of x7 = 0: TRUE or FALSE
produces the smallest average gini-index value, which indicates the most
homogeneous subsets. The split criterion of x7 = 0: TRUE or FALSE is selected
to split the root node. Hence, using the gini-index produces the same split as
using the information entropy.

4.1.4 Algorithm for the Top-Down Construction of a Decision Tree


This section describes and illustrates the algorithm of constructing a com-
plete decision tree. The algorithm for the top-down construction of a binary
decision tree has the following steps:

1. Start with the root node that includes all the data records in the
training data set and select this node to split.
2. Apply a split selection method to the selected node to determine the
best split along with the split criterion and partition the set of the
training data records at the selected node into two nodes with two
subsets of data records, respectively.
3. Check if the stopping criterion is satisfied. If so, the tree construction
is completed; otherwise, go back to Step 2 to continue by selecting a
node to split.
Decision and Regression Trees 45

Table 4.3
Binary Split of the Root Node and Calculation of the Gini-Index
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Gini-Index Value of Split
x1 = 0: TRUE or FALSE {2, 3, 4, 5, 6, 7, 8, 9, 10}, {1}
9 1
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
9   1  8  1
2 2

= × 1 −   −    + × 0 = 0.18
10   9   9   10

x2 = 0: TRUE or FALSE {1, 3, 4, 5, 6, 7, 8, 9, 10}, {2}


9 1
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
9   1  8  1
2 2

= × 1 −   −    + × 0 = 0.18
10   9   9   10

x3 = 0: TRUE or FALSE {1, 2, 4, 5, 6, 7, 8, 9, 10}, {3}


9 1
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
9   1  8  1
2 2

= × 1 −   −    + × 0 = 0.18
10   9   9   10

x4 = 0: TRUE or FALSE {1, 5, 6, 7, 8, 9, 10}, {2, 3, 4}


7 3
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
7   6  1  3
2 2

= × 1 −   −    + × 0 = 0.17
10   7   
7  10

x5 = 0: TRUE or FALSE {2, 3, 4, 6, 7, 8, 9, 10}, {1, 5}


8 2
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
8   7   1  2
2 2

= × 1 −   −    + × 0 = 0.175
10   8   8   10

x6 = 0: TRUE or FALSE {1, 2, 4, 5, 7, 8, 9, 10}, {3, 6}


8 2
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
8   7   1  2
2 2

= × 1 −   −    + × 0 = 0.175
10   8   8   10

(continued)
46 Data Mining

Table 4.3 (continued)


Binary Split of the Root Node and Calculation of the Gini-Index
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Gini-Index Value of Split
x7 = 0: TRUE or FALSE {2, 4, 8, 9, 10}, {1, 3, 5, 6, 7}
5 5
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
5   4  1  5
2 2

= × 1 −   −    + × 0 = 0.16
10   5   5   10

x8 = 0: TRUE or FALSE {1, 5, 6, 7, 9, 10}, {2, 3, 4, 8}


6 4
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
6   5  1  4
2 2

= × 1 −   −    + × 0 = 0.167
10   6   6   10

x9 = 0: TRUE or FALSE {2, 3, 4, 6, 7, 8, 10}, {1, 5, 9}


7 3
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
7   6  1  3
2 2

= × 1 −   −    + × 0 = 0.17
10   7   7   10

The stopping criterion based on data homogeneity is to stop when each


leaf node has homogeneous data, that is, a set of data records with the same
target value. Many large sets of real-world data are noisy, making it difficult
to obtain homogeneous data sets at leaf nodes. Hence, the stopping criterion
is often set to have the measure of data homogeneity to be smaller than a
threshold value, e.g., entropy(D) < 0.1.
We show the construction of the complete binary decision tree for the data
set of system fault detection next.

Example 4.1
Construct a binary decision tree for the data set of system fault detection
in Table 4.1.
We first use the information entropy as the measure of data homoge-
neity. As shown in Figure 4.1, the data set at the root node is partitioned
into two subsets, {2, 4, 8, 9, 10}, and {1, 3, 5, 6, 7}, which are already homo-
geneous with the target value, y = 1, and do not need a split. For the
subset, D = {2, 4, 8, 9, 10},

∑ − P log P = − 5 log
1 1 4 4
entropy (D) = i 2 i 2 − log 2 = 0.72.
i =1
5 5 5

Decision and Regression Trees 47

Except x7, which has been used to split the root node, the other eight
attribute variables, x1, x2, x3, x4, x5, x6, x8, and x9, can be used to split D.
The split criteria using x1 = 0, x3 = 0, x5 = 0, and x6 = 0 do not produce a
split of D. Table 4.4 gives the calculation of information entropy for the
splits using x2, x4, x7, x8, and x9. Since the split criterion, x8 = 0: TRUE or
FALSE, produces the smallest average entropy of the split, this split cri-
terion is selected to split D = {2, 4, 8, 9, 10} into {9, 10} and {2, 4, 8}, which
are already homogeneous with the target value, y = 1, and do not need a
split. Figure 4.1 shows this split.
For the subset, D = {9, 10},
2

∑ − P log P = − 2 log
1 1 1 1
entropy (D) = i 2 i 2 − log 2 = 1.
i =1
2 2 2

Except x7 and x8, which have been used to split the root node, the other
seven attribute variables, x1, x2, x3, x4, x5, x6, and x9, can be used to split D.
The split criteria using x1 = 0, x2 = 0, x3 = 0, x4 = 0, x5 = 0, and x6 = 0 do
not produce a split of D. The split criterion of x9 = 0: TRUE or FALSE,

Table 4.4
Binary Split of an Internal Node with D = {2, 4, 5, 9, 10} and Calculation
of Information Entropy for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Information Entropy of Split
x2 = 0: TRUE or FALSE {4, 8, 9, 10}, {2}
4 1
entropy (S) =
5
(
entropy (Dtrue ) + entropy D false
5
)
4  3 8 1 1 1
= ×  − log 2 − log 2  + × 0 = 0.64
5  4 9 4 4 5

x4 = 0: TRUE or FALSE {8, 9, 10}, {2, 4}


3 2
entropy (S) =
5
(
entropy (Dtrue ) + entropy D false
5
)
3  2 2 1 1 2
= ×  − log 2 − log 2  + × 0 = 0.55
5  3 3 3 3 5

x8 = 0: TRUE or FALSE {9, 10}, {2, 4, 8}


2 3
entropy (S) =
5
(
entropy (Dtrue ) + entropy D false
5
)
2  1 1 1 1 3
= ×  − log 2 − log 2  + × 0 = 0.4
5  2 2 2 2 5

x9 = 0: TRUE or FALSE {2, 4, 8, 10}, {9}


4 1
entropy (S) =
5
(
entropy (Dtrue ) + entropy D false
5
)
4  3 3 1 1 1
= ×  − log 2 − log 2  + × 0 = 0.64
5  4 4 4 4 5
48 Data Mining

produces two subsets, {9} with the target value of y = 1, and {10} with the
target value of y = 0, which are homogeneous and do not need a split.
Figure 4.1 shows this split. Since all leaf nodes of the decision tree are
homogeneous, the construction of the decision tree is stopped with the
complete decision tree shown in Figure 4.1.
We now show the construction of the decision tree using the gini-
index as the measure of data homogeneity. As described previously, the
data set at the root node is partitioned into two subsets, {2, 4, 8, 9, 10} and
{1, 3, 5, 6, 7}, which are already homogeneous with the target value, y = 1,
and do not need a split. For the subset, D = {2, 4, 8, 9, 10},
c 2 2
gini (D) = 1 − ∑
i =1
 4  1
Pi2 = 1 −   −   = 0.32.
 5  5

The split criteria using x1 = 0, x3 = 0, x5 = 0, and x6 = 0 do not produce a split
of D. Table 4.5 gives the calculation of the gini-index values for the splits

Table 4.5
Binary Split of an Internal Node with D = {2, 4, 5, 9, 10} and Calculation
of the Gini-Index Values for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Gini-Index Value of Split
x2 = 0: TRUE or FALSE {4, 8, 9, 10}, {2}
4 1
gini (S) =
5
(
gini (Dtrue ) + gini D false
5
)
4   3  1  1
2 2

= ×  1 −   −    + × 0 = 0.3
5   4  4  5

x4 = 0: TRUE or FALSE {8, 9, 10}, {2, 4}


3 2
gini (S) =
5
(
gini (Dtrue ) + gini D false
5
)
3   3  1  2
2 2

= ×  1 −   −    + × 0 = 0.27
5   4  4  5

x8 = 0: TRUE or FALSE {9, 10}, {2, 4, 8}


2 3
gini (S) =
5
(
gini (Dtrue ) + gini D false
5
)
2   1  1  3
2 2

= ×  1 −   −    + × 0 = 0.2
5   2   
2  5

x9 = 0: TRUE or FALSE {2, 4, 8, 10}, {9}


4 1
gini (S) =
5
(
gini (Dtrue ) + gini D false
5
)
4   3  1  1
2 2

= ×  1 −   −    + × 0 = 0.3
5   4  4  5
Decision and Regression Trees 49

using x2, x4, x7, x8, and x9. Since the split criterion, x8 = 0: TRUE or FALSE,
produces the smallest average gini-index value of the split, this split crite-
rion is selected to split D = {2, 4, 8, 9, 10} into {9, 10} and {2, 4, 8}, which are
already homogeneous with the target value, y = 1, and do not need a split.
For the subset, D = {9, 10},
c 2 2
gini (D) = 1 − ∑
i =1
 1  1
Pi2 = 1 −   −   = 0.5.
 2  2

Except x7 and x8, which have been used to split the root node, the other
seven attribute variables, x1, x2, x3, x4, x5, x6, and x9, can be used to split
D. The split criteria using x1 = 0, x2 = 0, x3 = 0, x4 = 0, x5 = 0, and x6 = 0 do
not produce a split of D. The split criterion of x9 = 0: TRUE or FALSE, pro-
duces two subsets, {9} with the target value of y = 1, and {10} with the tar-
get value of y = 0, which are homogeneous and do not need a split. Since
all leaf nodes of the decision tree are homogeneous, the construction of
the decision tree is stopped with the complete decision tree, which is
the same as the decision tree from using the information entropy as the
measure of data homogeneity.

4.1.5  Classifying Data Using a Decision Tree


A decision tree is used to classify a data record by passing the data record
into a leaf node of the decision tree using the values of the attribute vari-
ables and assigning the target value of the leaf node to the data record.
Figure 4.3 highlights in bold the path of passing the training data record,
for instance 10 in Table 4.1, from the root node to a leaf node with the target
value, y = 0. Hence, the data record is classified to have no system fault. For
the data records in the testing data set of system fault detection in Table 4.6,

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0

TRUE FALSE

{2, 4, 8, 9, 10} {1, 3, 5, 6, 7}


x8 = 0 y=1

TRUE FALSE

{9, 10} {2, 4, 8}


x9 = 0 y=1

TRUE FALSE

{10} {9}
y=0 y=1

Figure 4.3
Classifying a data record for no system fault using the decision tree for system fault detection.

www.allitebooks.com
50 Data Mining

Table 4.6
Classification of Data Records in the Testing Data Set for System Fault
Detection
Target Variable y
Attribute Variables (Quality of Parts) (System Faults)
Instance True Classified
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Value Value
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0

TRUE FALSE

{2, 4, 8, 9, 10} {1, 3, 5, 6, 7}


x8 = 0 y=1

TRUE FALSE

{9, 10} {2, 4, 8}


x9 = 0 y=1
TRUE FALSE

{10} {9}
y=0 y=1

Figure 4.4
Classifying a data record for multiple machine faults using the decision tree for system fault
detection.
Decision and Regression Trees 51

their target values are obtained using the decision tree in Figure 4.1 and are
shown in Table 4.6. Figure 4.4 highlights the path of passing a testing data
record for instance 1 in Table 4.6 from the root node to a leaf node with the
target value, y = 1. Hence, the data record is classified to have a system fault.

4.2  Learning a Nonbinary Decision Tree


In the lenses data set in Table 1.4, the attribute variable, Age, has three categor-
ical values: Young, Pre-presbyopic, and Presbyopic. If we want to construct a
binary decision tree for this data set, we need to convert the three categorical
values of the attribute variable, Age, into two categorical values if Age is used
to split the root node. We may have Young and Pre-presbyopic together as
one category, Presbyopic as another category, and Age  =  Presbyopic: TRUE
or FALSE as the split criterion. We may also have Young as one category, Pre-
presbyopic and Presbyopic together as another category, and Age = Young:
TRUE or FALSE as the split criterion. However, we can construct a nonbinary
decision tree to allow partitioning a data set at a node into more than two
subsets by using each of multiple categorical values for each branch of the
split. Example 4.2 shows the construction of a nonbinary decision tree for the
lenses data set.

Example 4.2
Construct a nonbinary decision tree for the lenses data set in Table 1.3.
If the attribute variable, Age, is used to split the root node for the
lenses data set, all three categorical values of Age can be used to par-
tition the set of 24 data records at the root node using the split crite-
rion, Age = Young, Pre-presbyopic, or Presbyopic, as shown in Figure 4.5.
We use the data set of 24 data records in Table 1.3 as the training data
set, D, at the root node of the nonbinary decision tree. In the lenses
data set, the target variable has three categorical values, Non-Contact
in 15 data records, Soft-Contact in 5 data records, and Hard-Contact in
4 data records. Using the information entropy as the measure of data
homogeneity, we have

∑ − P log P = − 24 log
15 15 5 5 4 4
entropy (D) = i 2 i 2 − log 2 − log 2 = 1.3261.
i =1
24 24 24 24 24

Table 4.7 shows the calculation of information entropy to split the


root node using the split criterion, Tear Production Rate  =  Reduced
or Normal, which produces a homogenous subset of {1, 3, 5, 7, 9, 11,
13, 15, 17, 19, 21, 23} and an inhomogeneous subset of {2, 4, 6, 8, 10, 12,
14, 16, 18, 20, 22, 24}. Table 4.8 shows the calculation of information
52

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24}
Tear production rate = ?

Reduced Normal

{1, 3, 5, 7, 9, 11, 13, 15, {2, 4, 6, 8, 10, 12, 14, 16,


17, 19, 21, 23} 18, 20, 22, 24}
Lenses = Non-contact Astigmatic = ?
No Yes

{2, 6, 10, 14, 18, 22} {4, 8, 12, 16, 20, 24}
Age = ? Spectacle prescription = ?

Young Presbyopic Myope Hypermetrope


Pre-presbyopic
{2, 6} {10, 14} {18, 22} {4, 12, 20} {8, 16, 24}
Lenses = Soft contact Lenses = Soft contact Spectacle prescription = ? Lenses = Hard contact Age = ?

Myope Hypermetrope Young Pre-presbyopic Presbyopic

{18} {22} {8} {16} {24}


Lenses = Non-contact Lenses = Soft contact Lenses = Hard contact Lenses = Non-contact Lenses = Non-contact

Figure 4.5
Decision tree for the lenses data set.
Data Mining
Decision and Regression Trees 53

Table 4.7
Nonbinary Split of the Root Node and Calculation of Information Entropy
for the Lenses Data Set
Split Criterion Resulting Subsets and Average Information Entropy of Split
Age = Young, {1, 2, 3, 4, 5, 6, 7, 8}, {9, 10, 11, 12, 13, 14, 15, 16}, {17, 18, 19, 20,
­Pre-presbyopic, or 21, 22, 23, 24}
Presbyopic 8 8
entropy (S) =
24
(
entropy DYoung + ) 24
( )
entropy DPre − presbyopic

8
+
24
(
entropy DPresbyopic )
8  4 4 2 2 2 2
= ×  − log 2 − log 2 − log 2 
24  8 8 8 8 8 8
8  5 5 2 2 1 1
+ ×  − log 2 − log 2 − log 2 
24  8 8 8 8 8 8
8  6 6 1 1 1 1
+ ×  − log 2 − log 2 − log 2  = 1.2867
24  8 8 8 8 8 8
Spectacle Prescription = {1, 2, 3, 4, 9, 10, 11, 12, 17, 18, 19, 20}, {5, 6, 7, 8, 13, 14, 15, 16, 21,
Myope or Hypermetrope 22, 23, 24}
12 12
entropy (S) =
24
(
entropy DMyope + ) 24
(
entropy DHypermetrope )
12  7 7 2 2 3 3
= ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
12  8 8 3 3 1 1
+ ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
= 1.2866
Astigmatic = No or Yes {1, 2, 5, 6, 9, 10, 13, 14, 17, 18, 21, 22}, {3, 4, 7, 8, 11, 12, 15, 16, 19,
20, 23, 24}
12 12
entropy (S) = entropy (DNo ) + entropy (DYes )
24 24
12  7 7 5 5 0 0
= ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
12  8 8 4 4 0 0
+ ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
= 0.9491
Tear Production Rate = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23}, {2, 4, 6, 8, 10, 12, 14, 16, 18,
Reduced or Normal 20, 22, 24}
12 12
entropy (S) = entropy (DReduced ) + entropy (DNormal )
24 24
12  12 12 0 0 0 0
= ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
12  3 3 5 5 4 4
+ ×  − log 2 − log 2 − log 2 
24  12 12 12 12 12 12 
= 0.7773
54 Data Mining

Table 4.8
Nonbinary Split of an Internal Node, {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24},
and Calculation of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, or {2, 4, 6, 8}, {10, 12, 14, 16}, {18, 20, 22, 24}
Presbyopic
4
entropy (S) =
12
(
entropy DYoung )
4
+
12
(
entropy DPre − presbyopic )
4
+
12
(
entropy DPresbyopic )
4  0 0 2 2 2 2
= ×  − log 2 − log 2 − log 2 
12  4 4 4 4 4 4

4  1 1 2 2 1 1
+ ×  − log 2 − log 2 − log 2 
12  4 4 4 4 4 4

4  2 2 1 1 1 1
+ ×  − log 2 − log 2 − log 2 
12  4 4 4 4 4 4
= 1.3333
Spectacle Prescription = Myope or {2, 4, 10, 12, 18, 20}, {6, 7, 14, 16, 22, 24}
Hypermetrope
6
entropy (S) =
12
(
entropy DMyope )
6
+
12
(
entropy DHypermetrope )
6  1 1 2 2 3 3
= ×  − log 2 − log 2 − log 2 
122  6 6 6 6 6 6

6  2 2 3 3 1 1
+ ×  − log 2 − log 2 − log 2 
12  6 6 6 6 6 6
= 1.4591
Astigmatic = No or Yes {2, 6, 10, 14, 18, 22}, {4, 8, 12, 16, 20, 24}
6
entropy (S) = entropy (DNo )
12
6
+ entropy (DYes )
12
6  1 1 5 5 0 0
= ×  − log 2 − log 2 − log 2 
12  6 6 6 6 6 6

6  2 2 0 0 4 4
+ ×  − log 2 − log 2 − log 2 
12  6 6 6 6 6 6
= 0.7842
Decision and Regression Trees 55

Table 4.9
Nonbinary Split of an Internal Node, {2, 6, 10, 14, 18, 22}, and Calculation
of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, {2, 6}, {10, 14}, {18, 22}
or Presbyopic
2 2
entropy (S) =
6
( ) (
entropy DYoung + entropy DPre − presbyopic
6
)
2
+
6
(
entropy DPresbyopic )
2  0 0 2 2 0 0
= ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2

2  0 0 2 2 0 0
+ ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2

2  1 1 1 1 0 0
+ ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2
= 0.3333
Spectacle Prescription = Myope {2, 10, 18}, {6, 14, 22}
or Hypermetrope
3 3
entropy (S) =
6
( ) (
entropy DMyope + entropy DHypermetrope
6
)
3  1 1 2 2 0 0
= ×  − log 2 − log 2 − log 2 
6  3 3 3 3 3 3

3  0 0 3 3 0 0
+ ×  − log 2 − log 2 − log 2 
6  3 3 3 3 3 3
= 0.4591

entropy to split the node with the data set of {2, 4, 6, 8, 10, 12, 14, 16, 18,
20, 22, 24} using the split criterion, Astigmatic = No or Yes, which pro-
duces two subsets of {2, 6, 10, 14, 18, 22} and {4, 8, 12, 16, 20, 24}. Table 4.9
shows the calculation of information entropy to split the node with the
data set of {2, 6, 10, 14, 18, 22} using the split criterion, Age = Young,
Pre-presbyopic, or Presbyopic, which produces three subsets of {2, 6},
{10, 14}, and (18, 22}. These subsets are further partitioned using the split
criterion, Spectacle Prescription = Myope or Hypermetrope, to produce
leaf nodes with homogeneous data sets. Table 4.10 shows the calcula-
tion of information entropy to split the node with the data set of {4, 8,
12, 16, 20, 24} using the split criterion, Spectacle Prescription = Myope
or Hypermetrope, which produces two subsets of {4, 12, 20} and {8, 16, 24}.
These subsets are further partitioned using the split criterion, Age =
Young, Pre-presbyopic, or Presbyopic, to produce leaf nodes with homo-
geneous data sets. Figure 4.5 shows the complete nonbinary decision
tree for the lenses data set.
56 Data Mining

Table 4.10
Nonbinary Split of an Internal Node, {4, 8, 12, 16, 20, 24}, and Calculation
of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, or {4, 8}, {12, 16}, {20, 24}
Presbyopic
2 2
entropy (S) =
6
( ) (
entropy DYoung + entropy DPre − presbyopic
6
)
2
+
6
(
entropy DPresbyopic )
2  0 0 0 0 2 2
= ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2

2  1 1 0 0 1 1
+ ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2

2  1 1 0 0 1 1
+ ×  − log 2 − log 2 − log 2 
6  2 2 2 2 2 2
= 0.6667
Spectacle Prescription = Myope {4, 12, 20}, {8, 16, 24}
or Hypermetrope
3 3
entropy (S) =
6
( ) (
entropy DMyope + entropy DHypermetrope
6
)
3  0 0 0 0 3 3
= ×  − log 2 − log 2 − log 2 
6  3 3 3 3 3 3

3  2 2 0 0 1 1
+ ×  − log 2 − log 2 − log 2 
6  3 3 3 3 3 3
= 0.4591

4.3 Handling Numeric and Missing


Values of Attribute Variables
If a data set has a numeric attribute variable, the variable needs to be trans-
formed into a categorical variable before being used to construct a decision
tree. We present a common method to perform the transformation. Suppose
that a numeric attribute variable, x, has the following numeric values in the
training data set, a1, a2, …, ak, which are sorted in an increasing order of val-
ues. The middle point of two adjacent numeric values, ai and aj, is computed
as follows:

ai + a j
ci = . (4.7)
2
Decision and Regression Trees 57

Using ci for i = 1, …, k − 1, we can create the following k + 1 categorical


values of x:

Category 1: x ≤ c1
Category 2: c1 < x ≤ c2
.
.
.
Category k : ck −1 < x ≤ ck
Category k + 1: ck < x.

A numeric value of x is transformed into a categorical value according to the


aforementioned definition of the categorical values. For example, if c1 < x ≤ c2,
the categorical value of x is Category 2.
In many real-world data sets, we may find an attribute variable that does
not have a value in a data record. For example, if there are attribute variables
of name, address, and email address for customers in a database for a store,
we may not have the email address for a particular customer. That is, we
may have missing email addresses for some customers. One way to treat
a data record with a missing value is to discard the data record. However,
when the training data set is small, we may need all the data records in the
training data set to construct a decision tree. To use a data record with a
missing value, we may estimate the missing value and use the estimated
value to fill in the missing value. For a categorical attribute variable, its
missing value can be estimated to be the value that is taken by the major-
ity of data records in the training data set that have the same target value
as that of the data record with a missing value of the attribute variable. For
a numeric attribute variable, its missing value can be estimated to be the
average of values that are taken by data records in the training data set that
have the same target value as that of the data record with a missing value
of the attribute variable. Other methods of estimating a missing value are
given in Ye (2003).

4.4 Handling a Numeric Target Variable


and Constructing a Regression Tree
If we have a numeric target variable, measures of data homogeneity such
as information entropy and gini-index cannot be applied. Formula 4.7 is
introduced (Breiman et al., 1984) to compute the average difference of val-
ues from their average value, R, and use it to measure data homogeneity for
constructing a regression tree when the target variable takes numeric values.
58 Data Mining

The average difference of values in a data set from their average value indi-
cates how values are similar or homogenous. The smaller the R value is, the
more homogenous the data set is. Formula 4.9 shows the computation of the
average R value after a split:

R (D ) = ∑ (y − y )
2
(4.8)
y ∈D

y=
∑ y
y ∈D
(4.9)
n

∑ ( ) D R (D )
Dv
R (S ) = v (4.10)
v ∈Values S

The space shuttle data set D in Table 1.2 has one numeric target variable
and four numeric attribute variables. The R value of the data set D with the
23 data records at the root node of the regression tree is computed as

y=
∑ y ∈D
y
n
0 + 1+ 0 + 0 + 0 + 0 + 0 + 0 + 1+ 1+ 1+ 0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1
=
23
= 0..3043

R (D ) = ∑ (y − y ) = (0 − 0.3043 ) + (1 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 )


2 2 2 2 2

y ∈D

+ (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + (1 − 0.3043 )


2 2 2 2 2

+ (1 − 0.3043 ) + (1 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + ( 2 − 0.3043 )


2 2 2 2 2

+ (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 )


2 2 2 2 2

+ (0 − 0.3043 ) + (0 − 0.3043 ) + (0 − 0.3043 ) + (1 − 0.3043 )


2 2 2 2

= 6.8696

The average of target values in data records at a leaf node of a decision


tree with a numeric target variable is often taken as the target value for the
leaf node. When passing a data record along the decision tree to determine
the target value of the data record, the target value of the leaf node where
Decision and Regression Trees 59

the data record arrives is assigned as the target value of the data record. The
decision tree for a numeric target variable is called a regression tree.

4.5 Advantages and Shortcomings


of the Decision Tree Algorithm
An advantage of using the decision tree algorithm to learn classification and
prediction patterns is the explicit expression of classification and prediction
patterns in the decision and regression tree. The decision tree in Figure 4.1
uncovers the following three patterns of part quality leading to three leaf
nodes with the classification of system fault, respectively,

• x7 = 1
• x7 = 0 & x8 = 1
• x7 = 0 & x8 = 0 & x9 = 1

and the following pattern of part quality to one leaf node with the classifica-
tion of no system fault:

• x7 = 0 & x8 = 0 & x9 = 0.

The aforementioned explicit classification patterns reveal the following key


knowledge for detecting the fault of this manufacturing system:

• Among the nine quality variables, only the three quality variables, x7,
x8, and x9, matter for system fault detection. This knowledge allows
us to reduce the cost of part quality inspection by inspecting the part
quality after M7, M8, and M9 only rather than all the nine machines.
• If one of these three variables, x7, x8, and x9, shows a quality failure,
the system has a fault; otherwise, the system has no fault.

A decision tree has its shortcoming in expressing classification and predic-


tion patterns because it uses only one attribute variable in a split criterion.
This may result in a large decision tree. From a large decision tree, it is dif-
ficult to see clear patterns for classification and prediction. For example, in
Chapter 1, we presented the following classification pattern for the balloon
data set in Table 1.1:

IF (Color = Yellow AND Size = Small) OR (Age = Adult AND Act = Stretch),


THEN Inflated = T; OTHERWISE, Inflated = F.

This classification pattern for the target value of Inflated = T, (Color = Yellow
AND Size = Small) OR (Age = Adult AND Act = Stretch), involves all the four
attribute variables of Color, Size, Age, and Act. It is difficult to express this
60 Data Mining

simple pattern in a decision tree. We cannot use all the four attribute variables
to partition the root node. Instead, we have to select only one attribute variable.
The average information entropy of a split to partition the root node using each
of the four attribute variables is the same with the computation shown next:

8 8
entropy (S) =
16
entropy (DYellow ) + entropy DPurple
16
( )
8  5 5 3 3
= ×  − log 2 − log 2 
12  8 8 8 8

8  2 2 6 6
+ ×  − log 2 − log 2 
12  8 8 8 8

= 0.8829.

We arbitrarily select Color = Yellow or Purple as the split criterion to parti-


tion the root node. Figure 4.6 gives the complete decision tree for the balloon
data set. The decision tree is large with the following seven classification
patterns leading to seven leaf nodes, respectively:

• Color = Yellow AND Size = Small, with Inflated = T


• Color = Yellow AND Size = Large AND Age = Adult AND Act = Stretch,
with Inflated = T

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}

Color = ?

Yellow Purple

{1, 2, 3, 4, 5, 6, 7, 8} {9, 10, 11, 12, 13, 14, 15, 16}

Size = ? Age = ?

Small Large Adult Child

{1, 2, 3, 4} {5, 6, 7, 8} {9, 11, 13, 15} {10, 12, 14, 16}

Inflated = T Age = ? Act = ? Inflated = F

Adult Child Stretch Dip

{1, 2, 3, 4} {5, 6, 7, 8} {9, 13} {11, 15}

Act = ? Inflated = F Inflated = T Inflated = F

Stretch Dip

{1, 2, 3, 4} {5, 6, 7, 8}

Inflated = T Inflated = F

Figure 4.6
Decision tree for the balloon data set.
Decision and Regression Trees 61

• Color = Yellow AND Size = Large AND Age = Adult AND Act = Dip,
with Inflated = F
• Color = Yellow AND Size = Large AND Age = Child, with Inflated = F
• Color = Purple AND Age = Adult AND Act = Stretch, with Inflated = T
• Color = Purple AND Age = Adult AND Act = Dip, with Inflated = F
• Color = Purple AND Age = Child, with Inflated = F

From these seven classification patterns, it is difficult to see the simple clas-
sification pattern:

IF (Color = Yellow AND Size = Small) OR (Age = Adult AND Act = Stretch),


THEN Inflated = T; OTHERWISE, Inflated = F.

Moreover, selecting the best split criterion with only one attribute vari-
able without looking ahead the combination of this split criterion with the
following-up criteria to the leaf node is like making a locally optimal deci-
sion. There is no guarantee that making locally optimal decisions at separate
times leads to the smallest decision tree or a globally optimal decision.
However, considering all the attribute variables and their combinations
of conditions for each split would correspond to an exhaustive search of all
combination values of all the attribute variables. This is computationally
costly or sometimes impossible for a large data set with a large number of
attribute variables.

4.6  Software and Applications


The website http://www.knuggets.com has information about various data
mining tools. The following software packages support the learning of deci-
sion and regression trees:

• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• SPSS AnswerTree (http://www.spss.com/answertree/)
• SAS Enterprise Miner (http://sas.com/products/miner/)
• IBM Inteligent Miner (http://www.ibm.com/software/data/iminer/)
• CART (http://www.salford-systems.com/)
• C4.5 (http://www.cse.unsw.edu.au/quinlan)

Some applications of decision trees can be found in (Ye, 2003, Chapter 1) and
(Li and Ye, 2001; Ye et al., 2001).
62 Data Mining

Exercises
4.1 Construct a binary decision tree for the balloon data set in Table 1.1
using the information entropy as the measure of data homogeneity.
4.2 Construct a binary decision tree for the lenses data set in Table 1.3
using the information entropy as the measure of the data homogeneity.
4.3 Construct a non-binary regression tree for the space shuttle data set
in Table 1.2 using only Launch Temperature and Leak-Check Pressure
as the attribute variables and considering two categorical values
of Launch Temperature (low for Temperature <60, normal for other
t­ emperatures) and three categorical values of Leak-Check Pressure
(50, 100, and 200).
4.4 Construct a binary decision tree or a nonbinary decision tree for the
data set found in Exercise 1.1.
4.5 Construct a binary decision tree or a nonbinary decision tree for the
data set found in Exercise 1.2.
4.6 Construct a dataset for which using the decision tree algorithm based
on the best split for data homogeneity does not produce the smallest
decision tree.
5
Artificial Neural Networks for
Classification and Prediction

Artificial neural networks (ANNs) are designed to mimic the architecture of


the human brain in order to create artificial intelligence like human intelli-
gence. Hence, ANNs use the basic architecture of the human brain, which con-
sists of neurons and connections among neurons. ANNs have processing units
like neurons and connections among processing units. This chapter introduces
two types of ANNs for classification and prediction: perceptron and multi-
layer feedforward ANN. In this chapter, we first describe the processing units
and how these units can be used to construct various types of ANN archi-
tectures. We then present the perceptron, which is a single-layer feedforward
ANN, and the learning of classification and prediction patterns by a percep-
tron. Finally, multilayer feedforward ANNs with the back-­propagation learn-
ing algorithm are described. A list of software packages that support ANNs is
provided. Some applications of ANNs are given with references.

5.1 Processing Units of ANNs


Figure 5.1 illustrates a processing unit in an ANN, unit j. The unit takes p
inputs, x1, x2, …, xp, another special input, x0 = 1, and produces an output, o.
The inputs x1, x2, …, xp and the output o are used to represent the inputs and
the output of a given problem. Take an example of the space shuttle data set
in Table 1.2. We may have x1, x2, and x3 to represent Launch Temperature,
Leak-Check Pressure, and Temporal Order of Flight, respectively, and have o
represent Number of O-rings with Stress. The input x0 is an inherent part of
every processing unit and always takes the value of 1.
Each input, xi, is connected to the unit j with a connection weight, wj,i. The
connection weight, wj,0, is called the bias or threshold for a reason that is
explained later. The unit j processes the inputs by first obtaining the net sum,
which is the weighted sum of the inputs, as follows:

net j = ∑w
i=0
x.
j, i i (5.1)

63
64 Data Mining

x0 = 1

x1 wj,0
wj,1

x2 wj,2 j

wj,i xi f o
i

wj,p

xp

Figure 5.1
Processing unit of ANN.

Let the vectors of x and w be defined as follows:


 x0 
 
x=  w ¢ =  w j, 0 … w j, p 
 xp 
 
Equation 5.1 can be represented as follows:

net j = w¢ x. (5.2)

The unit then applies a transfer function, f, to the net sum and obtains the
output, o, as follows:


(
o = f net j . )
(5.3)

Five of the common transfer functions are given next and illustrated in
Figure 5.2.

1. Sign function:

1 if net > 0
o = sgn ( net ) =  (5.4)
 −1 if net ≤ 0

2. Hard limit function:

1 if net > 0
o = hardlim ( net ) =  (5.5)
0 if net ≤ 0

Artificial Neural Networks for Classification and Prediction 65

1 1
f (net)
f (net)
net
0
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6

net
0
1 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6

The sign function The hard limit function

6 1
5
f (net)
4
3 f (net)
2
1 net
0
–6 –5 –4 –3 –2 –1–1 0 1 2 3 4 5 6
–2
–3
–4
–5 net
0
–6 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6

The linear function The sigmoid function

f (net)

0
net
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6

–1

The hyperbolic tangent function

Figure 5.2
Examples of transfer functions.

3. Linear function:

o = lin ( net ) = net (5.6)



4. Sigmoid function:

1
o = sig ( net ) = (5.7)
1 + e − net

5. Hyperbolic tangent function:

e net − e − net
O = tanh ( net ) = . (5.8)
e net + e − net
66 Data Mining

Given the following input vector and connection weight vector

1
 
x=5 w¢ =  −1.2 3 2 ,
 −6 

the output of the unit with each of the five transfer functions is computed as
follows:

1
 
net = w¢x =  −1.2 3 2  5  = 1.8
 −6 

o = sgn ( net ) = 1

o = hardlim ( net ) = 1

o = lin ( net ) = 1.8


o = sig ( net ) = 0.8581


o = tahn ( net ) = 0.9468.

One processing unit is sufficient to implement a logical AND function.


Table 5.1 gives the inputs and the output of the AND function and four
data records of this function. The AND function has the output values of
−1 and 1. Figure 5.3 gives the implementation of the AND function using
one processing unit. Among the five transfer functions in Figure 5.2, the
sign function and the hyperbolic tangent function can produce the range of

Table 5.1
AND Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 −1
1 −1 −1
1 1 1
Artificial Neural Networks for Classification and Prediction 67

x0 = 1

w1,0 = –0.3

x1 w1,1 = 0.5

w1,i xi f = sgn o
i

w1,2 = 0.5
x2
Figure 5.3
Implementation of the AND function using one processing unit.

output values from −1 to 1. The sign function is used as the transfer function
for the processing unit to implement the AND function. The first three data
records require the output value of −1. The weighted sum of the inputs for
the first three data records, w1,0x0 + w1,1x1 + w1,2 x2, should be in the range of
[−1, 0]. The last data record requires the output value of 1, and the weighted
sum of the inputs should be in the range of (0, 1]. The connection weight w1,0
must be a negative value to make net for the first three data records less than
zero and also make net for the last data record greater than zero. Hence, the
connection weight w1,0 acts as a threshold against the weighted sum of the
inputs to drive net greater than or less than zero. This is why the connec-
tion weight for x0 = 1 is called the threshold or bias. In Figure 5.3, w1,0 is set
to −0.3. Equation 5.1 can be represented as follows to show the role of the
threshold or bias, b:

net = w¢ x + b , (5.9)

where

 x1 
 
x=  w ¢ =  w j, 1 … wj, p  .
 xp 
 

The computation of the output value for each input is illustrated next.

 2

o = sgn ( net ) = sgn 

∑w
i=0
x  = sgn  − 0.3 × 1 + 0.5 × ( −1) + 0.5 × ( −1)
1, i i

= sgn ( − 0.3 − 1) = sgn ( −1.3 ) = −1


68 Data Mining

 2

o = sgn ( net ) = sgn 

∑w
i=0
x  = sgn  − 0.3 × 1 + 0.5 × ( −1) + 0.5 × (1)
1, i i

= sgn ( − 0.3 + 0 ) = sgn ( − 0.3 ) = −1

 2

o = sgn ( net ) = sgn 


i=0
w1, i xi  = sgn  − 0.3 × 1 + 0.5 × (1) + 0.5 × ( −1)

= sgn ( − 0.3 + 0 ) = sgn ( − 0.3 ) = −1

 2

o = sgn ( net ) = sgn 

∑w
i=0
x  = sgn  − 0.3 × 1 + 0.5 × (1) + 0.5 × (1)
1, i i

= sgn ( − 0.3 + 1) = sgn (0.7 ) = 1


.

Table 5.2 gives the inputs and the output of the logical OR function.
Figure 5.4 shows the implementation of the OR function using one process-
ing unit. Only the first data record requires the output value of −1, and the

Table 5.2
OR Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 1
1 −1 1
1 1 1

x0 = 1

w1,0 = 0.8

x1 w1,1 = 0.5

w1,i xi f = sgn o
i

w1,2 = 0.5
x2

Figure 5.4
Implementation of the OR function using one processing unit.
Artificial Neural Networks for Classification and Prediction 69

other three data records require the output value of 1. Only the first data
record produces the weighted sum −1 from the inputs, and the other three
data records produce the weighted sum of the inputs in the range [−0.5, 1].
Hence, any threshold value w1,0 in the range (0.5, 1) will make net for the
first data record less than zero and make net for the last three data records
greater than zero.

5.2 Architectures of ANNs
Processing units of ANNs can be used to construct various types of ANN
architectures. We present two ANN architectures: feedforward ANNs and
recurrent ANNs. Feedforward ANNs are widely used. Figure 5.5 shows a
one-layer, fully connected feedforward ANN in which each input is con-
nected to each processing unit. Figure 5.6 shows a two-layer, fully connected

x1
o1

x2 o2

oq

xp

Figure 5.5
Architecture of a one-layer feedforward ANN.

x1
o1

x2 o2

oq

xp

Figure 5.6
Architecture of a two-layer feedforward ANN.
70 Data Mining

feedforward ANN. Note that the input x0 for each processing unit is not
explicitly shown in the ANN architectures in Figures 5.5 and 5.6. The two-
layer feedforward ANN in Figure 5.6 contains the output layer of processing
units to produce the outputs and a hidden layer of processing units whose
outputs are the inputs to the processing units at the output layer. Each input
is connected to each processing unit at the hidden layer, and each processing
unit at the hidden layer is connected to each processing unit at the output
layer. In a feedforward ANN, there are no backward connections between
processing units in that the output of a processing unit is not used as a part
of inputs to that processing unit directly or indirectly. An ANN is not neces-
sarily fully connected as those in Figures 5.5 and 5.6. Processing units may
use the same transfer function or different transfer functions.
The ANNs in Figures 5.3 and 5.4, respectively, are examples of one-layer
feedforward ANNs. Figure 5.7 shows a two-layer, fully connected feedfor-
ward ANN with one hidden layer of two processing units and the output
layer of one processing unit to implement the logical exclusive-OR (XOR)
function. Table 5.3 gives the inputs and output of the XOR function.
The number of inputs and the number of outputs in an ANN depend on
the function that the ANN is set to capture. For example, the XOR function

x1
0.5 0.8

–0.3
1
0.5
–0.5

3 o

0.5
0.5
2

–0.5
0.8
x2

Figure 5.7
A two-layer feedforward ANN to implement the XOR function.

Table 5.3
XOR Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 1
1 −1 1
1 1 −1
Artificial Neural Networks for Classification and Prediction 71

x1
o1

x2 o2

oq

xp

Figure 5.8
Architecture of a recurrent ANN.

has two inputs and one output that can be represented by two inputs and
one output of an ANN, respectively. The number of processing units at the
hidden layer, called hidden units, is often determined empirically to account
for the complexity of the function that an ANN implements. In general, more
complex the function is, more hidden units are needed. A two-layer feedfor-
ward ANN with a sigmoid or hyperbolic tangent function has the capability
of implementing a given function (Witten et al., 2011).
Figure 5.8 shows the architecture of a recurrent ANN with backward con-
nections that feed the outputs back as the inputs to the first hidden unit
(shown) and other hidden units (not shown). The backward connections
allow the ANN to capture the temporal behavior in that the outputs at time
t + 1 depend on the outputs or state of the ANN at time t. Hence, recurrent
ANNs such as that in Figure 5.8 have backward connections to capture tem-
poral behaviors.

5.3 Methods of Determining Connection


Weights for a Perceptron
To use an ANN for implementing a function, we first determine the architec-
ture of the ANN, including the number of inputs, the number of outputs, the
number of layers, the number of processing units in each layer, and the trans-
fer function for each processing unit. Then we need to determine connection
weights. In this section, we describe a graphical method and a learning method
to determine connection weights for a perceptron, which is a one-layer feedfor-
ward ANN with a sign or hard limit transfer function. Although concepts and
72 Data Mining

methods in this section are explained using the sign transfer function for each
processing unit in a perceptron, these concepts and methods are also applicable
to a perceptron with a hard limit transfer function for each processing unit.
In Section 5.4, we present the back-propagation learning method to determine
connection weights for multiple-layer feedforward ANNs.

5.3.1 Perceptron
The following notations are used to represent a fully connected perceptron
with p inputs, q processing units at the output layer to produce q outputs, and
the sign transfer function for each processing unit, as shown in Figure 5.5:
 x1   o1   w1,1  w1, p   w1′   w j ,1   b1 
           
x=    o=  w′ =     =   wj =    b=
 xp 
 
 oq 
 
 wq ,1
  wq , p   wq′  w j, p 
 
bq 
 

o = sgn (w′ x + b) . (5.10)


5.3.2 Properties of a Processing Unit

( )
For a processing unit j, o = sgn ( net ) = sgn w j¢ x + b j separates input vec-
tors, xs, into two regions: one with net > 0 and o = 1, and another with net ≤ 0
and o = −1. The equation, net = w j¢ x + b j = 0, is the decision boundary in the
input space that separates the two regions. For example, given x in a two-
dimensional space and the following weight and bias values:

 x1 
x=  w j¢ =  −1 1 b j = −1,
 x2 

the decision boundary is


w j¢ x + b j = 0

− x1 + x2 − 1 = 0
x2 = x1 + 1.

Figure 5.9 illustrates the decision boundary and the separation of the input
space into two regions by the decision boundary. The slope and the intercept
of the line representing the decision boundary in Figure 5.9 are

− w j ,1 1
slope = = =1
w j,2 1
Artificial Neural Networks for Classification and Prediction 73

x2
x2 = x1 + 1

wj 1

x1
net > 0 0 1

net ≤ 0

Figure 5.9
Example of the decision boundary and the separation of the input space into two regions by a
processing unit.

−b j 1
intercept = = = 1.
w j, 2 1

As illustrated in Figure 5.9, a processing unit has the following properties:


• The weight vector is orthogonal to the decision boundary.
• The weight vector points to the positive side (net > 0) of the decision
boundary.
• The position of the decision boundary can be shifted by changing
b. If b = 0, the decision boundary passes through the origin, e.g.,
(0, 0) in the two-dimensional space.
• Because the decision boundary is a linear equation, a processing
unit can implement a linearly separable function only.
Those properties of a processing unit are used in the graphical method of
determining connection weights in Section 5.3.3 and the learning method
of determining connection weights in Section 5.3.4.

5.3.3 Graphical Method of Determining Connection Weights and Biases


The following steps are taken in the graphical method to determine connection
weights of a perceptron with p inputs, one output, one processing unit to pro-
duce the output, and the sign transfer function for the processing unit:
1. Plot the data points for the data records in the training data set for
the function.
2. Draw the decision boundary to separate the data points with o = 1
from the data points with o = −1.
74 Data Mining

3. Draw the weight vector to make it be orthogonal to the decision


boundary and point to the positive side of the decision boundary.
The coordinates of the weight vector define the connection weights.
4. Use one of the two methods to determine the bias:
a. Use the intercept of the decision boundary and connection
weights to determine the bias.
b. Select one data point on the positive side and one data point on
the negative side that are closest to the decision boundary on
each side and use these data points and the connection weights
to determine the bias.
These steps are illustrated in Example 5.1.

Example 5.1
Use the graphical method to determine the connection weights of a
perceptron with one processing unit for the AND function in Table 5.1.
In Step 1, we plot the four circles in Figure 5.10 to represent the four
data points of the AND function. The output value of each data point is
noted inside the circle for the data point. In Step 2, we use the decision
boundary, x2 = −x1 + 1, to separate the three data points with o = −1 from
the data point with o = 1. The intercept of the line for the decision bound-
ary is 1 with x2 = 1 when x1 is set to 0. In Step 3, we draw the weight vector,
w1 = (0.5, 0.5), which is orthogonal to the decision boundary and points
to the positive side of the decision boundary. Hence, we have w1,1 = 0.5,
w1,2 = 0.5. In Step 4, we use the following equation to determine the bias:
w1,1x1 + w1, 2 x2 + b1 = 0
w1, 2 x2 = − w1,1x1 − b1

x2

1
–1 1
w1

x1
0 1
net > 0

–1 –1 x2 = –x1 + 1

net ≤ 0

Figure 5.10
Illustration of the graphical method to determine connection weights.
Artificial Neural Networks for Classification and Prediction 75

b1
intercept = −
w1, 2

b1
1= −
0.5

b1 = −0.5.

If we move the decision boundary so that it has the intercept of 0.6, we


obtain b1 = −0.3 and exactly the same ANN for the AND function as
shown in Figure 5.3.
Using another method in Step 4, we select the data point (1, 1) on the
positive side of the decision boundary, the data points (−1, 1) on the nega-
tive side of the decision boundary, and the connection weights w1,1 = 0.5,
w1,2 = 0.5 to determine the bias b1 as follows:

net = w1,1x1 + w1, 2 x2 + b1

net = 0.5 × 1 + 0.5 × 1 + b1 > 0

b1 > −1

and

net = w1,1x1 + w1, 2 x2 + b1

net = 0.5 × ( −1) + 0.5 × 1 + b1 ≤ 0

b1 ≤ 0.

Hence, we have

−1 < b1 ≤ 0.

By letting b1 = −0.3, we obtain the same ANN for the AND function as
shown in Figure 5.3.
The ANN with the weights, bias, and decision boundary as those in
Figure 5.10 produces the correct output for the inputs in each data record
in Table 5.1. The ANN also has the generalization capability of classify-
ing any input vector on the negative side of the decision boundary into
o = −1 and any input vector on the positive side of the decision boundary
into o = 1.
For a perceptron with multiple output units, the graphical method
is applied to determine connection weights and bias for each
output unit.
76 Data Mining

5.3.4 Learning Method of Determining Connection Weights and Biases


We use the following two of the four data records for the AND function
in the training data set to illustrate the learning method of determining
connection weights for a perceptron with one processing unit without a
bias:

1. x1 = −1 x2 = −1 t1 = −1
2. x1 = 1 x2 = 1 t1 = 1,

where t1 denotes the target output of processing unit 1 that needs to be pro-
duced for each data record. The two data records are plotted in Figure 5.11.
We initialize the connection weights using random values, w1,1(k) = −1 and
w1,2(k) = 0.8, with k denoting the iteration number when the weights are
assigned or updated. Initially, we have k = 0. We present the inputs of the
first data record to the perceptron of one processing unit:

net = w1,1 (0 ) x1 + w1, 2 (0 ) x2 = ( −1) × ( −1) + 0.8 × ( −1) = −1.8.

Since net < 0, we have o1 = −1. Hence, the perceptron with the weight vector
(−1, 0.8) produces the target output for the inputs of the first data record, t1 = −1.
There is no need to change the connection weights. Next, we present the
inputs of the second data record to the perceptron:

net = w1,1 (0 ) x1 + w1, 2 (0 ) x2 = ( −1) × 1 + 0.8 × 1 = −0.2.

Since net < 0, we have o1 = −1, which is different from the target output for this
data record t1 = 1. Hence, the connection weights must be changed in order
x2

w1(1)

1
1
w1(0)

x1
0 1

–1

Figure 5.11
Illustration of the learning method to change connection weights.
Artificial Neural Networks for Classification and Prediction 77

to produce the target output. The following equations are used to change the
connection weights for processing unit j:

1
∆w j =
2
(tj − oj x ) (5.11)

w j ( k + 1) = w j ( k ) + ∆w j . (5.12)

In Equation 5.11, if (t − o) is zero, that is, t = o, then there is no change of
weights. If t = 1 and o = −1,

1 1
∆w j =
2
( ) (
t j − o j x = 1 − ( −1) x = x.
2
)

By adding x to wj(k), that is, performing wj(k) + x in Equation 5.12, we move


the weight vector closer to x and make the weight vector point more to the
direction of x because we want the weight vector to point to the positive side
of the decision boundary and x lies on the positive side of the decision
boundary. If t1 = −1 and o1 = 1,

1 1
∆w j =
2
( )
t j − o j x = ( −1 − 1) x = − x.
2

By subtracting x from wj(k), that is, performing wj(k) − x in Equation 5.12, we


move the weight vector away from x and make the weight vector point more
to the opposite direction of x because x lies on the negative side of the deci-
sion boundary with t = −1 and we want the weight vector to eventually point
to the positive side of the decision boundary.
Using Equations 5.11 and 5.12, we update the connection weights based on
the inputs and the target and actual outputs for the second data record as
follows:

1 1 1 1
∆w1 =
2 2
(
(t1 − o1 ) x = 1 − ( −1) ) 1 = 1
   

 −1  1  0 
w1 (1) = w1 (0 ) + ∆w1 =   +   =   .
0.8  1 1.8 

The new weight vector, w1(1), is shown in Figure 5.11. As Figure 5.11 illus-
trates, w 1(1) is closer to the second data record x than w 1(0) and points more
to the direction of x since x has t = 1 and thus lies on the positive side of the
decision boundary.
78 Data Mining

With the new weights, we present the inputs of the data records to the per-
ceptron again in the second iteration of evaluating and updating the weights
if needed. We present the inputs of the first data record:

net = w1,1 (1) x1 + w1, 2 (1) x2 = 0 × ( −1) + 1.8 × ( −1) = −1.8.

Since net < 0, we have o1 = −1. Hence, the perceptron with the weight vector
(0, 1.8) produces the target output for the inputs of the first data record, t1 = −1.
With (t1 − o1) = 0, there is no need to change the connection weights. Next, we
present the inputs of the second data record to the perceptron:

net = w1,1 (1) x1 + w1, 2 (1) x2 = 0 × 1 + 1.8 × 1 = 1.8.

Since net > 0, we have o1 = 1. Hence, the perceptron with the weight vector
(0, 1.8) produces the target output for the inputs of the second data record,
t = 1. With (t − o) = 0, there is no need to change the connection weights. The
perceptron with the weight vector (0, 1.8) produces the target outputs for
all the data records in the training data set. The learning of the connection
weights from the data records in the training data set is finished after one
iteration of changing the connection weights with the final weight vector (0, 1.8).
The decision boundary is the line, x2 = 0.
The general equations for the learning method of determining connection
weights are given as follows:


( )
∆w j = α t j − o j x = α e j x

(5.13)

w j ( k + 1) = w j ( k ) + ∆w j (5.14)

or


( )
∆w j , i = α t j − o j xi = α e j xi

(5.15)

w j , i ( k + 1) = w j , i ( k ) + ∆w j , i , (5.16)

Where
ej = tj − oj represents the output error
α is the learning rate taking a value usually in the range (0, 1)

In Equation 5.11, α is set to 1/2. Since the bias of processing unit j is the weight
of connection from the input x0 = 1 to the processing unit, Equations 5.15 and
5.16 can be extended for changing the bias of processing unit j as follows:

( ) ( )
∆b j = α t j − o j × x0 = α t j − o j × 1 = α e j (5.17)

b j ( k + 1) = b j ( k ) + ∆b j . (5.18)
Artificial Neural Networks for Classification and Prediction 79

5.3.5 Limitation of a Perceptron
As described in Sections 5.3.2 and 5.3.3, each processing unit implements a
linear decision boundary, that is, a linearly separable function. Even with
multiple processing units in one layer, a perceptron is limited to implement-
ing a linearly separable function. For example, the XOR function in Table
5.3 is not a linearly separable function. There is only one output for the
XOR function. Using one processing unit to represent the output, we have
one decision boundary, which is a straight line representing a linear func-
tion. However, there does not exit such a straight line in the input space to
separate the two data points with o = 1 from the other two data points with
o = −1. A nonlinear decision boundary such as the one shown in Figure 5.12
is needed to separate the two data points with o = 1 from the other two data
points with o = −1. To use processing units that implement linearly separa-
ble functions for constructing an ANN to implement the XOR function, we
need two processing units in one layer (the hidden layer) to implement two
decision boundaries and one processing unit in another layer (the output
layer) to combine the outputs of the two hidden units as shown in Table 5.4
and Figure 5.7. Table 5.5 defines the logical NOT function used in Table 5.4.
Hence, we need a two-layer ANN to implement the XOR function, which is
a nonlinearly separable function.
The learning method described by Equations 5.13 through 5.18 can be used
to learn the connection weights to each output unit using a set of training
data because the target value t for each output unit is given in the training
data. For each hidden unit, Equations 5.13 through 5.18 are not applicable
because we do not know t for the hidden unit. Hence, we encounter a dif-
ficulty in learning connection weights and biases from training data for a
multilayer ANN. This learning difficulty for multilayer ANNs is overcome
by the back-propagation learning method described in the next section.

x2

1
1 –1

x1
0 1

–1 1

Figure 5.12
Four data points of the XOR function.
80 Data Mining

Table 5.4
Function of Each Processing Unit in a Two-Layer ANN
to Implement the XOR Function
x1 x2 o1 = x1 OR x2 o2 = NOT (x1 OR x2) o3 = o1 AND o2
−1 −1 −1 1 −1
−1 1 1 1 1
1 −1 1 1 1
1 1 1 −1 −1

Table 5.5
NOT Function
x o
−1 1
1 −1

5.4 Back-Propagation Learning Method


for a Multilayer Feedforward ANN
The back-propagation learning method for a multilayer ANN (Rumelhart et al.,
1986) aims at searching for a set of connection weights (including biases) W
that minimizes the output error. The output error for a training data record
d is defined as follows:

∑ (t
1
Ed (W ) = )
2
j,d − oj,d , (5.19)
2 j

Where
tj,d is the target output of output unit j for the training data record d
oj,d is the actual output produced by output unit j of the ANN with the
weights W for the training data record d

The output error for a set of training data records is defined as follows:

∑∑ (t
1
E (W ) = )
2
j,d − oj,d . (5.20)
2 d j

Because each oj,d depends on W, E is a function of W. The back-propagation


learning method searches in the space of possible weights and evaluates a
given set of weights based on their associated E values. The search is called
the gradient descent search that changes weights by moving them in the
Artificial Neural Networks for Classification and Prediction 81

direction of reducing the output error after passing the inputs of the data
record d through the ANN with the weights W, as follows:

∆w j , i = −α
∂Ed
= −α
∂Ed ∂net j
= αδ j
∂ (∑ w o ) = αδ o
k
j,k k
j i (5.21)
∂w j , i ∂net j ∂w j , i ∂w j , i

where δj is defined as

∂Ed
δj = − , (5.22)
∂net j

Where
α is the learning rate with a value typically in (0, 1)
oi is input i to processing unit j

If unit j directly receives the inputs of the ANN, oi is xi; otherwise, oi is
from a unit in the preceding layer feeding its output as an input to unit j.
To change a bias for a processing unit, Equation 5.21 is modified by using
oi = 1 as follows:

∆b j = αδ j . (5.23)

If unit j is an output unit,

1
∑ (t
2

∂Ed ∂Ed ∂o j
∂
2 j
j,d )
− o j , d  ∂ f net
 j j ( ( ))
δj = − =− =−
∂net j ∂o j ∂net j ∂o j ∂net j


( ) (
= t j , d − o j , d f j′ net j , )
(5.24)

where f ′ denotes the derivative of the function f with regard to net. To obtain
( )
a value for the term f j′ net j in Equation 5.24, the transfer function f for unit
j must be a semi-linear, nondecreasing, and differentiable function, e.g.,
linear, sigmoid, and tanh. For the sigmoid transfer function

1
(
o j = f j net j = ) − net ,
1+ e j
we have the following:

− net
1 e j
(
f j′ net j =) 1+ e
− net j (
− net = o j 1 − o j .
1+ e j
) (5.25)

82 Data Mining

If unit j is a hidden unit feeding its output as an input to output units,

∂Ed ∂E ∂o j ∂E  ∂Ed ∂netn 


δj = −
∂net j
=− d
∂o j ∂net j
= − d f j′ net j = − 
∂o j 
( ) ∑ ∂net
n
 f j′ net j ,
n ∂o j 
( )

where netn is the net sum of output unit n. Using Equation 5.22, we rewrite
δj as follows:

 
 ∂netn  

∂
 ∑w o 
n, j j

∑ ( ) ∑δ ( )
j
δj =  δn  f j′ net j =  n  f j′ net j
 n
∂o j   n
∂o j 
 

 
=

∑ n
( )
δ n wn , j  f j′ net j .

(5.26)

Since we need δn in Equation 5.26, which is computed for output unit n,


changing the weights of the ANN should start with changing the weights
for output units and move on to changing the weights for hidden units in
the preceding layer so that δn for output unit n can be used in computing δj
for hidden unit j. In other words, δn for output unit n is back-propagated to
compute δj for hidden unit j, which gives the name of the back-propagation
learning.
Changes to weights and biases, as determined by Equations 5.21 and 5.23,
are used to update weights and biases of the ANN as follows:

w j , i ( k + 1) = w j , i ( k ) + ∆w j , i (5.27)

b j ( k + 1) = b j ( k ) + ∆b j . (5.28)

Example 5.2
Given the ANN for the XOR function and the first data record in Table 5.3
with x1 = −1, x2 = −1, and t = −1, use the back-propagation method to update
the weights and biases of the ANN. In the ANN, the sigmoid transfer
function is used by each of the two hidden units and the linear function
is used by the output unit. The ANN starts with the following arbitrarily
assigned values of weights and biases in (−1, 1) as shown in Figure 5.13:

w1,1 = 0.1 w2 ,1 = − 0.1 w1, 2 = 0.2 w2 , 2 = − 0.2 b1 = − 0.3

b2 = − 0.4 w3 ,1 = 0.3 w3 , 2 = 0.4 b3 = 0.5.


Artificial Neural Networks for Classification and Prediction 83

x1
0.1 –0.3

0.5
1
0.3
–0.1
o1
3 o
o2
0.2
0.4
2

–0.2
–0.4
x2

Figure 5.13
A set of weights with randomly assigned values in a two-layer feedforward ANN for the XOR
function.

Use the learning rate α = 0.3.


Passing the inputs of the data record, x1 = −1 and x2 = −1, through the
ANN, we obtain the following:

(
o1 = sig ( w1,1x1 + w1, 2 x2 + b1 ) = sig 0.1 × ( −1) + 0.2 × ( −1) + ( −0.3 ) )
1
= sig ( − 0.6 ) = = 0.3543
1 + e − ( −0.6)

(
o2 = sig ( w2 ,1x1 + w2 , 2 x2 + b2 ) = sig ( − 0.1) × ( −1) + ( − 0.2) × ( −1) + ( − 0.4 ) )
1
= sig ( − 0.2) = = 0.4502
1 + e − ( −0.2)

o = lin ( w3 ,1o1 + w3 , 2 o2 + b3 ) = lin (0.3 × 0.3543 + 0.4 × 0.4502 + 0.5)

= lin (0.7864 ) = 0.7864

Since the difference between o = 0.6871 and t = −1 is large, we need to


change the weights and biases of the ANN. Equations 5.21 and 5.23 are
used to determine changes to the weights and bias for the output unit
as follows:

∆w3 ,1 = αδ 3 o1 = 0.3 × δ 3 × o1 = 0.3 × δ 3 × 0.3543

∆w3 , 2 = αδ 3 o2 = 0.3 × δ 3 × o2 = 0.3 × δ 3 × 0.4502

∆ b3 = αδ 3 = 0.3δ 3 .
84 Data Mining

Equation 5.24 is used to determine δ3, and then δ3 is used to determine


∆w3,1, ∆w3,2, and ∆b3 as follows:

( )
δ 3 = (t − o) f 3′ ( net3 ) = t jd − o jd lin′ ( net3 ) = ( −1 − 0.7864 ) × 1 = −1.7864

∆w3 ,1 = 0.3 × δ 3 × 0.3543 = 0.3 × ( −1.7864 ) × 0.3543 = − 0.1899



∆w3 , 2 = 0.3 × δ 3 × 0.4502 = 0.3 × ( −1.7864 ) × 0.4502 = − 0.2413

∆ b3 = 0.3δ 3 = 0.3 × ( −1.7864 ) = − 0.5359.

Equations 5.21, 5.23, 5.25, and 5.26 are used to determine changes to the
weights and bias for each hidden unit as follows:

   n= 3 
δ1 = − 


n   n= 3

δ n wn ,1  f1′ ( net1 ) = −  δ n wn ,1  f1′ ( net1 )

= δ 3 w3 ,1o1 (1 − o1 ) = ( −1.7864 ) × 0.3 × 0.3543 × (1 − 0.3543 ) = − 0.1226

   n= 3 
δ2 = − 

∑δ w
n
n n, 2


 f 2′ ( net2 ) = −  δ n wn , 2  f 2′ ( net2 ) = δ 3 w3 , 2 o2 (1 − o2 )
 n= 3 

= ( −1.7864 ) × 0.4 × 0.4502 × (1 − 0.4502) = − 0.1769

∆w1,1 = αδ 1x1 = 0.3 × δ 1 × x1 = 0.3 × ( − 0.1226 ) × ( −1) = 0.0368

∆w1, 2 = αδ 1x2 = 0.3 × δ 1 × x2 = 0.3 × ( − 0.1226 ) × ( −1) = 0.0368

∆w2 ,1 = αδ 2 x1 = 0.3 × δ 2 × x1 = 0.3 × ( − 0.1769) × ( −1) = 0.0531

∆w2 , 2 = αδ 2 x2 = 0.3 × δ 2 × x2 = 0.3 × ( − 0.1769) × ( −1) = 0.0531

∆ b1 = αδ 1 = 0.3 × ( − 0.1226 ) = − 0.0368



∆ b2 = αδ 2 = 0.3δ 2 = 0.3 × ( − 0.1769) = − 0.0531.

Using the changes to all the weights and biases of the ANN, Equations
5.27 and 5.28 are used to perform an iteration of updating the weights
and biases as follows:

w1,1 (1) = w1,1 (0 ) + ∆w1,1 = 0.1 + 0.0368 = 0.1368

w1, 2 (1) = w1, 2 (0 ) + ∆w1, 2 = 0.2 + 0.0368 = 0.2368


Artificial Neural Networks for Classification and Prediction 85

w2 ,1 (1) = w2 ,1 (0 ) + ∆w2 ,1 = − 0.1 + 0.0531 = − 0.0469

w2 , 2 (1) = w2 , 2 (0 ) + ∆w2 , 2 = − 0.2 + 0.0531 = − 0.1469

w3 ,1 (1) = w3 ,1 (0 ) + ∆w3 ,1 = 0.3 − 0.1899 = 0.1101



w3 , 2 (1) = w3 , 2 (0 ) + ∆w3 , 2 = 0.4 − 0.2413 = 0.1587

b1 (1) = b1 (0 ) + ∆b1 = − 0.3 − 0.1226 = − 0.4226

b2 (1) = b2 (0 ) + ∆b2 = − 0.4 − 0.0531 = − 0.4531

b3 (1) = b3 (0 ) + ∆b3 = 0.5 − 0.5359 = − 0.0359.

This new set of weights and biases, wj,i(1) and bj(1), will be used to pass
the inputs of the second data record through the ANN and then update
the weights and biases again to obtain wj,i(2) and bj(2) if necessary. This
process repeats again for the third data record, the fourth data record,
back to the first data record, and so on, until the measure of the output
error E as defined in Equation 5.20 is smaller than a preset threshold,
e.g., 0.1.

A measure of the output error, such as E, or the root-mean-squared error


over all the training data records can be used to determine when the learn-
ing of ANN weights and biases can stop. The number of iterations, e.g.,
1000 iterations, is another criterion that can be used to stop the learning.
Updating weights and biases after passing each data record in the train-
ing data set is called the incremental learning. In the incremental learning,
weights and biases are updated so that they will work better for one data
record. Changes based on one data record may go in the different direction
where changes made for another data record go, making the learning take
a long time to converge to the final set of weights and biases that work for
all the data records. The batch learning is to hold the update of weights and
biases until all the data records in the training data set are passed through
the ANN and their associated changes of weights and biases are computed
and averaged. The average of weight and bias changes for all the data records,
that is, the overall effect of changes on weights and biases by all the data
records, is used to update weights and biases.
The learning rate also affects how well and fast the learning proceeds. As
illustrated in Figure 5.14, a small learning rate, e.g., 0.01, produces a small
change of weights and biases and thus a small decrease in E, and makes the
learning take a long time to reach the global minimum value of E or a local
minimum value of E. However, a large learning rate produces a large change
of weights and biases, which may cause the search of W for minimizing E not
to reach a local or global minimum value of E. Hence, as a tradeoff between
86 Data Mining

Global minimum of E

Local minimum of E

Small ΔW Large ΔW

Figure 5.14
Effect of the learning rate.

a small learning rate and a large learning rate, a method of adaptive learning
rates can be used to start with a large learning rate for speeding up the learn-
ing process and then change to a small learning rate for taking small steps to
reach a local or global minimum value of E.
Unlike the decision trees in Chapter 4, an ANN does not show an explicit
model of the classification and prediction function that the ANN has learned
from the training data. The function is implicitly represented through con-
nection weights and biases that cannot be translated into meaningful clas-
sification and prediction patterns in the problem domain. Although the
knowledge of classification and prediction patterns has been acquired by
the ANN, such knowledge is not available in an interpretable form. Hence,
ANNs help the task of performing classification and prediction but not the
task of discovering knowledge.

5.5 Empirical Selection of an ANN


Architecture for a Good Fit to Data
Unlike the regression models in Chapter 2, the learning of a classifica-
tion and prediction function by a multilayer feedforward ANN does not
require defining a specific form of that function, which may be difficult
when a data set is large, and we have little prior knowledge about the
domain or the data. The complexity of the ANN and the function which
the ANN learns and represents depends much on the number of hidden
Artificial Neural Networks for Classification and Prediction 87

units. The more hidden units the ANN has, the more complex function the
ANN can learn and represent. However, if we use a complex ANN to learn
a simple function, we may see the function of the ANN over-fit the data as
illustrated in Figure 5.15. In Figure 5.15, data points are generated using a
linear model:
y = x + ε,

where ε denotes a random noise. However, a nonlinear model is fitted to the


training data points as illustrated by the filled circles in Figure 5.15, cover-
ing every training data point with no difference between the target y value
and the predicted y value from the nonlinear model. Although the nonlinear
model provides a perfect fit to the training data, the prediction performance
of the nonlinear model on new data points in the testing data set as illus-
trated by the unfilled circles in Figure 5.15 will be poorer than that of the
linear model, y = x, for the following reasons:

• The nonlinear model captures the random noise ε in the model


• The random noises from new data points behave independently and
differently of the random noises from the training data points
• The random noises from the training data points that are captured in
the nonlinear model do not match well with the random noises from
new data points in the testing data set, causing prediction errors

In general, an over-fitted model does not generalize well to new data points
in the testing data set. When we do not have prior knowledge about a given
data set (e.g., the form or complexity of the classification and prediction func-
tion), we have to empirically try out ANN architectures with varying levels

Figure 5.15
An illustration of a nonlinear model overfitting to data from a linear model.
88 Data Mining

of complexity by using different numbers of hidden units. Each ANN archi-


tecture is trained to learn weights and biases of connections from a training
data set and is tested for the prediction performance on a testing data set. The
ANN architecture, which performs well on the testing data, is considered to
provide a good fit to the data and is selected.

5.6 Software and Applications


The website http://www.knuggets.com has information about various data
mining tools. The following software packages provide software tools for
ANNs with back-propagation learning:

• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.mathworks.com/)

Some applications of ANNs can be found in (Ye et al., 1993; Ye, 1996, 2003,
Chapter 3; Ye and Zhao, 1996, 1997).

Exercises
5.1 The training data set for the Boolean function y = NOT x is given
next. Use the graphical method to determine the decision boundary,
the weight, and the bias of a single-unit perceptron for this Boolean
function.
The training data set:

X Y
−1 1
1 −1

5.2 Consider the single-unit perceptron in Exercise 5.1. Assign 0.2 to ini-
tial weights and bias and use the learning rate of 0.3. Use the learning
method to perform one iteration of the weight and bias update for the
two data records of the Boolean function in Exercise 5.1.
5.3 The training data set for a classification function with three attribute vari-
ables and one target variable is given below. Use the graphical method
to determine the decision boundary, the weight, and the bias of a single-
neuron perceptron for this classification function.
Artificial Neural Networks for Classification and Prediction 89

The training data set:


x1 x2 x3 y
−1 −1 −1 −1
−1 −1 1 −1
−1 1 −1 −1
−1 1 1 1
1 −1 −1 −1
1 −1 1 1
1 1 −1 1
1 1 1 1

5.4 A single-unit perceptron is used to learn the classification function in


Exercise 5.3. Assign 0.4 to the initial weights and 1.5 to the initial bias
and use the learning rate of 0.2. Use the learning method to perform
one iteration of the weight and bias update for the third and fourth data
records of this function.
5.5 Consider a fully connected two-layer feedforward ANN with one input
variable, one hidden unit, and two output variables. Assign initial weights
and biases of 0.1 and use the learning rate of 0.3. The transfer function is
the sigmoid function for each unit. Show the architecture of the ANN and
perform one iteration of weight and bias update using the back-propaga-
tion learning algorithm and the following training example:
x y1 y2
1 0 1

5.6 The following ANN with the initial weights and biases is used to
learn the XOR function given below. The transfer function for units
1 and 4 is the linear function. The transfer function for units 2 and 3
is the sigmoid transfer function. The learning rate is α = 0.3. Perform
one iteration of the weight and bias update for w1,1, w1,2, w 2,1, w 3,1, w4,2,
w4,3, b2, after feeding x1 = 0 and x2 = 1 to the ANN.

b1 = –0.25 b2 = 0.45 b4 = –0.55

x1 w1,1 = 0.1 w2,1 = –0.3


2 w4,2 = 0.5
1 4 y
x2 3 w4,3 = –0.6
w1,2 = –0.2 w3,1 = 0.4

b3 = –0.35
90 Data Mining

XOR:

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
6
Support Vector Machines

A support vector machine (SVM) learns a classification function with two


target classes by solving a quadratic programming problem. In this chapter,
we briefly review the theoretical foundation of SVM that leads to the formu-
lation of a quadratic programming problem for learning a classifier. We then
introduce the SVM formulation for a linear classifier and a linearly separable
problem, followed by the SVM formulation for a linear classifier and a non-
linearly separable problem and the SVM formulation for a nonlinear classi-
fier and a nonlinearly separable problem based on kernel functions. We also
give methods of applying SVM for a classification function with more than
two target classes. A list of data mining software packages that support SVM
is provided. Some applications of SVM are given with references.

6.1 Theoretical Foundation for Formulating and Solving an


Optimization Problem to Learn a Classification Function
Consider a set of n data points, (x 1, y1), …, (x n, yn), and a classification func-
tion to fit to data, y = fA(x), where y takes one of two categorical values {−1, 1},
x is a p-dimensional vector of variables, and A is a set of parameters in the
function f that needs to be learned or determined using the training data. For
example, if an artificial neural network (ANN) is used to learn and represent
the classification function f, connection weights and biases are the parame-
ters in f. The expected risk of classification using f measures the classification
error, and is defined as

R ( A) =
∫f A ( x ) − y P ( x , y ) dxdy , (6.1)

where P(x, y) denotes the probability function of x and y. The expected risk of
classification depends on A values. A smaller expected risk of ­classification
indicates a better generalization performance of the classification function
in that the classification function is capable of classifying more data points

91
92 Data Mining

correctly. Different sets of A values give different classification functions


fA(x) and thus produce different classification errors and different levels
of the expected risk. The empirical risk over a sample of n data points is
defined as

∑f
1
Remp ( A ) = A ( xi ) − y i . (6.2)
n i =1

Vapnik and Chervonenkis (Vapnik, 1989, 2000) provide the ­following


bound on the expected risk of classification which holds with the probability
1 − η:

 2n  η
v  ln + 1 − ln
 v  4
R ( A ) ≤ Remp ( A ) + , (6.3)
n

where v denotes the VC (Vapnik and Chervonenkis) dimension of fA and


measures the complexity of fA, which is controlled by the number of param-
eters A in f for many classification functions. Hence, the expected risk of
classification is bound by both the empirical risk of classification and the
second term in Equation 6.3 with the second term increasing with the
VC-dimension. To minimize the expected risk of classification, we need to
minimize both the empirical risk and the VC-dimension of fA at the same
time. This is called the structural risk minimization principle. Minimizing
the VC-dimension of fA, that is, the complexity of fA, is like looking for a
classification function with the minimum description length for a good
generalization performance as discussed in Chapter 4. SVM searches for
a set of A values that minimize the empirical risk and the VC-dimension
at the same time by formulating and solving an optimization problem,
specifically, a quadratic programming problem. The following sections
provide the SVM formulation of the quadratic programming problem for
three types of classification problems: (1) a linear classifier and a linearly
separable problem, (2) a linear classifier and a nonlinearly separable prob-
lem, and (3) a nonlinear classifier and a nonlinearly separable problem. As
discussed in Chapter 5, the logical AND function is a linearly separable
classification problem and requires only a linear classifier in Type (1), and
the logical XOR function is a nonlinearly separable classification problem
and requires a nonlinear classifier in Type (3). Because a linear classifier
generally has a lower VC-dimension than a nonlinear classifier, using a
linear classifier for a nonlinearly separable problem in Type (2) can some-
times produce a lower bound on the expected risk of classification than
using a nonlinear classifier for the nonlinearly separable problem.
Support Vector Machines 93

6.2 SVM Formulation for a Linear Classifier and a Linearly


Separable Problem
Consider the definition of a linear classifier for a perceptron in Chapter 5:

f w , b ( x ) = sign (w¢ x + b ) . (6.4)

The decision boundary separating two target classes {−1, 1} is

w¢ x + b = 0. (6.5)

The linear classifier works in the following way:

y = sign (w¢ x + b ) = 1 if w¢ x + b > 0 (6.6)

y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ 0.

If we impose a constraint,

w ≤ M,

where M is a constant and w denotes the norm of the p-dimensional ­vector w


and is defined as

w = w12 +  + w p2 .

The set of hyperplanes defined by the following:


{f w ,b = sign (w¢ x + b )| w ≤ M , }
has the VC-dimension v that satisfies the bound (Vapnik, 1989, 2000):

{ }
v ≤ min M 2 , p + 1. (6.7)

By minimizing w , we will minimize M and thus the VC-dimension v.


Hence, to minimize the VC-dimension v as required by the structural risk
minimization principle, we want to minimize w , or equivalently:

1 2
min w . (6.8)
2
Rescaling w does not change the slope of the hyperplane for the decision
boundary. Rescaling b does not change the slope of the decision boundary but
94 Data Mining

moves the hyperplane of the decision boundary in parallel. For example, in the
two-dimensional vector space shown in Figure 6.1, the decision boundary is
w1 b
w1x1 + w2 x2 + b = 0 or x2 = − x1 − , (6.9)
w2 w2
the slope of the line for the decision boundary is −w1 w2, and the intercept
of the line for the decision boundary is −b w2. Rescaling w to cww, where cw
is a constant, does not change the slope of the line for the decision boundary
as −cw w1 cw w2 = − w1 w2 . Rescaling b to cbb, where cb is a constant, does not
change the slope of the line for the decision boundary, but changes the inter-
cept of the line to −cbb w2 and thus moves the line in parallel.
Figure 6.1 shows examples of data points with the target value of 1
(­indicated by small circles) and examples of data points with the target value
of −1 (indicated by small squares). Among the data points with the target
value of 1, we consider the data point closest to the decision boundary, x+1, as
shown by the data point with the solid circle in Figure 6.1. Among the data
points with the target value of −1, we consider the data point closest to the
decision boundary, x−1, as shown by the data point with the solid square in
Figure 6.1. Suppose that for two data points x+1 and x−1 we have
w¢ x+1 + b = c+1
(6.10)
w¢ x −1 + b = c−1.
We want to rescale w to cww and rescale b to cbb such that we have
cw w¢ x+1 + cbb = 1
(6.11)
cw w¢ x −1 + cbb = −1,
and still denote the rescaled values by w and b. We have

{
min w¢ xi + b , i = 1, … , n = 1, }
which implies |w′ x + b| = 1 for the data point in each target class closest to
the decision boundary w′x + b = 0.
x2 x2

x+1 x+1

x–1 x–1 d w´x + b = 1


w´x + b = 0
d w´x + b = –1
x1 x1
x2 = (w2/w1)x1 w´x + b = 1 x2 = (w2/w1)x1
w´x + b = 0
w´x + b = –1
(a) (b)

Figure 6.1
SVM for a linear classifier and a linearly separable problem. (a) A decision boundary with a
large margin. (b) A decision boundary with a small margin.
Support Vector Machines 95

For example, in the two-dimensional vector space of x, Equations 6.10 and


6.11 become the following:

w1x+1,1 + w2 x+1, 2 + b = c+1 (6.12)


w1x −1,1 + w2 x −1, 2 + b = c−1 (6.13)

cw w1x+1,1 + cw w2 x+1, 2 + cbb = 1 (6.14)

cw w1x −1,1 + cw w2 x −1, 2 + cbb = −1. (6.15)

We solve Equations 6.12 through 6.15 to obtain cw and cb. We first use Equation
6.14 to obtain
1 − cbb
cw = , (6.16)
w1x+1,1 + w2 x+1, 2
and substitute cw in Equations 6.16 into 6.15 to obtain

1 − cbb

w1x+1,1 + w2 x+1, 2
(w1x−1,1 + w2 x−1, 2 ) + cbb = −1. (6.17)

We then use Equations 6.12 and 6.13 to obtain

w1x+1,1 + w2 x+1, 2 = c+1 − b (6.18)

w1x −1,1 + w2 x −1, 2 = c−1 − b , (6.19)

and substitute Equations 6.18 and 6.19 into Equation 6.17 to obtain
1 − cbb
(c−1 − b ) + cbb = −1
c+1 − b

c−1 − b (c−1 − b ) b
− cb + bcb = −1
c+1 − b c+1 − b

2b − c+1 − c−1
cb = . (6.20)
b 2 + b − c−1b
We finally use Equation 6.14 to compute cw and substitute Equations 6.18 and
6.20 into the resulting equations to obtain

1 − cbb 1 − cbb 1 − ( 2b − c+1 − c−1 b + 1 − c−1 )


cw = = =
w1x+1,1 + w2 x+1, 2 c+1 − b c+1 − b
1 − b + c+1
= . (6.21)
( +1 b ) (b + 1 − c−1 )
c −

Equations 6.20 and 6.21 show how to rescale w and b in a two-dimensional


vector space of x.
96 Data Mining

Let w and b denote the rescaled values. The hyperplane bisects w′x + b = 1
and w′x + b = −1 is w′x + b = 0, as shown in Figure 6.1. Any data point x with
the target class +1 satisfies
w¢ x + b ≥ 1
since the data point with the target class of +1 closest to w′x + b = 0 has w′x +
b = 1. Any data point x with the target class of −1 satisfies
w¢ x + b ≤ −1
since the data point with the target class of −1 closest to w′x + b = 0 has w′x +
b = −1. Therefore, the linear classifier can be defined as follows:

y = sign (w¢ x + b ) = 1 if w¢ x + b ≥ 1 (6.22)

y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ −1.



To minimize the empirical risk Remp or the empirical classification error as
required by the structural risk minimization principle defined by Equation
6.3, we require
yi (w¢ xi + b ) ≥ 1, i = 1, … , n. (6.23)

If yi = 1, we want w¢ xi + b ≥ 1 so that the linear classifier in Equation 6.22


produces the target class 1. If yi = −1, we want w¢ xi + b ≤ −1 so that the linear
classifier in Equation 6.22 produces the target class of −1. Hence, Equation
6.23 specifies the requirement of the correct classification for the sample of
data points (xi, yi), i = 1, …, n.
Therefore, putting Equations 6.8 and 6.23 together allows us to apply the
structural risk principle of minimizing both the empirical classification error
and the VC-dimension of the classification function. Equations 6.8 and 6.23
are put together by formulating a quadratic programming problem:
1 2
min w ,b w (6.24)
2
subject to
yi (w¢ xi + b ) ≥ 1, i = 1, … , n.

6.3 Geometric Interpretation of the SVM Formulation


for the Linear Classifier
w in the objective function of the quadratic programming problem in
Formulation 6.24 has a geometric interpretation in that 2 w is the distance
of the two hyperplanes w′x + b = 1 and w′x + b = −1. This distance is called
Support Vector Machines 97

the margin of the decision boundary or the margin of the linear classifier,
with the w′x + b = 0 being the decision boundary. To show this in the two-
dimensional vector space of x, let us compute the distance of two parallel
lines w′x + b = 1 and w′x + b = −1 in Figure 6.1. These two parallel lines can
be represented as follows:

w1x1 + w2 x2 + b = 1 (6.25)

w1x1 + w2 x2 + b = −1. (6.26)

The following line

w2 x1 − w1x2 = 0 (6.27)

passes through the origin and is perpendicular to the lines defined in


Equations 6.25 and 6.26 since the slope of the parallel lines in Equations 6.25
and 6.26 is −w1 w2 and the slope of the line in Equation 6.27 is −w2 w1, which
is the negative reciprocal to −w1 w2 . By solving Equations 6.25 and 6.27 for x1
and x2, we obtain the coordinates of the data point where these two lines are
 1− b 1− b 
intersected:  2 w1 , 2 w2  . By solving Equations 6.26 and 6.27 for
 w1 + w22
w1 + w2 
2

x1 and x2, we obtain the coordinates of the data point where these two lines
 −1 − b −1 − b 
are intersected:  2 w1 , 2 w2 . Then we compute the distance of
 w1 + w22 w1 + w22 
 1− b 1− b   −1 − b −1 − b 
the two data points,  2 w1 , 2 w2 and  2 w1 , 2 w2 :
 w1 + w22 w1 + w22   w1 + w22 w1 + w22 

2 2
 1− b −1 − b   1− b −1 − b 
d=  2 w1 − 2 w1  +  2 w2 − 2 w2 
 w1 + w2
2
w1 + w2   w1 + w2
2 2
w1 + w2 
2

1 2 2
= 22 w12 + 22 w22 = = . (6.28)
w12 + w22 2
w +w
1
2
2
w

2
Hence, minimizing (1 2) w in the objective function of the quadratic pro-
gramming problem in Formulation 6.24 is to maximize the margin of the
linear classifier or the generalization performance of the linear classifier.
Figure 6.1a and b shows two different linear classifiers with two different
decision boundaries that classify the eight data points correctly but have
different margins. The linear classifier in Figure 6.1a has a larger margin
and is expected to have a better generalization performance than that in
Figure 6.1b.
98 Data Mining

6.4 Solution of the Quadratic Programming Problem


for a Linear Classifier
The quadratic programming problem in Formulation 6.24 has a quadratic
objective function and a linear constraint with regard to w and b, is a con-
vex optimization problem, and can be solved using the Lagrange multiplier
method for the following problem:
n

∑α  y (w¢ x + b) − 1
1
min w ,bmax a ≥ 0 L (w, b , a ) =
2
w − i i i (6.29)
2 i =1

subject to

α i  yi (w¢ xi + b ) − 1 = 0 i = 1, … , n (6.30)

αi ≥ 0 i = 1, … , n,

where αi, i = 1, …, n are the non-negative Lagrange multipliers, and the two
equations in the constrains are known as the Karush–Kuhn–Tucker ­condition
(Burges, 1998) and are the transformation of the inequality ­constraint in
Equation 6.23. The solution to Formulation 6.29 is at the ­saddle point of
L (w, b , a ), where L (w, b , a ) is minimized with regard to w and b and maxi-
2
mized with regard to α. Minimizing (1 2) w with regard to w and b covers the

n
objective function in Formulation 6.24. Minimizing − α i  yi (w¢ xi + b ) − 1
i =1


n
is to maximize α i  yi (w¢ xi + b ) − 1 with regard to α and satisfy
i =1
yi (w¢ xi + b ) ≥ 1—the constraint in Formulation 6.24, since αi, ≥ 0. At the point
where L (w, b , a ) is minimized with regard to w and b, we have

∂L ( w , b , a )
n n


∂w
= w− ∑α y x = 0
i =1
i i i or w = ∑α y x
i =1
i i i (6.31)

∂L (w, b , a )
n


∂b
= ∑α y = 0.
i =1
i i (6.32)

Note that w is determined by only the training data points (xi, yi) for which
αi > 0. Those training data vectors with the corresponding αi > 0 are called
support vectors. Using the Karush–Kuhn–Tucker condition in Equation 6.30
and any support vector (xi, yi) with αi > 0, we have

y i (w ¢ xi + b ) − 1 = 0 (6.33)
Support Vector Machines 99

in order to satisfy Equation 6.32. We also have

yi2 = 1 (6.34)

since yi takes the value of 1 or −1. We solve Equations 6.33 and 6.34 for b
and get

b = y i − w ¢ xi (6.35)

because

yi (w¢ xi + b ) − 1 = yi (w¢ xi + yi − w¢ xi ) − 1 = yi2 − 1 = 0.


To compute w using Equations 6.31 and 6.32 and compute b using Equation
6.35, we need to know the values of the Lagrange multipliers α. We
­substitute Equations 6.31 and 6.32 into L (w, b , a ) in Formulation 6.29 to
obtain L(α)

n n n n n n

∑∑ ∑∑ ∑ ∑α
1
L (a ) = α iα j yi y j xi¢ x j − α iα j yi y j xi¢ x j − b α i yi + i
2 i =1 j =1 i =1 j =1 i =1 i =1

n n n

∑ ∑∑α α y y x¢x .
1
= αi − i j i j i j  (6.36)
i =1
2 i =1 j =1

Hence, the dual problem to the quadratic programming problem in


Formulation 6.24 is

n n n

∑ ∑∑α α y y x¢x
1
max a L (a ) = αi − i j i j i j (6.37)
i =1
2 i =1 j =1

subject to

∑α y = 0 i i

i =1

α i  yi (w¢ xi + b ) − 1 = 0 or ∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1, …, n

α i ≥ 0 i = 1, …, n.

100 Data Mining

In summary, the linear classifier for SVM is solved in the following steps:

1. Solve the optimization problem in Formulation 6.37 to obtain α:

n n n

∑ ∑∑α α y y x¢x
1
max a L (a ) = αi − i j i j i j
i =1
2 i =1 j =1

subject to

∑α y = 0 i i

i =1

∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1,…, n

α i ≥ 0 i = 1, … , n.

2. Use Equation 6.31 to obtain w:


n

w= ∑α y x . i i i

i =1

3. Use Equation 6.35 and a support vector (xi, yi) to obtain b:

b = y i − w ¢ xi .

and the decision function of the linear classifier is given in
Equation 6.22:

y = sign (w¢ x + b ) = 1 if w¢ x + b ≥ 1

y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ −1,

or Equation 6.4:

 n

f w ,b ( x ) = sign (w¢ x + b ) = sign 

∑ i =1
α i yi xi¢ x + b .

Note that only the support vectors with the corresponding αi > 0
contribute to the computation of w, b and the decision function of the
linear classifier.
Support Vector Machines 101

Example 6.1
Determine the linear classifier of SVM for the AND function in Table 5.1,
which is copied here in Table 6.1 with x = (x1, x2).
There are four training data points in this problem. We formulate and
solve the optimization problem in Formulation 6.24 as follows:
1
min w1 , w2 , b
2
( w1 )2 + ( w2 )2 

subject to

w1 + w2 − b ≥ 1
w1 − w2 − b ≥ 1
− w1 + w2 − b ≥ 1
w1 + w2 + b ≥ 1.

Using the optimization toolbox in MATLAB®, we obtain the following


optimal solution to the aforementioned optimization problem:
w1 = 1, w2 = 1, b = −1.

That is, we have
1
w=  b = −1.
1
This solution gives the decision function in Equation 6.22 or 6.4 as follows:

   x1  
 y = sign  1 1   − 1 = sign ( x1 + x2 − 1) = 1 if x1 + x2 − 1 ≥ 1
   x2  

   x1  
 y = sign  1 1   − 1 = sign ( x1 + x2 − 1) = −1 if x1 + x2 − 1 ≤ −1
   x2  

or

  x1  
f w ,b ( x ) = sign (w¢ x + b ) = sign  1 1   − 1 = sign ( x1 + x2 − 1) .

  x2  

Table 6.1
AND Function
Data Point # Inputs Output
i x1 x2 y
1 −1 −1 −1
2 −1 1 −1
3 1 −1 −1
4 1 1 1
102 Data Mining

We can also formulate the optimization problem in Formulation 6.37:


n n n

∑ ∑ ∑ α α y y x′x
1
max α L (a ) = αi − i j i j i j
i =1
2 i =1 j =1

1
= α 1 + α 2 + α 3 + α 4 − [α 1α 1 y1 y1 x1¢ x1 + α 1α 2 y1 y 2 x 1¢ x2
2
+ α 1α 3 y1 y 3 x1¢ x3 + α 1α 4 y1 y 4 x 1¢ x 4 + α 2α 1 y 2 y1 x2¢ x1 + α 2α 2 y 2 y 2 x ¢2 x2
+ α 2α 3 y 2 y 3 x2¢ x3 + α 2α 4 y 2 y 4 x2¢ x 4 + α 3 α 1 y 3 y1 x3¢ x1 + α 3 α 2 y 3 y 2 x3¢ x2
+ α 3 α 3 y 3 y 3 x3¢ x3 + α 3 α 4 y 3 y 4 x3¢ x 4 + α 4 α 1 y 4 y1 x4¢ x1 + α 4 α 2 y 4 y 2 x4¢ x2
+ α 4 α 3 y 4 y 3 x ¢4 x3 + α 4 α 4 y 4 y 4 x4¢ x 4 ]

1  −1
= α1 + α 2 + α 3 + α 4 − α 1α 1 ( −1) ( −1)[ −1 − 1]  
2   −1
 −1 1
+ 2α 1α 2 ( −1) ( −1)[ −1 − 1]   + 2α 1α 3 ( −1) ( −1)[ −1 − 1]  
1
   −1
1  −1
+ 2α 1α 4 ( −1) (1)[ −1 − 1]   + α 2α 2 ( −1) ( −1)[ −11]  
1 1
1 1
+ 2α 2α 3 ( −1) ( −1)[ −11]   + 2α 2α 4 ( −1)(1)[ −11]  
−1
  1
1 1
+ α 3 α 3 ( −1) ( −1)[1 − 1]   + 2α 3 α 4 ( −1)(1)[1 − 1]  
−1
  1
1 
+ α 4 α 4 (1)(1)[11]   
1 
1
= α 1 + α 2 + α 3 + α 4 − (2α 12 + 2α 22 + 2α 23 + 2α 24 − 4α 1α 4 − 4α 2α 3 )
2
= −α 12 − α 22 − α 23 − α 24 + 2α 1α 4 + 2α 2α 3 + α 1 + α 2 + α 3 + α 4
= − (α1 − α 4 ) − (α 2 − α 3 ) + α1 + α 2 + α 3 + α 4
2 2

subject to
n

∑α y = α y + α y
i =1
i i 1 1 2 2 + α 3 y 3 + α i y 4 = −α 1 − α 2 − α 3 + α 4 = 0

 n 

 ∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1, 2, 3, 4 become: 


  −1  −1  −1
α 1 ( −1) α 1 ( −1)  −1 −1   + α 2 ( −1)  −1 1   + α 3 ( −11) 1 −1  
  −1  −1  −1
 −1 
+ α 4 (1) 1 1    + α 1 ( −1) b − α 1 = 0 or − α 1 ( −2α 1 − 2α 4 ) − α 1b − α 1 = 0
 −1 
Support Vector Machines 103

  −1  −1  −1


α 2 ( −1) α 1 ( −1)  −1 −1   + α 2 ( −1)  −1 1   + α 3 ( −1) 1 −1  
 1 1 1

 −1 
α 4 (1) 1 1    + α 2 ( −1) b − α 2 = 0 or − α 2 ( −2α 2 + 2α 3 ) − α 2b − α 2 = 0
 1  

 1 1 1


α 3 ( −1) α 1 ( −1)  −1 −1   + α 2 ( −1)  −1 1   + α 3 ( −1) 1 −1  
  −1  −1  −1

 1 
+ α 4 (1) 1 1    + α 3 ( −1) b − α 3 = 0 or − α 3 ( 2α 2 − 2α 3 ) − α 3 b − α 3 = 0
 −1 

 1 1 1


α 4 (1) α 1 ( −1)  −1 −1   + α 2 ( −1)  −1 1   + α 3 ( −1) 1 −11  
 1 1 1

1 
+ α 4 (1) 1 1    + α 4 (1) b − α 4 = 0 or α 4 ( 2α 1 + 2α 4 ) + α 4 b − α 4 = 0
1 

α i ≥ 0, i = 1, 2, 3, 4.

Using the optimization toolbox in MATLAB to solve the aforementioned


optimization problem, we obtain the optimal solution:

α 1 = 0, α 2 = 0.5, α 3 = 0.5, α 4 = 1, b = −1,


and the value of the objective function equals to 1.


The values of the Lagrange multipliers indicate that the second, third,
and fourth data points in Table 6.1 are the support vectors. We then
obtain w using Equation 6.31:

w= ∑α y x
i =1
i i i

w1 = α 1 y1x1,1 + α 2 y 2 x2 ,1 + α 3 y 3 x3 ,1 + α 4 y 4 x 4 ,1

= (0 ) ( −1) ( −1) + (0.5) ( −1) ( −1) + (0.5) ( −1)(1) + (1)(1)(1) = 1


w2 = α 1 y1x1, 2 + α 2 y 2 x2 , 2 + α 3 y 3 x3 , 2 + α 4 y 4 x 4 , 2

= (0 ) ( −1) ( −1) + (0.5) ( −1)(1) + (0.5) ( −1) ( −1) + (1)(1)(1) = 1.



104 Data Mining

The optimal solution already includes the value of b = −1. We obtain the
same value of b using Equation 6.35 and the fourth data point as the sup-
port vector:

1
b = y 4 − w¢ x 4 = 1 − 1 1   = −1.
1

The optimal solution of the dual problem for SVM gives the same deci-
sion function:

   x1  
 y = sign  1 1   − 1 = sign ( x1 + x2 − 1) = 1 if x1 + x2 − 1 ≥ 1
   x2  

   x1  
 y = sign  1 1   − 1 = sign ( x1 + x2 − 1) = −1 if x1 + x2 − 1 ≤ −1
   x2  

or

  x1  
f w ,b ( x ) = sign (w¢ x + b ) = sign  1 1   − 1 = sign ( x1 + x2 − 1) .

  x2  

Hence, the optimization problem and its dual problem of SVM for this
example problem produces the same optimal solution and the decision
function. Figure 6.2 illustrates the decision function and the support vec-
tors for this problem. The decision function of SVM is the same as that of
ANN for the same problem illustrated in Figure 5.10 in Chapter 5.

x2

1
–1
1

x1
0 1
x1 + x2 – 1 = 1
–1 –1 x1 + x2 – 1 = 0

x1 + x2 – 1 = –1

Figure 6.2
Decision function and support vectors for the SVM linear classifier in Example 6.1.
Support Vector Machines 105

Many books and papers in literature introduce SVMs using the dual opti-
mization problem in Formulation 6.37 but without the set of constraints:
n

∑α α y y x¢x + α y b − α = 0
j =1
i j i j j i i i i i = 1, … , n.

As seen from Example 6.1, without this set of constraints, the dual problem
becomes

max a − (α 1 − α 4 ) − (α 2 − α 3 ) + α 1 + α 2 + α 3 + α 4
2 2

subject to

−α 1 − α 2 − α 3 + α 4 = 0
α i ≥ 0, i = 1, 2, 3, 4.

If we let α 1 = α 4 > 0 and α 2 = α 3 = 0, which satisfy all the constraints, then the
objective function becomes max α1 + α4, which is unbounded as α1 and α4
can keep increasing their value without a bound. Hence, Formulation 6.37 of
the dual problem with the full set of constraints should be used.

6.5 SVM Formulation for a Linear Classifier


and a Nonlinearly Separable Problem
If a SVM linear classifier is applied to a nonlinearly separable problem (e.g.,
the logical XOR function described in Chapter 5), it is expected that not every
data point in the sample data set can be classified correctly by the SVM linear
classifier. The formulation of a SVM for a linear classifier in Formulation 6.24
can be extended to use a soft margin by introducing a set of additional non-
negative parameters βi, i = 1, …, n, into the SVM formulation

k
 n


1 2
min w ,b ,b w + C βi  (6.38)
2  i =1 

subject to
yi (wxi + b ) ≥ 1 − βi , i = 1, … , n

βi ≥ 0, i = 1, … , n,

where C > 0 and k ≥ 1 are predetermined for giving the penalty of misclas-
sifying the data points. Introducing βi into the constraint in Formulation 6.38
106 Data Mining

allows a data point to be misclassified with βi measuring the level of the


misclassification. If a data point is correctly classified, βi is zero. Minimizing
k

∑ 
n
C βi  in the objective function is to minimize the misclassification
 i =1 
2
error, while minimizing (1 2) w in the objective function is to minimize the
VC-dimension as discussed previously.
Using the Lagrange multiplier method, we transform Formulation 6.38 to
k
 n


1
min w ,b ,b max a ≥0 , g ≥0 L ( w, b , b , a , g ) = w + C 
2
βi 
2  i =1 
n n

− ∑
i =1
α i  yi (wxi + b ) − 1 + βi  − ∑γ β ,
i =1
i i (6.39)

where γi, i = 1, …, n, are the non-negative Lagrange multipliers. The solution to


Formulation 6.39 is at the saddle point of L ( w, b , b , a , g ), where L ( w, b , b , a , g ) is
minimized with regard to w, b, and β and maximized with regard to α and γ.
At the point where L ( w, b , b , a , g ) is minimized with regard to w, b, and β, we
have

∂L ( w , b , b , a , g )
n n


∂w
= w− ∑i =1
α i yi xi = 0 or w = ∑α y x
i =1
i i i (6.40)

∂L ( w, b , b , a , g )
n


∂b
= ∑α y = 0
i =1
i i (6.41)

  n  k −1
=  ∑
∂L ( w, b , b , a , g )  pC  βi  − α i − γ i = 0

i = 1, … , n if k > 1
. (6.42)
∂b 
i =1

C − α i − γ i = 0 i = 1, … , n if k = 1

When k > 1, we denote


k −1 1 k −1
 n
 n
 δ 
δ = pC 

∑i =1
βi 

or ∑
i =1
βi = 
 pC 
. (6.43)

We can rewrite Equation 6.42 to

 δ − α i − γ i = 0 or γ i = δ − α i i = 1, … , n if k > 1
 . (6.44)
C − α i − γ i = 0 or γ i = C − α i i = 1, … , n if k = 1
Support Vector Machines 107

The Karush–Kuhn–Tucker condition of the optimal solution to Formulation


6.39 gives

α i  yi (wxi + b ) − 1 + βi  = 0. (6.45)

Using a data point (xi, yi) that is correctly classified by the SVM, we have
βi = 0 and thus the following based on Equation 6.45:

b = y i − w ¢ xi , (6.46)

which is the same as Equation 6.35. Equations 6.40 and 6.46 are used to
compute w and b, respectively, if α is known. We use the dual problem of
Formulation 6.39 to determine α as follows.
When k = 1, substituting w, b, and γ in Equations 6.40, 6.44, and 6.46, respec-
tively, into Formulation 6.39 produces

k
 n
 n n

∑ ∑ ∑γ β
1
max a ≥ 0 L (a ) = w + C  α i  yi (wxi + b ) − 1 + βi  −
2
βi  − i i
2  i =1  i =1 i =1

n n n  
n n  
∑∑ ∑ ∑ ∑
1
= α iα j yi y j xi¢x j + C βi − α i  yi  α j y j xj¢ xi + b  − 1 + βi 
2    
i =1 j =1 i =1 i =1  j =1 
n n n n

∑ ∑ ∑∑α α y y x¢x
1
− (C − α i ) β i = αi − i j i j i j (6.47)
i =1 i =1
2 i =1 j =1

subject to

∑α y = 0
i i

i =1

α i ≤ C i = 1, … , n

α i ≥ 0 i = 1, … , n.

The constraint αi ≤ C comes from Equation 6.44:

C − α i − γ i = 0 or C − α i = γ i .

Since γi ≥ 0, we have C ≥ αi.
108 Data Mining

When k > 1, substituting w, b, and γ in Equations 6.40, 6.44, and 6.46,


r­ espectively, into Formulation 6.39 produces
k
 n
 n n

∑ ∑α  y (wx + b) − 1 + β  − ∑γ β
1
max a ≥0 ,δ L (a ) = w + C 
2
βi  − i i i i i i
2  i =1  i =1 i =1

n n
 n

k
n   n  
∑∑ ∑ ∑ ∑
1
= α iα j yi y j xi¢x j + C  βi  − α i  yi  α j y j x ¢j xi + b  − 1 + βi 
2      
i =1 j =1 i =1 i =1  j =1 
p
n n n n
δ p −1
 1
∑ (δ − α )β = ∑α − 2 ∑∑α α y y x¢x −
1
− i i i i j i j i j 1  1 − p 
(6.48)


i =1 i =1 i =1 j =1 ( pC ) p −1

subject to
n

∑α y = 0 i i

i =1

α i ≤ δ i = 1, … , n

α i ≥ 0 i = 1, … , n.

The decision function of the linear classifier is given in Equation 6.22:
y = sign (w¢ x + b ) = 1 if w¢ x + b ≥ 1

y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ −1,

or Equation 6.4:
 n

f w, b ( x ) = sign (w¢ x + b ) = sign 

∑ i =1
α i yi xi¢x + b  .


Only the support vectors with the corresponding αi > 0 contribute to the
computation of w, b, and the decision function of the linear classifier.

6.6 SVM Formulation for a Nonlinear Classifier


and a Nonlinearly Separable Problem
The soft margin SVM is extended to a nonlinearly separable problem by trans-
forming the p-dimensional x into a l-dimensional feature space where x can
be classified using a linear classifier. The transformation of x is represented as

x → j (x) ,
Support Vector Machines 109

where

(
j ( x ) = h1ϕ1 ( x ) , …, hlϕ l ( x ) . ) (6.49)

The formulation of the soft margin SVM becomes


When k = 1,

n n n

∑ ∑∑α α y y j ( x )¢ j ( x )
1
max a ≥ 0 L (a ) = αi − i j i j i j (6.50)
i =1
2 i =1 j =1

subject to
n

∑α y = 0 i i

i =1

α i ≤ C i = 1, …, n

α i ≥ 0 i = 1, …, n.

When k > 1,

n n n
δp p −1
 1
∑ ∑∑
1
max a ≥ 0 , δ L (a ) = αi − ( )
α i α j y i y j j ( x i )¢ j x j −  1 − p  (6.51)
( pC )
1 p −1
i =1
2 i =1 j =1

subject to
n

∑α y = 0 i i

i =1

α i ≤ δ i = 1, …, n

α i ≥ 0 i = 1, …, n,

with the decision function:

 n

f w ,b ( x ) = sign 

∑α y j ( x )¢ j ( x) + b .
i =1
i i i (6.52)

If we define a kernel function K(x, y) as


l

K ( x , y ) = j ( x )¢ j ( y ) = ∑h ϕ ( x)¢ ϕ (y) ,
i =1
2
i i i (6.53)
110 Data Mining

the formulation of the soft margin SVM in Equations 6.50 through 6.52
becomes:
When k = 1,
n n n

∑ ∑∑α α y y K ( x , x )
1
max a ≥ 0 L (a ) = αi − i j i j i j (6.54)
i =1
2 i =1 j =1

subject to
n

∑α y = 0 i i

i =1

α i ≤ C i = 1, …, n

α i ≥ 0 i = 1, …, n.

When k > 1,
n n n
δp p −1
 1
∑ ∑∑
1
max a ≥ 0 , δ L (a ) = αi − (
α i α j y i y j K xi , x j − )  1 − p  (6.55)
( pC )
1 p −1
i=1
2 i =1 j =1

subject to
n

∑α y = 0 i i

i =1

α i ≤ δ i = 1, …, n

α i ≥ 0 i = 1, …, n.

with the decision function:


 n

f w ,b ( x ) = sign 

∑ i =1
α i y i K ( xi , x ) + b .

(6.56)

The soft margin SVM in Equations 6.50 through 6.52 requires the transforma-
tion φ(x) and then solve the SVM in the feature space, while the soft margin
SVM in Equations 6.54 through 6.56 uses a kernel function K(x, y) directly.
To work in the feature space using Equations 6.50 through 6.52, some
examples of the transformation function for an input vector x in a one-
dimensional space are provided next:

(
j ( x ) = 1, x , …, x d ) (6.57)

K ( x , y ) = j ( x )′ j ( y ) = 1 + xy +  + ( xy ) .
d


Support Vector Machines 111

 1 1 
j ( x ) =  sin x , sin ( 2x ) , …, sin (ix ) , … (6.58)
 2 i 

sin ( x + y 2)

∑ i sin (ix) sin (iy ) = 2 log sin (x − y 2)


1 1
K ( x , y ) = j ( x )¢ j ( y ) =
i =1

x , y ∈[ 0 , π ] .

An example of the transformation function for an input vector x = (x1, x2) in
a two-dimensional space is given next:

(
j ( x ) = 1, 2 x1 , 2 x2 , x12 , x22 , 2 x1x2 ) (6.59)

K ( x , y ) = j ( x )¢ j ( y ) = (1 + xy ) .
2


An example of the transformation function for an input vector x = (x1, x2, x3)
in a three-dimensional space is given next:

(
j ( x ) = 1, 2 x1 , 2 x2 , 2 x3 , x12 , x22 , x32 , 2 x1x2 , 2 x1x3 , 2 x2 x3 , (6.60) )
K ( x , y ) = j ( x )¢ j ( y ) = (1 + xy ) .
2

Principal component analysis described in Chapter 14 can be used to pro-


duce the principal components for constructing j ( x ) . However, principal
components may not necessarily give appropriate features that lead to a lin-
ear classifier in the feature space.
For the transformation functions in Equations 6.57 through 6.60, it is easier
to compute the kernel functions directly than starting from computing the
transformation functions and working in the feature space since the SVM
can be solved using a kernel function directly. Some examples of the kernel
functions are provided next:

K ( x , y ) = (1 + xy )
d
(6.61)
2
x−y

K ( x, y) = e
2
2σ (6.62)

K ( x , y ) = tanh (ρxy − θ ) . (6.63)

The kernel functions in Equations 6.61 through 6.63 produce a polynomial


decision function as shown in Figure 6.3, a Gaussian radial basis function as
shown in Figure 6.4, and a multi-year perceptron for some values of ρ and θ.
112 Data Mining

x2

x1

Figure 6.3
A polynomial decision function in a two-dimensional space.

x2

x1

Figure 6.4
A Gaussian radial basis function in a two-dimensional space.

The addition and the tensor product of kernel functions are often used to
construct more complex kernel functions as follows:
K ( x, y) = ∑K ( x, y)
i
i (6.64)

K ( x, y) = ∏K ( x, y).
i
i (6.65)
Support Vector Machines 113

6.7 Methods of Using SVM for Multi-Class


Classification Problems
SVM described in the previous sections is for a binary classifier that deals
with only two target classes. For a classification problem with more than
two target classes, there are several methods that can be used to first build
binary classifiers and combine binary classifiers to handle multiple target
classes. Suppose that the target classes are T1, T2, …, Ts. In the one-versus-
one method, a binary classifier is built for every pair of target classes, Ti
versus Tj, i ≠ j. Among the target classes produced by all the binary clas-
sifiers for a given input vector, the most dominant target class is taken as
the final target class for the input vector. In the one-versus-all method,
suppose that a binary classifier is built to distinguish each target class Ti
from all the other target classes that are considered together as another
target class NOT-Ti. If all the binary classifiers produce the consistent clas-
sification results for a given input vector with one binary classifier pro-
ducing Ti and all the other classifiers producing NOT-Tj, j ≠ i, the final
target class for the input vector is Ti. However, if all the binary classi-
fiers produce inconsistent classification results for a given input vector,
it is difficult to determine the final target class for the input vector. For
example, there may exist Ti and Tj, i ≠ j, in the classification results, and
it is difficult to determine whether the final target class is Ti or Tj. The
error-correcting output coding method generates a unique binary code
consisting of binary bits for each target class, builds a binary classifier for
each binary bit, and takes the target class with the string of binary bits
closest to the resulting string of the binary bits from all the binary classi-
fiers. However, it is not straightforward to generate a unique binary code
for each target class so that the resulting set of binary codes for all the
target classes leads to the minimum classification error for the training
data points.

6.8  Comparison of ANN and SVM


The learning of an ANN, as described in Chapter 5, requires the search
for weights and biases of the ANN toward the minimum of the classifica-
tion error for the training data points, although the search may end up
with a local  minimum. An  SVM is solved to obtain the global optimal
solution. However, for a nonlinear classifier and a nonlinearly separa-
ble problem, it is often uncertain what kernel function is right to trans-
form the nonlinearly problem into a linearly separable problem since the
underlying classification function is unknown. Without an appropriate
114 Data Mining

kernel function, we may end up with using an inappropriate kernel func-


tion and thus a solution with the classification error greater than that
from a global optimal solution when an appropriate kernel function is
used. Hence, using an SVM for a nonlinear classifier and a nonlinearly
separable problem involves the search for a good kernel function to clas-
sify the training data through trials and errors, just as learning an ANN
involves determining an appropriate configuration of the ANN (i.e., the
number of hidden units) through trials and errors. Moreover, comput-
∑ ∑ ∑ ∑
n n
( )
n n
ing α iα j yi y j xi¢x j or α i α j y i y j K xi , x j in the objective
i =1 j =1 i =1 j =1
function of an SVM for a large set of training data (e.g., one containing
50,000 training data points) requires computing 2.5 × 109 terms and a large
memory space, and thus induces a large computational cost. Osuna et al.
(1997) apply an SVM to a face detection problem and show that the clas-
sification performance of the SVM is close to that of an ANN developed
by Sung and Poggio (1998).

6.9  Software and Applications


MATLAB (www.mathworks.com) supports SVM. The optimization tool-
box in MATLAB can be used to solve an optimization problem in SVM.
Osuna et al. (1997) report an application of SVM to face detection. There
are many other SVM applications in literature (www.support-vector-
machines.org).

Exercises
6.1 Determine the linear classifier of SVM for the OR function in Table 5.2
using the SVM formulation for a linear classifier in Formulations 6.24
and 6.29.
6.2 Determine the linear classifier of SVM for the NOT function using
the SVM formulation for a linear classifier in Formulations 6.24 and
6.29. The training data set for the NOT function, y = NOT x, is given
next:
The training data set:

X Y
−1 1
1 −1
Support Vector Machines 115

6.3 Determine the linear classifier of SVM for a classification function with
the following training data, using the SVM formulation for a linear
classifier in Formulations 6.24 and 6.29.
The training data set:
x1 x2 x3 y
−1 −1 −1 0
−1 −1 1 0
−1 1 −1 0
−1 1 1 1
1 −1 −1 0
1 −1 1 1
1 1 −1 1
1 1 1 1
7
k-Nearest Neighbor Classifier
and Supervised Clustering

This chapter introduces two classification methods: k-nearest neighbor clas-


sifier and supervised clustering, which includes the k-nearest neighbor classi-
fier as a part of the method. Some applications of supervised clustering are
given with references.

7.1  k-Nearest Neighbor Classifier


For a data point xi with p attribute variables:

 xi , 1 
 
xi =   
 xi , p 
 

and one target variable y whose categorical value needs to be determined, a


k-nearest neighbor classifier first locates k data points that are most similar
to (i.e., closest to) the data point as the k-nearest neighbors of the data point
and then uses the target classes of these k-nearest neighbors to determine
the target class of the data point. To determine the k-nearest neighbors
of the data point, we need to use a measure of similarity or dissimilarity
between data points. Many measures of similarity or dissimilarity exist,
including the Euclidean distance, the Minkowski distance, the Hamming
distance, Pearson’s correlation coefficient, and cosine similarity, which are
described in this section.
The Euclidean distance is defined as

( ) ∑ (x )
2
d xi , x j = i,l − x j , l , i ≠ j. (7.1)
l =1

The Euclidean distance is a measure of dissimilarity between two data


points xi and xj. The larger the Euclidean distance is, the more dissimilar the

117
118 Data Mining

two data points are, and the farther apart the two data points are separated
in the p-dimensional data space.
The Minkowski distance is defined as
1r
 p

( ) ∑x
r
d xi , x j =  i,l − x j,l  , i ≠ j. (7.2)
 
l =1

The Minkowski distance is also a measure of dissimilarity. If we let r = 2, the


Minkowski distance gives the Euclidean distance. If we let r = 1 and each
attribute variable takes a binary value, the Minkowski distance gives the
Hamming distance which counts the number of bits different between two
binary strings.
When the Minkowski distance measure is used, different attribute vari-
ables may have different means, variances, and ranges and bring different
scales into the distance computation. For example, values of one attribute
variable, xi, may range from 0 to 10, whereas values of another attribute vari-
able, xj, may range from 0 to 1. Two values of xi, 1 and 8, produce the absolute
difference of 7, whereas two values of xj, 0.1 and 0.8, produce the absolute
difference of 0.7. When both 7 and 0.7 are used in summing up the differ-
ences of two data points over all the attribute variables in Equation 7.2, the
absolute difference on  xj becomes irrelevant when it is compared with the
absolute difference on xi. Hence, the normalization may be necessary before
the Minkowski distance measure is used. Several methods of normalization
can be used. One normalization method uses the following formula to nor-
malize a variable x and produce a normalized variable z with the mean of
zero and the variance of 1:
x−x
z= , (7.3)
s

where x– and s are the sample average and the sample standard deviation of x.
Another normalization method uses the following formula to normalize a vari-
able x and produce a normalized variable z with values in the range of [0, 1]:
xmax − x
z= . (7.4)
xmax − xmin

The normalization is performed by applying the same normalization method


to all the attribute variables. The normalized attribute variables are used to
compute the Minkowski distance.
The following defines Pearson’s correlation coefficient ρ:
s xi x j
ρ xi x j = , (7.5)
s xi s x j

k-Nearest Neighbor Classifier and Supervised Clustering 119

where sxi x j, sxi, and sxj are the estimated covariance of xi and xj, the estimated
standard deviation of xi, and the estimated standard deviation of xj, respec-
tively, and are computed using a sample of n data points as follows:

∑ (x )( )
1
s xi x j = i,l − xi x j , l − x j (7.6)
n−1
l =1

∑ (x )
1 2
s xi = i,l − xi (7.7)
n−1 l =1

∑ (x )
1 2
sx j = j,l − xj (7.8)
n−1 l =1

∑x
1
xi = i,l (7.9)
n
l =1

xj =
1
n ∑x j,l . (7.10)
l =1

Pearson’s correlation coefficient falls in the range of [−1, 1] and is a measure of


similarity between two data points xi and xj. The larger the value of Pearson’s
correlation coefficient, the more correlated or similar the two data points are.
A more detailed description of Pearson’s correlation coefficient is given in
Chapter 14.
The cosine similarity considers two data points xi and xj as two vectors in
the p-dimensional space and uses the cosine of the angle θ between the two
vectors to measure the similarity of the two data points as follows:

xi′ x j
cos (θ ) = , (7.11)
xi x j

where ∥xi∥ and ∥xj∥ are the length of the two vectors and are computed as
follows:

xi = xi2,1 +  + xi2, p (7.12)


x j = x 2j ,1 +  + x 2j , p . (7.13)

120 Data Mining

When θ = 0°, that is, the two vectors point to the same direction, cos(θ) = 1.
When θ = 180°, that is, the two vectors point to the opposite directions,
cos(θ) = −1. When θ = 90° or 270°, that is, the two vectors are orthogonal,
cos(θ) = 0. Hence, like Pearson’s correlation coefficient, the cosine similarity
measure gives a value in the range of [−1, 1] and is a measure of similarity
between two data points xi and xj. The larger the value of the cosine simi-
larity, the more similar the two data points are. A more detailed descrip-
tion of the computation of the angle between two data vectors is given in
Chapter 14.
To classify a data point x, the similarity of the data point x to each of
n data points in the training data set is computed using a selected mea-
sure of similarity or dissimilarity. Among the n data points in the train-
ing data set, k data points that are most similar to the data point x are
considered  as  the k-nearest neighbors of x. The dominant target class of
the k-nearest neighbors is taken as the target class of x. In other words, the
k-nearest neighbors use the majority voting rule to determine the target
class of x. For example, suppose that for a data point x to be classified, we
have the following:

• k is set to 3
• The target variable takes one of two target classes: A and B
• Two of the 3-nearest neighbors have the target class of A

The 3-nearest neighbor classifier assigns A as the target class of x.

Example 7.1
Use a 3-nearest neighbor classifier and the Euclidean distance measure
of dissimilarity to classify whether or not a manufacturing system is
faulty using values of the nine quality variables. The training data set
in Table 7.1 gives a part of the data set in Table 1.4 and includes nine
single-fault cases and the nonfault case in a manufacturing system. For
the ith data observation, there are nine attribute variables for the quality
of parts, (xi,1, …, xi,9), and one target variable yi for system fault. Table 7.2
gives test cases for some multiple-fault cases.
For the first data point in the testing data set x = (1, 1, 0, 1, 1, 0, 1, 1, 1),
the Euclidean distances of this data point to the ten data points in the
training data set are 1.73, 2, 2.45, 2.24, 2, 2.65, 2.45, 2.45, 2.45, 2.65, respec-
tively. For example, the Euclidean distance between x and the first data
point in the training data set x 1 = (1, 0, 0, 0, 1, 0, 1, 0, 1) is

d ( x1 , x ) = (1 − 1)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 1)2 + (1 − 1)2


= 3 = 1.73.
k-Nearest Neighbor Classifier and Supervised Clustering 121

Table 7.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable

Instance i Quality of Parts


(Faulty Machine) xi1 xi2 xi3 xi4 xi5 xi6 xi7 xi8 xi9 System Fault yi
  1 (M1) 1 0 0 0 1 0 1 0 1 1
  2 (M2) 0 1 0 1 0 0 0 1 0 1
  3 (M3) 0 0 1 1 0 1 1 1 0 1
  4 (M4) 0 0 0 1 0 0 0 1 0 1
  5 (M5) 0 0 0 0 1 0 1 0 1 1
  6 (M6) 0 0 0 0 0 1 1 0 0 1
  7 (M7) 0 0 0 0 0 0 1 0 0 1
  8 (M8) 0 0 0 0 0 0 0 1 0 1
  9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0

Table 7.2
Testing Data Set for System Fault Detection and the Classification Results
in Examples 7.1 and 7.2
Target Variable
Attribute Variables (Quality of Parts) (System Fault yi)
Instance i True Classified
(Faulty Machine) xi1 xi2 xi3 xi4 xi5 xi6 xi7 xi8 xi9 Value Value
  1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
  2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
  3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
  4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
  5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
  6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
  7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
  8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
  9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1
122 Data Mining

The 3-nearest neighbors of x are x1, x2, and x5 in the training data set which
all take the target class of 1 for the system being faulty. Hence, target class of
1 is assigned to the first data point in the testing data set. Since in the train-
ing data set there is only one data point with the target class of 0, the 3-nearest
neighbors of each data point in the testing data set have at least two data
points whose target class is 1, producing the target class of 1 for each data
point in the testing data set. If we attempt to classify data point 10 with the
true target class of 0 in the training data set, the 3-nearest neighbors of this
data point are the data point itself and two other data points with the target
class of 1, making the target class of 1 for data point 10 in the training data
set, which is different from the true target class of this data point.
  However, if we let k = 1 for this example, the 1-nearest neighbor classi-
fier assigns the correct target class to each data point in the training data
set since each data point in the training data set has itself as its 1-nearest
neighbor. The 1-nearest neighbor classifier also assigns the correct target
class of 1 to each data point in the testing data set since data point 10 in
the training data set is the only data point with the target class of 0 and
its attribute variables have the values of zero, making data point 10 not
be the 1-nearest neighbor to any data point in the testing data set.

The classification results in Example 7.1 for k = 3 in comparison with the


classification results for k = 1 indicate that the selection of the k value plays an
important role in determining the target class of a data point. In Example 7.1,
k = 1 produces a better classification performance than k = 3. In some other
examples or applications, if k is too small, e.g., k = 1, the 1-nearest neighbor
of the data point x may happen to be an outlier or come from noise in the
training data set. Letting x take the target class of such a neighbor does not
give the outcome that reflects data patterns in the data set. If k is too large, the
group of the k-nearest neighbors may include the data points that are located
far away from and are not even similar to x. Letting such dissimilar data
points vote for the target class of x as its neighbors seems irrational.
The supervised clustering method in the next section extends the k-nearest
neighbor classifier by first determining similar data clusters and then using
these data clusters to classify a data point. Since data clusters give a more
coherent picture of the training data set than individual data points, clas-
sifying a data point based on its nearest data clusters and their target classes
is expected to give more robust classification performance than a k-nearest
neighbor classifier which depends on individual data points.

7.2  Supervised Clustering


The supervised clustering algorithm was developed and applied to cyber
attack detection for classifying the observed data of computer and network
activities into one of two target classes: attacks and normal use activities
k-Nearest Neighbor Classifier and Supervised Clustering 123

(Li and Ye, 2002, 2005, 2006; Ye, 2008; Ye and Li, 2002). The algorithm can
also be applied to other classification problems.
For cyber attack detection, the training data contain large amounts of com-
puter and network data for learning data patterns of attacks and normal
use activities. In addition, more training data are added over time to update
data patterns of attacks and normal activities. Hence, a scalable, incremental
learning algorithm is required so that data patterns of attacks and normal
use activities are maintained and updated incrementally with the addition
of each new data observation rather than processing all data observations in
the training data set in one batch. The supervised clustering algorithm was
developed as a scalable, incremental learning algorithm to learn and update
data patterns for classification.
During the training, the supervised clustering algorithm takes data points
in the training data set one by one to group them into clusters of similar data
points based on their attribute values and target values. We start with the first
data point in the training data set and let the first cluster to contain this data
point and to take the target class of the data point as the target class of the
data cluster. Taking the second data point in the training data set, we want to
let this data point join the closest cluster that has the same target class as the
target class of this data point. In the supervised clustering algorithm, we use
the mean vector of all the data points in a data cluster as the centroid of the data
cluster that is used to represent the location of the data cluster and compute the
distance of a data point to this cluster. The clustering of data points is based on
not only values of attribute variables to measure the distance of a data point to a
data cluster but also target classes of the data point and the data cluster to make
the data point join a data cluster with the same target class. All data points in
the same cluster have the same target class, which is also the target class of the
cluster. Because the algorithm uses the target class to guide or supervise the
clustering of data points, the algorithm is called supervised clustering.
Suppose that the distance of the first data point and the second data point
in the training data set is large but the second data point has the same target
class as the target class of the first cluster containing the first data point,
the second data point still has to join this cluster because this is the only
data cluster so far with the same target class. Hence, the clustering results
depend on the order in which data points are taken from the training data
set, causing the problem called the local bias of the input order. To address
this problem, the supervised clustering algorithm sets up an initial data clus-
ter for each target class. For each target class, the centroid of all data points
with the target class in the training data set is first computed using the mean
vector of the data points. Then an initial cluster for the target class is set up
to have the mean vector as the centroid of the cluster and the target class,
which is different from any target class of the data points in the training
data set. For example, if there are totally two target classes of T1 and T2 in the
training data, there are two initial clusters. One initial cluster has the mean
vector of the data points for T1 as the centroid. Another initial cluster has the
124 Data Mining

mean vector of the data points for T2 as the centroid. Both initial clusters are
assigned to a target class, e.g., T3, which is different from T1 and T2. Because
these initial data clusters do not contain any individual data points, they are
called the dummy clusters. All the dummy clusters have the target class that
is different from any target class in the training data set. The supervised
clustering algorithm requires a data point to form its own cluster if its closest
data cluster is a dummy cluster. With the dummy clusters, the first data point
from the training data set forms a new cluster since there are only dummy
clusters initially and the closest cluster to this data point is a dummy cluster.
If the second data point has the same target class of the first data point but is
located far away from the first data point, a dummy cluster is more likely the
closest cluster to the second data point than the data cluster containing the
first data point. This makes the second data point form its own cluster rather
than joining the cluster with the first data point, and thus addresses the local
bias problem due to the input order of training data points.
During the testing, the supervised clustering algorithm applies a k-nearest
neighbor classifier to the data clusters obtained from the training phase by
determining the k-nearest cluster neighbors of the data point to be classified and
letting these k-nearest data clusters vote for the target class of the data point.
Table 7.3 gives the steps of the supervised clustering algorithm. The following
notations are used in the description of the algorithm:

xi = (xi,1, …, xi,p, yi  ): a data point in the training data set with a known
value of yi, for i = 1, …, n
x = (x1, …, xp, y): a testing data point with the value of y to be determined
Tj: the jth target class, j = 1, …, s
C: a data cluster
nC: the number of data points in the data cluster C
xC : the centroid of the data cluster C that is the mean vector of all data
points in C

In Step 4 of the training phase, after the data point xi joins the data cluster
C, the centroid of the data cluster C is updated incrementally to produce
xC (t + 1) (the updated centroid) using xi, xC (t ) (the current cluster centroid),
and nC(t) (the current number of data points in C):

 nC (t ) xC1 (t ) + xi ,1 
 
 nC (t ) + 1 
xC (t + 1) =   . (7.14)
 
 nC (t ) xCp (t ) + xi , p 
 
 nC (t ) + 1 

k-Nearest Neighbor Classifier and Supervised Clustering 125

Table 7.3
Supervised Clustering Algorithm
Step Description
Training
1 Set up s dummy clusters for s target classes, respectively, determine the centroid
of each dummy cluster by computing the mean vector of all the data points in
the training data set with the target class Tj, and assign Ts+1 as the target class of
each dummy cluster where Ts+1 ≠ Tj, j = 1, …, s
2 FOR i = 1 to n
3 Compute the distance of xi to each data cluster C including each dummy
cluster, d(xi, x C ), using a measure of similarity
4 If the nearest cluster to the data point xi has the same target class as that of the
data point, let the data point join this cluster, and update the centroid of this
cluster and the number of data points in this cluster
5 If the nearest cluster to the data point xi has a different target class from that of
the data point, form a new cluster containing this data point, use the attribute
values of this data point as the centroid of this new cluster, let the number of
data points in the cluster be 1, and assign the target class of the data point as
the target class of the new cluster

Testing
1 Compute the distance of the data point x to each data cluster C excluding each
dummy cluster, d(x, xC )
2 Let the k-nearest neighbor clusters of the data point vote for the target class of the
data point

During the training, the dummy cluster for a certain target class can be
removed if many data clusters have been generated for that target class.
Since the centroid of the dummy cluster for a target class is the mean vector
of all the training data points with the target class, it is likely that the dummy
cluster for the target class is the closest cluster to a data point. Removing the
dummy cluster for the target class eliminates this likelihood and stops the
creation of a new cluster for the data point because the dummy cluster for
the target class is the closest cluster to the data point.

Example 7.2
Use the supervised clustering algorithm with the Euclidean distance
measure of dissimilarity and the 1-nearest neighbor classifier to clas-
sify whether or not a manufacturing system is faulty using the training
data set in Table 7.1 and the testing data set in Table 7.2. Both tables are
explained in Example 7.1.
In Step 1 of training, two dummy clusters C1 and C2 are set up for two
target classes, y = 1 and y = 0, respectively:
yC1 = 2 (indicating that C1 is a dummy cluster whose target class is
different from two target classes in the training and testing data sets)
yC2 = 2 (indicating that C2 is a dummy cluster)
126 Data Mining

1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 
 9 
 
0 + 1+ 0 + 0 + 0 + 0 + 0 + 0 + 0
 9 
 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0  0.11 
   
 9  0.11 
 0 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0  0.11 
   
 9  0.33 
1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0   
xC1 =  = 0.22 
 9  0.22 
0 + 0 + 1+ 0 + 0 + 1+ 0 + 0 + 0   
 9  0.56 
  0.44 
1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0   
 9  0.33 
0 + 1+ 1+ 1+ 0 + 0 + 0 + 1+ 0 
 
 9 
1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 
 
 9 

0
1
 
0
1
 0  0 
   
 1  0 
 0  0 
   
 1  0 
0
x C2 =   = 0 
1  
  0 
0  
 1  0 
 0  0 
   
 1  0 
0
 
1
0
1
 

nC1 = 9

nC2 = 1.
k-Nearest Neighbor Classifier and Supervised Clustering 127

In Step 2 of training, the first data point x 1 in the training data set is
considered:

1 
0 
 
0 
 
0 
x1 = 1  y = 1.
 
0 
1 
 
0 
 
1 

In Step 3 of training, the Euclidean distance of x 1 to each of the current


clusters C1 and C2 is computed:

(1 − 0.11)2 + (0 − 0.11)2 + (0 − 0.11)2 + (0 − 0.33)2 + (1 − 0.22)2


( )
d x1 , xC1 =
+ (0 − 0.22) + (1 − 0.56 ) + (0 − 0.44 ) + (1 − 0.33 )
2 2 2 2
= 1.56

(1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2


( )
d x1 , xC2 =
+ (0 − 0 ) + (1 − 0 ) + (0 − 0 ) + (1 − 0 )
2 2 2 2
= 2.

Since C1 is the closest cluster to x 1 and has a different target class from
that of x 1, Step 5 of training is executed to form a new data cluster C3
containing x 1:

y C3 = 1

1 
0 
 
0 
 
0 
x C3 = 1 
 
0 
1 
 
0 
 
1 

nC3 = 1.
128 Data Mining

Going back to Step 2 of training, the second data point x 2 in the training
data set is considered:

0 
 
1 
0 
 
1 
 
x2 = 0  y = 1.
 
0 
 
0 
 
1 
0 
 

In Step 3 of training, the Euclidean distance of x 2 to each of the current


clusters C1, C2, and C3 is computed:

(0 − 0.11)2 + (1 − 0.11)2 + (0 − 0.11)2 + (1 − 0.33)2 + (0 − 0.22)2


( )
d x2 , xC1 =
+ (0 − 0.22) + (0 − 0.56 ) + (1 − 0.44 ) + (0 − 0.33 )
2 2 2 2
= 1.44

(0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2


( )
d x 2 , x C2 =
+ (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2
= 1.73

(0 − 1)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2


( )
d x 2 , x C3 =
+ (0 − 0 ) + (0 − 1) + (1 − 0 ) + (0 − 1)
2 2 2 2
= 2.65.

Since C1 is the closest cluster to x 2 and has a different target class from
that of x 2, Step 5 of training is executed to form a new data cluster
C4 containing x 2:

y C4 = 1
k-Nearest Neighbor Classifier and Supervised Clustering 129

0 
1 
 
0 
 
1 
xC 4 = 0 
 
0 
0 
 
1 
 
0 
nC4 = 1.

Going back to Step 2 of training, the third data point x 3 in the training
data set is considered:

0 
0 
 
1 
 
1 
x3 = 0  y = 1.
 
1 
1 
 
1 
 
0 

In Step 3 of training, the Euclidean distance of x 3 to each of the current


clusters C1, C2, C3, and C4 is computed:

(0 − 0.11)2 + (0 − 0.11)2 + (1 − 0.11)2 + (1 − 0.33)2 + (0 − 0.22)2


( )
d x3 , xC1 =
+ (1 − 0.22) + (1 − 0.56 ) + (1 − 0.44 ) + (0 − 0.33 )
2 2 2 2
= 1.59

(0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2


( )
d x 3 , x C2 =
+ (1 − 0 ) + (1 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2
= 2.24

(0 − 1)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2


( )
d x 3 , x C3 =
+ (1 − 0 ) + (1 − 1) + (1 − 0 ) + (0 − 1)
2 2 2 2
= 2.45

(0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2


( )
d x 3 , xC 4 =
+ (1 − 0 ) + (1 − 0 ) + (1 − 1) + (0 − 0 )
2 2 2 2
= 2.

130 Data Mining

Since C1 is the closest cluster to x 3 and has a different target class from
that of x 3, Step 5 of training is executed to form a new data cluster C5
containing x 2:

y C5 = 1

0 
 
0 
 
1 
 
1 
 
x C5 = 0 
1 
 
1 
 
1 
 
0 

nC5 = 1.

Going back to Step 2 of training again, the fourth data point x 4 in the
training data set is considered:

0 
 
0 
 
0 
 
1 
 
x3 = 0  y = 1.
0 
 
0 
 
1 
 
0 

In Step 3 of training, the Euclidean distance of x 4 to each of the current


clusters C1, C2, C3, C4, and C5 is computed:

(0 − 0.11)2 + (0 − 0.11)2 + (0 − 0.11)2 + (1 − 0.33)2 + (0 − 0.22)2


( )
d x 4 , xC1 =
+ (0 − 0.22) + (0 − 0.56 ) + (1 − 0.44 ) + (0 − 0.33 )
2 2 2 2
= 1.14
k-Nearest Neighbor Classifier and Supervised Clustering 131

(0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2


( )
d x 4 , x C2 =
+ (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2
= 1.41

(0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2


( )
d x 4 , x C3 =
+ (0 − 0 ) + (0 − 1) + (1 − 0 ) + (0 − 1)
2 2 2 2
= 2.24

(0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2


( )
d x 4 , xC 4 =
+ (0 − 0 ) + (0 − 0 ) + (1 − 1) + (0 − 0 )
2 2 2 2
=1

(0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2


( )
d x 4 , x C5 =
+ (0 − 1) + (0 − 1) + (1 − 1) + (0 − 0 )
2 2 2 2
= 1.73.

Since C4 is the closest cluster to x 4 and has the same target class as that
of x 4, Step 4 of training is executed to add x, into the cluster C4, which is
updated next:

y C4 = 1

0 + 0
 2 
 
1 + 0 
 2 
 0 + 0  0 
   
 2  0.5
 1 + 1  0 
   
 2  1 
0 + 0  
xC 4 = = 0
2   
  0 
0 + 0  
 2  0 
 0 + 0  1 
   
 2  0 
1 + 1 
 
 2 
0 + 0
 2 
 

nC4 = 2.
132 Data Mining

The training continues with the remaining data points x 5, x 6, x 7, x 8, and


x 9 and produces the final clusters C1, C2, C3 = {x 1, x 5}, C4 = {x 2, x 4}, C5 = {x 3},
C6 = {x 6}, C7 = {x 7}, C8 = {x 8}, C9 = {x 9}, and C10 = {x 10}:

yC1 = 2

0.11 
 
0.11 
 
0.11 
 
0.33 
 
xC1 = 0.22 
0.22 
 
0.56 
 
0.44 
 
0.33 

nC1 = 9

y C2 = 2

0 
 
0 
 
0 
 
0 
 
x C2 = 0 
0 
 
0 
 
0 
 
0 

nC2 = 1

y C3 = 1
k-Nearest Neighbor Classifier and Supervised Clustering 133

1 
0 
 
0 
 
0 
x C3 = 1 
 
0 
1 
 
0 
 
1 

nC3 = 1

y C4 = 1

0 
0.5
 
0 
 
1 
xC 4 = 0 
 
0 
0 
 
1 
 
0 

nC4 = 2

y C5 = 1

0 
0 
 
1 
 
1 
x C5 = 0 
 
1 
1 
 
1 
 
0 
134 Data Mining

nC5 = 1

yC6 = 1

0 
0 
 
0 
 
0 
xC6 = 0 
 
1 
1 
 
0 
 
0 

nC6 = 1

yC7 = 1

0 
0 
 
0 
 
0 
xC7 = 0 
 
0 
1 
 
0 
 
0 

nC7 = 1

yC8 = 1

0 
0 
 
0 
 
0 
xC8 = 0 
 
0 
0 
 
1 
 
0 
k-Nearest Neighbor Classifier and Supervised Clustering 135

nC8 = 1

yC9 = 1

0 
0 
 
0 
 
0 
xC9 = 0 
 
0 
0 
 
0 
 
1 

nC9 = 1

yC10 = 0

0 
0 
 
0 
0 
xC10 = 0 
 
0 
0 
0 
 
0 

nC10 = 1.
In the testing, the first data point in the testing data set,

1 
1 
 
0 
 
1 
x = 1  ,
 
0 
1 
 
1 
 
1 

has the Euclidean distances of 1.73, 2.06, 2.45, 2.65, 2.45, 2.45, 2.45, and 2.65
to the nondummy clusters C3, C4, C5, C6, C7, C8, C9, and C10, respectively.
136 Data Mining

Hence, the cluster C3 is the nearest neighbor to x, and the target class of x is
assigned to be 1. The closest clusters to the remaining data points 2–16 in
the testing data set are C5, C3, C3, C3, C5, C4, C3/C5, C4, C3/C6/C10, C5, C3, C5, C5,
C5, and C3. For data point 8, there is a tie between C3 and C5 for the closest
cluster. Since both C3 and C5 have the target class of 1, the target class of 1
is assigned to data point 8. For data point 10, there also a tie among C3, C6,
and C10 for the closest cluster. Since the majority (two clusters C3 and C6)
of the three clusters tied have the target class of 1, the target class of 1 is
assigned to data point 10. Hence, all the data points in the testing data set
are assigned to the target class of 1 and are correctly classified as shown in
Table 2.2.

7.3  Software and Applications


A k-nearest neighbor classifier and the supervised clustering algorithm can
be easily implemented using computer programs. The application of the
supervised clustering algorithm to cyber attack detection is reported in Li
and Ye (2002, 2005, 2006), Ye (2008), Ye and Li (2002).

Exercises
7.1 In the space shuttle O-ring data set in Table 1.2, the target variable, the
Number of O-rings with Stress, has three values: 0, 1, and 2. Consider
these three values as categorical values, Launch-Temperature and
Leak-Check Pressure as the attribute variables, instances # 13–23 as the
training data, instances # 1–12 as the testing data, and the Euclidean
distance as the measure of dissimilarity. Construct a 1-nearest neigh-
bor classifier and a 3-nearest neighbor classifier, and test and compare
their classification performance.
7.2 Repeat Exercise 7.1 using the normalized attribute variables from the
normalization method in Equation 7.3.
7.3 Repeat Exercise 7.1 using the normalized attribute variables from the
normalization method in Equation 7.4.
7.4 Using the same training and testing data sets in Exercise 7.1 and the cosine
similarity measure, construct a 1-nearest neighbor classifier and a 3-nearest
neighbor classifier, and test and compare their classification performance.
7.5 Using the same training and testing data sets in Exercise 7.1, the supervised
clustering algorithm, and the Euclidean distance measure of dissimilarity,
construct a 1-nearest neighbor cluster classifier and a 3-nearest neighbor
cluster classifier, and test and compare their classification performance.
k-Nearest Neighbor Classifier and Supervised Clustering 137

7.6 Repeat Exercise 7.5 using the normalized attribute variables from the
normalization method in Equation 7.3.
7.7 Repeat Exercise 7.5 using the normalized attribute variables from
the normalization method in Equation 7.4.
7.8 Using the same training and testing data sets in Exercise 7.1, the super-
vised clustering algorithm, and the cosine similarity measure, construct
a 1-nearest neighbor cluster classifier and a 3-nearest neighbor cluster
classifier, and test and compare their classification performance.
Part III

Algorithms for
Mining Cluster and
Association Patterns
8
Hierarchical Clustering

Hierarchical clustering produces groups of similar data points at different


levels of similarity. This chapter introduces a bottom-up procedure of hier-
archical clustering, called agglomerative hierarchical clustering. A list of
software packages that support hierarchical clustering is provided. Some
applications of hierarchical clustering are given with references.

8.1  Procedure of Agglomerative Hierarchical Clustering


Given a number of data records in a data set, the agglomerative hierarchical clus-
tering algorithm produces clusters of similar data records in the following steps:

1. Start with clusters, each of which has one data record.


2. Merge the two closest clusters to form a new cluster that replaces the two
original clusters and contains data records from the two original clusters.
3. Repeat Step 2 until there is only one cluster left that contains all the
data records.

The next section gives several methods of determining the two closest
clusters in Step 2.

8.2 Methods of Determining the Distance


between Two Clusters
In order to determine the two closest clusters in Step 2, we need a method to
compute the distance between two clusters. There are a number of methods for
determining the distance between two clusters. This section describes four meth-
ods: average linkage, single linkage, complete linkage, and centroid method.
In the average linkage method, the distance of two clusters (cluster K, CK,
and cluster L, CL ), DK,L, is the average of distances between pairs of data

141
142 Data Mining

records, and each pair has one data record from Cluster K and another data
record from Cluster L, as follows:

d ( xK , xL )
DK , L = ∑∑
xK ∈CK X L∈CL
nK nL
(8.1)

 xK ,1   xL ,1 
   
xK =    xl =    ,
 xK , p   xL , p 
   

where
xK denotes a data record in CK
x L denotes a data record in CL
nK denotes the number of data records in CK
nL denotes the number of data records in CL
d(xK, x L) is the distance of two data records that can be computed using the
following Euclidean distance:
p

d ( xK , xL ) = ∑ (x − xL , i )
2
K ,i (8.2)
i =1
or some other dissimilarity measures of two data points that are described in
Chapter 7. As described in Chapter 7, the normalization of the variables,
x1, …, xp, may be necessary before using a measure of dissimilarity or simi-
larity to compute the distance of two data records.

Example 8.1
Compute the distance of the following two clusters using the average
linkage method and the squared Euclidean distance of data points:

CK = { x1 , x2 , x3 }

CL = { x 4 , x 5 }

 1 0  0 0  0 
0 0  0 0  0 
0 0  0 0  0 
         
0 0  0 0  0 
x1 =  1 x 2 =  1 x3 = 0 x 4 = 0  x5 = 0  .
0 0  0  1 0 
         
 1  1 0  1  1
0 0  0 0  0 
 1  1  1 0  0 
Hierarchical Clustering 143

There are six pairs of data records between CK and CL: (x 1, x 4), (x 1, x 5),
(x2, x4), (x2, x 5), (x 3, x4), (x 3, x 5), and their squared Euclidean distance
is computed as
9

d ( x1 , x 4 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (1 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 1)
2 2 2 2 2 2

+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 4
2 2 2

d ( x1 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (1 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2 2 2

+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2

d ( x2 , x4 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 1)
2 2 2 2 2 2

+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2

d ( x2 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2 2 2

+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 2
2 2 2

d ( x3 , x4 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 1)
2 2 2 2 2 2

+ (0 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2

d ( x3 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1

= (0 − 0) + (0 − 0) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0)
2 2 2 2 2 2

+ (0 − 1) + (0 − 0 ) + (1 − 0 ) = 2
2 2 2

d ( xK , xl )
∑∑
4 3 3 2 3 2
DK , L = = + + + + + = 2.8333
xK ∈CK xL∈CL
nK nL 3×2 3×2 3×2 3×2 3×2 3×2
144 Data Mining

In the single linkage method, the distance between two clusters is the min-
imum distance between a data record in one cluster and a data record in the
other cluster:


{
DK , L = min d ( xk , xl ) , xk ∈ CK , xl ∈ CL .

} (8.3)

Using the single linkage method, the distance of clusters CK and CL in Example
8.1 is computed as

{
DK , L = min d ( xK , xL ) , xK ∈ CK , xL ∈ CL }
= min {d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x )}
1 4 1 5 2 4 2 5 3 4 3 5

= min {4, 3, 3, 2, 3, 4} = 2.

In the complete linkage method, the distance between two clusters is the
maximum distance between a data record in one cluster and a data record
in the other cluster:


{
DK, L = max d ( xK , xL ) , xK ∈ CK , xL ∈ CL . }
(8.4)

Using the complete linkage method, the distance of clusters CK and CL in


Example 8.1 is computed as

{
DK, L = max d ( xK , xL ) , xK ∈ CK , xL ∈ CL }
= max {d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x )}
1 4 1 5 2 4 2 5 3 4 3 5

= max {4, 3, 3, 2, 3, 4} = 4.

In the centroid method, the distance between two clusters is the distance
between the centroids of clusters, and the centroid of a cluster is computed
using the mean vector of all data records in the cluster, as follows:


(
DK , L = d xK , xL ) (8.5)


∑  
∑ 
nK n ,L

 xk ,1   xl , 1 
k =1 l =1
 nK   nL 
   
xK =    xL =   . (8.6)
   
∑ ∑
nK nL
 xk , p   xl , p 
 k =1   l =1 
 nK   nL 

Hierarchical Clustering 145

Using the centroid linkage method and the squared Euclidean distance of
data points, the distance of clusters CK and CL in Example 8.1 is computed as

1+ 0 + 0 
 3 
 
0 + 0 + 0 1
 3 
 0 + 0 + 0   3 
  0
 3   

∑ 
nK

 xk ,1   0 + 0 + 0   0 
k =1   0
 nK   3 
   1+ 1+ 0   2
xK =   =
3  =  3 
   

nK
 xk , p   0 + 0 + 0   0 
 k =1   3   2 
 nK  
1+ 1+ 0  3
   
 3  0
 0 + 0 + 0   
  1
 3 
 1+ 1+ 1 
 3 

0 + 0
 2 
 
0 + 0
 2   
0 + 0 0
   
 2  0

∑ 
nL

 xl , 1   0 + 0   0 
 
  2   0 
l =1
 nL
  0 + 0  
xL =   = = 0
   2   1 

nL
 xl , p   1 + 0   
 l=1   2  2
 nL  
1+ 1  1
  0
 0   
 0 + 0   0 
 
 2 
0 + 0
 2 
146 Data Mining

2 2

( ) 1
3


2 2 2 2
3

DK , L = d xK , xL =  − 0 + (1 − 0 ) + (1 − 0 ) + (1 − 0 ) +  − 0

2 2
 1  2 
+  0 −  +  − 1 + (0 − 0 ) + (1 − 0 ) = 4.9167.
2 2

 2  3 

Various methods of determining the distance between two clusters have differ-
ent computational costs and may produce different clustering results. For exam-
ple, the average linkage method, the single linkage method, and the complete
linkage method require the computation of the distance between every pair of
data points from two clusters. Although the centroid method does not have such
a computation requirement, the centroid method must compute the centroid of
every new cluster and the distance of the new cluster with existing clusters. The
average linkage method and the centroid method take into account and control
the dispersion of data points in each cluster, whereas the single linkage method
and the complete linkage method place no constraint on the shape of the cluster.

8.3  Illustration of the Hierarchical Clustering Procedure


The hierarchical clustering procedure is illustrated in Example 8.2.

Example 8.2
Produce a hierarchical clustering of the data for system fault detection in
Table 8.1 using the single linkage method.

Table 8.1
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1
Hierarchical Clustering 147

Table 8.1 contains the data set for system fault detection, including
nine instances of single-machine faults. Only the nine attribute variables
about the quality of parts are used in the hierarchical clustering. The
nine data records in the data set are

 1 0  0  0  0  0 
           
0   1 0  0  0  0 
0  0   1 0  0  0 
           
0   1  1  1 0  0 
x1 =  1 x2 = 0  x3 = 0  x 4 = 0  x 5 =  1 x6 =  0 
           
0  0   1 0  0   1
 1 0   1 0   1  
           1
0   1  1  1 0  0 
           
 1 0  0  0   1 0 
0 0  0 
     
0 0  0 
0 0  0 
     
0 0  0 
x7 =  0  x8 =  0  x9 =  0  .
     
0  0  0 
 1 0  0 
     
0   1 0 
     
0  0   1

The clustering results will show which single-machine faults have simi-
lar symptoms of the part quality problem.
Figure 8.1 shows the hierarchical clustering procedure that starts with
the following nine clusters with one data record in each cluster:

C1 = { x1 } C2 = { x2 } C3 = { x3 } C4 = { x 4 } C5 = { x5 }

C6 = { x6 } C7 = { x7 } C8 = { x8 } C9 = { x9 } .

Merging
distance

3
2
1

C1 C5 C6 C7 C9 C2 C4 C8 C3

Figure 8.1
Result of hierarchical clustering for the data set of system fault detection.
148 Data Mining

Table 8.2
The Distance for Each Pair of Clusters: C1, C2, C3, C4, C5, C6, C7, C8, and C9
C1 = C2 = C3 = C4 = C5 = C6 = C7 = C8 = C9 =
{x1} {x2} {x3} {x4} {x5} {x6} {x7} {x8} {x9}
C1 = {x1} 7 7 6 1 4 3 5 3
C2 = {x2} 4 1 6 5 4 2 4
C3 = {x3} 3 6 3 4 4 6
C4 = {x4} 5 4 4 1 3
C5 = {x5} 3 2 4 2
C6 = {x6} 1 3 3
C7 = {x7} 2 2
C8 = {x8} 2
C9 = {x9}

Since each cluster has only one data record, the distance between two
clusters is the distance between two data records in two clusters, respec-
tively. Table 8.2 gives the distance for each pair of data records, which
also gives the distance for each pair of clusters.
There are four pairs of clusters that produce the smallest distance
of 1: (C1, C 5), (C2, C 4), (C 4, C 8), and (C 6, C7). We merge (C1, C 5) to form a
new cluster C1,5, and merge (C6, C7) to form a new cluster C6,7 . Since the
cluster C 4 is involved in two pairs of clusters (C2, C 4) and (C 4, C 8), we
can merge only one pair of clusters. We arbitrarily choose to merge
(C2, C 4) to form a new cluster C2,4. Figure 8.1 shows these new clusters,
in a new set of clusters, C1,5, C2,4, C 3, C6,7, C 8, and C9.
Table 8.3 gives the distance for each pair of the clusters, C1,5, C2,4, C3,
C6,7, C8, and C9, using the single linkage method. For example, there are
four pairs of data records between C1,5 and C2,4: (x 1, x 2), (x 1, x 4), (x 5, x 2), and
(x5, x4 ), with their distance being 7, 6, 6, and 5, respectively, from Table 8.2.
Hence, the minimum distance is 5, which is taken as the distance of

Table 8.3
Distance for Each Pair of Clusters: C1,5, C2,4, C3, C6,7, C8, and C9
C1,5 = C2,4 = C6,7 =
{x1, x5} {x2, x4} C3 = {x3} {x6, x7} C8 = {x8} C9 = {x9}
C1,5 = {x1, x5} 5 = min 6 = min 2 = min 4 = min 2 = min
{7, 6, 6, 5} {7, 6} {4, 3, 3, 2} {5, 4} {3, 2}
C2,4 = {x2, x4} 3 = min 4 = min 1 = min 3 = min
{4, 3} {5, 4, 4, 4} {2, 1} {4, 3}
C3 = {x3} 3 = min 4 = min 6 = min
{3, 4} {4} {6}
C6,7 = {x6, x7} 2 = min 2 = min
{3, 2} {3, 2}
C8 = {x8} 2 = min
{2}
C9 = {x9}
Hierarchical Clustering 149

Table 8.4
Distance for Each Pair of Clusters: C1,5, C2,4,8, C3, C6,7, and C9
C1,5 = {x1, x5} C2,4,8 = {x2, x4, x8} C3 = {x3} C6,7 = {x6, x7} C9 = {x9}
C1,5 = {x1, x5} 4 = min 6 = min 2 = min 2 = min
{7, 6, 5, 6, 5, 4} {7, 6} {4, 3, 3, 2} {3, 2}
C2,4,8 = {x2, x4, x8} 3 = min 2 = min 3 = min
{4, 3, 4} {5, 4, 4, 4, 3, 2} {4, 3, 2}
C3 = {x3} 3 = min 6 = min
{3, 4} {6}
C6,7 = {x6, x7} 2 = min
{3, 2}
C9 = {x9}

C1,5 and C2,4. The closest pair of clusters is (C2,4, C8) with the distance of 1.
Merging clusters C2,4 and C8 produces a new cluster C2,4,8. We have a new
set of clusters, C1,5, C2,4,8, C3, C6,7, and C9.
Table 8.4 gives the distance for each pair of the clusters, C1,5, C2,4,8, C3,
C6,7, and C9, using the single linkage method. Four pairs of clusters, (C1,5,
C6,7), (C1,5, C9), (C2,4,8, C6,7), and (C6,7, C9), produce the smallest distance
of 2. Since three clusters, C1,5, C6,7, and C9, have the same distance from
one another, we merge the three clusters together to form a new cluster,
C1,5,6,7,9. C6,7 is not merged with C2,4,8 since C6,7 is merged with C1,5 and C9.
We have a new set of clusters, C1,5,6,7,9, C2,4,8, and C3.
Table 8.5 gives the distance for each pair of the clusters, C1,5,6,7,9, C2,4,8, and
C3, using the single linkage method. The pair of clusters, (C1,5,6,7,9, C2,4,8),
produces the smallest distance of 2. Merging the clusters, C1,5,6,7,9 and C2,4,8,
forms a new cluster, C1,2,4,5,6,7,8,9. We have a new set of clusters, C1,2,5,4,5,6,7,8,9 and
C3, which have the distance of 3 and are merged into one cluster, C1,2,3,4,5,6,7,8,9.
Figure 8.1 also shows the merging distance, which is the distance of
two clusters when they are merged together. The hierarchical clustering
tree shown in Figure 8.1 is called the dendrogram.
Hierarchical clustering allows us to obtain different sets of clusters
by setting different thresholds of the merging distance threshold for
different levels of data similarity. For example, if we set the threshold
of the merging distance to 1.5 as shown by the dash line in Figure 8.1,
we obtain the clusters, C1,5, C 6,7, C9, C2,4,8, and C 3, which are considered
as the clusters of similar data because each cluster’s merging distance

Table 8.5
Distance for Each Pair of Clusters: C1,5,6,7,9, C2,4,8, and C3
C1,5,6,7,9 = {x1, x5,
x6, x7, x9} C2,4,8 = {x2, x4, x8} C3 = {x3}
C1,5,6,7,9 = {x1, x5, 2 = min{7, 6, 5, 6, 5, 4, 3 = min{7, 6, 3, 4, 6}
x6, x7, x9} 5, 4, 3, 4, 4, 2, 4, 3, 2}
C2,4,8 = {x2, x4, x8} 3 = min{4, 3, 4}
C3 = {x3}
150 Data Mining

is smaller than or equal to the threshold of 1.5. This set of clusters


indicates which machine faults produce similar symptoms of the part
quality problem. For instance, the cluster C1,5 indicates that the M1
fault and the M5 fault produce similar symptoms of the part quality
problem. The production flow of parts in Figure 1.1 shows that parts
pass through M1 and M5 consecutively and thus explains why the
M1 fault and M5 fault produce similar symptoms of the part qual-
ity problem. Hence, the clusters obtained by setting the threshold of
the merging distance to 1.5 give a meaningful clustering result that
reveals the inherent structure of the system. If we set the threshold of
the merging distance to 2.5 as shown by another dash line in Figure 8.1,
we obtain the set of clusters, C1,2,4,5,6,7,8,9 and C 3, which is not as useful
as the set of clusters, C1,5, C 6,7, C9, C2,4,8, and C 3, for revealing the system
structure.
This example shows that obtaining a data mining result is not the end
of data mining. It is crucial that we can explain the data mining result in
a meaningful way in the problem context to make the data mining result
useful in the problem domain. Many real-world data sets do not come
with prior knowledge of a system generating such data sets. Therefore,
after obtaining the hierarchical clustering result, it is important to
examine different sets of clusters at different levels of data similarity
and determine which set of clusters can be interpreted in a meaningful
manner to help reveal the system and generate useful knowledge about
the system.

8.4  Nonmonotonic Tree of Hierarchical Clustering


In Figure 8.1, the merging distance of a new cluster is not smaller than the
merging distance of any cluster that was formed before the new cluster. Such
a hierarchical clustering tree is monotonic. For example, in Figure 8.1, the
merging distance of the cluster C2,4 is 1, which is equal to the merging dis-
tance of C2,4,8, and the merging distance of C1,2,4,5,6,7,8,9 is 2, which is smaller
than the merging distance of C2,4,8.
The centroid linkage method can produce a nonmonotonic tree in which
the merging distance for a new cluster can be smaller than the merging
distance for a cluster that is formed before the new cluster. Figure 8.2 shows
three data points, x1, x2, and x3, for which the centroid method produces a
nonmonotonic tree of hierarchical clustering. The distance between each
pair of the three data points is 2. We start with three initial clusters, C1, C2,
and C3, containing the three data points, x1, x2, and x3, respectively. Because
the three clusters have the same distance between each other, we arbitrarily
choose to merge C1 and C2 into a new cluster C1,2. As shown in Figure 8.2,
the distance between the centroid of C1,2 and x3 is 22 − 12 = 1.73 , which is
smaller than the merging distance of 2 for C1,2. Hence, when C1,2 is merged
Hierarchical Clustering 151

C3
x3

C1 C2

x1 x2

C12

Figure 8.2
An example of three data points for which the centroid linkage method produces a nonmonotonic
tree of hierarchical clustering.

Merging
distance

Figure 8.3
Nonmonotonic tree of hierarchical clustering for the data points in Figure 8.2.

with C3 next to produce a new cluster C1,2,3, the merging distance of 1.73 for
C1,2,3 is smaller than the merging distance of 2 for C1,2. Figure 8.3 shows the
non-monotonic tree of hierarchical clustering for these three data points
using the centroid method.
The single linkage method, which is used in Example 8.2, computes the
distance between two clusters using the smallest distance between two data
points, one data point in one cluster, and another data point in the other
cluster. The smallest distance between two data points is used to form a new
cluster. The distance used to form a cluster earlier cannot be used again to
form a new cluster later, because the distance is already inside a cluster and
a distance with a data point outside a cluster is needed to form a new clus-
ter later. Hence, the distance to form a new cluster later must come from a
distance not used before, which must be greater than or equal to a distance
selected and used earlier. Hence, the hierarchical clustering tree from the
single linkage method is always monotonic.
152 Data Mining

8.5  Software and Applications


Hierarchical clustering is supported by many statistical software packages,
including:

• SAS (www.sas.com)
• SPSS (www.spss.com)
• Statistica (www.statistica.com)
• MATLAB® (www.matworks.com)

Some applications of hierarchical clustering can be found in (Ye, 1997, 2003,


Chapter 10; Ye and Salvendy, 1991, 1994; Ye and Zhao, 1996). In the work by Ye
and Salvendy (1994), the hierarchical clustering is used to reveal the knowl-
edge structure of C programming from expert programmers and novice
programmers.

Exercises
8.1 Produce a hierarchical clustering of 23 data points in the space shut-
tle O-ring data set in Table 1.2. Use Launch-Temperature and Leak-
Check Pressure as the attribute variables, the normalization method
in Equation 7.4 to obtain the normalized Launch-Temperature and
Leak-Check Pressure, the Euclidean distance of data points, and the
single linkage method.
8.2 Repeat Exercise 8.1 using the complete linkage method.
8.3 Repeat Exercise 8.1 using the cosine similarity measure to compute the
distance of data points.
8.4 Repeat Exercise 8.3 using the complete linkage method.
8.5 Discuss whether or not it is possible for the complete linkage method to
produce a nonmonotonic tree of hierarchical clustering.
8.6 Discuss whether or not it is possible for the average linkage method to
produce a nonmonotonic tree of hierarchical clustering.
9
K-Means Clustering and
Density-Based Clustering

This chapter introduces K-means and density-based clustering algorithms


that produce nonhierarchical groups of similar data points based on the cen-
troid and density of a cluster, respectively. A list of software packages that
support these clustering algorithms is provided. Some applications of these
clustering algorithms are given with references.

9.1  K-Means Clustering


Table 9.1 lists the steps of the K-means clustering algorithm. The K-means
clustering algorithm starts with a given K value and the initially assigned
centroids of the K clusters. The algorithm proceeds by having each of n data
points in the data set join its closest cluster and updating the centroids of the
clusters until the centroids of the clusters do not change any more and con-
sequently each data point does not move from its current cluster to another
cluster. In Step 7 of the algorithm, if there is any change of cluster centroids
in Steps 3–6, we have to check if the change of cluster centroids causes the
further movement of any data point by going back to Step 2.
To determine the closest cluster to a data point, the distance of a data point
to a data cluster needs to be computed. The mean vector of data points in a
cluster is often used as the centroid of the cluster. Using a measure of simi-
larity or dissimilarity, we compute the distance of a data point to the centroid
of the cluster as the distance of a data point to the cluster. Measures of simi-
larity or dissimilarity are described in Chapter 7.
One method of assigning the initial centroids of the K clusters is to ran-
domly select K data points from the data set and use these data points to set
up the centroids of the K clusters. Although this method uses specific data
points to set up the initial centroids of the K clusters, the K clusters have no
data point in each of them initially. There are also other methods of setting
up the initial centroids of the K clusters, such as using the result of a hier-
archical clustering to obtain the K clusters and using the centroids of these
clusters as the initial centroids of the K clusters for the K-means clustering
algorithm.

153
154 Data Mining

TABLE 9.1
K-Means Clustering Algorithm
Step Description
1 Set up the initial centroids of the K clusters
2 REPEAT
3 FOR i = 1 to n
4 Compute the distance of the data point xi to each of the K clusters using
a measure of similarity or dissimilarity
5 IF xi is not in any cluster or its closest cluster is not its current cluster
6 Move xi to its closest cluster and update the centroid of the cluster
7 UNTIL no change of cluster centroids occurs in Steps 3–6

For a large data set, the stopping criterion for the REPEAT-UNTIL loop in
Step 7 of the algorithm can be relaxed so that the REPEAT-UNTIL loop stops
when the amount of changes to the cluster centroids is less than a threshold,
e.g., less than 5% of the data points changing their clusters.
The K-means clustering algorithm minimizes the following sum of squared
errors (SSE) or distances between data points and their cluster centroids (Ye,
2003, Chapter 10):

∑∑d ( x, x ) .
2
SSE = Ci (9.1)
i = 1 x ∈Ci

In Equation 9.1, the mean vector of data points in the cluster Ci is used as the
cluster centroid to compute the distance between a data point in the cluster
Ci and the centroid of the cluster Ci.
Since K-means clustering depends on the parameter K, knowledge in
the application domain may help the selection of an appropriate K value
to produce a K-means clustering result that is meaningful in the appli-
cation domain. Different K-means clustering results using different K
values may be obtained so that different results can be compared and
evaluated.

Example 9.1
Produce the 5-means clusters for the data set of system fault detection
in Table 9.2 using the Euclidean distance as the measure of dissimilarity.
This  is the same data set for Example 8.1. The data set includes nine
instances of single-machine faults, and the data point for each instance
has the nine attribute variables about the quality of parts.
In Step 1 of the K-means clustering algorithm, we arbitrarily select
data points 1, 3, 5, 7, and 9 to set up the initial centroids of the five clus-
ters, C1, C2, C3, C4, and C5, respectively:
K-Means Clustering and Density-Based Clustering 155

Table 9.2
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1

1  0  0 0  0 
0  0  0 0  0 
         
0  1  0 0  0 
         
0  1  0 0  0 
xC1 = x1 = 1  x C2 = x3 = 0  x C3 = x5 = 1  xC 4 = x7 =  0  x C5 = x9 =  0  .
         
0  1  0 0  0 
1  1  1  1  0 
         
0  1  0  0  0 
         
1  0  1  0  1 

The five clusters have no data point in each of them initially. Hence, we
have C1 = {}, C2 = {}, C3 = {}, C4 = {}, and C5 = {}.
In Steps 2 and 3 of the algorithm, we take the first data point x 1 in the
data set. In Step 4 of the algorithm, we compute the Euclidean distance
of the data point x 1 to each of the five clusters:

(
d x1 , xC1 )
= (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2
=0

(
d x1 , xC2 )
= (1 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 1)2 + (1 − 0)2
= 2.65
156 Data Mining

(
d x1 , xC3 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2
=1

(
d x1 , xC4 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 0)2
= 1.73

(
d x1 , xC5 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 1)2
= 1.73

In Step 5 of the algorithm, x 1 is not in any cluster. Step 6 of the algorithm


is executed to move x 1 to its closest cluster C1 whose centroid remains the
same since its centroid is set up using x1. We have C1 = {x1}, C2 = {}, C3 = {},
C4 = {}, and C5 = {}.
Going back to Step 3, we take the second data point x 2 in the data set.
In Step 4, we compute the Euclidean distance of the data point x 2 to each
of the five clusters:

(
d x2 , xC1 )
= (0 − 1)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.65

(
d x 2 , x C2 )
= (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2
=2

(
d x 2 , x C3 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45
K-Means Clustering and Density-Based Clustering 157

(
d x 2 , xC 4 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
=2

(
d x 2 , x C5 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
=2

In Step 5, x 2 is not in any cluster. Step 6 of the algorithm is executed. Among


the three clusters, C2, C4, and C5, which produce the smallest distance to
x 2, we arbitrarily select C2 and move x 2 to C2. C2 has only one data point x 2,
and the centroid of C2 is updated by taking x 2 as its centroid:

0 
1 
 
0 
 
1 
x C2 = 0  .
 
0 
0 
 
1 
 
0 

We have C1 = {x 1}, C2 = {x 2}, C3 = {}, C4 = {}, and C5 = {}.


Going back to Step 3, we take the third data point x 3 in the data set. In
Step 4, we compute the Euclidean distance of the data point x 3 to each of
the five clusters:

(
d x3 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.65

(
d x 3 , x C2 )
= (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2
=2
158 Data Mining

(
d x 3 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45

(
d x 3 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 0)2
=2

(
d x 3 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2
= 2.45

In Step 5, x 3 is not in any cluster. Step 6 of the algorithm is executed.


Between the two clusters, C2 and C4, which produce the smallest dis-
tance to x 3, we arbitrarily select C2 and move x 3 to C2. C2 has two data
points, x 2 and x 3, and the centroid of C2 is updated:

0 + 0
 2 
 
1 + 0 
 2 
 0 + 1  0 
   
 2  0.5
 1 + 1  0.5
   
 2  1 
0 + 0  
x C2 = = 0 .
2   
  0.5
0 + 1  
 2  0.5
 0 + 1  1 
   
 2  0 
1 + 1 
 
 2 
0 + 0
 2 
 

We have C1 = {x 1}, C2 = {x 2, x 3}, C3 = {}, C4 = {}, and C5 = {}.


K-Means Clustering and Density-Based Clustering 159

Going back to Step 3, we take the fourth data point x 4 in the data set.
In Step 4, we compute the Euclidean distance of the data point x 4 to each
of the five clusters:

(
d x 4 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45

(
d x 4 , xC2 )
= (0 − 0)2 + (0 − 0.5)2 + (0 − 0.5)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0.5)2 + (0 − 0.5)2 + (1 − 1)2 + (0 − 0)2
=1

(
d x 4 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.24

(
d x 4 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
= 1.73

(
d x 4 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
= 1.73
160 Data Mining

In Step 5, x 4 is not in any cluster. Step 6 of the algorithm is executed to


move x 4 to its closest cluster C2, and the centroid of C2 is updated:

0 + 0 + 0
 3 
 
 1 + 0 + 0 
 3 
 0 + 1 + 0  0 
   
 3  0.33 
 1 + 1 + 1  0.33 
   
 3  1 
0 + 0 + 0  
x C2 =  = 0  .
 3  0.33 
0 + 1+ 0   
 3  0.33 
  1 
0 + 1+ 0   
 3  0 
1 + 1 + 1 
 
 3 
0 + 0 + 0
 
 3 
We have C1 = {x 1}, C2 = {x 2, x 3, x 4}, C3 = {}, C4 = {}, and C5 = {}.
Going back to Step 3, we take the fifth data point x5 in the data set. In Step
4, we know that x5 is closest to C3 since C3 is initially set up using x5 and is
not updated since then. In Step 5, x5 is not in any cluster. Step 6 of the algo-
rithm is executed to move x5 to its closest cluster C3 whose centroid remains
the same. We have C1 = {x1}, C2 = {x2, x3, x4}, C3 = {x5}, C4 = {}, and C5 = {}.
Going back to Step 3, we take the sixth data point x 6 in the data set. In
Step 4, we compute the Euclidean distance of the data point x 6 to each of
the five clusters:

(
d x6 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
=2

(
d x 6 , xC2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (0 − 1)2 + (0 − 0)2 + (1 − 0.33)2 + (1 − 0.33)2 + (0 − 1)2 + (0 − 0)2
= 1.77

(
d x 6 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.73
K-Means Clustering and Density-Based Clustering 161

(
d x 6 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2
=1

(
d x 6 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 1)2
= 1.73

In Step 5, x 6 is not in any cluster. Step 6 of the algorithm is executed to


move x 6 to its closest cluster C4, and the centroid of C4 is updated:

0 
0 
 
0 
 
0 
xC 4 = 0  .
 
1 
1 
 
0 
 
0 

We have C1 = {x 1}, C2 = {x 2, x 3, x 4}, C3 = {x 5}, C4 = {x 6}, and C5 = {}.


Going back to Step 3, we take the sixth data point x 7 in the data set. In
Step 4, we compute the Euclidean distance of the data point x 7 to each of
the five clusters:

(
d x7 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.73

(
d x7 , xC2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (0 − 1)2 + (0 − 0)2 + (0 − 0.33)2 + (1 − 0.33)2 + (0 − 1)2 + (0 − 0)2
= 1.67
162 Data Mining

(
d x 7 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.41

(
d x7 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2+ (0 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2
=1

(
d x 7 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 1)2
= 1.41

In Step 5, x 7 is not in any cluster. Step 6 of the algorithm is executed


to move x 7 to its closest cluster C 4, and the centroid of C 4 is updated:

0 + 0
 2 
 
0 + 0
 2 
 
 0 + 0  0 
 2   
  0 
 0 + 0  0 
 2   
  0 
0 + 0  
xC 4 =  = 0 
 2  0.5
1 + 0   
  1 
 2  0 
1 + 1   
  0 
 2 
0 + 0
 
 2 
0 + 0
 
 2 

We have C1 = {x 1}, C2 = {x 2, x 3, x 4}, C3 = {x 5}, C4 = {x 6, x 7}, and C5 = {}.


K-Means Clustering and Density-Based Clustering 163

Going back to Step 3, we take the eighth data point x 8 in the data set.
In Step 4, we compute the Euclidean distance of the data point x 8 to each
of the five clusters:

(
d x8 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.27

(
d x 8 , x C2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2+ (0 − 1)2+ (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (1 − 1) 2 + (0 − 0)2
= 1.20

(
d x 8 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
=2

(
d x 8 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0.5)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
= 1.5

(
d x 8 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
= 1.41
164 Data Mining

In Step 5, x 8 is not in any cluster. Step 6 of the algorithm is executed to


move x 8 to its closest cluster C2, and the centroid of C2 is updated:

0 + 0 + 0 + 0
 4 
 
1 + 0 + 0 + 0 
 4 
 0 + 1 + 0 + 0  0 
   
 4  0.25
 1 + 1 + 1 + 0  0.25
   
 4  0.75
 0 + 0 + 0 + 0   
x C2 =  = 0  .
4
  0.25
0 + 1+ 0 + 0   
 4  0.25
 0 + 1 + 0 + 0  1 
   
 3  0 
1 + 1 + 1 + 1 
 
 4 
0 + 0 + 0 + 0
 
 4 

We have C1 = {x 1}, C2 = {x 2, x 3, x 4, x 8}, C3 = {x 5}, C4 = {x 6, x 7}, and C5 = {}.


Going back to Step 3, we take the ninth data point x 9 in the data set. In
Step 4, we know that x 9 is closest to C5 since C5 is initially set up using
x 9 and is not updated since then. In Step 5, x 9 is not in any cluster. Step
6 of the algorithm is executed to move x 9 to its closest cluster C5 whose
centroid remains the same. We have C1 = {x 1}, C2 = {x 2, x 3, x 4, x 8}, C3 = {x 5},
C4 = {x 6, x 7}, and C5 = {x 9}.
After finishing the FOR loop in Steps 3–6, we go down to Step 7.
Since there are changes of cluster centroids in Steps 3–6, we go back
to Step 2 and then Step 3 to start another FOR loop. In this FOR loop,
the current cluster of each data point is the closest cluster of the data
point. Hence, none of the nine data points move from its current clus-
ter to another cluster, and no change of the cluster centroids occurs in
this FOR loop. The 5-means clustering for this example produces five
clusters, C1 = {x 1}, C2 = {x 2, x 3, x 4, x 8}, C 3 = {x 5}, C 4 = {x 6, x 7}, and C 5 = {x 9}.
The hierarchical clustering for the same data set shown in Figure 8.1
produces five clusters, {x 1, x 5}, {x 2, x 4, x 8}, {x 3}, {x 6, x 7}, and {x 9}, when we
set the threshold of the merging distance to 1.5. Hence, the 5-means
clustering results are similar but not exactly the same as the hierarchi-
cal clustering result.
K-Means Clustering and Density-Based Clustering 165

9.2  Density-Based Clustering


Density-based clustering considers data clusters as regions of data points with
high density, which is measured using the number of data points within a given
radius (Li and Ye, 2002). Clusters are separated by regions of data points with
low density. DBSCAN (Ester et al., 1996) is a density-based clustering algorithm
that starts with a set of data points and two parameters: the radius and the min-
imum number of data points required to form a cluster. The density of a data
point x is computed by counting the number of data points within the radius
of the data point x. The region of x is the area within the radius of x, which has
a dense region if the number of data points in the region of x is greater than or
equal to the minimum number of data points. At first, all the data points in the
data set are considered unmarked. DBSCAN arbitrarily selects an unmarked
data point x from the data set. If the region of the data point x is not dense,
x is marked as a noise point. If the region of x is dense, a new cluster is formed
containing x, and x is marked as a member of this new cluster. Moreover, each
of the data points in the region of x joins the cluster and is marked as a member
of this cluster if the data point has not yet joined a cluster. This new cluster is
further expanded to include all the data points that have not yet joined a cluster
and are in the region of any data point z already in the cluster if the region of z is
dense. The expansion of the cluster continues until all the data points connected
through the dense regions of data points join the cluster if they have not yet
joined a cluster. Note that a noise point may later be found in the dense region
of a data point in another cluster and thus may be converted as a member of
that cluster. After completing a cluster, DBSCAN selects another unmarked
data point and evaluates if the data point is a noise point or a data point to start
a new cluster. This process continues until all the data points in the data set are
marked as either a noise point or a member of a cluster.
Since density-based clustering depends on two parameters of the radius
and the minimum number of data points, knowledge in the application
domain may help the selection of appropriate parameter values to produce
a clustering result that is meaningful in the application domain. Different
clustering results using different parameter values may be obtained so that
different results can be compared and evaluated.

9.3  Software and Applications


K-means clustering is supported in:

• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.matworks.com)
• SAS (www.sas.com)
166 Data Mining

The application of DBSCAN to spatial data can be found in Ester et al.


(1996).

Exercises
9.1 Produce the 2-means clustering of the data points in Table 9.2 using the
Euclidean distance as the measure of dissimilarity and using the first
and third data points to set up the initial centroids of the two clusters.
9.2 Produce the density-based clustering of the data points in Table 9.2
using the Euclidean distance as the measure of dissimilarity, 1.5 as the
radius and 2 as the minimum number of data points required to form
a cluster.
9.3 Produce the density-based clustering of the data points in Table 9.2
using the Euclidean distance as the measure of dissimilarity, 2 as the
radius and 2 as the minimum number of data points required to form
a cluster.
9.4 Produce the 3-means clustering of 23 data points in the space shuttle
O-ring data set in Table 1.2. Use Launch-Temperature and Leak-Check
Pressure as the attribute variables and the normalization method in
Equation 7.4 to obtain the normalized Launch-Temperature and Leak-
Check Pressure, the Euclidean distance as the measure of dissimilarity.
9.5 Repeat Exercise 9.4 using the cosine similarity measure.
10
Self-Organizing Map

This chapter describes the self-organizing map (SOM), which is based on the
architecture of artificial neural networks and is used for data clustering and
visualization. A list of software packages for SOM is provided along with
references for applications.

10.1  Algorithm of Self-Organizing Map


SOM was developed by Kohonen (1982). SOM is an artificial neural network
with output nodes arranged in a q-dimensional space, called the output map
or graph. The one-, two-, or three-dimensional space or arrangement of out-
put nodes, as shown in Figure 10.1, is usually used so that clusters of data
points can be visualized, because similar data points are represented by
nodes that are close to each other in the output map.
In an SOM, each input variable xi, i = 1, …, p, is connected to each SOM
node j, j = 1, …, k, with the connection weight wji. The output vector o of the
SOM for a given input vector x is computed as follows:

 o1   w1¢ x 
   
   
o =  o j  =  w j¢ x  , (10.1)
   
   
 ok   wk¢ x 
   

where

 x1 
 

x =  xi 
 

 xp 
 

167
168 Data Mining

wji
(a) (b) x1 xi xp (c)

Figure 10.1
Architectures of SOM with a (a) one-, (b) two-, and (c) three-dimensional output map.

 w j1 
 
  
w j =  w ji  .
 
  
 w jp 
 

Among all the output nodes, the output node producing the largest value for
a given input vector x is called the winner node. The winner node of the input
vector has the weight vector that is most similar to the input vector. The learning
algorithm of SOM determines the connection weights so that the winner nodes
of similar input vectors are close together. Table 10.1 lists the steps of the SOM
learning algorithm, given a training data set with n data points, xi, i = 1, …, n.
In Step 5 of the algorithm, the connection weights of the winner node for
the input vector xi and the nearby nodes of the winner node are updated
to make the weights of the winner node and its nearby nodes more similar
to the input vector and thus make these nodes produce larger outputs for
the input vector. The neighborhood function f( j, c), which determines the
closeness of node j to the winner node c and thus eligibility of node j for the
weight change, can be defined in many ways. One example of f( j, c) is

1 if rj − rc ≤ Bc (t )
f ( j, c) =  , (10.2)
0 otherwise

where r j and rc are the coordinates of node j and the winner node c in the
output map, and Bc(t) gives the threshold value that bounds the neighborhood
Self-Organizing Map 169

Table 10.1
Learning Algorithm of SOM
Step Description
1 Initialize the connection weights of nodes with random positive or negative values,
w j¢ (t ) =  w j1 (t )  w jp (t ) , t = 0, j = 1, …, k
2 REPEAT
3 FOR i = 1 to n
4 Determine the winner node c for xi: c = argmax j w ¢j (t ) xi
5 Update the connection weights of the winner node and its nearby nodes:
w j (t + 1) = w j (t ) + αf ( j, c )  xi − w j (t ), where α is the learning rate and f( j, c)
defines whether or not node j is close enough to c to be considered for the
weight update
6 w j (t + 1) = w j (t ) for other nodes without the weight update
7 t = t +1
8 UNTIL the sum of weight changes for all the nodes, E(t), is not greater than a
threshold ε

of the winner c. Bc(t) is defined as a function of t to have an adaptive learn-


ing process that uses a large threshold value at the beginning of the learning
process and then decreases the threshold values over iterations. Another
example of f( j, c) is

1
f ( j, c) = 2 . (10.3)
rj − rc

e 2Bc (t)
2

In Step 8 of the algorithm, the sum of weight changes for all the nodes is
computed:

E (t ) = ∑ w (t + 1) − w (t) .
j
j j (10.4)

After the SOM is learned, clusters of data points are identified by marking
each node with the data point(s) that makes the node the winner node. A
cluster of data points are located and identified in a close neighborhood in
the output map.

Example 10.1
Use the SOM with nine nodes in a one-dimensional chain and their
coordinates of 1, 2, 3, 4, 5, 6, 7, 8, and 9 in Figure 10.2 to cluster the
nine data points in the data set for system fault detection in Table 10.2,
which is the same data set in Tables 8.1 and 9.2. The data set includes
nine instances of single-machine faults, and the data point for each
170 Data Mining

Node coordinate: 1 2 3 4 5 6 7 8 9

Fully connected wji

x1 x2 x3 x4 x5 x6 x7 x8 x9

Figure 10.2
Architecture of SOM for Example 10.1.

Table 10.2
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1

instance has the nine attribute variables about the quality of parts.
The learning rate α is 0.3. The neighborhood function f( j, c) is

1 for j = c − 1, c, c + 1
f ( j, c) =  .
0 otherwise

In Step 1 of the learning process, we initialize the connection weights to the


following:

 − 0.24   0.44   0.96   0.82 


       
 − 0 .41   0 . 44   − 0. 45   − 0.22
 0.46   0.93   − 0.75  0.60 
       
 0.27   − 0.15  0.35   − 0.56 
w1 (0 ) =  0.88  w2 (0 ) =  0.84  w3 (0 ) =  0.05  w4 (0 ) =  0.91 
       
 − 0.09  − 0.36   0.86   − 0.80 
 0.78   − 0.16   0.12   0.33 
       
 − 0.39  0.55   − 0.49  − 0.54 
       
 0.91   0.93   0.98   0.47 
Self-Organizing Map 171

 0.62   − 0.47   − 0.87 


     
 0.44   − 0.62   0.23 
 0.33   − 0.96   0.37 
     
 0.46   − 0.43   0.49 
w5 (0 ) =  − 0.25 w6 (0 ) =  0.32  w7 (0 ) =  0.04 
     
 − 0.26   0.96   0.33 
 − 0.71  0.70   − 0.10 
     
 − 0.61  − 0.004   0.45 
     
 0.38   − 0.84   − 0.96 

 − 0.95   0.69 
   
 − 0 .21   0.23 
 − 0.48   − 0.69
   
 0.05   0.86 
w8 (0 ) =  − 0.54  w9 (0 ) =  0.22  .
   
 0.23   − 0.91
 − 0.37   0.82 
   
 0.61   0.31 
   
 − 0.76   0.31 

Using these initial weights to compute the SOM outputs for the nine data
points makes nodes 4, 9, 7, 9, 1, 6, 9, 8, and 3 the winner nodes for x 1, x 2, x 3, x 4,
x 5, x 6, x 7, x 8, and x 9, respectively. For example, the output of each node for x 1
is computed to determine the winner node:

 o1   w1¢ (0 ) x1 
   w¢2 (0 ) x1 

 o2  
 o3   w¢3 (0 ) x1 
   
 o4   w¢4 (0 ) x1 
 
o =  o5  =  w¢5 (0 ) x1 
 
 o6   w6¢ (0 ) x1 
o   w7¢ (0 ) x1 

 7 
 o8   w8¢ (0 ) x1 
  
 o9   w9¢ (0 ) x1 

172 Data Mining

( − 0.24 ) (1) + ( − 0.41) (0 ) + (0.46 )(0 ) + (0.27 )(0 ) + (0.88 )(1) + ( − 0.09) (0 ) 
 
 + (0.78 )(1) + ( − 0.39) (0 ) + (0.91)(1) 
 
 
( 0.44 ) (1) + (0.44 )(0 ) + (0.93 )(0 ) + ( −0.15)(0 ) + (0.84 )(1) + ( − 0.36 ) (0 ) 
 
 + ( −0.16 )(1) + (0.55) (0 ) + (0.93 )(1) 
 
( 0.96 ) (1) + ( − 0.45) (0 ) + ( −0.75)(0 ) + (0.75)(0 ) + (0.05)(1) + (0.86 )(0 ) 
 
 + (0.12)(1) + ( − 0.49) (0 ) + (0.98 )(1) 
 

( 0. 82 )( ) (
1 + − 0 . 22 ) ( ) ( )( ) (
0 + 0 . 60 0 + −0 . 56 )( ) ( )( ) (
0 + 0 . 91 1 + − 0. 89 ) ( ) 
0
 
 + (0.333 )(1) + ( − 0.54 ) (0 ) + (0.47 )(1) 
 
(0.62)(1) + (0.44 )(0 ) + (0.33 )(0 ) + (0.46 )(0 ) + ( −0.25)(1) + ( − 0.26 ) (0 ) 
= 
 + −0.71 1 + − 0.61 0 + 0.338 1 
 ( )( ) ( ) ( ) ( )( ) 
 
( − 0.47 ) (1) + ( − 0.62) (0 ) + ( −0.96 )(0 ) + ( −0.43 )(0 ) + (0.32) (1) + ( 0.96 ) (0 )
 
 + (0.70 )(1) + ( − 0.04 ) (0 ) + ( −0.84 )(1) 
 
( − 0.87 ) (1) + ( 0.23 ) (0 ) + (0.37 )(0 ) + (0.49)(0 ) + (0.04 )(1) + ( 0.33 ) (0 ) 
 
 + −0.10 1 + 0.45 0 + −0.96 1 
 ( )(
( ) ( )( ) ( )( ) 
 
( − 0.95) (1) + ( − 0.21) (0 ) + ( −0.48 )(0 ) + (0.05)(0 ) + ( −0.54 )(1) + ( 0.23 ) (0 ) 
 
 + ( −0.37 )(1) + ( 0.61) (0 ) + ( −0.76 ) (1) 
 
( 0.69) (1) + ( 0.23 ) (0 ) + ( −0.69)(0 ) + (0.86 )(0 ) + (0.22)(1) + ( − 0.91) (0 ) 
 
 
 + (0.82)(1) + ( 0.31) (0 ) + (0.31)(1) 

 2.33 
 
 2.04 
 2.11 
 
 2.53 
=  0.04  .
 
 − 0.29
 −1.9 
 
 − 2.62
 
 2.04 
Self-Organizing Map 173

Node coordinate: 1 2 3 4 5 6 7 8 9

x5 x9 x1 x6 x3 x8 {x2, x4, x7}

Fully connected wji


x1 x2 x3 x4 x5 x6 x7 x8 x9

Figure 10.3
The winner nodes for the nine data points in Example 10.1 using initial weight values.

Since node 4 has the largest output value o4 = 2.53, node 4 is the winner node
for x 1. Figure 10.3 illustrates the output map to indicate the winner nodes for
the nine data points and thus initial clusters of the data points based on the
initial weights.
In Steps 2 and 3, x 1 is considered. In Step 4, the output of each node for x 1
is computed to determine the winner node. As described earlier, node 4 is
the winner node for x1, and thus c = 4. In Step 5, the connection weights to the
winner node c = 4 and its neighbors c − 1 = 3 and c + 1 = 5 are updated:
w4 (1) = w4 (0 ) + (0.3 )  x1 − w4 (0 ) = (0.7 ) w4 (0 ) + (0.3 ) x1

 0.82   1  0.87 
     
 − 0.22 0   − 0.15
 0.60  0   0.42 
     
 − 0.56  0   − 0.39
= (0.7 )  0.91  + (0.3 )  1 =  0.94  .
     
 − 0.80  0   − 0.56 
 0.33   1  0.53 
     
 − 0.54  0   − 0.38 
     
 0.47   1  0.63 

w3 (1) = w3 (0 ) + (0.3 )  x1 − w3 (0 ) = (0.7 ) w3 (0 ) + (0.3 ) x1

 0.96   1  1.96 
     
 − 0 .45  0   − 0.32
 − 0.75 0   0.53 
     
 0.35  0   0.25 
= (0.7 )  0.05  + (0.3 )  1 =  0.34  .
     
 0.86  0   0.60 
 0.12   1  0.38 
     
 − 0.49 0   − 0.34 
     
 0.98   1  0.99 
174 Data Mining

w5 (1) = w5 (0 ) + (0.3 )  x1 − w5 (0 ) = (0.7 ) w5 (0 ) + (0.3 ) x1

 0.62   1  0.73 
     
 0.44  0   0.31 
 0.33  0   0.23 
     
 0.46  0   0.32 
= (0.7 )  − 0.25 + (0.3 )  1 =  0.13  .
     
 − 0.26  0   − 0.18 
 − 0.71  1  0.80 
     
 − 0.61 0   − 0.43 
     
 0.38   1  0.57 

In Step 6, the weights for the other nodes remain the same. In Step 7, t is
increased to 1, and the weights of the nine nodes are

 − 0.24   0.44   1.96   0.87 


       
 − 0.41  0.44   − 0.32  − 0.15
 0.46   0.93   0.53   0.42 
       
 0.27   − 0.15  0.25   − 0.39
w1 (1) =  0.88  w2 (1) =  0.84  w3 (1) =  0.34  w4 (1) =  0.94 
       
 − 0.09  − 0.36   0.60   − 0.56 
 0.78   − 0.16   0.38   0.53 
       
 − 0.39  0.55   − 0.34   − 0.38 
       
 0.91   0.93   0.99   0.63 

 0.73   − 0.47   − 0.87 


     
 0.31   − 0.62   0.23 
 0.23   − 0.96   0.37 
     
 0.32   − 0.43   0.49 
w5 (1) =  0.13  w6 (1) =  0.32  w7 (1) =  0.04 
     
 − 0.18   0.96   0.33 
 0.80   0.70   − 0.10 
     
 − 0.43   − 0.04   0.45 
     
 0.57   − 0.84   − 0.96 
Self-Organizing Map 175

 − 0.95   0.69 
   
 − 0.21  0.23 
 − 0.48   − 0.69
   
 0.05   0.86 
w8 (1) =  − 0.54  w9 (1) =  0.22  .
   
 0.23   − 0.91
 − 0.37   0.82 
   
 0.61   0.31 
   
 − 0.76   0.31 

Next, we go back to Steps 2 and 3, and x 2 is considered. The learning process


continues until the sum of consecutive weight changes initiated by all the
nine data points is small enough.

10.2  Software and Applications


SOM is supported by:

• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.matworks.com)

Liu and Weisberg (2005) describe the application of SOM to analyze


ocean current variability. The application of SOM to brain activity data
of monkey in relation to movement directions is reported in Ye (2003,
Chapter 3).

Exercises
10.1 Continue the learning process in Example 10.1 to perform the weight
updates when x 2 is presented to the SOM.
10.2 Use the software Weka to produce the SOM for Example 10.1.
10.3 Define a two-dimensional SOM and the neighborhood function in
Equation 10.2 for Example 10.1 and perform one iteration of the weight
update when x 1 is presented to the SOM.
176 Data Mining

10.4 Use the software Weka to produce a two-dimensional SOM for


Example 10.1.
10.5 Produce a one-dimensional SOM with the same neighborhood function
in Example 10.1 for the space shuttle O-ring data set in Table 1.2. Use
Launch-Temperature and Leak-Check Pressure as the attribute vari-
ables and the normalization method in Equation 7.4 to obtain the
normalized Launch-Temperature and Leak-Check Pressure.
11
Probability Distributions of Univariate Data

The clustering algorithms in Chapters 8 through 10 can be applied to data


with one or more attribute variables. If there is only one attribute variable,
we have univariate data. For univariate data, the probability distribution of
data points captures not only clusters of data points but also many other
characteristics concerning the distribution of data points. Many specific data
patterns of univariate data can be identified through their corresponding
types of probability distribution. This chapter introduces the concept and
characteristics of the probability distribution and the use of the probability
distribution characteristics to identify certain univariate data patterns. A list
of software packages for identifying the probability distribution character-
istics of univariate data is provided along with references for applications.

11.1 Probability Distribution of Univariate Data


and Probability Distribution Characteristics
of Various Data Patterns
Given an attribute variable x and its data observations, x1, …, xn, the fre-
quency histogram of data observations is often used to show the frequencies
of all the x values. Table 11.1 gives the values of launch temperature in the
space shuttle O-ring data set, which is taken from Table 1.2. Figure 11.1 gives
a histogram of the launch temperature values in Table 11.1 using an interval
width of 5 units. Changing the interval width changes the frequency of data
observations in each interval and thus the histogram.
In the histogram in Figure 11.1, the frequency of data observations for each
interval can be replaced with the probability density, which can be estimated
using the ratio of that frequency to the total number of data observations.
Fitting a curve to the histogram of the probability density, we obtain a fit-
ted curve for the probability density function f(x) that gives the probability
density for any value of x. A common type of the probability distribution is a
normal distribution with the following probability density function:

2
1  x −µ 
1 −  
f (x) = e 2 σ 
, (11.1)
2πσ

177
178 Data Mining

Table 11.1
Values of Launch Temperature in
the Space Shuttle O-Ring Data Set
Instance Launch Temperature
1 66
2 70
3 69
4 68
5 67
6 72
7 73
8 70
9 57
10 63
11 70
12 78
13 67
14 53
15 67
16 75
17 70
18 81
19 76
20 79
21 75
22 76
23 58

10
Frequency

2
1

51–55 56–60 61–65 66–70 71–75 76–80 81–85

Figure 11.1
Frequency histogram of the Launch Temperature data.
Probability Distributions of Univariate Data 179

Where
μ is the mean
σ is the standard deviation

A normal distribution is symmetric with the highest probability density at


the mean x = μ and the same probability density at x = μ + a and x = μ − a.
Many data patterns manifest special characteristics of their probability
distributions. For example, we study time series data of computer and net-
work activities (Ye, 2008, Chapter 9). Time series data consist of data observa-
tions over time. From computer and network data, we observe the following
data patterns that are illustrated in Figure 11.2:

• Spike
• Random fluctuation
• Step change
• Steady change

The probability distributions of time series data with the spike, random fluc-
tuation, step change, and steady change patterns have special characteristics.
Time series data with a spike pattern as shown in Figure 11.2a have the major-
ity of data points with similar values and few data points with higher values
producing upward spikes or with lower values producing downward spikes.
The high frequency of data points with similar values determines where the
mean with a high probability density is located, and few data points with lower
(higher) values than the mean for downward (upward) spikes produce a long
tail on the left (right) side of the mean and thus a left (right) skewed distribution.
Hence, time series data with spikes produce a skewed probability distribution
that is asymmetric with most data points having values near the mean and few
data points having values spreading over one side of the mean and creating a
long tail, as shown in Figure 11.2a. Time series data with a random fluctuation
pattern produce a normal distribution that is symmetric, as shown in Figure
11.2b. Time series data with one step change, as shown in Figure 11.2c, produce
two clusters of data points with two different centroids and thus a bimodal dis-
tribution. Time series data with multiple step changes create multiple clusters
of data points with their different centroids and thus a multimodal distribu-
tion. Time series data with the steady change (i.e., a steady increase of values or
a steady decrease of values) have values evenly distributed and thus produce a
uniform distribution, as shown in Figure 11.2d. Therefore, the four patterns of
time series data produce four different types of probability distribution:

• Left or right skewed distribution


• Normal distribution
• Multimodal distribution
• Uniform distribution
0.014
280
180

260
0.012
240

0.010 220
200
0.008 180
160
0.006 140

No of obs
120
0.004 100

(C:)\Avg. Disk sec/Transfer


80

\\ALPHA02-VICTIM\LogicalDisk
0.002
60
40
0.000
20

–0.002 0
1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 –0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012
(a) \\ALPHA02-VICTIM\LogicalDisk(C:)\Avg. Disk sec/Transfer

Histogram (First300ObsTextEditing 1169v*300c)


\\ALPHA02-VICTIM\Process(services)\IO
Line Plot (First300ObsTextEditing 1169v*300c) Write Operations/sec = 300*1E-5*normal(x, 1, 8.2596E-6)
1.00007 140

1.00006 120

1.00005 100

1.00004 80

No of obs
1.00003 60

1.00002 40

\\ALPHA02-VICTIM\Process
(services)\IO Write Operations/sec
1.00001 20

1.00000 0
Case 1 Case 43 Case 85 Case 127 Case 169 Case 211 Case 253 Case 295 1.00000 1.00001 1.00002 1.00003 1.00004 1.00005 1.00006 1.00007
Case 22 Case 64 Case 106 Case 148 Case 190 Case 232 Case 274 \\ALPHA02-VICTIM\Process(services)\IO Write Operations/sec
(b)

Figure 11.2
Time series data patterns and their probability distributions. (a) The data plot and histogram of spike pattern, (b) the data plot and histogram of random
Data Mining

fluctuation pattern.
2.662E7 70
2.66E7
2.658E7 60
2.656E7
2.654E7 50
2.652E7
2.65E7 40
2.648E7

No of obs
2.646E7 30
2.644E7
2.642E7 20
2.64E7
2.638E7 10

\\ALPHA02-VICTIM\Memory\Pool Paged Bytes


2.636E7
2.634E7 0
Case 1 Case 37 Case 73 Case 109 Case 145 Case 181 Case 217 Case 253 2.632E7 2.636E7 2.64E7 2.644E7 2.648E7 2.652E7 2.656E7 2.66E7 2.664E7
Case 19 Case 55 Case 91 Case 127 Case 163 Case 199 Case 235
(c) \\ALPHA02-VICTIM\Memory\Pool Paged Bytes

2.684E7 80

2.682E7
70
2.68E7
Probability Distributions of Univariate Data

60
2.678E7
50
2.676E7

2.674E7 40

No of obs
2.672E7
30
2.67E7
20
2.668E7
10
2.666E7

\\ALPHA02-VICTIM\Memory\Pool Paged Bytes


2.664E7 0
Case 1 Case 37 Case 73 Case 109 Case 145 Case 181 Case 217 Case 253 2.662E7 2.666E7 2.67E7 2.674E7 2.678E7 2.682E7 2.686E7
Case 19 Case 55 Case 91 Case 127 Case 163 Case 199 Case 235 2.664E7 2.668E7 2.672E7 2.676E7 2.68E7 2.684E7
(d) \\ALPHA02-VICTIM\Memory\Pool Paged Bytes

Figure 11.2 (continued)


Time series data patterns and their probability distributions. (c) the data plot and histogram of a step change pattern, and (d) the data plot and histo-
181

gram of a steady change pattern.


182 Data Mining

As described in Ye (2008, Chapter 9), the four data patterns and their cor-
responding probability distributions can be used to identify whether or not
there are attack activities underway in computer and network systems since
computer and network data under attack or normal use conditions may
demonstrate different data patterns. Cyber attack detection is an important
part of protecting computer and network systems from cyber attacks.

11.2  Method of Distinguishing Four Probability Distributions


We may distinguish these four data patterns by identifying which of four dif-
ferent types of probability distribution data have. Although there are normality
tests to determine whether or not data have a normal distribution (Bryc, 1995),
statistical tests for identifying one of the other probability distributions are not
common. Although the histogram can be plotted to let us first visualize and
then determine the probability distribution, we need a test that can be pro-
grammed and run on computer without the manual visual inspection, espe-
cially when the data set is large and the real-time monitoring of data is required
as for the application to cyber attack detection. A method of distinguishing the
four probability distributions using a combination of skewness and mode tests
is developed in Ye (2008, Chapter 9) and is described in the next section.
The method of distinguishing four probability distributions is based on
skewness and mode tests. Skewness is defined as

 ( x − µ )3 
skewness = E  , (11.2)
 σ
3

where μ and σ are the mean and standard deviation of data population for
the variable x. Given a sample of n data points, x1, …, xn, the sample skewness
is computed:


n
n ( x i − x )3
skewness = i =1
, (11.3)
(n − 1) (n − 2) s3
where x and s are the average and standard deviation of the data sample.
Unlike the variance that squares both positive and negative deviations from
the mean to make both positive and negative deviations from the mean con-
tribute to the variance in the same way, the skewness measures how much
data deviations from the mean are symmetric on both sides of the mean. A
left-skewed distribution with a long tail on the left side of the mean has a
negative value of the skewness. A right-skewed distribution with a long tail
on the right side of the mean has a positive value of the skewness.
Probability Distributions of Univariate Data 183

Table 11.2
Combinations of Skewness and Mode Test Results for Distinguishing Four
Probability Distributions
Probability Distribution Dip Test Mode Test Skewness Test
Multimodal distribution Unimodality is Number of significant Any result
rejected modes ≥ 2
Uniform distribution Unimodality is not Number of significant Symmetric
rejected modes > 2
Normal distribution Unimodality is not Number of significant Symmetric
rejected modes < 2
Skewed distribution Unimodality is not Number of significant Skewed
rejected modes < 2

The mode of a probability distribution for a variable x is located at the


value of x that has the maximum probability density. When a ­probability
density function has multiple local maxima, the probability distribution
has multiple modes. A large probability density indicates a cluster of
similar data points. Hence, the mode is related to the clustering of data
points. A normal distribution and a skewed distribution are examples of
unimodal distributions with only one mode, in contrast to multimodal
distributions with multiple modes. A uniform distribution has no sig-
nificant mode since data are evenly distributed and are not formed into
clusters. The dip test (Hartigan and Hartigan, 1985) determines whether
or not a probability distribution is unimodal. The mode test in the R sta-
tistical software (www.r-project.org) determines the significance of each
potential mode in a probability distribution and gives the number of sig-
nificant modes.
Table 11.2 describes the special combinations of the skewness and mode
test results that are used to distinguish four probability distributions: a mul-
timodal distribution including a bimodal distribution, a uniform distribu-
tion, a normal distribution, and a skewed distribution. Therefore, if we know
that the data have one of these four probability distributions, we can check
the combination of results from the dip test, the mode test, and the skewness
test and identify which probability distribution the data have.

11.3  Software and Applications


Statistica (www.statsoft.com) supports the skewness test. R statistical soft-
ware (www.r-project.org, www.cran.r-project.org/doc/packages/diptest.pdf)
supports the dip test and the mode test. In Ye (2008, Chapter 9), computer and
network data under attack and normal use conditions can be characterized
by different probability distributions of the data under different conditions.
184 Data Mining

Cyber attack detection is performed by monitoring the observed computer


and network data and determining whether or not the change of probability
distribution from the normal use condition to an attack condition occurs.

Exercises
11.1 Select and use the software to perform the skewness test, the mode test,
and the dip test on the Launch Temperature Data in Table 11.1 and use
the test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
11.2 Select a numeric variable in the data set you obtain in Problem 1.2
and select an interval width to plot a histogram of the data for the
variable. Select and use the software to perform the skewness test,
the mode test, and the dip test on the data of the variable, and use the
test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
11.3 Select a numeric variable in the data set you obtain in Problem 1.3
and select an interval width to plot a histogram of the data for the
variable. Select and use the software to perform the skewness test,
the mode test, and the dip test on the data of the variable, and use the
test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
12
Association Rules

Association rules uncover items that are frequently associated together. The
algorithm of association rules was initially developed in the context of market
basket analysis for studying customer purchasing behaviors that can be used
for marketing. Association rules uncover what items customers often purchase
together. Items that are frequently purchased together can be placed together
in stores or can be associated together at e-commerce websites for promoting
the sale of the items or for other marketing purposes. There are many other
applications of association rules, for example, text analysis for document
classification and retrieval. This chapter introduces the algorithm of mining
association rules. A list of software packages that support association rules
is provided. Some applications of association rules are given with references.

12.1 Definition of Association Rules


and Measures of Association
An item set contains a set of items. For example, a customer’s purchase trans-
action at a grocery store is an item set or a set of grocery items such as eggs,
tomatoes, and apples. The data set for system fault detection with nine cases
of single-machine faults in Table 8.1 contains nine data records, which can
be considered as nine sets of items by taking x1, x2, x3, x4, x5, x6, x7, x8, x9 as
nine different quality problems with the value of 1 indicating the presence
of the given quality problem. Table 12.1 shows the nine item sets obtained
from the data set for system fault detection. A frequent association of items
in Table 12.1 reveals which quality problems often occur together.
An association rule takes the form of

A → C,

Where
A is an item set called the antecedent
C is an item set called the consequent

A and C have no common items, that is, A ∩ C = ∅ (an empty set). The relation-
ship of A and C in the association rule means that the presence of the item set

185
186 Data Mining

Table 12.1
Data Set for System Fault Detection with Nine Cases of Single-Machine
Faults and Item Sets Obtained from This Data Set
Attribute Variables about Quality of Parts
Instance Items in Each
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Data Record
1 (M1) 1 0 0 0 1 0 1 0 1 {x1, x5, x7, x9}
2 (M2) 0 1 0 1 0 0 0 1 0 {x2, x4, x8}
3 (M3) 0 0 1 1 0 1 1 1 0 {x3, x4, x6, x7, x8}
4 (M4) 0 0 0 1 0 0 0 1 0 {x4, x8}
5 (M5) 0 0 0 0 1 0 1 0 1 {x5, x7, x9}
6 (M6) 0 0 0 0 0 1 1 0 0 {x6, x7}
7 (M7) 0 0 0 0 0 0 1 0 0 {x7}
8 (M8) 0 0 0 0 0 0 0 1 0 {x8}
9 (M9) 0 0 0 0 0 0 0 0 1 {x9}

A in a data record implies the presence of the item set C in the data record, that
is, the item set C is associated with the item set A.
The measures of support, confidence, and lift are defined and used to dis-
cover item sets A and C that are frequently associated together. Support(X)
measures the proportion of data records that contain the item set X, and is
defined as

support(X ) =
{S|S ∈D and S ⊇ X} , (12.1)
N

Where
D denotes the data set containing data records
S is a data record in the data set D (indicated by S ∈ D) and contains the
items in X (indicated by S ⊇ X)
| | denotes the number of such data records S
N is the number of the data records in D

Based on the definition, we have

support(∅) =
{S|S ∈D and S ⊇ ∅} = N = 1.
N N

For example, for the data set with the nine data records in Table 12.1,
2
support ({x5 }) = = 0.22
9

5
support ({x7 }) = = 0.56
9
Association Rules 187

3
support ({x9 }) = = 0.33
9

2
support ({x5 , x7 }) = = 0.22
9

2
support ({x5 , x9 }) = = 0.22.
9

Support(A → C) measures the proportion of data records that contain both


the antecedent A and the consequent C in the association rule A → C, and is
defined as

support( A → C ) = support( A ∪ C ), (12.2)


where A ∪ C is the union of the item set A and the item set C and contains
items from both A and C. Based on the definition, we have

support(∅ → C ) = support(C )

support( A → ∅) = support( A).


For example,

support ({x5 } → {x7 }) = support ({x5 } ∪ {x7 }) = support ({x5 , x7 }) = 0.22


support ({x5 } → {x9 }) = support ({x5 } ∪ {x9 }) = support ({x5 , x9 }) = 0.22.


Confidence(A → C) measures the proportion of data records containing the


antecedent A that also contain the consequent C, and is defined as

support( A ∪ C )
confidence( A → C ) = . (12.3)
support( A)

Based on the definition, we have

support(C ) support(C )
confidence(∅ → C ) = = = support(C )
support(∅) 1

support( A)
confidence( A → ∅) = = 1.
support( A)

188 Data Mining

For example,

support ({x5 } ∪ {x7 }) 0.22


confidence ({x5 } → {x7 }) = = =1
support ({x5 }) 0.22

support ({x5 } ∪ {x9 }) 0.22


confidence ({x5 } → {x9 }) = = = 1.
support ({x9 }) 0.22

If the antecedent A and the consequent C are independent and support(C) is


high (the consequent C is contained in many data records in the data set),
support(A ∪ C) has a high value because C is contained in many data records
that also contain A. As a result, we get a high value of support(A → C) and
confidence(A → C) even though A and C are independent and the association
of A → C is of little interest. For example, if the item set C is contained in
every data record in the data, we have

support( A → C ) = support( A ∪ C ) = support( A)


support( A ∪ C ) support( A)
confidence( A → C ) = = = 1.
support( A) support( A)

However, the association rule of A → C is of little interest to us, because the
item set C is in every data record and thus any item set including A is associ-
ated with C. To address this issue, lift(A → C) is defined:

confidence( A → C ) support( A ∪ C )
lift( A → C ) = = . (12.4)
support(C ) support( A) × support(C )

If the antecedent A and the consequent C are independent but support(C)


is high, the high value of support(C) produces a low value of lift(A → C). For
example,

confidence ({x5 } → {x7 }) 1


lift ({x5 } → {x7 }) = = = 1.79
support ({x7 }) 0.56

confidence ({x5 } → {x9 }) 1


lift ({x5 } → {x9 }) = = = 3.03.
support ({x9 }) 0.33

Association Rules 189

M1 M5 M9

M2 M6 M7

M3 M4 M8

Figure 12.1
A manufacturing system with nine machines and production flows of parts.

The association rules, {x5} → {x7} and {x5} → {x9}, have the same values of sup-
port and confidence but different values of lift. Hence, x5 appears to have a
greater impact on the frequency of x9 than the frequency of x7. Figure 1.1,
which is copied in Figure 12.1, gives the production flows of parts for the
data set in Table 12.1. As shown in Figure 12.1, parts flowing through M5
go to M7 and M9. Hence, x5 should have the same impact on x7 and x9.
However, parts flowing through M6 also go to M7, x7 is more frequent than
x9 in the data set, producing a lower lift value for {x5} → {x7} than that of
{x5} → {x9}. In other words, x7 is impacted by not only x5 but also x6 and x3 as
shown in Figure 12.1, which makes x7 appear less dependent on x5 since lift
addresses the independence issue of the antecedent and the consequent by
a low value of lift.

12.2  Association Rule Discovery


Association rule discovery is used to find all association rules that exceed
the minimum thresholds on certain measures of association, typically sup-
port and confidence. Association rules are constructed using frequent item
sets that satisfy the minimum support. Given a data set of data records that
are made of p items at maximum, an item set can be represented as (x1, …,
xp), xi = 0 or 1, i = 1, …, p, with xi = 1 indicating the presence of the ith item
in the item set. Since there are 2p possible combinations of different values
for (x1, …, xp), there are (2p − 1) possible different item sets with 1 to p items,
excluding the empty set represented by (0, …, 0). It is impractical to exhaus-
tively examine the support value of every one of (2p − 1) possible different
item sets.
190 Data Mining

Table 12.2
Apriori Algorithm
Step Description of the Step
1 F1 = {frequent one-item sets}
2 i=1
3 while Fi ≠ ∅
4 i=i+1
5 Ci = {{x1, …, xi−2, xi−1, xi}|{x1, …, xi−2, xi−1} ∈ Fi−1 and
{x1, …, xi−2, xi} ∈ Fi−1}
6 for all data records S ∈ D
7 for all candidate sets C ∈ Ci
8 if S ⊇ C
9 C.count = C.count + 1
10 Fi = {C|C ∈ Ci and C.count ≥ minimum support}
11 return all Fj, j = 1, …, i − 1

The Apriori algorithm (Agrawal and Srikant, 1994) provides an efficient


procedure of generating frequent item sets by considering that an item set
can be a frequent item set only if all of its subsets are frequent item sets.
Table 12.2 gives the steps of the Apriori algorithm for a given data set D.
In Step 5 of the Apriori algorithm, the two item sets from Fi−1 have the
same items of x1, …, xi−2, and the two item sets differ only in only one item
with xi−1 in one item set and xi in another item set. A candidate item set
for Fi is constructed by including x1, …, xi−2 (the common items of the two
item sets from Fi−1), xi−1 and xi. For example, if {x1, x2, x3} is a frequent three-
item set, any combination of two items from this frequent three-item set, {x1,
x2}, {x2, x3} or {x1, x3}, must be frequent two-item sets. That is, if support({x1,
x2, x3}) is greater than or equal to the minimum support, support({x1, x2}),
support({x2,  x3}), and support {x1, x3} must be greater than or equal to the
minimum support. Hence, the frequent three-item set, {x1, x2, x3}, can be con-
structed using two of its two-item subsets that differ in only one item, {x1,
x2} and {x1, x3}, {x1, x2} and {x2, x3}, or {x1, x3} and {x2, x3}. Similarly, a frequent
i-item set must come from frequent (i − 1)-item sets that differ in only one
item. This method of constructing a candidate item set for Fi significantly
reduces the number of candidate item sets for Fi to be evaluated in Step 7 of
the algorithm.
Example 12.1 illustrates the Apriori algorithm. When the data are sparse
with each item being relatively infrequent in the data set, the Apriori algo-
rithm is efficient in that it produces a small number of frequent item sets,
few of which contain large numbers of items. When the data are dense, the
Apriori algorithm is less efficient and produces a large number of frequent
item sets.
Association Rules 191

Example 12.1
From the data set in Table 12.1, find all frequent item sets with min-
support (minimum support) = 0.2.
Examining the support of each one-item set, we obtain

 3
F1 = {x4 }, support = = 0.33 ,
 9
2
{x5 }, support = = 0.22,
9
2
{x6 }, support = = 0.22,
9
5
{x7 }, support = = 0.56,
9
4
{x8 }, support = = 0.44,
9
3 
{x9 }, support = = 0.33 .
9 

Using the frequent one-item sets to put together the candidate two-item
sets and examine their support, we obtain

 3
F2 = {x4 , x8 }, support = = 0.33 ,
 9
2
{x5 , x7 }, support = = 0.22,
9
2
{x5 , x9 }, support = = 0.22,
9
2
{x6 , x7 }, support = = 0.22,
9
2 
{x7 , x9 }, support = = 0.22 .
9 

Since {x5, x7}, {x5, x9}, and {x7, x9} differ from each other in only one item,
they are used to construct the three-item set {x5, x7, x9}—the only three-
item set that can be constructed:

 2 
F3 = {x5 , x7 , x9 }, support = = 0.22 .
 9 

Note that constructing a three-item set from two-item sets that dif-
fer in more than one item does not produce a frequent three-item set.
192 Data Mining

For example, {x4, x8} and {x5, x7} are frequent two-item sets that differ in
two items. {x4, x5}, {x4, x7}, {x8, x5}, and {x8, x7} are not frequent two-item
sets. A three-item set constructed using {x4, x8} and {x5, x7}, e.g., {x4, x5, x8},
is not a frequent three-item set because not every pair of two items from
{x4, x5, x8} is a frequent two-item set. Specifically, {x4, x5} and {x8, x5} are
not frequent two-item sets.
Since there is only one frequent three-item set, we cannot generate a
candidate four-item set in Step 5 of the Apriori algorithm. That is, C4 = ∅.
As a result, F4 = ∅ in Step 3 of the Apriori algorithm, and we exit the
WHILE loop. In Step 11 of the algorithm, we collect all the frequent item
sets that satisfy min-support = 0.2:
{x4}, {x5}, {x6}, {x7}, {x8}, {x9}, {x4, x8}, {x5, x7}, {x5, x9}, {x6, x7}, {x7, x9}, {x5, x7, x9}.

Example 12.2
Use the frequent item sets from Example 12.1 to generate all the asso-
ciation rules that satisfy min-support = 0.2 and min-confidence (minimum
confidence) = 0.5.
Using each frequent item set F obtained from Example 12.1, we gener-
ate each of the following association rules, A → C, which satisfies

A ∪ C = F,

A ∩ C = ∅,

the criteria of the min-support and the min-confidence:

∅ → {x4}, support = 0.33, confidence = 0.33


∅ → {x5}, support = 0.22, confidence = 0.22
∅ → {x6}, support = 0.22, confidence = 0.22
∅ → {x7}, support = 0.56, confidence = 0.56
∅ → {x8}, support = 0.44, confidence = 0.44
∅ → {x9}, support = 0.33, confidence = 0.33
∅ → {x4, x8}, support = 0.33, confidence = 0.33
∅ → {x5, x7}, support = 0.22, confidence = 0.22
∅ → {x5, x9}, support = 0.22, confidence = 0.22
∅ → {x6, x7}, support = 0.22, confidence = 0.22
∅ → {x7, x9}, support = 0.22, confidence = 0.22
∅ → {x5, x7, x9}, support = 0.22, confidence = 0.22
{x4} → ∅, support = 0.33, confidence = 1
{x5} → ∅, support = 0.22, confidence = 1
{x6} → ∅, support = 0.22, confidence = 1
{x7} → ∅, support = 0.56, confidence = 1
{x8} → ∅, support = 0.44, confidence = 1
{x9} → ∅, support = 0.33, confidence = 1
{x4, x8} → ∅, support = 0.33, confidence = 1
{x5, x7} → ∅, support = 0.22, confidence = 1
{x5, x9} → ∅, support = 0.22, confidence = 1
{x6, x7} → ∅, support = 0.22, confidence = 1
Association Rules 193

{x7, x9} → ∅, support = 0.22, confidence = 1


{x5, x7, x9} → ∅, support = 0.22, confidence = 1
{x4} → {x8}, support = 0.33, confidence = 1
{x5} → {x7}, support = 0.22, confidence = 1
{x5} → {x9}, support = 0.22, confidence = 1
{x6} → {x7}, support = 0.22, confidence = 1
{x7} → {x9}, support = 0.22, confidence = 0.39
{x8} → {x4}, support = 0.33, confidence = 0.75
{x7} → {x5}, support = 0.22, confidence = 0.39
{x9} → {x5}, support = 0.22, confidence = 0.67
{x7} → {x6}, support = 0.22, confidence = 0.39
{x9} → {x7}, support = 0.22, confidence = 0.67
{x5} → {x7, x9}, support = 0.22, confidence = 1
{x7} → {x5, x9}, support = 0.22, confidence = 0.39
{x9} → {x5, x7}, support = 0.22, confidence = 0.67
{x7, x9} → {x5}, support = 0.22, confidence = 1
{x5, x9} → {x7}, support = 0.22, confidence = 1
{x5, x7} → {x9}, support = 0.22, confidence = 1.

Removing each association rule in the form of F→Ø, we obtain the final
set of association rules:

{x4} → ∅, support = 0.33, confidence = 1


{x5} → ∅, support = 0.22, confidence = 1
{x6} → ∅, support = 0.22, confidence = 1
{x7} → ∅, support = 0.56, confidence = 1
{x8} → ∅, support = 0.44, confidence = 1
{x9} → ∅, support = 0.33, confidence = 1
{x4, x8} → ∅, support = 0.33, confidence = 1
{x5, x7} → ∅, support = 0.22, confidence = 1
{x5, x9} → ∅, support = 0.22, confidence = 1
{x6, x7} → ∅, support = 0.22, confidence = 1
{x7, x9} → ∅, support = 0.22, confidence = 1
{x5, x7, x9} → ∅, support = 0.22, confidence = 1
{x4} → {x8}, support = 0.33, confidence = 1
{x8} → {x4}, support = 0.33, confidence = 0.75
{x5} → {x7}, support = 0.22, confidence = 1
{x5} → {x9}, support = 0.22, confidence = 1
{x5} → {x7, x9}, support = 0.22, confidence = 1
{x5, x9} → {x7}, support = 0.22, confidence = 1
{x5, x7} → {x9}, support = 0.22, confidence = 1
{x9} → {x5}, support = 0.22, confidence = 0.67
{x9} → {x7}, support = 0.22, confidence = 0.67
{x9} → {x5, x7}, support = 0.22, confidence = 0.67
{x7, x9} → {x5}, support = 0.22, confidence = 1
{x6} → {x7}, support = 0.22, confidence = 1.

In this final set of association rules, each association rule in the form of
F → ∅ does not tell the association of two item sets but the presence of
the item set F in the data set and can thus be ignored. The remaining
194 Data Mining

association rules reveal the close association of x4 with x8, x5 with x7,
and x9, and x6 with x7, which are consistent with the production flows
in Figure 12.1. However, the production flows from M1, M2, and M3 are
not captured in the frequent item sets and in the final set of association
rules because of the way in which the data set is sampled by considering
all the single-machine faults. Since M1, M2, and M3 are at the beginning
of the production flows and affected by themselves only, x1, x2, and x3
appear less frequently in the data set than x4 to x9. For the same reason,
the confidence value of the association rule {x4} → {x8} is higher than that
of the association rule {x8} → {x4}.

Association rule discovery is not applicable to numeric data. To apply


association rule discovery, numeric data need to be converted into categor-
ical data by defining ranges of data values as discussed in Section 4.3 of
Chapter 4 and treating values in the same range as the same item.

12.3  Software and Applications


Association rule discovery is supported by Weka (http://www.cs.waikato.
ac.nz/ml/weka/) and Statistica (www.statistica.com). Some applications of
association rules can be found in Ye (2003, Chapter 2).

Exercises
12.1 Consider 16 data records in the testing data set of system fault detec-
tion in Table 3.2 as 16 sets of items by taking x1, x2, x3, x4, x5, x6, x7, x8,
x9 as nine different quality problems with the value of 1 indicating the
presence of the given quality problem. Find all frequent item sets with
min-support = 0.5.
12.2 Use the frequent item sets from Exercise 12.1 to generate all the associa-
tion rules that satisfy min-support = 0.5 and min-confidence = 0.5.
12.3 Repeat Exercise 12.1 for all 25 data records from Table 12.1 and Table 3.2
as the data set.
12.4 Repeat Exercise 12.2 for all 25 data records from Table 12.1 and Table 3.2
as the data set.
12.5 To illustrate the Apriori algorithm is efficient for a sparse data set, find
or create a sparse data set with each item being relatively infrequent in
Association Rules 195

the data set, and apply the Apriori algorithm to the data set to produce
frequent item sets with an appropriate value of min-support.
12.6 To illustrate the Apriori algorithm is less efficient for a dense data set,
find or create a dense data set with each item being relatively frequent
in the data records of the data set, and apply the Apriori algorithm to
the data set to produce frequent item sets with an appropriate value of
min-support.
13
Bayesian Network

Bayes classifier in Chapter 3 requires all the attribute variables are inde-
pendent of each other. Bayesian network in this chapter allows associations
among the attribute variables themselves and associations between attribute
variables and target variables. Bayesian network uses associations of vari-
ables to infer information about any variable in Bayesian network. In this
chapter, we first introduce the structure of a Bayesian network and the prob-
ability information of variables in a Bayesian network. Then we describe the
probabilistic inference that is conducted within a Bayesian network. Finally,
we introduce methods of learning the structure and probability information
of a Bayesian network. A list of software packages that support Bayesian
network is provided. Some applications of Bayesian network are given with
references.

13.1 Structure of a Bayesian Network and Probability


Distributions of Variables
In Chapter 3, a naive Bayes classifier uses Equation 3.5 (shown next) to clas-
sify the value of the target variable y based on the assumption that the attri-
bute variables, x1, …, xp, are independent of each other:

∏P(x|y).
y MAP ≈ arg max y ∈Y p( y )
i =1
i

However, in many applications, some attribute variables are associated in a


certain way. For example, in the data set for a system fault detection shown
in Table 3.1 and copied here in Table 13.1, x1 is associated with x5, x7, and x9.
As shown in Figure 1.1, which is copied here as Figure 13.1, M5, M7, and M9
are on the production path of parts that are processed at M1. The faulty M1
causes the failed part quality after M1 for x1 = 1, which in turn cause x5 = 1,
then x7 = 1, and finally x9 = 1. Although x1 affects x5, x7, and x9, we do not have
x5, x7, and x9 affecting x1. Hence, the cause–effect association of x1 with x5, x7,
and x9 goes in one direction only. Moreover, x1 is not associated with other
variables, x2, x3, x4, x6, and x8.

197
198 Data Mining

Table 13.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable

Instance Quality of Parts


(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 System Fault y
1 (M1) 1 0 0 0 1 0 1 0 1 1
2 (M2) 0 1 0 1 0 0 0 1 0 1
3 (M3) 0 0 1 1 0 1 1 1 0 1
4 (M4) 0 0 0 1 0 0 0 1 0 1
5 (M5) 0 0 0 0 1 0 1 0 1 1
6 (M6) 0 0 0 0 0 1 1 0 0 1
7 (M7) 0 0 0 0 0 0 1 0 0 1
8 (M8) 0 0 0 0 0 0 0 1 0 1
9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0

A Bayesian network contains nodes to represent variables (including both


attribute variables and target variables) and directed links between nodes to
represent directed associations between variables. Each variable is assumed
to have a finite set of states or values. There is a directed link from a node
representing the variable xi to a node representing a variable xj if xi has
a direct impact on xj, i.e., xi causes xj, or xi influences xj in some way. In a
directed link from xi to xj, xi is a parent of xj, and xj is a child of xi. No directed
cycles, e.g., x1 → x2 → x3 → x1, are allowed in a Bayesian network. Hence, the
structure of a Bayesian network is a directed, acyclic graph.
Domain knowledge is usually used to determine how variables are linked.
For example, the production flow of parts in Figure 13.1 can be used to deter-
mine the structure of a Bayesian network shown in Figure 13.2 that includes

M1 M5 M9

M2 M6 M7

M3 M4 M8

Figure 13.1
Manufacturing system with nine machines and production flows of parts.
Bayesian Network 199

x1 x5 x9

x2 x6 x7 y

x3 x4 x8

Figure 13.2
Structure of a Bayesian network for the data set of system fault detection.

nine attribute variables for the quality of parts at various stages of produc-
tion, x1, x2, x3, x4, x5, x6, x7, x8, and x9, and the target variable for the presence
of a system fault, y. In Figure 13.2, x5 has one parent x1, x6 has one parent x3, x4
has two parents x2 and x3, x9 has one parent x5, x7 has two parents x5 and x6,
x8 has one parent x4, and y has three parents x7, x8, and x9. Instead of drawing
a directed link from each of the nine quality variables, x1, x2, x3, x4, x5, x6, x7,
x8, and x9, to the system fault variable y, we have a directed link from each of
three quality variables, x7, x8, and x9, to the system fault variable y, because x7,
x8, and x9 are at the last stage of the production flow and capture the effects
of x1, x2, x3, x4, x5, and x6 on y.
Given that the variable x has parents z1, …, zk, a Bayesian network uses a
conditional probability distribution for P(x|z1, …, zk) to quantify the effects
of parents z1, …, zk on the child x. For example, we suppose that the device
for inspecting the quality of parts in the data set of system fault detec-
tion is not 100% reliable, producing data uncertainties and conditional
probability distributions in Tables 13.2 through 13.10 for the nodes with

Table 13.2
P(x5|x1)
x1 = 0 x1 = 1
x5 = 0 P(x5 = 0|x1 = 0) = 0.7 P(x5 = 0|x1 = 1) = 0.1
x5 = 1 P(x5 = 1|x1 = 0) = 0.3 P(x5 = 1|x1 = 1) = 0.9

Table 13.3
P(x6|x3)
x3 = 0 x3 = 1
x6 = 0 P(x6 = 0|x3 = 0) = 0.7 P(x6 = 0|x3 = 1) = 0.1
x6 = 1 P(x6 = 1|x3 = 0) = 0.3 P(x6 = 1|x3 = 1) = 0.9
200 Data Mining

Table 13.4
P(x4|x3, x2)
x2 = 0
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x2 = 0, x3 = 0) = 0.7 P(x4 = 0|x2 = 0, x3 = 1) = 0.1
x4 = 1 P(x4 = 1|x2 = 0, x3 = 0) = 0.3 P(x4 = 1|x2 = 0, x3 = 1) = 0.9
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x2 = 1, x3 = 0) = 0.1 P(x4 = 0|x2 = 1, x3 = 1) = 0.1
x4 = 1 P(x4 = 1|x2 = 1, x3 = 0) = 0.9 P(x4 = 1|x2 = 1, x3 = 1) = 0.9

Table 13.5
P(x9|x5)
x5 = 0 x5 = 1
x9 = 0 P(x9 = 0|x5 = 0) = 0.7 P(x9 = 0|x5 = 1) = 0.1
x9 = 1 P(x9 = 1|x5 = 0) = 0.3 P(x9 = 1|x5 = 1) = 0.9

Table 13.6
P(x7|x5, x6)
x5 = 0
x6 = 0 x6 = 1
x7 = 0 P(x7 = 0|x5 = 0, x6 = 0) = 0.7 P(x7 = 0|x5 = 0, x6 = 1) = 0.1
x7 = 1 P(x7 = 1|x5 = 0, x6 = 0) = 0.3 P(x7 = 1|x5 = 0, x6 = 1) = 0.9
x5 = 1
x6 = 0 x6 = 1
x7 = 0 P(x7 = 0|x5 = 1, x6 = 0) = 0.1 P(x7 = 0|x5 = 1, x6 = 1) = 0.1
x7 = 1 P(x7 = 1|x5 = 1, x6 = 0) = 0.9 P(x7 = 1|x5 = 1, x6 = 1) = 0.9

Table 13.7
P(x8|x4 )
x4 = 0 x4 = 1
x8 = 0 P(x8 = 0|x4 = 0) = 0.7 P(x8 = 0|x4 = 1) = 0.1
x8 = 1 P(x8 = 1|x4 = 0) = 0.3 P(x8 = 1|x4 = 1) = 0.9
Bayesian Network 201

Table 13.8
P(y|x9)
x9 = 0 x9 = 1
y=0 P(y = 0|x9 = 0) = 0.9 P(y = 0|x9 = 1) = 0.1
y=1 P(y = 1|x9 = 0) = 0.1 P(y = 1|x9 = 1) = 0.9

Table 13.9
P(y|x7)
x7 = 0 x7 = 1
y=0 P(y = 0|x7 = 0) = 0.9 P(y = 0|x7 = 1) = 0.1
y=1 P(y = 1|x7 = 0) = 0.1 P(y = 1|x7 = 1) = 0.9

Table 13.10
P(y|x8)
x8 = 0 x8 = 1
y=0 P(y = 0|x8 = 0) = 0.9 P(y = 0|x8 = 1) = 0.1
y=1 P(y = 1|x8 = 0) = 0.1 P(y = 1|x98 = 1) = 0.9

parent(s) in Figure 13.2. For example, in Table 13.2, P(x5 = 0|x1 = 1) = 0.1 and
P(x5 = 1|x1 = 1) = 0.9 mean that if x1 = 1 then the probability of x5 = 0 is 0.1,
the probability of x5 = 1 is 0.9, and the probability of having either value
(0 or 1) of x5 is 0.1 + 0.9 = 1. The reason for not having the probability of 1
for x5 = 1 if x1 = 1 is that the inspection device for x1 has a small probability
of failure. Although the inspection devices tell x1 = 1, there is a small prob-
ability that x1 should be 0. In addition, the inspection device for x5 also has
a small probability of failure, meaning that the inspection device may tell
x5 = 0 although x5 should be 1. The probabilities of failure in the inspection
devices produce data uncertainties and thus the conditional probabilities
in Tables 13.2 through 13.10.
For the node of a variable x in a Bayesian network that has no par-
ents, the prior probability distribution of x is needed. For example, in the
Bayesian network in Figure 13.2, x 1, x 2, and x 3 have no parents, and their
prior probability distributions are given in Tables 13.11 through 13.13,
respectively.
The prior probability distributions of nodes without parent(s) and the con-
ditional probability distributions of nodes with parent(s) allow computing
the joint probability distribution of all the variables in a Bayesian ­network.
202 Data Mining

Table 13.11
P(x1)
x1 = 0 x1 = 1
P(x1 = 0) = 0.8 P(x1 = 1) = 0.2

Table 13.12
P(x2)
x2 = 0 x2 = 1
P(x2 = 0) = 0.8 P(x2 = 1) = 0.2

Table 13.13
P(x3)
x3 = 0 x3 = 1
P(x3 = 0) = 0.8 P(x3 = 1) = 0.2

For example, the joint probability distribution of the 10 variables in the


Bayesian network in Figure 13.2 is computed next:

P(x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 , y )
= P(y|x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )P(x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )

= P( y|x7 , x8 , x9 )P( x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )

= P( y|x7 , x8 , x9 )P( x9|x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 )P( x1 , x2 , x3 , x 4 , x 5 , x 6 , x7 , x8 )

= P( y|x7 , x8 , x9 )P( x9|x5 )P( x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 )

= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x1 , x2 , x3 , x 4 , x5 , x6 , x8 )P( x1 , x2 , x3 , x 4 , x 5 , x 6 , x8 )

= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x5 , x6 )P( x1 , x2 , x3 , x 4 , x 5 , x 6 , x8 ) = 

= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x5 , x6 )P( x8|x 4 )P( x5|x1 )P( x6|x3 )P( x 4|x2 , x3 )P( x1 , x2 , x3 )

= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x5 , x6 )P( x8|x 4 )P( x5|x1 )P( x6|xx3 )P( x 4|x2 , x3 )P( x1 )P( x2 )P( x3 ).

In the aforementioned computation, we use the following equations:


( )
P x1 , … , xi|z1 , … , zk , v1 , … , v j = P ( x1 , … , xi|z1 , … , zk )

(13.1)
Bayesian Network 203

P( x1 , … , xi ) = ∏P(x ),
j =1
i (13.2)

where in Equation 13.1 we have x1, …, xi conditionally independent of


v1, …, vj given z1, …, zk, and in Equation 13.2 we have x 1, …, xi independent
of each other.
Therefore, the conditional independences and independences among cer-
tain variables allow us to express the joint probability distribution of all
the variables using the conditional probability distributions of nodes with
parent(s) and the prior probability distributions of nodes without parent(s).
In other words, a Bayesian network gives a decomposed, simplified repre-
sentation of the joint probability distribution.
The joint probability distribution of all the variables gives the complete
description of all the variables and allows us to answer any questions about
all the variables. For example, given the joint probability distribution of two
variables x and z, P(x, z), and x takes one of values a1, …, ai, and z takes one
of values b1, …, bj, we can compute the probabilities for any questions about
these two variables:
j

P( x) = ∑P(x, z = b ) k (13.3)
k =1
i

P( z) = ∑P(x = a , z) k (13.4)
k =1

P( x , z)
P ( x|z ) = (13.5)
P( z)

P( x , z)
P ( z|x ) = . (13.6)
P( x)

In Equation 13.3, we marginalize z out of P(x, z) to obtain P(x). In Equation 13.4,


we marginalize x out of P(x, z) to obtain P(z).

Example 13.1
Given the following joint probability distribution P(x, z):

P( x = 0, z = 0) = 0.2

P( x = 0, z = 1) = 0.4
204 Data Mining

P( x = 1, z = 0) = 0.3

P( x = 1, z = 1) = 0.1,

which sum up to 1, compute P(x), P(z), P(x|z), and P(x|z):

P( x = 0) = P( x = 0, z = 0) + P( x = 0, z = 1) = 0.2 + 0.4 = 0.6

P( x = 1) = P( x = 1, z = 0) + P( x = 1, z = 1) = 0.3 + 0.1 = 0.4

P( z = 0) = P( x = 0, z = 0) + P( x = 1, z = 0) = 0.2 + 0.3 = 0.5

P( z = 1) = P( x = 0, z = 1) + P( x = 1, z = 1) = 0.4 + 0.1 = 0.5

P( x = 0, z = 0) 0.2
P( x = 0|z = 0) = = = 0.4
P( z = 0) 0.5

P( x = 1, z = 0) 0.3
P( x = 1|z = 0) = = = 0.6
P( z = 0) 0.5

P( x = 0, z = 1) 0.4
P( x = 0|z = 1) = = = 0.8
P( z = 1) 0.5

P( x = 1, z = 1) 0.1
P( x = 1|z = 1) = = = 0.2
P( z = 1) 0.5

P( x = 0, z = 0) 0.2
P( z = 0|x = 0) = = = 0.33
P( x = 0) 0.6

P( x = 0, z = 1) 0.4
P( z = 1|x = 0) = = = 0.67
P( x = 0) 0.6

P( x = 1, z = 0) 0.3
P( z = 0|x = 1) = = = 0.75
P( x = 1) 0.4

P( x = 1, z = 1) 0.1
P( z = 1|x = 1) = = = 0.25.
P( x = 1) 0.4
Bayesian Network 205

13.2  Probabilistic Inference


The probability distributions captured in a Bayesian network represent our prior
knowledge about the domain of all the variables. After obtaining evidences for
specific values of some variables (evidence variables), we want to use the prob-
abilistic inference for determining posterior probability distributions of certain
variables of interest (query variables). That is, we want to see how probabilities
of values for the query variables change after knowing specific values of evi-
dence variables. For example, in the Bayesian network in Figure 13.2, we want
to know what the probability of y = 1 is and what the probability of x7 is if we
have the evidence to confirm x9 = 1. In some applications, evidence variables
are variables that can be easily observed, and query variables are variables that
are not observable. We give some examples of probability inference next.

Example 13.2
Consider the Bayesian network in Figure 13.2 and the probability distri-
butions in Tables 13.2 through 13.13. Given x6 = 1, what are the probabili-
ties of x4 = 1, x3 = 1, and x2 = 1? In other words, what are P(x4 = 1|x6 = 1),
P(x3 = 1|x6 = 1), and P(x2 = 1|x6 = 1)? Note that the given condition x6 = 1
does not imply P(x6 = 1) = 1.
To get P(x3 = 1|x6 = 1), we need to obtain P(x3, x6).

P( x6 , x3 ) = P( x6|x3 ) P( x3 )

x3 = 0 x3 = 1
x6 = 0 P(x6 = 0, x3 = 0) = P(x6 = 0|x3 = 0) P(x6 = 0, x3 = 1) = P(x6 = 0|x3 = 1)
P(x3 = 0) = (0.7)(0.8) = 0.56 P(x3 = 1) = (0.1)(0.2) = 0.02
x6 = 1 P(x6 = 1, x3 = 0) = P(x6 = 1|x3 = 0) P(x6 = 1, x3 = 1) = P(x6 = 1|x3 = 1)
P(x3 = 0) = (0.3)(0.8) = 0.24 P(x3 = 1) = (0.9)(0.2) = 0.18

By marginalizing x3 out of P(x6, x3), we obtain P(x6):

P( x6 = 0) = P( x6 = 0, x3 = 0) + P( x6 = 0, x3 = 1) = 0.56 + 0.02 = 0.58

P( x6 = 1) = P( x6 = 1, x3 = 0) + P( x6 = 1, x3 = 1) = 0.24 + 0.18 = 0.42.

P( x6 = 1|x3 = 1)P( x3 = 1) (0.9)(0.2)


P( x3 = 1|x6 = 1) = = = 0.429
P( x6 = 1) 0.42

Hence, the evidence x6 = 1 changes the probability of x3 = 1 from 0.2 to


0.429.
206 Data Mining

To obtain P(x4 = 1|x6 = 1), we need to get P(x4, x6). x4 and x6 are associated


through x3. Moreover, the association of x4 and x3 involves x2. Hence, we
want to marginalize x3 and x2 out of P(x4, x3, x2|x6 = 1) where

P ( x 4 , x3 , x2|x6 = 1) = P ( x 4|x3 , x2 ) P ( x3|x6 = 1) P ( x2 )

P ( x6 = 1|x3 ) P ( x3 )
= P ( x 4|x3 , x2 ) P ( x2 ) .
P ( x6 = 1)

Although P(x4|x3, x2), P(x6|x3), P(x3), and P(x2) are given in Tables
13.3, 13.4, 13.12, and 13.13, respectively, P(x6) needs to be computed. In
addition to computing P(x6), we also compute P(x4) so we can compare
P(x4 = 1|x6 = 1) with P(x4).
To obtain P(x4) and P(x6), we first compute the joint probabilities
P(x4, x3, x2) and P(x6, x3) and then marginalize x3 and x2 out of P(x4, x3, x2)
and x3 out of P(x6, x3) as follows:

P( x 4 , x3 , x2 ) = P( x 4|x3 , x2 )P( x3 )P( x2 )

x2 = 0
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0, x3 = 0, x2 = 0) = P(x4 = 0|x3 = 0, P(x4 = 0, x3 = 1, x2 = 0) = P(x4 = 0|x3 = 1,
x2 = 0) x2 = 0)
P(x3 = 0)P(x2 = 0) = (0.7)(0.8)(0.8) = 0.448 P(x3 = 1)P(x2 = 0) = (0.1)(0.2)(0.8) = 0.016
x4 = 1 P(x4 = 1, x3 = 0, x2 = 0) = P(x4 = 1|x3 = 0, P(x4 = 1, x3 = 1, x2 = 0) = P(x4 = 1|x3 = 1,
x2 = 0) x2 = 0)
P(x3 = 0)P(x2 = 0) = (0.3)(0.8)(0.8) = 0.192 P(x3 = 1)P(x2 = 0) = (0.9)(0.2)(0.8) = 0.144
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0, x3 = 0, x2 = 1) = P(x4 = 0|x3 = 0, P(x4 = 0, x3 = 1, x2 = 1) = P(x4 = 0|x3 = 1,
x2 = 1) x2 = 1)
P(x3 = 0)P(x2 = 1) = (0.1)(0.8)(0.2) = 0.016 P(x3 = 1)P(x2 = 1) = (0.1)(0.2)(0.2) = 0.004
x4 = 1 P(x4 = 1, x3 = 0, x2 = 1) = P(x4 = 1|x3 = 0, P(x4 = 1, x3 = 1, x2 = 1) = P(x4 = 1|x3 = 1,
x2 = 1) x2 = 1)
P(x3 = 0)P(x2 = 1) = (0.9)(0.8)(0.2) = 0.144 P(x3 = 1)P(x2 = 1) = (0.9)(0.2)(0.2) = 0.036

By marginalizing x3 and x2 out of P(x4, x3, x2), we obtain P(x4):

P( x 4 = 0) = P( x 4 = 0, x3 = 0, x2 = 0) + P( x 4 = 0, x3 = 1, x2 = 0)

+ P( x 4 = 0, x3 = 0, x2 = 1) + P( x 4 = 0, x3 = 1, x2 = 1)

= 0.448 + 0.016 + 0.016 + 0.004 = 0.484


Bayesian Network 207

P( x 4 = 1) = P( x 4 = 1, x3 = 0, x2 = 0) + P( x 4 = 1, x3 = 1, x2 = 0)

+ P( x 4 = 1, x3 = 0, x2 = 1) + P( x 4 = 1, x3 = 1, x2 = 1)

= 0.192 + 0.144 + 0.144 + 0.036 = 0.516.

Now we use P(x6) to compute P(x4, x3, x2|x6 = 1):

P( x 4 , x3 , x2|x6 = 1) = P( x 4|x3 , x2 )P( x3|x6 = 1)P( x2 )

P( x6 = 1|x3 )P( x3 )
= P( x 4|x3 , x2 ) P( x2 ) :
P( x6 = 1)

x2 = 0

x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x3 = 0, x2 = 0) P(x4 = 0|x3 = 1, x2 = 0)

P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)


P( x6 = 1) P( x6 = 1)

P( x2 = 0) P( x2 = 0)

(0.3)(0.8) (0.9)(0.2)
= (0.7 ) (0.8) = (0.1) (0.8)
0.42 0.42
= 0.32 = 0.034
x4 = 1 P(x4 = 1|x3 = 0, x2 = 0) P(x4 = 1|x3 = 1, x2 = 0)
P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)
P( x6 = 1) P( x6 = 1)

P( x2 = 0) P( x2 = 0)
(0.3)(0.8) (0.9)(0.2)
= (0.3) (0.8) = (0.9) (0.8)
0.42 0.42
= 0.137 = 0.309
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x3 = 0, x2 = 1) (x4 = 0|x3 = 1, x2 = 1)
P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)
P( x6 = 1) P( x6 = 1)
P( x2 = 1) P( x2 = 1)

(0.3)(0.8) (0.9)(0.2)
= (0.1) (0.2) = (0.1) (0.2)
0.42 0.42
= 0.011 = 0.009
(continued)
208 Data Mining

x4 = 1 P(x4 = 1|x3 = 0, x2 = 1) (x4 = 1|x3 = 1, x2 = 1)

P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)


P( x6 = 1) P( x6 = 1)
P( x2 = 1) P( x2 = 1)

(0.3)(0.8) (0.9)(0.2)
= (0.9) (0.2) = (0.9) (0.2)
0.42 0.42
= 0.103 = 0.077

We obtain P(x4 = 1|x6 = 1) by marginalizing x3 and x2 out of P(x4, x3,


x2|x6 = 1):

P( x 4 = 1|x6 = 1) = P( x 4 = 1, x3 = 0, x2 = 0|x6 = 1) + P( x 4 = 1, x3 = 1, x2 = 0|x6 = 1)

+ P( x 4 = 1, x3 = 0, x2 = 1|x6 = 1) + P( x 4 = 1, x3 = 1, x2 = 1|x6 = 1)

= 0.137 + 0.309 + 0.103 + 0.077 = 0.626.

In comparison with P(x4 = 1) = 0.516 that we computed earlier on, the


evidence x6 = 1 changes the probability of x4 = 1 to 0.626.
We obtain P(x2 = 1|x6 = 1) by marginalizing x4 and x3 out of P(x4, x3,
x2|x6 = 1):

P( x2 = 1|x6 = 1) = P( x 4 = 0, x3 = 0, x2 = 1|x6 = 1) + P( x 4 = 1, x3 = 0, x2 = 1|x6 = 1)

+ P( x 4 = 0, x3 = 1, x2 = 1|x6 = 1) + P( x4 = 1, x3 = 1, x2 = 1|x6 = 1)

= 0.011 + 0.103 + 0.009 + 0.077 = 0.2.

The evidence on x6 = 1 does not change the probability of x2 = 1 from


its prior probability of 0.2 because x6 is affected by x3 only. The evidence
on x6 = 1 brings the need to update the posterior probability of x3, which
in turn brings the need to update the posterior probability of x4 since x3
affects x4.

Generally, we conduct the probability inference about a query vari-


able by first obtaining the joint probability distribution that contains the
query variable and then marginalizing nonquery variables out of the
joint probability distribution to obtain the probability of the query vari-
able. Regardless of what new evidence about a specific value of a variable
is obtained, the conditional probability distribution for each node with
parent(s), P(child|parent(s)), which is given in a Bayesian network, does
not change. However, all other probabilities, including the conditional
probabilities P(parent|child) and the probabilities of other variables than
the evidence variable, may change, depending on whether or not those
Bayesian Network 209

probabilities are affected by the evidence variable. All the probabilities


that are affected by the evidence variable need to be updated, and the
updated probabilities should be used for the probabilistic inference when
a new evidence is obtained. For example, if we continue from Example 13.2
and obtain a new evidence of x4 = 1 after updating the probabilities for
the evidence of x6 = 1 in Example 13.2, all the updated probabilities from
Example 13.2 should be used to conduct the probabilistic inference for the
new evidence of x4 = 1, for example the probabilistic inference to deter-
mine P(x3 = 1|x4 = 1) and P(x2 = 1|x4 = 1).

Example 13.3
Continuing with all the updated posterior probabilities for the evidence
of x6 = 1 from Example 13.2, we now obtain a new evidence of x4 = 1.
What are the posterior probabilities of x2 = 1 and x3 = 1? In other words,
starting with all the updated probabilities from Example 13.2, what are
P(x3 = 1|x4 = 1) and P(x2 = 1|x4 = 1)?
The probabilistic inference is presented next:

P( x 4 = 1|x3 , x2 )P( x3|x6 = 1)P( x2|x6 = 1)


P( x3 , x2|x 4 = 1) =
P( x4 = 1|x6 = 1)

P( x 4 = 1|x3 = 0, x2 = 0)P( x3 = 0|x6 = 1)P( x2 = 0|x6 = 1)


P( x3 = 0, x2 = 0|x 4 = 1) =
P( x 4 = 1|x6 = 1)

(0.3)(1 − 0.429)(1 − 0.2)


= = 0.219
(0.626)

P( x 4 = 1|x3 = 0, x2 = 1)P( x3 = 0|x6 = 1)P( x2 = 1|x6 = 1)


P( x3 = 0, x2 = 1|x 4 = 1) =
P( x 4 = 1|x6 = 1)

(0.9)(1 − 0.429)(0.2)
= = 0.164
(0.626)

P( x 4 = 1|x3 = 1, x2 = 0)P( x3 = 1|x6 = 1)P( x2 = 0|x6 = 1)


P( x3 = 1, x2 = 0|x 4 = 1) =
P( x 4 = 1|x6 = 1)

(0.9)(0.429)(1 − 0.2)
= = 0.494
(0.626)

P( x 4 = 1|x3 = 1, x2 = 1)P( x3 = 1|x6 = 1)P( x2 = 1|x6 = 1)


P( x3 = 1, x2 = 1|x 4 = 1) =
P( x 4 = 1|x6 = 1)

(0.9)(0.429)(0.2)
= = 0.123
(0.626)
210 Data Mining

We obtain P(x3 = 1|x4 = 1) by marginalizing x2 out of P(x3, x2|x4 = 1):

P( x3 = 1|x 4 = 1) = P( x3 = 1, x2 = 0|x 4 = 1) + P( x3 = 1, x2 = 1|x 4 = 1)

= 0.494 + 0.123 = 0.617

Since x3 affects both x6 and x4, we raise the probability of x3 = 1 from 0.2
to 0.429 when we have the evidence of x6 = 1, and then we raise the prob-
ability of x3 = 1 again from 0.429 to 0.617 when we have the evidence of
x4 = 1.
We obtain P(x2 = 1|x4 = 1) by marginalizing x3 out of P(x3, x2|x4 = 1):

P( x2 = 1|x 4 = 1) = P( x3 = 0, x2 = 1|x 4 = 1) + P( x3 = 1, x2 = 1|x 4 = 1)

= 0.164 + 0.123 = 0.287.

Since x 2 affects x 4 but not x6, the probability of x 2 = 1 remains the same
at 0.2 when we have the evidence on x6 = 1, and then we raise the
probability of x 2 = 1 from 0.2 to 0.287 when we have the evidence on
x4 = 1. It is not a big increase since x 3 = 1 may also produce the evi-
dence on x 4 = 1.

Algorithms that are used to make the probability inference need to


search for a path from the evidence variable to the query variable and
to update and infer the probabilities along the path, as we did manu-
ally in Examples 13.2 and 13.3. The search and the probabilistic infer-
ence require large amounts of computation, as seen from Examples 13.2
and  13.3. Hence, it is crucial to develop computational efficient algo-
rithms for conducting the probabilistic inference in a Bayesian network,
for example those in HUGIN (www.hugin.com), which is a software pack-
age for Bayesian network.

13.3  Learning of a Bayesian Network


Learning the structure of a Bayesian network and conditional probabili-
ties and prior probabilities in a Bayesian network from training data is
a topic under extensive research. In general, we would like to construct
the structure of a Bayesian network based on our domain knowledge.
However, when we do not have adequate knowledge about the domain
but only data of some observable variables in the domain, we need to
uncover associations between variables using data mining techniques
Bayesian Network 211

such as association rules in Chapter 12 and statistical techniques such as


tests on the independence of variables.
When all the variables in a Bayesian network are observable to obtain
data records of the variables, the conditional probability tables of nodes
with parent(s) and the prior probabilities of nodes without parent(s)
can be estimated using the following formulas as those in Equations 3.6
and 3.7:

Nx= a
P( x = a) = (13.7)
N

N x = a& z = b
P ( x = a|z = b ) = , (13.8)
Nz=b

where
N is the number of data points in the data set
Nx = a is the number of data points with x = a
Nz = b is the number of data points with z = b
Nx = a&z = b is the number of data points with x = a and z = b

Russell et al. (1995) developed the gradient ascent method, which is


­similar to the gradient decent method for artificial neural network, to
learn an entry in a conditional probability table in a Bayesian network
when the entry cannot be learned from the training data. Let w ij = P(xi|z j)
be such an entry in a conditional probability table for the node x taking its
ith value with parent(s) z taking the jth value in a Bayesian network. Let h
denote a hypothesis about the value of wij. Given the training data set, we
want to find the maximum likelihood hypothesis h that maximizes P(D|h).

h = arg max h P(D|h) = arg max h lnP(D|h).

The following gradient ascent is performed to update wij:

∂lnP (D|h )
wij (t + 1) = wij (t + 1) + α , (13.9)
∂wij

212 Data Mining

where α is the learning rate. Denoting P(D|h) by Ph(D) and using ∂lnf(x)/∂x =
[1/f(x)][∂f(x)/∂x], we have

∂lnP(D|h) ∂ ln Ph (D) ∂ln


= = d ∈D
Ph (d) ∏
∂wij ∂wij ∂wij

1 ∂Ph (d)
∂ ∑ Ph (d|xi′ , z j′ )Ph ( xi′ , z j′ )
∑ ∑
1 i′ , j′
= =
d ∈D Ph ( d) ∂wij d ∈D Ph ( d) ∂wij

∂ ∑ Ph (d|xi′ , z j′ )Ph ( xi′|z j′ )Ph ( z j′ )



1 i′ , j′
=
d ∈D Ph ( d) ∂wij

∂ ∑ Ph (d|xi′ , z j′ )wi′ j′ Ph ( z j′ )

1 i′ , j′
=
d ∈D Ph ( d) ∂wij

∑ ∑
1 1 Ph ( xi , z j|d)Ph (d)
= Ph (d|xi , z j )Ph ( z j ) = Ph ( z j )
d ∈D Ph (d) d ∈D Ph (d) Ph ( xi , z j )

∑ ∑ ∑
Ph ( xi , z j|d) Ph ( xi , z j|d) Ph ( xi , z j|d)
= Ph ( z j ) = = .
d ∈D Ph ( xi , z j ) d ∈D Ph ( xi|z j ) d ∈D wij

(13.10)

Plugging Equation 13.10 into 13.9, we obtain:

∂lnP (D|h ) Ph xi , z j|d ( )


wij (t + 1) = wij (t + 1) + α
∂wij
= wij (t + 1) + α
wij (t)
, ∑ (13.11)
d ∈D

where Ph(xi, zj|d) can be obtained using the probabilistic inference described
in Section 13.2. After using Equation 13.11 to update wij, we need to ensure

∑w (t + 1) = 1
ij (13.12)
i

by performing the normalization

wij (t + 1)
wij (t + 1) = . (13.13)

∑ wij (t + 1)
i
Bayesian Network 213

13.4  Software and Applications


Bayes server (www.bayesserver.com) and HUGIN (www.hugin.com) are
two software packages that support Bayesian network. Some applications
of Bayesian network in bioinformatics and some other fields can be found in
Davis (2003), Diez et al. (1997), Jiang and Cooper (2010), Pourret et al. (2008).

Exercises
13.1 Consider the Bayesian network in Figure 13.2 and the probability distri-
butions in Tables 13.2 through 13.13. Given x6 = 1, what is the probability
of x7 = 1? In other words, what is P(x1 = 1|x6 = 1)?
13.2 Continuing with all the updated posterior probabilities for the evidence
of x6 = 1 from Example 13.2 and Exercise 13.1, we now obtain a new evi-
dence of x4 = 1. What is the posterior probability of x7 = 1? In other words,
what is P(x1 = 1|x4 = 1)?
13.3 Repeat Exercise 13.1 to determine P(x1 = 1|x6 = 1).
13.4 Repeat Exercise 13.2 to determine P(x1 = 1|x4 = 1).
13.5 Repeat Exercise 13.1 to determine P(y = 1|x6 = 1).
13.6 Repeat Exercise 13.2 to determine P(y = 1|x4 = 1).
Part IV

Algorithms for Mining


Data Reduction Patterns
14
Principal Component Analysis

Principal component analysis (PCA) is a statistical technique of representing


high-dimensional data in a low-dimensional space. PCA is usually used to
reduce the dimensionality of data so that the data can be further visualized
or analyzed in a low-dimensional space. For example, we may use PCA to
represent data records with 100 attribute variables by data records with only
2 or 3 variables. In this chapter, a review of multivariate statistics and matrix
algebra is first given to lay the mathematical foundation of PCA. Then, PCA
is described and illustrated. A list of software packages that support PCA is
provided. Some applications of PCA are given with references.

14.1  Review of Multivariate Statistics


If xi is a continuous random variable with continuous values and probability
density function fi(xi), the mean and variance of the random variable, ui and
σ i2 , are defined as follows:

ui = E( xi ) =
∫ x f (x )dx
i i i i (14.1)
−∞

σ =2
i
∫ (x − u ) f (x )dx .
i i
2
i i i (14.2)
−∞

If xi is a discrete random variable with discrete values and probability func-


tion P(xi ),


ui = E( xi ) = ∑
all values
xi P( xi ) (14.3)
of xi

σ i2 = ∑ ( xi − ui )2 P( xi ). (14.4)
all values
of xi

If xi and xj are continuous random variables with the joint probability density
function fij(xi, xj), the covariance of two random variables, xi and xj, is defined
as follows:
∞ ∞

σ ij = E( xi − µ i )( x j − µ j ) =
∫ ∫ (x − u )(x − u ) f (x , x )dx dx .
i i j j ij i j i j (14.5)
−∞ −∞

217
218 Data Mining

If xi and xj are discrete random variables with the joint probability density
function P(xi, xj),

σ ij = E( xi − µ i )( x j − µ j ) = ∑ ∑ (x − µ )(x − µ )P(x , x ).
all values all values
i i j j i j

(14.6)
of xi of x j

The correlation coefficient is
σ ij
ρij = . (14.7)
σi σ j

For a vector of random variables, x = (x1, x2, …, xp), the mean vector is:

 E( x1 )   µ1 
   
 E( x2 )  µ 2 
E( x) = = = m, (14.8)
    
   
E( x p ) µ p 

and the variance–covariance matrix is

  x1 − µ1  
 x − µ  
S = E ( x − m)( x − m) ¢ = E  
2
x p − µ p 
2
 x1 − µ1 x2 − µ 2 
   
  
  x p − µ p  

 ( x1 − µ1 )2 ( x1 − µ1 ) ( x2 − µ 2 )  ( x1 − µ1 ) ( xp − µ p )
 
 ( x2 − µ 2 ) ( x1 − µ1 ) ( x 2 − µ 2 )2  ( x1 − µ1 ) ( x2 − µ 2 )
= E 
     
(
 x −µ x −µ
p ( 1 ) 1) (x )
− µ p ( x2 − µ 2 ) (x ) 
2
 p p  p − µp 



E ( x1 − µ1 )
2
E ( x1 − µ1 ) ( x2 − µ 2 )  E ( x1 − µ1 ) x p − µ p (

)
 E ( x2 − µ 2 ) ( x1 − µ1 )
=
E ( x2 − µ 2 )
2
 (
E ( x2 − µ 2 ) x p − µ p  )

     
(
E x − µ x − µ
p ( 1 ) 1) ( )
E x p − µ p ( x2 − µ 2 ) ( 
)
2
 p  E xp − µ p 

 σ1 σ12  σ1 p 
 
σ 21 σ2  σ2p 
= . (14.9)
     
 
σ p1 σ p2  σ p 

Principal Component Analysis 219

Example 14.1
Compute the mean vector and variance–covariance matrix of two vari-
ables in Table 14.1.
The data set in Table 14.1 is a part of the data set for the manufactur-
ing system in Table 1.4 and includes two attribute variables, x7 and x8,
for nine cases of single-machine faults. Table 14.2 shows the joint and
marginal probabilities of these two variables.
The mean and variance of x7 are

∑ x P(x ) = 0 × 9 + 1 × 9 = 9
4 5 5
u7 = E( x7 ) = 7 7
all values
of x7

2 2

∑ (x  5 4  5 5
σ72 = 7 − u7 )2 P( x7 ) =  0 −  × +  1 −  × = 0.2469.
 9 9  9 9
all values
of x7

Table 14.1
Data Set for System Fault Detection
with Two Quality Variables
Instance
(Faulty Machine) x7 x8
1 (M1) 1 0
2 (M2) 0 1
3 (M3) 1 1
4 (M4) 0 1
5 (M5) 1 0
6 (M6) 1 0
7 (M7) 1 0
8 (M8) 0 1
9 (M9) 0 0

Table 14.2
Joint and Marginal Probabilities of Two
Quality Variables
P(x7, x8) x8 P(x7)
x7 0 1
1 3 1 3 4
0 + =
9 9 9 9 9
4 1 4 1 5
1 + =
9 9 9 9 9
1 4 5 3 1 4
P(x8) + = + = 1
9 9 9 9 9 9
220 Data Mining

The mean and variance of x8 are


5 4 4
u8 = E( x8 ) = x8 P( x8 ) = 0 × +1× =
all values
9 9 9
of x8

2 2

∑ (x  4 5  4 4
σ 82 = 8 − u8 )2 P( x8 ) =  0 −  × +  1 −  × = 0.2469.
 9 9  9 9
all values
of x8

The covariance of x7 and x8 are

σ78 = ∑ ∑ (x
all values all values
7 − µ 7 )( x8 − µ 8 )P( x7 , x8 )
of x7 of x8

 5  4 1  5  4 3  5  4 4
= 0 −  0 −  × + 0 −  1 −  × + 1 −  0 − ×
 9  9 9  9  9 9  9  9 9

 5  4 1
+  1 −   1 −  × = − 0.1358.
 9  9 9

The mean vector of x = (x7, x8) is

5
µ 7   9 
m= = 
µ 8   4 
 9 

σ77 σ78   0.2469 − 0.1358 


S= = .
σ 87 σ 88   − 0.1358 0.2469

14.2  Review of Matrix Algebra


Given a vector of p variables:

 x1 
 
x2
x =   , x ¢ =  x1 x2  x p  , (14.10)

 
 x p 

Principal Component Analysis 221

x1, x2, …, xp are linearly dependent if there exists a set of constants, c1, c2, …, cp,
not all zero, which makes the following equation hold:

c1x1 + c2 x2 +  + cp x p = 0. (14.11)

Similarly, x1, x2, …, xp are linearly independent if there exists only one set of
constants, c1 = c2 = … = c, = 0, which makes the following equation hold:

c1x1 + c2 x2 +  + cp x p = 0. (14.12)

The length of the vector, x, is computed as follows:

Lx = x12 + x22 +  x p2 = x′x. (14.13)



Figure 14.1 plots a two-dimensional vector, x′ = (x1, x2), and shows the com-
putation of the length of the vector.
Figure 14.2 shows the angle θ between two vectors, x′ = (x1, x2), y′ = (y1, y2),
which is computed as follows:

x1
cos(θ1 ) = (14.14)
Lx

x2
sin(θ1 ) = (14.15)
Lx

y1
cos(θ 2 ) = (14.16)
Ly

y2
sin(θ 2 ) = (14.17)
Ly

Lx = (x21 + x22)1/2

x1

x
x2

Figure 14.1
Computation of the length of a vector.
222 Data Mining

y2 y

θ x
x2

θ2
θ1

y1 x1

Figure 14.2
Computation of the angle between two vectors.

cos(θ) = cos(θ 2 − θ1 ) = cos(θ 2 )cos(θ1 ) + sin(θ 2 )sin(θ1 )

 y   x   y   x  x y + x2 y x x¢y
=  1 1 + 2 2 = 1 1 = . (14.18)
 Ly   Lx   Ly   Lx  Lx Ly Lx Ly

Based on the computation of the angle between two vectors, x′ and y′, the
two vectors are orthogonal, that is, θ = 90° or 270°, or cos(θ) = 0, only if x′y = 0.
A p × p square matrix, A, is symmetric if A = A′, that is, aij = aji, for i = 1, …, p,
and j = 1, …, p. An identity matrix is the following:

1 0  0
0 1  0 
I= ,
   
 
0 0  1

and we have

AI = IA = A. (14.19)

The inverse of the matrix A is denoted as A−1, and we have

AA −1 = A −1 A = I . (14.20)

The inverse of A exists if the k columns of A, a1, a 2, …, a p, are linearly


independent.
Principal Component Analysis 223

Let |A| denote the determinant of a square p × p matrix A. |A| is computed as


follows:

A = a11 if p = 1 (14.21)

p p

A= ∑
j =1
a1 j A1 j ( −1)1+ j = ∑a j =1
ij Aij ( −1)i+ j if p > 1, (14.22)

where
A1j is the (p − 1) × (p − 1) matrix obtained by removing the first row and the
jth column of A
A ij is the (p − 1) × (p − 1) matrix obtained by removing the ith row and the
jth column of A. For a 2 × 2 matrix:

 a11 a12 
A= ,
 a21 a22 

the determinant of A is

2
a11 a12
A=
a21 a22
= ∑aj =1
1j A1 j ( −1)1+ j

= a11 A11 ( −1)1+1 + a12 A12 ( −1)1+ 2 = a11a22 − a12 a21. (14.23)

For the identity matrix I,

I = 1. (14.24)

The calculation of the determinant of a matrix A is illustrated next using the


variance–covariance matrix of x7 and x8 from Table 14.1:

 0.2469 − 0.1358 
A= = 0.2469 × 0.2469 − ( − 0.1358)(( − 0.1358) = 0.0425.
 − 0.1358 0.2469

Let A be a p × p square matrix and I be the p × p identity matrix. The values


λ1, …, λp are called eigenvalues of the matrix A if they satisfy the following
equation:

A − λI = 0. (14.25)
224 Data Mining

Example 14.2
Compute the eigenvalues of the following matrix A, which is obtained
from Example 14.1:

 0.2469 − 0.1358 
A=
 − 0.1358 0.2469

 0.2469 − 0.1358  1 0 0.2469 − λ − 0.1358


A − λI =  − λ = =0
 − 0.1358 0.2469  0 1 − 0.1358 0.2469 − λ

(0.2469 − λ )(0.2469 − λ ) − 0.0184 = 0

λ 2 − 0.4938λ + 0.0426 = 0

λ 1 = 0.3824 λ 2 = 0.1115.

Let A be a p × p square matrix and λ be an eigenvalue of A. The vector x


is the eigenvector of A associated with the eigenvalue λ, if x is a nonzero
vector and satisfies the following equation:

Ax = λx. (14.26)

The normalized eigenvector with the unit length, e, is computed as


follows:

x
e= . (14.27)
x¢ x

Example 14.3
Compute the eigenvectors associated with the eigenvalues in Example 14.2.
The eigenvectors associated with the eigenvalues λ 1 = 0.3824 and λ 2 = 0.1115
of the following square matrix A in Example 14.2 are computed next:

 0.2469 − 0.1358 
A=
 − 0.1358 0.2469

Ax = λ 1 x

 0.2469 − 0.1358   x1   x1 
 − 0.1358    = 0.3824  
 0.2469  x2   x2 

Principal Component Analysis 225

0.2469x1 − 0.1358 x2 = 0.3824 x1



 − 0.1358 x1 + 0.2469x2 = 0.3824 x2

0.1355x1 + 0.1358 x2 = 0

0.1358 x1 + 0.1355x2 = 0.

The two equations are identical. Hence, there are many solutions. Setting
x1 = 1 and x2 = −1, we have

 1 
1  2
x=  e= 
 −1  −1 
 2 

Ax = λ 2 x

 0.2469 − 0.1358   x1   x1 
 − 0.1358 = 0.1115  
 0.2469  x2   x2 

0.2469x1 − 0.1358 x2 = 0.1115x1



 − 0.1358 x1 + 0.2469x2 = 0.1115x2

0.1354 x1 − 0.1358 x2 = 0

0.1358 x1 − 0.1354 x2 = 0.

The aforementioned two equations are identical and thus have many
solutions. Setting x1 = 1 and x2 = 1, we have

 1 
1  2
x=  e= .
1  1 
 2 

In this example, the two eigenvectors associated with the two eigenval-
ues are chosen such that they are orthogonal.
226 Data Mining

Let A be a p × p symmetric matrix and (λi, ei), i = 1, …, p be p pairs of eigen-


values and eigenvectors of A, with ei, i = 1, …, p, being chosen to be mutually
orthogonal. A spectral decomposition of A is given next:

A= ∑λ e e¢.
i i i (14.28)
i =1

Example 14.4
Compute the spectral decomposition of the matrix in Examples 14.2
and 14.3.
The spectral decomposition of the following symmetric matrix in
Examples 14.2 and 14.3 is illustrated next:

 0.2469 − 0.1358 
A=
 − 0.1358 0.2469

λ 1 = 0.3824 λ 2 = 0.1115

 1 
 2
e1 =  
 −1 
 2 

 1 
 2
e2 =  
 1 
 2 

 1   1 
 0.2469 − 0.1358   2  1 −1   2  1 1 
 − 0.1358  = 0.3824   + 0.1115  
 0.2469  −1   2 2
 1   2 2 
 2   2 

 0.1912 − 0.1912 0.0558 0.0558 


= +
 − 0.1912 0.1912 0.0558 0.0558 

A = λ 1e1e1¢ + λ 2e2e2¢.
Principal Component Analysis 227

A p × p symmetric matrix A is called a positive definite matrix if it satisfies


 x1  0 
   
x2 0
the following for any nonzero vector =   ≠   :
    
   
 x p  0 

x ¢ Ax > 0.

A p × p symmetric matrix A is a positive definite matrix if and only if every


eigenvalue of A is greater than or equal to zero (Johnson and Wichern, 1998).
For example, the following matrix 2 × 2 A is a positive definite matrix with
two positive eigenvalues:

 0.2469 − 0.1358 
A=
 − 0.1358 0.2469

λ 1 = 0.3824 λ 2 = 0.1115.

Let A be a p × p positive definite matrix with the eigenvalues sorted in the


order of λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0 and associated normalized eigenvectors e 1, e 2, …,
e p, which are orthogonal. The quadratic form, (xʹAx)/(xʹx), is maximized to
λ1 when x = e 1, and this quadratic form is minimized to λp when x = e p
(Johnson and Wichern, 1998). That is, we have the following:

x ¢ Ax
max x ≠ 0 = λ 1 attained by x = e1 or
x¢ x

 p
 x ¢ Ax
e1¢ Ae1 = e ¢1 
 ∑
i =1
λ ieiei¢  e1 = λ 1 = max x ≠ 0
 x¢ x
(14.29)

x ¢ Ax
min x ≠ 0 = λp attained by x = e p or
x¢ x

 p
 x ¢ Ax
e ¢p Ae p = ep¢ 
 ∑
i =1
λ ieie ¢i  e p = λ p = min x ≠ 0
 x¢ x
(14.30)

and
x ¢ Ax
max x ⊥ e1 ,…ei = λ i + 1 attained by x = ei + 1 , i = 1, … , p − 1. (14.31)
x¢ x
228 Data Mining

14.3  Principal Component Analysis


Principal component analysis explains the variance–covariance matrix
of variables. Given a vector of variables x′ =  [x1, …, xp] with the variance–­
covariance matrix Σ, the following is a linear combination of these variables:

yi = ai¢x = ai1x1 + ai 2 x2 +  + aip x p . (14.32)


The variance and covariance of yi can be computed as follows:

var( yi ) = ai¢Sai (14.33)


cov( yi , y j ) = ai¢ Sa j . (14.34)



The principal components y′ =  [y1, y2, …, yp] are chosen to be linear combina-
tions of x′ that satisfy the following:

y1 = a1¢ x = a11x1 + a12 x2 +  + a1p x p ,

a1¢ a1 = 1, a1 is chosen to maximiize var( y1 ) (14.35)


y 2 = a2¢ x = a21x1 + a22 x2 +  + a2 p x p ,

a2¢ a2 = 1, cov( y 2 , y1 ) = 0, a2 is cho


osen to maximize var( y 2 )

yi = ai¢ x = ai1x1 + ai 2 x2 +  + aip x p ,

ai¢ai = 1, cov( yi , y j ) = 0 for j < i , ai is chosen to maximize var( yi ).


Let (λi, ei ), i = 1, …, p, be eigenvalues and orthogonal eigenvectors of Σ, ei¢ei = 1,


and λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0. Setting a1 = e 1, …, a p = e p, we have

yi = e i¢ x i = 1, …, p (14.36)

e i¢ei = 1

var( yi ) = ei¢ Sei = λ i


cov( yi , y j ) = ei¢ Se j = 0 for j < i.



Principal Component Analysis 229

Based on Equations 14.29 through 14.31, yi, i = 1, …, p, set by Equation 14.36,


satisfy the requirement of the principal components in Equation 14.35. Hence,
the principal components are determined using Equation 14.36.
Let x1, …, xp have variances of σ1, …, σp, respectively. The sum of vari-
ances of x1, …, xp is equal to the sum of variances of y1, …, yp (Johnson and
Wichern, 1998):

p p

∑var(x ) = σ +  + σ = ∑var(y ) = λ +  + λ .
i 1 p i 1 p (14.37)
i =1 i =1

Example 14.5
Determine the principal components of the two variables in Example 14.1.
For the two variables x′ =  [x7, x8] in Table 14.1 and Example 14.1, the
­variance–covariance matrix Σ is

 0.2469 − 0.1358 
S= ,
 − 0.1358 0.2469

with eigenvalues and eigenvectors determined in Examples 14.2 and 14.3:

λ 1 = 0.3824 λ 2 = 0.1115

 1 
 2
e1 =  
 −1 
 2 

 1 
 2
e2 =  .
 1 
 2 

The principal components are

1 1
y1 = e1¢ x = x7 − x8
2 2

1 1
y 2 = e2¢ x = x7 + x8 .
2 2
230 Data Mining

The variances of y1 and y2 are

 1 1 
var( y1 ) = var  x7 − x8
 2 2 
2 2
 1   −1   1   −1 
= var( x7 ) +  var( x8 ) + 2  cov( x7, x8 )
 2   2   2   2 

1 1
= (0.2469) + (0.2469) − ( − 0.1358) = 0.3827 = λ 1
2 2

 1 1 
var( y 2 ) = var  x7 + x8
 2 2 
2 2
 1   1   1  1 
= var( x7 ) +  var( x8 ) + 2  cov( x7, x8 )
 2   2   2   2 

1 1
= (0.2469) + (0.2469) + ( − 0.1358) = 0.1111 = λ 2 .
2 2

We also have

var( x7 ) + var( x8 ) = 0.2469 + 0.2469 = var( y1 ) + var( y 2 ) = 0.3827 + 0.1111.


The proportion of the total variances accounted for by the first principal
component y1 is 0.3824/0.4939 = 0.7742 or 77%. Since most of the total vari-
ances in x′ = [x7, x8] is accounted by y1, we may use y1 to replace and repre-
sent originally the two variables x7, x8 without loss of much variances. This
is the basis of applying PCA for reducing the dimensions of data by using
a few principal components to represent a large number of variables in
the original data while still accounting for much of variances in the data.
Using a few principal components to represent the data, the data can be
further visualized in a one-, two-, or three-dimensional space of the prin-
cipal components to observe data patterns, or can be mined or analyzed to
uncover data patterns of principal components. Note that the mathemati-
cal meaning of each principal component as the linear combination of the
original data variable does not necessarily have a meaningful interpreta-
tion in the problem domain. Ye (1997, 1998) shows examples of interpret-
ing data that are not represented in their original problem domain.

14.4  Software and Applications


PCA is supported by many statistical software packages, including SAS
(www.sas.com), SPSS (www.spss.com), and Statistica (www.statistica.com).
Some applications of PCA in the manufacturing fields are described in Ye
(2003, Chapter 8).
Principal Component Analysis 231

Exercises
14.1 Determine the nine principal components of x1, …, x9 in Table 8.1 and
identify the principal components that can be used to account for 90%
of the total variances of data.
14.2 Determine the principal components of x1 and x2 in Table 3.2.
14.3 Repeat Exercise 14.2 for x1, …, x9, and identify the principal components
that can be used to account for 90% of the total variances of data.
15
Multidimensional Scaling

Multidimensional scaling (MDS) aims at representing high-dimensional


data in a low-dimensional space so that data can be visualized, analyzed,
and interpreted in the low-dimensional space to uncover useful data pat-
terns. This chapter describes MDS, software packages supporting MDS, and
some applications of MDS with references.

15.1  Algorithm of MDS


We are given n data items in the p-dimensional space, xi = (xi1, …, xip),
i = 1, …, n, along with the dissimilarity δij of each pair of n data items, xi and
xj, and the rank order of these dissimilarities from the least similar pair to
the most similar pair:

δ i1 j1 ≤ δ i2 j2 ≤  ≤ δ iM jM , (15.1)

where M denotes the total number of different data pairs, and M = n(n − 1)/2
for n data items. MDS (Young and Hamer, 1987) is to find coordinates of
the n data items in a q-dimensional space, zi = (zi1, …, xiq), i = 1, …, n, with
q being much smaller than p, while preserving the dissimilarities of n data
items given in Equation 15.1. MDS is nonmetric if only the rank order of
the dissimilarities in Equation 15.1 is preserved. Metric MDS goes further to
preserve the magnitudes of the dissimilarities. This chapter describes non-
metric MDS.
Table 15.1 gives the steps of the MDS algorithm to find coordinates of the n
data items in the q-dimensional space, while preserving the dissimilarities of
n data points given in Equation 15.1. In Step 1 of the MDS algorithm, the ini-
tial configuration for coordinates of n data points in the q-dimensional space
is generated using random values so that no two data points are the same.
In Step 2 of the MDS algorithm, the following is used to normalize
xi = (xi1, …, xiq), i = 1, …, n:

xij
normalized xij = . (15.2)
x +  + xiq2
2
i1

233
234 Data Mining

Table 15.1
MDS Algorithm
Step Description
1 Generate an initial configuration for the coordinates of n data
points in the q-dimensional space, (x11, …, x1q, …., xn1, …, xnq),
such that no two points are the same
2 Normalize xi = (xi1, …, xiq), i = 1, …, n, such that the vector for
each data point has the unit length using Equation 15.2
3 Compute S as the stress of the configuration using Equation 15.3
4 REPEAT UNTIL a stopping criterion based on S is satisfied
5 Update the configuration using the gradient decent method
and Equations 15.14 through 15.18
6 Normalize xi = (xi1, …, xiq), i = 1, …, n, in the configuration
using Equation 15.2
7 Compute S of the updated configuration using Equation 15.3

In Step 3 of the MDS algorithm, the following is used to compute the stress
of the configuration that measures how well the configuration preserves the
dissimilarities of n data points given in Equation 15.1 (Kruskal, 1964a,b):

S=
∑ (d − dˆ ) ij
ij ij
2

, (15.3)


∑d ij
2
ij

where dij measures the dissimilarity of xi and xj using their q-dimensional


coordinates, and d̂ij gives the desired dissimilarity of xi and xj that preserves
the dissimilarity order of δijs in Equation 15.1 such that

dˆ ij < dˆ i ′j ′ if δˆ ij < δˆ i ′j ′ . (15.4)


Note that there are n(n − 1)/2 different pairs of i and j in Equations 15.3 and 15.4.
The Euclidean distance shown in Equation 15.5, the more general
Minkowski r-metric distance shown in Equation 15.6, or some other dissimi-
larity measure can be used to compute dij:

dij = ∑(d
k =1
ik − d jk )2 (15.5)

1
 q r

dij =  (dik − d jk )r  .
 k = 1 
(15.6)

Multidimensional Scaling 235

d̂ij s are predicted from δijs by using a monotone regression algorithm des­
cribed in Kruskal (1964a,b) to produce

dˆ i1 j1 ≤ dˆ i2 j2 ≤  ≤ dˆ iM jM, (15.7)

given Equation 15.1
δ i1 j1 ≤ δ i2 j2 ≤  ≤ δ iM jM .

Table 15.2 describes the steps of the monotone regression algorithm, assum-
ing that there are no ties (equal values) among δijs. In Step 2 of the monotone
regression algorithm, d̂Bm for the block Bm is computed using the average of
dijs in Bm:

∑N
dij
dˆBm = , (15.8)
m
dij ∈Bm

where Nm is the number of dijs in Bm. If Bm has only one dij, d̂im jm = dij.
Table 15.2
Monotone Regression Algorithm
Step Description
1 Arrange δ im jm, m = 1, …, M, in the order from the smallest to the largest
2 Generate the initial M blocks in the same order in Step 1, B1, …, BM, such that each
block, Bm, has only one dissimilarity value, dim jm, and compute d̂B using Equation 15.8
3 Make the lowest block the active block, and also make it up-active; denote B as the
active block, B− as the next lower block of B, B+ as the next higher block of B
4 WHILE the active block B is not the highest block
5 IF dˆ < dˆ < dˆ /* B is both down-satisfied and up-satisfied, note that the lowest
B− B B+
clock is already down-satisfied and the highest block is already up-satisfied */
6 Make the next higher block of B the active block, and make it up-active
7 ELSE
8 IF B is up-active
9 IF dˆ < dˆ /* B is up-satisfied */
B B+

10 Make B down-active
11 ELSE
12 Merge B and B+ to form a new larger block which replaces B and B+
13 Make the new block as the active block and it is down-active
14 ELSE/* B is down-active */
15 IF dˆ < dˆ /* B is down-satisfied */
B− B

16 Make B up-active
17 ELSE
18 Merge B− and B to form a new larger block which replaces B− and B
19 Make the new block as the active block and it is up-active
20 d̂ij = d̂B, for each dij ∈ B and for each block B in the final sequence of the blocks
236 Data Mining

In Step 1 of the monotone regression algorithm, if there are ties among δijs,
these δijs with the equal value are arranged in the increasing order of their
corresponding dijs in the q-dimensional space (Kruskal, 1964a,b). Another
method of handling ties among δijs is to let these δijs with the equal value
form one single block with their corresponding dijs in this block.
After using the monotone regression method to obtain d̂ijs, we use Equation
15.3 to compute the stress of the configuration in Step 3 of the MDS algo-
rithm. The smaller the S value is, the better the configuration preserves the
dissimilarity order in Equation 15.1. Kruskal (1964a,b) considers the S value
of 20% indicating a poor fit of the configuration to the dissimilarity order
in Equation 15.1, the S value of 10% indicating a fair fit, the S value of 5%
indicating a good fit, the S value of 2.5% indicating an excellent fit, and the
S value of 0% indicating the best fit. Step 4 of the MDS algorithm evalu-
ates the goodness-of-fit using the S value of the configuration. If the S value
of the configuration is not acceptable, Step 5 of the MDS algorithm changes
the configuration to improve the goodness-of-fit using the gradient descent
method. Step 6 of the MDS algorithm normalizes the vector of each data
point in the updated configuration. Step 7 of the MDS algorithm computes
the S value of the updated configuration.
In Step 4 of the MDS algorithm, a threshold of goodness-of-fit can be set
and used such that the configuration is considered acceptable if S of the con-
figuration is less than or equal to the threshold of goodness-of-fit. Hence, a
stopping criterion in Step 4 of the MDS algorithm is having S less than or
equal to the threshold of goodness-of-fit. If there is little change in S, that
is, S levels off after iterations of updating the configuration, the procedure
of updating the configuration can be stopped too. Hence, the change of S,
which is smaller than a threshold, is another stopping criterion that can be
used in Step 4 of the MDS algorithm.
The gradient descent method of updating the configuration in Step 5 of the
MDS algorithm is similar to the gradient descent method used for updating
connection weights in the back-propagation learning of artificial neural net-
works in Chapter 5. The objective of updating the configuration, (x11, …, x1q, …,
xn1, …, xnq), is to minimize the stress of the configuration in Equation 15.3,
which is shown next:

S=
∑ (d − dˆ )
ij
ij ij
2

=
S*
, (15.9)


∑d ij
2
ij T*

where

S* = ∑(d
ij
ij − dˆ ij )2 (15.10)

Multidimensional Scaling 237

T* = ∑d . 2
ij (15.11)
ij

Using the gradient descent method, we update each xkl, k = 1, …, n, l = 1, …, q,


in the configuration as follows (Kruskal, 1964a,b):

xkl (t + 1) = xkl (t) + α∆xkl = xkl (t) + α( g kl )



 ∑ k ,l

g kl2

 , (15.12)



 ∑ k ,l
xkl2 

where

∂S
g kl = − , (15.13)
∂xkl

and α is the learning rate. For a normalized x, Formula 15.12 becomes

g kl
xkl (t + 1) = xkl (t) + α∆xkl = xkl (t) + α . (15.14)


∑ k ,l
g kl2

n

Kruskal (1964a,b) gives the following formula to compute gkl if dij is computed
using the Minkowski r-metric distance:

∂S   d − dˆ ij dij   xil − x jl 
r −1 
g kl = −
∂xkl
=S ∑ (ρki − ρkj )  ij
  S*

T

*   dijr−1 
 sign( x il − x jl ) 

,
i, j  
(15.15)

where

 1 if k = i
ρki =  (15.16)
0 if k ≠ i

1 if xil − x jl > 0

sign( xil − x jl ) =  −1 if xil − x jl > 0 . (15.17)
 0 if x − x = 0
 il jl

238 Data Mining

If r = 2 in Formula 15.13, that is, the Euclidean distance is used to computer dij,

 ki ˆ dij   xil − x jl  
kj  dij − d ij
g kl = S ∑
i, j
(ρ − ρ ) 
  S*

T *   dij  
. (15.18)

Example 15.1
Table 15.3 gives three data records of nine quality variables, which is a
part of Table 8.1. Table 15.4 gives the Euclidean distance for each pair of
the three data points in the nine-dimensional space. This Euclidean dis-
tance for a pair of data point, xi and xj, is taken as δij. Perform the MDS of
this data set with only one iteration of the configuration update for q = 2,
the stopping criterion of S ≤ 5%, and α = 0.2.
This data set has three data points, n = 3, in a nine-dimensional space.
We have δ12 = 2.65, δ13 = 2.65, and δ23 = 2. In Step 1 of the MDS algorithm
described in Table 15.1, we generate an initial configuration of the three
data points in the two-dimensional space:

x1 = (1, 1) x2 = (0, 1) x3 = (1, 0.5).


In Step 2 of the MDS algorithm, we normalize each data point so that it


has the unit length, using Formula 15.2:

 x11 x12   1 1 
x1 =  ,  = 2 , = (0.771, 0.71)
2 
 x11 + x12
2 2
x11 + x12   1 + 1
2 2 2
1 +1 
2

Table 15.3
Data Set for System Fault Detection with Three Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0

Table 15.4
Euclidean Distance for Each Pair
of Data Points
C1 = {x1} C2 = {x2} C3 = {x3}
C1 = {x1} 2.65 2.65
C2 = {x2} 2
C3 = {x3}
Multidimensional Scaling 239

 x21 x22   0 1 
x2 =  ,  = 2 , = (0, 1)
2 
 x21 + x22
2 2
x21 + x22   0 + 1
2 2 2
0 +1 
2

 x31 x32   1 0.5 


x3 =  ,  = 2 , = (0.89, 0.45).
2 
 x31 + x32
2 2
x31 + x32   1 + 0.5
2 2 2
1 + 0.5 
2

The distance between each pair of the three data points in the two-
dimensional space is computed using their initial coordinates:

d12 = ( x11 − x21 )2 + ( x12 − x22 )2 = (0.71 − 0)2 + (0.71 − 1)2 = 0.77

d13 = ( x11 − x31 )2 + ( x12 − x32 )2 = (0.71 − 0.89)2 + (0.71 − 0.45)2 = 0.32

d23 = ( x21 − x31 )2 + ( x22 − x32 )2 = (0 − 0.89)2 + (1 − 0.45)2 = 1.05.


Before we compute the stress of the initial configuration using Formula


15.3, we need to use the monotone regression algorithm in Table 15.2 to
compute d̂ij. In Step 1 of the monotone regression algorithm, we arrange
δ im jm, m = 1, …, M, in the order from the smallest to the largest, where M = 3:

δ 23 < δ 12 = δ 13 .

Since there is a tie between δ12 and δ13, δ12 and δ13 are arranged in the
increasing order of d13 = 0.32 and d12 = 0.77:

δ 23 < δ 13 < δ 12 .

In Step 2 of the monotone regression algorithm, we generate the initial


M blocks in the same order in Step 1, B1, …, BM, such that each block, Bm,
has only one dissimilarity value, dim jm:

B1 = {d23 } B2 = {d13 } B3 = {d12 }.



We compute d̂B using Formula 15.8:

∑n
dij d23
dˆ B1 = = = 1.05
dij ∈B1
1 1

∑n
dij d13
dˆ B2 = = = 0.32
dij ∈B2
2 1

∑n
dij d12
dˆ B3 = = = 0.77.
dij ∈B3
3 1

240 Data Mining

In Step 3 of the monotone regression algorithm, we make the lowest


block B1 the active block:

B = B1 B− = ∅ B+ = B2 ,

and make B up-active. In Step 4 of the monotone regression algorithm,


we check that the active block B1 is not the highest block. In Step 5 of the
monotone regression algorithm, we check that dˆ B > dˆ B+ and thus B is not
up-satisfied. We go to Step 8 of the monotone regression algorithm and
check that B is up-active. In Step 9 of the monotone regression algorithm,
we check that dˆ B > dˆ B+ and thus B is not up-satisfied. We go to Step 12 and
merge B and B+ to form a new larger block B12 to replace B1 and B2:

B12 = {d23 , d13 }

d23 + d13 1.05 + 0.32


∑n
dij
dˆ B12 = = = = 0.69
dij ∈B12
12 2 2

B12 = {d23 , d13 } B3 = {d12 }


∑n
dij d12
dˆ B3 = = = 0.77.
dij ∈B3
3 1

In Step 13 of the monotone regression algorithm, we make the new block


B12 the active block and also make it down-active:

B = B12 B− = ∅ B+ = B3 .

Going back to Step 4, we check that the active block B12 is not the highest
block. In Step 5, we check that B is both up-satisfied with dˆ B12 < dˆ B3 and
down-satisfied. Therefore, we execute Step 6 to make B3 the active block
and make it up-active:

B12 = {d23 , d13 } B3 = {d12 }


d23 + d13 1.05 + 0.32


∑n
dij
dˆ B12 = = = = 0.69
dij ∈B12
12 2 2

Multidimensional Scaling 241

∑n
dij d12
dˆ B3 = = = 0.77
dij ∈B3
3 1

B = B3 B− = B12 B+ = ∅.

Going back to Step 4 again, we check that the active block B is the highest
block, get out of the WHILE loop, execute Step 20—the last step of the
monotone regression algorithm, and assign the following values of d̂ijs:

dˆ12 = dˆ B3 = 0.77

dˆ13 = dˆ B12 = 0.69


dˆ 23 = dˆ B12 = 0.69.

With those d̂ij values and the dij values:

d12 = 0.77

d13 = 0.32

d23 = 1.05,

we now execute Step 3 of the MDS algorithm to compute the stress of the
initial configuration using Equations 15.9 through 15.11:

S* = ∑(d
ij
ij − dˆ ij )2 = (0.77 − 0.77 )2 + (0.32 − 0.69)2 + (1.05 − 0.69)2 = 0..27

T* = ∑d
ij
2
ij = 0.77 2 + 0.322 + 1.052 = 0.61

S* 0.27
S= = = 0.67.
T* 0.61

This stress level indicates a poor goodness-of-fit. In Step 4 of the MDS


algorithm, we check that S does not satisfy the stopping criterion of the
REPEAT loop. In Step 5 of the MDS algorithm, we update the configura-
tion using Equations 15.14, 15.16 and 15.18 with k = 1, 2, 3 and l = 1, 2:
242 Data Mining

  dij − d̂ ij dij   xil − x jl  


g kl = g11 = S ∑ (ρ
i, j
ki
− ρkj ) 
 S *

T *   dij  

  dij − d̂ ij dij   xi1 − x j1  


= (0.67 ) ∑ (ρ
i, j
1i
− ρ1 j ) 
 S*

T *   dij  

  d − d̂12 d12   x11 − x21 


= (0.67 ) (ρ11 − ρ12 )  12 − 
  S* T *   d12 
 d − d̂ 13 d13   x11 − x31 
+ (ρ11 − ρ13 )  13 − 
 S* T *   d13 
 d − d̂23 d23   x21 − x31  
+ (ρ12 − ρ13 )  23 −  
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 0 
= (0.67 ) (1 − 0)  −  
  0.27 0.61   0.77 
 0.32 − 0.69 0.32   0.71 − 0.89 
+ (1 − 0)  −  
 0.27 0.61   0.32 
 1.05 − 0.69 1.05   0 − 0.89  
+ (0 − 0 )  −  
 0.27 0.61   1.05  
= − 0.13

  dij − d̂ ij dij   xil − x jl  


∑ (ρ
g kl = g12 = S
i, j
ki
− ρkj ) 
 S*
− 
T *   dij  


  dij − d̂ ij dij   xi 2 − x j 2  
= (0.67 ) ∑ (ρ
i, j
1i
− ρ1 j ) 
 S*
− 
T *   dij  


  d − d̂12 d12   x12 − x22 


= (0.67 ) (ρ11 − ρ12 )  12 −  
  S* T *   d12 
 d − d̂ 13 d13   x12 − x32 
+ (ρ11 − ρ13 )  13 −  
 S* T *   d13 
 d − d̂23 d23   x22 − x32  
+ (ρ12 − ρ13 )  23 −  
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 1 
= (0.67 ) (1 − 0)  −  
  0.27 0.61   0.77 

 0.32 − 0.69 0.32   0.71 − 0.45 


+ (1 − 0)  −  
 0.27 0.61   0.32 

 1.05 − 0.69 1.05   1 − 0.45  


+ (0 − 0 )  −  
 0.27 0.61   1.05  
= −0.71
Multidimensional Scaling 243

  dij − d̂ ij dij   xil − x jl  


g kl = g 21 = S ∑ (ρ
i, j
ki
− ρkj ) 
 S*
− 
T *   dij  


  dij − d̂ ij dij   xi1 − x j1  


= (0.67 ) ∑ (ρ
i, j
2i
− ρ2 j ) 
 S*
− 
T *   dij  


  d − d̂12 d12   x11 − x21 


= (0.67 ) (ρ21 − ρ22 )  12 −  
  S* T *   d12 
 d − d̂13 d13   x11 − x31 
+ (ρ21 − ρ23 )  13 −  
 S* T *   d13 
 d − d̂ 23 d   x − x31  
+ (ρ22 − ρ23 )  23 − 23   21 
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 0 
= (0.67 ) (0 − 1)  −  
  0.27 0.61   0.77 
 0.32 − 0.69 0.32   0.71 − 0.89 
+ (0 − 0 )  −  
 0.27 0.61   0.32 
 1.05 − 0.69 1.05   0 − 0.89  
+ (1 − 0)  −  
 0.27 0.61   1.05  

= 1.07

  dij − d̂ ij dij   xil − x jl  


∑ (ρ
g kl = g 22 = S
i, j
ki
− ρkj ) 
 S*
− 
T *   dij  


  dij − d̂ ij dij   xi 2 − x j 2  
= (0.67 ) ∑ (ρ
i, j
2i
− ρ2 j ) 
 S*
− 
T *   dij  


  d − d̂ 12 d12   x12 − x22 


= (0.67 ) (ρ21 − ρ22 )  12 −  
  S* T *   d12 
 d − d̂13 d13   x12 − x32 
+ (ρ21 − ρ23 )  13 −  
 S* T *   d13 
 d − d̂ 23 d23   x22 − x32  
+ (ρ22 − ρ23 )  23 −  
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 1 
= (0.67 ) (0 − 1)  −  
  0.27 0.61   0.77 

 0.32 − 0.69 0.32   0.71 − 0.45 


+ (0 − 0 )  −  
 0.27 0.61   0.32 

 1.05 − 0.69 1.05   1 − 0.45  


+ (1 − 0)  −  
 0.27 0.61   1.05  
= −0.45
244 Data Mining

  dij − d̂ ij dij   xil − x jl  


g kl = g 31 = S ∑ (ρ
i, j
ki
− ρkj ) 
 S*
− 
T *   dij  


  dij − d̂ij dij   xi1 − x j1  


= (0.67 ) ∑ (ρ
i, j
3i
− ρ3 j ) 
 S*
− 
T *   dij  


  d − d̂12 d12   x11 − x21 


= (0.67 ) (ρ31 − ρ32 )  12 −  
  S* T *   d12 
 d − d̂13 d13   x11 − x31 
+ (ρ31 − ρ33 )  13 −  
 S* T *   d13 
 d − d̂ 23 d23   x21 − x31  
+ (ρ32 − ρ33 )  23 −  
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 0 
= (0.67 ) (0 − 0)  −  
  0.27 0.61   0.77 
 0.32 − 0.69 0.32   0.71 − 0.89 
+ (0 − 1)  −  
 0.27 0.61   0.32 
 1.05 − 0.69 1.05   0 − 0.89  
+ (0 − 1)  −  
 0.27 0.61   1.05  
= 0.90

  dij − d̂ ij dij   xil − x jl  


∑ (ρ
g kl = g 32 = S
i, j
ki
− ρkj ) 
 S*
− 
T *   dij  


  dij − d̂ij dij   xi 2 − x j 2  


= (0.67 ) ∑ (ρ
i, j
3i
− ρ3 j ) 
 S*
− 
T *   dij  


  d − d̂12 d12   x12 − x22 


= (0.67 ) (ρ31 − ρ32 )  12 −  
  S* T *   d12 
 d − d̂13 d13   x12 − x32 
+ (ρ31 − ρ33 )  13 −  
 S* T *   d13 
 d − d̂ 23 d23   x22 − x32  
+ (ρ32 − ρ33 )  23 −  
 S* T *   d23  
  0.77 − 0.77 0.77   0.71 − 1 
= (0.67 ) (0 − 0)  −  
  0.27 0.61   0.77 

 0.32 − 0.69 0.32   0.71 − 0.45 


+ (0 − 1)  −  
 0.27 0.61   0.32 

 1.05 − 0.69 1.05   1 − 0.45  


+ (0 − 1)  −  
 0.27 0.61   1.05  
= 0.77
Multidimensional Scaling 245

g kl
xkl (t + 1) = xkl (t) + α∆xkl = xkl (t) + α
∑ k ,l
g kl2

g11
x11 (1) = x11 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
− 0.13
= 0.71 + 0.2 = 0.70
( − 0.13)2 + ( − 0.71)2 + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
3

g12
x12 (1) = x12 (0) + 0.2
2
11
2
12
2
g + g + g + g 222
+ g 31
21
2 2
+ g 32
3
− 0.71
= 0.71 + 0.2 = 0.63
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2

g 21
x21 (1) = x21 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
1.07
= 0 + 0.2 = 0.12
( − 0.13)2 + ( − 0.71)2 + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
3

g 22
x22 (1) = x22 (0) + 0.2
2
11
2
12
2
g + g + g + g 222
+ g 31
21
2 2
+ g 32
3
− 0.45
= 1 + 0.2 = 0.95
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2

g 21
x31 (1) = x31 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
0.90
= 0.89 + 0.2 = 0.99
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2

3
246 Data Mining

g 22
x32 (1) = x32 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
0.77
= 0.45 + 0.2 = 0.54.
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2

Hence, after the update of the initial configuration in Step 5 of the MDS
algorithm, we obtain:

x1 = (0.70, 0.63) x2 = (0.12, 0.95) x3 = (0.99, 0.54).


In Step 6 of the MDS algorithm, we normalize each xi:

 0.70 0.63 
x1 =  , = (0.74, 0.67 )
2 
 0.70 + 0.63
2 2
0.70 + 0.63 
2

 0.12 0.95 
x2 =  , = (0.13 , 0.99)
2 
 0.12 + 0.95
2 2
0.12 + 0.95 
2

 0.99 0.54 
x3 =  , = (0.88, 0.48).
2 
 0.992 + 0.54 2 0.99 + 0.54 
2

15.2  Number of Dimensions


The MDS algorithm in Section 15.1 starts with the given q—the number of
dimensions. Before obtaining the final result of MDS for a data set, it is rec-
ommended that several q values are used to obtain the MDS result for each
q value, plot the stress of the MDS result versus q, and select q at the elbow of
this plot along with the corresponding MDS result. Figure 15.1 slows a plot of
stress versus q, and the q value at the elbow of this plot is 2. The q value at the
elbow of the stress-q plot is chosen because the stress improves much before
the elbow point but levels off after the elbow point. For example, in the study
by Ye (1998), the MDS results for q = 1, 2, 3, 4, 5, and 6 are obtained. The stress
values of these MDS results show the elbow point at q = 3.
Multidimensional Scaling 247

0.2
0.18
0.16
0.14
0.12
Stress

0.1
0.08
0.06
0.04
0.02
0
1 1.5 2 2.5 3 3.5 4
Number of dimensions

Figure 15.1
An example of plotting the stress of a MDS result versus the number of dimensions.

15.3  INDSCALE for Weighted MDS


In the study by Ye (1998), a number of human subjects (including expert pro-
grammers and novice programmers) are given a list of C programming con-
cepts and are asked to rate the dissimilarity for each pair of these concepts.
Hence, a dissimilarity matrix of the C programming concepts was obtained
from each subject. Considering each programming concept as a data point,
individual difference scaling (INDSCALE) is used in the study to take the dis-
similarity matrices of data points from the subjects as the inputs and produce
the outputs including both the configuration of each data point’s coordinates in
a q-dimensional space for the entire group of the subjects and a weight vector
for each subject. The weight vector for a subject contains a weight value for the
subject in each dimension. Applying the weight vector of a subject to the group
configuration of concept coordinates gives the configuration of concept coordi-
nates by the subject—the organization of the C programming concepts by each
subject. Since the different weight vectors of the individual subjects reflect their
differences in knowledge organization, the study applies the angular analysis of
variance (ANAVA) to the weight vectors of the individual subjects to analyze the
angular differences of the weight vectors and evaluate the significance of knowl-
edge organization differences between two skill groups of experts and novices.
In general, INDSCALE or weighted MDS takes object dissimilarity matri-
ces of n objects from m subjects and produces the group configuration of
object coordinates:

xi = ( xi1 , … , xiq ), i = 1, … , n,

248 Data Mining

and weight vectors of individual subjects:

w j = (w j1 , … , x jq ), j = 1, … , m.

The weight vector of a subject reflects the relative salience of each dimension
in the configuration space to the subject.

15.4  Software and Applications


MDS is supported in many statistical software packages, including SAS
MDS and INDSCALE procedures (www.sas.com). An application of MDS
and INDSCALE to determining expert-novice differences in knowledge rep-
resentation is described in Section 15.3 with details in Ye (1998).

Exercises
15.1 Continue Example 15.1 to perform the next iteration of the configura-
tion update.
15.2 Consider the data set consisting of the three data points in instances
#4, 5, and 6 in Table 8.1. Use the Euclidean distance for each pair of the
three data points in the nine-dimensional space, xi and xj, as δij. Perform
the MDS of this data set with only one iteration of the configuration
update for q = 3, the stopping criterion of S ≤ 5%, and α = 0.2.
15.3 Consider the data set in Table 8.1 consisting of nine data points in
instances 1–9. Use the Euclidean distance for each pair of the nine data
points in the nine-dimensional space, xi and xj, as δij. Perform the MDS
of this data set with only one iteration of the configuration update for
q = 1, the stopping criterion of S ≤ 5%, and α = 0.2.
Part V

Algorithms for
Mining Outlier and
Anomaly Patterns
16
Univariate Control Charts

Outliers and anomalies are data points that deviate largely from the norm
where the majority of data points follow. Outliers and anomalies may be
caused by a fault of a manufacturing machine and thus an out of control
manufacturing process, an attack whose behavior differs largely from nor-
mal use activities on computer and network systems, and so on. Detecting
outliers and anomalies are important in many fields. For example, detecting
an out of control manufacturing process quickly is important for reducing
manufacturing costs by not producing defective parts. An early detection of
a cyber attack is crucial to protect computer and network systems from being
compromised.
Control chart techniques define and detect outliers and anomalies on a
statistical basis. This chapter describes univariate control charts that monitor
one variable for anomaly detection. Chapter 17 describes multivariate control
charts that monitor multiple variables simultaneously for anomaly detection.
The univariate control charts described in this chapter include Shewhart con-
trol charts, cumulative sum (CUSUM) control charts, exponentially weighted
moving average (EWMA) control charts, and cuscore control charts. A list of
software packages that support univariate control charts is provided. Some
applications of univariate control charts are given with references.

16.1  Shewhart Control Charts


Shewhart control charts include variable control charts, each of which moni-
tors a variable with numeric values (e.g., a diameter of a hole manufactured
by a cutting machine), and attribute control charts, each of which monitors
an attribute summarizing categorical values (e.g., the fraction of defective
or nondefective parts). When samples of data points can be observed, vari-
able control charts, for example, x– control charts for detecting anomalies
concerning the mean of a process, and R and s control charts for detecting
anomalies concerning the variance of a process, are applicable. When only
individual data points can be observed, variable control charts, for example,
individual control charts, are applicable. For a data set with individual data
points rather than samples of data points, both CUSUM control charts in

251
252 Data Mining

Table 16.1
Samples of Data Observations
Data Observations Sample Standard
Sample in Each Sample Sample Mean Deviation
1 x11, …, x1j, …, x1n x– 1 s1
… … … …
i xi1, …, xij, …, xin x–i si
… … … …
m xm1, …, xmj, …, xmn x–m sm

Section 16.2 and EWMA control charts in Section 16.3 have advantages over
individual control charts.
We describe the x– control charts to illustrate how Shewhart control charts
work. Consider a variable x that takes m samples of n data observations from
a process as shown in Table 16.1. The x– control chart assumes that x is nor-
mally distributed with the mean μ and the standard deviation σ when the
process is in control.
x– and s , i = 1, …, m, in Table 16.1 are computed as follows:
i i


n
xij
j =1
x =
i (16.1)
n


n
( xij − xi )2
j =1
si = . (16.2)
n−1

The mean μ and the standard deviation σ are estimated using x and –s :


m
xi
x= i =1
(16.3)
m


m
si
s= i =1
. (16.4)
m

If the sample size n is large, x– i follows a normal distribution according to the


central limit theory. The probability that x– i falls within the three standard
deviations from the mean is approximately 99.7% based on the probability
density function of a normal distribution:

P ( x − 3 s ≤ xi ≤ x + 3 s ) = 99.7%. (16.5)
Univariate Control Charts 253

Since the probability that x– i falls beyond the three standard deviations from
the mean is only 0.3%, such x– i is considered an outlier or anomaly that may
be caused by the process being out of control. Hence, the estimated mean
and the 3-sigma control limits are typically used as the centerline and the
control limits (UCL for upper control limit and LCL for lower control limit),
respectively, for the in-control process mean in the x– control chart:

Centerline = x (16.6)

UCL = x + 3 s (16.7)

LCL = x − 3 s . (16.8)

The x– control chart monitors x– i from sample i of data observations. If x– i falls


within [LCL, UCL], the process is considered in control; otherwise, an anom-
aly is detected and the process is considered out of control.
Using the 3-sigma control limits in the x– control chart, there is still 0.3%
probability that the process is in control but a data observation falls outside
the control limit and an out-of-control signal is generated by the x– control
chart. If the process is in control but the control chart gives an out-of-control
signal, the signal is a false alarm. The rate of false alarms is the ratio of the
number of false alarms to the total number of data samples being monitored.
If the process is out of control and the control chart generates an out-of-­
control signal, we have a hit. The rate of hits is the ratio of the number of hits
to the total numbers of data samples. Using the 3-sigma control limits, we
should have the hit rate of 99.7% and the false alarm rate of 0.3%.
If the sample size n is not large, the estimation of the standard deviation by
–s may be off to a certain extent, and the coefficient for –s in Formulas 16.7 and
16.8 may need to be set to a different value than 3 in order to set the appro-
priate control limits so that a vast majority of the data population falls in the
control limits statistically. Montgomery (2001) gives appropriate coefficients
to set the control limits for various values of the sample size n.
The x– control chart shows how statistical control charts such as Shewhart
control charts establish the centerline and control limits based on the prob-
ability distribution of the variable of interest and the estimation of distribu-
tion parameters from data samples. In general, the centerline of a control
chart is set to the expected value of the variable, and the control limits are
set so that a vast majority of the data population fall in the control limits
statistically. Hence, the norm of data and anomalies are defined statistically,
depending on the probability distribution of data and estimation of distribu-
tion parameters.
Shewhart control charts are sensitive to the assumption that the variable
of interest follows a normal distribution. A deviation from this normal-
ity assumption may cause a Shewhart control chart such as the x– control
chart to perform poorly, for example, giving an out-of-control signal when
254 Data Mining

the process is truly in control or giving no signal when the process is truly
out of control. Because Shewhart control charts monitor and evaluate only
one data sample or one individual data observation at a time, Shewhart con-
trol charts are not effective at detecting small shifts, e.g., small shifts of a
process mean monitored by the x– control chart. CUSUM control charts in
Section 16.2 and EWMA control charts in Section 16.3 are less sensitive to
the normality assumption of data and are effective at detecting small shifts.
CUSUM control charts and EWMA control charts can be used to monitor
both data samples and individual data observations. Hence, CUSUM control
charts and EWMA control charts are more practical.

16.2  CUSUM Control Charts


Given a time series of data observations for a variable x, x1, …, xn, the cumula-
tive sum up to the ith observation is (Montgomery, 2001; Ye, 2003, Chapter 3)

CSi = ∑(x − µ ),
j =1
i 0 (16.9)

where μ0 is the target value of the process mean. If the process is in control,
data observations are expected to randomly fluctuate around the process
mean, and thus CSi stays around zero. However, if the process is out of con-
trol with a shift of x values from the process mean, CSi keeps increasing for a
positive shift (i.e., xi − μ0 > 0) or decreasing for a negative shift. Even if there
is a small shift, the effect of the small shift keeps accumulating in CSi and
becomes large to be defected. Hence, a CUSUM control chart is more effec-
tive than a Shewhart control chart to detect small shifts since a Shewhart
control chart examines only one data sample or one data observation.
Formula 16.9 is used to monitor individual data observations. If samples of
data points can be observed, xi in Formula 16.9 can be replaced by x– i to moni-
tor the sample average.
If we are interested in detecting only a positive shift, a one-side CUSUM
chart can be constructed to monitor the CSi+ statistic:

CSi+ = max 0, xi − (µ 0 + K ) + CSi+−1  , (16.10)

where K is called the reference value specifying how much increase from the
process mean μ0 we are interested in detecting. Since we expect xi ≥ μ0 + K as a
result of the positive shift K from the process mean μ0, we expect xi − (μ0 + K) to
be positive and expect CSi+ to keep increasing with i. In case that some xi makes
xi − (µ 0 + K ) + CSi+−1 have a negative value, CSi+ takes the value of 0 according to
Univariate Control Charts 255

Formula 16.10 since we are interested in only the positive shift. One method
of specifying K is to use the standard deviation σ of the process. For example,
K = 0.5σ indicates that we are interested in detecting a shift of 0.5σ above the
target mean. If the process is in control, we expect CSi+ to stay around zero.
Hence, CSi+ is initially set to zero:

CS0+ = 0. (16.11)

When CSi+ exceeds the decision threshold H, the process is considered out
of control. Typically H = 5σ is used as the decision threshold so that a low
rate of false alarms can be achieved (Montgomery, 2001). Note that H = 5σ is
greater than the 3-sigma control limits used for the x– control chart in Section
16.1 since CSi+ accumulates the effects of multiple data observations whereas
the x– control chart examines only one data observation or data sample.
If we are interested in detecting only a negative shift, −K, from the pro-
cess mean, a one-side CUSUM chart can be constructed to monitor the CSi−
statistic:

CSi− = max 0, (µ 0 − K ) − xi + CSi−−1  . (16.12)

Since we expect xi ≤ μ0 − K as a result of the negative shift, −K, from the process
mean μ0, we expect (μ0 − K) − xi to be positive and expect CSi− to keep increas-
ing with i. H = 5σ is typically used as the decision threshold to achieve a low
rate of false alarms (Montgomery, 2001). CSi− is initially set to zero since we
expect CSi− to stay around zero if the process is in control:

CS0− = 0. (16.13)

A two-side CUSUM control chart can be used to monitor both CSi+ using the
one-side upper CUSUM and CSi− using the one-side lower CUSUM for the
same xi. If either CSi+ or CSi− exceeds the decision threshold H, the process is
considered out of control.

Example 16.1
Consider the launch temperature data in Table 1.5 and presented in Table
16.2 as a sequence of data observations over time. Given the following
information:

µ 0 = 69

σ=7

K = 0.5σ = (0.5)(7 ) = 3.5

H = 5σ = (5)(7 ) = 35,

use a two-side CUSUM control chart to monitor the launch temperature.


256 Data Mining

Table 16.2
Data Observations of the Launch
Temperature from the Data Set of O-Rings
with Stress along with Statistics for the
Two-Side CUSUM Control Chart
Data Launch
Observation i Temperature xi CSi+ CSi-

1 66 0 0
2 70 0 0
3 69 0 0
4 68 0 0
5 67 0 0
6 72 0 0
7 73 0.5 0
8 70 0 0
9 57 0 8.5
10 63 0 11
11 70 0 6.5
12 78 5.5 0
13 67 0 0
14 53 0 12.5
15 67 0 11
16 75 2.5 1.5
17 70 0 0
18 81 8.5 0
19 76 12 0
20 79 18.5 0
21 75 21 0
22 76 24.5 0
23 58 10 7.5

With CSi+ and CSi− initially set to zero, that is, CS0+ = 0 and CS0− = 0, we
compute CS1+ and CS1−:

CS1+ = max 0, x1 − (µ 0 + K ) + CS0+  = max [0, 66 − (69 + 3.5) + 0 ] = max[0, − 6.5] = 0

CS1− = max 0, (µ 0 − K ) − x1 + CS0−  = max [0, (69 − 3.5) − 66 + 0 ] = max[0, − 0.5] = 0,

and then CS2+ and CS2− :

CS2+ = max 0, x2 − (µ 0 + K ) + CS1+  = max 0, 70 − (69 + 3.5) + 0  = max[0, − 2.5] = 0

CS2− = max 0, (µ 0 − K ) − x1 + CS0−  = max 0, (69 − 3.5) − 70 + 0  = max[0, − 4.5] = 0.
Univariate Control Charts 257

40

35

30
CS+
25 CS–
H
20

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Observation i

Figure 16.1
Two-side CUSUM control chart for the launch temperature in the data set of O-ring with stress.
+ −
The values of CSi and CSi for i = 3, …, 23 are shown in Table 16.2. Figure
16.1 shows the two-side CUSUM control chart. The CSi+ and CSi− values
for all the 23 observations do not exceed the decision threshold H = 35.
Hence, no anomalies of the launch temperature are detected. If the deci-
sion threshold is set to H = 3σ = (3)(7) = 21, the observation i = 22 will be
+
signaled as an anomaly because CS22 = 24.5 > H.
After an out-of-control signal is generated, the CUSUM control chart
will reset CSi+ and CSi− to their initial value of zero and use the initial
value of zero to compute CSi+ and CSi− for the next observation.

16.3  EWMA Control Charts


An EWMA control chart for a variable x with independent data observations,
xi, monitors the following statistic (Montgomery, 2001; Ye, 2003, Chapter 4):
zi = λxi + (1 − λ )zi −1 (16.14)
where λ is a weight in (0, 1]:
z0 = µ. (16.15)

The control limits are (Montgomery, 2001; Ye, 2003, Chapter 3):
λ
UCL = µ + Lσ (16.16)
2−λ

λ
LCL = µ − Lσ . (16.17)
2−λ
258 Data Mining

The weight λ determines the relative impacts of the current data observation
xi and previous data observations as captured through zi−1 on zi. If we express
zi using xi, xi−1, …, x1,

zi = λxi + (1 − λ )zi −1

= λxi + (1 − λ )[ λxi −1 + (1 − λ )zi − 2 ]

= λxi + (1 − λ )λxi −1 + (1 − λ )2 zi − 2

= λxi + (1 − λ )λxi −1 + (1 − λ )2 [ λxi − 2 + (1 − λ )zi − 3 ]

= λxi + (1 − λ )λxi −1 + (1 − λ )2 λxi − 2 + (1 − λ )3 zi − 3


= λxi + (1 − λ )λxi −1 + (1 − λ )2 λxi − 2 +  + (1 − λ )i − 2 λx2 + (1 − λ )i −1 λx1 (16.18)

we can see the weights on xi, xi−1, …, x1 decreasing exponentially. For example,
for λ = 0.3, we have the weight of 0.3 for xi, (0.7)(0.3) = 0.21 for xi−1, (0.7)2(0.3) =
0.147 for xi−2, (0.7)3(0.3) = 0.1029 for xi−3, …, as illustrated in Figure 16.2. This
gives the term of EWMA. The larger the λ value is, the less impact the past
observations and the more impact the current observation have on the cur-
rent EWMA statistic, zi.
In Formulas 16.14 through 16.17, setting λ and L in the following ranges
usually works well (Montgomery, 2001; Ye, 2003, Chapter 4):

0.05 ≤ λ ≤ 0.25

2.6 ≤ L ≤ 3.
A data sample can be used to compute the sample average and the sample
standard deviation as the estimates of μ and σ, respectively.

0.35

0.3

0.25

0.2
Weight

0.15

0.1

0.05

0
i i–1 i–2 …

Figure 16.2
Exponentially decreasing weights on data observations.
Univariate Control Charts 259

Example 16.2
Consider the launch temperature data in Table 1.5 and presented in Table
16.3 as a sequence of data observations over time. Given the following:

µ = 69
σ=7
λ = 0.2
L = 3,

use an EWMA control chart to monitor the launch temperature.


We first compute the control limits:

λ 0.3
UCL = µ + Lσ = 69 + (3)(7 ) = 77.82
2−λ 2 − 0.3

Table 16.3
Data Observations of the Launch
Temperature from the Data Set of O-Rings
with Stress along with the EWMA Statistic
for the EWMA Control Chart
Data Launch
Observation i Temperature xi zi
1 66 68.4
2 70 68.72
3 69 68.78
4 68 68.62
5 67 68.30
6 72 69.04
7 73 69.83
8 70 69.86
9 57 67.29
10 63 66.43
11 70 67.15
12 78 69.32
13 67 68.85
14 53 65.68
15 67 65.95
16 75 67.76
17 70 68.21
18 81 70.76
19 76 71.81
20 79 73.25
21 75 73.60
22 76 74.08
23 58 70.86
260 Data Mining

90
80
70
60
50 zi
UCL
40
LCL
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Observation i

Figure 16.3
EWMA control chart to monitor the launch temperature from the data set of O-rings with
stress.

λ 0.3
LCL = µ − Lσ = 69 − (3)(7 ) = 60.18.
2−λ 2 − 0.3

Using z0 = μ = 69, we compute the EWMA statistic:

z1 = λx1 + (1 − λ )z0 = (0.2)(66) + (1 − 0.2)(69) = 68.4

z2 = λx2 + (1 − λ )z1 = (0.2)(70) + (1 − 0.2)(68.4) = 68.72.

The values of the EWMA statistic for other data observations are given
in Table 16.3. The EWMA statistic values of all the 23 data observations
stay in the control limits [LCL, UCL] = [60.18, 77.82], and no anomalies
are detected. Figure 16.3 plots the EWMA control chart with the EWMA
statistic and the control limits.

If data observations are autocorrelated (see Chapter 18 for the descrip-


tion of autocorrelation), we can first build a 1-step ahead prediction model
of autocorrelated data, compare a data observation with its 1-step predicted
value to obtain the error or residual, and use an EWMA control chart to
monitor the residual data (Montgomery and Mastrangelo, 1991). The 1-step
ahead predicted value for xi is computed as follows:

zi −1 = λxi −1 + (1 − λ )zi − 2 , (16.19)

where 0 < λ ≤ 1. That is, zi−1 is the EWMA of xi−1, …, x1 and is used as the pre-
diction for xi. The prediction error or residual is then computed:

ei = xi − zi −1. (16.20)
Univariate Control Charts 261

In Formula 16.19, λ can be set to minimize the sum of squared prediction


errors on the training data set:

λ = arg min
λ ∑e .
i
2
i (16.21)

If the 1-step ahead prediction model represents the autocorrelated data well,
eis should be independent of each other and normally distributed with the
mean of zero and the standard deviation of σe. An EWMA control chart for
monitoring ei has the centerline at zero and the following control limits:

UCLei = Lσ̂ ei−1 (16.22)

LCLei = − Lσ̂ ei−1 (16.23)

σˆ 2ei−1 = αei2−1 + (1 − α )σˆ 2ei−2 , (16.24)

where L is set to a value such that 2.6 ≤ L ≤ 3, 0 < α ≤ 1 and σ̂ 2ei−1 gives
the estimate of σe for xi using the exponentially weighted moving average
of the ­prediction errors. Using Equation 16.20, which gives xi = ei + zi−1, the
control limits for monitoring xi directly instead of ei are:

UCLxi = zi −1 + Lσ̂ ei−1 (16.25)

LCLxi = zi −1 − Lσ̂ ei−1 (16.26)

Like CUSUM control charts, EWMA control charts are more robust to the
normality assumption of data than Shewhart control charts (Montgomery,
2001). Unlike Shewhart control charts, CUSUM control charts and EWMA
control charts are effective at detecting anomalies of not only large shifts but
also small shifts since CUSUM control charts and EWMA control charts take
into account the effects of multiple data observations.

16.4  Cuscore Control Charts


The control charts described in Sections 16.1 through 16.3 detect the out-of-
control shifts from the mean or the standard deviation. Cumulative score
(cuscore) control charts (Luceno, 1999) are designed to detect the change
from any specific form of an in-control data model to any specific form of
an out-of-control data model. For example, a cuscore control chart can be
262 Data Mining

constructed to detect a change of the slope in a linear model of in-control


data as follows:
In-control data model:

y t = θ 0t + ε t (16.27)

Out-of-control data model:

yt = θt + ε t , θ ≠ θ0 , (16.28)

where εt is a random variable with the normal distribution, the mean μ = 0


and the standard deviation σ. For another example, we can have a cuscore
control chart to detect the presence of a sine wave in an in-control process
with random variations from the mean of T:
In-control data model:

 2πt 
yt = T + θ0 sin  + ε t , θ0 = 0 (16.29)
 p 

Out-of-control data model:

 2πt 
yt = T + θ sin  + εt . (16.30)
 p 

To construct the cuscore statistic, we consider yt as a function of xt and the


parameter θ that differentiates an out-of-control process from an in-control
process:

y t = f ( xt , θ) (16.31)

and when the process is in control, we have

θ = θ0 . (16.32)

In the two examples shown in Equations 16.27 through 16.30, xt include only
t, and θ = θ0 when the process is in control.
The residual, εt, can be computed by subtracting the predicted value ŷt
from the observed value of yt:

ε t = yt − yˆ t = yt − f ( xt , θ) = g( yt , xt , θ). (16.33)

When the process is in-control, we have θ = θ0 and expect ε1, ε2, …, εn to be


independent of each other, each of which is a random variable of white noise
with independent data observations, a normal distribution, the mean μ = 0
Univariate Control Charts 263

and the standard deviation σ. That is, the random variables, ε1, ε2, …, εn, have
a joint multivariate normal distribution with the following joint probability
density function:

1 n ε2
1 − ∑ t=1 σt20
P ( ε1 , … , ε n |θ = θ0 ) = n e 2
. (16.34)
( 2π ) 2

Taking a natural logarithm of Equation 16.34, we have

∑ε
2 1
l ( ε1 , …, ε n |θ = θ0 ) = − ln(2π) − 2 2
t0 . (16.35)
n 2σ t =1

As seen from Equation 16.33, εt is a function of θ, and P(ε1, …, εn) in Equation


16.34 reaches the maximum likelihood value if the process is in control
with θ = θ0 and we have independently, identically normally distributed εt0,
t = 1, …, n, plugged into Equation 16.34. If the process is out-of-control and
θ ≠ θ0, Equation 16.34 is not the correct joint probability density function for
ε1, ε2, …, εn and thus does not give the maximum likelihood of ε1, ε2, …, εn.
Hence, if the process is in control with θ = θ0, we have

∂l ( ε1 , …, ε n |θ = θ0 )
= 0. (16.36)
∂θ

Using Equation 16.35 to substitute l(ε1, …, εn|θ = θ0) in Equation 16.36 and


dropping all the terms not related to θ when taking the derivative, we have
n
 ∂ε t 0 
∑ε
t =1
t0  −
∂θ 
 = 0. (16.37)

The cuscore statistic for a cuscore control chart to monitor is Q 0:


n n
 ∂ε 
Q0 = ∑t =1
εt0  − t0  =
 ∂θ  ∑εt =1
d
t0 t0 (16.38)

where

∂ε t 0
dt 0 = − . (16.39)
∂θ

Based on Equation 16.37, Q 0 is expected to stay near zero if the process is in


control with θ = θ0. If θ shifts from θ0, Q 0 departs from zero not randomly but
in a consistent manner.
264 Data Mining

For example, to detect a change of the slope from a linear model of in-
control data described in Equations 16.27 and 16.28, a cuscore control chart
monitors:

n n n
 ∂ε   ∂( yt − θt) 
Q0 = ∑t =1
εt0  − t0  =
 ∂θ  ∑t =1
εt0  −
 ∂θ 
= ∑(y − θ t)t.
t =1
t 0 (16.40)

If the slope θ of the in-control linear model changes from θ0, (yt − θ0t) in
Equation 16.40 contains t, which is multiplied by another t to make Q 0 keep
increasing (if yt − θ0t > 0) or decreasing (if yt − θ0t < 0) rather than randomly
varying around zero. Such a consistent departure of Q 0 from zero causes the
slope of the line connecting Q 0 values over time to increase or decrease from
zero, which can be used to signal the presence of an anomaly.
To detect a sine wave in an in-control process with the mean of T and ran-
dom variations described in Equations 16.29 and 16.30, the cuscore statistic
for a cuscore control chart is

   2πt   
 ∂  yt − T − θ sin  
 p   
n n
 ∂ε t 0   
Q0 = ∑
t =1
εt0  −
 ∂θ 
=
t =1

( yt − T )  −
 ∂θ 

n
 2πt 
= ∑(y − T )sin 
t =1
t
p 
. (16.41)

If the sine wave is present in yt, (yt − T) in Equation 16.41 contains sin(2πt/p), which
is multiplied by another sin(2πt/p) to make Q0 keep increasing (if yt − T > 0) or
decreasing (if yt − T < 0) rather than randomly varying around zero.
To detect a mean shift of K from μ0 as in a CUSUM control chart described
in Equations 16.9, 16.10, and 16.12, we have:
In-control data model:

yt = µ 0 + θ0 K + ε t , θ0 = 0 (16.42)

Out-of-control data model:

yt = µ 0 + θK + ε t , θ ≠ θ0 (16.43)

n n n
 ∂ε   ∂( yt − µ 0 − θK ) 
Q0 = ∑
t =1
εt0  − t0  =
 ∂θ  ∑t =1
( yt − µ 0 )  −
 ∂θ  = ∑(y − µ )K. (16.44)
t =1
t 0

If the mean shift of K from μ0 occurs, (yt − μ0) in Equation 16.44 contains K,


which is multiplied by another K to make Q 0 keep increasing (if yt − μ0 > 0) or
decreasing (if yt − μ0 < 0) rather than randomly varying around zero.
Univariate Control Charts 265

Since cuscore control charts allow us to detect a specific form of an anomaly


given a specific form of in-control data model, cuscore control charts allow us to
monitor and detect a wider range of in-control vs. out-of-control situations than
Shewhart control charts, CUSUM control charts, and EWMA control charts.

16.5 Receiver Operating Curve (ROC) for Evaluation


and Comparison of Control Charts
Different values of the decision threshold parameters used in various control
charts, for example, the 3-sigma in a x– control chart, H in an CUSUM con-
trol chart, and L in an EWMA control chart, produce different rates of false
alarms and hits. Suppose that in Example 16.1 any value of xi ≥ 75 is truly an
anomaly. Hence, seven data observations, observations #12, 16, 18, 19, 20, 21,
and 22, have xi ≥ 75 and are truly anomalies. If the decision threshold is set
to a value greater than or equal to the maximum value of CSi+ and CSi− for
all the 23 data observations, for example, H = 24.5, CSi+ and CSi− for all the 23
data observations do not exceed H, and the two-side CUSUM control chart
does not signal any data observation as an anomaly. We have no false alarms
and zero hits, that is, we have the false alarm rate of 0% and the hit rate of 0%.
If the decision threshold is set to a value smaller than the minimum value of
CSi+ and CSi− for all the 23 data observations, for example, H = −1, CSi+ and CSi−
for all the 23 data observations exceed H, and the two-side CUSUM control
chart signals every data ­observation as an anomaly, producing 7 hits on all
the true anomalies (observations #12, 16, 18, 19, 20, 21, and 22) and 16 false
alarms, that is, the hit rate of 100% and the false alarm rate of 100%. If the
decision threshold is set to H = 0, the ­two-side CUSUM control charts signals
data observations #7, 9, 10, 11, 12, 14, 15, 16, 18, 19, 20, 21, and 22 as anomalies,
producing 7 ­out-of-control signals on all the 7 true anomalies (the hit rate of
100%) and 7 out-of-control signals on observations #7, 9, 10, 11, 14, 15 and 23
out of 16 ­in-control data observations (the false alarm rate of 44%). Table 16.4
lists pairs of the false alarm rate and the hit rate for other values of H for the
two-side CUSUM control chart in Example 16.1.
A ROC plots pairs of the hit rate and the false alarm rate for various values
of a decision threshold. Figure 16.4 plots the ROC for the two-side CUSUM
control chart in Example 16.1, given seven true anomalies on observations #12,
16, 18, 19, 20, 21, and 22. Unlike a pair of the false alarm rate and the hit rate for
a specific value of a decision threshold, ROC gives a complete picture of perfor-
mance by an anomaly detection technique. The closer the ROC is to the top-left
corner (representing the false alarm rate 0% and the hit rate 100%) of the chart,
the better performance the anomaly detection technique produces. Since it is
difficult to set the decision thresholds for two different anomaly detection tech-
niques so that their performance can be compared fairly, ROC can be plotted
266 Data Mining

Table 16.4
Pairs of the False Alarm Rate and
the Hit Rate for Various Values of
the Decision Threshold H for the
Two-Side CUSUM Control Chart
in Example 16.1
H False Alarm Rate Hit Rate
−1 1 1
0 0.44 1
0.5 0.38 1
2.5 0.38 0.86
5.5 0.38 0.71
6.5 0.31 0.71
8.5 0.25 0.57
10 0.19 0.57
11 0.06 0.57
12 0.06 0.43
12.5 0 0.43
18.5 0 0.29
21 0 0.14
24.5 0 0

1
0.9
0.8
0.7
0.6
Hit rate

0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
False alarm rate

Figure 16.4
ROC for the two-side CUSUM control chart in Example 16.1.

for each technique in the same chart to compare ROCs for two ­techniques and
examine which ROC is closer to the top-left corner of the chart to determine
which technique produces better detection performance. Ye et al. (2002b)
show the use of ROCs for a comparison of cyber attack detection performance
by two control chart techniques.
Univariate Control Charts 267

16.6  Software and Applications


Minitab (www.minitab.com) supports statistical process control charts.
Applications of univariate control charts to manufacturing quality and cyber
intrusion detection can be found in Ye (2003, Chapter 4), Ye (2008), Ye et al.
(2002a, 2004), and Ye and Chen (2003).

Exercises
16.1 Consider the launch temperature data and the following information
in Example 16.1:

µ 0 = 69

K = 3.5

construct a cuscore control chart using Equation 16.44 to monitor the


launch temperature.
16.2 Plot the ROCs for the CUSUM control chart in Example 16.1, the EWMA
control chart in Example 16.2, and the cuscore control chart in Exercise
16.1, in the same chart, and compare the performance of these control
chart techniques.
16.3 Collect the data of daily temperatures in the last 12 months in your
city, consider the temperature data in each month as a data sample, and
construct a x– control chart to monitor the local temperature and detect
any anomaly.
16.4 Consider the same data set consisting of 12 monthly average tem-
peratures obtained from Exercise 16.3 and use x and –s obtained from
Exercise 16.3 to estimate μ0 and σ. Set K = 0.5σ, and H = 5σ. Construct a
two-side CUSUM control chart to monitor the data of the monthly aver-
age temperatures and detect any anomaly.
16.5 Consider the data set and the μ0 and K values in Exercise 16.4. Construct
a cuscore control chart to monitor the data of the monthly average tem-
peratures and detect any anomaly.
16.6 Consider the data set and the estimate of μ0 and σ in Exercise 16.4. Set
λ = 0.1 and L = 3. Construct an EWMA control chart to monitor the data
of the monthly average temperatures.
16.7 Repeat Exercise 16.6 but with λ = 0.3 and compare the EWMA control
charts in Exercises 16.6 and 16.7.
17
Multivariate Control Charts

Multivariate control charts monitor multiple variables simultaneously for


anomaly detection. This chapter describes three multivariate statistical con-
trol charts: Hotelling’s T 2 control charts, multivariate EWMA control charts,
and chi-square control charts. Some applications of multivariate control
charts are given with references.

17.1  Hotelling’s T 2 Control Charts


Let xi = (xi1, …, xip)′ denote an ith observation of random variables, xi1, …, xip,
which follow a multivariate normal distribution (see the probability density
function of a multivariate normal distribution in Chapter 16) with the mean
vector of μ and the variance–covariance matrix Σ (see the definition of the
variance–covariance matrix in Chapter 14). Given a data sample with n data
observations, the sample mean vector x– and the sample variance–covariance
matrix S:

 x1 
 
x=   (17.1)
 xp 
 

∑(x − x)(x − x)¢,


1
S= i i (17.2)
n−1
i =1

can be used to estimate μ and Σ, respectively. Hotelling’s T 2 statistic for an


observation xi is (Chou et al., 1999; Everitt, 1979; Johnson and Wichern, 1998;
Mason et al., 1995, 1997a,b; Mason and Young, 1999; Ryan, 1989):

T 2 = ( xi − x )¢ S −1( xi − x ), (17.3)

where
S−1 is the inverse of S
Hotelling’s T 2 statistic measures the statistical distance of xi from x–

269
270 Data Mining

x2
Control limits set by two univariate control charts

Control limits set by Hotelling’s T 2


x1

Missed by two univariate


control charts

Figure 17.1
An illustration of statistical distance measured by Hotelling’s T 2 and control limits of
Hotelling’s T 2 control charts and univariate control charts.

Suppose that we have x– = 0 at the origin of the two-dimensional space of


x1 and x2 in Figure 17.1. In Figure 17.1, the data points xis with the same sta-
tistical distance from x– lie at the ellipse by taking into account the variance
and covariance of x1 and x2, whereas all the data points xis with the same
Euclidean distance lie at the circle. The larger the value of Hotelling’s T 2 sta-
tistic for an observation xi, the larger statistical distance xi is from x–.
A Hotelling’s T 2 control chart monitors the Hotelling’s T 2 statistic in
Equation 17.3. If xi1, …, xip follow a multivariate normal distribution, a trans-
formed value of the Hotelling’s T 2 statistic:

n(n − p)
T2
p(n + 1)(n − 1)

follows a F distribution with p and n − p degrees of freedom. Hence, the


tabulated F value for a given level of significance, for example, α = 0.05, can
be used as the signal threshold. If the transformed value of the Hotelling’s
T 2 statistic for an observation x i is greater than the signal threshold, a
Hotelling’s T 2 control chart signals x i as an anomaly. A Hotelling’s T 2 con-
trol chart can detect both mean shifts and counter-relationships. Counter-
relationships are large deviations from the covariance structure of the
variables.
Figure 17.1 illustrates the control limits set by two individual x– control
charts for x1 and x2, respectively, and the control limits set by a Hotelling’s T 2
control chart based on the statistical distance. Because each of the individual
x– control charts for x1 and x2 does not include the covariance structure of x1
and x2, a data observation deviating from the covariance structure of x1 and
x2 is missed by each of the individual x– control charts but detected by the
Hotelling’s T 2 control chart, as illustrated in Figure 17.1. It is pointed out in
Ryan (1989) that Hotelling’s T 2 control charts are more sensitive to counter-
relationships than mean shifts. For example, if two variables have a posi-
tive correlation and a mean shift occurs with both variables but in the same
direction to maintain their correlation, Hotelling’s T 2 control charts may not
Multivariate Control Charts 271

detect the mean shift (Ryan, 1989). Hotelling’s T 2 control charts are also sen-
sitive to the multivariate normality assumption.

Example 17.1
The data set of the manufacturing system in Table 14.1, which is copied
in Table 17.1, includes two attribute variables, x7 and x8, in nine cases of
single-machine faults. The sample mean vector and the sample ­variance–
covariance matrix are computed in Chapter 14 and given next. Construct
a Hotelling’s T 2 control chart to determine if the first data observation
x = (x7, x8) = (1, 0) is an anomaly.

5
 x7   9 
x= = 
 x8   4 
 9 

 0.2469 − 0.1358 
S= .
 − 0.1358 0.2469 

For the first data observation x = (x7, x8) = (1, 0), we compute the value of
the Hotelling’s T 2 statistic:

 5
− 0.1358   1 − 9 
−1
 5 4   0.2469
T 2 = ( xi − x )¢ S −1 ( xi − x ) = 1 − 0 −   
 9 9   − 0.1358 0.2469   4
0− 
 9 
 4 
4 4   5.8070 3.1939  9 
= −    = 0.1435.
9 9   3.1939 5.8070   4 

 9 

Table 17.1
Data Set for System Fault Detection
with Two Quality Variables
Instance
(Faulty Machine) x7 x8
1 (M1) 1 0
2 (M2) 0 1
3 (M3) 1 1
4 (M4) 0 1
5 (M5) 1 0
6 (M6) 1 0
7 (M7) 1 0
8 (M8) 0 1
9 (M9) 0 0
272 Data Mining

The transformed T 2 has the value

n(n − p) (9)(9 − 2)
T2 = (0.1435) = 0.0502.
p(n + 1)(n − 1) (2)(9 + 1)(9 − 1)

The tabulated F value for α = 0.05 with 2 and 7 degrees of freedom is 4.74,
which is used as the signal threshold. Since 0.0502 < 4.74, the Hotelling’s
T 2 control chart does not signal x = (x7, x8) = (1, 0) as an anomaly.

17.2  Multivariate EWMA Control Charts


Hotelling’s T 2 control charts are a multivariate version of univariate x– control
charts in Chapter 16. Multivariate EWMA control charts are a multivariate
version of EWMA control charts in Chapter 16. A multivariate EWMA con-
trol chart monitors the following statistic (Ye, 2003, Chapter 4):

T 2 = zi¢ Sz−1zi , (17.4)

where

zi = λxi + (1 − λ )zi −1 , (17.5)


λ is a weight in (0, 1],

z0 = m or x (17.6)

λ
Sz = [1 − (1 − λ )2i ]S (17.7)
2−λ

and S is the sample variance–covariance matrix of x.

17.3  Chi-Square Control Charts


Since Hotelling’s T 2 control charts and multivariate EWMA control charts
require computing the inverse of a variance–covariance matrix, these con-
trol charts are not scalable to a large number of variables. The presence of
linearly correlated variables creates the difficulty of obtaining the inverse of
Multivariate Control Charts 273

a variance–covariance matrix. To address these problems, chi-square control


charts are developed (Ye et al., 2002b, 2006). A chi-square control chart moni-
tors the chi-square statistic for an observation x i = (xi1, …, xip)’ as follows:

p
( xij − x j )2
χ2 = ∑ xj
. (17.8)
j =1

For example, the data set of the manufacturing system in Table 17.1 includes
two attribute variables, x7 and x8, in nine cases of single-machine faults. The
sample mean vector is computed in Chapter 14 and given next:

5
 x7   9 
x =   =  .
 x8   4 
 9 

The chi-square statistic for the first data observation in Table 17.1 x = (x7, x8) =
(1, 0) is

2 2
 5  4
8 2
( x1 j − x j ) 2
( x17 − x7 ) ( x18 − x8 )  1 − 
2  0 − 
χ2 = ∑
j=7
xj
=
x7
+
x8
=
5
9
+
4
9
= 0.8.

9 9

If the p variables are independent and p is large, the chi-square statistic


follows a normal distribution based on the central limit theorem. Given a
sample of in-control data observations, the sample mean χ 2 and the sample
variance sχ2 of the chi-square statistic can be computed and used to set the
control limits:

UCL = χ 2 + Lsχ2 (17.9)


LCL = χ 2 − Lsχ2 . (17.10)


If we let L = 3, we have the 3-sigma control limits. If the value of the chi-
square statistic for an observation falls beyond [LCL, UCL], the chi-square
control chart signals an anomaly.
In the work by Ye et al. (2006), chi-square control charts are compared with
Hotelling’s T 2 control charts in their performance of detecting mean shifts
and counter-relationships for four types of data: (1) data with correlated and
274 Data Mining

normally distributed variables, (2) data with uncorrelated and normally dis-
tributed variables, (3) data with auto-correlated and normally distributed
variables, and (4) non-normally distributed variables without correlations
or auto-correlations. The testing results show that chi-square control charts
perform better or as well as Hotelling’s T 2 control charts for data of types 2,
3, and 4. Hotelling’s T 2 control charts perform better than chi-square control
charts for data of type 1 only. However, for data of type 1, we can use tech-
niques such as principal component analysis in Chapter 14 to obtain prin-
cipal components. Then a chi-square control chart can be used to monitor
principal components that are independent variables.

17.4  Applications
Applications of Hotelling’s T 2 control charts and chi-square control charts
to cyber attack detection for monitoring computer and network data and
detecting cyber attacks as anomalies can be found in the work by Ye and her
colleagues (Emran and Ye, 2002; Ye, 2003, Chapter 4; Ye, 2008; Ye and Chen,
2001; Ye et al., 2001, 2003, 2004, 2006). There are also applications of multivari-
ate control charts in manufacturing (Ye, 2003, Chapter 4) and other fields.

Exercises
17.1 Use the data set of x4, x5, and x6 in Table 8.1 to estimate the parameters
for a Hotelling’s T 2 control chart and construct the Hotelling’s T 2 con-
trol chart with α = 0.05 for the data set of x4, x5, and x6 in Table 4.6 to
monitor the data and detect any anomaly.
17.2 Use the data set of x4, x5, and x6 in Table 8.1 to estimate the param-
eters for a chi-square control chart and construct the chi-square control
chart with L = 3 for the data set of x4, x5, and x6 in Table 4.6 to monitor
the data and detect any anomaly.
17.3 Repeat Example 17.1 for the second observations.
Part VI

Algorithms for Mining


Sequential and
Temporal Patterns
18
Autocorrelation and Time Series Analysis

Time series data consist of data observations over time. If data observations
are correlated over time, time series data are autocorrelated. Time series
analysis was introduced by Box and Jenkins (1976) to model and analyze
time series data with autocorrelation. Time series analysis has been applied
to real-world data in many fields, including stock prices (e.g., S&P 500 index),
airline fares, labor force size, unemployment data, and natural gas price
(Yaffee and McGee, 2000). There are stationary and nonstationary time series
data that require different statistical inference procedures. In this chapter,
autocorrelation is defined. Several types of stationarity and nonstationarity
time series are explained. Autoregressive and moving average (ARMA) mod-
els of stationary series data are described. Transformations of nonstationary
series data into stationary series data are presented, along with autoregres-
sive, integrated, moving average (ARIMA) models. A list of software pack-
ages that support time series analysis is provided. Some applications of time
series analysis are given with references.

18.1  Autocorrelation
Equation 14.7 in Chapter 14 gives the correlation coefficient of two variables
xi and xj:

σ ij
ρij = ,
σ ii σ jj

where Equations 14.4 and 14.6 give

σ i2 = ∑ (x − u ) p (x )
all values
i i
2
i i

of xi

σ ij = ∑ ∑ (x − µ )(x − µ )p(x , x ).
all values all values
i i j j i j

of xi of x j

277
278 Data Mining

Given a variable x and a sample of its time series data xt, t = 1, …, n, we obtain
the lag-k autocorrelation function (ACF) coefficient by replacing the variables
xi and xj in the aforementioned equations with xt and xt−k, which are two data
observations with time lags of k:


n
( xt − x )( xt − k − x )/(n − k )
ACF(k ) = ρk = t = k +1
, (18.1)

n
( xt − x )2/n
t =1

where x– is the sample average. If time series data are statistically indepen-
dent at lag-k, ρk is zero. If xt and xt−k change from x– in the same way (e.g., both
increasing from x– ), ρk is positive. If xt and xt−k change from x– in the opposite
way (e.g., one increasing and another decreasing from x– ), ρk is negative.
The lag-k partial autocorrelation function (PACF) coefficient measures
the autocorrelation of lag-k, which is not accounted for by the autocorrela-
tion of lags 1 to k−1. PACF for lag-1 and lag-2 are given next (Yaffee and
McGee, 2000):

PACF(1) = ρ1 (18.2)

ρ2 − ρ12
PACF(2) = . (18.3)
1 − ρ12

18.2  Stationarity and Nonstationarity


Stationarity usually refers to weak stationarity that requires the mean and
variance of time series data not changing over time. A time series is strictly
stationary if the autocovariance σt,t−k does not change over time t but depends
only on the number of time lags k in addition to the fixed mean and the con-
stant variance. For example, a Gaussian time series that has a multivariate
normal distribution is a strict stationary series because the mean, variance,
and autocovariance of the series do not change over time. ARMA models are
used to model stationary time series.
Nonstationarity may be caused by

• Outliers (see the description in Chapter 16)


• Random walk in which each observation randomly deviates from
the previous observation without reversion to the mean
• Deterministic trend (e.g., a linear trend that has values changing
over time at a constant rate)
Autocorrelation and Time Series Analysis 279

• Changing variance
• Cycles with a data pattern that repeats periodically, including sea-
sonable cycles with annual periodicity
• Others that make the mean or variance of a time series changes over
time

A nonstationary series must be transformed into a stationary series in order


to build an ARMA model.

18.3  ARMA Models of Stationary Series Data


ARMA models apply to time series data with weak stationarity. An autore-
gressive (AR) model of order p, AR(p), describes a time series in which the
current observation of a variable x is a function of its previous p observation(s)
and a random error:

xt = φ 1 xt − 1 +  + φ p xt − p + e t . (18.4)

For example, time series data for the approval of president’s job performance
based on the Gallup poll is modeled as AR(1) (Yaffee and McGee, 2000):

xt = φ 1 xt − 1 + e t . (18.5)

Table 18.1 gives a time series of an AR(1) model with ϕ1 = 0.9, x0 = 3, and a
white noise process for et with the mean of 0 and the standard deviation of 1.

Table 18.1
Time Series of an AR(1) Model
with ϕ1 = 0.9, x0 = 3, and a
White Noise Process for et
t et xt
1 0.166 2.866
2 −0.422 2.157
3 −1.589 0.353
4 0.424 0.741
5 0.295 0.962
6 −0.287 0.579
7 −0.140 0.381
8 0.985 1.328
9 −0.370 0.825
10 −0.665 0.078
280 Data Mining

3.5

2.5

2
xt

1.5

0.5

0
1 2 3 4 5 6 7 8 9 10
t

Figure 18.1
Time series data generated using an AR(1) model with ϕ1 = 0.9 and a white noise process for et.

Figure 18.1 plots this AR(1) time series. As seen in Figure 18.1, the effect of the
initial x value, x0 = 3, diminishes quickly.
A moving average (MA) model of order q, MA(q), describes a time series in
which the current observation of a variable is an effect of a random error at
the current time and random errors at previous q time points:

xt = et − θ1et −1 −  − θ q et − q . (18.6)

For example, time series data from the epidemiological tracking on the pro-
portion of the total population reported to have a disease (e.g., AIDS) is mod-
eled as MV(1) (Yaffee and McGee, 2000)

xt = et − θ1et −1. (18.7)

Table 18.2 gives a time series of an MA(1) model with θ1 = 0.9 and a white noise
process for et with the mean of 0 and the standard deviation of 1. Figure 18.2
plots this MA(1) time series. As seen in Figure 18.2, −0.9et−1 in Formula 18.7
tends to bring xt in the opposite direction of xt−1, making xts oscillating.
An ARMA model, ARMA(p, q), describes a time series with both autoregres-
sive and moving average characteristics:

xt = φ 1 xt − 1 +  + φ p xt − p + e t − θ1 xt − 1 −  − θ q xt − q . (18.8)

ARMA(p, 0) denotes an AR(p) model, and ARMA(0, q) denotes an MA(q)


model. Generally, a smooth time series has high AR coefficients and low
MA coefficients, and a time series affected dominantly by random errors has
high MA coefficients and low AR coefficients.
Autocorrelation and Time Series Analysis 281

Table 18.2
Time Series of an MA(1)
Model with θ1 = 0.9 and a
White Noise Process for et
t et xt
0 0.649
1 0.166 −0.418
2 −0.422 −0.046
3 −1.589 −1.548
4 0.424 1.817
5 0.295 −1.340
6 −0.287 0.919
7 −0.140 −0.967
8 0.985 1.856
9 −0.370 −2.040
10 −0.665 1.171

2.5
2
1.5
1
0.5
0
xt

–0.5 1 2 3 4 5 6 7 8 9 10
–1
–1.5
–2
–2.5
t

Figure 18.2
Time series data generated using an MA(1) model with θ1 = 0.9 and a white noise process for et.

18.4  ACF and PACF Characteristics of ARMA Models


ACF and PACF described in Section 18.1 provide analytical tools to reveal
and identify the autoregressive order or the moving average order in an
ARMA model for a time series. The characteristics of ACF and PACF for time
series data generated by AR, MA and ARMA models are described next.
For an AR(1) time series:

xt = φ 1 xt − 1 + e t ,

282 Data Mining

ACF(k) is (Yeffee and McGee, 2000)

ACF(k ) = φ1k . (18.9)


If ϕ1 < 1, AR(1) is stationary with the exponential decline in the absolute value
of ACF over time since ACF(k) decreases with k and eventually diminishes. If
ϕ1 > 0, ACF(k) is positive. If ϕ1 < 0, ACF(k) is oscillating in that it is negative for
k = 1, positive for k = 2, negative for k = 3, positive for k = 4, and so on. If ϕ1 ≥ 1,
AR(1) is nonstationary. For a stationary AR(2) time series:

xt = φ 1 xt − 1 + φ 2 x t − 2 + e t ,

ACF(k) is positive with the exponential decline in the absolute value of ACF
over time if ϕ1 > 0 and ϕ2 > 0, and ACF(k) is oscillating with the exponential
decline in the absolute value of ACF over time if ϕ1 < 0 and ϕ2 > 0.
PACF(k) for an autoregressive series AR(p) carries through lag p and
become zero after lag p. For AR(1), PACF(1) is positive if ϕ1 > 0 or negative if
ϕ1 < 0, and PACF(k) for k ≥ 2 is zero. For AR(2), PACF(1) and PACF(2) are posi-
tive if ϕ1 > 0 and ϕ2 > 0, PACF(1) is negative and PACF(2) is positive if ϕ1 < 0
and ϕ2 > 0, and PACF(k) for k ≥ 3 is zero. Hence, PACF identifies the order of
an autoregressive time series.
For an MA(1) time series,

xt = et − θ1et −1 ,

ACF(1) is not zero as follows (Yeffee and McGee, 2000):

− θ1
ACF(1) = , (18.10)
1 + θ12

and ACF(k) is zero for k > 1. Similarly, for an MA(2) time series, ACF(1) and
ACF(2) are negative, and ACF(q) is zero for q > 2. For an MA(q), we have
(Yeffee and McGee, 2000)

ACF(k ) ≠ 0 if k ≤ q
.
ACF(k ) = 0 if k > q

Unlike an autoregressive time series whose ACF declines exponentially over


time, a moving average time series has a finite memory since the autocorrela-
tion of MA(q) carries only through lag q. Hence, ACF identifies the order of a
moving average time series. A moving average time series has PACF whose
magnitude exponentially declines over time. For MA(1), PACF(k) is negative
Autocorrelation and Time Series Analysis 283

if θ1 > 0, and PACF(k) is oscillating between positive and negative values with
the exponential decline in the magnitude of PACF(k) over time. For MA(2),
PACF(k) is negative with the exponential decline in the magnitude of PACF
over time if θ1 > 0 and θ2 > 0, and ACF(k) is oscillating with the exponential
decline in the absolute value of ACF over time if θ1 < 0 and θ2 < 0.
The aforementioned characteristics of autoregressive and moving aver-
age time series are combined in mixed time series with ARMA(p, q) models
where p > 0 and q > 0. For example, for an ARMA(1,1) with ϕ1 > 0 and θ1 < 0,
ACF declines exponentially overtime and PACF is oscillating with the expo-
nential decline over time.
The parameters in an ARMA model can be estimated from a sample of
time series data using the unconditional least-squares method, the condi-
tional least-squares method, or the maximum likelihood method (Yeffee and
McGee, 2000), which are supported in statistical software such as SAS (www.
sas.com) and SPSS (www.ibm.com/software/analytics/spss/).

18.5 Transformations of Nonstationary
Series Data and ARIMA Models
For nonstationary series caused by outliers, random walk, deterministic
trend, changing variance, and cycles and seasonality, which are described
in Section 18.2, methods of transforming those nonstationary series into sta-
tionary series are described next.
When outliers are detected in a time series, they can be removed and
replaced using the average of the series. A random walk has each observa-
tion randomly deviating from the previous observation without reversion to
the mean. Drunken drivers and birth rates have the behavior of a random
walk (Yeffee and McGee, 2000). Differencing is applied to a random walk
series as follows:

et = xt − xt −1 (18.11)

to obtain a stationary series of residual et, which is then modeled as an ARMA


model. A deterministic trend such as the following linear trend:

xt = a + bt + et , (18.12)

can be removed by de-trending. The de-trending includes first building a


regression model to capture the trend (e.g., a linear model for a linear trend
or a polynomial model for a higher-order trend) and then obtaining a sta-
tionary series of residual et through differencing between the observed value
284 Data Mining

and the predicted value from the regression model. For a changing vari-
ance with the variance of a time series expanding, contracting, or fluctuating
over time, the natural log transformation or a power transformation (e.g.,
square and square root) can be considered to stabilize the variance (Yeffee
and McGee, 2000). The natural log and power transformations belong to the
family of Box–Cox transformations, which are defined as (Yaffee and McGee,
2000):

( x t + c )λ − 1
yt = if 0 < λ ≤ 1
λ (18.13)

yt = lnxt + c if λ = 0

where
xt is the original time series
yt is the transformed time series
c is a constant
λ is a shape parameter

For a time series consisting of cycles, some of which are seasonable with
annual periodicity, cyclic or seasonal differencing can be performed as
follows:

e t = xt − xt − d , (18.14)

where d is the number of time lags that a cycle spans.


The regular differencing and the cyclic/seasonal differencing can be
added to an ARMA model to become an ARIMA model where I stands for
integrated:

xt − xt − d = φ 1 xt − 1 +  + φ p xt − p + e t − θ1 xt − 1 −  − θ q xt − q . (18.15)

18.6  Software and Applications


SAS (www.sas.com), SPSS (www.ibm.com/software/analytics/spss/), and
MATLAB (www.mathworks.com) support time series analysis. In the work
by Ye and her colleagues (Ye, 2008, Chapters 10 and 17), time series analysis
is applied to uncovering and identifying autocorrelation characteristics of
normal use and cyber attack activities using computer and network data.
Time series models are built based on these characteristics and used in
Autocorrelation and Time Series Analysis 285

cuscore control charts as described in Chapter 16 to detect the presence of


cyber attacks. The applications of time series analysis for forecasting can be
found in Yaffee and McGee (2000).

Exercises
18.1 Construct time series data following an ARMA(1,1) model.
18.2 For the time series data in Table 18.1, compute ACF(1), ACF(2), ACF(3),
PACF(1), and PACF(2).
18.3 For the time series data in Table 18.2, compute ACF(1), ACF(2), ACF(3),
PACF(1), and PACF(2).
19
Markov Chain Models and
Hidden Markov Models

Markov chain models and hidden Markov models have been widely used to
build models and make inferences of sequential data patterns. In this chap-
ter, Markov chain models and hidden Markov models are described. A list
of data mining software packages that support the learning and inference of
Markov chain models and hidden Markov models is provided. Some appli-
cations of Markov chain models and hidden Markov models are given with
references.

19.1  Markov Chain Models


A Markov chain model describes the first-order discrete-time stochastic pro-
cess of a system with the Markov property that the probability the system
state at time n does not depend on previous system states leading to the sys-
tem state at time n − 1 but only the system state at n − 1:


( )
P sn sn −1 , … , s1 = P ( sn sn −1 ) for all n,

(19.1)

where sn is the system state at time n. A stationary Markov chain model has
an additional property that the probability of a state transition from time
n − 1 to n is independent of time n:

P ( sn = j sn −1 = i ) = P ( j|i ) , (19.2)

where p(j|i) is the probability that the system is in state j at one time given
the system is in state i at the previous time. A stationary Markov model is
simply called a Markov model in the following text.
If the system has a finite number of states, 1, …, S, a Markov chain model
is defined by the state transition probabilities, P(j|i), i = 1, …, S, j = 1, …, S,

∑P ( j|i) = 1,
j =1
(19.3)

287
288 Data Mining

and the initial state probabilities, P(i), i = 1, …, S,

∑P(i) = 1, (19.4)
i =1

where P(i) is the probability that the system is in state i at time 1. The
joint probability of a given sequence of system states sn−K+1, …, sn in a time
window of size K including discrete times n − (K − 1), …, n is computed as
follows:

P(sn − K + 1 , … , sn ) = P(sn − K + 1 ) ∏ P (s
k = K− 1
|sn − k ) .
n − k +1 (19.5)

The state transition probabilities and the initial state probabilities can be
learned from a training data set containing one or more state sequences as
follows:

N ji
P ( j|i ) = (19.6)
N .i

Ni
P(i) = , (19.7)
N

where
Nji is the frequency that the state transition from state i to state j appears in
the training data
N.i is the frequency that the state transition from state i to any of the states,
1, …, S, appears in the training data
Ni is the frequency that state i appears in the training data
N is the total number of the states in the training data

Markov chain models can be used to learn and classify sequential data pat-
terns. For each target class, sequential data with the target class can be used
to build a Markov chain model by learning the state transition probability
matrix and the initial probability distribution from the training data accord-
ing to Equations 19.6 and 19.7. That is, we obtain a Markov chain model for
each target class. If we have target classes, 1, …, c, we build Markov chain
models, M1, …, Mc, for these target classes. Given a test sequence, the joint
probability of this sequence is computed using Equation  19.5 under each
Markov chain model. The test sequence is classified into the target class of the
Markov chain model which gives the highest value for the joint ­probability
of the test sequence.
Markov Chain Models and Hidden Markov Models 289

In the applications of Markov chain models to cyber attack detection


(Ye et al., 2002c, 2004), computer audit data under the normal use condition
and under various attack conditions on computers are collected. There are
totally 284 types of audit events in the audit data. Each audit event is con-
sidered as one of 284 system states. Each of the conditions (normal use and
various attacks) is considered as a target class. The Markov chain model for
a target class is learned from the training data under the condition of the
target class. For each test sequence of audit events in an observation window,
the joint probability of the test sequence is computed under each Markov
chain model. The test sequence is classified into one of the conditions (nor-
mal or one of the attack types) to determine if an attack is present.

Example 19.1
A system has two states: misuse (m) and regular use (r). A sequence
of system states is observed for training a Markov chain model:
mmmrrrrrrmrrmrrmrmmr. Build a Markov chain model using the
observed sequence of system states and compute the probability that
the sequence of system states mmrmrr is generated by the Markov chain
model.
Figure 19.1 shows the states and the state transitions in the observed
training sequence of systems states.
Using Equation 19.6 and the training sequence of system states
mmmrrrrrrmrrmrrmrmmr, we learn the following state transition
probabilities:

N mm 3
P ( m|m) = = ,
N.m 8

because state transitions 1, 2, and 18 are the state transition of m → m,


and state transitions 1, 2, 3, 10, 13, 16, 18, and 19 are the state transition of
m → any state:

N rm 5
P ( r|m) = = ,
N.m 8
because state transitions 3, 10, 13, 16, and 19 are the state transition of
m → m, and state transitions 1, 2, 3, 10, 13, 16, 18, and 19 are the state tran-
sition of m → any state:

N mr 4
P ( m|r ) = = ,
N.r 11

State: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
m m m r r r r r r m r r m r r m r m m r

State transition: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 19.1
States and state transitions in Example 19.1.
290 Data Mining

because state transitions 9, 12, 15, and 17 are the state transition of r → m,
and state transitions 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, and 17 are the state transi-
tion of r → any state:

N rr 7
P ( r|r ) = = ,
N.r 11

because state transitions 4, 5, 6, 7, 8, 11, and 14 are the state transition of


m → m, and state transitions 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, and 17 are the state
transition of m → any state.
Using Equation 19.7 and the training sequence of states
mmmrrrrrrmrrmrrmrmmr, we learn the following initial state probabilities:

Nm 8
P(m) = = ,
N 20

because states 1, 2, 3, 10, 13, 16, 18, and 19 are state m, and there are
20 states in the sequence of states:

N r 12
P(r ) = = ,
N 20

because states 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 17, 20 are state r, and there are
20 states in the sequence of states.
After learning all the parameters of the Markov chain model, we com-
pute the probability that the model generates the sequence of states:
mmrmrr.

P(mmrmrr ) = P(s1 )P ( s2|s1 ) P ( s3|s2 ) P ( s4|s3 ) P ( s5|s4 ) P ( s6|s5 )

= P ( m) P ( m|m) P ( r|m) P ( m|r ) P ( r|m) P ( r|r )

 8   3  5  4   5  7 
=             = 0.014.
 20   8   8   11   8   11 

19.2 Hidden Markov Models


In a hidden Markov model, an observation x is made at each stage, but the
state s at each stage is not observable. Although the state at each stage is
not observable, the sequence of observations is the result of state transitions
and emissions of observation from states upon arrival at each state. In addi-
tion to the initial state probabilities and the state transition probabilities,
Markov Chain Models and Hidden Markov Models 291

the probability of emitting x from each state s, P(x|s), is also defined as the


emission probability in a hidden Markov model:

∑P ( x|s) = 1. (19.8)
x

It is assumed that observations are independent of each other, and that the
emission probability of x from each state s does not depend on other states.
A hidden Markov model is used to determine the probability that a given
sequence of observations, x 1, …, x N, at stages 1, …, N, is generated by the hid-
den Markov model. Using any path method (Theodoridis and Koutroumbas,
1999), this probability is computed as follows:

SN

∑P ( x , …, x |s , …, s
i =1
1 N 1i Ni ) P (s1 , … , sN )
i i

SN N

= ∑ P ( s1i ) P ( x1|s1i ) ∏P (s |s ni n − 1i ) P ( xn|sn ) ,


i


(19.9)
i =1 n= 2

Where
i is the index for a possible state sequence, s1i , … , sNi , there are totally SN
possible state sequences
P ( s1i ) is the initial state probability, P ( sni |sn −1i ) is the state transition
probability
P ( xn|sni ) is the emission probability

Figure 19.2 shows stages 1, …, N, states 1, …, S, and observations at stages,


x 1, …, x N, involved in computing Equation 19.9. To perform the computation
in Equation 19.9, we define ρ(sn) as the probability that (1) state sn is reached at

β(sn–1)
β(sn)
s Any path method
ρ(sn–1)

ρ(sn)

Best path method

1
x1 xn–2 xn–1 xn xN

Figure 19.2
Any path method and the best path method for a hidden Markov model.
292 Data Mining

stage n, (2) observations x 1, …, x n−1 have been emitted at stages 1 to n − 1, and
(3) observation x n is emitted from state sin at stage n. ρ(sn) can be computed
recursively as follows:

ρ(sn ) = ∑ ρ(s
sn−1 = 1
n −1 )P ( sn|sn −1 ) P ( xn|sn ) , (19.10)

ρ(s1 ) = P(s1 )P ( x1|s1 ) . (19.11)


That is, ρ(sn) is the sum of the probabilities that starting from all possible
state sn = 1, …, S at stage n − 1 with x 1, …, x n−1 already emitted, we transi-
tion to state sn at stage n which emits x n, as illustrated in Figure 19.2. Using
Equations 19.10 and 19.11, Equation 19.9 can be computed as follows:

SN S

∑i =1
P ( x1 , … , x N |s1i , … , sNi ) P ( s1i , … , sNi ) = ∑ρ(s
sN = 1
N ). (19.12)

Hence, in any path method, Equations 19.10 through 19.12 are used to compute
the probability of a hidden Markov model generating a sequence of observa-
tions x 1, …, xN. Any path method starts by computing all ρ(s1) for s1 = 1, …, S
using Equation 19.11, then uses ρ(s1) to compute all ρ(s2), s2 = 1, …, S using
Equation 19.10, and continues all the way to obtain all ρ(sN) for sN = 1, …, S,
which are finally used in Equation 19.12 to complete the computation.
The computational cost of any path method is high because all SN possible
state sequences/paths from stage 1 to stage N are involved in the computa-
tion. Instead of using Equation 19.9, the best path method uses Equation 19.13
to compute the probability that a given sequence of observations, x 1, …, x N, at
stages 1, …, N, is generated by the hidden Markov model:

max Si=1 P ( x1 , …, x N|si1 , …, siN ) P ( si1 , …, siN )


N

= max Si=1P ( si1 ) P ( x1|si1 ) ∏P (s |s ) P ( xn|si ).


N
in in−1 n (19.13)
n= 2

That is, instead of summing over all the possible state sequences in Equation
19.9 for any path method, the best path method uses the maximum prob-
ability that the sequence of observations, x 1, …, x N, is generated by any pos-
sible state sequence from stage 1 to stage N. We define β(sn) as the probability
that (1) state sn is reached at stage n through the best path, (2) observations
x 1,  …,  x n−1 have been emitted at stages 1 to n − 1, and (3) observation x n is
Markov Chain Models and Hidden Markov Models 293

emitted from state sn at stage n. β(sn) can be computed recursively as follows


using Bellman’s principle (Theodoridis and Koutroumbas, 1999):

β(sn ) = max Ssn−1 = 1 β(sn −1 )P ( sn|sn −1 ) P ( xn|sn ) (19.14)


β(s1 ) = P(s1 )P ( x1|s1 ) . (19.15)


Equation 19.13 is computed using Equation 19.16:

max Si = 1 P ( x1 , … , x N|si1 , … , siN ) P ( si1 , … , siN ) = max SSN = 1β(sN ).


N
(19.16)

The Viterbi algorithm (Viterbi, 1967) is widely used to compute the logarithm
transformation of Equations 19.13 through 19.16.
The best path method requires less computational cost of storing and com-
puting the probabilities than any path method because the computation at
each stage n involves only the S best paths. However, in comparison with any
path method, the best path method is an alternative suboptimal method for
computing the probability that a given sequence of observations, x 1, …, x N, at
stages 1, …, N, is generated by the hidden Markov model, because only the
best path instead of all possible paths is used to determine the probability
of observing x 1, …, x N, given all possible paths in the hidden Markov model
that can possibly generate the observation sequence.
Hidden Markov models have been widely used in speed recognition,
handwritten character recognition, natural language processing, DNA
sequence recognition, and so on. In the application of hidden Markov models
to handwritten digit recognition (Bishop, 2006) for recognizing handwritten
digits, 0, 1, …, 9, a hidden Markov model is built for each digit. Each digit is
considered to have a sequence of line trajectories, x 1, …, x N, at stages 1, …, N.
Each hidden Markov model has 16 latent states, each of which can emit a line
segment of a fixed length with one of 16 possible angles. Hence, the emis-
sion distribution can be specified by a 16 × 16 matrix with the probability of
emitting each of 16 angles from each of 16 states. The hidden Markov model
for each digit is trained to establish the initial probability distribution, the
transition probability matrix, and the emission probabilities using 45 hand-
written examples of the digit. Given a handwritten digit to recognize, the
probability that the handwritten digit is generated by the hidden Markov
model for each digit is computed. The handwritten digit is classified as the
digit whose hidden Markov model produces the highest probability of gen-
erating the handwritten digit.
Hence, to apply hidden Markov models to a classification problem, a hid-
den Markov model is built for each target class. Given a sequence of observa-
tions, the probability of generating this observation sequence by each hidden
294 Data Mining

Markov model is computed using any path method or the best path method.
The given observation sequence is classified into the target class whose hid-
den Markov model produces the highest probability of generating the obser-
vation sequence.

19.3  Learning Hidden Markov Models


The set of model parameters for a hidden Markov model, A, includes the
state transition probabilities, P(j|i), the initial state probabilities, P(i), and the
emission probabilities, P(x|i):


{ }
A = P ( j|i ) , P(i), P ( x|i ) . (19.17)

The model parameters need to be learned from a training data set containing
a sequence of N observations, X = x 1, …, x N. Since the states cannot be directly
observed, Equations 19.6 and 19.7 cannot be used to learn the model param-
eters such as the state transition probabilities and the initial state probabili-
ties. Instead, the expectation maximization (EM) method is used to estimate
the model parameters, which maximize the probability of obtaining the
observation sequence from the model with the estimated model parameters,
P(X|A). The EM method has the following steps:

1. Assign initial values of the model parameters, A, and use these val-
ues to compute P(X|A).
2. Reestimate the model parameters to obtain Â, and compute P(X|Â).
3. If P(X|Â) − P(X|A) > ∈, let A =  because  improves the probability of
obtaining the observation sequence from  than A, and go to Step 2;
otherwise, stop because P(Â) is worse than or similar to P(A), and take
A as the final set of the model parameters.

In Step 3, ∈ is a preset threshold of improvement in the probability of gener-


ating the observation sequence X from the model parameters.
P(X|A) and P(X|Â) in the aforementioned EM method are computed
using Equation 19.12 for any path method and Equation 19.16 for the best
path method. If an observation is discrete and thus an observation sequence
is a member of a finite set of observation sequences, the Baum–Welch rees-
timation method is used to reestimate the model parameters in Step 2
of the aforementioned EM method. Theodoridis and Koutroumbas (1999)
describe the Baum–Welch reestimation method as follows. Let θn(i, j, X|A)
Markov Chain Models and Hidden Markov Models 295

be the probability that (1) the path goes through state i at stage n, (2) the
path goes through state j at the next stage n + 1, and (3) the model generates
the observation sequence X using the model parameters A. Let φn(i, X|A) be
the probability that (1) the path goes through state i at stage n, and (2) the
model generates the observation sequence X using the model parameters A.
Let ωn(i) be the probability of having the observations x n+1, …, x N at stages
n + 1, …, N, given that the path goes through state i at stage n. For any path
method, ωn(i) can be computed recursively for n = N − 1, …, 1 as follows:

ω n (i ) = P ( xn + 1 , … , x N|sn = i , A ) = ∑ω
sn+1 = 1
n+1 (sn + 1 )P ( sn + 1|sn = i ) P ( xn + 1|sn + 1 )

(19.18)

ω N (i) = 1, i = 1, … , S. (19.19)

For the best path method, ωn(i) can be computed recursively for n = N − 1, …, 1
as follows:

ω n (i) = P ( xn + 1 , … , x N|sn = i , A ) = max Ssn+1 = 1ω n + 1(sn + 1 )P ( sn + 1|sn = i ) P ( xn + 1|sn + 1 )



(19.20)

ω N (i) = 1, i = 1, … , S. (19.21)

We also have

ϕ n (i , X|A ) = ρn (i)ω n (i), (19.22)


where ρn(i) denotes ρ(sn = i), which is computed using Equations 19.10 and
19.11. The model parameter P(i) is the expected number of times that state i
occurs at stage 1, given the observation sequence X and the model param-
eters A, that is, P(i|X, A). The model parameter P(j|i) is the expected number
of times that transitions from state i to state j occur, given the observation
sequence X and the model parameters A, that is, P(i, j|X, A)/P(i|X, A). The
model parameters are reestimated as follows:

ϕ1 (i , X|A ) ρ1(i)ω1(i)
Pˆ (i) = P (i|X , A ) = = (19.23)
P ( X|A ) P ( X|A )

296 Data Mining


N −1
P (i , j|X , A ) θ n (i , j , X|A ) /P ( X|A )
Pˆ ( j|i ) = = n=1
P (i|X , A )

N −1
ϕ n (i , X|A ) /P ( X|A )
n=1


N −1
ρn (i ) P ( j|i ) P ( xn + 1|j ) ω n + 1( j)/P ( X|A )
= n=1


N −1
ρn (i)ω n (i)/ ( X|A )
n=1


N −1
ρn (i)P ( j|i ) P ( xn + 1|j ) ω n + 1( j)
= n=1
(19.24)

N −1
ρn (i)ω n (i)
n=1

∑ ϕ (i)/P (X|A) = ∑ ϕ (i)


N N
n& x = v n & xn = v
Pˆ ( x = v|i ) = n=1 n=1

∑ ϕ (i)/P (X|A) ∑ ϕ (i )
N N
n n
n=1 n=1

∑ ρ (i)ω (i)
N
n& x = v n & xn = v
= n=1 (19.25)
∑ ρ (i)ω (i)
N
n n
n=1

where

ϕ n (i) if xn = v
ϕ n & x n = v (i ) =  , (19.26)
0 if xn ≠ v

ρn (i) if xn = v
ρn & x n = v ( i ) =  , (19.27)
0 if xn ≠ v

ω n (i) if xn = v
ω n & x n = v (i ) =  , (19.28)
0 if xn ≠ v

and v is one of the discrete value vectors that x may take.

Example 19.2
A system has two states: misuse (m) and regular use (r), each of which
can produce one of three events: F, G, and H. A sequence of five events is
observed: FFFHG. Using any path method, perform one iteration of the
Markov Chain Models and Hidden Markov Models 297

model parameters reestimation in the EM method of learning a hidden


Markov model from the observed sequence of events.
In Step 1 of the EM method, the following arbitrary values are assigned
to the model parameters initially:

P(m) = 0.4 P(r ) = 0.6


P ( m|m) = 0.375 P ( r|m) = 0.625 P ( m|r ) = 0.364 P ( r|r ) = 0.636


P ( F|m) = 0.7 P (G|m) = 0.1 P ( H|m) = 0.2


P ( F|r ) = 0.3 P (G|r ) = 0.4 P ( H|r ) = 0.4.


Using these model parameters, we compute P(X = FFFHG|A) using


Equations 19.10, 19.11 and 19.12 for any path method:

ρ1 (m) = ρ(s1 = m) = P(s1 = m)P ( x1 = F|s1 = m) = (0.4)(0.7 ) = 0.28


ρ1 (r ) = ρ(s1 = r ) = P(s1 = r )P ( x1 = F|s1 = r ) = (0.6)(0.2) = 0.12



2

ρ2 (m) = ρ(s2 = m) = ∑ ρ(s )P (s |s ) P ( x |s )


s1 = 1
1 2 1 2 2

= ρ(s1 = m)P ( s2 = m|s1 = m) P ( x2 = F|s2 = m)

+ ρ(s1 = r )P ( s2 = m|s1 = r ) P ( x2 = F|s2 = m)

= (0.28)(0.375)(0.7 ) + (0.12)(0.364)(0.7 ) = 0.1060



2

ρ2 (r ) = ρ(s2 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s1 = 1
1 2 1 2 2

= ρ(s1 = m)P ( s2 = r|s1 = m)P ( x2 = F|s2 = r )

+ ρ(s1 = r )P ( s2 = r|s1 = r ) P ( x2 = F|s2 = r )

= (0.28)(0.625)(0.3) + (0.12)(0.636)(0.3) = 0.0754



2

ρ3 (m) = ρ(s3 = m) = ∑ ρ(s )P (s |s ) P ( x |s )


s2 = 1
2 3 2 3 3

= ρ(s2 = m)P ( s3 = m|s2 = m) P ( x3 = F|s3 = m)

+ ρ(s2 = r )P ( s3 = m|s2 = r ) P ( x3 = F|s3 = m)

= (0.1060)(0.375)(0.7 ) + (0.0754)(0.364)(0.7 ) = 0.0470



298 Data Mining

ρ3 (r ) = ρ(s3 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s2 = 1
2 3 2 3 3

= ρ(s2 = m)P ( s3 = r|s2 = m) P ( x3 = F|s3 = r )

+ ρ(s2 = r )P ( s3 = r|s2 = r ) P ( x3 = F|s3 = r )

= (0.1060)(0.625)(0.2) + (0.0754)(0.636)(0.2) = 0.0228


ρ4 (m) = ρ(s4 = m) = ∑ρ(s )P (s |s ) P ( x |s )


s3 = 1
3 4 3 4 4

= ρ(s3 = m)P ( s4 = m|s3 = m) P ( x 4 = H|s4 = m)

+ ρ(s3 = r )P ( s4 = m|s3 = r ) P ( x 4 = H|s4 = m)

= (0.0470)(0.375)(0.2) + (0.0228)(0.364)(0.2) = 0.0052


ρ4 (r ) = ρ(s4 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s3 = 1
3 4 3 4 4

= ρ(s3 = m)P ( s4 = r|s3 = m)P ( x 4 = H|s4 = r )

+ ρ(s3 = r )P ( s4 = r|s3 = r ) P ( x 4 = H|s4 = r )

= (0.0470)(0.625)(0.4) + (0.0228)(0.636)(0.4) = 0.0176


ρ5 (m) = ρ(s5 = m) = ∑ ρ(s )P (s |s ) P ( x |s )


s4 = 1
4 5 4 5 5

= ρ(s4 = m)P ( s5 = m|s4 = m)P ( x5 = G|s5 = m)

+ ρ(s4 = r )P ( s5 = m|s4 = r ) P ( x5 = G|s5 = m)

= (0.0052)(0.375)(0.1) + (0.0176)(0.364)(0.1) = 0.0008


ρ5 (r ) = ρ(s5 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s4 = 1
4 5 4 5 5

= ρ(s4 = m)P ( s5 = r|s4 = m) P ( x5 = G|s5 = r )

+ ρ(s4 = r )P ( s5 = r|s4 = r ) P ( x5 = G|s5 = r )

= (0.0052)(0.625)(0.4) + (0.0176)(0.636)(0.4) = 0.0058



Markov Chain Models and Hidden Markov Models 299

P ( X = FFFHG|A ) = ∑ρ(s ) = ρ(s


s5 = 1
5 5 = m) + ρ(s5 = r ) = 0.0008 + 0.0058

= 0.0066.

In Step 2 of the EM method, we use Equations 19.23 through 19.25 to rees-
timate the model parameters. We first need to use Equations 19.18 and
19.19 to compute ωn(i), n = 5, 4, 3, 2, and 1, which are used in Equations
19.23 through 19.25:
ω 5 (m) = 1 ω 5 (r ) = 1

2

ω 4 (m) = P ( x5 = G|s4 = m, A ) = ∑ω (s )P (s |s
s5 = 1
5 5 5 4 = m) P ( x5 = G|s5 )

= ω 5 (m)P ( s5 = m|s4 = m) P ( x5 = G|s5 = m)

+ ω 5 (r )P ( s5 = r|s4 = m) P ( x5 = G|s5 = r )
= (1)(0.375)(0.1) + (1)(0.625)(0.4) = 0.2875

2

ω 4 (r ) = P ( x5 = G|s4 = r , A ) = ∑ω (s )P (s |s
s5 = 1
5 5 5 4 = r ) P ( x5 = G|s5 )

= ω 5 (m)P ( s5 = m|s4 = r ) P ( x5 = G|s5 = m)

+ ω 5 (r )P ( s5 = r|s4 = r ) P ( x5 = G|s5 = r )
= (1)(0.364)(0.1) + (1)(0.636)(0.4) = 0.2908

2

ω 3 (m) = P ( x 4 = H , x5 = G|s3 = m, A ) = ∑ω (s )P (s |s
s4 = 1
4 4 4 3 = m) P ( x 4 = H|s4 )

= ω 4 (m)P ( s4 = m|s3 = m) P ( x 4 = H|s4 = m)

+ ω 4 (r )P ( s4 = r|s3 = m) P ( x 4 = H|s4 = r )
= (0.2875)(0.375)(0.2) + (0.2908)(0.625)(0.4)
= 0.0943

ω 3 (r ) = P ( x 4 = H , x5 = G|s3 = r , A ) = ∑ω (s )P (s |s
s4 = 1
4 4 4 3 = r ) P ( x4 = H|s4 )

= ω 4 (m)P ( s4 = m|s3 = r ) P ( x4 = H|s4 = m)

+ ω 4 (r )P ( s4 = r|s3 = r ) P ( x4 = H|s4 = r )
= (0.2875)(0.364)(0.2) + (0.2908)(0.636)(0.4)
= 0.0949
300 Data Mining

ω 2 (m) = P ( x3 = F , x 4 = H , x5 = G|s2 = m, A ) = ∑ω (s )P (s |s
s3 = 1
3 3 3 2 = m) P ( x3 = F|s3 )

= ω 3 (m)P ( s3 = m|s2 = m) P ( x3 = F|s3 = m)

+ ω 3 (r )P ( s3 = r|s2 = m) P ( x3 = F|s3 = r )

= (0.0943)(0.375)(0.7 ) + (0.0949)(0.625)(0.2)

= 0.0366

ω 2 (r ) = P ( x3 = F , x 4 = H , x5 = G|s2 = r , A ) = ∑ω (s )P (s |s
s3 = 1
3 3 3 2 = r ) P ( x3 = F|s3 )

= ω 3 (m)P ( s3 = m|s2 = r ) P ( x3 = F|s3 = m)

+ ω 3 (r )P ( s3 = r|s2 = r ) P ( x3 = F|s3 = r )

= (0.0943)(0.364)(0.7 ) + (0.0949)(0.636)(0.2)

= 0.0361

ω1 (m) = P ( x2 = F , x3 = F , x 4 = H , x5 = G|s1 = m, A ) = ∑ω (s )P (s |s = m) P ( x = F|s )


s2 = 1
2 2 2 1 s 2

= ω 2 (m)P ( s2 = m|s1 = m) P ( x2 = F|s2 = m)

+ ω 2 (r )P ( s2 = r|s1 = m) P ( x2 = F|s2 = r )

= (0.0366)(0.375)(0.7 ) + (0.0361)(0.625)(0.2)

= 0.0141

ω1 (r ) = P ( x2 = F , x3 = F , x 4 = H , x5 = G|s1 = r , A ) = ∑ω (s )P (s |s = r ) P ( x = F|s )
s2 = 1
2 2 2 1 s 2

= ω 2 (m)P ( s2 = m|s1 = r ) P ( x2 = F|s2 = m)

+ ω 2 (r )P ( s2 = r|s1 = r ) P ( x2 = F|s2 = r )

= (0.0366)(0.364)(0.7 ) + (0.0361)(0.636)(0.2)

= 0.0139.

Now we use Equations 19.23 through 19.25 to reestimate the model


parameters:

ρ1 (m)ω1 (m) (0.28)(0.0141)


Pˆ (m) = = = 0.5982
P ( X = FFFHG|A ) 0.0066

Markov Chain Models and Hidden Markov Models 301

ρ1 (r )ω1 (r ) (0.12)(0.0139)
Pˆ (r ) = = = 0.2527
P ( X = FFFHG|A ) 0.0066


4
ρn (m)P ( m|m) P ( xn + 1|m) ω n + 1 (m)
Pˆ ( m|m) = n=1


4
ρn (m)ω n (m)
n=1

 ρ1 (m)P ( m|m) P ( x2 = F|m) ω 2 (m) 


 
 + ρ2 (m)P ( m|m) P ( x3 = F|m) ω 3 (m) 
 
 + ρ3 (m)P ( m|m) P ( x4 = H|m) ω 4 (m)
( ) (5
 + ρ (m)P m|m P x = G|m ω (m) 
) 5 
=
4

 ρ1 (m)ω1 (m) 
 + ρ (m)ω (m)
 2 2

 + ρ3 (m)ω 3 (m)
 
 + ρ4 (m)ω 4 (m)

 (0.28)(0.375)(0.7 )(0.0366) + (0.1060)(0.375)(0.7 )(0.0943) 
 + (0.0470)(0.375)(0.2)(0.2875) + (0.0052)(0.375)(0.1)(1)
=  
[(0.28)(0.0141) + (0.1060)(0.0366) + (0.0470)(0.0943) + (0.0052)(0.2875)]
= 0.4742


4
ρn (m)P ( r|m) P ( xn + 1|r ) ω n + 1 (r )
Pˆ ( r|m) = n=1


4
ρn (m)ω n (m)
n=1

 ρ1 (m)P ( r|m) P ( x2 = F|r ) ω 2 (r ) 


 
 + ρ2 (m)P ( r|m) P ( x3 = F|r ) ω 3 (r ) 
 
 + ρ3 (m)P ( r|m) P ( x 4 = H|r ) ω 4 (r )
( ) (5
 + ρ (m)P r|m P x = G|r ω (r ) 
 4 ) 5 
=
 ρ1 (m)ω1 (m) 
 + ρ (m)ω (m)
 2 2

 + ρ3 (m)ω 3 (m)
 
 + ρ4 (m)ω 4 (m)

(0.28)(0.625)(0.2)(0.0361) + (0.1060)(0.625)(0.2)(0.0949) 
 + (0.0470)(0.625)(0.4)(0.2908) + (0.0052)(0.625)(0.4)(1)
=  
[ ( 0. 28 )( 0. 0141) + ( 0 . 1060 )( 0 .0366 ) + ( 0 .0470 )( 0 . 0943 ) + ( 0. 0052)( 0.2875)]

= 0.5262
302 Data Mining


4
ρn (r )P ( m|r ) P ( xn + 1|m) ω n + 1 (m)
Pˆ ( m|r ) = n=1


4
ρn (r )ω n (r )
n=1

 ρ1 (r )P ( m|r ) P ( x2 = F|m) ω 2 (m) 


 
 + ρ2 (r )P ( m|r ) P ( x3 = F|m) ω 3 (m) 
 
 + ρ3 (r )P ( m|r ) P ( x 4 = H|m) ω 4 (m)
( ) (5
 + ρ (r )P m|r P x = G|m ω (m) 
 4 ) 5 
=
 ρ1 (r )ω1 (r ) 
 + ρ (r )ω (r )
 2 2

 + ρ3 (r )ω 3 (r )
 
 + ρ4 (r )ω 4 (r )

 (0.12)(0.364)(0.7 )(0.0366) + (0.0754)(0.364)(0.7 )(0.0943) 


 + (0.0228)(0.364)(0.2)(0.2875) + (0.0176)(0.364)(0.1)(1)
=  
[ ( 0 . 12)( 0 . 0139) + ( 0 .0754 )( 0 .0361) + ( 0 .0228 )( 0 . 0949) + ( 0. 0176 )( 0 .2908)]

= 0.3469


4
ρn (r )P ( r|r ) P ( xn + 1|r ) ω n + 1 (r )
Pˆ ( r|r ) = n=1


4
ρn (r )ω n (r )
n=1

 ρ1 (r )P ( r|r ) P ( x2 = F|r ) ω 2 (r ) 
 
 + ρ2 (r )P ( r|r ) P ( x3 = F|r ) ω 3 (r ) 
 
 + ρ3 (r )P ( r|r ) P ( x 4 = H|r ) ω 4 (r )
( ) (5
 + ρ (r )P r|r P x = G|r ω (r ) 
 4 ) 5 
=
 ρ1 (r )ω1 (r ) 
 +ρ (r )ω (r )
 2 2

 +ρ3 (r )ω 3 (r )
 
 +ρ4 (r )ω 4 (r )

 (0.12)(0.636)(0.2)(0.0361) + (0.0754)(0.636)(0.2)(0.0949) 
 + (0.0228)(0.636)(0.4)(0.2908) + (0.0176)(0.636)(0.4)(1)
=  
[(0.12)(0.0139) + (0.0754)(0.0361) + (0.0228)(0.0949) + (0.0176)(0.2908)]
= 0.6533
Markov Chain Models and Hidden Markov Models 303


5
ρn& xn = F (m) ω n& xn = F (m)
Pˆ ( x = F|m) = n=1


5
ρ n (i ) ω n (i )
n=1

ρ1& x1 = F (m) ω1& x1 = F (m) + ρ2& x2 = F (m) ω 2& x2 = F (m) + ρ3 & x3 = F (m) ω 3 & x3 = F (m)
+ ρ4 & x4 = F (m) ω 4 & x4 = F (m) + ρ5& x5 = F (m) ω 5& x5 = F (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)

(0.28)(0.0141) + (0.1060)(0.0366) + (0.0470)(0.0943) + (0)(0) + (0)(0)


=
(0.28)(0.0141) + (0.1060)(0.0366) + (0.0470)(0.0943) + (0.0052)(0.2875) + (0.0058)(1)

= 0.6269


5
ρn& xn = G (m) ω n& xn = G (m)
Pˆ ( x = G|m) = n=1


5
ρn (m) ω n (m)
n=1

ρ1& x1 = G (m) ω1& x1 = G (m) + ρ2& x2 = G (m) ω 2& x2 = G (m) + ρ3 & x3 = G (m) ω 3 & x3 = G (m)
+ ρ4 & x4 = G (m) ω 4 & x4 = G (m) + ρ5& x5 = G (m) ω 5& x5 = G (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)

(0)(0) + (0)(0) + (0)(0) + (0)(0) + (0.0008)(1)


=
(0.28)(0.0141) + (0.1060)(0.0366) + (0.04770)(0.0943) + (0.0052)(0.2875) + (0.0008)(1)

= 0.0550


5
ρn& xn = H (m) ω n& xn = H (m)
Pˆ ( x = H|m) = n=1


5
ρn (m) ω n (m)
n=1

ρ1& x1 = H (m) ω1& x1 = H (m) + ρ2& x2 = H (m) ω 2& x2 = H (m) + ρ3 & x3 = H (m) ω 3 & x3 = H (m)
+ ρ4 & x4 = H (m) ω 4 & x4 = H (m) + ρ5& x5 = H (m) ω 5& x5 = H (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)

(0)(0) + (0)(0) + (0)(0) + (0.0052)(0.2875) + (0)(0)


=
(0.28)(0.0141) + (0.1060)(0.0366) + (0.0470)(0.0943) + (0.0052)(0.2875) + (0.0008)(1)

= 0.1027
304 Data Mining


5
ρn & x n = F ( r ) ω n & x n = F ( r )
Pˆ ( x = F|r ) = n=1


5
ρn ( r ) ω n ( r )
n=1

ρ1& x1 = F (r ) ω1& x1 = F (r ) + ρ2& x2 = F (r ) ω 2& x2 = F (r ) + ρ3 & x3 = F (r ) ω 3 & x3 = F (r )


+ ρ 4 & x 4 = F ( r ) ω 4 & x 4 = F ( r ) + ρ5 & x 5 = F ( r ) ω 5 & x 5 = F ( r )
=
ρ1 (r ) ω1 (r ) + ρ2 (r ) ω 2 (r ) + ρ3 (r ) ω 3 (r ) + ρ4 (r ) ω 4 (r ) + ρ5 (r ) ω 5 (r )

(0.12)(0.0139) + (0.0754)(0.0361) + (0.0228)(0.0949) + (0)(0) + (0)(0)


=
(0.12)(0.0139) + (0.0754)(0.0361) + (0.0228)(0.0949) + (0.0176)(0.2908) + (0.00558)(1)

= 0.3751


5
ρn & x n = G ( r ) ω n & x n = G ( r )
Pˆ ( x = G|r ) = n=1


5
ρn ( r ) ω n ( r )
n=1

ρ1& x1 = G (r ) ω1& x1 = G (r ) + ρ2& x2 = G (r ) ω 2& x2 = G (r ) + ρ3 & x3 = G (r ) ω 3 & x3 = G (r )


+ ρ 4 & x 4 = G ( r ) ω 4 & x 4 = G ( r ) + ρ5 & x 5 = G ( r ) ω 5 & x 5 = G ( r )
=
ρ1 (r ) ω1 (r ) + ρ2 (r ) ω 2 (r ) + ρ3 (r ) ω 3 (r ) + ρ4 (r ) ω 4 (r ) + ρ5 (r ) ω 5 (r )

(0)(0) + (0)(0) + (0)(0) + (0)(0) + (0.0058)(1)


=
(0.12)(0.0139) + (0.0754)(0.0361) + (0.02228)(0.0949) + (0.0176)(0.2908) + (0.0058)(1)

= 0.3320


5
ρn & x n = G ( r ) ω n & x n = G ( r )
Pˆ ( x = H|r ) = n=1


5
ρn ( r ) ω n ( r )
n=1

ρ1& x1 = G (r ) ω1& x1 = G (r ) + ρ2& x2 = G (r ) ω 2& x2 = G (r ) + ρ3 & x3 = G (r ) ω 3 & x3 = G (r )


+ ρ 4 & x 4 = G ( r ) ω 4 & x 4 = G ( r ) + ρ5 & x 5 = G ( r ) ω 5 & x 5 = G ( r )
=
ρ1 (r ) ω1 (r ) + ρ2 (r ) ω 2 (r ) + ρ3 (r ) ω 3 (r ) + ρ4 (r ) ω 4 (r ) + ρ5 (r ) ω 5 (r )

(0)(0) + (0)(0) + (0)(0) + (0.0176)(0.2908) + (0)(0)


=
(0.12)(0.0139) + (0.0754)(0.0361) + (0.0228)(0.0949) + (0.0176)(0.2908) + (0.0058)(1)

= 0.2929
Markov Chain Models and Hidden Markov Models 305

19.4  Software and Applications


The Hidden Markov Model Tookit (HTK) (http://htk.eng.cam.ac.uk/)
supports hidden Markov models. Ye and her colleagues (Ye, 2008; Ye et al.,
2002c, 2004b) describe the application of Markov chain models to cyber
attack detection. Rabiner (1989) gives a review on applications of hidden
Markov models to speech recognition.

Exercises
19.1 Given the Markov chain model in Example 19.1, determine the prob-
ability of observing a sequence of system states: rmmrmrrmrrrrrrmmm.
19.2 A system has two states, misuse (m) and regular use (r), each of which
can produce one of three events: F, G, and H. A hidden Markov model
for the system has the initial state probabilities and state transition
probabilities given from Example 19.1, and the state emission probabili-
ties as follows:

P ( F|m) = 0.1 P (G|m) = 0.3 P ( H|m) = 0.6


P ( F|r ) = 0.5 P (G|m) = 0.2 P ( H|m) = 0.3.



Use any path method to determine the probability of observing a
sequence of five events: GHFFH.
19.3 Given the hidden Markov model in Exercise 19.2, use the best path
method to determine the probability of observing a sequence of five
events: GHFFH.
20
Wavelet Analysis

Many objects have a periodic behavior and thus show a unique ­characteristic
in the frequency domain. For example, human sounds have a range of frequen-
cies that are different from those of some animals. Objects in the space includ-
ing the earth move at different frequencies. A new object in the space can
be discovered by observing its unique movement frequency,  which is dif-
ferent from those of known objects. Hence, the ­frequency characteristic of
an object can be useful in identifying an object. Wavelet analysis represents
time series data in the time–frequency domain using data ­characteristics
over time in various frequencies, and thus allows us to uncover temporal
data patterns at various frequencies. There are many forms of wavelets, e.g.,
Haar, Daubechies, and derivative of Gaussian (DoG). In this chapter, we use
the Haar wavelet to explain how wavelet analysis works to transform time
series data to data in the time–frequency domain. A list of software pack-
ages that support wavelet analysis is provided. Some applications of wavelet
analysis are given with references.

20.1  Definition of Wavelet


A wavelet form is defined by two functions: the scaling function φ(x) and the
wavelet function ψ(x). The scaling function of the Haar wavelet is a step func-
tion (Boggess and Narcowich, 2001; Vidakovic, 1999), as shown in Figure 20.1:

1 if 0 ≤ x < 1
ϕ( x) =  . (20.1)
0 otherwise

The wavelet function of the Haar wavelet is defined using the scaling func-
tion (Boggess and Narcowich, 2001; Vidakovic, 1999), as shown in Figure 20.1:

 1
1 if 0 ≤ x <
2 .
ψ( x) = ϕ(2x) − ϕ(2x − 1) =  (20.2)
 −1 1
if ≤x<1
 2

307
308 Data Mining

1 1 1

0 0 0
0 1 0 ½ 1 0 1 2
(x) (2x) (½ x)

1 1 1

0 0 0
0 1 –1 0 0 1 2
(x) (x + 1) (x – 1)

0
0 ½ 1

–1
(x)

Figure 20.1
The scaling function and the wavelet function of the Haar wavelet and the dilation and shift
effects.

Hence, the wavelet function of the Haar wavelet represents the change of
the function value from 1 to −1 in [0, 1). The function φ(2x) in Formula 20.2
is a step function with the height of 1 for the range of x values in [0, ½), as
shown in Figure 20.1. In general, the parameter a before x in φ(ax) produces a
dilation effect on the range of x values, widening or contracting the x range
by 1/a, as shown in Figure 20.1. The function φ(2x − 1) is also a step function
with the height of 1 for the range of x values in [½, 1). In general, the param-
eter b in φ(x + b) produces a shift effect on the range of x values, moving the
x range by b, as shown in Figure 20.1. Hence, φ(ax + b) defines a step function
with the height of 1 for x values in the range of [−b/a, (1 − b)/a), as shown next,
given a > 0:

0 ≤ ax + b < 1

−b 1− b
≤x< .
a a
Wavelet Analysis 309

20.2  Wavelet Transform of Time Series Data


Given time series data with the function as shown in Figure 20.2a and a
sample of eight data points 0, 2, 0, 2, 6, 8, 6, 8 taken from this function at the
1 1 2 3 4 5 6 7
time locations 0, , respectively, at the time interval of , , , , , , , or
8 8 8 8 8 8 8 8
at the frequency of 8, as shown in Figure 20.2b:
ai , i = 0, 1, … , 2k − 1, k = 3 or

a0 = 0, a1 = 2, a2 = 0, a3 = 2, a4 = 6, a5 = 8, a6 = 6, a7 = 8,

the function can be approximated using the sample of the data points and
the scaling function of the Haar wavelet as follows:
2k − 1

f ( x) = ∑a ϕ(2 x − i)
i=0
i
k
(20.3)

f (x) f (x)
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 x 0 x
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1 0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
(a) (b)

f (x)
9
8
7
6
5
4
3
2
1
0 x
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
(c)

Figure 20.2
A sample of time series data from (a) a function, (b) a sample of data points taken from a func-
tion, and (c) an approximation of the function using the scaling function of Haar wavelet.
310 Data Mining

f ( x) = a0ϕ(23 x − 0) + a1ϕ(23 x − 1) + a2ϕ(23 x − 2) + a3ϕ(23 x − 3) + a4ϕ(23 x − 4)

+ a5ϕ(23 x − 5) + a6ϕ(23 x − 6) + a7 ϕ(23 x − 7 )

f ( x) = 0ϕ(23 x) + 2ϕ(23 x − 1) + 0ϕ(23 x − 2) + 2ϕ(23 x − 3) + 6ϕ(23 x − 4)

+ 8ϕ(23 x − 5) + 6ϕ(23 x − 6) + 8ϕ(23 x − 7 )

In Formula 20.3, aiφ(2kx − i) defines a step function with the height of ai for x
values in the range of [i/2k, (i + 1)/2k). Figure 20.2c shows the approximation
of the function using the step functions at the height of the eight data points.
Considering the first two step functions in Formula 20.3, φ(2kx) and
φ(2kx − 1), which have the value of 1 for the x values in [0, 1/2k) and [1/2k, 2/2k),
respectively, we have the following relationships:

ϕ(2k −1 x) = ϕ(2k x) + ϕ(2k x − 1) (20.4)

ψ(2k −1 x) = ϕ(2k x) − ϕ(2k x − 1). (20.5)

φ(2k−1x) in Equation 20.4 has the value of 1 for the x values in [0, 1/2k−1),
which covers [0, 1/2k) and [1/2k, 2/2k) together. ψ(2k−1x) in Equation 20.5 also
­covers [0, 1/2k) and [1/2k, 2/2k) together but has the value of 1 for the x ­values
in [0,  1/2k−1) and −1 for the x values in [1/2k, 2/2k). An equivalent form of
Equations 20.4 and 20.5 is obtained by adding Equations 20.4 and 20.5 and by
subtracting Equation 20.5 from Equation 20.4:

1
ϕ(2k x) = ϕ(2k −1 x) + ψ(2k −1 x) (20.6)
2

1
ϕ(2k x − 1) = ϕ(2k −1 x) − ψ(2k −1 x) . (20.7)
2

At the left-hand side of Equations 20.6 and 20.7, we look at the data points
at the time interval of 1/2k or the frequency of 2k. At the right-hand side of
Equations 20.4 and 20.5, we look at the data points at the larger time interval
of 1/2k−1 or a lower frequency of 2k−1.
In general, considering the two step functions in Formula 20.3, φ(2kx − i)
and φ(2kx − i − 1), which have the value of 1 for the x values in [i/2k, (i + 1)/2k)
and [(i + 1)/2k, (i + 2)/2k), respectively, we have the following relationships:

 i
ϕ  2k −1 x −  = ϕ(2k x − i) + ϕ(2k x − i − 1) (20.8)
 2
Wavelet Analysis 311

 i
ψ  2k −1 x −  = ϕ(2k x − i) − ϕ(2k x − i − 1). (20.9)
 2

φ(2k−1x − i/2) in Equation 20.8 has the value of 1 for the x values in [i/2k, (i + 2)/2k)
or [i/2k, i/2k + 1/2k−1) with the time interval of 1/2k−1. ψ(2k−1x − i/2) in Equation
20.9 has the value of 1 for the x values in [i/2k, (i + 1)/2k) and −1 for the x values
in [(i + 1)/2k, (i + 2)/2k]. An equivalent form of Equations 20.8 and 20.9 is

1   k −1 i  i 
ϕ(2k x − i) = ϕ  2 x −  + ψ  2k −1 x −   (20.10)
2   2  2 

1   k −1 i  i 
ϕ(2k x − i − 1) =  ϕ  2 x −  − ψ  2k −1 x −   (20.11)
2  2  2 

At the left-hand side of Equations 20.10 and 20.11, we look at the data points
at the time interval of 1/2k or the frequency of 2k. At the right-hand side of
Equations 20.10 and 20.11, we look at the data points at the larger time inter-
val of 1/2k−1 or a lower frequency of 2k−1.
Equations 20.10 and 20.11 allow us to perform the wavelet transform of
times series data or their function representation in Formula 20.3 into data at
various frequencies as illustrated through Example 20.1.

Example 20.1
Perform the Haar wavelet transform of time series data 0, 2, 0, 2, 6, 8, 6, 8.
First, we represent the time series data using the scaling function of the
Haar wavelet:

2k − 1

f ( x) = ∑a ϕ(2 x − i)
i=0
i
k


f ( x) = 0ϕ(23 x) + 2ϕ(23 x − 1)

+ 0ϕ(23 x − 2) + 2ϕ(23 x − 3)

+ 6ϕ(23 x − 4) + 8ϕ(23 x − 5)

+ 6ϕ(23 x − 6) + 8ϕ(23 x − 7 ).

Then, we use Equations 20.10 and 20.11 to transform the aforemen-


tioned function. When performing the wavelet transform of the afore-
mentioned function, we use i = 0 and i + 1 = 1 for the first pair of the
scaling functions at the right-hand side of the aforementioned function,
312 Data Mining

i = 2 and i + 1 = 3 for the second pair, i = 4 and i + 1 = 5 for the third pair,
and i = 6 and i + 1 = 7 for the fourth pair:

1  2 0  0  1  0  0 
f ( x) = 0 × ϕ  2 x −  + ψ  22 x −   + 2 × ϕ  2k − 1 x −  − ψ  2k − 1 x −  
2   2  2  2  2  2 

1  2 2  2  1  2  2 
+0 × ϕ  2 x −  + ψ  22 x −   + 2 × ϕ  2k − 1 x −  − ψ  2k − 1 x −  
2   2  2  2  2  2 

1  2 4  4  1  4  4 
+6 × ϕ  2 x −  + ψ  22 x −   + 8 × ϕ  2k − 1 x −  − ψ  2k − 1 x −  
2   2  2  2  2  2 

1  2 6  2 6  1   k −1 6  k −1 6 
+6 × ϕ2 x −  + ψ  2 x −   + 8 × ϕ  2 x −  − ψ  2 x − 
2   2 2  2 2 2  

1 1
f ( x) = 0 × ϕ(22 x) + ψ(22 x) + 2 × ϕ(22 x) − ψ(22 x)
2 2
1 1
+0 × ϕ(22 x − 1) + ψ(22 x − 1) + 2 × ϕ(22 x − 1) − ψ(22 x − 1)
2 2
1 1
+6 × ϕ(22 x − 2) + ψ(22 x − 2) + 8 × ϕ(22 x − 2) − ψ(22 x − 2)
2 2
1 1
+6 × ϕ(22 x − 3) + ψ(22 x − 3) + 8 × ϕ(22 x − 3) − ψ(22 x − 3)
2 2

 1 1  1 1
f ( x) =  0 × + 2 ×  ϕ(22 x) +  0 × − 2 ×  ψ(22 x)
 2 2  2 2
 1 1  1 1
+0 × + 2 × 2
 ϕ(2 x − 1) +  0 × − 2 ×
2
 ψ(2 x − 1)
 2 2 2 2
 1 1  1 1
+  6 × + 8 ×  ϕ(22 x − 2) +  6 × − 8 ×  ψ(22 x − 2)
 2 2  2 2
 1 1  1 1
+  6 × + 8 ×  ϕ(22 x − 3) +  6 × − 8 ×  ψ(22 x − 3)
 2 2  2 2

f ( x) = ϕ(22 x) − ψ(22 x)
+ ϕ(22 x − 1) − ψ(22 x − 1)
+ 7 ϕ(22 x − 2) − 1ψ(22 x − 2)
+ 7 ϕ(22 x − 3) − 1ψ(22 x − 3)

f ( x) = ϕ(22 x) + ϕ(22 x − 1) + 7 ϕ(22 x − 2) + 7 ϕ(22 x − 3)


− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3).
Wavelet Analysis 313

We use Equations 20.10 and 20.11 to transform the first line of the afore-
mentioned function:

1 1
f ( x) = ϕ(21 x) + ψ(21 x) + ϕ(21 x) − ψ(21 x)
2 2
1 1
+7× ϕ(21 x − 1) + ψ(21 x − 1) + 7 × ϕ(21 x − 1) − ψ(21 x − 1)
2 2
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)

 1 1  1 1  7 7  7 7
f ( x) =  +  ϕ(2x) +  −  ψ(2x) +  +  ϕ(2x − 1) +  −  ψ(2x − 1)
 2 2  2 2  2 2  2 2
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)

f ( x) = ϕ(2x) + 7 ϕ(2x − 1)
+ 0ψ(2x) + 0ψ(2x − 1)
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3).

Again, we use Equations 20.10 and 20.11 to transform the first line of the
aforementioned function:

1 1
f ( x) = ϕ(21− 1 x) + ϕ(21− 1 x) + 7 × ϕ(21− 1 x) − ψ(21− 1 x)
2 2
+ 0ψ(2x) + 0ψ(2x − 1) − ψ(22 x) − ψ(22 x − 1)
− ψ(22 x − 2) − ψ(22 x − 3)

 1 7  1 7
f ( x) =  +  ϕ( x) +  −  ψ( x)
 2 2  2 2
+ 0ψ(2x) + 0ψ(2x − 1) − ψ(22 x) − ψ(22 x − 1)
− ψ(22 x − 2) − ψ(22 x − 3)

f ( x) = 4ϕ( x) − 3ψ( x) + 0ψ(2x) + 0ψ(2x − 1)


− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3). (20.12)


The function in Equation 20.12 gives the final result of the Haar wave-
let transform. The function has eight terms, as the original data sample
has eight data points. The first term, 4φ(x), represents a step function
314 Data Mining

at the height of 4 for x in [0, 1) and gives the average of the original
data points, 0, 2, 0, 2, 6, 8, 6, 8. The second term, −3ψ(x), has the wavelet
function ψ(x), which represents a step change of the function value
from 1 to −1 or the step change of −2 as the x values go from the first
half of the range [0, ½) to the second half of the range [½, 1). Hence, the
second term, −3ψ(x), reveals that the original time series data have the
step change of (−3) × (−2) = 6 from the first half set of four data points
to the second half set of four data points as the average of the first four
data points is 1 and the average of the last four data points is 7. The
third term, 0ψ(2x), represents that the original time series data have no
step change from the first and second data points to the third and four
data points as the average of the first and second data points is 1 and
the average of the third and fourth data points is 1. The fourth term,
0ψ(2x−1), represents that the original time series data have no step
change from the fifth and sixth data points to the seventh and eighth
data points as the average of the fifth and sixth data points is 7 and
the average of the seventh and eighth data points is 7. The fifth, sixth,
seventh and eighth terms of the function in Equation 20.12, −ψ(2 2 x),
−ψ(22 x−1), −ψ(22 x−2) and −ψ(22 x−3), reveal that the original time series
data have the step change of (−1) × (−2) = 2 from the first data point of 0
to the second data point of 2, the step change of (−1) × (−2) = 2 from the
third data point of 0 to the fourth data point of 2, the step change of
(−1) × (−2) = 2 from the fifth data point of 6 to the sixth data point of 8,
and the step change of (−1) × (−2) = 2 from the seventh data point of 6 to
the eighth data point of 8. Hence, the Haar wavelet transform of eight
data points in the original time series data produces eight terms with
the coefficient of the scaling function φ(x) revealing the average of the
original data, the coefficient of the wavelet function ψ(x) revealing the
step change in the original data at the lowest frequency from the first
half set of four data points to the second half set of four data points,
the coefficients of the wavelet functions ψ(2x) and ψ(2x − 1) revealing
the step changes in the original data at the higher frequency of every
two data points, and the coefficients of the wavelet functions ψ(2 2 x),
ψ(22 x − 1), ψ(22 x − 2) and ψ(22 x − 3) revealing the step changes in the
original data at the highest frequency of every data point.
Hence, the Haar wavelet transform of times series data allows us to
transform time series data to the data in the time–frequency domain
and observe the characteristics of the wavelet data pattern (e.g., a
step change for the Haar wavelet) in the time–frequency domain. For
example, the wavelet transform of the time series data 0, 2, 0, 2, 6, 8,
6, 8 in Equation 20.12 reveals that the data have the average of 4, a step
increase of 6 at four data points (at the lowest frequency of step change),
no step change at every two data points (at the medium frequency of
step change), and a step increase of 2 at every data point (at the highest
frequency of step change). In addition to the Haar wavelet that captures
the data pattern of a step change, there are many other wavelet forms,
for example, the Paul wavelet, the DoG wavelet, the Daubechies wave-
let, and Morlet wavelet as shown in Figure 20.3, which capture other
types of data patterns. Many wavelet forms are developed so that an
appropriate wavelet form can be selected to give a close match to the
Wavelet Analysis 315

0.3

0.0

–0.3
–4 –2 0 2 4
Paul wavelet

0.3

0.0

–0.3
–4 –2 0 2 4
DoG wavelet

–1

0 50 100 150 200


Daubechies wavelet
1

0.5

–0.5

–1
–1 –2 0 2 4
Morlet wavelet

Figure 20.3
Graphic illustration of the Paul wavelet, the DoG wavelet, the Daubechies wavelet, and the
Morlet wavelet. (Ye, N., Secure Computer and Network Systems: Modeling, Analysis and Design,
2008, Figure 11.2, p. 200. Copyright Wiley-VCH Verlag GmbH & Co. KGaA. Reproduced
with permission).

data pattern of time series data. For example, the Daubechies wavelet
(Daubechies, 1990) may be used to perform the wavelet transform of
time series data that shows a data pattern of linear increase or linear
decrease. The Paul and DoG wavelets may be used for time series data
that show wave-like data patterns.
316 Data Mining

20.3 Reconstruction of Time Series Data


from Wavelet Coefficients
Equations 20.8 and 20.9, which are repeated next, can be used to reconstruct
the time series data from the wavelet coefficients:
 i
ϕ  2k −1 x −  = ϕ(2k x − i) + ϕ(2k x − i − 1)
 2

 i
ψ  2k −1 x −  = ϕ(2k x − i) − ϕ(2k x − i − 1).
 2

Example 20.2
Reconstruct time series data from the wavelet coefficients in Equation
20.12, which is repeated next:

f ( x) = 4ϕ( x)
− 3ψ( x)
+ 0ψ(2x) + 0ψ(2x − 1)
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)

f ( x) = 4 × ϕ(21 x) + ϕ(21 x − 1)

− 3 × ϕ(21 x) − ϕ(21 x − 1)

+ 0 × ϕ(22 x) − ϕ(22 x − 1) + 0 × ϕ(22 x − 2) − ϕ(22 x − 3)

− ϕ(23 x) − ϕ(23 x − 1) − ϕ(23 x − 2) − ϕ(23 x − 3) − ϕ(23 x − 4) − ϕ(23 x − 5)

− ϕ(23 x − 6) − ϕ(23 x − 7 )

f ( x) = ϕ(2x) + 7 ϕ(2x − 1)
− ϕ(23 x) + ϕ(23 x − 1) − ϕ(23 x − 2) + ϕ(23 x − 3) − ϕ(23 x − 4)

+ ϕ(23 x − 5) − ϕ(23 x − 6) + ϕ(23 x − 7 )

f ( x) = ϕ(22 x) + ϕ(22 x − 1) + 7 × ϕ(22 x − 2) + ϕ(22 x − 3)

− ϕ(23 x) + ϕ(23 x − 1) − ϕ(23 x − 2) + ϕ(23 x − 3) − ϕ(23 x − 4) + ϕ(23 x − 5)

− ϕ(23 x − 6) + ϕ(23 x − 7 )
Wavelet Analysis 317

f ( x) = ϕ(22 x) + ϕ(22 x − 1) + 7 ϕ(22 x − 2) + 7 ϕ(22 x − 3)


− ϕ(23 x) + ϕ(23 x − 1) − ϕ(23 x − 2) + ϕ(23 x − 3) − ϕ(23 x − 4) + ϕ(23 x − 5)

− ϕ(23 x − 6) + ϕ(23 x − 7 )

f ( x) = ϕ(23 x) + ϕ(23 x − 1) + ϕ(23 x − 2) + ϕ(23 x − 3)

+ 7 × ϕ(23 x − 4) + ϕ(23 x − 5) + 7 × ϕ(23 x − 6) + ϕ(23 x − 7 )

− ϕ(23 x) + ϕ(23 x − 1) − ϕ(23 x − 2) + ϕ(23 x − 3) − ϕ(23 x − 4)

+ ϕ(23 x − 5) − ϕ(23 x − 6) + ϕ(23 x − 7 )

f ( x) = 0ϕ(23 x) + 2ϕ(23 x − 1)
+ 0ϕ(23 x − 2) + 2ϕ(23 x − 3)
+ 6ϕ(23 x − 4) + 8ϕ(23 x − 5)
+ 6ϕ(23 x − 6) + 8ϕ(23 x − 7 ).

Taking the coefficients of the scaling functions at the right-hand side of


the last equation gives us the original sample of time series data, 0, 2, 0,
2, 6, 8, 6, 8.

20.4  Software and Applications


Wavelet analysis is supported in software packages including Statistica
(www.statistica.com) and MATLAB® (www.matworks.com). As discussed in
Section 20.2, the wavelet transform can be applied to uncover characteris-
tics of certain data patterns in the time–frequency domain. For example, by
examining the time location and frequency of the Haar wavelet coefficient
with the largest magnitude, the biggest rise of the New York Stock Exchange
Index for the 6-year period of 1981–1987 was detected to occur from the first
3 years to the next 3 years (Boggess and Narcowich, 2001). The application
of the Haar, Paul, DoG, Daubechies, and Morlet wavelet to computer and
network data can be found in Ye (2008, Chapter 11).
The wavelet transform is also useful for many other types of applications,
including noise reduction and filtering, data compression, and edge detec-
tion (Boggess and Narcowich, 2001). Noise reduction and filtering are usu-
ally done by setting zero to the wavelet coefficients in a certain frequency
range, which is considered to characterize noise in a given environment
(e.g., the highest frequency for white noise or a given range of frequencies for
318 Data Mining

machine-generated noise in an airplane cockpit if the pilot’s voice is the sig-


nal of interest). Those wavelet coefficients along with other unchanged wave-
let coefficients are then used to reconstruct the signal with noise removed.
Data compression is usually done by retaining the wavelet coefficients with
the large magnitudes or the wavelet coefficients at certain frequencies that
are considered to represent the signal. Those wavelet coefficients and other
wavelet coefficients with the value of zero are used to reconstruct the signal
data. If the signal data are transmitted from one place to another place and
both places know the given frequencies that contain the signal, only a small
set of wavelet coefficients in the given frequencies need to be transmitted to
achieve data compression. Edge detection is to look for the largest wavelet
coefficients and use their time locations and frequencies to detect the largest
change(s) or discontinuities in data (e.g., a sharp edge between a light shade
to a dark shade in an image to detect an object such as a person in a hallway).

Exercises
20.1 Perform the Haar wavelet transform of time series data 2.5, 0.5, 4.5, 2.5,
−1, 1, 2, 6 and explain the meaning of each coefficient in the result of the
Haar wavelet transform.
20.2 The Haar wavelet transform of given time series data produces the fol-
lowing wavelet coefficients:

f ( x) = 2.25ϕ( x)

+ 0.25ψ ( x)

− 1ψ(2x) − 2ψ(2x − 1)

+ ψ(22 x) + ψ(22 x − 1) − ψ(22 x − 2) − 2ψ(22 x − 3).

Reconstruct the original time series data using these coefficients.
20.3 After setting the zero value to the coefficients whose absolute value is
smaller than 1.5 in the Haar wavelet transform from Exercise 20.2, we
have the following wavelet coefficients:

f ( x) = 2.25ϕ( x)

+ 0ψ( x)

+ 0ψ(2 x) − 2ψ(2 x − 1)

+ 0ψ (22 x) + 0ψ (22 x − 1) + 0ψ (22 x − 2) − 2ψ (22 x − 3).

Reconstruct the time series data using these coefficients.


References

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large
databases. In Proceedings of the 20th International Conference on Very Large Data
Bases, Santiago, Chile, pp. 487–499.
Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer.
Boggess, A. and Narcowich, F. J. 2001. The First Course in Wavelets with Fourier Analysis.
Upper Saddle River, NJ: Prentice Hall.
Box, G.E.P. and Jenkins, G. 1976. Time Series Analysis: Forecasting and Control. Oakland,
CA: Holden-Day.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and
Regression Trees. Boca Raton, FL: CRC Press.
Bryc, W. 1995. The Normal Distribution: Characterizations with Applications. New York:
Springer-Verlag.
Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2, 121–167.
Chou, Y.-M., Mason, R. L., and Young, J. C. 1999. Power comparisons for a Hotelling’s
T2 statistic. Communications of Statistical Simulation, 28(4), 1031–1050.
Daubechies, I. 1990. The wavelet transform, time-frequency localization and signal
analysis. IEEE Transactions on Information Theory, 36(5), 96–101.
Davis, G. A. 2003. Bayesian reconstruction of traffic accidents. Law, Probability and
Risk, 2(2), 69–89.
Díez, F. J., Mira, J., Iturralde, E., and Zubillaga, S. 1997. DIAVAL, a Bayesian expert
system for echocardiography. Artificial Intelligence in Medicine, 10, 59–73.
Emran, S. M. and Ye, N. 2002. Robustness of chi-square and Canberra techniques in
detecting intrusions into information systems. Quality and Reliability Engineering
International, 18(1), 19–28.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for
discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han,
U. M. Fayyad (eds.) Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), Portland, OR, AAAI Press, pp. 226–231.
Everitt, B. S. 1979. A Monte Carlo investigation of the Robustness of Hotelling’s one-
and two-sample T2 tests. Journal of American Statistical Association, 74(365), 48–51.
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.
ics.uci.edu/ml. Irvine, CA: University of California, School of Information and
Computer Science.
Hartigan, J. A. and Hartigan, P. M. 1985. The DIP test of unimodality. The Annals of
Statistics, 13, 70–84.
Jiang, X. and Cooper, G. F. 2010. A Bayesian spatio-temporal method for disease
outbreak detection. Journal of American Medical Informatics Association, 17(4),
462–471.
Johnson, R. A. and Wichern, D. W. 1998. Applied Multivariate Statistical Analysis. Upper
Saddle River, NJ: Prentice Hall.
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps.
Biological Cybernetics, 43, 59–69.

319
320 References

Kruskal, J. B. 1964a. Multidimensional scaling by optimizing goodness of fit to a non-


metric hypothesis. Psychometrika, 29(1), 1–27.
Kruskal, J. B. 1964b. Non-metric multidimensional scaling: A numerical method.
Psychometrika, 29(1), 115–129.
Li, X. and Ye, N. 2001. Decision tree classifiers for computer intrusion detection.
Journal of Parallel and Distributed Computing Practices, 4(2), 179–190.
Li, X. and Ye, N. 2002. Grid- and dummy-cluster-based learning of normal and intru-
sive clusters for computer intrusion detection. Quality and Reliability Engineering
International, 18(3), 231–242.
Li, X. and Ye, N. 2005. A supervised clustering algorithm for mining normal and intru-
sive activity patterns in computer intrusion detection. Knowledge and Information
Systems, 8(4), 498–509.
Li, X. and Ye, N. 2006. A supervised clustering and classification algorithm for mining
data with mixed variables. IEEE Transactions on Systems, Man, and Cybernetics,
Part A, 36(2), 396–406.
Liu, Y. and Weisberg, R. H. 2005. Patterns of ocean current variability on the West
Florida Shelf using the self-organizing map. Journal of Geophysical Research, 110,
C06003, doi:10.1029/2004JC002786.
Luceno, A. 1999. Average run lengths and run length probability distributions for
Cuscore charts to control normal mean. Computational Statistics & Data Analysis,
32(2), 177–196.
Mason, R. L., Champ, C. W., Tracy, N. D., Wierda, S. J., and Young, J. C. 1997a.
Assessment of multivariate process control techniques. Journal of Quality
Technology, 29(2), 140–143.
Mason, R. L., Tracy, N. D., and Young, J. C. 1995. Decomposition of T2 for multivariate
control chart interpretation. Journal of Quality Technology, 27(2), 99–108.
Mason, R. L., Tracy, N. D., and Young, J. C. 1997b. A practical approach for interpreting
multivariate T2 control chart signals. Journal of Quality Technology, 29(4), 396–406.
Mason, R. L. and Young, J. C. 1999. Improving the sensitivity of the T2 statistic in mul-
tivariate process control. Journal of Quality Technology, 31(2), 155–164.
Montgomery, D. 2001. Introduction to Statistical Quality Control, 4th edn. New York:
Wiley.
Montgomery, D. C. and Mastrangelo, C. M. 1991. Some statistical process control
methods for autocorrelated data. Journal of Quality Technologies, 23(3), 179–193.
Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. 1996. Applied Linear
Statistical Models. Chicago, IL: Irwin.
Osuna, E., Freund, R., and Girosi, F. 1997. Training support vector machines: An
application to face detection. In Proceedings of the 1997 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico,
pp. 130–136.
Pourret, O., Naim, P., and Marcot, B. 2008. Bayesian Networks: A Practical Guide to
Applications. Chichester, U.K.: Wiley.
Quinlan, J. R. 1986. Induction of decision trees. Machine Learning, 1, 81–106.
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1:
Foundations. Cambridge, MA: The MIT Press.
References 321

Russell, S., Binder, J., Koller, D., and Kanazawa, K. 1995. Local learning in probabilistic
networks with hidden variables. In Proceedings of the Fourteenth International Joint
Conference on Artificial Intelligence, Montreal, Quebec, Canada, pp. 1146–1162.
Ryan, T. P. 1989. Statistical Methods for Quality Improvement. New York: John Wiley &
Sons.
Sung, K. and Poggio, T. 1998. Example-based learning for view-based human face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1),
39–51.
Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Boston,
MA: Pearson.
Theodoridis, S. and Koutroumbas, K. 1999. Pattern Recognition. San Diego, CA:
Academic Press.
Vapnik, V. N. 1989. Statistical Learning Theory. New York: John Wiley & Sons.
Vapnik, V. N. 2000. The Nature of Statistical Learning Theory. New York: Springer-Verlag.
Vidakovic, B. 1999. Statistical Modeling by Wavelets. New York: John Wiley & Sons.
Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically opti-
mum decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.
Witten, I. H., Frank, E., and Hall, M. A. 2011. Data Mining: Practical Machine Learning
Tools and Techniques. Burlington, MA: Morgan Kaufmann.
Yaffe, R. and McGee, M. 2000. Introduction to Time Series Analysis and Forecasting. San
Diego, CA: Academic Press.
Ye, N. 1996. Self-adapting decision support for interactive fault diagnosis of manufac-
turing systems. International Journal of Computer Integrated Manufacturing, 9(5),
392–401.
Ye, N. 1997. Objective and consistent analysis of group differences in knowledge rep-
resentation. International Journal of Cognitive Ergonomics, 1(2), 169–187.
Ye, N. 1998. The MDS-ANAVA technique for assessing knowledge representation dif-
ferences between skill groups. IEEE Transactions on Systems, Man and Cybernetics,
28(5), 586–600.
Ye, N. 2003, ed. The Handbook of Data Mining. Mahwah, NJ: Lawrence Erlbaum Associates.
Ye, N. 2008. Secure Computer and Network Systems: Modeling, Analysis and Design.
London, U.K.: John Wiley & Sons.
Ye, N., Borror, C., and Parmar, D. 2003. Scalable chi square distance versus conven-
tional statistical distance for process monitoring with uncorrelated data vari-
ables. Quality and Reliability Engineering International, 19(6), 505–515.
Ye, N., Borror, C., and Zhang, Y. 2002a. EWMA techniques for computer intrusion
detection through anomalous changes in event intensity. Quality and Reliability
Engineering International, 18(6), 443–451.
Ye, N. and Chen, Q. 2001. An anomaly detection technique based on a chi-square
statistic for detecting intrusions into information systems. Quality and Reliability
Engineering International, 17(2), 105–112.
Ye, N. and Chen, Q. 2003. Computer intrusion detection through EWMA for auto-
correlated and uncorrelated data. IEEE Transactions on Reliability, 52(1), 73–82.
Ye, N., Chen, Q., and Borror, C. 2004. EWMA forecast of normal system activity for
computer intrusion detection. IEEE Transactions on Reliability, 53(4), 557–566.
Ye, N., Ehiabor, T., and Zhang, Y. 2002c. First-order versus high-order stochastic
models for computer intrusion detection. Quality and Reliability Engineering
International, 18(3), 243–250.
322 References

Ye, N., Emran, S. M., Chen, Q., and Vilbert, S. 2002b. Multivariate statistical analysis
of audit trails for host-based intrusion detection. IEEE Transactions on Computers,
51(7), 810–820.
Ye, N. and Li, X. 2002. A scalable, incremental learning algorithm for classification
problems. Computers & Industrial Engineering Journal, 43(4), 677–692.
Ye, N., Li, X., Chen, Q., Emran, S. M., and Xu, M. 2001. Probabilistic techniques for
intrusion detection based on computer audit data. IEEE Transactions on Systems,
Man, and Cybernetics, 31(4), 266–274.
Ye, N., Parmar, D., and Borror, C. M. 2006. A hybrid SPC method with the chi-square
distance monitoring procedure for large-scale, complex process data. Quality
and Reliability Engineering International, 22(4), 393–402.
Ye, N. and Salvendy, G. 1991. Cognitive engineering based knowledge representation
in neural networks. Behaviour & Information Technology, 10(5), 403–418.
Ye, N. and Salvendy, G. 1994. Quantitative and qualitative differences between
experts and novices in chunking computer software knowledge. International
Journal of Human-Computer Interaction, 6(1), 105–118.
Ye, N., Zhang, Y., and Borror, C. M. 2004b. Robustness of the Markov-chain model for
cyber-attack detection. IEEE Transactions on Reliability, 53(1), 116–123.
Ye, N. and Zhao, B. 1996. A hybrid intelligent system for fault diagnosis of advanced
manufacturing system. International Journal of Production Research, 34(2), 555–576.
Ye, N. and Zhao, B. 1997. Automatic setting of article format through neural networks.
International Journal of Human-Computer Interaction, 9(1), 81–100.
Ye, N., Zhao, B., and Salvendy, G. 1993. Neural-networks-aided fault diagnosis in
supervisory control of advanced manufacturing systems. International Journal of
Advanced Manufacturing Technology, 8, 200–209.
Young, F. W. and Hamer, R. M. 1987. Multidimensional Scaling: History, Theory, and
Applications. Hillsdale, NJ: Lawrence Erlbaum Associates.
Data Mining
Ergonomics and Industrial Engineering

YE
“… provides full spectrum coverage of the most important topics in data mining.
By reading it, one can obtain a comprehensive view on data mining, including
the basic concepts, the important problems in the area, and how to handle these
problems. The whole book is presented in a way that a reader who does not have
much background knowledge of data mining can easily understand. You can find
many figures and intuitive examples in the book. I really love these figures and Theories, Algorithms, and Examples
examples, since they make the most complicated concepts and algorithms much
easier to understand.”

DATA MINING
—Zheng Zhao, SAS Institute Inc., Cary, North Carolina, USA

“… covers pretty much all the core data mining algorithms. It also covers several
useful topics that are not covered by other data mining books such as univariate
and multivariate control charts and wavelet analysis. Detailed examples are
provided to illustrate the practical use of data mining algorithms. A list of software
packages is also included for most algorithms covered in the book. These are
extremely useful for data mining practitioners. I highly recommend this book for
anyone interested in data mining.”
—Jieping Ye, Arizona State University, Tempe, USA

New technologies have enabled us to collect massive amounts of data in many


fields. However, our pace of discovering useful information and knowledge from
these data falls far behind our pace of collecting the data. Data Mining: Theories,
Algorithms, and Examples introduces and explains a comprehensive set of data
mining algorithms from various data mining fields. The book reviews theoretical
rationales and procedural details of data mining algorithms, including those
commonly found in the literature and those presenting considerable difficulty,
using small data examples to explain and walk through the algorithms.

NONG YE
K10414
ISBN: 978-1-4398-0838-2
90000
www.c rc pr e ss.c o m

9 781439 808382
w w w.crcpress.com

K10414 cvr mech.indd 1 6/25/13 3:08 PM

You might also like