Data Mining - Theories - Algorithms - and Examples PDF
Data Mining - Theories - Algorithms - and Examples PDF
YE
“… provides full spectrum coverage of the most important topics in data mining.
By reading it, one can obtain a comprehensive view on data mining, including
the basic concepts, the important problems in the area, and how to handle these
problems. The whole book is presented in a way that a reader who does not have
much background knowledge of data mining can easily understand. You can find
many figures and intuitive examples in the book. I really love these figures and Theories, Algorithms, and Examples
examples, since they make the most complicated concepts and algorithms much
easier to understand.”
DATA MINING
—Zheng Zhao, SAS Institute Inc., Cary, North Carolina, USA
“… covers pretty much all the core data mining algorithms. It also covers several
useful topics that are not covered by other data mining books such as univariate
and multivariate control charts and wavelet analysis. Detailed examples are
provided to illustrate the practical use of data mining algorithms. A list of software
packages is also included for most algorithms covered in the book. These are
extremely useful for data mining practitioners. I highly recommend this book for
anyone interested in data mining.”
—Jieping Ye, Arizona State University, Tempe, USA
NONG YE
K10414
ISBN: 978-1-4398-0838-2
90000
www.c rc pr e ss.c o m
9 781439 808382
w w w.crcpress.com
www.allitebooks.com
www.allitebooks.com
Human Factors and Ergonomics Series
Published TiTles
Conceptual Foundations of Human Factors Measurement
D. Meister
Content Preparation Guidelines for the Web and Information Appliances:
Cross-Cultural Comparisons
H. Liao, Y. Guo, A. Savoy, and G. Salvendy
Cross-Cultural Design for IT Products and Services
P. Rau, T. Plocher and Y. Choong
Data Mining: Theories, Algorithms, and Examples
Nong Ye
Designing for Accessibility: A Business Guide to Countering Design Exclusion
S. Keates
Handbook of Cognitive Task Design
E. Hollnagel
The Handbook of Data Mining
N. Ye
Handbook of Digital Human Modeling: Research for Applied Ergonomics
and Human Factors Engineering
V. G. Duffy
Handbook of Human Factors and Ergonomics in Health Care and Patient Safety
Second Edition
P. Carayon
Handbook of Human Factors in Web Design, Second Edition
K. Vu and R. Proctor
Handbook of Occupational Safety and Health
D. Koradecka
Handbook of Standards and Guidelines in Ergonomics and Human Factors
W. Karwowski
Handbook of Virtual Environments: Design, Implementation, and Applications
K. Stanney
Handbook of Warnings
M. Wogalter
Human–Computer Interaction: Designing for Diverse Users and Domains
A. Sears and J. A. Jacko
Human–Computer Interaction: Design Issues, Solutions, and Applications
A. Sears and J. A. Jacko
Human–Computer Interaction: Development Process
A. Sears and J. A. Jacko
Human–Computer Interaction: Fundamentals
A. Sears and J. A. Jacko
The Human–Computer Interaction Handbook: Fundamentals
Evolving Technologies, and Emerging Applications, Third Edition
A. Sears and J. A. Jacko
Human Factors in System Design, Development, and Testing
D. Meister and T. Enderwick
www.allitebooks.com
Published TiTles (conTinued)
ForThcoming TiTles
Around the Patient Bed: Human Factors and Safety in Health care
Y. Donchin and D. Gopher
Cognitive Neuroscience of Human Systems Work and Everyday Life
C. Forsythe and H. Liao
Computer-Aided Anthropometry for Research and Design
K. M. Robinette
Handbook of Human Factors in Air Transportation Systems
S. Landry
Handbook of Virtual Environments: Design, Implementation
and Applications, Second Edition,
K. S. Hale and K M. Stanney
Variability in Human Performance
T. Smith, R. Henning, and M. Wade
www.allitebooks.com
www.allitebooks.com
Data Mining
Theories, Algorithms, and Examples
NONG YE
www.allitebooks.com
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
pedagogical approach or particular use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
www.allitebooks.com
Contents
Preface.................................................................................................................... xiii
Acknowledgments.............................................................................................. xvii
Author.................................................................................................................... xix
vii
www.allitebooks.com
viii Contents
www.allitebooks.com
Contents ix
www.allitebooks.com
x Contents
References............................................................................................................ 319
Index...................................................................................................................... 323
Preface
Part I introduces these types of data patterns with examples. Parts II–VI
describe algorithms to mine the five types of data patterns, respectively.
Classification and prediction patterns capture relations of attribute vari-
ables with target variables and allow us to classify or predict values of target
xiii
xiv Preface
Part III describes data mining algorithms to uncover cluster and associa-
tion patterns. Cluster patterns reveal patterns of similarities and differ-
ences among data records. Association patterns are established based on
co-occurrences of items in data records. Part III describes the following data
mining algorithms to mine cluster and association patterns:
Data reduction patterns look for a small number of variables that can be
used to represent a data set with a much larger number of variables. Since
one variable gives one dimension of data, data reduction patterns allow a
data set in a high-dimensional space to be represented in a low-dimensional
space. Part IV describes the following data mining algorithms to mine data
reduction patterns:
Outliers and anomalies are data points that differ largely from a normal pro-
file of data, and there are many ways to define and establish a norm profile
of data. Part V describes the following data mining algorithms to detect and
identify outliers and anomalies:
Sequential and temporal patterns reveal how data change their patterns
over time. Part VI describes the following data mining algorithms to mine
sequential and temporal patterns:
For the data mining algorithms in each chapter, a list of software packages
that support them is provided. Some applications of the data mining algo-
rithms are also given with references.
Teaching Support
The data mining algorithms covered in this book involve different levels of
difficulty. The instructor who uses this book as the textbook for a course on
data mining may select the book materials to cover in the course based on the
level of the course and the level of difficulty of the book materials. The book
materials in Chapters 1, 2 (Sections 2.1 and 2.2 only), 3, 4, 7, 8, 9 (Section 9.1
only), 12, 16 (Sections 16.1 through 16.3 only), and 19 (Section 19.1 only), which
cover the five types of data patterns, are appropriate for an undergraduate-
level course. The remainder is appropriate for a graduate-level course.
Exercises are provided at the end of each chapter. The following additional
teaching support materials are available on the book website and can be
obtained from the publisher:
• Solutions manual
• Lecture slides, which include the outline of topics, figures, tables,
and equations
I would like to thank my family, Baijun and Alice, for their love, understand-
ing, and unconditional support. I appreciate them for always being there for
me and making me happy.
I am grateful to Dr. Gavriel Salvendy, who has been my mentor and friend,
for guiding me in my academic career. I am also thankful to Dr. Gary Hogg,
who supported me in many ways as the department chair at Arizona State
University.
I would like to thank Cindy Carelli, senior editor at CRC Press. This book
would not have been possible without her responsive, helpful, understand-
ing, and supportive nature. It has been a great pleasure working with her.
Thanks also go to Kari Budyk, senior project coordinator at CRC Press, and
the staff at CRC Press who helped publish this book.
xvii
Author
xix
www.allitebooks.com
Part I
Data mining aims at discovering useful data patterns from massive amounts
of data. In this chapter, we give some examples of data sets and use these
data sets to illustrate various types of data variables and data patterns that
can be discovered from data. Data mining algorithms to discover each type
of data patterns are briefly introduced in this chapter. The concepts of train-
ing and testing data are also introduced.
3
4 Data Mining
Table 1.1
Balloon Data Set
Target
Attribute Variables Variable
Instance Color Size Act Age Inflated
1 Yellow Small Stretch Adult T
2 Yellow Small Stretch Child T
3 Yellow Small Dip Adult T
4 Yellow Small Dip Child T
5 Yellow Large Stretch Adult T
6 Yellow Large Stretch Child F
7 Yellow Large Dip Adult F
8 Yellow Large Dip Child F
9 Purple Small Stretch Adult T
10 Purple Small Stretch Child F
11 Purple Small Dip Adult F
12 Purple Small Dip Child F
13 Purple Large Stretch Adult T
14 Purple Large Stretch Child F
15 Purple Large Dip Adult F
16 Purple Large Dip Child F
that go through M1 first, M5 second, and M9 last, some parts that go through
M1 first, M5 second, and M7 last, and so on. There are nine variables, xi,
i = 1, 2, …, 9, representing the quality of parts after they go through the nine
machines. If parts after machine i pass the quality inspection, xi takes the
value of 0; otherwise, xi takes the value of 1. There is a variable, y, represent-
ing whether or not the system has a fault. The system has a fault if any of
the nine machines is faulty. If the system does not have a fault, y takes the
value of 0; otherwise, y takes the value of 1. There are nine variables, yi, i = 1,
2, …, 9, representing whether or not nine machines are faulty, respectively.
If machine i does not have a fault, yi takes the value of 0; otherwise, yi takes
the value of 1. The fault detection problem is to determine whether or not the
system has a fault based on the quality information. The fault detection prob-
lem involves the nine quality variables, xi, i = 1, 2, …, 9, and the system fault
variable, y. The fault diagnosis problem is to determine which machine has a
fault based on the quality information. The fault diagnosis problem involves
the nine quality variables, xi, i = 1, 2, …, 9, and the nine variables of machine
fault, yi, i = 1, 2, …, 9. There may be one or more machines that have a fault
at the same time, or no faulty machine. For example, in instance 1 with M1
being faulty (y1 and y taking the value of 1 and y2, y3, y4, y5, y6, y7, y8, and y9
taking the value of 0), parts after M1, M5, M7, M9 fails the quality inspection
with x1, x5, x7, and x9 taking the value of 1 and other quality variables, x2, x3,
x4, x6, and x8, taking the value of 0.
Introduction to Data, Data Patterns, and Data Mining 5
Table 1.2
Space Shuttle O-Ring Data Set
Target
Attribute Variables Variable
Number Temporal Number of
of Launch Leak-Check Order of O-Rings
Instance O-Rings Temperature Pressure Flight with Stress
1 6 66 50 1 0
2 6 70 50 2 1
3 6 69 50 3 0
4 6 68 50 4 0
5 6 67 50 5 0
6 6 72 50 6 0
7 6 73 100 7 0
8 6 70 100 8 0
9 6 57 200 9 1
10 6 63 200 10 1
11 6 70 200 11 1
12 6 78 200 12 0
13 6 67 200 13 0
14 6 53 200 14 2
15 6 67 200 15 0
16 6 75 200 16 0
17 6 70 200 17 0
18 6 81 200 18 0
19 6 76 200 19 0
20 6 79 200 20 0
21 6 75 200 21 0
22 6 76 200 22 0
23 6 58 200 23 1
Table 1.3
Lenses Data Set
Attributes Target
Spectacle Tear Production
Instance Age Prescription Astigmatic Rate Lenses
1 Young Myope No Reduced Noncontact
2 Young Myope No Normal Soft contact
3 Young Myope Yes Reduced Noncontact
4 Young Myope Yes Normal Hard contact
5 Young Hypermetrope No Reduced Noncontact
6 Young Hypermetrope No Normal Soft contact
7 Young Hypermetrope Yes Reduced Noncontact
8 Young Hypermetrope Yes Normal Hard contact
9 Pre-presbyopic Myope No Reduced Noncontact
10 Pre-presbyopic Myope No Normal Soft contact
11 Pre-presbyopic Myope Yes Reduced Noncontact
12 Pre-presbyopic Myope Yes Normal Hard contact
13 Pre-presbyopic Hypermetrope No Reduced Noncontact
14 Pre-presbyopic Hypermetrope No Normal Soft contact
15 Pre-presbyopic Hypermetrope Yes Reduced Noncontact
16 Pre-presbyopic Hypermetrope Yes Normal Noncontact
17 Presbyopic Myope No Reduced Noncontact
18 Presbyopic Myope No Normal Noncontact
19 Presbyopic Myope Yes Reduced Noncontact
20 Presbyopic Myope Yes Normal Hard contact
21 Presbyopic Hypermetrope No Reduced Noncontact
22 Presbyopic Hypermetrope No Normal Soft contact
23 Presbyopic Hypermetrope Yes Reduced Noncontact
24 Presbyopic Hypermetrope Yes Normal Noncontact
the target variables depend on the values of the attribute variables. In the bal-
loon data set in Table 1.1, the attribute variables are Color, Size, Act, and Age,
and the target variable gives the inflation status of the balloon. In the space
shuttle data set in Table 1.2, the attribute variables are Number of O-rings,
Launch Temperature, Leak-Check Pressure, and Temporal Order of Flight,
and the target variable is the Number of O-rings with Stress.
Some data sets may have only attribute variables. For example, customer
purchase transaction data may contain the items purchased by each cus-
tomer at a store. We have attribute variables representing the items pur-
chased. The interest in the customer purchase transaction data is in finding
out what items are often purchased together by customers. Such association
patterns of items or attribute variables can be used to design the store lay-
out for sale of items and assist customer shopping. Mining such a data set
involves only attribute variables.
Table 1.4
Data Set for a Manufacturing System to Detect and Diagnose Faults
Attribute Variables Target Variables
Instance
Quality of Parts Machine Fault
(Faulty System
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault, y y1 y2 y3 y4 y5 y6 y7 y8 y9
1 (M1) 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0
2 (M2) 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0
3 (M3) 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0
4 (M4) 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
5 (M5) 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0
6 (M6) 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0
Introduction to Data, Data Patterns, and Data Mining
7 (M7) 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
10 (none) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7
8 Data Mining
M1 M5 M9
M2 M6 M7
M3 M4 M8
Figure 1.1
A manufacturing system with nine machines and production flows of parts.
are not available. Color is a nominal variable since yellow and purple show
two different colors but an order of yellow and purple may be meaningless.
Numeric variables have two subtypes: interval variables and ratio variables
(Tan et al., 2006). Quantitative differences between the values of an interval
variable (e.g., Launch Temperature in °F) are meaningful, whereas both
quantitative differences and ratios between the values of a ratio variable
(e.g., Number of O-rings with Stress) are meaningful.
Formally, we denote the attribute variables as x1, …, xp, and the target vari-
ables as y1, …, yq. We let x = (x1, …, xp) and y = (y1, …, yq). Instances or data
observations of x1, …, xp, y1, …, yq give data records, (x1, …, xp, y1, …, yq).
www.allitebooks.com
10 Data Mining
where
y denotes the target variable, Number of O-rings with Stress
x denotes the attribute variable, Launch Temperature
2
Number of O-rings with stress
0
50 55 60 65 70 75 80 85
Launch temperature
Figure 1.2
The fitted linear relation model of Launch Temperature with Number of O-rings with Stress in
the space shuttle O-ring data set.
Introduction to Data, Data Patterns, and Data Mining 11
Table 1.5
Predicted Value of O-Rings with Stress
Attribute Variable Target Variable
Number of Predicted Value
Launch O-Rings of O-Rings
Instance Temperature with Stress with Stress
1 66 0 0.509227
2 70 1 0.279387
3 69 0 0.336847
4 68 0 0.394307
5 67 0 0.451767
6 72 0 0.164467
7 73 0 0.107007
8 70 0 0.279387
9 57 1 1.026367
10 63 1 0.681607
11 70 1 0.279387
12 78 0 −0.180293
13 67 0 0.451767
14 53 2 1.256207
15 67 0 0.451767
16 75 0 −0.007913
17 70 0 0.279387
18 81 0 −0.352673
19 76 0 −0.065373
20 79 0 −0.237753
21 75 0 −0.007913
22 76 0 −0.065373
23 58 1 0.968907
middle range, 1.026367 and 0.681607, are produced for two data records of
instances 9 and 10 with 1 O-rings with Stress. The predicted values in the low
range from −0.352673 to 0.509227 are produced for all the data records with 0
O-rings with Stress. The negative coefficient of x, −0.05746, in Equation 1.1,
also reveals this relation. Hence, the linear relation in Equation 1.1 gives data
patterns that allow us to predict the target variable, Number of O-rings with
Stress, from the attribute variable, Launch Temperature, in the space shuttle
O-ring data set.
Classification and prediction patterns, which capture the relation of attribute
variables, x1, …, xp, with target variables, y1, …, yq, can be represented in the
general form of y = F(x). For the balloon data set, classification patterns for F
take the form of decision rules. For the space shuttle O-ring data set, prediction
patterns for F take the form of a linear model. Generally, the term, “classification
12 Data Mining
patterns,” is used if the target variable is a categorical variable, and the term,
“prediction patterns,” is used if the target variable is a numeric variable.
Part II of the book introduces the following data mining algorithms that
are used to discover classification and prediction patterns from data:
Chapters 20, 21, and 23 in The Handbook of Data Mining (Ye, 2003) and Chapters 12
and 13 in Secure Computer and Network Systems: Modeling, Analysis and Design
(Ye, 2008) give applications of classification and prediction algorithms to
human performance data, text data, science and engineering data, and com-
puter and network data.
Group 1 Group 2
1 1 1 1
0 0 0 0
123456789 123456789 123456789 123456789
Instance 1 Instance 5 Instance 2 Instance 4
0 0 0 0
123456789 123456789 123456789 123456789
Instance 3 Instance 6 Instance 7 Instance 8
Group 6 Group 7
1 1
0 0
123456789 123456789
Instance 9 Instance 10
Figure 1.3
Clustering of 10 data records in the data set of a manufacturing system.
Part III of the book introduces the following data mining algorithms that
are used to discover cluster and association patterns from data:
Chapters 10, 21, 22, and 27 in The Handbook of Data Mining (Ye, 2003) give
applications of cluster algorithms to market basket data, web log data, text
data, geospatial data, and image data. Chapter 24 in The Handbook of Data
Mining (Ye, 2003) gives an application of the association rule algorithm to
protein structure data.
20
18
16
14
12
y
10
8
6
4
2
1 2 3 4 5 6 7 8 9 10
x
Figure 1.4
Reduction of a two-dimensional data set to a one-dimensional data set.
space, (x, y), with y = 2x and x = 1, 2, …, 10. This two-dimensional data set
can be represented as the one-dimensional data set with z as the axis, and z
is related to the original variables, x and y, as follows:
2
y
z = x * 12 + 1 * . (1.2)
x
The 10 data points of z are 2.236, 4.472, 6.708, 8.944, 11.180, 13.416, 15.652,
17.889, 20.125, and 22.361.
Part IV of the book introduces the following data mining algorithms that
are used to discover data reduction patterns from data:
Chapters 23 and 8 in The Handbook of Data Mining (Ye, 2003) give applications of
principal component analysis to volcano data and science and engineering data.
12
11
10
9
8
Frequency
7
6
5
4
3
2
1
0
[50, 59] [60, 69] [70, 79] [80, 89]
Launch temperature
Figure 1.5
Frequency histogram of Launch Temperature in the space shuttle data set.
Part V of the book introduces the following data mining algorithms that
are used to define some statistical norms of data and detect outliers and
anomalies according to these statistical norms:
Chapters 26 and 28 in The Handbook of Data Mining (Ye, 2003) and Chapter 14 in
Secure Computer and Network Systems: Modeling, Analysis and Design (Ye, 2008)
give applications of outlier and anomaly detection algorithms to manufac-
turing data and computer and network data.
100
Temperature
80
60
1 2 3 4 5 6 7 8 9 10 11 12
Quarter
Figure 1.6
Temperature in each quarter of a 3-year period.
16
Table 1.6
Test Data Set for a Manufacturing System to Detect and Diagnose Faults
Attribute Variables Target Variables
Instance
Quality of Parts Machine Fault
(Faulty System
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault, y y1 y2 y3 y4 y5 y6 y7 y8 y9
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 0 0
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 0 0
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 0 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1
Data Mining
Introduction to Data, Data Patterns, and Data Mining 17
series of temperature values for a city over quarters of a 3-year period. There
is a cyclic pattern of 60, 80, 100, and 60, which repeats every year. A variety of
sequential and temporal patterns can be discovered using the data mining
algorithms covered in Part VI of the book, including
Chapters 10, 11, and 16 in Secure Computer and Network Systems: Modeling,
Analysis and Design (Ye, 2008) give applications of sequential and temporal
pattern mining algorithms to computer and network data for cyber attack
detection.
Exercises
1.1 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering classification pat-
terns. The data set contains multiple categorical attribute variables and
one categorical target variable.
1.2 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering prediction patterns.
The data set contains multiple numeric attribute variables and one
numeric target variable.
18 Data Mining
1.3 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering cluster patterns. The
data set contains multiple numeric attribute variables.
1.4 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering association patterns.
The data set contains multiple categorical variables.
1.5 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering data reduction pat-
terns, and identify the type(s) of data variables in this data set.
1.6 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering outlier and anomaly
patterns, and identify the type(s) of data variables in this data set.
1.7 Find and describe a data set of at least 20 data records that has been
used in a data mining application for discovering sequential and tem-
poral patterns, and identify the type(s) of data variables in this data set.
Part II
www.allitebooks.com
2
Linear and Nonlinear Regression Models
Regression models capture how one or more target variables vary with one
or more attribute variables. They can be used to predict the values of the
target variables using the values of the attribute variables. In this chapter,
we introduce linear and nonlinear regression models. This chapter also
describes the least-squares method and the maximum likelihood method of
estimating parameters in regression models. A list of software packages that
support building regression models is provided.
y i = β 0 + β1 x i + ε i (2.1)
where
(xi, yi) denotes the ith observation of x and y
εi represents random noise (e.g., measurement error) contributing to the ith
observation of y
For a given value of xi, both yi and εi are random variables whose values may
follow a probability distribution as illustrated in Figure 2.1. In other words,
for the same value of x, different values of y and ε may be observed at differ-
ent times. There are three assumptions about εi:
21
22 Data Mining
yj
E(yj) = β0 + β1xj
E(yi) = β0 + β1xi
yi
x
xi xj
Figure 2.1
Illustration of a simple regression model.
1. E ( yi ) = β0 + β1xi
2. var(yi) = σ2
3. cov(yi, yj) = 0 for any two different data observations of y, the ith
observation and the jth observation
y i = β 0 + β1 x i , 1 + + β p x i , p + ε i , (2.2)
where
p is an integer greater than 1
xi,j denotes the ith observation of jth attribute variable
The linear regression models in Equations 2.1 and 2.2 are linear in the
parameters β0, …, βp and the attribute variables xi,1, …, xi,p. In general, linear
regression models are linear in the parameters but are not necessarily linear
in the attribute variables. The following regression model with polynomial
terms of x1 is also a linear regression model:
( ) (
yi = β0 + β1Φ1 xi ,1 , … , xi , p + + β k Φ k xi ,1 , … , xi , p + ε i , ) (2.4)
Linear and Nonlinear Regression Models 23
n n
∑ ( yi − yˆ i ) ∑ ( y − βˆ )
2
− βˆ 1xi .
2
SSE = = i 0 (2.6)
i =1 i =1
The partial derivatives of SSE with respect to β̂0 and β̂1 should be zero at the
point where SSE is minimized. Hence, the values of β̂0 and β̂1 that mini-
mize SSE are obtained by differentiating SSE with respect to β̂0 and β̂1
and setting these partial derivatives equal to zero:
n
∑ ( y − βˆ )
∂SSE
= −2 − βˆ 1xi = 0 (2.7)
∂βˆ
i 0
0 i =1
∑x ( y − βˆ )
∂SSE
= −2 − βˆ 1xi = 0. (2.8)
∂βˆ
i i 0
1 i =1
∑ ( y − βˆ − βˆ x ) = ∑y − nβˆ − βˆ ∑x = 0
i =1
i 0 1 i
i =1
i 0 1
i =1
i (2.9)
n n n n
∑ (
i =1
xi yi − βˆ 0 − βˆ 1xi = ) ∑ i =1
xi yi − βˆ 0 ∑ i =1
xi − βˆ 1 ∑x
i =1
2
i = 0. (2.10)
24 Data Mining
Solving Equations 2.9 and 2.10 for β̂0 and β̂1, we obtain:
∑ (x − x ) ( y − y ) = n∑ x y − ∑ x ∑
n n
n
n
i i i yi
i i i =1 i =1 i =1
β̂1 = i =1
(2.11)
∑ (x − x )
n 2
n∑ x − ∑ x
i
2
n
2
n
i i
i =1 i =1 i =1
1
n n
βˆ 0 =
n ∑ i =1
yi − βˆ 1 ∑ i =1
xi = y − βˆ 1x .
(2.12)
E ( y i ) = β 0 + β1 x i (2.13)
var ( yi ) = σ 2 (2.14)
and the density function of the normal probability distribution:
2 2
1 y i − E( y i ) 1 yi − β0 − β1 xi
1 − 1 −
f ( yi ) = e 2 σ
= e 2 σ
. (2.15)
2πσ 2πσ
Because yis are independent, the likelihood of observing y1, …, yn, L, is the
product of individual densities f(yi)s and is the function of β0, β1, and σ2:
2
n 1 yi − β0 − β1 xi
∏ (2πσ )
1 −
L (β0 , β1 , σ ) = e 2 σ
. (2.16)
2 12
i =1
(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 ) n
∑ ( y − βˆ )
1
ˆ = 2 i 0 − βˆ 1xi = 0 (2.17)
∂β 0 ˆ
σ i =1
Linear and Nonlinear Regression Models 25
(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 ) n
∑x ( y − βˆ )
1
= 2 − βˆ 1xi = 0 (2.18)
∂βˆ 1
i i 0
σˆ i =1
(
∂lnL βˆ 0 , βˆ 1 , σˆ 2 )=− n
∑ ( y − βˆ )
n 1 2
+ 4 i 0 − βˆ 1xi = 0. (2.19)
∂σˆ 2 ˆ
2σ 2
2σˆ i =1
∑ ( y − βˆ
i =1
i 0 − βˆ 1xi = 0 ) (2.20)
∑x ( y − βˆ
i =1
i i 0 − βˆ 1xi = 0 ) (2.21)
∑ ( ) .
n 2
y − βˆ i 0 − βˆ 1xi
σˆ 2
= i =1
(2.22)
n
Equations 2.20 and 2.21 are the same as Equations 2.9 and 2.10. Hence, the
maximum likelihood estimators of β0 and β1 are the same as the least-squares
estimators of β0 and β1 that are given in Equations 2.11 and 2.12.
For the linear regression model in Equation 2.2 with multiple attribute
variables, we define x0 = 1 and rewrite Equation 2.2 to
y i = β 0 x i , 0 + β1 x i , 1 + + β1 x i , p + ε i . (2.23)
y1 1 x1,1 x1, p β 0 ε1
y= x = b= e = ,
y n 1 x n ,1 xn , p β p ε n
y = xb + e. (2.24)
bˆ = ( x′x ) ( x′y ) ,
−1
(2.25)
Example 2.1
Use the least-squares method to fit a linear regression model to the space
shuttle O-rings data set in Table 1.5, which is also given in Table 2.1, and
determine the predicted target value for each observation using the lin-
ear regression model.
This data has one attribute variable x representing Launch Temperature
and one target variable y representing Number of O-rings with Stress.
The linear regression model for this data set is
y i = β 0 + β1 x i + ε i .
Table 2.2 shows the calculation for estimating β1 using Equation 2.11.
Using Equation 2.11, we obtain:
∑ (x − x )( y − y ) = −65.91 = −0.05.
n
i i
βˆ 1 = i =1
∑ (x − x )
n
1382.82
2
i
i =1
Table 2.1
Data Set of O-Rings with Stress along
with the Predicted Target Value from
the Linear Regression
Launch Number of O-Rings
Instance Temperature with Stress
1 66 0
2 70 1
3 69 0
4 68 0
5 67 0
6 72 0
7 73 0
8 70 0
9 57 1
10 63 1
11 70 1
12 78 0
13 67 0
14 53 2
15 67 0
16 75 0
17 70 0
18 81 0
19 76 0
20 79 0
21 75 0
22 76 0
23 58 1
Linear and Nonlinear Regression Models 27
Table 2.2
Calculation for Estimating the Parameters of the Linear Model in Example 2.1
Launch Number
Instance Temperature of O-Rings xi − x– yi − y– ( xi - x ) ( yi - y ) ( xi - x )2
1 66 0 −3.57 −0.30 1.07 12.74
2 70 1 0.43 0.70 0.30 0.18
3 69 0 −0.57 −0.30 0.17 0.32
4 68 0 −1.57 −0.30 0.47 2.46
5 67 0 −2.57 −0.30 0.77 6.60
6 72 0 2.43 −0.30 −0.73 5.90
7 73 0 3.43 −0.30 −1.03 11.76
8 70 0 0.43 −0.30 −0.13 0.18
9 57 1 −12.57 0.70 −8.80 158.00
10 63 1 −6.57 0.70 −4.60 43.16
11 70 1 0.43 0.70 0.30 0.18
12 78 0 8.43 −0.30 −2.53 71.06
13 67 0 −2.57 −0.30 0.77 6.60
14 53 2 −16.53 1.70 −28.10 273.24
15 67 0 −2.57 −0.30 0.77 6.60
16 75 0 5.43 −0.30 −1.63 29.48
17 70 0 0.43 −0.30 −0.13 0.18
18 81 0 11.43 −0.30 −3.43 130.64
19 76 0 6.43 −0.30 −1.93 41.34
20 79 0 19.43 −0.30 −5.83 377.52
21 75 0 5.43 −0.30 −1.63 29.48
22 76 0 6.43 −0.30 −1.93 41.34
23 58 1 −11.57 0.70 −8.10 133.86
Sum 1600 7 −65.91 1382.82
Average x– = 69.57 y– = 0.30
y i = 3.78 − 0.05xi + ε i .
The parameters in this linear regression model are similar to the param-
eters β̂0 = 4.301587 and β̂1 = −0.05746 in Equation 1.1, which are obtained
from Excel for the same data set. The differences in the parameters are
caused by rounding in the calculation.
28 Data Mining
y i = f ( xi , b ) + ε i , (2.26)
where
1 β 0
xi , 1 β1
xi = b =
xi , p β p
yi = β0 + β1eβ2 xi + ε i . (2.27)
β0
yi = + εi . (2.28)
1 + β1eβ2 xi
• Statistica (http://www.statsoft.com)
• SAS (http://www.sas.com)
• SPSS (http://www.ibm/com/software/analytics/spss/)
Exercises
2.1 Given the space shuttle data set in Table 2.1, use Equation 2.25 to esti-
mate the parameters of the following linear regression model:
y i = β 0 + β1 x i + ε i ,
where
xi is Launch Temperature
yi is Number of O-rings with Stress
Compute the sum of squared errors that are produced by the predicted
y values from the regression model.
2.2 Given the space shuttle data set in Table 2.1, use Equations 2.11 and 2.12
to estimate the parameters of the following linear regression model:
y i = β 0 + β1 x i + ε i ,
where
xi is Launch Temperature
yi is Number of O-rings with Stress
Compute the sum of squared errors that are produced by the predicted
y values from the regression model.
2.3 Use the data set found in Exercise 1.2 to build a linear regression model
and compute the sum of squared errors that are produced by the pre-
dicted y values from the regression model.
www.allitebooks.com
3
Naïve Bayes Classifier
A naïve Bayes classifier is based on the Bayes theorem. Hence, this chapter
first reviews the Bayes theorem and then describes naïve Bayes classifier. A
list of data mining software packages that support the learning of a naïve
Bayes classifier is provided. Some applications of naïve Bayes classifiers are
given with references.
P (B|A ) P ( A )
P ( A|B) = . (3.2)
P (B)
p ( y ) P ( x|y )
y MAP = arg max P ( y|x ) = arg max ≈ arg max p ( y ) P ( x|y ) , (3.3)
y ∈Y y ∈Y P (x) y ∈Y
31
32 Data Mining
where Y is the set of all target classes. The sign ≈ in Equation 3.3 is used
because P(x) is the same for all y values and thus can be ignored when we
compare p ( y ) P ( x|y ) P ( x ) for all y values. P(x) is the prior probability that we
observe x without any knowledge about what the target class of x is. P(y) is
the prior probability that we expect y, reflecting our prior knowledge about
the data set of x and the likelihood of the target class y in the data set with-
out referring to any specific x. P(y|x) is the posterior probability of y given
the observation of x. arg max P ( y|x ) compares the posterior probabilities of
y ∈Y
all target classes given x and chooses the target class y with the maximum
posterior probability.
P(x|y) is the probability that we observe x if the target class is y. A clas-
sification y that maximizes P(x|y) among all target classes is the maximum
likelihood (ML) classification:
The naïve Bayes classifier estimates the probability terms in Equation 3.5 in
the following way:
ny
P (y) = (3.6)
n
n y & xi
P ( xi|y ) = , (3.7)
ny
where
n is the total number of data points in the training data set
ny is the number of data points with the target class y
ny & xi is the number of data points with the target class y
the ith attribute variable taking the value of xi
Naïve Bayes Classifier 33
Example 3.1
Learn and use a naïve Bayes classifier for classifying whether or not a
manufacturing system is faulty using the values of the nine quality vari-
ables. The training data set in Table 3.1 gives a part of the data set in
Table 1.4 and includes nine single-fault cases and the nonfault case in a
manufacturing system. There are nine attribute variables for the qual-
ity of parts, (x1, …, x9), and one target variable y for the system fault.
Table 3.2 gives the test cases for some multiple-fault cases.
Using the training data set in Table 3.1, we compute the following:
n = 10
ny = 1 = 9 ny = 0 = 1
ny = 1& x1 = 1 = 1 ny = 1& x1 = 0 = 8 ny = 0 & x1 = 1 = 0 ny = 0 & x1 = 0 = 1
ny = 1& x2 = 1 = 1 ny = 1& x2 = 0 = 8 ny = 0 & x2 = 1 = 0 ny = 0 & x2 = 0 = 1
ny = 1& x3 = 1 = 1 ny = 1& x3 = 0 = 8 ny = 0 & x3 = 1 = 0 ny = 0 & x3 = 0 = 1
ny = 1& x4 = 1 = 3 ny = 1& x4 = 0 = 6 ny = 0 & x 4 = 1 = 0 ny = 0 & x 4 = 0 = 1
ny = 1& x5 = 1 = 2 ny = 1& x5 = 0 = 7 ny = 0 & x5 = 1 = 0 ny = 0 & x5 = 0 = 1
ny = 1& x6 = 1 = 2 ny = 1& x6 = 0 = 7 n y = 0 & x6 = 1 = 0 n y = 0 & x6 = 0 = 1
ny = 1& x7 = 1 = 5 ny = 1& x7 = 0 = 4 ny = 0 & x7 = 1 = 0 ny = 0 & x7 = 0 = 1
ny = 1& x8 = 1 = 4 ny = 1& x8 = 0 = 5 n y = 0 & x8 = 1 = 0 n y = 0 & x8 = 0 = 1
ny = 1& x9 = 1 = 3 ny = 1& x9 = 0 = 6 n y = 0 & x9 = 1 = 0 n y = 0 & x9 = 0 = 1
Table 3.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable
Instance
Quality of Parts
(Faulty
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 System Fault y
1 (M1) 1 0 0 0 1 0 1 0 1 1
2 (M2) 0 1 0 1 0 0 0 1 0 1
3 (M3) 0 0 1 1 0 1 1 1 0 1
4 (M4) 0 0 0 1 0 0 0 1 0 1
5 (M5) 0 0 0 0 1 0 1 0 1 1
6 (M6) 0 0 0 0 0 1 1 0 0 1
7 (M7) 0 0 0 0 0 0 1 0 0 1
8 (M8) 0 0 0 0 0 0 0 1 0 1
9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0
34 Data Mining
Table 3.2
Classification of Data Records in the Testing Data Set for System Fault Detection
Target Variable
Attribute Variables (Quality of Parts) (System Fault y)
Instance True Classified
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Value Value
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1
9 9
∏ ∏
ny = 1 ny = 1& xi
p( y = 1) P ( xi|y = 1) =
i =1
n i =1
ny = 1
ny = 1& x8 = 0 ny = 1& x9 = 1
× ×
ny = 1 ny = 1
9 1 8 8 6 2 7 5 5 3
= × × × × × × × × >0
10 9 9 9 9 9 9 9 9 9
Naïve Bayes Classifier 35
9 9
∏ ∏
ny = 0 n y = 0 & xi
p( y = 0) P ( xi|y = 0 ) =
i =1
n i =1
ny = 0
ny = 0 & x 4 = 0 ny = 0 & x5 = 1
× ×
ny = 0 ny = 0
n y = 0 & x 6 = 0 n y = 0 & x7 = 1
× ×
ny = 0 ny = 0
n y = 0 & x8 = 0 n y = 0 & x9 = 1
× ×
ny = 0 ny = 0
1 0 1 1 1 0 1 0 1 0
= × × × × × × × × =0
10 1 1 1 1 1 1 1 1 1
Instance #2 to Case #9 in Table 3.1 and all the cases in Table 3.2 can
be classified similarly to produce y MAP = 1 since there exist xi = 1 and
ny = 0 & xi = 1 ny = 0 = 0 1 , which make p ( y = 0 ) P ( x|y = 0 ) = 0. Instance #10 in
Table 3.1 with x = (0, 0, 0, 0, 0, 0, 0, 0, 0) is classified as follows:
Hence, all the instances in Tables 3.1 and 3.2 are correctly classified by
the naïve Bayes classifier.
• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (http://www.mathworks.com), statistics toolbox
The naïve Bayes classifier has been successfully applied in many fields, includ-
ing text and document classification (http://www.cs.waikato.ac.nz/∼eibe/
pubs/FrankAndBouckaertPKDD06new.pdf).
36 Data Mining
Exercises
3.1 Build a naïve Bayes classifier to classify the target variable from the
attribute variable in the balloon data set in Table 1.1 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
3.2 In the space shuttle O-ring data set in Table 1.2, consider the Leak-
Check Pressure as a categorical attribute with three categorical values
and the Number of O-rings with Stress as a categorical target variable
with three categorical values. Build a naïve Bayes classifier to classify
the Number of O-rings with Stress from the Leak-Check Pressure and
evaluate the classification performance of the naïve Bayes classifier by
computing what percentage of the date records in the data set are clas-
sified correctly by the naïve Bayes classifier.
3.3 Build a naïve Bayes classifier to classify the target variable from the
attribute variables in the lenses data set in Table 1.3 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
3.4 Build a naïve Bayes classifier to classify the target variable from the
attribute variables in the data set found in Exercise 1.1 and evaluate the
classification performance of the naïve Bayes classifier by computing
what percentage of the date records in the data set are classified cor-
rectly by the naïve Bayes classifier.
4
Decision and Regression Trees
Decision and regression tress are used to learn classification and prediction
patterns from data and express the relation of attribute variables x with a
target variable y, y = F(x), in the form of a tree. A decision tree classifies the
categorical target value of a data record using its attribute values. A regres-
sion tree predicts the numeric target value of a data record using its attribute
alues. In this chapter, we first define a binary decision tree and give the algo-
v
rithm to learn a binary decision tree from a data set with categorical attribute
variables and a categorical target variable. Then the method of learning a
nonbinary decision tree is described. Additional concepts are introduced to
handle numeric attribute variables and missing values of attribute variables,
and to handle a numeric target variable for constructing a regression tree.
A list of data mining software packages that support the learning of decision
and regression trees is provided. Some applications of decision and regres-
sion trees are given with references.
37
38 Data Mining
Table 4.1
Data Set for System Fault Detection
Target
Attribute Variables Variable
Quality of Parts
Instance System
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Fault y
1 (M1) 1 0 0 0 1 0 1 0 1 1
2 (M2) 0 1 0 1 0 0 0 1 0 1
3 (M3) 0 0 1 1 0 1 1 1 0 1
4 (M4) 0 0 0 1 0 0 0 1 0 1
5 (M5) 0 0 0 0 1 0 1 0 1 1
6 (M6) 0 0 0 0 0 1 1 0 0 1
7 (M7) 0 0 0 0 0 0 1 0 0 1
8 (M8) 0 0 0 0 0 0 0 1 0 1
9 (M9) 0 0 0 0 0 0 0 0 1 1
10 (none) 0 0 0 0 0 0 0 0 0 0
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0
TRUE FALSE
TRUE FALSE
TRUE FALSE
{10} {9}
y=0 y=1
Figure 4.1
Decision tree for system fault detection.
data set. For the data set of system fault detection, the root node contains a
set with all the 10 data records in the training data set, {1, 2, …, 10}. Note that
the numbers in the data set are the instance numbers. The root node is split
into two subsets, {2, 4, 8, 9, 10} and {1, 2, 5, 6, 7}, using the attribute variable,
x 7, and its two categorical values, x7 = 0 and x7 = 1. All the instances in the
subset, {2, 4, 8, 9, 10}, have x7 = 0. All the instances in the subset, {1, 2, 5, 6, 7},
have x7 = 1. Each subset is represented as a node in the decision tree. A
Boolean expression is used in the decision tree to express x7 = 0 by x7 = 0 is
Decision and Regression Trees 39
www.allitebooks.com
40 Data Mining
− 0 log 2 0 = 0 (4.2)
c
∑P = 1,
i =1
i (4.3)
where
D denotes a given data set
c denotes the number of different target values
Pi denotes the probability that a data record in the data set takes the ith
target value
An entropy value falls in the range, [0, log2c]. For example, given the data set
in Table 4.1, we have c = 2 (for two target values, y = 0 and y = 1), P1 = 9/10
(9 of the 10 records with y = 0) = 0.9, P2 = 1/10 (1 of the 10 records with y = 1) =
0.1, and
2
entropy (D) = ∑ − P log P = −0.9log 0.9 − 0.1log 0.1 = 0.47.
i =1
i 2 i 2 2
Decision and Regression Trees 41
Table 4.2
Binary Split of the Root Node and Calculation of Information Entropy
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Information Entropy of Split
x1 = 0: TRUE or FALSE {2, 3, 4, 5, 6, 7, 8, 9, 10}, {1}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9 8 8 1 1 1
= × − log 2 − log 2 + × 0 = 0.45
10 9 9 9 9 10
x2 = 0: TRUE or FALSE {1, 3, 4, 5, 6, 7, 8, 9, 10}, {2}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9 8 8 1 1 1
= × − log 2 − log 2 + × 0 = 0.45
10 9 9 9 9 10
x3 = 0: TRUE or FALSE {1, 2, 4, 5, 6, 7, 8, 9, 10}, {3}
9 1
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
9 8 8 1 1 1
= × − log 2 − log 2 + × 0 = 0.45
10 9 9 9 9 10
x4 = 0: TRUE or FALSE {1, 5, 6, 7, 8, 9, 10}, {2, 3, 4}
7 3
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
7 6 6 1 1 3
= × − log 2 − log 2 + × 0 = 0.41
10 7 7 7 7 10
x5 = 0: TRUE or FALSE {2, 3, 4, 6, 7, 8, 9, 10}, {1, 5}
8 2
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
8 7 7 1 1 2
= × − log 2 − log 2 + × 0 = 0.43
10 8 8 8 8 10
x6 = 0: TRUE or FALSE {1, 2, 4, 5, 7, 8, 9, 10}, {3, 6}
8 2
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
8 7 7 1 1 2
= × − log 2 − log 2 + × 0 = 0.43
10 8 8 8 8 10
x7 = 0: TRUE or FALSE {2, 4, 8, 9, 10}, {1, 3, 5, 6, 7}
5 5
entropy (S) =
10
entropy (Dtrue ) +
10
(
entropy D false )
5 4 4 1 1 5
= × − log 2 − log 2 + × 0 = 0.36
10 5 5 5 5 10
(continued)
42 Data Mining
Figure 4.2 shows how the entropy value changes with P1 (P2 = 1 − P1) when
c = 2. Especially, we have
If all the data records in a data set take one target value, we have P1 = 0, P2 = 1 or
P1 = 1, P2 = 0, and the value of information entropy is 0, that is, we need 0 bit of
information because we already know the target value that all the data records
take. Hence, the entropy value of 0 indicates that the data set is homogenous
1
0.9
0.8
0.7
entropy (D)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P1
Figure 4.2
Information entropy.
Decision and Regression Trees 43
with regard to the target value. If one half set of the data records in a data
set takes one target value and the other half set takes another target value,
we have P1 = 0.5, P2 = 0.5, and the value of information entropy is 1, meaning
that we need 1 bit of information to convey what target value is. Hence, the
entropy value of 1 indicates that the data set is inhomogeneous. When we use
the information entropy to measure data homogeneity, the lower the entropy
value is, the more homogenous the data set is with regard to the target value.
After a split of a data set into several subsets, the following formula is used
to compute the average information entropy of subsets:
∑
Dv
entropy (S) = entropy (Dv ) , (4.4)
v ∈Values( S )
D
where
S denotes the split
Values(S) denotes a set of values that are used in the split
v denotes a value in Values(S)
D denotes the data set being split
|D| denotes the number of data records in the data set D
Dv denotes the subset resulting from the split using the split value v
|Dv| denotes the number of data records in the data set Dv
For example, the root node of a decision tree for the data set in Table 4.1
has the data set, D = {1, 2, …, 10}, whose entropy value is 0.47 as shown previ-
ously. Using the split criterion, x1 = 0: TRUE or FALSE, the root node is split
into two subsets: Dfalse = {1}, which is homogenous, and Dtrue = {2, 3, 4, 5, 6,
7, 8, 9, 10}, which is inhomogeneous with eight data records taking the tar-
get value of 1 and one data record taking the target value of 0. The average
entropy of the two subsets after the split is
9 1
entropy (S) =
10
entropy (Dtrue ) + entropy D false
10
( )
9 8 8 1 1 1
= × − log 2 − log 2 + × 0 = 0.45.
10 9 9 9 9 10
Since this average entropy of subsets after the split is better than entropy (D) =
0.47, the split improves data homogeneity. Table 4.2 gives the average entropy of
subsets after each of the other eight splits of the root node. Among the nine pos-
sible splits, the split using the criterion of x7 = 0: TRUE or FALSE produces the
smallest average information entropy, which indicates the most homogeneous
subsets. Hence, the split criterion of x7 = 0: TRUE or FALSE is selected to split the
root node, resulting in two internal nodes as shown in Figure 4.1. The internal
node with the subset, {2, 4, 8, 9, 10}, is not homogenous. Hence, the decision tree
is further expanded with more splits until all leaf nodes are homogenous.
44 Data Mining
gini (D) = 1 − ∑P .
i =1
i
2
(4.5)
For example, given the data set in Table 4.1, we have c = 2, P1 = 0.9, P2 = 0.1, and
c
gini (D) = 1 − ∑P
i =1
i
2
= 1 − 0.92 − 0.12 = 0.18.
The gini-index values are computed for c = 2 and the following values of Pi:
Hence, the smaller the gini-index value is, the more homogeneous the data
set is. The average gini-index value of data subsets after a split is calculated
as follows:
∑
Dv
gini (S) = gini (Dv ) . (4.6)
v ∈Values( S )
D
Table 4.3 gives the average gini-index value of subsets after each of the nine
splits of the root node for the training data set of system fault detection.
Among the nine possible splits, the split criterion of x7 = 0: TRUE or FALSE
produces the smallest average gini-index value, which indicates the most
homogeneous subsets. The split criterion of x7 = 0: TRUE or FALSE is selected
to split the root node. Hence, using the gini-index produces the same split as
using the information entropy.
1. Start with the root node that includes all the data records in the
training data set and select this node to split.
2. Apply a split selection method to the selected node to determine the
best split along with the split criterion and partition the set of the
training data records at the selected node into two nodes with two
subsets of data records, respectively.
3. Check if the stopping criterion is satisfied. If so, the tree construction
is completed; otherwise, go back to Step 2 to continue by selecting a
node to split.
Decision and Regression Trees 45
Table 4.3
Binary Split of the Root Node and Calculation of the Gini-Index
for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Gini-Index Value of Split
x1 = 0: TRUE or FALSE {2, 3, 4, 5, 6, 7, 8, 9, 10}, {1}
9 1
gini (S) =
10
gini (Dtrue ) +
10
(
gini D false )
9 1 8 1
2 2
= × 1 − − + × 0 = 0.18
10 9 9 10
= × 1 − − + × 0 = 0.18
10 9 9 10
= × 1 − − + × 0 = 0.18
10 9 9 10
= × 1 − − + × 0 = 0.17
10 7
7 10
= × 1 − − + × 0 = 0.175
10 8 8 10
= × 1 − − + × 0 = 0.175
10 8 8 10
(continued)
46 Data Mining
= × 1 − − + × 0 = 0.16
10 5 5 10
= × 1 − − + × 0 = 0.167
10 6 6 10
= × 1 − − + × 0 = 0.17
10 7 7 10
Example 4.1
Construct a binary decision tree for the data set of system fault detection
in Table 4.1.
We first use the information entropy as the measure of data homoge-
neity. As shown in Figure 4.1, the data set at the root node is partitioned
into two subsets, {2, 4, 8, 9, 10}, and {1, 3, 5, 6, 7}, which are already homo-
geneous with the target value, y = 1, and do not need a split. For the
subset, D = {2, 4, 8, 9, 10},
∑ − P log P = − 5 log
1 1 4 4
entropy (D) = i 2 i 2 − log 2 = 0.72.
i =1
5 5 5
Decision and Regression Trees 47
Except x7, which has been used to split the root node, the other eight
attribute variables, x1, x2, x3, x4, x5, x6, x8, and x9, can be used to split D.
The split criteria using x1 = 0, x3 = 0, x5 = 0, and x6 = 0 do not produce a
split of D. Table 4.4 gives the calculation of information entropy for the
splits using x2, x4, x7, x8, and x9. Since the split criterion, x8 = 0: TRUE or
FALSE, produces the smallest average entropy of the split, this split cri-
terion is selected to split D = {2, 4, 8, 9, 10} into {9, 10} and {2, 4, 8}, which
are already homogeneous with the target value, y = 1, and do not need a
split. Figure 4.1 shows this split.
For the subset, D = {9, 10},
2
∑ − P log P = − 2 log
1 1 1 1
entropy (D) = i 2 i 2 − log 2 = 1.
i =1
2 2 2
Except x7 and x8, which have been used to split the root node, the other
seven attribute variables, x1, x2, x3, x4, x5, x6, and x9, can be used to split D.
The split criteria using x1 = 0, x2 = 0, x3 = 0, x4 = 0, x5 = 0, and x6 = 0 do
not produce a split of D. The split criterion of x9 = 0: TRUE or FALSE,
Table 4.4
Binary Split of an Internal Node with D = {2, 4, 5, 9, 10} and Calculation
of Information Entropy for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Information Entropy of Split
x2 = 0: TRUE or FALSE {4, 8, 9, 10}, {2}
4 1
entropy (S) =
5
(
entropy (Dtrue ) + entropy D false
5
)
4 3 8 1 1 1
= × − log 2 − log 2 + × 0 = 0.64
5 4 9 4 4 5
produces two subsets, {9} with the target value of y = 1, and {10} with the
target value of y = 0, which are homogeneous and do not need a split.
Figure 4.1 shows this split. Since all leaf nodes of the decision tree are
homogeneous, the construction of the decision tree is stopped with the
complete decision tree shown in Figure 4.1.
We now show the construction of the decision tree using the gini-
index as the measure of data homogeneity. As described previously, the
data set at the root node is partitioned into two subsets, {2, 4, 8, 9, 10} and
{1, 3, 5, 6, 7}, which are already homogeneous with the target value, y = 1,
and do not need a split. For the subset, D = {2, 4, 8, 9, 10},
c 2 2
gini (D) = 1 − ∑
i =1
4 1
Pi2 = 1 − − = 0.32.
5 5
The split criteria using x1 = 0, x3 = 0, x5 = 0, and x6 = 0 do not produce a split
of D. Table 4.5 gives the calculation of the gini-index values for the splits
Table 4.5
Binary Split of an Internal Node with D = {2, 4, 5, 9, 10} and Calculation
of the Gini-Index Values for the Data Set of System Fault Detection
Split Criterion Resulting Subsets and Average Gini-Index Value of Split
x2 = 0: TRUE or FALSE {4, 8, 9, 10}, {2}
4 1
gini (S) =
5
(
gini (Dtrue ) + gini D false
5
)
4 3 1 1
2 2
= × 1 − − + × 0 = 0.3
5 4 4 5
= × 1 − − + × 0 = 0.27
5 4 4 5
= × 1 − − + × 0 = 0.2
5 2
2 5
= × 1 − − + × 0 = 0.3
5 4 4 5
Decision and Regression Trees 49
using x2, x4, x7, x8, and x9. Since the split criterion, x8 = 0: TRUE or FALSE,
produces the smallest average gini-index value of the split, this split crite-
rion is selected to split D = {2, 4, 8, 9, 10} into {9, 10} and {2, 4, 8}, which are
already homogeneous with the target value, y = 1, and do not need a split.
For the subset, D = {9, 10},
c 2 2
gini (D) = 1 − ∑
i =1
1 1
Pi2 = 1 − − = 0.5.
2 2
Except x7 and x8, which have been used to split the root node, the other
seven attribute variables, x1, x2, x3, x4, x5, x6, and x9, can be used to split
D. The split criteria using x1 = 0, x2 = 0, x3 = 0, x4 = 0, x5 = 0, and x6 = 0 do
not produce a split of D. The split criterion of x9 = 0: TRUE or FALSE, pro-
duces two subsets, {9} with the target value of y = 1, and {10} with the tar-
get value of y = 0, which are homogeneous and do not need a split. Since
all leaf nodes of the decision tree are homogeneous, the construction of
the decision tree is stopped with the complete decision tree, which is
the same as the decision tree from using the information entropy as the
measure of data homogeneity.
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0
TRUE FALSE
TRUE FALSE
TRUE FALSE
{10} {9}
y=0 y=1
Figure 4.3
Classifying a data record for no system fault using the decision tree for system fault detection.
www.allitebooks.com
50 Data Mining
Table 4.6
Classification of Data Records in the Testing Data Set for System Fault
Detection
Target Variable y
Attribute Variables (Quality of Parts) (System Faults)
Instance True Classified
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Value Value
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
x7 = 0
TRUE FALSE
TRUE FALSE
{10} {9}
y=0 y=1
Figure 4.4
Classifying a data record for multiple machine faults using the decision tree for system fault
detection.
Decision and Regression Trees 51
their target values are obtained using the decision tree in Figure 4.1 and are
shown in Table 4.6. Figure 4.4 highlights the path of passing a testing data
record for instance 1 in Table 4.6 from the root node to a leaf node with the
target value, y = 1. Hence, the data record is classified to have a system fault.
Example 4.2
Construct a nonbinary decision tree for the lenses data set in Table 1.3.
If the attribute variable, Age, is used to split the root node for the
lenses data set, all three categorical values of Age can be used to par-
tition the set of 24 data records at the root node using the split crite-
rion, Age = Young, Pre-presbyopic, or Presbyopic, as shown in Figure 4.5.
We use the data set of 24 data records in Table 1.3 as the training data
set, D, at the root node of the nonbinary decision tree. In the lenses
data set, the target variable has three categorical values, Non-Contact
in 15 data records, Soft-Contact in 5 data records, and Hard-Contact in
4 data records. Using the information entropy as the measure of data
homogeneity, we have
∑ − P log P = − 24 log
15 15 5 5 4 4
entropy (D) = i 2 i 2 − log 2 − log 2 = 1.3261.
i =1
24 24 24 24 24
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24}
Tear production rate = ?
Reduced Normal
{2, 6, 10, 14, 18, 22} {4, 8, 12, 16, 20, 24}
Age = ? Spectacle prescription = ?
Figure 4.5
Decision tree for the lenses data set.
Data Mining
Decision and Regression Trees 53
Table 4.7
Nonbinary Split of the Root Node and Calculation of Information Entropy
for the Lenses Data Set
Split Criterion Resulting Subsets and Average Information Entropy of Split
Age = Young, {1, 2, 3, 4, 5, 6, 7, 8}, {9, 10, 11, 12, 13, 14, 15, 16}, {17, 18, 19, 20,
Pre-presbyopic, or 21, 22, 23, 24}
Presbyopic 8 8
entropy (S) =
24
(
entropy DYoung + ) 24
( )
entropy DPre − presbyopic
8
+
24
(
entropy DPresbyopic )
8 4 4 2 2 2 2
= × − log 2 − log 2 − log 2
24 8 8 8 8 8 8
8 5 5 2 2 1 1
+ × − log 2 − log 2 − log 2
24 8 8 8 8 8 8
8 6 6 1 1 1 1
+ × − log 2 − log 2 − log 2 = 1.2867
24 8 8 8 8 8 8
Spectacle Prescription = {1, 2, 3, 4, 9, 10, 11, 12, 17, 18, 19, 20}, {5, 6, 7, 8, 13, 14, 15, 16, 21,
Myope or Hypermetrope 22, 23, 24}
12 12
entropy (S) =
24
(
entropy DMyope + ) 24
(
entropy DHypermetrope )
12 7 7 2 2 3 3
= × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
12 8 8 3 3 1 1
+ × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
= 1.2866
Astigmatic = No or Yes {1, 2, 5, 6, 9, 10, 13, 14, 17, 18, 21, 22}, {3, 4, 7, 8, 11, 12, 15, 16, 19,
20, 23, 24}
12 12
entropy (S) = entropy (DNo ) + entropy (DYes )
24 24
12 7 7 5 5 0 0
= × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
12 8 8 4 4 0 0
+ × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
= 0.9491
Tear Production Rate = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23}, {2, 4, 6, 8, 10, 12, 14, 16, 18,
Reduced or Normal 20, 22, 24}
12 12
entropy (S) = entropy (DReduced ) + entropy (DNormal )
24 24
12 12 12 0 0 0 0
= × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
12 3 3 5 5 4 4
+ × − log 2 − log 2 − log 2
24 12 12 12 12 12 12
= 0.7773
54 Data Mining
Table 4.8
Nonbinary Split of an Internal Node, {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24},
and Calculation of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, or {2, 4, 6, 8}, {10, 12, 14, 16}, {18, 20, 22, 24}
Presbyopic
4
entropy (S) =
12
(
entropy DYoung )
4
+
12
(
entropy DPre − presbyopic )
4
+
12
(
entropy DPresbyopic )
4 0 0 2 2 2 2
= × − log 2 − log 2 − log 2
12 4 4 4 4 4 4
4 1 1 2 2 1 1
+ × − log 2 − log 2 − log 2
12 4 4 4 4 4 4
4 2 2 1 1 1 1
+ × − log 2 − log 2 − log 2
12 4 4 4 4 4 4
= 1.3333
Spectacle Prescription = Myope or {2, 4, 10, 12, 18, 20}, {6, 7, 14, 16, 22, 24}
Hypermetrope
6
entropy (S) =
12
(
entropy DMyope )
6
+
12
(
entropy DHypermetrope )
6 1 1 2 2 3 3
= × − log 2 − log 2 − log 2
122 6 6 6 6 6 6
6 2 2 3 3 1 1
+ × − log 2 − log 2 − log 2
12 6 6 6 6 6 6
= 1.4591
Astigmatic = No or Yes {2, 6, 10, 14, 18, 22}, {4, 8, 12, 16, 20, 24}
6
entropy (S) = entropy (DNo )
12
6
+ entropy (DYes )
12
6 1 1 5 5 0 0
= × − log 2 − log 2 − log 2
12 6 6 6 6 6 6
6 2 2 0 0 4 4
+ × − log 2 − log 2 − log 2
12 6 6 6 6 6 6
= 0.7842
Decision and Regression Trees 55
Table 4.9
Nonbinary Split of an Internal Node, {2, 6, 10, 14, 18, 22}, and Calculation
of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, {2, 6}, {10, 14}, {18, 22}
or Presbyopic
2 2
entropy (S) =
6
( ) (
entropy DYoung + entropy DPre − presbyopic
6
)
2
+
6
(
entropy DPresbyopic )
2 0 0 2 2 0 0
= × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
2 0 0 2 2 0 0
+ × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
2 1 1 1 1 0 0
+ × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
= 0.3333
Spectacle Prescription = Myope {2, 10, 18}, {6, 14, 22}
or Hypermetrope
3 3
entropy (S) =
6
( ) (
entropy DMyope + entropy DHypermetrope
6
)
3 1 1 2 2 0 0
= × − log 2 − log 2 − log 2
6 3 3 3 3 3 3
3 0 0 3 3 0 0
+ × − log 2 − log 2 − log 2
6 3 3 3 3 3 3
= 0.4591
entropy to split the node with the data set of {2, 4, 6, 8, 10, 12, 14, 16, 18,
20, 22, 24} using the split criterion, Astigmatic = No or Yes, which pro-
duces two subsets of {2, 6, 10, 14, 18, 22} and {4, 8, 12, 16, 20, 24}. Table 4.9
shows the calculation of information entropy to split the node with the
data set of {2, 6, 10, 14, 18, 22} using the split criterion, Age = Young,
Pre-presbyopic, or Presbyopic, which produces three subsets of {2, 6},
{10, 14}, and (18, 22}. These subsets are further partitioned using the split
criterion, Spectacle Prescription = Myope or Hypermetrope, to produce
leaf nodes with homogeneous data sets. Table 4.10 shows the calcula-
tion of information entropy to split the node with the data set of {4, 8,
12, 16, 20, 24} using the split criterion, Spectacle Prescription = Myope
or Hypermetrope, which produces two subsets of {4, 12, 20} and {8, 16, 24}.
These subsets are further partitioned using the split criterion, Age =
Young, Pre-presbyopic, or Presbyopic, to produce leaf nodes with homo-
geneous data sets. Figure 4.5 shows the complete nonbinary decision
tree for the lenses data set.
56 Data Mining
Table 4.10
Nonbinary Split of an Internal Node, {4, 8, 12, 16, 20, 24}, and Calculation
of Information Entropy for the Lenses Data Set
Resulting Subsets and Average Information
Split Criterion Entropy of Split
Age = Young, Pre-presbyopic, or {4, 8}, {12, 16}, {20, 24}
Presbyopic
2 2
entropy (S) =
6
( ) (
entropy DYoung + entropy DPre − presbyopic
6
)
2
+
6
(
entropy DPresbyopic )
2 0 0 0 0 2 2
= × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
2 1 1 0 0 1 1
+ × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
2 1 1 0 0 1 1
+ × − log 2 − log 2 − log 2
6 2 2 2 2 2 2
= 0.6667
Spectacle Prescription = Myope {4, 12, 20}, {8, 16, 24}
or Hypermetrope
3 3
entropy (S) =
6
( ) (
entropy DMyope + entropy DHypermetrope
6
)
3 0 0 0 0 3 3
= × − log 2 − log 2 − log 2
6 3 3 3 3 3 3
3 2 2 0 0 1 1
+ × − log 2 − log 2 − log 2
6 3 3 3 3 3 3
= 0.4591
ai + a j
ci = . (4.7)
2
Decision and Regression Trees 57
Category 1: x ≤ c1
Category 2: c1 < x ≤ c2
.
.
.
Category k : ck −1 < x ≤ ck
Category k + 1: ck < x.
The average difference of values in a data set from their average value indi-
cates how values are similar or homogenous. The smaller the R value is, the
more homogenous the data set is. Formula 4.9 shows the computation of the
average R value after a split:
R (D ) = ∑ (y − y )
2
(4.8)
y ∈D
y=
∑ y
y ∈D
(4.9)
n
∑ ( ) D R (D )
Dv
R (S ) = v (4.10)
v ∈Values S
The space shuttle data set D in Table 1.2 has one numeric target variable
and four numeric attribute variables. The R value of the data set D with the
23 data records at the root node of the regression tree is computed as
y=
∑ y ∈D
y
n
0 + 1+ 0 + 0 + 0 + 0 + 0 + 0 + 1+ 1+ 1+ 0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1
=
23
= 0..3043
y ∈D
= 6.8696
the data record arrives is assigned as the target value of the data record. The
decision tree for a numeric target variable is called a regression tree.
• x7 = 1
• x7 = 0 & x8 = 1
• x7 = 0 & x8 = 0 & x9 = 1
and the following pattern of part quality to one leaf node with the classifica-
tion of no system fault:
• x7 = 0 & x8 = 0 & x9 = 0.
• Among the nine quality variables, only the three quality variables, x7,
x8, and x9, matter for system fault detection. This knowledge allows
us to reduce the cost of part quality inspection by inspecting the part
quality after M7, M8, and M9 only rather than all the nine machines.
• If one of these three variables, x7, x8, and x9, shows a quality failure,
the system has a fault; otherwise, the system has no fault.
This classification pattern for the target value of Inflated = T, (Color = Yellow
AND Size = Small) OR (Age = Adult AND Act = Stretch), involves all the four
attribute variables of Color, Size, Age, and Act. It is difficult to express this
60 Data Mining
simple pattern in a decision tree. We cannot use all the four attribute variables
to partition the root node. Instead, we have to select only one attribute variable.
The average information entropy of a split to partition the root node using each
of the four attribute variables is the same with the computation shown next:
8 8
entropy (S) =
16
entropy (DYellow ) + entropy DPurple
16
( )
8 5 5 3 3
= × − log 2 − log 2
12 8 8 8 8
8 2 2 6 6
+ × − log 2 − log 2
12 8 8 8 8
= 0.8829.
Color = ?
Yellow Purple
Size = ? Age = ?
{1, 2, 3, 4} {5, 6, 7, 8} {9, 11, 13, 15} {10, 12, 14, 16}
Stretch Dip
{1, 2, 3, 4} {5, 6, 7, 8}
Inflated = T Inflated = F
Figure 4.6
Decision tree for the balloon data set.
Decision and Regression Trees 61
• Color = Yellow AND Size = Large AND Age = Adult AND Act = Dip,
with Inflated = F
• Color = Yellow AND Size = Large AND Age = Child, with Inflated = F
• Color = Purple AND Age = Adult AND Act = Stretch, with Inflated = T
• Color = Purple AND Age = Adult AND Act = Dip, with Inflated = F
• Color = Purple AND Age = Child, with Inflated = F
From these seven classification patterns, it is difficult to see the simple clas-
sification pattern:
Moreover, selecting the best split criterion with only one attribute vari-
able without looking ahead the combination of this split criterion with the
following-up criteria to the leaf node is like making a locally optimal deci-
sion. There is no guarantee that making locally optimal decisions at separate
times leads to the smallest decision tree or a globally optimal decision.
However, considering all the attribute variables and their combinations
of conditions for each split would correspond to an exhaustive search of all
combination values of all the attribute variables. This is computationally
costly or sometimes impossible for a large data set with a large number of
attribute variables.
• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• SPSS AnswerTree (http://www.spss.com/answertree/)
• SAS Enterprise Miner (http://sas.com/products/miner/)
• IBM Inteligent Miner (http://www.ibm.com/software/data/iminer/)
• CART (http://www.salford-systems.com/)
• C4.5 (http://www.cse.unsw.edu.au/quinlan)
Some applications of decision trees can be found in (Ye, 2003, Chapter 1) and
(Li and Ye, 2001; Ye et al., 2001).
62 Data Mining
Exercises
4.1 Construct a binary decision tree for the balloon data set in Table 1.1
using the information entropy as the measure of data homogeneity.
4.2 Construct a binary decision tree for the lenses data set in Table 1.3
using the information entropy as the measure of the data homogeneity.
4.3 Construct a non-binary regression tree for the space shuttle data set
in Table 1.2 using only Launch Temperature and Leak-Check Pressure
as the attribute variables and considering two categorical values
of Launch Temperature (low for Temperature <60, normal for other
t emperatures) and three categorical values of Leak-Check Pressure
(50, 100, and 200).
4.4 Construct a binary decision tree or a nonbinary decision tree for the
data set found in Exercise 1.1.
4.5 Construct a binary decision tree or a nonbinary decision tree for the
data set found in Exercise 1.2.
4.6 Construct a dataset for which using the decision tree algorithm based
on the best split for data homogeneity does not produce the smallest
decision tree.
5
Artificial Neural Networks for
Classification and Prediction
net j = ∑w
i=0
x.
j, i i (5.1)
63
64 Data Mining
x0 = 1
x1 wj,0
wj,1
x2 wj,2 j
wj,i xi f o
i
wj,p
xp
Figure 5.1
Processing unit of ANN.
net j = w¢ x. (5.2)
The unit then applies a transfer function, f, to the net sum and obtains the
output, o, as follows:
(
o = f net j . )
(5.3)
Five of the common transfer functions are given next and illustrated in
Figure 5.2.
1. Sign function:
1 if net > 0
o = sgn ( net ) = (5.4)
−1 if net ≤ 0
1 if net > 0
o = hardlim ( net ) = (5.5)
0 if net ≤ 0
Artificial Neural Networks for Classification and Prediction 65
1 1
f (net)
f (net)
net
0
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
net
0
1 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
6 1
5
f (net)
4
3 f (net)
2
1 net
0
–6 –5 –4 –3 –2 –1–1 0 1 2 3 4 5 6
–2
–3
–4
–5 net
0
–6 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
f (net)
0
net
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
–1
Figure 5.2
Examples of transfer functions.
3. Linear function:
1
o = sig ( net ) = (5.7)
1 + e − net
e net − e − net
O = tanh ( net ) = . (5.8)
e net + e − net
66 Data Mining
1
x=5 w¢ = −1.2 3 2 ,
−6
the output of the unit with each of the five transfer functions is computed as
follows:
1
net = w¢x = −1.2 3 2 5 = 1.8
−6
o = sgn ( net ) = 1
o = hardlim ( net ) = 1
Table 5.1
AND Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 −1
1 −1 −1
1 1 1
Artificial Neural Networks for Classification and Prediction 67
x0 = 1
w1,0 = –0.3
x1 w1,1 = 0.5
w1,i xi f = sgn o
i
w1,2 = 0.5
x2
Figure 5.3
Implementation of the AND function using one processing unit.
output values from −1 to 1. The sign function is used as the transfer function
for the processing unit to implement the AND function. The first three data
records require the output value of −1. The weighted sum of the inputs for
the first three data records, w1,0x0 + w1,1x1 + w1,2 x2, should be in the range of
[−1, 0]. The last data record requires the output value of 1, and the weighted
sum of the inputs should be in the range of (0, 1]. The connection weight w1,0
must be a negative value to make net for the first three data records less than
zero and also make net for the last data record greater than zero. Hence, the
connection weight w1,0 acts as a threshold against the weighted sum of the
inputs to drive net greater than or less than zero. This is why the connec-
tion weight for x0 = 1 is called the threshold or bias. In Figure 5.3, w1,0 is set
to −0.3. Equation 5.1 can be represented as follows to show the role of the
threshold or bias, b:
net = w¢ x + b , (5.9)
where
x1
x= w ¢ = w j, 1 … wj, p .
xp
The computation of the output value for each input is illustrated next.
2
o = sgn ( net ) = sgn
∑w
i=0
x = sgn − 0.3 × 1 + 0.5 × ( −1) + 0.5 × ( −1)
1, i i
2
o = sgn ( net ) = sgn
∑w
i=0
x = sgn − 0.3 × 1 + 0.5 × ( −1) + 0.5 × (1)
1, i i
2
o = sgn ( net ) = sgn
∑
i=0
w1, i xi = sgn − 0.3 × 1 + 0.5 × (1) + 0.5 × ( −1)
2
o = sgn ( net ) = sgn
∑w
i=0
x = sgn − 0.3 × 1 + 0.5 × (1) + 0.5 × (1)
1, i i
Table 5.2 gives the inputs and the output of the logical OR function.
Figure 5.4 shows the implementation of the OR function using one process-
ing unit. Only the first data record requires the output value of −1, and the
Table 5.2
OR Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 1
1 −1 1
1 1 1
x0 = 1
w1,0 = 0.8
x1 w1,1 = 0.5
w1,i xi f = sgn o
i
w1,2 = 0.5
x2
Figure 5.4
Implementation of the OR function using one processing unit.
Artificial Neural Networks for Classification and Prediction 69
other three data records require the output value of 1. Only the first data
record produces the weighted sum −1 from the inputs, and the other three
data records produce the weighted sum of the inputs in the range [−0.5, 1].
Hence, any threshold value w1,0 in the range (0.5, 1) will make net for the
first data record less than zero and make net for the last three data records
greater than zero.
5.2 Architectures of ANNs
Processing units of ANNs can be used to construct various types of ANN
architectures. We present two ANN architectures: feedforward ANNs and
recurrent ANNs. Feedforward ANNs are widely used. Figure 5.5 shows a
one-layer, fully connected feedforward ANN in which each input is con-
nected to each processing unit. Figure 5.6 shows a two-layer, fully connected
x1
o1
x2 o2
oq
xp
Figure 5.5
Architecture of a one-layer feedforward ANN.
x1
o1
x2 o2
oq
xp
Figure 5.6
Architecture of a two-layer feedforward ANN.
70 Data Mining
feedforward ANN. Note that the input x0 for each processing unit is not
explicitly shown in the ANN architectures in Figures 5.5 and 5.6. The two-
layer feedforward ANN in Figure 5.6 contains the output layer of processing
units to produce the outputs and a hidden layer of processing units whose
outputs are the inputs to the processing units at the output layer. Each input
is connected to each processing unit at the hidden layer, and each processing
unit at the hidden layer is connected to each processing unit at the output
layer. In a feedforward ANN, there are no backward connections between
processing units in that the output of a processing unit is not used as a part
of inputs to that processing unit directly or indirectly. An ANN is not neces-
sarily fully connected as those in Figures 5.5 and 5.6. Processing units may
use the same transfer function or different transfer functions.
The ANNs in Figures 5.3 and 5.4, respectively, are examples of one-layer
feedforward ANNs. Figure 5.7 shows a two-layer, fully connected feedfor-
ward ANN with one hidden layer of two processing units and the output
layer of one processing unit to implement the logical exclusive-OR (XOR)
function. Table 5.3 gives the inputs and output of the XOR function.
The number of inputs and the number of outputs in an ANN depend on
the function that the ANN is set to capture. For example, the XOR function
x1
0.5 0.8
–0.3
1
0.5
–0.5
3 o
0.5
0.5
2
–0.5
0.8
x2
Figure 5.7
A two-layer feedforward ANN to implement the XOR function.
Table 5.3
XOR Function
Inputs Output
x1 x2 o
−1 −1 −1
−1 1 1
1 −1 1
1 1 −1
Artificial Neural Networks for Classification and Prediction 71
x1
o1
x2 o2
oq
xp
Figure 5.8
Architecture of a recurrent ANN.
has two inputs and one output that can be represented by two inputs and
one output of an ANN, respectively. The number of processing units at the
hidden layer, called hidden units, is often determined empirically to account
for the complexity of the function that an ANN implements. In general, more
complex the function is, more hidden units are needed. A two-layer feedfor-
ward ANN with a sigmoid or hyperbolic tangent function has the capability
of implementing a given function (Witten et al., 2011).
Figure 5.8 shows the architecture of a recurrent ANN with backward con-
nections that feed the outputs back as the inputs to the first hidden unit
(shown) and other hidden units (not shown). The backward connections
allow the ANN to capture the temporal behavior in that the outputs at time
t + 1 depend on the outputs or state of the ANN at time t. Hence, recurrent
ANNs such as that in Figure 5.8 have backward connections to capture tem-
poral behaviors.
methods in this section are explained using the sign transfer function for each
processing unit in a perceptron, these concepts and methods are also applicable
to a perceptron with a hard limit transfer function for each processing unit.
In Section 5.4, we present the back-propagation learning method to determine
connection weights for multiple-layer feedforward ANNs.
5.3.1 Perceptron
The following notations are used to represent a fully connected perceptron
with p inputs, q processing units at the output layer to produce q outputs, and
the sign transfer function for each processing unit, as shown in Figure 5.5:
x1 o1 w1,1 w1, p w1′ w j ,1 b1
x= o= w′ = = wj = b=
xp
oq
wq ,1
wq , p wq′ w j, p
bq
( )
For a processing unit j, o = sgn ( net ) = sgn w j¢ x + b j separates input vec-
tors, xs, into two regions: one with net > 0 and o = 1, and another with net ≤ 0
and o = −1. The equation, net = w j¢ x + b j = 0, is the decision boundary in the
input space that separates the two regions. For example, given x in a two-
dimensional space and the following weight and bias values:
x1
x= w j¢ = −1 1 b j = −1,
x2
− x1 + x2 − 1 = 0
x2 = x1 + 1.
Figure 5.9 illustrates the decision boundary and the separation of the input
space into two regions by the decision boundary. The slope and the intercept
of the line representing the decision boundary in Figure 5.9 are
− w j ,1 1
slope = = =1
w j,2 1
Artificial Neural Networks for Classification and Prediction 73
x2
x2 = x1 + 1
wj 1
x1
net > 0 0 1
net ≤ 0
Figure 5.9
Example of the decision boundary and the separation of the input space into two regions by a
processing unit.
−b j 1
intercept = = = 1.
w j, 2 1
Example 5.1
Use the graphical method to determine the connection weights of a
perceptron with one processing unit for the AND function in Table 5.1.
In Step 1, we plot the four circles in Figure 5.10 to represent the four
data points of the AND function. The output value of each data point is
noted inside the circle for the data point. In Step 2, we use the decision
boundary, x2 = −x1 + 1, to separate the three data points with o = −1 from
the data point with o = 1. The intercept of the line for the decision bound-
ary is 1 with x2 = 1 when x1 is set to 0. In Step 3, we draw the weight vector,
w1 = (0.5, 0.5), which is orthogonal to the decision boundary and points
to the positive side of the decision boundary. Hence, we have w1,1 = 0.5,
w1,2 = 0.5. In Step 4, we use the following equation to determine the bias:
w1,1x1 + w1, 2 x2 + b1 = 0
w1, 2 x2 = − w1,1x1 − b1
x2
1
–1 1
w1
x1
0 1
net > 0
–1 –1 x2 = –x1 + 1
net ≤ 0
Figure 5.10
Illustration of the graphical method to determine connection weights.
Artificial Neural Networks for Classification and Prediction 75
b1
intercept = −
w1, 2
b1
1= −
0.5
b1 = −0.5.
b1 > −1
and
b1 ≤ 0.
Hence, we have
−1 < b1 ≤ 0.
By letting b1 = −0.3, we obtain the same ANN for the AND function as
shown in Figure 5.3.
The ANN with the weights, bias, and decision boundary as those in
Figure 5.10 produces the correct output for the inputs in each data record
in Table 5.1. The ANN also has the generalization capability of classify-
ing any input vector on the negative side of the decision boundary into
o = −1 and any input vector on the positive side of the decision boundary
into o = 1.
For a perceptron with multiple output units, the graphical method
is applied to determine connection weights and bias for each
output unit.
76 Data Mining
1. x1 = −1 x2 = −1 t1 = −1
2. x1 = 1 x2 = 1 t1 = 1,
where t1 denotes the target output of processing unit 1 that needs to be pro-
duced for each data record. The two data records are plotted in Figure 5.11.
We initialize the connection weights using random values, w1,1(k) = −1 and
w1,2(k) = 0.8, with k denoting the iteration number when the weights are
assigned or updated. Initially, we have k = 0. We present the inputs of the
first data record to the perceptron of one processing unit:
Since net < 0, we have o1 = −1. Hence, the perceptron with the weight vector
(−1, 0.8) produces the target output for the inputs of the first data record, t1 = −1.
There is no need to change the connection weights. Next, we present the
inputs of the second data record to the perceptron:
Since net < 0, we have o1 = −1, which is different from the target output for this
data record t1 = 1. Hence, the connection weights must be changed in order
x2
w1(1)
1
1
w1(0)
x1
0 1
–1
Figure 5.11
Illustration of the learning method to change connection weights.
Artificial Neural Networks for Classification and Prediction 77
to produce the target output. The following equations are used to change the
connection weights for processing unit j:
1
∆w j =
2
(tj − oj x ) (5.11)
w j ( k + 1) = w j ( k ) + ∆w j . (5.12)
In Equation 5.11, if (t − o) is zero, that is, t = o, then there is no change of
weights. If t = 1 and o = −1,
1 1
∆w j =
2
( ) (
t j − o j x = 1 − ( −1) x = x.
2
)
1 1
∆w j =
2
( )
t j − o j x = ( −1 − 1) x = − x.
2
1 1 1 1
∆w1 =
2 2
(
(t1 − o1 ) x = 1 − ( −1) ) 1 = 1
−1 1 0
w1 (1) = w1 (0 ) + ∆w1 = + = .
0.8 1 1.8
The new weight vector, w1(1), is shown in Figure 5.11. As Figure 5.11 illus-
trates, w 1(1) is closer to the second data record x than w 1(0) and points more
to the direction of x since x has t = 1 and thus lies on the positive side of the
decision boundary.
78 Data Mining
With the new weights, we present the inputs of the data records to the per-
ceptron again in the second iteration of evaluating and updating the weights
if needed. We present the inputs of the first data record:
Since net < 0, we have o1 = −1. Hence, the perceptron with the weight vector
(0, 1.8) produces the target output for the inputs of the first data record, t1 = −1.
With (t1 − o1) = 0, there is no need to change the connection weights. Next, we
present the inputs of the second data record to the perceptron:
Since net > 0, we have o1 = 1. Hence, the perceptron with the weight vector
(0, 1.8) produces the target output for the inputs of the second data record,
t = 1. With (t − o) = 0, there is no need to change the connection weights. The
perceptron with the weight vector (0, 1.8) produces the target outputs for
all the data records in the training data set. The learning of the connection
weights from the data records in the training data set is finished after one
iteration of changing the connection weights with the final weight vector (0, 1.8).
The decision boundary is the line, x2 = 0.
The general equations for the learning method of determining connection
weights are given as follows:
( )
∆w j = α t j − o j x = α e j x
(5.13)
w j ( k + 1) = w j ( k ) + ∆w j (5.14)
or
( )
∆w j , i = α t j − o j xi = α e j xi
(5.15)
w j , i ( k + 1) = w j , i ( k ) + ∆w j , i , (5.16)
Where
ej = tj − oj represents the output error
α is the learning rate taking a value usually in the range (0, 1)
In Equation 5.11, α is set to 1/2. Since the bias of processing unit j is the weight
of connection from the input x0 = 1 to the processing unit, Equations 5.15 and
5.16 can be extended for changing the bias of processing unit j as follows:
( ) ( )
∆b j = α t j − o j × x0 = α t j − o j × 1 = α e j (5.17)
b j ( k + 1) = b j ( k ) + ∆b j . (5.18)
Artificial Neural Networks for Classification and Prediction 79
5.3.5 Limitation of a Perceptron
As described in Sections 5.3.2 and 5.3.3, each processing unit implements a
linear decision boundary, that is, a linearly separable function. Even with
multiple processing units in one layer, a perceptron is limited to implement-
ing a linearly separable function. For example, the XOR function in Table
5.3 is not a linearly separable function. There is only one output for the
XOR function. Using one processing unit to represent the output, we have
one decision boundary, which is a straight line representing a linear func-
tion. However, there does not exit such a straight line in the input space to
separate the two data points with o = 1 from the other two data points with
o = −1. A nonlinear decision boundary such as the one shown in Figure 5.12
is needed to separate the two data points with o = 1 from the other two data
points with o = −1. To use processing units that implement linearly separa-
ble functions for constructing an ANN to implement the XOR function, we
need two processing units in one layer (the hidden layer) to implement two
decision boundaries and one processing unit in another layer (the output
layer) to combine the outputs of the two hidden units as shown in Table 5.4
and Figure 5.7. Table 5.5 defines the logical NOT function used in Table 5.4.
Hence, we need a two-layer ANN to implement the XOR function, which is
a nonlinearly separable function.
The learning method described by Equations 5.13 through 5.18 can be used
to learn the connection weights to each output unit using a set of training
data because the target value t for each output unit is given in the training
data. For each hidden unit, Equations 5.13 through 5.18 are not applicable
because we do not know t for the hidden unit. Hence, we encounter a dif-
ficulty in learning connection weights and biases from training data for a
multilayer ANN. This learning difficulty for multilayer ANNs is overcome
by the back-propagation learning method described in the next section.
x2
1
1 –1
x1
0 1
–1 1
Figure 5.12
Four data points of the XOR function.
80 Data Mining
Table 5.4
Function of Each Processing Unit in a Two-Layer ANN
to Implement the XOR Function
x1 x2 o1 = x1 OR x2 o2 = NOT (x1 OR x2) o3 = o1 AND o2
−1 −1 −1 1 −1
−1 1 1 1 1
1 −1 1 1 1
1 1 1 −1 −1
Table 5.5
NOT Function
x o
−1 1
1 −1
∑ (t
1
Ed (W ) = )
2
j,d − oj,d , (5.19)
2 j
Where
tj,d is the target output of output unit j for the training data record d
oj,d is the actual output produced by output unit j of the ANN with the
weights W for the training data record d
The output error for a set of training data records is defined as follows:
∑∑ (t
1
E (W ) = )
2
j,d − oj,d . (5.20)
2 d j
direction of reducing the output error after passing the inputs of the data
record d through the ANN with the weights W, as follows:
∆w j , i = −α
∂Ed
= −α
∂Ed ∂net j
= αδ j
∂ (∑ w o ) = αδ o
k
j,k k
j i (5.21)
∂w j , i ∂net j ∂w j , i ∂w j , i
where δj is defined as
∂Ed
δj = − , (5.22)
∂net j
Where
α is the learning rate with a value typically in (0, 1)
oi is input i to processing unit j
If unit j directly receives the inputs of the ANN, oi is xi; otherwise, oi is
from a unit in the preceding layer feeding its output as an input to unit j.
To change a bias for a processing unit, Equation 5.21 is modified by using
oi = 1 as follows:
∆b j = αδ j . (5.23)
1
∑ (t
2
∂Ed ∂Ed ∂o j
∂
2 j
j,d )
− o j , d ∂ f net
j j ( ( ))
δj = − =− =−
∂net j ∂o j ∂net j ∂o j ∂net j
( ) (
= t j , d − o j , d f j′ net j , )
(5.24)
where f ′ denotes the derivative of the function f with regard to net. To obtain
( )
a value for the term f j′ net j in Equation 5.24, the transfer function f for unit
j must be a semi-linear, nondecreasing, and differentiable function, e.g.,
linear, sigmoid, and tanh. For the sigmoid transfer function
1
(
o j = f j net j = ) − net ,
1+ e j
we have the following:
− net
1 e j
(
f j′ net j =) 1+ e
− net j (
− net = o j 1 − o j .
1+ e j
) (5.25)
82 Data Mining
where netn is the net sum of output unit n. Using Equation 5.22, we rewrite
δj as follows:
∂netn
∂
∑w o
n, j j
∑ ( ) ∑δ ( )
j
δj = δn f j′ net j = n f j′ net j
n
∂o j n
∂o j
=
∑ n
( )
δ n wn , j f j′ net j .
(5.26)
w j , i ( k + 1) = w j , i ( k ) + ∆w j , i (5.27)
b j ( k + 1) = b j ( k ) + ∆b j . (5.28)
Example 5.2
Given the ANN for the XOR function and the first data record in Table 5.3
with x1 = −1, x2 = −1, and t = −1, use the back-propagation method to update
the weights and biases of the ANN. In the ANN, the sigmoid transfer
function is used by each of the two hidden units and the linear function
is used by the output unit. The ANN starts with the following arbitrarily
assigned values of weights and biases in (−1, 1) as shown in Figure 5.13:
x1
0.1 –0.3
0.5
1
0.3
–0.1
o1
3 o
o2
0.2
0.4
2
–0.2
–0.4
x2
Figure 5.13
A set of weights with randomly assigned values in a two-layer feedforward ANN for the XOR
function.
(
o1 = sig ( w1,1x1 + w1, 2 x2 + b1 ) = sig 0.1 × ( −1) + 0.2 × ( −1) + ( −0.3 ) )
1
= sig ( − 0.6 ) = = 0.3543
1 + e − ( −0.6)
(
o2 = sig ( w2 ,1x1 + w2 , 2 x2 + b2 ) = sig ( − 0.1) × ( −1) + ( − 0.2) × ( −1) + ( − 0.4 ) )
1
= sig ( − 0.2) = = 0.4502
1 + e − ( −0.2)
∆ b3 = αδ 3 = 0.3δ 3 .
84 Data Mining
( )
δ 3 = (t − o) f 3′ ( net3 ) = t jd − o jd lin′ ( net3 ) = ( −1 − 0.7864 ) × 1 = −1.7864
Equations 5.21, 5.23, 5.25, and 5.26 are used to determine changes to the
weights and bias for each hidden unit as follows:
n= 3
δ1 = −
∑
n n= 3
∑
δ n wn ,1 f1′ ( net1 ) = − δ n wn ,1 f1′ ( net1 )
n= 3
δ2 = −
∑δ w
n
n n, 2
∑
f 2′ ( net2 ) = − δ n wn , 2 f 2′ ( net2 ) = δ 3 w3 , 2 o2 (1 − o2 )
n= 3
Using the changes to all the weights and biases of the ANN, Equations
5.27 and 5.28 are used to perform an iteration of updating the weights
and biases as follows:
This new set of weights and biases, wj,i(1) and bj(1), will be used to pass
the inputs of the second data record through the ANN and then update
the weights and biases again to obtain wj,i(2) and bj(2) if necessary. This
process repeats again for the third data record, the fourth data record,
back to the first data record, and so on, until the measure of the output
error E as defined in Equation 5.20 is smaller than a preset threshold,
e.g., 0.1.
Global minimum of E
Local minimum of E
Small ΔW Large ΔW
Figure 5.14
Effect of the learning rate.
a small learning rate and a large learning rate, a method of adaptive learning
rates can be used to start with a large learning rate for speeding up the learn-
ing process and then change to a small learning rate for taking small steps to
reach a local or global minimum value of E.
Unlike the decision trees in Chapter 4, an ANN does not show an explicit
model of the classification and prediction function that the ANN has learned
from the training data. The function is implicitly represented through con-
nection weights and biases that cannot be translated into meaningful clas-
sification and prediction patterns in the problem domain. Although the
knowledge of classification and prediction patterns has been acquired by
the ANN, such knowledge is not available in an interpretable form. Hence,
ANNs help the task of performing classification and prediction but not the
task of discovering knowledge.
units. The more hidden units the ANN has, the more complex function the
ANN can learn and represent. However, if we use a complex ANN to learn
a simple function, we may see the function of the ANN over-fit the data as
illustrated in Figure 5.15. In Figure 5.15, data points are generated using a
linear model:
y = x + ε,
In general, an over-fitted model does not generalize well to new data points
in the testing data set. When we do not have prior knowledge about a given
data set (e.g., the form or complexity of the classification and prediction func-
tion), we have to empirically try out ANN architectures with varying levels
Figure 5.15
An illustration of a nonlinear model overfitting to data from a linear model.
88 Data Mining
• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.mathworks.com/)
Some applications of ANNs can be found in (Ye et al., 1993; Ye, 1996, 2003,
Chapter 3; Ye and Zhao, 1996, 1997).
Exercises
5.1 The training data set for the Boolean function y = NOT x is given
next. Use the graphical method to determine the decision boundary,
the weight, and the bias of a single-unit perceptron for this Boolean
function.
The training data set:
X Y
−1 1
1 −1
5.2 Consider the single-unit perceptron in Exercise 5.1. Assign 0.2 to ini-
tial weights and bias and use the learning rate of 0.3. Use the learning
method to perform one iteration of the weight and bias update for the
two data records of the Boolean function in Exercise 5.1.
5.3 The training data set for a classification function with three attribute vari-
ables and one target variable is given below. Use the graphical method
to determine the decision boundary, the weight, and the bias of a single-
neuron perceptron for this classification function.
Artificial Neural Networks for Classification and Prediction 89
5.6 The following ANN with the initial weights and biases is used to
learn the XOR function given below. The transfer function for units
1 and 4 is the linear function. The transfer function for units 2 and 3
is the sigmoid transfer function. The learning rate is α = 0.3. Perform
one iteration of the weight and bias update for w1,1, w1,2, w 2,1, w 3,1, w4,2,
w4,3, b2, after feeding x1 = 0 and x2 = 1 to the ANN.
b3 = –0.35
90 Data Mining
XOR:
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
6
Support Vector Machines
R ( A) =
∫f A ( x ) − y P ( x , y ) dxdy , (6.1)
where P(x, y) denotes the probability function of x and y. The expected risk of
classification depends on A values. A smaller expected risk of classification
indicates a better generalization performance of the classification function
in that the classification function is capable of classifying more data points
91
92 Data Mining
∑f
1
Remp ( A ) = A ( xi ) − y i . (6.2)
n i =1
2n η
v ln + 1 − ln
v 4
R ( A ) ≤ Remp ( A ) + , (6.3)
n
w¢ x + b = 0. (6.5)
y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ 0.
If we impose a constraint,
w ≤ M,
w = w12 + + w p2 .
The set of hyperplanes defined by the following:
{f w ,b = sign (w¢ x + b )| w ≤ M , }
has the VC-dimension v that satisfies the bound (Vapnik, 1989, 2000):
{ }
v ≤ min M 2 , p + 1. (6.7)
1 2
min w . (6.8)
2
Rescaling w does not change the slope of the hyperplane for the decision
boundary. Rescaling b does not change the slope of the decision boundary but
94 Data Mining
moves the hyperplane of the decision boundary in parallel. For example, in the
two-dimensional vector space shown in Figure 6.1, the decision boundary is
w1 b
w1x1 + w2 x2 + b = 0 or x2 = − x1 − , (6.9)
w2 w2
the slope of the line for the decision boundary is −w1 w2, and the intercept
of the line for the decision boundary is −b w2. Rescaling w to cww, where cw
is a constant, does not change the slope of the line for the decision boundary
as −cw w1 cw w2 = − w1 w2 . Rescaling b to cbb, where cb is a constant, does not
change the slope of the line for the decision boundary, but changes the inter-
cept of the line to −cbb w2 and thus moves the line in parallel.
Figure 6.1 shows examples of data points with the target value of 1
(indicated by small circles) and examples of data points with the target value
of −1 (indicated by small squares). Among the data points with the target
value of 1, we consider the data point closest to the decision boundary, x+1, as
shown by the data point with the solid circle in Figure 6.1. Among the data
points with the target value of −1, we consider the data point closest to the
decision boundary, x−1, as shown by the data point with the solid square in
Figure 6.1. Suppose that for two data points x+1 and x−1 we have
w¢ x+1 + b = c+1
(6.10)
w¢ x −1 + b = c−1.
We want to rescale w to cww and rescale b to cbb such that we have
cw w¢ x+1 + cbb = 1
(6.11)
cw w¢ x −1 + cbb = −1,
and still denote the rescaled values by w and b. We have
{
min w¢ xi + b , i = 1, … , n = 1, }
which implies |w′ x + b| = 1 for the data point in each target class closest to
the decision boundary w′x + b = 0.
x2 x2
x+1 x+1
Figure 6.1
SVM for a linear classifier and a linearly separable problem. (a) A decision boundary with a
large margin. (b) A decision boundary with a small margin.
Support Vector Machines 95
We solve Equations 6.12 through 6.15 to obtain cw and cb. We first use Equation
6.14 to obtain
1 − cbb
cw = , (6.16)
w1x+1,1 + w2 x+1, 2
and substitute cw in Equations 6.16 into 6.15 to obtain
1 − cbb
w1x+1,1 + w2 x+1, 2
(w1x−1,1 + w2 x−1, 2 ) + cbb = −1. (6.17)
and substitute Equations 6.18 and 6.19 into Equation 6.17 to obtain
1 − cbb
(c−1 − b ) + cbb = −1
c+1 − b
c−1 − b (c−1 − b ) b
− cb + bcb = −1
c+1 − b c+1 − b
2b − c+1 − c−1
cb = . (6.20)
b 2 + b − c−1b
We finally use Equation 6.14 to compute cw and substitute Equations 6.18 and
6.20 into the resulting equations to obtain
Let w and b denote the rescaled values. The hyperplane bisects w′x + b = 1
and w′x + b = −1 is w′x + b = 0, as shown in Figure 6.1. Any data point x with
the target class +1 satisfies
w¢ x + b ≥ 1
since the data point with the target class of +1 closest to w′x + b = 0 has w′x +
b = 1. Any data point x with the target class of −1 satisfies
w¢ x + b ≤ −1
since the data point with the target class of −1 closest to w′x + b = 0 has w′x +
b = −1. Therefore, the linear classifier can be defined as follows:
the margin of the decision boundary or the margin of the linear classifier,
with the w′x + b = 0 being the decision boundary. To show this in the two-
dimensional vector space of x, let us compute the distance of two parallel
lines w′x + b = 1 and w′x + b = −1 in Figure 6.1. These two parallel lines can
be represented as follows:
w1x1 + w2 x2 + b = 1 (6.25)
w2 x1 − w1x2 = 0 (6.27)
x1 and x2, we obtain the coordinates of the data point where these two lines
−1 − b −1 − b
are intersected: 2 w1 , 2 w2 . Then we compute the distance of
w1 + w22 w1 + w22
1− b 1− b −1 − b −1 − b
the two data points, 2 w1 , 2 w2 and 2 w1 , 2 w2 :
w1 + w22 w1 + w22 w1 + w22 w1 + w22
2 2
1− b −1 − b 1− b −1 − b
d= 2 w1 − 2 w1 + 2 w2 − 2 w2
w1 + w2
2
w1 + w2 w1 + w2
2 2
w1 + w2
2
1 2 2
= 22 w12 + 22 w22 = = . (6.28)
w12 + w22 2
w +w
1
2
2
w
2
Hence, minimizing (1 2) w in the objective function of the quadratic pro-
gramming problem in Formulation 6.24 is to maximize the margin of the
linear classifier or the generalization performance of the linear classifier.
Figure 6.1a and b shows two different linear classifiers with two different
decision boundaries that classify the eight data points correctly but have
different margins. The linear classifier in Figure 6.1a has a larger margin
and is expected to have a better generalization performance than that in
Figure 6.1b.
98 Data Mining
∑α y (w¢ x + b) − 1
1
min w ,bmax a ≥ 0 L (w, b , a ) =
2
w − i i i (6.29)
2 i =1
subject to
αi ≥ 0 i = 1, … , n,
where αi, i = 1, …, n are the non-negative Lagrange multipliers, and the two
equations in the constrains are known as the Karush–Kuhn–Tucker condition
(Burges, 1998) and are the transformation of the inequality constraint in
Equation 6.23. The solution to Formulation 6.29 is at the saddle point of
L (w, b , a ), where L (w, b , a ) is minimized with regard to w and b and maxi-
2
mized with regard to α. Minimizing (1 2) w with regard to w and b covers the
∑
n
objective function in Formulation 6.24. Minimizing − α i yi (w¢ xi + b ) − 1
i =1
∑
n
is to maximize α i yi (w¢ xi + b ) − 1 with regard to α and satisfy
i =1
yi (w¢ xi + b ) ≥ 1—the constraint in Formulation 6.24, since αi, ≥ 0. At the point
where L (w, b , a ) is minimized with regard to w and b, we have
∂L ( w , b , a )
n n
∂w
= w− ∑α y x = 0
i =1
i i i or w = ∑α y x
i =1
i i i (6.31)
∂L (w, b , a )
n
∂b
= ∑α y = 0.
i =1
i i (6.32)
Note that w is determined by only the training data points (xi, yi) for which
αi > 0. Those training data vectors with the corresponding αi > 0 are called
support vectors. Using the Karush–Kuhn–Tucker condition in Equation 6.30
and any support vector (xi, yi) with αi > 0, we have
y i (w ¢ xi + b ) − 1 = 0 (6.33)
Support Vector Machines 99
yi2 = 1 (6.34)
since yi takes the value of 1 or −1. We solve Equations 6.33 and 6.34 for b
and get
b = y i − w ¢ xi (6.35)
because
To compute w using Equations 6.31 and 6.32 and compute b using Equation
6.35, we need to know the values of the Lagrange multipliers α. We
substitute Equations 6.31 and 6.32 into L (w, b , a ) in Formulation 6.29 to
obtain L(α)
n n n n n n
∑∑ ∑∑ ∑ ∑α
1
L (a ) = α iα j yi y j xi¢ x j − α iα j yi y j xi¢ x j − b α i yi + i
2 i =1 j =1 i =1 j =1 i =1 i =1
n n n
∑ ∑∑α α y y x¢x .
1
= αi − i j i j i j (6.36)
i =1
2 i =1 j =1
n n n
∑ ∑∑α α y y x¢x
1
max a L (a ) = αi − i j i j i j (6.37)
i =1
2 i =1 j =1
subject to
∑α y = 0 i i
i =1
α i yi (w¢ xi + b ) − 1 = 0 or ∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1, …, n
α i ≥ 0 i = 1, …, n.
100 Data Mining
In summary, the linear classifier for SVM is solved in the following steps:
n n n
∑ ∑∑α α y y x¢x
1
max a L (a ) = αi − i j i j i j
i =1
2 i =1 j =1
subject to
∑α y = 0 i i
i =1
∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1,…, n
α i ≥ 0 i = 1, … , n.
w= ∑α y x . i i i
i =1
b = y i − w ¢ xi .
and the decision function of the linear classifier is given in
Equation 6.22:
y = sign (w¢ x + b ) = 1 if w¢ x + b ≥ 1
y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ −1,
or Equation 6.4:
n
f w ,b ( x ) = sign (w¢ x + b ) = sign
∑ i =1
α i yi xi¢ x + b .
Note that only the support vectors with the corresponding αi > 0
contribute to the computation of w, b and the decision function of the
linear classifier.
Support Vector Machines 101
Example 6.1
Determine the linear classifier of SVM for the AND function in Table 5.1,
which is copied here in Table 6.1 with x = (x1, x2).
There are four training data points in this problem. We formulate and
solve the optimization problem in Formulation 6.24 as follows:
1
min w1 , w2 , b
2
( w1 )2 + ( w2 )2
subject to
w1 + w2 − b ≥ 1
w1 − w2 − b ≥ 1
− w1 + w2 − b ≥ 1
w1 + w2 + b ≥ 1.
x1
y = sign 1 1 − 1 = sign ( x1 + x2 − 1) = 1 if x1 + x2 − 1 ≥ 1
x2
x1
y = sign 1 1 − 1 = sign ( x1 + x2 − 1) = −1 if x1 + x2 − 1 ≤ −1
x2
or
x1
f w ,b ( x ) = sign (w¢ x + b ) = sign 1 1 − 1 = sign ( x1 + x2 − 1) .
x2
Table 6.1
AND Function
Data Point # Inputs Output
i x1 x2 y
1 −1 −1 −1
2 −1 1 −1
3 1 −1 −1
4 1 1 1
102 Data Mining
∑ ∑ ∑ α α y y x′x
1
max α L (a ) = αi − i j i j i j
i =1
2 i =1 j =1
1
= α 1 + α 2 + α 3 + α 4 − [α 1α 1 y1 y1 x1¢ x1 + α 1α 2 y1 y 2 x 1¢ x2
2
+ α 1α 3 y1 y 3 x1¢ x3 + α 1α 4 y1 y 4 x 1¢ x 4 + α 2α 1 y 2 y1 x2¢ x1 + α 2α 2 y 2 y 2 x ¢2 x2
+ α 2α 3 y 2 y 3 x2¢ x3 + α 2α 4 y 2 y 4 x2¢ x 4 + α 3 α 1 y 3 y1 x3¢ x1 + α 3 α 2 y 3 y 2 x3¢ x2
+ α 3 α 3 y 3 y 3 x3¢ x3 + α 3 α 4 y 3 y 4 x3¢ x 4 + α 4 α 1 y 4 y1 x4¢ x1 + α 4 α 2 y 4 y 2 x4¢ x2
+ α 4 α 3 y 4 y 3 x ¢4 x3 + α 4 α 4 y 4 y 4 x4¢ x 4 ]
1 −1
= α1 + α 2 + α 3 + α 4 − α 1α 1 ( −1) ( −1)[ −1 − 1]
2 −1
−1 1
+ 2α 1α 2 ( −1) ( −1)[ −1 − 1] + 2α 1α 3 ( −1) ( −1)[ −1 − 1]
1
−1
1 −1
+ 2α 1α 4 ( −1) (1)[ −1 − 1] + α 2α 2 ( −1) ( −1)[ −11]
1 1
1 1
+ 2α 2α 3 ( −1) ( −1)[ −11] + 2α 2α 4 ( −1)(1)[ −11]
−1
1
1 1
+ α 3 α 3 ( −1) ( −1)[1 − 1] + 2α 3 α 4 ( −1)(1)[1 − 1]
−1
1
1
+ α 4 α 4 (1)(1)[11]
1
1
= α 1 + α 2 + α 3 + α 4 − (2α 12 + 2α 22 + 2α 23 + 2α 24 − 4α 1α 4 − 4α 2α 3 )
2
= −α 12 − α 22 − α 23 − α 24 + 2α 1α 4 + 2α 2α 3 + α 1 + α 2 + α 3 + α 4
= − (α1 − α 4 ) − (α 2 − α 3 ) + α1 + α 2 + α 3 + α 4
2 2
subject to
n
∑α y = α y + α y
i =1
i i 1 1 2 2 + α 3 y 3 + α i y 4 = −α 1 − α 2 − α 3 + α 4 = 0
n
∑α α y y x¢ x + α y b − α = 0
j =1
i j i j j i i i i i = 1, 2, 3, 4 become:
−1 −1 −1
α 1 ( −1) α 1 ( −1) −1 −1 + α 2 ( −1) −1 1 + α 3 ( −11) 1 −1
−1 −1 −1
−1
+ α 4 (1) 1 1 + α 1 ( −1) b − α 1 = 0 or − α 1 ( −2α 1 − 2α 4 ) − α 1b − α 1 = 0
−1
Support Vector Machines 103
−1
α 4 (1) 1 1 + α 2 ( −1) b − α 2 = 0 or − α 2 ( −2α 2 + 2α 3 ) − α 2b − α 2 = 0
1
1
+ α 4 (1) 1 1 + α 3 ( −1) b − α 3 = 0 or − α 3 ( 2α 2 − 2α 3 ) − α 3 b − α 3 = 0
−1
1
+ α 4 (1) 1 1 + α 4 (1) b − α 4 = 0 or α 4 ( 2α 1 + 2α 4 ) + α 4 b − α 4 = 0
1
α i ≥ 0, i = 1, 2, 3, 4.
w= ∑α y x
i =1
i i i
w1 = α 1 y1x1,1 + α 2 y 2 x2 ,1 + α 3 y 3 x3 ,1 + α 4 y 4 x 4 ,1
w2 = α 1 y1x1, 2 + α 2 y 2 x2 , 2 + α 3 y 3 x3 , 2 + α 4 y 4 x 4 , 2
The optimal solution already includes the value of b = −1. We obtain the
same value of b using Equation 6.35 and the fourth data point as the sup-
port vector:
1
b = y 4 − w¢ x 4 = 1 − 1 1 = −1.
1
The optimal solution of the dual problem for SVM gives the same deci-
sion function:
x1
y = sign 1 1 − 1 = sign ( x1 + x2 − 1) = 1 if x1 + x2 − 1 ≥ 1
x2
x1
y = sign 1 1 − 1 = sign ( x1 + x2 − 1) = −1 if x1 + x2 − 1 ≤ −1
x2
or
x1
f w ,b ( x ) = sign (w¢ x + b ) = sign 1 1 − 1 = sign ( x1 + x2 − 1) .
x2
Hence, the optimization problem and its dual problem of SVM for this
example problem produces the same optimal solution and the decision
function. Figure 6.2 illustrates the decision function and the support vec-
tors for this problem. The decision function of SVM is the same as that of
ANN for the same problem illustrated in Figure 5.10 in Chapter 5.
x2
1
–1
1
x1
0 1
x1 + x2 – 1 = 1
–1 –1 x1 + x2 – 1 = 0
x1 + x2 – 1 = –1
Figure 6.2
Decision function and support vectors for the SVM linear classifier in Example 6.1.
Support Vector Machines 105
Many books and papers in literature introduce SVMs using the dual opti-
mization problem in Formulation 6.37 but without the set of constraints:
n
∑α α y y x¢x + α y b − α = 0
j =1
i j i j j i i i i i = 1, … , n.
As seen from Example 6.1, without this set of constraints, the dual problem
becomes
max a − (α 1 − α 4 ) − (α 2 − α 3 ) + α 1 + α 2 + α 3 + α 4
2 2
subject to
−α 1 − α 2 − α 3 + α 4 = 0
α i ≥ 0, i = 1, 2, 3, 4.
If we let α 1 = α 4 > 0 and α 2 = α 3 = 0, which satisfy all the constraints, then the
objective function becomes max α1 + α4, which is unbounded as α1 and α4
can keep increasing their value without a bound. Hence, Formulation 6.37 of
the dual problem with the full set of constraints should be used.
k
n
∑
1 2
min w ,b ,b w + C βi (6.38)
2 i =1
subject to
yi (wxi + b ) ≥ 1 − βi , i = 1, … , n
βi ≥ 0, i = 1, … , n,
where C > 0 and k ≥ 1 are predetermined for giving the penalty of misclas-
sifying the data points. Introducing βi into the constraint in Formulation 6.38
106 Data Mining
− ∑
i =1
α i yi (wxi + b ) − 1 + βi − ∑γ β ,
i =1
i i (6.39)
∂L ( w , b , b , a , g )
n n
∂w
= w− ∑i =1
α i yi xi = 0 or w = ∑α y x
i =1
i i i (6.40)
∂L ( w, b , b , a , g )
n
∂b
= ∑α y = 0
i =1
i i (6.41)
n k −1
= ∑
∂L ( w, b , b , a , g ) pC βi − α i − γ i = 0
i = 1, … , n if k > 1
. (6.42)
∂b
i =1
C − α i − γ i = 0 i = 1, … , n if k = 1
δ − α i − γ i = 0 or γ i = δ − α i i = 1, … , n if k > 1
. (6.44)
C − α i − γ i = 0 or γ i = C − α i i = 1, … , n if k = 1
Support Vector Machines 107
α i yi (wxi + b ) − 1 + βi = 0. (6.45)
Using a data point (xi, yi) that is correctly classified by the SVM, we have
βi = 0 and thus the following based on Equation 6.45:
b = y i − w ¢ xi , (6.46)
which is the same as Equation 6.35. Equations 6.40 and 6.46 are used to
compute w and b, respectively, if α is known. We use the dual problem of
Formulation 6.39 to determine α as follows.
When k = 1, substituting w, b, and γ in Equations 6.40, 6.44, and 6.46, respec-
tively, into Formulation 6.39 produces
k
n
n n
∑ ∑ ∑γ β
1
max a ≥ 0 L (a ) = w + C α i yi (wxi + b ) − 1 + βi −
2
βi − i i
2 i =1 i =1 i =1
n n n
n n
∑∑ ∑ ∑ ∑
1
= α iα j yi y j xi¢x j + C βi − α i yi α j y j xj¢ xi + b − 1 + βi
2
i =1 j =1 i =1 i =1 j =1
n n n n
∑ ∑ ∑∑α α y y x¢x
1
− (C − α i ) β i = αi − i j i j i j (6.47)
i =1 i =1
2 i =1 j =1
subject to
∑α y = 0
i i
i =1
α i ≤ C i = 1, … , n
α i ≥ 0 i = 1, … , n.
C − α i − γ i = 0 or C − α i = γ i .
Since γi ≥ 0, we have C ≥ αi.
108 Data Mining
∑ ∑α y (wx + b) − 1 + β − ∑γ β
1
max a ≥0 ,δ L (a ) = w + C
2
βi − i i i i i i
2 i =1 i =1 i =1
n n
n
k
n n
∑∑ ∑ ∑ ∑
1
= α iα j yi y j xi¢x j + C βi − α i yi α j y j x ¢j xi + b − 1 + βi
2
i =1 j =1 i =1 i =1 j =1
p
n n n n
δ p −1
1
∑ (δ − α )β = ∑α − 2 ∑∑α α y y x¢x −
1
− i i i i j i j i j 1 1 − p
(6.48)
i =1 i =1 i =1 j =1 ( pC ) p −1
subject to
n
∑α y = 0 i i
i =1
α i ≤ δ i = 1, … , n
α i ≥ 0 i = 1, … , n.
The decision function of the linear classifier is given in Equation 6.22:
y = sign (w¢ x + b ) = 1 if w¢ x + b ≥ 1
y = sign (w¢ x + b ) = −1 if w¢ x + b ≤ −1,
or Equation 6.4:
n
f w, b ( x ) = sign (w¢ x + b ) = sign
∑ i =1
α i yi xi¢x + b .
Only the support vectors with the corresponding αi > 0 contribute to the
computation of w, b, and the decision function of the linear classifier.
x → j (x) ,
Support Vector Machines 109
where
(
j ( x ) = h1ϕ1 ( x ) , …, hlϕ l ( x ) . ) (6.49)
n n n
∑ ∑∑α α y y j ( x )¢ j ( x )
1
max a ≥ 0 L (a ) = αi − i j i j i j (6.50)
i =1
2 i =1 j =1
subject to
n
∑α y = 0 i i
i =1
α i ≤ C i = 1, …, n
α i ≥ 0 i = 1, …, n.
When k > 1,
n n n
δp p −1
1
∑ ∑∑
1
max a ≥ 0 , δ L (a ) = αi − ( )
α i α j y i y j j ( x i )¢ j x j − 1 − p (6.51)
( pC )
1 p −1
i =1
2 i =1 j =1
subject to
n
∑α y = 0 i i
i =1
α i ≤ δ i = 1, …, n
α i ≥ 0 i = 1, …, n,
n
f w ,b ( x ) = sign
∑α y j ( x )¢ j ( x) + b .
i =1
i i i (6.52)
K ( x , y ) = j ( x )¢ j ( y ) = ∑h ϕ ( x)¢ ϕ (y) ,
i =1
2
i i i (6.53)
110 Data Mining
the formulation of the soft margin SVM in Equations 6.50 through 6.52
becomes:
When k = 1,
n n n
∑ ∑∑α α y y K ( x , x )
1
max a ≥ 0 L (a ) = αi − i j i j i j (6.54)
i =1
2 i =1 j =1
subject to
n
∑α y = 0 i i
i =1
α i ≤ C i = 1, …, n
α i ≥ 0 i = 1, …, n.
When k > 1,
n n n
δp p −1
1
∑ ∑∑
1
max a ≥ 0 , δ L (a ) = αi − (
α i α j y i y j K xi , x j − ) 1 − p (6.55)
( pC )
1 p −1
i=1
2 i =1 j =1
subject to
n
∑α y = 0 i i
i =1
α i ≤ δ i = 1, …, n
α i ≥ 0 i = 1, …, n.
The soft margin SVM in Equations 6.50 through 6.52 requires the transforma-
tion φ(x) and then solve the SVM in the feature space, while the soft margin
SVM in Equations 6.54 through 6.56 uses a kernel function K(x, y) directly.
To work in the feature space using Equations 6.50 through 6.52, some
examples of the transformation function for an input vector x in a one-
dimensional space are provided next:
(
j ( x ) = 1, x , …, x d ) (6.57)
K ( x , y ) = j ( x )′ j ( y ) = 1 + xy + + ( xy ) .
d
Support Vector Machines 111
1 1
j ( x ) = sin x , sin ( 2x ) , …, sin (ix ) , … (6.58)
2 i
sin ( x + y 2)
∞
x , y ∈[ 0 , π ] .
An example of the transformation function for an input vector x = (x1, x2) in
a two-dimensional space is given next:
(
j ( x ) = 1, 2 x1 , 2 x2 , x12 , x22 , 2 x1x2 ) (6.59)
K ( x , y ) = j ( x )¢ j ( y ) = (1 + xy ) .
2
An example of the transformation function for an input vector x = (x1, x2, x3)
in a three-dimensional space is given next:
(
j ( x ) = 1, 2 x1 , 2 x2 , 2 x3 , x12 , x22 , x32 , 2 x1x2 , 2 x1x3 , 2 x2 x3 , (6.60) )
K ( x , y ) = j ( x )¢ j ( y ) = (1 + xy ) .
2
K ( x , y ) = (1 + xy )
d
(6.61)
2
x−y
−
K ( x, y) = e
2
2σ (6.62)
x2
x1
Figure 6.3
A polynomial decision function in a two-dimensional space.
x2
x1
Figure 6.4
A Gaussian radial basis function in a two-dimensional space.
The addition and the tensor product of kernel functions are often used to
construct more complex kernel functions as follows:
K ( x, y) = ∑K ( x, y)
i
i (6.64)
K ( x, y) = ∏K ( x, y).
i
i (6.65)
Support Vector Machines 113
Exercises
6.1 Determine the linear classifier of SVM for the OR function in Table 5.2
using the SVM formulation for a linear classifier in Formulations 6.24
and 6.29.
6.2 Determine the linear classifier of SVM for the NOT function using
the SVM formulation for a linear classifier in Formulations 6.24 and
6.29. The training data set for the NOT function, y = NOT x, is given
next:
The training data set:
X Y
−1 1
1 −1
Support Vector Machines 115
6.3 Determine the linear classifier of SVM for a classification function with
the following training data, using the SVM formulation for a linear
classifier in Formulations 6.24 and 6.29.
The training data set:
x1 x2 x3 y
−1 −1 −1 0
−1 −1 1 0
−1 1 −1 0
−1 1 1 1
1 −1 −1 0
1 −1 1 1
1 1 −1 1
1 1 1 1
7
k-Nearest Neighbor Classifier
and Supervised Clustering
xi , 1
xi =
xi , p
( ) ∑ (x )
2
d xi , x j = i,l − x j , l , i ≠ j. (7.1)
l =1
117
118 Data Mining
two data points are, and the farther apart the two data points are separated
in the p-dimensional data space.
The Minkowski distance is defined as
1r
p
( ) ∑x
r
d xi , x j = i,l − x j,l , i ≠ j. (7.2)
l =1
where x– and s are the sample average and the sample standard deviation of x.
Another normalization method uses the following formula to normalize a vari-
able x and produce a normalized variable z with values in the range of [0, 1]:
xmax − x
z= . (7.4)
xmax − xmin
where sxi x j, sxi, and sxj are the estimated covariance of xi and xj, the estimated
standard deviation of xi, and the estimated standard deviation of xj, respec-
tively, and are computed using a sample of n data points as follows:
∑ (x )( )
1
s xi x j = i,l − xi x j , l − x j (7.6)
n−1
l =1
∑ (x )
1 2
s xi = i,l − xi (7.7)
n−1 l =1
∑ (x )
1 2
sx j = j,l − xj (7.8)
n−1 l =1
∑x
1
xi = i,l (7.9)
n
l =1
xj =
1
n ∑x j,l . (7.10)
l =1
xi′ x j
cos (θ ) = , (7.11)
xi x j
where ∥xi∥ and ∥xj∥ are the length of the two vectors and are computed as
follows:
x j = x 2j ,1 + + x 2j , p . (7.13)
120 Data Mining
When θ = 0°, that is, the two vectors point to the same direction, cos(θ) = 1.
When θ = 180°, that is, the two vectors point to the opposite directions,
cos(θ) = −1. When θ = 90° or 270°, that is, the two vectors are orthogonal,
cos(θ) = 0. Hence, like Pearson’s correlation coefficient, the cosine similarity
measure gives a value in the range of [−1, 1] and is a measure of similarity
between two data points xi and xj. The larger the value of the cosine simi-
larity, the more similar the two data points are. A more detailed descrip-
tion of the computation of the angle between two data vectors is given in
Chapter 14.
To classify a data point x, the similarity of the data point x to each of
n data points in the training data set is computed using a selected mea-
sure of similarity or dissimilarity. Among the n data points in the train-
ing data set, k data points that are most similar to the data point x are
considered as the k-nearest neighbors of x. The dominant target class of
the k-nearest neighbors is taken as the target class of x. In other words, the
k-nearest neighbors use the majority voting rule to determine the target
class of x. For example, suppose that for a data point x to be classified, we
have the following:
• k is set to 3
• The target variable takes one of two target classes: A and B
• Two of the 3-nearest neighbors have the target class of A
Example 7.1
Use a 3-nearest neighbor classifier and the Euclidean distance measure
of dissimilarity to classify whether or not a manufacturing system is
faulty using values of the nine quality variables. The training data set
in Table 7.1 gives a part of the data set in Table 1.4 and includes nine
single-fault cases and the nonfault case in a manufacturing system. For
the ith data observation, there are nine attribute variables for the quality
of parts, (xi,1, …, xi,9), and one target variable yi for system fault. Table 7.2
gives test cases for some multiple-fault cases.
For the first data point in the testing data set x = (1, 1, 0, 1, 1, 0, 1, 1, 1),
the Euclidean distances of this data point to the ten data points in the
training data set are 1.73, 2, 2.45, 2.24, 2, 2.65, 2.45, 2.45, 2.45, 2.65, respec-
tively. For example, the Euclidean distance between x and the first data
point in the training data set x 1 = (1, 0, 0, 0, 1, 0, 1, 0, 1) is
Table 7.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable
Table 7.2
Testing Data Set for System Fault Detection and the Classification Results
in Examples 7.1 and 7.2
Target Variable
Attribute Variables (Quality of Parts) (System Fault yi)
Instance i True Classified
(Faulty Machine) xi1 xi2 xi3 xi4 xi5 xi6 xi7 xi8 xi9 Value Value
1 (M1, M2) 1 1 0 1 1 0 1 1 1 1 1
2 (M2, M3) 0 1 1 1 0 1 1 1 0 1 1
3 (M1, M3) 1 0 1 1 1 1 1 1 1 1 1
4 (M1, M4) 1 0 0 1 1 0 1 1 1 1 1
5 (M1, M6) 1 0 0 0 1 1 1 0 1 1 1
6 (M2, M6) 0 1 0 1 0 1 1 1 0 1 1
7 (M2, M5) 0 1 0 1 1 0 1 1 0 1 1
8 (M3, M5) 0 0 1 1 1 1 1 1 1 1 1
9 (M4, M7) 0 0 0 1 0 0 1 1 0 1 1
10 (M5, M8) 0 0 0 0 1 0 1 1 0 1 1
11 (M3, M9) 0 0 1 1 0 1 1 1 1 1 1
12 (M1, M8) 1 0 0 0 1 0 1 1 1 1 1
13 (M1, M2, M3) 1 1 1 1 1 1 1 1 1 1 1
14 (M2, M3, M5) 0 1 1 1 1 1 1 1 1 1 1
15 (M2, M3, M9) 0 1 1 1 0 1 1 1 1 1 1
16 (M1, M6, M8) 1 0 0 0 1 1 1 1 1 1 1
122 Data Mining
The 3-nearest neighbors of x are x1, x2, and x5 in the training data set which
all take the target class of 1 for the system being faulty. Hence, target class of
1 is assigned to the first data point in the testing data set. Since in the train-
ing data set there is only one data point with the target class of 0, the 3-nearest
neighbors of each data point in the testing data set have at least two data
points whose target class is 1, producing the target class of 1 for each data
point in the testing data set. If we attempt to classify data point 10 with the
true target class of 0 in the training data set, the 3-nearest neighbors of this
data point are the data point itself and two other data points with the target
class of 1, making the target class of 1 for data point 10 in the training data
set, which is different from the true target class of this data point.
However, if we let k = 1 for this example, the 1-nearest neighbor classi-
fier assigns the correct target class to each data point in the training data
set since each data point in the training data set has itself as its 1-nearest
neighbor. The 1-nearest neighbor classifier also assigns the correct target
class of 1 to each data point in the testing data set since data point 10 in
the training data set is the only data point with the target class of 0 and
its attribute variables have the values of zero, making data point 10 not
be the 1-nearest neighbor to any data point in the testing data set.
(Li and Ye, 2002, 2005, 2006; Ye, 2008; Ye and Li, 2002). The algorithm can
also be applied to other classification problems.
For cyber attack detection, the training data contain large amounts of com-
puter and network data for learning data patterns of attacks and normal
use activities. In addition, more training data are added over time to update
data patterns of attacks and normal activities. Hence, a scalable, incremental
learning algorithm is required so that data patterns of attacks and normal
use activities are maintained and updated incrementally with the addition
of each new data observation rather than processing all data observations in
the training data set in one batch. The supervised clustering algorithm was
developed as a scalable, incremental learning algorithm to learn and update
data patterns for classification.
During the training, the supervised clustering algorithm takes data points
in the training data set one by one to group them into clusters of similar data
points based on their attribute values and target values. We start with the first
data point in the training data set and let the first cluster to contain this data
point and to take the target class of the data point as the target class of the
data cluster. Taking the second data point in the training data set, we want to
let this data point join the closest cluster that has the same target class as the
target class of this data point. In the supervised clustering algorithm, we use
the mean vector of all the data points in a data cluster as the centroid of the data
cluster that is used to represent the location of the data cluster and compute the
distance of a data point to this cluster. The clustering of data points is based on
not only values of attribute variables to measure the distance of a data point to a
data cluster but also target classes of the data point and the data cluster to make
the data point join a data cluster with the same target class. All data points in
the same cluster have the same target class, which is also the target class of the
cluster. Because the algorithm uses the target class to guide or supervise the
clustering of data points, the algorithm is called supervised clustering.
Suppose that the distance of the first data point and the second data point
in the training data set is large but the second data point has the same target
class as the target class of the first cluster containing the first data point,
the second data point still has to join this cluster because this is the only
data cluster so far with the same target class. Hence, the clustering results
depend on the order in which data points are taken from the training data
set, causing the problem called the local bias of the input order. To address
this problem, the supervised clustering algorithm sets up an initial data clus-
ter for each target class. For each target class, the centroid of all data points
with the target class in the training data set is first computed using the mean
vector of the data points. Then an initial cluster for the target class is set up
to have the mean vector as the centroid of the cluster and the target class,
which is different from any target class of the data points in the training
data set. For example, if there are totally two target classes of T1 and T2 in the
training data, there are two initial clusters. One initial cluster has the mean
vector of the data points for T1 as the centroid. Another initial cluster has the
124 Data Mining
mean vector of the data points for T2 as the centroid. Both initial clusters are
assigned to a target class, e.g., T3, which is different from T1 and T2. Because
these initial data clusters do not contain any individual data points, they are
called the dummy clusters. All the dummy clusters have the target class that
is different from any target class in the training data set. The supervised
clustering algorithm requires a data point to form its own cluster if its closest
data cluster is a dummy cluster. With the dummy clusters, the first data point
from the training data set forms a new cluster since there are only dummy
clusters initially and the closest cluster to this data point is a dummy cluster.
If the second data point has the same target class of the first data point but is
located far away from the first data point, a dummy cluster is more likely the
closest cluster to the second data point than the data cluster containing the
first data point. This makes the second data point form its own cluster rather
than joining the cluster with the first data point, and thus addresses the local
bias problem due to the input order of training data points.
During the testing, the supervised clustering algorithm applies a k-nearest
neighbor classifier to the data clusters obtained from the training phase by
determining the k-nearest cluster neighbors of the data point to be classified and
letting these k-nearest data clusters vote for the target class of the data point.
Table 7.3 gives the steps of the supervised clustering algorithm. The following
notations are used in the description of the algorithm:
xi = (xi,1, …, xi,p, yi ): a data point in the training data set with a known
value of yi, for i = 1, …, n
x = (x1, …, xp, y): a testing data point with the value of y to be determined
Tj: the jth target class, j = 1, …, s
C: a data cluster
nC: the number of data points in the data cluster C
xC : the centroid of the data cluster C that is the mean vector of all data
points in C
In Step 4 of the training phase, after the data point xi joins the data cluster
C, the centroid of the data cluster C is updated incrementally to produce
xC (t + 1) (the updated centroid) using xi, xC (t ) (the current cluster centroid),
and nC(t) (the current number of data points in C):
nC (t ) xC1 (t ) + xi ,1
nC (t ) + 1
xC (t + 1) = . (7.14)
nC (t ) xCp (t ) + xi , p
nC (t ) + 1
k-Nearest Neighbor Classifier and Supervised Clustering 125
Table 7.3
Supervised Clustering Algorithm
Step Description
Training
1 Set up s dummy clusters for s target classes, respectively, determine the centroid
of each dummy cluster by computing the mean vector of all the data points in
the training data set with the target class Tj, and assign Ts+1 as the target class of
each dummy cluster where Ts+1 ≠ Tj, j = 1, …, s
2 FOR i = 1 to n
3 Compute the distance of xi to each data cluster C including each dummy
cluster, d(xi, x C ), using a measure of similarity
4 If the nearest cluster to the data point xi has the same target class as that of the
data point, let the data point join this cluster, and update the centroid of this
cluster and the number of data points in this cluster
5 If the nearest cluster to the data point xi has a different target class from that of
the data point, form a new cluster containing this data point, use the attribute
values of this data point as the centroid of this new cluster, let the number of
data points in the cluster be 1, and assign the target class of the data point as
the target class of the new cluster
Testing
1 Compute the distance of the data point x to each data cluster C excluding each
dummy cluster, d(x, xC )
2 Let the k-nearest neighbor clusters of the data point vote for the target class of the
data point
During the training, the dummy cluster for a certain target class can be
removed if many data clusters have been generated for that target class.
Since the centroid of the dummy cluster for a target class is the mean vector
of all the training data points with the target class, it is likely that the dummy
cluster for the target class is the closest cluster to a data point. Removing the
dummy cluster for the target class eliminates this likelihood and stops the
creation of a new cluster for the data point because the dummy cluster for
the target class is the closest cluster to the data point.
Example 7.2
Use the supervised clustering algorithm with the Euclidean distance
measure of dissimilarity and the 1-nearest neighbor classifier to clas-
sify whether or not a manufacturing system is faulty using the training
data set in Table 7.1 and the testing data set in Table 7.2. Both tables are
explained in Example 7.1.
In Step 1 of training, two dummy clusters C1 and C2 are set up for two
target classes, y = 1 and y = 0, respectively:
yC1 = 2 (indicating that C1 is a dummy cluster whose target class is
different from two target classes in the training and testing data sets)
yC2 = 2 (indicating that C2 is a dummy cluster)
126 Data Mining
1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0
9
0 + 1+ 0 + 0 + 0 + 0 + 0 + 0 + 0
9
0 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0 0.11
9 0.11
0 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 0.11
9 0.33
1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0
xC1 = = 0.22
9 0.22
0 + 0 + 1+ 0 + 0 + 1+ 0 + 0 + 0
9 0.56
0.44
1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0
9 0.33
0 + 1+ 1+ 1+ 0 + 0 + 0 + 1+ 0
9
1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1
9
0
1
0
1
0 0
1 0
0 0
1 0
0
x C2 = = 0
1
0
0
1 0
0 0
1 0
0
1
0
1
nC1 = 9
nC2 = 1.
k-Nearest Neighbor Classifier and Supervised Clustering 127
In Step 2 of training, the first data point x 1 in the training data set is
considered:
1
0
0
0
x1 = 1 y = 1.
0
1
0
1
Since C1 is the closest cluster to x 1 and has a different target class from
that of x 1, Step 5 of training is executed to form a new data cluster C3
containing x 1:
y C3 = 1
1
0
0
0
x C3 = 1
0
1
0
1
nC3 = 1.
128 Data Mining
Going back to Step 2 of training, the second data point x 2 in the training
data set is considered:
0
1
0
1
x2 = 0 y = 1.
0
0
1
0
Since C1 is the closest cluster to x 2 and has a different target class from
that of x 2, Step 5 of training is executed to form a new data cluster
C4 containing x 2:
y C4 = 1
k-Nearest Neighbor Classifier and Supervised Clustering 129
0
1
0
1
xC 4 = 0
0
0
1
0
nC4 = 1.
Going back to Step 2 of training, the third data point x 3 in the training
data set is considered:
0
0
1
1
x3 = 0 y = 1.
1
1
1
0
Since C1 is the closest cluster to x 3 and has a different target class from
that of x 3, Step 5 of training is executed to form a new data cluster C5
containing x 2:
y C5 = 1
0
0
1
1
x C5 = 0
1
1
1
0
nC5 = 1.
Going back to Step 2 of training again, the fourth data point x 4 in the
training data set is considered:
0
0
0
1
x3 = 0 y = 1.
0
0
1
0
Since C4 is the closest cluster to x 4 and has the same target class as that
of x 4, Step 4 of training is executed to add x, into the cluster C4, which is
updated next:
y C4 = 1
0 + 0
2
1 + 0
2
0 + 0 0
2 0.5
1 + 1 0
2 1
0 + 0
xC 4 = = 0
2
0
0 + 0
2 0
0 + 0 1
2 0
1 + 1
2
0 + 0
2
nC4 = 2.
132 Data Mining
yC1 = 2
0.11
0.11
0.11
0.33
xC1 = 0.22
0.22
0.56
0.44
0.33
nC1 = 9
y C2 = 2
0
0
0
0
x C2 = 0
0
0
0
0
nC2 = 1
y C3 = 1
k-Nearest Neighbor Classifier and Supervised Clustering 133
1
0
0
0
x C3 = 1
0
1
0
1
nC3 = 1
y C4 = 1
0
0.5
0
1
xC 4 = 0
0
0
1
0
nC4 = 2
y C5 = 1
0
0
1
1
x C5 = 0
1
1
1
0
134 Data Mining
nC5 = 1
yC6 = 1
0
0
0
0
xC6 = 0
1
1
0
0
nC6 = 1
yC7 = 1
0
0
0
0
xC7 = 0
0
1
0
0
nC7 = 1
yC8 = 1
0
0
0
0
xC8 = 0
0
0
1
0
k-Nearest Neighbor Classifier and Supervised Clustering 135
nC8 = 1
yC9 = 1
0
0
0
0
xC9 = 0
0
0
0
1
nC9 = 1
yC10 = 0
0
0
0
0
xC10 = 0
0
0
0
0
nC10 = 1.
In the testing, the first data point in the testing data set,
1
1
0
1
x = 1 ,
0
1
1
1
has the Euclidean distances of 1.73, 2.06, 2.45, 2.65, 2.45, 2.45, 2.45, and 2.65
to the nondummy clusters C3, C4, C5, C6, C7, C8, C9, and C10, respectively.
136 Data Mining
Hence, the cluster C3 is the nearest neighbor to x, and the target class of x is
assigned to be 1. The closest clusters to the remaining data points 2–16 in
the testing data set are C5, C3, C3, C3, C5, C4, C3/C5, C4, C3/C6/C10, C5, C3, C5, C5,
C5, and C3. For data point 8, there is a tie between C3 and C5 for the closest
cluster. Since both C3 and C5 have the target class of 1, the target class of 1
is assigned to data point 8. For data point 10, there also a tie among C3, C6,
and C10 for the closest cluster. Since the majority (two clusters C3 and C6)
of the three clusters tied have the target class of 1, the target class of 1 is
assigned to data point 10. Hence, all the data points in the testing data set
are assigned to the target class of 1 and are correctly classified as shown in
Table 2.2.
Exercises
7.1 In the space shuttle O-ring data set in Table 1.2, the target variable, the
Number of O-rings with Stress, has three values: 0, 1, and 2. Consider
these three values as categorical values, Launch-Temperature and
Leak-Check Pressure as the attribute variables, instances # 13–23 as the
training data, instances # 1–12 as the testing data, and the Euclidean
distance as the measure of dissimilarity. Construct a 1-nearest neigh-
bor classifier and a 3-nearest neighbor classifier, and test and compare
their classification performance.
7.2 Repeat Exercise 7.1 using the normalized attribute variables from the
normalization method in Equation 7.3.
7.3 Repeat Exercise 7.1 using the normalized attribute variables from the
normalization method in Equation 7.4.
7.4 Using the same training and testing data sets in Exercise 7.1 and the cosine
similarity measure, construct a 1-nearest neighbor classifier and a 3-nearest
neighbor classifier, and test and compare their classification performance.
7.5 Using the same training and testing data sets in Exercise 7.1, the supervised
clustering algorithm, and the Euclidean distance measure of dissimilarity,
construct a 1-nearest neighbor cluster classifier and a 3-nearest neighbor
cluster classifier, and test and compare their classification performance.
k-Nearest Neighbor Classifier and Supervised Clustering 137
7.6 Repeat Exercise 7.5 using the normalized attribute variables from the
normalization method in Equation 7.3.
7.7 Repeat Exercise 7.5 using the normalized attribute variables from
the normalization method in Equation 7.4.
7.8 Using the same training and testing data sets in Exercise 7.1, the super-
vised clustering algorithm, and the cosine similarity measure, construct
a 1-nearest neighbor cluster classifier and a 3-nearest neighbor cluster
classifier, and test and compare their classification performance.
Part III
Algorithms for
Mining Cluster and
Association Patterns
8
Hierarchical Clustering
The next section gives several methods of determining the two closest
clusters in Step 2.
141
142 Data Mining
records, and each pair has one data record from Cluster K and another data
record from Cluster L, as follows:
d ( xK , xL )
DK , L = ∑∑
xK ∈CK X L∈CL
nK nL
(8.1)
xK ,1 xL ,1
xK = xl = ,
xK , p xL , p
where
xK denotes a data record in CK
x L denotes a data record in CL
nK denotes the number of data records in CK
nL denotes the number of data records in CL
d(xK, x L) is the distance of two data records that can be computed using the
following Euclidean distance:
p
d ( xK , xL ) = ∑ (x − xL , i )
2
K ,i (8.2)
i =1
or some other dissimilarity measures of two data points that are described in
Chapter 7. As described in Chapter 7, the normalization of the variables,
x1, …, xp, may be necessary before using a measure of dissimilarity or simi-
larity to compute the distance of two data records.
Example 8.1
Compute the distance of the following two clusters using the average
linkage method and the squared Euclidean distance of data points:
CK = { x1 , x2 , x3 }
CL = { x 4 , x 5 }
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
x1 = 1 x 2 = 1 x3 = 0 x 4 = 0 x5 = 0 .
0 0 0 1 0
1 1 0 1 1
0 0 0 0 0
1 1 1 0 0
Hierarchical Clustering 143
There are six pairs of data records between CK and CL: (x 1, x 4), (x 1, x 5),
(x2, x4), (x2, x 5), (x 3, x4), (x 3, x 5), and their squared Euclidean distance
is computed as
9
d ( x1 , x 4 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (1 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 1)
2 2 2 2 2 2
+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 4
2 2 2
d ( x1 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (1 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2 2 2
+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2
d ( x2 , x4 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 1)
2 2 2 2 2 2
+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2
d ( x2 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (1 − 0 ) + (0 − 0 )
2 2 2 2 2 2
+ (1 − 1) + (0 − 0 ) + (1 − 0 ) = 2
2 2 2
d ( x3 , x4 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 1)
2 2 2 2 2 2
+ (0 − 1) + (0 − 0 ) + (1 − 0 ) = 3
2 2 2
d ( x3 , x5 ) = ∑ (x − x4 , i )
2
1, i
i =1
= (0 − 0) + (0 − 0) + (0 − 0 ) + (0 − 0 ) + (0 − 0 ) + (0 − 0)
2 2 2 2 2 2
+ (0 − 1) + (0 − 0 ) + (1 − 0 ) = 2
2 2 2
d ( xK , xl )
∑∑
4 3 3 2 3 2
DK , L = = + + + + + = 2.8333
xK ∈CK xL∈CL
nK nL 3×2 3×2 3×2 3×2 3×2 3×2
144 Data Mining
In the single linkage method, the distance between two clusters is the min-
imum distance between a data record in one cluster and a data record in the
other cluster:
{
DK , L = min d ( xk , xl ) , xk ∈ CK , xl ∈ CL .
} (8.3)
Using the single linkage method, the distance of clusters CK and CL in Example
8.1 is computed as
{
DK , L = min d ( xK , xL ) , xK ∈ CK , xL ∈ CL }
= min {d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x )}
1 4 1 5 2 4 2 5 3 4 3 5
= min {4, 3, 3, 2, 3, 4} = 2.
In the complete linkage method, the distance between two clusters is the
maximum distance between a data record in one cluster and a data record
in the other cluster:
{
DK, L = max d ( xK , xL ) , xK ∈ CK , xL ∈ CL . }
(8.4)
{
DK, L = max d ( xK , xL ) , xK ∈ CK , xL ∈ CL }
= max {d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x ) , d ( x , x )}
1 4 1 5 2 4 2 5 3 4 3 5
= max {4, 3, 3, 2, 3, 4} = 4.
In the centroid method, the distance between two clusters is the distance
between the centroids of clusters, and the centroid of a cluster is computed
using the mean vector of all data records in the cluster, as follows:
(
DK , L = d xK , xL ) (8.5)
∑
∑
nK n ,L
xk ,1 xl , 1
k =1 l =1
nK nL
xK = xL = . (8.6)
∑ ∑
nK nL
xk , p xl , p
k =1 l =1
nK nL
Hierarchical Clustering 145
Using the centroid linkage method and the squared Euclidean distance of
data points, the distance of clusters CK and CL in Example 8.1 is computed as
1+ 0 + 0
3
0 + 0 + 0 1
3
0 + 0 + 0 3
0
3
∑
nK
xk ,1 0 + 0 + 0 0
k =1 0
nK 3
1+ 1+ 0 2
xK = =
3 = 3
∑
nK
xk , p 0 + 0 + 0 0
k =1 3 2
nK
1+ 1+ 0 3
3 0
0 + 0 + 0
1
3
1+ 1+ 1
3
0 + 0
2
0 + 0
2
0 + 0 0
2 0
∑
nL
xl , 1 0 + 0 0
2 0
l =1
nL
0 + 0
xL = = = 0
2 1
∑
nL
xl , p 1 + 0
l=1 2 2
nL
1+ 1 1
0
0
0 + 0 0
2
0 + 0
2
146 Data Mining
2 2
( ) 1
3
2 2 2 2
3
DK , L = d xK , xL = − 0 + (1 − 0 ) + (1 − 0 ) + (1 − 0 ) + − 0
2 2
1 2
+ 0 − + − 1 + (0 − 0 ) + (1 − 0 ) = 4.9167.
2 2
2 3
Various methods of determining the distance between two clusters have differ-
ent computational costs and may produce different clustering results. For exam-
ple, the average linkage method, the single linkage method, and the complete
linkage method require the computation of the distance between every pair of
data points from two clusters. Although the centroid method does not have such
a computation requirement, the centroid method must compute the centroid of
every new cluster and the distance of the new cluster with existing clusters. The
average linkage method and the centroid method take into account and control
the dispersion of data points in each cluster, whereas the single linkage method
and the complete linkage method place no constraint on the shape of the cluster.
Example 8.2
Produce a hierarchical clustering of the data for system fault detection in
Table 8.1 using the single linkage method.
Table 8.1
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1
Hierarchical Clustering 147
Table 8.1 contains the data set for system fault detection, including
nine instances of single-machine faults. Only the nine attribute variables
about the quality of parts are used in the hierarchical clustering. The
nine data records in the data set are
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 1 1 1 0 0
x1 = 1 x2 = 0 x3 = 0 x 4 = 0 x 5 = 1 x6 = 0
0 0 1 0 0 1
1 0 1 0 1
1
0 1 1 1 0 0
1 0 0 0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
x7 = 0 x8 = 0 x9 = 0 .
0 0 0
1 0 0
0 1 0
0 0 1
The clustering results will show which single-machine faults have simi-
lar symptoms of the part quality problem.
Figure 8.1 shows the hierarchical clustering procedure that starts with
the following nine clusters with one data record in each cluster:
C1 = { x1 } C2 = { x2 } C3 = { x3 } C4 = { x 4 } C5 = { x5 }
C6 = { x6 } C7 = { x7 } C8 = { x8 } C9 = { x9 } .
Merging
distance
3
2
1
C1 C5 C6 C7 C9 C2 C4 C8 C3
Figure 8.1
Result of hierarchical clustering for the data set of system fault detection.
148 Data Mining
Table 8.2
The Distance for Each Pair of Clusters: C1, C2, C3, C4, C5, C6, C7, C8, and C9
C1 = C2 = C3 = C4 = C5 = C6 = C7 = C8 = C9 =
{x1} {x2} {x3} {x4} {x5} {x6} {x7} {x8} {x9}
C1 = {x1} 7 7 6 1 4 3 5 3
C2 = {x2} 4 1 6 5 4 2 4
C3 = {x3} 3 6 3 4 4 6
C4 = {x4} 5 4 4 1 3
C5 = {x5} 3 2 4 2
C6 = {x6} 1 3 3
C7 = {x7} 2 2
C8 = {x8} 2
C9 = {x9}
Since each cluster has only one data record, the distance between two
clusters is the distance between two data records in two clusters, respec-
tively. Table 8.2 gives the distance for each pair of data records, which
also gives the distance for each pair of clusters.
There are four pairs of clusters that produce the smallest distance
of 1: (C1, C 5), (C2, C 4), (C 4, C 8), and (C 6, C7). We merge (C1, C 5) to form a
new cluster C1,5, and merge (C6, C7) to form a new cluster C6,7 . Since the
cluster C 4 is involved in two pairs of clusters (C2, C 4) and (C 4, C 8), we
can merge only one pair of clusters. We arbitrarily choose to merge
(C2, C 4) to form a new cluster C2,4. Figure 8.1 shows these new clusters,
in a new set of clusters, C1,5, C2,4, C 3, C6,7, C 8, and C9.
Table 8.3 gives the distance for each pair of the clusters, C1,5, C2,4, C3,
C6,7, C8, and C9, using the single linkage method. For example, there are
four pairs of data records between C1,5 and C2,4: (x 1, x 2), (x 1, x 4), (x 5, x 2), and
(x5, x4 ), with their distance being 7, 6, 6, and 5, respectively, from Table 8.2.
Hence, the minimum distance is 5, which is taken as the distance of
Table 8.3
Distance for Each Pair of Clusters: C1,5, C2,4, C3, C6,7, C8, and C9
C1,5 = C2,4 = C6,7 =
{x1, x5} {x2, x4} C3 = {x3} {x6, x7} C8 = {x8} C9 = {x9}
C1,5 = {x1, x5} 5 = min 6 = min 2 = min 4 = min 2 = min
{7, 6, 6, 5} {7, 6} {4, 3, 3, 2} {5, 4} {3, 2}
C2,4 = {x2, x4} 3 = min 4 = min 1 = min 3 = min
{4, 3} {5, 4, 4, 4} {2, 1} {4, 3}
C3 = {x3} 3 = min 4 = min 6 = min
{3, 4} {4} {6}
C6,7 = {x6, x7} 2 = min 2 = min
{3, 2} {3, 2}
C8 = {x8} 2 = min
{2}
C9 = {x9}
Hierarchical Clustering 149
Table 8.4
Distance for Each Pair of Clusters: C1,5, C2,4,8, C3, C6,7, and C9
C1,5 = {x1, x5} C2,4,8 = {x2, x4, x8} C3 = {x3} C6,7 = {x6, x7} C9 = {x9}
C1,5 = {x1, x5} 4 = min 6 = min 2 = min 2 = min
{7, 6, 5, 6, 5, 4} {7, 6} {4, 3, 3, 2} {3, 2}
C2,4,8 = {x2, x4, x8} 3 = min 2 = min 3 = min
{4, 3, 4} {5, 4, 4, 4, 3, 2} {4, 3, 2}
C3 = {x3} 3 = min 6 = min
{3, 4} {6}
C6,7 = {x6, x7} 2 = min
{3, 2}
C9 = {x9}
C1,5 and C2,4. The closest pair of clusters is (C2,4, C8) with the distance of 1.
Merging clusters C2,4 and C8 produces a new cluster C2,4,8. We have a new
set of clusters, C1,5, C2,4,8, C3, C6,7, and C9.
Table 8.4 gives the distance for each pair of the clusters, C1,5, C2,4,8, C3,
C6,7, and C9, using the single linkage method. Four pairs of clusters, (C1,5,
C6,7), (C1,5, C9), (C2,4,8, C6,7), and (C6,7, C9), produce the smallest distance
of 2. Since three clusters, C1,5, C6,7, and C9, have the same distance from
one another, we merge the three clusters together to form a new cluster,
C1,5,6,7,9. C6,7 is not merged with C2,4,8 since C6,7 is merged with C1,5 and C9.
We have a new set of clusters, C1,5,6,7,9, C2,4,8, and C3.
Table 8.5 gives the distance for each pair of the clusters, C1,5,6,7,9, C2,4,8, and
C3, using the single linkage method. The pair of clusters, (C1,5,6,7,9, C2,4,8),
produces the smallest distance of 2. Merging the clusters, C1,5,6,7,9 and C2,4,8,
forms a new cluster, C1,2,4,5,6,7,8,9. We have a new set of clusters, C1,2,5,4,5,6,7,8,9 and
C3, which have the distance of 3 and are merged into one cluster, C1,2,3,4,5,6,7,8,9.
Figure 8.1 also shows the merging distance, which is the distance of
two clusters when they are merged together. The hierarchical clustering
tree shown in Figure 8.1 is called the dendrogram.
Hierarchical clustering allows us to obtain different sets of clusters
by setting different thresholds of the merging distance threshold for
different levels of data similarity. For example, if we set the threshold
of the merging distance to 1.5 as shown by the dash line in Figure 8.1,
we obtain the clusters, C1,5, C 6,7, C9, C2,4,8, and C 3, which are considered
as the clusters of similar data because each cluster’s merging distance
Table 8.5
Distance for Each Pair of Clusters: C1,5,6,7,9, C2,4,8, and C3
C1,5,6,7,9 = {x1, x5,
x6, x7, x9} C2,4,8 = {x2, x4, x8} C3 = {x3}
C1,5,6,7,9 = {x1, x5, 2 = min{7, 6, 5, 6, 5, 4, 3 = min{7, 6, 3, 4, 6}
x6, x7, x9} 5, 4, 3, 4, 4, 2, 4, 3, 2}
C2,4,8 = {x2, x4, x8} 3 = min{4, 3, 4}
C3 = {x3}
150 Data Mining
C3
x3
C1 C2
x1 x2
C12
Figure 8.2
An example of three data points for which the centroid linkage method produces a nonmonotonic
tree of hierarchical clustering.
Merging
distance
Figure 8.3
Nonmonotonic tree of hierarchical clustering for the data points in Figure 8.2.
with C3 next to produce a new cluster C1,2,3, the merging distance of 1.73 for
C1,2,3 is smaller than the merging distance of 2 for C1,2. Figure 8.3 shows the
non-monotonic tree of hierarchical clustering for these three data points
using the centroid method.
The single linkage method, which is used in Example 8.2, computes the
distance between two clusters using the smallest distance between two data
points, one data point in one cluster, and another data point in the other
cluster. The smallest distance between two data points is used to form a new
cluster. The distance used to form a cluster earlier cannot be used again to
form a new cluster later, because the distance is already inside a cluster and
a distance with a data point outside a cluster is needed to form a new clus-
ter later. Hence, the distance to form a new cluster later must come from a
distance not used before, which must be greater than or equal to a distance
selected and used earlier. Hence, the hierarchical clustering tree from the
single linkage method is always monotonic.
152 Data Mining
• SAS (www.sas.com)
• SPSS (www.spss.com)
• Statistica (www.statistica.com)
• MATLAB® (www.matworks.com)
Exercises
8.1 Produce a hierarchical clustering of 23 data points in the space shut-
tle O-ring data set in Table 1.2. Use Launch-Temperature and Leak-
Check Pressure as the attribute variables, the normalization method
in Equation 7.4 to obtain the normalized Launch-Temperature and
Leak-Check Pressure, the Euclidean distance of data points, and the
single linkage method.
8.2 Repeat Exercise 8.1 using the complete linkage method.
8.3 Repeat Exercise 8.1 using the cosine similarity measure to compute the
distance of data points.
8.4 Repeat Exercise 8.3 using the complete linkage method.
8.5 Discuss whether or not it is possible for the complete linkage method to
produce a nonmonotonic tree of hierarchical clustering.
8.6 Discuss whether or not it is possible for the average linkage method to
produce a nonmonotonic tree of hierarchical clustering.
9
K-Means Clustering and
Density-Based Clustering
153
154 Data Mining
TABLE 9.1
K-Means Clustering Algorithm
Step Description
1 Set up the initial centroids of the K clusters
2 REPEAT
3 FOR i = 1 to n
4 Compute the distance of the data point xi to each of the K clusters using
a measure of similarity or dissimilarity
5 IF xi is not in any cluster or its closest cluster is not its current cluster
6 Move xi to its closest cluster and update the centroid of the cluster
7 UNTIL no change of cluster centroids occurs in Steps 3–6
For a large data set, the stopping criterion for the REPEAT-UNTIL loop in
Step 7 of the algorithm can be relaxed so that the REPEAT-UNTIL loop stops
when the amount of changes to the cluster centroids is less than a threshold,
e.g., less than 5% of the data points changing their clusters.
The K-means clustering algorithm minimizes the following sum of squared
errors (SSE) or distances between data points and their cluster centroids (Ye,
2003, Chapter 10):
∑∑d ( x, x ) .
2
SSE = Ci (9.1)
i = 1 x ∈Ci
In Equation 9.1, the mean vector of data points in the cluster Ci is used as the
cluster centroid to compute the distance between a data point in the cluster
Ci and the centroid of the cluster Ci.
Since K-means clustering depends on the parameter K, knowledge in
the application domain may help the selection of an appropriate K value
to produce a K-means clustering result that is meaningful in the appli-
cation domain. Different K-means clustering results using different K
values may be obtained so that different results can be compared and
evaluated.
Example 9.1
Produce the 5-means clusters for the data set of system fault detection
in Table 9.2 using the Euclidean distance as the measure of dissimilarity.
This is the same data set for Example 8.1. The data set includes nine
instances of single-machine faults, and the data point for each instance
has the nine attribute variables about the quality of parts.
In Step 1 of the K-means clustering algorithm, we arbitrarily select
data points 1, 3, 5, 7, and 9 to set up the initial centroids of the five clus-
ters, C1, C2, C3, C4, and C5, respectively:
K-Means Clustering and Density-Based Clustering 155
Table 9.2
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1
1 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 1 0 0 0
xC1 = x1 = 1 x C2 = x3 = 0 x C3 = x5 = 1 xC 4 = x7 = 0 x C5 = x9 = 0 .
0 1 0 0 0
1 1 1 1 0
0 1 0 0 0
1 0 1 0 1
The five clusters have no data point in each of them initially. Hence, we
have C1 = {}, C2 = {}, C3 = {}, C4 = {}, and C5 = {}.
In Steps 2 and 3 of the algorithm, we take the first data point x 1 in the
data set. In Step 4 of the algorithm, we compute the Euclidean distance
of the data point x 1 to each of the five clusters:
(
d x1 , xC1 )
= (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2
=0
(
d x1 , xC2 )
= (1 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 1)2 + (1 − 0)2
= 2.65
156 Data Mining
(
d x1 , xC3 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 1)2
=1
(
d x1 , xC4 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 0)2
= 1.73
(
d x1 , xC5 )
= (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 1)2
= 1.73
(
d x2 , xC1 )
= (0 − 1)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.65
(
d x 2 , x C2 )
= (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2
=2
(
d x 2 , x C3 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45
K-Means Clustering and Density-Based Clustering 157
(
d x 2 , xC 4 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
=2
(
d x 2 , x C5 )
= (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
=2
0
1
0
1
x C2 = 0 .
0
0
1
0
(
d x3 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.65
(
d x 3 , x C2 )
= (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2
=2
158 Data Mining
(
d x 3 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45
(
d x 3 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 1)2 + (1 − 0)2 + (0 − 0)2
=2
(
d x 3 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2
= 2.45
0 + 0
2
1 + 0
2
0 + 1 0
2 0.5
1 + 1 0.5
2 1
0 + 0
x C2 = = 0 .
2
0.5
0 + 1
2 0.5
0 + 1 1
2 0
1 + 1
2
0 + 0
2
Going back to Step 3, we take the fourth data point x 4 in the data set.
In Step 4, we compute the Euclidean distance of the data point x 4 to each
of the five clusters:
(
d x 4 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.45
(
d x 4 , xC2 )
= (0 − 0)2 + (0 − 0.5)2 + (0 − 0.5)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0.5)2 + (0 − 0.5)2 + (1 − 1)2 + (0 − 0)2
=1
(
d x 4 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.24
(
d x 4 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
= 1.73
(
d x 4 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
= 1.73
160 Data Mining
0 + 0 + 0
3
1 + 0 + 0
3
0 + 1 + 0 0
3 0.33
1 + 1 + 1 0.33
3 1
0 + 0 + 0
x C2 = = 0 .
3 0.33
0 + 1+ 0
3 0.33
1
0 + 1+ 0
3 0
1 + 1 + 1
3
0 + 0 + 0
3
We have C1 = {x 1}, C2 = {x 2, x 3, x 4}, C3 = {}, C4 = {}, and C5 = {}.
Going back to Step 3, we take the fifth data point x5 in the data set. In Step
4, we know that x5 is closest to C3 since C3 is initially set up using x5 and is
not updated since then. In Step 5, x5 is not in any cluster. Step 6 of the algo-
rithm is executed to move x5 to its closest cluster C3 whose centroid remains
the same. We have C1 = {x1}, C2 = {x2, x3, x4}, C3 = {x5}, C4 = {}, and C5 = {}.
Going back to Step 3, we take the sixth data point x 6 in the data set. In
Step 4, we compute the Euclidean distance of the data point x 6 to each of
the five clusters:
(
d x6 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
=2
(
d x 6 , xC2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (0 − 1)2 + (0 − 0)2 + (1 − 0.33)2 + (1 − 0.33)2 + (0 − 1)2 + (0 − 0)2
= 1.77
(
d x 6 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.73
K-Means Clustering and Density-Based Clustering 161
(
d x 6 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2
=1
(
d x 6 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 1)2
= 1.73
0
0
0
0
xC 4 = 0 .
1
1
0
0
(
d x7 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.73
(
d x7 , xC2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (0 − 1)2 + (0 − 0)2 + (0 − 0.33)2 + (1 − 0.33)2 + (0 − 1)2 + (0 − 0)2
= 1.67
162 Data Mining
(
d x 7 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (0 − 0)2 + (0 − 1)2
= 1.41
(
d x7 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2+ (0 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2
=1
(
d x 7 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 0)2 + (0 − 1)2
= 1.41
0 + 0
2
0 + 0
2
0 + 0 0
2
0
0 + 0 0
2
0
0 + 0
xC 4 = = 0
2 0.5
1 + 0
1
2 0
1 + 1
0
2
0 + 0
2
0 + 0
2
Going back to Step 3, we take the eighth data point x 8 in the data set.
In Step 4, we compute the Euclidean distance of the data point x 8 to each
of the five clusters:
(
d x8 , xC1 )
= (0 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
= 2.27
(
d x 8 , x C2 )
= (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2+ (0 − 1)2+ (0 − 0)2 + (0 − 0.33)2 + (0 − 0.33)2 + (1 − 1) 2 + (0 − 0)2
= 1.20
(
d x 8 , x C3 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2 + (0 − 1)2
=2
(
d x 8 , xC 4 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0.5)2 + (0 − 1)2 + (1 − 0)2 + (0 − 0)2
= 1.5
(
d x 8 , x C5 )
= (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 + (1 − 0)2 + (0 − 1)2
= 1.41
164 Data Mining
0 + 0 + 0 + 0
4
1 + 0 + 0 + 0
4
0 + 1 + 0 + 0 0
4 0.25
1 + 1 + 1 + 0 0.25
4 0.75
0 + 0 + 0 + 0
x C2 = = 0 .
4
0.25
0 + 1+ 0 + 0
4 0.25
0 + 1 + 0 + 0 1
3 0
1 + 1 + 1 + 1
4
0 + 0 + 0 + 0
4
• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.matworks.com)
• SAS (www.sas.com)
166 Data Mining
Exercises
9.1 Produce the 2-means clustering of the data points in Table 9.2 using the
Euclidean distance as the measure of dissimilarity and using the first
and third data points to set up the initial centroids of the two clusters.
9.2 Produce the density-based clustering of the data points in Table 9.2
using the Euclidean distance as the measure of dissimilarity, 1.5 as the
radius and 2 as the minimum number of data points required to form
a cluster.
9.3 Produce the density-based clustering of the data points in Table 9.2
using the Euclidean distance as the measure of dissimilarity, 2 as the
radius and 2 as the minimum number of data points required to form
a cluster.
9.4 Produce the 3-means clustering of 23 data points in the space shuttle
O-ring data set in Table 1.2. Use Launch-Temperature and Leak-Check
Pressure as the attribute variables and the normalization method in
Equation 7.4 to obtain the normalized Launch-Temperature and Leak-
Check Pressure, the Euclidean distance as the measure of dissimilarity.
9.5 Repeat Exercise 9.4 using the cosine similarity measure.
10
Self-Organizing Map
This chapter describes the self-organizing map (SOM), which is based on the
architecture of artificial neural networks and is used for data clustering and
visualization. A list of software packages for SOM is provided along with
references for applications.
o1 w1¢ x
o = o j = w j¢ x , (10.1)
ok wk¢ x
where
x1
x = xi
xp
167
168 Data Mining
wji
(a) (b) x1 xi xp (c)
Figure 10.1
Architectures of SOM with a (a) one-, (b) two-, and (c) three-dimensional output map.
w j1
w j = w ji .
w jp
Among all the output nodes, the output node producing the largest value for
a given input vector x is called the winner node. The winner node of the input
vector has the weight vector that is most similar to the input vector. The learning
algorithm of SOM determines the connection weights so that the winner nodes
of similar input vectors are close together. Table 10.1 lists the steps of the SOM
learning algorithm, given a training data set with n data points, xi, i = 1, …, n.
In Step 5 of the algorithm, the connection weights of the winner node for
the input vector xi and the nearby nodes of the winner node are updated
to make the weights of the winner node and its nearby nodes more similar
to the input vector and thus make these nodes produce larger outputs for
the input vector. The neighborhood function f( j, c), which determines the
closeness of node j to the winner node c and thus eligibility of node j for the
weight change, can be defined in many ways. One example of f( j, c) is
1 if rj − rc ≤ Bc (t )
f ( j, c) = , (10.2)
0 otherwise
where r j and rc are the coordinates of node j and the winner node c in the
output map, and Bc(t) gives the threshold value that bounds the neighborhood
Self-Organizing Map 169
Table 10.1
Learning Algorithm of SOM
Step Description
1 Initialize the connection weights of nodes with random positive or negative values,
w j¢ (t ) = w j1 (t ) w jp (t ) , t = 0, j = 1, …, k
2 REPEAT
3 FOR i = 1 to n
4 Determine the winner node c for xi: c = argmax j w ¢j (t ) xi
5 Update the connection weights of the winner node and its nearby nodes:
w j (t + 1) = w j (t ) + αf ( j, c ) xi − w j (t ), where α is the learning rate and f( j, c)
defines whether or not node j is close enough to c to be considered for the
weight update
6 w j (t + 1) = w j (t ) for other nodes without the weight update
7 t = t +1
8 UNTIL the sum of weight changes for all the nodes, E(t), is not greater than a
threshold ε
1
f ( j, c) = 2 . (10.3)
rj − rc
e 2Bc (t)
2
In Step 8 of the algorithm, the sum of weight changes for all the nodes is
computed:
E (t ) = ∑ w (t + 1) − w (t) .
j
j j (10.4)
After the SOM is learned, clusters of data points are identified by marking
each node with the data point(s) that makes the node the winner node. A
cluster of data points are located and identified in a close neighborhood in
the output map.
Example 10.1
Use the SOM with nine nodes in a one-dimensional chain and their
coordinates of 1, 2, 3, 4, 5, 6, 7, 8, and 9 in Figure 10.2 to cluster the
nine data points in the data set for system fault detection in Table 10.2,
which is the same data set in Tables 8.1 and 9.2. The data set includes
nine instances of single-machine faults, and the data point for each
170 Data Mining
Node coordinate: 1 2 3 4 5 6 7 8 9
x1 x2 x3 x4 x5 x6 x7 x8 x9
Figure 10.2
Architecture of SOM for Example 10.1.
Table 10.2
Data Set for System Fault Detection with Nine Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
4 (M4) 0 0 0 1 0 0 0 1 0
5 (M5) 0 0 0 0 1 0 1 0 1
6 (M6) 0 0 0 0 0 1 1 0 0
7 (M7) 0 0 0 0 0 0 1 0 0
8 (M8) 0 0 0 0 0 0 0 1 0
9 (M9) 0 0 0 0 0 0 0 0 1
instance has the nine attribute variables about the quality of parts.
The learning rate α is 0.3. The neighborhood function f( j, c) is
1 for j = c − 1, c, c + 1
f ( j, c) = .
0 otherwise
− 0.95 0.69
− 0 .21 0.23
− 0.48 − 0.69
0.05 0.86
w8 (0 ) = − 0.54 w9 (0 ) = 0.22 .
0.23 − 0.91
− 0.37 0.82
0.61 0.31
− 0.76 0.31
Using these initial weights to compute the SOM outputs for the nine data
points makes nodes 4, 9, 7, 9, 1, 6, 9, 8, and 3 the winner nodes for x 1, x 2, x 3, x 4,
x 5, x 6, x 7, x 8, and x 9, respectively. For example, the output of each node for x 1
is computed to determine the winner node:
o1 w1¢ (0 ) x1
w¢2 (0 ) x1
o2
o3 w¢3 (0 ) x1
o4 w¢4 (0 ) x1
o = o5 = w¢5 (0 ) x1
o6 w6¢ (0 ) x1
o w7¢ (0 ) x1
7
o8 w8¢ (0 ) x1
o9 w9¢ (0 ) x1
172 Data Mining
( − 0.24 ) (1) + ( − 0.41) (0 ) + (0.46 )(0 ) + (0.27 )(0 ) + (0.88 )(1) + ( − 0.09) (0 )
+ (0.78 )(1) + ( − 0.39) (0 ) + (0.91)(1)
( 0.44 ) (1) + (0.44 )(0 ) + (0.93 )(0 ) + ( −0.15)(0 ) + (0.84 )(1) + ( − 0.36 ) (0 )
+ ( −0.16 )(1) + (0.55) (0 ) + (0.93 )(1)
( 0.96 ) (1) + ( − 0.45) (0 ) + ( −0.75)(0 ) + (0.75)(0 ) + (0.05)(1) + (0.86 )(0 )
+ (0.12)(1) + ( − 0.49) (0 ) + (0.98 )(1)
( 0. 82 )( ) (
1 + − 0 . 22 ) ( ) ( )( ) (
0 + 0 . 60 0 + −0 . 56 )( ) ( )( ) (
0 + 0 . 91 1 + − 0. 89 ) ( )
0
+ (0.333 )(1) + ( − 0.54 ) (0 ) + (0.47 )(1)
(0.62)(1) + (0.44 )(0 ) + (0.33 )(0 ) + (0.46 )(0 ) + ( −0.25)(1) + ( − 0.26 ) (0 )
=
+ −0.71 1 + − 0.61 0 + 0.338 1
( )( ) ( ) ( ) ( )( )
( − 0.47 ) (1) + ( − 0.62) (0 ) + ( −0.96 )(0 ) + ( −0.43 )(0 ) + (0.32) (1) + ( 0.96 ) (0 )
+ (0.70 )(1) + ( − 0.04 ) (0 ) + ( −0.84 )(1)
( − 0.87 ) (1) + ( 0.23 ) (0 ) + (0.37 )(0 ) + (0.49)(0 ) + (0.04 )(1) + ( 0.33 ) (0 )
+ −0.10 1 + 0.45 0 + −0.96 1
( )(
( ) ( )( ) ( )( )
( − 0.95) (1) + ( − 0.21) (0 ) + ( −0.48 )(0 ) + (0.05)(0 ) + ( −0.54 )(1) + ( 0.23 ) (0 )
+ ( −0.37 )(1) + ( 0.61) (0 ) + ( −0.76 ) (1)
( 0.69) (1) + ( 0.23 ) (0 ) + ( −0.69)(0 ) + (0.86 )(0 ) + (0.22)(1) + ( − 0.91) (0 )
+ (0.82)(1) + ( 0.31) (0 ) + (0.31)(1)
2.33
2.04
2.11
2.53
= 0.04 .
− 0.29
−1.9
− 2.62
2.04
Self-Organizing Map 173
Node coordinate: 1 2 3 4 5 6 7 8 9
Figure 10.3
The winner nodes for the nine data points in Example 10.1 using initial weight values.
Since node 4 has the largest output value o4 = 2.53, node 4 is the winner node
for x 1. Figure 10.3 illustrates the output map to indicate the winner nodes for
the nine data points and thus initial clusters of the data points based on the
initial weights.
In Steps 2 and 3, x 1 is considered. In Step 4, the output of each node for x 1
is computed to determine the winner node. As described earlier, node 4 is
the winner node for x1, and thus c = 4. In Step 5, the connection weights to the
winner node c = 4 and its neighbors c − 1 = 3 and c + 1 = 5 are updated:
w4 (1) = w4 (0 ) + (0.3 ) x1 − w4 (0 ) = (0.7 ) w4 (0 ) + (0.3 ) x1
0.82 1 0.87
− 0.22 0 − 0.15
0.60 0 0.42
− 0.56 0 − 0.39
= (0.7 ) 0.91 + (0.3 ) 1 = 0.94 .
− 0.80 0 − 0.56
0.33 1 0.53
− 0.54 0 − 0.38
0.47 1 0.63
0.96 1 1.96
− 0 .45 0 − 0.32
− 0.75 0 0.53
0.35 0 0.25
= (0.7 ) 0.05 + (0.3 ) 1 = 0.34 .
0.86 0 0.60
0.12 1 0.38
− 0.49 0 − 0.34
0.98 1 0.99
174 Data Mining
0.62 1 0.73
0.44 0 0.31
0.33 0 0.23
0.46 0 0.32
= (0.7 ) − 0.25 + (0.3 ) 1 = 0.13 .
− 0.26 0 − 0.18
− 0.71 1 0.80
− 0.61 0 − 0.43
0.38 1 0.57
In Step 6, the weights for the other nodes remain the same. In Step 7, t is
increased to 1, and the weights of the nine nodes are
− 0.95 0.69
− 0.21 0.23
− 0.48 − 0.69
0.05 0.86
w8 (1) = − 0.54 w9 (1) = 0.22 .
0.23 − 0.91
− 0.37 0.82
0.61 0.31
− 0.76 0.31
• Weka (http://www.cs.waikato.ac.nz/ml/weka/)
• MATLAB® (www.matworks.com)
Exercises
10.1 Continue the learning process in Example 10.1 to perform the weight
updates when x 2 is presented to the SOM.
10.2 Use the software Weka to produce the SOM for Example 10.1.
10.3 Define a two-dimensional SOM and the neighborhood function in
Equation 10.2 for Example 10.1 and perform one iteration of the weight
update when x 1 is presented to the SOM.
176 Data Mining
2
1 x −µ
1 −
f (x) = e 2 σ
, (11.1)
2πσ
177
178 Data Mining
Table 11.1
Values of Launch Temperature in
the Space Shuttle O-Ring Data Set
Instance Launch Temperature
1 66
2 70
3 69
4 68
5 67
6 72
7 73
8 70
9 57
10 63
11 70
12 78
13 67
14 53
15 67
16 75
17 70
18 81
19 76
20 79
21 75
22 76
23 58
10
Frequency
2
1
Figure 11.1
Frequency histogram of the Launch Temperature data.
Probability Distributions of Univariate Data 179
Where
μ is the mean
σ is the standard deviation
• Spike
• Random fluctuation
• Step change
• Steady change
The probability distributions of time series data with the spike, random fluc-
tuation, step change, and steady change patterns have special characteristics.
Time series data with a spike pattern as shown in Figure 11.2a have the major-
ity of data points with similar values and few data points with higher values
producing upward spikes or with lower values producing downward spikes.
The high frequency of data points with similar values determines where the
mean with a high probability density is located, and few data points with lower
(higher) values than the mean for downward (upward) spikes produce a long
tail on the left (right) side of the mean and thus a left (right) skewed distribution.
Hence, time series data with spikes produce a skewed probability distribution
that is asymmetric with most data points having values near the mean and few
data points having values spreading over one side of the mean and creating a
long tail, as shown in Figure 11.2a. Time series data with a random fluctuation
pattern produce a normal distribution that is symmetric, as shown in Figure
11.2b. Time series data with one step change, as shown in Figure 11.2c, produce
two clusters of data points with two different centroids and thus a bimodal dis-
tribution. Time series data with multiple step changes create multiple clusters
of data points with their different centroids and thus a multimodal distribu-
tion. Time series data with the steady change (i.e., a steady increase of values or
a steady decrease of values) have values evenly distributed and thus produce a
uniform distribution, as shown in Figure 11.2d. Therefore, the four patterns of
time series data produce four different types of probability distribution:
260
0.012
240
0.010 220
200
0.008 180
160
0.006 140
No of obs
120
0.004 100
\\ALPHA02-VICTIM\LogicalDisk
0.002
60
40
0.000
20
–0.002 0
1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 –0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012
(a) \\ALPHA02-VICTIM\LogicalDisk(C:)\Avg. Disk sec/Transfer
1.00006 120
1.00005 100
1.00004 80
No of obs
1.00003 60
1.00002 40
\\ALPHA02-VICTIM\Process
(services)\IO Write Operations/sec
1.00001 20
1.00000 0
Case 1 Case 43 Case 85 Case 127 Case 169 Case 211 Case 253 Case 295 1.00000 1.00001 1.00002 1.00003 1.00004 1.00005 1.00006 1.00007
Case 22 Case 64 Case 106 Case 148 Case 190 Case 232 Case 274 \\ALPHA02-VICTIM\Process(services)\IO Write Operations/sec
(b)
Figure 11.2
Time series data patterns and their probability distributions. (a) The data plot and histogram of spike pattern, (b) the data plot and histogram of random
Data Mining
fluctuation pattern.
2.662E7 70
2.66E7
2.658E7 60
2.656E7
2.654E7 50
2.652E7
2.65E7 40
2.648E7
No of obs
2.646E7 30
2.644E7
2.642E7 20
2.64E7
2.638E7 10
2.684E7 80
2.682E7
70
2.68E7
Probability Distributions of Univariate Data
60
2.678E7
50
2.676E7
2.674E7 40
No of obs
2.672E7
30
2.67E7
20
2.668E7
10
2.666E7
As described in Ye (2008, Chapter 9), the four data patterns and their cor-
responding probability distributions can be used to identify whether or not
there are attack activities underway in computer and network systems since
computer and network data under attack or normal use conditions may
demonstrate different data patterns. Cyber attack detection is an important
part of protecting computer and network systems from cyber attacks.
( x − µ )3
skewness = E , (11.2)
σ
3
where μ and σ are the mean and standard deviation of data population for
the variable x. Given a sample of n data points, x1, …, xn, the sample skewness
is computed:
∑
n
n ( x i − x )3
skewness = i =1
, (11.3)
(n − 1) (n − 2) s3
where x and s are the average and standard deviation of the data sample.
Unlike the variance that squares both positive and negative deviations from
the mean to make both positive and negative deviations from the mean con-
tribute to the variance in the same way, the skewness measures how much
data deviations from the mean are symmetric on both sides of the mean. A
left-skewed distribution with a long tail on the left side of the mean has a
negative value of the skewness. A right-skewed distribution with a long tail
on the right side of the mean has a positive value of the skewness.
Probability Distributions of Univariate Data 183
Table 11.2
Combinations of Skewness and Mode Test Results for Distinguishing Four
Probability Distributions
Probability Distribution Dip Test Mode Test Skewness Test
Multimodal distribution Unimodality is Number of significant Any result
rejected modes ≥ 2
Uniform distribution Unimodality is not Number of significant Symmetric
rejected modes > 2
Normal distribution Unimodality is not Number of significant Symmetric
rejected modes < 2
Skewed distribution Unimodality is not Number of significant Skewed
rejected modes < 2
Exercises
11.1 Select and use the software to perform the skewness test, the mode test,
and the dip test on the Launch Temperature Data in Table 11.1 and use
the test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
11.2 Select a numeric variable in the data set you obtain in Problem 1.2
and select an interval width to plot a histogram of the data for the
variable. Select and use the software to perform the skewness test,
the mode test, and the dip test on the data of the variable, and use the
test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
11.3 Select a numeric variable in the data set you obtain in Problem 1.3
and select an interval width to plot a histogram of the data for the
variable. Select and use the software to perform the skewness test,
the mode test, and the dip test on the data of the variable, and use the
test results to determine whether or not the probability distribution
of the Launch Temperature data falls into one of the four probability
distributions in Table 11.2.
12
Association Rules
Association rules uncover items that are frequently associated together. The
algorithm of association rules was initially developed in the context of market
basket analysis for studying customer purchasing behaviors that can be used
for marketing. Association rules uncover what items customers often purchase
together. Items that are frequently purchased together can be placed together
in stores or can be associated together at e-commerce websites for promoting
the sale of the items or for other marketing purposes. There are many other
applications of association rules, for example, text analysis for document
classification and retrieval. This chapter introduces the algorithm of mining
association rules. A list of software packages that support association rules
is provided. Some applications of association rules are given with references.
A → C,
Where
A is an item set called the antecedent
C is an item set called the consequent
A and C have no common items, that is, A ∩ C = ∅ (an empty set). The relation-
ship of A and C in the association rule means that the presence of the item set
185
186 Data Mining
Table 12.1
Data Set for System Fault Detection with Nine Cases of Single-Machine
Faults and Item Sets Obtained from This Data Set
Attribute Variables about Quality of Parts
Instance Items in Each
(Faulty Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9 Data Record
1 (M1) 1 0 0 0 1 0 1 0 1 {x1, x5, x7, x9}
2 (M2) 0 1 0 1 0 0 0 1 0 {x2, x4, x8}
3 (M3) 0 0 1 1 0 1 1 1 0 {x3, x4, x6, x7, x8}
4 (M4) 0 0 0 1 0 0 0 1 0 {x4, x8}
5 (M5) 0 0 0 0 1 0 1 0 1 {x5, x7, x9}
6 (M6) 0 0 0 0 0 1 1 0 0 {x6, x7}
7 (M7) 0 0 0 0 0 0 1 0 0 {x7}
8 (M8) 0 0 0 0 0 0 0 1 0 {x8}
9 (M9) 0 0 0 0 0 0 0 0 1 {x9}
A in a data record implies the presence of the item set C in the data record, that
is, the item set C is associated with the item set A.
The measures of support, confidence, and lift are defined and used to dis-
cover item sets A and C that are frequently associated together. Support(X)
measures the proportion of data records that contain the item set X, and is
defined as
support(X ) =
{S|S ∈D and S ⊇ X} , (12.1)
N
Where
D denotes the data set containing data records
S is a data record in the data set D (indicated by S ∈ D) and contains the
items in X (indicated by S ⊇ X)
| | denotes the number of such data records S
N is the number of the data records in D
support(∅) =
{S|S ∈D and S ⊇ ∅} = N = 1.
N N
For example, for the data set with the nine data records in Table 12.1,
2
support ({x5 }) = = 0.22
9
5
support ({x7 }) = = 0.56
9
Association Rules 187
3
support ({x9 }) = = 0.33
9
2
support ({x5 , x7 }) = = 0.22
9
2
support ({x5 , x9 }) = = 0.22.
9
where A ∪ C is the union of the item set A and the item set C and contains
items from both A and C. Based on the definition, we have
support(∅ → C ) = support(C )
For example,
support( A ∪ C )
confidence( A → C ) = . (12.3)
support( A)
support(C ) support(C )
confidence(∅ → C ) = = = support(C )
support(∅) 1
support( A)
confidence( A → ∅) = = 1.
support( A)
188 Data Mining
For example,
support( A ∪ C ) support( A)
confidence( A → C ) = = = 1.
support( A) support( A)
However, the association rule of A → C is of little interest to us, because the
item set C is in every data record and thus any item set including A is associ-
ated with C. To address this issue, lift(A → C) is defined:
confidence( A → C ) support( A ∪ C )
lift( A → C ) = = . (12.4)
support(C ) support( A) × support(C )
M1 M5 M9
M2 M6 M7
M3 M4 M8
Figure 12.1
A manufacturing system with nine machines and production flows of parts.
The association rules, {x5} → {x7} and {x5} → {x9}, have the same values of sup-
port and confidence but different values of lift. Hence, x5 appears to have a
greater impact on the frequency of x9 than the frequency of x7. Figure 1.1,
which is copied in Figure 12.1, gives the production flows of parts for the
data set in Table 12.1. As shown in Figure 12.1, parts flowing through M5
go to M7 and M9. Hence, x5 should have the same impact on x7 and x9.
However, parts flowing through M6 also go to M7, x7 is more frequent than
x9 in the data set, producing a lower lift value for {x5} → {x7} than that of
{x5} → {x9}. In other words, x7 is impacted by not only x5 but also x6 and x3 as
shown in Figure 12.1, which makes x7 appear less dependent on x5 since lift
addresses the independence issue of the antecedent and the consequent by
a low value of lift.
Table 12.2
Apriori Algorithm
Step Description of the Step
1 F1 = {frequent one-item sets}
2 i=1
3 while Fi ≠ ∅
4 i=i+1
5 Ci = {{x1, …, xi−2, xi−1, xi}|{x1, …, xi−2, xi−1} ∈ Fi−1 and
{x1, …, xi−2, xi} ∈ Fi−1}
6 for all data records S ∈ D
7 for all candidate sets C ∈ Ci
8 if S ⊇ C
9 C.count = C.count + 1
10 Fi = {C|C ∈ Ci and C.count ≥ minimum support}
11 return all Fj, j = 1, …, i − 1
Example 12.1
From the data set in Table 12.1, find all frequent item sets with min-
support (minimum support) = 0.2.
Examining the support of each one-item set, we obtain
3
F1 = {x4 }, support = = 0.33 ,
9
2
{x5 }, support = = 0.22,
9
2
{x6 }, support = = 0.22,
9
5
{x7 }, support = = 0.56,
9
4
{x8 }, support = = 0.44,
9
3
{x9 }, support = = 0.33 .
9
Using the frequent one-item sets to put together the candidate two-item
sets and examine their support, we obtain
3
F2 = {x4 , x8 }, support = = 0.33 ,
9
2
{x5 , x7 }, support = = 0.22,
9
2
{x5 , x9 }, support = = 0.22,
9
2
{x6 , x7 }, support = = 0.22,
9
2
{x7 , x9 }, support = = 0.22 .
9
Since {x5, x7}, {x5, x9}, and {x7, x9} differ from each other in only one item,
they are used to construct the three-item set {x5, x7, x9}—the only three-
item set that can be constructed:
2
F3 = {x5 , x7 , x9 }, support = = 0.22 .
9
Note that constructing a three-item set from two-item sets that dif-
fer in more than one item does not produce a frequent three-item set.
192 Data Mining
For example, {x4, x8} and {x5, x7} are frequent two-item sets that differ in
two items. {x4, x5}, {x4, x7}, {x8, x5}, and {x8, x7} are not frequent two-item
sets. A three-item set constructed using {x4, x8} and {x5, x7}, e.g., {x4, x5, x8},
is not a frequent three-item set because not every pair of two items from
{x4, x5, x8} is a frequent two-item set. Specifically, {x4, x5} and {x8, x5} are
not frequent two-item sets.
Since there is only one frequent three-item set, we cannot generate a
candidate four-item set in Step 5 of the Apriori algorithm. That is, C4 = ∅.
As a result, F4 = ∅ in Step 3 of the Apriori algorithm, and we exit the
WHILE loop. In Step 11 of the algorithm, we collect all the frequent item
sets that satisfy min-support = 0.2:
{x4}, {x5}, {x6}, {x7}, {x8}, {x9}, {x4, x8}, {x5, x7}, {x5, x9}, {x6, x7}, {x7, x9}, {x5, x7, x9}.
Example 12.2
Use the frequent item sets from Example 12.1 to generate all the asso-
ciation rules that satisfy min-support = 0.2 and min-confidence (minimum
confidence) = 0.5.
Using each frequent item set F obtained from Example 12.1, we gener-
ate each of the following association rules, A → C, which satisfies
A ∪ C = F,
A ∩ C = ∅,
Removing each association rule in the form of F→Ø, we obtain the final
set of association rules:
In this final set of association rules, each association rule in the form of
F → ∅ does not tell the association of two item sets but the presence of
the item set F in the data set and can thus be ignored. The remaining
194 Data Mining
association rules reveal the close association of x4 with x8, x5 with x7,
and x9, and x6 with x7, which are consistent with the production flows
in Figure 12.1. However, the production flows from M1, M2, and M3 are
not captured in the frequent item sets and in the final set of association
rules because of the way in which the data set is sampled by considering
all the single-machine faults. Since M1, M2, and M3 are at the beginning
of the production flows and affected by themselves only, x1, x2, and x3
appear less frequently in the data set than x4 to x9. For the same reason,
the confidence value of the association rule {x4} → {x8} is higher than that
of the association rule {x8} → {x4}.
Exercises
12.1 Consider 16 data records in the testing data set of system fault detec-
tion in Table 3.2 as 16 sets of items by taking x1, x2, x3, x4, x5, x6, x7, x8,
x9 as nine different quality problems with the value of 1 indicating the
presence of the given quality problem. Find all frequent item sets with
min-support = 0.5.
12.2 Use the frequent item sets from Exercise 12.1 to generate all the associa-
tion rules that satisfy min-support = 0.5 and min-confidence = 0.5.
12.3 Repeat Exercise 12.1 for all 25 data records from Table 12.1 and Table 3.2
as the data set.
12.4 Repeat Exercise 12.2 for all 25 data records from Table 12.1 and Table 3.2
as the data set.
12.5 To illustrate the Apriori algorithm is efficient for a sparse data set, find
or create a sparse data set with each item being relatively infrequent in
Association Rules 195
the data set, and apply the Apriori algorithm to the data set to produce
frequent item sets with an appropriate value of min-support.
12.6 To illustrate the Apriori algorithm is less efficient for a dense data set,
find or create a dense data set with each item being relatively frequent
in the data records of the data set, and apply the Apriori algorithm to
the data set to produce frequent item sets with an appropriate value of
min-support.
13
Bayesian Network
Bayes classifier in Chapter 3 requires all the attribute variables are inde-
pendent of each other. Bayesian network in this chapter allows associations
among the attribute variables themselves and associations between attribute
variables and target variables. Bayesian network uses associations of vari-
ables to infer information about any variable in Bayesian network. In this
chapter, we first introduce the structure of a Bayesian network and the prob-
ability information of variables in a Bayesian network. Then we describe the
probabilistic inference that is conducted within a Bayesian network. Finally,
we introduce methods of learning the structure and probability information
of a Bayesian network. A list of software packages that support Bayesian
network is provided. Some applications of Bayesian network are given with
references.
∏P(x|y).
y MAP ≈ arg max y ∈Y p( y )
i =1
i
197
198 Data Mining
Table 13.1
Training Data Set for System Fault Detection
Attribute Variables Target Variable
M1 M5 M9
M2 M6 M7
M3 M4 M8
Figure 13.1
Manufacturing system with nine machines and production flows of parts.
Bayesian Network 199
x1 x5 x9
x2 x6 x7 y
x3 x4 x8
Figure 13.2
Structure of a Bayesian network for the data set of system fault detection.
nine attribute variables for the quality of parts at various stages of produc-
tion, x1, x2, x3, x4, x5, x6, x7, x8, and x9, and the target variable for the presence
of a system fault, y. In Figure 13.2, x5 has one parent x1, x6 has one parent x3, x4
has two parents x2 and x3, x9 has one parent x5, x7 has two parents x5 and x6,
x8 has one parent x4, and y has three parents x7, x8, and x9. Instead of drawing
a directed link from each of the nine quality variables, x1, x2, x3, x4, x5, x6, x7,
x8, and x9, to the system fault variable y, we have a directed link from each of
three quality variables, x7, x8, and x9, to the system fault variable y, because x7,
x8, and x9 are at the last stage of the production flow and capture the effects
of x1, x2, x3, x4, x5, and x6 on y.
Given that the variable x has parents z1, …, zk, a Bayesian network uses a
conditional probability distribution for P(x|z1, …, zk) to quantify the effects
of parents z1, …, zk on the child x. For example, we suppose that the device
for inspecting the quality of parts in the data set of system fault detec-
tion is not 100% reliable, producing data uncertainties and conditional
probability distributions in Tables 13.2 through 13.10 for the nodes with
Table 13.2
P(x5|x1)
x1 = 0 x1 = 1
x5 = 0 P(x5 = 0|x1 = 0) = 0.7 P(x5 = 0|x1 = 1) = 0.1
x5 = 1 P(x5 = 1|x1 = 0) = 0.3 P(x5 = 1|x1 = 1) = 0.9
Table 13.3
P(x6|x3)
x3 = 0 x3 = 1
x6 = 0 P(x6 = 0|x3 = 0) = 0.7 P(x6 = 0|x3 = 1) = 0.1
x6 = 1 P(x6 = 1|x3 = 0) = 0.3 P(x6 = 1|x3 = 1) = 0.9
200 Data Mining
Table 13.4
P(x4|x3, x2)
x2 = 0
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x2 = 0, x3 = 0) = 0.7 P(x4 = 0|x2 = 0, x3 = 1) = 0.1
x4 = 1 P(x4 = 1|x2 = 0, x3 = 0) = 0.3 P(x4 = 1|x2 = 0, x3 = 1) = 0.9
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x2 = 1, x3 = 0) = 0.1 P(x4 = 0|x2 = 1, x3 = 1) = 0.1
x4 = 1 P(x4 = 1|x2 = 1, x3 = 0) = 0.9 P(x4 = 1|x2 = 1, x3 = 1) = 0.9
Table 13.5
P(x9|x5)
x5 = 0 x5 = 1
x9 = 0 P(x9 = 0|x5 = 0) = 0.7 P(x9 = 0|x5 = 1) = 0.1
x9 = 1 P(x9 = 1|x5 = 0) = 0.3 P(x9 = 1|x5 = 1) = 0.9
Table 13.6
P(x7|x5, x6)
x5 = 0
x6 = 0 x6 = 1
x7 = 0 P(x7 = 0|x5 = 0, x6 = 0) = 0.7 P(x7 = 0|x5 = 0, x6 = 1) = 0.1
x7 = 1 P(x7 = 1|x5 = 0, x6 = 0) = 0.3 P(x7 = 1|x5 = 0, x6 = 1) = 0.9
x5 = 1
x6 = 0 x6 = 1
x7 = 0 P(x7 = 0|x5 = 1, x6 = 0) = 0.1 P(x7 = 0|x5 = 1, x6 = 1) = 0.1
x7 = 1 P(x7 = 1|x5 = 1, x6 = 0) = 0.9 P(x7 = 1|x5 = 1, x6 = 1) = 0.9
Table 13.7
P(x8|x4 )
x4 = 0 x4 = 1
x8 = 0 P(x8 = 0|x4 = 0) = 0.7 P(x8 = 0|x4 = 1) = 0.1
x8 = 1 P(x8 = 1|x4 = 0) = 0.3 P(x8 = 1|x4 = 1) = 0.9
Bayesian Network 201
Table 13.8
P(y|x9)
x9 = 0 x9 = 1
y=0 P(y = 0|x9 = 0) = 0.9 P(y = 0|x9 = 1) = 0.1
y=1 P(y = 1|x9 = 0) = 0.1 P(y = 1|x9 = 1) = 0.9
Table 13.9
P(y|x7)
x7 = 0 x7 = 1
y=0 P(y = 0|x7 = 0) = 0.9 P(y = 0|x7 = 1) = 0.1
y=1 P(y = 1|x7 = 0) = 0.1 P(y = 1|x7 = 1) = 0.9
Table 13.10
P(y|x8)
x8 = 0 x8 = 1
y=0 P(y = 0|x8 = 0) = 0.9 P(y = 0|x8 = 1) = 0.1
y=1 P(y = 1|x8 = 0) = 0.1 P(y = 1|x98 = 1) = 0.9
parent(s) in Figure 13.2. For example, in Table 13.2, P(x5 = 0|x1 = 1) = 0.1 and
P(x5 = 1|x1 = 1) = 0.9 mean that if x1 = 1 then the probability of x5 = 0 is 0.1,
the probability of x5 = 1 is 0.9, and the probability of having either value
(0 or 1) of x5 is 0.1 + 0.9 = 1. The reason for not having the probability of 1
for x5 = 1 if x1 = 1 is that the inspection device for x1 has a small probability
of failure. Although the inspection devices tell x1 = 1, there is a small prob-
ability that x1 should be 0. In addition, the inspection device for x5 also has
a small probability of failure, meaning that the inspection device may tell
x5 = 0 although x5 should be 1. The probabilities of failure in the inspection
devices produce data uncertainties and thus the conditional probabilities
in Tables 13.2 through 13.10.
For the node of a variable x in a Bayesian network that has no par-
ents, the prior probability distribution of x is needed. For example, in the
Bayesian network in Figure 13.2, x 1, x 2, and x 3 have no parents, and their
prior probability distributions are given in Tables 13.11 through 13.13,
respectively.
The prior probability distributions of nodes without parent(s) and the con-
ditional probability distributions of nodes with parent(s) allow computing
the joint probability distribution of all the variables in a Bayesian network.
202 Data Mining
Table 13.11
P(x1)
x1 = 0 x1 = 1
P(x1 = 0) = 0.8 P(x1 = 1) = 0.2
Table 13.12
P(x2)
x2 = 0 x2 = 1
P(x2 = 0) = 0.8 P(x2 = 1) = 0.2
Table 13.13
P(x3)
x3 = 0 x3 = 1
P(x3 = 0) = 0.8 P(x3 = 1) = 0.2
P(x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 , y )
= P(y|x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )P(x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )
= P( y|x7 , x8 , x9 )P( x1 , x2 , x3 , x 4 , x5 , x6 , x7 , x8 , x9 )
= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x5 , x6 )P( x8|x 4 )P( x5|x1 )P( x6|x3 )P( x 4|x2 , x3 )P( x1 , x2 , x3 )
= P( y|x7 , x8 , x9 )P( x9|x5 )P( x7|x5 , x6 )P( x8|x 4 )P( x5|x1 )P( x6|xx3 )P( x 4|x2 , x3 )P( x1 )P( x2 )P( x3 ).
( )
P x1 , … , xi|z1 , … , zk , v1 , … , v j = P ( x1 , … , xi|z1 , … , zk )
(13.1)
Bayesian Network 203
P( x1 , … , xi ) = ∏P(x ),
j =1
i (13.2)
P( x) = ∑P(x, z = b ) k (13.3)
k =1
i
P( z) = ∑P(x = a , z) k (13.4)
k =1
P( x , z)
P ( x|z ) = (13.5)
P( z)
P( x , z)
P ( z|x ) = . (13.6)
P( x)
Example 13.1
Given the following joint probability distribution P(x, z):
P( x = 0, z = 0) = 0.2
P( x = 0, z = 1) = 0.4
204 Data Mining
P( x = 1, z = 0) = 0.3
P( x = 1, z = 1) = 0.1,
P( x = 0, z = 0) 0.2
P( x = 0|z = 0) = = = 0.4
P( z = 0) 0.5
P( x = 1, z = 0) 0.3
P( x = 1|z = 0) = = = 0.6
P( z = 0) 0.5
P( x = 0, z = 1) 0.4
P( x = 0|z = 1) = = = 0.8
P( z = 1) 0.5
P( x = 1, z = 1) 0.1
P( x = 1|z = 1) = = = 0.2
P( z = 1) 0.5
P( x = 0, z = 0) 0.2
P( z = 0|x = 0) = = = 0.33
P( x = 0) 0.6
P( x = 0, z = 1) 0.4
P( z = 1|x = 0) = = = 0.67
P( x = 0) 0.6
P( x = 1, z = 0) 0.3
P( z = 0|x = 1) = = = 0.75
P( x = 1) 0.4
P( x = 1, z = 1) 0.1
P( z = 1|x = 1) = = = 0.25.
P( x = 1) 0.4
Bayesian Network 205
Example 13.2
Consider the Bayesian network in Figure 13.2 and the probability distri-
butions in Tables 13.2 through 13.13. Given x6 = 1, what are the probabili-
ties of x4 = 1, x3 = 1, and x2 = 1? In other words, what are P(x4 = 1|x6 = 1),
P(x3 = 1|x6 = 1), and P(x2 = 1|x6 = 1)? Note that the given condition x6 = 1
does not imply P(x6 = 1) = 1.
To get P(x3 = 1|x6 = 1), we need to obtain P(x3, x6).
P( x6 , x3 ) = P( x6|x3 ) P( x3 )
x3 = 0 x3 = 1
x6 = 0 P(x6 = 0, x3 = 0) = P(x6 = 0|x3 = 0) P(x6 = 0, x3 = 1) = P(x6 = 0|x3 = 1)
P(x3 = 0) = (0.7)(0.8) = 0.56 P(x3 = 1) = (0.1)(0.2) = 0.02
x6 = 1 P(x6 = 1, x3 = 0) = P(x6 = 1|x3 = 0) P(x6 = 1, x3 = 1) = P(x6 = 1|x3 = 1)
P(x3 = 0) = (0.3)(0.8) = 0.24 P(x3 = 1) = (0.9)(0.2) = 0.18
P ( x6 = 1|x3 ) P ( x3 )
= P ( x 4|x3 , x2 ) P ( x2 ) .
P ( x6 = 1)
Although P(x4|x3, x2), P(x6|x3), P(x3), and P(x2) are given in Tables
13.3, 13.4, 13.12, and 13.13, respectively, P(x6) needs to be computed. In
addition to computing P(x6), we also compute P(x4) so we can compare
P(x4 = 1|x6 = 1) with P(x4).
To obtain P(x4) and P(x6), we first compute the joint probabilities
P(x4, x3, x2) and P(x6, x3) and then marginalize x3 and x2 out of P(x4, x3, x2)
and x3 out of P(x6, x3) as follows:
x2 = 0
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0, x3 = 0, x2 = 0) = P(x4 = 0|x3 = 0, P(x4 = 0, x3 = 1, x2 = 0) = P(x4 = 0|x3 = 1,
x2 = 0) x2 = 0)
P(x3 = 0)P(x2 = 0) = (0.7)(0.8)(0.8) = 0.448 P(x3 = 1)P(x2 = 0) = (0.1)(0.2)(0.8) = 0.016
x4 = 1 P(x4 = 1, x3 = 0, x2 = 0) = P(x4 = 1|x3 = 0, P(x4 = 1, x3 = 1, x2 = 0) = P(x4 = 1|x3 = 1,
x2 = 0) x2 = 0)
P(x3 = 0)P(x2 = 0) = (0.3)(0.8)(0.8) = 0.192 P(x3 = 1)P(x2 = 0) = (0.9)(0.2)(0.8) = 0.144
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0, x3 = 0, x2 = 1) = P(x4 = 0|x3 = 0, P(x4 = 0, x3 = 1, x2 = 1) = P(x4 = 0|x3 = 1,
x2 = 1) x2 = 1)
P(x3 = 0)P(x2 = 1) = (0.1)(0.8)(0.2) = 0.016 P(x3 = 1)P(x2 = 1) = (0.1)(0.2)(0.2) = 0.004
x4 = 1 P(x4 = 1, x3 = 0, x2 = 1) = P(x4 = 1|x3 = 0, P(x4 = 1, x3 = 1, x2 = 1) = P(x4 = 1|x3 = 1,
x2 = 1) x2 = 1)
P(x3 = 0)P(x2 = 1) = (0.9)(0.8)(0.2) = 0.144 P(x3 = 1)P(x2 = 1) = (0.9)(0.2)(0.2) = 0.036
P( x 4 = 0) = P( x 4 = 0, x3 = 0, x2 = 0) + P( x 4 = 0, x3 = 1, x2 = 0)
+ P( x 4 = 0, x3 = 0, x2 = 1) + P( x 4 = 0, x3 = 1, x2 = 1)
P( x 4 = 1) = P( x 4 = 1, x3 = 0, x2 = 0) + P( x 4 = 1, x3 = 1, x2 = 0)
+ P( x 4 = 1, x3 = 0, x2 = 1) + P( x 4 = 1, x3 = 1, x2 = 1)
P( x6 = 1|x3 )P( x3 )
= P( x 4|x3 , x2 ) P( x2 ) :
P( x6 = 1)
x2 = 0
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x3 = 0, x2 = 0) P(x4 = 0|x3 = 1, x2 = 0)
P( x2 = 0) P( x2 = 0)
(0.3)(0.8) (0.9)(0.2)
= (0.7 ) (0.8) = (0.1) (0.8)
0.42 0.42
= 0.32 = 0.034
x4 = 1 P(x4 = 1|x3 = 0, x2 = 0) P(x4 = 1|x3 = 1, x2 = 0)
P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)
P( x6 = 1) P( x6 = 1)
P( x2 = 0) P( x2 = 0)
(0.3)(0.8) (0.9)(0.2)
= (0.3) (0.8) = (0.9) (0.8)
0.42 0.42
= 0.137 = 0.309
x2 = 1
x3 = 0 x3 = 1
x4 = 0 P(x4 = 0|x3 = 0, x2 = 1) (x4 = 0|x3 = 1, x2 = 1)
P( x6 = 1|x3 = 0)P( x3 = 0) P( x6 = 1|x3 = 1)P( x3 = 1)
P( x6 = 1) P( x6 = 1)
P( x2 = 1) P( x2 = 1)
(0.3)(0.8) (0.9)(0.2)
= (0.1) (0.2) = (0.1) (0.2)
0.42 0.42
= 0.011 = 0.009
(continued)
208 Data Mining
(0.3)(0.8) (0.9)(0.2)
= (0.9) (0.2) = (0.9) (0.2)
0.42 0.42
= 0.103 = 0.077
+ P( x 4 = 1, x3 = 0, x2 = 1|x6 = 1) + P( x 4 = 1, x3 = 1, x2 = 1|x6 = 1)
+ P( x 4 = 0, x3 = 1, x2 = 1|x6 = 1) + P( x4 = 1, x3 = 1, x2 = 1|x6 = 1)
Example 13.3
Continuing with all the updated posterior probabilities for the evidence
of x6 = 1 from Example 13.2, we now obtain a new evidence of x4 = 1.
What are the posterior probabilities of x2 = 1 and x3 = 1? In other words,
starting with all the updated probabilities from Example 13.2, what are
P(x3 = 1|x4 = 1) and P(x2 = 1|x4 = 1)?
The probabilistic inference is presented next:
(0.9)(1 − 0.429)(0.2)
= = 0.164
(0.626)
(0.9)(0.429)(1 − 0.2)
= = 0.494
(0.626)
(0.9)(0.429)(0.2)
= = 0.123
(0.626)
210 Data Mining
Since x3 affects both x6 and x4, we raise the probability of x3 = 1 from 0.2
to 0.429 when we have the evidence of x6 = 1, and then we raise the prob-
ability of x3 = 1 again from 0.429 to 0.617 when we have the evidence of
x4 = 1.
We obtain P(x2 = 1|x4 = 1) by marginalizing x3 out of P(x3, x2|x4 = 1):
Since x 2 affects x 4 but not x6, the probability of x 2 = 1 remains the same
at 0.2 when we have the evidence on x6 = 1, and then we raise the
probability of x 2 = 1 from 0.2 to 0.287 when we have the evidence on
x4 = 1. It is not a big increase since x 3 = 1 may also produce the evi-
dence on x 4 = 1.
Nx= a
P( x = a) = (13.7)
N
N x = a& z = b
P ( x = a|z = b ) = , (13.8)
Nz=b
where
N is the number of data points in the data set
Nx = a is the number of data points with x = a
Nz = b is the number of data points with z = b
Nx = a&z = b is the number of data points with x = a and z = b
∂lnP (D|h )
wij (t + 1) = wij (t + 1) + α , (13.9)
∂wij
212 Data Mining
where α is the learning rate. Denoting P(D|h) by Ph(D) and using ∂lnf(x)/∂x =
[1/f(x)][∂f(x)/∂x], we have
1 ∂Ph (d)
∂ ∑ Ph (d|xi′ , z j′ )Ph ( xi′ , z j′ )
∑ ∑
1 i′ , j′
= =
d ∈D Ph ( d) ∂wij d ∈D Ph ( d) ∂wij
∂ ∑ Ph (d|xi′ , z j′ )wi′ j′ Ph ( z j′ )
∑
1 i′ , j′
=
d ∈D Ph ( d) ∂wij
∑ ∑
1 1 Ph ( xi , z j|d)Ph (d)
= Ph (d|xi , z j )Ph ( z j ) = Ph ( z j )
d ∈D Ph (d) d ∈D Ph (d) Ph ( xi , z j )
∑ ∑ ∑
Ph ( xi , z j|d) Ph ( xi , z j|d) Ph ( xi , z j|d)
= Ph ( z j ) = = .
d ∈D Ph ( xi , z j ) d ∈D Ph ( xi|z j ) d ∈D wij
(13.10)
where Ph(xi, zj|d) can be obtained using the probabilistic inference described
in Section 13.2. After using Equation 13.11 to update wij, we need to ensure
∑w (t + 1) = 1
ij (13.12)
i
wij (t + 1)
wij (t + 1) = . (13.13)
∑ wij (t + 1)
i
Bayesian Network 213
Exercises
13.1 Consider the Bayesian network in Figure 13.2 and the probability distri-
butions in Tables 13.2 through 13.13. Given x6 = 1, what is the probability
of x7 = 1? In other words, what is P(x1 = 1|x6 = 1)?
13.2 Continuing with all the updated posterior probabilities for the evidence
of x6 = 1 from Example 13.2 and Exercise 13.1, we now obtain a new evi-
dence of x4 = 1. What is the posterior probability of x7 = 1? In other words,
what is P(x1 = 1|x4 = 1)?
13.3 Repeat Exercise 13.1 to determine P(x1 = 1|x6 = 1).
13.4 Repeat Exercise 13.2 to determine P(x1 = 1|x4 = 1).
13.5 Repeat Exercise 13.1 to determine P(y = 1|x6 = 1).
13.6 Repeat Exercise 13.2 to determine P(y = 1|x4 = 1).
Part IV
ui = E( xi ) =
∫ x f (x )dx
i i i i (14.1)
−∞
∞
σ =2
i
∫ (x − u ) f (x )dx .
i i
2
i i i (14.2)
−∞
ui = E( xi ) = ∑
all values
xi P( xi ) (14.3)
of xi
σ i2 = ∑ ( xi − ui )2 P( xi ). (14.4)
all values
of xi
If xi and xj are continuous random variables with the joint probability density
function fij(xi, xj), the covariance of two random variables, xi and xj, is defined
as follows:
∞ ∞
σ ij = E( xi − µ i )( x j − µ j ) =
∫ ∫ (x − u )(x − u ) f (x , x )dx dx .
i i j j ij i j i j (14.5)
−∞ −∞
217
218 Data Mining
If xi and xj are discrete random variables with the joint probability density
function P(xi, xj),
σ ij = E( xi − µ i )( x j − µ j ) = ∑ ∑ (x − µ )(x − µ )P(x , x ).
all values all values
i i j j i j
(14.6)
of xi of x j
The correlation coefficient is
σ ij
ρij = . (14.7)
σi σ j
For a vector of random variables, x = (x1, x2, …, xp), the mean vector is:
E( x1 ) µ1
E( x2 ) µ 2
E( x) = = = m, (14.8)
E( x p ) µ p
and the variance–covariance matrix is
x1 − µ1
x − µ
S = E ( x − m)( x − m) ¢ = E
2
x p − µ p
2
x1 − µ1 x2 − µ 2
x p − µ p
( x1 − µ1 )2 ( x1 − µ1 ) ( x2 − µ 2 ) ( x1 − µ1 ) ( xp − µ p )
( x2 − µ 2 ) ( x1 − µ1 ) ( x 2 − µ 2 )2 ( x1 − µ1 ) ( x2 − µ 2 )
= E
(
x −µ x −µ
p ( 1 ) 1) (x )
− µ p ( x2 − µ 2 ) (x )
2
p p p − µp
E ( x1 − µ1 )
2
E ( x1 − µ1 ) ( x2 − µ 2 ) E ( x1 − µ1 ) x p − µ p (
)
E ( x2 − µ 2 ) ( x1 − µ1 )
=
E ( x2 − µ 2 )
2
(
E ( x2 − µ 2 ) x p − µ p )
(
E x − µ x − µ
p ( 1 ) 1) ( )
E x p − µ p ( x2 − µ 2 ) (
)
2
p E xp − µ p
σ1 σ12 σ1 p
σ 21 σ2 σ2p
= . (14.9)
σ p1 σ p2 σ p
Principal Component Analysis 219
Example 14.1
Compute the mean vector and variance–covariance matrix of two vari-
ables in Table 14.1.
The data set in Table 14.1 is a part of the data set for the manufactur-
ing system in Table 1.4 and includes two attribute variables, x7 and x8,
for nine cases of single-machine faults. Table 14.2 shows the joint and
marginal probabilities of these two variables.
The mean and variance of x7 are
∑ x P(x ) = 0 × 9 + 1 × 9 = 9
4 5 5
u7 = E( x7 ) = 7 7
all values
of x7
2 2
∑ (x 5 4 5 5
σ72 = 7 − u7 )2 P( x7 ) = 0 − × + 1 − × = 0.2469.
9 9 9 9
all values
of x7
Table 14.1
Data Set for System Fault Detection
with Two Quality Variables
Instance
(Faulty Machine) x7 x8
1 (M1) 1 0
2 (M2) 0 1
3 (M3) 1 1
4 (M4) 0 1
5 (M5) 1 0
6 (M6) 1 0
7 (M7) 1 0
8 (M8) 0 1
9 (M9) 0 0
Table 14.2
Joint and Marginal Probabilities of Two
Quality Variables
P(x7, x8) x8 P(x7)
x7 0 1
1 3 1 3 4
0 + =
9 9 9 9 9
4 1 4 1 5
1 + =
9 9 9 9 9
1 4 5 3 1 4
P(x8) + = + = 1
9 9 9 9 9 9
220 Data Mining
∑
5 4 4
u8 = E( x8 ) = x8 P( x8 ) = 0 × +1× =
all values
9 9 9
of x8
2 2
∑ (x 4 5 4 4
σ 82 = 8 − u8 )2 P( x8 ) = 0 − × + 1 − × = 0.2469.
9 9 9 9
all values
of x8
σ78 = ∑ ∑ (x
all values all values
7 − µ 7 )( x8 − µ 8 )P( x7 , x8 )
of x7 of x8
5 4 1 5 4 3 5 4 4
= 0 − 0 − × + 0 − 1 − × + 1 − 0 − ×
9 9 9 9 9 9 9 9 9
5 4 1
+ 1 − 1 − × = − 0.1358.
9 9 9
5
µ 7 9
m= =
µ 8 4
9
x1
x2
x = , x ¢ = x1 x2 x p , (14.10)
x p
Principal Component Analysis 221
x1, x2, …, xp are linearly dependent if there exists a set of constants, c1, c2, …, cp,
not all zero, which makes the following equation hold:
c1x1 + c2 x2 + + cp x p = 0. (14.11)
Similarly, x1, x2, …, xp are linearly independent if there exists only one set of
constants, c1 = c2 = … = c, = 0, which makes the following equation hold:
c1x1 + c2 x2 + + cp x p = 0. (14.12)
x1
cos(θ1 ) = (14.14)
Lx
x2
sin(θ1 ) = (14.15)
Lx
y1
cos(θ 2 ) = (14.16)
Ly
y2
sin(θ 2 ) = (14.17)
Ly
Lx = (x21 + x22)1/2
x1
x
x2
Figure 14.1
Computation of the length of a vector.
222 Data Mining
y2 y
θ x
x2
θ2
θ1
y1 x1
Figure 14.2
Computation of the angle between two vectors.
y x y x x y + x2 y x x¢y
= 1 1 + 2 2 = 1 1 = . (14.18)
Ly Lx Ly Lx Lx Ly Lx Ly
Based on the computation of the angle between two vectors, x′ and y′, the
two vectors are orthogonal, that is, θ = 90° or 270°, or cos(θ) = 0, only if x′y = 0.
A p × p square matrix, A, is symmetric if A = A′, that is, aij = aji, for i = 1, …, p,
and j = 1, …, p. An identity matrix is the following:
1 0 0
0 1 0
I= ,
0 0 1
and we have
AI = IA = A. (14.19)
AA −1 = A −1 A = I . (14.20)
A = a11 if p = 1 (14.21)
p p
A= ∑
j =1
a1 j A1 j ( −1)1+ j = ∑a j =1
ij Aij ( −1)i+ j if p > 1, (14.22)
where
A1j is the (p − 1) × (p − 1) matrix obtained by removing the first row and the
jth column of A
A ij is the (p − 1) × (p − 1) matrix obtained by removing the ith row and the
jth column of A. For a 2 × 2 matrix:
a11 a12
A= ,
a21 a22
the determinant of A is
2
a11 a12
A=
a21 a22
= ∑aj =1
1j A1 j ( −1)1+ j
= a11 A11 ( −1)1+1 + a12 A12 ( −1)1+ 2 = a11a22 − a12 a21. (14.23)
I = 1. (14.24)
0.2469 − 0.1358
A= = 0.2469 × 0.2469 − ( − 0.1358)(( − 0.1358) = 0.0425.
− 0.1358 0.2469
A − λI = 0. (14.25)
224 Data Mining
Example 14.2
Compute the eigenvalues of the following matrix A, which is obtained
from Example 14.1:
0.2469 − 0.1358
A=
− 0.1358 0.2469
λ 2 − 0.4938λ + 0.0426 = 0
λ 1 = 0.3824 λ 2 = 0.1115.
Ax = λx. (14.26)
x
e= . (14.27)
x¢ x
Example 14.3
Compute the eigenvectors associated with the eigenvalues in Example 14.2.
The eigenvectors associated with the eigenvalues λ 1 = 0.3824 and λ 2 = 0.1115
of the following square matrix A in Example 14.2 are computed next:
0.2469 − 0.1358
A=
− 0.1358 0.2469
Ax = λ 1 x
0.2469 − 0.1358 x1 x1
− 0.1358 = 0.3824
0.2469 x2 x2
Principal Component Analysis 225
0.1355x1 + 0.1358 x2 = 0
0.1358 x1 + 0.1355x2 = 0.
The two equations are identical. Hence, there are many solutions. Setting
x1 = 1 and x2 = −1, we have
1
1 2
x= e=
−1 −1
2
Ax = λ 2 x
0.2469 − 0.1358 x1 x1
− 0.1358 = 0.1115
0.2469 x2 x2
0.1354 x1 − 0.1358 x2 = 0
0.1358 x1 − 0.1354 x2 = 0.
The aforementioned two equations are identical and thus have many
solutions. Setting x1 = 1 and x2 = 1, we have
1
1 2
x= e= .
1 1
2
In this example, the two eigenvectors associated with the two eigenval-
ues are chosen such that they are orthogonal.
226 Data Mining
A= ∑λ e e¢.
i i i (14.28)
i =1
Example 14.4
Compute the spectral decomposition of the matrix in Examples 14.2
and 14.3.
The spectral decomposition of the following symmetric matrix in
Examples 14.2 and 14.3 is illustrated next:
0.2469 − 0.1358
A=
− 0.1358 0.2469
λ 1 = 0.3824 λ 2 = 0.1115
1
2
e1 =
−1
2
1
2
e2 =
1
2
1 1
0.2469 − 0.1358 2 1 −1 2 1 1
− 0.1358 = 0.3824 + 0.1115
0.2469 −1 2 2
1 2 2
2 2
A = λ 1e1e1¢ + λ 2e2e2¢.
Principal Component Analysis 227
x ¢ Ax > 0.
0.2469 − 0.1358
A=
− 0.1358 0.2469
λ 1 = 0.3824 λ 2 = 0.1115.
x ¢ Ax
max x ≠ 0 = λ 1 attained by x = e1 or
x¢ x
p
x ¢ Ax
e1¢ Ae1 = e ¢1
∑
i =1
λ ieiei¢ e1 = λ 1 = max x ≠ 0
x¢ x
(14.29)
x ¢ Ax
min x ≠ 0 = λp attained by x = e p or
x¢ x
p
x ¢ Ax
e ¢p Ae p = ep¢
∑
i =1
λ ieie ¢i e p = λ p = min x ≠ 0
x¢ x
(14.30)
and
x ¢ Ax
max x ⊥ e1 ,…ei = λ i + 1 attained by x = ei + 1 , i = 1, … , p − 1. (14.31)
x¢ x
228 Data Mining
yi = e i¢ x i = 1, …, p (14.36)
e i¢ei = 1
p p
∑var(x ) = σ + + σ = ∑var(y ) = λ + + λ .
i 1 p i 1 p (14.37)
i =1 i =1
Example 14.5
Determine the principal components of the two variables in Example 14.1.
For the two variables x′ = [x7, x8] in Table 14.1 and Example 14.1, the
variance–covariance matrix Σ is
0.2469 − 0.1358
S= ,
− 0.1358 0.2469
λ 1 = 0.3824 λ 2 = 0.1115
1
2
e1 =
−1
2
1
2
e2 = .
1
2
1 1
y1 = e1¢ x = x7 − x8
2 2
1 1
y 2 = e2¢ x = x7 + x8 .
2 2
230 Data Mining
1 1
var( y1 ) = var x7 − x8
2 2
2 2
1 −1 1 −1
= var( x7 ) + var( x8 ) + 2 cov( x7, x8 )
2 2 2 2
1 1
= (0.2469) + (0.2469) − ( − 0.1358) = 0.3827 = λ 1
2 2
1 1
var( y 2 ) = var x7 + x8
2 2
2 2
1 1 1 1
= var( x7 ) + var( x8 ) + 2 cov( x7, x8 )
2 2 2 2
1 1
= (0.2469) + (0.2469) + ( − 0.1358) = 0.1111 = λ 2 .
2 2
We also have
The proportion of the total variances accounted for by the first principal
component y1 is 0.3824/0.4939 = 0.7742 or 77%. Since most of the total vari-
ances in x′ = [x7, x8] is accounted by y1, we may use y1 to replace and repre-
sent originally the two variables x7, x8 without loss of much variances. This
is the basis of applying PCA for reducing the dimensions of data by using
a few principal components to represent a large number of variables in
the original data while still accounting for much of variances in the data.
Using a few principal components to represent the data, the data can be
further visualized in a one-, two-, or three-dimensional space of the prin-
cipal components to observe data patterns, or can be mined or analyzed to
uncover data patterns of principal components. Note that the mathemati-
cal meaning of each principal component as the linear combination of the
original data variable does not necessarily have a meaningful interpreta-
tion in the problem domain. Ye (1997, 1998) shows examples of interpret-
ing data that are not represented in their original problem domain.
Exercises
14.1 Determine the nine principal components of x1, …, x9 in Table 8.1 and
identify the principal components that can be used to account for 90%
of the total variances of data.
14.2 Determine the principal components of x1 and x2 in Table 3.2.
14.3 Repeat Exercise 14.2 for x1, …, x9, and identify the principal components
that can be used to account for 90% of the total variances of data.
15
Multidimensional Scaling
δ i1 j1 ≤ δ i2 j2 ≤ ≤ δ iM jM , (15.1)
where M denotes the total number of different data pairs, and M = n(n − 1)/2
for n data items. MDS (Young and Hamer, 1987) is to find coordinates of
the n data items in a q-dimensional space, zi = (zi1, …, xiq), i = 1, …, n, with
q being much smaller than p, while preserving the dissimilarities of n data
items given in Equation 15.1. MDS is nonmetric if only the rank order of
the dissimilarities in Equation 15.1 is preserved. Metric MDS goes further to
preserve the magnitudes of the dissimilarities. This chapter describes non-
metric MDS.
Table 15.1 gives the steps of the MDS algorithm to find coordinates of the n
data items in the q-dimensional space, while preserving the dissimilarities of
n data points given in Equation 15.1. In Step 1 of the MDS algorithm, the ini-
tial configuration for coordinates of n data points in the q-dimensional space
is generated using random values so that no two data points are the same.
In Step 2 of the MDS algorithm, the following is used to normalize
xi = (xi1, …, xiq), i = 1, …, n:
xij
normalized xij = . (15.2)
x + + xiq2
2
i1
233
234 Data Mining
Table 15.1
MDS Algorithm
Step Description
1 Generate an initial configuration for the coordinates of n data
points in the q-dimensional space, (x11, …, x1q, …., xn1, …, xnq),
such that no two points are the same
2 Normalize xi = (xi1, …, xiq), i = 1, …, n, such that the vector for
each data point has the unit length using Equation 15.2
3 Compute S as the stress of the configuration using Equation 15.3
4 REPEAT UNTIL a stopping criterion based on S is satisfied
5 Update the configuration using the gradient decent method
and Equations 15.14 through 15.18
6 Normalize xi = (xi1, …, xiq), i = 1, …, n, in the configuration
using Equation 15.2
7 Compute S of the updated configuration using Equation 15.3
In Step 3 of the MDS algorithm, the following is used to compute the stress
of the configuration that measures how well the configuration preserves the
dissimilarities of n data points given in Equation 15.1 (Kruskal, 1964a,b):
S=
∑ (d − dˆ ) ij
ij ij
2
, (15.3)
∑d ij
2
ij
Note that there are n(n − 1)/2 different pairs of i and j in Equations 15.3 and 15.4.
The Euclidean distance shown in Equation 15.5, the more general
Minkowski r-metric distance shown in Equation 15.6, or some other dissimi-
larity measure can be used to compute dij:
dij = ∑(d
k =1
ik − d jk )2 (15.5)
1
q r
∑
dij = (dik − d jk )r .
k = 1
(15.6)
Multidimensional Scaling 235
d̂ij s are predicted from δijs by using a monotone regression algorithm des
cribed in Kruskal (1964a,b) to produce
dˆ i1 j1 ≤ dˆ i2 j2 ≤ ≤ dˆ iM jM, (15.7)
given Equation 15.1
δ i1 j1 ≤ δ i2 j2 ≤ ≤ δ iM jM .
Table 15.2 describes the steps of the monotone regression algorithm, assum-
ing that there are no ties (equal values) among δijs. In Step 2 of the monotone
regression algorithm, d̂Bm for the block Bm is computed using the average of
dijs in Bm:
∑N
dij
dˆBm = , (15.8)
m
dij ∈Bm
where Nm is the number of dijs in Bm. If Bm has only one dij, d̂im jm = dij.
Table 15.2
Monotone Regression Algorithm
Step Description
1 Arrange δ im jm, m = 1, …, M, in the order from the smallest to the largest
2 Generate the initial M blocks in the same order in Step 1, B1, …, BM, such that each
block, Bm, has only one dissimilarity value, dim jm, and compute d̂B using Equation 15.8
3 Make the lowest block the active block, and also make it up-active; denote B as the
active block, B− as the next lower block of B, B+ as the next higher block of B
4 WHILE the active block B is not the highest block
5 IF dˆ < dˆ < dˆ /* B is both down-satisfied and up-satisfied, note that the lowest
B− B B+
clock is already down-satisfied and the highest block is already up-satisfied */
6 Make the next higher block of B the active block, and make it up-active
7 ELSE
8 IF B is up-active
9 IF dˆ < dˆ /* B is up-satisfied */
B B+
10 Make B down-active
11 ELSE
12 Merge B and B+ to form a new larger block which replaces B and B+
13 Make the new block as the active block and it is down-active
14 ELSE/* B is down-active */
15 IF dˆ < dˆ /* B is down-satisfied */
B− B
16 Make B up-active
17 ELSE
18 Merge B− and B to form a new larger block which replaces B− and B
19 Make the new block as the active block and it is up-active
20 d̂ij = d̂B, for each dij ∈ B and for each block B in the final sequence of the blocks
236 Data Mining
In Step 1 of the monotone regression algorithm, if there are ties among δijs,
these δijs with the equal value are arranged in the increasing order of their
corresponding dijs in the q-dimensional space (Kruskal, 1964a,b). Another
method of handling ties among δijs is to let these δijs with the equal value
form one single block with their corresponding dijs in this block.
After using the monotone regression method to obtain d̂ijs, we use Equation
15.3 to compute the stress of the configuration in Step 3 of the MDS algo-
rithm. The smaller the S value is, the better the configuration preserves the
dissimilarity order in Equation 15.1. Kruskal (1964a,b) considers the S value
of 20% indicating a poor fit of the configuration to the dissimilarity order
in Equation 15.1, the S value of 10% indicating a fair fit, the S value of 5%
indicating a good fit, the S value of 2.5% indicating an excellent fit, and the
S value of 0% indicating the best fit. Step 4 of the MDS algorithm evalu-
ates the goodness-of-fit using the S value of the configuration. If the S value
of the configuration is not acceptable, Step 5 of the MDS algorithm changes
the configuration to improve the goodness-of-fit using the gradient descent
method. Step 6 of the MDS algorithm normalizes the vector of each data
point in the updated configuration. Step 7 of the MDS algorithm computes
the S value of the updated configuration.
In Step 4 of the MDS algorithm, a threshold of goodness-of-fit can be set
and used such that the configuration is considered acceptable if S of the con-
figuration is less than or equal to the threshold of goodness-of-fit. Hence, a
stopping criterion in Step 4 of the MDS algorithm is having S less than or
equal to the threshold of goodness-of-fit. If there is little change in S, that
is, S levels off after iterations of updating the configuration, the procedure
of updating the configuration can be stopped too. Hence, the change of S,
which is smaller than a threshold, is another stopping criterion that can be
used in Step 4 of the MDS algorithm.
The gradient descent method of updating the configuration in Step 5 of the
MDS algorithm is similar to the gradient descent method used for updating
connection weights in the back-propagation learning of artificial neural net-
works in Chapter 5. The objective of updating the configuration, (x11, …, x1q, …,
xn1, …, xnq), is to minimize the stress of the configuration in Equation 15.3,
which is shown next:
S=
∑ (d − dˆ )
ij
ij ij
2
=
S*
, (15.9)
∑d ij
2
ij T*
where
S* = ∑(d
ij
ij − dˆ ij )2 (15.10)
Multidimensional Scaling 237
T* = ∑d . 2
ij (15.11)
ij
∑ k ,l
xkl2
where
∂S
g kl = − , (15.13)
∂xkl
g kl
xkl (t + 1) = xkl (t) + α∆xkl = xkl (t) + α . (15.14)
∑ k ,l
g kl2
n
Kruskal (1964a,b) gives the following formula to compute gkl if dij is computed
using the Minkowski r-metric distance:
∂S d − dˆ ij dij xil − x jl
r −1
g kl = −
∂xkl
=S ∑ (ρki − ρkj ) ij
S*
−
T
* dijr−1
sign( x il − x jl )
,
i, j
(15.15)
where
1 if k = i
ρki = (15.16)
0 if k ≠ i
1 if xil − x jl > 0
sign( xil − x jl ) = −1 if xil − x jl > 0 . (15.17)
0 if x − x = 0
il jl
238 Data Mining
If r = 2 in Formula 15.13, that is, the Euclidean distance is used to computer dij,
ki ˆ dij xil − x jl
kj dij − d ij
g kl = S ∑
i, j
(ρ − ρ )
S*
−
T * dij
. (15.18)
Example 15.1
Table 15.3 gives three data records of nine quality variables, which is a
part of Table 8.1. Table 15.4 gives the Euclidean distance for each pair of
the three data points in the nine-dimensional space. This Euclidean dis-
tance for a pair of data point, xi and xj, is taken as δij. Perform the MDS of
this data set with only one iteration of the configuration update for q = 2,
the stopping criterion of S ≤ 5%, and α = 0.2.
This data set has three data points, n = 3, in a nine-dimensional space.
We have δ12 = 2.65, δ13 = 2.65, and δ23 = 2. In Step 1 of the MDS algorithm
described in Table 15.1, we generate an initial configuration of the three
data points in the two-dimensional space:
x11 x12 1 1
x1 = , = 2 , = (0.771, 0.71)
2
x11 + x12
2 2
x11 + x12 1 + 1
2 2 2
1 +1
2
Table 15.3
Data Set for System Fault Detection with Three Cases
of Single-Machine Faults
Attribute Variables about Quality of Parts
Instance (Faulty
Machine) x1 x2 x3 x4 x5 x6 x7 x8 x9
1 (M1) 1 0 0 0 1 0 1 0 1
2 (M2) 0 1 0 1 0 0 0 1 0
3 (M3) 0 0 1 1 0 1 1 1 0
Table 15.4
Euclidean Distance for Each Pair
of Data Points
C1 = {x1} C2 = {x2} C3 = {x3}
C1 = {x1} 2.65 2.65
C2 = {x2} 2
C3 = {x3}
Multidimensional Scaling 239
x21 x22 0 1
x2 = , = 2 , = (0, 1)
2
x21 + x22
2 2
x21 + x22 0 + 1
2 2 2
0 +1
2
The distance between each pair of the three data points in the two-
dimensional space is computed using their initial coordinates:
d12 = ( x11 − x21 )2 + ( x12 − x22 )2 = (0.71 − 0)2 + (0.71 − 1)2 = 0.77
d13 = ( x11 − x31 )2 + ( x12 − x32 )2 = (0.71 − 0.89)2 + (0.71 − 0.45)2 = 0.32
δ 23 < δ 12 = δ 13 .
Since there is a tie between δ12 and δ13, δ12 and δ13 are arranged in the
increasing order of d13 = 0.32 and d12 = 0.77:
δ 23 < δ 13 < δ 12 .
∑n
dij d23
dˆ B1 = = = 1.05
dij ∈B1
1 1
∑n
dij d13
dˆ B2 = = = 0.32
dij ∈B2
2 1
∑n
dij d12
dˆ B3 = = = 0.77.
dij ∈B3
3 1
240 Data Mining
B = B1 B− = ∅ B+ = B2 ,
∑n
dij d12
dˆ B3 = = = 0.77.
dij ∈B3
3 1
B = B12 B− = ∅ B+ = B3 .
Going back to Step 4, we check that the active block B12 is not the highest
block. In Step 5, we check that B is both up-satisfied with dˆ B12 < dˆ B3 and
down-satisfied. Therefore, we execute Step 6 to make B3 the active block
and make it up-active:
∑n
dij d12
dˆ B3 = = = 0.77
dij ∈B3
3 1
B = B3 B− = B12 B+ = ∅.
Going back to Step 4 again, we check that the active block B is the highest
block, get out of the WHILE loop, execute Step 20—the last step of the
monotone regression algorithm, and assign the following values of d̂ijs:
dˆ12 = dˆ B3 = 0.77
dˆ 23 = dˆ B12 = 0.69.
d12 = 0.77
d13 = 0.32
d23 = 1.05,
we now execute Step 3 of the MDS algorithm to compute the stress of the
initial configuration using Equations 15.9 through 15.11:
S* = ∑(d
ij
ij − dˆ ij )2 = (0.77 − 0.77 )2 + (0.32 − 0.69)2 + (1.05 − 0.69)2 = 0..27
T* = ∑d
ij
2
ij = 0.77 2 + 0.322 + 1.052 = 0.61
S* 0.27
S= = = 0.67.
T* 0.61
dij − d̂ ij dij xi 2 − x j 2
= (0.67 ) ∑ (ρ
i, j
1i
− ρ1 j )
S*
−
T * dij
dij − d̂ ij dij xi 2 − x j 2
= (0.67 ) ∑ (ρ
i, j
2i
− ρ2 j )
S*
−
T * dij
g kl
xkl (t + 1) = xkl (t) + α∆xkl = xkl (t) + α
∑ k ,l
g kl2
g11
x11 (1) = x11 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
− 0.13
= 0.71 + 0.2 = 0.70
( − 0.13)2 + ( − 0.71)2 + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
3
g12
x12 (1) = x12 (0) + 0.2
2
11
2
12
2
g + g + g + g 222
+ g 31
21
2 2
+ g 32
3
− 0.71
= 0.71 + 0.2 = 0.63
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2
g 21
x21 (1) = x21 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
1.07
= 0 + 0.2 = 0.12
( − 0.13)2 + ( − 0.71)2 + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
3
g 22
x22 (1) = x22 (0) + 0.2
2
11
2
12
2
g + g + g + g 222
+ g 31
21
2 2
+ g 32
3
− 0.45
= 1 + 0.2 = 0.95
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2
g 21
x31 (1) = x31 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
0.90
= 0.89 + 0.2 = 0.99
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2
3
246 Data Mining
g 22
x32 (1) = x32 (0) + 0.2
2
g11 2
+ g12 2
+ g 21 2
+ g 22 2
+ g 31 2
+ g 32
3
0.77
= 0.45 + 0.2 = 0.54.
( − 0.13) + ( − 0.71) + 1.07 2 + ( − 0.45)2 + 0.90 2 + 0.77 2
2 2
Hence, after the update of the initial configuration in Step 5 of the MDS
algorithm, we obtain:
0.70 0.63
x1 = , = (0.74, 0.67 )
2
0.70 + 0.63
2 2
0.70 + 0.63
2
0.12 0.95
x2 = , = (0.13 , 0.99)
2
0.12 + 0.95
2 2
0.12 + 0.95
2
0.99 0.54
x3 = , = (0.88, 0.48).
2
0.992 + 0.54 2 0.99 + 0.54
2
0.2
0.18
0.16
0.14
0.12
Stress
0.1
0.08
0.06
0.04
0.02
0
1 1.5 2 2.5 3 3.5 4
Number of dimensions
Figure 15.1
An example of plotting the stress of a MDS result versus the number of dimensions.
xi = ( xi1 , … , xiq ), i = 1, … , n,
248 Data Mining
w j = (w j1 , … , x jq ), j = 1, … , m.
The weight vector of a subject reflects the relative salience of each dimension
in the configuration space to the subject.
Exercises
15.1 Continue Example 15.1 to perform the next iteration of the configura-
tion update.
15.2 Consider the data set consisting of the three data points in instances
#4, 5, and 6 in Table 8.1. Use the Euclidean distance for each pair of the
three data points in the nine-dimensional space, xi and xj, as δij. Perform
the MDS of this data set with only one iteration of the configuration
update for q = 3, the stopping criterion of S ≤ 5%, and α = 0.2.
15.3 Consider the data set in Table 8.1 consisting of nine data points in
instances 1–9. Use the Euclidean distance for each pair of the nine data
points in the nine-dimensional space, xi and xj, as δij. Perform the MDS
of this data set with only one iteration of the configuration update for
q = 1, the stopping criterion of S ≤ 5%, and α = 0.2.
Part V
Algorithms for
Mining Outlier and
Anomaly Patterns
16
Univariate Control Charts
Outliers and anomalies are data points that deviate largely from the norm
where the majority of data points follow. Outliers and anomalies may be
caused by a fault of a manufacturing machine and thus an out of control
manufacturing process, an attack whose behavior differs largely from nor-
mal use activities on computer and network systems, and so on. Detecting
outliers and anomalies are important in many fields. For example, detecting
an out of control manufacturing process quickly is important for reducing
manufacturing costs by not producing defective parts. An early detection of
a cyber attack is crucial to protect computer and network systems from being
compromised.
Control chart techniques define and detect outliers and anomalies on a
statistical basis. This chapter describes univariate control charts that monitor
one variable for anomaly detection. Chapter 17 describes multivariate control
charts that monitor multiple variables simultaneously for anomaly detection.
The univariate control charts described in this chapter include Shewhart con-
trol charts, cumulative sum (CUSUM) control charts, exponentially weighted
moving average (EWMA) control charts, and cuscore control charts. A list of
software packages that support univariate control charts is provided. Some
applications of univariate control charts are given with references.
251
252 Data Mining
Table 16.1
Samples of Data Observations
Data Observations Sample Standard
Sample in Each Sample Sample Mean Deviation
1 x11, …, x1j, …, x1n x– 1 s1
… … … …
i xi1, …, xij, …, xin x–i si
… … … …
m xm1, …, xmj, …, xmn x–m sm
Section 16.2 and EWMA control charts in Section 16.3 have advantages over
individual control charts.
We describe the x– control charts to illustrate how Shewhart control charts
work. Consider a variable x that takes m samples of n data observations from
a process as shown in Table 16.1. The x– control chart assumes that x is nor-
mally distributed with the mean μ and the standard deviation σ when the
process is in control.
x– and s , i = 1, …, m, in Table 16.1 are computed as follows:
i i
∑
n
xij
j =1
x =
i (16.1)
n
∑
n
( xij − xi )2
j =1
si = . (16.2)
n−1
The mean μ and the standard deviation σ are estimated using x and –s :
∑
m
xi
x= i =1
(16.3)
m
∑
m
si
s= i =1
. (16.4)
m
P ( x − 3 s ≤ xi ≤ x + 3 s ) = 99.7%. (16.5)
Univariate Control Charts 253
Since the probability that x– i falls beyond the three standard deviations from
the mean is only 0.3%, such x– i is considered an outlier or anomaly that may
be caused by the process being out of control. Hence, the estimated mean
and the 3-sigma control limits are typically used as the centerline and the
control limits (UCL for upper control limit and LCL for lower control limit),
respectively, for the in-control process mean in the x– control chart:
Centerline = x (16.6)
UCL = x + 3 s (16.7)
LCL = x − 3 s . (16.8)
the process is truly in control or giving no signal when the process is truly
out of control. Because Shewhart control charts monitor and evaluate only
one data sample or one individual data observation at a time, Shewhart con-
trol charts are not effective at detecting small shifts, e.g., small shifts of a
process mean monitored by the x– control chart. CUSUM control charts in
Section 16.2 and EWMA control charts in Section 16.3 are less sensitive to
the normality assumption of data and are effective at detecting small shifts.
CUSUM control charts and EWMA control charts can be used to monitor
both data samples and individual data observations. Hence, CUSUM control
charts and EWMA control charts are more practical.
CSi = ∑(x − µ ),
j =1
i 0 (16.9)
where μ0 is the target value of the process mean. If the process is in control,
data observations are expected to randomly fluctuate around the process
mean, and thus CSi stays around zero. However, if the process is out of con-
trol with a shift of x values from the process mean, CSi keeps increasing for a
positive shift (i.e., xi − μ0 > 0) or decreasing for a negative shift. Even if there
is a small shift, the effect of the small shift keeps accumulating in CSi and
becomes large to be defected. Hence, a CUSUM control chart is more effec-
tive than a Shewhart control chart to detect small shifts since a Shewhart
control chart examines only one data sample or one data observation.
Formula 16.9 is used to monitor individual data observations. If samples of
data points can be observed, xi in Formula 16.9 can be replaced by x– i to moni-
tor the sample average.
If we are interested in detecting only a positive shift, a one-side CUSUM
chart can be constructed to monitor the CSi+ statistic:
where K is called the reference value specifying how much increase from the
process mean μ0 we are interested in detecting. Since we expect xi ≥ μ0 + K as a
result of the positive shift K from the process mean μ0, we expect xi − (μ0 + K) to
be positive and expect CSi+ to keep increasing with i. In case that some xi makes
xi − (µ 0 + K ) + CSi+−1 have a negative value, CSi+ takes the value of 0 according to
Univariate Control Charts 255
Formula 16.10 since we are interested in only the positive shift. One method
of specifying K is to use the standard deviation σ of the process. For example,
K = 0.5σ indicates that we are interested in detecting a shift of 0.5σ above the
target mean. If the process is in control, we expect CSi+ to stay around zero.
Hence, CSi+ is initially set to zero:
CS0+ = 0. (16.11)
When CSi+ exceeds the decision threshold H, the process is considered out
of control. Typically H = 5σ is used as the decision threshold so that a low
rate of false alarms can be achieved (Montgomery, 2001). Note that H = 5σ is
greater than the 3-sigma control limits used for the x– control chart in Section
16.1 since CSi+ accumulates the effects of multiple data observations whereas
the x– control chart examines only one data observation or data sample.
If we are interested in detecting only a negative shift, −K, from the pro-
cess mean, a one-side CUSUM chart can be constructed to monitor the CSi−
statistic:
Since we expect xi ≤ μ0 − K as a result of the negative shift, −K, from the process
mean μ0, we expect (μ0 − K) − xi to be positive and expect CSi− to keep increas-
ing with i. H = 5σ is typically used as the decision threshold to achieve a low
rate of false alarms (Montgomery, 2001). CSi− is initially set to zero since we
expect CSi− to stay around zero if the process is in control:
CS0− = 0. (16.13)
A two-side CUSUM control chart can be used to monitor both CSi+ using the
one-side upper CUSUM and CSi− using the one-side lower CUSUM for the
same xi. If either CSi+ or CSi− exceeds the decision threshold H, the process is
considered out of control.
Example 16.1
Consider the launch temperature data in Table 1.5 and presented in Table
16.2 as a sequence of data observations over time. Given the following
information:
µ 0 = 69
σ=7
H = 5σ = (5)(7 ) = 35,
Table 16.2
Data Observations of the Launch
Temperature from the Data Set of O-Rings
with Stress along with Statistics for the
Two-Side CUSUM Control Chart
Data Launch
Observation i Temperature xi CSi+ CSi-
1 66 0 0
2 70 0 0
3 69 0 0
4 68 0 0
5 67 0 0
6 72 0 0
7 73 0.5 0
8 70 0 0
9 57 0 8.5
10 63 0 11
11 70 0 6.5
12 78 5.5 0
13 67 0 0
14 53 0 12.5
15 67 0 11
16 75 2.5 1.5
17 70 0 0
18 81 8.5 0
19 76 12 0
20 79 18.5 0
21 75 21 0
22 76 24.5 0
23 58 10 7.5
With CSi+ and CSi− initially set to zero, that is, CS0+ = 0 and CS0− = 0, we
compute CS1+ and CS1−:
CS1+ = max 0, x1 − (µ 0 + K ) + CS0+ = max [0, 66 − (69 + 3.5) + 0 ] = max[0, − 6.5] = 0
CS1− = max 0, (µ 0 − K ) − x1 + CS0− = max [0, (69 − 3.5) − 66 + 0 ] = max[0, − 0.5] = 0,
CS2+ = max 0, x2 − (µ 0 + K ) + CS1+ = max 0, 70 − (69 + 3.5) + 0 = max[0, − 2.5] = 0
CS2− = max 0, (µ 0 − K ) − x1 + CS0− = max 0, (69 − 3.5) − 70 + 0 = max[0, − 4.5] = 0.
Univariate Control Charts 257
40
35
30
CS+
25 CS–
H
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Observation i
Figure 16.1
Two-side CUSUM control chart for the launch temperature in the data set of O-ring with stress.
+ −
The values of CSi and CSi for i = 3, …, 23 are shown in Table 16.2. Figure
16.1 shows the two-side CUSUM control chart. The CSi+ and CSi− values
for all the 23 observations do not exceed the decision threshold H = 35.
Hence, no anomalies of the launch temperature are detected. If the deci-
sion threshold is set to H = 3σ = (3)(7) = 21, the observation i = 22 will be
+
signaled as an anomaly because CS22 = 24.5 > H.
After an out-of-control signal is generated, the CUSUM control chart
will reset CSi+ and CSi− to their initial value of zero and use the initial
value of zero to compute CSi+ and CSi− for the next observation.
The control limits are (Montgomery, 2001; Ye, 2003, Chapter 3):
λ
UCL = µ + Lσ (16.16)
2−λ
λ
LCL = µ − Lσ . (16.17)
2−λ
258 Data Mining
The weight λ determines the relative impacts of the current data observation
xi and previous data observations as captured through zi−1 on zi. If we express
zi using xi, xi−1, …, x1,
zi = λxi + (1 − λ )zi −1
= λxi + (1 − λ )[ λxi −1 + (1 − λ )zi − 2 ]
= λxi + (1 − λ )λxi −1 + (1 − λ )2 zi − 2
= λxi + (1 − λ )λxi −1 + (1 − λ )2 [ λxi − 2 + (1 − λ )zi − 3 ]
= λxi + (1 − λ )λxi −1 + (1 − λ )2 λxi − 2 + (1 − λ )3 zi − 3
= λxi + (1 − λ )λxi −1 + (1 − λ )2 λxi − 2 + + (1 − λ )i − 2 λx2 + (1 − λ )i −1 λx1 (16.18)
we can see the weights on xi, xi−1, …, x1 decreasing exponentially. For example,
for λ = 0.3, we have the weight of 0.3 for xi, (0.7)(0.3) = 0.21 for xi−1, (0.7)2(0.3) =
0.147 for xi−2, (0.7)3(0.3) = 0.1029 for xi−3, …, as illustrated in Figure 16.2. This
gives the term of EWMA. The larger the λ value is, the less impact the past
observations and the more impact the current observation have on the cur-
rent EWMA statistic, zi.
In Formulas 16.14 through 16.17, setting λ and L in the following ranges
usually works well (Montgomery, 2001; Ye, 2003, Chapter 4):
0.05 ≤ λ ≤ 0.25
2.6 ≤ L ≤ 3.
A data sample can be used to compute the sample average and the sample
standard deviation as the estimates of μ and σ, respectively.
0.35
0.3
0.25
0.2
Weight
0.15
0.1
0.05
0
i i–1 i–2 …
Figure 16.2
Exponentially decreasing weights on data observations.
Univariate Control Charts 259
Example 16.2
Consider the launch temperature data in Table 1.5 and presented in Table
16.3 as a sequence of data observations over time. Given the following:
µ = 69
σ=7
λ = 0.2
L = 3,
λ 0.3
UCL = µ + Lσ = 69 + (3)(7 ) = 77.82
2−λ 2 − 0.3
Table 16.3
Data Observations of the Launch
Temperature from the Data Set of O-Rings
with Stress along with the EWMA Statistic
for the EWMA Control Chart
Data Launch
Observation i Temperature xi zi
1 66 68.4
2 70 68.72
3 69 68.78
4 68 68.62
5 67 68.30
6 72 69.04
7 73 69.83
8 70 69.86
9 57 67.29
10 63 66.43
11 70 67.15
12 78 69.32
13 67 68.85
14 53 65.68
15 67 65.95
16 75 67.76
17 70 68.21
18 81 70.76
19 76 71.81
20 79 73.25
21 75 73.60
22 76 74.08
23 58 70.86
260 Data Mining
90
80
70
60
50 zi
UCL
40
LCL
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Observation i
Figure 16.3
EWMA control chart to monitor the launch temperature from the data set of O-rings with
stress.
λ 0.3
LCL = µ − Lσ = 69 − (3)(7 ) = 60.18.
2−λ 2 − 0.3
The values of the EWMA statistic for other data observations are given
in Table 16.3. The EWMA statistic values of all the 23 data observations
stay in the control limits [LCL, UCL] = [60.18, 77.82], and no anomalies
are detected. Figure 16.3 plots the EWMA control chart with the EWMA
statistic and the control limits.
where 0 < λ ≤ 1. That is, zi−1 is the EWMA of xi−1, …, x1 and is used as the pre-
diction for xi. The prediction error or residual is then computed:
ei = xi − zi −1. (16.20)
Univariate Control Charts 261
λ = arg min
λ ∑e .
i
2
i (16.21)
If the 1-step ahead prediction model represents the autocorrelated data well,
eis should be independent of each other and normally distributed with the
mean of zero and the standard deviation of σe. An EWMA control chart for
monitoring ei has the centerline at zero and the following control limits:
where L is set to a value such that 2.6 ≤ L ≤ 3, 0 < α ≤ 1 and σ̂ 2ei−1 gives
the estimate of σe for xi using the exponentially weighted moving average
of the prediction errors. Using Equation 16.20, which gives xi = ei + zi−1, the
control limits for monitoring xi directly instead of ei are:
Like CUSUM control charts, EWMA control charts are more robust to the
normality assumption of data than Shewhart control charts (Montgomery,
2001). Unlike Shewhart control charts, CUSUM control charts and EWMA
control charts are effective at detecting anomalies of not only large shifts but
also small shifts since CUSUM control charts and EWMA control charts take
into account the effects of multiple data observations.
y t = θ 0t + ε t (16.27)
yt = θt + ε t , θ ≠ θ0 , (16.28)
2πt
yt = T + θ0 sin + ε t , θ0 = 0 (16.29)
p
2πt
yt = T + θ sin + εt . (16.30)
p
y t = f ( xt , θ) (16.31)
θ = θ0 . (16.32)
In the two examples shown in Equations 16.27 through 16.30, xt include only
t, and θ = θ0 when the process is in control.
The residual, εt, can be computed by subtracting the predicted value ŷt
from the observed value of yt:
ε t = yt − yˆ t = yt − f ( xt , θ) = g( yt , xt , θ). (16.33)
and the standard deviation σ. That is, the random variables, ε1, ε2, …, εn, have
a joint multivariate normal distribution with the following joint probability
density function:
1 n ε2
1 − ∑ t=1 σt20
P ( ε1 , … , ε n |θ = θ0 ) = n e 2
. (16.34)
( 2π ) 2
∑ε
2 1
l ( ε1 , …, ε n |θ = θ0 ) = − ln(2π) − 2 2
t0 . (16.35)
n 2σ t =1
∂l ( ε1 , …, ε n |θ = θ0 )
= 0. (16.36)
∂θ
where
∂ε t 0
dt 0 = − . (16.39)
∂θ
For example, to detect a change of the slope from a linear model of in-
control data described in Equations 16.27 and 16.28, a cuscore control chart
monitors:
n n n
∂ε ∂( yt − θt)
Q0 = ∑t =1
εt0 − t0 =
∂θ ∑t =1
εt0 −
∂θ
= ∑(y − θ t)t.
t =1
t 0 (16.40)
If the slope θ of the in-control linear model changes from θ0, (yt − θ0t) in
Equation 16.40 contains t, which is multiplied by another t to make Q 0 keep
increasing (if yt − θ0t > 0) or decreasing (if yt − θ0t < 0) rather than randomly
varying around zero. Such a consistent departure of Q 0 from zero causes the
slope of the line connecting Q 0 values over time to increase or decrease from
zero, which can be used to signal the presence of an anomaly.
To detect a sine wave in an in-control process with the mean of T and ran-
dom variations described in Equations 16.29 and 16.30, the cuscore statistic
for a cuscore control chart is
2πt
∂ yt − T − θ sin
p
n n
∂ε t 0
Q0 = ∑
t =1
εt0 −
∂θ
=
t =1
∑
( yt − T ) −
∂θ
n
2πt
= ∑(y − T )sin
t =1
t
p
. (16.41)
If the sine wave is present in yt, (yt − T) in Equation 16.41 contains sin(2πt/p), which
is multiplied by another sin(2πt/p) to make Q0 keep increasing (if yt − T > 0) or
decreasing (if yt − T < 0) rather than randomly varying around zero.
To detect a mean shift of K from μ0 as in a CUSUM control chart described
in Equations 16.9, 16.10, and 16.12, we have:
In-control data model:
yt = µ 0 + θ0 K + ε t , θ0 = 0 (16.42)
yt = µ 0 + θK + ε t , θ ≠ θ0 (16.43)
n n n
∂ε ∂( yt − µ 0 − θK )
Q0 = ∑
t =1
εt0 − t0 =
∂θ ∑t =1
( yt − µ 0 ) −
∂θ = ∑(y − µ )K. (16.44)
t =1
t 0
Table 16.4
Pairs of the False Alarm Rate and
the Hit Rate for Various Values of
the Decision Threshold H for the
Two-Side CUSUM Control Chart
in Example 16.1
H False Alarm Rate Hit Rate
−1 1 1
0 0.44 1
0.5 0.38 1
2.5 0.38 0.86
5.5 0.38 0.71
6.5 0.31 0.71
8.5 0.25 0.57
10 0.19 0.57
11 0.06 0.57
12 0.06 0.43
12.5 0 0.43
18.5 0 0.29
21 0 0.14
24.5 0 0
1
0.9
0.8
0.7
0.6
Hit rate
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
False alarm rate
Figure 16.4
ROC for the two-side CUSUM control chart in Example 16.1.
for each technique in the same chart to compare ROCs for two techniques and
examine which ROC is closer to the top-left corner of the chart to determine
which technique produces better detection performance. Ye et al. (2002b)
show the use of ROCs for a comparison of cyber attack detection performance
by two control chart techniques.
Univariate Control Charts 267
Exercises
16.1 Consider the launch temperature data and the following information
in Example 16.1:
µ 0 = 69
K = 3.5
x1
x= (17.1)
xp
T 2 = ( xi − x )¢ S −1( xi − x ), (17.3)
where
S−1 is the inverse of S
Hotelling’s T 2 statistic measures the statistical distance of xi from x–
269
270 Data Mining
x2
Control limits set by two univariate control charts
Figure 17.1
An illustration of statistical distance measured by Hotelling’s T 2 and control limits of
Hotelling’s T 2 control charts and univariate control charts.
n(n − p)
T2
p(n + 1)(n − 1)
detect the mean shift (Ryan, 1989). Hotelling’s T 2 control charts are also sen-
sitive to the multivariate normality assumption.
Example 17.1
The data set of the manufacturing system in Table 14.1, which is copied
in Table 17.1, includes two attribute variables, x7 and x8, in nine cases of
single-machine faults. The sample mean vector and the sample variance–
covariance matrix are computed in Chapter 14 and given next. Construct
a Hotelling’s T 2 control chart to determine if the first data observation
x = (x7, x8) = (1, 0) is an anomaly.
5
x7 9
x= =
x8 4
9
0.2469 − 0.1358
S= .
− 0.1358 0.2469
For the first data observation x = (x7, x8) = (1, 0), we compute the value of
the Hotelling’s T 2 statistic:
5
− 0.1358 1 − 9
−1
5 4 0.2469
T 2 = ( xi − x )¢ S −1 ( xi − x ) = 1 − 0 −
9 9 − 0.1358 0.2469 4
0−
9
4
4 4 5.8070 3.1939 9
= − = 0.1435.
9 9 3.1939 5.8070 4
−
9
Table 17.1
Data Set for System Fault Detection
with Two Quality Variables
Instance
(Faulty Machine) x7 x8
1 (M1) 1 0
2 (M2) 0 1
3 (M3) 1 1
4 (M4) 0 1
5 (M5) 1 0
6 (M6) 1 0
7 (M7) 1 0
8 (M8) 0 1
9 (M9) 0 0
272 Data Mining
n(n − p) (9)(9 − 2)
T2 = (0.1435) = 0.0502.
p(n + 1)(n − 1) (2)(9 + 1)(9 − 1)
The tabulated F value for α = 0.05 with 2 and 7 degrees of freedom is 4.74,
which is used as the signal threshold. Since 0.0502 < 4.74, the Hotelling’s
T 2 control chart does not signal x = (x7, x8) = (1, 0) as an anomaly.
where
z0 = m or x (17.6)
λ
Sz = [1 − (1 − λ )2i ]S (17.7)
2−λ
p
( xij − x j )2
χ2 = ∑ xj
. (17.8)
j =1
For example, the data set of the manufacturing system in Table 17.1 includes
two attribute variables, x7 and x8, in nine cases of single-machine faults. The
sample mean vector is computed in Chapter 14 and given next:
5
x7 9
x = = .
x8 4
9
The chi-square statistic for the first data observation in Table 17.1 x = (x7, x8) =
(1, 0) is
2 2
5 4
8 2
( x1 j − x j ) 2
( x17 − x7 ) ( x18 − x8 ) 1 −
2 0 −
χ2 = ∑
j=7
xj
=
x7
+
x8
=
5
9
+
4
9
= 0.8.
9 9
If we let L = 3, we have the 3-sigma control limits. If the value of the chi-
square statistic for an observation falls beyond [LCL, UCL], the chi-square
control chart signals an anomaly.
In the work by Ye et al. (2006), chi-square control charts are compared with
Hotelling’s T 2 control charts in their performance of detecting mean shifts
and counter-relationships for four types of data: (1) data with correlated and
274 Data Mining
normally distributed variables, (2) data with uncorrelated and normally dis-
tributed variables, (3) data with auto-correlated and normally distributed
variables, and (4) non-normally distributed variables without correlations
or auto-correlations. The testing results show that chi-square control charts
perform better or as well as Hotelling’s T 2 control charts for data of types 2,
3, and 4. Hotelling’s T 2 control charts perform better than chi-square control
charts for data of type 1 only. However, for data of type 1, we can use tech-
niques such as principal component analysis in Chapter 14 to obtain prin-
cipal components. Then a chi-square control chart can be used to monitor
principal components that are independent variables.
17.4 Applications
Applications of Hotelling’s T 2 control charts and chi-square control charts
to cyber attack detection for monitoring computer and network data and
detecting cyber attacks as anomalies can be found in the work by Ye and her
colleagues (Emran and Ye, 2002; Ye, 2003, Chapter 4; Ye, 2008; Ye and Chen,
2001; Ye et al., 2001, 2003, 2004, 2006). There are also applications of multivari-
ate control charts in manufacturing (Ye, 2003, Chapter 4) and other fields.
Exercises
17.1 Use the data set of x4, x5, and x6 in Table 8.1 to estimate the parameters
for a Hotelling’s T 2 control chart and construct the Hotelling’s T 2 con-
trol chart with α = 0.05 for the data set of x4, x5, and x6 in Table 4.6 to
monitor the data and detect any anomaly.
17.2 Use the data set of x4, x5, and x6 in Table 8.1 to estimate the param-
eters for a chi-square control chart and construct the chi-square control
chart with L = 3 for the data set of x4, x5, and x6 in Table 4.6 to monitor
the data and detect any anomaly.
17.3 Repeat Example 17.1 for the second observations.
Part VI
Time series data consist of data observations over time. If data observations
are correlated over time, time series data are autocorrelated. Time series
analysis was introduced by Box and Jenkins (1976) to model and analyze
time series data with autocorrelation. Time series analysis has been applied
to real-world data in many fields, including stock prices (e.g., S&P 500 index),
airline fares, labor force size, unemployment data, and natural gas price
(Yaffee and McGee, 2000). There are stationary and nonstationary time series
data that require different statistical inference procedures. In this chapter,
autocorrelation is defined. Several types of stationarity and nonstationarity
time series are explained. Autoregressive and moving average (ARMA) mod-
els of stationary series data are described. Transformations of nonstationary
series data into stationary series data are presented, along with autoregres-
sive, integrated, moving average (ARIMA) models. A list of software pack-
ages that support time series analysis is provided. Some applications of time
series analysis are given with references.
18.1 Autocorrelation
Equation 14.7 in Chapter 14 gives the correlation coefficient of two variables
xi and xj:
σ ij
ρij = ,
σ ii σ jj
σ i2 = ∑ (x − u ) p (x )
all values
i i
2
i i
of xi
σ ij = ∑ ∑ (x − µ )(x − µ )p(x , x ).
all values all values
i i j j i j
of xi of x j
277
278 Data Mining
Given a variable x and a sample of its time series data xt, t = 1, …, n, we obtain
the lag-k autocorrelation function (ACF) coefficient by replacing the variables
xi and xj in the aforementioned equations with xt and xt−k, which are two data
observations with time lags of k:
∑
n
( xt − x )( xt − k − x )/(n − k )
ACF(k ) = ρk = t = k +1
, (18.1)
∑
n
( xt − x )2/n
t =1
where x– is the sample average. If time series data are statistically indepen-
dent at lag-k, ρk is zero. If xt and xt−k change from x– in the same way (e.g., both
increasing from x– ), ρk is positive. If xt and xt−k change from x– in the opposite
way (e.g., one increasing and another decreasing from x– ), ρk is negative.
The lag-k partial autocorrelation function (PACF) coefficient measures
the autocorrelation of lag-k, which is not accounted for by the autocorrela-
tion of lags 1 to k−1. PACF for lag-1 and lag-2 are given next (Yaffee and
McGee, 2000):
PACF(1) = ρ1 (18.2)
ρ2 − ρ12
PACF(2) = . (18.3)
1 − ρ12
• Changing variance
• Cycles with a data pattern that repeats periodically, including sea-
sonable cycles with annual periodicity
• Others that make the mean or variance of a time series changes over
time
xt = φ 1 xt − 1 + + φ p xt − p + e t . (18.4)
For example, time series data for the approval of president’s job performance
based on the Gallup poll is modeled as AR(1) (Yaffee and McGee, 2000):
xt = φ 1 xt − 1 + e t . (18.5)
Table 18.1 gives a time series of an AR(1) model with ϕ1 = 0.9, x0 = 3, and a
white noise process for et with the mean of 0 and the standard deviation of 1.
Table 18.1
Time Series of an AR(1) Model
with ϕ1 = 0.9, x0 = 3, and a
White Noise Process for et
t et xt
1 0.166 2.866
2 −0.422 2.157
3 −1.589 0.353
4 0.424 0.741
5 0.295 0.962
6 −0.287 0.579
7 −0.140 0.381
8 0.985 1.328
9 −0.370 0.825
10 −0.665 0.078
280 Data Mining
3.5
2.5
2
xt
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10
t
Figure 18.1
Time series data generated using an AR(1) model with ϕ1 = 0.9 and a white noise process for et.
Figure 18.1 plots this AR(1) time series. As seen in Figure 18.1, the effect of the
initial x value, x0 = 3, diminishes quickly.
A moving average (MA) model of order q, MA(q), describes a time series in
which the current observation of a variable is an effect of a random error at
the current time and random errors at previous q time points:
xt = et − θ1et −1 − − θ q et − q . (18.6)
For example, time series data from the epidemiological tracking on the pro-
portion of the total population reported to have a disease (e.g., AIDS) is mod-
eled as MV(1) (Yaffee and McGee, 2000)
Table 18.2 gives a time series of an MA(1) model with θ1 = 0.9 and a white noise
process for et with the mean of 0 and the standard deviation of 1. Figure 18.2
plots this MA(1) time series. As seen in Figure 18.2, −0.9et−1 in Formula 18.7
tends to bring xt in the opposite direction of xt−1, making xts oscillating.
An ARMA model, ARMA(p, q), describes a time series with both autoregres-
sive and moving average characteristics:
xt = φ 1 xt − 1 + + φ p xt − p + e t − θ1 xt − 1 − − θ q xt − q . (18.8)
Table 18.2
Time Series of an MA(1)
Model with θ1 = 0.9 and a
White Noise Process for et
t et xt
0 0.649
1 0.166 −0.418
2 −0.422 −0.046
3 −1.589 −1.548
4 0.424 1.817
5 0.295 −1.340
6 −0.287 0.919
7 −0.140 −0.967
8 0.985 1.856
9 −0.370 −2.040
10 −0.665 1.171
2.5
2
1.5
1
0.5
0
xt
–0.5 1 2 3 4 5 6 7 8 9 10
–1
–1.5
–2
–2.5
t
Figure 18.2
Time series data generated using an MA(1) model with θ1 = 0.9 and a white noise process for et.
xt = φ 1 xt − 1 + e t ,
282 Data Mining
If ϕ1 < 1, AR(1) is stationary with the exponential decline in the absolute value
of ACF over time since ACF(k) decreases with k and eventually diminishes. If
ϕ1 > 0, ACF(k) is positive. If ϕ1 < 0, ACF(k) is oscillating in that it is negative for
k = 1, positive for k = 2, negative for k = 3, positive for k = 4, and so on. If ϕ1 ≥ 1,
AR(1) is nonstationary. For a stationary AR(2) time series:
xt = φ 1 xt − 1 + φ 2 x t − 2 + e t ,
ACF(k) is positive with the exponential decline in the absolute value of ACF
over time if ϕ1 > 0 and ϕ2 > 0, and ACF(k) is oscillating with the exponential
decline in the absolute value of ACF over time if ϕ1 < 0 and ϕ2 > 0.
PACF(k) for an autoregressive series AR(p) carries through lag p and
become zero after lag p. For AR(1), PACF(1) is positive if ϕ1 > 0 or negative if
ϕ1 < 0, and PACF(k) for k ≥ 2 is zero. For AR(2), PACF(1) and PACF(2) are posi-
tive if ϕ1 > 0 and ϕ2 > 0, PACF(1) is negative and PACF(2) is positive if ϕ1 < 0
and ϕ2 > 0, and PACF(k) for k ≥ 3 is zero. Hence, PACF identifies the order of
an autoregressive time series.
For an MA(1) time series,
xt = et − θ1et −1 ,
− θ1
ACF(1) = , (18.10)
1 + θ12
and ACF(k) is zero for k > 1. Similarly, for an MA(2) time series, ACF(1) and
ACF(2) are negative, and ACF(q) is zero for q > 2. For an MA(q), we have
(Yeffee and McGee, 2000)
ACF(k ) ≠ 0 if k ≤ q
.
ACF(k ) = 0 if k > q
if θ1 > 0, and PACF(k) is oscillating between positive and negative values with
the exponential decline in the magnitude of PACF(k) over time. For MA(2),
PACF(k) is negative with the exponential decline in the magnitude of PACF
over time if θ1 > 0 and θ2 > 0, and ACF(k) is oscillating with the exponential
decline in the absolute value of ACF over time if θ1 < 0 and θ2 < 0.
The aforementioned characteristics of autoregressive and moving aver-
age time series are combined in mixed time series with ARMA(p, q) models
where p > 0 and q > 0. For example, for an ARMA(1,1) with ϕ1 > 0 and θ1 < 0,
ACF declines exponentially overtime and PACF is oscillating with the expo-
nential decline over time.
The parameters in an ARMA model can be estimated from a sample of
time series data using the unconditional least-squares method, the condi-
tional least-squares method, or the maximum likelihood method (Yeffee and
McGee, 2000), which are supported in statistical software such as SAS (www.
sas.com) and SPSS (www.ibm.com/software/analytics/spss/).
18.5 Transformations of Nonstationary
Series Data and ARIMA Models
For nonstationary series caused by outliers, random walk, deterministic
trend, changing variance, and cycles and seasonality, which are described
in Section 18.2, methods of transforming those nonstationary series into sta-
tionary series are described next.
When outliers are detected in a time series, they can be removed and
replaced using the average of the series. A random walk has each observa-
tion randomly deviating from the previous observation without reversion to
the mean. Drunken drivers and birth rates have the behavior of a random
walk (Yeffee and McGee, 2000). Differencing is applied to a random walk
series as follows:
et = xt − xt −1 (18.11)
xt = a + bt + et , (18.12)
and the predicted value from the regression model. For a changing vari-
ance with the variance of a time series expanding, contracting, or fluctuating
over time, the natural log transformation or a power transformation (e.g.,
square and square root) can be considered to stabilize the variance (Yeffee
and McGee, 2000). The natural log and power transformations belong to the
family of Box–Cox transformations, which are defined as (Yaffee and McGee,
2000):
( x t + c )λ − 1
yt = if 0 < λ ≤ 1
λ (18.13)
yt = lnxt + c if λ = 0
where
xt is the original time series
yt is the transformed time series
c is a constant
λ is a shape parameter
For a time series consisting of cycles, some of which are seasonable with
annual periodicity, cyclic or seasonal differencing can be performed as
follows:
e t = xt − xt − d , (18.14)
xt − xt − d = φ 1 xt − 1 + + φ p xt − p + e t − θ1 xt − 1 − − θ q xt − q . (18.15)
Exercises
18.1 Construct time series data following an ARMA(1,1) model.
18.2 For the time series data in Table 18.1, compute ACF(1), ACF(2), ACF(3),
PACF(1), and PACF(2).
18.3 For the time series data in Table 18.2, compute ACF(1), ACF(2), ACF(3),
PACF(1), and PACF(2).
19
Markov Chain Models and
Hidden Markov Models
Markov chain models and hidden Markov models have been widely used to
build models and make inferences of sequential data patterns. In this chap-
ter, Markov chain models and hidden Markov models are described. A list
of data mining software packages that support the learning and inference of
Markov chain models and hidden Markov models is provided. Some appli-
cations of Markov chain models and hidden Markov models are given with
references.
( )
P sn sn −1 , … , s1 = P ( sn sn −1 ) for all n,
(19.1)
where sn is the system state at time n. A stationary Markov chain model has
an additional property that the probability of a state transition from time
n − 1 to n is independent of time n:
P ( sn = j sn −1 = i ) = P ( j|i ) , (19.2)
where p(j|i) is the probability that the system is in state j at one time given
the system is in state i at the previous time. A stationary Markov model is
simply called a Markov model in the following text.
If the system has a finite number of states, 1, …, S, a Markov chain model
is defined by the state transition probabilities, P(j|i), i = 1, …, S, j = 1, …, S,
∑P ( j|i) = 1,
j =1
(19.3)
287
288 Data Mining
∑P(i) = 1, (19.4)
i =1
where P(i) is the probability that the system is in state i at time 1. The
joint probability of a given sequence of system states sn−K+1, …, sn in a time
window of size K including discrete times n − (K − 1), …, n is computed as
follows:
P(sn − K + 1 , … , sn ) = P(sn − K + 1 ) ∏ P (s
k = K− 1
|sn − k ) .
n − k +1 (19.5)
The state transition probabilities and the initial state probabilities can be
learned from a training data set containing one or more state sequences as
follows:
N ji
P ( j|i ) = (19.6)
N .i
Ni
P(i) = , (19.7)
N
where
Nji is the frequency that the state transition from state i to state j appears in
the training data
N.i is the frequency that the state transition from state i to any of the states,
1, …, S, appears in the training data
Ni is the frequency that state i appears in the training data
N is the total number of the states in the training data
Markov chain models can be used to learn and classify sequential data pat-
terns. For each target class, sequential data with the target class can be used
to build a Markov chain model by learning the state transition probability
matrix and the initial probability distribution from the training data accord-
ing to Equations 19.6 and 19.7. That is, we obtain a Markov chain model for
each target class. If we have target classes, 1, …, c, we build Markov chain
models, M1, …, Mc, for these target classes. Given a test sequence, the joint
probability of this sequence is computed using Equation 19.5 under each
Markov chain model. The test sequence is classified into the target class of the
Markov chain model which gives the highest value for the joint probability
of the test sequence.
Markov Chain Models and Hidden Markov Models 289
Example 19.1
A system has two states: misuse (m) and regular use (r). A sequence
of system states is observed for training a Markov chain model:
mmmrrrrrrmrrmrrmrmmr. Build a Markov chain model using the
observed sequence of system states and compute the probability that
the sequence of system states mmrmrr is generated by the Markov chain
model.
Figure 19.1 shows the states and the state transitions in the observed
training sequence of systems states.
Using Equation 19.6 and the training sequence of system states
mmmrrrrrrmrrmrrmrmmr, we learn the following state transition
probabilities:
N mm 3
P ( m|m) = = ,
N.m 8
N rm 5
P ( r|m) = = ,
N.m 8
because state transitions 3, 10, 13, 16, and 19 are the state transition of
m → m, and state transitions 1, 2, 3, 10, 13, 16, 18, and 19 are the state tran-
sition of m → any state:
N mr 4
P ( m|r ) = = ,
N.r 11
State: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
m m m r r r r r r m r r m r r m r m m r
State transition: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 19.1
States and state transitions in Example 19.1.
290 Data Mining
because state transitions 9, 12, 15, and 17 are the state transition of r → m,
and state transitions 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, and 17 are the state transi-
tion of r → any state:
N rr 7
P ( r|r ) = = ,
N.r 11
Nm 8
P(m) = = ,
N 20
because states 1, 2, 3, 10, 13, 16, 18, and 19 are state m, and there are
20 states in the sequence of states:
N r 12
P(r ) = = ,
N 20
because states 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 17, 20 are state r, and there are
20 states in the sequence of states.
After learning all the parameters of the Markov chain model, we com-
pute the probability that the model generates the sequence of states:
mmrmrr.
8 3 5 4 5 7
= = 0.014.
20 8 8 11 8 11
∑P ( x|s) = 1. (19.8)
x
It is assumed that observations are independent of each other, and that the
emission probability of x from each state s does not depend on other states.
A hidden Markov model is used to determine the probability that a given
sequence of observations, x 1, …, x N, at stages 1, …, N, is generated by the hid-
den Markov model. Using any path method (Theodoridis and Koutroumbas,
1999), this probability is computed as follows:
SN
∑P ( x , …, x |s , …, s
i =1
1 N 1i Ni ) P (s1 , … , sN )
i i
SN N
(19.9)
i =1 n= 2
Where
i is the index for a possible state sequence, s1i , … , sNi , there are totally SN
possible state sequences
P ( s1i ) is the initial state probability, P ( sni |sn −1i ) is the state transition
probability
P ( xn|sni ) is the emission probability
β(sn–1)
β(sn)
s Any path method
ρ(sn–1)
ρ(sn)
1
x1 xn–2 xn–1 xn xN
Figure 19.2
Any path method and the best path method for a hidden Markov model.
292 Data Mining
stage n, (2) observations x 1, …, x n−1 have been emitted at stages 1 to n − 1, and
(3) observation x n is emitted from state sin at stage n. ρ(sn) can be computed
recursively as follows:
ρ(sn ) = ∑ ρ(s
sn−1 = 1
n −1 )P ( sn|sn −1 ) P ( xn|sn ) , (19.10)
That is, ρ(sn) is the sum of the probabilities that starting from all possible
state sn = 1, …, S at stage n − 1 with x 1, …, x n−1 already emitted, we transi-
tion to state sn at stage n which emits x n, as illustrated in Figure 19.2. Using
Equations 19.10 and 19.11, Equation 19.9 can be computed as follows:
SN S
∑i =1
P ( x1 , … , x N |s1i , … , sNi ) P ( s1i , … , sNi ) = ∑ρ(s
sN = 1
N ). (19.12)
Hence, in any path method, Equations 19.10 through 19.12 are used to compute
the probability of a hidden Markov model generating a sequence of observa-
tions x 1, …, xN. Any path method starts by computing all ρ(s1) for s1 = 1, …, S
using Equation 19.11, then uses ρ(s1) to compute all ρ(s2), s2 = 1, …, S using
Equation 19.10, and continues all the way to obtain all ρ(sN) for sN = 1, …, S,
which are finally used in Equation 19.12 to complete the computation.
The computational cost of any path method is high because all SN possible
state sequences/paths from stage 1 to stage N are involved in the computa-
tion. Instead of using Equation 19.9, the best path method uses Equation 19.13
to compute the probability that a given sequence of observations, x 1, …, x N, at
stages 1, …, N, is generated by the hidden Markov model:
That is, instead of summing over all the possible state sequences in Equation
19.9 for any path method, the best path method uses the maximum prob-
ability that the sequence of observations, x 1, …, x N, is generated by any pos-
sible state sequence from stage 1 to stage N. We define β(sn) as the probability
that (1) state sn is reached at stage n through the best path, (2) observations
x 1, …, x n−1 have been emitted at stages 1 to n − 1, and (3) observation x n is
Markov Chain Models and Hidden Markov Models 293
The Viterbi algorithm (Viterbi, 1967) is widely used to compute the logarithm
transformation of Equations 19.13 through 19.16.
The best path method requires less computational cost of storing and com-
puting the probabilities than any path method because the computation at
each stage n involves only the S best paths. However, in comparison with any
path method, the best path method is an alternative suboptimal method for
computing the probability that a given sequence of observations, x 1, …, x N, at
stages 1, …, N, is generated by the hidden Markov model, because only the
best path instead of all possible paths is used to determine the probability
of observing x 1, …, x N, given all possible paths in the hidden Markov model
that can possibly generate the observation sequence.
Hidden Markov models have been widely used in speed recognition,
handwritten character recognition, natural language processing, DNA
sequence recognition, and so on. In the application of hidden Markov models
to handwritten digit recognition (Bishop, 2006) for recognizing handwritten
digits, 0, 1, …, 9, a hidden Markov model is built for each digit. Each digit is
considered to have a sequence of line trajectories, x 1, …, x N, at stages 1, …, N.
Each hidden Markov model has 16 latent states, each of which can emit a line
segment of a fixed length with one of 16 possible angles. Hence, the emis-
sion distribution can be specified by a 16 × 16 matrix with the probability of
emitting each of 16 angles from each of 16 states. The hidden Markov model
for each digit is trained to establish the initial probability distribution, the
transition probability matrix, and the emission probabilities using 45 hand-
written examples of the digit. Given a handwritten digit to recognize, the
probability that the handwritten digit is generated by the hidden Markov
model for each digit is computed. The handwritten digit is classified as the
digit whose hidden Markov model produces the highest probability of gen-
erating the handwritten digit.
Hence, to apply hidden Markov models to a classification problem, a hid-
den Markov model is built for each target class. Given a sequence of observa-
tions, the probability of generating this observation sequence by each hidden
294 Data Mining
Markov model is computed using any path method or the best path method.
The given observation sequence is classified into the target class whose hid-
den Markov model produces the highest probability of generating the obser-
vation sequence.
{ }
A = P ( j|i ) , P(i), P ( x|i ) . (19.17)
The model parameters need to be learned from a training data set containing
a sequence of N observations, X = x 1, …, x N. Since the states cannot be directly
observed, Equations 19.6 and 19.7 cannot be used to learn the model param-
eters such as the state transition probabilities and the initial state probabili-
ties. Instead, the expectation maximization (EM) method is used to estimate
the model parameters, which maximize the probability of obtaining the
observation sequence from the model with the estimated model parameters,
P(X|A). The EM method has the following steps:
1. Assign initial values of the model parameters, A, and use these val-
ues to compute P(X|A).
2. Reestimate the model parameters to obtain Â, and compute P(X|Â).
3. If P(X|Â) − P(X|A) > ∈, let A =  because  improves the probability of
obtaining the observation sequence from  than A, and go to Step 2;
otherwise, stop because P(Â) is worse than or similar to P(A), and take
A as the final set of the model parameters.
be the probability that (1) the path goes through state i at stage n, (2) the
path goes through state j at the next stage n + 1, and (3) the model generates
the observation sequence X using the model parameters A. Let φn(i, X|A) be
the probability that (1) the path goes through state i at stage n, and (2) the
model generates the observation sequence X using the model parameters A.
Let ωn(i) be the probability of having the observations x n+1, …, x N at stages
n + 1, …, N, given that the path goes through state i at stage n. For any path
method, ωn(i) can be computed recursively for n = N − 1, …, 1 as follows:
ω n (i ) = P ( xn + 1 , … , x N|sn = i , A ) = ∑ω
sn+1 = 1
n+1 (sn + 1 )P ( sn + 1|sn = i ) P ( xn + 1|sn + 1 )
(19.18)
ω N (i) = 1, i = 1, … , S. (19.19)
For the best path method, ωn(i) can be computed recursively for n = N − 1, …, 1
as follows:
ω N (i) = 1, i = 1, … , S. (19.21)
We also have
where ρn(i) denotes ρ(sn = i), which is computed using Equations 19.10 and
19.11. The model parameter P(i) is the expected number of times that state i
occurs at stage 1, given the observation sequence X and the model param-
eters A, that is, P(i|X, A). The model parameter P(j|i) is the expected number
of times that transitions from state i to state j occur, given the observation
sequence X and the model parameters A, that is, P(i, j|X, A)/P(i|X, A). The
model parameters are reestimated as follows:
ϕ1 (i , X|A ) ρ1(i)ω1(i)
Pˆ (i) = P (i|X , A ) = = (19.23)
P ( X|A ) P ( X|A )
296 Data Mining
∑
N −1
P (i , j|X , A ) θ n (i , j , X|A ) /P ( X|A )
Pˆ ( j|i ) = = n=1
P (i|X , A )
∑
N −1
ϕ n (i , X|A ) /P ( X|A )
n=1
∑
N −1
ρn (i ) P ( j|i ) P ( xn + 1|j ) ω n + 1( j)/P ( X|A )
= n=1
∑
N −1
ρn (i)ω n (i)/ ( X|A )
n=1
∑
N −1
ρn (i)P ( j|i ) P ( xn + 1|j ) ω n + 1( j)
= n=1
(19.24)
∑
N −1
ρn (i)ω n (i)
n=1
∑ ϕ (i)/P (X|A) ∑ ϕ (i )
N N
n n
n=1 n=1
∑ ρ (i)ω (i)
N
n& x = v n & xn = v
= n=1 (19.25)
∑ ρ (i)ω (i)
N
n n
n=1
where
ϕ n (i) if xn = v
ϕ n & x n = v (i ) = , (19.26)
0 if xn ≠ v
ρn (i) if xn = v
ρn & x n = v ( i ) = , (19.27)
0 if xn ≠ v
ω n (i) if xn = v
ω n & x n = v (i ) = , (19.28)
0 if xn ≠ v
Example 19.2
A system has two states: misuse (m) and regular use (r), each of which
can produce one of three events: F, G, and H. A sequence of five events is
observed: FFFHG. Using any path method, perform one iteration of the
Markov Chain Models and Hidden Markov Models 297
ρ2 (r ) = ρ(s2 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s1 = 1
1 2 1 2 2
ρ3 (r ) = ρ(s3 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s2 = 1
2 3 2 3 3
ρ4 (r ) = ρ(s4 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s3 = 1
3 4 3 4 4
ρ5 (r ) = ρ(s5 = r ) = ∑ρ(s )P (s |s ) P ( x |s )
s4 = 1
4 5 4 5 5
= 0.0066.
In Step 2 of the EM method, we use Equations 19.23 through 19.25 to rees-
timate the model parameters. We first need to use Equations 19.18 and
19.19 to compute ωn(i), n = 5, 4, 3, 2, and 1, which are used in Equations
19.23 through 19.25:
ω 5 (m) = 1 ω 5 (r ) = 1
2
ω 4 (m) = P ( x5 = G|s4 = m, A ) = ∑ω (s )P (s |s
s5 = 1
5 5 5 4 = m) P ( x5 = G|s5 )
+ ω 5 (r )P ( s5 = r|s4 = m) P ( x5 = G|s5 = r )
= (1)(0.375)(0.1) + (1)(0.625)(0.4) = 0.2875
2
ω 4 (r ) = P ( x5 = G|s4 = r , A ) = ∑ω (s )P (s |s
s5 = 1
5 5 5 4 = r ) P ( x5 = G|s5 )
+ ω 5 (r )P ( s5 = r|s4 = r ) P ( x5 = G|s5 = r )
= (1)(0.364)(0.1) + (1)(0.636)(0.4) = 0.2908
2
ω 3 (m) = P ( x 4 = H , x5 = G|s3 = m, A ) = ∑ω (s )P (s |s
s4 = 1
4 4 4 3 = m) P ( x 4 = H|s4 )
+ ω 4 (r )P ( s4 = r|s3 = m) P ( x 4 = H|s4 = r )
= (0.2875)(0.375)(0.2) + (0.2908)(0.625)(0.4)
= 0.0943
ω 3 (r ) = P ( x 4 = H , x5 = G|s3 = r , A ) = ∑ω (s )P (s |s
s4 = 1
4 4 4 3 = r ) P ( x4 = H|s4 )
+ ω 4 (r )P ( s4 = r|s3 = r ) P ( x4 = H|s4 = r )
= (0.2875)(0.364)(0.2) + (0.2908)(0.636)(0.4)
= 0.0949
300 Data Mining
ω 2 (m) = P ( x3 = F , x 4 = H , x5 = G|s2 = m, A ) = ∑ω (s )P (s |s
s3 = 1
3 3 3 2 = m) P ( x3 = F|s3 )
+ ω 3 (r )P ( s3 = r|s2 = m) P ( x3 = F|s3 = r )
= (0.0943)(0.375)(0.7 ) + (0.0949)(0.625)(0.2)
= 0.0366
ω 2 (r ) = P ( x3 = F , x 4 = H , x5 = G|s2 = r , A ) = ∑ω (s )P (s |s
s3 = 1
3 3 3 2 = r ) P ( x3 = F|s3 )
+ ω 3 (r )P ( s3 = r|s2 = r ) P ( x3 = F|s3 = r )
= (0.0943)(0.364)(0.7 ) + (0.0949)(0.636)(0.2)
= 0.0361
+ ω 2 (r )P ( s2 = r|s1 = m) P ( x2 = F|s2 = r )
= (0.0366)(0.375)(0.7 ) + (0.0361)(0.625)(0.2)
= 0.0141
ω1 (r ) = P ( x2 = F , x3 = F , x 4 = H , x5 = G|s1 = r , A ) = ∑ω (s )P (s |s = r ) P ( x = F|s )
s2 = 1
2 2 2 1 s 2
+ ω 2 (r )P ( s2 = r|s1 = r ) P ( x2 = F|s2 = r )
= (0.0366)(0.364)(0.7 ) + (0.0361)(0.636)(0.2)
= 0.0139.
ρ1 (r )ω1 (r ) (0.12)(0.0139)
Pˆ (r ) = = = 0.2527
P ( X = FFFHG|A ) 0.0066
∑
4
ρn (m)P ( m|m) P ( xn + 1|m) ω n + 1 (m)
Pˆ ( m|m) = n=1
∑
4
ρn (m)ω n (m)
n=1
ρ1 (m)ω1 (m)
+ ρ (m)ω (m)
2 2
+ ρ3 (m)ω 3 (m)
+ ρ4 (m)ω 4 (m)
(0.28)(0.375)(0.7 )(0.0366) + (0.1060)(0.375)(0.7 )(0.0943)
+ (0.0470)(0.375)(0.2)(0.2875) + (0.0052)(0.375)(0.1)(1)
=
[(0.28)(0.0141) + (0.1060)(0.0366) + (0.0470)(0.0943) + (0.0052)(0.2875)]
= 0.4742
∑
4
ρn (m)P ( r|m) P ( xn + 1|r ) ω n + 1 (r )
Pˆ ( r|m) = n=1
∑
4
ρn (m)ω n (m)
n=1
(0.28)(0.625)(0.2)(0.0361) + (0.1060)(0.625)(0.2)(0.0949)
+ (0.0470)(0.625)(0.4)(0.2908) + (0.0052)(0.625)(0.4)(1)
=
[ ( 0. 28 )( 0. 0141) + ( 0 . 1060 )( 0 .0366 ) + ( 0 .0470 )( 0 . 0943 ) + ( 0. 0052)( 0.2875)]
= 0.5262
302 Data Mining
∑
4
ρn (r )P ( m|r ) P ( xn + 1|m) ω n + 1 (m)
Pˆ ( m|r ) = n=1
∑
4
ρn (r )ω n (r )
n=1
= 0.3469
∑
4
ρn (r )P ( r|r ) P ( xn + 1|r ) ω n + 1 (r )
Pˆ ( r|r ) = n=1
∑
4
ρn (r )ω n (r )
n=1
ρ1 (r )P ( r|r ) P ( x2 = F|r ) ω 2 (r )
+ ρ2 (r )P ( r|r ) P ( x3 = F|r ) ω 3 (r )
+ ρ3 (r )P ( r|r ) P ( x 4 = H|r ) ω 4 (r )
( ) (5
+ ρ (r )P r|r P x = G|r ω (r )
4 ) 5
=
ρ1 (r )ω1 (r )
+ρ (r )ω (r )
2 2
+ρ3 (r )ω 3 (r )
+ρ4 (r )ω 4 (r )
(0.12)(0.636)(0.2)(0.0361) + (0.0754)(0.636)(0.2)(0.0949)
+ (0.0228)(0.636)(0.4)(0.2908) + (0.0176)(0.636)(0.4)(1)
=
[(0.12)(0.0139) + (0.0754)(0.0361) + (0.0228)(0.0949) + (0.0176)(0.2908)]
= 0.6533
Markov Chain Models and Hidden Markov Models 303
∑
5
ρn& xn = F (m) ω n& xn = F (m)
Pˆ ( x = F|m) = n=1
∑
5
ρ n (i ) ω n (i )
n=1
ρ1& x1 = F (m) ω1& x1 = F (m) + ρ2& x2 = F (m) ω 2& x2 = F (m) + ρ3 & x3 = F (m) ω 3 & x3 = F (m)
+ ρ4 & x4 = F (m) ω 4 & x4 = F (m) + ρ5& x5 = F (m) ω 5& x5 = F (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)
= 0.6269
∑
5
ρn& xn = G (m) ω n& xn = G (m)
Pˆ ( x = G|m) = n=1
∑
5
ρn (m) ω n (m)
n=1
ρ1& x1 = G (m) ω1& x1 = G (m) + ρ2& x2 = G (m) ω 2& x2 = G (m) + ρ3 & x3 = G (m) ω 3 & x3 = G (m)
+ ρ4 & x4 = G (m) ω 4 & x4 = G (m) + ρ5& x5 = G (m) ω 5& x5 = G (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)
= 0.0550
∑
5
ρn& xn = H (m) ω n& xn = H (m)
Pˆ ( x = H|m) = n=1
∑
5
ρn (m) ω n (m)
n=1
ρ1& x1 = H (m) ω1& x1 = H (m) + ρ2& x2 = H (m) ω 2& x2 = H (m) + ρ3 & x3 = H (m) ω 3 & x3 = H (m)
+ ρ4 & x4 = H (m) ω 4 & x4 = H (m) + ρ5& x5 = H (m) ω 5& x5 = H (m)
=
ρ1 (m) ω1 (m) + ρ2 (m) ω 2 (m) + ρ3 (m) ω 3 (m) + ρ4 (m) ω 4 (m) + ρ5 (m) ω 5 (m)
= 0.1027
304 Data Mining
∑
5
ρn & x n = F ( r ) ω n & x n = F ( r )
Pˆ ( x = F|r ) = n=1
∑
5
ρn ( r ) ω n ( r )
n=1
= 0.3751
∑
5
ρn & x n = G ( r ) ω n & x n = G ( r )
Pˆ ( x = G|r ) = n=1
∑
5
ρn ( r ) ω n ( r )
n=1
= 0.3320
∑
5
ρn & x n = G ( r ) ω n & x n = G ( r )
Pˆ ( x = H|r ) = n=1
∑
5
ρn ( r ) ω n ( r )
n=1
= 0.2929
Markov Chain Models and Hidden Markov Models 305
Exercises
19.1 Given the Markov chain model in Example 19.1, determine the prob-
ability of observing a sequence of system states: rmmrmrrmrrrrrrmmm.
19.2 A system has two states, misuse (m) and regular use (r), each of which
can produce one of three events: F, G, and H. A hidden Markov model
for the system has the initial state probabilities and state transition
probabilities given from Example 19.1, and the state emission probabili-
ties as follows:
Many objects have a periodic behavior and thus show a unique characteristic
in the frequency domain. For example, human sounds have a range of frequen-
cies that are different from those of some animals. Objects in the space includ-
ing the earth move at different frequencies. A new object in the space can
be discovered by observing its unique movement frequency, which is dif-
ferent from those of known objects. Hence, the frequency characteristic of
an object can be useful in identifying an object. Wavelet analysis represents
time series data in the time–frequency domain using data characteristics
over time in various frequencies, and thus allows us to uncover temporal
data patterns at various frequencies. There are many forms of wavelets, e.g.,
Haar, Daubechies, and derivative of Gaussian (DoG). In this chapter, we use
the Haar wavelet to explain how wavelet analysis works to transform time
series data to data in the time–frequency domain. A list of software pack-
ages that support wavelet analysis is provided. Some applications of wavelet
analysis are given with references.
1 if 0 ≤ x < 1
ϕ( x) = . (20.1)
0 otherwise
The wavelet function of the Haar wavelet is defined using the scaling func-
tion (Boggess and Narcowich, 2001; Vidakovic, 1999), as shown in Figure 20.1:
1
1 if 0 ≤ x <
2 .
ψ( x) = ϕ(2x) − ϕ(2x − 1) = (20.2)
−1 1
if ≤x<1
2
307
308 Data Mining
1 1 1
0 0 0
0 1 0 ½ 1 0 1 2
(x) (2x) (½ x)
1 1 1
0 0 0
0 1 –1 0 0 1 2
(x) (x + 1) (x – 1)
0
0 ½ 1
–1
(x)
Figure 20.1
The scaling function and the wavelet function of the Haar wavelet and the dilation and shift
effects.
Hence, the wavelet function of the Haar wavelet represents the change of
the function value from 1 to −1 in [0, 1). The function φ(2x) in Formula 20.2
is a step function with the height of 1 for the range of x values in [0, ½), as
shown in Figure 20.1. In general, the parameter a before x in φ(ax) produces a
dilation effect on the range of x values, widening or contracting the x range
by 1/a, as shown in Figure 20.1. The function φ(2x − 1) is also a step function
with the height of 1 for the range of x values in [½, 1). In general, the param-
eter b in φ(x + b) produces a shift effect on the range of x values, moving the
x range by b, as shown in Figure 20.1. Hence, φ(ax + b) defines a step function
with the height of 1 for x values in the range of [−b/a, (1 − b)/a), as shown next,
given a > 0:
0 ≤ ax + b < 1
−b 1− b
≤x< .
a a
Wavelet Analysis 309
f ( x) = ∑a ϕ(2 x − i)
i=0
i
k
(20.3)
f (x) f (x)
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 x 0 x
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1 0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
(a) (b)
f (x)
9
8
7
6
5
4
3
2
1
0 x
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
(c)
Figure 20.2
A sample of time series data from (a) a function, (b) a sample of data points taken from a func-
tion, and (c) an approximation of the function using the scaling function of Haar wavelet.
310 Data Mining
In Formula 20.3, aiφ(2kx − i) defines a step function with the height of ai for x
values in the range of [i/2k, (i + 1)/2k). Figure 20.2c shows the approximation
of the function using the step functions at the height of the eight data points.
Considering the first two step functions in Formula 20.3, φ(2kx) and
φ(2kx − 1), which have the value of 1 for the x values in [0, 1/2k) and [1/2k, 2/2k),
respectively, we have the following relationships:
φ(2k−1x) in Equation 20.4 has the value of 1 for the x values in [0, 1/2k−1),
which covers [0, 1/2k) and [1/2k, 2/2k) together. ψ(2k−1x) in Equation 20.5 also
covers [0, 1/2k) and [1/2k, 2/2k) together but has the value of 1 for the x values
in [0, 1/2k−1) and −1 for the x values in [1/2k, 2/2k). An equivalent form of
Equations 20.4 and 20.5 is obtained by adding Equations 20.4 and 20.5 and by
subtracting Equation 20.5 from Equation 20.4:
1
ϕ(2k x) = ϕ(2k −1 x) + ψ(2k −1 x) (20.6)
2
1
ϕ(2k x − 1) = ϕ(2k −1 x) − ψ(2k −1 x) . (20.7)
2
At the left-hand side of Equations 20.6 and 20.7, we look at the data points
at the time interval of 1/2k or the frequency of 2k. At the right-hand side of
Equations 20.4 and 20.5, we look at the data points at the larger time interval
of 1/2k−1 or a lower frequency of 2k−1.
In general, considering the two step functions in Formula 20.3, φ(2kx − i)
and φ(2kx − i − 1), which have the value of 1 for the x values in [i/2k, (i + 1)/2k)
and [(i + 1)/2k, (i + 2)/2k), respectively, we have the following relationships:
i
ϕ 2k −1 x − = ϕ(2k x − i) + ϕ(2k x − i − 1) (20.8)
2
Wavelet Analysis 311
i
ψ 2k −1 x − = ϕ(2k x − i) − ϕ(2k x − i − 1). (20.9)
2
φ(2k−1x − i/2) in Equation 20.8 has the value of 1 for the x values in [i/2k, (i + 2)/2k)
or [i/2k, i/2k + 1/2k−1) with the time interval of 1/2k−1. ψ(2k−1x − i/2) in Equation
20.9 has the value of 1 for the x values in [i/2k, (i + 1)/2k) and −1 for the x values
in [(i + 1)/2k, (i + 2)/2k]. An equivalent form of Equations 20.8 and 20.9 is
1 k −1 i i
ϕ(2k x − i) = ϕ 2 x − + ψ 2k −1 x − (20.10)
2 2 2
1 k −1 i i
ϕ(2k x − i − 1) = ϕ 2 x − − ψ 2k −1 x − (20.11)
2 2 2
At the left-hand side of Equations 20.10 and 20.11, we look at the data points
at the time interval of 1/2k or the frequency of 2k. At the right-hand side of
Equations 20.10 and 20.11, we look at the data points at the larger time inter-
val of 1/2k−1 or a lower frequency of 2k−1.
Equations 20.10 and 20.11 allow us to perform the wavelet transform of
times series data or their function representation in Formula 20.3 into data at
various frequencies as illustrated through Example 20.1.
Example 20.1
Perform the Haar wavelet transform of time series data 0, 2, 0, 2, 6, 8, 6, 8.
First, we represent the time series data using the scaling function of the
Haar wavelet:
2k − 1
f ( x) = ∑a ϕ(2 x − i)
i=0
i
k
f ( x) = 0ϕ(23 x) + 2ϕ(23 x − 1)
+ 0ϕ(23 x − 2) + 2ϕ(23 x − 3)
+ 6ϕ(23 x − 4) + 8ϕ(23 x − 5)
+ 6ϕ(23 x − 6) + 8ϕ(23 x − 7 ).
i = 2 and i + 1 = 3 for the second pair, i = 4 and i + 1 = 5 for the third pair,
and i = 6 and i + 1 = 7 for the fourth pair:
1 2 0 0 1 0 0
f ( x) = 0 × ϕ 2 x − + ψ 22 x − + 2 × ϕ 2k − 1 x − − ψ 2k − 1 x −
2 2 2 2 2 2
1 2 2 2 1 2 2
+0 × ϕ 2 x − + ψ 22 x − + 2 × ϕ 2k − 1 x − − ψ 2k − 1 x −
2 2 2 2 2 2
1 2 4 4 1 4 4
+6 × ϕ 2 x − + ψ 22 x − + 8 × ϕ 2k − 1 x − − ψ 2k − 1 x −
2 2 2 2 2 2
1 2 6 2 6 1 k −1 6 k −1 6
+6 × ϕ2 x − + ψ 2 x − + 8 × ϕ 2 x − − ψ 2 x −
2 2 2 2 2 2
1 1
f ( x) = 0 × ϕ(22 x) + ψ(22 x) + 2 × ϕ(22 x) − ψ(22 x)
2 2
1 1
+0 × ϕ(22 x − 1) + ψ(22 x − 1) + 2 × ϕ(22 x − 1) − ψ(22 x − 1)
2 2
1 1
+6 × ϕ(22 x − 2) + ψ(22 x − 2) + 8 × ϕ(22 x − 2) − ψ(22 x − 2)
2 2
1 1
+6 × ϕ(22 x − 3) + ψ(22 x − 3) + 8 × ϕ(22 x − 3) − ψ(22 x − 3)
2 2
1 1 1 1
f ( x) = 0 × + 2 × ϕ(22 x) + 0 × − 2 × ψ(22 x)
2 2 2 2
1 1 1 1
+0 × + 2 × 2
ϕ(2 x − 1) + 0 × − 2 ×
2
ψ(2 x − 1)
2 2 2 2
1 1 1 1
+ 6 × + 8 × ϕ(22 x − 2) + 6 × − 8 × ψ(22 x − 2)
2 2 2 2
1 1 1 1
+ 6 × + 8 × ϕ(22 x − 3) + 6 × − 8 × ψ(22 x − 3)
2 2 2 2
f ( x) = ϕ(22 x) − ψ(22 x)
+ ϕ(22 x − 1) − ψ(22 x − 1)
+ 7 ϕ(22 x − 2) − 1ψ(22 x − 2)
+ 7 ϕ(22 x − 3) − 1ψ(22 x − 3)
We use Equations 20.10 and 20.11 to transform the first line of the afore-
mentioned function:
1 1
f ( x) = ϕ(21 x) + ψ(21 x) + ϕ(21 x) − ψ(21 x)
2 2
1 1
+7× ϕ(21 x − 1) + ψ(21 x − 1) + 7 × ϕ(21 x − 1) − ψ(21 x − 1)
2 2
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)
1 1 1 1 7 7 7 7
f ( x) = + ϕ(2x) + − ψ(2x) + + ϕ(2x − 1) + − ψ(2x − 1)
2 2 2 2 2 2 2 2
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)
f ( x) = ϕ(2x) + 7 ϕ(2x − 1)
+ 0ψ(2x) + 0ψ(2x − 1)
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3).
Again, we use Equations 20.10 and 20.11 to transform the first line of the
aforementioned function:
1 1
f ( x) = ϕ(21− 1 x) + ϕ(21− 1 x) + 7 × ϕ(21− 1 x) − ψ(21− 1 x)
2 2
+ 0ψ(2x) + 0ψ(2x − 1) − ψ(22 x) − ψ(22 x − 1)
− ψ(22 x − 2) − ψ(22 x − 3)
1 7 1 7
f ( x) = + ϕ( x) + − ψ( x)
2 2 2 2
+ 0ψ(2x) + 0ψ(2x − 1) − ψ(22 x) − ψ(22 x − 1)
− ψ(22 x − 2) − ψ(22 x − 3)
The function in Equation 20.12 gives the final result of the Haar wave-
let transform. The function has eight terms, as the original data sample
has eight data points. The first term, 4φ(x), represents a step function
314 Data Mining
at the height of 4 for x in [0, 1) and gives the average of the original
data points, 0, 2, 0, 2, 6, 8, 6, 8. The second term, −3ψ(x), has the wavelet
function ψ(x), which represents a step change of the function value
from 1 to −1 or the step change of −2 as the x values go from the first
half of the range [0, ½) to the second half of the range [½, 1). Hence, the
second term, −3ψ(x), reveals that the original time series data have the
step change of (−3) × (−2) = 6 from the first half set of four data points
to the second half set of four data points as the average of the first four
data points is 1 and the average of the last four data points is 7. The
third term, 0ψ(2x), represents that the original time series data have no
step change from the first and second data points to the third and four
data points as the average of the first and second data points is 1 and
the average of the third and fourth data points is 1. The fourth term,
0ψ(2x−1), represents that the original time series data have no step
change from the fifth and sixth data points to the seventh and eighth
data points as the average of the fifth and sixth data points is 7 and
the average of the seventh and eighth data points is 7. The fifth, sixth,
seventh and eighth terms of the function in Equation 20.12, −ψ(2 2 x),
−ψ(22 x−1), −ψ(22 x−2) and −ψ(22 x−3), reveal that the original time series
data have the step change of (−1) × (−2) = 2 from the first data point of 0
to the second data point of 2, the step change of (−1) × (−2) = 2 from the
third data point of 0 to the fourth data point of 2, the step change of
(−1) × (−2) = 2 from the fifth data point of 6 to the sixth data point of 8,
and the step change of (−1) × (−2) = 2 from the seventh data point of 6 to
the eighth data point of 8. Hence, the Haar wavelet transform of eight
data points in the original time series data produces eight terms with
the coefficient of the scaling function φ(x) revealing the average of the
original data, the coefficient of the wavelet function ψ(x) revealing the
step change in the original data at the lowest frequency from the first
half set of four data points to the second half set of four data points,
the coefficients of the wavelet functions ψ(2x) and ψ(2x − 1) revealing
the step changes in the original data at the higher frequency of every
two data points, and the coefficients of the wavelet functions ψ(2 2 x),
ψ(22 x − 1), ψ(22 x − 2) and ψ(22 x − 3) revealing the step changes in the
original data at the highest frequency of every data point.
Hence, the Haar wavelet transform of times series data allows us to
transform time series data to the data in the time–frequency domain
and observe the characteristics of the wavelet data pattern (e.g., a
step change for the Haar wavelet) in the time–frequency domain. For
example, the wavelet transform of the time series data 0, 2, 0, 2, 6, 8,
6, 8 in Equation 20.12 reveals that the data have the average of 4, a step
increase of 6 at four data points (at the lowest frequency of step change),
no step change at every two data points (at the medium frequency of
step change), and a step increase of 2 at every data point (at the highest
frequency of step change). In addition to the Haar wavelet that captures
the data pattern of a step change, there are many other wavelet forms,
for example, the Paul wavelet, the DoG wavelet, the Daubechies wave-
let, and Morlet wavelet as shown in Figure 20.3, which capture other
types of data patterns. Many wavelet forms are developed so that an
appropriate wavelet form can be selected to give a close match to the
Wavelet Analysis 315
0.3
0.0
–0.3
–4 –2 0 2 4
Paul wavelet
0.3
0.0
–0.3
–4 –2 0 2 4
DoG wavelet
–1
0.5
–0.5
–1
–1 –2 0 2 4
Morlet wavelet
Figure 20.3
Graphic illustration of the Paul wavelet, the DoG wavelet, the Daubechies wavelet, and the
Morlet wavelet. (Ye, N., Secure Computer and Network Systems: Modeling, Analysis and Design,
2008, Figure 11.2, p. 200. Copyright Wiley-VCH Verlag GmbH & Co. KGaA. Reproduced
with permission).
data pattern of time series data. For example, the Daubechies wavelet
(Daubechies, 1990) may be used to perform the wavelet transform of
time series data that shows a data pattern of linear increase or linear
decrease. The Paul and DoG wavelets may be used for time series data
that show wave-like data patterns.
316 Data Mining
Example 20.2
Reconstruct time series data from the wavelet coefficients in Equation
20.12, which is repeated next:
f ( x) = 4ϕ( x)
− 3ψ( x)
+ 0ψ(2x) + 0ψ(2x − 1)
− ψ(22 x) − ψ(22 x − 1) − ψ(22 x − 2) − ψ(22 x − 3)
f ( x) = ϕ(2x) + 7 ϕ(2x − 1)
− ϕ(23 x) + ϕ(23 x − 1) − ϕ(23 x − 2) + ϕ(23 x − 3) − ϕ(23 x − 4)
− ϕ(23 x − 6) + ϕ(23 x − 7 )
Wavelet Analysis 317
− ϕ(23 x − 6) + ϕ(23 x − 7 )
f ( x) = 0ϕ(23 x) + 2ϕ(23 x − 1)
+ 0ϕ(23 x − 2) + 2ϕ(23 x − 3)
+ 6ϕ(23 x − 4) + 8ϕ(23 x − 5)
+ 6ϕ(23 x − 6) + 8ϕ(23 x − 7 ).
Exercises
20.1 Perform the Haar wavelet transform of time series data 2.5, 0.5, 4.5, 2.5,
−1, 1, 2, 6 and explain the meaning of each coefficient in the result of the
Haar wavelet transform.
20.2 The Haar wavelet transform of given time series data produces the fol-
lowing wavelet coefficients:
f ( x) = 2.25ϕ( x)
+ 0.25ψ ( x)
− 1ψ(2x) − 2ψ(2x − 1)
+ ψ(22 x) + ψ(22 x − 1) − ψ(22 x − 2) − 2ψ(22 x − 3).
Reconstruct the original time series data using these coefficients.
20.3 After setting the zero value to the coefficients whose absolute value is
smaller than 1.5 in the Haar wavelet transform from Exercise 20.2, we
have the following wavelet coefficients:
f ( x) = 2.25ϕ( x)
+ 0ψ( x)
+ 0ψ(2 x) − 2ψ(2 x − 1)
+ 0ψ (22 x) + 0ψ (22 x − 1) + 0ψ (22 x − 2) − 2ψ (22 x − 3).
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large
databases. In Proceedings of the 20th International Conference on Very Large Data
Bases, Santiago, Chile, pp. 487–499.
Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer.
Boggess, A. and Narcowich, F. J. 2001. The First Course in Wavelets with Fourier Analysis.
Upper Saddle River, NJ: Prentice Hall.
Box, G.E.P. and Jenkins, G. 1976. Time Series Analysis: Forecasting and Control. Oakland,
CA: Holden-Day.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and
Regression Trees. Boca Raton, FL: CRC Press.
Bryc, W. 1995. The Normal Distribution: Characterizations with Applications. New York:
Springer-Verlag.
Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2, 121–167.
Chou, Y.-M., Mason, R. L., and Young, J. C. 1999. Power comparisons for a Hotelling’s
T2 statistic. Communications of Statistical Simulation, 28(4), 1031–1050.
Daubechies, I. 1990. The wavelet transform, time-frequency localization and signal
analysis. IEEE Transactions on Information Theory, 36(5), 96–101.
Davis, G. A. 2003. Bayesian reconstruction of traffic accidents. Law, Probability and
Risk, 2(2), 69–89.
Díez, F. J., Mira, J., Iturralde, E., and Zubillaga, S. 1997. DIAVAL, a Bayesian expert
system for echocardiography. Artificial Intelligence in Medicine, 10, 59–73.
Emran, S. M. and Ye, N. 2002. Robustness of chi-square and Canberra techniques in
detecting intrusions into information systems. Quality and Reliability Engineering
International, 18(1), 19–28.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for
discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han,
U. M. Fayyad (eds.) Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), Portland, OR, AAAI Press, pp. 226–231.
Everitt, B. S. 1979. A Monte Carlo investigation of the Robustness of Hotelling’s one-
and two-sample T2 tests. Journal of American Statistical Association, 74(365), 48–51.
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.
ics.uci.edu/ml. Irvine, CA: University of California, School of Information and
Computer Science.
Hartigan, J. A. and Hartigan, P. M. 1985. The DIP test of unimodality. The Annals of
Statistics, 13, 70–84.
Jiang, X. and Cooper, G. F. 2010. A Bayesian spatio-temporal method for disease
outbreak detection. Journal of American Medical Informatics Association, 17(4),
462–471.
Johnson, R. A. and Wichern, D. W. 1998. Applied Multivariate Statistical Analysis. Upper
Saddle River, NJ: Prentice Hall.
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps.
Biological Cybernetics, 43, 59–69.
319
320 References
Russell, S., Binder, J., Koller, D., and Kanazawa, K. 1995. Local learning in probabilistic
networks with hidden variables. In Proceedings of the Fourteenth International Joint
Conference on Artificial Intelligence, Montreal, Quebec, Canada, pp. 1146–1162.
Ryan, T. P. 1989. Statistical Methods for Quality Improvement. New York: John Wiley &
Sons.
Sung, K. and Poggio, T. 1998. Example-based learning for view-based human face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1),
39–51.
Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Boston,
MA: Pearson.
Theodoridis, S. and Koutroumbas, K. 1999. Pattern Recognition. San Diego, CA:
Academic Press.
Vapnik, V. N. 1989. Statistical Learning Theory. New York: John Wiley & Sons.
Vapnik, V. N. 2000. The Nature of Statistical Learning Theory. New York: Springer-Verlag.
Vidakovic, B. 1999. Statistical Modeling by Wavelets. New York: John Wiley & Sons.
Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically opti-
mum decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.
Witten, I. H., Frank, E., and Hall, M. A. 2011. Data Mining: Practical Machine Learning
Tools and Techniques. Burlington, MA: Morgan Kaufmann.
Yaffe, R. and McGee, M. 2000. Introduction to Time Series Analysis and Forecasting. San
Diego, CA: Academic Press.
Ye, N. 1996. Self-adapting decision support for interactive fault diagnosis of manufac-
turing systems. International Journal of Computer Integrated Manufacturing, 9(5),
392–401.
Ye, N. 1997. Objective and consistent analysis of group differences in knowledge rep-
resentation. International Journal of Cognitive Ergonomics, 1(2), 169–187.
Ye, N. 1998. The MDS-ANAVA technique for assessing knowledge representation dif-
ferences between skill groups. IEEE Transactions on Systems, Man and Cybernetics,
28(5), 586–600.
Ye, N. 2003, ed. The Handbook of Data Mining. Mahwah, NJ: Lawrence Erlbaum Associates.
Ye, N. 2008. Secure Computer and Network Systems: Modeling, Analysis and Design.
London, U.K.: John Wiley & Sons.
Ye, N., Borror, C., and Parmar, D. 2003. Scalable chi square distance versus conven-
tional statistical distance for process monitoring with uncorrelated data vari-
ables. Quality and Reliability Engineering International, 19(6), 505–515.
Ye, N., Borror, C., and Zhang, Y. 2002a. EWMA techniques for computer intrusion
detection through anomalous changes in event intensity. Quality and Reliability
Engineering International, 18(6), 443–451.
Ye, N. and Chen, Q. 2001. An anomaly detection technique based on a chi-square
statistic for detecting intrusions into information systems. Quality and Reliability
Engineering International, 17(2), 105–112.
Ye, N. and Chen, Q. 2003. Computer intrusion detection through EWMA for auto-
correlated and uncorrelated data. IEEE Transactions on Reliability, 52(1), 73–82.
Ye, N., Chen, Q., and Borror, C. 2004. EWMA forecast of normal system activity for
computer intrusion detection. IEEE Transactions on Reliability, 53(4), 557–566.
Ye, N., Ehiabor, T., and Zhang, Y. 2002c. First-order versus high-order stochastic
models for computer intrusion detection. Quality and Reliability Engineering
International, 18(3), 243–250.
322 References
Ye, N., Emran, S. M., Chen, Q., and Vilbert, S. 2002b. Multivariate statistical analysis
of audit trails for host-based intrusion detection. IEEE Transactions on Computers,
51(7), 810–820.
Ye, N. and Li, X. 2002. A scalable, incremental learning algorithm for classification
problems. Computers & Industrial Engineering Journal, 43(4), 677–692.
Ye, N., Li, X., Chen, Q., Emran, S. M., and Xu, M. 2001. Probabilistic techniques for
intrusion detection based on computer audit data. IEEE Transactions on Systems,
Man, and Cybernetics, 31(4), 266–274.
Ye, N., Parmar, D., and Borror, C. M. 2006. A hybrid SPC method with the chi-square
distance monitoring procedure for large-scale, complex process data. Quality
and Reliability Engineering International, 22(4), 393–402.
Ye, N. and Salvendy, G. 1991. Cognitive engineering based knowledge representation
in neural networks. Behaviour & Information Technology, 10(5), 403–418.
Ye, N. and Salvendy, G. 1994. Quantitative and qualitative differences between
experts and novices in chunking computer software knowledge. International
Journal of Human-Computer Interaction, 6(1), 105–118.
Ye, N., Zhang, Y., and Borror, C. M. 2004b. Robustness of the Markov-chain model for
cyber-attack detection. IEEE Transactions on Reliability, 53(1), 116–123.
Ye, N. and Zhao, B. 1996. A hybrid intelligent system for fault diagnosis of advanced
manufacturing system. International Journal of Production Research, 34(2), 555–576.
Ye, N. and Zhao, B. 1997. Automatic setting of article format through neural networks.
International Journal of Human-Computer Interaction, 9(1), 81–100.
Ye, N., Zhao, B., and Salvendy, G. 1993. Neural-networks-aided fault diagnosis in
supervisory control of advanced manufacturing systems. International Journal of
Advanced Manufacturing Technology, 8, 200–209.
Young, F. W. and Hamer, R. M. 1987. Multidimensional Scaling: History, Theory, and
Applications. Hillsdale, NJ: Lawrence Erlbaum Associates.
Data Mining
Ergonomics and Industrial Engineering
YE
“… provides full spectrum coverage of the most important topics in data mining.
By reading it, one can obtain a comprehensive view on data mining, including
the basic concepts, the important problems in the area, and how to handle these
problems. The whole book is presented in a way that a reader who does not have
much background knowledge of data mining can easily understand. You can find
many figures and intuitive examples in the book. I really love these figures and Theories, Algorithms, and Examples
examples, since they make the most complicated concepts and algorithms much
easier to understand.”
DATA MINING
—Zheng Zhao, SAS Institute Inc., Cary, North Carolina, USA
“… covers pretty much all the core data mining algorithms. It also covers several
useful topics that are not covered by other data mining books such as univariate
and multivariate control charts and wavelet analysis. Detailed examples are
provided to illustrate the practical use of data mining algorithms. A list of software
packages is also included for most algorithms covered in the book. These are
extremely useful for data mining practitioners. I highly recommend this book for
anyone interested in data mining.”
—Jieping Ye, Arizona State University, Tempe, USA
NONG YE
K10414
ISBN: 978-1-4398-0838-2
90000
www.c rc pr e ss.c o m
9 781439 808382
w w w.crcpress.com