2016 Book PrinciplesOfDataMining PDF
2016 Book PrinciplesOfDataMining PDF
Max Bramer
Principles
of Data
Mining
Third Edition
Undergraduate Topics in Computer Science
‘Undergraduate Topics in Computer Science’ (UTiCS) delivers high-quality instruc-
tional content for undergraduates studying in all areas of computing and information
science. From core foundational and theoretical material to final-year topics and ap-
plications, UTiCS books take a fresh, concise, and modern approach and are ideal for
self-study or for a one- or two-semester course. The texts are all authored by estab-
lished experts in their fields, reviewed by an international advisory board, and contain
numerous examples and problems, many of which include fully worked solutions.
v
vi Principles of Data Mining
No introductory book on Data Mining can take you to research level in the
subject — the days for that have long passed. This book will give you a good
grounding in the principal techniques without attempting to show you this
year’s latest fashions, which in most cases will have been superseded by the
time the book gets into your hands. Once you know the basic methods, there
are many sources you can use to find the latest developments in the field. Some
of these are listed in Appendix C. The other appendices include information
about the main datasets used in the examples in the book, many of which are of
interest in their own right and are readily available for use in your own projects
if you wish, and a glossary of the technical terms used in the book.
Self-assessment Exercises are included for each chapter to enable you to
check your understanding. Specimen solutions are given in Appendix E.
Acknowledgements
I would like to thank my daughter Bryony for drawing many of the more
complex diagrams and for general advice on design. I would also like to thank
Dr. Frederic Stahl for advice on Chapters 21 and 22 and my wife Dawn for her
very valuable comments on draft chapters and for preparing the index. The
responsibility for any errors that may have crept into the final version remains
with me.
Max Bramer
Emeritus Professor of Information Technology
University of Portsmouth, UK
November 2016
Contents
vii
viii Principles of Data Mining
8. Continuous Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Local versus Global Discretisation . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Adding Local Discretisation to TDIDT . . . . . . . . . . . . . . . . . . . . . 96
8.3.1 Calculating the Information Gain of a Set of Pseudo-
attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3.2 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4 Using the ChiMerge Algorithm for Global Discretisation . . . . . . 105
8.4.1 Calculating the Expected Values and χ2 . . . . . . . . . . . . . 108
8.4.2 Finding the Threshold Value . . . . . . . . . . . . . . . . . . . . . . . 113
8.4.3 Setting minIntervals and maxIntervals . . . . . . . . . . . . . . . 113
x Principles of Data Mining
B. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
1
Introduction to Data Mining
– It is estimated that there are around 150 million users of Twitter, sending
350 million Tweets each day.
Alongside advances in storage technology, which increasingly make it pos-
sible to store such vast amounts of data at relatively low cost whether in com-
mercial data warehouses, scientific research laboratories or elsewhere, has come
a growing realisation that such data contains buried within it knowledge that
can be critical to a company’s growth or decline, knowledge that could lead
to important discoveries in science, knowledge that could enable us accurately
to predict the weather and natural disasters, knowledge that could enable us
to identify the causes of and possible cures for lethal illnesses, knowledge that
could literally mean the difference between life and death. Yet the huge volumes
involved mean that most of this data is merely stored — never to be examined
in more than the most superficial way, if at all. It has rightly been said that
the world is becoming ‘data rich but knowledge poor’.
Machine learning technology, some of it very long established, has the po-
tential to solve the problem of the tidal wave of data that is flooding around
organisations, governments and individuals.
Data comes in, possibly from many sources. It is integrated and placed
in some common data store. Part of it is then taken and pre-processed into a
standard format. This ‘prepared data’ is then passed to a data mining algorithm
which produces an output in the form of rules or some other kind of ‘patterns’.
These are then interpreted to give — and this is the Holy Grail for knowledge
discovery — new and potentially useful knowledge.
This brief description makes it clear that although the data mining algo-
rithms, which are the principal subject of this book, are central to knowledge
discovery they are not the whole story. The pre-processing of the data and the
interpretation (as opposed to the blind use) of the results are both of great
importance. They are skilled tasks that are far more of an art (or a skill learnt
from experience) than an exact science. Although they will both be touched on
in this book, the algorithms of the data mining stage of knowledge discovery
will be its prime concern.
– weather forecasting
and many more. Some examples of applications (potential or actual) are:
– a supermarket chain mines its customer transactions data to optimise tar-
geting of high value customers
– a credit card company can use its data warehouse of customer transactions
for fraud detection
– a major hotel chain can use survey databases to identify attributes of a
‘high-value’ prospect
– predicting the probability of default for consumer loan applications by im-
proving the ability to predict bad loans
– reducing fabrication flaws in VLSI chips
– data mining systems can sift through vast quantities of data collected during
the semiconductor fabrication process to identify conditions that are causing
yield problems
– predicting audience share for television programmes, allowing television ex-
ecutives to arrange show schedules to maximise market share and increase
advertising revenues
– predicting the probability that a cancer patient will respond to chemotherapy,
thus reducing health-care costs without affecting quality of care
– analysing motion-capture data for elderly people
– trend mining and visualisation in social networks.
Applications can be divided into four main types: classification, numerical
prediction, association and clustering. Each of these is explained briefly below.
However first we need to distinguish between two types of data.
Nearest Neighbour Matching. This method relies on identifying (say) the five
examples that are ‘closest’ in some sense to an unclassified one. If the five
‘nearest neighbours’ have grades Second, First, Second, Second and Second
we might reasonably conclude that the new instance should be classified as
‘Second’.
Classification Rules. We look for rules that we can use to predict the classi-
fication of an unseen instance, for example:
IF SoftEng = A AND Project = A THEN Class = First
IF SoftEng = A AND Project = B AND ARIN = B THEN Class = Second
IF SoftEng = B THEN Class = Second
Data for data mining comes in many forms: from computer files typed in by
human operators, business information in SQL or some other standard database
format, information recorded automatically by equipment such as fault logging
devices, to streams of binary data transmitted from satellites. For purposes of
data mining (and for the remainder of this book) we will assume that the data
takes a particular standard form which is described in the next section. We will
look at some of the practical problems of data preparation in Section 2.3.
Nominal Variables
A variable used to put objects into categories, e.g. the name or colour of an
object. A nominal variable may be numerical in form, but the numerical values
have no mathematical interpretation. For example we might label 10 people
as numbers 1, 2, 3, . . . , 10, but any arithmetic with such values, e.g. 1 + 2 = 3
Data for Data Mining 11
Binary Variables
A binary variable is a special case of a nominal variable that takes only two
possible values: true or false, 1 or 0 etc.
Ordinal Variables
Ordinal variables are similar to nominal variables, except that an ordinal vari-
able has values that can be arranged in a meaningful order, e.g. small, medium,
large.
Integer Variables
Integer variables are ones that take values that are genuine integers, for ex-
ample ‘number of children’. Unlike nominal variables that are numerical in
form, arithmetic with integer variables is meaningful (1 child + 2 children = 3
children etc.).
Interval-scaled Variables
Interval-scaled variables are variables that take numerical values which are
measured at equal intervals from a zero point or origin. However the origin
does not imply a true absence of the measured characteristic. Two well-known
examples of interval-scaled variables are the Fahrenheit and Celsius tempera-
ture scales. To say that one temperature measured in degrees Celsius is greater
than another or greater than a constant value such as 25 is clearly meaningful,
but to say that one temperature measured in degrees Celsius is twice another
is meaningless. It is true that a temperature of 20 degrees is twice as far from
the zero value as 10 degrees, but the zero value has been selected arbitrarily
and does not imply ‘absence of temperature’. If the temperatures are converted
to an equivalent scale, say degrees Fahrenheit, the ‘twice’ relationship will no
longer apply.
12 Principles of Data Mining
Ratio-scaled Variables
For many applications the data can simply be extracted from a database
in the form described in Section 2.1, perhaps using a standard access method
such as ODBC. However, for some applications the hardest task may be to
get the data into a standard form in which it can be analysed. For example
data values may have to be extracted from textual output generated by a fault
logging system or (in a crime analysis application) extracted from transcripts
of interviews with witnesses. The amount of effort required to do this may be
considerable.
Even when the data is in the standard form it cannot be assumed that it
is error free. In real-world datasets erroneous values can be recorded for a
variety of reasons, including measurement errors, subjective judgements and
malfunctioning or misuse of automatic recording equipment.
Erroneous values can be divided into those which are possible values of the
attribute and those which are not. Although usage of the term noise varies, in
this book we will take a noisy value to mean one that is valid for the dataset,
but is incorrectly recorded. For example the number 69.72 may accidentally be
entered as 6.972, or a categorical attribute value such as brown may accidentally
be recorded as another of the possible values, such as blue. Noise of this kind
is a perpetual problem with real-world data.
A far smaller problem arises with noisy values that are invalid for the
dataset, such as 69.7X for 6.972 or bbrown for brown. We will consider these to
be invalid values, not noise. An invalid value can easily be detected and either
corrected or rejected.
It is hard to see even very ‘obvious’ errors in the values of a variable when
they are ‘buried’ amongst say 100,000 other values. In attempting to ‘clean
up’ data it is helpful to have a range of software tools available, especially to
give an overall visual impression of the data, when some anomalous values or
unexpected concentrations of values may stand out. However, in the absence of
special software, even some very basic analysis of the values of variables may be
helpful. Simply sorting the values into ascending order (which for fairly small
datasets can be accomplished using just a standard spreadsheet) may reveal
unexpected results. For example:
– A numerical variable may only take six different values, all widely separated.
It would probably be best to treat this as a categorical variable rather than
a continuous one.
– All the values of a variable may be identical. The variable should be treated
as an ‘ignore’ attribute.
14 Principles of Data Mining
– All the values of a variable except one may be identical. It is then necessary
to decide whether the one different value is an error or a significantly differ-
ent value. In the latter case the variable should be treated as a categorical
attribute with just two values.
– There may be some values that are outside the normal range of the variable.
For example, the values of a continuous attribute may all be in the range
200 to 5000 except for the highest three values which are 22654.8, 38597 and
44625.7. If the data values were entered by hand a reasonable guess is that
the first and third of these abnormal values resulted from pressing the initial
key twice by accident and the second one is the result of leaving out the
decimal point. If the data were recorded automatically it may be that the
equipment malfunctioned. This may not be the case but the values should
certainly be investigated.
– We may observe that some values occur an abnormally large number of times.
For example if we were analysing data about users who registered for a web-
based service by filling in an online form we might notice that the ‘country’
part of their addresses took the value ‘Albania’ in 10% of cases. It may be
that we have found a service that is particularly attractive to inhabitants of
that country. Another possibility is that users who registered either failed to
choose from the choices in the country field, causing a (not very sensible)
default value to be taken, or did not wish to supply their country details and
simply selected the first value in a list of options. In either case it seems likely
that the rest of the address data provided for those users may be suspect
too.
– If we are analysing the results of an online survey collected in 2002, we may
notice that the age recorded for a high proportion of the respondents was 72.
This seems unlikely, especially if the survey was of student satisfaction, say.
A possible interpretation for this is that the survey had a ‘date of birth’ field,
with subfields for day, month and year and that many of the respondents did
not bother to override the default values of 01 (day), 01 (month) and 1930
(year). A poorly designed program then converted the date of birth to an
age of 72 before storing it in the database.
It is important to issue a word of caution at this point. Care is needed when
dealing with anomalous values such as 22654.8, 38597 and 44625.7 in one of
the examples above. They may simply be errors as suggested. Alternatively
they may be outliers, i.e. genuine values that are significantly different from
the others. The recognition of outliers and their significance may be the key to
major discoveries, especially in fields such as medicine and physics, so we need
Data for Data Mining 15
This is the simplest strategy: delete all instances where there is at least one
missing value and use the remainder.
This strategy is a very conservative one, which has the advantage of avoid-
ing introducing any data errors. Its disadvantage is that discarding data may
damage the reliability of the results derived from the data. Although it may be
worth trying when the proportion of missing values is small, it is not recom-
mended in general. It is clearly not usable when all or a high proportion of all
the instances have missing values.
A less cautious strategy is to estimate each of the missing values using the
values that are present in the dataset.
16 Principles of Data Mining
customers will buy a new brand of dog food. The number of attributes of any
relevance to this is probably very small. At best the many irrelevant attributes
will place an unnecessary computational overhead on any data mining algo-
rithm. At worst, they may cause the algorithm to give poor results.
Of course, supermarkets, hospitals and other data collectors will reply that
they do not necessarily know what is relevant or will come to be recognised
as relevant in the future. It is safer for them to record everything than risk
throwing away important information.
Although faster processing speeds and larger memories may make it possible
to process ever larger numbers of attributes, this is inevitably a losing struggle
in the long term. Even if it were not, when the number of attributes becomes
large, there is always a risk that the results obtained will have only superficial
accuracy and will actually be less reliable than if only a small proportion of
the attributes were used — a case of ‘more means less’.
There are several ways in which the number of attributes (or ‘features’)
can be reduced before a dataset is processed. The term feature reduction or
dimension reduction is generally used for this process. We will return to this
topic in Chapter 10.
The datasets in the UCI Repository were collected principally to enable data
mining algorithms to be compared on a standard range of datasets. There are
many new algorithms published each year and it is standard practice to state
their performance on some of the better-known datasets in the UCI Repository.
Several of these datasets will be described later in this book.
The availability of standard datasets is also very helpful for new users of data
mining packages who can gain familiarisation using datasets with published
performance results before applying the facilities to their own datasets.
In recent years a potential weakness of establishing such a widely used set
of standard datasets has become apparent. In the great majority of cases the
datasets in the UCI Repository give good results when processed by standard
algorithms of the kind described in this book. Datasets that lead to poor results
tend to be associated with unsuccessful projects and so may not be added to
the Repository. The achievement of good results with selected datasets from
the Repository is no guarantee of the success of a method with new data, but
experimentation with such datasets can be a valuable step in the development
of new methods.
A welcome relatively recent development is the creation of the UCI ‘Knowl-
edge Discovery in Databases Archive’ at http://kdd.ics.uci.edu. This con-
tains a range of large and complex datasets as a challenge to the data mining
research community to scale up its algorithms as the size of stored datasets,
especially commercial ones, inexorably rises.
Name, Date of Birth, Sex, Weight, Height, Marital Status, Number of Chil-
dren
What is the type of each variable?
3. Give two ways of dealing with missing data values.
Reference
[1] Blake, C. L., & Merz, C. J. (1998). UCI repository of machine
learning databases. Irvine: University of California, Department of In-
formation and Computer Science. http://www.ics.uci.edu/~mlearn/
MLRepository.html.
3
Introduction to Classification: Naı̈ve
Bayes and Nearest Neighbour
– houses that are likely to rise in value, fall in value or have an unchanged
value in 12 months’ time
– people who are at high, medium or low risk of a car accident in the next 12
months
– people who are likely to vote for each of a number of political parties (or
none)
– the likelihood of rain the next day for a weather forecast (very likely, likely,
unlikely, very unlikely).
We have already seen an example of a (fictitious) classification task, the
‘degree classification’ example, in the Introduction.
In this chapter we introduce two classification algorithms: one that can be
used when all the attributes are categorical, the other when all the attributes
are continuous. In the following chapters we come on to algorithms for gener-
ating classification trees and rules (also illustrated in the Introduction).
Usually we are not interested in just one event but in a set of alternative
possible events, which are mutually exclusive and exhaustive, meaning that one
and only one must always occur.
In the train example, we might define four mutually exclusive and exhaus-
tive events
E1 – train cancelled
E2 – train ten minutes or more late
E3 – train less than ten minutes late
E4 – train on time or early.
The probability of an event is usually indicated by a capital letter P , so we
might have
P (E1) = 0.05
P (E2) = 0.1
P (E3) = 0.15
P (E4) = 0.7
(Read as ‘the probability of event E1 is 0.05’ etc.)
Each of these probabilities is between 0 and 1 inclusive, as it has to be to
qualify as a probability. They also satisfy a second important condition: the
sum of the four probabilities has to be 1, because precisely one of the events
must always occur. In this case
P (E1) + P (E2) + P (E3) + P (E4) = 1
In general, the sum of the probabilities of a set of mutually exclusive and
exhaustive events must always be 1.
Generally we are not in a position to know the true probability of an event
occurring. To do so for the train example we would have to record the train’s
arrival time for all possible days on which it is scheduled to run, then count
the number of times events E1, E2, E3 and E4 occur and divide by the total
number of days, to give the probabilities of the four events. In practice this is
often prohibitively difficult or impossible to do, especially (as in this example)
if the trials may potentially go on forever. Instead we keep records for a sample
of say 100 days, count the number of times E1, E2, E3 and E4 occur, divide
by 100 (the number of days) to give the frequency of the four events and use
these as estimates of the four probabilities.
For the purposes of the classification problems discussed in this book, the
‘events’ are that an instance has a particular classification. Note that classifi-
cations satisfy the ‘mutually exclusive and exhaustive’ requirement.
The outcome of each trial is recorded in one row of a table. Each row must
have one and only one classification.
24 Principles of Data Mining
How should we use probabilities to find the most likely classification for an
unseen instance such as the one below?
Making the assumption that the attributes are independent, the value of
this expression can be calculated using the product
P(ci )×P(a1 = v1 | ci )×P(a2 = v2 | ci )× . . . ×P(an = vn | ci )
We calculate this product for each value of i from 1 to k and choose the
classification that has the largest value.
The formula shown in bold in Figure 3.3 combines the prior probability of
ci with the values of the n possible conditional probabilities involving a test
on the value of a single attribute.
n
It is often written as P (ci ) × P (aj = vj | class = ci ).
j=1
Note that the Greek letter (pronounced pi) in the above formula is not
connected with the mathematical constant 3.14159. . . . It indicates the product
obtained by multiplying together the n values P (a1 = v1 | ci ), P (a2 = v2 | ci )
etc.
( is the capital form of ‘pi’. The lower case form is π. The equivalents in
the Roman alphabet are P and p. P is the first letter of ‘Product’.)
When using the Naı̈ve Bayes method to classify a series of unseen instances
the most efficient way to start is by calculating all the prior probabilities and
also all the conditional probabilities involving one attribute, though not all of
them may be required for classifying any particular instance.
Using the values in each of the columns of Figure 3.2 in turn, we obtain the
following posterior probabilities for each possible classification for the unseen
instance:
class = on time
0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013
Introduction to Classification: Naı̈ve Bayes and Nearest Neighbour 29
class = late
0.10 × 0.50 × 1.00 × 0.50 × 0.50 = 0.0125
class = very late
0.15 × 1.00 × 0.67 × 0.33 × 0.67 = 0.0222
class = cancelled
0.05 × 0.00 × 0.00 × 1.00 × 1.00 = 0.0000
Supposing we have a training set with just two instances such as the fol-
lowing
a b c d e f Class
yes no no 6.4 8.3 low negative
yes yes yes 18.2 4.7 high positive
There are six attribute values, followed by a classification (positive or neg-
ative).
We are then given a third instance
yes no no 6.6 8.0 low ????
What should its classification be?
Even without knowing what the six attributes represent, it seems intuitively
obvious that the unseen instance is nearer to the first instance than to the
second. In the absence of any other information, we could reasonably predict
its classification using that of the first instance, i.e. as ‘negative’.
In practice there are likely to be many more instances in the training set
but the same principle applies. It is usual to base the classification on those of
the k nearest neighbours (where k is a small integer such as 3 or 5), not just the
nearest one. The method is then known as k-Nearest Neighbour or just k-NN
classification (Figure 3.4).
A circle has been added to enclose the five nearest neighbours of the unseen
instance, which is shown as a small circle close to the centre of the larger one.
The five nearest neighbours are labelled with three + signs and two − signs,
so a basic 5-NN classifier would classify the unseen instance as ‘positive’ by a
form of majority voting. There are other possibilities, for example the ‘votes’
of each of the k nearest neighbours can be weighted, so that the classifications
of closer neighbours are given greater weight than the classifications of more
distant ones. We will not pursue this here.
We can represent two points in two dimensions (‘in two-dimensional space’
is the usual term) as (a1 , a2 ) and (b1 , b2 ) and visualise them as points in a
plane.
When there are three attributes we can represent the points by (a1 , a2 , a3 )
and (b1 , b2 , b3 ) and think of them as points in a room with three axes at right
angles. As the number of dimensions (attributes) increases it rapidly becomes
impossible to visualise them, at least for anyone who is not a physicist (and
most of those who are).
32 Principles of Data Mining
When there are n attributes, we can represent the instances by the points
(a1 , a2 , . . . , an ) and (b1 , b2 , . . . , bn ) in ‘n-dimensional space’.
There are many possible ways of measuring the distance between two instances
with n attribute values, or equivalently between two points in n-dimensional
space. We usually impose three requirements on any distance measure we use.
We will use the notation dist(X, Y ) to denote the distance between two points
X and Y .
1. The distance of any point A from itself is zero, i.e. dist(A, A) = 0.
2. The distance from A to B is the same as the distance from B to A, i.e.
dist(A, B) = dist(B, A) (the symmetry condition).
The third condition is called the triangle inequality (Figure 3.7). It cor-
responds to the intuitive idea that ‘the shortest distance between any two
points is a straight line’. The condition says that for any points A, B and Z:
dist(A, B) ≤ dist(A, Z) + dist(Z, B).
Introduction to Classification: Naı̈ve Bayes and Nearest Neighbour 33
by Pythagoras’ Theorem.
If there are two points (a1 , a2 , a3 ) and (b1 , b2 , b3 ) in a three-dimensional
space the corresponding formula is
(a1 − b1 )2 + (a2 − b2 )2 + (a3 − b3 )2
The City Block distance between the points (4, 2) and (12, 9) in Figure 3.9
is (12 − 4) + (9 − 2) = 8 + 7 = 15.
A third possibility is the maximum dimension distance. This is the largest
absolute difference between any pair of corresponding attribute values. (The
absolute difference is the difference converted to a positive number if it is
negative.) For example the maximum dimension distance between instances
6.2 −7.1 −5.0 18.3 −3.1 8.9
and
8.3 12.4 −4.1 19.7 −6.2 12.4
Introduction to Classification: Naı̈ve Bayes and Nearest Neighbour 35
3.3.2 Normalisation
A major problem when using the Euclidean distance formula (and many other
distance measures) is that the large values frequently swamp the small ones.
Suppose that two instances are as follows for some classification problem
associated with cars (the classifications themselves are omitted).
When the distance of these instances from an unseen one is calculated, the
mileage attribute will almost certainly contribute a value of several thousands
squared, i.e. several millions, to the sum of squares total. The number of doors
will probably contribute a value less than 10. It is clear that in practice the
only attribute that will matter when deciding which neighbours are the nearest
using the Euclidean distance formula is the mileage. This is unreasonable as the
unit of measurement, here the mile, is entirely arbitrary. We could have chosen
an alternative measure of distance travelled such as millimetres or perhaps
light years. Similarly we might have measured age in some other unit such as
milliseconds or millennia. The units chosen should not affect the decision on
which are the nearest neighbours.
To overcome this problem we generally normalise the values of continuous
attributes. The idea is to make the values of each attribute run from 0 to 1.
Suppose that for some attribute A the smallest value found in the training data
is −8.1 and the largest is 94.3. First we adjust each value of A by adding 8.1 to
it, so the values now run from 0 to 94.3+8.1 = 102.4. The spread of values from
highest to lowest is now 102.4 units, so we divide all values by that number to
make the spread of values from 0 to 1.
In general if the lowest value of attribute A is min and the highest value is
max, we convert each value of A, say a, to (a − min)/(max − min).
Using this approach all continuous attributes are converted to small num-
bers from 0 to 1, so the effect of the choice of unit of measurement on the
outcome is greatly reduced.
Note that it is possible that an unseen instance may have a value of A that
is less than min or greater than max. If we want to keep the adjusted numbers
36 Principles of Data Mining
in the range from 0 to 1 we can just convert any values of A that are less than
min or greater than max to 0 or 1, respectively.
Another issue that occurs with measuring the distance between two points
is the weighting of the contributions of the different attributes. We may be-
lieve that the mileage of a car is more important than the number of doors
it has (although no doubt not a thousand times more important, as with the
unnormalised values). To achieve this we can adjust the formula for Euclidean
distance to
w1 (a1 − b1 )2 + w2 (a2 − b2 )2 + . . . + wn (an − bn )2
In lazy learning systems the training data is ‘lazily’ left unchanged until an
unseen instance is presented for classification. When it is, only those calcula-
tions that are necessary to classify that single instance are performed.
The lazy learning approach has some enthusiastic advocates, but if there are
a large number of unseen instances, it can be computationally very expensive
to carry out compared with eager learning methods such as Naı̈ve Bayes and
the other methods of classification that are described in later chapters.
A more fundamental weakness of the lazy learning approach is that it does
not give any idea of the underlying causality of the task domain. This is also
true of the probability-based Naı̈ve Bayes eager learning algorithm, but to a
lesser extent. X is the classification for no reason deeper than that if you do
the calculations X turns out to be the answer. We now turn to methods that
give an explicit way of classifying any unseen instance that can be used (and
critiqued) independently from the training data used to generate it. We call
such methods model-based.
A fictitious example which has been used for illustration by many authors,
notably Quinlan [2], is that of a golfer who decides whether or not to play each
day on the basis of the weather.
Figure 4.1 shows the results of two weeks (14 days) of observations of
weather conditions and the decision on whether or not to play.
Assuming the golfer is acting consistently, what are the rules that deter-
mine the decision whether or not to play each day? If tomorrow the values of
Outlook, Temperature, Humidity and Windy were sunny, 74°F, 77% and false
respectively, what would the decision be?
One way of answering this is to construct a decision tree such as the one
shown in Figure 4.2. This is a typical example of a decision tree, which will
form the topic of several chapters of this book.
In order to determine the decision (classification) for a given set of weather
conditions from the decision tree, first look at the value of Outlook. There are
three possibilities.
1. If the value of Outlook is sunny, next consider the value of Humidity. If the
value is less than or equal to 75 the decision is play. Otherwise the decision
is don’t play.
Using Decision Trees for Classification 41
4.1.2 Terminology
We will assume that the ‘standard formulation’ of the data given in Chapter 2
applies. There is a universe of objects (people, houses etc.), each of which can
be described by the values of a collection of its attributes. Attributes with a
finite (and generally fairly small) set of values, such as sunny, overcast and rain,
are called categorical. Attributes with numerical values, such as Temperature
and Humidity, are generally known as continuous. We will distinguish between
a specially-designated categorical attribute called the classification and the
other attribute values and will generally use the term ‘attributes’ to refer only
to the latter.
Descriptions of a number of objects are held in tabular form in a training
set. Each row of the figure comprises an instance, i.e. the (non-classifying)
attribute values and the classification corresponding to one object.
The aim is to develop classification rules from the data in the training set.
This is often done in the implicit form of a decision tree.
A decision tree is created by a process known as splitting on the value of
attributes (or just splitting on attributes), i.e. testing the value of an attribute
such as Outlook and then creating a branch for each of its possible values.
In the case of continuous attributes the test is normally whether the value is
‘less than or equal to’ or ‘greater than’ a given value known as the split value.
42 Principles of Data Mining
The splitting process continues until each branch can be labelled with just one
classification.
Decision trees have two different functions: data compression and prediction.
Figure 4.2 can be regarded simply as a more compact way of representing the
data in Figure 4.1. The two representations are equivalent in the sense that
for each of the 14 instances the given values of the four attributes will lead to
identical classifications.
However, the decision tree is more than an equivalent representation to the
training set. It can be used to predict the values of other instances not in the
training set, for example the one given previously where the values of the four
attributes are sunny, 74, 77 and false respectively. It is easy to see from the
decision tree that in this case the decision would be don’t play. It is important
to stress that this ‘decision’ is only a prediction, which may or may not turn
out to be correct. There is no infallible way to predict the future!
So the decision tree can be viewed as not merely equivalent to the original
training set but as a generalisation of it which can be used to predict the
classification of other instances. These are often called unseen instances and
a collection of them is generally known as a test set or an unseen test set, by
contrast with the original training set.
The training set shown in Figure 4.3 (taken from a fictitious university) shows
the results of students for five subjects coded as SoftEng, ARIN, HCI, CSA
and Project and their corresponding degree classifications, which in this sim-
plified example are either FIRST or SECOND. There are 26 instances. What
determines who is classified as FIRST or SECOND?
Figure 4.4 shows a possible decision tree corresponding to this training set.
It consists of a number of branches, each ending with a leaf node labelled with
one of the valid classifications, i.e. FIRST or SECOND. Each branch comprises
the route from the root node (i.e. the top of the tree) to a leaf node. A node
that is neither the root nor a leaf node is called an internal node.
We can think of the root node as corresponding to the original training set.
All other nodes correspond to a subset of the training set.
At the leaf nodes each instance in the subset has the same classification.
There are five leaf nodes and hence five branches.
Each branch corresponds to a classification rule. The five classification rules
can be written in full as:
IF SoftEng = A AND Project = A THEN Class = FIRST
IF SoftEng = A AND Project = B AND ARIN = A AND CSA = A
Using Decision Trees for Classification 43
if (SoftEng = A) {
if (Project = A) Class = FIRST
else {
if (ARIN = A) {
if (CSA = A) Class = FIRST
else Class = SECOND
}
else Class = SECOND
}
}
else Class = SECOND
IF all the instances in the training set belong to the same class
THEN return the value of the class
ELSE (a) Select an attribute A to split on+
(b) Sort the instances in the training set into subsets, one
for each value of attribute A
(c) Return a tree with one branch for each non-empty subset,
each branch having a descendant subtree or a class
value produced by applying the algorithm recursively
+
Never select an attribute twice in the same branch
If the first two statements (the premises) are true, then the conclusion must
be true.
This type of reasoning is entirely reliable but in practice rules that are 100%
certain (such as ‘all men are mortal’) are often not available.
A second type of reasoning is called abduction. An example of this is
Here the conclusion is consistent with the truth of the premises, but it may
not necessarily be correct. Fido may be some other type of animal that chases
cats, or perhaps not an animal at all. Reasoning of this kind is often very
successful in practice but can sometimes lead to incorrect conclusions.
A third type of reasoning is called induction. This is a process of generali-
sation based on repeated observations.
For example, if I see 1,000 dogs with four legs I might reasonably conclude
that “if x is a dog then x has 4 legs” (or more simply “all dogs have four legs”).
This is induction. The decision trees derived from the golf and degrees datasets
are of this kind. They are generalised from repeated observations (the instances
in the training sets) and we would expect them to be good enough to use for
48 Principles of Data Mining
predicting the classification of unseen instances in most cases, but they may
not be infallible.
References
[1] Michie, D. (1990). Machine executable skills from ‘silent’ brains. In Research
and development in expert systems VII. Cambridge: Cambridge University
Press.
[2] Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo:
Morgan Kaufmann.
[3] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1,
81–106.
5
Decision Tree Induction: Using Entropy
for Attribute Selection
A fictitious university requires its students to enrol in one of its sports clubs,
either the Football Club or the Netball Club. It is forbidden to join both clubs.
Any student joining no club at all will be awarded an automatic failure in their
degree (this being considered an important disciplinary offence).
Figure 5.2 gives a training set of data collected about 12 students, tabulating
four items of data about each one (eye colour, marital status, sex and hair
length) against the club joined.
It is possible to generate many different trees from this data using the
TDIDT algorithm. One possible decision tree is Figure 5.3. (The numbers in
parentheses indicate the number of instances corresponding to each of the leaf
nodes.)
This is a remarkable result. All the blue-eyed students play football. For
the brown-eyed students, the critical factor is whether or not they are married.
If they are, then the long-haired ones all play football and the short-haired
ones all play netball. If they are not married, it is the other way round: the
short-haired ones play football and the long-haired ones play netball.
52 Principles of Data Mining
Although it is tempting to say that it is, it is best to avoid using terms such
as ‘correct’ and ‘incorrect’ in this context. All we can say is that both decision
trees are compatible with the data from which they were generated. The only
way to know which one gives better results for unseen data is to use them both
and compare the results.
Despite this, it is hard to avoid the belief that Figure 5.4 is right and
Figure 5.3 is wrong. We will return to this point.
Decision Tree Induction: Using Entropy for Attribute Selection 53
a1 a2 a3 a4 class
a11 a21 a31 a41 c1
a12 a21 a31 a42 c1
a11 a21 a31 a41 c1
a11 a22 a32 a41 c2
a11 a22 a32 a41 c2
a12 a22 a31 a41 c1
a11 a22 a32 a41 c2
a11 a22 a31 a42 c1
a11 a21 a32 a42 c2
a11 a22 a32 a41 c2
a12 a22 a31 a41 c1
a12 a22 a31 a42 c1
Here we have a training set of 12 instances. There are four attributes, a1,
a2, a3 and a4, with values a11, a12 etc., and two classes c1 and c2.
One possible decision tree we can generate from this data is Figure 5.6.
information gain. This method will be explained later in this chapter. Other
commonly used methods will be discussed in Chapter 6.
Figure 5.8 is based on Figure 5.1, which gave the size of the tree with most
and least branches produced by the takefirst, takelast and random attribute se-
lection strategies for a number of datasets. The final column shows the number
of branches generated by the ‘entropy’ attribute selection method (which has
not yet been described). In almost all cases the number of branches is substan-
tially reduced. The smallest number of branches, i.e. rules for each dataset, is
in bold and underlined.
Figure 5.8 Most and Least Figures from Figure 5.1 Augmented by Informa-
tion about Entropy Attribute Selection
In all cases the number of rules in the decision tree generated using the
‘entropy’ method is less than or equal to the smallest number generated using
any of the other attribute selection criteria introduced so far. In some cases,
such as for the chess dataset, it is considerably fewer.
There is no guarantee that using entropy will always lead to a small de-
cision tree, but experience shows that it generally produces trees with fewer
branches than other attribute selection criteria (not just the basic ones used
in Section 5.1). Experience also shows that small trees tend to give more ac-
curate predictions than large ones, although there is certainly no guarantee of
infallibility.
24 instances linking the values of four attributes age (i.e. age group), specRx
(spectacle prescription), astig (whether astigmatic) and tears (tear production
rate) with one of three classes 1, 2 and 3 (signifying respectively that the patient
should be fitted with hard contact lenses, soft contact lenses or none at all).
The complete training set is given as Figure 5.9.
5.3.2 Entropy
K
E=− pi log2 pi
i=1
summed over the non-empty classes only, i.e. classes for which pi = 0.
An explanation of this formula will be given in Chapter 10. At present it is
simplest to accept the formula as given and concentrate on its properties.
As is shown in Appendix A the value of −pi log2 pi is positive for values of pi
greater than zero and less than 1. When pi = 1 the value of −pi log2 pi is zero.
This implies that E is positive or zero for all training sets. It takes its minimum
value (zero) if and only if all the instances have the same classification, in which
case there is only one non-empty class, for which the probability is 1.
Entropy takes its maximum value when the instances are equally distributed
amongst the K possible classes.
In this case the value of each pi is 1/K, which is independent of i, so
K
E=− (1/K) log2 (1/K)
i=1
= −K(1/K) log2 (1/K)
= − log2 (1/K) = log2 K
= 1.3261 bits (these and subsequent figures in this chapter are given to four
decimal places).
Entropy E1 = −(2/8) log2 (2/8) − (2/8) log2 (2/8) − (4/8) log2 (4/8)
= 0.5 + 0.5 + 0.5 = 1.5
Training set 2 (age = 2)
Entropy E2 = −(1/8) log2 (1/8) − (2/8) log2 (2/8) − (5/8) log2 (5/8)
= 0.375 + 0.5 + 0.4238 = 1.2988
Training Set 3 (age = 3)
Entropy E3 = −(1/8) log2 (1/8) − (1/8) log2 (1/8) − (6/8) log2 (6/8)
= 0.375 + 0.375 + 0.3113 = 1.0613
Decision Tree Induction: Using Entropy for Attribute Selection 59
Although the entropy of the first of these three training sets (E1 ) is greater
than Estart , the weighted average will be less. The values E1 , E2 and E3 need
to be weighted by the proportion of the original instances in each of the three
subsets. In this case all the weights are the same, i.e. 8/24.
If the average entropy of the three training sets produced by splitting on at-
tribute age is denoted by Enew , then Enew = (8/24)E1 +(8/24)E2 +(8/24)E3 =
1.2867 bits (to 4 decimal places).
If we define Information Gain = Estart − Enew then the information gain
from splitting on attribute age is 1.3261 − 1.2867 = 0.0394 bits (see Fig-
ure 5.11).
The ‘entropy method’ of attribute selection is to choose to split on the
attribute that gives the greatest reduction in (average) entropy, i.e. the one
that maximises the value of Information Gain. This is equivalent to minimising
the value of Enew as Estart is fixed.
60 Principles of Data Mining
The values of Enew and Information Gain for splitting on each of the four
attributes age, specRx, astig and tears are as follows:
attribute age
Enew = 1.2867
Information Gain = 1.3261 − 1.2867 = 0.0394 bits
attribute specRx
Enew = 1.2866
Information Gain = 1.3261 − 1.2866 = 0.0395 bits
attribute astig
Enew = 0.9491
Information Gain = 1.3261 − 0.9491 = 0.3770 bits
attribute tears
Enew = 0.7773
Information Gain = 1.3261 − 0.7773 = 0.5488 bits
Thus, the largest value of Information Gain (and the smallest value of the
new entropy Enew ) is obtained by splitting on attribute tears (see Figure 5.12).
The process of splitting on nodes is repeated for each branch of the evolving
decision tree, terminating when the subset at every leaf node has entropy zero.
Decision Tree Induction: Using Entropy for Attribute Selection 61
2. Suggest reasons why entropy (or information gain) is one of the most effec-
tive methods of attribute selection when using the TDIDT tree generation
algorithm.
6
Decision Tree Induction: Using Frequency
Tables for Attribute Selection
For practical use a more efficient method is available which requires only a
single table to be constructed for each categorical attribute at each node. This
method, which can be shown to be equivalent to the one given previously (see
Section 6.1.1), uses a frequency table. The cells of this table show the number
of occurrences of each combination of class and attribute value in the training
set. For the lens24 dataset the frequency table corresponding to splitting on
attribute age is shown in Figure 6.2.
Figure 6.2 Frequency Table for Attribute age for lens24 Example
It remains to be proved that this method always gives the same value of Enew
as the basic method described in Chapter 5.
Decision Tree Induction: Using Frequency Tables for Attribute Selection 65
x log2 x
1 0
2 1
3 1.5850
4 2
5 2.3219
6 2.5850
7 2.8074
8 3
9 3.1699
10 3.3219
11 3.4594
12 3.5850
Assume that there are N instances, each relating the value of a number
of categorical attributes to one of K possible classifications. (For the lens24
dataset used previously, N = 24 and K = 3.)
Splitting on a categorical attribute with V possible values produces V sub-
sets of the training set. The j th subset contains all the instances for which the
attribute takes its j th value. Let Nj denote the number of instances in that
subset. Then
V
Nj = N
j=1
(For the frequency table shown in Figure 6.2, for attribute age, there are three
values of the attribute, so V = 3. The three column sums are N1 , N2 and N3 ,
which all have the same value (8). The value of N is N1 + N2 + N3 = 24.)
Let fij denote the number of instances for which the classification is the ith
one and the attribute takes its j th value (e.g. for Figure 6.2, f32 = 5). Then
K
fij = Nj
i=1
The frequency table method of forming the sum for Enew given above
amounts to using the formula
V
K
V
Enew = − (fij /N ). log2 fij + (Nj /N ). log2 Nj
j=1 i=1 j=1
66 Principles of Data Mining
The basic method of calculating Enew using the entropies of the j subsets
resulting from splitting on the specified attribute was described in Chapter 5.
The entropy of the j th subset is Ej where
K
Ej = − (fij /Nj ). log2 (fij /Nj )
i=1
The value of Enew is the weighted sum of the entropies of these V subsets.
The weighting is the proportion of the original N instances that the subset
contains, i.e. Nj /N for the j th subset. So
V
Enew = Nj Ej /N
j=1
V K
=− (Nj /N ).(fij /Nj ). log2 (fij /Nj )
j=1 i=1
V K
=− (fij /N ). log2 (fij /Nj )
j=1 i=1
V K
V
K
=− (fij /N ). log2 fij + (fij /N ). log2 Nj
j=1 i=1 j=1 i=1
V K V
K
=− (fij /N ). log2 fij + (Nj /N ). log2 Nj [as fij = Nj ]
j=1 i=1 j=1 i=1
The formula for entropy given in Section 5.3.2 excludes empty classes from the
summation. They correspond to zero entries in the body of the frequency table,
which are also excluded from the calculation.
If a complete column of the frequency table is zero it means that the categor-
ical attribute never takes one of its possible values at the node under consider-
ation. Any such columns are ignored. (This corresponds to ignoring empty sub-
sets whilst generating a decision tree, as described in Section 4.2, Figure 4.5.)
One measure that is commonly used is the Gini Index of Diversity. If there
are K classes, with the probability of the i th class being pi , the Gini Index is
K
defined as 1 − p2i .
i=1
This is a measure of the ‘impurity’ of a dataset. Its smallest value is zero,
which it takes when all the classifications are the same. It takes its largest value
1 − 1/K when the classes are evenly distributed between the instances, i.e. the
frequency of each class is 1/K.
Splitting on a chosen attribute gives a reduction in the average Gini Index
of the resulting subsets (as it does for entropy). The new average value Gininew
can be calculated using the same frequency table used to calculate the new
entropy value in Section 6.1.
Using the notation introduced in that section, the value of the Gini Index
for the j th subset resulting from splitting on a specified attribute is Gj , where
K
Gj = 1 − (fij /Nj )2
i=1
The weighted average value of the Gini Index for the subsets resulting from
splitting on the attribute is
V
Gininew = Nj. Gj /N
j=1
V
V
K
= (Nj /N ) − (Nj /N ).(fij /Nj )2
j=1 j=1 i=1
V
K
= 1− 2
fij /(N.Nj )
j=1 i=1
V
K
= 1 − (1/N ) (1/Nj ) 2
fij
j=1 i=1
At each stage of the attribute selection process the attribute is selected which
maximises the reduction in the value of the Gini Index, i.e. Ginistart − Gininew .
Again taking the example of the lens24 dataset, the initial probabilities of
the three classes as given in Chapter 5 are p1 = 4/24, p2 = 5/24 and p3 = 15/24.
Hence the initial value of the Gini Index is Gstart = 0.5382.
For splitting on attribute age the frequency table, as before, is shown in
Figure 6.4.
We can now calculate the new value of the Gini Index as follows.
1. For each non-empty column, form the sum of the squares of the values in
the body of the table and divide by the column sum.
2. Add the values obtained for all the columns and divide by N (the number
of instances).
68 Principles of Data Mining
Figure 6.4 Frequency Table for Attribute age for lens24 Example
Suppose that for some dataset with three possible classifications c1, c2 and
c3 we have an attribute A with four values a1, a2, a3 and a4, and the frequency
table given in Figure 6.5.
a1 a2 a3 a4 Total
c1 27 64 93 124 308
c2 31 54 82 105 272
c3 42 82 125 171 420
Total 100 200 300 400 1000
We start by making the assumption that the value of A has no effect what-
soever on the classification and look for evidence that this assumption (which
statisticians call the null hypothesis) is false.
It is quite easy to imagine four-valued attributes that are certain or almost
certain to be irrelevant to a classification. For example the values in each row
might correspond to the number of patients achieving a large benefit, a little
benefit or no benefit (classifications c1, c2 and c3) from a certain medical
treatment, with attribute values a1 to a4 denoting a division of patients into
four groups depending on the number of siblings they have (say zero, one, two,
three or more). Such a division would appear (to this layman) highly unlikely to
be relevant. Other four-valued attributes far more likely to be relevant include
age and weight, each converted into four ranges in this case.
The example may be made more controversial by saying that c1, c2 and
c3 are levels achieved in some kind of intelligence test and a1, a2, a3 and a4
denote people who are married and male, married and female, unmarried and
male or unmarried and female, not necessarily in that order. Does the test
score obtained depend on which category you are in? Please note that we are
not trying to settle such sensitive questions in this book, especially not with
invented data, just (as far as this chapter is concerned) deciding which attribute
should be selected when constructing a decision tree.
From now on we will treat the data as test results but to avoid controversy
will not say anything about the kind of people who fall into the four categories
a1 to a4.
The first point to note is that from examining the Total row we can see that
the people who took the test had attribute values a1 to a4 in the ratio 1:2:3:4.
This is simply a fact about the data we happen to have obtained and in itself
implies nothing about the null hypothesis, that the division of test subjects
into four groups is irrelevant.
70 Principles of Data Mining
Next consider the c1 row. We can see that a total of 308 people obtained
classification c1. If the value of attribute A were irrelevant we would expect
the 308 values in the cells to split in the ratio 1:2:3:4.
In cell c1/a1 we would expect a value of 308 ∗ 100/1000 = 30.8.
In c1/a2 we would expect twice this, i.e. 308 ∗ 200/1000 = 61.6.
In c1/a3 we would expect 308 ∗ 300/1000 = 92.4.
In c1/a4 we would expect 308 ∗ 400/1000 = 123.2.
(Note that the total of the four values comes to 308, as it must.)
We call the four calculated values above the expected values for each
class/attribute value combination. The actual values in the c1 row: 27, 64,
93 and 124 are not far away from these. Do they and the expected values for
the c2 and c3 rows support or undermine the null hypothesis, that attribute A
is irrelevant?
Although the ‘ideal’ situation is that all the expected values are identical
to the corresponding actual values, known as the observed values, this needs a
strong caveat. If you ever read a published research paper, newspaper article
etc. where for some data the expected values all turn out to be exact integers
that are exactly the same as the observed values for all classification/attribute
value combinations, by far the most likely explanation is that the published
data is an exceptionally incompetent fraud. In the real world, such perfect
accuracy is never achieved. In this example, as with most real data it is in any
case impossible for the expected values to be entirely identical to the observed
ones, as the former are not usually integers and the latter must be.
Figure 6.6 is an updated version of the frequency table given previously,
with the observed value in each of the cells from c1/a1 to c3/a4 followed by
its expected value in parentheses.
a1 a2 a3 a4 Total
c1 27 (30.8) 64 (61.6) 93 (92.4) 124 (123.2) 308
c2 31 (27.2) 54 (54.4) 82 (81.6) 105 (108.8) 272
c3 42 (42.0) 82 (84.0) 125 (126.0) 171 (168.0) 420
Total 100 200 300 400 1000
The notation normally used is to represent the observed value for each cell
by O and the expected value by E. The value of E for each cell is just the
product of the corresponding column sum and row sum divided by the grand
total number of instances given in the bottom right-hand corner of the table.
For example the E value for cell c3/a2 is 200 ∗ 420/1000 = 84.0.
Decision Tree Induction: Using Frequency Tables for Attribute Selection 71
We can use the values of O and E for each cell to calculate a measure of how
far the frequency table varies from what we would expect if the null hypothesis
(that attribute A is irrelevant) were correct. We would like the measure to
be zero in the case that the E values in every cell are always identical to the
corresponding O values.
The measure generally used is the χ2 value, which is defined as the sum of
the values of (O − E)2 /E over all the cells.
Calculating the χ2 value for the updated frequency table above, we have
χ = (27 − 30.8)2 /30.8 + . . . + (171 − 168.0)2 /168.0 = 1.35 (to two decimal
2
places).
Is this χ2 value small enough to give support for the null hypothesis that
attribute A is irrelevant to the classification? Or is it large enough to suggest
that the null hypothesis is false?
This question will be important when the same method is used later in
connection with the discretisation of continuous attributes, but as far as this
chapter is concerned we will ignore the question of the validity of the null
hypothesis and simply record the value of χ2 . We then repeat the process with
all the attributes under consideration as the attribute to split on in our decision
tree and choose the one with the largest χ2 value as the one likely to have the
greatest power of discrimination amongst the three classifications.
This is not serious of course, but it is trying to make a serious point. Math-
ematically it is possible to find some formula that will justify any further de-
velopment of the sequence, for example
does not.
Whether this is right or wrong is impossible to say absolutely — it depends
on the situation. It illustrates an inductive bias, i.e. a preference for one choice
rather than another, which is not determined by the data itself (in this case,
previous values in the sequence) but by external factors, such as our preferences
for simplicity or familiarity with perfect squares. In school we rapidly learn
that the question-setter has a strong bias in favour of sequences such as perfect
squares and we give our answers to match this bias if we can.
Turning back to the task of attribute selection, any formula we use for it,
however principled we believe it to be, introduces an inductive bias that is not
justified purely by the data. Such bias can be helpful or harmful, depending
on the dataset. We can choose a method that has a bias that we favour, but
we cannot eliminate inductive bias altogether. There is no neutral, unbiased
method.
Clearly it is important to be able to say what bias is introduced by any
particular method of selecting attributes. For many methods this is not easy to
do, but for one of the best-known methods we can. Using entropy can be shown
to have a bias towards selecting attributes with a large number of values.
Decision Tree Induction: Using Frequency Tables for Attribute Selection 73
For many datasets this does no harm, but for some it can be undesirable. For
example we may have a dataset about people that includes an attribute ‘place
of birth’ and classifies them as responding to some medical treatment ‘well’,
‘badly’ or ‘not at all’. Although the place of birth may have some effect on the
classification it is probably only a minor one. Unfortunately, the information
gain selection method will almost certainly choose it as the first attribute to
split on in the decision tree, generating one branch for each possible place of
birth. The decision tree will be very large, with many branches (rules) with
very low value for classification.
−(8/24) log2 (8/24) − (8/24) log2 (8/24) − (8/24) log2 (8/24) = 1.5850
Split Information forms the denominator in the Gain Ratio formula. Hence the
higher the value of Split Information, the lower the Gain Ratio.
The value of Split Information depends on the number of values a categorical
attribute has and how uniformly those values are distributed (hence the name
‘Split Information’).
To illustrate this we will examine the case where there are 32 instances and
we are considering splitting on an attribute a, which has values 1, 2, 3 and 4.
The ‘Frequency’ row in the tables below is the same as the column sum row
in the frequency tables used previously in this chapter.
The following examples illustrate a number of possibilities.
1. Single Attribute Value
a=1 a=2 a=3 a=4
Frequency 32 0 0 0
Split Information = −(32/32) × log2 (32/32) = − log2 1 = 0
2. Different Distributions of a Given Total Frequency
a=1 a=2 a=3 a=4
Frequency 16 16 0 0
Split Information = −(16/32) × log2 (16/32) − (16/32) × log2 (16/32) =
− log2 (1/2) = 1
a=1 a=2 a=3 a=4
Frequency 16 8 8 0
Split Information = −(16/32) × log2 (16/32) − 2 × (8/32) × log2 (8/32) =
−(1/2) log2 (1/2) − (1/2) log2 (1/4) = 0.5 + 1 = 1.5
a=1 a=2 a=3 a=4
Frequency 16 8 4 4
Split Information = −(16/32) × log2 (16/32) − (8/32) × log2 (8/32) − 2 ×
(4/32) × log2 (4/32) = 0.5 + 0.5 + 0.75 = 1.75
3. Uniform Distribution of Attribute Frequencies
a=1 a=2 a=3 a=4
Frequency 8 8 8 8
Split Information = −4 × (8/32) × log2 (8/32) = − log2 (1/4) = log2 4 = 2
In general, if there are M attribute values, each occurring equally frequently,
the Split Information is log2 M (irrespective of the frequency value).
Decision Tree Induction: Using Frequency Tables for Attribute Selection 75
6.5.2 Summary
For many datasets Information Gain (i.e. entropy reduction) and Gain Ratio
give the same results. For others using Gain Ratio can give a significantly
smaller decision tree. However, Figure 6.7 shows that neither Information Gain
nor Gain Ratio invariably gives the smallest decision tree. This is in accord with
the general result that no method of attribute selection is best for all possible
datasets. In practice Information Gain is probably the most commonly used
method, although the popularity of C4.5 makes Gain Ratio a strong contender.
76 Principles of Data Mining
References
[1] Mingers, J. (1989). An empirical comparison of pruning methods for deci-
sion tree induction. Machine Learning, 4, 227–243.
[2] Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo:
Morgan Kaufmann.
7
Estimating the Predictive Accuracy of a
Classifier
7.1 Introduction
Any algorithm which assigns a classification to unseen instances is called a
classifier. A decision tree of the kind described in earlier chapters is one very
popular type of classifier, but there are several others, some of which are de-
scribed elsewhere in this book.
This chapter is concerned with estimating the performance of a classifier of
any kind but will be illustrated using decision trees generated with attribute
selection using information gain, as described in Chapter 5.
Although the data compression referred to in Chapter 4 can sometimes
be important, in practice the principal reason for generating a classifier is to
enable unseen instances to be classified. However we have already seen that
many different classifiers can be generated from a given dataset. Each one is
likely to perform differently on a set of unseen instances.
The most obvious criterion to use for estimating the performance of a clas-
sifier is predictive accuracy, i.e. the proportion of a set of unseen instances that
it correctly classifies. This is often seen as the most important criterion but
other criteria are also important, for example algorithmic complexity, efficient
use of machine resources and comprehensibility.
For most domains of interest the number of possible unseen instances is
potentially very large (e.g. all those who might develop an illness, the weather
for every possible day in the future or all the possible objects that might appear
NOTE. For some datasets in the UCI Repository (and elsewhere) the data
is provided as two separate files, designated as the training set and the test set.
In such cases we will consider the two files together as comprising the ‘dataset’
for that application. In cases where the dataset is only a single file we need to
divide it into a training set and a test set before using Method 1. This may be
done in many ways, but a random division into two parts in proportions such
as 1:1, 2:1, 70:30 or 60:40 would be customary.
Estimating the Predictive Accuracy of a Classifier 81
It is important to bear in mind that the overall aim is not (just) to classify
the instances in the test set but to estimate the predictive accuracy of the
classifier for all possible unseen instances, which will generally be many times
the number of instances contained in the test set.
If the predictive accuracy calculated for the test set is p and we go on to
use the classifier to classify the instances in a different test set, it is very likely
that a different value for predictive accuracy would be obtained. All that we
can say is that p is an estimate of the true predictive accuracy of the classifier
for all possible unseen instances.
We cannot determine the true value without collecting all the instances and
running the classifier on them, which is usually an impossible task. Instead, we
can use statistical methods to find a range of values within which the true value
of the predictive accuracy lies, with a given probability or ‘confidence level’.
To do this we use the standard error associated with the estimated value p.
If pis calculated using a test set of N instances the value of its standard error
is p(1 − p)/N . (The proof of this is outside the scope of this book, but can
readily be found in many statistics textbooks.)
The significance of standard error is that it enables us to say that with a
specified probability (which we can choose) the true predictive accuracy of the
classifier is within so many standard errors above or below the estimated value
p. The more certain we wish to be, the greater the number of standard errors.
The probability is called the confidence level, denoted by CL and the number
of standard errors is usually written as ZCL .
Figure 7.2 shows the relationship between commonly used values of CL and
ZCL .
If the predictive accuracy for a test set is p, with standard error S, then
using this table we can say that with probability CL (or with a confidence level
CL) the true predictive accuracy lies in the interval p ± ZCL × S.
82 Principles of Data Mining
Example
Here the classifier is used to classify k test sets, not just one. If all the test sets
are of the same size, N , the predictive accuracy values obtained for the k test
sets are then averaged to produce an overall estimate p.
As the total number of instances in the test sets is kN , the standard error
of the estimate p is p(1 − p)/kN .
If the test sets are not all of the same size the calculations are slightly more
complicated.
If there are Ni instances in the i th test set (1 ≤ i ≤ k) and the predictive
accuracy calculated for the i th test set is pi the overall predictive accuracy p
i=k
i=k
is pi Ni /T where Ni = T , i.e. p is the weighted average of the pi values.
i=1
i=1
The standard error is p(1 − p)/T .
series of k runs is now carried out. Each of the k parts in turn is used as a test
set and the other k − 1 parts are used as a training set.
The total number of instances correctly classified (in all k runs combined) is
divided by the total number of instances
N to give an overall level of predictive
accuracy p, with standard error p(1 − p)/N .
is most likely to be of benefit with very small datasets where as much data as
possible needs to be used to train the classifier.
The vote, pima-indians and glass datasets are all taken from the UCI Repos-
itory. The chess dataset was constructed for a well-known series of machine
learning experiments [1].
The vote dataset has separate training and test sets. The other three
datasets were first divided into two parts, with every third instance placed
in the test set and the other two placed in the training set in both cases.
The result for the vote dataset illustrates the point that TDIDT (along
with some but not all other classification algorithms) is sometimes unable to
classify an unseen instance (Figure 7.5). The reason for this was discussed in
Section 6.7.
Estimating the Predictive Accuracy of a Classifier 85
Figure 7.6 Train and Test Results for vote Dataset (Modified)
Figures 7.7 and 7.8 show the results obtained using 10-fold and N -fold
Cross-validation for the four datasets.
For the vote dataset the 300 instances in the training set are used. For the
other two datasets all the available instances are used.
All the figures given in this section are estimates. The 10-fold cross-
validation and N -fold cross-validation results for all four datasets are based
86 Principles of Data Mining
on considerably more instances than those in the corresponding test sets for
the ‘train and test’ experiments and so are more likely to be reliable.
Each dataset has both a training set and a separate test set. In each case,
there are missing values in both the training set and the test set. The values
in parentheses in the ‘training set’ and ‘test set’ columns show the number of
instances that have at least one missing value.
The ‘train and test’ method was used for estimating predictive accuracy.
Estimating the Predictive Accuracy of a Classifier 87
Two strategies for dealing with missing attribute values were described in
Section 2.4. We give results for each of these in turn.
This is the simplest strategy: delete all instances where there is at least one
missing value and use the remainder. This strategy has the advantage of avoid-
ing introducing any data errors. Its main disadvantage is that discarding data
may damage the reliability of the resulting classifier.
A second disadvantage is that the method cannot be used when a high
proportion of the instances in the training set have missing values, as is the case
for example with both the hypo and the labor-ne datasets. A final disadvantage
is that it is not possible with this strategy to classify any instances in the test
set that have missing values.
Together these weaknesses are quite substantial. Although the ‘discard in-
stances’ strategy may be worth trying when the proportion of missing values
is small, it is not recommended in general.
Of the three datasets listed in Figure 7.9, the ‘discard instances’ strategy
can only be applied to crx. Doing so gives the possibly surprising result in
Figure 7.10.
Clearly discarding the 37 instances with at least one missing value from the
training set (5.4%) does not prevent the algorithm constructing a decision tree
capable of classifying the 188 instances in the test set that do not have missing
values correctly in every case.
With this strategy any missing values of a categorical attribute are replaced by
its most commonly occurring value in the training set. Any missing values of a
continuous attribute are replaced by its average value in the training set.
88 Principles of Data Mining
Figure 7.11 shows the result of applying the ‘Most Frequent/Average Value’
strategy to the crx dataset. As for the ‘Discard Instances’ strategy all instances
in the test set are correctly classified, but this time all 200 instances in the test
set are classified, not just the 188 instances in the test set that do not have
missing values.
With this strategy we can also construct classifiers from the hypo and crx
datasets.
In the case of the hypo dataset, we get a decision tree with just 15 rules.
The average number of terms per rule is 4.8. When applied to the test data this
tree is able to classify correctly 1251 of the 1258 instances in the test set (99%;
Figure 7.12). This is a remarkable result with so few rules, especially as there
are missing values in every instance in the training set. It gives considerable
credence to the belief that using entropy for constructing a decision tree is an
effective approach.
In the case of the labor-ne dataset, we obtain a classifier with five rules,
which correctly classifies 14 out of the 17 instances in the test set (Figure 7.13).
It is worth noting that for each dataset given in Figure 7.9 the missing values are
those of attributes, not classifications. Missing classifications in the training set
are a far larger problem than missing attribute values. One possible approach
would be to replace them all by the most frequently occurring classification but
this is unlikely to prove successful in most cases. The best approach is probably
to discard any instances with missing classifications.
Correct Classified as
classification democrat republican
democrat 81 (97.6%) 2 (2.4%)
republican 6 (11.5%) 46 (88.5%)
The body of the table has one row and column for each possible classifi-
cation. The rows correspond to the correct classifications. The columns corre-
spond to the predicted classifications.
The value in the i th row and j th column gives the number of instances for
which the correct classification is the i th class which are classified as belonging
to the j th class. If all the instances were correctly classified, the only non-zero
entries would be on the ‘leading diagonal’ running from top left (i.e. row 1,
column 1) down to bottom right.
To demonstrate that the use of a confusion matrix is not restricted to
datasets with two classifications, Figure 7.15 shows the results obtained us-
ing 10-fold cross-validation with the TDIDT algorithm (using information gain
90 Principles of Data Mining
for attribute section) for the glass dataset, which has six classifications: 1, 2,
3, 5, 6 and 7 (there is also a class 4 but it is not used for the training data).
Correct Classified as
classification 1 2 3 5 6 7
1 52 10 7 0 0 1
2 15 50 6 2 1 2
3 5 6 6 0 0 0
5 0 2 0 10 0 1
6 0 1 0 0 7 1
7 1 3 0 1 0 24
When a dataset has only two classes, one is often regarded as ‘positive’ (i.e. the
class of principal interest) and the other as ‘negative’. In this case the entries
in the two rows and columns of the confusion matrix are referred to as true
and false positives and true and false negatives (Figure 7.16).
When there are more than two classes, one class is sometimes important
enough to be regarded as positive, with all the other classes combined treated
as negative. For example we might consider class 1 for the glass dataset as the
‘positive’ class and classes 2, 3, 5, 6 and 7 combined as ‘negative’. The confusion
matrix given as Figure 7.15 can then be rewritten as shown in Figure 7.17.
Of the 73 instances classified as positive, 52 genuinely are positive (true
positives) and the other 21 are really negative (false positives). Of the 141
instances classified as negative, 18 are really positive (false negatives) and the
other 123 are genuinely negative (true negatives). With a perfect classifier there
would be no false positives or false negatives.
Estimating the Predictive Accuracy of a Classifier 91
False positives and false negatives may not be of equal importance, e.g.
we may be willing to accept some false positives as long as there are no false
negatives or vice versa. We will return to this topic in Chapter 12.
Reference
[1] Quinlan, J. R. (1979). Discovering rules by induction from large collections
of examples. In D. Michie (Ed.), Expert systems in the micro-electronic age
(pp. 168–201). Edinburgh: Edinburgh University Press.
8
Continuous Attributes
8.1 Introduction
Many data mining algorithms, including the TDIDT tree generation algorithm,
require all attributes to take categorical values. However, in the real world many
attributes are naturally continuous, e.g. height, weight, length, temperature and
speed. It is essential for a practical data mining system to be able to handle such
attributes. In some cases the algorithms can be adapted for use with continuous
attributes. In other cases, this is hard or impossible to do.
Although it would be possible to treat a continuous attribute as a categor-
ical one with values 6.3, 7.2, 8.3, 9.2 etc., say, this is very unlikely to prove
satisfactory in general. If the continuous attribute has a large number of dif-
ferent values in the training set, it is likely that any particular value will only
occur a small number of times, perhaps only once, and rules that include tests
for specific values such as X = 7.2 will probably be of very little value for
prediction.
The standard approach is to split the values of a continuous attribute into a
number of non-overlapping ranges. For example a continuous attribute X might
be divided into the four ranges X < 7, 7 ≤ X < 12, 12 ≤ X < 20 and X ≥ 20.
This allows it to be treated as a categorical attribute with four possible values.
In the figure below, the values 7, 12 and 20 are called cut values or cut points.
to the same problem at cut points, e.g. why is a length of 2.99999 treated
differently from one of 3.00001?
The problem with any method of discretising continuous attributes is that of
over-sensitivity. Whichever cut points are chosen there will always be a poten-
tial problem with values that fall just below a cut point being treated differently
from those that fall just above for no principled reason.
Ideally we would like to find ‘gaps’ in the range of values. If in the length
example there were many values from 0.3 to 0.4 with the next smallest value
being 2.2, a test such as length < 1.0 would avoid problems around the cut
point, as there are no instances (in the training set) with values close to 1.0.
The value 1.0 is obviously arbitrary and a different cut point, e.g. 1.5 could
just as well have been chosen. Unfortunately the same gaps may not occur in
unseen test data. If there were values such as 0.99, 1.05, 1.49 and 1.51 in the
test data, whether the arbitrary choice of cut point was 1.0 or 1.5 could be of
critical importance.
Although both the equal width intervals and the equal frequency intervals
methods are reasonably effective, they both suffer from the fundamental weak-
ness, as far as classification problems are concerned, that they take no account
of the classifications when determining where to place the cut points, and other
methods which do make use of the classifications are generally preferred. Two
of these are described in Sections 8.3 and 8.4.
At any node of the evolving decision tree the entropy values (and hence the
information gain values) of all the pseudo-attributes derived from a given
continuous attribute can be calculated with a single pass through the train-
ing data. The same applies to any other measure that can be calculated
using the frequency table method described in Chapter 6. There are three
stages.
98 Principles of Data Mining
Stage 1
Before processing any continuous attributes at a node we first need to
count the number of instances with each of the possible classifications in
the part of the training set under consideration at the node. (These are
the sums of the values in each row of a frequency table such as Figure 6.2.)
These values do not depend on which attribute is subsequently processed
and so only have to be counted once at each node of the tree.
Stage 2
We next work through the continuous attributes one by one. We will
assume that a particular continuous attribute under consideration is named
Var and that the aim is to find the largest value of a specified measure for
all possible pseudo-attributes Var < X where X is one of the values of Var
in the part of the training set under consideration at the given node. We
will call the values of attribute Var candidate cut points. We will call the
largest value of measure maxmeasure and the value of X that gives that
largest value the cut point for attribute Var.
Stage 3
Having found the value of maxmeasure (and the corresponding cut points)
for all the continuous attributes, we next need to find the largest and then
compare it with the values of the measure obtained for any categorical
attributes to determine which attribute or pseudo-attribute to split on at
the node.
To illustrate this process we will use the golf training set introduced in
Chapter 4. For simplicity we will assume that we are at the root node of the
decision tree but the same method can be applied (with a reduced training set
of course) at any node of the tree.
We start by counting the number of instances with each of the possible
classifications. Here there are 9 play and 5 don’t play, making a total of 14.
We now need to process each of the continuous attributes in turn (Stage 2).
There are two: temperature and humidity. We will illustrate the processing
involved at Stage 2 using attribute temperature.
The first step is to sort the values of the attribute into ascending numer-
ical order and create a table containing just two columns: one for the sorted
attribute values and the other for the corresponding classification. We will call
this the sorted instances table.
Figure 8.1 shows the result of this for our example. Note that temperature
values 72 and 75 both occur twice. There are 12 distinct values 64, 65, . . . , 85.
Continuous Attributes 99
Temperature Class
64 play
65 don’t play
68 play
69 play
70 play
71 don’t play
72 play
72 don’t play
75 play
75 play
80 don’t play
81 play
83 play
85 don’t play
The algorithm for processing the sorted instances table for continuous at-
tribute Var is given in Figure 8.2. It is assumed that there are n instances and
the rows in the sorted instances table are numbered from 1 to n. The attribute
value corresponding to row i is denoted by value(i) and the corresponding class
is denoted by class(i).
Essentially, we work through the table row by row from top to bottom,
accumulating a count of the number of instances with each classification. As
each row is processed its attribute value is compared with the value for the
row below. If the latter value is larger it is treated as a candidate cut point
and the value of the measure is computed using the frequency table method
(the example that follows will show how this is done). Otherwise the attribute
values must be the same and processing continues to the next row. After the
last but one row has been processed, processing stops (the final row has nothing
below it with which to compare).
The algorithm returns two values: maxmeasure and cutvalue, which are
respectively the largest value of the measure that can be obtained for a pseudo-
attribute derived from attribute Var and the corresponding cut value.
100 Principles of Data Mining
for i = 1 to n − 1 {
increase count of class(i) by 1
if value(i) < value(i + 1){
(a) Construct a frequency table for pseudo-attribute
Var < value(i + 1)
(b) Calculate the value of measure
(c) If measure > maxmeasure {
maxmeasure = measure
cutvalue = value(i + 1)
}
}
}
In this and the other frequency tables in this section the counts of play and
don’t play in the ‘temperature < xxx’ column are marked with an asterisk.
The entries in the final column are fixed (the same for all attributes) and are
shown in bold. All the other entries are calculated from these by simple addition
and subtraction. Once the frequency table has been constructed, the values of
Continuous Attributes 101
measures such as Information Gain and Gain Ratio can be calculated from it,
as described in Chapter 6.
Figure 8.3(b) shows the frequency table resulting after the next row of the
sorted instances table has been examined. The counts are now play = 1, don’t
play = 1.
The value of Information Gain (or the other measures) can again be cal-
culated from this table. The important point here is how easily this second
frequency table can be derived from the first. Only the don’t play row has
changed by moving just one instance from the ‘greater than or equal to’ col-
umn to the ‘less than’ column.
We proceed in this way processing rows 3, 4, 5 and 6 and generating a new
frequency table (and hence a new value of measure) for each one. When we come
to the seventh row (temperature = 72) we note that the value of temperature
for the next instance is the same as for the current one (both 72), so we do
not create a new frequency table but instead go on to row 8. As the value of
temperature for this is different from that for the next instance we construct a
frequency table for the latter value, i.e. for pseudo-attribute temperature < 75
(Figure 8.3(c)).
We go on in this way until we have processed row 13 (out of 14). This ensures
that frequency tables are constructed for all the distinct values of temperature
except the first. There are 11 of these candidate cut values, corresponding to
pseudo-attributes temperature < 65, temperature < 68, . . . , temperature < 85.
The value of this method is that the 11 frequency tables are generated from
each other one by one, by a single pass through the sorted instances table.
102 Principles of Data Mining
This section looks at three efficiency issues associated with the method de-
scribed in Section 8.3.1.
This is the principal overhead on the use of the method and thus the prin-
cipal limitation on the maximum size of training set that can be handled. This
is also true of almost any other conceivable method of discretising continuous
attributes. For this algorithm it has to be carried out once for each continuous
attribute at each node of the decision tree.
It is important to use an efficient method of sorting, especially if the num-
ber of instances is large. The one most commonly used is probably Quicksort,
descriptions of which are readily available from books (and websites) about
sorting. Its most important feature is that the number of operations required
is approximately a constant multiple of n × log2 n, where n is the number of
instances. We say it varies as n × log2 n. This may not seem important but
there are other sorting algorithms that vary as n2 (or worse) and the difference
is considerable.
Figure 8.4 shows the values of n × log2 n and n2 for different values of n. It
is clear from the table that a good choice of sorting algorithm is essential.
n n × log2 n n2
100 664 10, 000
500 4, 483 250, 000
1, 000 9, 966 1, 000, 000
10, 000 132, 877 100, 000, 000
100, 000 1, 660, 964 10, 000, 000, 000
1, 000, 000 19, 931, 569 1, 000, 000, 000, 000
The difference between the values in the second and third columns of this
table is considerable. Taking the final row for illustration, if we imagine a sorting
Continuous Attributes 103
task for 1,000,000 items (not a huge number) that takes 19,931,569 steps and
assume that each step takes just one microsecond to perform, the time required
would be 19.9 seconds. If we used an alternative method to perform the same
task that takes 1,000,000,000,000 steps, each lasting a microsecond, the time
would increase to over 11.5 days.
For any given continuous attribute, generating the frequency tables takes
just one pass through the training data. The number of such tables is the
same as the number of cut values, i.e. the number of distinct attribute values
(ignoring the first). Each table comprises just 2 × 2 = 4 entries in its main
body plus two column sums. Processing many of these small tables should be
reasonably manageable.
As the method was described in Section 8.3.1 the number of candidate cut
points is always the same as the number of distinct values of the attribute
(ignoring the first). For a large training set the number of distinct values may
also be large. One possibility is to reduce the number of candidate cut points
by making use of class information.
Figure 8.5 is the sorted instances table for the golf training set and attribute
temperature, previously shown in Section 8.3.1, with the eleven cut values in-
dicated with asterisks (where there are repeated attribute values only the last
occurrence is treated as a cut value).
We can reduce this number by applying the rule ‘only include attribute
values for which the class value is different from that for the previous attribute
value’. Thus attribute value 65 is included because the corresponding class
value (don’t play) is different from the class corresponding to temperature 64,
which is play. Attribute value 69 is excluded because the corresponding class
(play) is the same as that for attribute value 68. Figure 8.6 shows the result of
applying this rule.
The instances with temperature value 65, 68, 71, 81 and 85 are included.
Instances with value 69, 70 and 83 are excluded.
However, repeated attribute values lead to complications. Should 72, 75 and
80 be included or excluded? We cannot apply the rule ‘only include attribute
values for which the class value is different from that for the previous attribute
value’ to the two instances with attribute value 72 because one of their class
values (don’t play) is the same as for the previous attribute value and the other
(play) is not. Even though both instances with temperature 75 have class play,
104 Principles of Data Mining
Temperature Class
64 play
65 * don’t play
68 * play
69 * play
70 * play
71 * don’t play
72 play
72 * don’t play
75 play
75 * play
80 * don’t play
81 * play
83 * play
85 * don’t play
Temperature Class
64 play
65 * don’t play
68 * play
69 play
70 play
71 * don’t play
72 play
72 ? don’t play
75 play
75 ? play
80 ? don’t play
81 * play
83 play
85 * don’t play
we still cannot apply the rule. Which of the instances for the previous attribute
value, 72, would we use? It seems reasonable to include 80, as the class for both
occurrences of 75 is play, but what if they were a combination of play and don’t
play?
There are other combinations that can occur, but in practice none of this
need cause us any problems. It does no harm to examine more candidate cut
points than the bare minimum and a simple amended rule is: ‘only include
attribute values for which the class value is different from that for the previous
attribute value, together with any attribute which occurs more than once and
the attribute immediately following it’.
This gives the final version of the table shown in Figure 8.7, with eight
candidate cut values.
Temperature Class
64 play
65 * don’t play
68 * play
69 play
70 play
71 * don’t play
72 play
72 * don’t play
75 play
75 * play
80 * don’t play
81 * play
83 play
85 * don’t play
The first step in discretising a continuous attribute is to sort its values into
ascending numerical order, with the corresponding classifications sorted into
the same order.
The next step is to construct a frequency table giving the number of oc-
currences of each distinct value of the attribute for each possible classification.
It then uses the distribution of the values of the attribute within the different
classes to generate a set of intervals that are considered statistically distinct at
a given level of significance.
As an example, suppose that A is a continuous attribute in a training set
with 60 instances and three possible classifications c1, c2 and c3. A possible
distribution of the values of A arranged in ascending numerical order is shown
in Figure 8.8. The aim is to combine the values of A into a number of ranges.
Note that some of the attribute values occur just once, whilst others occur
several times.
the adjacent intervals labelled 8.7 and 12.1 in Figure 8.8 the contingency table
is shown below as Figure 8.10(a).
Figure 8.10(a) Observed Frequencies for Two Adjacent Rows of Figure 8.8
The ‘row sum’ figures in the right-hand column and the ‘column sum’ figures
in the bottom row are called ‘marginal totals’. They correspond respectively to
the number of instances for each value of A (i.e. with their value of attribute
A in the corresponding interval) and the number of instances in each class for
both intervals combined. The grand total (19 instances in this case) is given in
the bottom right-hand corner of the table.
The contingency table is used to calculate the value of a variable called χ2
(or ‘the χ2 statistic’ or ‘the Chi-square statistic’), using a method that will
be described in Section 8.4.1. This value is then compared with a threshold
value T , which depends on the number of classes and the level of statistical
significance required. The threshold will be described further in Section 8.4.2.
For the current example we will use a significance level of 90% (explained
below). As there are three classes this gives a threshold value of 4.61.
The significance of the threshold is that if we assume that the classification
is independent of which of the two adjacent intervals an instance belongs to,
there is a 90% probability that χ2 will be less than 4.61.
If χ2 is less than 4.61 it is taken as supporting the hypothesis of indepen-
dence at the 90% significance level and the two intervals are merged. On the
other hand, if the value of χ2 is greater than 4.61 we deduce that the class and
interval are not independent, again at the 90% significance level, and the two
intervals are left unchanged.
For a given pair of adjacent rows (intervals) the value of χ2 is calculated using
the ‘observed’ and ‘expected’ frequency values for each combination of class and
row. For this example there are three classes so there are six such combinations.
In each case, the observed frequency value, denoted by O, is the frequency that
Continuous Attributes 109
actually occurred. The expected value E is the frequency value that would be
expected to occur by chance given the assumption of independence.
If the row is i and the class is j, then let the total number of instances in
row i be rowsum i and let the total number of occurrences of class j be colsum j .
Let the grand total number of instances for the two rows combined be sum.
Assuming the hypothesis that the class is independent of which of the two
rows an instance belongs to is true, we can calculate the expected number of
instances in row i for class j as follows. There are a total of colsum j occurrences
of class j in the two intervals combined, so class j occurs a proportion of
colsum j /sum of the time. As there are rowsum i instances in row i, we would
expect rowsum i × colsum j /sum occurrences of class j in row i.
To calculate this value for any combination of row and class, we just have
to take the product of the corresponding row sum and column sum divided by
the grand total of the observed values for the two rows.
For the adjacent intervals labelled 8.7 and 12.1 in Figure 8.8 the six values
of O and E (one pair of values for each class/row combination) are given in
Figure 8.10(b).
Figure 8.10(b) Observed and Expected Values for Two Adjacent Rows of
Figure 8.8
The O values are taken from Figure 8.8 or Figure 8.10(a). The E values are
calculated from the row and column sums. Thus for row 8.7 and class c1, the
expected value E is 13 × 7/19 = 4.79.
Having calculated the value of O and E for all six combinations of class and
row, the next step is to calculate the value of (O − E)2 /E for each of the six
combinations. These are shown in the Val columns in Figure 8.11.
The value of χ2 is then the sum of the six values of (O − E)2 /E. For the
pair of rows shown in Figure 8.11 the value of χ2 is 1.89.
If the independence hypothesis is correct the observed and expected values
O and E would ideally be the same and χ2 would be zero. A small value of
χ2 would also support the hypothesis, but the larger the value of χ2 the more
reason there is to suspect that the hypothesis may be false. When χ2 exceeds
110 Principles of Data Mining
the threshold value we consider that it is so unlikely for this to have occurred
by chance that the hypothesis is rejected.
The value of χ2 is calculated for each adjacent pair of rows (intervals). When
doing this, a small but important technical detail is that an adjustment has to
be made to the calculation for any value of E less than 0.5. In this case the
denominator in the calculation of (O − E)2 /E is changed to 0.5.
The results for the initial frequency table are summarised in Figure 8.12(a).
In each case, the χ2 value given in a row is the value for the pair of adjacent
intervals comprising that row and the one below. No χ2 value is calculated for
the final interval, because there is not one below it. As the table has 11 intervals
there are 10 χ2 values.
Continuous Attributes 111
The χ2 values are now calculated for the revised frequency table. Note that
the only values that can have changed from those previously calculated are
those for the two pairs of adjacent intervals of which the newly merged interval
(1.4) is one. These values are shown in bold in Figure 8.12(c).
Now the smallest value of χ2 is 1.20, which again is below the threshold
value of 4.61. So intervals 87.1 and 89.0 are merged.
ChiMerge proceeds iteratively in this way, merging two intervals at each
stage until a minimum χ2 value is reached which is greater than the threshold,
indicating that an irreducible set of intervals has been reached. The final table
is shown as Figure 8.12(d).
The χ2 values for the two remaining pairs of intervals are greater than
the threshold value. Hence no further merging of intervals is possible and the
discretisation is complete. Continuous attribute A can be replaced by a cate-
gorical attribute with just three values, corresponding to the ranges (for the
90% significance level):
112 Principles of Data Mining
Threshold values for the χ2 test can be found in statistical tables. The value
depends on two factors:
1. The significance level. 90% is a commonly used significance level. Other
commonly used levels are 95% and 99%. The higher the significance level,
the higher the threshold value and the more likely it is that the hypothesis
of independence will be supported and thus that the adjacent intervals will
be merged.
2. The number of degrees of freedom of the contingency table. A full expla-
nation of this is outside the scope of this book, but the general idea is as
follows. If we have a contingency table such as Figure 8.10(a) with 2 rows
and 3 columns, how many of the 2 × 3 = 6 cells in the main body of the
table can we fill independently given the marginal totals (row and column
sums)? The answer to this is just 2. If we put two numbers in the c1 and
c2 columns of the first row (A = 8.7), the value in the c3 column of that
row is determined by the row sum value. Once all three values in the first
row are fixed, those in the second row (A = 12.1) are determined by the
three column sum values.
In the general case of a contingency table with N rows and M columns the
number of independent values in the main body of the table is (N −1)×(M −1).
For the ChiMerge algorithm the number of rows is always two and the number
of columns is the same as the number of classes, so the number of degrees of
freedom is (2 − 1) × (number of classes − 1) = number of classes − 1, which in
this example is 2. The larger the number of degrees of freedom is, the higher
the threshold value.
For 2 degrees of freedom and a 90% significance level, the χ2 threshold value
is 4.61. Some other values are given in Figure 8.13 below.
Choosing a higher significance level will increase the threshold value and
thus may make the merging process continue for longer, resulting in categorical
attributes with fewer and fewer intervals.
A problem with the ChiMerge algorithm is that the result may be a large
number of intervals or, at the other extreme, just one interval. For a large
training set an attribute may have many thousands of distinct values and the
method may produce a categorical attribute with hundreds or even thousands
of values. This is likely to be of little or no practical value. On the other hand,
114 Principles of Data Mining
if the intervals are eventually merged into just one that would suggest that
the attribute value is independent of the classification and the attribute would
best be deleted. Both a large and a small number of intervals can simply reflect
setting the significance level too low or too high.
Kerber [1] proposed setting two values, minIntervals and maxIntervals. This
form of the algorithm always merges the pair of intervals with the lowest value of
χ2 as long as the number of intervals is more than maxIntervals. After that the
pair of intervals with the smallest value of χ2 is merged at each stage until either
a χ2 value is reached that is greater than the threshold value or the number of
intervals is reduced to minIntervals. In either of those cases the algorithm stops.
Although this is difficult to justify in terms of the statistical theory behind
the χ2 test it can be very useful in practice to give a manageable number
of categorical values. Reasonable settings for minIntervals and maxIntervals
might be 2 or 3 and 20, respectively.
The ChiMerge algorithm works quite well in practice despite some theoretical
problems relating to the statistical technique used, which will not be discussed
here (Kerber’s paper [1] gives further details). A serious weakness is that the
method discretises each attribute independently of the values of the others,
even though the classifications are clearly not determined by the values of just
a single attribute.
Sorting the values of each continuous attribute into order can be a significant
processing overhead for a large dataset. However this is likely to be an overhead
for any method of discretisation, not just ChiMerge. In the case of ChiMerge
it needs to be performed only once for each continuous attribute.
116 Principles of Data Mining
For convenience information gain will be used for attribute selection through-
out.
Seven datasets are used for the experiment, all taken from the UCI Repos-
itory. Basic information about each dataset is given in Figure 8.15.
Reference
[1] Kerber, R. (1992). ChiMerge: discretization of numeric attributes. In Pro-
ceedings of the 10th national conference on artificial intelligence (pp. 123–
128). Menlo Park: AAAI Press.
9
Avoiding Overfitting of Decision Trees
able to predict the correct classification more accurately for unseen data — a
case of ‘less means more’.
We will start by looking at a topic that at first sight is unrelated to the
subject of this chapter, but will turn out to be important: how to deal with
inconsistencies in a training set.
Consider how the TDIDT algorithm will perform when there is a clash in the
training set. The method will still produce a decision tree but (at least) one of
the branches will grow to its greatest possible length (i.e. one term for each of
Avoiding Overfitting of Decision Trees 123
the possible attributes), with the instances at the lowest node having more than
one classification. The algorithm would like to choose another attribute to split
on at that node but there are no ‘unused’ attributes and it is not permitted
to choose the same attribute twice in the same branch. When this happens we
will call the set of instances represented by the lowest node of the branch the
clash set.
A typical clash set might have one instance with classification true and one
with classification false. In a more extreme case there may be several possible
classifications and several instances with each classification in the clash set, e.g.
for an object recognition example there might be three instances classified as
house, two as tree and two as lorry.
Figure 9.1 shows an example of a decision tree generated from a training
set with three attributes x, y and z, each with possible values 1 and 2, and
three classifications c1, c2 and c3. The node in the bottom row labelled ‘mixed’
represents a clash set, i.e. there are instances with more than one of the three
possible classifications, but no more attributes to split on.
There are many possible ways of dealing with clashes but the two principal
ones are:
(a) The ‘delete branch’ strategy: discard the branch to the node from the
node above. This is similar to removing the instances in the clash set from
the training set (but not necessarily equivalent to it, as the order in which the
attributes were selected might then have been different).
Applying this strategy to Figure 9.1 gives Figure 9.2. Note that this tree
will be unable to classify unseen instances for which x = 1, y = 1 and z = 2,
as previously discussed in Section 6.7.
124 Principles of Data Mining
Figure 9.2 Decision Tree Generated from Figure 9.1 by ‘Delete Branch’ Strat-
egy
(b) The ‘majority voting’ strategy: label the node with the most common
classification of the instances in the clash set. This is similar to changing the
classification of some of the instances in the training set (but again not neces-
sarily equivalent, as the order in which the attributes were selected might then
have been different).
Applying this strategy to Figure 9.1 gives Figure 9.3, assuming that the
most common classification of the instances in the clash set is c3.
The decision on which of these strategies to use varies from one situation
to another. If there were, say, 99 instances classified as yes and one instance
classified as no in the training set, we would probably assume that the no
was a misclassification and use method (b). If the distribution in a weather
forecasting application were 4 rain, 5 snow and 3 fog, we might prefer to discard
the instances in the clash set altogether and accept that we are unable to make
a prediction for that combination of attribute values.
A middle approach between the ‘delete branch’ and the ‘majority voting’
strategies is to use a clash threshold. The clash threshold is a percentage from
0 to 100 inclusive.
The ‘clash threshold’ strategy is to assign all the instances in a clash set
to the most commonly occurring class for those instances provided that the
proportion of instances in the clash set with that classification is at least equal
to the clash threshold. If it is not, the instances in the clash set (and the
corresponding branch) are discarded altogether.
Avoiding Overfitting of Decision Trees 125
Figure 9.3 Decision Tree Generated from Figure 9.1 by ‘Majority Voting’
Strategy
Setting the clash threshold to zero gives the effect of always assigning to the
most common class, i.e. the ‘majority voting’ strategy. Setting the threshold
to 100 gives the effect of never assigning to the most common class, i.e. the
‘delete branch’ strategy.
Clash threshold values between 0 and 100 give a middle position between
these extremes. Reasonable percentage values to use might be 60, 70, 80 or 90.
Figure 9.4 shows the result of using different clash thresholds for the same
dataset. The dataset used is the crx ‘credit checking’ dataset modified by delet-
ing all the continuous attributes to ensure that clashes will occur. The modified
training set does not satisfy the adequacy condition.
The results were all generated using TDIDT with attributes selected using
information gain in ‘train and test’ mode.
Figure 9.4 Results for crx (Modified) With Varying Clash Thresholds
126 Principles of Data Mining
From the results given it is clear that when there are clashes in the training
data it is no longer possible to obtain a decision tree that gives 100% predictive
accuracy on the training set from which it was generated.
The ‘delete branch’ option (threshold = 100%) avoids making any errors
but leaves many of the instances unclassified. The ‘majority voting’ strategy
(threshold = 0%) avoids leaving instances unclassified but gives many classi-
fication errors. The results for threshold values 60%, 70%, 80% and 90% lie
between these two extremes. However, the predictive accuracy for the train-
ing data is of no importance — we already know the classifications! It is the
accuracy for the test data that matters.
In this case the results for the test data are very much in line with those
for the training data: reducing the threshold value increases the number of
correctly classified instances but it also increases the number of incorrectly
classified instances and the number of unclassified instances falls accordingly.
If we use the ‘default classification strategy’ and automatically allocate each
unclassified instance to the largest class in the original training set, the picture
changes considerably.
Figure 9.5 Results for crx (Modified) With Varying Clash Thresholds (Using
Default to Largest Class)
Figure 9.5 shows the results given in Figure 9.4 modified so that for the test
data any unclassified instances are automatically assigned to the largest class.
The highest predictive accuracy is given for clash thresholds 70% and 80% in
this case.
Having established the basic method of dealing with clashes in a training
set, we now turn back to the main subject of this chapter: the problem of
avoiding the overfitting of decision trees to data.
Avoiding Overfitting of Decision Trees 127
Pre- and post-pruning are both methods to increase the generality of deci-
sion trees.
(i.e. unpruned). Figure 9.7 shows the results with a maximum depth cutoff of
3, 4 or unlimited instead. The ‘majority voting’ strategy is used throughout.
The results obtained clearly show that the choice of pre-pruning method is
important. However, it is essentially ad hoc. No choice of size or depth cutoff
consistently produces good results across all the datasets.
130 Principles of Data Mining
This result reinforces the comment by Quinlan [1] that the problem with
pre-pruning is that the ‘stopping threshold’ is “not easy to get right — too
high a threshold can terminate division before the benefits of subsequent splits
become evident, while too low a value results in little simplification”. It would
be highly desirable to find a more principled choice of cutoff criterion to use
with pre-pruning than the size and maximum depth approaches used previously,
and if possible one which can be applied completely automatically without the
need for the user to select any cutoff threshold value. A number of possible
ways of doing this have been proposed, but in practice the use of post-pruning,
to which we now turn, has proved more popular.
that rule applies that are incorrectly classified. We call this the error rate at
node J (a proportion from 0 to 1 inclusive).
If we imagine the branch from the root node A to an internal node such as
G were to terminate there, rather than being split two ways to form the two
branches A to J and A to K, this branch would correspond to an incomplete
rule of the kind discussed in Section 9.3 on pre-pruning. We will assume that
the unseen instances to which a truncated rule of this kind applies are classified
using the ‘majority voting’ strategy of Section 9.1.1, i.e. they are all allocated
to the class to which the largest number of the instances in the training set
corresponding to that node belong.
When post-pruning a decision tree such as Figure 9.8 we look for non-leaf
nodes in the tree that have a descendant subtree of depth one (i.e. all the
nodes one level down are leaf nodes). All such subtrees are candidates for post-
pruning. If a pruning condition (which will be described below) is met the
subtree hanging from the node can be replaced by the node itself. We work
from the bottom of the tree upwards and prune one subtree at a time. The
method continues until no more subtrees can be pruned.
For Figure 9.8 the only candidates for pruning are the subtrees hanging
from nodes G and D.
Working from the bottom of the tree upwards we start by considering the
replacement of the subtree ‘hanging from’ node G by G itself, as a leaf node in
132 Principles of Data Mining
a pruned tree. How does the error rate of the branch (truncated rule) ending
at G compare with the error rate of the two branches (complete rules) ending
at J and K? Is it beneficial or harmful to the predictive accuracy of the tree to
split at node G? We might consider truncating the branch earlier, say at node
F . Would that be beneficial or harmful?
To answer questions such as these we need some way of estimating the error
rate at any node of a tree. One way to do this is to use the tree to classify the
instances in some set of previously unseen data called a pruning set and count
the errors. Note that it is imperative that the pruning set is additional to the
‘unseen test set’ used elsewhere in this book. The test set must not be used
for pruning purposes. Using a pruning set is a reasonable approach but may
be unrealistic when the amount of data available is small. An alternative that
takes a lot less execution time is to use a formula to estimate the error rate.
Such a formula is likely to be probability-based and to make use of factors such
as the number of instances corresponding to the node that belong to each of
the classes and the prior probability of each class.
Figure 9.9 shows the estimated error rates at each of the nodes in Figure 9.8
using a (fictitious) formula.
Node Estimated
error rate
A 0.3
B 0.15
C 0.25
D 0.19
E 0.1
F 0.129
G 0.12
H 0.05
I 0.2
J 0.2
K 0.1
L 0.2
M 0.1
Using Figure 9.9 we see that the estimated error rates at nodes J and K are
0.2 and 0.1, respectively. These two nodes correspond to 8 and 12 instances,
respectively (of the 20 at node G).
Avoiding Overfitting of Decision Trees 133
To estimate the error rate of the subtree hanging from node G (Figure 9.10)
we take the weighted average of the estimated error rates at J and K. This
value is (8/20) × 0.2 + (12/20) × 0.1 = 0.14. We will call this the backed-up
estimate of the error rate at node G because it is computed from the estimated
error rates of the nodes below it.
We now need to compare this value with the value obtained from Figure 9.9,
i.e. 0.12, which we will call the static estimate of the error rate at that
node.2
In the case of node G the static value is less than the backed-up value. This
means that splitting at node G increases the error rate at that node, which is
obviously counter-productive. We prune the subtree descending from node G
to give Figure 9.11.
The candidates for pruning are now the subtrees descending from nodes F
and D. (Node G is now a leaf node of the partly pruned tree.)
We can now consider whether or not it is beneficial to split at node F
(Figure 9.12). The static error rates at nodes G, H and I are 0.12, 0.05 and
0.2. Hence the backed-up error rate at node F is (20/50) × 0.12 + (10/50) ×
0.05 + (20/50) × 0.2 = 0.138.
The static error rate at node F is 0.129, which is smaller than the backed-up
value, so we again prune the tree, giving Figure 9.13.
The candidates for pruning are now the subtrees hanging from nodes B and
D. We will consider whether to prune at node B (Figure 9.14).
The static error rates at nodes E and F are 0.1 and 0.129, respectively, so
the backed-up error rate at node B is (10/60) × 0.1 + (50/60) × 0.129 = 0.124.
This is less than the static error rate at node B, which is 0.15. Splitting at
node B reduces the error rate, so we do not prune the subtree.
We next need to consider pruning at node D (Figure 9.15). The static error
rates at nodes L and M are 0.2 and 0.1, respectively, so the backed-up error
1
In Figure 9.10 and similar figures, the two figures in parentheses at each node
give the number of instances in the training set corresponding to that node (as in
Figure 9.8) and the estimated error rate at the node, as given in Figure 9.9.
2
From now on, for simplicity we will generally refer to the ‘backed-up’ error rate
and the ‘static error rate’ at a node, without using the word ‘estimated’ every
time. However it is important to bear in mind that they are only estimates not the
accurate values, which we have no way of knowing.
134 Principles of Data Mining
rate at node D is (7/10) × 0.2 + (3/10) × 0.1 = 0.17. This is less than the static
error rate at node D, which is 0.19, so we do not prune the subtree. There are
no further subtrees to consider. The final post-pruned tree is Figure 9.13.
In an extreme case this method could lead to a decision tree being post-
pruned right up to its root node, indicating that using the tree is likely to lead
to a higher error rate, i.e. more incorrect classifications, than simply assigning
every unseen instance to the largest class in the training data. Luckily such
poor decision trees are likely to be very rare.
Post-pruning decision trees would appear to be a more widely used and
accepted approach than pre-pruning them. No doubt the ready availability
and popularity of the C4.5 classification system [1] has had a large influence on
this. However, an important practical objection to post-pruning is that there
is a large computational overhead involved in generating a complete tree only
then to discard some or possibly most of it. This may not matter with small
experimental datasets, but ‘real-world’ datasets may contain many millions of
instances and issues of computational feasibility and scaling up of methods will
inevitably become important.
The decision tree representation of classification rules is widely used and it is
therefore desirable to find methods of pruning that work well with it. However,
the tree representation is itself a source of overfitting, as will be demonstrated
in Chapter 11.
pruning (generating a tree in full and then removing parts of it). Results are
given for pre-pruning using either a size or a maximum depth cutoff. A method
of post-pruning a decision tree based on comparing the static and backed-up
estimated error rates at each node is also described.
References
[1] Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo:
Morgan Kaufmann.
[2] Esposito, F., Malerba, D., & Semeraro, G. (1997). A comparative analysis of
methods for pruning decision trees. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19 (5), 476–491.
10
More About Entropy
10.1 Introduction
In this chapter we return to the subject of the entropy of a training set, which
was introduced in Chapter 5. The idea of entropy is not only used in data
mining; it is a very fundamental one, which is widely used in Information
Theory as the basis for calculating efficient ways of representing messages for
transmission by telecommunication systems.
We will start by explaining what is meant by the entropy of a set of distinct
values and then come back to look again at the entropy of a training set.
Suppose we are playing a game of the ‘twenty questions’ variety where we try
to identify one of M possible values by asking a series of yes/no questions. The
values in which we are really interested are mutually exclusive classifications of
the kind discussed in Chapter 3 and elsewhere, but the same argument can be
applied to any set of distinct values.
We will assume at present that all M values are equally likely and for
reasons that will soon become apparent we will also assume that M is an exact
power of 2, say 2N , where N ≥ 1.
As a concrete example we will take the task of identifying an unknown
capital city from the eight possibilities: London, Paris, Berlin, Warsaw, Sofia,
Rome, Athens and Moscow (here M = 8 = 23 ).
There are many possible ways of asking questions, for example random
guessing:
Is it Warsaw? No
Is it Berlin? No
Is it Rome? Yes
This works well if the questioner makes a lucky guess early on, but (un-
surprisingly) it is inefficient in the general case. To show this, imagine that we
make our guesses in the fixed order: London, Paris, Berlin etc. until we guess
the correct answer. We never need guess further than Athens, as a ‘no’ answer
will tell us the city must be Moscow.
If the city is London, we need 1 question to find it.
If the city is Paris, we need 2 questions to find it.
If the city is Berlin, we need 3 questions to find it.
If the city is Warsaw, we need 4 questions to find it.
If the city is Sofia, we need 5 questions to find it.
If the city is Rome, we need 6 questions to find it.
If the city is Athens, we need 7 questions to find it.
If the city is Moscow, we need 7 questions to find it.
Each of these possibilities is equally likely, i.e. has probability 1/8, so on
average we need (1 + 2 + 3 + 4 + 5 + 6 + 7 + 7)/8 questions, i.e. 35/8 = 4.375
questions.
A little experiment will soon show that the best strategy is to keep dividing
the possibilities into equal halves. Thus we might ask
Is it London, Paris, Athens or Moscow? No
Is it Berlin or Warsaw? Yes
Is it Berlin?
Whether the third question is answered yes or no, the answer will tell us
the identity of the ‘unknown’ city.
The halving strategy always takes three questions to identify the unknown
city. It is considered to be the ‘best’ strategy not because it invariably gives
us the answer with the smallest number of questions (random guessing will
occasionally do better) but because if we conduct a long series of ‘trials’ (each
a game to guess a city, selected at random each time) the halving strategy will
invariably find the answer and will do so with a smaller number of questions on
average than any other strategy. With this understanding we can say that the
smallest number of yes/no questions needed to determine an unknown value
from 8 equally likely possibilities is three.
It is no coincidence that 8 is 23 and the smallest number of yes/no questions
needed is 3. If we make the number of possible values M a higher or a lower
power of two the same occurs. If we start with 8 possibilities and halve the
number by the first question, that leaves 4 possibilities. We can determine
More About Entropy 139
M log2 M
2 1
4 2
8 3
16 4
32 5
64 6
128 7
256 8
512 9
1024 10
In the phrase ‘the smallest number of yes/no questions needed’ in the defi-
nition of entropy, it is implicit that each question needs to divide the remaining
possibilities into two equally probable halves. If they do not, for example with
random guessing, a larger number will be needed.
It is not sufficient that each question looked at in isolation is a ‘halving
question’. For example, consider the sequence
Is it Berlin, London, Paris or Warsaw? Yes
Is it Berlin, London, Paris or Sofia? Yes
Both questions are ‘halving questions’ in their own right, but the answers
leave us after two questions still having to discriminate amongst three possi-
bilities, which cannot be done with one more question.
It is not sufficient that each question asked is a halving question. It is
necessary to find a sequence of questions that take full advantage of the answers
already given to divide the remaining possibilities into two equally probable
halves. We will call this a ‘well-chosen’ sequence of questions.
So far we have established that the entropy of a set of M distinct values is
log2 M , provided that M is a power of 2 and all values are equally likely. We
have also established the need for questions to form a ‘well-chosen’ sequence.
This raises three questions:
– What if M is not a power of 2?
– What if the M possible values are not equally likely?
– Is there a systematic way of finding a sequence of well-chosen questions?
It will be easier to answer these questions if we first introduce the idea of
coding information using bits.
There are 76 = 117649 possible sequences of six days. The value of log2 117649
is 16.84413. This is between 16 and 17 so to determine any possible value of
a sequence of 6 days of the week would take 17 questions. The average num-
ber of questions for each of the six days of the week is 17/6 = 2.8333. This is
reasonably close to log2 7, which is approximately 2.8074.
A better approximation to the entropy is obtained by taking a larger value
of k, say 21. Now log2 M k is log2 (721 ) = 58.95445, so 59 questions are needed
for the set of 21 values, making an average number of questions per value of
59/21 = 2.809524.
Finally, for a set of 1000 values (k = 1000), log2 M k is log2 (71000 ) =
2807.3549, so 2808 questions are needed for the set of 1000 values, making
an average per value of 2.808, which is very close to log2 7.
It is not a coincidence that these values appear to be converging to log2 7, as
is shown by the following argument for the general case of sequences of length
k drawn from M distinct equally likely values.
More About Entropy 143
M log2 M
2 1
3 1.5850
4 2
5 2.3219
6 2.5850
7 2.8074
8 3
9 3.1699
10 3.3219
always have a lower value than log2 M . In the extreme case where only one
value ever occurs, there is no need to use even one bit to represent the value
and the entropy is zero.
We will write the frequency with which the ith of the M values occurs as
pi where i varies from 1 to M . Then we have 0 ≤ pi ≤ 1 for all pi and
i=M
pi = 1.
i=1
For convenience we will give an example where all the pi values are the
reciprocal of an exact power of 2, i.e. 1/2, 1/4 or 1/8, but the result obtained
can be shown to apply for other values of pi using an argument similar to that
in Section 10.3.
Suppose we have four values A, B, C and D which occur with frequencies
1/2, 1/4, 1/8 and 1/8 respectively. Then M = 4, p1 = 1/2, p2 = 1/4, p3 = 1/8,
p4 = 1/8.
When representing A, B, C and D we could use the standard 2-bit encoding
described previously, i.e.
A 10
B 11
C 00
D 01
However, we can improve on this using a variable length encoding, i.e. one
where the values are not always represented by the same number of bits. There
are many possible ways of doing this. The best way turns out to be the one
shown in Figure 10.3.
A 1
B 01
C 001
D 000
Figure 10.3 Most Efficient Representation for Four Values with Frequencies
1/2, 1/4, 1/8 and 1/8
A 0
B 11
C 100
D 101
Any other representation will require more bits to be examined on average.
For example we might choose
A 01
B 1
C 001
D 000
With this representation, in the average case we need to examine 1/2 × 2 +
1/4 × 1 + 1/8 × 3 + 1/8 × 3 = 2 bits (the same as the number for the fixed
length representation).
Some other representations, such as
A 101
B 0011
C 10011
D 100001
are much worse than the 2-bit representation. This one requires 1/2 × 3 + 1/4 ×
4 + 1/8 × 5 + 1/8 × 6 = 3.875 bits to be examined on average.
The key to finding the most efficient coding is to use a string of N bits to
represent a value that occurs with frequency 1/2N . Writing this another way,
represent a value that occurs with frequency pi by a string of log2 (1/pi ) bits
(see Figure 10.4).
pi log2 (1/pi )
1/2 1
1/4 2
1/8 3
1/16 4
This method of coding ensures that we can determine any value by asking
a sequence of ‘well-chosen’ yes/no questions (i.e. questions for which the two
possible answers are equally likely) about the value of each of the bits in turn.
Is the first bit 1?
If not, is the second bit 1?
146 Principles of Data Mining
There are two special cases to consider. When all the values of pi are the
same, i.e. pi = 1/M for all values of i from 1 to M , the above formula reduces
to
M
E = − (1/M ) log2 (1/M )
i=1
= − log2 (1/M )
= log2 M
which is the formula given in Section 10.3.
When there is only one value with a non-zero frequency, M = 1 and p1 = 1,
so E = −1 × log2 1 = 0.
kind. Instead we ask a series of questions about the value of a set of attributes
measured for each of the instances in a training set, which collectively determine
the classification. Sometimes only one question is necessary, sometimes many
more.
Asking any question about the value of an attribute effectively divides the
training set into a number of subsets, one for each possible value of the at-
tribute (any empty subsets are discarded). The TDIDT algorithm described in
Chapter 4 generates a decision tree from the top down by repeatedly splitting
on the values of attributes. If the training set represented by the root node
has M possible classifications, each of the subsets corresponding to the end
nodes of each branch of the developing tree has an entropy value that varies
from log2 M (if the frequencies of each of the classifications in the subset are
identical) to zero (if the subset has attributes with only one classification).
When the splitting process has terminated, all the ‘uncertainty’ has been
removed from the tree. Each branch corresponds to a combination of attribute
values and for each branch there is a single classification, so the overall entropy
is zero.
Although it is possible for a subset created by splitting to have an entropy
greater than its ‘parent’, at every stage of the process splitting on an attribute
reduces the average entropy of the tree or at worst leaves it unchanged. This is
an important result, which is frequently assumed but seldom proved. We will
consider it in the next section.
Estart = −(1/2) log2 (1/2) − (1/2) log2 (1/2) = − log2 (1/2) = log2 (2) = 1
148 Principles of Data Mining
X Y Class
1 1 A
1 2 B
2 1 A
2 2 B
3 2 A
3 1 B
4 2 A
4 1 B
Figure 10.5 Training Set for ‘Information Gain Can be Zero’ Example
Attribute value
Class 1 2 3 4
A 1 1 1 1
B 1 1 1 1
Total 2 2 2 2
Each column of the frequency table is balanced and it can easily be verified
that Enew = 1.
For splitting on attribute Y the frequency table is shown in Figure 10.6(b).
Attribute value
Class 1 2
A 2 2
B 2 2
Total 4 4
Again both columns are balanced and Enew = 1. Whichever value is taken,
Enew is 1 and so the Information Gain = Estart − Enew = 0.
More About Entropy 149
The absence of information gain does not imply that there is no value in
splitting on either of the attributes. Whichever one is chosen, splitting on the
other attribute for all the resulting branches will produce a final decision tree
with each branch terminated by a leaf node and thus having an entropy of zero.
Although we have shown that Information Gain can sometimes be zero, it
can never be negative. Intuitively it would seem wrong for it to be possible to
lose information by splitting on an attribute. Surely that can only give more
information (or occasionally the same amount)?
The result that Information Gain can never be negative is stated by many
authors and implied by others. The name Information Gain gives a strong
suggestion that information loss would not be possible, but that is far from
being a formal proof.
The present author’s inability to locate a proof of this crucial result led him
to issue a challenge to several British academics to find a proof in the technical
literature or generate one themselves. An excellent response to this came from
two members of the University of Ulster in Northern Ireland who produced a
detailed proof of their own [1]. The proof is too difficult to reproduce here but
is well worth obtaining and studying in detail.
1. Calculate the value of information gain for each attribute in the original
dataset.
2. Discard all attributes that do not meet a specified criterion.
3. Pass the revised dataset to the preferred classification algorithm.
The method of calculating information gain for categorical attributes using
frequency tables was described in Chapter 6. A modification that enables the
method to be used for continuous attributes by examining alternative ways of
splitting the attribute values into two parts was described in Chapter 8. The
latter also returns a ‘split value’, i.e. the value of the attribute that gives the
largest information gain. This value is not needed when information gain is
used for feature reduction. It is sufficient to know the largest information gain
achievable for the attribute with any split value.
There are many possible criteria that can be used for determining which
attributes to retain, for example:
– Only retain the best 20 attributes
– Only retain the best 25% of the attributes
– Only retain attributes with an information gain that is at least 25% of the
highest information gain of any attribute
– Only retain attributes that reduce the initial entropy of the dataset by at
least 10%.
There is no one choice that is best in all situations, but analysing the infor-
mation gain values of all the attributes can help make an informed choice.
The genetics dataset contains 3190 instances. Each instance comprises the
values of a sequence of 60 DNA elements and is classified into one of three
possible categories: EI, IE and N . Each of the 60 attributes (named A0
to A59) is categorical and has 8 possible values: A, T , G, C, N , D, S and R.
Figure 10.8 genetics Dataset: Information Gain for Some of the Attributes
152 Principles of Data Mining
The largest information gain is for A29. A gain of 0.3896 implies that the
initial entropy would be reduced by more than a quarter if the value of A29
were known. The second largest information gain is for attribute A28.
Comparing values written as decimals to four decimal places is awkward
(for people). It is probably easier to make sense of this table if it is adjusted by
dividing all the information gain values by 0.3896 (the largest value), making
a proportion from 0 to 1, and then multiplying them all by 100. The resulting
values are given in Figure 10.9. An adjusted information gain of 1.60 for at-
tribute A0 means that the information gain for A0 is 1.60% of the size of the
largest value, which was the one obtained for A29.
From this table it is clear that not only is the information gain for A29 the
largest, it is considerably larger than most of the other values. Only a small
number of other information gain values are even 50% as large.
More About Entropy 153
The next example makes use of a much larger dataset. The dataset bcst96 has
been used for experiments on automatic classification of web pages. Some basic
information about it is given in Figure 10.11.
The bcst96 dataset comprises 1186 instances (training set) and a further
509 instances (test set). Each instance corresponds to a web page, which
is classified into one of two possible categories, B or C, using the values of
13,430 attributes, all continuous.
There are 1,749 attributes that each have only a single value for the in-
stances in the training set and so can be deleted, leaving 11,681 continuous
attributes.
In this case the original number of attributes is more than 11 times as large
as the number of instances in the training set. It seems highly likely that a
large number of the attributes could safely be deleted, but which ones?
The initial value of entropy is 0.996, indicating that the two classes are
fairly equally balanced.
As can be seen in Figure 10.11, having deleted the attributes that have a
single value for all instances in the training set, there are 11,681 continuous
attributes remaining.
Next we calculate the information gain for each of these 11,681 attributes.
The largest value is 0.381.
The frequency table is shown in Figure 10.12.
The most surprising result is that as many as 11,135 of the attributes
(95.33%) have an information gain in the 5 bin, i.e. no more than 5% of the
largest information gain available. Almost 99% of the values are in the 5 and
10 bins.
Using TDIDT with the entropy attribute selection criterion for classifica-
tion, the algorithm generates 38 rules from the original training set and uses
these to predict the classification of the 509 instances in the test set. It does this
with 94.9% accuracy (483 correct and 26 incorrect predictions). If we discard
all but the best 50 attributes, the same algorithm generates a set of 62 rules,
which again give 94.9% predictive accuracy on the test set (483 correct and 26
incorrect predictions).
More About Entropy 155
In this case just 50 out of 11,681 attributes (less than 0.5%) suffice to
give the same predictive accuracy as the whole set of attributes. However, the
difference in the amount of processing required to produce the two decision
trees is considerable. With all the attributes the TDIDT algorithm will need to
examine approximately 1, 186 × 11, 681 = 13, 853, 666 attribute values at each
node of the evolving decision tree. If only the best 50 attributes are used the
number drops to just 1, 186 × 50 = 59, 300.
Although feature reduction cannot always be guaranteed to produce re-
sults as good as those in these two examples, it should always be considered,
especially when the number of attributes is large.
156 Principles of Data Mining
References
[1] McSherry, D., & Stretch, C. (2003). Information gain (University of Ulster
Technical Note).
[2] Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training
knowledge-based neural networks to recognize genes in DNA sequences.
In Advances in neural information processing systems (Vol. 3). San Mateo:
Morgan Kaufmann.
11
Inducing Modular Rules for Classification
only one rule (at most) can ever apply. The five rules corresponding to Figure
11.1 are as follows (in arbitrary order):
IF SoftEng = A AND Project = B AND
ARIN = A AND CSA = A THEN Class = FIRST
IF SoftEng = A AND Project = A THEN Class = FIRST
IF SoftEng = A AND Project = B AND ARIN = A AND
CSA = B THEN Class = SECOND
IF SoftEng = A AND Project = B AND ARIN = B THEN
Class = SECOND
IF SoftEng = B THEN Class = SECOND
We now examine each of the rules in turn to consider whether removing each
of its terms increases or reduces its predictive accuracy. Thus for the first rule
given above we consider the four terms ‘SoftEng = A’, ‘Project = B’, ‘ARIN
= A’ and ‘CSA = A’. We need some way of estimating whether removing each
of these terms singly would increase or decrease the accuracy of the resulting
rule set. Assuming we have such a method, we remove the term that gives the
largest increase in predictive accuracy, say ‘Project = B’. We then consider the
removal of each of the other three terms. The processing of a rule ends when
removing any of the terms would reduce (or leave unchanged) the predictive
accuracy. We then go on to the next rule.
Inducing Modular Rules for Classification 159
This description relies on there being some means of estimating the effect
on the predictive accuracy of a ruleset of removing a single term from one of
the rules. We may be able to use a probability-based formula to do this or
we can simply use the original and revised rulesets to classify the instances
in an unseen pruning set and compare the results. (Note that it would be
methodologically unsound to improve the ruleset using a test set and then
examine its performance on the same instances. For this method there needs
to be three sets: training, pruning and test.)
Figure 11.3 Decision Tree for the degrees Dataset (revised – version 2)
11.1 have now become the following (the first four rules have changed).
IF Project = B AND ARIN = A AND CSA = A THEN Class = FIRST
IF Project = A THEN Class = FIRST
IF Project = B AND ARIN = A AND CSA = B
THEN Class = SECOND
IF Project = B AND ARIN = B THEN Class = SECOND
IF SoftEng = B THEN Class = SECOND
We will say that a rule fires if its condition part is satisfied for a given
instance. If a set of rules fits into a tree structure there is only one rule that
can fire for any instance. In the general case of a set of rules that do not fit
into a tree structure, it is entirely possible for several rules to fire for a given
test instance, and for those rules to give contradictory classifications.
Suppose that for the degrees application we have an unseen instance for
which the values of SoftEng, Project, ARIN and CSA are ‘B’, ‘B’, ‘A’ and ‘A’,
respectively. Both the first and the last rules will fire. The first rule concludes
‘Class = FIRST’; the last rule concludes ‘Class = SECOND’. Which one should
we take?
The problem can be illustrated outside the context of the degrees dataset
by considering just two rules from some imaginary ruleset:
IF x = 4 THEN Class = a
Inducing Modular Rules for Classification 161
IF y = 2 THEN Class = b
What should the classification be for an instance with x = 4 and y = 2?
One rule gives class a, the other class b.
We can easily extend the example with other rules such as
IF w = 9 and k = 5 THEN Class = b
IF x = 4 THEN Class = a
IF y = 2 THEN Class = b
IF z = 6 and m = 47 THEN Class = b
What should the classification be for an instance with w = 9, k = 5, x = 4,
y = 2, z = 6 and m = 47? One rule gives class a, the other three rules give
class b.
We need a method of choosing just one classification to give to the unseen
instance. This method is known as a conflict resolution strategy. There are
various strategies we can use, including:
– ‘majority voting’ (e.g. there are three rules predicting class b and only one
predicting class a, so choose class b)
– giving priority to certain types of rule or classification (e.g. rules with a
small number of terms or predicting a rare classification might have a higher
weighting than other rules in the voting)
– using a measure of the ‘interestingness’ of each rule (of the kind that will be
discussed in Chapter 16), give priority to the most interesting rule.
It is possible to construct quite elaborate conflict resolution strategies but
most of them have the same drawback: they require the condition part of all
the rules to be tested for each unseen instance, so that all the rules that fire are
known before the strategy is applied. By contrast, we need only work through
the rules generated from a decision tree until the first one fires (as we know no
others can).
A very basic but widely used conflict resolution strategy is to work through
the rules in order and to take the first one that fires. This can reduce the
amount of processing required considerably, but makes the order in which the
rules are generated very important.
Whilst it is possible using a conflict resolution strategy to post-prune a
decision tree to give a set of rules that do not fit together in a tree structure,
it seems an unnecessarily indirect way of generating a set of rules. In addition
if we wish to use the ‘take the first rule that fires’ conflict resolution strategy,
the order in which the rules are extracted from the tree is likely to be of crucial
importance, whereas it ought to be arbitrary.
162 Principles of Data Mining
In Section 11.4 we will describe an algorithm that dispenses with tree gen-
eration altogether and produces rules that are ‘free standing’, i.e. do not fit
together into a tree structure, directly. We will call these modular rules.
tests that carry an unusually high cost or risk to health. For many real-world
applications a method of classifying unseen instances that avoided making un-
necessary tests would be highly desirable.
For each classification (class = i) in turn and starting with the complete
training set each time:
1. Calculate the probability that class = i for each attribute/value pair.
2. Select the pair with the largest probability and create a subset of
the training set comprising all the instances with the selected at-
tribute/value combination (for all classifications).
3. Repeat 1 and 2 for this subset until a subset is reached that contains
only instances of class i. The induced rule is then the conjunction of all
the attribute/value pairs selected.
4. Remove all instances covered by this rule from the training set.
Repeat 1–4 until all instances of class i have been removed
We will illustrate the algorithm by generating rules for the lens24 dataset
(classification 1 only). The algorithm generates two classification rules for that
class.
The initial training set for lens24 comprises 24 instances, shown in Figure
11.6.
age specRx astig tears class
1 1 1 1 3
1 1 1 2 2
1 1 2 1 3
1 1 2 2 1
1 2 1 1 3
1 2 1 2 2
1 2 2 1 3
1 2 2 2 1
2 1 1 1 3
2 1 1 2 2
2 1 2 1 3
2 1 2 2 1
2 2 1 1 3
2 2 1 2 2
2 2 2 1 3
2 2 2 2 3
3 1 1 1 3
3 1 1 2 3
3 1 2 1 3
3 1 2 2 1
3 2 1 1 3
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3
First Rule
Figure 11.7 shows the probability of class = 1 occurring for each at-
tribute/value pair over the whole training set (24 instances).
The maximum probability is when astig = 2 or tears = 2.
Choose astig = 2 arbitrarily.
Incomplete rule induced so far:
IF astig = 2 THEN class = 1
166 Principles of Data Mining
The subset of the training set covered by this incomplete rule is given in
Figure 11.8.
Figure 11.8 First Rule: Subset of Training Set Covered by Incomplete Rule
(Version 1)
Figure 11.9 shows the probability of each attribute/value pair (not involving
attribute astig) occurring for this subset.
The maximum probability is when tears = 2.
Incomplete rule induced so far:
Inducing Modular Rules for Classification 167
Figure 11.10 First Rule: Subset of Training Set Covered by Incomplete Rule
(Version 2)
Figure 11.11 shows the probability of each attribute/value pair (not involv-
ing attributes astig or tears) occurring for this subset.
The maximum probability is when age = 1 or specRx = 1.
Choose (arbitrarily) age = 1.
Incomplete rule induced so far:
IF astig = 2 and tears = 2 and age = 1 THEN class = 1
The subset of the training set covered by this rule is given in Figure 11.12.
This subset contains only instances of class 1.
The final induced rule is therefore
IF astig = 2 and tears = 2 and age = 1 THEN class = 1
168 Principles of Data Mining
Figure 11.12 First Rule: Subset of Training Set Covered by Incomplete Rule
(Version 3)
Second Rule
Removing the two instances covered by the first rule from the training set
gives a new training set with 22 instances. This is shown in Figure 11.13.
The table of frequencies is now as given in Figure 11.14 for attribute/value
pairs corresponding to class = 1.
The maximum probability is achieved by astig = 2 and tears = 2.
Choose astig = 2 arbitrarily.
Incomplete rule induced so far:
IF astig=2 THEN class = 1
The subset of the training set covered by this rule is shown in Figure 11.15.
This gives the frequency table shown in Figure 11.16.
The maximum probability is achieved by tears = 2.
Incomplete rule induced so far:
IF astig = 2 and tears = 2 then class = 1
The subset of the training set covered by this rule is shown in Figure 11.17.
This gives the frequency table shown in Figure 11.18.
The maximum probability is for specRx = 1.
Incomplete rule induced so far:
IF astig = 2 and tears = 2 and specRx = 1 THEN class = 1
Inducing Modular Rules for Classification 169
The subset of the training set covered by this rule is shown in Figure 11.19.
This subset contains only instances of class 1. So the final induced rule is:
IF astig = 2 and tears = 2 and specRx = 1 THEN class = 1
Removing the two instances covered by this rule from the current version of
the training set (which has 22 instances) gives a training set of 20 instances from
which all instances of class 1 have now been removed. So the Prism algorithm
terminates (for classification 1).
The final pair of rules induced by Prism for class 1 are:
IF astig = 2 and tears = 2 and age = 1 THEN class = 1
IF astig = 2 and tears = 2 and specRx = 1 THEN class = 1
The algorithm will now go on to generate rules for the remaining classifica-
tions. It produces 3 rules for class 2 and 4 for class 3. Note that the training
set is restored to its original state for each new class.
1. Tie-breaking
The basic algorithm can be improved slightly by choosing between at-
tribute/value pairs which have equal probability not arbitrarily as above
but by taking the one with the highest total frequency.
172 Principles of Data Mining
However, the basic algorithm can easily be extended to deal with clashes as
follows.
Step 3 of the algorithm states:
Repeat 1 and 2 for this subset until a subset is reached that contains only
instances of class i.
To this needs to be added ‘or a subset is reached which contains instances
of more than one class, although values of all the attributes have already been
used in creating the subset’.
The simple approach of assigning all instances in the subset to the majority
class does not fit directly into the Prism framework. A number of approaches
to doing so have been investigated, and the most effective would appear to be
as follows.
Both the additional features described in Section 11.4.1 are included in a re-
implementation of Prism by the present author [3].
The same paper describes a series of experiments to compare the perfor-
mance of Prism with that of TDIDT on a number of datasets. The author
concludes “The experiments presented here suggest that the Prism algorithm
for generating modular rules gives classification rules which are at least as good
as those obtained from the widely used TDIDT algorithm. There are generally
fewer rules with fewer terms per rule, which is likely to aid their comprehen-
sibility to domain experts and users. This result would seem to apply even
more strongly when there is noise in the training set. As far as classification
accuracy on unseen test data is concerned, there appears to be little to choose
between the two algorithms for noise-free datasets, including ones with a sig-
nificant proportion of clash instances in the training set. The main difference
Inducing Modular Rules for Classification 173
is that Prism generally has a preference for leaving a test instance as ‘unclas-
sified’ rather than giving it a wrong classification. In some domains this may
be an important feature. When it is not, a simple strategy such as assigning
unclassified instances to the majority class would seem to suffice. When noise
is present, Prism would seem to give consistently better classification accuracy
than TDIDT, even when there is a high level of noise in the training set. . . .
The reasons why Prism should be more tolerant to noise than TDIDT are not
entirely clear, but may be related to the presence of fewer terms per rule in
most cases. The computational effort involved in generating rules using Prism
. . . is greater than for TDIDT. However, Prism would seem to have considerable
potential for efficiency improvement by parallelisation.”
These very positive conclusions are of course based on only a fairly limited
number of experiments and need to be verified for a much wider range of
datasets. In practice, despite the drawbacks of a decision tree representation
and the obvious potential of Prism and other similar algorithms, TDIDT is far
more frequently used to generate classification rules. The ready availability of
C4.5 [4] and related systems is no doubt a significant factor in this.
In Chapter 16 we go on to look at the use of modular rules for predicting
associations between attribute values rather than for classification.
References
[1] Cendrowska, J. (1987). PRISM: an algorithm for inducing modular rules.
International Journal of Man-Machine Studies, 27, 349–370.
[2] Cendrowska, J. (1990). Knowledge acquisition for expert systems: inducing
modular rules from examples. PhD Thesis, The Open University.
[3] Bramer, M. A. (2000). Automatic induction of classification rules from ex-
amples using N-prism. In Research and development in intelligent systems
XVI (pp. 99–121). Berlin: Springer.
[4] Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo:
Morgan Kaufmann.
12
Measuring the Performance of a Classifier
Up to now we have generally assumed that the best (or only) way of measuring
the performance of a classifier is by its predictive accuracy, i.e. the proportion
of unseen instances it correctly classifies. However this is not necessarily the
case.
There are many other types of classification algorithm as well as those
discussed in this book. Some require considerably more computation or memory
than others. Some require a substantial number of training instances to give
reliable results. Depending on the situation the user may be willing to accept
a lower level of predictive accuracy in order to reduce the run time/memory
requirements and/or the number of training instances needed.
A more difficult trade-off occurs when the classes are severely unbalanced.
Suppose we are considering investing in one of the leading companies quoted on
a certain stock market. Can we predict which companies will become bankrupt
in the next two years, so we can avoid investing in them? The proportion of
such companies is obviously small. Let us say it is 0.02 (a fictitious value), so
on average out of every 100 companies 2 will become bankrupt and 98 will not.
Call these ‘bad’ and ‘good’ companies respectively.
If we have a very ‘trusting’ classifier that always predicts ‘good’ under all
circumstances its predictive accuracy will be 0.98, a very high value. Looked at
only in terms of predictive accuracy this is a very successful classifier. Unfor-
tunately it will give us no help at all in avoiding investing in bad companies.
On the other hand, if we want to be very safe we could use a very ‘cautious’
classifier that always predicted ‘bad’. In this way we would never lose our money
in a bankrupt company but would never invest in a good one either. This is
© Springer-Verlag London Ltd. 2016 175
M. Bramer, Principles of Data Mining, Undergraduate Topics
in Computer Science, DOI 10.1007/978-1-4471-7307-6 12
176 Principles of Data Mining
similar to the ultra-safe strategy for air traffic control: ground all aeroplanes, so
you can be sure that none of them will crash. In real life, we are usually willing
to accept the risk of making some mistakes in order to achieve our objectives.
It is clear from this example that neither the very trusting nor the very
cautious classifier is any use in practice. Moreover, where the classes are severely
unbalanced (98% to 2% in the company example), predictive accuracy on its
own is not a reliable indicator of a classifier’s effectiveness.
Bad Company Application. Here we would like the number of false positives
(bad companies that are classified as good) to be as small as possible, ideally
zero. We would probably be willing to accept a high proportion of false negatives
(good companies classified as bad) as there are a large number of possible
companies to invest in.
retrieval applications the user is seldom aware of the false negatives (relevant
pages not found by the search engine) but false positives are visible, waste time
and irritate the user.
These examples illustrate that, leaving aside the ideal of perfect classifica-
tion accuracy, there is no single combination of false positives and false neg-
atives that is ideal for every application and that even a very high level of
predictive accuracy may be unhelpful when the classes are very unbalanced. To
go further we need to define some improved measures of performance.
The classifier will of course still work exactly as well as before to predict
the correct classification of either a pass or a fail with which it is presented. For
both confusion matrices the values of TP Rate and FP Rate are the same (0.89
and 0.2 respectively). However the values of the Predictive Accuracy measure
are different.
For the original confusion matrix, Predictive Accuracy is 16,000/19,000 =
0.842. For the second one, Predictive Accuracy is 88,000/100,000 = 0.88.
An alternative possibility is that over a period of time there is a large
increase in the relative proportion of failures, perhaps because of an increase
in the number of younger people being tested. A possible confusion matrix for
a future series of trials would be as follows.
Predicted class Total
+ − instances
Actual class + 8, 000 1, 000 9, 000
− 20, 000 80, 000 100, 000
Here the Predictive Accuracy is 88,000/109,000 = 0.807.
Whichever of these test sets was used with the classifier the TP Rate and
FP Rate values would be the same. However the three Predictive Accuracy
values would vary from 81% to 88%, reflecting changes in the relative numbers
of positive and negative values in the test set, rather than any change in the
quality of the classifier.
The other six points shown are (0.1, 0.6), (0.2, 0.5), (0.4, 0.2), (0.5, 0.5),
(0.7, 0.7) and (0.2, 0.7).
One classifier is better than another if its corresponding point on the ROC
Graph is to the ‘north-west’ of the other’s. Thus the classifier represented by
(0.1, 0.6) is better than the one represented by (0.2, 0.5). It has a lower FP Rate
and a higher TP Rate. If we compare points (0.1, 0.6) and (0.2, 0.7), the latter
has a higher TP Rate but also a higher FP Rate. Neither classifier is superior
to the other on both measures and the one chosen will depend on the relative
importance given by the user to the two measures.
The diagonal line joining the bottom left and top right-hand corners corre-
sponds to random guessing, whatever the probability of the positive class may
be. If a classifier guesses positive and negative at random with equal frequency,
it will classify positive instances correctly 50% of the time and negative in-
stances as positive, i.e. incorrectly, 50% of the time. Thus both the TP Rate
and the FP Rate will be 0.5 and the classifier will lie on the diagonal at point
(0.5, 0.5).
Similarly, if a classifier guesses positive and negative at random with positive
selected 70% of the time, it will classify positive instances correctly 70% of the
time and negative instances as positive, i.e. incorrectly, 70% of the time. Thus
both the TP Rate and the FP Rate will be 0.7 and the classifier will lie on the
diagonal at point (0.7, 0.7).
We can think of the points on the diagonal as corresponding to a large
number of random classifiers, with higher points on the diagonal corresponding
to higher proportions of positive classifications generated on a random basis.
184 Principles of Data Mining
The upper left-hand triangle corresponds to classifiers that are better than
random guessing. The lower right-hand triangle corresponds to classifiers that
are worse than random guessing, such as the one at (0.4, 0.2).
A classifier that is worse than random guessing can be converted to one
that is better than random guessing simply by reversing its predictions, so that
every positive prediction becomes negative and vice versa. By this method the
classifier at (0.4, 0.2) can be converted to the new one at (0.2, 0.4) in Figure
12.4. The latter point is the former reflected about the diagonal line.
Examining ROC curves can give insights into the best way of tuning a
classification algorithm. In Figure 12.5 performance clearly degrades after the
third point in the series.
The performance of different types of classifier with different parameters
can be compared by inspecting their ROC curves.
The values of true positive rate and false positive rate are often represented
diagrammatically by a ROC graph. Joining the points on a ROC graph to form
a ROC curve can often give insight into the best way of tuning a classifier. A
Euclidean distance measure of the difference between a given classifier and the
performance of a hypothetical perfect classifier is described.
Predicted class
+ −
Actual class + 55 5
− 5 35
Predicted class
+ −
Actual class + 40 20
− 1 39
Predicted class
+ −
Actual class + 60 0
− 20 20
Calculate the values of true positive rate and false positive rate for each
classifier and plot them on a ROC graph. Calculate the value of the Euclidean
distance measure Euc for each one. Which classifier would you consider the best
if you were equally concerned with avoiding false positive and false negative
classifications?
13
Dealing with Large Volumes of Data
13.1 Introduction
In the not too far distant past, datasets with a few hundred or a few thousand
records would have been considered normal and those with tens of thousands
of records would probably have been considered very large. The ‘data explo-
sion’ that is so evident all around us has changed all that. In some fields only
quite a small amount of data is available and that is unlikely to change very
much (perhaps fossil data or data about patients with rare illnesses); in other
fields (such as retailing, bioinformatics, branches of science such as chemistry,
cosmology and particle physics, and the ever-growing area of mining data held
by Internet applications such as blogs and social networking sites) the volume
has greatly increased and seems likely to go on increasing rapidly.
Some of the best–known data mining methods were developed in those
far-off days and were originally tested on datasets such as the UCI Reposi-
tory [1]. It is certainly not self-evident that they will all scale up to much
larger datasets with acceptable runtimes or memory requirements. The most
obvious answer to this problem is to take a sample from a large dataset and use
that for data mining. Taking a 1% sample chosen at random from a 100 million
record dataset would leave ‘only’ a million records to analyse but that is itself a
substantial number. Also, however random the 1% selection process itself may
be, that does not guarantee that what results will be a random sample from
the underlying (probably far larger) population of possible records for that task
area, as that will depend on how the original data was collected. All that will
be certain is that 99 million data records will have been discarded.
one processor alone. A little experience will soon dispel this illusion. In reality
it can easily be the case that 100 processors take considerably longer to do
the job than just 10, because of communication and other overheads amongst
them. We might invent the term ‘the two many cooks principle’ to describe
this.
There are several ways in which a classification task could be distributed
over a number of processors.
(1) If all the data is together in one very large dataset, we can distribute
it on to p processors, run an identical classification algorithm on each one and
combine the results.
(2) The data may inherently ‘live’ in different datasets on different pro-
cessors, for example in different parts of a company or even in different co-
operating organisations. As for (1) we could run an identical classification al-
gorithm on each one and combine the results.
(3) An extreme case of a large data volume is streaming data arriving in
effectively a continuous infinite stream in real time, e.g. from a CCTV. If the
data is all coming to a single source, different parts of it could be processed
by different processors acting in parallel. If it is coming into several different
processors, it could be handled in a similar way to (2).
(4) An entirely different situation arises where we have a dataset that is not
particularly large, but we wish to generate several or many different classifiers
from it and then combine the results by some kind of ‘voting’ system in order
to classify unseen instances. In this case we might have the whole dataset on a
single processor, accessed by different classification programs (possibly identical
or possibly different) accessing all or part of the data. Alternatively, we could
distribute the data in whole or in part to each processor before running a set of
either identical or different classification programs on it. This topic is discussed
in Chapter 14 ‘Ensemble Classification’.
A common feature of all these approaches is that there needs to be some kind
of ‘control module’ to combine the results obtained on the p processors. De-
pending on the application, the control module may also need to distribute the
data to different processors, initiate the processing on each processor and per-
haps synchronise the p processors’ work. The control module might be running
on an additional processor or as a separate process on one of the p processors
mentioned previously.
In the next section we will focus on the first category of application, i.e. all
the data is together in one very large dataset, a part of which we can distribute
on to each of p processors, then run an identical classification algorithm on
each one and combine the results.
192 Principles of Data Mining
This leads to a very rough outline for a possible way of distributing a classi-
fication task to a network of processors. For simplicity we will assume that the
aim is to generate a set of classification rules corresponding to a given dataset,
rather than some other form of classification model.
Dealing with Large Volumes of Data 193
(a) The data is divided up either vertically or horizontally (or perhaps both)
amongst the processors.
(b) The same algorithm is executed on each processor to analyse the corre-
sponding portion of the data.
(c) Finally the results obtained by each processor are passed to a ‘control
module’, which combines the results into a set of rules. It will also have been re-
sponsible for initiating steps (a) and (b) and for whatever action was necessary
to keep the processors in step during step (b).
A general model for distributed data mining of this kind is provided by the
Cooperating Data Mining (CDM) model introduced by Provost [3]. Figure 13.3
shows the basic architecture (reproduced from [4] with permission).
– Layer 1: the sample selection procedure, which partitions the data sample
S into subsamples (one for each of the processors available)
– Layer 2: For each processor there is a corresponding learning algorithm
Li which runs on the corresponding subsample Si and generates a concept
description Ci .
– Layer 3: the concept descriptions are then merged by a combining pro-
cedure to form a final concept description Cfinal (such as a set of classification
rules).
The model allows for the learning algorithms Li to communicate with each
other but does not specify how.
Scale-Up
is happening if we plot on the vertical axis not runtime but relative runtime,
i.e. (for each of the three plots) the runtime divided by the runtime for just
2 processors. This gives us Figure 13.6. Now each plot starts with a relative
runtime of one (for two processors) and we have added the ‘ideal’ situation of
a horizontal line of height one to the graph accordingly.
We can now see that the relative runtime is greatest for the smallest work-
load per processor (130K) and smallest for the largest workload (850K). So
with this algorithm, the effect of the communication overhead in increasing the
runtime above the ideal is lower as the workload per processor increases. As
we wish to be able to deal with very large datasets this is a most desirable
result.
Dealing with Large Volumes of Data 199
Size-Up
the way the system handles the communication overheads. This is a very good
result.
Speed-Up
a potentially vast amount of data together and processing it all as a single job.
This is a highly desirable property.
We will only briefly summarise the description of the Naı̈ve Bayes classifi-
cation algorithm here, using an example from Chapter 3. Given a training set
such as Figure 13.9:
class
on time late very late cancelled
day = weekday * 9/14 = 0.64 1/2 = 0.5 3/3 = 1 0/1 = 0
day = saturday 2/14 = 0.14 1/2 = 0.5 0/3 = 0 1/1 = 1
day = sunday 1/14 = 0.07 0/2 = 0 0/3 = 0 0/1 = 0
day = holiday 2/14 = 0.14 0/2 = 0 0/3 = 0 0/1 = 0
season = spring 4/14 = 0.29 0/2 = 0 0/3 = 0 1/1 = 1
season = summer * 6/14 = 0.43 0/2 = 0 0/3 = 0 0/1 = 0
season = autumn 2/14 = 0.14 0/2 = 0 1/3 = 0.33 0/1 = 0
season = winter 2/14 = 0.14 2/2 = 1 2/3 = 0.67 0/1 = 0
wind = none 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
wind = high * 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1
wind = normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
rain = none 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
rain = slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0
rain = heavy * 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05
Then the score for each class for an unseen instance such as
can be calculated from the values in the rows shown above that are marked
with asterisks.
The class with the largest score is selected, in this case class = on time.
(There are complications with zero values which will be ignored here.)
First we note that there is no need to store all the values shown above.
All that needs to be stored for each of the attributes is a frequency table
showing the number of instances with each possible combination of the attribute
value and classification. For attribute day the table would be as shown in
Figure 13.11.
Dealing with Large Volumes of Data 205
class
on time late very late cancelled
weekday 9 1 3 0
saturday 2 1 0 1
sunday 1 0 0 0
holiday 2 0 0 0
Together with a table for each attribute there needs to be a row show-
ing the frequencies of each of the four classes, as shown for this example in
Figure 13.12.
class
on time late very late cancelled
TOTAL 14 2 3 1
The values in the TOTAL row are used as the denominators when the values
in the frequency table for each attribute are used in calculations, e.g. for the
frequency table for attribute day, the value used for weekday/on time is 9/14.
The Prior Probability row in Figure 13.10 does not need to be stored at all as
in each case the value is the frequency of the corresponding class divided by
the total number of instances (20 in this example).
Even when the volume of data is very large the number of classes is often
small and even when there are a very large number of categorical attributes, the
number of possible attribute values for each one is likely to be quite small, so
overall it seems entirely practical to store a frequency table such as Figure 13.11
for each attribute plus a single table of class frequencies.
With this tabular representation for the probability model generated by
the Naı̈ve Bayes algorithm, incrementally updating a classifier becomes trivial.
Suppose that based on 100,000 instances we have a frequency table for attribute
A as shown in Figure 13.13.
The frequency counts for the four classes are 50120, 19953, 14301 and 15626
making a grand total of 100,000.
Suppose that we now want to process a batch of 50,000 more instances with
a frequency table for attribute A as shown in Figure 13.14.
206 Principles of Data Mining
For these new instances the frequency counts of the classes are 21308, 9972,
9316 and 9404, making a total of 50,000.
In order to obtain the same classification for any unseen instance with the
training data received in two parts as if all 150,000 instances had been used
together to generate the classifier as a single job, it is only necessary to add
the two frequency tables for each attribute together element-by-element and to
add together the frequency totals for each class. This is simple to do with no
loss of accuracy involved.
Returning to the topic of distributing data to a number of processors by ver-
tical partitioning, i.e. allocating a portion of the attributes to each processor,
that approach fits well with the Naı̈ve Bayes algorithm. All that each processor
would have to do is to count the frequency of each attribute value/class combi-
nation for each of the attributes allocated to it and pass a small table for each
one to the ‘control module’ whenever requested.
Experiments have shown that the classification accuracy of Naı̈ve Bayes
is generally competitive with that of other methods. Its main drawbacks are
that it only applies when the attribute values are all categorical and that the
probability model generated is not as explicit as a decision tree, say. Depending
on the application, the explicitness of the model may or may not be a significant
issue.
Dealing with Large Volumes of Data 207
1. Construct a frequency table for each of the four attributes and a class
frequency table, using the data in the two train datasets combined.
2. Using these new tables find the most likely classification for the unseen
instance given below.
References
[1] Blake, C. L., & Merz, C. J. (1998). UCI repository of machine
learning databases. Irvine: University of California, Department of In-
formation and Computer Science. http://www.ics.uci.edu/~mlearn/
MLRepository.html.
[2] Catlett, J. (1991). Megainduction: machine learning on very large databases.
Sydney: University of Technology.
[3] Provost, F. (2000). Distributed data mining: scaling up and beyond. In
H. Kargupta & P. Chan (Eds.), Advances in distributed data mining. San
Mateo: Morgan Kaufmann.
[4] Stahl, F., Bramer, M., & Adda, M. (2009). PMCRI: a parallel modular clas-
sification rule induction framework. In LNAI: Vol. 5632. Machine learning
and data mining in pattern recognition (pp. 148–162). Berlin: Springer.
[5] Shafer, J. C., Agrawal, R., & Mehta, M. (1996). SPRINT: a scalable parallel
classifier for data mining. In Twenty-second international conference on very
large data bases.
[6] Stahl, F. T., Bramer, M. A., & Adda, M. (2010). J-PMCRI: a methodology
for inducing pre-pruned modular classification rules. In Artificial intelli-
gence in theory and practice III (pp. 47–56). Berlin: Springer.
[7] Stahl, F., & Bramer, M. (2013). Computationally efficient induction of
classification rules with the PMCRI and J-PMCRI frameworks. Knowledge
based systems. Amsterdam: Elsevier.
[8] Nolle, L., Wong, K. C. P., & Hopgood, A. (2002). DARBS: a distributed
blackboard system. In M. A. Bramer, F. Coenen, & A. Preece (Eds.), Re-
search and development in intelligent systems XVIII. Berlin: Springer.
[9] Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: massive
online analysis, a framework for stream classification and clustering. Journal
of Machine Learning Research, 99, 1601–1604.
14
Ensemble Classification
14.1 Introduction
The idea of ensemble classification is to learn not just one classifier but a set of
classifiers, called an ensemble of classifiers, and then to combine their predic-
tions for the classification of unseen instances using some form of voting. This
is illustrated in Figure 14.1 below. It is hoped that the ensemble will collec-
tively have a higher level of predictive accuracy than any one of the individual
classifiers, but that is not guaranteed.
The term ensemble learning is often used to mean the same as ensemble
classification, but the former is a more general technique where a set of models
is learnt that collectively can be applied to solving a problem of potentially any
kind, not just classification.
The individual classifiers in an ensemble are known as base classifiers. If
the base classifiers are all of the same kind (e.g. decision trees) the ensemble is
known as homogeneous. Otherwise it is known as heterogeneous.
A simple form of ensemble classification algorithm is:
1. Generate N classifiers for a given dataset
2. For an unseen instance X
a) Compute the predicted classification of X for each of the N classifiers
b) Select the classification that is most frequently predicted.
This is a majority voting model where each time a classifier predicts a
particular classification for an unseen instance it counts as one ‘vote’ for that
© Springer-Verlag London Ltd. 2016 209
M. Bramer, Principles of Data Mining, Undergraduate Topics
in Computer Science, DOI 10.1007/978-1-4471-7307-6 14
210 Principles of Data Mining
very similar, as they are all likely to give a very similar ‘standard’ performance.
A better strategy is likely to be to generate trees (or other classifiers) that are
diverse, in the hope that some will give much better than ‘standard’ perfor-
mance, even if others are much worse. Those in the latter category should not
be included in the ensemble; those in the former should be retained. This leads
naturally to the idea of generating a large number of classifiers in some random
way and then retaining only the best.
Two pioneering pieces of work in this field are the Random Decision Forests
system developed by Tin Kam Ho [1] and the Random Forests system of Leo
Breiman [2]. Both use the approach of generating a large number of decision
trees in a way that has a substantial random element, measuring their per-
formance and then selecting the best trees for the ensemble. To quote Stahl
and Bramer [3]: “Ho argues that traditional trees often cannot be grown over
a certain level of complexity without risking a loss of generalisation caused
by overfitting on the training data. Ho proposes inducing multiple trees in
randomly selected subsets of the feature space. He claims that the combined
classification will improve, as the individual trees will generalise better on the
classification for their subset of the feature space”.
Ho’s work introduced the idea of making a random selection of the attributes
to use when generating each classifier. Breiman added to this by introducing
a technique known as bagging for generating multiple different but related
training sets from a single set of training data, with the aim of reducing
overfitting and improving classification accuracy [4].
Naturally this is computationally expensive to do. Ho’s and Breiman’s
papers are both important contributions to the field and are well worth studying
in detail. However as usual there are many other ways of implementing the
same general ideas once they have been set out and the description given in
this chapter is our own.
To develop the idea of basing an ensemble on random classifiers further we
need:
– A means of generating a large number of classifiers (say 100) in a random
fashion and
– A way of measuring the performance of each one.
The final step is to choose all those that meet some criterion to include
in an ensemble. There are several ways of doing this. For example we may
select, say, the 10 classifiers with the best performance or all the classifiers
with performance above some threshold of accuracy.
212 Principles of Data Mining
Since the same calculation applies to all instances and those never selected
form the validation dataset for the classifier, it follows that for a reasonably
large dataset of ‘remaining data’ we can expect that the validation dataset will
comprise (on average) 36.8% of the instances. It follows that the other 63.2%
go into the training set, some of them many times, to make a training set of N
instances.
The significance of the training set being ‘padded out’ to N instances with
duplicate values is far from negligible. Depending on the algorithm used, the
classifier generated may be substantially different from the one obtained if
duplicate values are deleted from the training set, which is a possible alternative
approach.
Predicted
Classifier
Class
1 A
2 B
3 A
4 B
5 A
6 C
7 C
8 A
9 C
10 B
Predicted
Classifier Accuracy
Class
1 0.65 A
2 0.90 B
3 0.65 A
4 0.85 B
5 0.70 A
6 0.70 C
7 0.90 C
8 0.65 A
9 0.80 C
10 0.95 B
Total 7.75
We can now adopt a weighted majority voting approach, with each vote for
a classification weighted by the proportion given in the middle column.
– Now classifier A gains 0.65 + 0.65 + 0.7 + 0.65 = 2.65 votes.
– Classifier B gains 0.9 + 0.85 + 0.95 = 2.7 votes.
– Classifier C gains 0.7 + 0.9 + 0.8 = 2.4 votes.
– The total number of votes available is 0.65 + 0.9 + . . . + 0.95 = 7.75.
With this approach classifier B is now the winner. This seems reasonable as
it gained the votes of three of the best classifiers, judged by their performance
on their validation datasets (which vary from one classifier to another), whereas
candidate classifier A gained the votes of four relatively weak classifiers. In this
case choosing B as the winning classifier seems justified.
However it is possible to make the situation more complex still. An overall
predictive accuracy figure of say 0.85 can conceal considerable variation in
performance. We will focus on classifier 4 with overall predictive accuracy of
0.85 and consider a possible confusion matrix for it, assuming there were exactly
1,000 instances in its validation dataset. (Confusion matrices are discussed in
Chapter 7.)
From Figure 14.4 we can see that classification B was quite rare in the
validation dataset for classifier 4. Of the 100 instances with that classification
only 50 were correctly predicted. Even worse, if we look at the 120 times that
classification B was predicted by classifier 4, only 50 times was the prediction
correct. Now it seems as if giving classifier 4 a weighted value of 0.85 for its
Ensemble Classification 217
Figure 14.5 is a revised version of Figure 14.3. Now each classifier again has
one vote, which it casts as three proportions. For example classifier 4 predicts
class B for the unseen instance under consideration. This produces not a single
vote for class B, but a vote split into three parts cast for all three classes A, B
and C, in this case the values 0.25, 0.42 and 0.33 respectively. These proportions
are derived from the ‘Predicted Class B ’ column of the confusion matrix for
classifier 4 (Figure 14.4).
Adding the votes for each of the three classes in Figure 14.5, the winner
now (rather surprisingly) is class C, mainly because of the three high votes of
0.9 twice and 0.8.
Which of the three methods illustrated in this section is the most reliable?
The first predicted class A, the second class B and the third class C. There is
no clear-cut answer to this. The point is that there are a number of ways the
votes can be combined in an ensemble classifier rather than just one.
Looking again at Figure 14.5 there are further complications to take into
account. Classifier 5, which predicts class A has ‘votes’ of 0.4, 0.2 and 0.4. This
means that for its validation data when it predicted class A, only 40% of the
instances were actually of class A, 20% of the instances were class B and 40%
of the instances were class C. What credibility can be given to a prediction
of class A by that classifier? We can look at the three proportions for classi-
fier 5 as indications of its ‘track record’ when predicting class A. On that basis
there seems no reason at all to trust it and we might consider eliminating that
classifier from consideration any time its prediction is A, as well as eliminat-
ing classifier 4 when its prediction is class B. However, if we do so, we will
have implicitly moved from a ‘democratic’ model – one classifier, one vote – to
something closer to a ‘community of experts’ approach.
Suppose the 10 classifiers represent 10 medical consultants in a hospital
and A, B and C are three treatments to give a patient with a life-threatening
condition. The consultants are trying to predict which treatment is most likely
to prove effective. Why should anyone trust consultants 4 and 5, with their
poor track records when predicting B and A respectively?
By contrast consultant 6, whose prediction is that treatment C will prove
the most effective at saving the patient, has a track record of 90% success when
making that prediction. The only consultant to compare with consultant 6 is
number 9, who also has a track record of 90% success when predicting C. With
two such experts making the same choice, who would wish to contradict them?
Even the act of counting the votes seems not only pointless but unnecessarily
risky, just in case the other eight less successful consultants might happen to
outvote the two leading experts.
We could go on elaborating this example but will stop here. Clearly it
is possible to look at the question of how best to combine the classifications
Ensemble Classification 219
the ensemble all predict the correct classification of each unseen instance and
their predictions are then combined using some form of voting system.
The idea of a random forest of classifiers is introduced and issues relating
to the selection of a different training set and/or a different set of attributes
from a given dataset when constructing each of the classifiers are discussed.
A number of alternative ways of combining the classifications produced by
an ensemble of classifiers are considered. The chapter concludes with a brief
discussion of a distributed processing approach to dealing with the large amount
of computation often required to generate an ensemble.
References
[1] Ho, T. K. (1995). Random decision forests. International Conference on
Document Analysis and Recognition, 1, 278.
[2] Breiman, L. (2001). Random forests. Machine Learning, 45 (1), 5–32.
[3] Stahl, F., & Bramer, M. (2011). Random prism: an alternative to random
forests. In Research and development in intelligent systems XXVIII (pp. 5–
18). Springer.
[4] Breiman, L. (1996). Bagging predictors. Machine Learning, 24 (2), 123–140.
[5] Stahl, F., May, D., & Bramer, M. (2012). Parallel random prism: a com-
putationally efficient ensemble learner for classification. In Research and
development in intelligent systems XXIX. Springer.
[6] Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: mas-
sively parallel learning of tree ensembles with mapreduce. Proceedings of
the VLDB Endowment, 2, 1426–1437.
15
Comparing Classifiers
15.1 Introduction
In Chapter 12 we considered how to choose between different classifiers applied
to the same dataset. For those with real datasets to analyse this is obviously
the principal issue.
However there is an entirely different category of data miner: those who
develop new algorithms or what they hope are improvements to existing al-
gorithms designed to give superior performance on not just one dataset but a
wide range of possible datasets most of which are not known or do not even
exist at the time the new methods are developed. Into this category fall both
academic researchers and commercial software developers.
Whatever new methods are developed in the future, we can be certain of
this: no one is going to develop a new algorithm that out-performs all estab-
lished methods of classification (such as those described in this book) for all
possible datasets. Data mining packages intended for use in a wide variety of
possible application areas will continue to need to include a choice of classifi-
cation algorithms to use. The aim of further development is to establish new
techniques that are generally preferable to well-established ones. To do this
it is necessary to compare their performance against at least one established
algorithm on a range of datasets.
There are many published papers giving descriptions of interesting new clas-
sification algorithms accompanied by a performance table such as Figure 15.1.
Each column gives the predictive accuracy, expressed as a percentage, of one of
© Springer-Verlag London Ltd. 2016 221
M. Bramer, Principles of Data Mining, Undergraduate Topics
in Computer Science, DOI 10.1007/978-1-4471-7307-6 15
222 Principles of Data Mining
Established New
Dataset Classifier Classifier
A B
dataset 1 80 85
dataset 2 73 70
dataset 3 85 85
dataset 4 68 74
dataset 5 82 71
dataset 6 75 65
dataset 7 73 77
dataset 8 64 73
dataset 9 75 75
dataset 10 69 76
Total 744 751
Average 74.4 75.1
the classifiers on a range of datasets. (Note that for the method of comparison
we describe below multiplying all the values in both columns by a constant has
no effect on the outcome. Thus it makes no difference whether we represent
predictive accuracy by percentages as here or by proportions between 0 and 1,
such as 0.8 and 0.85.)
The production of tables of comparative values such as Figure 15.1 is a
considerable improvement over the position with some of the older Data Min-
ing literature where new algorithms are either not evaluated at all (leaving the
brilliance of the author’s ideas to speak for itself, one assumes) or are evalu-
ated on datasets that are only available to the author and/or are not named.
As time has gone by collections of ‘standard’ datasets have been assembled
that make it possible for developers to compare their results with those ob-
tained by other methods on the same datasets. In many cases the latter results
are only available in the published literature, since with a few honourable ex-
ceptions authors do not generally make software implementing their algorithms
accessible to other developers and researchers, except in the case of commercial
packages.
A very widely-used collection of datasets is the ‘UCI Repository’ [1] which
was introduced in Section 2.6. Being able to compare performance on the same
datasets as those used by previous authors clearly makes it far easier to evaluate
new algorithms. However the widespread use of such repositories is not an
unmixed blessing as will be explained later.
Comparing Classifiers 223
We can see that the average difference between A and B is 0.7, i.e. 0.7% in
favour of classifier B. This does not seem very much. Is it sufficient to reject
the null hypothesis that the performance of classifiers A and B is effectively the
same? We will address this question using a paired t-test. The word ‘paired’
in the name refers to the fact that the results fall into natural pairs, i.e. it is
224 Principles of Data Mining
sensible to compare the results for dataset 1 for classifiers A and B but these
are separate from those for dataset 2 etc.
To perform a paired t-test we need only three values: the total of the values
of z, the total of the z 2 values and the number of datasets. We well denote
2 2
these by z, z , and n, respectively, so z = 7, z = 437 and n = 10.1
From these three values we can calculate the value of a statistic which is
traditionally represented by the variable t. The t-statistic was introduced in
the early 20th century by an English statistician named William Gosset, who is
best known by his pen name of ‘Student’, and so this test is also often known
as Student’s t-test.
The calculation of the value of t can be broken down into the following
steps.
Step 1. Calculate the average value of z : z/n = 7/10 = 0.7.
2/
Step 2. Calculate the value of ( z) n. Here this gives 72 /10 = 4.9.
2
Step 3. Subtract the result of step 2 from z . Here this gives 437 − 4.9 =
432.1.
Step 4. Divide this value by (n − 1) to give the sample variance, which is
traditionally denoted by s2 . Here s2 is 432.1/9 = 48.01.
Step 5. Take the square root of s2 to give s, known as the sample standard
√
deviation. Here the value of s is 48.01 = 6.93.
√ √
Step 6. Divide s by n to give the standard error. Here the value is 6.93/ 10 =
2.19.
Step 7. Finally we divide the average value of z by the standard error to give
the value of the t statistic. Here t = 0.7/2.19 = 0.32.
The word ‘sample’ in both ‘sample variance’ and ‘sample standard devia-
tion’ refers to the fact that the 10 datasets given in the table are not all the
possible datasets that exist to which the two classifiers may be applied. They
are just a very small sample of all the possible datasets that exist or may exist in
the future. We are using them as ‘representatives’ of this much larger collection
of datasets. We will return to the question of how far this is reasonable.
The terms standard deviation and variance are commonly used in statistics.
Standard deviation measures the fluctuation of the values of z about the mean
1
For those not familiar with this notation, which uses the Greek letter (pro-
nounced ‘sigma’) to denote summation, it is explained in Appendix A.1.1. The
simplified variant
used here leaves out the subscripts, as the values to be added
are obvious.
2 z (read as ‘sigma z ’) denotes the sum of all values of z, which here
is 7, z (read as ‘sigma z squared’) represents thesum of all the values of z 2 ,
which
is 437. The latter is not to be confused with ( z)2 , which is the square of
z, i.e. 49.
Comparing Classifiers 225
value, which here is 0.7. In Figure 15.2 the fluctuation is considerable: the
differences between the values of z and the average value (0.7) vary from −11.7
to +8.3 and this is reflected in a sample standard deviation, s, value of 6.93,
almost 10 times larger than the average itself. The calculation of the standard
error value adjusts s to allow for the number of datasets in the sample. Because
t is the average value of z divided by the standard error, it follows that the
smaller the value of s (i.e. the fluctuation of z values about the average), the
larger will be the value of t. (Readers interested in a full explanation of and
justification for the t-test are referred to the many statistics textbooks that are
available.)
Now we have calculated t, the next step is to use it to determine whether or
not to accept the null hypothesis that the performance of classifiers A and B
is effectively the same. We ask this question in an equivalent form: is the value
of t sufficiently far away from zero to justify rejecting the null hypothesis? We
say ‘sufficiently far away from zero’ rather than ‘sufficiently large’ because t
can have either a positive or a negative value. (The average value of z can be
positive or negative; standard error is always positive.)
We can now reformulate our question as: ‘how likely is a value of t outside
the range from −0.32 to +0.32 to occur by chance’ ? The answer to this depends
on the number of datasets n, but statisticians refer instead to the number of
degrees of freedom, which for our purposes is always one less than the number
of datasets, i.e. n − 1.
Figure 15.3 shows the distribution of the t-statistic for 9 degrees of freedom
(chosen because there are 10 datasets in the tables shown so far).
The left- and right-hand ends of the curve (called its ‘tails’) go on infinitely
in both directions. The area between the entire curve and the horizontal axis,
i.e. the t-axis, gives the probability that t will take one of its possible values,
which of course is one.
The figure has the values t = −1.83 and t = +1.83 marked with vertical
lines. The area between the parts of the curve that are to the left of t = −1.83
or to the right of t = +1.83 and the horizontal axis is the probability of the t
value being ≤ −1.83 or ≥ +1.83, i.e. at least as far away from zero as 1.83. We
need to look at both tails in this way as a negative value of −1.83 is just as
much evidence that the null hypothesis (that the two classifiers are equivalent)
is false as the positive value +1.83. When we compare two classifiers there is
no reason to believe that if A and B are significantly different then B must be
better than A; it might also be that B is worse than A.
The area shaded in Figure 15.3, i.e. the probability that t is at least 1.83
either side of zero can be calculated to be 0.1005.
Looking at the probability that t ≤ −1.83 or t ≥ +1.83, or in general that
t ≤ −a or t ≥ +a, for any positive value a, gives us what is known as a
two-tailed test of significance.
The value of the area under the two tails t ≤ −a and t ≥ +a have been
calculated for different degrees of freedom and values of a corresponding to
probabilities of particular interest. Some of these are summarised in Figure 15.4.
Figure 15.4 shows some key values for the t statistic for degrees of freedom from
1 to 19, i.e. for comparisons based on anything from 2 to 20 datasets. (Note
that because we are using a two-tailed test, probabilities 0.10, 0.05 and 0.01 in
the table correspond to a = 0.05, 0.025 and 0.005 respectively in the previous
discussion.)
Looking at the values for 9 degrees of freedom (i.e. for n = 10) the value
of 1.833 in the ‘Probability 0.10’ column indicates that a value of t ≥ 1.833
(or ≤ −1.833) would only be expected to happen by chance with probability
0.10 or less, i.e. no more than 1 time out of 10. If we had a t value of 2.1, say,
we could reject the null hypothesis ‘at the 10% level’, implying that such an
extreme value of t would only be expected to occur by chance fewer than one
time in 10. This is a commonly used criterion for rejecting a null hypothesis and
on this basis we could confidently say that classifier B is significantly better
than classifier A.
A value of t ≥ 2.262 (or ≤ −2.262) would enable us to reject the null
hypothesis at the 5% level, and a value of t ≥ 3.250 (or ≤ −3.250) would enable
us to reject the null hypothesis at the 1% level, as such values would only be
expected to occur by chance one time in 20 and 1 time in 100 respectively.
Comparing Classifiers 227
Naturally we could use other threshold values and work out a value of t
that would only be exceeded by chance one time in six on average, say, but
conventionally we use one of the thresholds shown in Figure 15.4. The least
restrictive condition generally imposed is that to reject a null hypothesis we
require a value of t that would occur no more than 1 time in 10 by chance.
Returning to our example, the value of t calculated was only 0.32, which
with 9 degrees of freedom is nowhere near the 10% value of 1.833. We can safely
accept the null hypothesis. On the basis of the evidence presented it would be
unsafe to say that the performance of classifier B was significantly different
from that of classifier A.
It is important to appreciate that the reason for this disappointing result
(certainly disappointing to the creator of classifier B ) is not the relatively low
average value of z (0.7). It is the relatively high value of the standard error
(2.19) relative to the average value of z.
To illustrate this we will introduce a new classifier C, which will turn out
to be much more successful as a challenger to classifier A.
228 Principles of Data Mining
2
Now z = 28, z = 216 and n = 8.
The average value of z is 3.5. The standard error is 1.45 and the value of t
is 2.41. This is large enough for classifier B to be declared significantly better
than classifier A at the 5% level. (With 7 degrees of freedom the threshold value
230 Principles of Data Mining
2
Now z = 39, z = 277 and n = 10.
The average value of z is 3.9. The standard error is 1.18 and the value of t
is 3.31. This is large enough to be significant at the 1% level.
Paradoxically if the results for classifier B with datasets 11 and 12 had
been much better, say 95% and 99% respectively, the value of t would have
been lower at 2.81. Intuitively, we may say that by increasing the fluctuation
around the average value of z we make it more likely that the difference between
the classifiers has occurred by chance. To obtain a significant value of t, it is
generally far more important that the values of z have low variability than that
the average value of z is large.
It is clear that the choice of datasets to include in a performance table such
as Figure 15.1 is of critical importance. A comparison of the t values calculated
from Figures 15.2, 15.6 and 15.7 shows that leaving out (or including) datasets
on which the new algorithm B performs badly (or well) can make the difference
between a ‘no significant difference’ result and a significant improvement (or
Comparing Classifiers 231
Having established that for the results given in Figure 15.6 classifier B is sta-
tistically significantly better than classifier A at the 5% level, and the average
improvement for the eight datasets listed is 3.5%, it would be helpful to estab-
lish a confidence interval for the average improvement to indicate within what
limits the true improvement for datasets not included in the table is likely to
lie.
For this example the average value of z is 3.5 and the standard error is 1.45.
As the t value in the ‘Probability 0.05’ column of Figure 15.4 for 7 degrees of
freedom is 2.365, we can say that the 95% confidence interval for the true
average difference is 3.5 ± (2.365 ∗ 1.45) = 3.5 ± 3.429. We can be 95% certain
that the true average improvement lies between 0.071% and 6.929%.
For the performance figures given in Figure 15.7 classifier B is significantly
better than classifier A at the 1% level. Here the average value of z is 3.9 and
the standard error is 1.18. There are 9 degrees of freedom and the value of t in
the ‘Probability 0.01’ column for that number of degrees of freedom is 3.250.
We can say that the 99% confidence interval for the true average difference is
3.9 ± (3.250 ∗ 1.18) = 3.9 ± 3.835. We can be 99% certain that the true average
improvement lies between 0.065% and 7.735%.
15.4 Sampling
So far we have shown how to test for the significance of a difference in per-
formance between two classifiers on some specified datasets. However in most
cases we do this not because we are particularly interested in those datasets but
232 Principles of Data Mining
because we would like our new method to be considered better on all possible
datasets. This brings us to the issue of sampling.
Any collection of datasets can be considered to be a sample from the com-
plete collection of all the world’s datasets (which is not accessible to us of
course), but is it a representative sample, i.e. one that accurately reflects the
members of the entire population? If not, why should anyone imagine that a
classifier’s improved performance on datasets 1–10, say, should generalise to
imply improved performance on all other (or indeed any other) datasets?
The situation is similar to the world of advertising, where it is common to
see claims such as ‘8 out of 10 women prefer product B to product A’. (The
laws of libel prevent us using more realistic examples in this section.)
Does this claim mean that the advertiser has asked exactly 10 women,
perhaps all close friends, family members or employees? That would not be
very convincing. Why should those 10 speak for all the women of the world?
Even if we restrict ourselves to the aim of speaking for, say, all the women in
Great Britain, it is obvious that just asking 10 people is hopelessly inadequate.
Some advertisements go further and say (e.g.) ‘total number of women asked
= 94’. This is better, but how were the 94 selected? If they were all questioned
on the same Tuesday morning at the same shopping centre, or sports event
say, the bias towards selecting people living in a small geographical area with
particular interests and availability for answering surveys on Tuesday mornings
is surely obvious.
To make any meaningful statement about the views of the female popu-
lation of Great Britain we need to sub-divide the population into a number
of mutually exclusive and homogeneous sub-groups, based on features such as
geographical location, age group and socio-economic status and then ensure we
interview a reasonably large group of women that is broken down in the same
proportions for each sub-group as the overall population. This is known as strat-
ified sampling and is the approach typically adopted by companies conducting
opinion surveys.
Returning to data mining, a natural question to ask when faced with a
table showing the comparative performance of different classifiers on a number
of datasets is how were those datasets selected? It would be good to believe that
they were a carefully selected representative sample of all the world’s datasets,
but that is hardly realistic. Let us suppose that all the datasets were chosen
from a standard repository, such as the UCI one, which was established to
facilitate comparison with the work of previous software developers. Is there
any reason to suppose that they are a representative sample (rather than just
a sample) of all the datasets in the UCI Repository?
It would be possible to attempt to achieve this, although unavoidably im-
precisely, e.g. by choosing a number of datasets that are believed to include a
Comparing Classifiers 233
– B may give better performance for certain types of dataset than A, for ex-
ample where there are many missing values or where there is likely to be a
high proportion of noise present.
Given a performance table such as Figure 15.1 the question that needs to
be addressed is what distinguishes those datasets for which the B value is
greater than the A value from those the other way round. Often there may
be no discernible reason for the differences but, where there is, a valuable new
algorithm for particular types of dataset may have been found.
Classifier Classifier
Dataset
A B
1 74 86
2 69 75
3 80 86
4 67 69
5 84 83
6 87 95
7 69 65
8 74 81
9 78 74
10 72 80
11 75 73
12 72 82
13 70 68
14 75 78
15 80 78
16 84 85
17 79 79
18 79 78
19 63 76
20 75 71
References
[1] Blake, C. L., & Merz, C. J. (1998). UCI repository of machine
learning databases. Irvine: University of California, Department of In-
formation and Computer Science. http://www.ics.uci.edu/~mlearn/
MLRepository.html.
[2] Salzberg, S. L. (1997). On comparing classifiers: pitfalls to avoid and a
recommended approach. Data Mining and Knowledge Discovery, 1, 317–
327. Kluwer.
16
Association Rule Mining I
16.1 Introduction
Classification rules are concerned with predicting the value of a categorical
attribute that has been identified as being of particular importance. In this
chapter we go on to look at the more general problem of finding any rules of
interest that can be derived from a given dataset.
We will restrict our attention to IF . . . THEN . . . rules that have a conjunc-
tion of ‘attribute = value’ terms on both their left- and right-hand sides. We
will also assume that all attributes are categorical (continuous attributes can be
dealt with by discretising them ’globally’ before any of the methods discussed
here are used).
Unlike classification, the left- and right-hand sides of rules can potentially
include tests on the value of any attribute or combination of attributes, subject
only to the obvious constraints that at least one attribute must appear on both
sides of every rule and no attribute may appear more than once in any rule. In
practice data mining systems often place restrictions on the rules that can be
generated, such as the maximum number of terms on each side.
If we have a financial dataset one of the rules extracted might be as follows:
IF Has-Mortgage = yes AND Bank Account Status = In credit
THEN Job Status = Employed AND Age Group = Adult under 65
Rules of this more general kind represent an association between the values
of certain attributes and those of others and are called association rules. The
process of extracting such rules from a given dataset is called association rule
mining (ARM). The term generalised rule induction (or GRI) is also used,
by contrast with classification rule induction. (Note that if we were to apply
the constraint that the right-hand side of a rule has to have only one term
which must be an attribute/value pair for a designated categorical attribute,
association rule mining would reduce to induction of classification rules.)
For a given dataset there are likely to be few if any association rules that
are exact, so we normally associate with each rule a confidence value, i.e. the
proportion of instances matched by its left- and right-hand sides combined as
a proportion of the number of instances matched by the left-hand side on its
own. This is the same measure as the predictive accuracy of a classification
rule, but the term ‘confidence’ is more commonly used for association rules.
Association Rule Mining algorithms need to be able to generate rules with
confidence values less than one. However the number of possible Association
Rules for a given dataset is generally very large and a high proportion of the
rules are usually of little (if any) value. For example, for the (fictitious) financial
dataset mentioned previously, the rules would include the following (no doubt
with very low confidence):
IF Has-Mortgage = yes AND Bank Account Status = In credit
THEN Job Status = Unemployed
This rule will almost certainly have a very low confidence and is obviously
unlikely to be of any practical value.
The main difficulty with association rule mining is computational efficiency.
If there are say 10 attributes, each rule can have a conjunction of up to nine
‘attribute = value’ terms on the left-hand side. Each of the attributes can
appear with any of its possible values. Any attribute not used on the left-hand
side can appear on the right-hand side, also with any of its possible values.
There are a very large number of possible rules of this kind. Generating all of
these is very likely to involve a prohibitive amount of computation, especially
if there are a large number of instances in the dataset.
For a given unseen instance there are likely to be several or possibly many
rules, probably of widely varying quality, predicting different values for any
attributes of interest. A conflict resolution strategy of the kind discussed in
Chapter 11 is needed that takes account of the predictions from all the rules,
plus information about the rules and their quality. However we will concentrate
here on rule generation, not on conflict resolution.
Association Rule Mining I 239
Figure 16.1 Instances matching LEFT, RIGHT and both LEFT and RIGHT
240 Principles of Data Mining
The values NLEFT , NRIGHT , NBOTH and NTOTAL are too basic to be
considered as rule interestingness measures themselves but the values of most
(perhaps all) interestingness measures can be computed from them.
Three commonly used measures are given in Figure 16.2 below. The first
has more than one name in the technical literature.
Support
NBOTH /NTOTAL
The proportion of the training set correctly predicted by the rule
Completeness
NBOTH /NRIGHT
The proportion of the matching right-hand sides that are correctly predicted
by the rule
We can illustrate this using the financial rule given in Section 16.1.
IF Has-Mortgage = yes AND Bank Account Status = In credit
THEN Job Status = Employed AND Age Group = Adult under 65
Assume that by counting we arrive at the following values:
NLEFT = 65
NRIGHT = 54
NBOTH = 50
NTOTAL = 100
From these we can calculate the values of the three interestingness measures
given in Figure 16.2.
Confidence = NBOTH /NLEFT = 50/65 = 0.77
Support = NBOTH /NTOTAL = 50/100 = 0.5
Completeness = NBOTH /NRIGHT = 50/54 = 0.93
The confidence of the rule is 77%, which may not seem very high. However
it correctly predicts for 93% of the instances in the dataset that match the
Association Rule Mining I 241
right-hand side of the rule and the correct predictions apply to as much as 50%
of the dataset. This seems like a valuable rule.
Amongst the other measures of interestingness that are sometimes used is
discriminability. This measures how well a rule discriminates between one class
and another. It is defined by:
1 − (NLEFT − NBOTH )/(NTOTAL − NRIGHT )
which is
1 − (number of misclassifications produced by the rule)/(number of instances
with other classifications)
If the rule predicts perfectly, i.e. NLEFT = NBOTH , the value of discriminabil-
ity is 1.
For the example given above, the value of discriminability is
1 − (65 − 50)/(100 − 54) = 0.67.
Criterion 1
The measure should be zero if NBOTH = (NLEFT × NRIGHT )/NT OT AL
Interestingness should be zero if the antecedent and the consequent are
statistically independent (as explained below).
Criterion 2
The measure should increase monotonically with NBOTH
Criterion 3
The measure should decrease monotonically with each of NLEFT and
NRIGHT
For criteria 2 and 3, it is assumed that all other parameters are fixed.
The second and third of these are more easily explained than the first.
Criterion 2 states that if everything else is fixed the more right-hand sides
that are correctly predicted by a rule the more interesting it is. This is clearly
reasonable.
Criterion 3 states that if everything else is fixed
(a) the more instances that match the left-hand side of a rule the less
interesting it is.
(b) the more instances that match the right-hand side of a rule the less
interesting it is.
The purpose of (a) is to give preference to rules that correctly predict a given
number of right-hand sides from as few matching left-hand sides as possible (for
a fixed value of NBOTH , the smaller the value of NLEFT the better).
The purpose of (b) is to give preference to rules that predict right-hand sides
that are relatively infrequent (because predicting common right-hand sides is
easier to do).
Criterion 1 is concerned with the situation where the antecedent and the con-
sequent of a rule (i.e. its left- and right-hand sides) are independent. How many
right-hand sides would we expect to predict correctly just by chance?
We know that the number of instances in the dataset is NTOTAL and that
the number of those instances that match the right-hand side of the rule is
NRIGHT . So if we just predicted a right-hand side without any justification
whatever we would expect our prediction to be correct for NRIGHT instances
out of NTOTAL , i.e. a proportion of NRIGHT /NTOTAL times.
If we predicted the same right-hand side NLEFT times (one for each instance
that matches the left-hand side of the rule), we would expect that purely by
chance our prediction would be correct NLEFT × NRIGHT /NTOTAL times.
By definition the number of times that the prediction actually turns out
to be correct is NBOTH . So Criterion 1 states that if the number of correct
predictions made by the rule is the same as the number that would be expected
by chance the rule interestingness is zero.
Piatetsky-Shapiro proposed a further rule interestingness measure called
RI, as the simplest measure that meets his three criteria. This is defined by:
RI = NBOTH − (NLEFT × NRIGHT /NTOTAL )
RI measures the difference between the actual number of matches and the
expected number if the left- and right-hand sides of the rule were independent.
Generally the value of RI is positive. A value of zero would indicate that the
rule is no better than chance. A negative value would imply that the rule is
less successful than chance.
The RI measure satisfies all three of Piatetsky-Shapiro’s criteria.
Criterion 1 RI is zero if NBOTH = (NLEFT × NRIGHT )/NTOTAL
Association Rule Mining I 243
NTOTAL = 647
Figure 16.4 Rule Interestingness Values for Rules Derived from chess Dataset
We can now return briefly to the subject of conflict resolution, when several
rules predict different values for one or more attributes of interest for an unseen
test instance. Rule interestingness measures give one approach to handling this.
For example we might decide to use only the rule with the highest interesting-
ness value, or the most interesting three rules, or more ambitiously we might
decide on a ‘weighted voting’ system that adjusts for the interestingness value
or values of each rule that fires.
The J-measure was introduced into the data mining literature by Smyth and
Goodman [2], as a means of quantifying the information content of a rule that
is soundly based on theory. Justifying the formula is outside the scope of this
book, but calculating its value is straightforward.
Given a rule of the form If Y = y, then X = x using Smyth and Goodman’s
notation, the information content of the rule, measured in bits of information,
is denoted by J(X; Y = y), which is called the J-measure for the rule.
The value of the J-measure is the product of two terms:
– p(y) The probability that the left-hand side (antecedent) of the rule will
occur
– j(X; Y = y) The j-measure (note the small letter ‘j’) or cross-entropy.
The cross-entropy term is defined by the equation:
p(x|y) 1 − p(x|y)
j(X; Y = y) = p(x|y). log2 + (1 − p(x|y)). log2
p(x) 1 − p(x)
The value of cross-entropy depends on two values:
– p(x) The probability that the right-hand side (consequent) of the rule will
be satisfied if we have no other information (called the a priori probability
of the rule consequent)
– p(x|y) The probability that the right-hand side of the rule will be satisfied if
we know that the left-hand side is satisfied (read as ‘probability of x given
y’).
A plot of the j-measure for various values of p(x) is given in Figure 16.5.
In terms of the basic measures introduced in Section 16.2:
p(y) = NLEF T /NT OT AL
p(x) = NRIGHT /NT OT AL
p(x|y) = NBOT H /NLEF T
The J-measure has two helpful properties concerning upper bounds. First,
it can be shown that the value of J(X; Y = y) is less than or equal to
1
p(y). log2 ( p(y) ).
The maximum value of this expression, given when p(y) = 1/e, is log2 e/e,
which is approximately 0.5307 bits.
248 Principles of Data Mining
Second (and more important), it can be proved that the J value of any rule
obtained by specialising a given rule by adding further terms is bounded by
the value
1
Jmax = p(y). max{p(x|y). log2 ( p(x) ), (1 − p(x|y)). log2 ( 1−p(x)
1
)}
Thus if a given rule is known to have a J value of, say, 0.352 bits and the
value of Jmax is also 0.352, there is no benefit to be gained (and possibly harm
to be done) by adding further terms to the left-hand side, as far as information
content is concerned.
We will come back to this topic in the next section.
There are many ways in which we can search a given search space, i.e. generate
all the rules of interest and calculate their quality measures. In this section we
will describe a method that takes advantage of the properties of the J-measure.
To simplify the description we will assume that there are ten attributes
a1, a2, . . . , a10 each with three possible values 1, 2 and 3. The search space
comprises rules with just one term on the right-hand side and up to nine terms
on the left-hand side.
Association Rule Mining I 249
In this case the beam width is 20. It is not necessary for the beam width to be
a fixed value. For example it might start at 50 when expanding rules of order
one then reduce progressively for rules of higher orders.
It is important to appreciate that using a beam search technique to reduce
the number of rules generated is a heuristic, i.e. a ‘rule of thumb’ that is not
guaranteed to work correctly in every case. It is not necessarily the case that
the best rules of order K are all specialisations of the best rules of order K − 1.
The second method of reducing the number of rules to be generated is
guaranteed always to work correctly and relies on one of the properties of the
J-measure.
Let us suppose that the last entry in the ‘best N rules table’ (i.e. the entry
with lowest J-value in the table) has a J-value of 0.35 and we have a rule with
two terms, say
IF a3 = 3 AND a6 = 2 THEN a2 = 2
which has a J-value of 0.28.
In general specialising a rule by adding a further term can either increase
or decrease its J-value. So even if the order 3 rule
IF a3 = 3 AND a6 = 2 AND a8 = 1 THEN a2 = 2
has a lower J-value, perhaps 0.24, it is perfectly possible that adding a fourth
term could give a higher J-value that will put the rule in the top N .
A great deal of unnecessary calculation can be avoided by using the Jmax
value described in Section 16.4.1. As well as calculating the J-value of the rule
IF a3 = 3 AND a6 = 2 THEN a2 = 2
which was given previously as 0.28, let us assume that we also calculate its
Jmax value as 0.32. This means that no further specialisation of the rule by
adding terms to the left-hand side can produce a rule (for the same right-hand
side) with a J-value larger than 0.32. This is less than the minimum of 0.35
needed for the expanded form of the rule to qualify for the best N rules table.
Hence the order 2 form of the rule can safely be discarded.
Combining a beam search with rule ‘pruning’ using the Jmax value can
make generating rules from even quite a large dataset computationally feasible.
In the next chapter we look at the problem of generating association rules
for market basket analysis applications, where the datasets are often huge, but
the rules take a restricted form.
Association Rule Mining I 251
2. Given a dataset with four attributes w, x, y and z, each with three values,
how many rules can be generated with one term on the right-hand side?
References
[1] Piatetsky-Shapiro, G. (1991). Discovery, analysis and presentation of strong
rules. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery
in databases (pp. 229–248). Menlo Park: AAAI Press.
[2] Smyth, P., & Goodman, R. M. (1992). Rule induction using information
theory. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discov-
ery in databases (pp. 159–176). Menlo Park: AAAI Press.
17
Association Rule Mining II
17.1 Introduction
This chapter is concerned with a special form of Association Rule Mining, which
is known as Market Basket Analysis. The rules generated for Market Basket
Analysis are all of a certain restricted kind.
Here we are interested in any rules that relate the purchases made by cus-
tomers in a shop, frequently a large store with many thousands of products, as
opposed to those that predict the purchase of one particular item. Although in
this chapter ARM will be described in terms of this application, the methods
described are not restricted to the retail industry. Other applications of the
same kind include analysis of items purchased by credit card, patients’ medical
records, crime data and data from satellites.
For convenience we write the items in an itemset in the order in which they
appear in set I, the set of all possible items, i.e. {a, b, c} not {b, c, a}.
All itemsets are subsets of I. We do not count the empty set as an itemset
and so an itemset can have anything from 1 up to m members.
cd → e
The arrow is read as ‘implies’, but we must be careful not to interpret this
as meaning that buying c and d somehow causes e to be bought. It is better to
think of rules in terms of prediction: if we know that c and d were bought we
can predict that e was also bought.
The rule cd → e is typical of most if not all of the rules used in Association
Rule Mining in that it is not invariably correct. The rule is satisfied for trans-
actions 2, 4 and 7 in Figure 17.1, but not for transaction 6, i.e. it is satisfied in
75% of cases. For basket analysis it might be interpreted as ‘if bread and milk
are bought, then cheese is bought too in 75% of cases’.
Note that the presence of items c, d and e in transactions 2, 4, and 7 can
also be used to justify other rules such as
c → ed
and
e → cd
For the rule cd → e we have L = {c, d}, R = {e} and L ∪ R = {c, d, e}. We
can count the number of transactions in the database that are matched by the
first two itemsets. Itemset L matches four transactions, numbers 2, 4, 6 and 7,
and itemset L ∪ R matches 3 transactions, numbers 2, 4 and 7, so count(L) = 4
and count(L ∪ R) = 3.
As there are 8 transactions in the database we can calculate
and
support(L ∪ R) = count(L ∪ R)/8 = 3/8
A large number of rules can be generated from even quite a small database
and we are generally only interested in those that satisfy given criteria for
interestingness. There are many ways in which the interestingness of a rule can
be measured, but the two most commonly used are support and confidence.
The justification for this is that there is little point in using rules that only
apply to a small proportion of the database or that predict only poorly.
The support for a rule L → R is the proportion of the database to which
the rule successfully applies, i.e. the proportion of transactions in which the
items in L and the items in R occur together. This value is just the support
for itemset L ∪ R, so we have
support(L → R) = support(L ∪ R).
The predictive accuracy of the rule L → R is measured by its confidence,
defined as the proportion of transactions for which the rule is satisfied. This can
be calculated as the number of transactions matched by the left-hand and right-
hand sides combined, as a proportion of the number of transactions matched
by the left-hand side on its own, i.e. count(L ∪ R)/count(L).
Ideally, every transaction matched by L would also be matched by L ∪ R,
in which case the value of confidence would be 1 and the rule would be called
exact, i.e. always correct. In practice, rules are generally not exact, in which
case count(L ∪ R) < count(L) and the confidence is less than 1.
Since the support count of an itemset is its support multiplied by the total
number of transactions in the database, which is a constant value, the confi-
dence of a rule can be calculated either by
or by
confidence(L → R) = support(L ∪ R)/support(L)
It is customary to reject any rule for which the support is below a minimum
threshold value called minsup, typically 0.01 (i.e. 1%) and also to reject all rules
258 Principles of Data Mining
with confidence below a minimum threshold value called minconf, typically 0.8
(i.e. 80%).
For the rule cd → e, the confidence is count({c, d, e})/count({c, d}), which
is 3/4 = 0.75.
17.6 Apriori
This account is based on the very influential Apriori algorithm by Agrawal
and Srikant [1], which showed how association rules could be generated in a
realistic timescale, at least for relatively small databases. Since then a great
deal of effort has gone into looking for improvements on the basic algorithm to
enable larger and larger databases to be processed.
The method relies on the following very important result.
Theorem 1
If an itemset is supported, all of its (non-empty) subsets are also supported.
Proof
Removing one or more of the items from an itemset cannot reduce and
will often increase the number of transactions that it matches. Hence the
support for a subset of an itemset must be at least as great as that for
the original itemset. It follows that any (non-empty) subset of a supported
itemset must also be supported.
Theorem 2
If Lk = ∅ (the empty set) then Lk+1 , Lk+2 etc. must also be empty.
Proof
If any supported itemsets of cardinality k + 1 or larger exist, they will have
subsets of cardinality k and it follows from Theorem 1 that all of these
must be supported. However we know that there are no supported itemsets
of cardinality k as Lk is empty. Hence there are no supported subsets of
cardinality k + 1 or larger and Lk+1 , Lk+2 etc. must all be empty.
We need a method of going from each set Lk−1 to the next Lk in turn. We
can do this in two stages.
First we use Lk−1 to form a candidate set Ck containing itemsets of cardi-
nality k. Ck must be constructed in such a way that it is certain to include all
the supported itemsets of cardinality k but may contain some other itemsets
that are not supported.
Next we need to generate Lk as a subset of Ck . We can generally discard
some of the members of Ck as possible members of Lk by inspecting the mem-
bers of Lk−1 . The remainder need to be checked against the transactions in the
database to establish their support values. Only those itemsets with support
greater than or equal to minsup are copied from Ck into Lk .
This gives us the Apriori algorithm for generating all the supported itemsets
of cardinality at least 2 (Figure 17.2).
To start the process we construct C1 , the set of all itemsets comprising just
a single item, then make a pass through the database counting the number of
transactions that match each of these itemsets. Dividing each of these counts
by the number of transactions in the database gives the value of support for
each single-element itemset. We discard all those with support < minsup to
give L1 .
The process involved can be represented diagrammatically as Figure 17.3,
continuing until Lk is empty.
Agrawal and Srikant’s paper also gives an algorithm Apriori-gen which
takes Lk−1 and generates Ck without using any of the earlier sets Lk−2 etc.
There are two stages to this. These are given in Figure 17.4.
To illustrate the method, let us assume that L4 is the list
{{p, q, r, s}, {p, q, r, t}, {p, q, r, z}, {p, q, s, z}, {p, r, s, z}, {q, r, s, z},
{r, s, w, x}, {r, s, w, z}, {r, t, v, x}, {r, t, v, z}, {r, t, x, z}, {r, v, x, y},
{r, v, x, z}, {r, v, y, z}, {r, x, y, z}, {t, v, x, z}, {v, x, y, z}}
which contains 17 itemsets of cardinality four.
Association Rule Mining II 261
{{p, q, r, s, t}, {p, q, r, s, z}, {p, q, r, t, z}, {r, s, w, x, z}, {r, t, v, x, z}, {r, v, x, y, z}}
We can eliminate the first, third and fourth itemsets from C5 , making the
final version of candidate set C5
Let us assume that L1 has just 8 of these members, namely {a}, {b}, {c},
{d}, {e}, {f }, {g} and {h}. We cannot generate any rules from these, as they
only have one element, but we can now form candidate itemsets of cardinality
two.
In generating C2 from L1 all pairs of (single-item) itemsets in L1 are con-
sidered to match at the ‘join’ step, since there is nothing to the left of the
rightmost element of each one that might fail to match.
In this case the candidate generation algorithm gives us as members of C2
all the itemsets with two members drawn from the eight items a, b, c, . . . ,
h. Note that it would be pointless for a candidate itemset of two elements to
include any of the other 92 items from the original set of 100, e.g. {a, z}, as
one of its subsets would be {z}, which is not supported.
There are 28 possible itemsets of cardinality 2 that can be formed from the
items a, b, c, . . . , h. They are
{a, b}, {a, c}, {a, d}, {a, e}, {a, f }, {a, g}, {a, h},
{b, c}, {b, d}, {b, e}, {b, f }, {b, g}, {b, h},
{c, d}, {c, e}, {c, f }, {c, g}, {c, h},
{d, e}, {d, f }, {d, g}, {d, h},
{e, f }, {e, g}, {e, h},
{f, g}, {f, h},
{g, h}.
As mentioned previously, it is convenient always to list the elements of an
itemset in a standard order. Thus we do not include, say, {e, d} because it is
the same set as {d, e}.
We now need to make a second pass through the database to find the
support counts of each of these itemsets, then divide each of the counts by
the number of transactions in the database and reject any itemsets that have
support less than minsup. Assume in this case that only 6 of the 28 itemsets
with two elements turn out to be supported, so L2 = {{a, c}, {a, d}, {a, h},
{c, g}, {c, h}, {g, h}}.
The algorithm for generating C3 now gives just four members, i.e. {a, c, d},
{a, c, h}, {a, d, h} and {c, g, h}.
Before going to the database, we first check whether each of the candidates
meets the condition that all its subsets are supported. Itemsets {a, c, d} and
{a, d, h} fail this test, because their subsets {c, d} and {d, h} are not members
of L2 . That leaves just {a, c, h} and {c, g, h} as possible members of L3 .
We now need a third pass through the database to find the support counts
for itemsets {a, c, h} and {c, g, h}. We will assume they both turn out to be
supported, so L3 = {{a, c, h}, {c, g, h}}.
We now need to calculate C4 . It has no members, as the two members of L3
do not have their first two elements in common. As C4 is empty, L4 must also
264 Principles of Data Mining
be empty, which implies that L5 , L6 etc. must also be empty and the process
ends.
We have found all the itemsets of cardinality at least two with just three
passes through the database. In doing so we needed to find the support counts
for just 100 + 28 + 2 = 130 itemsets, which is a huge improvement on checking
through the total number of possible itemsets for 100 items, which is approxi-
mately 1030 .
The set of all supported itemsets with at least two members is the union
of L2 and L3 , i.e. {{a, c}, {a, d}, {a, h}, {c, g}, {c, h}, {g, h}, {a, c, h}, {c, g, h}}.
It has eight itemsets as members. We next need to generate the candidate rules
from each of these and determine which of them have a confidence value greater
than or equal to minconf.
Although using the Apriori algorithm is clearly a significant step forward,
it can run into substantial efficiency problems when there are a large number
of transactions, items or both. One of the main problems is the large number
of candidate itemsets generated during the early stages of the process. If the
number of supported itemsets of cardinality one (the members of L1 ) is large,
say N , the number of candidate itemsets in C2 , which is N (N − 1)/2, can be
a very large number.
A fairly large (but not huge) database may comprise over 1,000 items and
100,000 transactions. If there are, say, 800 supported itemsets in L1 , the number
of itemsets in C2 is 800 × 799/2, which is approximately 320,000.
Since Agrawal and Srikant’s paper was published a great deal of research
effort has been devoted to finding more efficient ways of generating supported
itemsets. These generally involve reducing the number of passes through all
the transactions in the database, reducing the number of unsupported itemsets
in Ck , more efficient counting of the number of transactions matched by each
of the itemsets in Ck (perhaps using information collected in previous passes
through the database), or some combination of these.
For itemset {c, d, e} there are 6 possible rules that can be generated, as
listed below.
Rule L → R count(L ∪ R) count(L) confidence(L → R)
de → c 3 3 1.0
ce → d 3 4 0.75
cd → e 3 4 0.75
e → cd 3 4 0.75
d → ce 3 4 0.75
c → de 3 7 0.43
Only one of the rules has a confidence value greater than or equal to minconf
(i.e. 0.8).
The number of ways of selecting i items from the k in a supported itemset
of cardinality k for the right-hand side of a rule is denoted by the mathematical
k!
expression k Ci which has the value (k−i)!i! .
The total number of possible right-hand sides L and thus the total number
of possible rules that can be constructed from an itemset L ∪ R of cardinality
k is k C1 + k C2 + · · · + k Ck−1 . It can be shown that the value of this sum is
2k − 2.
Assuming that k is reasonably small, say 10, this number is manageable.
For k = 10 there are 210 − 2 = 1022 possible rules. However as k becomes larger
the number of possible rules rapidly increases. For k = 20 it is 1,048,574.
Fortunately we can reduce the number of candidate rules considerably using
the following result.
Theorem 3
Transferring members of a supported itemset from the left-hand side of a
rule to the right-hand side cannot increase the value of rule confidence.
Proof
For this purpose we will write the original rule as A ∪ B → C, where sets
A, B and C all contain at least one element, have no elements in common
and the union of the three sets is the supported itemset S.
Transferring the item or items in B from the left to the right-hand side then
amounts to creating a new rule A → B ∪ C.
The union of the left- and right-hand sides is the same for both rules, namely
the supported itemset S, so we have
support(S)
confidence(A → B ∪ C) = support(A)
support(S)
confidence(A ∪ B → C) = support(A∪B)
It is clear that the proportion of transactions in the database matched by
an itemset A must be at least as large as the proportion matched by a larger
itemset A ∪ B, i.e. support(A) ≥ support(A ∪ B).
Hence it follows that confidence(A → B ∪ C) ≤ confidence(A ∪ B → C).
266 Principles of Data Mining
If the confidence of a rule ≥ minconf we will call the itemset on its right-
hand side confident. If not, we will call the right-hand itemset unconfident. From
the above theorem we then have two important results that apply whenever
the union of the itemsets on the two sides of a rule is fixed:
Any superset of an unconfident right-hand itemset is unconfident.
Any (non-empty) subset of a confident right-hand itemset is confident.
n × confidence(L → R)
lift(L → R) =
count(R)
where n is the number of transactions in the database, and
confidence(R → L)
lift(L → R) =
support(L)
Incidentally, from the second of these five formulae, which is symmetric in
L and R, we can also see that
lift(L → R) = lift(R → L)
The former is just support(L ∪ R). The frequencies (i.e. supports) of L and
R are support(L) and support(R), respectively. If L and R were independent
the expected frequency of both occurring in the same transaction would be the
product of support(L) and support(R).
This gives a formula for leverage:
The value of the leverage of a rule is clearly always less than its support.
The number of rules satisfying the support ≥ minsup and confidence ≥
minconf constraints can be reduced by setting a leverage constraint, e.g.
leverage ≥ 0.0001, corresponding to an improvement in support of one oc-
currence per 10,000 transactions in the database.
If a database has 100,000 transactions and we have a rule L → R with these
support counts
count(L) count(R) count(L ∪ R)
8000 9000 7000
the values of support, confidence, lift and leverage can be calculated to be 0.070,
0.875, 9.722 and 0.063 respectively (all to three decimal places).
So the rule applies to 7% of the transactions in the database and is satisfied
for 87.5% of the transactions that include the items in L. The latter value is
9.722 times more frequent than would be expected by chance. The improvement
in support compared with chance is 0.063, corresponding to 6.3 transactions
per 100 in the database, i.e. approximately 6300 in the database of 100,000
transactions.
count(L) = 3400
count(R) = 4000
count(L ∪ R) = 3000
What are the values of support, confidence, lift and leverage for this rule?
Reference
[1] Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association
rules in large databases. In J. B. Bocca, M. Jarke & C. Zaniolo (Eds.),
Proceedings of the 20th international conference on very large databases
(VLDB94) (pp. 487–499). San Mateo: Morgan Kaufmann. http://
citeseer.nj.nec.com/agrawal94fast.html.
18
Association Rule Mining III: Frequent
Pattern Trees
The aim is to find association rules linking the items in purchases together,
e.g.
eggs, milk → bread, cheese, pork
meaning that transactions that contain eggs and milk generally also include
bread, cheese and pork.
We do this in two stages:
1. Find itemsets such as {eggs, milk, bread} with a sufficiently high value of
support (defined by the user).
2. For each such itemset, extract one or more association rules, with all the
items in the itemset appearing on either the left- or the right-hand side.
This chapter is only concerned with step (1) of this process, i.e. finding the
itemsets. A method for extracting the association rules from the itemsets is
described in Section 17.8 of Chapter 17.
The term used in Chapter 17 for itemsets with a sufficiently high value of
support was supported itemsets. In view of the title of this chapter we will switch
here to using the equivalent term frequent itemsets, which is more commonly
used in the technical literature, although perhaps less meaningful. (We will use
the term frequent itemsets rather than frequent patterns.)
There is another detailed change from Chapter 17. In that chapter the
definition of a frequent (or supported) itemset was that the value of the support
count divided by the number of transactions in the database, i.e. the support,
was greater than or equal to a threshold value defined by the user, such as
0.01, called minsup. This is equivalent to saying that the support count must
be greater than or equal to the number of transactions multiplied by the value
of minsup. For a database with a million transactions the value of minsup
multiplied by the number of transactions would be a large number such as
10,000.
In this chapter we will define a frequent itemset as one for which the support
count is greater than or equal to a user-defined integer which we will call
minsupportcount.
These two definitions are clearly equivalent. The value of minsupportcount
will typically be a large integer, but for the example used in the remainder of
this chapter we will set it to the highly unrealistic value of three.
An important result which was established in Chapter 17 is the downward
closure property of itemsets: if an itemset is frequent, any (non-empty) subset
of it is also frequent. This is generally used in a different form: if an itemset
is infrequent then any superset of it must also be infrequent. For example if
{a, b, c, d} is infrequent then {a, b, c, d, e, f } must also be infrequent. If the
Association Rule Mining III: Frequent Pattern Trees 273
database would have to be scanned at least once, reducing the number of scans
to just two is a very valuable feature of this algorithm.
In [1] it is claimed that FP-growth is an order of magnitude faster than
Apriori. Naturally this depends on a number of factors, for example whether
the FP-tree can be represented in a way that is compact enough to fit into main
memory. Like virtually all the algorithms in this book, there are a number of
variants of both Apriori and FP-growth that aim to make them less memory
or computationally expensive and there will no doubt be more in the future.
In the following sections the FP-growth algorithm is described and illus-
trated by a series of figures showing the FP-tree corresponding to an example
transaction database, followed by a sequence of conditional FP-trees from which
it is straightforward to extract the frequent itemsets.
To illustrate the process we will use the transaction data from [1]. There are
just five transactions held in a transaction database, with each item represented
by a single letter:
f, a, c, d, g, i, m, p
a, b, c, f, l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n
The first step is to make a scan through the transaction database to count
the number of occurrences of each item, which is the same as the support count
of the corresponding single-item itemset. The result is as follows.
f, c: 4
a, m, p, b: 3
l, o: 2
d, g, i, h, j, k, s, e and n: 1
The user now needs to decide on a value for minsupportcount. As the amount
of data is so small, in this example we will use the highly unrealistic value:
minsupportcount = 3.
There are only six items for which the corresponding single-item itemset
has a support count of minsupportcount or more. In descending order of sup-
Association Rule Mining III: Frequent Pattern Trees 275
index orderedItems
0 f
1 c
2 a
3 b
4 m
5 p
As far as extracting frequent itemsets is concerned the items that are not
in the orderedItems array may as well not exist, as they cannot occur in any
frequent itemset. For example, if item g were a member of a frequent itemset
then by the downward closure property of itemsets any non-empty subset of
that itemset would also be frequent, so {g} would have to be frequent, but we
know by counting that it is not.
It is conventional and very important from a computational point of view
that the items in an itemset are written in a fixed order. In the case of
FP-growth they are written in descending order of their position in the
orderedItems array, i.e. in descending order of the number of transactions
in which each of them occurs. Thus {c, a, m} is a valid itemset, which may
be frequent or infrequent, but {m, c, a} and {c, m, a} are invalid. We are
only interested in whether itemsets that are valid in this sense are frequent
or infrequent.
We next make the second and final scan through the transaction database.
As each transaction is read all items that are not in orderedItems are removed
and the remaining items are sorted into descending order (i.e. the order of the
items in orderedItems) before being passed to the FP-tree construction process.
This gives the same effect as if the transaction data were originally the five
transactions
f, c, a, m, p
f, c, a, b, m
f, b
c, b, p
f, c, a, m, p
276 Principles of Data Mining
18.2.2 Initialisation
item
index count linkto parent child1 child2
name
0 root
nodes array child array
Item f As this is the first item for the transaction we take the ‘current node’
to be the root node. In this case the current node does not have a descendant
node with item name f, so a new node for item f is added numbered 1, with
its parent node numbered 0 (indicating the root node) in Figure 18.4. Note
that an item with name f and support count 1 is indicated by f /1 in Fig-
ure 18.3.
Adding a new node numbered N, for an item with name Item with its
parent node numbered P
– A new node numbered N is added to the tree with item name Item and
support count 1 as a descendant of the node numbered P.
– A new row, numbered N, is added to the nodes array with itemname,
count and parent values Item, 1 and P respectively. The first unused
child value for node P is set to N.
– The value of the row with index Item in both array startlink and array
endlink is set to N.
Item c
The current node is now node 1, which does not have a descendant node with
item name c, so a new node is added numbered 2, for item c with its parent
node numbered 1.
Item a
The current node is now node 2, which does not have a descendant node with
item name a, so a new node is added numbered 3, for item a with its parent
node numbered 2.
Item m
The current node is now node 3, which does not have a descendant node with
item name m, so a new node is added numbered 4, for item m with its parent
node numbered 3.
Item p
The current node is now node 4, which does not have a descendant node with
item name p, so a new node is added numbered 5, for item p with its parent
node numbered 4.
This gives the partial tree and corresponding tables shown below.
278 Principles of Data Mining
item
index count linkto parent child1 child2
name
0 root 1
1 f 1 0 2
2 c 1 1 3
3 a 1 2 4
4 m 1 3 5
5 p 1 4
nodes array child array
Items f, c and a
There is already a chain of nodes from the root to f, c, and a nodes in turn,
so no changes are needed except to increase the counts of nodes 1, 2 and 3 and
the corresponding rows of array nodes by one, giving Figures 18.5 and 18.6.
item
index count linkto parent child1 child2
name
0 root 1
1 f 2 0 2
2 c 2 1 3
3 a 2 2 4
4 m 1 3 5
5 p 1 4
nodes array child array
Item b
There is no descendant of the current node (the last node accessed), i.e. node 3,
that has item name b, so a new node numbered 6 is added for item b with its
parent node numbered 3 (Figures 18.7 and 18.8).
Association Rule Mining III: Frequent Pattern Trees 281
item
index count linkto parent child1 child2
name
0 root 1
1 f 2 0 2
2 c 2 1 3
3 a 2 2 4 6
4 m 1 3 5
5 p 1 4
6 b 1 3
nodes array child array
Item m
A new node numbered 7 is added for item m with its parent node numbered 6.
For the first time in this example the endlink array has a non-null value for
a newly added node, as endlink[m] is 4. Because of this, a dashed line link is
made from node 4 to node 7 for item m (Figures 18.9 and 18.10).
Making a ‘dashed line’ link for item Item across the tree from node A
to node B
The linkto value in row A of the nodes array and the value of endlink[Item]
are both set to B.
item
index count linkto parent child1 child2
name
0 root 1
1 f 2 0 2
2 c 2 1 3
3 a 2 2 4 6
4 m 1 7 3 5
5 p 1 4
6 b 1 3 7
7 m 1 6
nodes array child array
Item f
The count value for node 1 in the tree and row 1 in the nodes array are both
increased by 1.
Item b
There is no descendant of the current node, node 1, with item name b so a new
node numbered 8 is added for item b with its parent node numbered 1.
The endlink array has a non-null value for the new node, as endlink[b] is 6.
A dashed line link is made from node 6 to node 8 for item b (Figures 18.11
and 18.12).
284 Principles of Data Mining
item
index count linkto parent child1 child2
name
0 root 1
1 f 3 0 2 8
2 c 2 1 3
3 a 2 2 4 6
4 m 1 7 3 5
5 p 1 4
6 b 1 8 3 7
7 m 1 6
8 b 1 1
nodes array child array
Item c
The current node (the root node) does not have a descendant node with item
name c, so a new node is added numbered 9, for item c with its parent node
numbered 0 (indicating the root node). A dashed line link is made from node
2 to node 9.
Item b
The current node is now node 9, which does not have a descendant node with
item name b, so a new node is added numbered 10, for item b with its parent
node numbered 9. A dashed line link is made from node 8 to node 10.
Item p
The current node is now node 10, which does not have a descendant node
with item name p, so a new node is added numbered 11, for item p with its
parent node numbered 10. A dashed line link is made from node 5 to node 11
(Figures 18.13 and 18.14).
item
index count linkto parent child1 child2
name
0 root 1 9
1 f 3 0 2 8
2 c 2 9 1 3
3 a 2 2 4 6
4 m 1 7 3 5
5 p 1 11 4
6 b 1 8 3 7
7 m 1 6
8 b 1 10 1
9 c 1 0 10
10 b 1 9 11
11 p 1 10
nodes array child array
item
index count linkto parent child1 child2
name
0 root 1 9
1 f 4 0 2 8
2 c 3 9 1 3
3 a 3 2 4 6
4 m 2 7 3 5
5 p 2 11 4
6 b 1 8 3 7
7 m 1 6
8 b 1 10 1
9 c 1 0 10
10 b 1 9 11
11 p 1 10
nodes array child array
Once the FP-tree has been created arrays child and endlink can be dis-
carded. The contents of the tree are fully represented by arrays nodes and
startlink.
We will illustrate the process by a series of diagrams and describe how the
frequent itemset extraction process can be implemented in a recursive fashion
by constructing a number of tables that are equivalent to reduced versions of
the FP-tree.
We start by observing some general points.
– The dashed lines (links) in Figure 18.15 are not part of the tree itself (if there
were links across the tree it would no longer be a tree structure). Rather,
they are a way of keeping track of all the nodes with a particular name, e.g.
b, wherever they occur in the tree. This will be very useful in what follows.
– The items used to label the nodes in each branch of the tree from the root
downwards are always in the same order as the items in the orderedItems
array, i.e. f, c, a, b, m, p. This is descending order of the support counts of the
corresponding itemsets (e.g. {f }) in the transaction database, or equivalently
the order of the items in the orderedItems array, which is repeated as Fig-
ure 18.17. (Not every branch of the tree includes all six of the items.)
– Although the nodes in Figure 18.15 are labelled with the names c, m, p
etc. these are just the rightmost items in the itemsets to which the nodes
correspond. Thus nodes 1, 2, 3, 4 and 5 correspond to the itemsets {f }, {f, c},
{f, c, a}, {f, c, a, m} and {f, c, a, m, p} respectively.
The orderedItems array is repeated here for convenience as Figure 18.17.
index orderedItems
0 f
1 c
2 a
3 b
4 m
5 p
The process of extracting all the frequent itemsets from the FP-tree is es-
sentially a recursive one which can be represented by a call to a recursively-
defined function findFrequent that takes four arguments:
– Two arrays representing the tree. Initially these are arrays nodes and
startlink, corresponding to the original FP-tree. For future calls to the
function these will be replaced by arrays nodes2 and startlink2 corre-
sponding to a conditional FP-tree, as will be explained subsequently.
– Integer variable lastitem, which initially is set to the number of elements
in the orderedItems array (6 in this example).
– A set named originalItemset, which is initially empty, i.e. {}.
We will start with an ‘original itemset’ with no members, i.e. {} and generate
all possible one-item itemsets derived from it by adding a new item to its
leftmost position in ascending order of the elements of orderedItems, i.e. {p},
{m}, {b}, {a}, {c} and {f } in that order1 . For each of the itemsets that is
frequent2 , say {m}, we next examine itemsets with an additional item in the
leftmost position, e.g. {b, m}, {a, m} or {c, m} to find any that are frequent.
Note that the additional item must be above m in the orderedItems array to
preserve the conventional ordering of the items in an itemset. If we find a
frequent itemset, e.g. {a, m}, we next construct itemsets with a further item
in the leftmost position, e.g. {c, a, m}, check whether each one is frequent and
so on. The effect is that having found a single-item itemset that is frequent we
will go on to find all the frequent itemsets that end in the corresponding item
before examining the next single-item itemset.
Constructing new itemsets by adding one new item at a time to the left,
maintaining the same order as in the orderedItems array, is a very efficient
way of proceeding. Having established that say {c, a} is frequent, the only
other itemset that needs checking is {f, c, a} as f is the only item above c
in orderedItems. It may be true (and it is true in this case) that some other
itemset such as {c, a, m} is also frequent but that will already have been dealt
with at another stage.
Examining itemsets in this order also takes advantage of the downward
closure property of itemsets. If we find that an itemset, say {b, m} is infrequent
1
This rather convoluted way of describing the generation of the itemsets {p}, {m},
{b}, {a}, {c} and {f } is for consistency with the description of the generation of
two-item, three-item etc. itemsets that follows.
2
All the single item itemsets must inevitably be frequent, as the items in the initial
tree were selected from those in the transaction database on that basis. However
this will often not be the case as we go on to use findFrequent recursively to analyse
reduced versions of the FP-tree.
Association Rule Mining III: Frequent Pattern Trees 291
there is no point in examining any other itemsets with further items added.
If any of them, say {f, c, b, m} were frequent then by the downward closure
property {b, m} must be too, but we already know that it is not.
This strategy for generating frequent itemsets can be implemented in
function findFrequent by a loop for variable thisrow through values from
lastitem-1 down to zero.
– We set variable nextitem to orderedItems[thisrow] and then set firstlink
to startlink[nextitem].
– If firstlink is null we go on to the next value of thisrow.
– Otherwise we set variable thisItemset to be an expanded version of orig-
inalItemset with item nextitem as its leftmost item and then call func-
tion condfptree which takes four arguments: nodes, firstlink, thisrow and
thisItemset.
– Function condfptree first sets variable lastitem to the value of thisrow. It
then checks whether thisItemset is frequent. If it is, it goes on to generate
a conditional FP-tree for that itemset in the form of arrays nodes2 and
startlink2 and then calls findFrequent recursively with the two replace-
ment arrays, together with lastitem and thisItemset, as arguments.
orderedItems array in turn. Thus the two-item itemsets {m, p}, {b, p}, {a, p},
{c, p} and {f, p} are examined in turn. If any of them is frequent its conditional
FP-tree is constructed and a sequence of three-item itemsets is generated by
extending the two-item itemset by adding an item in the leftmost position.
The process continues in this fashion until the whole tree structure has been
examined. At each stage when the current itemset is expanded by adding an
extra item in the leftmost position, only those items in the orderedItems array
(Figure 18.17) above the one previously in the leftmost position are considered.
We now need to check whether any two-item itemsets formed by adding an
additional item to itemset {p} are also frequent. To do this we first construct
a conditional FP-tree for itemset {p}. This is a reduced version of the original
FP-tree that contains only the branches that begin at the root and end at the
two nodes labelled p, but with the nodes renumbered and often with different
support counts. (It may be helpful to look ahead to Figures 18.20 and 18.21 at
this point.)
Initialisation
Diagrammatically we can represent the initial state of the FP-tree by a
single unnumbered node, representing the root.
We will represent the evolving tree by the contents of four arrays, all initially
empty:
– A two-dimensional array nodes2, with a numerical index that will corre-
spond to the numbering of the nodes in the tree. The names given to the
columns of this array are the same as those for array nodes in Section 18.2.
– A single-dimensional array oldindex, which for each node holds the num-
ber of the corresponding node in the tree from which the evolving con-
ditional FP-tree is derived (initially the FP-tree shown in Figure 18.15).
– Single-dimensional arrays startlink2 and lastlink indexed by the names
of some or all of the itemsets in the orderedItems array.
We again work through the chain of linked p nodes, this time adding
branches to an evolving conditional FP-tree for itemset {p} and values to the
four equivalent arrays as we do so.
First Branch
Add the five nodes in the leftmost branch of the FP-tree (Figure 18.15),
numbering from the bottom upwards, as a branch leading up to the root, all
with the support count of the lowest node (i.e. the one with itemname p).
Values corresponding to each node in turn are added to the four arrays, as
described in the box below (note that this is not yet a complete description).
Association Rule Mining III: Frequent Pattern Trees 293
Note that in Figure 18.18 the numbering of the nodes is different from that
in Figure 18.15. It reflects the order in which this new tree has been generated,
working from bottom (the p node) to top (the root) for each branch. The root
node has not been numbered and the other nodes are numbered from 1 onwards.
item
index count linkto parent oldindex
name
1 p 2 2 5
2 m 2 3 4
3 a 2 4 3
4 c 2 5 2
5 f 2 1
nodes2 array oldindex
index startlink2 lastlink
p 1 1
m 2 2
a 3 3
c 4 4
f 5 5
link arrays
The values in the nodes2, oldindex, startlink2 and lastlink arrays corre-
sponding to the first branch are shown in Figure 18.19.
The null value in the parent column of node 5 indicates a link to the root
node. The use of the linkto column in array nodes2 will be explained when we
go on to add the second branch. The use of the array oldindex will be explained
in Section 18.3.2.
Note that the support counts of the branch in Figure 18.18 are different
from those of the corresponding branch in the FP-tree (Figure 18.15). When
we constructed the original FP-tree we thought of a node such as node 3 as
representing an itemset {f, c, a} with support count 3. All the nodes in the
branch from node 1 down to node 5 represented itemsets beginning with f, e.g.
node 4 represented {f, c, a, m}. We need to think of a conditional FP-tree in a
different way, working from the bottom of each branch to the top. The lowest
node (now numbered 1) in Figure 18.18 now represents (part of) itemset {p},
node 2 represents itemset {m, p}, nodes 3, 4 and 5 represent itemsets {a, m, p},
{c, a, m, p} and {f, c, a, m, p} respectively. In all cases the itemset ends with
Association Rule Mining III: Frequent Pattern Trees 295
item p rather than starting with item f . Looking at Figure 18.18 this way,
the support counts for the a, c and f nodes cannot be 3, 3 and 4 respectively
as they were in the FP-tree. If there are two transactions that include item p
there cannot be more than 2 transactions that include items a and p together, or
any other such combination.
For this reason the best approach to constructing the conditional FP-tree
for {p} is to construct the tree bottom-up, branch by branch, using the counts
of the p nodes. Each new node entered in the tree ‘inherits’ the support count
of the p node at the bottom of the branch.
Second Branch
We now add the second and final branch that ends in a node with itemname
p in the FP-tree.
This gives the final version of the conditional FP-tree for itemset {p} shown
in Figure 18.20.
The important difference from adding the first branch is that now the dashed
line links have been added for nodes p and c. These are essential for determining
whether itemsets are frequent at each stage of the extraction process.
296 Principles of Data Mining
item
index count linkto parent oldindex
name
1 p 2 6 2 5
2 m 2 3 4
3 a 2 4 3
4 c 2 8 5 2
5 f 2 1
6 p 1 7 11
7 b 1 8 10
8 c 1 9
nodes2 array oldindex
index startlink2 lastlink
p 1 6
m 2 2
a 3 3
c 4 8
f 5 5
b 7 7
link arrays
The null values in the parent column of nodes 5 and 8 indicate links to the
root node. The non-null values in the linkto column of array nodes2 correspond
to ‘dashed line’ links between nodes across the tree.
Two-item Itemsets
Having constructed the conditional FP-tree for itemset {p}, there are five
two-item itemsets to examine, starting with {m, p}. In each case we do it by
extracting the part of the tree that contains only the branches that begin at the
root and end at each of the nodes labelled m (or similarly for each of the other
items b, a, c, and f in turn). Note that the nodes in the conditional FP-tree
are numbered sequentially from 1 (in the order they are generated) each time.
To implement the creation and examination of the two-item itemsets ex-
panded from {p} we make a recursive call from function condfptree to
function findFrequent with four arguments: nodes2, startlink2, lastitem and
thisItemset. The last of these has the value {p}.
A sequence of itemsets with two items is now generated from the conditional
FP-tree for itemset {p} by making a loop through the orderedItems array from
298 Principles of Data Mining
row lastitem-1 to row zero. As lastitem is now 5, this means that the items
used as a new leftmost item for the expanded itemsets are m, b, a, c and f in
that order (but not p).
Itemsets {m, p}, {b, p}, {a, p} and {c, p} – expanded from original
itemset {p}
{m, p}: There is only one m node, which has a count of 2. So {m, p} is
infrequent (Figure 18.22).
{b, p}: There is only one b node, which has a count of 1. So {b, p} is infre-
quent (Figure 18.23).
{a, p}: There is only one a node, which has a count of 2. So {a, p} is
infrequent (Figure 18.24).
{c, p}: There are two c nodes, with a total count of 3. So {c, p} is frequent
(Figure 18.25).
300 Principles of Data Mining
already know is infrequent. There are 32 possible itemsets with p as the right-
most item and in descending order of the items in the orderedItems array. We
have only needed to examine seven of them (two frequent and five infrequent).
For space reasons we will not examine all the other single-item itemsets
and those constructed by expanding them by adding additional items in the
leftmost position. However we will examine itemset {m} and its derivatives as
this will illustrate some important additional points.
item
index count linkto parent oldindex
name
1 m 2 5 2 4
2 a 2 3 3
3 c 2 4 2
4 f 2 1
5 m 1 7
nodes2 array oldindex
index startlink2 lastlink
m 1 5
a 2 2
c 3 3
f 4 4
link arrays
It is at the final stage that the processing of this node differs from the
algorithm used up to now. We check whether the value of thisparent (i.e. 3) is
Association Rule Mining III: Frequent Pattern Trees 303
already in the oldindex array. Unlike for all the examples shown previously, it
is there in position 2, implying that the b node has a parent, node 2, which
is already present in the evolving tree structure. This in turn implies that the
new node 6 needs to be linked to the part of the tree structure that has already
been created. There are three stages to this.
– The value of parent in row 6 of nodes2 is set to 2.
– The adding of additional nodes for the current branch is aborted.
– The chain of parent nodes in the nodes2 array is followed from row 2, up to
immediately before the root, i.e. from 2 to 3 to 4, with the support count
being increased by the support count of the node at the bottom of the branch
(i.e. by 1) at each stage.
This concludes the construction of the arrays corresponding to the condi-
tional FP-tree for itemset {m}, giving Figure 18.30.
item
index count linkto parent oldindex
name
1 m 2 5 2 4
2 a 3 3 3
3 c 3 4 2
4 f 3 1
5 m 1 6 7
6 b 1 2 6
nodes2 array oldindex
index startlink2 lastlink
m 1 5
a 2 2
c 3 3
f 4 4
b 6 6
link arrays
This leads to a revised and final version of the algorithm for adding a branch.
304 Principles of Data Mining
{a, m}: There is only one a node, which has a count of 3. So {a, m} is
frequent (Figure 18.32).
There is only one c node, which has a count of 3 (Figure 18.36). So {c, m}
is frequent.
We now examine all the three-item itemsets constructed by expanding
{c, m} by adding an item in the leftmost position. Only items above c in the
orderedItems array need to be considered, i.e. f.
308 Principles of Data Mining
There is only one f node, which has a count of 3 (Figure 18.37). So {f, c, m}
is frequent.
As there is no item above f in orderedItems the examination of itemsets
with three items that are expanded versions of {c, m} is concluded.
There is only one f node, which has a count of 3 (Figure 18.38). So {f, m}
is frequent.
As there is no item above f in orderedItems and there are no more two-
item itemsets to be considered, the examination of itemsets with final item m
is concluded.
This time we have found 8 frequent itemsets ending with item m (there
cannot be any others) and have examined only one infrequent itemset. There
are 16 possible itemsets with m as the rightmost item that are in descending
order of the items in the orderedItems array. We have only needed to examine
a total of nine of them.
Reference
[1] Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without can-
didate generation. SIGMOD Record, 29 (2), 1–12. Proceedings of the 2000
ACM SIGMOD international conference on management of data, ACM
Press.
19
Clustering
19.1 Introduction
In this chapter we continue with the theme of extracting information from
unlabelled data and turn to the important topic of clustering. Clustering is
concerned with grouping together objects that are similar to each other and
dissimilar to the objects belonging to other clusters.
In many fields there are obvious benefits to be had from grouping together
similar objects. For example
– In an economics application we might be interested in finding countries whose
economies are similar.
– In a financial application we might wish to find clusters of companies that
have similar financial performance.
– In a marketing application we might wish to find clusters of customers with
similar buying behaviour.
– In a medical application we might wish to find clusters of patients with
similar symptoms.
– In a document retrieval application we might wish to find clusters of docu-
ments with related content.
– In a crime analysis application we might look for clusters of high volume
crimes such as burglaries or try to cluster together much rarer (but possibly
related) crimes such as murders.
© Springer-Verlag London Ltd. 2016 311
M. Bramer, Principles of Data Mining, Undergraduate Topics
in Computer Science, DOI 10.1007/978-1-4471-7307-6 19
312 Principles of Data Mining
There are many algorithms for clustering. We will describe two methods for
which the similarity between objects is based on a measure of the distance
between them.
In the restricted case where each object is described by the values of just
two attributes, we can represent them as points in a two-dimensional space
(a plane) such as Figure 19.1.
In the case of three attributes we can think of the objects as being points in
a three-dimensional space (such as a room) and visualising clusters is generally
straightforward too. For larger dimensions (i.e. larger numbers of attributes) it
soon becomes impossible to visualise the points, far less the clusters.
The diagrams in this chapter will use only two dimensions, although in
practice the number of attributes will usually be more than two and can often
be large.
Before using a distance-based clustering algorithm to cluster objects, it is
first necessary to decide on a way of measuring the distance between two points.
As for nearest neighbour classification, discussed in Chapter 3, a measure com-
monly used when clustering is the Euclidean distance. To avoid complications
we will assume that all attribute values are continuous. (Attributes that are
categorical can be dealt with as described in Chapter 3.)
We next need to introduce the notion of the ‘centre’ of a cluster, generally
called its centroid.
Assuming that we are using Euclidean distance or something similar as a
measure we can define the centroid of a cluster to be the point for which each
attribute value is the average of the values of the corresponding attribute for
all the points in the cluster.
So the centroid of the four points (with 6 attributes)
8.0 7.2 0.3 23.1 11.1 −6.1
2.0 −3.4 0.8 24.2 18.3 −5.2
−3.5 8.1 0.9 20.6 10.2 −7.3
−6.0 6.7 0.5 12.5 9.2 −8.4
314 Principles of Data Mining
would be
0.125 4.65 0.625 20.1 12.2 −6.75
The centroid of a cluster will sometimes be one of the points in the cluster,
but frequently, as in the above example, it will be an ‘imaginary’ point, not
part of the cluster itself, which we can take as marking its centre. The value of
the idea of the centroid of a cluster will be illustrated in what follows.
There are many methods of clustering. In this book we will look at two of
the most commonly used: k-means clustering and hierarchical clustering.
1. Choose a value of k.
2. Select k objects in an arbitrary fashion. Use these as the initial set of k
centroids.
3. Assign each of the objects to the cluster for which it is nearest to the
centroid.
4. Recalculate the centroids of the k clusters.
5. Repeat steps 3 and 4 until the centroids no longer move.
19.2.1 Example
x y
6.8 12.6
0.8 9.8
1.2 11.6
2.8 9.6
3.8 9.9
4.4 6.5
4.8 1.1
6.0 19.9
6.2 18.5
7.6 17.4
7.8 12.2
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
Three of the points shown in Figure 19.6 have been surrounded by small
circles. We will assume that we have chosen k = 3 and that these three points
have been selected to be the locations of the initial three centroids. This initial
(fairly arbitrary) choice is shown in Figure 19.7.
Initial
x y
Centroid 1 3.8 9.9
Centroid 2 7.8 12.2
Centroid 3 6.2 18.5
The columns headed d1, d2 and d3 in Figure 19.8 show the Euclidean dis-
tance of each of the 16 points from the three centroids. For the purposes of
this example, we will not normalise or weight either of the attributes, so the
distance of the first point (6.8, 12.6) from the first centroid (3.8, 9.9) is simply
(6.8 − 3.8)2 + (12.6 − 9.9)2 = 4.0 (to one decimal place)
The column headed ‘cluster’ indicates the centroid closest to each point and
thus the cluster to which it should be assigned.
The resulting clusters are shown in Figure 19.9 below.
The centroids are indicated by small circles. For this first iteration they are
also actual points within the clusters. The centroids are those that were used
to construct the three clusters but are not the true centroids of the clusters
once they have been created.
Clustering 317
x y d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
We next calculate the centroids of the three clusters using the x and y values
of the objects currently assigned to each one. The results are shown in Figure
19.10.
The three centroids have all been moved by the assignment process, but the
movement of the third one is appreciably less than for the other two.
318 Principles of Data Mining
The centroids are again indicated by small circles. However from now on the
centroids are ‘imaginary points’ corresponding to the ‘centre’ of each cluster,
not actual points within the clusters.
These clusters are very similar to the previous three, shown in Figure 19.9.
In fact only one point has moved. The object at (8.3, 6.9) has moved from
cluster 2 to cluster 1.
We next recalculate the positions of the three centroids, giving Figure 19.12.
The first two centroids have moved a little, but the third has not moved at
all.
We assign the 16 objects to clusters once again, giving Figure 19.13.
These are the same clusters as before. Their centroids will be the same as
those from which the clusters were generated. Hence the termination condition
of the k-means algorithm ‘repeat . . . until the centroids no longer move’ has
Clustering 319
been met and these are the final clusters produced by the algorithm for the
initial choice of centroids made.
It can be proved that the k-means algorithm will always terminate, but it does
not necessarily find the best set of clusters, corresponding to minimising the
value of the objective function. The initial selection of centroids can significantly
affect the result. To overcome this, the algorithm can be run several times for
a given value of k, each time with a different choice of the initial k centroids,
the set of clusters with the smallest value of the objective function then being
taken.
The most obvious drawback of this method of clustering is that there is no
principled way to know what the value of k ought to be. Looking at the final set
of clusters in the above example (Figure 19.13), it is not clear that k = 3 is the
most appropriate choice. Cluster 1 might well be broken into several separate
clusters. We can choose a value of k pragmatically as follows.
320 Principles of Data Mining
If we imagine choosing k = 1, i.e. all the objects are in a single cluster, with
the initial centroid selected in a random way (a very poor idea), the value of
the objective function is likely to be large. We can then try k = 2, k = 3 and
k = 4, each time experimenting with a different choice of the initial centroids
and choosing the set of clusters with the smallest value. Figure 19.14 shows the
(imaginary) results of such a series of experiments.
Value of k Value of
objective function
1 62.8
2 12.3
3 9.4
4 9.3
5 9.2
6 9.1
7 9.05
These results suggest that the best value of k is probably 3. The value of
the function for k = 3 is much less than for k = 2, but only a little better than
for k = 4. It is possible that the value of the objective function drops sharply
after k = 7, but even if it does k = 3 is probably still the best choice. We
normally prefer to find a fairly small number of clusters as far as possible.
Note that we are not trying to find the value of k with the smallest value of
the objective function. That will occur when the value of k is the same as the
number of objects, i.e. each object forms its own cluster of one. The objective
function will then be zero, but the clusters will be worthless. This is another
example of the overfitting of data discussed in Chapter 9. We usually want a
fairly small number of clusters and accept that the objects in a cluster will be
spread around the centroid (but ideally not too far away).
1. Assign each object to its own single-object cluster. Calculate the dis-
tance between each pair of clusters.
2. Choose the closest pair of clusters and merge them into a single cluster
(so reducing the total number of clusters by one).
3. Calculate the distance between the new cluster and each of the old
clusters.
4. Repeat steps 2 and 3 until all the objects are in a single cluster.
It would be very inefficient to calculate the distance between each pair of clus-
ters for each pass through the algorithm, especially as the distance between
those clusters not involved in the most recent merger cannot have changed.
The usual approach is to generate and maintain a distance matrix giving
the distance between each pair of clusters.
324 Principles of Data Mining
a b c d e f
a 0 12 6 3 25 4
b 12 0 19 8 14 15
c 6 19 0 12 5 18
d 3 8 12 0 11 9
e 25 14 5 11 0 7
f 4 15 18 9 7 0
Note that the table is symmetric, so not all values have to be calculated
(the distance from c to f is the same as the distance from f to c etc.). The
values on the diagonal from the top-left corner to the bottom-right corner must
always be zero (the distance from a to a is zero etc.).
From the distance matrix of Figure 19.19 we can see that the closest pair
of clusters (single objects) are a and d, with a distance value of 3. We combine
these into a single cluster of two objects which we will call ad. We can now
rewrite the distance matrix with rows a and d replaced by a single row ad and
similarly for the columns (Figure 19.20).
The entries in the matrix for the various distances between b, c, e and f
obviously remain the same, but how should we calculate the entries in row and
column ad?
ad b c e f
ad 0 ? ? ? ?
b ? 0 19 14 15
c ? 19 0 5 18
e ? 14 5 0 7
f ? 15 18 7 0
We could calculate the position of the centroid of cluster ad and use that
to measure the distance of cluster ad from clusters b, c, e and f . However for
hierarchical clustering a different approach, which involves less calculation, is
generally used.
Clustering 325
ad b c e f
ad 0 8 6 11 4
b 8 0 19 14 15
c 6 19 0 5 18
e 11 14 5 0 7
f 4 15 18 7 0
The smallest (non-zero) value in the table is now 4, which is the distance
between cluster ad and cluster f , so we next merge these clusters to form a
three-object cluster adf. The distance matrix, using the single-link method of
calculation, now becomes Figure 19.22.
adf b c e
adf 0 8 6 7
b 8 0 19 14
c 6 19 0 5
e 7 14 5 0
adf b ce
adf 0 8 6
b 8 0 14
ce 6 14 0
adfce b
adfce 0 8
b 8 0
We can do this in several ways. For example we can merge clusters until
only some pre-defined number remain. Alternatively we can stop merging when
a newly created cluster fails to meet some criterion for its compactness, e.g.
the average distance between the objects in the cluster is too high.
x y
10.9 12.6
2.3 8.4
8.4 12.6
12.1 16.2
7.3 8.9
23.4 11.3
19.7 18.5
17.1 17.2
3.2 3.4
1.3 22.8
2.4 6.9
2.4 7.1
3.1 8.3
2.9 6.9
11.2 4.4
8.3 8.7
328 Principles of Data Mining
2. For the example given in Section 19.3.1, what would be the distance matrix
after each of the first three mergers if complete-link clustering were used
instead of single-link clustering?
20
Text Mining
a count of how many times each one occurs, or some other measure of the
importance of each word.
Assuming that we wish to store an ‘importance value’ for each word in a
document as one instance in a training set, how should we do it? If a given doc-
ument has say 106 different words, we cannot just use a representation with
106 attributes (ignoring classifications). Other documents in the dataset may
use other words, probably overlapping with the 106 in the current instance,
but not necessarily so. The unseen documents that we wish to classify may
have words that are not used in any of the training documents. An obvious —
but extremely bad — approach would be to allocate as many attributes as are
needed to allow for all possible words that might be used in any possible un-
seen document. Unfortunately if the language of the documents is English, the
number of possible words is approximately one million, which is a hopelessly
impractical number of attributes to use.
A much better approach is to restrict the representation to the words that
actually occur in the training documents. This can still be many thousands
(or more) and we will look at ways of reducing this number in Sections 20.3
and 20.4 below. We place all the words used at least once in a ‘dictionary’ and
allocate one attribute position in each row of our training set for each one. The
order in which we do this is arbitrary, so we can think of it as alphabetical.
The bag-of-words representation is inherently a highly redundant one. It
is likely that for any particular document most of the attributes/features (i.e.
words) will not appear. For example the dictionary used may have 10,000 words,
but a specific document may have just 200 different words. If so, its represen-
tation as an instance in the training set will have 9,800 out of 10,000 attributes
with value zero, indicating no occurrences, i.e. unused.
If there are multiple classifications there are two possibilities for construct-
ing the dictionary of words for a collection of training documents. Whichever
one is used the dictionary is likely to be large.
The first is the local dictionary approach. We form a different dictionary
for each category, using only those words that appear in documents classified
as being in that category. This enables each dictionary to be relatively small
at the cost of needing to construct N of them, where there are N categories.
The second approach is to construct a global dictionary, which includes all
the words that occur at least once in any of the documents. This is then used
for classification into each of the N categories. Constructing a global dictionary
will clearly be a lot faster than constructing N local dictionaries, but at the
cost of making an even more redundant representation to use for classifying
into each of the categories. There is some evidence to suggest that using a
local dictionary approach tends to give better performance than using a global
dictionary.
332 Principles of Data Mining
Predicted class
Ck not Ck
Actual Ck a c
class not Ck b d
In Figure 20.1 the values a, b, c and d are the number of true positive,
false positive, false negative and true negative classifications, respectively. For
a perfect classifier b and c would both be zero.
The value (a + d)/(a + b + c + d) gives the predictive accuracy. However, as
mentioned in Chapter 12, for information retrieval applications, which include
text classification, it is more usual to use some other measures of classifier
performance.
Recall is defined as a/(a + c), i.e. the proportion of documents in category
Ck that are correctly predicted.
Precision is defined as a/(a + b), i.e. the proportion of documents that are
predicted as being in category Ck that are actually in that category.
It is common practice to combine Recall and Precision into a single measure
of performance called the F1 Score, which is defined by the formula F 1 =
2 × Precision × Recall/(Precision + Recall). This is just the product of Precision
and Recall divided by their average.
Having generated confusion matrices for each of the N binary classifica-
tion tasks we can combine them in several ways. One method is called micro-
averaging. The N confusion matrices are added together element by element to
form a single matrix from which Recall, Precision, F1 and any other preferred
measures can be computed.
338 Principles of Data Mining
When attempting to classify web pages we immediately run into the problem
of finding any classified pages to use as training data. Web pages are uploaded
by a very large number of individuals, operating in an environment where no
widely agreed standard classification scheme exists. Fortunately there are ways
of overcoming this problem, at least partially.
The search engine company, Yahoo, uses hundreds of professional classifiers
to categorise new web pages into a (nearly) hierarchical structure, comprising
14 main categories, each with many sub-categories, sub-sub-categories etc. The
complete structure can be found on the web at http://dir.yahoo.com. Users
can search through the documents in the directory structure either using a
search engine approach or by following links through the structure. For example
we might follow the path from ‘Science’ to ‘Computer Science’ to ‘Artificial
Intelligence’ to ‘Machine Learning’ to find a set of links to documents that
human classifiers have placed in that category. The first of these (at the time
of writing) is to the UCI Machine Learning Repository, which was discussed in
Chapter 2.
The Yahoo system demonstrates the potential value of classifying web pages.
However, only a very small proportion of the entire web could possibly be
classified this way ‘manually’. With 1.5 million new pages being added each
day the volume of new material will defeat any conceivable team of human
classifiers. An interesting area of investigation (which the present author and
his research group are currently pursuing) is whether web pages can be classified
automatically using the Yahoo classification scheme (or some other similar
scheme) by supervised learning methods of the kind described in this book.
Unlike many other task areas for data mining there are few ‘standard’
datasets available on which experimenters can compare their results. One ex-
ception is the BankSearch dataset created by the University of Reading, which
includes 11,000 web pages pre-classified (by people) into four main categories
(Banking and Finance, Programming, Science, Sport) and 11 sub-categories,
some quite distinct and some quite similar.
Marley was dead: to begin with. There is no doubt whatever about that.
The register of his burial was signed by the clergyman, the clerk, the
undertaker, and the chief mourner. Scrooge signed it: and Scrooge’s name
was good upon ’Change, for anything he chose to put his hand to. Old
Marley was as dead as a door-nail.
Mind! I don’t mean to say that I know, of my own knowledge, what there
is particularly dead about a door-nail. I might have been inclined, myself,
to regard a coffin-nail as the deadest piece of ironmongery in the trade.
But the wisdom of our ancestors is in the simile; and my unhallowed hands
shall not disturb it, or the Country’s done for. You will therefore permit
me to repeat, emphatically, that Marley was as dead as a door-nail.
<html><head><meta http-equiv="content-type"
content="text/html; charset=UTF-8">
<title>Google</title><style>
<!--
body,td,a,p,.h{font-family:arial,sans-serif;}
.h{font-size: 20px;}
.q{color:#0000cc;}
//-->
</style>
<script>
<!--
function sf(){document.f.q.focus();}
function clk(el,ct,cd) {if(document.images){(new Image()).src=
"/url?sa=T&ct="+es
cape(ct)+"&cd="+escape(cd)+"&url="
+escape(el.href)+"&ei=gpZNQpzEHaSgQYCUwKoM";}return true;}
// -->
</script>
</head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=
#551a8b alink=#ff00
00 onLoad=sf()><center><img src="/intl/en_uk/images/logo.gif"
width=276 height=1
10 alt="Google"><br><br>
We can deal with the second problem to some extent by removing HTML
markup and JavaScript when we create a representation of a document such
as a ‘bag-of-words’, but the scarcity of relevant information on most web pages
remains a problem. We must be careful not to assume that HTML markup
is always irrelevant noise — the only two useful words in Figure 20.3 (both
‘Google’) appear in the HTML markup.
Even compared with articles in newspapers, papers in scientific journals etc.
web pages suffer from an extremely diverse authorship, with little consistency
in style or vocabulary, and extremely diverse content. Ignoring HTML markup,
JavaScript, irrelevant advertisements and the like, the content of a web page
is often quite small. It is not surprising that classification systems that work
well on standard text documents often struggle with hypertext. It is reported
that in one experiment, classifiers that were 90% accurate on the widely used
Reuters dataset (of standard text documents) scored only 32% on a sample of
Yahoo classified pages.
342 Principles of Data Mining
2. Normalise the vectors (20, 10, 8, 12, 56) and (0, 15, 12, 8, 0).
Calculate the distance between the two normalised vectors using the dot
product formula.
21
Classifying Streaming Data
21.1 Introduction
One of the most significant developments in Data Mining in recent years has
been the huge increase in availability of streaming data, i.e. data which arrives
(generally in large quantities) from some automatic process over a period of
days, months, years or potentially forever.
Some examples of this are:
• Sales transactions in supermarkets
• Data from GPS systems
• Records of changes in share prices
• Logs of telephone calls
• Logs of accesses to webpages
• Records of credit card purchases
• Records of postings to social media
• Data from networks of sensors
For some applications the volume of data received can be as high as tens of
millions of records per day, which we can regard as effectively infinite.
As elsewhere in this book we will restrict ourselves to data that is symbolic
in nature, as opposed to say images sent from CCTV. We will concentrate
on data records that are labelled with a classification and assume that the
task is to learn an underlying model in the form of a decision tree. We will
further impose the restriction that all attributes are categorical. Continuous
algorithm.
1
We distinguish between nodes which have or have not previously been split on
an attribute. The former are called internal nodes; the latter are called leaf nodes.
We will consider the root node not as a third type of node but as an internal node
after it has been split on an attribute and a leaf node before that.
348 Principles of Data Mining
nodes 1 and 2), att1 at node 2 (creating nodes 3, 4 and 5) and att5 at node
4 (creating nodes 6 and 7). Each attribute can potentially have any number
of values but to simplify the figures we will generally assume they have either
two or three. We will adopt the convention that if, say, attribute att5 has two
values we will call them val51 and val52.
Although we are not concerned with the fine detail of implementation in
this book, it will make the description of the H-Tree algorithm much easier to
follow if we think in terms of maintaining six arrays at each node. These will
be discussed in the following sections.
2
A note on notation. In this chapter array elements are generally shown enclosed
in square brackets, e.g. currentAtts[2]. However an array containing a number of
constant values will generally be denoted by those values separated by commas and
enclosed in braces. So currentAtts[2] is {att1, att2, att3, att5, att6, att7}.
Classifying Streaming Data 349
For an internal node N , i.e. one that has previously been split on an attribute,
the array element splitAtt[N ] is the name of the splitting attribute, so
• splitAtt[0] is att4
• splitAtt[2] is att1
• splitAtt[4] is att5.
All the other nodes are leaf nodes at which there are (by definition) no
splitting attributes. The value of splitAtt[N ] for a leaf node is 'none'.
As the data records are received they are processed and discarded, but do-
ing this does not immediately alter the structure (nodes and branches) of the
evolving incomplete tree. As each new record comes in and is processed it is
sorted through the tree to the appropriate leaf node. It is only when a number
of conditions are met at a leaf node that a change to the tree by splitting at
that node is considered.
To illustrate the sorting process let us assume we have the following record
(Figure 21.2):
Values that are not relevant to the example are denoted by xxx.)
The record passes (notionally) through the tree starting at the root (node
0). From there because the value of att4 is val42 it passes on to node 2. Then
350 Principles of Data Mining
because att1 has the value val12 it goes to node 4. Finally, because att5 has
value val51 it arrives at node 6, a leaf. The route taken from root to leaf is 0
to 2 to 4 to 6.
For each leaf node N , hitcount[N ] is the number of ‘hits’ on the node since it
was created, i.e. the number of records sorted to that leaf node by the process
illustrated above. As each new record is sorted to a leaf node L, the value of
hitcount[L] is increased by 1. The internal nodes ‘passed through’ in the above
example (0, 2 and 4) have their own hitcount values, obtained before they were
split and became internal nodes. These values are not increased for internal
nodes as they play no further part in the tree generation process.
When a new node is created its hitcount value is set to zero.
This is the most complex of the six arrays. The name stands for ‘attribute-
class-value counts’. It has four dimensions.
If N is a leaf node then for each attribute A in its current attributes ar-
ray, acvCounts[N ][A] is a two-dimensional array which records the number of
occurrences of each possible combination of class value and the value of attribu-
te A.
In Figure 21.1, the attributes in the current attributes array for node 6
are att2, att3, att6 and att7. Assume that hitcount[6] is 200 and the values
Classifying Streaming Data 351
The sum of the values in the c1 row is the same as classtotals[6][c1] and
similarly for the other classes. The overall total of the numbers in the whole
table is the same as hitcount[6].
These two-dimensional arrays are precisely what are needed to calculate
measures such as Information Gain (discussed in Chapter 5) which are often
used to determine which attribute to split on at a node and, as we shall see
in Section 21.4, this is how they will be used in developing the H-Tree. As
each new record is sorted to a leaf node N , one of the entries in the frequency
table corresponding to every attribute in the node’s current attributes array is
increased by 1.
As for arrays hitcount and classtotals internal nodes have their own
acvCounts values, obtained before they were split and became internal nodes.
3
The row and column headings are provided to assist the reader only. The table
itself has 3 rows and 3 columns.
352 Principles of Data Mining
These values are not increased for internal nodes as they play no part in the
H-Tree algorithm.
When a new leaf node is created its acvCounts value for each combination
of class and attribute value for every attribute in its current attributes array
is set to zero.
This final array, together with the splitAtt array provides the ‘glue’ that keeps
the tree structure together.
When leaf node N is split on attribute A, for each value of that attribute a
branch leading to a new node is created.
For each value V of attribute A:
• branch[N ][A][V ] is set to nextnode
• nextnode is increased by 1.
We start with a tree with just a single node, numbered zero, and associate five
arrays (all except branch) with it as described in Section 21.2.
The pseudocode for this is given below4 .
4
Pseudocode fragments are provided for the benefit of readers who may be inter-
ested in developing their own implementations of the H-Tree algorithm. Other readers
can safely ignore them.
Classifying Streaming Data 353
We now begin reading the incoming records one-by-one, in each case processing
the record and then discarding it. (Each record is ‘sorted’ to node zero as there
is only one at present.) We increase one of the numbers in the classtotals[0] array
by one and the hitcount value by one for each record read. We also adjust the
contents of the frequency table for each attribute by adding one to one of the
values in the table, depending on the combination of the value of the attribute
and the specified classification for that record.
Let us assume that by the time the 100th record has been read the
classtotals[0] array contains {63, 17, 20}. The sum of the three values in the
array will of course be 100. At this stage the frequency table for attribute att6
might contain the following (Figure 21.4). There will be frequency tables simi-
lar to this for each of the other attributes. In each case the right-most column
(row sums) will be the same.
It will help to interpret this if we add an extra row containing the sum of
the numbers in each of the existing columns and a further column containing
the sum of the numbers in each row. Note that these additional values do not
need to be stored. They are calculated from the values in the stored 3 ∗ 4 table
354 Principles of Data Mining
Row
Class val61 val62 val63 val64
sums
c1 32 18 4 9 63
c2 0 5 5 7 17
c3 0 10 7 3 20
Column
32 33 16 19 100
sums
The number in the bottom right-hand corner is the sum of the numbers in
the bottom (column sums) row, which is the same as the sum of the numbers in
the right-most (row sums) column. This overall total is the same as the number
of records sorted to node 0 – in this case 100.
The other values in the right-most column show how many instances have
been classified with each of the three possible classes. They are the same as the
values in the classtotals array, i.e. {63, 17, 20}.
The first four numbers in the bottom row are the column sums. They show
that attribute att6 had the value val61 32 times, val62 33 times, val63 16 times
and val64 19 times.
To develop the tree we need to split on an attribute at the root node, but
we clearly cannot do this after just one record has been read as the resulting
tree would be essentially arbitrary and likely to have extremely low predictive
power. Instead we wait until a specified number of records have been sorted to
node zero5 and then make a decision on whether or not to split on an attribute
and if so which one.
The specified number is denoted by G and is sometimes referred to as the
grace period. In this chapter the same value will be used at each leaf node as
the tree evolves, but it would be possible to use a larger value at some points
in the processing (e.g. at or near the root of the tree) than at others. To make
the numbers reasonably small in our examples we will use a value of 100 for G.
5
As initially there are no other nodes, all incoming records will be sorted there.
Classifying Streaming Data 355
Once G, i.e. 100 records have been sorted to node 0, we next consider
splitting at that node provided that the records sorted to it have more than
one classification. If all the classifications were the same we would continue
receiving and processing records until the next 100 were received at which
time splitting would be considered again. In this example the classifications
are not all identical so we go on to determine which attribute to split on,
but with ‘no split’ as one of the options. At present we will assume that we
will definitely split and will choose the attribute to split on using a method
such as the maximising Information Gain method described in Chapter 5,
or one of the other similar methods that use a frequency table for each at-
tribute.
We will say that it is decided to split at node 0 using attribute att6. This gives
us four branches (one per value of att6) leading to four new nodes, numbered
from 1 to 4, as shown in Figure 21.66 . To achieve this we start by setting array
element branch[0][att6][val61] to nextnode, i.e. 1, and increasing nextnode by 1.
We then create the other three branches in a similar way. The value of nextnode
is now 5.
Figure 21.6 H-Tree After Splitting on Node 0 (with current attributes array
for each node shown)
6
In Figures 21.6, 21.8 and 21.9 we depart from our usual notation for trees and
show the values that are in the classtotals array for each node.
356 Principles of Data Mining
For each of the new nodes 1 to 4 the classtotals array is initialised to {0, 0, 0},
the value of hitcount is set to zero and the value of splitAtt is set to 'none'.
We create a current attributes array for each of the new nodes, by taking the
current attributes array from the parent node (node 0) and removing attribute
att6 from each. This gives all of them the array {att1, att2, att3, att4, att5, att7}.
These are the attributes available for any future splitting at those nodes.
We also create a frequency table for each attribute at each of the three
nodes.
For attribute att2 which has two values, val21 and val22, the values of
frequency tables acvCounts[1][att2], acvCounts[2][att2], acvCounts[3][att2] and
acvCounts[4][att2] are all initially the same, i.e. (Figure 21.7):
Creating the new frequency tables this way may seem innocuous but it is
in fact a major departure from the Information Gain method and the other
methods described in Chapter 5 for situations where all the data is available.
Ideally we would like the new frequency tables to begin with counts of all the
class / attribute value combinations for all the relevant records so far received.
However there is no way of doing this. We would need to re-examine the original
data, but it has all already been discarded. The best we can do is to start
with a table with all zero values for each attribute at each of the nodes, but
this will inevitably mean that the tree eventually generated will be different
Classifying Streaming Data 357
We now go on to read and process the next set of records. As each one is read,
it is sorted to the correct leaf node. We can think of the instance starting at
node 0, and then falling down to node 1, 2, 3 or 4 depending on the value of
attribute att6. In a larger tree it might fall further down to a lower level, but
in all cases the instance will be sorted to one of the leaf nodes.
As each record is sorted to a leaf node the values in the hitcount and classto-
tals arrays and the frequency table for each attribute in the current attributes
array are updated at that node.
Pseudocode for processing a record is given below.
358 Principles of Data Mining
We now consider splitting at node 2. The records that have been sorted to that
node have more than one classification, so we go on to calculate the Information
Gain (or other measure) for each attribute in the node’s current attributes
array.
This time we will say that attribute att2, which has two values, is chosen
Classifying Streaming Data 359
for splitting, giving the new but still incomplete tree structure shown in Figure
21.8. The new nodes (5 and 6) all have classtotals arrays with the value {0, 0, 0},
hitcount values of zero and frequency tables containing all zero values. The
classtotals and hitcount arrays and frequency tables at the other nodes are left
unchanged.
The new nodes 5 and 6 will all have a currentAtts array containing {att1,
att3, att4, att5, att7}. The value of splitAtt at each node will be 'none'.
We now continue to read records, sorting each one to the appropriate leaf node,
adjusting the values of classtotals and the contents of the frequency tables for
each attribute each time.
We will assume that at some stage the total number of records sorted to
node 4 in Figure 21.8 increases to 100, the value of G, and that at that stage the
classtotals arrays for the records sorted to the seven nodes are the following:
Node 0: {63, 17, 20}
Node 1: {22, 9, 1}*
Node 2: {87, 10, 3}
Node 3: {8, 7, 15}*
Node 4: {45, 20, 35}*
Node 5: {25, 12, 31}*
Node 6: {0, 0, 0}*
Leaf nodes are again indicated by an asterisk. The classtotals arrays for
nodes 0 and 2 have not changed as they are no longer leaf nodes.
We next consider splitting at node 4. If the decision is no, we simply go on
to read further records. However we will assume that this time attribute att2 is
chosen for splitting (as it was at node 2), giving the new tree structure shown
360 Principles of Data Mining
in Figure 21.9.
We carry on in this way, expanding at most one leaf node at each stage.
Depending on the number of current attributes we have and the number of
values each of them has, we may eventually reach a tree where every leaf node
is non-expandable. If that happens the tree is now fixed and cannot later be
altered. If there are quite a large number of attributes this is highly unlikely
to happen. It is far more likely that the tree will initially develop fairly rapidly
(or as rapidly as the size of the grace period allows) and then stabilise, i.e. stop
evolving or change only slightly as further records are processed.
If all the classifications are identical, say c1, the entropy at the leaf
node is zero and the entropy resulting from splitting on any attribute will
also inevitably be zero. For example if the classtotals array is {100, 0, 0}
the frequency table for attribute att6, which has four values, might look
like this (Figure 21.10):
Row
Class val61 val62 val63 val64
sums
c1 28 0 30 42 100
c2 0 0 0 0 0
c3 0 0 0 0 0
Column
28 0 30 42 100
sums
This table has an entropy value of zero. The contribution from each
of the non-zero values in the main body of the table will be cancelled out
by the contribution from the corresponding column sum. This is a general
feature of any frequency table with either zero or one positive entries in
each column of the table. The result is that the entropy will be identical
(i.e. zero) for each attribute so there will be no basis for making a split on
one attribute rather than another and no benefit at all from splitting.
(iii) The node must be expandable, i.e. its current attributes array must not
be empty.
This leads to another revised version of pseudocode fragment 3.
Classifying Streaming Data 363
Issue (b), i.e. how to determine which attribute to split on (if any) at node L
forms the topic of the next two sections. (At step 5 there is a forward reference
to pseudocode fragment 4, which will be given in Section 21.5.)
Note that the row sums for classes c1, c2 and c3 are 100, 150, 250,
respectively, and the overall total of all the values in the main body of the
table (i.e. not including those in the column sum row) is 500.
We now form a sum as follows:
(a) For every non-zero value V in the main body of the table subtract V ∗log2 V .
(b) For each non-zero value S in the column sum row add S ∗ log2 S.
(c) Finally divide the total sum by the overall total of all the values in the
main body of the table.
Classifying Streaming Data 365
ln(1/δ)
=R∗
2 ∗ nrec
In this formula nrec is the number of records sorted to the given node, i.e.
hitcount[L] in our array notation. This is usually the same as G, the ‘grace
period’ but may be a multiple of G. The Greek letter δ (pronounced ‘delta’) is
used to represent the value of 1-Prob.
Figure 21.12 shows the value of ln(1/δ) for various common values of the
probability Prob. The ln function is called the natural logarithm function and
is often written as loge .
Probability
δ = 1-Prob ln(1/δ)
Prob
0.9 0.1 2.3026
0.95 0.05 2.9957
0.99 0.01 4.6052
0.999 0.001 6.9078
The value R corresponds to the range of values of the measure we are using
to decide which attribute to split on, which we will assume is Information Gain.
The smallest value of Information Gain that can be obtained by splitting at a
node is zero and the largest value is the ‘initial entropy’ Estart at the node. We
will use the value of Estart for R.
Maximum
Number of
value of R
classes c
= log2 c
2 1
3 1.5850
4 2
5 2.3219
6 2.5850
7 2.8074
8 3
The largest value that R can ever take occurs when all the classes are equally
frequent at a node, in which case, assuming there are c classes, the value of the
entropy and hence the value of R is log2 c. Even with a very large number of
streaming records the number of classifications is likely to be a small number.
The maximum values of R corresponding to some small values of c are given
in Figure 21.13.
Putting all this together, if we have three classes distributed evenly (so
R = 1.5850) and want to be 95% certain that attribute X is the best choice,
Figure 21.14 shows the value of the bound for each of several possible values
of nrec.
Number of
records 100 200 1,000 2,000 10,000 20,000
nrec
Bound 0.1940 0.1372 0.0613 0.0434 0.0194 0.0137
Figure 21.14 Values of Hoeffding Bound for R = 1.5850 and Prob = 0.95
For each value of nrec, only if the difference between the information gains
of the best attribute X and the second best attribute Y is greater than will
a split on X be made. As the number of records, nrec, becomes larger the
Hoeffding Bound requirement becomes progressively easier to meet.
If we want to adopt a more cautious approach, i.e. require a higher probabil-
ity of certainty before splitting, the value of the bound will be correspondingly
larger (making it more difficult to achieve). Figure 21.15 shows the values of
the Hoeffding Bound for R = 1.5850 and Prob = 0.999 for different values of
nrec.
Number of
records 100 200 1,000 2,000 10,000 20,000
nrec
Bound 0.2946 0.2083 0.0931 0.0659 0.0295 0.0208
Figure 21.15 Values of Hoeffding Bound for R = 1.5850 and P rob = 0.999
If there is only one attribute in node L’s current attributes list, attribute Y
will be taken to be the ‘null attribute’, equivalent to not splitting at all, with
an Information Gain value of zero.
370 Principles of Data Mining
The main algorithm uses pseudocode fragments 1 and 3, the latter of which
uses numbers 2 and 4.
The final versions of all four pseudocode fragments are repeated here for
ease of reference.
The above method can be adapted to give us a way of evaluating the perfor-
mance of an evolving H-Tree. One possibility is to keep back a file of records
that are not used for building the tree and then use them as a test set, i.e.
treat them as if we did not know the classifications and record the predicted
and actual classes in a confusion matrix7 . (In this case we do not change the
values of the arrays at the nodes to which they are sorted.)
After the last test record has been examined we might have a confusion
matrix such as the one shown in Figure 21.16 (assuming that the test file has
1,000 records).
Predicted Class
Actual
c1 c2 c3
Class
c1 263 2 21
c2 2 187 8
c3 4 9 504
accuracy and track how the values vary if we repeat the test every hour, every
day etc.
A second possibility is to evaluate the performance of the tree each time a
node is expanded, using the same records that were used to develop the tree,
rather than a separate test set, but recording in the confusion matrix only the
actual versus predicted classifications of those records that have arrived since
the previous split. (With this approach, the values in arrays hitcount, classtotals
and acvCounts are updated for each new record.)
We create a confusion matrix with all zero values immediately after each
split on a node. Using the example in Figure 21.9 (Section 21.3), if the next
record that arrives is sorted to node 5 with actual classification c2 but predicted
classification c1, we increase the count in row c2, column c1 of the confusion
matrix by one.
This method gives a straightforward way of tracking the performance of
the classifier from one split to another in terms of the records that have ar-
rived in that period, rather than on a fixed set of test data. This is probably
preferable in view of the phenomenon of concept drift, which will be described
in Chapter 22.
the same order each time, so the total number of records input is say 2,400,
24,000 or 24,000,000, we can examine the trees produced and compare them
with those produced by TDIDT. For any exact number of replications of the
original data records TDIDT will give the same result as if they had only been
processed once.
We will compare the algorithms by extracting rules from the trees generated,
each rule corresponding to the path from the root node to a leaf node, working
from left to right.
There are nine rules generated by the TDIDT algorithm:
1. IF tears = 1 THEN Class = 3
2. IF tears = 2 AND astig = 1 AND age = 1 THEN Class = 2
3. IF tears = 2 AND astig = 1 AND age = 2 THEN Class = 2
4. IF tears = 2 AND astig = 1 AND age = 3 AND specRx = 1 THEN Class = 3
5. IF tears = 2 AND astig = 1 AND age = 3 AND specRx = 2 THEN Class = 2
6. IF tears = 2 AND astig = 2 AND specRx = 1 THEN Class = 1
7. IF tears = 2 AND astig = 2 AND specRx = 2 AND age = 1 THEN Class = 1
8. IF tears = 2 AND astig = 2 AND specRx = 2 AND age = 2 THEN Class = 3
9. IF tears = 2 AND astig = 2 AND specRx = 2 AND age = 3 THEN Class = 3
The results below show the rules generated by H-Tree with G and Prob set
to 500 and 0.999 respectively for varying numbers of records (all multiples of
24).
2400 records
At this stage there are just three rules, listed below.
1. IF tears = 1 THEN Class = 3
2. IF tears = 2 AND astig = 1 THEN Class = {0, 187, 38}
3. IF tears = 2 AND astig = 2 THEN Class = {149, 0, 76}
The arrays shown for rules 2 and 3 give the class totals for the three classes
(1, 2 and 3) in order.
Rules 2 and 3 look as if they may be ‘compressed’ versions of TDIDT rules
2–5 and 6–9 respectively. How will they develop as more records are processed?
4800 records
At this stage H-Tree has generated six rules as follow.
1. IF tears = 1 THEN Class = 3
2. IF tears = 2 AND astig = 1 AND age = 1 THEN Class = 2
3. IF tears = 2 AND astig = 1 AND age = 2 THEN Class = 2
4. IF tears = 2 AND astig = 1 AND age = 3 THEN Class = {0, 55, 54}
5. IF tears = 2 AND astig = 2 AND specRx = 1 THEN Class = 1
6. IF tears = 2 AND astig = 2 AND specRx = 2 THEN Class = {54, 0, 109}
376 Principles of Data Mining
The vote ‘US congressional voting’ dataset has 300 records, with 16 attributes
and 2 classes. TDIDT generates 34 rules from this dataset.
Records Rules
12,000 9
24,000 14
36,000 17
72,000 24
120,000 27
360,000 28
480,000 28
720,000 28
Figure 21.17 shows the number of rules generated by the H-Tree algorithm
for different numbers of records, all multiple repetitions of the original 300
records.
In this case inspection shows that the H-Tree seems to be converging towards
the 34 rules generated by TDIDT. Out of the H-Tree’s 28 rules, 24 are the same
as those generated by TDIDT, but even after 720,000 records have been pro-
cessed (2,400 repetitions of the original 300), there are still four of H-Tree’s rules
with mixed classifications that have not (yet?) been expanded into rules gener-
ated by TDIDT. In each case there would seem to be an obvious way in which
the rule could be expanded and it is entirely possible that the four H-Tree rules
with mixed classifications might evolve into ten rules with single classifications
as they are for TDIDT given yet more repetitions of the original 300 records.
It would seem that the tree produced by H-Tree is converging towards the
one generated by TDIDT, albeit very slowly8 .
8
For some practical applications, to have a tree with a smaller number of leaf nodes
which predicts the same or almost the same classifications as the complete TDIDT
decision tree might be considered preferable, but we will not pursue that issue here.
378 Principles of Data Mining
References
[1] Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In
Proceedings of the sixth ACM SIGKDD international conference on knowl-
edge discovery and data mining (pp. 71–80). New York: ACM.
[2] Hoeffding, W. (1963). Probability inequalities for sums of bounded random
variables. Journal of the American Statistical Association, 58 (301), 13–30.
22
Classifying Streaming Data II:
Time-Dependent Data
the subtree hanging from a node by another subtree that is more appropriate
for the changed concept.
Four crucial features that distinguish algorithms for classifying streaming
data from the TDIDT algorithm described in Chapters 4–6, which we will call
a batch model, are:
• We cannot collect all the data before generating a classification tree as the
volume is potentially infinite.
• We cannot store all the data and revisit it repeatedly as we can do with
TDIDT, once again as the volume is potentially infinite.
• We cannot wait until we have a fixed classification tree before we use the
tree to predict the classification for previously unseen data. We must be
able to use it for prediction with a high level of accuracy at any time,
except for possibly a fairly short start-up phase.
• The algorithm must be able to operate in real-time and thus the amount
of processing needed as each record comes in must be quite small. This
is particularly important if we want to allow for data that arrives in high
volume day-by-day such as data recording supermarket transactions or
withdrawals from bank ATMs.
The H-Tree algorithm (based on Domingos and Hulten’s VFDT algorithm [1])
meets these four criteria. In this chapter we will develop a revised version of the
algorithm that meets the same criteria and also deals with data that is time-
dependent. It is based closely on an algorithm introduced by Hulten, Spencer
and Domingos [2] called CVFDT, standing for Concept-adapting Very Fast
Decision Tree learner. As always with influential algorithms there are many
published variants that purport to be improvements. Our own version, which
will be described in this chapter, is based closely on CVFDT but incorporates
some detailed changes and simplifications. To avoid confusion we will call it
the CDH-Tree algorithm, standing for ‘Concept Drift Hoeffding Tree’.
We will start by reviewing the H-Tree algorithm and then change it piece
by piece to become the final version of CDH-Tree. The next section summarises
the key points of the H-Tree algorithm without explanation. It is assumed that
there is a constant stream of records arriving and that each one is processed as
it arrives and is then thrown away. If you have not yet read Chapter 21
you are strongly encouraged to do so before going on.
The nodes in the tree are numbered in the order in which they were created.
The system variable nextnode holds the number of the next node in sequence
for possible future use. In this case nextnode is 8.
The tree has been created by splitting on attribute att4 at node 0 (creating
nodes 1 and 2), att1 at node 2 (creating nodes 3, 4 and 5) and att5 at node 4
(creating nodes 6 and 7).
There are six arrays maintained at every node. They are listed below.
listed in a standard order. This is called the current attributes array for that
node.
If our data records have the values of seven attributes att1, att2, att3, att4,
att5, att6 and att7 then at the root node currentAtts[0] is initialised to the array
{att1, att2, att3, att4, att5, att6, att7}.
When a leaf node is ‘expanded’ by being split on at an attribute, its
immediate descendant nodes inherit the current attributes array of the parent
with the splitting attribute removed. Thus
• currentAtts[1] and currentAtts[2] are both the array {att1, att2, att3, att5,
att6, att7}
• currentAtts[3], currentAtts[4] and currentAtts[5] are all the array {att2, att3,
att5, att6, att7}
• currentAtts[6] and currentAtts[7] are both the array {att2, att3, att6, att7}.
For an internal node N , i.e. one that has previously been split on an attribute,
the array element splitAtt[N ] is the name of the splitting attribute, so splitAtt[0]
is att4, splitAtt[2] is att1 and splitAtt[4] is att5. All the other nodes are leaf
nodes at which there are (by definition) no splitting attributes. The value of
splitAtt[N ] for a leaf node is 'none'.
For each leaf node N , hitcount[N ] is the number of ‘hits’ on the node since it
was created, i.e. the number of records sorted to that leaf node. As each new
record is sorted to a leaf node L, the value of hitcount[L] is increased by 1.
These values are not increased for internal nodes as they play no further part
in the tree generation process. When a new node is created its hitcount value
is set to zero.
We will consider all the pseudocode and other information in this section
to be the ‘initial draft’ (version 1) of a specification for CDH-Tree and will
progressively refine it in what follows.
Here and subsequently the parts altered are shown in bold. Note that
array classtotals is only updated at leaf nodes. It is not used when making
decisions about splitting or unsplitting but has an important role when using
the classification tree for prediction, so its value needs to be increased only at
the leaf node to which each record is sorted.
Pseudocode fragments 1, 2 and 4 remain unaltered.
This change means that we now need to store the most recent W records1 .
When initially the first W records are processed they are stored in a table of
W rows (or some equivalent form), working from row 1 down to row W . When
the next record is read its values go into row 1 and the record that previously
occupied that row (i.e. the oldest record) is discarded. When the next record is
read, it goes into row 2 (which is now the oldest record), the previous occupant
of that row is discarded and so on. After 2 ∗ W records have been processed
none of the original W records will remain in the window.
When a new record is added to the window the acvCounts and hitcount
arrays are incremented for all the nodes it passes through on its path from
the root node to the appropriate leaf (including the root node and leaf node)
and the classtotals array is incremented for the leaf node itself. When an old
record is removed from the window it is not merely discarded, it is forgotten.
Forgetting a record means decreasing by one the values in the acvCounts and
hitcount arrays for all the nodes it passed through on its path from the root
node to the appropriate leaf node when it was first added to the window and
decreasing by one the values in the classtotals array for the leaf node itself.
Since at present we are keeping to the principle that once we have split on
an attribute at a given node that decision is never changed, it is still only the
values of the acvCounts and hitcount arrays at the leaf nodes that affect the
evolving tree. In [2] it is argued that if the data is stationary, adding records
to or removing records from the sliding window should have little effect on the
evolving tree.
Our aim in this chapter is to create an algorithm that will deal with data
that is time-dependent and the value of introducing a sliding window will
become apparent soon.
The revised version of the pseudocode for the main CDH-Tree algorithm
now looks like this.
1
We will assume that each record comprises a set of attribute values together with
a classification.
390 Principles of Data Mining
CDH-Tree Algorithm
(version 2)
1. Set values of G, Prob and W
2. Initialise root node (see pseudocode 1)
3. For each record R with classification C that arrives to be processed
a) If the number of records in the window < W add R to the
window
else
i. take a copy of the oldest record in the window Rold with
classification Cold
ii. replace Rold by R
iii. ‘forget’ record Rold with classification Cold (see
pseudocode 5)
b) Process record R with classification C (see pseudocode 3)
Steps (a) and (b) of the algorithm are often described as the grow/forget
stage.
Pseudocode fragment 5 is based on fragment 3 but of course it does not
need to include any possibility of splitting on the leaf node. Warning – this
version contains an error.
When we come to forget the record the tree may have evolved to look like
this (Figure 22.3).
During the forgetting process (as so far described) the record might now be
sorted to leaf node 10 following the path 0 → 2 → 4 → 9 → 10.
392 Principles of Data Mining
The splitting attribute at node zero (the root) is att4, which has two
values: val41 and val42. These lead to nodes 1 and 2 using the branch array:
branch[0][att4][val41] is 1 and branch[0][att4][val42] is 2. We look at each of
these nodes in turn. Node 1 is a leaf node (we know this from examining
splitAtt[1]), so we take no action at that node. Node 2 is an internal node. It
is split on attribute att1, which has three values: val11, val12 and val13. The
first and third of these lead to leaf nodes 3 and 5 respectively, but the second
branch leads to node 4 which is an internal node, and so the process
continues, until every path eventually leads to a leaf node and ends.
At each internal node we calculate the Information Gain corresponding to
each attribute available for splitting. If the one that was chosen for splitting,
attribute A, does not have the largest value, we find the two attributes with
the largest Information Gain values and calculate the Hoeffding Bound. If the
difference between the values of Information Gain for the best attribute X
and the second best attribute Y is greater than the value of the Hoeffding
Bound, the node is considered suspect and attribute X is considered to be the
alternative splitting attribute for that node.
Whether the node is deemed ‘suspect’ or not it remains split on attribute
A at present. We go on to examine its direct descendants using the contents of
array branch.
All this gives two new fragments of pseudocode: 6 and 7.
2
Figure 22.4 is identical to Figure 22.1. It is repeated for ease of reference.
396 Principles of Data Mining
Instead we create an ‘alternate node’ for any suspect node N with initially
a one-level subtree hanging from it formed by splitting on the ‘alternative
splitting attribute’ X. We will assume that this alternate node is numbered
N1. Node N1 is given the same currentAtts, hitcount, classtotals and
acvCounts arrays as node N but its splitting attribute (the value of array
element splitAtt[N1]) is different. It has the value X.
Next we create a branch from node N1 to a new node for each value of the
new splitting attribute.
Node N1 and its substructure are not part of our classification tree but we
can consider nodes N and N1 being linked by a dotted line as in Figure 22.5,
where N and N1 are 2 and 10 respectively. Node 10 is an alternate node for
node 2 in the main tree.
As time goes by more than one alternate node and its substructure can
be associated with a suspect node. As the grow/forget process continues the
substructure below each alternate node may also evolve.
At some later time (see Section 22.10) a decision will be made whether or
not to replace each of the suspect internal nodes by one of its alternate nodes,
and if there is more than one of them which one. At that time Figure 22.5 may
have evolved to look like this (Figure 22.6).
398 Principles of Data Mining
To record the link between nodes in the main tree and their alternate nodes
in array form we use a seventh array: altTreeList. This is a two-dimensional
array. For Figure 22.6 altTreeList[2] is the array {10, 19}. The altTreeList array
is only applicable to nodes in the main tree, not alternate nodes or any other
nodes in the substructure hanging from them3 .
We can now replace the words in italics in pseudocode fragment 6:
3
This is a restriction imposed by the CDH-Tree algorithm. It would be possible to
allow nodes in an alternate tree to have their own alternate nodes but at the risk of
creating and needing to maintain an increasingly unwieldy structure, most of which
will never form part of the main tree. It is only the current main tree that is used for
prediction.
Classifying Streaming Data II: Time-Dependent Data 399
Once a decision is made to replace, say node 2 by its alternate node 19,
implementing the change is straightforward. Node 2 is a direct descendant
of node 0. Let us say that the branch in question is for attribute att7 with
value val72. Then the element of the branch array that links nodes 0 and 2
is branch[0][att7][val72]. All we have to do is to change the value of this array
element from 2 to 19.
The aim of this cautious approach to ‘resplitting’ a node on a different
attribute from before is to ensure a smooth changeover to the new tree, which
will enable it to maintain a high level of predictive accuracy throughout.
Using the ‘alternate node’ approach would be pointless for stationary data
(as the data used with the algorithm developed in Chapter 21 is assumed to be),
as it is unlikely that any would ever be created and, if they were, it is unlikely
that they would ever replace the original nodes. In the case of time-dependent
data it is hoped that using alternate nodes will give a smooth and appropriately
cautious way of re-splitting the decision tree at one or more nodes as concept
drift occurs.
The notation (*) indicates a recursive call to the same pseudocode fragment.
Pseudocode 5 also needs to be changed to enable forgetting to take place
in alternate trees (if applicable) not only the main tree.
402 Principles of Data Mining
Suppose that a test record is sorted to leaf node 22 following the path 0
→ 2 → 4 → 17 → 19 → 22. As the record ‘passes through’ internal node
4 (on its way to the leaf node) it is automatically copied and passed to the
alternate trees hanging from the alternate nodes for node 4, i.e. 8 and 14. This
may lead to it also being sorted to leaf nodes 9 and 29. As the record goes on
and passes through node 19 it is also automatically copied and passed to that
node’s alternate, node 23, and from there may be sorted to leaf node 24. Thus
a single record is sorted to four different leaf nodes.
404 Principles of Data Mining
We can now make a prediction of the classification at each of the four nodes
and compare it with the true classification for the record. We do this using the
contents of the classtotals array for each of the nodes on the path from the root
to the leaf node (including the root and the leaf itself).
To predict the classification at node 22 in the main tree we use the contents
of the classtotals arrays at nodes 0, 2, 4, 17, 19 and 22. If there are three
possible classifications c1, c2 and c3, the array classtotals[2], say, will contain
three values such as {50, 23, 42}. These correspond to the number of records
with each classification that were sorted to that node when it was still a leaf
node, i.e. before splitting on an attribute occurred there. We add together the
contents of the classtotals arrays at nodes 0, 2, 4, 17, 19 and 22, element by
element, into a combined array testTotals which might then contain values such
as {312, 246, 385}. The class with the largest value in the testTotals array in
our example is c3 and so this is taken to be the class predicted for leaf node
22.
We compare the prediction with the true classification, which in this case
we will assume is c3, so in this case the predicted and the true classification
are the same.
The same method is used for prediction if a record is sorted to a leaf node
in an alternate tree, e.g. node 29. In this case we would combine the contents
of the classtotals arrays at nodes 0, 2, 14, 15 and 29.
As more records are processed during the testing phase we accumulate a
count for each leaf node (in both the main tree and the alternate trees) of the
total number of records sorted to it and the number of those records for
which the classification is correctly predicted. We store these values for each
node in the two-dimensional array testcounts. At node 22, say, array elements
testcounts[22][0] and testcounts[22][1] give the number of records so far sorted
to that node during the testing stage and the number correctly classified
respectively.
At the end of the testing phase we will have a testcounts array with contents
such as the following for each of the 19 leaf nodes in our tree (Figure 22.9).
We can now fill in the table with testcounts array values for each of the
internal nodes too, calculated by adding together the values of each of its
immediate successors, working upwards from the leaf nodes.
Ignoring at present the possible existence of alternate nodes, the
procedure is straightforward. Taking the part of the tree shown in Figure
22.10, the testcounts array for node 19 is the sum of those for nodes 21 and
22, i.e. {25, 18} and {5, 3} making a total of {30, 21}. Adding the testcounts
values for leaf node 20 to this gives the array for internal node 17 as {40, 29}.
Proceeding in this way the values at the leaf nodes are propagated up the tree
right up to the root.
Classifying Streaming Data II: Time-Dependent Data 405
Number of Correct
Leaf Node Number of Records
Classifications
N testcounts[N ][0]
testcounts[N ][1]
1 25 16
5 6 3
6 5 3
9 20 14
10 24 16
11 10 7
12 5 3
13 5 3
16 14 12
18 4 3
20 10 8
21 25 18
22 5 3
24 15 13
25 15 6
27 10 6
28 10 4
29 20 13
30 10 10
Figure 22.10 Extract from Figure 22.8 – Tree Structure at Start of Testing
Phase
406 Principles of Data Mining
Figure 22.11 Figure 22.10 Augmented with Alternate Node 23 and its
Subtree
Before passing the values from node 19 up to its parent node 17, we first
consider the alternate tree rooted at node 23. The testcounts array at node 23
is the combination of those at its successor nodes 24 and 25 (both leaf nodes)
giving {30, 19}.
We now compare the scores at nodes 19 and 23: {30, 21} versus {30, 19}.
The first elements are identical as they must always be, since the same records
pass through each of them in the testing phase. However the descendants of
node 19 correctly predicted the classifications of 21 records, compared with
only 19 for the descendants of alternate node 23. Node 19 is not replaced by its
alternate node and remains unchanged in the tree, as also does alternate node
23.
When we get to node 4 the performance at that node needs to be compared
with that of the alternate trees rooted at its two alternate nodes, 8 and 14. The
testcounts values are as follows.
Node 4: {44, 32}
Node 8: {44, 30}
Node 14: {44, 35}
This time the alternate tree rooted at node 14 has the best performance,
so node 4 is replaced by alternate node 14.
The tree now looks like this (Figure 22.12)4 .
4
Although nodes 14, 15, 16, 29 and 30 were previously parts of an alternate tree
they are now in the main tree and so potentially can have alternate tree structures
attached to them.
Classifying Streaming Data II: Time-Dependent Data 407
Note that the value of the testcounts array for node 2 is now the combination
of those for nodes 14 and 5, i.e. {50, 38}.
Finally we come to the decision whether to replace node 7 by its alternate
node 26. The testcounts arrays are {20, 13} for node 7 and {20, 10} for node
26. Node 7 has the better performance so both nodes remain unchanged in the
tree.
This completes the investigation of the tree at the end of the testing phase.
Part of the structure has changed but there is still one suspect node, 7, with an
alternate, 26. The possibility of replacing the suspect node will be considered
again at the end of the next testing phase, by which time new alternate nodes
and their subtrees may have been created5 .
The above method requires a final change to the pseudocode for the main
CDH-Tree algorithm plus two additional pseudocode fragments 8 and 9.
5
Nodes 4 and 8 in Figure 22.8 and the subtrees hanging from them are not part of
the revised structure and are no longer accessible. It may be possible for a practical
implementation to reuse the memory they occupy but we will not pursue this here.
408 Principles of Data Mining
(The array tcounts returned by step 3(c)(ii) is not used here, but is
important to the recursive definition of pseudocode 9 given below.)
Classifying Streaming Data II: Time-Dependent Data 409
If all 24 records are input to CDH-Tree (in the same order) a large number
of times, so the total number of records input is say 2,400, 24,000 or 24,000,000,
we can examine the trees produced and compare them with those produced by
the tree-generation algorithm TDIDT described in Chapters 4–6. For any exact
multiple of the original data records TDIDT will give the same result as if it
had only been processed once. In Section 21.8.1 it was shown that at some
point between processing 7200 records and processing 9600 records the H-Tree
constructed from this data had already produced the same tree as TDIDT and
hence the same rules.
After 9600 records (i.e. 400 repetitions of the lens24 data records in the
same order each time) had been processed the decision tree looked like this
(Figure 22.13)6 .
Figure 22.13 Tree Generated by TDIDT and H-Tree for 9,600 Records of
lens24 Data
We can extract rules from such a tree branch by branch, each rule
corresponding to the path from the root node to a leaf node, working from
left to right.
6
We will adopt the convention that the branches at each internal node correspond
to attribute values 1 and 2 (or 1, 2 and 3 in the case of age) in that order, working
from left to right. So, for example, node 6 corresponds to a rule with left-hand side
IF tears = 2 AND astig = 1 AND age = 2. (The corresponding classifications on the
right-hand side are not shown.)
412 Principles of Data Mining
In this case there are nine rules generated by both TDIDT and H-Tree
(Figure 22.14). All the leaf nodes have a single classification.
Figure 22.14 Rules Extracted from Decision Tree for lens24 Data (9,600
Records)
Using the same data with the CDH-Tree algorithm, with W set to 9600,
will produce exactly the same result7 . We now need to find a way to introduce
concept drift into the stream of data coming into CDH-Tree.
7
Up to the point where the sliding window is full, and provided D is greater than
W , CDH-Tree is effectively the same algorithm as H-Tree.
8
We have left attribute age unchanged to avoid irrelevant complications. It has
three attribute values whereas the other attributes all have only two.
Classifying Streaming Data II: Time-Dependent Data 413
The decision tree will now look like this (Figure 22.15).
Figure 22.15 Tree Generated by TDIDT and H-Tree for 9,600 Records of
lens24 Alternative Mode
The rules extracted from the tree are now these (Figure 22.16).
414 Principles of Data Mining
Figure 22.16 Rules Extracted from Decision Tree for lens24 Data –
Alternative Mode
We are now in a position to introduce concept drift into our stream of data.
We will interpret the data as standard mode for the first 19,200 records, then
switch to alternative mode for the next 19,200 records and so on indefinitely,
alternating between the two modes. Figure 22.17 shows the mode applicable to
the part of the infinite stream of records that we will use for our experiment.
19,200 Records
The first testing phase ends (T = 18000, M = 1200). With no alternate
nodes, no changes to the tree are possible. The tree predicts perfectly the
classification of all 1,200 records.
Next the alternative data mode begins.
28,000 Records
The second check for concept drift is made (D = 14000). The split on
attribute tears at node 0 is found no longer to be the best choice. An alternate
node 15 is created with a one-level subtree split on attribute astig. No other
suspect nodes are found.
The tree now looks like this (Figure 22.18). It is in transition from Figure
22.13 (standard mode of lens24 data) towards Figure 22.15 (alternative mode
of the data).
The arrays shown for several of the rules give the classcounts for the three
classes (1, 2 and 3) in order.
The concept drift between one mode of the lens24 data and another is clearly
having an effect on the predictions that would be made about the classification
of records sorted to the different leaf nodes.
418 Principles of Data Mining
36,000 Records
The tree now looks like this (Figure 22.20):
The alternate tree hanging from alternate node 15 is the same as Figure 22.15
(with the node numbers from 15 onwards rather than 0 onwards but in the
same order) except that node 22 has not yet been split on attribute tears.
The corresponding rules are shown in Figure 22.21.
Figure 22.22 Tree After Substitution of Alternate Node 15 for Root Node
38,182 Records
Attribute 22 is split on attribute tears, giving new nodes 28 and 29. The
tree is now identical to Figure 22.15.
38,400 Records
The data returns to standard mode.
At this point the rules corresponding to the nodes in the tree are identical
to those shown in Figure 22.16.
40,198 Records
Attribute 23 is split on attribute age, giving new nodes 30, 31 and 32. The
tree is starting its journey back towards Figure 22.13, i.e. the shape it had after
9,600 records.
420 Principles of Data Mining
42,000 Records
The third check for concept drift is made (D = 42000). Now nodes 18
and 19 are found to be ‘suspect’. Node 18 is given an alternate node 33, from
which hangs a one-level tree split on attribute tears (nodes 34 and 35). Node
19 is given an alternate node 36 from which hangs a one-level subtree split on
attribute age (nodes 37, 38 and 39).
The tree now looks like Figure 22.23.
47,984 Records
Node 37 is split on attribute tears (giving new nodes 40 and 41).
48,000 Records
The rules corresponding to the tree are now as shown in Figure 22.24:
54,000 Records
The third testing phase begins (T = 18000)
55,200 Records
The third testing phase ends (T = 18000, M = 1200). Node 18 is replaced
by alternative node 33.
Classifying Streaming Data II: Time-Dependent Data 421
56,000 Records
The fourth check for concept drift is made (D = 14000). Node 17 is
considered suspect and is replaced by node 42 which has a one-level
descendant subtree split on attribute tears (new nodes 43 and 44). The state
of the tree is now as shown in Figure 22.25.
It seems to be much harder for the tree to get back to the shape it had in
Figure 22.13 than it was to get from there to Figure 22.15. The prevalence of
splits on attribute tears appears to be a partial substitute for splitting on that
attribute at the root node.
57,600 Records
The rules corresponding to the classification tree are now as shown in Figure
22.26.
The alternative data mode begins.
57,996 Records
Node 43 is split on attribute specRx (giving new nodes 45 and 46).
58,000 Records
Node 44 is split on attribute specRx (giving new nodes 47 and 48).
422 Principles of Data Mining
60,000 Records
We will leave the continuing story here. The tree now has nine corresponding
rules, shown in Figure 22.27.
Following all the changes caused by the concept drift introduced into the
stream of input records, the tree now misclassifies five of each batch of 24
records.
Starting from Figure 22.13 with the data in standard mode after 9,600
records, the tree evolved very satisfactorily into Figure 22.159 with the data
then in alternative mode. However this depended crucially on a suitable
choice of variables W , D, T and M .
9
Strictly, the nodes were numbered differently from Figure 22.15, but in the same
order.
424 Principles of Data Mining
After a total of 60,000 records the tree still showed little sign of evolving
back to Figure 22.13. The third check for concept drift, after 42,000 records,
led to alternate nodes being associated with nodes 18 and 19 (Figure 22.23)
but unfortunately not with the new root node 15. This in turn led to a drop in
the predictive accuracy of the tree.
The algorithm appears to be very sensitive to the choice of variables,
especially D and T . Their sizes relative to each other and to W may well be
critical to the success of the algorithm on real-world data
References
[1] Domingos, P., & Hulten, G. (2000). Mining high-speed data streams.
In Proceedings of the sixth ACM SIGKDD international conference on
knowledge discovery and data mining (pp. 71–80). New York: ACM.
[2] Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing
data streams. In Proceedings of the seventh ACM SIGKDD international
conference on knowledge discovery and data mining (pp. 97–106).
New York: ACM.
A
Essential Mathematics
This appendix gives a basic description of the main mathematical notation and
techniques used in this book. It has four sections, which deal with, in order:
– the subscript notation for variables and the Σ (or ‘sigma’) notation for sum-
mation (these are used throughout the book, particularly in Chapters 4, 5
and 6)
– tree structures used to represent data items and the processes applied to
them (these are used particularly in Chapters 4, 5 and 9)
– the mathematical function log2 X (used particularly in Chapters 5, 6 and 10)
– set theory (which is used in Chapter 17).
If you are already familiar with this material, or can follow it fairly easily,
you should have no trouble reading this book. Everything else will be explained
as we come to it. If you have difficulty following the notation in some parts of
the book, you can usually safely ignore it, just concentrating on the results and
the detailed examples given.
In some situations a single subscript is not enough and we find it helpful to use
two (or occasionally even more). This is analogous to saying ‘the fifth house on
the third street’ or similar.
We can think of a variable with two subscripts, e.g. a11 , a46 , or in general
aij as representing the cells of a table. The figure below shows the standard
way of referring to the cells of a table with 5 rows and 6 columns. For example,
in a45 the first subscript refers to the fourth row and the second subscript refers
to the fifth column. (By convention tables are labelled with the row numbers
increasing from 1 as we move downwards and column numbers increasing from
1 as we move from left to right.) The subscripts can be separated by a comma
if it is necessary to avoid ambiguity.
a11 a12 a13 a14 a15 a16
a21 a22 a23 a24 a25 a26
a31 a32 a33 a34 a35 a36
a41 a42 a43 a44 a45 a46
a51 a52 a53 a54 a55 a56
We can write a typical value as aij , using two dummy variables i and j.
If we have a table with m rows and n columns, the second row of the
table is a21 , a22 , . . . , a2n and the sum of the values in the second row is
430 Principles of Data Mining
Finally, we need to point out that subscripts are not always used in the way
shown previously in this appendix. In Chapters 5, 6 and 10 we illustrate the
calculation of two values of a variable E, essentially the ‘before’ and ‘after’
values. We call the original value Estart and the second value Enew . This is
just a convenient way of labelling two values of the same variable. There is no
meaningful way of using an index of summation.
A.2 Trees
Computer Scientists and Mathematicians often use a structure called a tree to
represent data items and the processes applied to them.
Trees are used extensively in the first half of this book, especially in Chap-
ters 4, 5 and 9.
Essential Mathematics 431
Figure A.1 is an example of a tree. The letters A to M are labels added for
ease of reference and are not part of the tree itself.
A.2.1 Terminology
A.2.2 Interpretation
A tree structure is one with which many people are familiar from family trees,
flowcharts etc. We might say that the root node A of Figure A.1 represents the
most senior person in a family tree, say John. His children are represented by
nodes B, C and D, their children are E, F, L and M and so on. Finally John’s
great-great-grandchildren are represented by nodes J and K.
For the trees used in this book a different kind of interpretation is more
helpful.
Figure A.2 is Figure A.1 augmented by numbers placed in parentheses at
each of the nodes. We can think of 100 units placed at the root and flowing
down to the leaves like water flowing down a mountainside from a single source
(the root) to a number of pools (the leaves). There are 100 units at A. They
flow down to form 60 at B, 30 at C and 10 at D. The 60 at B flow down to E
(10 units) and F (50 units), and so on. We can think of the tree as a means
of distributing the original 100 units from the root step-by-step to a number
Essential Mathematics 433
of leaves. The relevance of this to using decision trees for classification will
become clear in Chapter 4.
A.2.3 Subtrees
If we consider the part of Figure A.1 that hangs below node F, there are
six nodes (including F itself) and five links which form a tree in their own
right (see Figure A.3). We call this a subtree of the original tree. It is the
subtree ‘descending from’ (or ‘hanging from’) node F. A subtree has all the
characteristics of a tree in its own right, including its own root (node F).
Sometimes we wish to ‘prune’ a tree by removing the subtree which descends
from a node such as F (leaving the node F itself intact), to give a simpler tree,
such as Figure A.4. Pruning trees in this way is dealt with in Chapter 9.
log2 (1/8) = −3
log2 (1/4) = −2
log2 (1/2) = −1
log2 1 = 0
log2 2 = 1
log2 4 = 2
log2 8 = 3
log2 16 = 4
log2 32 = 5
The log2 function has some unusual (and very helpful) properties that
greatly assist calculations using it. These are given in Figure A.7.
436 Principles of Data Mining
The logarithm function can have other bases as well as 2. In fact any positive
number can be a base. All the properties given in Figures A.6 and A.7 apply
for any base.
Another commonly used base is base 10. log10 X = Y means 10Y = X, so
log10 100 = 2, log10 1000 = 3 etc.
Perhaps the most widely used base of all is the ‘mathematical constant’
with the very innocuous name of e. The value of e is approximately 2.71828.
Logarithms to base e are of such importance that instead of loge X we often
write ln X and speak of the ‘natural logarithm’, but explaining the importance
of this constant is considerably outside the scope of this book.
Few calculators have a log2 function, but many have a log10 , loge or ln
function. To calculate log2 X from the other bases use log2 X = loge X/0.6931
or log10 X/0.3010 or ln X/0.6931.
The only base of logarithms used in this book is base 2. However the log2
function also appears in the formula −X log2 X in the discussion of entropy in
Chapters 5 and 10. The value of this function is also only defined for values
of X greater than zero. However the function is only of importance when X is
between 0 and 1. The graph of the important part of this function is given in
Figure A.8.
The initial minus sign is included to make the value of the function positive
(or zero) for all X between 0 and 1.
Essential Mathematics 437
It can be proved that the function −X log2 X has its maximum value when
X = 1/e = 0.3679 (e is the ‘mathematical constant’ mentioned above). When
X takes the value 1/e, the value of the function is approximately 0.5307.
Values of X from 0 to 1 can sometimes usefully be thought of as proba-
bilities (from 0 = impossible to 1 = certain), so we may write the function as
−p log2 (p). The variable used is of course irrelevant as long as we are consis-
tent. Using the fourth property in Figure A.7, the function can equivalently be
written as p log2 (1/p). This is the form in which it mainly appears in Chapters
5 and 10.
No element may appear in a set more than once, so {a, b, c, b} is not a valid
set. The order in which the elements of a set are listed is not significant, so
{a, b, c} and {c, b, a} are the same set.
The cardinality of a set is the number of elements it contains, so {dog, cat,
mouse} has cardinality three and {a, b, {a, b, c}, d, e} has cardinality five. The
set with no elements {} is called the empty set and is written as ∅.
We usually think of the members of a set being drawn from some ‘universe
of discourse’, such as all the people who belong to a certain club. Let us assume
that set A contains all those who are aged under 25 and set B contains all those
who are married.
We call the set containing all the elements that occur in either A or B or
both the union of the two sets A and B. It is written as A ∪ B. If A is the set
{John, Mary, Henry} and B is the set {Paul, John, Mary, Sarah} then A ∪ B is
the set {John, Mary, Henry, Paul, Sarah}, the set of people who are either under
25 or married or both. Figure A.9 shows two overlapping sets. The shaded area
is their union.
We call the set containing all the elements (if there are any) that occur in
both A and B the intersection of the two sets A and B. It is written A ∩ B.
If A is the set {John, Mary, Henry} and B is the set {Paul, John, Mary, Sarah}
as before, then A ∩ B is the set {John, Mary}, the set of people who are both
under 25 and married. Figure A.10 shows two overlapping sets. In this case,
the shaded area is their intersection.
Two sets are called disjoint if they have no elements in common, for ex-
ample A = {Max, Dawn} and B = {Frances, Bryony, Gavin}. In this case their
Essential Mathematics 439
intersection A ∩ B is the set with no elements, which we call the empty set and
represent by {} or (more often) by ∅. Figure A.11 shows this case.
If two sets are disjoint their union is the set comprising all the elements in
the first set and all those in the second set.
There is no reason to be restricted to two sets. It is meaningful to refer to
the union of any number of sets (the set comprising those elements that appear
in any one or more of the sets) and the intersection of any number of sets (the
set comprising those elements that appear in all the sets). Figure A.12 shows
three sets, say A, B and C. The shaded area is their intersection A ∩ B ∩ C.
A.4.1 Subsets
Dataset chess
Description
This dataset was used for one of a well-known series of experiments by
the Australian researcher Ross Quinlan, taking as an experimental testbed
the Chess endgame with White king and rook versus Black king and
knight. This endgame served as the basis for several studies of Machine
Learning and other Artificial Intelligence techniques in the 1970s and 1980s.
The task is to classify positions (all with Black to move) as either ‘safe’
or ‘lost’, using attributes that correspond to configurations of the pieces.
The classification ‘lost’ implies that whatever move Black makes, White
will immediately be able to give checkmate or alternatively will be able to
capture the Knight without giving stalemate or leaving his Rook vulnerable
to immediate capture. Generally this is not possible, in which case the
position is ‘safe’. This task is trivial for human experts but has proved
remarkably difficult to automate in a satisfactory way. In this experiment
(Quinlan’s ‘third problem’), the simplifying assumption is made that the
board is of infinite size. Despite this, the classification task remains a hard
one. Further information is given in [1].
Source: Reconstructed by the author from description given in [1].
Classes
safe, lost
Attributes and Attribute Values
The first four attributes represent the distance between pairs of pieces (wk
and wr: White King and Rook, bk and bn: Black King and Knight). They
all have values 1, 2 and 3 (3 denoting any value greater than 2).
dist bk bn
dist bk wr
dist wk bn
dist wk wr
The other three attributes all have values 1 (denoting true) and 2 (denoting
false).
inline (Black King and Knight and White Rook in line)
wr bears bk (White Rook bears on the Black King)
wr bears bn (White Rook bears on the Black Knight)
Instances
Training set: 647 instances
No separate test set
Datasets 447
Description
Data from ophthalmic optics relating clinical data about a patient to a
decision as to whether he/she should be fitted with hard contact lenses,
soft contact lenses or none at all.
Classes
hard lenses: The patient should be fitted with hard contact lenses
soft lenses: The patient should be fitted with soft contact lenses
no lenses: The patient should not be fitted with contact lenses
Instances
Training set: 108 instances
No separate test set
448 Principles of Data Mining
Dataset crx
Description
This dataset concerns credit card applications. The data is genuine but the
attribute names and values have been changed to meaningless symbols to
protect confidentiality of the data.
Classes
+ and − denoting a successful application and an unsuccessful application,
respectively (largest class for the training data is −)
Instances
Training set: 690 instances (including 37 with missing values)
Test set: 200 instances (including 12 with missing values)
Datasets 449
Dataset genetics
Description
Each instance comprises the values of a sequence of 60 DNA elements
classified into one of three possible categories. For further information see
[3].
Classes
EI, IE and N
Instances
Training set: 3190 instances
No separate test set
450 Principles of Data Mining
Dataset glass
Description
This dataset is concerned with the classification of glass left at the scene
of a crime into one of six types (such as ‘tableware’, ‘headlamp’ or ‘build-
ing windows float processed’), for purposes of criminological investigation.
The classification is made on the basis of nine continuous attributes (plus
an identification number, which is ignored).
Classes
1, 2, 3, 5, 6, 7
Type of glass:
1 building windows float processed
2 building windows non float processed
3 vehicle windows float processed
4 vehicle windows non float processed (none in this dataset)
5 container
6 tableware
7 headlamp
Instances
Training set: 214 instances
No separate test set
Datasets 451
Dataset golf
Description
A synthetic dataset relating a decision on whether or not to play golf to
weather observations.
Classes
Play, Don’t Play
Instances
Training set: 14 instances
No separate test set
452 Principles of Data Mining
Dataset hepatitis
Description
The aim is to classify patients into one of two classes, representing ‘will
live’ or ‘will die’, on the basis of 13 categorical and 9 continuous attributes.
Classes
1 and 2 representing ‘will die’ and ‘will live’ respectively
Instances
Training set: 155 instances (including 75 with missing values)
No separate test set
Datasets 453
Dataset hypo
Description
This is a dataset on Hypothyroid disorders collected by the Garvan
Institute in Australia. Subjects are divided into five classes based on the
values of 29 attributes (22 categorical and 7 continuous).
Classes
hyperthyroid, primary hypothyroid, compensated hypothyroid, secondary
hypothyroid, negative
Instances
Training set: 2514 instances (all with missing values)
Test set: 1258 instances (371 with missing values)
454 Principles of Data Mining
Dataset iris
Description
Iris Plant Classification. This is one of the best known classification
datasets, which is widely referenced in the technical literature. The aim is
to classify iris plants into one of three classes on the basis of the values of
four categorical attributes.
Classes
Iris-setosa, Iris-versicolor, Iris-virginica (there are 50 instances in the
dataset for each classification)
Instances
Training set: 150 instances
No separate test set
Datasets 455
Dataset labor-ne
Description
This is a small dataset, created by Collective Bargaining Review (a monthly
publication). It gives details of the final settlements in labor negotiations in
Canadian industry in 1987 and the first quarter of 1988. The data includes
all collective agreements reached in the business and personal services
sector for local organisations with at least 500 members (teachers, nurses,
university staff, police, etc).
Classes
good, bad
Instances
Training set: 40 instances (39 with missing values)
Test set: 17 instances (all with missing values)
Dataset lens24
Description
A reduced and simplified version of contact lenses with only 24 instances.
Classes
1, 2, 3
Instances
Training set: 24 instances
No separate test set
Datasets 457
Dataset monk1
Description
Monk’s Problem 1. The ‘Monk’s Problems’ are a set of three artificial prob-
lems with the same set of six categorical attributes. They have been used to
test a wide range of classification algorithms, originally at the second Euro-
pean Summer School on Machine Learning, held in Belgium during summer
1991. There are 3 × 3 × 2 × 3 × 4 × 2 = 432 possible instances. All of them
are included in the test set for each problem, which therefore includes the
training set in each case.
The ‘true’ concept underlying Monk’s Problem 1 is: if (attribute#1 =
attribute#2) or (attribute#5 = 1) then class = 1 else class = 0
Classes
0, 1 (62 instances for each classification)
Instances
Training set: 124 instances
Test set: 432 instances
458 Principles of Data Mining
Dataset monk2
Description
Monk’s Problem 2. See monk1 for general information about the Monk’s
Problems. The ‘true’ concept underlying Monk’s problem 2 is: if (at-
tribute#n = 1) for exactly two choices of n (from 1 to 6) then class = 1
else class = 0
Classes
0, 1
Instances
Training set: 169 instances
Test set: 432 instances
Datasets 459
Dataset monk3
Description
Monk’s Problem 3. See monk1 for general information about the Monk’s
Problems. The ‘true’ concept underlying Monk’s Problem 3 is:
if (attribute#5 = 3 and attribute#4 = 1) or (attribute#5 = 4 and at-
tribute#2 = 3) then class = 1 else class = 0
This dataset has 5% noise (misclassifications) in the training set.
Classes
0, 1
Instances
Training set: 122 instances
Test set: 432 instances
460 Principles of Data Mining
Dataset pima-indians
Description
The dataset concerns the prevalence of diabetes in Pima Indian women. It
is considered to be a difficult dataset to classify.
The dataset was created by the (United States) National Institute of
Diabetes and Digestive and Kidney Diseases and is the result of a study on
768 adult female Pima Indians living near Phoenix. The goal is to predict
the presence of diabetes using seven health-related attributes, such as
‘Number of times pregnant’ and ‘Diastolic blood pressure’, together with
age.
Classes
0 (‘tested negative for diabetes’) and 1 (‘tested positive for diabetes’)
Instances
Training set: 768 instances
No separate test set
Datasets 461
Dataset sick-euthyroid
Classes
sick-euthyroid and negative
Instances
Training set: 3163 instances
No separate test set
462 Principles of Data Mining
Dataset vote
Description
Voting records drawn from the Congressional Quarterly Almanac, 98th
Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Wash-
ington, DC, 1985.
This dataset includes votes for each of the US House of Representatives
Congressmen on the 16 key votes identified by the CQA. The CQA lists nine
different types of vote: voted for, paired for, and announced for (these three
simplified to yea), voted against, paired against, and announced against
(these three simplified to nay), voted present, voted present to avoid conflict
of interest, and did not vote or otherwise make a position known (these three
simplified to an unknown disposition).
The instances are classified according to the party to which the voter
belonged, either Democrat or Republican. The aim is to predict the voter’s
party on the basis of 16 categorical attributes recording the votes on topics
such as handicapped infants, aid to the Nicaraguan Contras, immigration,
a physician fee freeze and aid to El Salvador.
Classes
democrat, republican
Instances
Training set: 300 instances
Test set: 135 instances
Datasets 463
References
[1] Quinlan, J. R. (1979). Discovering rules by induction from large collections
of examples. In D. Michie (Ed.), Expert systems in the micro-electronic age
(pp. 168–201). Edinburgh: Edinburgh University Press.
[2] Cendrowska, J. (1990). Knowledge acquisition for expert systems: inducing
modular rules from examples. PhD Thesis, The Open University.
[3] Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training knowl-
edge-based neural networks to recognize genes in DNA sequences. In Ad-
vances in neural information processing systems (Vol. 3). San Mateo: Mor-
gan Kaufmann.
C
Sources of Further Information
Websites
There is a great deal of information about all aspects of data mining available on
the World Wide Web. A good place to start looking is the ‘Knowledge Discovery
Nuggets’ site at http://www.kdnuggets.com, which has links to information
on software, products, companies, datasets, other websites, courses, conferences
etc.
Another very useful source of information is The Data Mine at http://www.
the-data-mine.com.
The Natural Computing Applications Forum (NCAF) is an active British-
based group specialising in Predictive Analytics, Data Mining and related tech-
nologies. Their website is at http://www.ncaf.org.uk.
Books
There are many books on Data Mining. Some popular ones are listed below.
1. Data Mining: Concepts and Techniques (second edition) by J. Han
and M. Kamber. Morgan Kaufmann, 2006. ISBN: 1-55860-901-6.
Conferences
There are many conferences and workshops on Data Mining every year. Two
of the most important regular series are:
The annual KDD-20xx series of conferences organised by SIGKDD (the
ACM Special Interest Group on Knowledge Discovery and Data Mining)
in the United States and Canada. For details see the SIGKDD website at
www.kdd.org.
The annual IEEE ICDM (International Conferences on Data Mining) series.
These move around the world, with every second year in the United States or
Canada. For details see the ICDM website at http://www.cs.uvm.edu/~icdm.
Intersection (of two sets) The intersection of two sets A and B, written as
A ∩ B, is the set that includes all the elements (if there are any) that occur
in both of the sets
Interval-scaled Variable A type of variable. See Section 2.2
Invalid Value An attribute value that is invalid for a given dataset. See
Noise
Item For Market Basket Analysis, each item corresponds to one of the
purchases made by a customer, e.g. bread or milk. We are not usually
concerned with items that were not purchased
Itemset For Market Basket Analysis, a set of items purchased by a
customer, effectively the same as a transaction. Itemsets are generally
written in list notation, e.g. {fish, cheese, milk}
J -Measure A rule interestingness measure that quantifies the informa-
tion content of a rule
j-Measure A value used in calculating the J -measure of a rule
Jack-knifing Another name for N -fold cross-validation
k-fold Cross-validation A strategy for estimating the performance of a
classifier
k-Means Clustering A widely used method of clustering
k-Nearest Neighbour Classification A method of classifying an unseen
instance using the classification of the instance or instances closest to
it (see Chapter 3)
Knowledge Discovery The non-trivial extraction of implicit, previously un-
known and potentially useful information from data. See Introduction
Labelled Data Data where each instance has a specially designated at-
tribute which can be either categorical or continuous. The aim is gen-
erally to predict its value. See Unlabelled Data
Landscape-style Dataset A dataset for which there are far more at-
tributes than instances
Large Itemset Another name for Supported Itemset
Lazy Learning For classification tasks, a form of learning where the train-
ing data is left unchanged until an unseen instance is presented for
classification. See Eager Learning
Leaf Node A node of a tree which has no other nodes descending from it
Glossary and Notation 481
Rule Fires The antecedent of the rule is satisfied for a given instance
Rule Induction The automatic generation of rules from examples
Rule Interestingness Measure A measure of the importance of a rule
Ruleset A collection of rules
Sample Standard Deviation A statistical measure of the ‘spread’ of the
numbers in a sample. The square root of the sample variance
Sample Variance A statistical measure of the ‘spread’ of the numbers in a
sample. The square of the sample standard deviation
Sampling The selection of a subset of the members of a dataset (or other
collection of objects, people etc.) that it is hoped will accurately represent
the characteristics of the whole population
Sampling with Replacement A form of sampling where the whole popu-
lation of objects is available for selection at each stage (implying that the
sample may include an object two or more times)
Scale-up of a Distributed Data Mining System A measure of the per-
formance of a distributed data mining system
Search Space In Chapter 16, the set of possible rules of interest
Search Strategy A method of examining the contents of a search space
(usually in an efficient order)
Sensitivity Another name for true positive rate
Set An unordered collection of items, known as elements. See Appendix A.
The elements of a set are often written between ‘brace’ characters and
separated by commas, e.g. {apples, oranges, bananas}
Significance Test A test applied to estimate the probability that an apparent
relationship between two variables is (or is not) a chance occurrence
Simple Majority Voting See Majority Voting
Single-link Clustering For hierarchical clustering, a method of calcu-
lating the distance between two clusters using the shortest distance from
any member of one cluster to any member of the other
Size Cutoff A possible criterion for pre-pruning a decision tree
Size-up of a Distributed Data Mining System A measure of the perfor-
mance of a distributed data mining system
486 Principles of Data Mining
Self-assessment Exercise 2
Question 1
Labelled data has a specially designated attribute. The aim is to use the data
given to predict the value of that attribute for instances that have not yet
been seen. Data that does not have any specially designated attribute is called
unlabelled.
Question 2
Name: Nominal
Date of Birth: Ordinal
Sex: Binary
Weight: Ratio-scaled
Height: Ratio-scaled
Marital Status: Nominal (assuming that there are more than two values, e.g.
single, married, widowed, divorced)
Number of Children: Integer
Question 3
– Discard all instances where there is at least one missing value and use the
remainder.
– Estimate missing values of each categorical attribute by its most frequently
occurring value in the training set and estimate missing values of each con-
tinuous attribute by the average of its values for the training set.
© Springer-Verlag London Ltd. 2016 491
M. Bramer, Principles of Data Mining, Undergraduate Topics
in Computer Science, DOI 10.1007/978-1-4471-7307-6
492 Principles of Data Mining
Self-assessment Exercise 3
Question 1
Using the values in Figure 3.2, the probability of each class for the unseen
instance
is as follows.
class = on time
0.70 × 0.64 × 0.43 × 0.29 × 0.07 = 0.0039
class = late
0.10 × 0.5 × 0 × 0.5 × 0.5 = 0
class = very late
0.15 × 1 × 0 × 0.33 × 0.67 = 0
class = cancelled
0.05 × 0 × 0 × 1 × 1 = 0
The largest value is for class = on time
The probability of each class for the unseen instance
is as follows.
class = on time
0.70 × 0.07 × 0.43 × 0.36 × 0.57 = 0.0043
class = late
0.10 × 0 × 0 × 0.5 × 0 = 0
class = very late
0.15 × 0 × 0 × 0.67 × 0 = 0
class = cancelled
0.05 × 0 × 0 × 0 × 0 = 0
The largest value is for class = on time
Question 2
The distance of the first instance in Figure 3.5 from the unseen instance is the
square root of (0.8 − 9.1)2 + (6.3 − 11.0)2 , i.e. 9.538.
The distances for the 20 instances are given in the table below.
The five nearest neighbours are marked with asterisks in the rightmost
column.
Self-assessment Exercise 4
Question 1
No two instances with the same values of all the attributes may belong to
different classes.
Question 2
The most likely cause is probably noise or missing values in the training set.
Question 3
Provided the adequacy condition is satisfied the TDIDT algorithm is guaran-
teed to terminate and give a decision tree corresponding to the training set.
Question 4
A situation will be reached where a branch has been generated to the maximum
length possible, i.e. with a term for each of the attributes, but the corresponding
subset of the training set still has more than one classification.
494 Principles of Data Mining
Self-assessment Exercise 5
Question 1
(a) The proportions of instances with each of the two classifications are 6/26
and 20/26. So Estart = −(6/26) log2 (6/26) − (20/26) log2 (20/26) = 0.7793.
(b) The following shows the calculations.
Splitting on SoftEng
SoftEng = A
Proportions of each class: FIRST 6/14, SECOND 8/14
Entropy = −(6/14) log2 (6/14) − (8/14) log2 (8/14) = 0.9852
SoftEng = B
Proportions of each class: FIRST 0/12, SECOND 12/12
Entropy = 0 [all the instances have the same classification]
Weighted average entropy Enew = (14/26) × 0.9852 + (12/26) × 0 = 0.5305
Information Gain = 0.7793 − 0.5305 = 0.2488
Splitting on ARIN
ARIN = A
Proportions of each class: FIRST 4/12, SECOND 8/12
Entropy = 0.9183
ARIN = B
Proportions of each class: FIRST 2/14, SECOND 12/14
Entropy = 0.5917
Weighted average entropy Enew = (12/26) × 0.9183 + 14/26 × 0.5917 = 0.7424
Information Gain = 0.7793 − 0.7424 = 0.0369
Splitting on HCI
HCI = A
Proportions of each class: FIRST 1/9, SECOND 8/9
Entropy = 0.5033
HCI = B
Proportions of each class: FIRST 5/17, SECOND 12/17
Entropy = 0.8740
Weighted average entropy Enew = (9/26) × 0.5033 + (17/26) × 0.8740 = 0.7457
Information Gain = 0.7793 − 0.7457 = 0.0337
Splitting on CSA
CSA = A
Proportions of each class: FIRST 3/7, SECOND 4/7
Entropy = 0.9852
CSA = B
Solutions to Self-assessment Exercises 495
Splitting on Project
Project = A
Proportions of each class: FIRST 5/9, SECOND 4/9
Entropy = 0.9911
Project = B
Proportions of each class: FIRST 1/17, SECOND 16/17
Entropy = 0.3228
Weighted average entropy Enew = (9/26) × 0.9911 + (17/26) × 0.3228 = 0.5541
Information Gain = 0.7793 − 0.5541 = 0.2253
The maximum value of information gain is for attribute SoftEng.
Question 2
The TDIDT algorithm inevitably leads to a decision tree where all nodes have
entropy zero. Reducing the average entropy as much as possible at each step
would seem like an efficient way of achieving this in a relatively small num-
ber of steps. The use of entropy minimisation (or information gain maximisa-
tion) appears generally to lead to a small decision tree compared with other
attribute selection criteria. The Occam’s Razor principle suggests that small
trees are most likely to be the best, i.e. to have the greatest predictive power.
Self-assessment Exercise 6
Question 1
The frequency table for splitting on attribute SoftEng is as follows.
Attribute value
Class A B
FIRST 6 0
SECOND 8 12
Total 14 12
Using the method of calculating entropy given in Chapter 6, the value is:
−(6/26) log2 (6/26) − (8/26) log2 (8/26) − (12/26) log2 (12/26)
+ (14/26) log2 (14/26) + (12/26) log2 (12/26)
= 0.5305
This is the same value as was obtained using the original method for Self-
assessment Exercise 1 for Chapter 5. Similar results apply for the other at-
tributes.
496 Principles of Data Mining
Question 2
It was shown previously that the entropy of the chess dataset is: 0.7793.
The value of Gini Index is 1 − (6/26)2 − (20/26)2 = 0.3550.
Attribute value
Class A B
FIRST 6 0
SECOND 8 12
Total 14 12
Attribute value
Class A B
FIRST 4 2
SECOND 8 12
Total 12 14
The value of entropy is 0.7424
The value of split information is 0.9957
So the information gain is 0.7793 − 0.7424 = 0.0369
and the gain ratio is 0.0369/0.9957 = 0.0371
New value of Gini Index = 0.3370
Solutions to Self-assessment Exercises 497
Attribute value
Class A B
FIRST 1 5
SECOND 8 12
Total 9 17
The value of entropy is 0.7457
The value is split information is 0.9306
So the information gain is 0.7793 − 0.7457 = 0.0337
and the gain ratio is 0.0336/0.9306 = 0.0362
New value of Gini Index = 0.3399
Attribute value
Class A B
FIRST 3 3
SECOND 4 16
Total 7 19
Attribute value
Class A B
FIRST 5 1
SECOND 4 16
Total 9 17
Self-assessment Exercise 7
Question 1
vote Dataset, Figure 7.14
The number of correct predictions is 127 and the total number of instances
is 135.
We have p = 127/135 = 0.9407, N = 135, so the standard error is
p × (1 − p)/N = 0.9407 × 0.0593/135 = 0.0203.
The value of the predictive accuracy can be expected to lie in the following
ranges:
probability 0.90: from 0.9407 − 1.64 × 0.0203 to 0.9407 + 1.64 × 0.0203, i.e.
from 0.9074 to 0.9741
probability 0.95: from 0.9407 − 1.96 × 0.0203 to 0.9407 + 1.96 × 0.0203, i.e.
from 0.9009 to 0.9806
probability 0.99: from 0.9407 − 2.58 × 0.0203 to 0.9407 + 2.58 × 0.0203, i.e.
from 0.8883 to 0.9932
The number of correct predictions is 149 and the total number of instances
is 214.
We have p = 149/214 = 0.6963, N = 214, so the standard error is
p × (1 − p)/N = 0.6963 × 0.3037/214 = 0.0314.
The value of the predictive accuracy can be expected to lie in the following
ranges:
probability 0.90: from 0.6963 − 1.64 × 0.0314 to 0.6963 + 1.64 × 0.0314, i.e.
from 0.6447 to 0.7478
probability 0.95: from 0.6963 − 1.96 × 0.0314 to 0.6963 + 1.96 × 0.0314, i.e.
from 0.6346 to 0.7579
probability 0.99: from 0.6963 − 2.58 × 0.0314 to 0.6963 + 2.58 × 0.0314, i.e.
from 0.6152 to 0.7774
Solutions to Self-assessment Exercises 499
Question 2
False positive classifications would be undesirable in applications such as the
prediction of equipment that will fail in the near future, which may lead to
expensive and unnecessary preventative maintenance. False classifications of
individuals as likely criminals or terrorists can have very serious repercussions
for the wrongly accused.
False negative classifications would be undesirable in applications such as
medical screening, e.g. for patients who may have a major illness requiring
treatment, or prediction of catastrophic events such as hurricanes or earth-
quakes.
Decisions about the proportion of false negative (positive) classifications
that would be acceptable to reduce the proportion of false positives (nega-
tives) to zero is a matter of personal taste. There is no general answer.
Self-assessment Exercise 8
Question 1
Sorting the values of humidity into ascending numerical order gives the follow-
ing table.
Humidity Class
(%)
65 play
70 play
70 play
70 don’t play
75 play
78 play
80 don’t play
80 play
80 play
85 don’t play
90 don’t play
90 play
95 don’t play
96 play
The amended rule for selecting cut points given in Section 8.3.2 is: ‘only
include attribute values for which the class value is different from that for the
previous attribute value, together with any attribute which occurs more than
once and the attribute immediately following it’.
500 Principles of Data Mining
This rule gives the cut points for the humidity attribute as all the values in
the above table except 65 and 78.
Question 2
Figure 8.12(c) is reproduced below.
Value of A Frequency for class Total Value of χ2
c1 c2 c3
1.3 1 0 4 5 3.74
1.4 1 2 1 4 5.14
2.4 6 0 2 8 3.62
6.5 3 2 4 9 4.62
8.7 6 0 1 7 1.89
12.1 7 2 3 12 1.73
29.4 0 0 1 1 3.20
56.2 2 4 0 6 6.67
87.1 0 1 3 4 1.20
89.0 1 1 2 4
Total 27 12 21 60
After the 87.1 and 89.0 rows are merged, the figure looks like this.
Value of A Frequency for class Total Value of χ2
c1 c2 c3
1.3 1 0 4 5 3.74
1.4 1 2 1 4 5.14
2.4 6 0 2 8 3.62
6.5 3 2 4 9 4.62
8.7 6 0 1 7 1.89
12.1 7 2 3 12 1.73
29.4 0 0 1 1 3.20
56.2 2 4 0 6 6.67
87.1 1 2 5 8
Total 27 12 21 60
The previous values of χ2 are shown in the rightmost column. Only the
one given in bold can have been changed by the merging process, so this value
needs to be recalculated.
For the adjacent intervals labelled 56.2 and 87.1 the values of O and E are
as follows.
Solutions to Self-assessment Exercises 501
Self-assessment Exercise 9
The decision tree shown in Figure 9.8 is reproduced below for ease of refer-
ence.
Node Estimated
error rate
A 0.2
B 0.35
C 0.1
D 0.2
E 0.01
F 0.25
G 0.05
H 0.1
I 0.2
J 0.15
K 0.2
L 0.1
M 0.1
Self-assessment Exercise 10
Question 1
The entropy of a training set depends only on the relative proportions of the
classifications, not on the number of instances it contains. Thus for both train-
ing sets the answer is the same.
Entropy = −0.2 × log2 0.2 − 0.3 × log2 0.3 − 0.25 × log2 0.25 − 0.25 × log2 0.25
= 1.985
Question 2
It is best to ask any question that divides the people into two approximately
equal halves. An obvious question would be ‘Is the person male?’. This might
well be appropriate in a restaurant, a theatre etc. but would not be suitable for
504 Principles of Data Mining
a group where there is a large predominance of one sex, e.g. a football match.
In such a case a question such as ‘Does he or she have brown eyes?’ might be
better, or even ‘Does he or she live in a house or flat with an odd number?’
Self-assessment Exercise 11
The degrees dataset given in Figure 4.3 is reproduced below for ease of
reference.
SoftEng ARIN HCI CSA Project Class
A B A B B SECOND
A B B B A FIRST
A A A B B SECOND
B A A B B SECOND
A A B B A FIRST
B A A B B SECOND
A B B B B SECOND
A B B B B SECOND
A A A A A FIRST
B A A B B SECOND
B A A B B SECOND
A B B A B SECOND
B B B B A SECOND
A A B A B FIRST
B B B B A SECOND
A A B B B SECOND
B B B B B SECOND
A A B A A FIRST
B B B A A SECOND
B B A A B SECOND
B B B B A SECOND
B A B A B SECOND
A B B B A FIRST
A B A B B SECOND
B A B B B SECOND
A B B B B SECOND
ARIN = B 2 6 0.333
HCI = A 1 1 1.0
HCI = B 4 8 0.5
CSA = A 2 3 0.667
CSA = B 3 6 0.5
Self-assessment Exercise 12
The true positive rate is the number of instances that are correctly predicted
as positive divided by the number of instances that are actually positive.
The false positive rate is the number of instances that are wrongly predicted
as positive divided by the number of instances that are actually negative.
Predicted class
+ −
Actual class + 50 10
− 10 30
For the table above the values are:
True positive rate: 50/60 = 0.833
False positive rate: 10/40 = 0.25
The Euclidean distance is defined as: Euc = fprate2 + (1 − tprate)2
For this table Euc = (0.25)2 + (1 − 0.833)2 = 0.300.
For the other three tables specified in the Exercise the values are as follows.
Solutions to Self-assessment Exercises 507
Second table
True positive rate: 55/60 = 0.917
False positive rate: 5/40 = 0.125
Euc = 0.150
Third table
True positive rate: 40/60 = 0.667
False positive rate: 1/40 = 0.025
Euc = 0.334
Fourth table
True positive rate: 60/60 = 1.0
False positive rate: 20/40 = 0.5
Euc = 0.500
The following ROC graph shows the four classifiers as well as the four
hypothetical ones at (0, 0), (1, 0), (0, 1) and (1, 1).
If we were equally concerned about avoiding false positive and false nega-
tive classifications we should choose the one given in the second table in the
Exercise, which has true positive rate 0.917 and false positive rate 0.125. This
is the one closest to (0, 1) the perfect classifier in the ROC graph.
508 Principles of Data Mining
Self-assessment Exercise 13
Question 1
The frequency tables for the four attributes are given below, followed by the
class frequency table. The attribute values needed for part (2) are shown in
bold.
Attribute class
day on time late very late cancelled
weekday 12 2 5 1
saturday 3 1 0 1
sunday 2 0 0 0
holiday 3 0 0 0
Attribute class
season on time late very late cancelled
spring 4 0 0 1
summer 10 1 1 1
autumn 2 0 1 0
winter 4 2 3 0
Attribute class
wind on time late very late cancelled
none 8 0 0 1
high 5 2 2 1
normal 7 1 3 0
Attribute class
rain on time late very late cancelled
none 9 1 1 1
slight 10 0 1 0
heavy 1 2 3 1
class
on time late very late cancelled
TOTAL 20 3 5 2
Solutions to Self-assessment Exercises 509
Question 2
For convenience we can put the rows shown in bold in the four attribute fre-
quency tables together in a single table, augmented by the corresponding class
frequencies and probabilities.
class
on time late very late cancelled
weekday 12/20 = 0.60 2/3 = 0.67 5/5 = 1.0 1/2 = 0.50
summer 10/20 = 0.50 1/3 = 0.33 1/5 = 0.20 1/2 = 0.50
high 5/20 = 0.25 2/3 = 0.67 2/5 = 0.40 1/2 = 0.50
heavy 1/20 = 0.05 2/3 = 0.67 3/5 = 0.60 1/2 = 0.50
We can also construct a table of prior probabilities from the class frequency
table, using the total frequency (30) as the denominator.
class
on time late very late cancelled
Prior
20/30 = 0.67 3/30 = 0.10 5/30 = 0.17 2/30 = 0.07
Probability
The class with the largest score is selected, in this case class = late.
Self-assessment Exercise 14
Question 1
Setting a threshold of 0.5 has the effect of eliminating classifiers 4 and 5, leaving
a reduced table as follows.
510 Principles of Data Mining
Self-assessment Exercise 15
Question 1
The average value of B − A is 2.8.
Question 2
The standard error is 1.237 and the t value is 2.264.
Solutions to Self-assessment Exercises 511
Question 3
The t value is larger than the value in the 0.05 column of Figure 4 for 19
degrees of freedom, i.e. 2.093, so we can say that the performance of classifier
B is significantly different from that of classifier A at the 5% level. As the answer
to Question 1 is a positive value we can say that classifier B is significantly
better than classifier A at the 5% level.
Question 4
The 95% confidence interval for the improvement offered by classifier B over
classifier A is 2.8 ± (2.093 ∗ 1.237) = 2.8 ± 2.589, i.e. we can be 95% certain
that the true average improvement in predictive accuracy lies between 0.211%
and 5.389%.
Self-assessment Exercise 16
Question 1
Using the formulae for Confidence, Completeness, Support, Discriminability
and RI given in Chapter 16, the values for the five rules are as follows.
Rule Confid. Complete Support Discrim. RI
1 0.972 0.875 0.7 0.9 124.0
2 0.933 0.215 0.157 0.958 30.4
3 1.0 0.5 0.415 1.0 170.8
4 0.5 0.8 0.289 0.548 55.5
5 0.983 0.421 0.361 0.957 38.0
Question 2
Let us assume that the attribute w has the three values w1 , w2 and w3 and
similarly for attributes x, y and z.
If we arbitrarily choose attribute w to be on the right-hand side of each
rule, there are three possible types of rule:
IF . . . THEN w = w1
IF . . . THEN w = w2
IF . . . THEN w = w3
Let us choose one of these, say the first, and calculate how many possible
left-hand sides there are for such rules.
The number of ‘attribute = value’ terms on the left-hand side can be one,
two or three. We consider each case separately.
512 Principles of Data Mining
Self-assessment Exercise 17
Question 1
At the join step of the Apriori-gen algorithm, each member (set) is compared
with every other member. If all the elements of the two members are identical
except the right-most ones (i.e. if the first two elements are identical in the
case of the sets of three elements specified in the Exercise), the union of the
two sets is placed into C4 .
For the members of L3 given the following sets of four elements are placed
into C4 : {a, b, c, d}, {b, c, d, w}, {b, c, d, x}, {b, c, w, x}, {p, q, r, s}, {p, q, r, t} and
{p, q, s, t}.
At the prune step of the algorithm, each member of C4 is checked to see
whether all its subsets of 3 elements are members of L3 .
Solutions to Self-assessment Exercises 513
So {b, c, d, w}, {b, c, d, x}, {b, c, w, x}, {p, q, r, t} and {p, q, s, t} are removed
by the prune step, leaving C4 as {{a, b, c, d}, {p, q, r, s}}.
Question 2
The relevant formulae for support, confidence, lift and leverage for a database
of 5000 transactions are:
support(L → R) = support(L ∪ R) = count(L ∪ R)/5000 = 3000/5000 =
0.6
confidence(L → R) = count(L ∪ R)/count(L) = 3000/3400 = 0.882
lift(L → R.) = 5000 × confidence(L → R)/count(R) = 5000 × 0.882/4000 =
1.103
leverage(L → R) = support(L ∪ R) − support(L) × support(R)
= count(L ∪ R)/5000 − (count(L)/5000) × (count(R)/5000) = 0.056
Self-assessment Exercise 18
Question 1
The conditional FP-tree for itemset {c} is shown below.
514 Principles of Data Mining
Question 2
The support count can be determined by following the link joining the two c
nodes and adding the support counts associated with each of the nodes together.
The total support count is 3 + 1 = 4.
Question 3
As the support count is greater than or equal to 3, itemset {c} is frequent.
Question 4
The contents of the four arrays corresponding to the conditional FP-tree for
itemset c are given below.
item
index count linkto parent oldindex
name
1 c 3 3 2 1
2 f 3 2
3 c 1 9
nodes2 array oldindex
index startlink2 lastlink
p
m
a
c 1 3
f 2 2
b
link arrays
Self-assessment Exercise 19
Question 1
We begin by choosing three of the instances to form the initial centroids. We
can do this in many possible ways, but it seems reasonable to select three
instances that are fairly far apart. One possible choice is as follows.
Initial
x y
Centroid 1 2.3 8.4
Centroid 2 8.4 12.6
Centroid 3 17.1 17.2
Solutions to Self-assessment Exercises 515
In the following table the columns headed d1, d2 and d3 show the Euclidean
distance of each of the 16 points from the three centroids. The column headed
‘cluster’ indicates the centroid closest to each point and thus the cluster to
which it should be assigned.
x y d1 d2 d3 cluster
1 10.9 12.6 9.6 2.5 7.7 2
2 2.3 8.4 0.0 7.4 17.2 1
3 8.4 12.6 7.4 0.0 9.8 2
4 12.1 16.2 12.5 5.2 5.1 3
5 7.3 8.9 5.0 3.9 12.8 2
6 23.4 11.3 21.3 15.1 8.6 3
7 19.7 18.5 20.1 12.7 2.9 3
8 17.1 17.2 17.2 9.8 0.0 3
9 3.2 3.4 5.1 10.6 19.6 1
10 1.3 22.8 14.4 12.4 16.8 2
11 2.4 6.9 1.5 8.3 17.9 1
12 2.4 7.1 1.3 8.1 17.8 1
13 3.1 8.3 0.8 6.8 16.6 1
14 2.9 6.9 1.6 7.9 17.5 1
15 11.2 4.4 9.8 8.7 14.1 2
16 8.3 8.7 6.0 3.9 12.2 2
We now reassign all the objects to the cluster to which they are closest and
recalculate the centroid of each cluster. The new centroids are shown below.
After first iteration
x y
Centroid 1 2.717 6.833
Centroid 2 7.9 11.667
Centroid 3 18.075 15.8
We now calculate the distance of each object from the three new centroids.
As before the column headed ‘cluster’ indicates the centroid closest to each
point and thus the cluster to which it should be assigned.
x y d1 d2 d3 cluster
10.9 12.6 10.0 3.1 7.9 2
2.3 8.4 1.6 6.5 17.4 1
8.4 12.6 8.1 1.1 10.2 2
12.1 16.2 13.3 6.2 6.0 3
7.3 8.9 5.0 2.8 12.8 2
516 Principles of Data Mining
We now again reassign all the objects to the cluster to which they are closest
and recalculate the centroid of each cluster. The new centroids are shown below.
After second iteration
x y
Centroid 1 2.717 6.833
Centroid 2 7.9 11.667
Centroid 3 18.075 15.8
These are unchanged from the first iteration, so the process terminates. The
objects in the final three clusters are as follows.
Cluster 1: 2, 9, 11, 12, 13, 14
Cluster 2: 1, 3, 5, 10, 15, 16
Cluster 3: 4, 6, 7, 8
Question 2
In Section 19.3.1 the initial distance matrix between the six objects a, b, c, d,
e and f is the following.
a b c d e f
a 0 12 6 3 25 4
b 12 0 19 8 14 15
c 6 19 0 12 5 18
d 3 8 12 0 11 9
e 25 14 5 11 0 7
f 4 15 18 9 7 0
The closest objects are those with the smallest non-zero distance value in
the table. These are objects a and d which have a distance value of 3. We
Solutions to Self-assessment Exercises 517
combine these into a single cluster of two objects which we call ad. We can now
rewrite the distance matrix with rows a and d replaced by a single row ad and
similarly for the columns.
As in Section 5.3.1, the entries in the matrix for the various distances be-
tween b, c, e and f obviously remain the same, but how should we calculate
the entries in row and column ad?
ad b c e f
ad 0 ? ? ? ?
b ? 0 19 14 15
c ? 19 0 5 18
e ? 14 5 0 7
f ? 15 18 7 0
The question specifies that complete link clustering should be used. For this
method the distance between two clusters is taken to be the longest distance
from any member of one cluster to any member of the other cluster. On this
basis the distance from ad to b is 12, the longer of the distance from a to b (12)
and the distance from d to b (8) in the original distance matrix. The distance
from ad to c is also 12, the longer of the distance from a to c (6) and the distance
from d to c (12) in the original distance matrix. The complete distance matrix
after the first merger is now as follows.
ad b c e f
ad 0 12 12 25 9
b 12 0 19 14 15
c 12 19 0 5 18
e 25 14 5 0 7
f 9 15 18 7 0
ad b ce f
ad 0 12 25 9
b 12 0 19 15
ce 25 19 0 18
f 9 15 18 0
518 Principles of Data Mining
adf b ce
adf 0 15 25
b 15 0 19
ce 25 19 0
Self-assessment Exercise 20
Question 1
The value of TFIDF is the product of two values, tj and log2 (n/nj ), where
tj is the frequency of the term in the current document, nj is the number of
documents containing the term and n is the total number of documents.
For term ‘dog’ the value of TFIDF is 2 × log2 (1000/800) = 0.64
For term ‘cat’ the value of TFIDF is 10 × log2 (1000/700) = 5.15
For term ‘man’ the value of TFIDF is 50 × log2 (1000/2) = 448.29
For term ‘woman’ the value of TFIDF is 6 × log2 (1000/30) = 30.35
The small number of documents containing the term ‘man’ accounts for the
high TFIDF value.
Question 2
To normalise a vector, each element needs to be divided by its length, which
is the square root of the sum of the squares of all the elements. For vector
2 2 2 2 2
√10, 8, 12, 56) the length is the square root of 20 + 10 + 8 + 12 + 56
(20,
= 3844 = 62. So the normalised vector is (20/62, 10/62, 8/62, 12/62, 56/62),
i.e. (0.323, 0.161, 0.129, 0.194, 0.903). √
For vector (0, 15, 12, 8, 0) the length is 433 = 20.809. The normalised form
is (0, 0.721, 0.577, 0.384, 0).
The distance between the two normalised vectors can be calculated using
the dot product formula as the sum of the products of the corresponding pairs
of values, i.e. 0.323 × 0 + 0.161 × 0.721 + 0.129 × 0.577 + 0.194 × 0.384 + 0.903 × 0
= 0.265.
Self-assessment Exercise 21
Question 1
The TDIDT algorithm relies on having all the data available for repeated use
as the decision tree is built. As each node is split on an attribute it is necessary
Solutions to Self-assessment Exercises 519
to re-scan the data in order to construct the frequency tables for each of the
descendant nodes.
Question 2
The use of a Hoeffding Bound is intended to make the algorithm make more
cautious decisions about splitting on an attribute. Once a node has been split on
an attribute it cannot be unsplit or resplit, so it is important to avoid making
bad decisions about splitting, even at the risk of occasionally not making a
good one.
Question 3
After splitting on an attribute at a node the algorithm creates an empty fre-
quency table for each attribute in the current attributes array for each of the
descendant nodes as it has no means of re-scanning the data to construct ta-
bles with the correct values (see solution to Question 1). If there is a large
amount of data – and assuming that the underlying model does not change –
newly-arriving records should accumulate values in the frequency tables that
are in approximately the same proportions as those of the frequency tables that
would have been produced if all the data had been stored.
Question 4
The candidate attribute for splitting is att3 as it has the largest value of Infor-
mation Gain. The difference between this value and the second largest (which
corresponds to attribute att4) is 1.3286 − 1.0213 = 0.3073.
The formula for the Hoeffding Bound is given in Section 21.5 as:
ln(1/δ)
R∗
2 ∗ nrec
In this formula nrec is the number of records sorted to the given node, which
is the sum of the values in the classtotals[Z ] array, i.e. 100.
The Greek letter δ is used to represent the value of 1-Prob. From Fig-
ure 21.12 we can see that the value of ln(1/δ) is 6.9078.
The value R corresponds to the range of values that Information Gain can
take at node Z, which we are assuming is the same as the ‘initial entropy’ at
the node. We can calculate this using the values in the classtotals array. These
are in the same proportions as the values in the example in Section 21.4 and
so give the same result, i.e. 1.4855 (to 4 decimal places).
Putting these values into the formula for the Hoeffding Bound we obtain
the value 1.4855 ∗ 6.9078/200 = 0.2761. The difference between IG(att3 ) and
IG(att4 ) is 0.3073, which is larger than the value of the Hoeffding Bound so
we will decide to split on attribute att3 at node Z.
520 Principles of Data Mining
Self-assessment Exercise 22
Question 1
The aim of the testing phase is to determine whether any of the internal nodes
in the main tree can be replaced by one of its alternate nodes, so if none of the
internal nodes has an alternate a testing phase is certain to have no effect. This
does no harm apart from testing records unnecessarily. The system can avoid
it by maintaining a count of the number of alternate nodes assigned to internal
nodes in the main tree and only entering a testing phase if the count is positive.
When an internal node is substituted by one of its alternates, the count needs
to be reduced by the total number of alternates for that node which may be
greater than one.
Question 2
The hitcount and acvCounts arrays are incremented at each of the nodes
through which each incoming record passes on its path from the root to a
leaf node, so there is multiple counting of records. By contrast the classtotals
array has precisely one entry for each record in the current sliding window, at
the leaf node to which it was sorted when it was processed (which may since
have been split on an attribute and become an internal node).
Index
Itemset 254, 255, 256, 258, 259–262, monk3 Dataset 444, 459
264–266, 272, 274–276 Morphological Variants 332
Jack-knifing 83 Multiple Classification 329–330, 331
J-Measure 246, 247–250 Mutually Exclusive and Exhaustive
j-Measure 247 Categories (or Classifications) 21,
Keywords 342 28, 329
k-fold Cross-validation 82–83 Mutually Exclusive and Exhaustive
k-Means Clustering 314–319 Events 23
k-Nearest Neighbour Classification 30, Naı̈ve Bayes Algorithm 28
31 Naı̈ve Bayes Classification 22–29,
Knowledge Discovery 2–3 36–37, 202–205
Labelled Data 4–5, 10 n-dimensional Space 32, 33
labor-ne Dataset 444, 455 N-dimensional Vector 334–335
Landscape-style Dataset 192 Nearest Neighbour Classification 6,
Large Itemset See Supported Itemset 29–37
Lazy Learning 36–37 Network of Computers 219
Leaf Node 42, 130, 322, 347, 432 Network of Processors 190
Learning 5–8, 36–37, 194 Neural Network 7
Leave-one-out Cross-validation 83 N-fold Cross-validation 83–84
Length of a Vector 335 Node (of a Decision Tree) 42, 431, 432
lens24 Dataset 55–56 , 374–376, 410, Node (of a FP-tree) 276–308
412, 444, 456 Noise 13, 16, 122, 127, 172–173, 235,
Leverage 266–268 341
Lift 266–267 Nominal Variable 10–11
Link 431 Non-expandable Leaf Node 357
Linked Neighbourhood 343
Normalisation (of an Attribute) 35–36
Local Dictionary 331
Normalised Vector Space Model
Local Discretisation 95, 96–97, 116–118
335–336, 337
Local Information Partition 195–196
Null Hypothesis 69, 71, 223, 225, 226,
Logarithm Function 139, 434–437
227
Majority Voting 209, 215
Numerical Prediction 4, 7
Manhattan Distance 34
Object 9, 41, 45
Market Basket Analysis 8, 245,
253–268 Objective Function 314, 320–321
Markup Information 342 Observed Value 70–71
Matches 255 Opportunity Sampling 233 See
Mathematics 427–441 Chapter 15
Maximum Dimension Distance 34 Order of a Rule 249, 250
maxIntervals 114–116 Ordinal Variable 11
Members of a Set 437–438 Outlier 14–15
Metadata 342 Overfitting 121–122, 127–135, 162–163,
Microaveraging 337 321
Minimum Error Pruning 130 Overheads 191, 200
minIntervals 114–116 Paired t-test 223–229
Missing Branches 76–77 Parallel Ensemble Classifier 219
Missing Value Parallelisation 173, 190, 219
– attribute, 15–16, 86–89 Path 432
– classification, 89, 234 Pessimistic Error Pruning 130
Model-based Classification Algorithm Piatetsky-Shapiro Criteria 241–243
37 pima-indians Dataset 444, 460
Moderator Program 191, 192 PMCRI 194–201
monk1 Dataset 444, 457 Portrait-style Dataset 192
monk2 Dataset 444, 458 Positive Predictive Value See Precision
Index 525