DataMining_Chapter3
DataMining_Chapter3
In this chapter we present decision trees, a model widely used in Data Mining.
3.1 INTRODUCTION
For certain fields of application, it is essential to produce classification procedures that the
user can understand. This is particularly the case for medical diagnosis, where the doctor
needs to be able to interpret the reasons for the diagnosis. Decision trees meet this
requirement because they graphically represent a set of rules and are easy to interpret.
3.2.1 Aim
A decision tree models a hierarchy of tests on the values of a set of variables called attributes.
At the end of these tests, the model (the decision tree) produces a numerical value or selects
an element from a discrete set of conclusions. The former is known as regression and the
latter as classification. For example, the following decision tree (figure 3.1) models a problem
where we wish to classify individuals into two classes {sick, healthy} according to the values
taken by two descriptors: "temperature" and "sore throat".
35
Fig 3.2. Example of a regression decision tree.
The internal nodes of a decision tree are called decision nodes. These nodes are labelled with
a test that can be applied to any description of an individual in the population. In general, each
test examines the value of a single attribute in the description space. The possible answers to
the test correspond to the labels of the arcs originating from this node. In the case of binary
decision nodes, the labels of the arcs are omitted and, by convention, the left arc corresponds
to a positive response to the test. Leaves are labelled by a class.
A decision tree is the graphical representation of a classification procedure. Each complete
description is associated with a single leaf of the decision tree. This association is defined by
starting at the root of the tree and moving down the tree according to the responses to the tests
that label the internal nodes. The associated class is then the default class associated with the
leaf that corresponds to the description. The classification procedure obtained is immediately
translated into decision rules. The rule systems obtained are special in that the order in which
the attributes are examined is fixed and the decision rules are mutually exclusive.
If (Temperature<=37)
then if (sore throat)
then Class= « sick »
else
Class= « healthy »
else
Class= « sick »
36
Position of a node: The nodes of a tree are identified by positions, which are numbers where
the level of the node and its position (from the left) are concatenated (figure 3.4). The root is
noted Ø. For example, position 11 refers to the node at level 1, and is the first position from
the left.
Given a sample S, a set of classes {1,...,c} and a decision tree T, at each position pos of T
corresponds a subset of the sample which is the set of examples which satisfy the tests from
the root up to that position. Consequently, for any position pos of T, we can define the
following quantities:
N(pos) is the cardinal of the set of examples associated with p,
N(k/pos) is the cardinal of the set of examples associated with p that belong to class k,
P(k/pos) = N(k/pos)/N(pos) the proportion of elements of class k at position p.
Example: Consider the decision tree from the previous example. We also have a sample of
200 patients. We know that 100 are sick and 100 are healthy. The distribution between the
two classes M (for sick) and S (for healthy) is given by :
Sore throat Non sore throat
temperature <= 37 (0 Healthy, 38 Sick) (100 Healthy, 0 Sick)
We return to the tree, adding the associated examples to each node (figure 3.5).
Fig 3.5 Decision tree with examples associated with each node
37
Here is the calculation of the different cardinal values at the root.
We then have: N(Ø)=200; N(H/Ø)=100; N(S/Ø)=100; P(H/Ø)=100/200 and
P(S/Ø)=100/200.
(equation 3.1)
=1.00
In general, entropy decreases as we go down the tree, until it reaches zero at the leaf level.
38
Dans toutes les méthodes, on trouve les trois opérateurs suivants :
1. Decide whether a node is terminal, i.e. decide whether a node should be labelled as a
leaf. For example: all examples are in the same class, there are fewer than a certain
number of errors, ...
2. Select a test to associate with a node. For example: randomly, using statistical criteria,
etc.
3. Assigning a class to a sheet. The majority class is assigned, except where cost or risk
functions are used.
The methods will differ in the choices made for these different operators, i.e. the choice of test
(for example, use of the gain and entropy function) and the stopping criterion (when to stop
the growth of the tree, i.e. when to decide whether a node is terminal). The general outline of
the algorithms is as follows:
With such an algorithm, it is possible to calculate a decision tree with little or no apparent
error. A perfect decision tree is one in which all the examples in the training set are correctly
classified. Such a tree does not always exist (if there are two examples such that two different
classes correspond to two identical descriptions). The aim is to build a tree with the smallest
possible classification error.
The following table represents the PlayTennis dataset presented by Quilan himself to
introduce the ID3 algorithm. Note that all the variables (corresponding to the columns) have
been made discrete.
39
The ID3 algorithm starts with a table whose data has already been classified (labelled). From
this table, the algorithm constructs a decision tree which can predict the class of each of the
data items in the table, and even the class of new data (which does not appear in the dataset).
Sky, Temperature, Humidity and Wind are the four attributes that describe the data. We can
see that the dataset contains just 14 rows corresponding to situations in which tennis players
accept or refuse to play depending on the values taken by the attributes describing the weather
conditions. But in reality, there are 36 different examples if we vary the attributes with all the
possible values they can take on:
|{Sunny,Overcast,Rainy}| × |{Warm,Medium,Cool}| x |{High,Normal}| × |{Weak, Strong}|
=3×3×2×2=9×4=36
The ID3 algorithm is based on the concept of attributes and classes from machine learning
(discrete classification). This algorithm looks for the most relevant attribute to test so that the
tree is as short and optimised as possible.
To find the attribute to test, we use the entropy defined in the previous section.
Initially, the algorithm takes the whole dataset S = {J1, J2, J3, ..., J14}. And as 9 out of 14
examples give the decision (or class) Yes and 5 out of 14 give the decision No, we can
calculate the following proportions:
(equation 3.2)
40
Now that we know that the initial entropy of the dataset is 0.94, we need to know which
attribute to test first, then second, and so on.
To find out which attribute to test, we need to use the notion of entropy gain. The gain is
defined by a set of examples and by an attribute. This formula is used to calculate the
contribution of this attribute to the disorder of the set. The more an attribute contributes to
disorder, the more important it is to test it in order to separate the set into smaller sets with
lower entropy.
Here is the formula that calculates the entropy gain for a set S and an attribute A.
(equation 3.3)
The attribute that will be tested at this node of the tree is the node that will reduce entropy the
most.
Taking the example again, and considering S as the initial set, to determine which attribute to
test, we need to calculate the gain of all the attributes.
41
Here is a summary of the calculations made :
The calculations show that: Gain(S, Temperature) < Gain(S, Wind) < Gain(S, Humidity) <
Gain(S, Sky). The greatest gain is for Sky. Sky is therefore the first attribute tested in the tree.
If we look at each child node, we see that for the overcast node, all the results are positive. So
there's no attribute to test here, we can label Yes directly. The following figure shows the
decision tree to be constructed after this first iteration.
Fig 3.6 Tree after the first iteration of its creation with ID3
We now need to continue adding test nodes after Sunny and Rainy because there is a mixture
of classes between the examples. So let's determine for Sunny which is the best attribute to
test using entropy gain again. However, it is no longer useful to test the gain of Sky, as it has
just been used. The results of the calculation are given directly:
We can see that : Gain(Ssunny, Wind) < Gain(Ssunny, Temperature) < Gain(Ssunny, Humidity)
The biggest gain is for Humidity. You can see that the gain is equal to Ssunny entropy. This
means that all the children of Humidity will give a class (label). Here's our tree after a second
iteration of ID3.
42
Fig 3.7 Tree after the second iteration of its creation with ID3
We still have to continue the tree on the Rainy edge. Here are the gains for the different
attributes:
We can see that : Gain(Srainy, Temperature) ≤ Gain(Srainy, Humidity) < Gain(Srainy, Wind)
The largest gain is 0.971 and is for Wind. We therefore need to test Wind, and since the gain
is equal to the entropy of Srainy, each of Wind's child nodes will be a label (pure node).
So here's our final tree.
We can check that this tree gives the correct prediction for each of the 14 cases in the dataset
used to construct it. For example, for case number 1 (Sky="Sunny", Temperature="Warm",
Humidity="High", Wind="Weak"), the tree gives the class "No", which is consistent with
what exists in the training dataset.
43
But the tree also allows predictions to be made about new cases that do not exist in the
dataset. For example, for a new case (Sky='Sunny', Temperature='Cool', Humidity='High',
Wind='Weak'), the tree gives the class 'No'.
EXERCISES
Exercise 3.1: Recall the general objective of the Decision Tree (DT) model.
Exercise 3.2: What is the difference between a classification DT and a regression DT?
Exercise 3.3: Consider the following classification Decision Tree.The classes are (class1 and
class2).
1/ Translate the tree into a set of rules.
2/ Transform the tree into a binary Decision Tree.
Exercise 3.4: Consider the AD presented in this chapter (medical diagnosis). We have a
sample of 200 patients. In this sample, 100 are healthy and 100 are sick. The distribution
between the two classes H (Healthy) and S (Sick) is given in the following table:
44