Decision Tree Tutorial by Kardi Teknomo
Decision Tree Tutorial by Kardi Teknomo
DECISION TREE
TUTORIAL
Revoledu.com
Decision Tree Tutorial by Kardi Teknomo
Copyright © 2008-2012 by Kardi Teknomo
Published by Revoledu.com
Online edition is available at Revoledu.com
Last Update: October 2012
Notice of rights
All rights reserved. No part of this text and its companion files may be reproduced or
transmitted in any form by any means, electronic, mechanical, photocopying, recording,
or otherwise, without the prior written permission of the publisher. For information on
getting permission for reprint and excerpts, contact [email protected]
Notice of liability
The information in this text and its companion files are distributed on an “As is” basis,
without warranty. While every precaution has been taken in the preparation of the text,
neither the author(s) nor Revoledu, shall have any liability to any person or entity with
respect to any loss or damage caused or alleged to be caused directly or indirectly by
the instructions contained in this text or by the computer software and hardware
products described in it.
All product names and services identified throughout this text are used in editorial
fashion only with no intention of infringement of any trademark. No such use, or the
use of any trade name, is intended to convey endorsement or other affiliation with this
text.
Table of Contents
Decision Tree Tutorial ................................................................................................................................... 1
What is Decision Tree?........................................................................................................................... 1
How to use a decision tree? .................................................................................................................. 2
How to generate a decision tree? ........................................................................................................ 4
How to measure impurity? ................................................................................................................ 4
Entropy ............................................................................................................................................... 5
Gini Index ........................................................................................................................................... 6
Classification error .......................................................................................................................... 7
Decision Tree Algorithm ........................................................................................................................ 7
Information gain ................................................................................................................................. 10
Second Iteration ................................................................................................................................. 13
Third iteration ...................................................................................................................................... 16
Decision tree is a popular classifier that does not require any knowledge or parameter
setting. The approach is supervised learning. Given a training data, we can induce a
decision tree. From a decision tree we can easily create rules about the data. Using
decision tree, we can easily predict the classification of unseen records. In this decision
tree tutorial, you will learn how to use, and how to build a decision tree in a very simple
explanation.
Let us start with an example. Throughout this tutorial, we will use the following 10 training
data. The training data is supposed to be a part of a transportation study regarding mode
choice to select Bus, Car or Train among commuters along a major route in a city,
gathered through a questionnaire study. The data have 4 attributes which I selected for the
sake of clarity. Attribute gender is binary type, car ownership is quantitative integer (thus
behave like nominal). Travel cost/km is quantitative of ratio type but in here I put into
ordinal type (at later section of this tutorial, I will discuss how to split quantitative data into
qualitative) and income level is also an ordinal type.
Attributes Classes
Gender Car Travel Cost Income Level Transportation
ownership ($)/km mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Male 0 Standard Medium Train
Female 1 Standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Based on above training data, we can induce a decision tree as the following:
The question is what transportation mode would Alex, Buddy and Cheery use? Using the
decision tree that we have generated in the previous section, we will use deductive
approach to classify whether a person will use car, train or bus as his or her mode along a
major route in that city, based on the given attributes.
We can start from the root node which contains an attribute of Travel cost per km. If the
travel cost per km is expensive, the person uses car. If the travel cost per km is standard
price, the person uses train. If the travel cost is cheap, the decision tree needs to ask next
question about the gender of the person. If the person is a male, then he uses bus. If the
gender is female, the decision tree needs to ask again on how many cars she own in her
household. If she has no car, she uses bus, otherwise she uses train.
Based on the rules or decision tree above, the classification is very straightforward. Alex
is willing to pay standard travel cost per km, thus regardless his other attributes, his
transportation mode must be train. Buddy is only willing to pay cheap travel cost per km,
and his gender is male, thus his selection of transportation mode should be bus. Cherry is
also willing to pay cheap travel cost per km, and her gender is female and actually she
owns a car, thus her transportation mode choice to work is train (probably she uses car
only during weekend to shop). Variable Income level never be utilized to classify the
transportation mode in this case.
Person Travel Cost Gender Car Transportation
name ($)/km ownership Mode
Alex Standard Male 1 Train
Buddy Cheap Male 0 Bus
Cherry Cheap Female 1 Train
Though decision tree is very powerful method, at this point, I shall give several notes to the
readers in decision tree utilization. First, it must be noted, however, that with limited
number of training data (only 10) that induce the decision tree, we cannot generalize the
rules of the decision tree above to be applicable for other cases in your city. The decision
The sequence of rules generated by the decision tree is based on priority of the attributes.
For example, there is no rule for people who own more than 1 car because based on the
data it is already covered by attribute travel cost/km. For those who own 2 cars the travel
cost/km are always expensive, thus the mode is car.
Due to the limitation of decision algorithm (most algorithms of decision tree employ greedy
strategy with no backtracking thus it is not exhaustive search), these sequences of priority
in general is not optimum. We cannot say that the rules generated by decision tree are the
best rules.
In the next section, you will learn more detail on how to generate a decision tree.
Before I discuss about decision tree algorithm, it would be better if you familiar yourself
with several measures of impurity. Therefore, the topics in this section are:
Given a data table that contains attributes and class of the attributes, we can measure
homogeneity (or heterogeneity) of the table based on the classes. We say a table is pure or
homogenous if it contains only a single class. If a data table contains several classes, then we say
that the table is impure or heterogeneous. There are several indices to measure degree of
impurity quantitatively. Most well known indices to measure degree of impurity are entropy, gini
index, and classification error. The formulas are given below
Prob (Bus)
( = 4 / 10 = 0.4
Prob (Car)
( = 3 / 100 = 0.3
Prob (Train)
( = 3 / 10 = 0.3
Entro
opy
One way ure impurity degree is us
w to measu sing entropyy.
Kardi Teknomo
T Page 5
Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, we can
now compute entropy as
Entropy = – 0.4 log (0.4) – 0.3 log (0.3) – 0.3 log (0.3) = 1.571
Entropy of a pure table (consist of single class) is zero because the probability is 1 and log
(1) = 0. Entropy reaches maximum value when all classes in the table have equal
probability. Figure below plots the values of maximum entropy for different number of
classes n, where probability is equal to p=1/n. I this case, maximum entropy is equal to -
n*p*log p. Notice that the value of entropy is larger than 1 if the number of classes is more
than 2.
Gini Index
Another way to measure impurity degree is using Gini index.
Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, we can
now compute Gini index as
Gini index of a pure table (consist of single class) is zero because the probability is 1 and
1-(1)^2 = 0. Similar to Entropy, Gini index also reaches maximum value when all classes
in the table have equal probability. Figure below plots the values of maximum gini index
for different number of classes n, where probability is equal to p=1/n. Notice that the value
of Gini index is always between 0 and 1 regardless the number of classes.
Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, index of
classification error is given as
Similar to Entropy and Gini Index, Classification error index of a pure table (consist of
single class) is zero because the probability is 1 and 1-max(1) = 0. The value of
classification error index is always between 0 and 1. In fact the maximum Gini index for a
given number of classes is always equal to the maximum of classification error index
because for a number of classes n, we set probability is equal to p=1/n and maximum Gini
index happens at 1-n*(1/n)^2 = 1-1/n, while maximum classification error index also
happens at 1-max{1/n} =1-1/n.
Knowing how to compute degree of impurity, now we are ready to proceed with decision
tree algorithm that I will explain in the next section.
Here is an explanation on how a decision tree algorithm work. We have a data record
which contains attributes and the associated classes. Let us call this data as table D.
From table D, we take out each attribute and its associate classes. If we have p attributes,
then we will take out p subset of D. Let us call these subsets as Si. Table D is the parent of
table Si.
From table D and for each associated subset Si, we compute degree of impurity. We have
discussed about how to compute these indices in the previous section.
To compute the degree of impurity, we must distinguish whether it is come from the parent
table D or it come from a subset table Si with attribute i.
If the table is a parent table D, we simply compute the number of records of each class.
For example, in the parent table below, we can compute degree of impurity based on
transportation mode. In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C,
3T):
For example, attribute Travel cost per km has three values: Cheap, Standard and
Expensive. Now we sort the table Si = [Travel cost/km, Transportation mode] based on the
values of Travel cost per km. Then we separate each value of the travel cost and compute
the degree of impurity (either using entropy, gini index or classification error).
Information gain is computed as impurity degrees of the parent table and weighted
summation of impurity degrees of the subset table. The weight is based on the number of
records for each attribute values. Suppose we will use entropy as measurement of impurity
degree, then we have:
Information gain (i) = Entropy of parent table D – Sum (nk/n * Entropy of each value k of
subset table Si)
For example, our data table D has classes of 4B, 3C, 3T which produce entropy of 1.571.
Now we try the attribute Travel cost per km which we split into three: Cheap that has
classes of 4B, 1T (thus entropy of 0.722), Standard that has classes of 2T (thus entropy =
The information gain of attribute Travel cost per km is computed as 1.571 – (5/10 *
0.722+2/10*0+3/10*0) = 1.210
You can also compute information gain based on Gini index or classification error in the
same method. The results are given below.
For each attribute in our data, we try to compute the information gain. The illustration
below shows the computation of information gain for the first iteration (based on the data
table) for other three attributes of Gender, Car ownership and Income level.
Once you get the information gain for all attributes, then we find the optimum attribute that
produce the maximum information gain (i* = argmax {information gain of attribute i}). In
our case, travel cost per km produces the maximum information gain. We put this optimum
attribute into the node of our decision tree. As it is the first node, then it is the root node of
the decision tree. Our decision tree now consists of a single root node.
After the split of the data, we can see clearly that value of Expensive travel cost/km is
associated only with pure class of Car while Standard travel cost/km is only related to pure
class of Train. Pure class is always assigned into leaf node of a decision tree. We can use
this information to update our decision tree in our first iteration into the following.
For Cheap travel cost/km, the classes are not pure, thus we need to split further in the next
iteration.
Second Iteration
In the second iteration, we need to update our data table. Since Expensive and Standard
travel cost/km have been associated with pure class, we do not need these data any
longer. For second iteration, our data table D is only come from the Cheap Travel cost/km.
We remove attribute travel cost/km from the data because they are equal and redundant.
Then, we repeat the procedure of computing degree of impurity and information gain for
the three attributes. The results of computation are exhibited below.
Using this information, we can now update our decision tree. We can add node Gender
which has two values of male and female. The pure class is related to leaf node, thus Male
gender has leaf node of Bus. For Female gender, we need to split further the attributes in
the next iteration.
If you observed the data table of the third iteration, it consists only two rows. Each row has
distinct values. If we use attribute car ownership, we will get pure class for each of its
value. Similarly, attribute income level will also give pure class for each value. Therefore,
we can use either one of the two attributes. Suppose we select attribute car ownership, we
can update our decision tree into the final version.
Now we have grown the full decision tree based on the data.