0% found this document useful (0 votes)
27 views19 pages

Decision Tree Tutorial by Kardi Teknomo

This document provides a tutorial on decision trees. It explains that a decision tree is a hierarchical structure that classifies data using a series of questions or rules about attribute values. It then gives an example decision tree generated from sample transportation data. The document outlines how to use a decision tree to classify new unseen data by following the rules generated from the tree. Finally, it notes some limitations of decision trees given the small sample size used in the example.

Uploaded by

NWAWEL A IROUME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views19 pages

Decision Tree Tutorial by Kardi Teknomo

This document provides a tutorial on decision trees. It explains that a decision tree is a hierarchical structure that classifies data using a series of questions or rules about attribute values. It then gives an example decision tree generated from sample transportation data. The document outlines how to use a decision tree to classify new unseen data by following the rules generated from the tree. Finally, it notes some limitations of decision trees given the small sample size used in the example.

Uploaded by

NWAWEL A IROUME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Kardi Teknomo

DECISION TREE
TUTORIAL

Revoledu.com
Decision Tree Tutorial by Kardi Teknomo
Copyright © 2008-2012 by Kardi Teknomo
Published by Revoledu.com
Online edition is available at Revoledu.com
Last Update: October 2012

Notice of rights
All rights reserved. No part of this text and its companion files may be reproduced or
transmitted in any form by any means, electronic, mechanical, photocopying, recording,
or otherwise, without the prior written permission of the publisher. For information on
getting permission for reprint and excerpts, contact [email protected]

Notice of liability
The information in this text and its companion files are distributed on an “As is” basis,
without warranty. While every precaution has been taken in the preparation of the text,
neither the author(s) nor Revoledu, shall have any liability to any person or entity with
respect to any loss or damage caused or alleged to be caused directly or indirectly by
the instructions contained in this text or by the computer software and hardware
products described in it.

All product names and services identified throughout this text are used in editorial
fashion only with no intention of infringement of any trademark. No such use, or the
use of any trade name, is intended to convey endorsement or other affiliation with this
text.
Table of Contents
Decision Tree Tutorial ................................................................................................................................... 1
What is Decision Tree?........................................................................................................................... 1
How to use a decision tree? .................................................................................................................. 2
How to generate a decision tree? ........................................................................................................ 4
How to measure impurity? ................................................................................................................ 4
Entropy ............................................................................................................................................... 5
Gini Index ........................................................................................................................................... 6
Classification error .......................................................................................................................... 7
Decision Tree Algorithm ........................................................................................................................ 7
Information gain ................................................................................................................................. 10
Second Iteration ................................................................................................................................. 13
Third iteration ...................................................................................................................................... 16
Decision tree is a popular classifier that does not require any knowledge or parameter
setting. The approach is supervised learning. Given a training data, we can induce a
decision tree. From a decision tree we can easily create rules about the data. Using
decision tree, we can easily predict the classification of unseen records. In this decision
tree tutorial, you will learn how to use, and how to build a decision tree in a very simple
explanation.

What is Decision Tree?


Decision tree is a hierarchical tree structure that used to classify classes based on a series
of questions (or rules) about the attributes of the class. The attributes of the classes can
be any type of variables from binary, nominal, ordinal, and quantitative values, while the
classes must be qualitative type (categorical or binary, or ordinal). In short, given a data of
attributes together with its classes, a decision tree produces a sequence of rules (or series
of questions) that can be used to recognize the class.

Let us start with an example. Throughout this tutorial, we will use the following 10 training
data. The training data is supposed to be a part of a transportation study regarding mode
choice to select Bus, Car or Train among commuters along a major route in a city,
gathered through a questionnaire study. The data have 4 attributes which I selected for the
sake of clarity. Attribute gender is binary type, car ownership is quantitative integer (thus
behave like nominal). Travel cost/km is quantitative of ratio type but in here I put into
ordinal type (at later section of this tutorial, I will discuss how to split quantitative data into
qualitative) and income level is also an ordinal type.

Attributes Classes
Gender Car Travel Cost Income Level Transportation
ownership ($)/km mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Male 0 Standard Medium Train
Female 1 Standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car

Based on above training data, we can induce a decision tree as the following:

Kardi Teknomo Page 1


Notice that attribute “income level” is not included in the decision tree because based on
the given data attribute “travel cost per km” would produce better classification than
“income level”. We will see later how the decision is generated. In the next section, I will
discuss how to use a decision tree to predict unseen record.

How to use a decision tree?


Decision tree can be used to predict a pattern or to classify the class of a data. Suppose
we have new unseen records of a person from the same location where the data sample
was taken. The following data are called test data (in contrast to training data) because we
would like to examine the classes of these data.

Person Gender Car Travel Cost Income Transportation


name ownership ($)/km Level Mode
Alex Male 1 Standard High ?
Buddy Male 0 Cheap Medium ?
Cherry Female 1 Cheap High ?

The question is what transportation mode would Alex, Buddy and Cheery use? Using the
decision tree that we have generated in the previous section, we will use deductive
approach to classify whether a person will use car, train or bus as his or her mode along a
major route in that city, based on the given attributes.

We can start from the root node which contains an attribute of Travel cost per km. If the
travel cost per km is expensive, the person uses car. If the travel cost per km is standard
price, the person uses train. If the travel cost is cheap, the decision tree needs to ask next
question about the gender of the person. If the person is a male, then he uses bus. If the
gender is female, the decision tree needs to ask again on how many cars she own in her
household. If she has no car, she uses bus, otherwise she uses train.

Kardi Teknomo Page 2


The rules generated from the decision tree above are mutually exclusive and exhaustive
for each class label on the leaf node of the tree:
Rule 1: If Travel cost/km is expensive then mode = car
Rule 2: If Travel cost/km is standard then mode = train
Rule 3: If Travel cost/km is cheap and gender is male then mode = bus
Rule 4: If Travel cost/km is cheap and gender is female and she owns no car then mode =
bus
Rule 5: If Travel cost/km is cheap and gender is female and she owns 1 car then mode =
train

Based on the rules or decision tree above, the classification is very straightforward. Alex
is willing to pay standard travel cost per km, thus regardless his other attributes, his
transportation mode must be train. Buddy is only willing to pay cheap travel cost per km,
and his gender is male, thus his selection of transportation mode should be bus. Cherry is
also willing to pay cheap travel cost per km, and her gender is female and actually she
owns a car, thus her transportation mode choice to work is train (probably she uses car
only during weekend to shop). Variable Income level never be utilized to classify the
transportation mode in this case.
Person Travel Cost Gender Car Transportation
name ($)/km ownership Mode
Alex Standard Male 1 Train
Buddy Cheap Male 0 Bus
Cherry Cheap Female 1 Train

Though decision tree is very powerful method, at this point, I shall give several notes to the
readers in decision tree utilization. First, it must be noted, however, that with limited
number of training data (only 10) that induce the decision tree, we cannot generalize the
rules of the decision tree above to be applicable for other cases in your city. The decision

Kardi Teknomo Page 3


tree above is only true for the cases on the given data, which is only for the particular
major route in that city where the data was gathered.

The sequence of rules generated by the decision tree is based on priority of the attributes.
For example, there is no rule for people who own more than 1 car because based on the
data it is already covered by attribute travel cost/km. For those who own 2 cars the travel
cost/km are always expensive, thus the mode is car.

Due to the limitation of decision algorithm (most algorithms of decision tree employ greedy
strategy with no backtracking thus it is not exhaustive search), these sequences of priority
in general is not optimum. We cannot say that the rules generated by decision tree are the
best rules.

In the next section, you will learn more detail on how to generate a decision tree.

How to generate a decision tree?


In this section, you will learn how to generate a decision tree. This approach is sometimes
called decision tree inductive because the decision tree are build based on data. I will
show the manual computation step by step such that you can check using calculator or
spreadsheet.

Before I discuss about decision tree algorithm, it would be better if you familiar yourself
with several measures of impurity. Therefore, the topics in this section are:

How to measure impurity?


Entropy
Gini Index
Classification error
How a decision tree algorithm work?
Improvement through gain ratio

How to measure impurity?

Given a data table that contains attributes and class of the attributes, we can measure
homogeneity (or heterogeneity) of the table based on the classes. We say a table is pure or
homogenous if it contains only a single class. If a data table contains several classes, then we say
that the table is impure or heterogeneous. There are several indices to measure degree of
impurity quantitatively. Most well known indices to measure degree of impurity are entropy, gini
index, and classification error. The formulas are given below

Kardi Teknomo Page 4


ove formulas contain
All abo c value
es of probability pj of a classs j.

In our example, thhe classes of


o Transporta
ation mode below consiist of three g groups of Buus,
Car annd Train. In this case, we
w have 4 buuses, 3 carss and 3 trains (in short w
we write as 4
4B,
3C, 3T
T). The total data is 10 ro
ows.

Based d on these daata, we can compute pro


obability of e
each class. S
Since probability is equa
al
to freq
quency relatiive, we have
e

Prob (Bus)
( = 4 / 10 = 0.4
Prob (Car)
( = 3 / 100 = 0.3
Prob (Train)
( = 3 / 10 = 0.3

Obserrve that when to compute probability y, we only fo


ocus on the cclasses, not on the
attribu
utes. Having the probability of each class,
c now w
we are readyy to computee the
quantitative indice
es of impurity
y degrees.

Entro
opy
One way ure impurity degree is us
w to measu sing entropyy.

Kardi Teknomo
T Page 5
Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, we can
now compute entropy as

Entropy = – 0.4 log (0.4) – 0.3 log (0.3) – 0.3 log (0.3) = 1.571

The logarithm is base 2.

Entropy of a pure table (consist of single class) is zero because the probability is 1 and log
(1) = 0. Entropy reaches maximum value when all classes in the table have equal
probability. Figure below plots the values of maximum entropy for different number of
classes n, where probability is equal to p=1/n. I this case, maximum entropy is equal to -
n*p*log p. Notice that the value of entropy is larger than 1 if the number of classes is more
than 2.

Gini Index
Another way to measure impurity degree is using Gini index.

Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, we can
now compute Gini index as

Gini Index = 1 – (0.4^2 + 0.3^2 + 0.3^2) = 0.660

Gini index of a pure table (consist of single class) is zero because the probability is 1 and
1-(1)^2 = 0. Similar to Entropy, Gini index also reaches maximum value when all classes
in the table have equal probability. Figure below plots the values of maximum gini index
for different number of classes n, where probability is equal to p=1/n. Notice that the value
of Gini index is always between 0 and 1 regardless the number of classes.

Kardi Teknomo Page 6


Classification error
Still another way to measure impurity degree is using index of classification error

Example: Given that Prob (Bus) = 0.4, Prob (Car) = 0.3 and Prob (Train) = 0.3, index of
classification error is given as

Classification Error Index = 1 – Max{0.4, 0.3, 0.3} = 1 - 0.4 = 0.60

Similar to Entropy and Gini Index, Classification error index of a pure table (consist of
single class) is zero because the probability is 1 and 1-max(1) = 0. The value of
classification error index is always between 0 and 1. In fact the maximum Gini index for a
given number of classes is always equal to the maximum of classification error index
because for a number of classes n, we set probability is equal to p=1/n and maximum Gini
index happens at 1-n*(1/n)^2 = 1-1/n, while maximum classification error index also
happens at 1-max{1/n} =1-1/n.

Knowing how to compute degree of impurity, now we are ready to proceed with decision
tree algorithm that I will explain in the next section.

Decision Tree Algorithm


There are several most popular decision tree algorithms such as ID3, C4.5 and CART
(classification and regression trees). In general, the actual decision tree algorithms are
recursive. (For example, it is based on a greedy recursive algorithm called Hunt algorithm
that uses only local optimum on each call without backtracking. The result is not optimum

Kardi Teknomo Page 7


but very fast). For clarity, however, in this tutorial, I will describe as if the algorithm is
iterative.

Here is an explanation on how a decision tree algorithm work. We have a data record
which contains attributes and the associated classes. Let us call this data as table D.
From table D, we take out each attribute and its associate classes. If we have p attributes,
then we will take out p subset of D. Let us call these subsets as Si. Table D is the parent of
table Si.

From table D and for each associated subset Si, we compute degree of impurity. We have
discussed about how to compute these indices in the previous section.

To compute the degree of impurity, we must distinguish whether it is come from the parent
table D or it come from a subset table Si with attribute i.

If the table is a parent table D, we simply compute the number of records of each class.
For example, in the parent table below, we can compute degree of impurity based on
transportation mode. In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C,
3T):

Kardi Teknomo Page 8


If the table is a subset of attribute table Si, we need to separate the computation of impurity
degree for each value of the attribute i.

For example, attribute Travel cost per km has three values: Cheap, Standard and
Expensive. Now we sort the table Si = [Travel cost/km, Transportation mode] based on the
values of Travel cost per km. Then we separate each value of the travel cost and compute
the degree of impurity (either using entropy, gini index or classification error).

Kardi Teknomo Page 9


Information gain
The reason for different ways of computation of impurity degrees between data table D
and subset table Si is because we would like to compare the difference of impurity degrees
before we split the table (i.e. data table D) and after we split the table according to the
values of an attribute i (i.e. subset table Si) . The measure to compare the difference of
impurity degrees is called information gain. We would like to know what our gain is if we
split the data table based on some attribute values.

Information gain is computed as impurity degrees of the parent table and weighted
summation of impurity degrees of the subset table. The weight is based on the number of
records for each attribute values. Suppose we will use entropy as measurement of impurity
degree, then we have:

Information gain (i) = Entropy of parent table D – Sum (nk/n * Entropy of each value k of
subset table Si)

For example, our data table D has classes of 4B, 3C, 3T which produce entropy of 1.571.
Now we try the attribute Travel cost per km which we split into three: Cheap that has
classes of 4B, 1T (thus entropy of 0.722), Standard that has classes of 2T (thus entropy =

Kardi Teknomo Page 10


0 because pure single class) and Expensive with single class of 3C (thus entropy also
zero).

The information gain of attribute Travel cost per km is computed as 1.571 – (5/10 *
0.722+2/10*0+3/10*0) = 1.210

You can also compute information gain based on Gini index or classification error in the
same method. The results are given below.

For each attribute in our data, we try to compute the information gain. The illustration
below shows the computation of information gain for the first iteration (based on the data
table) for other three attributes of Gender, Car ownership and Income level.

Kardi Teknomo Page 11


Table below summarizes the information gain for all four attributes. In practice, you don’t
need to compute the impurity degree based on three methods. You can use either one of
Entropy or Gini index or index of classification error.

Once you get the information gain for all attributes, then we find the optimum attribute that
produce the maximum information gain (i* = argmax {information gain of attribute i}). In
our case, travel cost per km produces the maximum information gain. We put this optimum
attribute into the node of our decision tree. As it is the first node, then it is the root node of
the decision tree. Our decision tree now consists of a single root node.

Kardi Teknomo Page 12


Once we obtain the optimum attribute, we can split the data table according to that
optimum attribute. In our example, we split the data table based on the value of travel cost
per km.

After the split of the data, we can see clearly that value of Expensive travel cost/km is
associated only with pure class of Car while Standard travel cost/km is only related to pure
class of Train. Pure class is always assigned into leaf node of a decision tree. We can use
this information to update our decision tree in our first iteration into the following.

For Cheap travel cost/km, the classes are not pure, thus we need to split further in the next
iteration.

Second Iteration
In the second iteration, we need to update our data table. Since Expensive and Standard
travel cost/km have been associated with pure class, we do not need these data any
longer. For second iteration, our data table D is only come from the Cheap Travel cost/km.
We remove attribute travel cost/km from the data because they are equal and redundant.

Kardi Teknomo Page 13


Now we have only three attributes: Gender, car ownership and Income level. The degree
of impurity of the data table D is shown in the picture below.

Then, we repeat the procedure of computing degree of impurity and information gain for
the three attributes. The results of computation are exhibited below.

Kardi Teknomo Page 14


The maximum gain is obtained for the optimum attribute Gender. Once we obtain the
optimum attribute, the data table is split according to that optimum attribute. In our case,
Male Gender is only associated with pure class Bus, while Female still need further split of
attribute.

Using this information, we can now update our decision tree. We can add node Gender
which has two values of male and female. The pure class is related to leaf node, thus Male
gender has leaf node of Bus. For Female gender, we need to split further the attributes in
the next iteration.

Kardi Teknomo Page 15


Third iteration
Data table of the third iteration comes only from part of the data table of the second
iteration with male gender removed (thus only female part). Since attribute Gender has
been used in the decision tree, we can remove the attribute and focus only on the
remaining two attributes: Car ownership and Income level.

If you observed the data table of the third iteration, it consists only two rows. Each row has
distinct values. If we use attribute car ownership, we will get pure class for each of its
value. Similarly, attribute income level will also give pure class for each value. Therefore,
we can use either one of the two attributes. Suppose we select attribute car ownership, we
can update our decision tree into the final version.

Now we have grown the full decision tree based on the data.

Kardi Teknomo Page 16

You might also like