5 Lesson 02
5 Lesson 02
2.1 Introduction
Lesson 01 provided a comprehensive introduction to the filed of machine learning in
general. This week onwards, you will study various machine learning techniques and their
applications. It is recalled that machine learning is about learning from examples, without
concerning any model, theory or algorithm that can describe the example data. As such,
machine learning techniques can be used to model real world systems that do not fit into
any mathematical or scientific model. This module covers three major machine learning
techniques, namely, Decision Trees, Artificial Neural Networks and Genetic Algorithms. This
week discusses fundamentals of decision trees and decision tree applications.
Decision Trees is the simplest kind of machine learning technique. As name implies, a
decision trees technique draw a tree to represent a given example (data) set. It is quite
natural that we use tree as a structure to represent many things in the world. For instance,
organizational structure can be represented as a tree. Sample space of probabilistic
experiment can also be represented by a tree.
For all machine learning techniques, we start with a data set called examples. Each
example is characterized by set of attributes. One of the attribute can be considered as the
goal attribute. For example, Figure 2.1 shows four examples X1, X2, X3 and X4 from the
domain of weather forecasting.
In Figure 2.1, pressure, temperature, humidity, wind speed and Rain are the attributes of
examples. Further, Rain, one of the attributes would be the goal attribute.
2.2.1 Goal attribute
In decision trees the gaol attributes has been identified only with two options such as
positive/negative, yes/no, good/bad, etc. However, we can also consider more than two
options for the value of goal attribute. For instance, according to Figure 2.1, the attribute
pressure got three options as high, low and medium, whereas the attribute humidity got
only two attributes: high and low.
For instance, if we select the attribute pressure, there will be three braches labelling high,
low and medium. For the branch, high, we identify that goal attribute would be Yes (by X1)
and No (by X4). Thus the branch, high returns both positive and negative answers. This
means if the pressure is high we cannot have an exact conclusion as Rain is Yes or No. On
the other hand, if we consider the branch low, the goal attribute is classified exactly as Yes.
That means, if we say that pressure is low, we can conclude that Rain will definitely be No.
Just imagine if a branch returns both positive and negative examples, what can we do? Now
we can consider another attribute and classify examples according to its branches. This can
be done until all braches return all positive or all negative classifications. This is how we
construct decision trees.
The fundamental challenge behind construction of a decision tree is the choice of the
appropriate attribute to classify positive and negative examples at a given moment. Theory
has been developed for this purpose.
When we consider the attribute weight, it has three options or values as high, low and
medium. Therefore, we have three braches from the weight attribute. Now the example X1
goes along the branch high, and classifies the goal attribute as good (+). Similarly X2 goes
along the branch high, and classifies the goal attribute as +. We denote this by +[X1, X2].
Similarly other two values medium and low can be considered to classify data. Thus far
analysis of data depicts the decision shown in Figure 2.3.
Weight
high medium
low
+[X1, X2] -+[X3, +[X4, X6]
-[X5] X52] -[X6, X7]
Now we notice that along the branch high we got the examples X1 and X2. Both are
positive. So we can in fact conclude, if weight is high then health is good. Similarly, for the
branch low, we can conclude that “if weight is low then health is bad” In this manner if a
branch returns all positive or all negative we can end up a conclusion.
Let us consider branch labelled as medium. Here there are tree examples, out of which X4
is positive, while X6 and X7 are negative. No conclusions can be made. Therefore, we
should use another attribute, say, exercise, and classify the examples X4, X6 and X7. Since
the attribute exercise has two attributes Yes and No, we will obtain the extended decision
tree as in Figure 2.4.
Weight
high medium
low
Exercise
+[X1, X2] -+[X3, +[X4]
Weight
yes
-[X5] X52] med -[X6, X7]
no
med
+[X4, +[X4,
X6] X6]
-[X6]
Figure 2.4 – After expanding on the attribute Exercise
Now we notice that in Figure 2.4, along the branch yes, we have positive and negative
examples. Therefore, we need to consider another attribute (now, meal) to further classify
examples X6 and X7. The final decision tree is shown in Figure 2.5. Note that all leaf nodes
of the tree have all positive or all negative examples. Thus the construction of the decision
tree is over now.
Weight
high medium
low
Exercise
+[X1, X2] -+[X3, Weight
yes
-[X5] X52] med no
med
MealWeig +[X4,
normalm ht sugary X6]
edium mediu
Fatty
mediu
+[X4, -+
X6] [X64,
Figure 2.5 – Complete decision tree
For example, by considering the leftmost branch, we can say that “if weight is high then
health is good”.
Similarly, by going along the rightmost branch, we can say that “if weight is medium and
exercise is no then health is bad”.
Exercise 2.1
Write TWO more conclusions that use all three attributes.
More conclusions
Let us consider other possible conclusions. It appears that the value (option), normal of the
attribute does not collect any examples along the branch. What does this mean? What do
you mean by this? It says that the value normal has no contribution to classify data. As such
the attribute normal is unnecessary when collecting data.
Note also that there are instance where the conclusions made along two different branches
may be contradicting. In such a situation, the collected data may include some noise or
error.
In this case, we use all three attributes, weight, exercise and meal for constructing the
decision tree. However, it is possible that we may end up with all positive or all negative
classification, without using all given attributes. So we may conclude that some attributes
may not be relevant for the data set.
Another very important point would be that we can draw more than one decision tree to
represent a given set of training data. For example, if we begin with the meal as the
attribute at the root, and then use weight and exercise, the decision tree will be different
from what we received in Figure 2.5. As such conclusions may also change. Therefore, it
would be an interesting question to ask, which decision tree is the most appropriate.
For example, our conclusions such as “if weight is medium and exercise is no then health
is bad” is in fact a rule. They appear in the standard rule format of
Recall that machine learning techniques are used to model non-algorithmic real world
systems that cannot be fitted into any formal model. In such a situation decision tress has
been able to create rules to describe non-algorithmic systems. This is a surprising feature of
decision trees as compared with features of machine learning techniques such as GA and
ANN
It should also be noted that for a given training data set we can produce more than one
distinct decision trees. Among those decision trees, some may be very large, while another
one may be relatively small. On the other hand some decision trees may not even include
all attributes stated in the same training data. Perhaps, this may be due to the use of
irrelevant attributes for a particular set of examples.
Since we can produce several decision trees for a given training data set, it is required to
identify the most appropriate decision tree for a given training data set.
All machine learning techniques are expected to produce a model that is generic as much
as possible. That means, a decision tree should not just remember the training data and
produce conclusion. If that is the case, a decision tree would be able to recognize only the
inputs that are used to construct the decision tree. This is why decision trees should go
beyond memorization and achieve generalization over the training data provided.
Discovery of rules
With the help of a decision tree, we can identify some rules that can describe a certain
data set. This is a key feature of decision tree since machine learning techniques are
generally used to model data which do not fit into any rules, algorithms, etc. Note that
other machine learning techniques such as Artificial Neural Networks and Genetic
Algorithms cannot devise a rule set to describe a collection of data. Therefore, if we are
interested in some form of rule support to realize a data set, decision tree is unique among
other machine learning techniques.
2.5 Summary
This lesson discussed the fundamental concepts of decision trees. Here we also learn the
creation of decision tress through our intuitive understanding. It was realized that we can
construct more than one decision trees for a given real world problem. Among those
decision trees, the one which provide the highest generalization would be the most suitable
decision tree to model a particular problem. We also pointed out the importance of
decision trees over other machine learning techniques. This is because; decision trees are
able providing information for construction of rules. Most generalized decision tree for a
real world system would be capable of showing improved performance over the expert
systems.