Unit 3 Part 2
Unit 3 Part 2
UNIT-III
5.1 INTRODUCTION
▪ Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
▪ Learned trees can also be re-represented as sets of if-then rules to improve human
readability. These learning methods are among the most popular of inductive inference
algorithms and have been successfully applied to a broad range of tasks from learning to
diagnose medical cases to learning to assess credit risk of loan applicants.
▪ A decision tree can be said to be a map of reasoning process. It uses a structure resembling
that of a tree to describe a dataset and solutions can be visualized by following different
pathways through the tree. It is a hierarchical set of rules explaining the way in which a
large set of data can be divided into smaller data partitions.
▪ Each time a split takes place, the components of the resulting partitions become
increasingly similar to one another with regard to the target. If we had to select a
classification method capable of performing well across a wide range of situations without
the analyst needing to put in effort, and easy for the customer to understand, the tree
methodology would be the preferred choice. Several types of decision-free learning
techniques are available with varying needs and abilities. Decision-tree learning is usually
best suited to problems with the following features: • Patterns are described by a fixed set
of attributes xj; j = 1, 2, …, n, and each attribute xj takes on a small number of disjoint
possible values (categorical or numeric) vlxj; l = 1, 2, …, dj. • The output variable y is a
Boolean-valued function (binary classification problems) defined over the set S of patterns
{s(i)} ={x(i)} ; i = 1, 2, …, N. That is, y takes on values yq ; q = 1, 2. If we assume y1= 0
and y2=1, then y : S -> [0, 1].
▪ The training data is described by the dataset D of N patterns with corresponding observed
outputs: D = {s(i) , y(i) } = {x(i) y(i) }; i = 1, 2, …,N.
where freq(yq, D) stands for the number of patterns in D that belong to class y q and |D| denotes the
total number of patterns in D (|D | = N). The expected information needed to classify a pattern in
D is given by
A log function to the base 2 is used because information is encoded in bits. Info (D) is just the
average amount of information needed to identify the class label of a pattern in D. Note that at
this point, the information we have is solely based on the proportions of patterns in each class.
Info (D) can also be expressed as entropy of D, denoted as Entropy (D).
Associated with root node of the decision tree, Info(D) represents the expected amount of
information that would be needed to specify whether a new instance should be classified as y 1 or
y2, given that the example reached the node. Info (D) is 0 if all patterns in D belong to the same
class (P1 = 0, P2 = 1): – P1 log2 P1 – P2 log2 P2 = 0 (note that 0log20 = 0). Info (D) is 1 when the
collection D contains an equal number of Class 1 and Class 2 patterns (P 1=1/2,P2=1/2),
representing maximum heterogeneity (randomness) in the dataset: – P 1 log2 P1 – P2 log2 P2 = 1. If
the collection D contains unequal number of Class 1 and Class 2 patterns, Info (D) is between 0
and 1. It is, thus, a measure of impurity of the collection of examples.
To illustrate, we consider training set of Table 8.1 (Weather Data). It has nine examples of class
Yes, and five examples of class No. Therefore,
Root node with dataset D will therefore be a highly impure node. The training set D contains
instances that belong to a mixture of classes (high entropy). In this situation, the idea of ‘divide-
and-conquer’ strategy is to divide D into subsets of instances that are, or seem to be, heading
towards single-class collection of instances.
Suppose we select attribute xj for the root node. xj has distinct values vlxj ; l = 1, …, dj , as observed
from the training data D. Attribute xj can be used to split data D into l; l = 1, …, dj , partitions or
subsets {D1, D2, …, Dd j }, where Dl contains those patterns in D that have values vlxj of xj . These
partitions would correspond to branches grown from the node. Ideally, we would like this
partitioning to produce an exact classification, i.e., we would like each partition to be pure.
This amount is measured by
where Pql is the probability that the arbitrary sample in subset Dl belongs to class yq, and is
estimated as
Information gain is defined as the difference between the original information requirement (i.e,
based on the partition of classes in the entire dataset D) and the new requirement (i.e., obtained
after partitioning on xj ). That is,
In other words, Gain (D, xj ) tells us how much would be gained by branching on xj . It is the
expected reduction in information requirement (expected reduction in entropy) by partitioning on
xj . The attribute xj with the highest information gain, Gain (D, x j ), is chosen as the splitting
attribute at the root node. This is equivalent to saying that we want to partition on the attribute x j
that would do the best classification so that the amount of information still required (i.e., Info(D,
xj )) to finish classification task is minimal.
The information gain, Gain (D, xj ), measures the expected reduction in entropy, caused by
partitioning the patterns in dataset D according to the attribute x j .
This value is representative of the potential information derived from the division of the dataset,
D, into dj partitions matching with the dj values of the attribute x j . For each value of xj , the number
of tuples possessing that value is considered with respect to the total number of tuples in D. This
is different from information gain, which measures the information with respect to classification
obtained on the basis of same partitioning. The gain ratio is defined as
Gini index considers a binary split for each attribute. Let us first consider the case where xj is
continuous-valued attribute having dj distinct values v lxj ; l = 1, 2, …, dj. It is common to take
mid-point between each pair of (sorted) adjacent values as a possible split-point (It is a simple
policy, although something might be gained by adoping a more sophisticated policy. One such
policy will be discussed in the next section). The point giving the minimum Gini index for the
attribute xj is taken as its split-point
The attribute that maximizes the reduction in impurity (or equivalently, has the minimum Gini
index) is selected as the splitting attribute. Then one of these two parts (D 1, D2) is divided in a
similar manner by choosing a variable again and a split value for the variable. This process is
continued till we get pure leaf nodes.
▪ Notice that the entropy is 0 if all members of S belong to the same class. Also note that
the entropy is 1 when the collection contains an equal number of positive and negative
examples.
▪ If the collection contains unequal numbers of positive and negative examples, the entropy
is between 0 and 1. The following figure shows this.
▪ If the target classification has c classes, then the entropy of S relative to this c-wise
classification is defined as
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S
for which attribute A has value v.
▪ As an example, let us construct the decision tree for the following data which shows
training examples for the target concept PlayTennis.
First, compute Entropy(S) [S is the given data set. There are two classes yes = 9, No = 5].
Next, compute the information gain of each attribute in the data set.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as
⬚
|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )
5 3 3 2 2 4 4 4 5 3 3 2 2
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 5 5 5 5 14 4 4 14 5 5 5 5
= 0.694
Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
= 0.940 – 0.694
= 0.246
Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as
⬚
|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )
4 2 2 2 2 6 4 4 2 2 4 3 3 1 1
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 4 4 4 4 14 6 6 6 6 14 4 4 4 4
= 0.911
Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
= 0.940 – 0.911
= 0.029
Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as
⬚
|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )
7 3 3 4 4 7 6 6 1 1
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 7 7 7 7 14 7 7 7 7
= 0.789
Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
= 0.940 – 0.789
= 0.151
Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as
⬚
|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )
8 6 6 2 2 6 3 3 3 3
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 8 8 8 8 14 6 6 6 6
= 0.892
Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑊𝑖𝑛𝑑) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
= 0.940 – 0.892
= 0.048
Since, Gain(S, Outlook) is high so, the attribute Outlook is tested first and becomes the root
for the decision tree. It then classifies S into three partitions [Let us say S1:leftmost subtree,
S2:middle subtree, S3:rightmost subtree] (because it has three different values: Sunny, Overcast,
Rainy) as shown below.
Now, we need to apply the same procedure to decide the root node for the leftmost subtree
and for the rightmost subtree. However, observe that all the samples of middle tree are related to
the same class (Yes class). So, we can create a leaf node here and label it with Yes. The final
decision tree for the given data set, S, is given below.
After understanding this procedure, now we are ready to design the algorithm for the basic
decision tree learning algorithm (ID3). This is given below.
To illustrate the process of implementation of C4.5 decision tree, we consider the (toy) training set
, in which there are four attributes: x1 (Outlook), x2 (Temperature), x3 (Humidity), x4 (Wind); and
two classes for the output variable PlayTennis: Yes, No. The attributes x 1 and x4 have categorical
values, and the attributes x2 and x3 have continuous numeric values. Attribute x1 (Outlook) has
three categorical values: sunny, overcast and rain, and therefore the node labeled Outlook will
have three branches. The node for categorical variable x2 (Wind) will have two branches. The other
two variables are continuous-valued, and as per the strategy followed in C4.5, the corresponding
nodes will have binary splits. This will require discretization of these variables.
There are four choices of attribute at the root node: Outlook, Temperature, Humidity, and Wind.
Tree stumps for the attribute Outlook (x1) are shown above; a three-way split corresponding to the
three categorical values of x1.
Temperature (x2) has continuous numeric values. For this attribute, we will select the best cut-
point Tx2 from its range of values by evaluating every candidate cut point. Examples are first sorted
by increasing value of the attribute, and interval between each successive pair of values in the
sorted sequence gives a potential cut-point. The cut-point that minimizes the entropy will never
occur between two patterns of the same class. Therefore, it is necessary to consider potential
divisions that separate patterns of different classes. For the weather data , this gives us eight
potential cut-points: {64, 65, 70, 71, 72, 75, 80, 83}. Note that boundary points of the intervals
between classes have been taken as the potential cut-points. Entropy for each of these cut-points
is evaluated and the one that results in maximum information gain/gain ratio is selected as the
split-value for the attribute x2. It follows that x2 = 83 is the selected split-value
Tree stump for x2 (Temperature) is shown below. Evaluating all the four candidate root node
variables, Outlook turns out to be the best choice with respect to entropy reduction/gain ratio
measure.
Table shows the dataset for the branch sunny, obtained from the data D . Repeating the process
described above on this dataset, we select Humidity as the daughter node with splitvalue = 70
● Pruning is a data compression technique in machine learning and search algorithms that
reduces the size of decision trees by removing sections of the tree that are non-critical and
redundant to classify instances.
● The fundamental algorithm for decision trees continues to grow the tree by splitting nodes
as long as new divisions generate daughter nodes that increase purity. Such a tree has
undergone optimization for the training set. Therefore, elimination of any leaf nodes will
simply lead to an increase in the error rate of the tree on the training set. But this certainly
does not mean that the entire tree with pure leaf nodes also performs the best on new data!
● Pruning is the procedure that decreases the size of decision trees. It can decrease the risk
of overfitting by defining the size of the tree or eliminating areas of the tree that support
little power. Pruning supports by trimming the branches that follow anomalies in the
training information because of noise or outliers and supports the original tree in a method
that enhances the generalization efficiency of the tree,
● Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the
tree to properly classify independent test data.
● There are two approaches to tree pruning which are as follows −
5.5.1 Pre-pruning Approach
● In the pre-pruning approach, a tree is “pruned” by labored its construction early (e.g., by
determining not to further divide or partition the subset of training samples at a provided
node).
● Upon halting, the node turns into a leaf. The leaf can influence the most common class
between the subset samples, or the probability distribution of those samples.
● When making a tree, measures including statistical significance, information gain, etc., can
be used to create the generosity of a split. If partitioning the samples at a node can result in
a split that declines below a pre-specified threshold, then partitioning of the given subset
is halted. There are problems in selecting an appropriate threshold. High thresholds can
result in oversimplified trees, while low thresholds can result in very little simplification.
5.5.2 Post-pruning Approach
● A tree node is pruned by eliminating its branches. The price complexity pruning algorithm
is an instance of the post-pruning approach. The pruned node turns into a leaf and is labeled
by the most common class between its previous branches.
● For each non-leaf node in the tree, the algorithm computes the expected error rate that can
appear if the subtree at that node were shortened. Next, the expected error rate appearing
if the node were not pruned is computed using the error rates for each branch, connected
by weighting according to the dimension of observations along each branch. If pruning the
node leads to a higher expected error rate, then the subtree is preserved. Therefore, it is
pruned.
● After creating a set of increasingly pruned trees, an independent test set can estimate the
efficiency of each tree. The decision tree that diminishes the expected error cost is
preferred.
i. Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the total number of predictions.
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Where TP=True Positive, TN=True Negative, FP=False Positive, FN=False Negative
A confusion matrix is defined as thetable that is often used to describe the performance of a
classification model on a set of the test data for which the true values are known.
Fig: Confusion matrix
iii. Precision
Precision explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives.
“Precision for a label is defined as the number of true positives divided by the number of
predicted positives”.
iv. Recall (Sensitivity) — Recall explains how many of the actual positive cases we were able to
predict correctly with our model. It is a useful metric in cases where False Negative is of higher
concern than False Positive.
Recall for a label is defined as the number of true positives divided by the total number of
actual positives.
v. F1 Score
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.
F1 Score is the harmonic mean of precision and recall.
The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases: