0% found this document useful (0 votes)
13 views

Decision Tree

Uploaded by

punam chavan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Decision Tree

Uploaded by

punam chavan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Decision Tree

-by Ms.Ashwini D. Khairkar


Decision Tree
• Decision tree algorithm falls under the category of supervised
learning. They can be used to solve both regression and classification
problems.
• Decision tree uses the tree representation to solve the problem in
which each leaf node corresponds to a class label and attributes are
represented on the internal node of the tree.
• We can represent any boolean function on discrete attributes using
the decision tree.
Cond …
Cond..
• There are two main types of Decision Trees:
1.Classification trees (Yes/No types)
• What we’ve seen above is an example of classification tree, where
the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision
variable is Categorical.
2.Regression trees (Continuous data types)
• Here the decision or the outcome variable is Continuous, e.g. a
number like 123. Working Now that we know what a Decision Tree
is, we’ll see how it works internally. There are many algorithms out
there which construct Decision Trees, but one of the best is called
as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before
discussing the ID3 algorithm, we’ll go through few definitions.
Below are some assumptions that we made while using
decision tree:
• At the beginning, we consider the whole training set as the root.
• Feature values are preferred to be categorical. If the values are
continuous then they are discretized prior to building the model.
• On the basis of attribute values records are distributed recursively.
• We use statistical methods for ordering attributes as root or the
internal node.
Cond…
Important terminology

1.Root Node: This attribute is used for dividing the data into two or more
sets. The feature attribute in this node is selected based on Attribute
Selection Techniques.
2.Branch or Sub-Tree: A part of the entire decision tree is called a branch or
sub-tree.
3.Splitting: Dividing a node into two or more sub-nodes based on if-else
conditions.
4.Decision Node: After splitting the sub-nodes into further sub-nodes, then
it is called the decision node.
5.Leaf or Terminal Node: This is the end of the decision tree where it cannot
be split into further sub-nodes.
6.Pruning: Removing a sub-node from the tree is called pruning.
Cond..
Cond..
• In Decision Tree the major challenge is to identification of the attribute for
the root node in each level. This process is known as attribute selection.
We have two popular attribute selection measures:
1.Information Gain
2.Gini Index
• 1.Information Gain
When we use a node in a decision tree to partition the training instances
into smaller subsets the entropy changes. Information gain is a measure of
this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset
of S with A = v, and Values (A) is the set of all possible values of A, then
Information Gain = (Entropy of parent node)-(Entropy of child node)
• For example, in a binary classification problem (two classes),
we can calculate the entropy of the data sample as follows:
Cond …
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Gini index
• Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes
“gini” value.
• The Formula for the calculation of the of the Gini Index is given below.
What are the steps in ID3 algorithm?

• The steps in ID3 algorithm are as follows:


1.Calculate entropy for dataset.
2.For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3.Find the feature with maximum information gain.
4.Repeat it until we get the desired tree.
Classification using the ID3 algorithm
Consider whether a dataset based on which we will determine whether
to play Tennis or not.

Here, dataset is of binary


classes(yes and no), where
9 out of 14 are "yes" and 5
out of 14 are "no".
Cond..
• Complete entropy of dataset is:
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53) = 0.94
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute - Outlook

Categorical values - sunny, overcast and rain


H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain)= -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook –
I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) *
H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693
Information Gain = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247
Cond..
Second Attribute - Temperature
Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Average Entropy Information for Temperature –
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) + p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811 = 0.9108
Information Gain = H(S) - I(Temperature) = 0.94 - 0.9108 = 0.0292

Third Attribute - Humidity


Categorical values - high, normal
H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591
Average Entropy Information for Humidity –
I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591 =
0.787
Information Gain = H(S) - I(Humidity) = 0.94 - 0.787 = 0.153
Cond..
• Fourth Attribute - Wind
Categorical values - weak, strong
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1
Average Entropy Information for Wind –
I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong) = (8/14)*0.811 + (6/14)*1 = 0.892
Information Gain = H(S) - I(Wind) = 0.94 - 0.892 = 0.048

Here, the attribute with maximum information gain is Outlook. So, the decision tree built so far -
Cond ..
• Here, when Outlook == overcast, it is of pure class(Yes).
Now, we have to repeat same procedure for the data with rows
consist of Outlook value as Sunny and then for Outlook value
as Rain.
• Now, finding the best attribute for splitting the data with
Outlook=Sunny values{ Dataset rows = [1, 2, 8, 9, 11]}.

Complete entropy of Sunny is –


H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (2/5) * log2(2/5) - (3/5) * log2(3/5) = 0.971
Cond..
• First Attribute - Temperature
Categorical values - hot, mild, cool
H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Temperature –
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny, mild)*H(Sunny, Temper
+ p(Sunny, cool)*H(Sunny, Temperature=cool)
= (2/5)*0 + (1/5)*0 + (2/5)*1 = 0.4
Information Gain = H(Sunny) - I(Sunny, Temperature) = 0.971 - 0.4 = 0.571
Cond..
• Second Attribute - Humidity

Categorical values - high, normal


H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0
H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0
Average Entropy Information for Humidity –
I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny, normal)*H(Sunny, Humidity=normal)
= (3/5)*0 + (2/5)*0 = 0
Information Gain = H(Sunny) - I(Sunny, Humidity) = 0.971 - 0 = 0.971
Cond..
• Third Attribute - Wind

Categorical values - weak, strong


H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Wind –
I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)*H(Sunny, Wind=strong)
= (3/5)*0.918 + (2/5)*1 = 0.9508
Information Gain = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202
Cond..
• Here, the attribute with maximum information gain is Humidity.
So, the decision tree built so far -
Cond …
• Here, when Outlook = Sunny and Humidity = High, it is a pure
class of category "no". And When Outlook = Sunny and
Humidity = Normal, it is again a pure class of category "yes".
Therefore, we don't need to do further calculations.
• Now, finding the best attribute for splitting the data with
Outlook=Sunny values{ Dataset rows = [4, 5, 6, 10, 14]}.
Cond…
Complete entropy of Rain is –
H(S) = - p(yes) * log2(p(yes )) - p(no) * log2(p(no))
= - (3/5) * log(3/5) - (2/5) * log(2/5) = 0.971
First Attribute - Temperature

Categorical values - mild, cool


H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
Average Entropy Information for Temperature –
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain, cool)*H(Rain, Temperature=cool)
= (2/5)*1 + (3/5)*0.918 = 0.9508
Information Gain = H(Rain) - I(Rain, Temperature) = 0.971 - 0.9508 = 0.0202
Cond..
• Second Attribute - Wind
Categorical values - weak, strong
H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0
Average Entropy Information for Wind –
I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong)
= (3/5)*0 + (2/5)*0 = 0
Information Gain = H(Rain) - I(Rain, Wind) = 0.971 - 0 = 0.971
Final desired output
• Here, when Outlook = Rain and Wind = Strong, it is a pure class
of category "no". And When Outlook = Rain and Wind = Weak, it
is again a pure class of category "yes"
characteristics of ID3 algorithm

1.ID3 uses a greedy approach that's why it does not guarantee an


optimal solution; it can get stuck in local optimums.
2.ID3 can overfit to the training data (to avoid overfitting, smaller
decision trees should be preferred over larger ones).
3.This algorithm usually produces small trees, but it does not always
produce the smallest possible tree.
4.ID3 is harder to use on continuous data (if the values of any given
attribute is continuous, then there are many more places to split the
data on this attribute, and searching for the best value to split by can
be time consuming).

You might also like