0% found this document useful (0 votes)
20 views

Decision Tree

The document discusses decision trees and random forests, which are machine learning algorithms used for classification problems. It provides key terms like root node, decision node, and leaf node. It then gives two examples of using decision trees to classify students playing football based on gender, course, and height, and to classify dots as red or grey based on features X and Y. It explains how decision trees work by calculating Gini impurity and information gain to pick the most important splitting features.

Uploaded by

depp8218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Decision Tree

The document discusses decision trees and random forests, which are machine learning algorithms used for classification problems. It provides key terms like root node, decision node, and leaf node. It then gives two examples of using decision trees to classify students playing football based on gender, course, and height, and to classify dots as red or grey based on features X and Y. It explains how decision trees work by calculating Gini impurity and information gain to pick the most important splitting features.

Uploaded by

depp8218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

DECISION TREES

AND Bhabani Shankear

RANDOM FOREST Basak


What is Decision Tree?
Decision Tree is a supervised machine learning algorithm that is mostly used in
classification problems.
It works both for categorical as well as continuous dependent variables.
KEY TERMS IN DECISION TREES
Root Node: It represents the entire population.
Splitting: It refers to the process of dividing the decision tree into nodes and sub-nodes.
Decision node: It refers to the node which gets split further based on a decision.
Leaf node. It refers to the end node after which no splitting takes place. It is the last
node of the decision tree.
DECISION TREE
EXAMPLE 1
Lets say we have a sample of 30 students with 3 feature
Gender (Male / Female)
Course (Statistics / Economics)
Height (< 5.5 ft / > 5.5 ft)
Given that 15 out of these 30 students play football
Qs: In order to create a model to predict who will play football during leisure period which is the most
important feature to start with?

Gender Course Height


N=30 N=30 N=30

Male Female Stat Eco <5.5 ft >5.5 ft


N=20 N=10 N=12 N=18 N=16 N=14
65% 20% 58% 44% 50% 50%
(13) (2) (7) (8) (8) (7)
Decision Tree takes that feature as the most important feature which gives the
maximum separation in the target variable.
Here that variable is Gender
EXAMPLE 2
Objective: To predict whether a dot is Red or Grey based on two independent variables X and
Y
Solution
DECISION TREE SPLIT FOR NUMERIC FEATURES
X1 Y
2 Yes Steps:
2.5 Yes
1. Arrange all the values in ascending order.
3 No 2. Consider a threshold value to split and divide the
3.7 No observations into 2 nodes and move ahead.
5.2 No 3. The split for which the Gini Impurity is lowest is taken as
the first feature.
8.4 Yes 4. This type of splitting takes place for each and every
10.5 Yes numerical feature.

X1
4Y/3N
Disadvantage:

Time complexity increases for a large data set.


<= 2 ➢2
1Y/0N 3Y/3N
HOW DECISION TREE FUNCTIONS
Gender Course Height
N=30 N=30 N=30

Male Female Stat Eco <5.5 ft >5.5 ft


N=20 N=10 N=12 N=18 N=16 N=14

20% 58% 44% 50% 50%


65%
(2) (7) (8) (8) (7)
(13)

Gender Play Football Total P(Y) P(N)

Male Yes=13, No=7 20 13/20 7/20

Female Yes=2, No=8 10 2/10 8/10


Gini Impurity for feature ‘GENDER’
G(Gender=M) = 1 – P(Yes)^2 – P(No)^2
= 1 – (13/20)^2 – (7/20)^2
= 0.45

G(Gender=F) = 1 – P(Yes)^2 – P(No)^2


= 1 – (2/10)^2 – (8/10)^2
= 0.32
Weighted sum of Gini Impurity for feature ‘GENDER’
G(Gender)
𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝑀𝑎𝑙𝑒) 𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝐹𝑒𝑚𝑎𝑙𝑒)
= * G(Gender=M) + *G(Gender=F)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
20 10
= * 0.45 + * 0.32
30 30
Course Play Football Total P(Y) P(N)
Stat Yes=7, No=5 12 7/12 5/12
Eco Yes=8, No=10 18 8/18 10/18

Gini Impurity for feature ‘COURSE’


G(Course=Stat) = 1 – P(Yes)^2 – P(No)^2
= 1 – (7/12)^2 – (5/12)^2
= 0.49

G(Course=Eco) = 1 – P(Yes)^2 – P(No)^2


= 1 – (8/18)^2 – (10/18)^2
= 0.49
Weighted sum of Gini Impurity for feature ‘COURSE’
G(Course)
𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝑆𝑡𝑎𝑡) 𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝐸𝑐𝑜)
= * G(Course=Stat) + *G(Course=Eco)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
12 18
= * 0.49 + * 0.49
30 30
Height Play Football Total P(Y) P(N)
<= 5.5 ft Yes=8, No=8 16 8/16 8/16
> 5.5 ft Yes=7, No=7 14 7/14 7/14

Gini Impurity for feature ‘HEIGHT’


G(Height <= 5.5 ft) = 1 – P(Yes)^2 – P(No)^2
= 1 – (1/2)^2 – (1/2)^2
= 0.5

G(Height > 5.5 ft) = 1 – P(Yes)^2 – P(No)^2


= 1 – (1/2)^2 – (1/2)^2
= 0.5
Weighted sum of Gini Impurity for feature ‘HEIGHT’
G(Height)
𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝐻𝑡≤5.5𝑓𝑡) 𝑁𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝐻𝑡>5.5𝑓𝑡)
= * G(Ht<=5.5ft) + *G(Ht>5.5ft)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
16 14
= * 0.5 + * 0.5
30 30

So, Weighted Gini Impurity for feature ‘GENDER’ is the lowest compared to the other two
features ‘COURSE’ and ‘HEIGHT’.
So, we will start splitting with ‘GENDER’.
RANGE OF GINI IMPURITY
Let us consider m outcomes, then
𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − σ𝑚 2 2 2 2
𝑖=1 𝑝𝑖 = 1 − 𝑝1 − 𝑝2 − … . . − 𝑝𝑚

For m = 2,
𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − σ2𝑖=1 𝑝𝑖2 = 1 − 𝑝12 − 𝑝22
where, p1 = p and p2 = 1- p
Case 1: Case 2:
Let us consider 2 outcomes Yes and No Let us consider 2 outcomes Yes and No
Worst Case: Number of Yes = Number of No, Best Case: All Yes and zero No,
then, P(Yes) = P(No) = 0.5 then, P(Yes) = 1 and P(No) = 0
𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − σ2𝑖=1 𝑝𝑖2
𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − σ2𝑖=1 𝑝𝑖2
= 1 − 𝑃(𝑌𝑒𝑠)2 −𝑃(𝑁𝑜)2
= 1 − 𝑃(𝑌𝑒𝑠)2 −𝑃(𝑁𝑜)2 = 1 – (1)2 – (0)2
= 1 – (0.5)2 – (0.5)2 =0
= 0.5
So, Gini Impurity lies between
0 and 0.5

Gini Impurity

P(Yes) = 0 P(Yes) = 0.5 P(Yes) =1


Entropy

Set A Set B Set C


Entropy
Here Set C is a Pure Set i.e. it is completely homogeneous
Set B and A are Impure Set i.e. they are not homogeneous.
Impurity of Set A > Impurity of Set B
So, we can say that less impure node requires less information to describe it and
more impure nodes require more information to describe it.
Entropy is the degree of randomness or disorganization in the system.
More entropy means more randomness or chaos
RANGE OF ENTROPY
Entropy is given by,
− 𝑝 ∗ log 2 𝑝 − 𝑞 ∗ log 2 𝑞
where, p = P(Yes) and q = 1-p = P(No)
Case 1: Case 2:
Let us consider 2 outcomes Yes and No Let us consider 2 outcomes Yes and No
Number of Yes = Number of No, All Yes and zero No or All No and zero Yes
then, P(Yes) = P(No) = 0.5 All Yes and zero No
then, P(Yes) = 1 and P(No) = 0
E(S) = −𝑃(𝑌𝑒𝑠) ∗ log 2 𝑃(𝑌𝑒𝑠) −𝑃(𝑁𝑜) ∗ log 2 𝑃(𝑁𝑜)
E(S) = −𝑃(𝑌𝑒𝑠) ∗ log 2 𝑃(𝑌𝑒𝑠) −𝑃(𝑁𝑜) ∗ log 2 𝑃(𝑁𝑜)
= −0.5 ∗ log 2 0.5 −0.5 ∗ log 2 0.5 = −1 ∗ log 2 1 −0 ∗ log 2 0
= - 0.5*(-1) -0.5*(-1) = − log 2 1
=1 =0
E(S) = 1

Entropy lies between


0 and 1

Entropy, E(S)

E(S) = 0

P(Y) = 0 P(Y) = 0.5 P(Y) = 1


Information
𝑆𝑣
Information = σ * 𝐸(𝑆𝑣 )
𝑆
where, 𝑆𝑣 = No. of instances in the subset
S = Total no. of instances in the dataset
E(𝑆𝑣 ) = Entropy of the subset

Information Gain
Information Gain = Entropy – Information
Higher the Information Gain, more important is that feature
EXAMPLE
X1 X2 X3 X4 Y
Match Opponent Championship Time Location Win
1The Rock Tag Team Match Summer Texas No
2The Rock Tag Team Match Summer New York No
3Triple H Tag Team Match Summer Texas Yes
4Goldberg Inter Continental Match Summer Texas Yes
5Goldberg WWE Title Winter Texas Yes
6Goldberg WWE Title Winter New York No
7Triple H WWE Title Winter New York Yes
8The Rock Inter Continental Match Summer Texas No
9The Rock WWE Title Winter Texas Yes
10Goldberg Inter Continental Match Winter Texas Yes
11The Rock Inter Continental Match Winter New York Yes
12Triple H Inter Continental Match Summer New York Yes
13Triple H Tag Team Match Winter Texas Yes
14Goldberg Inter Continental Match Summer New York No
GI VS ENTROPY
Gini Impurity is computationally efficient compared to Entropy where we have to use
log 2 𝑝 which takes time. After entropy we again have to calculate Information Gain.
So, Gini Impurity is preferred over Entropy as the metric in Decision Tree to split the
nodes.
TYPES OF DECISION TREES

Types of Decision Trees Metric used for separation Goodness of fit


Classification Trees Gini Impurity Misclassification rate
(i.e. Categorical dependent Entropy and Information Gain
variable)
Regression Trees Reduction of Variance Sum of squared errors
(i.e. Continuous dependent
Variable)
DIFFERENT ALGORITHMS FOR DECISION TREE
1. ID3 (Iterative Dichotomizer): It is one of the algorithms used to construct decision
tree for classification. It uses Information Gain as the criteria for finding the root
nodes and splitting them. It only accepts categorical features.
2. C4.5: It is an extension of ID3 algorithm, and better than ID3. It deals with both
numerical and categorical features.
3. Classification and Regression Trees (CART): It is the most popular algorithm used
for constructing decision trees. By default it uses Gini Impurity as the metric for
selecting root nodes. This algorithm works on both regression and classification
problems.
ADVANTAGES OF DECISION TREE
1. Decision Tree is very easy to build and explain.
2. Complex decision tree models can be understood just by visualizing.
3. It can be used both for regression and classification problems.
4. Normalization and Scaling are not required in decision tree.
DISADVANTAGES OF DECISION TREES
•Decision Trees are not fit for continuous target variables. They looses information when
it divide the variables in different categories.
•They suffer from the problem of overfitting i.e. the error rate is very low for the
training data but when it comes to the testing data set the error rate is very high i.e.
data set has Low Bias and High Variance. This happens because the decision tree has
grown to its complete depth i.e. the tree has grown as big as possible.
HOW TO PREVENT THE PROBLEM OF OVERFITTING
1. Tree Pruning
2. Bagging Technique
OVERFITTING VS UNDERFITTING
OVERFITTING VS UNDERFITTING
TREE PRUNING
Pruning : It means cutting the branches of a tree or trimming the tree so that it does not
grow beyond a certain size.

How do we prune the tree ?


Hyperparameters: They are parameters which are intrinsic to the model whose values
are used to control the learning process. By controlling these hyperparameters in
decision tree, we can prevent the trees from overfitting.
HYPERPARAMETERS IN DECISION TREES
Some of the important hyperparameters in Decision Trees are:
▪ max leaf node: It refers to the maximum number of terminal nodes in a tree. It can be
used in place of max depth.
▪ min samples leaf: It refers to the minimum number of records or observations required
in a leaf node. Generally, 5% of overall population or 100 whichever is lower is
chosen as min samples leaf.
▪ max depth: It denotes the maximum number of layers in a tree i.e. the vertical depth
of a tree. It is used in place of max leaf node.
▪ min samples for a node to split: It gives the minimum number of samples or
observations required in a node to split.
▪ max features: It refers to the maximum number of features to consider for split. The
thumb rule is to take the square root of the total number of features or (30 – 40)% of
the total number of features whichever is lower.
So, by pruning i.e. by reducing the number of leaf nodes, restricting the size of the
samples in a leaf node and other hyperparameters, we can prevent the problem of
overfitting in decision trees.
ENSEMBLE TECHNIQUE
Ensemble means combining multiple models.
In ensemble technique, instead of relying on a single model, we train the dataset by
utilizing multiple models and then select the final output. It is based on the concept of
the ‘Wisdom of the Crowd’.

Ensemble technique is classified into


• Bagging technique
• Boosting technique
BAGGING
Let's say that we have a training data set, S containing 1000 rows. We are trying to
create r number of subsets S1,S2,......,Sr by taking a sample of rows (say 100) from S.
Here selection of each row is done using simple random sampling with replacement.
In other words, we can say that if subset S1 contain 5 rows (say) - it can have row
numbers like {20, 30, 20, 41, 50} i.e. to say that there can be a repetition of rows.
In this way, we create r subsets S1, S2, ......., Sr each containing 100 rows following
the method of simple random sampling with replacement.
Why simple random sampling with replacement ?
This is an important step. This step ensures that each subset S1, S2, ......, Sr are
independent of each other. This forms the basis of Bagging.
Now, based on each subset Si, we build a machine learning model Mi, i = 1,2,.....,r So, we
have model M1 for subset S1, model M2 for subset S2 and so on.
Let us first consider a Binary Classification Problem. So here our objective will be to predict
the final outcome as 1 or 0.
I bring in the test data and fit the model on it. The output will be either 1 or 0. In this way,
each of the models, Mi will predict either 1 or 0.
How do we make a conclusion regarding the final output from all the different models?
We consider the majority voting from all the predictions combined to get the final output. This
is for the Classification problem.
Similarly, for Regression problem i.e. when the dependent variable is continuous in nature,
each model predicts some numeric value. So, we consider some model averaging technique
like normal average to get the final outcome.
This method is also known as Bootstrap Aggregation i.e. we are creating samples
known as Bootstrap samples from the original training data, S and finally we are
combining (i.e. aggregating) them to come to a final output. Hence the name
Bootstrap Aggregation.
RANDOM FOREST
In Random Forest, we apply this technique of Bootstrap Aggregation.
▪ Here we replace each model by Decision Tree. Hence we get multiple decision trees in Random
Forest
▪ Apart from randomly selecting the rows (as we do in Bagging technique), here we also go for
Feature Sampling with replacement.
HOW RANDOM FOREST SOLVES OVERFITTING PROBLEM
Here, we are not building a single decision tree on the entire dataset. We are
building multiple decision trees independently on different subsets of the data. Now
each decision tree may give High Variance but when we combine all the decision
trees, this High Variance gets converted into Low Variance.
So, here instead of relying on a single decision tree, we are taking decision based on
multiple decision trees.
THANK YOU

You might also like