Decision Trees
Decision Trees
Decision trees support tool that uses a tree-like graph or model of decisions and their possible
consequence. It is also called instance based algorithm as at each instance we take decision or
we can say it uses nested if- else condition.
Decision Tree is a non linear model which is made of various linear axis parallel planes.
Like a single plane can be thought as logistic regression and using mulitple logistic regression
is same as Decision Tree. .
Geometrically, it is axis parallel planes that teslate the area into Hypercuboid or
hypercube.
Decision Trees can be of can be of two types- Regression Trees and Classification trees.
Regression Trees are used when the dependent variable is a continuous variable classification
is used when dependent variable is categorical.
Some terms related to decision trees are:
Decision Node: Any node that gets split is known as the decision node.
Root Node: This is the top-most decision node and represents the entire sample
which further gets divided.
Leaf Node: Also known as the Terminal node, this is a node that does not split further.
Splitting: The process of dividing a node is known as splitting. This is the most
commonly used method where a top-down approach is taken to split each node to find
the best split
Pruning: It is the opposite of splitting. Here we remove sub-nodes of a decision node.
1. Entropy: It is defined as
H= -p log2p – q log2q
Suppose we tossed a coin 4 times, and the output of the events came as {Head, Tail, Tail,
Head}. Based solely on this observation, if you have to guess what will be the output of the
coin toss, what would be your guess?
- two heads and two tails. Fifty percent probability of having head and fifty percent
probability of having a tail. You can not be sure. The output is a random event between head
and tail.
But what if we have a biased coin, which when tossed four times, gives following output:
{Tail, Tail, Tail, Head}. Here, if you have to guess the output of the coin toss, what would be
your guess? Chances are you will go with Tail, and why? Because seventy-five percent
chance is the output is tail based on the sample set that we have. In other words, the result is
less random in case of the biased coin than what it was in case of the perfect coin.
hence, We would say that an unbiased coin has a high entropy, because the uncertainty of the
value of X is the highest possible.
Suppose, we have a data like 1000 employees as
Defaulte Employed Area Gender
r
1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F
And we want to classify data a variable as defaulter or not on basis of employed,
Area, Gender.
Employed == Y
/ \
/ \
YES NO
N= 700 N= 300
1 550 1 50
0 150 0 250
Employed
=YES
N= 700
1 550
0 150
|
|
Gender==M
/
\
/ \
YES YES
(Gender==M) (Gender==F)
N= 300 N= 400
77% 1 230 3% 1 10
23% 0 70 98% 0 390
Employed = NO
N= 300
1 50
0 250
|
|
Location==N
/ \
/ \
YES YES (Location!
(Location==N) =N)
N= 100 N= 200
1 25 1 150
0 75 0 50
It tells us if both the categories are equally probable then entropy is 1and if they are unequal
probable then entropy decreases. Generally, the feature which has higher entropy is selected
for splitting. Entropy measures the value of randomness in the data.
Information Gain is opposite of entropy. It say something about feature variable. Higher the
IG better it is.
3. Ginni Impurity: This method is an impurity based criterio used in CART and only
performs binary splits.
It is defined
Where we sum the probability „pi‟ of an item with the chosen label ‟i‟. It reaches its
minimum (zero) when all cases in the node fall into a single target category
We find that out of total 50 students, half of them were selected for the coaching.
For Calculating Gini Index, we have to find the proportions of these values
Now we can find the weighted Gini for Split X1 (Grade) by calculating Gini for sub-
node „A or B‟ and „C
or D‟. The calculations are shown below
We can similarly do the same for X2 which here is „Class‟ and find the Gini for spliting
„Class‟.
For X3 Height which is a continuous variable, the statistical software in the backend split
the variable into
height > x and height < x and find the Gini value and whichever value of x provides them
with the
minimum Gini value, that value of x is used to calculate the Gini for splitting this
continuous variable.
In our example, we find that the value of x is equal to 1.75 and provides the minimum
value in comparison
to other values of height say 1.80 or 1.60. We find that the Gini for split height comes out
to be 0.23
Now we compare the value of Gini and find that the minimum Gini is provided by height
(0.23), thus, this
variable will be the root node that will split the data into two homogeneous „purer‟ sets.
The second
minimum value of Gini is found for the variable „Grade‟ and the maximum value of Gini
is found
with „Class‟. Thus the data will first split by Height as it plays a major role in deciding
whether a student
will be selected for the basketball coaching or not.
Ginni impurity and entropy both are same. The only difference lies in the formula
of ginni impurity doesn’t uses log that makes calculation faster as compared to
entropy (computational complexity).
For example, lets calculate mean and variance for the root node. Here we have 15 1s
(y=1) and 15 0s (y=0). We calculate mean by ((25 x 1) + (25 x 0)) ÷ 50 which comes
out to be 0.5 We can now use the formula of variance. Here our x will be 0.5. While
x will be 1 and 0. We have 50 entries so the formula will be (x-x )2× 50 ÷ n
To be very precise the formula would look something like this-
(1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2
+ (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-
0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 +
(1-0.5)2 + (1-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-34.
Decision Trees 280 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-
0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0- 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 +
(0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 ÷ 50 This can be
written as (1-0.5)2 X 25 + (0-0.5)2 X 25 ÷ 50
Using this we can finally come up with a equation (25 X (1-0.5)2)+(25 X (0-0.5)2) ÷
50
We use the above formula to find the mean and variance for each variable. See below
the calculation to find the variance of X1 (Grade) variable.
We can use the same formula to find the variance for Split Class.
We find that that the results are similar to entropy with the „Grade‟ being the least
important variable as its variance is higher (0.24) than the other categorical variable
which is „Class‟, having a variance of 0.21. The variance of Numerical Variable
follows similar steps and the variable which provides the minimum variance is
considered for the split
Now, we are aware of how to build a decision tree. But we must take care while
building a decision tree as depth increase the chance of overfitting increased (or
chances of higher noisy points in the data) and if depth is too short like 1 then
decision tree underfit. So, appropriate depth should be find out using cross validation.
Parameters Related to the Size of the Tree:
Maximum Depth Among the easiest parameter to understand is maximum depth
which is an important parameter to tune
trees and tree-based models. It is the length of the longest path from root node to the
leaf node or the depth
of the layers of the leaves. We often provide a value/maximum number/ depth-limit
to the decision tree
model to limit the further splitting of nodes, so when the specified depth is reached,
the tree stops splitting.
In the above example, we have two features with one being on x and other on the y-axis. Here
it finds that the first best split can be done by doing it on the x-axis i.e. split the variable on
the x-axis and find the value which provides the best result (the value that provides the
minimum entropy or Gini depending upon
the algorithm decision tree is using). Here for max depth =1, it splits the data and says that
data points on the left will be classified as 0 (blue, background colour) while the actual 0 are
the blue dots. This depth will make the model too generalized and will underfit the model.
However, if it keeps going and
recursively partitions the data, it then finds another way to split the data and thus creates
more and more branches and keeps on splitting the space. We can see by the time it reaches
the maximum depth, it has all the pure leaves, however. This leads to overfitting as the
decision boundry is very complex
and will yield very poor results in testing or when used for unseen new data. Thus we can set
the maximum depth at 5 to have a moderately complex and accurate decision tree model.
Minimum Split Size This is another parameter through which we define the minimum
number of observation required to allow the node to further split. This can control overfitting
and underfitting. A very low value of Minimum Split Size will lead to overfitting with a
small number of observations being considered for further splitting, however, an extremely
large value will lead to underfitting as a good amount of observations that on being split can
provide with useful insights won‟t be split.
Maximum Impurity
In the Maximum Depth, we saw how a low number of depth caused the red points to be
considered as blue. We can limit the number of wrong points required to be in a region before
splitting is stopped.
PYTHON CODE:
In case of classification
In case of regression
Data set contain the following variable as:
# We have train (8523) and test (5681) data set, train data set has both input and output
variable(s). We need to predict the sales for test data set.
# Item_Visibility: The % of total display area of all products in a store allocated to the
particular product
# Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket
# Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome
variable to be predicted.
After all preprocessing steps. We train a decision tree model ithout hyperparameter tuning as:
Model score
We got 100% score on training data.
On test data we got 5.7% score because we did not provide any tuning parameters while
intializing the tree as a result of which algorithm split the training data till the leaf node. Due
to which depth of tree increased and our model did the overfitting.
That's why we are getting high score on our training data and less score on test data.
Decision Trees provide high accuracy and require very little preprocessing of
data such as outlier capping, missing value, variable transformation etc.
Also, they can work if the dependent and independent variable are not linear
i.e. it can capture non-linear relationships
Decision Trees require less data than what regression model requires and are
easy to understand along with being very inquisitive.
Tree-Based models can be very easily visualized with clear-cut demarkation
allowing people having no background of statistics to understand the process
easily.
Decision Tees can also be used for data cleaning, data exploration and
variable selection and creation.
Decision Trees can work with high dimensional data having both- continuous
and categorical variables
Feature interaction are in built of decision tree.
DT are super - interpretable.