0% found this document useful (0 votes)
122 views13 pages

Decision Trees

Decision trees are a non-linear predictive modeling technique that uses a tree-like graph to predict outcomes. They work by splitting a dataset into smaller and smaller subsets while simultaneously an algorithm recursively partitions data based on input variable values. Some key terms related to decision trees include decision nodes, leaf nodes, splitting, and pruning. To build a decision tree, metrics like entropy, information gain, Gini impurity, and mean squared error are used to determine the optimal variable to split the data on at each step. Regression trees are used for continuous target variables while classification trees are for categorical targets.

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views13 pages

Decision Trees

Decision trees are a non-linear predictive modeling technique that uses a tree-like graph to predict outcomes. They work by splitting a dataset into smaller and smaller subsets while simultaneously an algorithm recursively partitions data based on input variable values. Some key terms related to decision trees include decision nodes, leaf nodes, splitting, and pruning. To build a decision tree, metrics like entropy, information gain, Gini impurity, and mean squared error are used to determine the optimal variable to split the data on at each step. Regression trees are used for continuous target variables while classification trees are for categorical targets.

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Decision trees

Decision trees support tool that uses a tree-like graph or model of decisions and their possible
consequence. It is also called instance based algorithm as at each instance we take decision or
we can say it uses nested if- else condition.
Decision Tree is a non linear model which is made of various linear axis parallel planes.
Like a single plane can be thought as logistic regression and using mulitple logistic regression
is same as Decision Tree. .

Decision tree represented as:

Geometrically, it is axis parallel planes that teslate the area into Hypercuboid or
hypercube.
Decision Trees can be of can be of two types- Regression Trees and Classification trees.
Regression Trees are used when the dependent variable is a continuous variable classification
is used when dependent variable is categorical.
Some terms related to decision trees are:
 Decision Node: Any node that gets split is known as the decision node.
 Root Node: This is the top-most decision node and represents the entire sample
which further gets divided.
 Leaf Node: Also known as the Terminal node, this is a node that does not split further.
 Splitting: The process of dividing a node is known as splitting. This is the most
commonly used method where a top-down approach is taken to split each node to find
the best split
 Pruning: It is the opposite of splitting. Here we remove sub-nodes of a decision node.

How to build a decision tree?

In case of classification type of problem we use “Entropy”, “Information Gain”, “Ginni


coefficient ” as our measure to decide the feature that split the data, where as in case of
linear regression we use mean sum of square (MSE) to split the data.

The most common criteria/metrics for Classification Trees are-


 Gini Index
 Entropy
 Chi-Square (for Classification Trees)
 Reduction In Variance (for Regression Trees)
 ANOVA.

1. Entropy: It is defined as
H= -p log2p – q log2q

Where p is probability of outcome variable classified in class 1 and q is probability that


outcome variable classified in class 0.
Entropy measures uncertainty or we can say randomness.In other words, its a measure of
unpredictability.
"Does it say that the feature with higher entropy has more random values than the one with
lower entropy?" -yes. Let’s take an example of a coin toss.

Suppose we tossed a coin 4 times, and the output of the events came as {Head, Tail, Tail,
Head}. Based solely on this observation, if you have to guess what will be the output of the
coin toss, what would be your guess?

- two heads and two tails. Fifty percent probability of having head and fifty percent
probability of having a tail. You can not be sure. The output is a random event between head
and tail.

But what if we have a biased coin, which when tossed four times, gives following output:
{Tail, Tail, Tail, Head}. Here, if you have to guess the output of the coin toss, what would be
your guess? Chances are you will go with Tail, and why? Because seventy-five percent
chance is the output is tail based on the sample set that we have. In other words, the result is
less random in case of the biased coin than what it was in case of the perfect coin.

hence, We would say that an unbiased coin has a high entropy, because the uncertainty of the
value of X is the highest possible.
Suppose, we have a data like 1000 employees as
Defaulte Employed Area Gender
r

1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F
And we want to classify data a variable as defaulter or not on basis of employed,
Area, Gender.

Steps in computing entropy as:


Suppose Entropy(employed) is 0.97(using above formula), Suppose Entropy(Area )
is 0.43 and Suppose Entropy(gender) is 0.58. Then we choose Entropy(employed) as
our root node

               
      Employed == Y      
       /  \      
 
    /     \      
  YES       NO
  N= 700       N= 300
  1 550       1 50
  0 150       0 250

Now, we compute entropy of Entropy(gender /employed Yes =1) is 0.63 and


Entropy(gender/employed no = 0) is 0.38 and similarly Entropy(location /employed yes) is
0.23 and Entropy(location/employed no = 1) is 0.46. Since Entropy(gender /employed Yes
=1) is 0.63 greater than Entropy(location /employed yes) is 0.23 so we chose Area as another
node for employed = 1

           
Employed
        =YES
        N= 700
        1 550
        0 150
       
   |
       
   |
        Gender==M
       /
   \
 
    /     \  
YES YES
  (Gender==M)   (Gender==F)
  N= 300   N= 400
77% 1 230 3% 1 10
23% 0 70 98% 0 390
         
  Employed = NO    
  N= 300    
  1 50    
  0 250    
    |     
     |    
  Location==N    
    /  \    
   /   \    
YES YES (Location!
(Location==N)   =N)
N= 100   N= 200
1 25   1 150
0 75   0 50

It tells us if both the categories are equally probable then entropy is 1and if they are unequal
probable then entropy decreases. Generally, the feature which has higher entropy is selected
for splitting. Entropy measures the value of randomness in the data.

2. Information Gain: It is defined as

IG= Entropy(parent) – Weighted average of entropy of child nodes

Information Gain is opposite of entropy. It say something about feature variable. Higher the
IG better it is.

3. Ginni Impurity: This method is an impurity based criterio used in CART and only
performs binary splits.
It is defined

Where we sum the probability „pi‟ of an item with the chosen label ‟i‟. It reaches its
minimum (zero) when all cases in the node fall into a single target category
We find that out of total 50 students, half of them were selected for the coaching.

For Calculating Gini Index, we have to find the proportions of these values

Now we can find the weighted Gini for Split X1 (Grade) by calculating Gini for sub-
node „A or B‟ and „C
or D‟. The calculations are shown below
We can similarly do the same for X2 which here is „Class‟ and find the Gini for spliting
„Class‟.

For X3 Height which is a continuous variable, the statistical software in the backend split
the variable into
height > x and height < x and find the Gini value and whichever value of x provides them
with the
minimum Gini value, that value of x is used to calculate the Gini for splitting this
continuous variable.

In our example, we find that the value of x is equal to 1.75 and provides the minimum
value in comparison
to other values of height say 1.80 or 1.60. We find that the Gini for split height comes out
to be 0.23
Now we compare the value of Gini and find that the minimum Gini is provided by height
(0.23), thus, this
variable will be the root node that will split the data into two homogeneous „purer‟ sets.
The second
minimum value of Gini is found for the variable „Grade‟ and the maximum value of Gini
is found
with „Class‟. Thus the data will first split by Height as it plays a major role in deciding
whether a student
will be selected for the basketball coaching or not.

Ginni impurity and entropy both are same. The only difference lies in the formula
of ginni impurity doesn’t uses log that makes calculation faster as compared to
entropy (computational complexity).

4. Mean sum of square (MSE): It is defined as


∑(x – x ) 2 ÷ n
However, when the variable is categorical then it can get a bit complicate
We built tree top down right from root to leave and we want all similar data point to
be present in on leaf st that standard deviation measures homogeneity smaller.

For example, lets calculate mean and variance for the root node. Here we have 15 1s
(y=1) and 15 0s (y=0). We calculate mean by ((25 x 1) + (25 x 0)) ÷ 50 which comes
out to be 0.5 We can now use the formula of variance. Here our x will be 0.5. While
x will be 1 and 0. We have 50 entries so the formula will be (x-x )2× 50 ÷ n
To be very precise the formula would look something like this-
(1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2
+ (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-
0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 +
(1-0.5)2 + (1-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-34.
Decision Trees 280 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-
0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0- 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 +
(0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 ÷ 50 This can be
written as (1-0.5)2 X 25 + (0-0.5)2 X 25 ÷ 50
Using this we can finally come up with a equation (25 X (1-0.5)2)+(25 X (0-0.5)2) ÷
50
We use the above formula to find the mean and variance for each variable. See below
the calculation to find the variance of X1 (Grade) variable.
We can use the same formula to find the variance for Split Class.

We find that that the results are similar to entropy with the „Grade‟ being the least
important variable as its variance is higher (0.24) than the other categorical variable
which is „Class‟, having a variance of 0.21. The variance of Numerical Variable
follows similar steps and the variable which provides the minimum variance is
considered for the split

Now, we are aware of how to build a decision tree. But we must take care while
building a decision tree as depth increase the chance of overfitting increased (or
chances of higher noisy points in the data) and if depth is too short like 1 then
decision tree underfit. So, appropriate depth should be find out using cross validation.
Parameters Related to the Size of the Tree:
Maximum Depth Among the easiest parameter to understand is maximum depth
which is an important parameter to tune
trees and tree-based models. It is the length of the longest path from root node to the
leaf node or the depth
of the layers of the leaves. We often provide a value/maximum number/ depth-limit
to the decision tree
model to limit the further splitting of nodes, so when the specified depth is reached,
the tree stops splitting.

In the above example, we have two features with one being on x and other on the y-axis. Here
it finds that the first best split can be done by doing it on the x-axis i.e. split the variable on
the x-axis and find the value which provides the best result (the value that provides the
minimum entropy or Gini depending upon
the algorithm decision tree is using). Here for max depth =1, it splits the data and says that
data points on the left will be classified as 0 (blue, background colour) while the actual 0 are
the blue dots. This depth will make the model too generalized and will underfit the model.
However, if it keeps going and
recursively partitions the data, it then finds another way to split the data and thus creates
more and more branches and keeps on splitting the space. We can see by the time it reaches
the maximum depth, it has all the pure leaves, however. This leads to overfitting as the
decision boundry is very complex
and will yield very poor results in testing or when used for unseen new data. Thus we can set
the maximum depth at 5 to have a moderately complex and accurate decision tree model.
Minimum Split Size This is another parameter through which we define the minimum
number of observation required to allow the node to further split. This can control overfitting
and underfitting. A very low value of Minimum Split Size will lead to overfitting with a
small number of observations being considered for further splitting, however, an extremely
large value will lead to underfitting as a good amount of observations that on being split can
provide with useful insights won‟t be split.

Minimum Leaf Size


Here we can provide a value that will decide the minimum number of observations that a
terminal node will have. It is used for imbalanced class as, for example, if we define the
minimum split size at 99 and we
have a node with 100 observations, it can be split, however, upon splitting we find that the
child nodes have 99 observations in one node and one observation in another. We have
discussed in ANOVA and other Basic Statistical concepts that each class of a variable should
have a minimum number of
observations to be considered for any statistical test. Similarly here we define the minimum
leaf size which is generally 10% of the minimum split size.

Maximum Impurity
In the Maximum Depth, we saw how a low number of depth caused the red points to be
considered as blue. We can limit the number of wrong points required to be in a region before
splitting is stopped.
PYTHON CODE:

In case of classification

In case of regression
Data set contain the following variable as:

# We have train (8523) and test (5681) data set, train data set has both input and output
variable(s). We need to predict the sales for test data set.

# Item_Identifier: Unique product ID

# Item_Weight: Weight of product

# Item_Fat_Content: Whether the product is low fat or not

# Item_Visibility: The % of total display area of all products in a store allocated to the
particular product

# Item_Type: The category to which the product belongs

# Item_MRP: Maximum Retail Price (list price) of the product

# Outlet_Identifier: Unique store ID

# Outlet_Establishment_Year: The year in which store was established

# Outlet_Size: The size of the store in terms of ground area covered

# Outlet_Location_Type: The type of city in which the store is located

# Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket

# Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome
variable to be predicted.

Item_Outlet_Sales is dependent variable rest are independent variable

After all preprocessing steps. We train a decision tree model ithout hyperparameter tuning as:

Model score
We got 100% score on training data.

On test data we got 5.7% score because we did not provide any tuning parameters while
intializing the tree as a result of which algorithm split the training data till the leaf node. Due
to which depth of tree increased and our model did the overfitting.

That's why we are getting high score on our training data and less score on test data.

So to solve this problem we would use hyper parameter tuning.

We can use GridSearch or RandomizedSearch for hyper parameters tuning.

Simple Models without any Hyper parameter or its tuning but


with K-Fold Cross Validation

Error after cross validation

After hyperparameter tuning.


Error after hyper parmeter tuning

It was observed that after tuning MSE has been decreased.

When to stop growing?

 Pure node got stop growing


 If we have lack of point stop growing
 It tree are too deep

Advantages of decision trees:

 Decision Trees provide high accuracy and require very little preprocessing of
data such as outlier capping, missing value, variable transformation etc.
 Also, they can work if the dependent and independent variable are not linear
i.e. it can capture non-linear relationships
 Decision Trees require less data than what regression model requires and are
easy to understand along with being very inquisitive.
 Tree-Based models can be very easily visualized with clear-cut demarkation
allowing people having no background of statistics to understand the process
easily.
 Decision Tees can also be used for data cleaning, data exploration and
variable selection and creation.
 Decision Trees can work with high dimensional data having both- continuous
and categorical variables
 Feature interaction are in built of decision tree.
 DT are super - interpretable.

You might also like