0% found this document useful (0 votes)
73 views16 pages

Decision Trees Palagraism

The document discusses decision trees, including: 1) How decision trees work by using a tree-like graph or model of decisions and possible consequences to classify data points into categories or predict a target variable. 2) The key steps to build a decision tree including calculating metrics like entropy, information gain, and Gini impurity to determine the optimal features to split the data on at each node. 3) Decision tree implementation in Python for classification and regression problems, and parameters like minimum split size that affect the size of the tree.

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views16 pages

Decision Trees Palagraism

The document discusses decision trees, including: 1) How decision trees work by using a tree-like graph or model of decisions and possible consequences to classify data points into categories or predict a target variable. 2) The key steps to build a decision tree including calculating metrics like entropy, information gain, and Gini impurity to determine the optimal features to split the data on at each node. 3) Decision tree implementation in Python for classification and regression problems, and parameters like minimum split size that affect the size of the tree.

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Fig 1 Decision tree 4

Fi2 Partitioning data at different level 9

Fig 3 9
Contents
6. Decision trees..................................................................................................................................3
6.1 How to build a decision tree?....................................................................................................4
6.1.1 Entropy:...............................................................................................................................5
6.2.2 Information Gain.........................................................................................................6
6.1.3 Ginni impurity:-.................................................................................................................7
6.1.4 MSE......................................................................................................................................7
6.2 Parameters Related to the Size of the Tree:.............................................................................9
6.2.1 Minimum Split Size..........................................................................................................10
6.3 Decision tree implementation in Python................................................................................10
6.3.1 In case of classification.....................................................................................................10
6.3.2 In case of regression..........................................................................................................15
6.4 When to stop growing?............................................................................................................17
6.5 Advantages of decision:...........................................................................................................17
6. Decision trees

Decision trees support tool that uses a tree-like graph or a model of decisions and their
possible consequence. It is also called instance based algorithm as at each instance we take
decision or we can say it uses nested if- else condition.
Decision Tree is a non linear model which is made of various linear axis parallel planes.
In logistic regression we have a single plane but in decision tree we have multiple planes.
Suppose we want to classify a person is defaulter in paying tax or not on basis of three
categories – employed, area, location. Then decision tree made as:

Depth of ROOT
Tree = 0         NODE          
          N= 1000          
          1 600          
 1 = no
defaulter         0 400          
  0=defau
lter           |          
            |          
Employed ==
          Y          
          / \          
        / \          
Depth of
Tree = 1     YES    \ NO    
      N= 700       N= 300    
      1 550       1 50    
      0 150       0 250    
        |       |      
        |       |      
Gender==
      M       Location==N    
      / \       / \    

    / \     ..../   \    
YES
Depth of (Gender== (Gender==F Location= Location!
Tree = 2 M)   )   N)   =N
  N= 300   N= 400   N= 100   N= 200
77% 1 230 # 1 10 25% 1 25   1 150
23% 0 70 # 0 390 75% 0 75   0 50
 No nodefau
  defaulter    defaulter
  defaulter ler    
Decision tree represented as:

Fig 1 Decision tree


Geometrically, it is axis parallel planes that tessellate the area into Hypercuboid or
hypercube.
Like other machine learning model Decision Trees also of two types Regression Trees and
Classification trees. Some terms related to decision trees are:
 Decision Node: Node that is splitted
 Root Node: This is the top-most node
 Leaf Node: last node of the tree
 Splitting: The process of dividing a node is known as splitting. This is the most
commonly used method where a top-down approach is taken to split each node to find
the best split
 Pruning: Reverse of splitting. Here we remove sub-nodes of a decision node.

6.1 How to build a decision tree?

In case of classification type of problem we use “Entropy”, “Information Gain”, “Ginni


coefficient ” as our measure to decide the feature that split the data, where as in case of
linear regression we use mean sum of square (MSE) to split the data.
The most common criteria/metrics for Classification Trees are-
 Entropy
 Gini index
 Chi-Square (for Classification Trees)
 Reduction In Variance (for Regression Trees)
 ANOVA.

6.1.1 Entropy: It is defined as:


H= -p log2p – q log2q
p is probability of outcome variable classified in class 1 and q is probability that
outcome variable(defaulter) classified in class 0
Suppose, we have a data like 1000 employees as

Defaulte Employed Area Gender


r

1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F

And we want to classify data a variable as defaulter or not on basis of employed,


Area, Gender.

Steps in computing entropy as:


= From above graph we saw in case of feature employment, no of people in class 1 = 600
No of people in class 0= 400
Probabilityclass1)= 600/1000
Probabilityclass1)= 400/1000
H(employed) = -0.6*log(0.6,2)-0.4*log(0.4,2)
= 0.65
Similarly, we calculate other entropy or H(location) and H(Gender)
ROOT Potential Features for Split
Default Are Gend
Employed
er a er
0.2
  0.65
1 0.34
1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F
1 1 N M
Since we see above for employed we have maximum entropy so we choose it as our root
node.

We see from above example that entropy approach 1 if both the categories are equally
probable then and if they are unequal probable then entropy decreases. Generally, the feature
which has higher entropy is selected for splitting. Entropy measures the value of
randomness in the data.

6.2.2 Information Gain: It is defined as

IG= Entropy(parent) – Weighted average of entropy of child nodes

Information Gain is opposite of entropy. As the name suggests it explain how much
homogeneity in the data is maintained if that feature used as a base node. Higher the IG better
the feature is for splitting the Decision tree. It is computed as:

      ROOT NODE      
      N= 1000      
      1 600      
      0 400      
        |      
        |      
      Employed == Y      
      / \      
    / \      
  YES    \ NO
  N= 700       N= 300
  1 550       1 50
  0 150       0 250

H(employed)= 0.65
H(Employed = Y) = -550/700*log2(550/700)-150/700*log2(150/700) = 0.75
Similarly for (employed = N) = 0.25
Information gain(employed) = 0.65 - (700/1000*0.75+300/1000*0.25) =0.65 -0.6
=0.05

Similarly we do it for other features.

6.1.3 Ginni impurity:- It is defined

Where pi is the probability of allocating a unit in the ith class .


i.e, Ginnii iimpurity for employed in above example given as:
1 – ((0.6)2+(0.4)2) =0.6
Similarly we compute entropy for other features.
Ginni impurity and entropy both are same. The only difference lies in the formula of ginni impurity
doesn’t uses log that makes calculation faster as compared to entropy (computational
complexity).

Now suppose, we have a data like 1000 employees as

Defaulte Employed Area Gender


r

15 1 N M
25 0 S F
65 0 S M
85 0 S F
45 1 N F
54 0 N M
12 0 N M
94 1 S M
100 0 S F
25 1 S M
20 0 N F

6.1.4 MSE
In this if we have defaulter (amount doesn’t pay by a person in Rs) is given and we want to
predict it. In this case we use mse.
First we compute mean square error of defaulter using formula
Σ ( yi−γ ) 2
=
n
Suppose (mean of 1000 defaulter) = 50
Now we compute mse as:
( 15−γ ) 2+ ( 25−γ ) 2+ ( 65−γ ) 2+…+ ( 25−γ ) 2+ ( 20−γ ) 2
1000

= 196.258

Now, for feature employed we have 700 yes and 300 no then,
For employed =Yes
( 15−γ ) 2+ ( 45−γ ) 2+ ( 94−γ ) 2+…+ (25−γ ) 2
We compute Mse =
700
=102.68
For employed = No
( 25−γ ) 2+ ( 65−γ ) 2+ ( 65−γ ) 2+…+ ( 20−γ ) 2
We compute Mse =
300
=78.54
Now, for feature employed =196.258 – (700/1000*102.68+ 300/1000 *78.4)
= 196.258 – 94.8
=101.458
101.458 here represent that using feature employed reduction in MSE is 101.458.
Similarly we try for location and area and the feature for splitting which maximum
reduces MSE of defaulter(root node)

 Defaulter  Mse=   ROOT NODE      


   196.258   N= 1000      
      1 600      
      0 400      
        |      
        |      
      Employed == Y      
      / \      
    / \      
 Y = Mean or   Y = Mean or
Md(employed Md(employed
=Yes) YES   =No) NO
 MSE =102.68 N= 700    MSE =78.54  N= 300
  1 550       1 50
  0 150       0 250

Now, we are aware of how to build a decision tree. But we must take care while
building a decision tree as depth increase the chance of overfitting increased (or
chances of higher noisy points in the data) and if depth is too short like 1 then
decision tree underfit. So, appropriate depth should be find out using cross validation.

           
  Complete Data      
  1000        
           
K fold Cross Validation with K =
  1000  
1          
           
2          
           
3          
           
4          
           
5          

In cross validation, we test our data on some part(/5 or 1/10) of original data for same
hyper parameter and compute accuracy simply by taking average.
6.2 Parameters Related to the Size of the Tree:

Maximum Depth Length from the root node to the leaf node of the tree. If maximum
depth is defined externally then tree stop growing after that depth. Too short value of
maximum depth (like 2,3) lead to under fit and high value (like 10,20) this lead to
overfit. Also tree of high maximum depth is hard to interpret.

Fig 2 Partitioning data at different level of depth

6.2.1 Minimum Split Size


No of sample required for splitting. High value of this lead to underfitting and low value
corresponds to overfitting.
6.2.2 Minimum Leaf Size
No of sample present at terminal node. This is useful in case of imbalance data. Suppose if
we have 100 data point of which 90 belong to one class and 10of another class. Then a case
may arise where at terminal node we have 90 in one node and 10 in another leaf node .
Prediction made by such model are incorrect.

6.3 Decision tree implementation in Python


6.3.1 In case of classification

Dataset has folloing columns as:


Pregnancies = No of pregnancies a person had
Glucose = Level of glucose in body
BloodPressure = Blood pressure of person
SkinThickness =Thickness of skin
Insulin = Insulin in body
BMI = Body mass innex of person
,DiabetesPedigreeFunction = unit for measuring diabetes
Age Age of perrson
Outcome A person is diabetic or not

Outcome is a dependent variable it predict a person is diabetic or not and rest are
independent variable.

Outcome variable is imbalance

Input: sns.heatmap(data_num.corr())
Output:
it is clear that there is no or low multicollinearity

First we balance the dataset

from imblearn.over_sampling import SMOTE


over_sampler = SMOTE(k_neighbors=4)
train_X, train_Y = over_sampler.fit_resample(train_x, train_y)

1 343
0 343
Name: Outcome, dtype: int64

Since, we removed outlier, no missing value, no multicollinearity


Now, we train our model
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train_X, train_Y)

Accuracy before cross validation and hyper parameter tuning

For training data


train_predicted_probabilities = pd.DataFrame(clf.predict_proba(train_X))[1]
roc_auc_score(train_Y, train_predicted_probabilities)

Output : 1.0

For test data

test_predicted_probabilities = pd.DataFrame(clf.predict_proba(test_X))[1]
roc_auc_score(test_Y, test_predicted_probabilities)

Output: The AUC for the model built on the Test Data is : 0.663152005508693

In train data we have 100 % accuracy and test 70%

After K fold cross validation

from sklearn import model_selection


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

from sklearn import tree


clf = tree.DecisionTreeClassifier()
kfold = KFold(n_splits=4)
scoring = 'roc_auc'
results_k10 = cross_val_score(clf, train_X, train_Y, cv=kfold,
scoring=scoring)

train_auc = metrics.roc_auc_score(train_Y, clf.predict(train_X))


test_auc = metrics.roc_auc_score(test_Y, clf.predict(test_X))

print("The AUC for the model built on the Train Data is : ", train_auc)
print("The AUC for the model built on the Test Data is : ", test_auc)

Output:
The AUC for the model built on the Train Data is : 1.0
The AUC for the model built on the Test Data is : 0.70

Accuracy of train is 100 and test is 70%

After Hyper parameter tuning:

from sklearn.model_selection import GridSearchCV


params = {'max_features': ['auto', 'sqrt'],
'criterion':['gini','entropy'],
'min_samples_split': [2,5,7],
'min_samples_leaf':[4,5,6,7],
'max_depth':[5,6,7]}

from sklearn import tree


clf = tree.DecisionTreeClassifier()
DT = GridSearchCV(clf, param_grid=params, cv = 5, scoring='roc_auc')
DT.fit(train_X,train_Y)
DTC_F = tree.DecisionTreeClassifier(criterion = 'gini', max_depth = 7,
min_samples_split = 7, min_samples_leaf
= 7, max_features = 'auto')
#Fitting decision tree for best classifier
DTC_F.fit(train_X, train_Y)
train_predicted_probabilities = pd.DataFrame(DTC_F.predict_proba(train_X))
[1]
roc_auc_score(train_Y, train_predicted_probabilities)
test_predicted_probabilities = pd.DataFrame(DTC_F.predict_proba(test_X))[1]
roc_auc_score(test_Y, test_predicted_probabilities)

Ouput:

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [5, 6, 7],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [4, 5, 6, 7],
'min_samples_split': [2, 5, 7]},

Train output:
0.9377810266130608
Test output
0.7364565529144444

Accuray

After parameter tuning accuracy of test has been increased from 60 to 73%. For train
accuracy is still 0.93. Since there is too much variation in accuracy of train and test data so
there is high variance in the data.
6.3.2 In case of regression
Data set contain the following variable as:
In train data we have 8523 column and following columns:

Item_Identifier: product ID

Item_Identifier: product ID

Item_Weight: Weight of the product in Kg

Item_Fat_Content: product is low fat or no fat

Item_Visibility: The % of total display area of the products in a store

Item_Type: The category to which the product belongs

Item_MRP: Maximum Retail Price selling price) of the product

Outlet_Identifier: Unique store ID

Outlet_Establishment_Year: The year in which store was established

Outlet_Size: The size of the store in terms of ground area covered

Outlet_Location_Type: The type of city in which the store is located

Outlet_Type: Whether the outlet is just a grocery store or supermarket

Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome variable to be
predicted.

Item_Outlet_Sales is dependent variable rest are independent variable

After all preprocessing steps. We train a decision tree model without hyperparameter tuning
as:

from sklearn.tree import DecisionTreeRegressor


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse

reg_decision_model=DecisionTreeRegressor()
reg_decision_model.fit(X_train,y_train)
reg_decision_model.score(X_train,y_train)
1.0

Model score

reg_decision_model.score(X_test, y_test)
0.562696

We got 100% score on training data.

On test data we got 57% score because we haven’t done any tuning parameters just initialized
the tree with default values. So tree has expanded it full length using feature which is not
actually valid. That’s why 100 % accuracy on train data (i.e, a highly overfitted model)
Now we tune parameter to get rid off this

Simple Models without any Hyper parameter or its tuning but with K-Fold Cross
Validation

from sklearn import tree


clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train,y_train)
clf.predict(X_train)[1]

resultsDTC = cross_val_score(clf,X_train,y_train , cv=5,


scoring='neg_mean_squared_error')

resultsDTC.mean()

-3.176499710061905

MSE after cross validation

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test,prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

MAE: 0.8283144309737974
MSE: 3.4923738756355105
RMSE: 1.868789414470103

After hyperparameter tuning.

# Hyper parameters range intialization for tuning

parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9],
"min_samples_leaf":[4,5,6,7,8],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,40,50,60] }

tuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring=
'neg_mean_squared_error',cv=3,verbose=3)

Tuning best parameter:

tuned_hyper_model=
DecisionTreeRegressor(max_depth=5,max_features='sqrt',max_leaf_nodes=60,min
_samples_leaf=4,min_weight_fraction_leaf=0.1,splitter='best')
Error after hyper parameter tuning

# With hyperparameter tuned

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test,tuned_pred))
print('MSE:', metrics.mean_squared_error(y_test, tuned_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, tuned_pred)))

MAE: 0.9994744906564477
MSE: 2.344128598627275
RMSE: 1.53105473404032

It was observed that after tuning MSE has been decreased to a small value

6.4 When to stop growing?

 Pure node got stop growing


 If we have lack of point stop growing
 It tree are too deep

6.5 Advantages of decision trees:

 Decision Trees provide high accuracy and require very little preprocessing of
data such as outlier capping, missing value, variable transformation etc.
 It work well for non-linear relationships.
 Tree-Based models can be very easily visualized with clear-cut demarcation
allowing people having no background of statistics to understand the process
easily.
 Decision Tees implements well in data cleaning, data exploration and
variable selection and creation.
 Decision Trees can work with high dimensional data having both- continuous
and categorical variables
 Feature interaction are in built of decision tree.
 DT are super - interpretable.

In this module we have learnt about decision tree but our test model and train model has
high variation. Now we look some more powerful algorithm which is aggregation of
decision tree i.e Boosting, Bagging in upcoming module..

You might also like