Decision Trees Palagraism
Decision Trees Palagraism
Fig 3 9
Contents
6. Decision trees..................................................................................................................................3
6.1 How to build a decision tree?....................................................................................................4
6.1.1 Entropy:...............................................................................................................................5
6.2.2 Information Gain.........................................................................................................6
6.1.3 Ginni impurity:-.................................................................................................................7
6.1.4 MSE......................................................................................................................................7
6.2 Parameters Related to the Size of the Tree:.............................................................................9
6.2.1 Minimum Split Size..........................................................................................................10
6.3 Decision tree implementation in Python................................................................................10
6.3.1 In case of classification.....................................................................................................10
6.3.2 In case of regression..........................................................................................................15
6.4 When to stop growing?............................................................................................................17
6.5 Advantages of decision:...........................................................................................................17
6. Decision trees
Decision trees support tool that uses a tree-like graph or a model of decisions and their
possible consequence. It is also called instance based algorithm as at each instance we take
decision or we can say it uses nested if- else condition.
Decision Tree is a non linear model which is made of various linear axis parallel planes.
In logistic regression we have a single plane but in decision tree we have multiple planes.
Suppose we want to classify a person is defaulter in paying tax or not on basis of three
categories – employed, area, location. Then decision tree made as:
Depth of ROOT
Tree = 0 NODE
N= 1000
1 600
1 = no
defaulter 0 400
0=defau
lter |
|
Employed ==
Y
/ \
/ \
Depth of
Tree = 1 YES \ NO
N= 700 N= 300
1 550 1 50
0 150 0 250
| |
| |
Gender==
M Location==N
/ \ / \
/ \ ..../ \
YES
Depth of (Gender== (Gender==F Location= Location!
Tree = 2 M) ) N) =N
N= 300 N= 400 N= 100 N= 200
77% 1 230 # 1 10 25% 1 25 1 150
23% 0 70 # 0 390 75% 0 75 0 50
No nodefau
defaulter defaulter
defaulter ler
Decision tree represented as:
1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F
We see from above example that entropy approach 1 if both the categories are equally
probable then and if they are unequal probable then entropy decreases. Generally, the feature
which has higher entropy is selected for splitting. Entropy measures the value of
randomness in the data.
Information Gain is opposite of entropy. As the name suggests it explain how much
homogeneity in the data is maintained if that feature used as a base node. Higher the IG better
the feature is for splitting the Decision tree. It is computed as:
ROOT NODE
N= 1000
1 600
0 400
|
|
Employed == Y
/ \
/ \
YES \ NO
N= 700 N= 300
1 550 1 50
0 150 0 250
H(employed)= 0.65
H(Employed = Y) = -550/700*log2(550/700)-150/700*log2(150/700) = 0.75
Similarly for (employed = N) = 0.25
Information gain(employed) = 0.65 - (700/1000*0.75+300/1000*0.25) =0.65 -0.6
=0.05
15 1 N M
25 0 S F
65 0 S M
85 0 S F
45 1 N F
54 0 N M
12 0 N M
94 1 S M
100 0 S F
25 1 S M
20 0 N F
6.1.4 MSE
In this if we have defaulter (amount doesn’t pay by a person in Rs) is given and we want to
predict it. In this case we use mse.
First we compute mean square error of defaulter using formula
Σ ( yi−γ ) 2
=
n
Suppose (mean of 1000 defaulter) = 50
Now we compute mse as:
( 15−γ ) 2+ ( 25−γ ) 2+ ( 65−γ ) 2+…+ ( 25−γ ) 2+ ( 20−γ ) 2
1000
= 196.258
Now, for feature employed we have 700 yes and 300 no then,
For employed =Yes
( 15−γ ) 2+ ( 45−γ ) 2+ ( 94−γ ) 2+…+ (25−γ ) 2
We compute Mse =
700
=102.68
For employed = No
( 25−γ ) 2+ ( 65−γ ) 2+ ( 65−γ ) 2+…+ ( 20−γ ) 2
We compute Mse =
300
=78.54
Now, for feature employed =196.258 – (700/1000*102.68+ 300/1000 *78.4)
= 196.258 – 94.8
=101.458
101.458 here represent that using feature employed reduction in MSE is 101.458.
Similarly we try for location and area and the feature for splitting which maximum
reduces MSE of defaulter(root node)
Now, we are aware of how to build a decision tree. But we must take care while
building a decision tree as depth increase the chance of overfitting increased (or
chances of higher noisy points in the data) and if depth is too short like 1 then
decision tree underfit. So, appropriate depth should be find out using cross validation.
Complete Data
1000
K fold Cross Validation with K =
1000
1
2
3
4
5
In cross validation, we test our data on some part(/5 or 1/10) of original data for same
hyper parameter and compute accuracy simply by taking average.
6.2 Parameters Related to the Size of the Tree:
Maximum Depth Length from the root node to the leaf node of the tree. If maximum
depth is defined externally then tree stop growing after that depth. Too short value of
maximum depth (like 2,3) lead to under fit and high value (like 10,20) this lead to
overfit. Also tree of high maximum depth is hard to interpret.
Outcome is a dependent variable it predict a person is diabetic or not and rest are
independent variable.
Input: sns.heatmap(data_num.corr())
Output:
it is clear that there is no or low multicollinearity
1 343
0 343
Name: Outcome, dtype: int64
Output : 1.0
test_predicted_probabilities = pd.DataFrame(clf.predict_proba(test_X))[1]
roc_auc_score(test_Y, test_predicted_probabilities)
Output: The AUC for the model built on the Test Data is : 0.663152005508693
print("The AUC for the model built on the Train Data is : ", train_auc)
print("The AUC for the model built on the Test Data is : ", test_auc)
Output:
The AUC for the model built on the Train Data is : 1.0
The AUC for the model built on the Test Data is : 0.70
Ouput:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [5, 6, 7],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [4, 5, 6, 7],
'min_samples_split': [2, 5, 7]},
Train output:
0.9377810266130608
Test output
0.7364565529144444
Accuray
After parameter tuning accuracy of test has been increased from 60 to 73%. For train
accuracy is still 0.93. Since there is too much variation in accuracy of train and test data so
there is high variance in the data.
6.3.2 In case of regression
Data set contain the following variable as:
In train data we have 8523 column and following columns:
Item_Identifier: product ID
Item_Identifier: product ID
Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome variable to be
predicted.
After all preprocessing steps. We train a decision tree model without hyperparameter tuning
as:
reg_decision_model=DecisionTreeRegressor()
reg_decision_model.fit(X_train,y_train)
reg_decision_model.score(X_train,y_train)
1.0
Model score
reg_decision_model.score(X_test, y_test)
0.562696
On test data we got 57% score because we haven’t done any tuning parameters just initialized
the tree with default values. So tree has expanded it full length using feature which is not
actually valid. That’s why 100 % accuracy on train data (i.e, a highly overfitted model)
Now we tune parameter to get rid off this
Simple Models without any Hyper parameter or its tuning but with K-Fold Cross
Validation
resultsDTC.mean()
-3.176499710061905
print('MAE:', metrics.mean_absolute_error(y_test,prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
MAE: 0.8283144309737974
MSE: 3.4923738756355105
RMSE: 1.868789414470103
parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9],
"min_samples_leaf":[4,5,6,7,8],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,40,50,60] }
tuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring=
'neg_mean_squared_error',cv=3,verbose=3)
tuned_hyper_model=
DecisionTreeRegressor(max_depth=5,max_features='sqrt',max_leaf_nodes=60,min
_samples_leaf=4,min_weight_fraction_leaf=0.1,splitter='best')
Error after hyper parameter tuning
print('MAE:', metrics.mean_absolute_error(y_test,tuned_pred))
print('MSE:', metrics.mean_squared_error(y_test, tuned_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, tuned_pred)))
MAE: 0.9994744906564477
MSE: 2.344128598627275
RMSE: 1.53105473404032
It was observed that after tuning MSE has been decreased to a small value
Decision Trees provide high accuracy and require very little preprocessing of
data such as outlier capping, missing value, variable transformation etc.
It work well for non-linear relationships.
Tree-Based models can be very easily visualized with clear-cut demarcation
allowing people having no background of statistics to understand the process
easily.
Decision Tees implements well in data cleaning, data exploration and
variable selection and creation.
Decision Trees can work with high dimensional data having both- continuous
and categorical variables
Feature interaction are in built of decision tree.
DT are super - interpretable.
In this module we have learnt about decision tree but our test model and train model has
high variation. Now we look some more powerful algorithm which is aggregation of
decision tree i.e Boosting, Bagging in upcoming module..