Decision Trees
Decision Trees
Agenda
3. Regularization
4. Case studies
Decision Trees -
1. Decision Tree is one of the most commonly used models in data science world
3. It can be used for regression and classification, more often used for classification
4. Can be used for binary classification such as whether an applicant for loan is likely to
turn into defaulter or not, whether a customer is likely to churn or not
5. It can also be used for multi-class classification for example, identifying the character
in English alphabet
6. Decision Tree algorithm finds the relation between the target column and the
independent variables and expresses it as a tree structure
Horse- Weight Car Type 270 2500 L 2. The smaller node on the
Power
200 2900 L bottom has “S” in majority in
130 3500 L the target column hence gets
70 2530 S label “S”
90 2000 S
80 2200 S
90 1500 S 3. The homogeneity of the target
column in both the smaller
150 3000 L nodes has increased
Weight > 2000 compared to parent
270 2500 L
215 2000 L
100 1700 S
Decision Trees: Training/Building Tree
Horse- Weight Car Type
Class = L (Majority) Power
Horse- Weight Car Type 70 2530 S 1. The smaller node on the top
Power now is perfectly homogenous
80 2200 S
in the target column and
130 3500 L HorsePower > 100 belongs to class “S”
150 3000 L
2. The second node similarity
Horse- Weight Car Type
270 2500 L
Power
belongs to “L”
200 2900 L 130 3500 L
3. The third node belongs to “S”
150 3000 L
70 2530 S
270 2500 L
80 2200 S 4. The fourth node belongs to
200 2900 L “L”
215 2000 L
Horse- Weight Car Type
Power
100 1700 S
215 2000 L
Class = S (Majority)
Decision Trees: Training/Building Tree
If wt > 2000
if hp > 100
class = “L”
else
class = “S”
If wt <= 2000
if hp > 200
class = “L”
else
class = “S”
Note: The CART algorithm employed by scikitlearn creates only binary tree i.e., each node is split into
two subnodes
Decision Trees: Predicting
Let us apply a
function on the
weight column Predicted class of the
new data point
Large Car
Decision Trees – Training Errors
2. Such classification errors can occur both during the training and testing. They are
called training errors and testing errors. This is true for any algorithm
3. The decision tree algorithm by default will try to build a tree whether the smallest child
nodes are perfectly homogenous in the target columns
4. To achieve perfect homogeneity in the target column, the algorithm may build a large
tree where each leaf has only 1 record! Such models are overfit models. They give
zero errors on training but perform poorly on test data
Decision Trees – Posterior Probability
5. Sometimes when the algorithm runs out of independent attributes to use to break a
node into smaller nodes or it is forced to stop, we may find nodes where the target
column is not homogeneous
90 2000 S
90 1500 S
215 2000 L
100 1700 S
6. In such case, the label assigned to the node is based on majority class and the ratio of
classes indicates the posterior probability of the two classes at that node. P(S) = ¾ and
P(L) = 1/4
Decision Trees – Structure & Node types
1. Classifiers utilize a tree structure to model relationships among the features and the
potential outcomes
2. Decision trees consist of nodes and branches. Nodes represent a decision function
while branch represents the result of the function. Thus it is a flow chart for deciding
how to classify a new observation:
3. The nodes are of three types, Root Node (representing the original data), Branch
Node (representing a function), Leaf Node (which holds the result of all the previous
functions that connect to it)
Decision Trees - Structure & Node types
4. For classification problem, the posterior probability of all the classes is reflected in the
leaf node and the Leaf Node belongs to the majority class.
5. After executing all the functions from Root Node to Leaf Node, the class of a data
point is decided by the leaf node to which it reaches
6. For regression, the average/ median value of the target attribute is assigned to the
query variable
7. Tree creation splits data into subsets and subsets into further smaller subsets. The
algorithm stops splitting data when data within the subsets are sufficiently
homogenous or some other stopping criterion is met
Decision Trees - Structure & Node types
1. The decision tree algorithm learns (i.e. creates the decision tree from the data set)
through optimization of a loss function
2. The loss function represents the loss of impurity in the target column. The
requirement here is to minimize the impurity as much as possible at the leaf nodes
Suppose we wish to find if there was any influence of shipping mode, order priority on
customer location. Customer location is target column and like the bag of coloured balls
Shipping
Sales
Mode
Data
Regular Express
Air Air
Low High
Priority Priority
When sub branches are created, the total entropy of the sub branches should be
less than the entropy of the parent node. More the drop in entropy, more the
information gained
Decision Trees – Shannon's Entropy
a. Imagine a bag contains 6 red and 4 black balls.
b. Let the two classes Red -> class 0 and Black -> class 1
d. Suppose we remove all red balls from the bag and then entropy will be
a. H(X) = - 1.0 *log2(1.0) – 0.0 * log2(0) = 0 ## Entropy is 0! i.e. Information is
100%
Machine Learning (Decision Tree Classification)
Decision Trees -
Entropy Info Gain
E0 = max entropy 0
Shipping
Mode (1000) E0 say 1
E1 = E0 – E1
Express (E1a*700/1000) +
Regular Air
(700), E1a Air (300), (E1b * 300/1000)
E1b
E2 = (E2a * E1 – E2
500/700) + (E2b *
Low High Low High
Priority Priority Priority Priority
200/700) + (E2c *
(500) E2a (200) E2b (100) E2c (200) E2d 100/300) + (E2d *
200/300)
Tree will stop growing when stop criterion for the splitting is reached which could be -
a. Tree has reached certain pre-fixed depth (longestt path from root node to leaf node)
b. Tree has achieve maximum number of nodes (tree size)
c. Exhausted all attributes to split
d. Leaf node on split will have less than predefined number of data points
Decision Trees - Information Gain using Entropy
1. Gini index – is calculated by subtracting the sum of the squared probabilities of each
class from one
a. Uses squared proportion of classes
b. Perfectly classified, Gini Index would be zero
c. Evenly distributed would be 1 – (1/# Classes)
d. You want a variable split that has a low Gini Index
e. Used in CART algorithm
2. Entropy –
a. Favors splits with small counts but many unique values
b. Weights probability of class by log(base=2) of the class probability
c. A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger
d. Information Gain is the Entropy of the parent node minus the entropy of the child nodes
Decision Trees – Gini , Entropy , Misclassification Error
3. C5.0 is Quinlan’s latest version and it uses less memory and builds smaller
rulesets than C4.5 while being more accurate
Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Handles numeric and categorical variables
4. Interpretation of results does not required mathematical or statistical knowledge
Dis-advantages -
1. Often biased towards splits or features have large number of levels
2. May not be optimum as modelling some relations on axis parallel basis is not
optimal
3. Small changes in training data can result in large changes to the logic
4. Large trees can be difficult to interpret
Decision Trees - Preventing overfitting through regularization
3. If left unconstrained, they can build tree structures to adapt to the training
data leading to overfitting
4. To avoid overfitting, we need to restrict the DT’s freedom during the tree
creation. This is called regularization
1. max_depth – Is the maximum length of a path from root to leaf (in terms of
number of decision points. The leaf node is not split further. It could lead to
a tree with leaf node containing many observations on one side of the tree,
whereas on the other side, nodes containing much less observations get
further split