0% found this document useful (0 votes)
10 views27 pages

Decision Trees

Uploaded by

Ashish Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views27 pages

Decision Trees

Uploaded by

Ashish Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Decision Trees

Agenda

1. Introduction to Decision Trees

2. Loss function for Decision Trees

3. Regularization

4. Case studies
Decision Trees -

1. Decision Tree is one of the most commonly used models in data science world

2. It is also a proven management tool used to take decisions in complex situations

3. It can be used for regression and classification, more often used for classification

4. Can be used for binary classification such as whether an applicant for loan is likely to
turn into defaulter or not, whether a customer is likely to churn or not

5. It can also be used for multi-class classification for example, identifying the character
in English alphabet

6. Decision Tree algorithm finds the relation between the target column and the
independent variables and expresses it as a tree structure

7. It does so by binary splitting data using functions based on comparison operators on


the independent columns
Decision Trees: Training/Building
Tree Horse-Power Weight Car Type
130 3500 L
1. Suppose we are given the data about
cars as shown
90 2000 S
90 1500 S
2. Our objective is to find if any patterns
exist that connect the “Horse-Power” and
150 3000 L
“Weight” to car type ( Large or Small) 270 2500 L
3. The independent variables are “Horse- 200 2900 L
Power” and “Weight” while the target 70 2530 S
column is “Car Type”
215 2000 L
4. The target column has binary values ( L
80 2200 S
and S) in equal numbers
100 1700 S
Decision Trees: Training/Building Tree
Horse- Weight Car Type
Let us apply a function on the Power
1. This smaller node on top has
weight column 130 3500 L “L” in majority in the target
150 3000 L column hence gets label “L”

Horse- Weight Car Type 270 2500 L 2. The smaller node on the
Power
200 2900 L bottom has “S” in majority in
130 3500 L the target column hence gets
70 2530 S label “S”
90 2000 S
80 2200 S
90 1500 S 3. The homogeneity of the target
column in both the smaller
150 3000 L nodes has increased
Weight > 2000 compared to parent
270 2500 L

200 2900 L 4. But both the smaller nodes


still have a mix of values in
70 2530 S the target column
215 2000 L Horse- Weight Car Type
Power
5. Let us split the data further
80 2200 S 90 2000 S using Horse-Power
100 1700 S 90 1500 S

215 2000 L

100 1700 S
Decision Trees: Training/Building Tree
Horse- Weight Car Type
Class = L (Majority) Power

Horse- Weight Car Type 70 2530 S 1. The smaller node on the top
Power now is perfectly homogenous
80 2200 S
in the target column and
130 3500 L HorsePower > 100 belongs to class “S”
150 3000 L
2. The second node similarity
Horse- Weight Car Type
270 2500 L
Power
belongs to “L”
200 2900 L 130 3500 L
3. The third node belongs to “S”
150 3000 L
70 2530 S
270 2500 L
80 2200 S 4. The fourth node belongs to
200 2900 L “L”

Horse- Weight Car Type


Let us apply a function on the 5. There is no further need to
Power
Horse-Power column split the data as it is perfectly
90 2000 S
Horse- Weight Car Type homogenous
Power 90 1500 S
100 1700 S
90 2000 S
HorsePower > 200
90 1500 S

215 2000 L
Horse- Weight Car Type
Power
100 1700 S
215 2000 L
Class = S (Majority)
Decision Trees: Training/Building Tree

The tree thus has given us the


Let us apply a function on following relation between
the weight column “Weight” and “Horse-Power” and
“Car Type” as

If wt > 2000
if hp > 100
class = “L”
else
class = “S”

If wt <= 2000
if hp > 200
class = “L”
else
class = “S”

Note: The CART algorithm employed by scikitlearn creates only binary tree i.e., each node is split into
two subnodes
Decision Trees: Predicting

Let us apply a
function on the
weight column Predicted class of the
new data point

Horse- Weight Car Horse- Weight Car


Power Type Power Type
150 2500 ? 150 2500 ?

Large Car
Decision Trees – Training Errors

1. Suppose we come across a combination of “Weight” and “Horse-Power” for any of


the classes which was not available in the training data on which the decision tree
was built. For example, a Small Car with HorsePower of 250 and Weight is 2000. The
Decision Tree will classify it as Large Car

2. Such classification errors can occur both during the training and testing. They are
called training errors and testing errors. This is true for any algorithm

3. The decision tree algorithm by default will try to build a tree whether the smallest child
nodes are perfectly homogenous in the target columns

4. To achieve perfect homogeneity in the target column, the algorithm may build a large
tree where each leaf has only 1 record! Such models are overfit models. They give
zero errors on training but perform poorly on test data
Decision Trees – Posterior Probability

5. Sometimes when the algorithm runs out of independent attributes to use to break a
node into smaller nodes or it is forced to stop, we may find nodes where the target
column is not homogeneous

Horse- Weight Car Type


Power

90 2000 S

90 1500 S

215 2000 L

100 1700 S

6. In such case, the label assigned to the node is based on majority class and the ratio of
classes indicates the posterior probability of the two classes at that node. P(S) = ¾ and
P(L) = 1/4
Decision Trees – Structure & Node types

1. Classifiers utilize a tree structure to model relationships among the features and the
potential outcomes

2. Decision trees consist of nodes and branches. Nodes represent a decision function
while branch represents the result of the function. Thus it is a flow chart for deciding
how to classify a new observation:

3. The nodes are of three types, Root Node (representing the original data), Branch
Node (representing a function), Leaf Node (which holds the result of all the previous
functions that connect to it)
Decision Trees - Structure & Node types

4. For classification problem, the posterior probability of all the classes is reflected in the
leaf node and the Leaf Node belongs to the majority class.

5. After executing all the functions from Root Node to Leaf Node, the class of a data
point is decided by the leaf node to which it reaches

6. For regression, the average/ median value of the target attribute is assigned to the
query variable

7. Tree creation splits data into subsets and subsets into further smaller subsets. The
algorithm stops splitting data when data within the subsets are sufficiently
homogenous or some other stopping criterion is met
Decision Trees - Structure & Node types

1. The decision tree algorithm learns (i.e. creates the decision tree from the data set)
through optimization of a loss function

2. The loss function represents the loss of impurity in the target column. The
requirement here is to minimize the impurity as much as possible at the leaf nodes

3. Purity of a node is a measure of homogeneity in the target column at that node


Decision Trees -
1. There is a bag of 50 balls of red, green, blue, white and
yellow colour respectively
2. You have to pull out one ball from the bag with closed
eyes. If the ball is -
a. Red, you loose the prize money accumulated
b. Green, you can quit
c. Blue you loose half prize money but continue
d. White you loose quarter prize money & continue
e. Yellow you can skip the question
3. This state where you have to decide and your decision
can result in various outcomes with equal probability is
said to be state of maximum uncertainty
4. If you have a bag full of balls of only one colour, then
there is no uncertainty. You know what is going to
happen. Uncertainty is zero.
5. Thus, the more the homogeneity, lesser the uncertainty
and vice versa
6. Uncertainty is expressed as entropy or Gini index
Decision Trees -

Suppose we wish to find if there was any influence of shipping mode, order priority on
customer location. Customer location is target column and like the bag of coloured balls

Shipping
Sales
Mode
Data

Regular Express
Air Air

Low High
Priority Priority

When sub branches are created, the total entropy of the sub branches should be
less than the entropy of the parent node. More the drop in entropy, more the
information gained
Decision Trees – Shannon's Entropy
a. Imagine a bag contains 6 red and 4 black balls.

b. Let the two classes Red -> class 0 and Black -> class 1

c. Entropy of the bag (X) will be calculated as per the formula

a. H(X) = - (0.6 * log2( 0.6)) - (0.4 * log2(0.4)) = 0.9709506

d. Suppose we remove all red balls from the bag and then entropy will be
a. H(X) = - 1.0 *log2(1.0) – 0.0 * log2(0) = 0 ## Entropy is 0! i.e. Information is
100%
Machine Learning (Decision Tree Classification)
Decision Trees -
Entropy Info Gain

E0 = max entropy 0
Shipping
Mode (1000) E0 say 1

E1 = E0 – E1
Express (E1a*700/1000) +
Regular Air
(700), E1a Air (300), (E1b * 300/1000)
E1b

E2 = (E2a * E1 – E2
500/700) + (E2b *
Low High Low High
Priority Priority Priority Priority
200/700) + (E2c *
(500) E2a (200) E2b (100) E2c (200) E2d 100/300) + (E2d *
200/300)

Tree will stop growing when stop criterion for the splitting is reached which could be -
a. Tree has reached certain pre-fixed depth (longestt path from root node to leaf node)
b. Tree has achieve maximum number of nodes (tree size)
c. Exhausted all attributes to split
d. Leaf node on split will have less than predefined number of data points
Decision Trees - Information Gain using Entropy

Information Gain = reduction in entropy =


Decision Trees - Information Gain using Gini index

Information Gain = reduction in Gini index =


Decision Trees -

Common measures of purity

1. Gini index – is calculated by subtracting the sum of the squared probabilities of each
class from one
a. Uses squared proportion of classes
b. Perfectly classified, Gini Index would be zero
c. Evenly distributed would be 1 – (1/# Classes)
d. You want a variable split that has a low Gini Index
e. Used in CART algorithm

2. Entropy –
a. Favors splits with small counts but many unique values
b. Weights probability of class by log(base=2) of the class probability
c. A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger
d. Information Gain is the Entropy of the parent node minus the entropy of the child nodes
Decision Trees – Gini , Entropy , Misclassification Error

Note: Misclassification Error is not used in Decision Trees


Decision Trees - Algorithms

1. ID3 (Iterative Dicotomizer 3) – developed by Ross Quinlan. Creates a multi


branch tree at each node using greedy algorithm. Trees grow to maximum
size before pruning

2. C4.5 succeeded ID3 by overcoming limitation of features required to be


categorical. It dynamically defines discrete attribute for numerical attributes.
It converts the trained trees into a set of if-then rules. Accuracy of each rule
is evaluated to determine the order in which they should be applied

3. C5.0 is Quinlan’s latest version and it uses less memory and builds smaller
rulesets than C4.5 while being more accurate

4. CART (Classification & Regression Trees) is similar to C4.5 but it supports


numerical target variables and does not compute rule sets. Creates binary
tree. Scikit uses CART
Decision Trees -

Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Handles numeric and categorical variables
4. Interpretation of results does not required mathematical or statistical knowledge

Dis-advantages -
1. Often biased towards splits or features have large number of levels
2. May not be optimum as modelling some relations on axis parallel basis is not
optimal
3. Small changes in training data can result in large changes to the logic
4. Large trees can be difficult to interpret
Decision Trees - Preventing overfitting through regularization

1. Decision trees do not assume a particular form of relationship between the


independent and dependent variables unlike linear models for e.g.

2. DT is a non-parametrized algorithm unlike linear models where we supply


the input parameters

3. If left unconstrained, they can build tree structures to adapt to the training
data leading to overfitting

4. To avoid overfitting, we need to restrict the DT’s freedom during the tree
creation. This is called regularization

5. The regularization hyperparameters depend on the algorithms used


Decision Trees - Regularization parameters

1. max_depth – Is the maximum length of a path from root to leaf (in terms of
number of decision points. The leaf node is not split further. It could lead to
a tree with leaf node containing many observations on one side of the tree,
whereas on the other side, nodes containing much less observations get
further split

2. min_sample_split - A limit to stop further splitting of nodes when the


number of observations in the node is lower than this value

3. min_sample_leaf – Minimum number of samples a leaf node must have.


When a leaf contains too few observations, further splitting will result
in overfitting (modeling of noise in the data).
Decision Trees - Regularization parameters (Contd…)

4. min_weight_fraction_leaf – Same as min_sample_leaf but expressed in


fraction of total number of weighted instances

5. max_leaf_nodes – maximum number of leaf nodes in a tree

6. max_feature_size - max number of features that are evaluated for splitting


each node
Decision Tree -

Lab- 1 Model to predict potential credit defaulters

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

You might also like