Decision Trees
Decision Trees
Machine Intelligence
Lecture # 3
Spring 2024
1
Tentative Course Topics
2
Recommending App
ML ask:
Between Gender and Occupation,
which one seems more decisive for
predicting what app will the users
download?....
3
Recommending App
Occupation
School Work
Pokemon Go Gender
F M
WhatsApp Snapchat
4
Between a horizontal and
a vertical line, which one
would cut the data
better?
5
6
Non-parametric Estimation
• A non-parametric model is not fixed, but its complexity
depends on the size of the training set or, rather, the
complexity of the problem inherent in the data.
• Here, a non-parametric model does not mean that the model
has no parameters; it means that the number of parameters
is not fixed and that their number can grow depending on the
size of the data or, better still, depending on the complexity of
the regularity that underlies the data.
7
Decision tree
• A decision tree is a hierarchical data structure implementing
the divide-and-conquer strategy.
• It is an efficient non-parametric method that can be used for
both classification and regression.
• A decision tree is also a non-parametric model in the sense
that we do not assume any parametric form for the class
densities, and the tree structure is not fixed a priori, but the
tree grows, branches and leaves are added during learning
depending on the complexity of the problem inherent in the
data.
8
Function Approximation
9
Ref: https://www.seas.upenn.edu, Eric Eaton.
Sample Dataset (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances hx i , yi i
Class label denotes whether a tennis game was played
10
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
11
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
12
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribution over labels
13
Ref: https://www.seas.upenn.edu, Eric Eaton.
Stages of Machine Learning
Given: labeled training data X , Y = { hx i , yi i } ni= 1
• Assumes each x i ⇠ D(X ) with yi = f t ar get (x i )
X, Y
Train the model:
learner
model classifier.train(X, Y )
x model yprediction
• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute
16
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
Entropy can be roughly thought of as how much variance the data has.
17
18
Entropy: a common way to measure impurity
19
Entropy
Entropy = 0 all examples belong to the same class. Minimum
Entropy = 1 examples are evenly split between classes. impurity
Maximum
entropy = - 1 log21 = 0 impurity
entropy = -0.5 log20.5 – 0.5 log20.5 =1
16
20
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy
Entropy = 0 all examples belong to the same class. Minimum
Entropy = 1 examples are evenly split between classes. impurity
Maximum
impurity
16
21
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
2-Class Cases:
2-Class Cases:
Xn Minimum
Entropy H (x) = − P (x = i ) log2 P (x = i ) impurity
i= 1
22
Information Gain
23
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain
24
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain
25
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain of feature WIND
26
Information Gain of feature HUMIDITY
27
Information Gain of feature TEMP
28
Information Gain of feature OUTLOOK
29
After several iterations
30
Information Gain
Example of a family of 10 members, where
5 members are pursuing their studies and
the rest of them have completed or not
pursued.
we can say that if a node contains only one class in it or formally says the node of the tree is pure the entropy for data in such node
will be zero and according to the information gain formula the information gained for such node will be higher and purity is higher
if the entropy is higher the information gain will be less, and the node can be considered as the less pure.
31
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node
32
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node
Now according to the performance
of the students, we can say
Students= 20
Curricular activity = 10
No curricular activity = 10
Students= 14 Students= 6
Curricular activity = 8/14=57% Curricular activity = 2/6=33%
No curricular activity = 6/14=43%
No curricular activity = 4/6=67%
Entropy = -(0.43) * log2(0.43) -(0.57) * log2(0.57) = 0.98 Entropy = -(0.33) * log2(0.33) -(0.67) * log2(0.67) = 0.91
we have calculated the entropy for the parent and child nodes now the weighted sum of these entropies will give the weighted entropy of all the nodes.
Weighted Entropy : (14/20)*0.98 + (6/20)*0.91 = 0.959
33
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
splits based on the class
Class 12th
Weighted entropy Entropy = -(0.2) * log2(0.2) -(0.8) * log2(0.8) = 1
Weighted Entropy : (10/20)*0.722 + (10/20)*0.722 = 0.722
34
Calculation of Information Gain
the amount of uncertainty because of any process or any given random
variable.
the amount of information improved in the nodes before splitting them for
making further decisions.
• Information Gain = 1 – Entropy higher Information Gain = more Entropy removed, which is what we
want.
35
36
37
Which Tree Should We Output?
38
Overfitting in DTs
39
Overfitting in DTs
40
Avoiding overfitting
How can we avoid overfitting?
• Stop growing when data split is not statistically significant
• Acquire more training data
• Remove irrelevant attributes (manual process – not always possible)
• Grow full tree, then post-prune
41
Reduced-error pruning
Split training data further into training and validation sets
Grow tree based on training set
Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning each
possible node (plus those below it)
2. Greedily remove the node that most improves
validation set accuracy
42
Effect of Reduced-Error Pruning
Effect of Reduced-Error Pruning
44
Decision Tree PROS & CONS
45