Decision Tree Basics
Decision Tree Basics
Dan Lo
Department of Computer Science
Kennesaw State University
Overview
• Widely used in practice
• Strengths include
– Fast and simple to implement
– Can convert to rules
– Handles noisy data
• Weaknesses include
– Univariate splits/partitioning using only one attribute at a time --- limits types of
possible trees
– Large decision trees may be hard to understand
– Requires fixed-length feature vectors
– Non-incremental (i.e., batch method)
Tennis Played?
• Columns denote features Xi
• Rows denote labeled instances 𝑥𝑖 , 𝑦𝑖
• Class label denotes whether a tennis game was played
Decision Tree
• A possible decision tree for the data:
• S is a training sample
• 𝑝⊕ is the proportion of positive examples in S.
• 𝑝⊖ is the proportion of negative examples in S.
• Entropy measures the impurity of S
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ lg 𝑝⊕ − 𝑝⊖ lg𝑝⊖
Information Gain
• We want to determine which attribute in a given set of training
feature vectors is most useful for discriminating between the classes
to be learned.
• Information gain tells us how important a given attribute of the
feature vectors is.
• We will use it to decide the ordering of attributes in the nodes of a
decision tree.
• IG = Entropy(parent) – Weighted Sum of Entropy(children)
Basic Algorithm for Top-Down Learning of
Decision Trees
ID3 (Iterative Dichotomiser 3, Ross Quinlan, 1986)
node = root of decision tree
Main loop:
1. A <- the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.
Y N N N IG=0.971-3/
Y Y Y Y 5*0.9183=0.4200
Y N Y N
(2/2, 0)
Fever
(1/3, 2/3)
H=0
H = 0.9183
IG=0.9183
(1/1, 0) (0,2/2)
H=0 H=0
How to Use Decision Tree
Wearing Masks
Yes No
(100%) (100%)
What if IG is negative?
• If IG is negative, that means children’s entropy is larger than their
parent.
• I.e., adding children nodes do not get better classification.
• So stop growing nodes at that branch.
• This is one way of true pruning.
Pruning Tree
• Decision may grow fast, which we don’t like!
• It may cause overfitting by noise including incorrect attributes or class
membership.
• Large decision trees requires lots of memory and may not be deployed in
resource limited devices.
• Decision tree may not capture features in the training set.
• It is hard to tell if a single extra node will increase accuracy, so called the
horizon effect.
• One way to prune trees is set an IG threshold to keep subtrees.
• i.e., IG has to be greater than the threshold to grow the tree;
• Another way is simply set the tree depth or set the max bin count.
How About Numeric Attributes
• IN the COVID-19 example, we only have Yes/No attributes, what if we have
a person’s weight?
• We could sort the weight. Find the average of two adjacent values.
Calculate entropy of each 𝑊 < 𝑤𝑖 . Pick the one with lowest entropy.
• For ranked data, like rank 1-4 for a question. Or categorical data, like low,
medium, and high. We may simply encode them as ordinals. Calculate
entropies for each R < 𝑟𝑖 . Pick the one with lowest entropy.
• For non-sequential numeric data, like red, green, and blue. We may
enumerate all possible combinations and calculate their entropies such as
{C=red},{C=green}, {C=blue},{C=red, green}, {C=red, blue}, {C=green, blue}.
• Remember our goal is to split data. So we don’t consider any split criteria
that do not separate data like {C=red, green, blue}