M2 Decision Trees
M2 Decision Trees
02/03/2025
Representing Data
• Think about a large table, N attributes, and assume you want to know something about the people
represented as entries in this table.
• E.g. reads a lot of books or not;
• Simplest way: Histogram on the first attribute – reads
• Then, histogram on 1st and 2nd: (reads (0/1) & gender (0/1): 00, 01, 10, 11)
• But, what if the # of attributes is larger: N=16
• How large are the 1-d histograms (contingency tables) ? 16 numbers
• How large are the 2-d histograms? 16-choose-2 (all pairs) = 120 numbers
• How many 3-d tables? 560 numbers
• With 100 attributes, the 3-d tables need 161,700 numbers
• We need to figure out a way to represent data in a better way;
– In part, this depends on identifying the important attributes, since we want to look at these first.
– Information theory has something to say about it – we will use it to better represent the data.
Decision Trees
• A hierarchical data structure that represents data by
implementing a divide and conquer strategy
– Can be used as a non-parametric classification and regression method
(real numbers associated with each example, rather than a categorical
label)
• Process:
– Given a collection of examples, learn a decision tree that represents it.
– Use this representation to classify new examples Given this collection of shapes,
what shapes are type A, B, and C?
C B A
The Representation
• Decision Trees are classifiers for instances represented as feature vectors Learning a
– color={red, blue, green} ; shape={circle, triangle, rectangle} ; label= {A, B, C} Decision Tree?
– An example: <(color = green; shape = rectangle), label = B>
• Nodes are tests for feature values
• There is one branch for each value of the feature
• Leaves specify the category (labels) - Check the color
feature.
• Can categorize instances into multiple disjoint categories - If it is blue than
Color check the shape
Evaluation of a
Decision Tree feature
- if it is …then…
C B A
Shape B Shape
B A C B A
Expressivity of Decision Trees
• As Boolean functions they can represent any Boolean function.
• Can be rewritten as rules in Disjunctive Normal Form (DNF)
– Green ∧ Square positive
– Blue ∧ Circle positive Color
– Blue ∧ Square positive
• The disjunction of these rules is equivalent to the Decision Tree
• What did we show? What is the hypothesis space here?
– 2 dimensions: color and shape
– 3 values each: color(red, blue, green), shape(triangle, square, circle) Shape B Shape
– |X| = 9: (red, triangle), (red, circle), (blue, square) …
– |Y| = 2: + and -
– |H| = 29
• And, all these functions can be represented as decision trees.
- + + + -
Decision Trees
- + + + -
Decision Boundaries
• Usually, instances are represented as attribute-value pairs (color=blue,
shape = square, +)
• Numerical values can be used either by discretizing or by using thresholds
for splitting nodes
• In this case, the tree divides the features space into axis-parallel
rectangles, each labeled with one of the labels
Y
X<3
no yes
+ + +
7
Y>7 Y<5
+ + -
no yes no yes
5
- +
- - + + X<1
no yes
1 3 X
+ -
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy
a decision tree.
• Learning a good representation
Sunny Overcast Rain
from data is the challenge.
Humidity Wind
Yes
• Labels
– Binary classification task: Y = {+, -}
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
O(vercast),
2 S H H S -
R(ainy)
3 O H H W +
4 R M H W + Temperature: H(ot),
M(edium),
5 R C N W +
6 R C N S - C(ool)
7 O C N S +
8 S M H W - Humidity: H(igh),
N(ormal),
9 S C N W +
L(ow)
1 R M N W +
0 Wind: S(trong),
1 S M N S + W(eak)
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
02/03/2025
Basic Decision Trees Learning Algorithm
O T H W Play?
• Data is processed in Batch (i.e. all the
1 S H H W -
2 S H H S - data available) Algorithm?
3 O H H W + • Recursively build a decision tree top
4 R M H W + down.
5 R C N W +
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
1 R M N W + Sunny Overcast Rain
0
1 S M N S + Humidity Wind
Yes
1
1 O M H S + High Normal Strong Weak
2 No Yes No Yes
1 O H N W +
3
1 R M H S -
Basic Decision Tree Algorithm
• Let S be the set of Examples
– Label is the target attribute (the prediction)
– Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
why?
End
Return Root
For evaluation time
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– But, finding the minimal decision tree consistent with the data is NP-
hard
• The recursive algorithm is a greedy heuristic search for a
simple tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
A
< (A=0,B=0), - >: 50 examples
1 0
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
+ -
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
– Splitting on A: we get purely labeled nodes. splitting on A
– Splitting on B: we don’t get purely labeled nodes.
– What if we have: <(A=1,B=0), - >: 3 examples?
B
1 0
• (one way to think about it: # of queries required to label a
random data point)
A -
1 0
+ -
splitting on B
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?
Advantage A. But…
Need a way to quantify things
A B
1 0 1 0
• One way to think about it: # of queries required to
label a random data point.
• If we choose A we have less uncertainty about B - A -
the labels.
1 0 100 1 0 53
+ - + -
100 3 100 50
splitting on A splitting on B
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the next
attribute to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
– The most popular heuristics is based on information gain, originated
with the ID3 system of Quinlan.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
• Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each
example; if it is 0.8 – can use less then 1 bit.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:
-- -- --
+ + +
Entropy
(Convince yourself that the max value would be )
(Also note that the base of the log only introduces a constant factor; therefore, we’ll think about base 2)
1 1 1
High Entropy – High level of Uncertainty
Information Gain
Low Entropy – No Uncertainty.
• Where:
– Sv is the subset of S for which attribute a has value v, and Outlook
– the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set
Sunny Overcast Rain
5 R C N W + Expected entropy
6 R C N S - =
7 O C N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
8 S M H W -
Information gain = 0.940 – 0.694 = 0.246
9 S C N W +
1 R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
Information gain:
4 R M H W +
Outlook: 0.246
5 R C N W + Humidity: 0.151
6 R C N S - Wind: 0.048
7 O C N S + Temperature: 0.029
8 S M H W -
9 S C N W + → Split on Outlook
1 R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Outlook
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
An Illustrative Example (III)
O T H W Play?
Outlook
1S H H W -
2S H H S -
3O H H W +
4R M H W +
Sunny Overcast Rain 5R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6R C N S -
2+,3- 4+,0- 3+,2- 7O C N S +
8S M H W -
? Yes ?
9S C N W +
1R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
An Illustrative Example (III)
Outlook
O T H W Play?
1S H H W -
2S H H S -
3O H H W +
4R M H W +
Sunny Overcast Rain
5R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6R C N S -
2+,3- 4+,0- 3+,2- 7O C N S +
? Yes ? 8S M H W -
9S C N W +
1R M N W +
Continue until:
• Every attribute is included in path, or, 0
• All examples in the leaf have same label 1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
Sunny Overcast Rain 4 R M H W +
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14
6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
1 R M N W +
Gain(S sunny , Humidity) .97-(3/5) 0-(2/5) 0 = .97 0
Gain(S sunny , Temp) .97- 0-(2/5) 1 = .57
1 S M N S +
1
Gain(S sunny , Wind) .97-(2/5) 1 - (3/5) .92= .02 1 O M H S +
2
Split on Humidity 1 O H N W +
3
1 R M H S -
An Illustrative Example (V)
Outlook
? Yes ?
An Illustrative Example (V)
Outlook
Humidity Yes ?
High Normal
No Yes
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;
• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
An Illustrative Example (VI)
Outlook