0% found this document useful (0 votes)
4 views37 pages

M2 Decision Trees

Decision Trees

Uploaded by

tamizharasi.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views37 pages

M2 Decision Trees

Decision Trees

Uploaded by

tamizharasi.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction of Decision trees

02/03/2025
Representing Data
• Think about a large table, N attributes, and assume you want to know something about the people
represented as entries in this table.
• E.g. reads a lot of books or not;
• Simplest way: Histogram on the first attribute – reads
• Then, histogram on 1st and 2nd: (reads (0/1) & gender (0/1): 00, 01, 10, 11)
• But, what if the # of attributes is larger: N=16
• How large are the 1-d histograms (contingency tables) ? 16 numbers
• How large are the 2-d histograms? 16-choose-2 (all pairs) = 120 numbers
• How many 3-d tables? 560 numbers
• With 100 attributes, the 3-d tables need 161,700 numbers
• We need to figure out a way to represent data in a better way;
– In part, this depends on identifying the important attributes, since we want to look at these first.
– Information theory has something to say about it – we will use it to better represent the data.
Decision Trees
• A hierarchical data structure that represents data by
implementing a divide and conquer strategy
– Can be used as a non-parametric classification and regression method
(real numbers associated with each example, rather than a categorical
label)
• Process:
– Given a collection of examples, learn a decision tree that represents it.
– Use this representation to classify new examples Given this collection of shapes,
what shapes are type A, B, and C?
C B A
The Representation
• Decision Trees are classifiers for instances represented as feature vectors Learning a
– color={red, blue, green} ; shape={circle, triangle, rectangle} ; label= {A, B, C} Decision Tree?
– An example: <(color = green; shape = rectangle), label = B>
• Nodes are tests for feature values
• There is one branch for each value of the feature
• Leaves specify the category (labels) - Check the color
feature.
• Can categorize instances into multiple disjoint categories - If it is blue than
Color check the shape
Evaluation of a
Decision Tree feature
- if it is …then…

C B A
Shape B Shape

B A C B A
Expressivity of Decision Trees
• As Boolean functions they can represent any Boolean function.
• Can be rewritten as rules in Disjunctive Normal Form (DNF)
– Green ∧ Square  positive
– Blue ∧ Circle  positive Color
– Blue ∧ Square  positive
• The disjunction of these rules is equivalent to the Decision Tree
• What did we show? What is the hypothesis space here?
– 2 dimensions: color and shape
– 3 values each: color(red, blue, green), shape(triangle, square, circle) Shape B Shape
– |X| = 9: (red, triangle), (red, circle), (blue, square) …
– |Y| = 2: + and -
– |H| = 29
• And, all these functions can be represented as decision trees.
- + + + -
Decision Trees

• Output is a discrete category. Real valued outputs


are possible (regression trees)

• There are efficient algorithms for processing large


amounts of data (but not too many features) Color

• There are methods for handling noisy data


(classification noise and attribute noise) and for
handling missing attribute values
Shape B Shape

- + + + -
Decision Boundaries
• Usually, instances are represented as attribute-value pairs (color=blue,
shape = square, +)
• Numerical values can be used either by discretizing or by using thresholds
for splitting nodes
• In this case, the tree divides the features space into axis-parallel
rectangles, each labeled with one of the labels

Y
X<3
no yes
+ + +

7
Y>7 Y<5
+ + -
no yes no yes
5

- +
- - + + X<1
no yes
1 3 X
+ -
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy

• Clearly, given data, there are


many ways to represent it as Outlook

a decision tree.
• Learning a good representation
Sunny Overcast Rain
from data is the challenge.
Humidity Wind
Yes

High Normal Strong Weak


No Yes No Yes
Will I play tennis today?
• Features
– Outlook: {Sun, Overcast, Rain}
– Temperature: {Hot, Mild, Cool}
– Humidity: {High, Normal, Low}
– Wind: {Strong, Weak}

• Labels
– Binary classification task: Y = {+, -}
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
O(vercast),
2 S H H S -
R(ainy)
3 O H H W +
4 R M H W + Temperature: H(ot),
M(edium),
5 R C N W +
6 R C N S - C(ool)
7 O C N S +
8 S M H W - Humidity: H(igh),
N(ormal),
9 S C N W +
L(ow)
1 R M N W +
0 Wind: S(trong),
1 S M N S + W(eak)
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
02/03/2025
Basic Decision Trees Learning Algorithm
O T H W Play?
• Data is processed in Batch (i.e. all the
1 S H H W -
2 S H H S - data available) Algorithm?
3 O H H W + • Recursively build a decision tree top
4 R M H W + down.
5 R C N W +
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
1 R M N W + Sunny Overcast Rain
0
1 S M N S + Humidity Wind
Yes
1
1 O M H S + High Normal Strong Weak
2 No Yes No Yes
1 O H N W +
3
1 R M H S -
Basic Decision Tree Algorithm
• Let S be the set of Examples
– Label is the target attribute (the prediction)
– Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
why?
End
Return Root
For evaluation time
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– But, finding the minimal decision tree consistent with the data is NP-
hard
• The recursive algorithm is a greedy heuristic search for a
simple tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
A
< (A=0,B=0), - >: 50 examples
1 0
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
+ -
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
– Splitting on A: we get purely labeled nodes. splitting on A
– Splitting on B: we don’t get purely labeled nodes.
– What if we have: <(A=1,B=0), - >: 3 examples?

B
1 0
• (one way to think about it: # of queries required to label a
random data point)
A -
1 0

+ -

splitting on B
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?

Advantage A. But…
Need a way to quantify things
A B
1 0 1 0
• One way to think about it: # of queries required to
label a random data point.
• If we choose A we have less uncertainty about B - A -
the labels.
1 0 100 1 0 53
+ - + -
100 3 100 50
splitting on A splitting on B
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the next
attribute to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
– The most popular heuristics is based on information gain, originated
with the ID3 system of Quinlan.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:

• is the proportion of positive examples in S and


• is the proportion of negatives examples in S
– If all the examples belong to the same category [(1,0) or (0,1)]: Entropy = 0
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1
– Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:

• Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each
example; if it is 0.8 – can use less then 1 bit.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:

• is the proportion of positive examples in S and


• is the proportion of negatives examples in S
– If all the examples belong to the same category: Entropy = 0
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1
– Entropy = Level of uncertainty.

Test yourself: assign


1 1 1
high, medium, low
to each of these
distributions

-- -- --
+ + +
Entropy
(Convince yourself that the max value would be )
(Also note that the base of the log only introduces a constant factor; therefore, we’ll think about base 2)

Test yourself again:


assign high,
medium, low to
each of these
distributions.
For the middle
distribution, try to
guess the value of
the entropy.

1 1 1
High Entropy – High level of Uncertainty

Information Gain
Low Entropy – No Uncertainty.

• The information gain of an attribute a is the expected


reduction in entropy caused by partitioning on this attribute

• Where:
– Sv is the subset of S for which attribute a has value v, and Outlook
– the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set
Sunny Overcast Rain

• Partitions of low entropy (imbalanced splits) lead to high gain


• Go back to check which of the A, B splits is better
Will I play tennis today?
O T H W Play? Outlook: S(unny),
1 S H H W - O(vercast),
2 S H H S - R(ainy)
3 O H H W +
Temperature: H(ot),
4 R M H W +
5 R C N W + M(edium),
6 R C N S - C(ool)
7 O C N S + Humidity: H(igh),
8 S M H W - N(ormal),
9 S C N W + L(ow)
1 R M N W +
Wind: S(trong),
0
1 S M N S + W(eak)
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
Will I play tennis today?
O T H W Play?
1 S H H W - calculate current entropy
2 S H H S -
3 O H H W +
4 R M H W +
5 R C N W +
6 R C N S -  0.94
7 O C N S +
8 S M H W -
9 S C N W +
1 R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
Information Gain: Outlook
¿
O T H W Play?
𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
1 S H H W - Outlook = sunny:
2 S H H S - Entropy(O = S)= 0.971
3 O H H W + Outlook = overcast:
4 R M H W + Entropy(O = O)= 0
Outlook = rainy:
5 R C N W + Entropy(O = R)= 0.971
6 R C N S -
7 O C N S + Expected entropy
=
8 S M H W -
= (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
9 S C N W +
1 R M N W + Information gain = 0.940 – 0.694 = 0.246
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
Information Gain: Humidity
¿
O T H W Play?
𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
1 S H H W - Humidity = high:
2 S H H S - Entropy(H = H)= 0.985
3 O H H W + Humidity = Normal:
4 R M H W + Entropy(H = N)= 0.592

5 R C N W + Expected entropy
6 R C N S - =
7 O C N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
8 S M H W -
Information gain = 0.940 – 0.694 = 0.246
9 S C N W +
1 R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
Information gain:
4 R M H W +
Outlook: 0.246
5 R C N W + Humidity: 0.151
6 R C N S - Wind: 0.048
7 O C N S + Temperature: 0.029
8 S M H W -
9 S C N W + → Split on Outlook
1 R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
1 R M H S -
4
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Outlook
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
An Illustrative Example (III)

O T H W Play?
Outlook
1S H H W -
2S H H S -
3O H H W +
4R M H W +
Sunny Overcast Rain 5R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6R C N S -
2+,3- 4+,0- 3+,2- 7O C N S +
8S M H W -
? Yes ?
9S C N W +
1R M N W +
0
1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
An Illustrative Example (III)

Outlook
O T H W Play?
1S H H W -
2S H H S -
3O H H W +
4R M H W +
Sunny Overcast Rain
5R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6R C N S -
2+,3- 4+,0- 3+,2- 7O C N S +
? Yes ? 8S M H W -
9S C N W +
1R M N W +
Continue until:
• Every attribute is included in path, or, 0
• All examples in the leaf have same label 1 S M N S +
1
1 O M H S +
2
1 O H N W +
3
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
Sunny Overcast Rain 4 R M H W +
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14
6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
1 R M N W +
Gain(S sunny , Humidity)  .97-(3/5) 0-(2/5) 0 = .97 0
Gain(S sunny , Temp)  .97- 0-(2/5) 1 = .57
1 S M N S +
1
Gain(S sunny , Wind)  .97-(2/5) 1 - (3/5) .92= .02 1 O M H S +
2
Split on Humidity 1 O H N W +
3
1 R M H S -
An Illustrative Example (V)

Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14


2+,3- 4+,0- 3+,2-

? Yes ?
An Illustrative Example (V)

Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14


2+,3- 4+,0- 3+,2-

Humidity Yes ?

High Normal
No Yes
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;

• 2. Find the feature with the most information gain:


i = argmax i Gain(S, Xi)

• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
An Illustrative Example (VI)

Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14


2+,3- 4+,0- 3+,2-

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
Decision Trees - Summary
• Hypothesis Space:
– Variable size (contains all functions)
– Deterministic; Discrete and Continuous attributes
• Search Algorithm
– ID3 - batch
– Extensions: missing values
• Issues:
– What is the goal?
– When to stop? How to guarantee good generalization?
• Did not address:
– How are we doing? (Correctness-wise, Complexity-wise)

You might also like