0% found this document useful (0 votes)
7 views43 pages

Lec4 - Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

Lec4 - Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Decision Trees

Function Approximation
Problem Setting
• Set of possible instances X
• Set of possible labels Y
• Unknown target function
• Set of function hypotheses
Sample Dataset
• Columns denote features X i
• Rows denote labeled instances
• Class label denotes whether a tennis game was played
Decision Tree
• A possible decision tree for the data:

• Each internal node: test one attribute X i


• Each branch from a node: selects one value for X i
• Each leaf node: predict Y (or p(Y | x ∈ leaf) )
Decision Tree
• A possible decision tree for the data:

• What prediction would we make for


<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?
Decision Tree
• If features are continuous, internal nodes can
test the value of a feature against a threshold
Decision Tree Induced Partition
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribution over labels

Decision
boundary
Expressiveness
• Decision trees can represent any boolean function of
the input attributes

Truth table row  path to leaf

• In the worst case, the tree will require exponentially


many nodes
Expressiveness
Decision trees have a variable-sized hypothesis space
• As the #nodes (or depth) increases, the hypothesis
space grows
– Depth 1 (“decision stump”): can represent any boolean
function of one feature
– Depth 2: any boolean fn of two features; some involving
three features (e.g., (x 1 A x 2 ) V (¬x1 A ¬x3) )
– etc.

Based on slide by Pedro Domingos


Another Example:
Restaurant Domain (Russell & Norvig)
Model a patron’s decision of whether to wait for a table at a restaurant

~7,000 possible cases


Decision Tree Techniques
Basic Algorithm for Top-Down
Induction of Decision Trees
[ID3, C4.5 by Quinlan]

node = root of decision tree


Main loop:
1. A  the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop.
Else, recurse over new leaf nodes.

How do we choose which attribute is best?


Choosing the Best Attribute
Key problem: choosing which attribute to split a
given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest
number of possible values
– Most-Values: Choose the attribute with the largest
number of possible values
– Max-Gain: Choose the attribute that has the largest
expected information gain
• i.e., attribute that results in smallest expected size of subtrees
rooted at its children

• The ID3 algorithm uses the Max-Gain method of


selecting the best attribute
Choosing an Attribute
Idea: a good attribute splits the examples into subsets
that are (ideally) “all positive” or “all negative”

Which split is more informative: Patrons? or Type?


Information Gain
Which test is more informative?
Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
Information Gain
Impurity
Very impure group Less impure Minimum
impurity
Information Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in


the nodes of a decision tree.
Information Gain
Information Gain is the expected reduction in entropy of target
variable Y for data sample S, due to sorting on variable A
Play
Day Outlook Temp Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Weak Yes


8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes


Rafael Nadal 14 Rain Mild High Strong No
Example

Play
Day Outlook Temp Humidity Wind
Tennis

Outlook 1 Sunny Hot High Weak No


2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


Overcast

4 Rain Mild High Weak Yes


5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Weak Yes


Humidity Yes Wind
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes


No Yes No Yes 13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


Example
Question 1 Question 2

Yes No Yes No
Example
Question 3 Question 4

Yes No Yes No
Example
Example
Question 1 Question 2

E=1 E=1

Yes No Yes No

E=0.97 E=0.92 E=0.72 E=0

Information Gain
k
Example
Question 1 Question 2

E=1 E=1

Yes No Yes No

E=0.97 E=0.92 E=0.72 E=0


Example

Play
Day Outlook Temp Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Weak Yes


8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


Example

E=0.954

Wind

E=0.811 E=1
Example

E=0.954
G (S, W ind) = 0.048

Humidity

E=0.985 E=0.592
Example

G (S, W i nd) = 0.048 E=0.954


G (S , H umidity) = 0.151

Temp

d
Mil
E=1 E=0.92 E=0.81
Example

G (S, W ind) = 0.048 E=0.954

G (S , H umidity) = 0.151
G (S , T emp) = 0.042

Outlook

Overcast
E=0.971 E=0 E=0.971
Example
Outlook

Overcast
Humidity Yes Wind

No Yes No Yes
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?

You might also like