0% found this document useful (0 votes)
70 views49 pages

CS 4700: Foundations of Artificial Intelligence

This document provides information about the CS 4700 course "Foundations of Artificial Intelligence" taught by Professor Carla P. Gomes at Cornell University. It discusses the module on decision trees and decision tree learning algorithms. Specifically, it covers the basics of decision trees as a representation for hypotheses, evaluates their expressiveness in representing Boolean functions, and describes the top-down induction of decision trees algorithm that builds trees by recursively splitting on the attribute that provides the most information gain at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views49 pages

CS 4700: Foundations of Artificial Intelligence

This document provides information about the CS 4700 course "Foundations of Artificial Intelligence" taught by Professor Carla P. Gomes at Cornell University. It discusses the module on decision trees and decision tree learning algorithms. Specifically, it covers the basics of decision trees as a representation for hypotheses, evaluates their expressiveness in representing Boolean functions, and describes the top-down induction of decision trees algorithm that builds trees by recursively splitting on the attribute that provides the most information gain at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

CS 4700:

Foundations of Artificial Intelligence


Prof. Carla P. Gomes
[email protected]
Module:
Decision Trees
(Reading: Chapter 18)

Carla P. Gomes
CS4700

Big Picture of Learning


Learning can be seen as fitting a function to the data. We can consider
different target functions and therefore different hypothesis spaces.
Examples:
Propositional if-then rules
A learning problem
Decision Trees
First-order if-then rules
is realizable if its hypothesis space
First-order logic theory
contains the true function.
Linear functions
Polynomials of degree at most k
Neural networks
Java programs
Tradeoff between expressiveness of
Turing machine
a hypothesis space and the
Etc

complexity of finding simple, consistent hypotheses


within the space.
Carla P. Gomes
CS4700

Decision Tree Learning


Task:
Given: collection of examples (x, f(x))
Return: a function h (hypothesis) that approximates f
h is a decision tree
Input: an object or situation described by a set of attributes (or features)
Output: a decision the predicts output value for the input.
The input attributes and the outputs can be discrete or continuous.
We will focus on decision trees for Boolean classification:
each example is classified as positive or negative.

Carla P. Gomes
CS4700

Can we learn
how counties vote?

Decision Trees:
a sequence of tests;
Representation very natural for humans:
Style of many How to manuals.

New York Times


April 16, 2008

Decision Tree

What is a decision tree?


A tree with two types of nodes:
Decision nodes;
Leaf nodes;
Decision node: Specifies a choice or test of
some attribute with 2 or more alternatives;
every decision node is part of a path to a
leaf node
Leaf node: Indicates classification of an
example
Carla P. Gomes
CS4700

Decision Tree Example: BigTip


Food

great
Speedy
yes

mediocre

yes

no

no
Price
adequate

yes

yuck

high
no

Is the decision tree we learned consistent?


Yes, it agrees with all the examples!

no

Learning decision trees:


An example
Problem: decide whether to wait for a table at a restaurant. What attributes would you use?
:
Attributes used by SR
1. Alternate: is there an alternative restaurant nearby?
What about
2. Bar: is there a comfortable bar area to wait in?
restaurant name?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
It could be great for
7. Raining: is it raining outside?
generating a small tree
8. Reservation: have we made a reservation?
but it doesnt generalize!
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Goal predicate: WillWait?


Carla P. Gomes
CS4700

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:

12 examples
6+
6-

Classification of examples is positive (T) or negative (F)


Carla P. Gomes
CS4700

Decision trees
One possible representation for hypotheses
E.g., here is a tree for deciding whether to wait:

Carla P. Gomes
CS4700

Expressiveness of Decision Trees

Any particular decision tree hypothesis for WillWait goal predicate can be
seen as a disjunction of a conjunction of tests, i.e., an assertion of the form:
s WillWait(s) (P1(s) P2(s)

Pn(s))

Where each condition Pi(s) is a conjunction of tests corresponding


to the path from the root of the tree to a leaf with a positive outcome.
(Note: only propositional; it contains only one variable and all predicates are unary;
to consider interactions more than one object (say another restaurant), we would require
an exponential number of attributes.)`

Carla P. Gomes
CS4700

Expressiveness
Decision trees can express any Boolean function of the input attributes.
E.g., for Boolean functions, truth table row path to leaf:

Carla P. Gomes
CS4700

Number of Distinct Decision Trees


How many distinct decision trees with 10 Boolean attributes?
= number of Boolean functions with 10 propositional symbols
Input features

Output

0000000000
0000000001
0000000010
0000000100

1111111111

0/1
0/1
0/1
0/1
0/1

How many entries does this table have?

210
So how many Boolean functions
with 10 Boolean attributes are there,
given that each entry can be 0/1?
= 22

10

Carla P. Gomes
CS4700

Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions

= number of distinct truth tables with 2n rows

= 22

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616


trees

Googles calculator could not handle 10 attributes !


Carla P. Gomes
CS4700

Decision tree learning Algorithm


Decision trees can express any Boolean function.
Goal: Finding a decision tree that agrees with training set:
We could construct a decision tree that has one path to a leaf for each example,
where the path tests sets each attribute value to the value of the example.
Problem: This approach would just memorize example.
How to deal with new examples? It doesnt generalize!

(E.g., parity function, 1, if an even number of inputs, or majority


function, 1, if more than half of the inputs are 1).

Overall Goal: get a good classification with a small number of tests.


But of course finding the smallest tree consistent with the examples is NP-hard!
Carla P. Gomes
CS4700

Expressiveness:
Boolean Function with 2 attributes 2 DTs
2

AND

T
B

FT
F F

T
NAND

T
F

T
B

OR

F
B

FT
T T

F
F

T
T

T
B

FT
T T

NOR

F
B
F
T

T
F

T
B

XOR

F
B

FT
F F

F
F

T
B

FT
F T

T
XNOR

F
B

F T
T F

T
B

FT
T F

F
B

F T
F T

T
B

FT
T F

NOT A

F
B
F
T

T
F

T
B

FT
F T

Carla P. Gomes
CS4700

F
B
F
F

F
B
F
T

Expressiveness:
2
2 attribute DTs
2

AND

T
B

T
T

F
F

T
B
T

A
F

NOR

F
T

XOR

F
B

T
T

NAND

OR

T
F

F
F

T
B

FT
F T

T
XNOR

F
B
T
F

F T
T F

T
B

FT
T F

T
T

F
B

F
F

F
F
NOT A

T
F

F
B

F
T
Carla P. Gomes
CS4700

F
T

Expressiveness:
2
2 attribute DTs
2

A AND-NOT B

T
F

T
B

FT
T F

F
B
F
F

T
F

T
B

FT
F T

A OR NOT B

NOR A OR B

T
T

T
B

FT
F T

NOT A AND B

F
B
F
T

T
T

T
B

F
B

FT
T F

F
F

TRUE

T
B

FT
F T

F
B

F T
F T

NOT B

F
B
F
T

T
F

T
B

FT
T F

T
B

FT
T T

FALSE

F
B
F
T

T
F

T
B

FT
F F

Carla P. Gomes
CS4700

F
B
F
T

F
B
F
F

Expressiveness:
2
2 attribute DTs
2

A AND-NOT B

T
F

T
B

F
F

T
F

F
B

T
T

A OR NOT B

NOR A OR B

T
T

T
B
F

NOT A AND B

F
T

T
T

F
F

T
B

FT
F T

F
B
F
F

NOT B

F
B
T
F

TRUE

F
T

T
F

T
B

FT
T F

FALSE

F
B
F
T

Carla P. Gomes
CS4700

Basic DT Learning Algorithm


Goal: find a small tree consistent with the training examples
Idea: (recursively) choose "most significant" attribute as root of (sub)tree;
Use a top-down greedy search through the space of possible decision trees.
Greedy because there is no backtracking. It picks highest values first.
Variations of known algorithms ID3, C4.5 (Quinlan -86, -93)
Top-down greedy construction
Which attribute should be tested?

(ID3 Iterative Dichotomiser 3)


3

Heuristics and Statistical testing with current data

Repeat for descendants


Carla P. Gomes
CS4700

Big Tip Example


10 examples:
6+

3 4

7 8 10

4-

Attributes:
Food with values g,m,y
Speedy? with values y,n
Price, with values a, h
Lets build our decision tree
starting with the attribute Food,
(3 possible values: g, m, y).

Top-Down Induction of Decision Tree:


Big Tip Example
1 3 4 7 8 10
2

10 examples:

Food
y

No

No

5 9

1
g

3 4

7 8 10

2
Speedy
y
Yes

3 7 8 10

How many + and - examples


per subclass, starting with y?

Price

a
Yes
4

h
No
2

Lets consider next


the attribute Speedy

6+
4-

Top-Down Induction
of DT (simplified)
Yes
TDIDF(D,cdef)
IF(all examples in D have same class c)
Return leaf with class c (or class cdef, if D is empty)
ELSE IF(no attributes left to test)
Return leaf with class c of majority in D
ELSE
Pick A as the best decision attribute for next node
FOR each value vi of A create a new
descendent of node
D i {(x, y) D : attribute A of x has value v i }

Subtree ti for vi is TDIDT(Di,cdef)

RETURN tree with A as root and ti as subtrees

{(
x
,
y
),

,
(
x
Training Data:
1
1
n , y n )}

Carla P. Gomes
CS4700

Picking the Best Attribute to Split


Ockhams Razor:
All other things being equal, choose the simplest explanation
Decision Tree Induction:
Find the smallest tree that classifies the training data correctly
Problem
Finding the smallest tree is computationally hard !
Approach
Use heuristic search (greedy search)
Heuristics
Pick attribute that maximizes information (Information Gain)
Other statistical tests
Carla P. Gomes
CS4700

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:

12 examples
6+
6-

Classification of examples is positive (T) or negative (F)


Carla P. Gomes
CS4700

Choosing an attribute:
Information Gain
Goal: trees with short paths to leaf nodes

Is this a good attribute


to split on?

Which one should we pick?

A perfect attribute would ideally divide the


examples into sub-sets that are all positive or negative
Carla P. Gomes
CS4700

Information Gain
Most useful in classification
how to measure the worth of an attribute information gain
how well attribute separates examples according to their
classification
Next
precise definition for gain

measure from Information Theory


Shannon and Weaver 49

Carla P. Gomes
CS4700

Information
Information answers questions.
The more clueless I am about a question, the more information
the answer contains.
Example fair coin prior <0.5,0.5>
By definition Information of the prior (or entropy of the prior):
I(P1,P2) = - P1 log2(P1) P2 log2(P2) =
I(0.5,0.5) = -0.5 log2(0.5) 0.5 log2(0.5) = 1
We need 1 bit to convey the outcome of the flip of a fair coin.

Scale: 1 bit = answer to Boolean question with prior <0.5, 0.5>


Carla P. Gomes
CS4700

Information
(or Entropy)
Information in an answer given possible answers v 1, v2, vn:

(Also called entropy of the prior.)

Example biased coin prior <1/100,99/100>


0.08 bits

I(1/100,99/100) = -1/100 log2(1/100) 99/100 log2(99/100) =

Example biased coin prior <1,0>

0 log2(0) =0

I(1,0) = -1 log2(1) 0 log2(0) = 0 bits


i.e., no uncertainty left in source!

Carla P. Gomes
CS4700

Shape of Entropy Function


Roll of an unbiased die
1

1 p
1/2
0
The more uniform is the probability distribution,
the greater is its entropy.

Carla P. Gomes
CS4700

Information or
Entropy
Information or Entropy measures the randomness of an arbitrary collection of
examples.
We dont have exact probabilities but our training data provides an estimate of the
probabilities of positive vs. negative examples given a set of values for the
attributes.
For a collection S, entropy is given as:

For a collection S having positive and negative examples


p - # positive examples;
n - # negative examples
Carla P. Gomes
CS4700

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:

12 examples
6+
6Whats the entropy
of this collection of
examples?
Classification of examples is positive (T) or negative (F)

p = n = 6; I(0.5,0.5) = -0.5 log2(0.5) 0.5 log2(0.5) = 1


So we need 1 bit of info to classify a randomly picked example.

Carla P. Gomes
CS4700

Choosing an attribute:
Information Gain
Intuition: Pick the attribute that reduces the entropy (uncertainty) the
most.
So we measure the information gain after testing a given attribute A:

Remainder(A) gives us the remaining


uncertainty after getting info on attribute A.
Carla P. Gomes
CS4700

Choosing an attribute:
Information Gain
Remainder(A)
gives us the amount information we still need after testing on A.
Assume A divides the training set E into E1, E2, Ev, corresponding to
the different v distinct values of A.
Each subset Ei has pi positive examples and ni negative examples.
So for total information content, we need to weigh the contributions of the
different subclasses induced by A Weight of each subclass
Carla P. Gomes
CS4700

Choosing an attribute:
Information Gain
Measures the expected reduction in entropy. The higher the Information Gain (IG),
or just Gain, with respect to an attribute A , the more is the expected reduction in
entropy.
Weight of each subclass

where Values(A) is the set of all possible values for attribute A,


Sv is the subset of S for which attribute A has value v.

Carla P. Gomes
CS4700

Interpretations of gain
Gain(S,A)
expected reduction in entropy caused by knowing A
information provided about the target function value given the
value of A
number of bits saved in the coding a member of S knowing the
value of A

Used in ID3 (Iterative Dichotomiser 3)


3 Ross Quinlan
Carla P. Gomes
CS4700

Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Type and Patrons:

Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as
the root.
Carla P. Gomes
CS4700

Example contd.
Decision tree learned from the 12 examples:

SRs Tree

Substantially simpler than true tree--a more complex hypothesis isnt justified
Carla P. Gomes
CS4700

Inductive Bias
Roughly: prefer
shorter trees over longer ones
ones with high gain attributes at root
Difficult to characterize precisely
attribute selection heuristics
interacts closely with given data

Carla P. Gomes
CS4700

Evaluation Methodology

Carla P. Gomes
CS4700

Evaluation Methodology

How to evaluate the quality of a learning algorithm, i.e.,:


How good are the hypotheses produce by the learning algorithm?
How good are they at classifying unseen examples?
Standard methodology:
1. Collect a large set of examples.
2. Randomly divide collection into two disjoint sets: training set and test set.
3. Apply learning algorithm to training set generating hypothesis h
4. Measure performance of h w.r.t. test set (a form of cross-validation)
measures generalization to unseen data
Important: keep the training and test sets disjoint! No peeking!
Carla P. Gomes
CS4700

Peeking
Example of peeking:
We generate four different hypotheses for example by using different
criteria to pick the next attribute to branch on.
We test the performance of the four different hypothesis on the test set
and we select the best hypothesis.
Voila: Peeking occured!
The hypothesis was selected on the basis of its performance on the test set,
so information about the test set has leaked into the learning algorithm.

So a new test set is required!


Carla P. Gomes
CS4700

Evaluation Methodology

Standard methodology:
1. Collect a large set of examples.
2. Randomly divide collection into two disjoint sets: training set and test set.
3. Apply learning algorithm to training set generating hypothesis h
4. Measure performance of h w.r.t. test set (a form of cross-validation)
Important: keep the training and test sets disjoint! No peeking!
5. To study the efficiency and robustness of an algorithm, repeat steps 2-4 for different
sizes of training sets and different randomly selected training sets of each size.
Carla P. Gomes
CS4700

Test/Training Split

Real-world Process
drawn randomly
split randomly
Training Data Dtrain
(x1,y1), , (xn,yn)

split randomly

Data D
Dtrain

Learner

Test Data Dtest


(x1,y1),(xk,yk)

Measuring Prediction Performance

Performance Measures
Error Rate
Fraction (or percentage) of false predictions
Accuracy
Fraction (or percentage) of correct predictions
Precision/Recall
Applies only to binary classification problems (classes pos/neg)
Precision: Fraction (or percentage) of correct predictions among
all examples predicted to be positive
Recall: Fraction (or percentage) of correct predictions among all
real positive examples

Carla P. Gomes
CS4700

Learning Curve Graph

Learning curve graph


average prediction quality proportion correct on test set
as a function of the size of the training set..

Carla P. Gomes
CS4700

Prediction quality:
Average Proportion correct on test set

Restaurant Example:
Learning Curve

As the training set increases,


so does the quality of prediction:
Happy curve !
the learning algorithm is able to capture
the pattern in the data

How well does it work?

Many case studies have shown that decision trees are at least as accurate
as human experts.

A study for diagnosing breast cancer had humans correctly


classifying the examples 65% of the time, and the decision
tree classified 72% correct.
British Petroleum designed a decision tree for gas-oil
separation for offshore oil platforms that replaced an
earlier rule-based expert system.
Cessna designed an airplane flight controller using 90,000
examples and 20 attributes per example.
Carla P. Gomes
CS4700

Summary

Decision tree learning is a particular case of supervised learning,


For supervised learning, the aim is to find a simple hypothesis
approximately consistent with training examples

Decision tree learning using information gain

Learning performance = prediction accuracy measured on test set


Carla P. Gomes
CS4700

You might also like