CS 4700: Foundations of Artificial Intelligence
CS 4700: Foundations of Artificial Intelligence
Carla P. Gomes
CS4700
Carla P. Gomes
CS4700
Can we learn
how counties vote?
Decision Trees:
a sequence of tests;
Representation very natural for humans:
Style of many How to manuals.
Decision Tree
great
Speedy
yes
mediocre
yes
no
no
Price
adequate
yes
yuck
high
no
no
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
12 examples
6+
6-
Decision trees
One possible representation for hypotheses
E.g., here is a tree for deciding whether to wait:
Carla P. Gomes
CS4700
Any particular decision tree hypothesis for WillWait goal predicate can be
seen as a disjunction of a conjunction of tests, i.e., an assertion of the form:
s WillWait(s) (P1(s) P2(s)
Pn(s))
Carla P. Gomes
CS4700
Expressiveness
Decision trees can express any Boolean function of the input attributes.
E.g., for Boolean functions, truth table row path to leaf:
Carla P. Gomes
CS4700
Output
0000000000
0000000001
0000000010
0000000100
1111111111
0/1
0/1
0/1
0/1
0/1
210
So how many Boolean functions
with 10 Boolean attributes are there,
given that each entry can be 0/1?
= 22
10
Carla P. Gomes
CS4700
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= 22
Expressiveness:
Boolean Function with 2 attributes 2 DTs
2
AND
T
B
FT
F F
T
NAND
T
F
T
B
OR
F
B
FT
T T
F
F
T
T
T
B
FT
T T
NOR
F
B
F
T
T
F
T
B
XOR
F
B
FT
F F
F
F
T
B
FT
F T
T
XNOR
F
B
F T
T F
T
B
FT
T F
F
B
F T
F T
T
B
FT
T F
NOT A
F
B
F
T
T
F
T
B
FT
F T
Carla P. Gomes
CS4700
F
B
F
F
F
B
F
T
Expressiveness:
2
2 attribute DTs
2
AND
T
B
T
T
F
F
T
B
T
A
F
NOR
F
T
XOR
F
B
T
T
NAND
OR
T
F
F
F
T
B
FT
F T
T
XNOR
F
B
T
F
F T
T F
T
B
FT
T F
T
T
F
B
F
F
F
F
NOT A
T
F
F
B
F
T
Carla P. Gomes
CS4700
F
T
Expressiveness:
2
2 attribute DTs
2
A AND-NOT B
T
F
T
B
FT
T F
F
B
F
F
T
F
T
B
FT
F T
A OR NOT B
NOR A OR B
T
T
T
B
FT
F T
NOT A AND B
F
B
F
T
T
T
T
B
F
B
FT
T F
F
F
TRUE
T
B
FT
F T
F
B
F T
F T
NOT B
F
B
F
T
T
F
T
B
FT
T F
T
B
FT
T T
FALSE
F
B
F
T
T
F
T
B
FT
F F
Carla P. Gomes
CS4700
F
B
F
T
F
B
F
F
Expressiveness:
2
2 attribute DTs
2
A AND-NOT B
T
F
T
B
F
F
T
F
F
B
T
T
A OR NOT B
NOR A OR B
T
T
T
B
F
NOT A AND B
F
T
T
T
F
F
T
B
FT
F T
F
B
F
F
NOT B
F
B
T
F
TRUE
F
T
T
F
T
B
FT
T F
FALSE
F
B
F
T
Carla P. Gomes
CS4700
3 4
7 8 10
4-
Attributes:
Food with values g,m,y
Speedy? with values y,n
Price, with values a, h
Lets build our decision tree
starting with the attribute Food,
(3 possible values: g, m, y).
10 examples:
Food
y
No
No
5 9
1
g
3 4
7 8 10
2
Speedy
y
Yes
3 7 8 10
Price
a
Yes
4
h
No
2
6+
4-
Top-Down Induction
of DT (simplified)
Yes
TDIDF(D,cdef)
IF(all examples in D have same class c)
Return leaf with class c (or class cdef, if D is empty)
ELSE IF(no attributes left to test)
Return leaf with class c of majority in D
ELSE
Pick A as the best decision attribute for next node
FOR each value vi of A create a new
descendent of node
D i {(x, y) D : attribute A of x has value v i }
{(
x
,
y
),
,
(
x
Training Data:
1
1
n , y n )}
Carla P. Gomes
CS4700
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
12 examples
6+
6-
Choosing an attribute:
Information Gain
Goal: trees with short paths to leaf nodes
Information Gain
Most useful in classification
how to measure the worth of an attribute information gain
how well attribute separates examples according to their
classification
Next
precise definition for gain
Carla P. Gomes
CS4700
Information
Information answers questions.
The more clueless I am about a question, the more information
the answer contains.
Example fair coin prior <0.5,0.5>
By definition Information of the prior (or entropy of the prior):
I(P1,P2) = - P1 log2(P1) P2 log2(P2) =
I(0.5,0.5) = -0.5 log2(0.5) 0.5 log2(0.5) = 1
We need 1 bit to convey the outcome of the flip of a fair coin.
Information
(or Entropy)
Information in an answer given possible answers v 1, v2, vn:
0 log2(0) =0
Carla P. Gomes
CS4700
1 p
1/2
0
The more uniform is the probability distribution,
the greater is its entropy.
Carla P. Gomes
CS4700
Information or
Entropy
Information or Entropy measures the randomness of an arbitrary collection of
examples.
We dont have exact probabilities but our training data provides an estimate of the
probabilities of positive vs. negative examples given a set of values for the
attributes.
For a collection S, entropy is given as:
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
12 examples
6+
6Whats the entropy
of this collection of
examples?
Classification of examples is positive (T) or negative (F)
Carla P. Gomes
CS4700
Choosing an attribute:
Information Gain
Intuition: Pick the attribute that reduces the entropy (uncertainty) the
most.
So we measure the information gain after testing a given attribute A:
Choosing an attribute:
Information Gain
Remainder(A)
gives us the amount information we still need after testing on A.
Assume A divides the training set E into E1, E2, Ev, corresponding to
the different v distinct values of A.
Each subset Ei has pi positive examples and ni negative examples.
So for total information content, we need to weigh the contributions of the
different subclasses induced by A Weight of each subclass
Carla P. Gomes
CS4700
Choosing an attribute:
Information Gain
Measures the expected reduction in entropy. The higher the Information Gain (IG),
or just Gain, with respect to an attribute A , the more is the expected reduction in
entropy.
Weight of each subclass
Carla P. Gomes
CS4700
Interpretations of gain
Gain(S,A)
expected reduction in entropy caused by knowing A
information provided about the target function value given the
value of A
number of bits saved in the coding a member of S knowing the
value of A
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Type and Patrons:
Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as
the root.
Carla P. Gomes
CS4700
Example contd.
Decision tree learned from the 12 examples:
SRs Tree
Substantially simpler than true tree--a more complex hypothesis isnt justified
Carla P. Gomes
CS4700
Inductive Bias
Roughly: prefer
shorter trees over longer ones
ones with high gain attributes at root
Difficult to characterize precisely
attribute selection heuristics
interacts closely with given data
Carla P. Gomes
CS4700
Evaluation Methodology
Carla P. Gomes
CS4700
Evaluation Methodology
Peeking
Example of peeking:
We generate four different hypotheses for example by using different
criteria to pick the next attribute to branch on.
We test the performance of the four different hypothesis on the test set
and we select the best hypothesis.
Voila: Peeking occured!
The hypothesis was selected on the basis of its performance on the test set,
so information about the test set has leaked into the learning algorithm.
Evaluation Methodology
Standard methodology:
1. Collect a large set of examples.
2. Randomly divide collection into two disjoint sets: training set and test set.
3. Apply learning algorithm to training set generating hypothesis h
4. Measure performance of h w.r.t. test set (a form of cross-validation)
Important: keep the training and test sets disjoint! No peeking!
5. To study the efficiency and robustness of an algorithm, repeat steps 2-4 for different
sizes of training sets and different randomly selected training sets of each size.
Carla P. Gomes
CS4700
Test/Training Split
Real-world Process
drawn randomly
split randomly
Training Data Dtrain
(x1,y1), , (xn,yn)
split randomly
Data D
Dtrain
Learner
Performance Measures
Error Rate
Fraction (or percentage) of false predictions
Accuracy
Fraction (or percentage) of correct predictions
Precision/Recall
Applies only to binary classification problems (classes pos/neg)
Precision: Fraction (or percentage) of correct predictions among
all examples predicted to be positive
Recall: Fraction (or percentage) of correct predictions among all
real positive examples
Carla P. Gomes
CS4700
Carla P. Gomes
CS4700
Prediction quality:
Average Proportion correct on test set
Restaurant Example:
Learning Curve
Many case studies have shown that decision trees are at least as accurate
as human experts.
Summary