Chap 18
Chap 18
INTELLIGENCE
Chapter 18: Learning from Examples
• For example,
p , with jjust the ten Boolean attributes
of our restaurant problem there are 2 1024 or
about 10 308 different functions to choose
f
from, andd for
f 20 attributes
ib there
h are over 10 300000
• Need some ingenious algorithms to find good
h
hypotheses
th iin such
h a llarge space.
Example Training Data
Inducing decision trees from examples
Inducing decision trees from examples
• We want a tree that is consistent with the
examples and is as small as possible
• Intractable p
problem to find the smallest
consistent tree
• With heuristics,
heuristics we can find a good
approximate solution
• The DECISION-TREE-LEARNING
DECISION TREE LEARNING
algorithm adopts a greedy divide-and-
conquer strategy: always test the most
important attribute first.
Inducing decision trees from examples
• W
We shall
h ll use a capital
it l lletter,
tt say X, X to
t
denote a random variable and its
corresponding
di smallll letter,
l tt x in i thi
this case,
for one of its values.
Random Variable Example
• Two coins toss simultaneously then the
possible outcomes and values x of the
random variable X
X, where X is the number
of heads are:
Sample Space x
HH 2
HT 1
TH 1
TT 0
Entropy
py
• We will use the notion of information gain,
which is defined in terms of entropy
• Entropy is a measure of the uncertainty
off a random
d i bl acquisition
variable; i iti off
information corresponds to a reduction
i entropy.
in t
• A random variable with only one value—a
coin that always comes up heads—has no
uncertainty and thus its entropy is defined
as zero; thus no information gain
Entropy
• A flip of a fair coin is equally likely to come up
heads or tails, 0 or 1, this counts as “1 bit” of
entropy.
entropy
• The roll of a fair four-sided die has 2 bits of
entropy because it takes two bits to describe
entropy,
one of four equally probable choices.
Entropy
Information Gain
Generalization and overfitting
• Generalization describe a model’s abilityy to
react to new/unseen data.
• Overfitting refers to a model that models the
training data too well. It happens when a model
learns the detail and noise in the training data to
th extent
the t t that
th t it negatively
ti l impacts
i t the
th
performance of the model on new data.
• Overfitting grows as hypothesis space and the
number of input attributes grows
• Less likely as we increase the number of training
examples.
Generalization and Overfitting
• Decision tree pruning combats overfitting
• Pruning works by eliminating nodes that
are not clearly relevant ii.e.
e information
gains is equal/near to zero
• How
H llarge a gain
i should
h ld we require i iin
order to split on a particular attribute?
• Significance test begins by assuming
that there is no underlying pattern (the so-
called null hypothesis).
Generalization and Overfitting
• Then the actual data are analyzed to
calculate the extent to which they deviate
from a perfect absence of pattern
pattern.
• If the degree of deviation is statistically
unlikely (5% or less)
less), then that is
considered to be good evidence for the
presence of a significant pattern in the
data.
Broadening
g the applicability
y of
decision trees
• In p
practice decision tree learning
g has to answer
also the following questions
– Missing attribute values: while learning and in classifying
Instances
– Multivalued discrete attributes: value subsetting or penalizing
against too many values
– Numerical
N i l attributes:
tt ib t split
lit point
i t selection
l ti ffor iinterval
t l di
division
i i
– Continuous-valued output attributes
• Decision trees are used widely and many good
implementations are available
• Decision trees fulfill understandability,
y, contraryy
to neural networks, which is a legal requirement
for financial decisions
Evaluating and Choosing
the Best Hypothesis
• We assume that there is a p
probability
y distribution over
examples that remains stationary over time
– Each observed value is sampled from that distribution and is
independent
p of p
previous examples
p and
– Each example has identical prior probability distribution
• Examples that satisfy these assumptions are called
independent and identically distributed (i
(i.i.d.)
id)
• The error rate of a hypothesis h is the proportion of
mistakes it makes
– The proportion of times that h(x) ≠ y for an (x, y) example
• Just because a hypothesis h has low error rate on the
training set does not mean that it will generalize well
Model selection: Complexity vs.
goodness of fit
• We can think of finding the best hypothesis as two
tasks:
– Model selection defines the hypothesis space and
– Optimization finds the best hypothesis within that space
• How to select among models that are
parameterized by size
– With polynomials we have size = 1 for linear functions, size = 2 for
quadratics, and so on
– For decision trees, the size could be the # of nodes in the tree
• We want to find the value of the size parameter
th t best
that b t balances
b l underfitting
d fitti andd overfitting
fitti tot
give the best test set accuracy
Model selection: Complexity vs.
goodness of fit
• A wrapper
pp takes a learning g algorithm
g as an argument
g ((DT
learning for example)
• The wrapper enumerates models according to the size
parameter
• For each size, it uses cross validation (say) on the learner to
compute the average error rate on training and test sets
• We start with the smallest, simplest models (which probably
underfit the data), and iterate, considering more complex
models at each step, until the models start to overfit
• The cross validation picks the value of size with the lowest
validation set error
• We then generate a hypothesis of that size using all the data
(without holding out any of it; eventually we should evaluate
the returned hypothesis on a separate test set)
From error rates to loss
• Consider the problem of classifying y g emails as spam or non-spam
• It is worse to classify non-spam as spam than to classify spam as
• non-spam
• So a classifier with a 1% error rate,
rate where almost all errors were
classifying spam as non-spam, would be better than a classifier
with only a 0.5% error rate, if most of those errors were
classifying non
non-spam
spam as spam
• Utility is what learners – like decision makers – should maximize
• In machine learning it is traditional to express utilities by means
• of loss functions
• The loss function L(x, y, ŷ) is defined as the amount of utility lost
• by predicting h(x) = ŷ when the correct answer is f(x) = y:
L(x, y, ŷ) = U(result of using y given an input x)
– U(result of using ŷ given an input x)
From error rates to loss
• Often a simplified version of the loss function is used: It is 10 times
worse to classify non-spam as spam than vice-versa:
L(spam, nonspam) = 1, L(nonspam, spam) = 10
• Note that L(y,
(y, y) is always
y zero
• In general for real-valued data small errors are better than large
ones
• Two functions that implement that idea are the absolute value of the
difference (called the L1 loss), and the square of the difference
(called the L2 loss)
• Minimizing error rate is formulated in the L0/1 loss function
Absolute value loss: L1(y, ŷ) = |y–ŷ|
Squared error loss: L2(y, ŷ) = (y–ŷ)2
0/1 lloss: L0/1(
L0/1(y, ŷ) = 0 if y = ŷ,
ŷ else
l 1
From error rates to loss
• Let P(X, Y) be a prior probability distribution over examples
• Let E be the set of all possible input-output examples
• Then the expected generalization loss for a hypothesis h (w.r.t.
• loss function L) is
• The estimated best hypothesis is then the one with minimum empirical
loss:
Regularization
• Earlier on we did model selection with cross-validation on
model size
• An alternative approach is to search for a hypothesis that
directly minimizes the weighted sum of empirical loss and the
complexity of the hypothesis, which we call the total cost