0% found this document useful (0 votes)
51 views51 pages

Chap 18

The document describes machine learning through supervised learning and decision tree learning. It defines learning as improving an agent's performance through experience. Supervised learning involves learning a function from labeled example input-output pairs provided by a teacher. Decision tree learning induces a tree that performs tests on attribute values to reach an output, allowing representation of any logical function. The document provides an example of building a decision tree to determine whether to wait for a restaurant table based on various attribute values.

Uploaded by

Ahmed Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views51 pages

Chap 18

The document describes machine learning through supervised learning and decision tree learning. It defines learning as improving an agent's performance through experience. Supervised learning involves learning a function from labeled example input-output pairs provided by a teacher. Decision tree learning induces a tree that performs tests on attribute values to reach an output, allowing representation of any logical function. The document provides an example of building a decision tree to determine whether to wait for a restaurant table based on various attribute values.

Uploaded by

Ahmed Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

ARTIFICIAL

INTELLIGENCE
Chapter 18: Learning from Examples

Instructor: Dr Ghulam Mustafa


Department of Information Technology
PUGC
In which we describe agents that can
improve their behavior through diligent
study of their own experiences.
experiences
Outline
• Learning agents
• Supervised/Inductive learning
• Decision
D i i ttree llearning
i
Learning
• Learning: a process for improving the
performance
f off an agentt through
th h experience
i
Why would we want an agent to learn?
• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than
trying to write it down
• Learning
L i modifies
difi th the agent's
t' d
decision
i i
mechanisms to improve performance
Learning agents
Forms of Learning
Any component of an agent can be improved by
learning from data. The improvements, and the
techniques used to make them, depend on four
major factors:

– Which component is to be improved.


– What prior knowledge the agent already has.
– What
Wh representation
i i used
is d ffor the
h ddata and
d the
h
component.
– What feedback is available to learn from
from.
Components to be learned
• A direct mapping from conditions on the current state
to actions.
• A means to infer relevant properties of the world from
the percept sequence.
• Information about the way the world evolves and about
the results of possible actions the agent can take.
• Utility information indicating the desirability of world
states.
• Action-value
A ti l iinformation
f ti iindicating
di ti th the d
desirability
i bilit off
actions.
• Goals that describe classes of states whose
achievement maximizes the agent’s utility.
Representation and prior knowledge
• Inductive learning: learning a general function or
rule from specific input
input–output
output pairs
• Analytical or deductive learning: going from a
known g general rule to a new rule that is logically
g y
entailed
• Propositional and first-order logical sentences for
the components in a logical agent
• Bayesian networks for the inferential components
of a decision
decision-theoretic
theoretic agent
• Factored representation: a vector of attribute
values—and
values and outputs that can be either a continuous
numerical value or a discrete value.
Feedback to learn from
• Unsupervised learning/Clustering the
agent learns patterns in the input even
though no explicit feedback is supplied
supplied.
For example, a taxi agent might gradually
develop a concept of “good
good traffic days
days”
and “bad traffic days”
Feedback to learn from
• Reinforcement learning the agent learns
from a series of reinforcements/ rewards
or punishments.
punishments For example
example, the lack of a
tip at the end of the journey gives the taxi
agent an indication that it did something
wrong. The two points for a win at the end
of a chess game tells the agent it did
something right.
Feedback to learn from
• Supervised learning the agent observes
some example input
input–output
output pairs and
learns a function that maps from input to
output In component 1 above
output. above, the inputs
are percepts and the output are provided
by a teacher who says “Brake!”
Brake! or “Turn
Turn
left.”
• Semi-supervised
S i i d learning
l i we are giveni a
few labeled examples and must make what
we can off a large
l collection
ll ti off unlabeled
l b l d
Supervised Learning
• Given a training set of N example input–output pairs
– (x1,
( 1 y1),
1) ((x2,
2 y2),
2) . . . ((xN,
N yN)
N)
• Discover a function h that approximates the true function f.
• The function h is a hypothesis.
yp
• Learning is a search through the space of possible
hypotheses for one that will perform well
• To
T measure theth accuracy off a hypothesis
h th i we givei it a test
t t
set of examples that are distinct from the training set.
• Problem: find a hypothesis h
– such that h ≈ f
– given a training set of examples
• We say a hypothesis generalizes well if it correctly
predicts the value of y for novel examples.
• CLASSIFICATION
– When the output y is one of a finite set of
values (such as sunny, cloudy or rainy), the
learning problem is called classification,
classification and
is called Boolean or binary classification if
there are only two values.
values
• REGRESSION
– When y is a number (such as tomorrow’s
tomorrow s
temperature), the learning problem is called
regression.
regression
Supervised Learning
Supervised Learning
• The line is called a consistent hypothesis
b
because it agrees with
ith allll th
the d
data.
t
• Hypothesis space, H, be the set of
polynomials
l i l ffrom which
hi h we select l tbbestt one
• How do we choose from among multiple
consistent
i t t hypotheses?
h th ?
• Ockham’s razor: Prefer the simplest
hypothesis consistent with the data.
• Tradeoff between complex hypotheses that fit
the training data well and simpler hypotheses
that may generalize better.
Learning Decision Trees
• Decision tree induction is one of the simplest
and yet most successful forms of machine
learning.
• A decision tree represents a function that
takes as input a vector of attribute values and
returns a “decision”—a single output value.
• A decision tree reaches its decision by
performing a sequence of tests. Internal node
corresponds to test and leaf node
corresponds to a value to be returned by the
function
Example
Problem: build a decision tree to decide whether to wait
for a table at a restaurant based on the following
attributes. The aim here is to learn a definition for the
goal predicate WillWait. List the attributes that we will
consider as part of the input:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3 Fri/Sat: is today Friday or Saturday?
3.
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6 Price: price range ($
6. ($, $$
$$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9 Type:
9. T ki d off restaurant
kind t t (French,
(F h Italian,
It li Th
Thai,i Burger)
B )
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Expressiveness
• A Boolean decision tree is logically equivalent to the
assertion that the g
goal attribute is true if and only
y if the
input attributes satisfy one of the paths leading to a leaf
with value true.
G l ⇔ (Path1
• Goal (P th1 ∨ Path2
P th2 ∨ ・ ・ ・))
• Any function in propositional logic can be expressed as a
decision tree.
• As an example, the right most path in Figure 18.2
• Path = (Patrons =Full ∧ WaitEstimate =0–10) .
• Decision trees are good for some kinds of functions and
bad for others.
Expressiveness
• Consider the set of all Boolean functions on n
attributes. How many different functions are in
this set?

• For example,
p , with jjust the ten Boolean attributes
of our restaurant problem there are 2 1024 or
about 10 308 different functions to choose
f
from, andd for
f 20 attributes
ib there
h are over 10 300000
• Need some ingenious algorithms to find good
h
hypotheses
th iin such
h a llarge space.
Example Training Data
Inducing decision trees from examples
Inducing decision trees from examples
• We want a tree that is consistent with the
examples and is as small as possible
• Intractable p
problem to find the smallest
consistent tree
• With heuristics,
heuristics we can find a good
approximate solution
• The DECISION-TREE-LEARNING
DECISION TREE LEARNING
algorithm adopts a greedy divide-and-
conquer strategy: always test the most
important attribute first.
Inducing decision trees from examples

• Most important attribute mean the one that


makes the most difference
• Type
yp is a p poor attribute,, because it leaves us
with four possible outcomes each with
equivalent positive and negative examples
• Patrons is important attribute, because if the
value is None or Some, then we are left with
example l sets ffor which
hi h we can answer
definitively (No and Yes, respectively). If the
value is Full , we are left with a mixed set of
examples.
Inducing decision trees from examples

Decision tree learning is recursive problem


here are four cases when to consider
recursion:
• If remain examples are only positive or negative, we are
done.
• If some positive and some negative examples choose the
best attribute to split
• If there
th are no examples l left
l ft select
l t the
th parent’s
t’ default
d f lt value.
l
• If no attributes left, but both positive and negative examples,
this is due to noise in data, return plurality classification of the
remain examples
Performance measurement
How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size
Performance measurement contd
contd.
• Learning curve depends on
– realizable (can express target function) vs. non-realizable non-
realizability can be due to missing attributes or restricted
hypothesis class (e(e.g.,
g thresholded linear function)
– redundant expressiveness (e.g., loads of irrelevant attributes)
Choosing attribute tests
• The greedy search used in decision tree
learning is designed to approximately
minimize the depth of the final tree tree.
• The idea is to pick the attribute that goes
as far as possible toward providing an
exact classification of the examples.
• A perfect
f t attribute
tt ib t divides
di id the th examplesl
into sets, each of which are all positive or
allll negative
ti and d th
thus will
ill b
be lleaves off th
the
tree.
Random Variable
• A random variable is a function that
associates a real number with each
element in the sample space.
space

• W
We shall
h ll use a capital
it l lletter,
tt say X, X to
t
denote a random variable and its
corresponding
di smallll letter,
l tt x in i thi
this case,
for one of its values.
Random Variable Example
• Two coins toss simultaneously then the
possible outcomes and values x of the
random variable X
X, where X is the number
of heads are:
Sample Space x
HH 2
HT 1
TH 1
TT 0
Entropy
py
• We will use the notion of information gain,
which is defined in terms of entropy
• Entropy is a measure of the uncertainty
off a random
d i bl acquisition
variable; i iti off
information corresponds to a reduction
i entropy.
in t
• A random variable with only one value—a
coin that always comes up heads—has no
uncertainty and thus its entropy is defined
as zero; thus no information gain
Entropy
• A flip of a fair coin is equally likely to come up
heads or tails, 0 or 1, this counts as “1 bit” of
entropy.
entropy
• The roll of a fair four-sided die has 2 bits of
entropy because it takes two bits to describe
entropy,
one of four equally probable choices.
Entropy
Information Gain
Generalization and overfitting
• Generalization describe a model’s abilityy to
react to new/unseen data.
• Overfitting refers to a model that models the
training data too well. It happens when a model
learns the detail and noise in the training data to
th extent
the t t that
th t it negatively
ti l impacts
i t the
th
performance of the model on new data.
• Overfitting grows as hypothesis space and the
number of input attributes grows
• Less likely as we increase the number of training
examples.
Generalization and Overfitting
• Decision tree pruning combats overfitting
• Pruning works by eliminating nodes that
are not clearly relevant ii.e.
e information
gains is equal/near to zero
• How
H llarge a gain
i should
h ld we require i iin
order to split on a particular attribute?
• Significance test begins by assuming
that there is no underlying pattern (the so-
called null hypothesis).
Generalization and Overfitting
• Then the actual data are analyzed to
calculate the extent to which they deviate
from a perfect absence of pattern
pattern.
• If the degree of deviation is statistically
unlikely (5% or less)
less), then that is
considered to be good evidence for the
presence of a significant pattern in the
data.
Broadening
g the applicability
y of
decision trees
• In p
practice decision tree learning
g has to answer
also the following questions
– Missing attribute values: while learning and in classifying
Instances
– Multivalued discrete attributes: value subsetting or penalizing
against too many values
– Numerical
N i l attributes:
tt ib t split
lit point
i t selection
l ti ffor iinterval
t l di
division
i i
– Continuous-valued output attributes
• Decision trees are used widely and many good
implementations are available
• Decision trees fulfill understandability,
y, contraryy
to neural networks, which is a legal requirement
for financial decisions
Evaluating and Choosing
the Best Hypothesis
• We assume that there is a p
probability
y distribution over
examples that remains stationary over time
– Each observed value is sampled from that distribution and is
independent
p of p
previous examples
p and
– Each example has identical prior probability distribution
• Examples that satisfy these assumptions are called
independent and identically distributed (i
(i.i.d.)
id)
• The error rate of a hypothesis h is the proportion of
mistakes it makes
– The proportion of times that h(x) ≠ y for an (x, y) example
• Just because a hypothesis h has low error rate on the
training set does not mean that it will generalize well
Model selection: Complexity vs.
goodness of fit
• We can think of finding the best hypothesis as two
tasks:
– Model selection defines the hypothesis space and
– Optimization finds the best hypothesis within that space
• How to select among models that are
parameterized by size
– With polynomials we have size = 1 for linear functions, size = 2 for
quadratics, and so on
– For decision trees, the size could be the # of nodes in the tree
• We want to find the value of the size parameter
th t best
that b t balances
b l underfitting
d fitti andd overfitting
fitti tot
give the best test set accuracy
Model selection: Complexity vs.
goodness of fit
• A wrapper
pp takes a learning g algorithm
g as an argument
g ((DT
learning for example)
• The wrapper enumerates models according to the size
parameter
• For each size, it uses cross validation (say) on the learner to
compute the average error rate on training and test sets
• We start with the smallest, simplest models (which probably
underfit the data), and iterate, considering more complex
models at each step, until the models start to overfit
• The cross validation picks the value of size with the lowest
validation set error
• We then generate a hypothesis of that size using all the data
(without holding out any of it; eventually we should evaluate
the returned hypothesis on a separate test set)
From error rates to loss
• Consider the problem of classifying y g emails as spam or non-spam
• It is worse to classify non-spam as spam than to classify spam as
• non-spam
• So a classifier with a 1% error rate,
rate where almost all errors were
classifying spam as non-spam, would be better than a classifier
with only a 0.5% error rate, if most of those errors were
classifying non
non-spam
spam as spam
• Utility is what learners – like decision makers – should maximize
• In machine learning it is traditional to express utilities by means
• of loss functions
• The loss function L(x, y, ŷ) is defined as the amount of utility lost
• by predicting h(x) = ŷ when the correct answer is f(x) = y:
L(x, y, ŷ) = U(result of using y given an input x)
– U(result of using ŷ given an input x)
From error rates to loss
• Often a simplified version of the loss function is used: It is 10 times
worse to classify non-spam as spam than vice-versa:
L(spam, nonspam) = 1, L(nonspam, spam) = 10
• Note that L(y,
(y, y) is always
y zero
• In general for real-valued data small errors are better than large
ones
• Two functions that implement that idea are the absolute value of the
difference (called the L1 loss), and the square of the difference
(called the L2 loss)
• Minimizing error rate is formulated in the L0/1 loss function
Absolute value loss: L1(y, ŷ) = |y–ŷ|
Squared error loss: L2(y, ŷ) = (y–ŷ)2
0/1 lloss: L0/1(
L0/1(y, ŷ) = 0 if y = ŷ,
ŷ else
l 1
From error rates to loss
• Let P(X, Y) be a prior probability distribution over examples
• Let E be the set of all possible input-output examples
• Then the expected generalization loss for a hypothesis h (w.r.t.
• loss function L) is

• The best hypothesis h* is the one with the minimum expected


• generalization loss

• Because P(x, y) is not known, the learning agent can only


• estimate generalization loss with empirical loss on the set of
• examples E:

• The estimated best hypothesis is then the one with minimum empirical
loss:
Regularization
• Earlier on we did model selection with cross-validation on
model size
• An alternative approach is to search for a hypothesis that
directly minimizes the weighted sum of empirical loss and the
complexity of the hypothesis, which we call the total cost

• Here λ is a parameter, a positive number that serves as a


conversion rate between loss and hypothesis complexity
• We still need to do cross-validation
cross validation search to find the
hypothesis that generalizes best, but this time with different
values of λ
• This
Thi process off explicitly
li itl penalizing
li i complexl hhypotheses
th iis
called regularization

You might also like