0% found this document useful (0 votes)
17 views

computational learning theorem

Uploaded by

hometowncrpalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

computational learning theorem

Uploaded by

hometowncrpalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 91

Computational Learning

Theory

1
A theory of the learnable
(Valiant ‘84)
 […] The problem is to discover good models that are
interesting to study for their own sake and that promise to
be relevant both to explaining human experience and to
building devices that can learn […] Learning machines must
have all 3 of the following properties:


the machines can provably learn whole classes of concepts,
these classes can be characterized


the classes of concepts are appropriate and nontrivial for
general-purpose knowledge


the computational process by which the machine builds the
desired programs requires a “feasible” (i.e. polynomial) number
of steps

2
A theory of the learnable
 We seek general laws that constrain inductive
learning, relating:

Probability of successful learning

Number of training examples

Complexity of hypothesis space

Accuracy to which target concept is approximated

Manner in which training examples are presented

3
Overview
 Are there general laws that govern learning?

Sample Complexity: How many training examples are needed for a
learner to converge (with high probability) to a successful hypothesis?

Computational Complexity: How much computational effort is
needed for a learner to converge (with high probability) to a
successful hypothesis?

Mistake Bound: How many training examples will the learner
misclassify before converging to a successful hypothesis?
 These questions will be answered within two analytical
frameworks:

The Probably Approximately Correct (PAC) framework

The Mistake Bound framework

4
Overview (Cont’d)
 Rather than answering these questions for
individual learners, we will answer them for
broad classes of learners. In particular we will
consider:
 The size or complexity of the hypothesis space
considered by the learner.
 The accuracy to which the target concept must be
approximated.
 The probability that the learner will output a
successful hypothesis.
 The manner in which training examples are
presented to the learner.

5
Introduction
 Problem setting
 Inductively learning an unknown target
function, given training examples and a
hypothesis space
 Focus on:
 How many training examples are sufficient?
 How many mistakes will the learner make
before it succeeds?

6
Introduction (2)
 Desirable: quantitative bounds depending on

Complexity of hypo space,

Accuracy of approximation to the target

Probability of outputting a successful hypo

How the training examples are presented

Learner proposes instances

Teacher presents instances

Some random process produces instances
 Specifically, study sample complexity,
computational complexity, and mistake bound.

7
Problem Setting
 Space of possible instances X (e.g. set of all people) over
which target functions may be defined.
 Assume that different instances in X may be encountered with
different frequencies.
 Modeling above assumption as: unknown (stationary)
probability distribution D that defines the probability of
encountering each instance in X
 Training examples are provided by drawing instances
independently from X, according to D, and they are noise-free.
 Each element c in target function set C corresponds to certain
subset of X, i.e. c is a Boolean function. (Just for the sake of
simplicity)

8
Error of a Hypothesis
 Training error of hypo h w.r.t. target function c and
training data set S of n sample is
1
errorS (h)    (c( x) h( x))
n xS
 True error of hypo h w.r.t. target function c and
distribution D is
errorD (h)  Pr [c( x) h( x)]
x~ D
 errorD(h) is not observable, so how probable is it that
errorS(h) gives a misleading estimates of errorD(h)?
 Different from problem setting in Ch5, where samples are
drawn independently from h, here h depends on training
samples.
9
An Illustration of True
Error

10
Theoretical Questions of
Interest
 Is it possible to identify classes of learning
problems that are inherently difficult or easy,
independent of the learning algorithm?
 Can one characterize the number of training
examples necessary or sufficient to assure
successful learning?
 How is the number of examples affected

If observing a random sample of training data?

if the learner is allowed to pose queries to the trainer?
 Can one characterize the number of mistakes
that a learner will make before learning the
target function?
 Can one characterize the inherent computational
complexity of a class of learning algorithms?
Computational Learning
Theory
 Relatively recent field
 Area of intense research
 Partial answers to some questions on
previous page is yes.
 Will generally focus on certain types of
learning problems.
Inductive Learning of Target
Function
 What we are given
 Hypothesis space
 Training examples
 What we want to know
 How many training examples are sufficient
to successfully learn the target function?
 How many mistakes will the learner make
before succeeding?
Computational Learning
Theory
Provides a theoretical analysis of learning:
 Is it possible to identify classes of learning problems
that are inherently difficult/easy?

 Can we characterize the computational complexity of


classes of learning problems
• When a learning algorithm can be expected to succeed
• When learning may be impossible

 Can we characterize the number of training samples


necessary/sufficient for successful learning?
 How is this number affected if we allow the learner to ask
questions (active learning)

 How many mistakes will the learner make before learning


the target function
 Computational Learning Theory
Quantitative bounds can be set depending on the
following attributes:

 Accuracy to which the target must be approximated


 The probability that the learner will output a
successful hypothesis
 Size or complexity of the hypothesis space
considered by the learner
 The manner in which training examples are
presented to the learner
Theory

Three general areas:

1. Sample Complexity. How many examples we need


to find a good hypothesis?

2. Computational Complexity. How much


computational
power we need to find a good hypothesis?

3. Mistake Bound. How many mistakes we will make


before finding a good hypothesis?
 Sample Complexity
How Many Training Examples Sufficient To Learn Target Concept?
Scenario 1: Active Learning
Learner proposes instances, as queries to teacher
Query (learner): instance x
Answer (teacher): c(x)

Scenario 2: Passive Learning from Teacher-Selected Examples


Teacher (who knows c) provides training examples
Sequence of examples (teacher): {<x i, c(xi)>}
Teacher may or may not be helpful, optimal

Scenario 3: Passive Learning from Teacher-Annotated Examples


Random process (e.g., nature) proposes instances
Instance x generated randomly, teacher provides c(x)
Models of Learning
 Learner: who is doing the learning? (e.g. A computer with
limited resources (finite memory, polynomial time,...)
 Domain: What is being learnt? (e.g. Concept of a chair)
 Information source:
- Examples
- positive/negative
- according to a certain distribution
- selected how?
- features?
- Queries
- “is this a chair?”
- Experimentation
- play with a new gadget to learn how it works

Noisy or noise-free?
 Prior knowledge: e.g. “The concept to learn is a conjunction
of features”
 Performance criteria:
- Measure of how well learned? Done?
- Accuracy (error rate)
- Efficiency
Computational Learning Theory
• The PAC Learning
Framework
• Finite Hypothesis Spaces
• Examples of PAC Learnable
Concepts
• VC dimension & Infinite Hyp.
Spaces
 The Mistake Bound Model
Two Frameworks
 PAC (Probably Approximately Correct)
Learning Framework: Identify classes of
hypotheses that can and cannot be
learned from a polynomial number of
training examples
 Define a natural measure of complexity for
hypothesis spaces that allows bounding the
number of training examples needed
 Mistake Bound Framework
PAC Learning
 Probably Approximately Correct
Learning Model
 Will restrict discussion to learning
boolean-valued concepts in noise-free
data.
Problem Setting:
Instances and Concepts
 X is set of all possible instances over which
target function may be defined
 C is set of target concepts learner is to
learn
 Each target concept c in C is a subset of X
 Each target concept c in C is a boolean function
c: X{0,1}

c(x) = 1 if x is positive example of concept


c(x) = 0 otherwise
Problem Setting: Distribution
 Instances generated at random using
some probability distribution D

D may be any distribution

D is generally not known to the learner

D is required to be stationary (does not
change over time)
 Training examples x are drawn at random
from X according to D and presented with
target value c(x) to the learner.
Problem Setting: Hypotheses
 Learner L considers set of hypotheses H
 After observing a sequence of training
examples of the target concept c, L
must output some hypothesis h from H
which is its estimate of c
Example Problem
(Classifying Executables)
 Three Classes (Malicious, Boring, Funny)
 Features
 a1 GUI present (yes/no)
 a2 Deletes files (yes/no)
 a3 Allocates memory (yes/no)
 a4 Creates new thread (yes/no)
 Distribution?
 Hypotheses?
Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No F
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No F
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
True Error
 Definition: The true error (denoted
errorD(h)) of hypothesis h with respect to
target concept c and distribution D , is the
probability that h will misclassify an
instance drawn at random according to D.
errorD (h)  Pr [c( x) h( x)]
xD

Computer Science Department


CS 9633 Machine Learning
Error of h with respect to c
 Instance space X

-
 -
-

 c
 +
 +  h
 +

 -
Computer Science Department
CS 9633 Machine Learning
Key Points
 True error defined over entire instance
space, not just training data
 Error depends strongly on the unknown
probability distribution D
 The error of h with respect to c is not
directly observable to the learner L—
can only observe performance with
respect to training data (training error)
 Question: How probable is it that the
observed training error for h gives a
misleading estimate of the true error?
Computer Science Department
CS 9633 Machine Learning
PAC Learnability
 Goal: characterize classes of target concepts
that can be reliably learned
 from a reasonable number of randomly drawn
training examples and
 using a reasonable amount of computation
 Unreasonable to expect perfect learning where
errorD(h) = 0
 Would need to provide training examples
corresponding to every possible instance
 With random sample of training examples, there is
always a non-zero probability that the training
examples will be misleading

Computer Science Department


CS 9633 Machine Learning
Weaken Demand on Learner
 Hypothesis error (Approximately)

Will not require a zero error hypothesis

Require that error is bounded by some
constant , that can be made arbitrarily small

 is the error parameter
 Error on training data (Probably)

Will not require that the learner succeed on
every sequence of randomly drawn training
examples

Require that its probability of failure is bounded
by a constant, , that can be made arbitrarily
small

 is the confidence parameter
Computer Science Department
CS 9633 Machine Learning
Probably Approximately
Correct Learning (PAC
Learning)

34
Theory

Three general areas:

1. Sample Complexity. How many examples we need


to find a good hypothesis?

2. Computational Complexity. How much


computational
power we need to find a good hypothesis?

3. Mistake Bound. How many mistakes we will make


before finding a good hypothesis?
Two Frameworks
 PAC (Probably Approximately Correct)
Learning Framework: Identify classes of
hypotheses that can and cannot be
learned from a polynomial number of
training examples
 Define a natural measure of complexity for
hypothesis spaces that allows bounding the
number of training examples needed
 Mistake Bound Framework
Cannot Learn Exact Concepts
from Limited Data, Only
Approximations

 Positive

Learner
  Classifier
 Negative

 Positive
Negative

37
Cannot Learn Even Approximate
Concepts
from Pathological Training Sets
 Positive

Learner
  Classifier
 Negative

 Positive
Negative

38
Probably approximately correct learning

formal computational model which want


shed light on the limits of what can be
learned by a machine, analysing the
computational cost of learning algorithms

39
What we want to learn
 CONCEPT = recognizing algorithm

LEARNING = computational description of


recognizing algorithms starting from:
- examples
- incomplete specifications
That is:
to determine uniformly good approximations of an
unknown function from its value in some sample points
 interpolation
 pattern matching
 concept learning

40
What’s new in p.a.c.
learning?
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related

A general problem:
use of resources (time, space…) by computations  COMPLEXITY
THEORY
Example
Sorting: n·logn time (polynomial, feasible)
Bool. satisfiability: 2ⁿ time (exponential, intractable)

41
PAC Learnability
 PAC refers to Probably Approximately Correct
 It is desirable that errorD(h) to be zero,
however, to be realistic, we weaken our
demand in two ways:
 errorD(h) is to be bounded by a small number ε

Learner is not required to success on every training
sample, rather that its probability of failure is to be
bounded by a constant δ.
 Hence we come up with the idea of “Probably
Approximately Correct”
42
PAC Learning
 The only reasonable expectation of a
learner is that with high probability it
learns a close approximation to the
target concept.
 In the PAC model, we specify two small
parameters, ε and δ, and require that
with probability at least (1  δ) a system
learn a concept with error at most ε.

43
The PAC Learning Framework
Definition: A class of concepts C is
PAC learnable using a hypothesis class
H, if there exist a learning algorithm L
such that for arbitrary small δ and ε,
and for all concepts c in C, and for all
distributions D over the input space,
there is a 1-δ probability that the
hypothesis h selected from space H by
L is approximately correct (has less
than ε true error).
Definition of PAC-
Learnability
 Definition: Consider a concept class C
defined over a set of instances X of length n
and a learner L using hypothesis space H.
 C is PAC-learnable by L using H if all c  C,
distributions D over X,  such that 0 <  <
½ , and  such that 0 <  < ½, learner L will
with probability at least (1 - ) output a
hypothesis h H such that errorD(h)  , in
time that is polynomial in 1/, 1/, n, and
size(c).
Computer Science Department
CS 9633 Machine Learning
Requirements of Definition
 L must with arbitrarily high probability (1-
), out put a hypothesis having arbitrarily
low error ().
 L’s learning must be efficient—grows
polynomially in terms of
 Strengths of output hypothesis (1/, 1/)
 Inherent complexity of instance space (n) and
concept class C (size(c)).

Computer Science Department


CS 9633 Machine Learning
Block Diagram of PAC
Learning Model
Control

Parameters
, 
Training
sample  Hypothesis
{ xi , c ( xi )}in1
 Learning  h
algorithm L

Computer Science Department


CS 9633 Machine Learning
Examples of second
requirement
 Consider executables problem where
instances are conjunctions of boolean
features:
a1=yes  a2=no  a3=yes  a4=no
 Concepts are conjunctions of a subset of
the features
a1=yes  a3=yes  a4=yes
Using the Concept of PAC
Learning in Practice
 We often want to know how many
training instances we need in order to
achieve a certain level of accuracy with
a specified probability.
 If L requires some minimum processing
time per training example, then for C to
be PAC-learnable by L, L must learn
from a polynomial number of training
examples.
Computer Science Department
CS 9633 Machine Learning
Sample Complexity for
Finite Hypothesis Spaces

50
Sample Complexity for Finite Hypothesis
Spaces

 Start from a good class of learner—


consistent learner, defined as one that
outputs a hypo which perfectly fits the
training data set, whenever possible.
 Recall: Version space VSH,D is defined to
be the set of all hypo h∈H that correctly
classify all training examples in D.
 Property. Every consistent learner outputs
a hypo belonging to version space.

51
Sample Complexity for
Finite Hypothesis Spaces
 Given any consistent learner, the number of examples
sufficient to assure that any hypothesis will be probably
(with probability (1- )) approximately (within error  )
correct is m= 1/ (ln|H|+ln(1/))
 If the learner is not consistent, m= 1/22 (ln|H|+ln(1/))
 Conjunctions of Boolean Literals are also PAC-Learnable
and m= 1/ (n.ln3+ln(1/))
 k-term DNF expressions are not PAC learnable because
even though they have polynomial sample complexity,
their computational complexity is not polynomial.
 Surprisingly, however, k-term CNF is PAC learnable.

52
Formal Definition of PAC-
Learnable
 Consider a concept class C defined over an instance
space X containing instances of length n, and a
learner, L, using a hypothesis space, H. C is said to be
PAC-learnable by L using H iff for all cC,
distributions D over X, 0<ε<0.5, 0<δ<0.5; learner L
by sampling random examples from distribution D, will
with probability at least 1 δ output a hypothesis hH
such that errorD(h) ε, in time polynomial in 1/ε, 1/δ, n
and size(c).
 Example:

X: instances described by n binary features

C: conjunctive descriptions over these features

H: conjunctive descriptions over these features

L: most-specific conjunctive generalization algorithm (Find-S)

size(c): the number of literals in c (i.e. length of the
conjunction).
53
ε-exhausted
 Def. VSH,D is said to be ε-exhausted w.r.t.
c and D if for any h in VSH,D, errorD(h)<ε.

54
A PAC-Learnable Example
 Consider class C of conjunction of boolean
literals.
 A boolean literal is any boolean variable or its negation
 Q: Is such C PAC-learnable?
 A: Yes, by going through the following two steps:
1. Show that any consistent learner will require only a
polynomial number of training examples to learn any
element of C
2. Suggest a specific algorithm that use polynomial time
per training example.

55
Contd
Step1:
 Let H consist of conjunction of literals based on n

boolean variables.
 Now take a look at m≥(1/ε)(ln|H|+ln(1/δ)),

observe that |H|=3n, then the inequality becomes


m≥(1/ε)(nln3+ln(1/δ)).
Step2:
 FIND-S algorithm satisfies the requirement

 For each new positive training example, the algorithm


computes intersection of literals shared by current
hypothesis and the example, using time linear in n
56
Sample Complexity of Conjunction
 Learning
Consider conjunctions over n boolean features. There
are 3n of these since each feature can appear positively,
appear negatively, or not appear in a given conjunction.
Therefore |H|= 3n, so a sufficient number of examples to
learn a PACconcept
1 nis:
  1 
 ln  ln 3  /   ln  n ln 3  / 
     

 Concrete examples:
 δ=ε=0.05, n=10 gives 280 examples
 δ=0.01, ε=0.05, n=10 gives 312 examples
 δ=ε=0.01, n=10 gives 1,560 examples
 δ=ε=0.01, n=50 gives 5,954 examples
 Result holds for any consistent learner, including FindS.

57
Sample Complexity of Learning
 Arbitrary
Consider Boolean
any boolean functionFunctions
over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a nPAC concept is:
 1   1 
 ln  ln 2  /   ln  2 ln 2  / 
2 n

     

 Concrete examples:
 δ=ε=0.05, n=10 gives 14,256 examples
 δ=ε=0.05, n=20 gives 14,536,410 examples
 δ=ε=0.05, n=50 gives 1.561x1016 examples

58
Agnostic Learning & Inconsistent
Hypo
 we assume that VSH,D is not empty, and
a simple way to guarantee such
condition holds is that we assume that c
belongs to H.
 Agnostic learning setting: Don’t assume
c∈H, and the learner simply finds hypo
with minimum training error instead.

59
Sample Complexity for
Infinite Hypothesis Spaces

60
Infinite Hypothesis Spaces
 The preceding analysis was restricted to finite
hypothesis spaces.
 Some infinite hypothesis spaces (such as those
including real-valued thresholds or
parameters) are more expressive than others.
 Compare a rule allowing one threshold on a
continuous feature (length<3cm) vs one allowing
two thresholds (1cm<length<3cm).
 Need some measure of the expressiveness of
infinite hypothesis spaces.
 The Vapnik-Chervonenkis (VC) dimension
provides just such a measure, denoted VC(H).
 Analagous to ln|H|, there are bounds for
sample complexity using VC(H).
61
 VC Dimension
An unbiased hypothesis space shatters the entire
instance space.
 The larger the subset of X that can be shattered, the
more expressive the hypothesis space is, i.e. the less
biased.
 The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the
largest finite subset of X shattered by H. If arbitrarily
large finite subsets of X can be shattered then VC(H) = 
 If there exists at least one subset of X of size d that can
be shattered then VC(H) ≥ d. If no subset of size d can be
shattered, then VC(H) < d.
 For a single intervals on the real line, all sets of 2
instances can be shattered, but no set of 3 instances can,
so VC(H) = 2.

Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|
62
Shattering a Set of
Instances
 Def. A dichotomy of a set S is a partition of S
into two disjoint subsets
 Def. A set of instances S is shattered by hypo
space H iff for every dichotomy of S, there exists
some hypo in H consistent with this dichotomy.

3 instances 63
shattered
 VC Dimension
e say that a set S of examples is shattered by a set of functions H if
for every partition of the examples in S into positive and negative examp
there is a function in H that gives exactly these labels to the examples

he VC dimension of hypothesis space H over instance space X


is the size of the largest finite subset of X that is shattered by H.

there exists a subset of size d can be shattered, then VC(H) >=d


no subset of size d can be shattered, then VC(H) < d

C(Half intervals) = 1 (no subset of size 2 can be shattered)


C( Intervals) = 2 (no subset of size 3 can be shattered)
C(Half-spaces in the plane) = 3 (no subset of size 4 can be shattered)

Computational
Learning Theory CS446-Spring 06 64
VC Dimension
 Motivation: What if H can’t shatter X? Try finite
subsets of X.
 Def. VC dimension of hypo space H defined over
instance space X is the size of largest finite subset
of X shattered by H. If any arbitrarily large finite
subsets of X can be shattered by H, then VC(H)≡∞
 Roughly speaking, VC dimension measures how
many (training) points can be separated for all
possible labeling using functions of a given class.
 Notice that for any finite H, VC(H)≤log2|H|

65
Sample Complexity for
Infinite Hypothesis Spaces
II
 Upper-Bound on sample complexity, using the VC-
Dimension: m 1/ (4log2(2/)+8VC(H)log2(13/)
Lower Bound on sample complexity, using the VC-
Dimension:
Consider any concept class C such that VC(C)  2, any
learner L, and any 0 <  < 1/8, and 0 <  < 1/100.
Then there exists a distribution D and target concept
in C such that if L observes fewer examples than
max[1/ log(1/ ),(VC(C)-1)/(32)]
then with probability at least , L outputs a
hypothesis h having errorD(h)>  .

66
An Example: Linear Decision
Surface
 Line case: X=real number set, and
H=set of all open intervals, then
VC(H)=2.
 Plane case: X=xy-plane, and H=set of
all linear decision surface of the plane,
then VC(H)=3.
 General case: For n-dim real-number
space, let H be its linear decision
surface, then VC(H)=n+1.
67
Sample Complexity from VC
Dimension
 How many randomly drawn examples
suffice to ε-exhaust VSH,D with probability
1 2 13
at least 1-δ? m  (4 log 2  8VC ( H ) log 2 )
  
(Blumer et al. 1989)
 Furthermore, it is possible to obtain a

lower bound on sample complexity (i.e.


minimum number of required training
samples)

68
Lower Bound on Sample
Complexity
 Theorem 7.2 (Ehrenfeucht et al. 1989)
Consider any concept class C s.t. VC(C)≥2,
any learner L, and any 0<ε<1/8, and
0<δ<1/100. Then there exists a distribution
D and target concept in C s.t. if L observes
fewer examples than
max[(1/ε)log(1/δ), (VC(C)-1)/(32ε)], then with
probability at least δ, L outputs a hypo h
having errorD(h)>ε.
69
VC-Dimension for Neural
Networks
 Let G be a layered directed acyclic graph with n
input nodes and s2 internal nodes, each having
at most r inputs. Let C be a concept class over Rr
of VC dimension d, corresponding to the set of
functions that can be described by each of the s
internal nodes. Let CG be the G-composition of C,
corresponding to the set of functions that can be
represented by G. Then VC(CG)2ds log(es),
where e is the base of the natural logarithm.
 This theorem can help us bound the VC-Dimension
of a neural network and thus, its sample
complexity

70
Mistake Bound Model

71
Mistake Bound Model
 The learner receives a sequence of training
examples
• Instance based learning

 Upon receiving each example x, the learner


must predict the target value c(x)
• Online learning

 How many mistakes will the learner make


before it learns the target concept?
• e.g. Learning fraudulent credit card purchases
Mistake Bound Model
 When the majority of the hypotheses incorrectly
classifies the new example, the VS will be reduced
to at most half its current size

 Given that the VS initially contains |H| hypotheses,


the maximum number of mistakes possible before
VS contains just one member is log2|H|

 The algorithm can learn without any mistakes at all


 when the majority is correct, it will remove the incorrect,
minority hypotheses
Skip
 We may also ask what is the Optimal
Mistake bound (Opt(C))?
 lowest worst-case mistake bound over all
possible learning algorithms

 VC(C) < Opt(C) < MHalving(C) < log2|C|


Introduction (2)
 Desirable: quantitative bounds depending on

Complexity of hypo space,

Accuracy of approximation

Probability of outputting a successful hypo

How the training examples are presented

Learner proposes instances

Teacher presents instances

Some random process produces instances
 Specifically, study sample complexity,
computational complexity, and mistake bound.

76
Introduction to “Mistake
Bound”
 Mistake bound: the total number of mistakes
a learner makes before it converges to the
correct hypothesis
 Assume the learner receives a sequence of
training examples, however, for each instance
x, the learner must first predict c(x) before it
receives correct answer from the teacher.
 Application scenario: when the learning must
be done on-the-fly, rather than during off-line
training stage.

77
Learning
 The Mistake Bound framework is different from
the PAC framework as it considers learners that
receive a sequence of training examples and that
predict, upon receiving each example, what its
target value is.
 The question asked in this setting is: “How many
mistakes will the learner make in its
predictions before it learns the target
concept?”
 This question is significant in practical settings
where learning must be done while the system is
in actual use.

78
 Theorem 1. Online learning of conjunctive concepts can be done with
at most n+1 prediction mistakes.
Find-S Algorithm
Finding-S: Find a maximally specific
hypothesis
1. Initialize h to the most specific hypothesis

in H
2. For each positive training example x
 For each attribute constraint ai in h, if it is
satisfied by x, then do nothing; otherwise
replace ai by the next more general constraint
that is satisfied by x.
3. Output hypo h
81
Mistake Bound for FIND-S
 Assume training data is noise-free and target
concept c is in the hypo space H, which consists
of conjunction of up to n boolean literals
 Then in the worst case the learner needs to
make n+1 mistakes before it learns c

Note that misclassification occurs only in case that
the latest learned hypo misclassifies a positive
example as negative, and one such mistake removes
at least one constraint from the hypo, and

in the above worst case, c is the function that
assigns every instance to “true” value
82
Mistake Bound for Halving
Algorithm
 Halving algorithm = incrementally learning the
version space as every new instance arrives +
predict a new instance by majority votes (of hypo
in VS)
 Q: What is the maximum number of mistakes
that can be made by a halving algorithm, for an
arbitrary finite H, before it exactly learns the
target concept c (assume c is in H)?
 Answer: the largest integer no more than log2|H|
 How about the minimum number of mistakes?
 Answer: zero-mistake!
84
Optimal Mistake Bounds
 For an arbitrary concept class C,
assuming H=C, interested in the lowest
worst-case mistake bound over all
possible learning algorithms
 Let MA(c) denotes the maximum number
of mistakes over all possible training
examples that a learner A makes to
exactly learn c.
 Def. MA(C) ≡maxc∈CMA(c)
 Ex: MFIND-s(C)=n+1, MHalving(C)≤log2|C|
85
Optimal Mistake Bounds
(2)
 The optimal mistake bound for C, denoted
by Opt(C), defined as minA∈learning algMA(C)
 Notice that Opt(C)≤MHalving(C)≤log2|C|
 Furthermore, Littlestone (1987) shows that
VC(C)≤Opt(C) !
 When C equal to the power-set Cp of any
finite instance space X, the above four
quantities become equal to each other, i.e.
|X|
86
Optimal Mistake Bounds
 Definition: Let C be an arbitrary
nonempty concept class. The optimal
mistake bound for C, denoted Opt(C), is
the minimum over all possible learning
algorithms A of MA(C).
Opt(C)=minALearning_Algorithm MA(C)
 For any concept class C, the optimal
mistake bound is bound as follows:
VC(C)  Opt(C)  log2(|C|)

87
Weighted-Majority
Algorithm
 It is a generalization of Halving algorithm:
makes a prediction by taking a weighted
vote among a pool of prediction
algorithms (or hypotheses) and learns by
altering the weights
 It starts by assigning equal weight (=1)
to every prediction algorithm. Whenever
an algorithm misclassifies a training
example, reduces its weight
 Halving algorithm reduces the weight to zero
88
Procedure for Adjusting
Weights
ai denotes the ith prediction algorithm in the pool; wi
denotes the weight of ai, and is initialized to 1
 For each training example <x, c(x)>
 Initialize q0 & q1 to be 0
 For each ai, if ai(x)=0 then q0←q0+wi, else q1←q1+wi
 If q1>q0, predicts c(x) to be 1, else
 if q1<q0, predicts c(x) to be 0, else

predicts c(x) at random to be 1 or 0.
 For each ai, do
 If ai(x)≠c(x) (given by the teacher), wi←βwi
89
Weighted-Majority
Algorithm
ai denotes the ith prediction algorithm in the pool A of
algorithm. wi denotes the weight associated with ai.

For all i initialize wi <-- 1
 For each training example <x,c(x)>

Initialize q0 and q1 to 0
 For each prediction algorithm ai

If ai(x)=0 then q0 <-- q0+wi

If ai(x)=1 then q1 <-- q1+wi

If q1 > q0 then predict c(x)=1

If q0 > q1 then predict c(x) =0

If q0=q1 then predict 0 or 1 at random for c(x)

For each prediction algorithm ai in A do

If ai(x)  c(x) then wi <-- wi

90
Comments on “Adjusting Weights”
Idea
 The idea can be found in various problems
such as pattern matching, where we
might reduce weights of less frequently
used patterns in the learned library
 The textbook claims that one benefit of
the algorithm is that it is able to
accommodate inconsistent training data,
but in case of learning by query, we
presume that answer given by the teacher
is always correct.
91
Relative Mistake Bound for the
Algorithm
 Theorem 7.3 Let D be the training sequence, A be any
set of n prediction algorithms, and k be the minimum
number of mistakes made by any algorithm in A for
the training sequence D. Then the number of mistakes
over D made by Weighted-Majority algorithm using
β=0.5 is at most 2.4(k+log2n)
 Proof: The basic idea is that we compare the final
weight of best prediction algorithm to the sum of
weights over all predictions. Let aj be such algorithm with k
mistakes, then its final weight w j=0.5k. Now consider the sum W
of weights over all predictions, observe that for every mistake
made, W is reduced to at most 0.75W.
92
Relative Mistake Bound for
the Weighted-Majority
Algorithm
 Let D be any sequence of training examples,
let A be any set of n prediction algorithms, and
let k be the minimum number of mistakes
made by any algorithm in A for the training
sequence D. Then the number of mistakes over
D made by the Weighted-Majority algorithm
using =1/2 is at most 2.4(k + log2n).
 This theorem can be generalized for any 0  
1 where the bound becomes
(k log2 1/ + log2n)/log2(2/(1+ ))

93

You might also like