0% found this document useful (0 votes)

17 views

computational learning theorem

Uploaded by

hometowncrpalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

computational learning theorem

Uploaded by

hometowncrpalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 91

Computational Learning

Theory

1
A theory of the learnable
(Valiant ‘84)
 […] The problem is to discover good models that are
interesting to study for their own sake and that promise to
be relevant both to explaining human experience and to
building devices that can learn […] Learning machines must
have all 3 of the following properties:


the machines can provably learn whole classes of concepts,
these classes can be characterized


the classes of concepts are appropriate and nontrivial for
general-purpose knowledge


the computational process by which the machine builds the
desired programs requires a “feasible” (i.e. polynomial) number
of steps

2
A theory of the learnable
 We seek general laws that constrain inductive
learning, relating:

Probability of successful learning

Number of training examples

Complexity of hypothesis space

Accuracy to which target concept is approximated

Manner in which training examples are presented

3
Overview
 Are there general laws that govern learning?

Sample Complexity: How many training examples are needed for a
learner to converge (with high probability) to a successful hypothesis?

Computational Complexity: How much computational effort is
needed for a learner to converge (with high probability) to a
successful hypothesis?

Mistake Bound: How many training examples will the learner
misclassify before converging to a successful hypothesis?
 These questions will be answered within two analytical
frameworks:

The Probably Approximately Correct (PAC) framework

The Mistake Bound framework

4
Overview (Cont’d)
 Rather than answering these questions for
individual learners, we will answer them for
broad classes of learners. In particular we will
consider:
 The size or complexity of the hypothesis space
considered by the learner.
 The accuracy to which the target concept must be
approximated.
 The probability that the learner will output a
successful hypothesis.
 The manner in which training examples are
presented to the learner.

5
Introduction
 Problem setting
 Inductively learning an unknown target
function, given training examples and a
hypothesis space
 Focus on:
 How many training examples are sufficient?
 How many mistakes will the learner make
before it succeeds?

6
Introduction (2)
 Desirable: quantitative bounds depending on

Complexity of hypo space,

Accuracy of approximation to the target

Probability of outputting a successful hypo

How the training examples are presented

Learner proposes instances

Teacher presents instances

Some random process produces instances
 Specifically, study sample complexity,
computational complexity, and mistake bound.

7
Problem Setting
 Space of possible instances X (e.g. set of all people) over
which target functions may be defined.
 Assume that different instances in X may be encountered with
different frequencies.
 Modeling above assumption as: unknown (stationary)
probability distribution D that defines the probability of
encountering each instance in X
 Training examples are provided by drawing instances
independently from X, according to D, and they are noise-free.
 Each element c in target function set C corresponds to certain
subset of X, i.e. c is a Boolean function. (Just for the sake of
simplicity)

8
Error of a Hypothesis
 Training error of hypo h w.r.t. target function c and
training data set S of n sample is
1
errorS (h)    (c( x) h( x))
n xS
 True error of hypo h w.r.t. target function c and
distribution D is
errorD (h)  Pr [c( x) h( x)]
x~ D
 errorD(h) is not observable, so how probable is it that
errorS(h) gives a misleading estimates of errorD(h)?
 Different from problem setting in Ch5, where samples are
drawn independently from h, here h depends on training
samples.
9
An Illustration of True
Error

10
Theoretical Questions of
Interest
 Is it possible to identify classes of learning
problems that are inherently difficult or easy,
independent of the learning algorithm?
 Can one characterize the number of training
examples necessary or sufficient to assure
successful learning?
 How is the number of examples affected

If observing a random sample of training data?

if the learner is allowed to pose queries to the trainer?
 Can one characterize the number of mistakes
that a learner will make before learning the
target function?
 Can one characterize the inherent computational
complexity of a class of learning algorithms?
Computational Learning
Theory
 Relatively recent field
 Area of intense research
 Partial answers to some questions on
previous page is yes.
 Will generally focus on certain types of
learning problems.
Inductive Learning of Target
Function
 What we are given
 Hypothesis space
 Training examples
 What we want to know
 How many training examples are sufficient
to successfully learn the target function?
 How many mistakes will the learner make
before succeeding?
Computational Learning
Theory
Provides a theoretical analysis of learning:
 Is it possible to identify classes of learning problems
that are inherently difficult/easy?

 Can we characterize the computational complexity of

classes of learning problems
• When a learning algorithm can be expected to succeed
• When learning may be impossible

 Can we characterize the number of training samples

necessary/sufficient for successful learning?
 How is this number affected if we allow the learner to ask
questions (active learning)

 How many mistakes will the learner make before learning

the target function
 Computational Learning Theory
Quantitative bounds can be set depending on the
following attributes:

 Accuracy to which the target must be approximated

 The probability that the learner will output a
successful hypothesis
 Size or complexity of the hypothesis space
considered by the learner
 The manner in which training examples are
presented to the learner
Theory

Three general areas:

1. Sample Complexity. How many examples we need

to find a good hypothesis?

2. Computational Complexity. How much

computational
power we need to find a good hypothesis?

3. Mistake Bound. How many mistakes we will make

before finding a good hypothesis?
 Sample Complexity
How Many Training Examples Sufficient To Learn Target Concept?
Scenario 1: Active Learning
Learner proposes instances, as queries to teacher
Query (learner): instance x
Answer (teacher): c(x)

Scenario 2: Passive Learning from Teacher-Selected Examples

Teacher (who knows c) provides training examples
Sequence of examples (teacher): {<x i, c(xi)>}
Teacher may or may not be helpful, optimal

Scenario 3: Passive Learning from Teacher-Annotated Examples

Random process (e.g., nature) proposes instances
Instance x generated randomly, teacher provides c(x)
Models of Learning
 Learner: who is doing the learning? (e.g. A computer with
limited resources (finite memory, polynomial time,...)
 Domain: What is being learnt? (e.g. Concept of a chair)
 Information source:
- Examples
- positive/negative
- according to a certain distribution
- selected how?
- features?
- Queries
- “is this a chair?”
- Experimentation
- play with a new gadget to learn how it works

Noisy or noise-free?
 Prior knowledge: e.g. “The concept to learn is a conjunction
of features”
 Performance criteria:
- Measure of how well learned? Done?
- Accuracy (error rate)
- Efficiency
Computational Learning Theory
• The PAC Learning
Framework
• Finite Hypothesis Spaces
• Examples of PAC Learnable
Concepts
• VC dimension & Infinite Hyp.
Spaces
 The Mistake Bound Model
Two Frameworks
 PAC (Probably Approximately Correct)
Learning Framework: Identify classes of
hypotheses that can and cannot be
learned from a polynomial number of
training examples
 Define a natural measure of complexity for
hypothesis spaces that allows bounding the
number of training examples needed
 Mistake Bound Framework
PAC Learning
 Probably Approximately Correct
Learning Model
 Will restrict discussion to learning
boolean-valued concepts in noise-free
data.
Problem Setting:
Instances and Concepts
 X is set of all possible instances over which
target function may be defined
 C is set of target concepts learner is to
learn
 Each target concept c in C is a subset of X
 Each target concept c in C is a boolean function
c: X{0,1}

c(x) = 1 if x is positive example of concept

c(x) = 0 otherwise
Problem Setting: Distribution
 Instances generated at random using
some probability distribution D

D may be any distribution

D is generally not known to the learner

D is required to be stationary (does not
change over time)
 Training examples x are drawn at random
from X according to D and presented with
target value c(x) to the learner.
Problem Setting: Hypotheses
 Learner L considers set of hypotheses H
 After observing a sequence of training
examples of the target concept c, L
must output some hypothesis h from H
which is its estimate of c
Example Problem
(Classifying Executables)
 Three Classes (Malicious, Boring, Funny)
 Features
 a1 GUI present (yes/no)
 a2 Deletes files (yes/no)
 a3 Allocates memory (yes/no)
 a4 Creates new thread (yes/no)
 Distribution?
 Hypotheses?
Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No F
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No F
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
True Error
 Definition: The true error (denoted
errorD(h)) of hypothesis h with respect to
target concept c and distribution D , is the
probability that h will misclassify an
instance drawn at random according to D.
errorD (h)  Pr [c( x) h( x)]
xD

Computer Science Department

CS 9633 Machine Learning
Error of h with respect to c
 Instance space X

-
 -
-

 c
 +
 +  h
 +

 -
Computer Science Department
CS 9633 Machine Learning
Key Points
 True error defined over entire instance
space, not just training data
 Error depends strongly on the unknown
probability distribution D
 The error of h with respect to c is not
directly observable to the learner L—
can only observe performance with
respect to training data (training error)
 Question: How probable is it that the
observed training error for h gives a
misleading estimate of the true error?
Computer Science Department
CS 9633 Machine Learning
PAC Learnability
 Goal: characterize classes of target concepts
that can be reliably learned
 from a reasonable number of randomly drawn
training examples and
 using a reasonable amount of computation
 Unreasonable to expect perfect learning where
errorD(h) = 0
 Would need to provide training examples
corresponding to every possible instance
 With random sample of training examples, there is
always a non-zero probability that the training
examples will be misleading

Computer Science Department

CS 9633 Machine Learning
Weaken Demand on Learner
 Hypothesis error (Approximately)

Will not require a zero error hypothesis

Require that error is bounded by some
constant , that can be made arbitrarily small

 is the error parameter
 Error on training data (Probably)

Will not require that the learner succeed on
every sequence of randomly drawn training
examples

Require that its probability of failure is bounded
by a constant, , that can be made arbitrarily
small

 is the confidence parameter
Computer Science Department
CS 9633 Machine Learning
Probably Approximately
Correct Learning (PAC
Learning)

34
Theory

Three general areas:

1. Sample Complexity. How many examples we need

to find a good hypothesis?

2. Computational Complexity. How much

computational
power we need to find a good hypothesis?

3. Mistake Bound. How many mistakes we will make

before finding a good hypothesis?
Two Frameworks
 PAC (Probably Approximately Correct)
Learning Framework: Identify classes of
hypotheses that can and cannot be
learned from a polynomial number of
training examples
 Define a natural measure of complexity for
hypothesis spaces that allows bounding the
number of training examples needed
 Mistake Bound Framework
Cannot Learn Exact Concepts
from Limited Data, Only
Approximations

 Positive

Learner
  Classifier
 Negative

 Positive
Negative

37
Cannot Learn Even Approximate
Concepts
from Pathological Training Sets
 Positive

Learner
  Classifier
 Negative

 Positive
Negative

38
Probably approximately correct learning

formal computational model which want

shed light on the limits of what can be
learned by a machine, analysing the
computational cost of learning algorithms

39
What we want to learn
 CONCEPT = recognizing algorithm

LEARNING = computational description of

recognizing algorithms starting from:
- examples
- incomplete specifications
That is:
to determine uniformly good approximations of an
unknown function from its value in some sample points
 interpolation
 pattern matching
 concept learning

40
What’s new in p.a.c.
learning?
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related

A general problem:
use of resources (time, space…) by computations  COMPLEXITY
THEORY
Example
Sorting: n·logn time (polynomial, feasible)
Bool. satisfiability: 2ⁿ time (exponential, intractable)

41
PAC Learnability
 PAC refers to Probably Approximately Correct
 It is desirable that errorD(h) to be zero,
however, to be realistic, we weaken our
demand in two ways:
 errorD(h) is to be bounded by a small number ε

Learner is not required to success on every training
sample, rather that its probability of failure is to be
bounded by a constant δ.
 Hence we come up with the idea of “Probably
Approximately Correct”
42
PAC Learning
 The only reasonable expectation of a
learner is that with high probability it
learns a close approximation to the
target concept.
 In the PAC model, we specify two small
parameters, ε and δ, and require that
with probability at least (1  δ) a system
learn a concept with error at most ε.

43
The PAC Learning Framework
Definition: A class of concepts C is
PAC learnable using a hypothesis class
H, if there exist a learning algorithm L
such that for arbitrary small δ and ε,
and for all concepts c in C, and for all
distributions D over the input space,
there is a 1-δ probability that the
hypothesis h selected from space H by
L is approximately correct (has less
than ε true error).
Definition of PAC-
Learnability
 Definition: Consider a concept class C
defined over a set of instances X of length n
and a learner L using hypothesis space H.
 C is PAC-learnable by L using H if all c  C,
distributions D over X,  such that 0 <  <
½ , and  such that 0 <  < ½, learner L will
with probability at least (1 - ) output a
hypothesis h H such that errorD(h)  , in
time that is polynomial in 1/, 1/, n, and
size(c).
Computer Science Department
CS 9633 Machine Learning
Requirements of Definition
 L must with arbitrarily high probability (1-
), out put a hypothesis having arbitrarily
low error ().
 L’s learning must be efficient—grows
polynomially in terms of
 Strengths of output hypothesis (1/, 1/)
 Inherent complexity of instance space (n) and
concept class C (size(c)).

Computer Science Department

CS 9633 Machine Learning
Block Diagram of PAC
Learning Model
Control


Parameters
, 
Training
sample  Hypothesis
{ xi , c ( xi )}in1
 Learning  h
algorithm L

Computer Science Department

CS 9633 Machine Learning
Examples of second
requirement
 Consider executables problem where
instances are conjunctions of boolean
features:
a1=yes  a2=no  a3=yes  a4=no
 Concepts are conjunctions of a subset of
the features
a1=yes  a3=yes  a4=yes
Using the Concept of PAC
Learning in Practice
 We often want to know how many
training instances we need in order to
achieve a certain level of accuracy with
a specified probability.
 If L requires some minimum processing
time per training example, then for C to
be PAC-learnable by L, L must learn
from a polynomial number of training
examples.
Computer Science Department
CS 9633 Machine Learning
Sample Complexity for
Finite Hypothesis Spaces

50
Sample Complexity for Finite Hypothesis
Spaces

 Start from a good class of learner—

consistent learner, defined as one that
outputs a hypo which perfectly fits the
training data set, whenever possible.
 Recall: Version space VSH,D is defined to
be the set of all hypo h∈H that correctly
classify all training examples in D.
 Property. Every consistent learner outputs
a hypo belonging to version space.

51
Sample Complexity for
Finite Hypothesis Spaces
 Given any consistent learner, the number of examples
sufficient to assure that any hypothesis will be probably
(with probability (1- )) approximately (within error  )
correct is m= 1/ (ln|H|+ln(1/))
 If the learner is not consistent, m= 1/22 (ln|H|+ln(1/))
 Conjunctions of Boolean Literals are also PAC-Learnable
and m= 1/ (n.ln3+ln(1/))
 k-term DNF expressions are not PAC learnable because
even though they have polynomial sample complexity,
their computational complexity is not polynomial.
 Surprisingly, however, k-term CNF is PAC learnable.

52
Formal Definition of PAC-
Learnable
 Consider a concept class C defined over an instance
space X containing instances of length n, and a
learner, L, using a hypothesis space, H. C is said to be
PAC-learnable by L using H iff for all cC,
distributions D over X, 0<ε<0.5, 0<δ<0.5; learner L
by sampling random examples from distribution D, will
with probability at least 1 δ output a hypothesis hH
such that errorD(h) ε, in time polynomial in 1/ε, 1/δ, n
and size(c).
 Example:

X: instances described by n binary features

C: conjunctive descriptions over these features

H: conjunctive descriptions over these features

L: most-specific conjunctive generalization algorithm (Find-S)

size(c): the number of literals in c (i.e. length of the
conjunction).
53
ε-exhausted
 Def. VSH,D is said to be ε-exhausted w.r.t.
c and D if for any h in VSH,D, errorD(h)<ε.

54
A PAC-Learnable Example
 Consider class C of conjunction of boolean
literals.
 A boolean literal is any boolean variable or its negation
 Q: Is such C PAC-learnable?
 A: Yes, by going through the following two steps:
1. Show that any consistent learner will require only a
polynomial number of training examples to learn any
element of C
2. Suggest a specific algorithm that use polynomial time
per training example.

55
Contd
Step1:
 Let H consist of conjunction of literals based on n

boolean variables.
 Now take a look at m≥(1/ε)(ln|H|+ln(1/δ)),

observe that |H|=3n, then the inequality becomes

m≥(1/ε)(nln3+ln(1/δ)).
Step2:
 FIND-S algorithm satisfies the requirement

 For each new positive training example, the algorithm

computes intersection of literals shared by current
hypothesis and the example, using time linear in n
56
Sample Complexity of Conjunction
 Learning
Consider conjunctions over n boolean features. There
are 3n of these since each feature can appear positively,
appear negatively, or not appear in a given conjunction.
Therefore |H|= 3n, so a sufficient number of examples to
learn a PACconcept
1 nis:
  1 
 ln  ln 3  /   ln  n ln 3  / 
     

 Concrete examples:
 δ=ε=0.05, n=10 gives 280 examples
 δ=0.01, ε=0.05, n=10 gives 312 examples
 δ=ε=0.01, n=10 gives 1,560 examples
 δ=ε=0.01, n=50 gives 5,954 examples
 Result holds for any consistent learner, including FindS.

57
Sample Complexity of Learning
 Arbitrary
Consider Boolean
any boolean functionFunctions
over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a nPAC concept is:
 1   1 
 ln  ln 2  /   ln  2 ln 2  / 
2 n

     

 Concrete examples:
 δ=ε=0.05, n=10 gives 14,256 examples
 δ=ε=0.05, n=20 gives 14,536,410 examples
 δ=ε=0.05, n=50 gives 1.561x1016 examples

58
Agnostic Learning & Inconsistent
Hypo
 we assume that VSH,D is not empty, and
a simple way to guarantee such
condition holds is that we assume that c
belongs to H.
 Agnostic learning setting: Don’t assume
c∈H, and the learner simply finds hypo
with minimum training error instead.

59
Sample Complexity for
Infinite Hypothesis Spaces

60
Infinite Hypothesis Spaces
 The preceding analysis was restricted to finite
hypothesis spaces.
 Some infinite hypothesis spaces (such as those
including real-valued thresholds or
parameters) are more expressive than others.
 Compare a rule allowing one threshold on a
continuous feature (length<3cm) vs one allowing
two thresholds (1cm<length<3cm).
 Need some measure of the expressiveness of
infinite hypothesis spaces.
 The Vapnik-Chervonenkis (VC) dimension
provides just such a measure, denoted VC(H).
 Analagous to ln|H|, there are bounds for
sample complexity using VC(H).
61
 VC Dimension
An unbiased hypothesis space shatters the entire
instance space.
 The larger the subset of X that can be shattered, the
more expressive the hypothesis space is, i.e. the less
biased.
 The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the
largest finite subset of X shattered by H. If arbitrarily
large finite subsets of X can be shattered then VC(H) = 
 If there exists at least one subset of X of size d that can
be shattered then VC(H) ≥ d. If no subset of size d can be
shattered, then VC(H) < d.
 For a single intervals on the real line, all sets of 2
instances can be shattered, but no set of 3 instances can,
so VC(H) = 2.

Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|
62
Shattering a Set of
Instances
 Def. A dichotomy of a set S is a partition of S
into two disjoint subsets
 Def. A set of instances S is shattered by hypo
space H iff for every dichotomy of S, there exists
some hypo in H consistent with this dichotomy.

3 instances 63
shattered
 VC Dimension
e say that a set S of examples is shattered by a set of functions H if
for every partition of the examples in S into positive and negative examp
there is a function in H that gives exactly these labels to the examples

he VC dimension of hypothesis space H over instance space X

is the size of the largest finite subset of X that is shattered by H.

there exists a subset of size d can be shattered, then VC(H) >=d

no subset of size d can be shattered, then VC(H) < d

C(Half intervals) = 1 (no subset of size 2 can be shattered)

C( Intervals) = 2 (no subset of size 3 can be shattered)
C(Half-spaces in the plane) = 3 (no subset of size 4 can be shattered)

Computational
Learning Theory CS446-Spring 06 64
VC Dimension
 Motivation: What if H can’t shatter X? Try finite
subsets of X.
 Def. VC dimension of hypo space H defined over
instance space X is the size of largest finite subset
of X shattered by H. If any arbitrarily large finite
subsets of X can be shattered by H, then VC(H)≡∞
 Roughly speaking, VC dimension measures how
many (training) points can be separated for all
possible labeling using functions of a given class.
 Notice that for any finite H, VC(H)≤log2|H|

65
Sample Complexity for
Infinite Hypothesis Spaces
II
 Upper-Bound on sample complexity, using the VC-
Dimension: m 1/ (4log2(2/)+8VC(H)log2(13/)
Lower Bound on sample complexity, using the VC-
Dimension:
Consider any concept class C such that VC(C)  2, any
learner L, and any 0 <  < 1/8, and 0 <  < 1/100.
Then there exists a distribution D and target concept
in C such that if L observes fewer examples than
max[1/ log(1/ ),(VC(C)-1)/(32)]
then with probability at least , L outputs a
hypothesis h having errorD(h)>  .

66
An Example: Linear Decision
Surface
 Line case: X=real number set, and
H=set of all open intervals, then
VC(H)=2.
 Plane case: X=xy-plane, and H=set of
all linear decision surface of the plane,
then VC(H)=3.
 General case: For n-dim real-number
space, let H be its linear decision
surface, then VC(H)=n+1.
67
Sample Complexity from VC
Dimension
 How many randomly drawn examples
suffice to ε-exhaust VSH,D with probability
1 2 13
at least 1-δ? m  (4 log 2  8VC ( H ) log 2 )
  
(Blumer et al. 1989)
 Furthermore, it is possible to obtain a

lower bound on sample complexity (i.e.

minimum number of required training
samples)

68
Lower Bound on Sample
Complexity
 Theorem 7.2 (Ehrenfeucht et al. 1989)
Consider any concept class C s.t. VC(C)≥2,
any learner L, and any 0<ε<1/8, and
0<δ<1/100. Then there exists a distribution
D and target concept in C s.t. if L observes
fewer examples than
max[(1/ε)log(1/δ), (VC(C)-1)/(32ε)], then with
probability at least δ, L outputs a hypo h
having errorD(h)>ε.
69
VC-Dimension for Neural
Networks
 Let G be a layered directed acyclic graph with n
input nodes and s2 internal nodes, each having
at most r inputs. Let C be a concept class over Rr
of VC dimension d, corresponding to the set of
functions that can be described by each of the s
internal nodes. Let CG be the G-composition of C,
corresponding to the set of functions that can be
represented by G. Then VC(CG)2ds log(es),
where e is the base of the natural logarithm.
 This theorem can help us bound the VC-Dimension
of a neural network and thus, its sample
complexity

70
Mistake Bound Model

71
Mistake Bound Model
 The learner receives a sequence of training
examples
• Instance based learning

 Upon receiving each example x, the learner

must predict the target value c(x)
• Online learning

 How many mistakes will the learner make

before it learns the target concept?
• e.g. Learning fraudulent credit card purchases
Mistake Bound Model
 When the majority of the hypotheses incorrectly
classifies the new example, the VS will be reduced
to at most half its current size

 Given that the VS initially contains |H| hypotheses,

the maximum number of mistakes possible before
VS contains just one member is log2|H|

 The algorithm can learn without any mistakes at all

 when the majority is correct, it will remove the incorrect,
minority hypotheses
Skip
 We may also ask what is the Optimal
Mistake bound (Opt(C))?
 lowest worst-case mistake bound over all
possible learning algorithms

 VC(C) < Opt(C) < MHalving(C) < log2|C|

Introduction (2)
 Desirable: quantitative bounds depending on

Complexity of hypo space,

Accuracy of approximation

Probability of outputting a successful hypo

How the training examples are presented

Learner proposes instances

Teacher presents instances

Some random process produces instances
 Specifically, study sample complexity,
computational complexity, and mistake bound.

76
Introduction to “Mistake
Bound”
 Mistake bound: the total number of mistakes
a learner makes before it converges to the
correct hypothesis
 Assume the learner receives a sequence of
training examples, however, for each instance
x, the learner must first predict c(x) before it
receives correct answer from the teacher.
 Application scenario: when the learning must
be done on-the-fly, rather than during off-line
training stage.

77
Learning
 The Mistake Bound framework is different from
the PAC framework as it considers learners that
receive a sequence of training examples and that
predict, upon receiving each example, what its
target value is.
 The question asked in this setting is: “How many
mistakes will the learner make in its
predictions before it learns the target
concept?”
 This question is significant in practical settings
where learning must be done while the system is
in actual use.

78
 Theorem 1. Online learning of conjunctive concepts can be done with
at most n+1 prediction mistakes.
Find-S Algorithm
Finding-S: Find a maximally specific
hypothesis
1. Initialize h to the most specific hypothesis

in H
2. For each positive training example x
 For each attribute constraint ai in h, if it is
satisfied by x, then do nothing; otherwise
replace ai by the next more general constraint
that is satisfied by x.
3. Output hypo h
81
Mistake Bound for FIND-S
 Assume training data is noise-free and target
concept c is in the hypo space H, which consists
of conjunction of up to n boolean literals
 Then in the worst case the learner needs to
make n+1 mistakes before it learns c

Note that misclassification occurs only in case that
the latest learned hypo misclassifies a positive
example as negative, and one such mistake removes
at least one constraint from the hypo, and

in the above worst case, c is the function that
assigns every instance to “true” value
82
Mistake Bound for Halving
Algorithm
 Halving algorithm = incrementally learning the
version space as every new instance arrives +
predict a new instance by majority votes (of hypo
in VS)
 Q: What is the maximum number of mistakes
that can be made by a halving algorithm, for an
arbitrary finite H, before it exactly learns the
target concept c (assume c is in H)?
 Answer: the largest integer no more than log2|H|
 How about the minimum number of mistakes?
 Answer: zero-mistake!
84
Optimal Mistake Bounds
 For an arbitrary concept class C,
assuming H=C, interested in the lowest
worst-case mistake bound over all
possible learning algorithms
 Let MA(c) denotes the maximum number
of mistakes over all possible training
examples that a learner A makes to
exactly learn c.
 Def. MA(C) ≡maxc∈CMA(c)
 Ex: MFIND-s(C)=n+1, MHalving(C)≤log2|C|
85
Optimal Mistake Bounds
(2)
 The optimal mistake bound for C, denoted
by Opt(C), defined as minA∈learning algMA(C)
 Notice that Opt(C)≤MHalving(C)≤log2|C|
 Furthermore, Littlestone (1987) shows that
VC(C)≤Opt(C) !
 When C equal to the power-set Cp of any
finite instance space X, the above four
quantities become equal to each other, i.e.
|X|
86
Optimal Mistake Bounds
 Definition: Let C be an arbitrary
nonempty concept class. The optimal
mistake bound for C, denoted Opt(C), is
the minimum over all possible learning
algorithms A of MA(C).
Opt(C)=minALearning_Algorithm MA(C)
 For any concept class C, the optimal
mistake bound is bound as follows:
VC(C)  Opt(C)  log2(|C|)

87
Weighted-Majority
Algorithm
 It is a generalization of Halving algorithm:
makes a prediction by taking a weighted
vote among a pool of prediction
algorithms (or hypotheses) and learns by
altering the weights
 It starts by assigning equal weight (=1)
to every prediction algorithm. Whenever
an algorithm misclassifies a training
example, reduces its weight
 Halving algorithm reduces the weight to zero
88
Procedure for Adjusting
Weights
ai denotes the ith prediction algorithm in the pool; wi
denotes the weight of ai, and is initialized to 1
 For each training example <x, c(x)>
 Initialize q0 & q1 to be 0
 For each ai, if ai(x)=0 then q0←q0+wi, else q1←q1+wi
 If q1>q0, predicts c(x) to be 1, else
 if q1<q0, predicts c(x) to be 0, else

predicts c(x) at random to be 1 or 0.
 For each ai, do
 If ai(x)≠c(x) (given by the teacher), wi←βwi
89
Weighted-Majority
Algorithm
ai denotes the ith prediction algorithm in the pool A of
algorithm. wi denotes the weight associated with ai.

For all i initialize wi <-- 1
 For each training example <x,c(x)>

Initialize q0 and q1 to 0
 For each prediction algorithm ai

If ai(x)=0 then q0 <-- q0+wi

If ai(x)=1 then q1 <-- q1+wi

If q1 > q0 then predict c(x)=1

If q0 > q1 then predict c(x) =0

If q0=q1 then predict 0 or 1 at random for c(x)

For each prediction algorithm ai in A do

If ai(x)  c(x) then wi <-- wi

90
Comments on “Adjusting Weights”
Idea
 The idea can be found in various problems
such as pattern matching, where we
might reduce weights of less frequently
used patterns in the learned library
 The textbook claims that one benefit of
the algorithm is that it is able to
accommodate inconsistent training data,
but in case of learning by query, we
presume that answer given by the teacher
is always correct.
91
Relative Mistake Bound for the
Algorithm
 Theorem 7.3 Let D be the training sequence, A be any
set of n prediction algorithms, and k be the minimum
number of mistakes made by any algorithm in A for
the training sequence D. Then the number of mistakes
over D made by Weighted-Majority algorithm using
β=0.5 is at most 2.4(k+log2n)
 Proof: The basic idea is that we compare the final
weight of best prediction algorithm to the sum of
weights over all predictions. Let aj be such algorithm with k
mistakes, then its final weight w j=0.5k. Now consider the sum W
of weights over all predictions, observe that for every mistake
made, W is reduced to at most 0.75W.
92
Relative Mistake Bound for
the Weighted-Majority
Algorithm
 Let D be any sequence of training examples,
let A be any set of n prediction algorithms, and
let k be the minimum number of mistakes
made by any algorithm in A for the training
sequence D. Then the number of mistakes over
D made by the Weighted-Majority algorithm
using =1/2 is at most 2.4(k + log2n).
 This theorem can be generalized for any 0  
1 where the bound becomes
(k log2 1/ + log2n)/log2(2/(1+ ))

Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Machine Leaning 3
No ratings yet
Machine Leaning 3
44 pages
ML Chapter 7 (CLT) Notes
No ratings yet
ML Chapter 7 (CLT) Notes
59 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
No ratings yet
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
34 pages
ML 3
No ratings yet
ML 3
36 pages
Computational Learning Theory
No ratings yet
Computational Learning Theory
11 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
Basics of Learning Theory
No ratings yet
Basics of Learning Theory
35 pages
Computational Learning Theory
No ratings yet
Computational Learning Theory
15 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
Machine Learning - Computational Learning Theory PDF
No ratings yet
Machine Learning - Computational Learning Theory PDF
7 pages
ML Notes
No ratings yet
ML Notes
161 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
Computational Learning
No ratings yet
Computational Learning
12 pages
Tutorial
No ratings yet
Tutorial
81 pages
Week_7_Notes[1]
No ratings yet
Week_7_Notes[1]
11 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Lecture5 Learning Theory v1.1
No ratings yet
Lecture5 Learning Theory v1.1
59 pages
AIML Module-03
No ratings yet
AIML Module-03
40 pages
07 Intro to ML
No ratings yet
07 Intro to ML
38 pages
6.3.Unit-3 ML Handout
No ratings yet
6.3.Unit-3 ML Handout
20 pages
Module 03
No ratings yet
Module 03
54 pages
AIML Module - 03
No ratings yet
AIML Module - 03
34 pages
Key Ideas in Machine Learning
No ratings yet
Key Ideas in Machine Learning
11 pages
Notes
No ratings yet
Notes
125 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
AI-unit-4
No ratings yet
AI-unit-4
91 pages
Module 3
No ratings yet
Module 3
41 pages
ML_UNIT 4
No ratings yet
ML_UNIT 4
15 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Colt Tutorial
No ratings yet
Colt Tutorial
43 pages
MachineLearning_UNIT III
No ratings yet
MachineLearning_UNIT III
30 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
AIML Module - 03 21CS4
No ratings yet
AIML Module - 03 21CS4
34 pages
ML UNIT-3 Notes PDF
No ratings yet
ML UNIT-3 Notes PDF
23 pages
AL3451 13 M
No ratings yet
AL3451 13 M
22 pages
Module 3
No ratings yet
Module 3
70 pages
Ai and ML Module 3
No ratings yet
Ai and ML Module 3
12 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
Ai Unit5 Learning
No ratings yet
Ai Unit5 Learning
62 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
U1 - ML
No ratings yet
U1 - ML
5 pages
Module3 PPT
No ratings yet
Module3 PPT
132 pages
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
No ratings yet
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
44 pages
ML Unit-3.-1
No ratings yet
ML Unit-3.-1
28 pages
PSO
No ratings yet
PSO
74 pages
Module 3 - AIML
No ratings yet
Module 3 - AIML
134 pages
Machine Learning_v1 (1)
No ratings yet
Machine Learning_v1 (1)
30 pages
ML Lec 03 Machine Learning Process
No ratings yet
ML Lec 03 Machine Learning Process
42 pages
Aiml M3 C1
No ratings yet
Aiml M3 C1
59 pages
Mod3 - Learning Theory
No ratings yet
Mod3 - Learning Theory
10 pages
12-Computational-Learning-Theory
No ratings yet
12-Computational-Learning-Theory
38 pages
Basics of Learning Theory
No ratings yet
Basics of Learning Theory
37 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
4.0 ALGO211 Week10 Computational Learning Theory
No ratings yet
4.0 ALGO211 Week10 Computational Learning Theory
16 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
12 pages
Machine Learning-Csen 3233-2023
No ratings yet
Machine Learning-Csen 3233-2023
4 pages
Lecture 8-ml On-Line Learning
No ratings yet
Lecture 8-ml On-Line Learning
48 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
ML Mod1-CS467 Machine Learning - Ktustudents - in
No ratings yet
ML Mod1-CS467 Machine Learning - Ktustudents - in
16 pages
ML(R20)-III-II- 10 marks Question bank
No ratings yet
ML(R20)-III-II- 10 marks Question bank
2 pages
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
No ratings yet
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
43 pages
Introduction_Machine_Learning
No ratings yet
Introduction_Machine_Learning
53 pages
Hw5 Solution
No ratings yet
Hw5 Solution
4 pages
Assignment2 PDF
No ratings yet
Assignment2 PDF
2 pages
Final Exam Review: Nishant Mehta
No ratings yet
Final Exam Review: Nishant Mehta
32 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
151 pages
Unit 2
No ratings yet
Unit 2
28 pages
ml aat 2
No ratings yet
ml aat 2
25 pages
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
No ratings yet
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
41 pages
Mcqs Bank Unit 1: A) The Autonomous Acquisition of Knowledge Through The Use of Computer Programs
100% (1)
Mcqs Bank Unit 1: A) The Autonomous Acquisition of Knowledge Through The Use of Computer Programs
8 pages
ML DL Projects and Tutorials
100% (1)
ML DL Projects and Tutorials
21 pages
ML Questions - GROUP - 08
No ratings yet
ML Questions - GROUP - 08
23 pages
Machine Leaning 1 unit
No ratings yet
Machine Leaning 1 unit
10 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
03 Hypothesis Spaces Commented4
No ratings yet
03 Hypothesis Spaces Commented4
45 pages

computational learning theorem

Uploaded by

computational learning theorem

Uploaded by

Computational Learning

 Can we characterize the computational complexity of

 Can we characterize the number of training samples

 How many mistakes will the learner make before learning

 Accuracy to which the target must be approximated

Three general areas:

1. Sample Complexity. How many examples we need

2. Computational Complexity. How much

3. Mistake Bound. How many mistakes we will make

Scenario 2: Passive Learning from Teacher-Selected Examples

Scenario 3: Passive Learning from Teacher-Annotated Examples

c(x) = 1 if x is positive example of concept

Computer Science Department

Computer Science Department

Three general areas:

1. Sample Complexity. How many examples we need

2. Computational Complexity. How much

3. Mistake Bound. How many mistakes we will make

formal computational model which want

LEARNING = computational description of

Computer Science Department

Computer Science Department

 Start from a good class of learner—

observe that |H|=3n, then the inequality becomes

 For each new positive training example, the algorithm

he VC dimension of hypothesis space H over instance space X

there exists a subset of size d can be shattered, then VC(H) >=d

C(Half intervals) = 1 (no subset of size 2 can be shattered)

lower bound on sample complexity (i.e.

 Upon receiving each example x, the learner

 How many mistakes will the learner make

 Given that the VS initially contains |H| hypotheses,

 The algorithm can learn without any mistakes at all

 VC(C) < Opt(C) < MHalving(C) < log2|C|

You might also like