computational learning theorem
computational learning theorem
Theory
1
A theory of the learnable
(Valiant ‘84)
[…] The problem is to discover good models that are
interesting to study for their own sake and that promise to
be relevant both to explaining human experience and to
building devices that can learn […] Learning machines must
have all 3 of the following properties:
the machines can provably learn whole classes of concepts,
these classes can be characterized
the classes of concepts are appropriate and nontrivial for
general-purpose knowledge
the computational process by which the machine builds the
desired programs requires a “feasible” (i.e. polynomial) number
of steps
2
A theory of the learnable
We seek general laws that constrain inductive
learning, relating:
Probability of successful learning
Number of training examples
Complexity of hypothesis space
Accuracy to which target concept is approximated
Manner in which training examples are presented
3
Overview
Are there general laws that govern learning?
Sample Complexity: How many training examples are needed for a
learner to converge (with high probability) to a successful hypothesis?
Computational Complexity: How much computational effort is
needed for a learner to converge (with high probability) to a
successful hypothesis?
Mistake Bound: How many training examples will the learner
misclassify before converging to a successful hypothesis?
These questions will be answered within two analytical
frameworks:
The Probably Approximately Correct (PAC) framework
The Mistake Bound framework
4
Overview (Cont’d)
Rather than answering these questions for
individual learners, we will answer them for
broad classes of learners. In particular we will
consider:
The size or complexity of the hypothesis space
considered by the learner.
The accuracy to which the target concept must be
approximated.
The probability that the learner will output a
successful hypothesis.
The manner in which training examples are
presented to the learner.
5
Introduction
Problem setting
Inductively learning an unknown target
function, given training examples and a
hypothesis space
Focus on:
How many training examples are sufficient?
How many mistakes will the learner make
before it succeeds?
6
Introduction (2)
Desirable: quantitative bounds depending on
Complexity of hypo space,
Accuracy of approximation to the target
Probability of outputting a successful hypo
How the training examples are presented
Learner proposes instances
Teacher presents instances
Some random process produces instances
Specifically, study sample complexity,
computational complexity, and mistake bound.
7
Problem Setting
Space of possible instances X (e.g. set of all people) over
which target functions may be defined.
Assume that different instances in X may be encountered with
different frequencies.
Modeling above assumption as: unknown (stationary)
probability distribution D that defines the probability of
encountering each instance in X
Training examples are provided by drawing instances
independently from X, according to D, and they are noise-free.
Each element c in target function set C corresponds to certain
subset of X, i.e. c is a Boolean function. (Just for the sake of
simplicity)
8
Error of a Hypothesis
Training error of hypo h w.r.t. target function c and
training data set S of n sample is
1
errorS (h) (c( x) h( x))
n xS
True error of hypo h w.r.t. target function c and
distribution D is
errorD (h) Pr [c( x) h( x)]
x~ D
errorD(h) is not observable, so how probable is it that
errorS(h) gives a misleading estimates of errorD(h)?
Different from problem setting in Ch5, where samples are
drawn independently from h, here h depends on training
samples.
9
An Illustration of True
Error
10
Theoretical Questions of
Interest
Is it possible to identify classes of learning
problems that are inherently difficult or easy,
independent of the learning algorithm?
Can one characterize the number of training
examples necessary or sufficient to assure
successful learning?
How is the number of examples affected
If observing a random sample of training data?
if the learner is allowed to pose queries to the trainer?
Can one characterize the number of mistakes
that a learner will make before learning the
target function?
Can one characterize the inherent computational
complexity of a class of learning algorithms?
Computational Learning
Theory
Relatively recent field
Area of intense research
Partial answers to some questions on
previous page is yes.
Will generally focus on certain types of
learning problems.
Inductive Learning of Target
Function
What we are given
Hypothesis space
Training examples
What we want to know
How many training examples are sufficient
to successfully learn the target function?
How many mistakes will the learner make
before succeeding?
Computational Learning
Theory
Provides a theoretical analysis of learning:
Is it possible to identify classes of learning problems
that are inherently difficult/easy?
-
-
-
c
+
+ h
+
-
Computer Science Department
CS 9633 Machine Learning
Key Points
True error defined over entire instance
space, not just training data
Error depends strongly on the unknown
probability distribution D
The error of h with respect to c is not
directly observable to the learner L—
can only observe performance with
respect to training data (training error)
Question: How probable is it that the
observed training error for h gives a
misleading estimate of the true error?
Computer Science Department
CS 9633 Machine Learning
PAC Learnability
Goal: characterize classes of target concepts
that can be reliably learned
from a reasonable number of randomly drawn
training examples and
using a reasonable amount of computation
Unreasonable to expect perfect learning where
errorD(h) = 0
Would need to provide training examples
corresponding to every possible instance
With random sample of training examples, there is
always a non-zero probability that the training
examples will be misleading
34
Theory
Positive
Learner
Classifier
Negative
Positive
Negative
37
Cannot Learn Even Approximate
Concepts
from Pathological Training Sets
Positive
Learner
Classifier
Negative
Positive
Negative
38
Probably approximately correct learning
39
What we want to learn
CONCEPT = recognizing algorithm
40
What’s new in p.a.c.
learning?
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem:
use of resources (time, space…) by computations COMPLEXITY
THEORY
Example
Sorting: n·logn time (polynomial, feasible)
Bool. satisfiability: 2ⁿ time (exponential, intractable)
41
PAC Learnability
PAC refers to Probably Approximately Correct
It is desirable that errorD(h) to be zero,
however, to be realistic, we weaken our
demand in two ways:
errorD(h) is to be bounded by a small number ε
Learner is not required to success on every training
sample, rather that its probability of failure is to be
bounded by a constant δ.
Hence we come up with the idea of “Probably
Approximately Correct”
42
PAC Learning
The only reasonable expectation of a
learner is that with high probability it
learns a close approximation to the
target concept.
In the PAC model, we specify two small
parameters, ε and δ, and require that
with probability at least (1 δ) a system
learn a concept with error at most ε.
43
The PAC Learning Framework
Definition: A class of concepts C is
PAC learnable using a hypothesis class
H, if there exist a learning algorithm L
such that for arbitrary small δ and ε,
and for all concepts c in C, and for all
distributions D over the input space,
there is a 1-δ probability that the
hypothesis h selected from space H by
L is approximately correct (has less
than ε true error).
Definition of PAC-
Learnability
Definition: Consider a concept class C
defined over a set of instances X of length n
and a learner L using hypothesis space H.
C is PAC-learnable by L using H if all c C,
distributions D over X, such that 0 < <
½ , and such that 0 < < ½, learner L will
with probability at least (1 - ) output a
hypothesis h H such that errorD(h) , in
time that is polynomial in 1/, 1/, n, and
size(c).
Computer Science Department
CS 9633 Machine Learning
Requirements of Definition
L must with arbitrarily high probability (1-
), out put a hypothesis having arbitrarily
low error ().
L’s learning must be efficient—grows
polynomially in terms of
Strengths of output hypothesis (1/, 1/)
Inherent complexity of instance space (n) and
concept class C (size(c)).
Parameters
,
Training
sample Hypothesis
{ xi , c ( xi )}in1
Learning h
algorithm L
50
Sample Complexity for Finite Hypothesis
Spaces
51
Sample Complexity for
Finite Hypothesis Spaces
Given any consistent learner, the number of examples
sufficient to assure that any hypothesis will be probably
(with probability (1- )) approximately (within error )
correct is m= 1/ (ln|H|+ln(1/))
If the learner is not consistent, m= 1/22 (ln|H|+ln(1/))
Conjunctions of Boolean Literals are also PAC-Learnable
and m= 1/ (n.ln3+ln(1/))
k-term DNF expressions are not PAC learnable because
even though they have polynomial sample complexity,
their computational complexity is not polynomial.
Surprisingly, however, k-term CNF is PAC learnable.
52
Formal Definition of PAC-
Learnable
Consider a concept class C defined over an instance
space X containing instances of length n, and a
learner, L, using a hypothesis space, H. C is said to be
PAC-learnable by L using H iff for all cC,
distributions D over X, 0<ε<0.5, 0<δ<0.5; learner L
by sampling random examples from distribution D, will
with probability at least 1 δ output a hypothesis hH
such that errorD(h) ε, in time polynomial in 1/ε, 1/δ, n
and size(c).
Example:
X: instances described by n binary features
C: conjunctive descriptions over these features
H: conjunctive descriptions over these features
L: most-specific conjunctive generalization algorithm (Find-S)
size(c): the number of literals in c (i.e. length of the
conjunction).
53
ε-exhausted
Def. VSH,D is said to be ε-exhausted w.r.t.
c and D if for any h in VSH,D, errorD(h)<ε.
54
A PAC-Learnable Example
Consider class C of conjunction of boolean
literals.
A boolean literal is any boolean variable or its negation
Q: Is such C PAC-learnable?
A: Yes, by going through the following two steps:
1. Show that any consistent learner will require only a
polynomial number of training examples to learn any
element of C
2. Suggest a specific algorithm that use polynomial time
per training example.
55
Contd
Step1:
Let H consist of conjunction of literals based on n
boolean variables.
Now take a look at m≥(1/ε)(ln|H|+ln(1/δ)),
Concrete examples:
δ=ε=0.05, n=10 gives 280 examples
δ=0.01, ε=0.05, n=10 gives 312 examples
δ=ε=0.01, n=10 gives 1,560 examples
δ=ε=0.01, n=50 gives 5,954 examples
Result holds for any consistent learner, including FindS.
57
Sample Complexity of Learning
Arbitrary
Consider Boolean
any boolean functionFunctions
over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a nPAC concept is:
1 1
ln ln 2 / ln 2 ln 2 /
2 n
Concrete examples:
δ=ε=0.05, n=10 gives 14,256 examples
δ=ε=0.05, n=20 gives 14,536,410 examples
δ=ε=0.05, n=50 gives 1.561x1016 examples
58
Agnostic Learning & Inconsistent
Hypo
we assume that VSH,D is not empty, and
a simple way to guarantee such
condition holds is that we assume that c
belongs to H.
Agnostic learning setting: Don’t assume
c∈H, and the learner simply finds hypo
with minimum training error instead.
59
Sample Complexity for
Infinite Hypothesis Spaces
60
Infinite Hypothesis Spaces
The preceding analysis was restricted to finite
hypothesis spaces.
Some infinite hypothesis spaces (such as those
including real-valued thresholds or
parameters) are more expressive than others.
Compare a rule allowing one threshold on a
continuous feature (length<3cm) vs one allowing
two thresholds (1cm<length<3cm).
Need some measure of the expressiveness of
infinite hypothesis spaces.
The Vapnik-Chervonenkis (VC) dimension
provides just such a measure, denoted VC(H).
Analagous to ln|H|, there are bounds for
sample complexity using VC(H).
61
VC Dimension
An unbiased hypothesis space shatters the entire
instance space.
The larger the subset of X that can be shattered, the
more expressive the hypothesis space is, i.e. the less
biased.
The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the
largest finite subset of X shattered by H. If arbitrarily
large finite subsets of X can be shattered then VC(H) =
If there exists at least one subset of X of size d that can
be shattered then VC(H) ≥ d. If no subset of size d can be
shattered, then VC(H) < d.
For a single intervals on the real line, all sets of 2
instances can be shattered, but no set of 3 instances can,
so VC(H) = 2.
Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|
62
Shattering a Set of
Instances
Def. A dichotomy of a set S is a partition of S
into two disjoint subsets
Def. A set of instances S is shattered by hypo
space H iff for every dichotomy of S, there exists
some hypo in H consistent with this dichotomy.
3 instances 63
shattered
VC Dimension
e say that a set S of examples is shattered by a set of functions H if
for every partition of the examples in S into positive and negative examp
there is a function in H that gives exactly these labels to the examples
Computational
Learning Theory CS446-Spring 06 64
VC Dimension
Motivation: What if H can’t shatter X? Try finite
subsets of X.
Def. VC dimension of hypo space H defined over
instance space X is the size of largest finite subset
of X shattered by H. If any arbitrarily large finite
subsets of X can be shattered by H, then VC(H)≡∞
Roughly speaking, VC dimension measures how
many (training) points can be separated for all
possible labeling using functions of a given class.
Notice that for any finite H, VC(H)≤log2|H|
65
Sample Complexity for
Infinite Hypothesis Spaces
II
Upper-Bound on sample complexity, using the VC-
Dimension: m 1/ (4log2(2/)+8VC(H)log2(13/)
Lower Bound on sample complexity, using the VC-
Dimension:
Consider any concept class C such that VC(C) 2, any
learner L, and any 0 < < 1/8, and 0 < < 1/100.
Then there exists a distribution D and target concept
in C such that if L observes fewer examples than
max[1/ log(1/ ),(VC(C)-1)/(32)]
then with probability at least , L outputs a
hypothesis h having errorD(h)> .
66
An Example: Linear Decision
Surface
Line case: X=real number set, and
H=set of all open intervals, then
VC(H)=2.
Plane case: X=xy-plane, and H=set of
all linear decision surface of the plane,
then VC(H)=3.
General case: For n-dim real-number
space, let H be its linear decision
surface, then VC(H)=n+1.
67
Sample Complexity from VC
Dimension
How many randomly drawn examples
suffice to ε-exhaust VSH,D with probability
1 2 13
at least 1-δ? m (4 log 2 8VC ( H ) log 2 )
(Blumer et al. 1989)
Furthermore, it is possible to obtain a
68
Lower Bound on Sample
Complexity
Theorem 7.2 (Ehrenfeucht et al. 1989)
Consider any concept class C s.t. VC(C)≥2,
any learner L, and any 0<ε<1/8, and
0<δ<1/100. Then there exists a distribution
D and target concept in C s.t. if L observes
fewer examples than
max[(1/ε)log(1/δ), (VC(C)-1)/(32ε)], then with
probability at least δ, L outputs a hypo h
having errorD(h)>ε.
69
VC-Dimension for Neural
Networks
Let G be a layered directed acyclic graph with n
input nodes and s2 internal nodes, each having
at most r inputs. Let C be a concept class over Rr
of VC dimension d, corresponding to the set of
functions that can be described by each of the s
internal nodes. Let CG be the G-composition of C,
corresponding to the set of functions that can be
represented by G. Then VC(CG)2ds log(es),
where e is the base of the natural logarithm.
This theorem can help us bound the VC-Dimension
of a neural network and thus, its sample
complexity
70
Mistake Bound Model
71
Mistake Bound Model
The learner receives a sequence of training
examples
• Instance based learning
76
Introduction to “Mistake
Bound”
Mistake bound: the total number of mistakes
a learner makes before it converges to the
correct hypothesis
Assume the learner receives a sequence of
training examples, however, for each instance
x, the learner must first predict c(x) before it
receives correct answer from the teacher.
Application scenario: when the learning must
be done on-the-fly, rather than during off-line
training stage.
77
Learning
The Mistake Bound framework is different from
the PAC framework as it considers learners that
receive a sequence of training examples and that
predict, upon receiving each example, what its
target value is.
The question asked in this setting is: “How many
mistakes will the learner make in its
predictions before it learns the target
concept?”
This question is significant in practical settings
where learning must be done while the system is
in actual use.
78
Theorem 1. Online learning of conjunctive concepts can be done with
at most n+1 prediction mistakes.
Find-S Algorithm
Finding-S: Find a maximally specific
hypothesis
1. Initialize h to the most specific hypothesis
in H
2. For each positive training example x
For each attribute constraint ai in h, if it is
satisfied by x, then do nothing; otherwise
replace ai by the next more general constraint
that is satisfied by x.
3. Output hypo h
81
Mistake Bound for FIND-S
Assume training data is noise-free and target
concept c is in the hypo space H, which consists
of conjunction of up to n boolean literals
Then in the worst case the learner needs to
make n+1 mistakes before it learns c
Note that misclassification occurs only in case that
the latest learned hypo misclassifies a positive
example as negative, and one such mistake removes
at least one constraint from the hypo, and
in the above worst case, c is the function that
assigns every instance to “true” value
82
Mistake Bound for Halving
Algorithm
Halving algorithm = incrementally learning the
version space as every new instance arrives +
predict a new instance by majority votes (of hypo
in VS)
Q: What is the maximum number of mistakes
that can be made by a halving algorithm, for an
arbitrary finite H, before it exactly learns the
target concept c (assume c is in H)?
Answer: the largest integer no more than log2|H|
How about the minimum number of mistakes?
Answer: zero-mistake!
84
Optimal Mistake Bounds
For an arbitrary concept class C,
assuming H=C, interested in the lowest
worst-case mistake bound over all
possible learning algorithms
Let MA(c) denotes the maximum number
of mistakes over all possible training
examples that a learner A makes to
exactly learn c.
Def. MA(C) ≡maxc∈CMA(c)
Ex: MFIND-s(C)=n+1, MHalving(C)≤log2|C|
85
Optimal Mistake Bounds
(2)
The optimal mistake bound for C, denoted
by Opt(C), defined as minA∈learning algMA(C)
Notice that Opt(C)≤MHalving(C)≤log2|C|
Furthermore, Littlestone (1987) shows that
VC(C)≤Opt(C) !
When C equal to the power-set Cp of any
finite instance space X, the above four
quantities become equal to each other, i.e.
|X|
86
Optimal Mistake Bounds
Definition: Let C be an arbitrary
nonempty concept class. The optimal
mistake bound for C, denoted Opt(C), is
the minimum over all possible learning
algorithms A of MA(C).
Opt(C)=minALearning_Algorithm MA(C)
For any concept class C, the optimal
mistake bound is bound as follows:
VC(C) Opt(C) log2(|C|)
87
Weighted-Majority
Algorithm
It is a generalization of Halving algorithm:
makes a prediction by taking a weighted
vote among a pool of prediction
algorithms (or hypotheses) and learns by
altering the weights
It starts by assigning equal weight (=1)
to every prediction algorithm. Whenever
an algorithm misclassifies a training
example, reduces its weight
Halving algorithm reduces the weight to zero
88
Procedure for Adjusting
Weights
ai denotes the ith prediction algorithm in the pool; wi
denotes the weight of ai, and is initialized to 1
For each training example <x, c(x)>
Initialize q0 & q1 to be 0
For each ai, if ai(x)=0 then q0←q0+wi, else q1←q1+wi
If q1>q0, predicts c(x) to be 1, else
if q1<q0, predicts c(x) to be 0, else
predicts c(x) at random to be 1 or 0.
For each ai, do
If ai(x)≠c(x) (given by the teacher), wi←βwi
89
Weighted-Majority
Algorithm
ai denotes the ith prediction algorithm in the pool A of
algorithm. wi denotes the weight associated with ai.
For all i initialize wi <-- 1
For each training example <x,c(x)>
Initialize q0 and q1 to 0
For each prediction algorithm ai
If ai(x)=0 then q0 <-- q0+wi
If ai(x)=1 then q1 <-- q1+wi
If q1 > q0 then predict c(x)=1
If q0 > q1 then predict c(x) =0
If q0=q1 then predict 0 or 1 at random for c(x)
For each prediction algorithm ai in A do
If ai(x) c(x) then wi <-- wi
90
Comments on “Adjusting Weights”
Idea
The idea can be found in various problems
such as pattern matching, where we
might reduce weights of less frequently
used patterns in the learned library
The textbook claims that one benefit of
the algorithm is that it is able to
accommodate inconsistent training data,
but in case of learning by query, we
presume that answer given by the teacher
is always correct.
91
Relative Mistake Bound for the
Algorithm
Theorem 7.3 Let D be the training sequence, A be any
set of n prediction algorithms, and k be the minimum
number of mistakes made by any algorithm in A for
the training sequence D. Then the number of mistakes
over D made by Weighted-Majority algorithm using
β=0.5 is at most 2.4(k+log2n)
Proof: The basic idea is that we compare the final
weight of best prediction algorithm to the sum of
weights over all predictions. Let aj be such algorithm with k
mistakes, then its final weight w j=0.5k. Now consider the sum W
of weights over all predictions, observe that for every mistake
made, W is reduced to at most 0.75W.
92
Relative Mistake Bound for
the Weighted-Majority
Algorithm
Let D be any sequence of training examples,
let A be any set of n prediction algorithms, and
let k be the minimum number of mistakes
made by any algorithm in A for the training
sequence D. Then the number of mistakes over
D made by the Weighted-Majority algorithm
using =1/2 is at most 2.4(k + log2n).
This theorem can be generalized for any 0
1 where the bound becomes
(k log2 1/ + log2n)/log2(2/(1+ ))
93