0% found this document useful (0 votes)

12 views

lecture01

Computational learning theory focuses on the mathematical foundations of learning from data, addressing questions about the learnability of functions and the resources required for learning. The PAC (Probably Approximately Correct) learning framework is introduced, emphasizing the need for efficient algorithms that can learn concepts from data distributions while maintaining high accuracy and confidence. The document also discusses a rectangle learning game to illustrate these concepts and the importance of data similarity for effective learning.

Uploaded by

Sodisetty Rahul Koushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

lecture01

Uploaded by

Sodisetty Rahul Koushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Computational Learning Theory

1 : Introduction to the PAC Learning Framework

Lecturer: Varun Kanade

1 What is computational learning theory?

Machine learning techniques lie at the heart of many technological applications that we use
daily. When using a digital camera, the boxes that appear around faces are produced using
a machine learning algorithm. When websites such as BBC iplayer or Netflix suggest what
a user might like to watch next, they are using machine learning algorithms to provide these
recommendations.
In the field of computational learning theory, we develop precise mathematical formulations
of ‘learning from data’. Having a precise mathematical formulation allows us to answer questions
such as the following: What types of functions are easy to learn? Are there classes of functions
that are hard to learn? How much data is required to learn a function of a particular type?
What amount of computational resources are required to learn?
We will be interested in positive as well as negative answers to these questions. For example,
we will consider learning algorithms that are guaranteed to learn certain kinds of functions using
modest amount of data and reasonable runnning time. For the most part, we will take the view
that as long as the resources used can be bounded by a polynomial function of the problem size,
the learning algorithm is efficient. Obviously, as in the case of analysis of algorithms, there may
be situations where just being polynomial time may not be considered efficient enough. Some
of the algorithms we study will not run in polynomial time at all, but they will still be much
better than brute force algorithms.

2 A Rectangle Learning Game

Let us consider the following rectangle learning game. We are given some points in the Euclidean
plane, some of which are labeled positive and others negative. Furthermore, we are guaranteed
that there is an axis-aligned rectangle such that all the points inside it are labelled positive,
while those outside are labelled negative. However, this rectangle itself is not revealed to us.
Our goal is to produce a rectangle that is “close” to the true hidden rectangle that was used to
label the observed data. (See Fig. 1(a).)
To make the problem a bit more concrete, suppose that the two dimensions measure the
curvature and length of bananas. The points that are labelled positive have medium curva-
ture and medium length and represent the bananas that would pass stringent EU regulations.
However, the actual lower and upper limits that “define” medium in each dimension are hidden
from us. We wish to learn some rectangle that will be good enough to predict whether bananas
would pass the regulators’ tests or not.
Let R be the unknown rectangle used to label the points. We can express the labelling
process using a boolean function cR : R2 → {+, −}, where cR (x) = + if x is inside the rectangle
R and cR (x) = − otherwise. Let us consider the following simple approach. We consider the
tightest possible rectangle that can fit all the positively labelled data inside it; let us denote
this rectangle by R0 (Fig. 1(b)). Our prediction function or hypothesis hR0 : R2 → {+, −} is the
following: if x ∈ R0 , hR0 (x) = +, else hR0 (x) = −.1 Let us consider the following questions:
• Have we learnt the function cR ?
1
For the sake of concreteness, let us say that points on the sides are considered to be inside the rectangle.

1
− − − −

− R + − R +
+ − R0 + −
− + − +

− −
(a) (b)

− − − −

− R T+2 − R T+2
T1 R0 +T3 − T1 R0 +T3 −
− + T4 − ++ T4

− −
(c) (d)

Figure 1: (a) Data received for the rectangle learning game. The rectangle R used to generate
labels is not known to the learning algorithm. (b) The tightest fit algorithm produces a rectanlge
R0 . (c) & (d) The regions T1 , T2 , T3 and T4 contain /4 mass each under D.

• How good is our prediction function hR0 ?

Let R denote the true rectangle that actually defines the labelling function cR . Since, we’ve
chosen the tightest possible fit, the rectangle R0 must be entirely contained inside R. Consider
the shaded region shown in Fig. 1(b). For any point x that is in this shaded region, it must
be that hR0 (x) = −, while cR (x) = +. In other words, our prediction function hR0 would make
errors on all of these points. If we had to make predictions on points that mostly lie in this
region our hypothesis would be quite bad. This brings us to an important point that the data
that is used to learn a hypothesis should be similar to the data on which we will be tested. Let
us formalise this notion.
Let D be a distribution over R2 . Our training data consists of m points that are drawn
independently according to D and then labelled according to the function cR . We will define the
error of a hypothesis hR0 with respect to the target function cR and distribution D as follows:

err(hR0 ; cR , D) = Px∼D hR0 (x) 6= cR (x)

Whenever the target cR and distribution D are clear from context, we will simply refer to this
as err(hR0 ).
Let us show that in fact our algorithm outputs an hR0 that is quite good, in the sense
that given any > 0 as the target error, with high probability it will output hR0 such that
err(hR0 ; cR , D) ≤ . Consider four rectangular strips T1 , T2 , T3 , T4 that are chosen along the
sides of the rectangle R (and lying inside) such that they each have a probability mass of
exactly /4 under D.2 Note that some of these strips overlap, e.g., T1 and T2 (See Fig. 1(c)).
The total probability mass of the set T1 ∪T2 ∪T3 ∪T4 is at most . Now, if we can guarantee that
the training data of m points contains at least one point from each of T1 , T2 , T3 and T4 , then
the tightest fit rectangle R0 will be such that R \ R0 ⊆ T1 ∪ T2 ∪ T3 ∪ T4 and as a consequence,
2
Assuming the distribution D is smooth over R2 , i.e., there is no point mass under D, this is always possible.
Otherwise, the algorithm is still correct, however, the analysis is slightly more tedious.

2
err(hR0 ; cR , D) ≤ . This is shown in Fig. 1(d); note that if even one of the Ti do not contain any
data point, this may be cause a problem, in the sense that the region of disagreement between
R and R0 may have probability mass greater than (See Fig. 1(c)).
Let A1 be the event that when m points are drawn independently according to D, none of
them lies in T1 . Similarly define the events A2 , A3 , A4 for T2 , T3 , T4 . Let E = A1 ∪ A2 ∪ A3 ∪ A4 .
If E does not occur, then err(hR0 ; cR , D) ≤ . We will use the union bound to bound P [E]. The
union bound states that for any two events A and B,

P [A ∪ B] ≤ P [A] + P [B] .

Let us compute P [A1 ]. The probability that a single point drawn from D does not land in T1
is 1 − /4, so the probability that after m independent draws from D none of the points are in
m m
T1 is 1 − 4 . By a similar argument, P [Ai ] = 1 − 4 for i = 1, . . . , 4. Thus, we have

4
X
P [E] ≤ P [Ai ] By the Union Bound
i=1
m

=4 1−
4

m
≤ 4 exp − As 1 − x ≤ e−x
4

Now let δ > 0 be fixed, then if m ≥ 4 log 4δ , then we have that P [E] ≤ δ. In other words,
with probability at least 1 − δ, err(hR0 ; cR , D) ≤ .

Remarks: A couple of remarks are in order. We can think of as being

an accuracy parameter
4 4
and δ being the confidence parameter. The bound of m ≥ log δ shows that as we demand
higher accuracy and higher confidence of our learning algorithm, we need to supply more data.
This is indeed a reasonable requirement. Furthermore, the cost for getting higher accuracy and
higher confidence is relatively modest. For example, if we want to halve the error, say go from
= 0.02 to = 0.01, the amount of data required (as per the bound) at most doubles.3

3 Probably Approximately Correct (PAC) Learning

Let us start defining a precise mathematical framework for learning using the insights gained
from the rectangle learning game. First let us make a few observations:

1. The learning algorithm does not know the target concept to be learnt (obviously, otherwise
there is nothing to learn!). However, the learning algorithm does know the set of possible
target concepts. In this instance, the unknown target is always an axis-aligned rectangle.

2. The learning algorithm has access to data drawn from some distribution D. We do assume
that the observations are drawn independently according to D. However, no assumption
is made on the distribution D itself.

3. The output hypothesis is evaluated with respect to the same distribution D as was used
to obtain the training data.
3
We are using the world “required” a bit losely here. All we can say is our present analysis of this particular
algorithm suggests that the amount of data required scales linearly as 1 . We will see lower bounds of this nature
that hold for any algorithm later in the course.

3
4. We would like our algorithms to be statistically efficient (require relatively small number of
examples to guarantee high accuracy and confidence) as well as computationally efficient
(reasonable amount of time required for processing examples and producing an output
hypothesis).

Let us now formalise a few other concepts related to learning.

Instance Space
Let X denote the set of possible instances or examples that may be seen by the learning
algorithm. For instance, in the rectangle learning game the examples were points in R2 . When
classifying images the examples may be 3 dimensional arrays, containing the RGB values of
each pixel.

Concept Class
A concept c over X is a boolean function c : X → {0, 1}.4 A concept class C over X is
a collection of concepts c over X. In the rectangle learning game, the concept class under
consideration is all axis-aligned rectangles. In order to view the concepts as boolean functions,
we may use the convention that + corresponds to 1 and − corresponds to 0. The learning
algorithm has knowledge of C, but not of the specific concept c ∈ C that is used to label the
observations.

Data Generation
Let D be a fixed probability distribution over X. The training data is obtained as follows. An
example x ∈ X is drawn according to the distribution D. If c is the target concept, the example
x is labelled accordingly as c(x). The learning algorithm observes (x, c(x)). We will refer to
this process as an example oracle, EX(c, D). We assume that a learning algorithm can query
the oracle EX(c, D) at unit cost.

3.1 PAC Learning : Take I

Let h : X → {0, 1} be some hypothesis. Once the distribution D over X and c ∈ C are fixed,
the error of h with respect to c and D is defined as:

err(h; c, D) = Px∼D h(x) 6= c(x)

When c and D are clear from context, we will simply refer to this as err(h).

Definition 1 (PAC Learning : Take I). Let C be a concept class over X. We say that C
is PAC learnable if there exists a learning algorithm L that satisfies the following: for every
concept c ∈ C, for every distribution D over X, for every 0 < < 1/2 and 0 < δ < 1/2, if L is
given access to EX(c, D) and inputs and δ, L outputs a hypothesis h ∈ C that with probability
at least 1 − δ satisfies err(h) ≤ . The probability is over the random examples drawn from
EX(c, D) as well as any internal randomisation of L. We further say that C is efficiently PAC
learnable if the running time of L is polynomial in 1/ and 1/δ.

The term PAC stands for probably approximately correct. The approximately correct part
captures the notion that the output hypothesis does have some error; demanding higher accuracy
(lower ) is possible, but comes at a cost of increased running time and sample complexity.
The probably part captures the notion that there is some chance that the algorithm may fail
completely. This may happen because the observations are not representative of the distribution,
4
We will consider concepts that are not boolean functions later in the course.

4
(a) (b)

Figure 2: Different shape concepts in R2 .

a low probability event, though very much a possible event. Our confidence (lower δ) in the
correctness of our algorithm is increased as we allow more sample complexity and running time.
In Section 2, we essentially proved the following theorem.

Theorem 2. The concept class C of axis aligned rectangles in R2 is efficiently PAC (Take I)
learnable.

3.2 PAC Learning : Take II

Let us now discuss a couple of questions that we have glossed over so far. The question is
that of the complexity of the concepts that we are trying to learn. For example, consider the
question of learning rectangles (Fig. 2(a)) vs the shapes as shown in Fig. 2(b). Intuitively,
we believe that it should be harder to learn concepts defined by shapes as shown in Fig. 2(b)
than rectangles. Thus, an algorithm that learns a more complex class should be allowed more
resources (samples, running time, memory, etc.). For example, to represent a rectangle we only
need to store four real numbers, the lower and upper limits in both the x and y directions. The
number of real numbers used to represent more complex shapes is higher.5
The question of representation is better elucidated by taking the case of boolean functions.
Suppose that the instance space X is {0, 1}n , the set of length n bit vectors. We can consider
boolean functions f : X → {0, 1}. There are several ways of representing boolean functions.
One option is to keep the entire truth table with 2n entries. Alternatively, we may represent f
as a circuit using ∧, ∨ and ¬ gates. We may ask that f be represented in disjunctive normal
form (DNF), i.e., of the form shown below

(x1 ∧ x3 ∧ x7 ∧ · · · ) ∨ (x2 ∧ x4 ∧ x8 ∧ · · · ) ∨ · · · ∨ (x1 ∧ x3 )

The choice of representation is quite important as shown by the following exercise.

Exercise: Let f = x1 ⊕ x2 ⊕ · · · ⊕ xn be the parity function on n bits. Show that f can be
represented as a circuit of size O(n log n) using ∧, ∨ and ¬ gates. Show that f represented in
disjunctive normal form consists of Ω(2n ) terms.
There are other possible representations of boolean functions, such as decision trees or neural
networks.

Representation Scheme
Abstractly, a representation scheme for a concept class C is a function R : Σ∗ → C, where Σ
is a finite alphabet.6 Any σ satisfying R(σ) = c is called a representation of c. We assume
that there is a function, size : Σ∗ → N, that measures the size of a representation. A concept
5
We assume that our computers can store and manipulate real numbers at unit cost.
6
If the concept requires using real numbers, such as in the case of rectangles, we may use R : (Σ ∪ R)∗ → C.

5
c ∈ C may in general have multiple representations under R. For example, there are several
boolean circuits that compute exactly the same boolean function. We extend size to the set C
by defining, size(c) = min {size(σ)}. When we refer to a concept class, we will assume by
σ:R(σ)=c
default that it is associated with a representation scheme and a size function, so that size(c) is
well defined for c ∈ C.

Instance Size
Typically in a machine learning problems instances have a size; roughly we may think of the
size of an instance as the memory required to store an instance. For example, 10 × 10 black and
white images can be defined using 100 bits, whereas 1024 × 1024 colour images will require over
3 million real numbers. Thus, when faced with larger instances it is natural to allow learning
algorithms more time. In this course, we will only consider settings when the instance space is
either Xn = {0, 1}n S or Xn = Rn . We denote by Cn a concept
S class over Xn . We consider the
instance space X = n≥1 Xn and the concept class C = n≥1 Cn .
S
Definition S3 (PAC Learning : Take II). Let Cn be a concept class over Xn and let C = n≥1 Cn
and X = n≥1 Xn . We say that C is PAC learnable (take II) if there exists an algorithm L
that satisfies the following: for all n ∈ N, for every c ∈ Cn , for every D over Xn , for every
0 < < 1/2 and 0 < δ < 1/2, if L is given access to EX(c, D) and inputs , δ and size(c),
L outputs h ∈ Cn that with probability at least 1 − δ satisfies err(h) ≤ . We say that C is
efficiently PAC learnable if the running time of L is polynomial in n, size(c), 1/ and 1/δ, when
learning c ∈ Cn .

4 Learning Conjunctions
Let us now consider a second learning
S problem. Let Xn = {0, 1}n and S Cn denote the set of
conjunctions over Xn . Let X = n≥1 Xn and CONJUNCTIONS = n≥1 Cn . A conjunction
can be represented by a set of literals, where each literal corresponds to either a variable or its
negation. An example of a conjunction is x1 ∧ x3 ∧ x4 . Sometimes a conjunction is referred to
as a term. We can represent a conjunction by using a bit string of length at most 2n as there
are only 2n possible literals for n boolean variables. Thus, our goal is to design an algorithm
that runs in time polynomial in n, 1/ and 1/δ.
The example oracle EX(c, D) returns examples of the form (a, y) where y ∈ {0, 1}. The ith
bit ai is the assignment of the variable xi . The label y is 1 is the target conjunction c evaluates
to true under the assignment a and 0 otherwise.
We consider the following simple algorithm:

(i) Start with a candidate hypothesis that includes all 2n literals, i.e.,

h ≡ x1 ∧ x1 ∧ x2 ∧ x2 ∧ · · · ∧ xn ∧ xn

(ii) For each positive example, (a, 1), and for i = 1, . . . , n, do the following:

(a) If ai = 1, drop the literal xi from h if it still exists,

(b) Else drop the literal xi if it still exists.

(iii) Output the resulting h

Let us make a couple of observations about the algorithm.

1. The algorithm only uses positive examples; the negative examples are completely ignored.

6
2. If c is the target conjunction. Any literal ` that is present in c is also present in the
hypothesis h. This is because the algorithm only throws out literals that cannot possibly
be present in c, as the literal evaluated to negative in the example (a, 1) but the label of
the example was positive. This also ensure that if c(a) = 0 then h(a) = 0. Thus, h, if it
errs at all only does so on examples a, such that c(a) = 1.
Theorem 4. The class of boolean conjunctions, CONJUNCTIONS, is efficiently PAC (take II)
learnable.
Proof. Let c be the target conjunction
and D the distribution over {0, 1}n . For a literal `, let
p(`) = Pa∼D c(a) = 1 ∧ ` is 0 in a ; here ` is 0 in a, means that the assignment of the variables
according to a causes the literal ` to evaluate to 0. Notice that if p(`) > 0, then the literal `
cannot be present in c; if it were, then there can be no a such that c(a) = 1 and ` is 0 in a.

We define a literal ` to be bad if p(`) ≥ 2n . We wish to ensure that all bad literals are
eliminated from the hypothesis h. For a bad literal `, let A` denote the event that after m
independent draws from EX(c, D), ` is not eliminiated from h. Note that this can only happen
if no a such that c(a) = 1 but ` is 0 in a is drawn. This canShappen with probability at most
m

1 − 2n . Let B denote the set of bad literals and let E = `∈B A` be the event that at least
one bad literal survives in h. We shall choose m large enough so that P [E] ≤ δ. Consider the
following:
X
P [E] ≤ P [A` ] By the Union Bound
`∈B
m m

≤ 2n 1 − |B| ≤ 2n and for each ` ∈ B, P [A` ] ≤ 1 −
2n 2n

m
≤ 2n exp − As 1 − x ≤ e−x
2n

Thus, whenever m ≥ 2n log 2n
δ , we know that P [E] ≤ δ. Now, suppose that E does not occur,
i.e., all bad literals are eliminated from h. Let G be the set of good literals.

err(h) = Pa∼D c(a) = 1 ∧ h(a) = 0
X
≤ Pa∼D c(a) = 1 ∧ ` is 0 in a
`∈G

≤ 2n · ≤
2n
This finishes the proof.

5 Hardness of Learning 3-term DNF

Let us now look at a richer class of boolean functions that simply conjunctions. The class we’ll
consider is the class of 3-term DNF formulae. What is a 3-term DNF formula? It is simply a
disjunction of exactly three terms, where each term is simply a boolean conjunction. Define the
class,

3-TERM-DNF = {T1 ∨ T2 ∨ T3 | Ti conjunction over x1 , . . . , xn }

Note that any function that can be expressed as a 3-term DNF formula has representation
size at most 6n—there are three terms each of which is a boolean conjunction expressible by a
boolean string of length 2n. Thus, an efficient algorithm for learning 3-TERM-DNF would have
to run in time polynomial in n, 1/ and 1/δ. Our next result shows that such an algorithm in
fact is unlikely to exist. Formally, we’ll proof the following theorem.

7
Theorem 5. 3-TERM-DNF is not efficiently PAC learnable (take II) unless RP = NP.

Let us first discuss the condition “unless RP = NP”. Let us briefly describe what the class
RP is. For further details, the reader is referred to a book on complexity theory such as that
by Arora and Barak (2009).
The class RP consists of languages for which membership can be determined by a randomised
polynomial time algorithm that errs on only one side. More precisely, a language L ∈ RP, if
there exists a randomised polynomial time algorithm A that satisfies the following:

• For π 6∈ L, A(π) = 0

• For π ∈ L, A(π) = 1 with probability at least 1/2.

In order to prove Theorem 5, we shall reduce an NP-complete language to the problem of

PAC (take II) learning 3-TERM-DNF. Suppose L is a language that is NP-complete. Given an
instance π we wish to decide whether π ∈ L. We will construct a sample, a set of positive and
negative examples, with the following property: There exists a 3-term DNF formula ϕ that is
consistent with all the examples in the sample if and only if π ∈ L.
Let us see how an algorithm that can PAC-learn 3-TERM-DNF helps us test whether or
not π ∈ L. Let S denote the sample constructed and let D be a distribution that is uniform
1
over the sample, i.e., a distribution that assigns probability mass |S| for every example that
1
appears in the sample and 0 mass on all other examples. Let = 2|S| and δ = 1/2. Now, let
us suppose that π ∈ L, then indeed there does exist 3-term DNF ϕ that is consistent with the
sample S. So we can simulate a valid example oracle EX(ϕ, D), by simply returning a random
example from S along with its label. Now, with probability at least 1/2, the algorithm returns
1
h ∈ 3-TERM-DNF, such that err(h) ≤ 2|S| . However, as there are only |S| points in S and the
distribution is uniform, it must be that h is consistent with all examples in S, which implies
π ∈ L.
On the other hand, if π 6∈ L, there is no 3-term DNF formula that is consistent with the
sample produced. Hence, the learning algorithm cannot output such an h. Thus, assuming a
PAC (take II) learning algorithm for 3-TERM-DNF exists, we have a randomised algorithm to
solve the decision problem for the NP-complete language L. This in turn implies that NP = RP,
something that is widely believed to be untrue.

5.1 Reducing Graph 3-Colouring to PAC (Take II) Learning 3-term DNF
Let us now actually choose a particular NP-complete language and show that it can be reduced
to the learning problem for 3-TERM-DNF. The language 3-COLOURABLE consists of represen-
tations of graphs that can be 3-coloured. We say a graph is 3-colourable if there is an assignment
from the vertices to the set {r, g, b} such that no two adjacent vertices are assigned the same
colour. As already discussed, given a graph G, we only need to produce a sample of positively
and negatively labelled points such that the graph G is 3-colourable if and only if there exists
a 3-term DNF consistent with all the labelled points.
Let S = S+ ∪ S− where S+ consists of points that have label 1 and S− of points that have
label 0. Suppose G has n vertices. For vertex i ∈ G, we let v(i) ∈ {0, 1}n that has a 1 in every
position except i. For an edge (i, j) in G, we let e(i, j) ∈ {0, 1}n that has a 1 in all positions
except i and j. Let S+ = {v(i) | i a node of G} and S− = {e(i, j) | (i, j) an edge of G}. Figure 3
shows an example of a graph that is 3-colourable along with the sample S.

G 3-colourable implies a consistent 3-term DNF exists

Suppose G is three colourable. Let
VVr , Vg , Vb be the set of vertices of G that are labelled red, blue
and green respectively. Let Tr = i6∈Vr xi . Tg and Tb are defined similarly. Now ϕ = Tr ∨ Tg ∨ Tb

8
1 e(1, 2) (0, 0, 1, 1, 1, 1)
e(1, 6) (0, 1, 1, 1, 1, 0)
v(1) (0, 1, 1, 1, 1, 1) e(2, 3) (1, 0, 0, 1, 1, 1)
2 3
v(2) (1, 0, 1, 1, 1, 1) e(2, 4) (1, 0, 1, 0, 1, 1)
4 v(3) (1, 1, 0, 1, 1, 1) e(3, 6) (1, 1, 0, 1, 1, 0)
6 v(4) (1, 1, 1, 0, 1, 1) e(4, 5) (1, 1, 1, 0, 0, 1)
v(5) (1, 1, 1, 1, 0, 1) e(4, 6) (1, 1, 1, 0, 1, 0)
5 v(6) (1, 1, 1, 1, 1, 0) e(5, 6) (1, 1, 1, 1, 0, 0)
(a) Graph (b) Positive Examples (c) Negative Examples

Figure 3: (a) A graph G along with a valid three colouring. (b) Positive examples of the sample
generated using G. (c) Negative examples of the sample generated using G.

is a 3-term DNF formula. We will show that ϕ is consistent with S. To show this we need to
show that all examples in S+ satisfy ϕ and none of the examples in S− satisfy ϕ. First consider
v(i) ∈ S+ . Without loss of generality, suppose i is coloured red, i.e., i ∈ Vr . Then, we claim
that v(i) is a satisfying assignment of Tr and hence also of ϕ. Clearly, the literal xi is not
contained in Tr and there are no negative literals in Tr . Since all the bits of v(i) other than the
ith position are 1, v(i) is a satisfying assignment of Tr .
Now, consider e(i, j). We will show that e(i, j) is not a satisfying assignment of any of Tr ,
Tg or Tb and hence it also does not satisfy ϕ. For a colour c ∈ {r, g, b}, either i is not coloured
c or j isn’t. Suppose i is the one that is not coloured c, then Tc contains the literal xi , but the
ith bit of e(i, j) is 0 and so e(i, j) is not a satisfying assignment of Tc . This argument applies to
all colours and hence e(i, j) is not a satisfying assignment of ϕ. This shows that ϕ is consistent
with S.

Existence of 3-term DNF implies G 3-colourable

Suppose ϕ = Tr ∨ Tg ∨ Tb is a 3-term DNF that is consistent with S. We use this to assign
colours to the nodes of G. For a node i, since v(i) is a satisfying assignment of ϕ, it is also
a satisfying assignment of at least one of Tr , Tg or Tb . We assign a colour to it based on the
term for which it is a satisfying assignment, and ties may be broken arbitrarily. Since, for every
vertex i, there exists v(i) ∈ S+ , this ensures that every vertex is assigned a colour.
Now, we need to ensure that no two adjacent vertices are assigned the same colour. Suppose
there is an edge (i, j) such that i and j are assigned the same colour. Without loss of generality,
suppose that this colour is red. Now, we know that since ϕ is consistent with S, e(i, j) does
not satisfy ϕ and hence also does not satisfy Tr . Also, as i and j were both coloured red, v(i)
and v(j) do satisfy Tr . This implies that the literals xi and xj are not present in Tr . The fact
that v(i) satisfies Tr ensures that xk for any k 6= i cannot be a literal in Tr . However, if Tr does
not contain any negated literal other than possibly xi and if it does not contain the literals xi
and xj , then e(i, j) satisfies Tr and hence ϕ, a contradiction. Hence, there cannot be any two
adjacent vertices that have been assigned the same colour.
This completes the proof of Theorem 5.

6 Learning 3-CNF
Let us revisit the problem of PAC-learning 3-term DNF formulae. Let us recall the the dis-
tributive law of boolean operations.

(a ∧ b) ∨ (c ∧ d) ≡ (a ∨ c) ∧ (a ∨ d) ∧ (b ∨ c) ∧ (b ∨ d)

9
Using the distributive rule, we can re-write a 3-term DNF formula as follows:
^
T1 ∨ T2 ∨ T3 ≡ (`1 ∨ `2 ∨ `3 ) (1)
`1 ∈T1
`2 ∈T2
`3 ∈T3

The form on the right hand side is called a 3-CNF formula. CNF stands for conjunctive normal
form, i.e., expression as a conjunction of clauses (disjunctions). The 3 indicates that each clause
has length 3 (i.e., it contains 3 literals). Note that any 3-term DNF formula can be expressed
as a 3-CNF formula with at most (2n)3 clauses, as each of the terms in a 3-term DNF formula
can have at most 2n literals. The converse however is not true, i.e., there are 3-CNF formulae
that cannot be represented as a 3-term DNF formula. Thus, 3-CNF formulae is a more general
class of functions.
Let us now consider the question of learning 3-CNF formulae. In doing so, we’ll study an
important notion in computational learning theory, as in all of computer science, that of a
reduction. Note that a 3-CNF formula can be viewed simply as a conjunction over a set of
clauses. Thus, if we can create a new instance space where the variables are actually encodings
of all the (2n)3 possible clauses, the 3-CNF formula simply becomes a conjunction. For any
three literals, `1 , `2 , `3 , we create a variable z`1 ,`2 ,`3 that takes the value `1 ∨ `2 ∨ `3 .
3
More precisely consider the following maps: (i) f : {0, 1}n → {0, 1}(2n) which is a one-one
map that takes an assignment of n boolean variables and maps it to an assignment of all possible
clauses of length 3 over the n boolean variables, (ii) g : 3-CNF[n] → CONJUNCTIONS[(2n)3 ]
which maps a 3-CNF formula over n variables to a conjunction over (2n)3 variables. We have
already seen that we can PAC learn the class of conjunctions. We’d like to use this algorithm
to learn the class 3-CNF.
Let c be the target 3-CNF formula and D a distribution over {0, 1}n . Let c0 = g(c) and
3
let D0 be a distribution over {0, 1}(2n) , where D0 (S) = D(f −1 (S)). We need to be able to
simulate the example oracle EX(c0 , D0 ). Doing so is simple, we receive (x, c(x)) form EX(c, D),
we output (f (x), c(x)); this is a valid simulation of EX(c0 , D0 ). The learning algorithm returns a
conjunction of (positive or negative) literals over (2n)3 variables. We need to transform this back
to a 3-CNF formula over n variables. The first thing to note is that we only need to learn the
class of conjunctions with positive literals, so we can modify the conjunction learning algorithm
to start with a conjunction that includes all the positive literals (instead of all the positive and
negative literals); the rest of the algorithm remains the same. Then, if a literal corresponding
to z`1 ,`2 ,`3 is included in the output conjunction, we include the clause (x`1 ∨ x`2 ∨ x`3 ) in the
3-CNF formula.

Theorem 6. The class of 3-CNF formulae is efficiently PAC-learnable.

7 PAC Learning
Let us consider the results in Sections 5 and 6 and reconsider the goal of learning. On the one
hand, we saw that is impossible to learn the class of 3-term DNF formulae unless NP = RP.
The hardness result crucially relies on the fact that the output of the algorithm was itself
required to be a 3-term DNF formula. On the other hand, we know that any 3-term DNF
formula can be expressed as a 3-CNF formula (with a modest, but still polynomial, blow-up
in size). The class of 3-CNF formulae turns out to be PAC-learnable. This means that when
receiving examples labelled according to a 3-term DNF formula, if we had in fact applied the
3-CNF learning algorithm, we would have produced a hypothesis with low error. The output
hypothesis however would be a 3-CNF formula not a 3-term DNF formula. If the aim of machine
learning is prediction, it should not matter what form we represent the hypothesis in. In fact,
we may not even care that the output is a 3-CNF formula, it could potentially be something

10
even more complex, as long as it makes correct predictions. Thus, in the final definition of PAC
learning, we’ll remove the condition that the output hypothesis actually belongs to the concept
class being learnt. In general, we’ll allow learning algorithms to output hypotheses that lie in
some class H, called the hypothesis class. While we allow the hypothesis class H to be different
from the concept class C and possibly a lot more powerful, for the notion of efficiency we do
require that any hypothesis h ∈ H is polynomial time evaluatable. As with the case of concept
classes, we assume that there is a representation scheme and a suitable size function for the
hypothesis class H. Formally,

Definition 7. A hypothesis class H is polynomially evaluatable if there exists an algorithm

that on input any instance x ∈ Xn and any representation h ∈ Hn , outputs the value h(x) in
time polynomial in n and size(h).

We can now give the final defintion of PAC learning.

S
Definition
S 8 (PAC Learning). Let Cn be a concept class over Xn and let C = n≥1 Cn and
X = n≥1 Xn . We say that C is PAC learnable using hypothesis class H if there exists an
algorithm L that satisfies the following: for all n ∈ N, for every c ∈ Cn , for every D over Xn ,
for every 0 < < 1/2 and 0 < δ < 1/2, if L is given access to EX(c, D) and inputs , δ and
size(c), L outputs h ∈ H that with probability at least 1 − δ satisfies err(h) ≤ . We say that C
is efficiently PAC learnable if the running time of L is polynomial in n, size(c), 1/ and 1/δ,
when learning c ∈ Cn and if H is polynomially evaluatable.

Remark 9. It is worth noting that the requirement that H is polynomially evaluatable is quite
necessary and natural. Essentially what we are asking is that the designer of a learning algo-
rithm, when returning the “code” to make predictions, actually returns code that can be evaluated
by the user.

Bibliographic Notes
Material in this lecture is almost entirely adopted from (Kearns and Vazirani, 1994, Chap. 1).
The original PAC learning framework was introduced in a seminal paper by Valiant (1984).

References
Sanjeev Arora and Boaz Barak. Computational Complexity: A Modern Approach. Cambridge
University Press, 2009.

Michael J. Kearns and Umesh K. Vazirani. An Introduction to Computational Learning Theory.

The MIT Press, 1994.

Leslie Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.

HCG General Catalog
No ratings yet
HCG General Catalog
395 pages
Sony Xav-63m Xav-64bt BTM Ver1.4 SM
No ratings yet
Sony Xav-63m Xav-64bt BTM Ver1.4 SM
122 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Faar Field Manual
100% (2)
Faar Field Manual
65 pages
Manual Español-Ingles 7600
100% (2)
Manual Español-Ingles 7600
32 pages
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
No ratings yet
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
6 pages
SML_Lecture2
No ratings yet
SML_Lecture2
35 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
Solutions for Exercises in Foundations of Machine Learning, 2nd Edition – Mohri & Rostamizadeh
100% (1)
Solutions for Exercises in Foundations of Machine Learning, 2nd Edition – Mohri & Rostamizadeh
5 pages
Hw5 Solution
No ratings yet
Hw5 Solution
4 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
HW 1
No ratings yet
HW 1
6 pages
Week 3
No ratings yet
Week 3
56 pages
PAC LEARNING
No ratings yet
PAC LEARNING
30 pages
Hw1 Theory Solution PuHK4fmHvB
No ratings yet
Hw1 Theory Solution PuHK4fmHvB
4 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Computational Learning Theory
No ratings yet
Computational Learning Theory
15 pages
Lecture Summary
No ratings yet
Lecture Summary
2 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
Mathematics of Learning Dealing With Data Notices-Ams2003refs
No ratings yet
Mathematics of Learning Dealing With Data Notices-Ams2003refs
19 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Tamas Hardness Da
No ratings yet
Tamas Hardness Da
19 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
88450
No ratings yet
88450
50 pages
LNCS 2810 Regularized Learning with Flexible Constraints 1st edition by Eyke HÃ¼llermeier ISBN 3540408134 978-3540408130 - The latest updated ebook is now available for download
No ratings yet
LNCS 2810 Regularized Learning with Flexible Constraints 1st edition by Eyke HÃ¼llermeier ISBN 3540408134 978-3540408130 - The latest updated ebook is now available for download
47 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ML Ctanujit
No ratings yet
ML Ctanujit
56 pages
Foundations of Machine Learning: Assignment 1: Problem 1
No ratings yet
Foundations of Machine Learning: Assignment 1: Problem 1
2 pages
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
100% (1)
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
47 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
ML Notes
No ratings yet
ML Notes
161 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Machine Leaning 3
No ratings yet
Machine Leaning 3
44 pages
Foundations of Machine Learning: Regression
No ratings yet
Foundations of Machine Learning: Regression
52 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Download Complete Learning theory An approximation theory viewpoint 1st Edition Felipe Cucker PDF for All Chapters
100% (1)
Download Complete Learning theory An approximation theory viewpoint 1st Edition Felipe Cucker PDF for All Chapters
77 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
ML Unit 1
No ratings yet
ML Unit 1
35 pages
Week_7_Notes[1]
No ratings yet
Week_7_Notes[1]
11 pages
Dis11 Sol
No ratings yet
Dis11 Sol
5 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
sol3_2016
No ratings yet
sol3_2016
8 pages
2021-02-19_Fri_20CS92R02
No ratings yet
2021-02-19_Fri_20CS92R02
6 pages
Lecture9_ML-Algorithms
No ratings yet
Lecture9_ML-Algorithms
22 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
homework4_v1.0
No ratings yet
homework4_v1.0
5 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
SnS Post Mid Till 07.03.25
No ratings yet
SnS Post Mid Till 07.03.25
26 pages
PNS Notes
No ratings yet
PNS Notes
24 pages
Assignment 03 Jan 24 2024
No ratings yet
Assignment 03 Jan 24 2024
2 pages
Tutorial 1
No ratings yet
Tutorial 1
2 pages
MATH 2070U Midterm 2011 A Solution
No ratings yet
MATH 2070U Midterm 2011 A Solution
6 pages
Working Papers in PPE Updated
No ratings yet
Working Papers in PPE Updated
29 pages
Large Enterprise B Corp Certification Best Practices Guide Edited April2019
No ratings yet
Large Enterprise B Corp Certification Best Practices Guide Edited April2019
18 pages
Gaither Banchoffa
No ratings yet
Gaither Banchoffa
12 pages
Texts in This Period: Compare and Contrast The Presentation of Relationships Between Men and Women in Two
No ratings yet
Texts in This Period: Compare and Contrast The Presentation of Relationships Between Men and Women in Two
9 pages
Section02 Answerkey PDF
No ratings yet
Section02 Answerkey PDF
11 pages
FusionModule800 5.0 Smart Small Modular Data Center V100R022C10)
No ratings yet
FusionModule800 5.0 Smart Small Modular Data Center V100R022C10)
109 pages
PDF PDF
100% (1)
PDF PDF
192 pages
Global+Cardio 2025-2-70 Def
No ratings yet
Global+Cardio 2025-2-70 Def
28 pages
SFDA Guidance For Drafting Risk Management Plans of COVID-19 Vaccines
No ratings yet
SFDA Guidance For Drafting Risk Management Plans of COVID-19 Vaccines
16 pages
Python Training Course Content
No ratings yet
Python Training Course Content
6 pages
HT Fuse 1 1
No ratings yet
HT Fuse 1 1
4 pages
U2L1 - Worksheet 3
No ratings yet
U2L1 - Worksheet 3
2 pages
Davco Grout + RL 1000 PDF
No ratings yet
Davco Grout + RL 1000 PDF
2 pages
The Theory of Rasa by Jivan Chaudhary
100% (1)
The Theory of Rasa by Jivan Chaudhary
6 pages
You Exec - Annual Report Part4 Complete
No ratings yet
You Exec - Annual Report Part4 Complete
38 pages
Practical Guidance For Bayesian Inference in Astronomy
No ratings yet
Practical Guidance For Bayesian Inference in Astronomy
10 pages
Project Modeling Cellular Division and Differentiation Student Guide
No ratings yet
Project Modeling Cellular Division and Differentiation Student Guide
3 pages
CRACK Microsoft Windows 10 Version 1809 OS Build 17763107 PDF
No ratings yet
CRACK Microsoft Windows 10 Version 1809 OS Build 17763107 PDF
4 pages
E&I Notes
No ratings yet
E&I Notes
2 pages
University Institutions of National Importance Sl. No. States/Uts Research Institutions Arts, Science & Commerce Colleges Deemed University
No ratings yet
University Institutions of National Importance Sl. No. States/Uts Research Institutions Arts, Science & Commerce Colleges Deemed University
24 pages
Repairing Logitech Driving Force Pro Pedals
No ratings yet
Repairing Logitech Driving Force Pro Pedals
5 pages
PPI Diaphragm Compressors Brochure US
No ratings yet
PPI Diaphragm Compressors Brochure US
12 pages
Epp. SHS Ict.
No ratings yet
Epp. SHS Ict.
3 pages
MPDF
No ratings yet
MPDF
1 page
Traditional Literature
No ratings yet
Traditional Literature
16 pages

lecture01

Uploaded by

lecture01

Uploaded by

Computational Learning Theory

1 : Introduction to the PAC Learning Framework

1 What is computational learning theory?

2 A Rectangle Learning Game

• How good is our prediction function hR0 ?

Remarks: A couple of remarks are in order. We can think of  as  being

3 Probably Approximately Correct (PAC) Learning

Let us now formalise a few other concepts related to learning.

3.1 PAC Learning : Take I

Figure 2: Different shape concepts in R2 .

3.2 PAC Learning : Take II

(x1 ∧ x3 ∧ x7 ∧ · · · ) ∨ (x2 ∧ x4 ∧ x8 ∧ · · · ) ∨ · · · ∨ (x1 ∧ x3 )

The choice of representation is quite important as shown by the following exercise.

(a) If ai = 1, drop the literal xi from h if it still exists,

(iii) Output the resulting h

Let us make a couple of observations about the algorithm.

5 Hardness of Learning 3-term DNF

3-TERM-DNF = {T1 ∨ T2 ∨ T3 | Ti conjunction over x1 , . . . , xn }

• For π ∈ L, A(π) = 1 with probability at least 1/2.

In order to prove Theorem 5, we shall reduce an NP-complete language to the problem of

G 3-colourable implies a consistent 3-term DNF exists

Existence of 3-term DNF implies G 3-colourable

Theorem 6. The class of 3-CNF formulae is efficiently PAC-learnable.

Definition 7. A hypothesis class H is polynomially evaluatable if there exists an algorithm

We can now give the final defintion of PAC learning.

Michael J. Kearns and Umesh K. Vazirani. An Introduction to Computational Learning Theory.

You might also like

Remarks: A couple of remarks are in order. We can think of as being