lecture01
lecture01
1
− − − −
− R + − R +
+ − R0 + −
− + − +
− −
(a) (b)
− − − −
− R T+2 − R T+2
T1 R0 +T3 − T1 R0 +T3 −
− + T4 − ++ T4
− −
(c) (d)
Figure 1: (a) Data received for the rectangle learning game. The rectangle R used to generate
labels is not known to the learning algorithm. (b) The tightest fit algorithm produces a rectanlge
R0 . (c) & (d) The regions T1 , T2 , T3 and T4 contain /4 mass each under D.
Let R denote the true rectangle that actually defines the labelling function cR . Since, we’ve
chosen the tightest possible fit, the rectangle R0 must be entirely contained inside R. Consider
the shaded region shown in Fig. 1(b). For any point x that is in this shaded region, it must
be that hR0 (x) = −, while cR (x) = +. In other words, our prediction function hR0 would make
errors on all of these points. If we had to make predictions on points that mostly lie in this
region our hypothesis would be quite bad. This brings us to an important point that the data
that is used to learn a hypothesis should be similar to the data on which we will be tested. Let
us formalise this notion.
Let D be a distribution over R2 . Our training data consists of m points that are drawn
independently according to D and then labelled according to the function cR . We will define the
error of a hypothesis hR0 with respect to the target function cR and distribution D as follows:
err(hR0 ; cR , D) = Px∼D hR0 (x) 6= cR (x)
Whenever the target cR and distribution D are clear from context, we will simply refer to this
as err(hR0 ).
Let us show that in fact our algorithm outputs an hR0 that is quite good, in the sense
that given any > 0 as the target error, with high probability it will output hR0 such that
err(hR0 ; cR , D) ≤ . Consider four rectangular strips T1 , T2 , T3 , T4 that are chosen along the
sides of the rectangle R (and lying inside) such that they each have a probability mass of
exactly /4 under D.2 Note that some of these strips overlap, e.g., T1 and T2 (See Fig. 1(c)).
The total probability mass of the set T1 ∪T2 ∪T3 ∪T4 is at most . Now, if we can guarantee that
the training data of m points contains at least one point from each of T1 , T2 , T3 and T4 , then
the tightest fit rectangle R0 will be such that R \ R0 ⊆ T1 ∪ T2 ∪ T3 ∪ T4 and as a consequence,
2
Assuming the distribution D is smooth over R2 , i.e., there is no point mass under D, this is always possible.
Otherwise, the algorithm is still correct, however, the analysis is slightly more tedious.
2
err(hR0 ; cR , D) ≤ . This is shown in Fig. 1(d); note that if even one of the Ti do not contain any
data point, this may be cause a problem, in the sense that the region of disagreement between
R and R0 may have probability mass greater than (See Fig. 1(c)).
Let A1 be the event that when m points are drawn independently according to D, none of
them lies in T1 . Similarly define the events A2 , A3 , A4 for T2 , T3 , T4 . Let E = A1 ∪ A2 ∪ A3 ∪ A4 .
If E does not occur, then err(hR0 ; cR , D) ≤ . We will use the union bound to bound P [E]. The
union bound states that for any two events A and B,
P [A ∪ B] ≤ P [A] + P [B] .
Let us compute P [A1 ]. The probability that a single point drawn from D does not land in T1
is 1 − /4, so the probability that after m independent draws from D none of the points are in
m m
T1 is 1 − 4 . By a similar argument, P [Ai ] = 1 − 4 for i = 1, . . . , 4. Thus, we have
4
X
P [E] ≤ P [Ai ] By the Union Bound
i=1
m
=4 1−
4
m
≤ 4 exp − As 1 − x ≤ e−x
4
Now let δ > 0 be fixed, then if m ≥ 4 log 4δ , then we have that P [E] ≤ δ. In other words,
with probability at least 1 − δ, err(hR0 ; cR , D) ≤ .
1. The learning algorithm does not know the target concept to be learnt (obviously, otherwise
there is nothing to learn!). However, the learning algorithm does know the set of possible
target concepts. In this instance, the unknown target is always an axis-aligned rectangle.
2. The learning algorithm has access to data drawn from some distribution D. We do assume
that the observations are drawn independently according to D. However, no assumption
is made on the distribution D itself.
3. The output hypothesis is evaluated with respect to the same distribution D as was used
to obtain the training data.
3
We are using the world “required” a bit losely here. All we can say is our present analysis of this particular
algorithm suggests that the amount of data required scales linearly as 1 . We will see lower bounds of this nature
that hold for any algorithm later in the course.
3
4. We would like our algorithms to be statistically efficient (require relatively small number of
examples to guarantee high accuracy and confidence) as well as computationally efficient
(reasonable amount of time required for processing examples and producing an output
hypothesis).
Instance Space
Let X denote the set of possible instances or examples that may be seen by the learning
algorithm. For instance, in the rectangle learning game the examples were points in R2 . When
classifying images the examples may be 3 dimensional arrays, containing the RGB values of
each pixel.
Concept Class
A concept c over X is a boolean function c : X → {0, 1}.4 A concept class C over X is
a collection of concepts c over X. In the rectangle learning game, the concept class under
consideration is all axis-aligned rectangles. In order to view the concepts as boolean functions,
we may use the convention that + corresponds to 1 and − corresponds to 0. The learning
algorithm has knowledge of C, but not of the specific concept c ∈ C that is used to label the
observations.
Data Generation
Let D be a fixed probability distribution over X. The training data is obtained as follows. An
example x ∈ X is drawn according to the distribution D. If c is the target concept, the example
x is labelled accordingly as c(x). The learning algorithm observes (x, c(x)). We will refer to
this process as an example oracle, EX(c, D). We assume that a learning algorithm can query
the oracle EX(c, D) at unit cost.
When c and D are clear from context, we will simply refer to this as err(h).
Definition 1 (PAC Learning : Take I). Let C be a concept class over X. We say that C
is PAC learnable if there exists a learning algorithm L that satisfies the following: for every
concept c ∈ C, for every distribution D over X, for every 0 < < 1/2 and 0 < δ < 1/2, if L is
given access to EX(c, D) and inputs and δ, L outputs a hypothesis h ∈ C that with probability
at least 1 − δ satisfies err(h) ≤ . The probability is over the random examples drawn from
EX(c, D) as well as any internal randomisation of L. We further say that C is efficiently PAC
learnable if the running time of L is polynomial in 1/ and 1/δ.
The term PAC stands for probably approximately correct. The approximately correct part
captures the notion that the output hypothesis does have some error; demanding higher accuracy
(lower ) is possible, but comes at a cost of increased running time and sample complexity.
The probably part captures the notion that there is some chance that the algorithm may fail
completely. This may happen because the observations are not representative of the distribution,
4
We will consider concepts that are not boolean functions later in the course.
4
(a) (b)
a low probability event, though very much a possible event. Our confidence (lower δ) in the
correctness of our algorithm is increased as we allow more sample complexity and running time.
In Section 2, we essentially proved the following theorem.
Theorem 2. The concept class C of axis aligned rectangles in R2 is efficiently PAC (Take I)
learnable.
Representation Scheme
Abstractly, a representation scheme for a concept class C is a function R : Σ∗ → C, where Σ
is a finite alphabet.6 Any σ satisfying R(σ) = c is called a representation of c. We assume
that there is a function, size : Σ∗ → N, that measures the size of a representation. A concept
5
We assume that our computers can store and manipulate real numbers at unit cost.
6
If the concept requires using real numbers, such as in the case of rectangles, we may use R : (Σ ∪ R)∗ → C.
5
c ∈ C may in general have multiple representations under R. For example, there are several
boolean circuits that compute exactly the same boolean function. We extend size to the set C
by defining, size(c) = min {size(σ)}. When we refer to a concept class, we will assume by
σ:R(σ)=c
default that it is associated with a representation scheme and a size function, so that size(c) is
well defined for c ∈ C.
Instance Size
Typically in a machine learning problems instances have a size; roughly we may think of the
size of an instance as the memory required to store an instance. For example, 10 × 10 black and
white images can be defined using 100 bits, whereas 1024 × 1024 colour images will require over
3 million real numbers. Thus, when faced with larger instances it is natural to allow learning
algorithms more time. In this course, we will only consider settings when the instance space is
either Xn = {0, 1}n S or Xn = Rn . We denote by Cn a concept
S class over Xn . We consider the
instance space X = n≥1 Xn and the concept class C = n≥1 Cn .
S
Definition S3 (PAC Learning : Take II). Let Cn be a concept class over Xn and let C = n≥1 Cn
and X = n≥1 Xn . We say that C is PAC learnable (take II) if there exists an algorithm L
that satisfies the following: for all n ∈ N, for every c ∈ Cn , for every D over Xn , for every
0 < < 1/2 and 0 < δ < 1/2, if L is given access to EX(c, D) and inputs , δ and size(c),
L outputs h ∈ Cn that with probability at least 1 − δ satisfies err(h) ≤ . We say that C is
efficiently PAC learnable if the running time of L is polynomial in n, size(c), 1/ and 1/δ, when
learning c ∈ Cn .
4 Learning Conjunctions
Let us now consider a second learning
S problem. Let Xn = {0, 1}n and S Cn denote the set of
conjunctions over Xn . Let X = n≥1 Xn and CONJUNCTIONS = n≥1 Cn . A conjunction
can be represented by a set of literals, where each literal corresponds to either a variable or its
negation. An example of a conjunction is x1 ∧ x3 ∧ x4 . Sometimes a conjunction is referred to
as a term. We can represent a conjunction by using a bit string of length at most 2n as there
are only 2n possible literals for n boolean variables. Thus, our goal is to design an algorithm
that runs in time polynomial in n, 1/ and 1/δ.
The example oracle EX(c, D) returns examples of the form (a, y) where y ∈ {0, 1}. The ith
bit ai is the assignment of the variable xi . The label y is 1 is the target conjunction c evaluates
to true under the assignment a and 0 otherwise.
We consider the following simple algorithm:
(i) Start with a candidate hypothesis that includes all 2n literals, i.e.,
h ≡ x1 ∧ x1 ∧ x2 ∧ x2 ∧ · · · ∧ xn ∧ xn
(ii) For each positive example, (a, 1), and for i = 1, . . . , n, do the following:
1. The algorithm only uses positive examples; the negative examples are completely ignored.
6
2. If c is the target conjunction. Any literal ` that is present in c is also present in the
hypothesis h. This is because the algorithm only throws out literals that cannot possibly
be present in c, as the literal evaluated to negative in the example (a, 1) but the label of
the example was positive. This also ensure that if c(a) = 0 then h(a) = 0. Thus, h, if it
errs at all only does so on examples a, such that c(a) = 1.
Theorem 4. The class of boolean conjunctions, CONJUNCTIONS, is efficiently PAC (take II)
learnable.
Proof. Let c be the target conjunction
and D the distribution over {0, 1}n . For a literal `, let
p(`) = Pa∼D c(a) = 1 ∧ ` is 0 in a ; here ` is 0 in a, means that the assignment of the variables
according to a causes the literal ` to evaluate to 0. Notice that if p(`) > 0, then the literal `
cannot be present in c; if it were, then there can be no a such that c(a) = 1 and ` is 0 in a.
We define a literal ` to be bad if p(`) ≥ 2n . We wish to ensure that all bad literals are
eliminated from the hypothesis h. For a bad literal `, let A` denote the event that after m
independent draws from EX(c, D), ` is not eliminiated from h. Note that this can only happen
if no a such that c(a) = 1 but ` is 0 in a is drawn. This canShappen with probability at most
m
1 − 2n . Let B denote the set of bad literals and let E = `∈B A` be the event that at least
one bad literal survives in h. We shall choose m large enough so that P [E] ≤ δ. Consider the
following:
X
P [E] ≤ P [A` ] By the Union Bound
`∈B
m m
≤ 2n 1 − |B| ≤ 2n and for each ` ∈ B, P [A` ] ≤ 1 −
2n 2n
m
≤ 2n exp − As 1 − x ≤ e−x
2n
Thus, whenever m ≥ 2n log 2n
δ , we know that P [E] ≤ δ. Now, suppose that E does not occur,
i.e., all bad literals are eliminated from h. Let G be the set of good literals.
err(h) = Pa∼D c(a) = 1 ∧ h(a) = 0
X
≤ Pa∼D c(a) = 1 ∧ ` is 0 in a
`∈G
≤ 2n · ≤
2n
This finishes the proof.
Note that any function that can be expressed as a 3-term DNF formula has representation
size at most 6n—there are three terms each of which is a boolean conjunction expressible by a
boolean string of length 2n. Thus, an efficient algorithm for learning 3-TERM-DNF would have
to run in time polynomial in n, 1/ and 1/δ. Our next result shows that such an algorithm in
fact is unlikely to exist. Formally, we’ll proof the following theorem.
7
Theorem 5. 3-TERM-DNF is not efficiently PAC learnable (take II) unless RP = NP.
Let us first discuss the condition “unless RP = NP”. Let us briefly describe what the class
RP is. For further details, the reader is referred to a book on complexity theory such as that
by Arora and Barak (2009).
The class RP consists of languages for which membership can be determined by a randomised
polynomial time algorithm that errs on only one side. More precisely, a language L ∈ RP, if
there exists a randomised polynomial time algorithm A that satisfies the following:
• For π 6∈ L, A(π) = 0
5.1 Reducing Graph 3-Colouring to PAC (Take II) Learning 3-term DNF
Let us now actually choose a particular NP-complete language and show that it can be reduced
to the learning problem for 3-TERM-DNF. The language 3-COLOURABLE consists of represen-
tations of graphs that can be 3-coloured. We say a graph is 3-colourable if there is an assignment
from the vertices to the set {r, g, b} such that no two adjacent vertices are assigned the same
colour. As already discussed, given a graph G, we only need to produce a sample of positively
and negatively labelled points such that the graph G is 3-colourable if and only if there exists
a 3-term DNF consistent with all the labelled points.
Let S = S+ ∪ S− where S+ consists of points that have label 1 and S− of points that have
label 0. Suppose G has n vertices. For vertex i ∈ G, we let v(i) ∈ {0, 1}n that has a 1 in every
position except i. For an edge (i, j) in G, we let e(i, j) ∈ {0, 1}n that has a 1 in all positions
except i and j. Let S+ = {v(i) | i a node of G} and S− = {e(i, j) | (i, j) an edge of G}. Figure 3
shows an example of a graph that is 3-colourable along with the sample S.
8
1 e(1, 2) (0, 0, 1, 1, 1, 1)
e(1, 6) (0, 1, 1, 1, 1, 0)
v(1) (0, 1, 1, 1, 1, 1) e(2, 3) (1, 0, 0, 1, 1, 1)
2 3
v(2) (1, 0, 1, 1, 1, 1) e(2, 4) (1, 0, 1, 0, 1, 1)
4 v(3) (1, 1, 0, 1, 1, 1) e(3, 6) (1, 1, 0, 1, 1, 0)
6 v(4) (1, 1, 1, 0, 1, 1) e(4, 5) (1, 1, 1, 0, 0, 1)
v(5) (1, 1, 1, 1, 0, 1) e(4, 6) (1, 1, 1, 0, 1, 0)
5 v(6) (1, 1, 1, 1, 1, 0) e(5, 6) (1, 1, 1, 1, 0, 0)
(a) Graph (b) Positive Examples (c) Negative Examples
Figure 3: (a) A graph G along with a valid three colouring. (b) Positive examples of the sample
generated using G. (c) Negative examples of the sample generated using G.
is a 3-term DNF formula. We will show that ϕ is consistent with S. To show this we need to
show that all examples in S+ satisfy ϕ and none of the examples in S− satisfy ϕ. First consider
v(i) ∈ S+ . Without loss of generality, suppose i is coloured red, i.e., i ∈ Vr . Then, we claim
that v(i) is a satisfying assignment of Tr and hence also of ϕ. Clearly, the literal xi is not
contained in Tr and there are no negative literals in Tr . Since all the bits of v(i) other than the
ith position are 1, v(i) is a satisfying assignment of Tr .
Now, consider e(i, j). We will show that e(i, j) is not a satisfying assignment of any of Tr ,
Tg or Tb and hence it also does not satisfy ϕ. For a colour c ∈ {r, g, b}, either i is not coloured
c or j isn’t. Suppose i is the one that is not coloured c, then Tc contains the literal xi , but the
ith bit of e(i, j) is 0 and so e(i, j) is not a satisfying assignment of Tc . This argument applies to
all colours and hence e(i, j) is not a satisfying assignment of ϕ. This shows that ϕ is consistent
with S.
6 Learning 3-CNF
Let us revisit the problem of PAC-learning 3-term DNF formulae. Let us recall the the dis-
tributive law of boolean operations.
(a ∧ b) ∨ (c ∧ d) ≡ (a ∨ c) ∧ (a ∨ d) ∧ (b ∨ c) ∧ (b ∨ d)
9
Using the distributive rule, we can re-write a 3-term DNF formula as follows:
^
T1 ∨ T2 ∨ T3 ≡ (`1 ∨ `2 ∨ `3 ) (1)
`1 ∈T1
`2 ∈T2
`3 ∈T3
The form on the right hand side is called a 3-CNF formula. CNF stands for conjunctive normal
form, i.e., expression as a conjunction of clauses (disjunctions). The 3 indicates that each clause
has length 3 (i.e., it contains 3 literals). Note that any 3-term DNF formula can be expressed
as a 3-CNF formula with at most (2n)3 clauses, as each of the terms in a 3-term DNF formula
can have at most 2n literals. The converse however is not true, i.e., there are 3-CNF formulae
that cannot be represented as a 3-term DNF formula. Thus, 3-CNF formulae is a more general
class of functions.
Let us now consider the question of learning 3-CNF formulae. In doing so, we’ll study an
important notion in computational learning theory, as in all of computer science, that of a
reduction. Note that a 3-CNF formula can be viewed simply as a conjunction over a set of
clauses. Thus, if we can create a new instance space where the variables are actually encodings
of all the (2n)3 possible clauses, the 3-CNF formula simply becomes a conjunction. For any
three literals, `1 , `2 , `3 , we create a variable z`1 ,`2 ,`3 that takes the value `1 ∨ `2 ∨ `3 .
3
More precisely consider the following maps: (i) f : {0, 1}n → {0, 1}(2n) which is a one-one
map that takes an assignment of n boolean variables and maps it to an assignment of all possible
clauses of length 3 over the n boolean variables, (ii) g : 3-CNF[n] → CONJUNCTIONS[(2n)3 ]
which maps a 3-CNF formula over n variables to a conjunction over (2n)3 variables. We have
already seen that we can PAC learn the class of conjunctions. We’d like to use this algorithm
to learn the class 3-CNF.
Let c be the target 3-CNF formula and D a distribution over {0, 1}n . Let c0 = g(c) and
3
let D0 be a distribution over {0, 1}(2n) , where D0 (S) = D(f −1 (S)). We need to be able to
simulate the example oracle EX(c0 , D0 ). Doing so is simple, we receive (x, c(x)) form EX(c, D),
we output (f (x), c(x)); this is a valid simulation of EX(c0 , D0 ). The learning algorithm returns a
conjunction of (positive or negative) literals over (2n)3 variables. We need to transform this back
to a 3-CNF formula over n variables. The first thing to note is that we only need to learn the
class of conjunctions with positive literals, so we can modify the conjunction learning algorithm
to start with a conjunction that includes all the positive literals (instead of all the positive and
negative literals); the rest of the algorithm remains the same. Then, if a literal corresponding
to z`1 ,`2 ,`3 is included in the output conjunction, we include the clause (x`1 ∨ x`2 ∨ x`3 ) in the
3-CNF formula.
7 PAC Learning
Let us consider the results in Sections 5 and 6 and reconsider the goal of learning. On the one
hand, we saw that is impossible to learn the class of 3-term DNF formulae unless NP = RP.
The hardness result crucially relies on the fact that the output of the algorithm was itself
required to be a 3-term DNF formula. On the other hand, we know that any 3-term DNF
formula can be expressed as a 3-CNF formula (with a modest, but still polynomial, blow-up
in size). The class of 3-CNF formulae turns out to be PAC-learnable. This means that when
receiving examples labelled according to a 3-term DNF formula, if we had in fact applied the
3-CNF learning algorithm, we would have produced a hypothesis with low error. The output
hypothesis however would be a 3-CNF formula not a 3-term DNF formula. If the aim of machine
learning is prediction, it should not matter what form we represent the hypothesis in. In fact,
we may not even care that the output is a 3-CNF formula, it could potentially be something
10
even more complex, as long as it makes correct predictions. Thus, in the final definition of PAC
learning, we’ll remove the condition that the output hypothesis actually belongs to the concept
class being learnt. In general, we’ll allow learning algorithms to output hypotheses that lie in
some class H, called the hypothesis class. While we allow the hypothesis class H to be different
from the concept class C and possibly a lot more powerful, for the notion of efficiency we do
require that any hypothesis h ∈ H is polynomial time evaluatable. As with the case of concept
classes, we assume that there is a representation scheme and a suitable size function for the
hypothesis class H. Formally,
Remark 9. It is worth noting that the requirement that H is polynomially evaluatable is quite
necessary and natural. Essentially what we are asking is that the designer of a learning algo-
rithm, when returning the “code” to make predictions, actually returns code that can be evaluated
by the user.
Bibliographic Notes
Material in this lecture is almost entirely adopted from (Kearns and Vazirani, 1994, Chap. 1).
The original PAC learning framework was introduced in a seminal paper by Valiant (1984).
References
Sanjeev Arora and Boaz Barak. Computational Complexity: A Modern Approach. Cambridge
University Press, 2009.
Leslie Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
11