Lect 0329
Lect 0329
The PAC model assumes access to a noise-free oracle for examples of the target concept. In reality,
we need learning algorithms with at least some tolerance for mislabeled examples. In this lecture
we study a classic model for learning in the presence of noise, the Random Classification noise
model. This is a simple noise model in which one can get positive algorithmic results.
In this model, a learning algorithm will have access to a modified and noisy oracle for examples,
denoted by EXCNη
(c∗ , D). Here c∗ and D are the target concept and distribution, and 0 ≤ η < 1/2
is a new parameter called the classification error rate.
{
with probability 1 − η, returns ⟨x, c∗ (x)⟩ from EX(c∗ , D)
η
EXCN (c∗ , D) =
with probability η, returns ⟨x, ¬c∗ (x)⟩ from EX(c∗ , D)
Note:
• Despite the classification noise in the examples received, the goal of the learner remains that
of finding a good approximation to the target concept with respect to the distribution D.
The error rate is measured with respect to the target concept and distribution.
∑
error(h) = Pr(x)
D
x:c∗ (x)̸=h(x)
The typical noise-free PAC algorithm draws a large number of samples and outputs a consistent
hypothesis. With classification noise, however, there may not be a consistent hypothesis. Angluin
1
and Laird show that ERM (empirical risk minimization) can be used to learn in the random
classification noise model, though not necessarily efficiently. To remind you, the ERM algorithm is
specified as follows:
To analyze this, suppose concept ci has true error rate di . What is the probability pi that ci
η
disagrees with a labelled example drawn from EXCN (c∗ , D)? We have two cases.
η
1. EXCN (c∗ , D) reports correctly, but ci is incorrect: di (1 − η)
η
2. EXCN (c∗ , D) reports incorrectly, but ci is correct: (1 − di )η
pi = di (1 − η) + (1 − di )η
pi = η + di (1 − 2η)
We need m large enough such that an ε-bad hypothesis will not minimize disagreements with the
hypothesis. Consider the point η + ε(1−2η)
2 . If an ε-bad hypothesis minimizes disagreements, then
the target function must have at least as large of a disagreement rate. Thus, at least one of the
following events must hold:
ε(1−2η)
1. Some ε-bad hypothesis ci has empirical disagreement rate ≤ η + 2 .
then the probability that either of the events occurs is at most δ. That is, the probability that
an ϵ-bad hypothesis minimizes disagreements is at most δ. Thus, if we draw a sample of size m
as specified above, and then find a hypothesis which minimizes disagreements with the sample,
then we have an algorithm which PAC learns in the presence of classification noise. This gives the
following theorem.
Theorem 1 For all finite concept classes C, we can PAC learn in the presence of classification
noise.
2
Minimizing disagreements (ERM) can be NP-hard
Theorem 2 Finding monotone conjunctions which minimize disagreements with a given sample is
NP-hard.
Proof: The proof is a polynomial-time reduction from the decision version of the vertex cover
problem to the decision version of our problem. The decision version of the vertex cover problem
is specified by an undirected graph G = (V, E) of n vertices and a positive integer c < n, and
the question is whether there exists a set C of at most c vertices of G such that every edge of G
is incident to at least one vertex in C. (Such a set C is called a vertex cover.) This problem is
NP-complete.
Our reduction is as follows. If there are n vertices in the graph, then our instance space is
{0, 1}n . For each vertex vi in the graph, we introduce a positive example (ai , +), where ai =
(11 · · · 101 · · · 11), with 0 in i-th position, and for each edge e in the graph we introduce n + 1
negative examples (be , −), where be = (11 · · · 101 · · · 101 · · · 11), with 0’s in i-th and j-th positions.
Thus we get a set S of labeled examples of size n + |E|(n + 1). Clearly, the computation of S from
(G, c) can clearly be carried out in polynomial time.
It is easy to show the following.
Claim 1 G has a vertex cover of size at most c if and only if there is a monotone conjunction with
at most c disagreements.
Suppose G has a vertex cover C of at most c vertices. Let f denote the product of those xi such
that vi is in C. How many examples from S disagree with c? By construction, for each vertex vi ,
f (ai ) = − iff vi ∈ C. Thus, f disagrees with at most c positive examples from S. For each edge
e = (vi , vj ), the set C contains at least one of vi or vj , so f contains at least one of xi or xj . So,
f (be ) = −. Thus, f agrees with all the negative examples in S. Hence the number of disagreements
is at most c, as claimed.
Now suppose that there exists some conjunction f such that f disagrees with at most c examples
in S. Since c < n, this means that f must agree with all the negative examples in S, since each
one is repeated n + 1 times. Hence f can only disagree with positive examples in S, and at most
c of them. Thus f must contain at most c literals xi . Define the set C to be all those vertices vi
such that xi appears in the conjunction f . Then C contains at most c vertices; it remains to see
that it is a vertex cover. If e = (vi , vj ) is any edge in G then f (be ) = −, since f agrees with all
the negative examples. But f (be ) = −, if and only if f contains at least one of xi or xj . Thus C
contains at least one of vi or vj , so C is a vertex cover of G.
In the following lecture, we will show how to learn conjunctions efficiently in the random classifi-
cation noise model by using statistical queries.