Contact me in order to access the whole complete document. Email: smtb98@gmail.
com
WhatsApp: https://wa.me/message/2H3BV2L5TTSUF1 Telegram: https://t.me/solutionmanual
ssm
mtt
Solutions to Exercises
bb99
Chapter 2
88@
2.1 Two-oracle variant of the PAC model
• Assume that C is efficiently PAC-learnable using H in the standard PAC model using
algorithm A. Consider the distribution D = 12 (D− + D+ ). Let h ∈ H be the hypothesis
output by A. Choose δ such that:
@
P[RD (h) ≤ /2] ≥ 1 − δ.
From
RD (h) = P [h(x) 6= c(x)]
x∼D
1
= ( P [h(x) 6= c(x)] + P [h(x) 6= c(x)])
2 x∼D− x∼D+
1
ggm
= (RD− (h) + RD+ (h)),
2
it follows that:
P[RD− (h) ≤ ] ≥ 1 − δ and P[RD+ (h) ≤ ] ≥ 1 − δ.
This implies two-oracle PAC-learning with the same computational complexity.
Assume now that C is efficiently PAC-learnable in the two-oracle PAC model. Thus, there
maa
•
exists a learning algorithm A such that for c ∈ C, > 0, and δ > 0, there exist m− and m+
polynomial in 1/, 1/δ, and size(c), such that if we draw m− negative examples or more
and m+ positive examples or more, with confidence 1 − δ, the hypothesis h output by A
verifies:
P[RD− (h)] ≤ and P[RD+ (h)] ≤ .
Now, let D be a probability distribution over negative and positive examples. If we could
draw m examples according to D such that m ≥ max{m− , m+ }, m polynomial in 1/, 1/δ,
and size(c), then two-oracle PAC-learning would imply standard PAC-learning:
P[RD (h)]
iill..cc
≤ P[RD (h)|c(x) = 0] P[c(x) = 0] + P[RD (h)|c(x) = 1] P[c(x) = 1]
≤ (P[c(x) = 0] + P[c(x) = 1]) = .
If D is not too biased, that is, if the probability of drawing a positive example, or that of
drawing a negative example is more than , it is not hard to show, using Chernoff bounds
or just Chebyshev’s inequality, that drawing a polynomial number of examples in 1/ and
1/δ suffices to guarantee that m ≥ max{m− , m+ } with high confidence.
Otherwise, D is biased toward negative (or positive examples), in which case returning
h = h0 (respectively h = h1 ) guarantees that P[RD (h)] ≤ .
oom
m
complete document is available on https://unihelp.xyz/ *** contact me if site not loaded
460 Solutions Manual
To show the claim about the not-too-biased case, let Sm denote the number of positive
examples obtained when drawing m examples when the probability of a positive example
is . By Chernoff bounds,
2
P[Sm ≤ (1 − α)m] ≤ e−mα /2 .
1 2m+
We want to ensure that at least m+ examples are found. With α = 2
and m =
,
P[Sm > m+ ] ≤ e−m+ /4 .
Setting the bound to be less than or equal to δ/2, leads to the following condition on m:
2m+ 8 2
m ≥ min{ , log }
δ
A similar analysis can be done in the case of negative examples. Thus, when D is not too
biased, with confidence 1 − δ, we will find at least m− negative and m+ positive examples
if we draw m examples, with
2m+ 2m− 8 2
m ≥ min{ , , log }.
δ
In both solutions, our training data is the set T and our learned concept L(T ) is the tightest
circle (with minimal radius) which is consistent with the data.
2.2 PAC learning of hyper-rectangles
The proof in the case of hyper-rectangles is similar to the one given presented within the
chapter. The algorithm selects the tightest axis-aligned hyper-rectangle containing all the
sample points. For i ∈ [2n], select a region ri such that PD [ri ] = /(2n) for each edge of the
hyper-rectangle R. Assuming that PD [R − R0 ] > , argue that R0 cannot meet all ri s, so it
must miss at least one. The probability that none of the m sample points falls into region ri
is (1 − /2n)m . By the union bound, this shows that
m
P[R(R0 ) > ] ≤ 2n(1 − /2n)m ≤ 2n exp − . (E.35)
2n
Setting δ to the right-hand side shows that for
2n 2n
m≥ log , (E.36)
δ
with probability at least 1 − δ, RD (R0 ) ≤ .
2.3 Concentric circles
Suppose our target concept c is the circle around the origin with radius r. We will choose a
slightly smaller radius s by
s := inf{s0 : P (s0 ≤ ||x|| ≤ r) < }.
Let A denote the annulus between radii s and r; that is, A := {x : s ≤ ||x|| ≤ r}. By definition
of s,
P (A) ≥ . (E.37)
In addition, our generalization error, P (c ∆ L(T )), must be small if T intersects A. We can
state this as
P (c ∆ L(T )) > =⇒ T ∩ A = ∅. (E.38)
Using (E.37), we know that any point in T chosen according to P will “miss” region A with
probability at most 1 − . Defining error := P (c ∆ L(T )), we can combine this with (E.38) to
see that
P (error > ) ≤ P (T ∩ A = ∅) ≤ (1 − )m ≤ e−m .
Setting δ to be greater than or equal to the right-hand side leads to m ≥ 1 log( 1δ ).
2.4 Non-concentric circles
As in the previous example, it is natural to assume the learning algorithm operates by return-
ing the smallest circle which is consistent with the data. Gertrude is relying on the logical
implication
error > =⇒ T ∩ ri = ∅ for some i, (E.39)
Solutions Manual 461
r1 r1
r3 r3
r2 r2
Figure E.5
Counter-example shows error of tightest circle in gray.
which is not necessarily true here. Figure E.5 illustrates a counterexample. In the figure, we
have one training point in each region ri . The points in r1 and r2 are very close together,
and the point in r3 is very close to region r1 . On this training data (some other points may
be included outside the three regions ri ), our learned circle is the “tightest” circle including
these points, and hence one diameter approximately traverses the corners of r1 . In the figure,
the gray regions are the error of this learned hypotheses versus the target circle, which has a
thick border. Clearly, the error may be greater than even while T ∩ ri 6= ∅ for any i; this
contradicts (E.39) and invalidates poor Gertrude’s proof.
2.5 Triangles
As in the case of axis-aligned rectangles, consider three regions r1 , r2 , r3 , along the sides of
the target concept as indicated in figure E.6. Note that the triangle formed by the points
A”, B”, C” is similar to ABC (same angles) since A”B” must be parallel to AB, and similarly
for the other sides.
Assume that P[ABC] > , otherwise the statement would be trivial. Consider a triangle
A0 B 0 C 0 similar to ABC and consistent with the training sample and such that it meets all
three regions r1 , r2 , r3 .
Since it meets r1 , the line A0 B 0 must be below A”B”. Since it meets r2 and r3 , A0 must be
in r2 and B 0 in r3 (see figure E.6). Now, since the angle A\ 0 B 0 C 0 is equal to A”B”C”,
\ C0
must be necessarily above C”. This implies that triangle A0 B 0 C 0 contains A”B”C”, and thus
error(A0 B 0 C 0 ) ≤ .
error(A0 B 0 C 0 ) > =⇒ ∃i ∈ {1, 2, 3} : A0 B 0 C 0 ∩ ri = ∅.
Thus, by the union bound,
3
X
P[error(A0 B 0 C 0 ) > ] ≤ P[A0 B 0 C 0 ∩ ri = ∅] ≤ 3(1 − /3)m ≤ 3e−3m .
i=1
3
Setting δ to match the right-hand side gives the sample complexity m ≥
log 3δ .
2.8 Learning intervals
462 Solutions Manual
C”
A” B”
A’ B’
A B
Figure E.6
Rectangle triangles.
Given a sample S, one algorithm consists of returning the tightest closed interval IS containing
positive points. Let I = [a, b] be the target concept. If P[I] < , then clearly R(IS ) < .
Assume that P[I] ≥ . Consider two intervals IL and IR defined as follows:
IL = [a, x] with x = inf{x : P[a, x] ≥ /2}
IR = [x0 , b] with x0 = sup{x0 : P[x0 , b] ≥ /2}.
By the definition of x, the probability of [a, x[ is less than or equal to /2, similarly the
probability of ]x0 , b] is less than or equal to /2. Thus, if IS overlaps both with IL and IR ,
then its error region has probability at most . Thus, R(IS ) > implies that IS does not
overlap with either IL or IR , that is either none of the training points falls in IL or none falls
in IR . Thus, by the union bound,
P[R(IS ) > ] ≤ P[S ∩ IL = ∅] + P[S ∩ IR = ∅]
≤ 2(1 − /2)m ≤ 2e−m/2 .
2 2
Setting δ to match the right-hand side gives the sample complexity m =
log δ
and proves
the PAC-learning of closed intervals.
2.9 Learning union of intervals
Given a sample S, our algorithm consists of the following steps:
(a) Sort S in ascending order.
(b) Loop through sorted S, marking where intervals of consecutive positively labeled points
begin and end.
(c) Return the union of intervals found on the previous step. This union is represented by a
list of tuples that indicate start and end points of the intervals.
This algorithms works both for p = 2 and for a general p. We will now consider the problem
for C2 . To show that this is a PAC-learning algorithm we need to distinguish between two
cases.
The first case is when our target concept is a disjoint union of two closed intervals: I =
[a, b] ∪ [c, d]. Note, there are two sources of error: false negatives in [a, b] and [c, d] and also
false positives in (b, c). False positives may occur if no sample is drawn from (b, c). By
linearity of expectation and since these two error regions are disjoint, we have that R(hS ) =
RFP (hS ) + RFN,1 (hS ) + RFN,2 (hS ), where
RFP (hS ) = P [x ∈ hS , x 6∈ I],
x∼D
RFN,1 (hS ) = P [x 6∈ hS , x ∈ [a, b]],
x∼D
RFN,2 (hS ) = P [x 6∈ hS , x ∈ [c, d]].
x∼D
Solutions Manual 463
Since we need to have that at least one of RFP (hS ), RFN,1 (hS ), RFN,2 (hS ) exceeds /3 in
order for R(hS ) > , by union bound
P(R(hS ) > ) ≤ P(RFP (hS ) > /3 or RFN(hS ),1 > /3 or RFN(hS ),2 > /3)
2
X
≤ P(RFP (hS ) > /3) + P(RFN(hS ),i > /3) (E.40)
i=1
We first bound P(RFP (hS ) > /3). Note that if RFP (hS ) > /3, then P((b, c)) > /3 and
hence
P(RFP (hS ) > /3) ≤ (1 − /3)m ≤ e−m/3 .
Now we can bound P(RFN(hS ),i > /3) by 2e−m/6 using the same argument as in the previous
question. Therefore,
P(R(hS ) > ) ≤ e−m/3 + 4e−m/6 ≤ 5e−m/6 .
Setting, the right-hand side to δ and solving for m yields that m ≥ 6 log 5δ .
The second case that we need to consider is when I = [a, d], that is, [a, b] ∩ [c, d] 6= ∅. In
that case, our algorithm reduces to the one from exercise 2.8 and it was already shown that
only m ≥ 2 log 2δ samples is required to learn this concept. Therefore, we conclude that our
algorithm is indeed a PAC-learning algorithm.
Extension of this result to the case of Cp is straightforward. The only difference is that in
(E.40), one has two summations for p − 1 regions of false positives and 2p regions of false
2(2p−1)
negatives. In that case sample complexity is m ≥
log 3p−1
δ
.
Sorting step of our algorithm takes O(m log m time and steps (b) and (c) are linear in m,
which leads to overall time complexity O(m log m).
2.10 Consistent hypotheses
Since PAC-learning with L is possible for any distribution, let D be the uniform distribution
over Z. Note that, in that case, the cost of an error of a hypothesis h on any point z ∈ Z is
PD [z] = 1/m. Thus, if RD (h) < 1/m, we must have RD (h) = 0 and h is consistent. Thus,
choose = 1/(m + 1). Then, for any δ > 0, with probability at least 1 − δ over samples S with
|S| ≥ P ((m + 1), 1/δ) points (where P is some fixed polynomial) the hypothesis hS returned
by L is consistent with Z since RD (hS ) ≤ 1/(m + 1).
2.11 Senate laws
(a) The true error in the consistent case is bounded as follows:
1 1
RD (h) ≤ (log |H| + log ). (E.41)
m δ
For δ = .05, m = 200 and |H| = 2800, RD (h) ≤ 5.5%.
(b) The true error in the inconsistent case is bounded as:
r
1 1
RD (h) ≤ R
bD (h) + (log 2|H| + log ). (E.42)
2m δ
bD (h) = m0 /m = .1, m = 200 and |H| = 2800, RD (h) ≤ 27.05%.
For δ = .05, R
2.12 Bayesian bound. For any fixed h ∈ H, by Hoeffding’s inequality, for any δ > 0,
s
1
log p(h)δ
P R(h) − R bS (h) ≥ ≤ p(h)δ. (E.43)
2m