CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/autumn2016/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4) For
problems that require programming, please include in your submission a printout of your code
(with comments) and any figures that you are asked to plot.
If you are skipping a question, please include it on your PDF/photo, but leave the question
blank and tag it appropriately on Gradescope. This includes extra credit problems. If you are
scanning your document by cellphone, please check the Piazza forum for recommended cellphone
scanning apps and best practices.
We assume that the voters correctly disclose their vote during the survey. Thus, for each
value of i, we have that Xij are drawn IID from a Bernoulli(φi ) distribution. Moreover,
the Xij ’s (for all i, j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
m
1 X
φ̂i = Xij
m j=1
Also, let Zi = 1{|φ̂i − φi | > γ} be a binary random variable that indicates whether the
prediction in state i was highly inaccurate.
(a) Let ψi be the probability that Zi = 1. Using the Hoeffding inequality, find an upper
bound on ψi .
Answer: A direct application of the Hoeffding inequality yields
2
ψi ≤ 2e−2γ m
(b) In this part, we prove a general result which will be useful for this problem. Let Vi
and Wi (1 ≤ i ≤ k) be Bernoulli random variables, and suppose
Let the Vi ’s be mutually independent, and similarly let the Wi ’s also be mutually
independent. Prove that, for any value of t, the following holds:
k
! k
!
X X
P Vi > t ≤ P Wi > t
i=1 i=1
[Hint: One way to do this is via induction on k. If you use a proof by induction, for
the base case (k = 1), you must show that the inequality holds for t < 0, 0 ≤ t < 1,
and t ≥ 1.]
Answer: Prove it by induction.
Base case: Show
P (V1 = 1) ≤ P (W1 = 1)
which holds by our original assumptions.
Inductive step: Assume
CS229 Problem Set #3 Solutions 3
l
! l
!
X X
P Vi > t ≤P Wi > t , ∀t
i=1 i=1
Then,
l+1
!
X
P Vi > t
i=1
l+1
! l+1
!
X X
= P (Vl+1 = 1) P Vi > tVl+1 = 1 + P (Vl+1 = 0) P Vi > tVl+1 = 0
i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1Vl+1 = 1 + P (Vl+1 = 0) P Vi > tVl+1 = 0
i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + P (Vl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + (1 − P (Vl+1 = 1)) P Vi > t
i=1 i=1
l
! l
!! l
!
X X X
= P (Vl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!! l
!
X X X
≤ P (Wl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + (1 − P (Wl+1 = 1) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + P (Wl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
≤ P (Wl+1 = 1) P Wi > t − 1 + P (Wl+1 = 0) P Wi > t
i=1 i=1
l+1
!
X
=P Wi > t .
i=1
Then E[Viu ] = P (ξi ≤ P (Vi = 1)) = P (Vi = 1) = E[Vi ] and similarly E[Wiu ] = E[Wi ].
But Wiu ≥ Viu always and the sequence {V1u , . . . , Vku } is i.i.d. (and similarly for Wiu ).
CS229 Problem Set #3 Solutions 4
Consequently, we have
k
X k
X k
X k
X
u
P Vi > t = P Vi > t ≤P Wiu >t =P Wi > t ,
i=1 i=1 i=1 i=1
Pk Pk
so that i=1 Viu > t implies i=1 Wiu > t.
(c) The fraction
Pn of states on which our predictions are highly inaccurate is given by
Z = n1 i=1 Zi . Prove a reasonable closed form upper bound on the probability
P (Z > τ ) of being highly inaccurate on more than a fraction τ of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound
must decrease to zero as m → ∞ (for fixed n and τ > 0). Also, your bound should
either remain constant or decrease as n → ∞ (for fixed m and τ > 0). It is also fine
if, for some values of τ , m and n, your bound just tells us that P (Z > τ ) ≤ 1 (the
trivial bound).]
Answer: There are multiple ways to do this problem. We list a couple of them below:
2
where the last step follows provided that 0 < τ − µ = τ − 2e−2γ m , or equivalently,
1 2
m > 2γ 2 log τ . For fixed τ and m, this bound goes to zero as n → ∞. Alternatively,
CS229 Problem Set #3 Solutions 5
where k is the smallest integer such that k > nτ . For fixed τ and n, observe that as
m → ∞, µ → 0, so this bound goes to zero. Therefore,
n
2 X n
P (Z > τ ) ≤ min 1, 2e−2(τ −µ) n , µj
j
j=k
P (Zi = 1) ≤ µ = P (Yi = 1). Using the result from the previous part:
n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
1X
= P Yi − µ > τ − µ
n i=0
!
1 X n
≤ P Yi − µ > τ − µ
n
i=0
1 Pn
Var n i=0 Yi
≤
(τ − µ)2
1
P n
2 i=0 Var [Yi ]
= n
(τ − µ)2
2 2
2e−2γ (1 − 2e−2γ
m m
)
=
n(τ − µ)2
2
2e−2γ m
≤ ,
n(τ − µ)2
where we again require that m > 2γ1 2 log τ2 . This version of the bound goes to zero
and let H = {hθ : θ ∈ Rd+1 } be the corresponding hypothesis class. What is the VC
dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When
doing this problem, you should not assume any other non-trivial result (such as that the
VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved
in class.]
Answer: The key insight is that if the polynomial does not cross the x-axis (i.e. have a
root) between two points, then it must give the two points the same label.
First, we need to show that there is a set of size d + 1 which H can shatter. We consider
polynomials with d real roots. A subset of the polynomials in H can be written as
d
Y
± (x − ri )
i=1
where ri is the ith real root. Consider any set of size d+1 which does not contain any duplicate
points. For any labelling of these points, construct a function as follows: If two consecutive
points are labelled differently, set one of the ri to the average of those points. If two consecutive
CS229 Problem Set #3 Solutions 7
points are labelled the same, don’t put a root between them. If we haven’t used up all of our
d roots, place them beyond the last point. Finally, choose ± to get the desired labelling.
A more constructive proof of the above is the following: consider any set of distinct points
x(1) , . . . , x(d+1) , and let y (1) , . . . , y (d+1) ∈ {−1, 1} be any labeling of these points (where we
have used −1 for points which would normally be labeled zero). Then, consider the following
polynomial:
d+1
X Y x(j) − x
(k)
p(x) = y .
k=1 j6=k
x(j) − x(k)
Here, observe that in the above expression, each term of the summation is a polynomial (in
x) of degree d, and hence the overall expression is a polynomial of degree d. Furthermore,
observe that when x = x(i) , then the ith term of the summation evaluates to y (i) , and all
other terms of the summation evaluate to 0 (since all other terms have a factor (x(i) − x)).
Therefore, p(x(i) ) = y (i) for i = 1, . . . , d + 1. This construction is known as a “Lagrange
interpolating polynomial.” Therefore, any labeling of d + 1 points can be realized using a
degree d polynomial.
Second, we need to prove that H can’t shatter a set of size d + 2. If two points are identical,
we can’t realize any labelling that labels them differently. If all points are unique, we can’t
achieve an alternating labelling because we would need d + 1 roots.
3. [12 points] MAP estimates and weight decay
Consider using a logistic regression model hθ (x) = g(θT x) where g is the sigmoid function,
and let a training set {(x(i) , y (i) ); i = 1, . . . , m} be given as usual. The maximum likelihood
estimate of the parameters θ is given by
m
Y
θML = arg max p(y (i) |x(i) ; θ).
θ
i=1
If we wanted to regularize logistic regression, then we might put a Bayesian prior on the
parameters. Suppose we chose the prior θ ∼ N (0, τ 2 I) (here, τ > 0, and I is the n + 1-by-
n + 1 identity matrix), and then found the MAP estimate of θ as:
m
Y
θMAP = arg max p(θ) p(y (i) |x(i) , θ)
θ
i=1
Prove that
||θMAP ||2 ≤ ||θML ||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight
decay, since it encourages the weights (meaning parameters) to take on generally smaller
values.
Answer: Assume that
1 1 2
p(θMAP ) = n+1 1
e− 2τ 2 (||θMAP ||2 )
(2π) 2 |τ 2 I| 2
1 1 2
< n+1 1
e− 2τ 2 (||θML ||2 )
(2π) |τ 2 I|
2 2
= p(θML )
This yields
m
Y m
Y
p(θMAP ) p(y (i) |x(i) , θMAP ) < p(θML ) p(y (i) |x(i) , θMAP )
i=1 i=1
Ym
≤ p(θML ) p(y (i) |x(i) , θML )
i=1
Qm (i) (i)
Qm |x (i)
where the last inequality holds since θML was chosen to maximize i=1 p(y ; θ). However,
this result gives us a contradiction, since θMAP was chosen to maximize i=1 p(y |x(i) , θ)p(θ)
X P (x)
K L(P kQ) = P (x) log
x
Q(x)
For notational convenience, we assume P (x) > 0, ∀x. (Otherwise, one standard thing to do
is to adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence
as K L(P ||Q) = K L(P (X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability dis-
tributions. In this problem we will prove some basic properties of KL divergence, and
work out a relationship between minimizing KL divergence and the maximum likelihood
estimation that we’re familiar with.
(a) Nonnegativity. Prove the following:
and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 Solutions 9
Where all equalities follow from straight forward algebraic manipulation. The inequality
follows from Jensen’s inequality.
To show the second part of the claim, note that log t is a strictly concave function of t.
Using the form of Jensen’s inequality given in the lecture notes, we have equality if and
only if Q(x) Q(x) Q(x) Q(x)
P P
P (x) = E[ P (x) ] for all x. But since E[ P (x) ] = x P (x) P (x) = x Q(x) = 1, it
follows that P (x) = Q(x). Hence we have K L(P kQ) = 0 if and only if P (x) = Q(x)
for all x.
(b) Chain rule for KL divergence. The KL divergence between 2 conditional distri-
butions P (X|Y ), Q(X|Y ) is defined as follows:
!
X X P (x|y)
K L(P (X|Y )kQ(X|Y )) = P (y) P (x|y) log
y x
Q(x|y)
Answer:
X P (x, y)
K L(P (X, Y )kQ(X, Y )) = P (x, y) log (7)
x,y
Q(x, y)
X P (x)P (y|x)
= P (x, y) log (8)
x,y
Q(x)Q(y|x)
X P (x) P (y|x)
= P (x, y) log + P (x, y) log (9)
x,y
Q(x) Q(y|x)
X P (x) X P (y|x)
= P (x, y) log + P (x)P (y|x) log (10)
x,y
Q(x) x,y
Q(y|x)
X P (x) X X P (y|x)
= P (x) log + P (x) P (y|x) log (11)
x
Q(x) x y
Q(y|x)
= K L(P (X)kQ(X)) (12)
+K L(P (Y |X)kQ(Y |X)). (13)
Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli
Naive Bayes parameter estimation.
Qn In the Naive Bayes model we assumed Pθ is of the
following form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we
therefore have:
n
X
K L(P̂ kPθ ) = K L(P̂ (y)kp(y)) + K L(P̂ (xi |y)kp(xi |y)).
i=1
Answer:
X
arg min K L(P̂ kPθ ) = arg min P̂ (x) log P̂ (x) − P̂ (x) log Pθ (x) (14)
θ θ
x
X
= arg min −P̂ (x) log Pθ (x) (15)
θ
x
X
= arg max P̂ (x) log Pθ (x) (16)
θ
x
X 1 Xm
= arg max 1{x(i) = x} log Pθ (x) (17)
θ
x
m i=1
m
1 XX
= arg max 1{x(i) = x} log Pθ (x) (18)
θ m i=1 x
m
1 X
= arg max log Pθ (x(i) ) (19)
θ m i=1
m
X
= arg max log Pθ (x(i) ) (20)
θ
i=1
where we used in order: definition of KL, leaving out terms independent of θ, flip sign and
correspondingly flip min-max, definition of P̂ , switching order of summation, definition of
the indicator and simplification.
CS229 Problem Set #3 Solutions 12
• http://cs229.stanford.edu/ps/ps3/mandrill-small.tiff
• http://cs229.stanford.edu/ps/ps3/mandrill-large.tiff
The mandrill-large.tiff file contains a 512x512 image of a mandrill represented in 24-
bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit
numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity
values for that pixel. The straightforward representation of this image therefore takes
about 262144 × 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use
K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is
considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will
cluster these points in color-space into 16 clusters, and replace each pixel with the closest
cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while
(several minutes even on a fast computer)!2
Answer: Figure ?? shows the original image of the mandrill. Figure ?? shows the image
compressed into 16 colors using K-means run to convergence, and shows the 16 colors used in
the compressed image. (These solutions are given in a color PostScript file. To see the colors
without a color printer, view them with a program that can display color PostScript, such as
ghostview.) The original image used 24 bits per pixel. To represent one of 16 colors requires
log2 16 = 4 bits per pixel. We have therefore achieved a compression factor of about 24/4 = 6
of the image. MATLAB code for this problem is given below.
2 In order to use the imread and imshow commands in octave, you have to install the Image package from
octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net
3 Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.
CS229 Problem Set #3 Solutions 13
A = double(imread(’mandrill-small.tiff’));
imshow(uint8(round(A)));
% K-means initialization
k = 16;
initmu = zeros(k,3);
for l=1:k,
i = random(’unid’, size(A, 1), 1, 1);
j = random(’unid’, size(A, 2), 1, 1);
initmu(l,:) = double(permute(A(i,j,:), [3 2 1])’);
end;
% Run K-means
mu = initmu;
for iter = 1:200, % usually converges long before 200 iterations
newmu = zeros(k,3);
nassign = zeros(k,1);
for i=1:size(A,1),
for j=1:size(A,2),
dist = zeros(k,1);
for l=1:k,
d = mu(l,:)’-permute(A(i,j,:), [3 2 1]);
dist(l) = d’*d;
end;
[value, assignment] = min(dist);
nassign(assignment) = nassign(assignment) + 1;
newmu(assignment,:) = newmu(assignment,:) + ...
permute(A(i,j,:), [3 2 1])’;
end; end;
for l=1:k,
if (nassign(l) > 0)
newmu(l,:) = newmu(l,:) / nassign(l);
end;
end;
mu = newmu;
end;
end; end;
imshow(uint8(round(qimage)));
CS229 Problem Set #3 Solutions 15