0% found this document useful (0 votes)

142 views16 pages

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

The document provides solutions to problems from Problem Set #3 of CS229 Autumn 2016 at Stanford University. It summarizes the key details and constraints of a scenario where a news organization needs to sample voters from different states to predict election outcomes. It then: (1) Derives an upper bound on the probability that a state's prediction is highly inaccurate using Hoeffding's inequality (2) Proves a general result about upper bounding the probability that sums of independent Bernoulli variables exceed a threshold, using induction.

Uploaded by

vip_thb_2007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views16 pages

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

Uploaded by

vip_thb_2007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

CS229 Problem Set #3 Solutions 1

CS 229 Autumn 2016

Problem Set #3 Solutions: Theory & Unsuper-
vised learning

Due Wednesday, May 18 at 11:00 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/autumn2016/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4) For
problems that require programming, please include in your submission a printout of your code
(with comments) and any figures that you are asked to plot.
If you are skipping a question, please include it on your PDF/photo, but leave the question
blank and tag it appropriately on Gradescope. This includes extra credit problems. If you are
scanning your document by cellphone, please check the Piazza forum for recommended cellphone
scanning apps and best practices.

1. [23 points] Uniform convergence

You are hired by CNN to help design the sampling procedure for making their electoral
predictions for the next presidential election in the (fictitious) country of Elbania.
The country of Elbania is organized into states, and there are only two candidates running
in this election: One from the Elbanian Democratic party, and another from the Labor
Party of Elbania. The plan for making our electorial predictions is as follows: We’ll sample
m voters from each state, and ask whether they’re voting democrat. We’ll then publish,
for each state, the estimated fraction of democrat voters. In this problem, we’ll work out
how many voters we need to sample in order to ensure that we get good predictions with
high probability.
One reasonable goal might be to set m large enough that, with high probability, we obtain
uniformly accurate estimates of the fraction of democrat voters in every state. But this
might require surveying very many people, which would be prohibitively expensive. So,
we’re instead going to demand only a slightly lower degree of accuracy.
Specifically, we’ll say that our prediction for a state is “highly inaccurate” if the estimated
fraction of democrat voters differs from the actual fraction of democrat voters within that
state by more than a tolerance factor γ. CNN knows that their viewers will tolerate some
small number of states’ estimates being highly inaccurate; however, their credibility would
be damaged if they reported highly inaccurate estimates for too many states. So, rather
than trying to ensure that all states’ estimates are within γ of the true values (which
would correspond to no state’s estimate being highly inaccurate), we will instead try only
to ensure that the number of states with highly inaccurate estimates is small.
To formalize the problem, let there be n states, and let m voters be drawn IID from each
state. Let the actual fraction of voters in state i that voted democrat be φi . Also let Xij
(1 ≤ i ≤ n, 1 ≤ j ≤ m) be a binary random variable indicating whether the j-th randomly
CS229 Problem Set #3 Solutions 2

chosen voter from state i voted democrat:

1 if the j th example from the ith state voted democrat

Xij =
0 otherwise

We assume that the voters correctly disclose their vote during the survey. Thus, for each
value of i, we have that Xij are drawn IID from a Bernoulli(φi ) distribution. Moreover,
the Xij ’s (for all i, j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
m
1 X
φ̂i = Xij
m j=1

Also, let Zi = 1{|φ̂i − φi | > γ} be a binary random variable that indicates whether the
prediction in state i was highly inaccurate.

(a) Let ψi be the probability that Zi = 1. Using the Hoeffding inequality, find an upper
bound on ψi .
Answer: A direct application of the Hoeffding inequality yields
2
ψi ≤ 2e−2γ m

(b) In this part, we prove a general result which will be useful for this problem. Let Vi
and Wi (1 ≤ i ≤ k) be Bernoulli random variables, and suppose

E[Vi ] = P (Vi = 1) ≤ P (Wi = 1) = E[Wi ] ∀i ∈ {1, 2, . . . k}

Let the Vi ’s be mutually independent, and similarly let the Wi ’s also be mutually
independent. Prove that, for any value of t, the following holds:
k
! k
!
X X
P Vi > t ≤ P Wi > t
i=1 i=1

[Hint: One way to do this is via induction on k. If you use a proof by induction, for
the base case (k = 1), you must show that the inequality holds for t < 0, 0 ≤ t < 1,
and t ≥ 1.]
Answer: Prove it by induction.
Base case: Show

P (V1 > t) ≤ P (W1 > t)

If t < 0, then both probalities are 1. If t ≥ 1, then both probabilities are 0. Otherwise,
the equation reduces to

P (V1 = 1) ≤ P (W1 = 1)
which holds by our original assumptions.
Inductive step: Assume
CS229 Problem Set #3 Solutions 3

l
! l
!
X X
P Vi > t ≤P Wi > t , ∀t
i=1 i=1

Then,

l+1
!
X
P Vi > t
i=1
l+1
! l+1
!
X X
= P (Vl+1 = 1) P Vi > tVl+1 = 1 + P (Vl+1 = 0) P Vi > tVl+1 = 0

i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1Vl+1 = 1 + P (Vl+1 = 0) P Vi > tVl+1 = 0

i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + P (Vl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + (1 − P (Vl+1 = 1)) P Vi > t
i=1 i=1
l
! l
!! l
!
X X X
= P (Vl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!! l
!
X X X
≤ P (Wl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + (1 − P (Wl+1 = 1) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + P (Wl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
≤ P (Wl+1 = 1) P Wi > t − 1 + P (Wl+1 = 0) P Wi > t
i=1 i=1
l+1
!
X
=P Wi > t .
i=1

And the result is proved.

A second, and completely different way to prove this result, is by what is known as
coupling. In particular, let ξ1 , ξ2 , . . . , ξk be i.i.d. random variables, each uniform on [0, 1].
Define the variables
( (
u 0 if ξi > P (Vi = 1) u 0 if ξi > P (Wi = 1)
Vi = and Wi =
1 if ξi ≤ P (Vi = 1). 1 if ξi ≤ P (Wi = 1).

Then E[Viu ] = P (ξi ≤ P (Vi = 1)) = P (Vi = 1) = E[Vi ] and similarly E[Wiu ] = E[Wi ].
But Wiu ≥ Viu always and the sequence {V1u , . . . , Vku } is i.i.d. (and similarly for Wiu ).
CS229 Problem Set #3 Solutions 4

Consequently, we have
k
X k
X k
X k
X
u
P Vi > t = P Vi > t ≤P Wiu >t =P Wi > t ,
i=1 i=1 i=1 i=1

where the inequality follows because

k
X k
X
Wiu ≥ Viu ,
i=1 i=1

Pk Pk
so that i=1 Viu > t implies i=1 Wiu > t.
(c) The fraction
Pn of states on which our predictions are highly inaccurate is given by
Z = n1 i=1 Zi . Prove a reasonable closed form upper bound on the probability
P (Z > τ ) of being highly inaccurate on more than a fraction τ of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound
must decrease to zero as m → ∞ (for fixed n and τ > 0). Also, your bound should
either remain constant or decrease as n → ∞ (for fixed m and τ > 0). It is also fine
if, for some values of τ , m and n, your bound just tells us that P (Z > τ ) ≤ 1 (the
trivial bound).]
Answer: There are multiple ways to do this problem. We list a couple of them below:

Using Chernoff’s inequality

2
Let Yi be new Bernoulli random variables with mean µ = 2e−2γ m . Then we know from
part (a) that P (Zi = 1) ≤ µ = P (Yi = 1). Using the result from the previous part:
n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
1X
= P Yi − µ > τ − µ
n i=0
n !
1 X
≤ P Yi − µ > τ − µ

n
i=0
≤ 2 exp −2(τ − µ)2 n ,

2
where the last step follows provided that 0 < τ − µ = τ − 2e−2γ m , or equivalently,
1 2

m > 2γ 2 log τ . For fixed τ and m, this bound goes to zero as n → ∞. Alternatively,
CS229 Problem Set #3 Solutions 5

we can also just compute the right side directly, as in

n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
X
= P Yi > nτ
i=0
n n
!
X X
= P Yi = j
j=k i=0
n
X n j
= µ (1 − µ)1−j
j
j=k
n
X n j
≤ µ
j
j=k

where k is the smallest integer such that k > nτ . For fixed τ and n, observe that as
m → ∞, µ → 0, so this bound goes to zero. Therefore,
 
n 
 2 X n
P (Z > τ ) ≤ min 1, 2e−2(τ −µ) n , µj
 j 
j=k

has the properties we want.

Using Markov’s inequality

Markov’s inequality states that for any nonnegative random variable X and τ > 0, then
−2γ 2 m
P (X > τ ) ≤ E[X]
τ . From part (a), we have E[Zi ] = P (Zi = 1) ≤ 2e , implying
that
n
!
1X
P (Z > τ ) = P Zi > τ
n i=0
Pn
E n1 i=0 Zi

≤
τ
2 −2γ 2 m
≤ e .
τ
This bound satisfies the given requirements: as m → ∞, the bound goes to zero; if
n → ∞, the bound stays constant.

Using Chebyshev’s inequality

Chebyshev’s inequality states that for any random variable X with expected value µ and
2
finite variance σ 2 , then for any constant τ > 0, P (|X − µ| > τ ) ≤ στ . Let Yi be new
2
Bernoulli random variables with mean µ = 2e−2γ m . Then we know from part (a) that
CS229 Problem Set #3 Solutions 6

P (Zi = 1) ≤ µ = P (Yi = 1). Using the result from the previous part:
n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
1X
= P Yi − µ > τ − µ
n i=0
!
1 X n
≤ P Yi − µ > τ − µ

n
i=0
1 Pn
Var n i=0 Yi
≤
(τ − µ)2
1
P n
2 i=0 Var [Yi ]
= n
(τ − µ)2
2 2
2e−2γ (1 − 2e−2γ
m m
)
=
n(τ − µ)2
2
2e−2γ m
≤ ,
n(τ − µ)2

where we again require that m > 2γ1 2 log τ2 . This version of the bound goes to zero

both when m → ∞ and when n → ∞.

2. [15 points] More VC dimension

Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses
of the following form:

hθ (x) = 1{θ0 + θ1 x + θ2 x2 + · · · + θd xd ≥ 0},

and let H = {hθ : θ ∈ Rd+1 } be the corresponding hypothesis class. What is the VC
dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When
doing this problem, you should not assume any other non-trivial result (such as that the
VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved
in class.]
Answer: The key insight is that if the polynomial does not cross the x-axis (i.e. have a
root) between two points, then it must give the two points the same label.
First, we need to show that there is a set of size d + 1 which H can shatter. We consider
polynomials with d real roots. A subset of the polynomials in H can be written as

d
Y
± (x − ri )
i=1

where ri is the ith real root. Consider any set of size d+1 which does not contain any duplicate
points. For any labelling of these points, construct a function as follows: If two consecutive
points are labelled differently, set one of the ri to the average of those points. If two consecutive
CS229 Problem Set #3 Solutions 7

points are labelled the same, don’t put a root between them. If we haven’t used up all of our
d roots, place them beyond the last point. Finally, choose ± to get the desired labelling.
A more constructive proof of the above is the following: consider any set of distinct points
x(1) , . . . , x(d+1) , and let y (1) , . . . , y (d+1) ∈ {−1, 1} be any labeling of these points (where we
have used −1 for points which would normally be labeled zero). Then, consider the following
polynomial:
d+1
X Y x(j) − x
(k)
p(x) = y .
k=1 j6=k
x(j) − x(k)

Here, observe that in the above expression, each term of the summation is a polynomial (in
x) of degree d, and hence the overall expression is a polynomial of degree d. Furthermore,
observe that when x = x(i) , then the ith term of the summation evaluates to y (i) , and all
other terms of the summation evaluate to 0 (since all other terms have a factor (x(i) − x)).
Therefore, p(x(i) ) = y (i) for i = 1, . . . , d + 1. This construction is known as a “Lagrange
interpolating polynomial.” Therefore, any labeling of d + 1 points can be realized using a
degree d polynomial.
Second, we need to prove that H can’t shatter a set of size d + 2. If two points are identical,
we can’t realize any labelling that labels them differently. If all points are unique, we can’t
achieve an alternating labelling because we would need d + 1 roots.
3. [12 points] MAP estimates and weight decay
Consider using a logistic regression model hθ (x) = g(θT x) where g is the sigmoid function,
and let a training set {(x(i) , y (i) ); i = 1, . . . , m} be given as usual. The maximum likelihood
estimate of the parameters θ is given by
m
Y
θML = arg max p(y (i) |x(i) ; θ).
θ
i=1

If we wanted to regularize logistic regression, then we might put a Bayesian prior on the
parameters. Suppose we chose the prior θ ∼ N (0, τ 2 I) (here, τ > 0, and I is the n + 1-by-
n + 1 identity matrix), and then found the MAP estimate of θ as:
m
Y
θMAP = arg max p(θ) p(y (i) |x(i) , θ)
θ
i=1

Prove that
||θMAP ||2 ≤ ||θML ||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight
decay, since it encourages the weights (meaning parameters) to take on generally smaller
values.
Answer: Assume that

||θMAP ||2 > ||θML ||2

Then, we have that

CS229 Problem Set #3 Solutions 8

1 1 2
p(θMAP ) = n+1 1
e− 2τ 2 (||θMAP ||2 )
(2π) 2 |τ 2 I| 2

1 1 2
< n+1 1
e− 2τ 2 (||θML ||2 )
(2π) |τ 2 I|
2 2

= p(θML )

This yields

m
Y m
Y
p(θMAP ) p(y (i) |x(i) , θMAP ) < p(θML ) p(y (i) |x(i) , θMAP )
i=1 i=1
Ym
≤ p(θML ) p(y (i) |x(i) , θML )
i=1

Qm (i) (i)
Qm |x (i)
where the last inequality holds since θML was chosen to maximize i=1 p(y ; θ). However,
this result gives us a contradiction, since θMAP was chosen to maximize i=1 p(y |x(i) , θ)p(θ)

4. [15 points] KL divergence and Maximum Likelihood

The Kullback-Leibler (KL) divergence between two discrete-valued distributions P (X), Q(X)
is defined as follows:1

X P (x)
K L(P kQ) = P (x) log
x
Q(x)

For notational convenience, we assume P (x) > 0, ∀x. (Otherwise, one standard thing to do
is to adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence
as K L(P ||Q) = K L(P (X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability dis-
tributions. In this problem we will prove some basic properties of KL divergence, and
work out a relationship between minimizing KL divergence and the maximum likelihood
estimation that we’re familiar with.
(a) Nonnegativity. Prove the following:

∀P, Q K L(P kQ) ≥ 0

and

K L(P kQ) = 0 if and only if P = Q.

[Hint: You may use the following result, called Jensen’s inequality. If f is a convex
function, and X is a random variable, then E[f (X)] ≥ f (E[X]). Moreover, if f is
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,

and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 Solutions 9

strictly convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0;

for instance f (x) = − log x is strictly convex), then E[f (X)] = f (E[X]) implies that
X = E[X] with probability 1; i.e., X is actually a constant.]
Answer:
X P (x)
−K L(P kQ) = − P (x) log (1)
x
Q(x)
X Q(x)
= P (x) log (2)
x
P (x)
X Q(x)
≤ log P (x) (3)
x
P (x)
X
= log Q(x) (4)
x
= log 1 (5)
= 0 (6)

Where all equalities follow from straight forward algebraic manipulation. The inequality
follows from Jensen’s inequality.
To show the second part of the claim, note that log t is a strictly concave function of t.
Using the form of Jensen’s inequality given in the lecture notes, we have equality if and
only if Q(x) Q(x) Q(x) Q(x)
P P
P (x) = E[ P (x) ] for all x. But since E[ P (x) ] = x P (x) P (x) = x Q(x) = 1, it
follows that P (x) = Q(x). Hence we have K L(P kQ) = 0 if and only if P (x) = Q(x)
for all x.
(b) Chain rule for KL divergence. The KL divergence between 2 conditional distri-
butions P (X|Y ), Q(X|Y ) is defined as follows:
!
X X P (x|y)
K L(P (X|Y )kQ(X|Y )) = P (y) P (x|y) log
y x
Q(x|y)

This can be thought of as the expected KL divergence between the corresponding

conditional distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)),
where the expectation is taken over the random y.
Prove the following chain rule for KL divergence:

K L(P (X, Y )kQ(X, Y )) = K L(P (X)kQ(X)) + K L(P (Y |X)kQ(Y |X)).

CS229 Problem Set #3 Solutions 10

Where we applied (in order): definition of KL, definition

P of conditional probability, log of
product is sum of logs, splitting the summation, y P (x, y) = P (x), definition of KL.
(c) KL and maximum likelihood.
Consider a density estimation problem, and suppose we are given Pma training set
1
{x(i) ; i = 1, . . . , m}. Let the empirical distribution be P̂ (x) = m i=1 1{x (i)
= x}.
(P̂ is just the uniform distribution over the training set; i.e., sampling from the em-
pirical distribution is the same as picking a random example from the training set.)
Suppose we have some family of distributions Pθ parameterized by θ. (If you like,
think of Pθ (x) as an alternative notation for P (x; θ).) Prove that finding the maximum
likelihood estimate for the parameter θ is equivalent to finding Pθ with minimal KL
divergence from P̂ . I.e. prove:
m
X
arg min K L(P̂ kPθ ) = arg max log Pθ (x(i) )
θ θ
i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli
Naive Bayes parameter estimation.
Qn In the Naive Bayes model we assumed Pθ is of the
following form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we
therefore have:
n
X
K L(P̂ kPθ ) = K L(P̂ (y)kp(y)) + K L(P̂ (xi |y)kp(xi |y)).
i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate

of the parameters decomposes into 2n + 1 independent optimization problems: One
for the class priors p(y), and one for each of the conditional distributions p(xi |y)
for each feature xi given each of the two possible labels for y. Specifically, finding
the maximum likelihood estimates for each of these problems individually results in
also maximizing the likelihood of the joint distribution. (If you know what Bayesian
networks are, a similar remark applies to parameter estimation for them.)
CS229 Problem Set #3 Solutions 11

Answer:
X
arg min K L(P̂ kPθ ) = arg min P̂ (x) log P̂ (x) − P̂ (x) log Pθ (x) (14)
θ θ
x
X
= arg min −P̂ (x) log Pθ (x) (15)
θ
x
X
= arg max P̂ (x) log Pθ (x) (16)
θ
x
X 1 Xm
= arg max 1{x(i) = x} log Pθ (x) (17)
θ
x
m i=1
m
1 XX
= arg max 1{x(i) = x} log Pθ (x) (18)
θ m i=1 x
m
1 X
= arg max log Pθ (x(i) ) (19)
θ m i=1
m
X
= arg max log Pθ (x(i) ) (20)
θ
i=1

where we used in order: definition of KL, leaving out terms independent of θ, flip sign and
correspondingly flip min-max, definition of P̂ , switching order of summation, definition of
the indicator and simplification.
CS229 Problem Set #3 Solutions 12

5. [20 points] K-means for compression

In this problem, we will apply the K-means algorithm to lossy image compression, by
reducing the number of colors used in an image.
We will be using the following files:

• http://cs229.stanford.edu/ps/ps3/mandrill-small.tiff
• http://cs229.stanford.edu/ps/ps3/mandrill-large.tiff
The mandrill-large.tiff file contains a 512x512 image of a mandrill represented in 24-
bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit
numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity
values for that pixel. The straightforward representation of this image therefore takes
about 262144 × 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use
K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is
considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will
cluster these points in color-space into 16 clusters, and replace each pixel with the closest
cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while
(several minutes even on a fast computer)!2

(a) Start up MATLAB, and type A = double(imread(’mandrill-large.tiff’)); to

read in the image. Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2)
and A(:,:,3) are 512x512 arrays that respectively contain the red, green, and blue
values for each pixel. Enter imshow(uint8(round(A))); to display the image.
(b) Since the large image has 262144 pixels and would take a while to cluster, we will in-
stead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff.
Treating each pixel’s (r, g, b) values as an element of R3 , run K-means3 with 16 clus-
ters on the pixel data from this smaller image, iterating (preferably) to convergence,
but in no case for less than 30 iterations. For initialization, set each cluster centroid
to the (r, g, b)-values of a randomly chosen pixel in the image.
(c) Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r, g, b) values
with the value of the closest cluster centroid. Display the new image, and compare it
visually to the original image. Hand in all your code and a printout of your compressed
image (printing on a black-and-white printer is fine).
(d) If we represent the image with these reduced (16) colors, by (approximately) what
factor have we compressed the image?

Answer: Figure ?? shows the original image of the mandrill. Figure ?? shows the image
compressed into 16 colors using K-means run to convergence, and shows the 16 colors used in
the compressed image. (These solutions are given in a color PostScript file. To see the colors
without a color printer, view them with a program that can display color PostScript, such as
ghostview.) The original image used 24 bits per pixel. To represent one of 16 colors requires
log2 16 = 4 bits per pixel. We have therefore achieved a compression factor of about 24/4 = 6
of the image. MATLAB code for this problem is given below.
2 In order to use the imread and imshow commands in octave, you have to install the Image package from

octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net
3 Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.
CS229 Problem Set #3 Solutions 13

A = double(imread(’mandrill-small.tiff’));
imshow(uint8(round(A)));

% K-means initialization
k = 16;
initmu = zeros(k,3);
for l=1:k,
i = random(’unid’, size(A, 1), 1, 1);
j = random(’unid’, size(A, 2), 1, 1);
initmu(l,:) = double(permute(A(i,j,:), [3 2 1])’);
end;

% Run K-means
mu = initmu;
for iter = 1:200, % usually converges long before 200 iterations
newmu = zeros(k,3);
nassign = zeros(k,1);
for i=1:size(A,1),
for j=1:size(A,2),
dist = zeros(k,1);
for l=1:k,
d = mu(l,:)’-permute(A(i,j,:), [3 2 1]);
dist(l) = d’*d;
end;
[value, assignment] = min(dist);
nassign(assignment) = nassign(assignment) + 1;
newmu(assignment,:) = newmu(assignment,:) + ...
permute(A(i,j,:), [3 2 1])’;
end; end;
for l=1:k,
if (nassign(l) > 0)
newmu(l,:) = newmu(l,:) / nassign(l);
end;
end;
mu = newmu;
end;

% Assign new colors to large image

bigimage = double(imread(’mandrill-large.tiff’));
imshow(uint8(round(bigimage)));
qimage = bigimage;
for i=1:size(bigimage,1), for j=1:size(bigimage,2),
dist = zeros(k,1);
for l=1:k,
d = mu(l,:)’-permute(bigimage(i,j,:), [3 2 1]);
dist(l) = d’*d;
end;
[value, assignment] = min(dist);
qimage(i,j,:) = ipermute(mu(assignment,:), [3 2 1]);
CS229 Problem Set #3 Solutions 14

end; end;
imshow(uint8(round(qimage)));
CS229 Problem Set #3 Solutions 15

Figure 1: The original image of the mandrill.

CS229 Problem Set #3 Solutions 16

Figure 2: The compressed image of the mandrill.

Probability and Statistical Inference 9t PDF
100% (1)
Probability and Statistical Inference 9t PDF
30 pages
CFA Questions and Solutions
100% (1)
CFA Questions and Solutions
16 pages
Cae Writing 2015 All Tasks
82% (11)
Cae Writing 2015 All Tasks
21 pages
Credit Rating Agencies
100% (1)
Credit Rating Agencies
8 pages
Network Analysis Questions and Worksheet
No ratings yet
Network Analysis Questions and Worksheet
3 pages
Hoi Phieu - Giai Bai Tap
No ratings yet
Hoi Phieu - Giai Bai Tap
6 pages
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
No ratings yet
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
24 pages
Sta301 Final Term Mcqs by Masters 2023
No ratings yet
Sta301 Final Term Mcqs by Masters 2023
24 pages
Probability and Stochastic Processes 3rd Edition Roy D. Yates Chapter 1 Solutions
33% (3)
Probability and Stochastic Processes 3rd Edition Roy D. Yates Chapter 1 Solutions
13 pages
Introduction To Probability Theory by Paul G Hoel Sidney C Port Charles J Stone PDF
No ratings yet
Introduction To Probability Theory by Paul G Hoel Sidney C Port Charles J Stone PDF
7 pages
Estimation and Detection Theory by Don H. Johnson
No ratings yet
Estimation and Detection Theory by Don H. Johnson
214 pages
Risk and Reliability Analysis
No ratings yet
Risk and Reliability Analysis
800 pages
Problem Set 6: n+1 N n+1 N 1 N
100% (1)
Problem Set 6: n+1 N n+1 N 1 N
6 pages
20-Modified Volume-Price Trend Indicator
100% (1)
20-Modified Volume-Price Trend Indicator
7 pages
Practice Classes
No ratings yet
Practice Classes
48 pages
Relative Measures of Dispersion
100% (1)
Relative Measures of Dispersion
8 pages
EEC 126 Discussion 4 Solutions
100% (1)
EEC 126 Discussion 4 Solutions
4 pages
Ejercicios Pinsky
No ratings yet
Ejercicios Pinsky
66 pages
Probability (Schaum's Outline Series, 3rd Ed.) 3rd Edition Seymour Lipschutz - Ebook PDF Download
100% (1)
Probability (Schaum's Outline Series, 3rd Ed.) 3rd Edition Seymour Lipschutz - Ebook PDF Download
57 pages
Solution Mid Sem
100% (1)
Solution Mid Sem
4 pages
DAF1104 - Quantitative MethodsNOTES
No ratings yet
DAF1104 - Quantitative MethodsNOTES
138 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
How Will Your Account Grow Based On
No ratings yet
How Will Your Account Grow Based On
8 pages
Chap2 PDF
No ratings yet
Chap2 PDF
15 pages
Fall 2009 Final Solution
No ratings yet
Fall 2009 Final Solution
8 pages
Final Exam
No ratings yet
Final Exam
4 pages
Guía 2
No ratings yet
Guía 2
6 pages
TD1 PointEstimation
No ratings yet
TD1 PointEstimation
5 pages
MTech QROR PQB 2019
No ratings yet
MTech QROR PQB 2019
13 pages
C) Increased, Due To A Smaller Spread Between Required Return and Growth
No ratings yet
C) Increased, Due To A Smaller Spread Between Required Return and Growth
2 pages
STAT 513 Solutions
No ratings yet
STAT 513 Solutions
16 pages
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
No ratings yet
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
5 pages
Balance: Insert Your Numbers Into Yellow Bordered Cells. Don't Change Any Other Numbers
No ratings yet
Balance: Insert Your Numbers Into Yellow Bordered Cells. Don't Change Any Other Numbers
3 pages
BA - Probability 2
No ratings yet
BA - Probability 2
3 pages
Econometric Analysis MT Official Problem Set Solution 3
No ratings yet
Econometric Analysis MT Official Problem Set Solution 3
9 pages
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
No ratings yet
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
5 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
MLB Assignment 7 Final
No ratings yet
MLB Assignment 7 Final
16 pages
Unit Summary
No ratings yet
Unit Summary
31 pages
HW1 Sol 1,2
No ratings yet
HW1 Sol 1,2
20 pages
CH 2
No ratings yet
CH 2
11 pages
Final f09 Sol
No ratings yet
Final f09 Sol
8 pages
Best Question
No ratings yet
Best Question
6 pages
Quantitative Methods OUBS 027125 Revision Notes: Tutor: Ms Mushira Laloo
No ratings yet
Quantitative Methods OUBS 027125 Revision Notes: Tutor: Ms Mushira Laloo
12 pages
Midterm PnS-2023 - Solution
No ratings yet
Midterm PnS-2023 - Solution
8 pages
STAT 135 Solutions To Homework 4:: 30 Points
No ratings yet
STAT 135 Solutions To Homework 4:: 30 Points
9 pages
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
No ratings yet
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
8 pages
Massachusetts Institute of Technology: Please Ask
No ratings yet
Massachusetts Institute of Technology: Please Ask
18 pages
STATS
No ratings yet
STATS
8 pages
Statistics 111 - Lecture 6: Probability
No ratings yet
Statistics 111 - Lecture 6: Probability
32 pages
Data Analysis Final Exam
No ratings yet
Data Analysis Final Exam
6 pages
ECE313F20 Exam1 Solution
No ratings yet
ECE313F20 Exam1 Solution
11 pages
Eecs 564 Homework1
No ratings yet
Eecs 564 Homework1
3 pages
Ps 3
No ratings yet
Ps 3
6 pages
Lagrange Multipliers: D D N×D N 1
No ratings yet
Lagrange Multipliers: D D N×D N 1
3 pages
Benevolence Jimu (N0189512L) - Stochastic Modelling (Cin4120) - Assignment 2
No ratings yet
Benevolence Jimu (N0189512L) - Stochastic Modelling (Cin4120) - Assignment 2
11 pages
QueingTheory Homewk2sols
No ratings yet
QueingTheory Homewk2sols
7 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Msqe Metrics 1 ps2
No ratings yet
Msqe Metrics 1 ps2
11 pages
Events in Probability - Mutually Exclusive, Impossible, Identical, Certain
No ratings yet
Events in Probability - Mutually Exclusive, Impossible, Identical, Certain
4 pages
SM2016 MSC
No ratings yet
SM2016 MSC
8 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
ESM3a: Advanced Linear Algebra and Stochastic Processes
No ratings yet
ESM3a: Advanced Linear Algebra and Stochastic Processes
12 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
HW 4 Key - 217
No ratings yet
HW 4 Key - 217
9 pages
MMAT5340 Middle18 Sol
No ratings yet
MMAT5340 Middle18 Sol
9 pages
Grab Receipt IOS-7151776-4-014
No ratings yet
Grab Receipt IOS-7151776-4-014
1 page
hw3 Sol
No ratings yet
hw3 Sol
8 pages
Stein's Method For Poisson-Exponential Distributions
No ratings yet
Stein's Method For Poisson-Exponential Distributions
21 pages
Essentials On The Analysis of Randomized Algorithms: 1 Basics
No ratings yet
Essentials On The Analysis of Randomized Algorithms: 1 Basics
8 pages
Appendix I: Problem Hints and Solutions
No ratings yet
Appendix I: Problem Hints and Solutions
6 pages
Assn10 Sol
No ratings yet
Assn10 Sol
8 pages
MA8451 Notes 018 5 Edubuzz360
No ratings yet
MA8451 Notes 018 5 Edubuzz360
18 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Probability and Statistics 22s Soln
No ratings yet
Probability and Statistics 22s Soln
4 pages
Stochastic Problem2
No ratings yet
Stochastic Problem2
2 pages
HW 2
No ratings yet
HW 2
7 pages
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
No ratings yet
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
7 pages
January 2020 Exam Solutions
No ratings yet
January 2020 Exam Solutions
7 pages
MIT18 650F16 PSet1
No ratings yet
MIT18 650F16 PSet1
4 pages
Homework Week4 Solutions
No ratings yet
Homework Week4 Solutions
5 pages
Midterm 2019 Sol
No ratings yet
Midterm 2019 Sol
5 pages
hw10 3077 Fa24 - Soln
No ratings yet
hw10 3077 Fa24 - Soln
3 pages
MSO205 IITK Note3
No ratings yet
MSO205 IITK Note3
12 pages
IOE516 - Stochastic Processes II Homework Set 2 - Suggested Solutions
No ratings yet
IOE516 - Stochastic Processes II Homework Set 2 - Suggested Solutions
6 pages
ADASD
No ratings yet
ADASD
4 pages
C3 Review Test MS
No ratings yet
C3 Review Test MS
4 pages
Exercise Sheet 8 - April 30th, 2018: MATH-F302 - Probabilit Es II
No ratings yet
Exercise Sheet 8 - April 30th, 2018: MATH-F302 - Probabilit Es II
2 pages
AIML MODEL Q-Set
No ratings yet
AIML MODEL Q-Set
2 pages
Industrial Management Question Paper
No ratings yet
Industrial Management Question Paper
8 pages
Homework 8 Solutions
No ratings yet
Homework 8 Solutions
7 pages
P (X=r) = µ = E (x) = λ σ² = var (x) = λ: λ = on average receive 3 calls in an hour λ = average 4 accidents in 3 days
No ratings yet
P (X=r) = µ = E (x) = λ σ² = var (x) = λ: λ = on average receive 3 calls in an hour λ = average 4 accidents in 3 days
6 pages
Interview Preparation Bootcamp
No ratings yet
Interview Preparation Bootcamp
4 pages
Homework 1
No ratings yet
Homework 1
2 pages
Final Assignment - 4 - MA2201
No ratings yet
Final Assignment - 4 - MA2201
2 pages
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
4.5/5 (2)
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Bell's Inequality Untwisted
From Everand
Bell's Inequality Untwisted
James Spinosa
No ratings yet
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
A System of Legal Logic: Using Aristotle, Ayn Rand, and Analytical Philosophy to Understand the Law, Interpret Cases, and Win in Litigation
From Everand
A System of Legal Logic: Using Aristotle, Ayn Rand, and Analytical Philosophy to Understand the Law, Interpret Cases, and Win in Litigation
Russell Hasan
No ratings yet

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

Uploaded by

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

Uploaded by

CS229 Problem Set #3 Solutions 1

CS 229 Autumn 2016

Due Wednesday, May 18 at 11:00 pm on Gradescope.

1. [23 points] Uniform convergence

chosen voter from state i voted democrat:

E[Vi ] = P (Vi = 1) ≤ P (Wi = 1) = E[Wi ] ∀i ∈ {1, 2, . . . k}

P (V1 > t) ≤ P (W1 > t)

And the result is proved.

where the inequality follows because

Using Chernoff’s inequality

we can also just compute the right side directly, as in

has the properties we want.

Using Markov’s inequality

Using Chebyshev’s inequality

both when m → ∞ and when n → ∞.

2. [15 points] More VC dimension

hθ (x) = 1{θ0 + θ1 x + θ2 x2 + · · · + θd xd ≥ 0},

||θMAP ||2 > ||θML ||2

Then, we have that

4. [15 points] KL divergence and Maximum Likelihood

∀P, Q K L(P kQ) ≥ 0

K L(P kQ) = 0 if and only if P = Q.

strictly convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0;

This can be thought of as the expected KL divergence between the corresponding

K L(P (X, Y )kQ(X, Y )) = K L(P (X)kQ(X)) + K L(P (Y |X)kQ(Y |X)).

Where we applied (in order): definition of KL, definition

This shows that finding the maximum likelihood/minimum KL-divergence estimate

5. [20 points] K-means for compression

(a) Start up MATLAB, and type A = double(imread(’mandrill-large.tiff’)); to

% Assign new colors to large image

Figure 1: The original image of the mandrill.

Figure 2: The compressed image of the mandrill.

You might also like