0% found this document useful (0 votes)
142 views16 pages

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

The document provides solutions to problems from Problem Set #3 of CS229 Autumn 2016 at Stanford University. It summarizes the key details and constraints of a scenario where a news organization needs to sample voters from different states to predict election outcomes. It then: (1) Derives an upper bound on the probability that a state's prediction is highly inaccurate using Hoeffding's inequality (2) Proves a general result about upper bounding the probability that sums of independent Bernoulli variables exceed a threshold, using induction.

Uploaded by

vip_thb_2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views16 pages

CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning

The document provides solutions to problems from Problem Set #3 of CS229 Autumn 2016 at Stanford University. It summarizes the key details and constraints of a scenario where a news organization needs to sample voters from different states to predict election outcomes. It then: (1) Derives an upper bound on the probability that a state's prediction is highly inaccurate using Hoeffding's inequality (2) Proves a general result about upper bounding the probability that sums of independent Bernoulli variables exceed a threshold, using induction.

Uploaded by

vip_thb_2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CS229 Problem Set #3 Solutions 1

CS 229 Autumn 2016


Problem Set #3 Solutions: Theory & Unsuper-
vised learning

Due Wednesday, May 18 at 11:00 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/autumn2016/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4) For
problems that require programming, please include in your submission a printout of your code
(with comments) and any figures that you are asked to plot.
If you are skipping a question, please include it on your PDF/photo, but leave the question
blank and tag it appropriately on Gradescope. This includes extra credit problems. If you are
scanning your document by cellphone, please check the Piazza forum for recommended cellphone
scanning apps and best practices.

1. [23 points] Uniform convergence


You are hired by CNN to help design the sampling procedure for making their electoral
predictions for the next presidential election in the (fictitious) country of Elbania.
The country of Elbania is organized into states, and there are only two candidates running
in this election: One from the Elbanian Democratic party, and another from the Labor
Party of Elbania. The plan for making our electorial predictions is as follows: We’ll sample
m voters from each state, and ask whether they’re voting democrat. We’ll then publish,
for each state, the estimated fraction of democrat voters. In this problem, we’ll work out
how many voters we need to sample in order to ensure that we get good predictions with
high probability.
One reasonable goal might be to set m large enough that, with high probability, we obtain
uniformly accurate estimates of the fraction of democrat voters in every state. But this
might require surveying very many people, which would be prohibitively expensive. So,
we’re instead going to demand only a slightly lower degree of accuracy.
Specifically, we’ll say that our prediction for a state is “highly inaccurate” if the estimated
fraction of democrat voters differs from the actual fraction of democrat voters within that
state by more than a tolerance factor γ. CNN knows that their viewers will tolerate some
small number of states’ estimates being highly inaccurate; however, their credibility would
be damaged if they reported highly inaccurate estimates for too many states. So, rather
than trying to ensure that all states’ estimates are within γ of the true values (which
would correspond to no state’s estimate being highly inaccurate), we will instead try only
to ensure that the number of states with highly inaccurate estimates is small.
To formalize the problem, let there be n states, and let m voters be drawn IID from each
state. Let the actual fraction of voters in state i that voted democrat be φi . Also let Xij
(1 ≤ i ≤ n, 1 ≤ j ≤ m) be a binary random variable indicating whether the j-th randomly
CS229 Problem Set #3 Solutions 2

chosen voter from state i voted democrat:


1 if the j th example from the ith state voted democrat

Xij =
0 otherwise

We assume that the voters correctly disclose their vote during the survey. Thus, for each
value of i, we have that Xij are drawn IID from a Bernoulli(φi ) distribution. Moreover,
the Xij ’s (for all i, j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
m
1 X
φ̂i = Xij
m j=1

Also, let Zi = 1{|φ̂i − φi | > γ} be a binary random variable that indicates whether the
prediction in state i was highly inaccurate.

(a) Let ψi be the probability that Zi = 1. Using the Hoeffding inequality, find an upper
bound on ψi .
Answer: A direct application of the Hoeffding inequality yields
2
ψi ≤ 2e−2γ m

(b) In this part, we prove a general result which will be useful for this problem. Let Vi
and Wi (1 ≤ i ≤ k) be Bernoulli random variables, and suppose

E[Vi ] = P (Vi = 1) ≤ P (Wi = 1) = E[Wi ] ∀i ∈ {1, 2, . . . k}

Let the Vi ’s be mutually independent, and similarly let the Wi ’s also be mutually
independent. Prove that, for any value of t, the following holds:
k
! k
!
X X
P Vi > t ≤ P Wi > t
i=1 i=1

[Hint: One way to do this is via induction on k. If you use a proof by induction, for
the base case (k = 1), you must show that the inequality holds for t < 0, 0 ≤ t < 1,
and t ≥ 1.]
Answer: Prove it by induction.
Base case: Show

P (V1 > t) ≤ P (W1 > t)


If t < 0, then both probalities are 1. If t ≥ 1, then both probabilities are 0. Otherwise,
the equation reduces to

P (V1 = 1) ≤ P (W1 = 1)
which holds by our original assumptions.
Inductive step: Assume
CS229 Problem Set #3 Solutions 3

l
! l
!
X X
P Vi > t ≤P Wi > t , ∀t
i=1 i=1

Then,

l+1
!
X
P Vi > t
i=1
l+1
! l+1
!
X X
= P (Vl+1 = 1) P Vi > t Vl+1 = 1 + P (Vl+1 = 0) P Vi > t Vl+1 = 0

i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 Vl+1 = 1 + P (Vl+1 = 0) P Vi > t Vl+1 = 0

i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + P (Vl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Vl+1 = 1) P Vi > t − 1 + (1 − P (Vl+1 = 1)) P Vi > t
i=1 i=1
l
! l
!! l
!
X X X
= P (Vl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!! l
!
X X X
≤ P (Wl+1 = 1) P Vi > t − 1 −P Vi > t +P Vi > t
i=1 i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + (1 − P (Wl+1 = 1) P Vi > t
i=1 i=1
l
! l
!
X X
= P (Wl+1 = 1) P Vi > t − 1 + P (Wl+1 = 0) P Vi > t
i=1 i=1
l
! l
!
X X
≤ P (Wl+1 = 1) P Wi > t − 1 + P (Wl+1 = 0) P Wi > t
i=1 i=1
l+1
!
X
=P Wi > t .
i=1

And the result is proved.


A second, and completely different way to prove this result, is by what is known as
coupling. In particular, let ξ1 , ξ2 , . . . , ξk be i.i.d. random variables, each uniform on [0, 1].
Define the variables
( (
u 0 if ξi > P (Vi = 1) u 0 if ξi > P (Wi = 1)
Vi = and Wi =
1 if ξi ≤ P (Vi = 1). 1 if ξi ≤ P (Wi = 1).

Then E[Viu ] = P (ξi ≤ P (Vi = 1)) = P (Vi = 1) = E[Vi ] and similarly E[Wiu ] = E[Wi ].
But Wiu ≥ Viu always and the sequence {V1u , . . . , Vku } is i.i.d. (and similarly for Wiu ).
CS229 Problem Set #3 Solutions 4

Consequently, we have
k
X  k
X  k
X  k
X 
u
P Vi > t = P Vi > t ≤P Wiu >t =P Wi > t ,
i=1 i=1 i=1 i=1

where the inequality follows because


k
X k
X
Wiu ≥ Viu ,
i=1 i=1

Pk Pk
so that i=1 Viu > t implies i=1 Wiu > t.
(c) The fraction
Pn of states on which our predictions are highly inaccurate is given by
Z = n1 i=1 Zi . Prove a reasonable closed form upper bound on the probability
P (Z > τ ) of being highly inaccurate on more than a fraction τ of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound
must decrease to zero as m → ∞ (for fixed n and τ > 0). Also, your bound should
either remain constant or decrease as n → ∞ (for fixed m and τ > 0). It is also fine
if, for some values of τ , m and n, your bound just tells us that P (Z > τ ) ≤ 1 (the
trivial bound).]
Answer: There are multiple ways to do this problem. We list a couple of them below:

Using Chernoff’s inequality


2
Let Yi be new Bernoulli random variables with mean µ = 2e−2γ m . Then we know from
part (a) that P (Zi = 1) ≤ µ = P (Yi = 1). Using the result from the previous part:
n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
1X
= P Yi − µ > τ − µ
n i=0
n !
1 X
≤ P Yi − µ > τ − µ

n
i=0
≤ 2 exp −2(τ − µ)2 n ,


2
where the last step follows provided that 0 < τ − µ = τ − 2e−2γ m , or equivalently,
1 2

m > 2γ 2 log τ . For fixed τ and m, this bound goes to zero as n → ∞. Alternatively,
CS229 Problem Set #3 Solutions 5

we can also just compute the right side directly, as in


n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
X
= P Yi > nτ
i=0
n n
!
X X
= P Yi = j
j=k i=0
n 
X n j
= µ (1 − µ)1−j
j
j=k
n  
X n j
≤ µ
j
j=k

where k is the smallest integer such that k > nτ . For fixed τ and n, observe that as
m → ∞, µ → 0, so this bound goes to zero. Therefore,
 
n   
 2 X n
P (Z > τ ) ≤ min 1, 2e−2(τ −µ) n , µj
 j 
j=k

has the properties we want.

Using Markov’s inequality


Markov’s inequality states that for any nonnegative random variable X and τ > 0, then
−2γ 2 m
P (X > τ ) ≤ E[X]
τ . From part (a), we have E[Zi ] = P (Zi = 1) ≤ 2e , implying
that
n
!
1X
P (Z > τ ) = P Zi > τ
n i=0
 Pn
E n1 i=0 Zi


τ
2 −2γ 2 m
≤ e .
τ
This bound satisfies the given requirements: as m → ∞, the bound goes to zero; if
n → ∞, the bound stays constant.

Using Chebyshev’s inequality


Chebyshev’s inequality states that for any random variable X with expected value µ and
2
finite variance σ 2 , then for any constant τ > 0, P (|X − µ| > τ ) ≤ στ . Let Yi be new
2
Bernoulli random variables with mean µ = 2e−2γ m . Then we know from part (a) that
CS229 Problem Set #3 Solutions 6

P (Zi = 1) ≤ µ = P (Yi = 1). Using the result from the previous part:
n
!
1X
P (Z > τ ) ≤ P Yi > τ
n i=0
n
!
1X
= P Yi − µ > τ − µ
n i=0
!
1 X n
≤ P Yi − µ > τ − µ

n
i=0
 1 Pn 
Var n i=0 Yi

(τ − µ)2
1
P n
2 i=0 Var [Yi ]
= n
(τ − µ)2
2 2
2e−2γ (1 − 2e−2γ
m m
)
=
n(τ − µ)2
2
2e−2γ m
≤ ,
n(τ − µ)2

where we again require that m > 2γ1 2 log τ2 . This version of the bound goes to zero


both when m → ∞ and when n → ∞.

2. [15 points] More VC dimension


Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses
of the following form:

hθ (x) = 1{θ0 + θ1 x + θ2 x2 + · · · + θd xd ≥ 0},

and let H = {hθ : θ ∈ Rd+1 } be the corresponding hypothesis class. What is the VC
dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When
doing this problem, you should not assume any other non-trivial result (such as that the
VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved
in class.]
Answer: The key insight is that if the polynomial does not cross the x-axis (i.e. have a
root) between two points, then it must give the two points the same label.
First, we need to show that there is a set of size d + 1 which H can shatter. We consider
polynomials with d real roots. A subset of the polynomials in H can be written as

d
Y
± (x − ri )
i=1

where ri is the ith real root. Consider any set of size d+1 which does not contain any duplicate
points. For any labelling of these points, construct a function as follows: If two consecutive
points are labelled differently, set one of the ri to the average of those points. If two consecutive
CS229 Problem Set #3 Solutions 7

points are labelled the same, don’t put a root between them. If we haven’t used up all of our
d roots, place them beyond the last point. Finally, choose ± to get the desired labelling.
A more constructive proof of the above is the following: consider any set of distinct points
x(1) , . . . , x(d+1) , and let y (1) , . . . , y (d+1) ∈ {−1, 1} be any labeling of these points (where we
have used −1 for points which would normally be labeled zero). Then, consider the following
polynomial:
d+1
X Y  x(j) − x 
(k)
p(x) = y .
k=1 j6=k
x(j) − x(k)

Here, observe that in the above expression, each term of the summation is a polynomial (in
x) of degree d, and hence the overall expression is a polynomial of degree d. Furthermore,
observe that when x = x(i) , then the ith term of the summation evaluates to y (i) , and all
other terms of the summation evaluate to 0 (since all other terms have a factor (x(i) − x)).
Therefore, p(x(i) ) = y (i) for i = 1, . . . , d + 1. This construction is known as a “Lagrange
interpolating polynomial.” Therefore, any labeling of d + 1 points can be realized using a
degree d polynomial.
Second, we need to prove that H can’t shatter a set of size d + 2. If two points are identical,
we can’t realize any labelling that labels them differently. If all points are unique, we can’t
achieve an alternating labelling because we would need d + 1 roots.
3. [12 points] MAP estimates and weight decay
Consider using a logistic regression model hθ (x) = g(θT x) where g is the sigmoid function,
and let a training set {(x(i) , y (i) ); i = 1, . . . , m} be given as usual. The maximum likelihood
estimate of the parameters θ is given by
m
Y
θML = arg max p(y (i) |x(i) ; θ).
θ
i=1

If we wanted to regularize logistic regression, then we might put a Bayesian prior on the
parameters. Suppose we chose the prior θ ∼ N (0, τ 2 I) (here, τ > 0, and I is the n + 1-by-
n + 1 identity matrix), and then found the MAP estimate of θ as:
m
Y
θMAP = arg max p(θ) p(y (i) |x(i) , θ)
θ
i=1

Prove that
||θMAP ||2 ≤ ||θML ||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight
decay, since it encourages the weights (meaning parameters) to take on generally smaller
values.
Answer: Assume that

||θMAP ||2 > ||θML ||2

Then, we have that


CS229 Problem Set #3 Solutions 8

1 1 2
p(θMAP ) = n+1 1
e− 2τ 2 (||θMAP ||2 )
(2π) 2 |τ 2 I| 2

1 1 2
< n+1 1
e− 2τ 2 (||θML ||2 )
(2π) |τ 2 I|
2 2

= p(θML )

This yields

m
Y m
Y
p(θMAP ) p(y (i) |x(i) , θMAP ) < p(θML ) p(y (i) |x(i) , θMAP )
i=1 i=1
Ym
≤ p(θML ) p(y (i) |x(i) , θML )
i=1

Qm (i) (i)
Qm |x (i)
where the last inequality holds since θML was chosen to maximize i=1 p(y ; θ). However,
this result gives us a contradiction, since θMAP was chosen to maximize i=1 p(y |x(i) , θ)p(θ)

4. [15 points] KL divergence and Maximum Likelihood


The Kullback-Leibler (KL) divergence between two discrete-valued distributions P (X), Q(X)
is defined as follows:1

X P (x)
K L(P kQ) = P (x) log
x
Q(x)

For notational convenience, we assume P (x) > 0, ∀x. (Otherwise, one standard thing to do
is to adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence
as K L(P ||Q) = K L(P (X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability dis-
tributions. In this problem we will prove some basic properties of KL divergence, and
work out a relationship between minimizing KL divergence and the maximum likelihood
estimation that we’re familiar with.
(a) Nonnegativity. Prove the following:

∀P, Q K L(P kQ) ≥ 0


and

K L(P kQ) = 0 if and only if P = Q.


[Hint: You may use the following result, called Jensen’s inequality. If f is a convex
function, and X is a random variable, then E[f (X)] ≥ f (E[X]). Moreover, if f is
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,

and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 Solutions 9

strictly convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0;


for instance f (x) = − log x is strictly convex), then E[f (X)] = f (E[X]) implies that
X = E[X] with probability 1; i.e., X is actually a constant.]
Answer:
X P (x)
−K L(P kQ) = − P (x) log (1)
x
Q(x)
X Q(x)
= P (x) log (2)
x
P (x)
X Q(x)
≤ log P (x) (3)
x
P (x)
X
= log Q(x) (4)
x
= log 1 (5)
= 0 (6)

Where all equalities follow from straight forward algebraic manipulation. The inequality
follows from Jensen’s inequality.
To show the second part of the claim, note that log t is a strictly concave function of t.
Using the form of Jensen’s inequality given in the lecture notes, we have equality if and
only if Q(x) Q(x) Q(x) Q(x)
P P
P (x) = E[ P (x) ] for all x. But since E[ P (x) ] = x P (x) P (x) = x Q(x) = 1, it
follows that P (x) = Q(x). Hence we have K L(P kQ) = 0 if and only if P (x) = Q(x)
for all x.
(b) Chain rule for KL divergence. The KL divergence between 2 conditional distri-
butions P (X|Y ), Q(X|Y ) is defined as follows:
!
X X P (x|y)
K L(P (X|Y )kQ(X|Y )) = P (y) P (x|y) log
y x
Q(x|y)

This can be thought of as the expected KL divergence between the corresponding


conditional distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)),
where the expectation is taken over the random y.
Prove the following chain rule for KL divergence:

K L(P (X, Y )kQ(X, Y )) = K L(P (X)kQ(X)) + K L(P (Y |X)kQ(Y |X)).


CS229 Problem Set #3 Solutions 10

Answer:
X P (x, y)
K L(P (X, Y )kQ(X, Y )) = P (x, y) log (7)
x,y
Q(x, y)
X P (x)P (y|x)
= P (x, y) log (8)
x,y
Q(x)Q(y|x)
X P (x) P (y|x)
= P (x, y) log + P (x, y) log (9)
x,y
Q(x) Q(y|x)
X P (x) X P (y|x)
= P (x, y) log + P (x)P (y|x) log (10)
x,y
Q(x) x,y
Q(y|x)
X P (x) X X P (y|x)
= P (x) log + P (x) P (y|x) log (11)
x
Q(x) x y
Q(y|x)
= K L(P (X)kQ(X)) (12)
+K L(P (Y |X)kQ(Y |X)). (13)

Where we applied (in order): definition of KL, definition


P of conditional probability, log of
product is sum of logs, splitting the summation, y P (x, y) = P (x), definition of KL.
(c) KL and maximum likelihood.
Consider a density estimation problem, and suppose we are given Pma training set
1
{x(i) ; i = 1, . . . , m}. Let the empirical distribution be P̂ (x) = m i=1 1{x (i)
= x}.
(P̂ is just the uniform distribution over the training set; i.e., sampling from the em-
pirical distribution is the same as picking a random example from the training set.)
Suppose we have some family of distributions Pθ parameterized by θ. (If you like,
think of Pθ (x) as an alternative notation for P (x; θ).) Prove that finding the maximum
likelihood estimate for the parameter θ is equivalent to finding Pθ with minimal KL
divergence from P̂ . I.e. prove:
m
X
arg min K L(P̂ kPθ ) = arg max log Pθ (x(i) )
θ θ
i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli
Naive Bayes parameter estimation.
Qn In the Naive Bayes model we assumed Pθ is of the
following form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we
therefore have:
n
X
K L(P̂ kPθ ) = K L(P̂ (y)kp(y)) + K L(P̂ (xi |y)kp(xi |y)).
i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate


of the parameters decomposes into 2n + 1 independent optimization problems: One
for the class priors p(y), and one for each of the conditional distributions p(xi |y)
for each feature xi given each of the two possible labels for y. Specifically, finding
the maximum likelihood estimates for each of these problems individually results in
also maximizing the likelihood of the joint distribution. (If you know what Bayesian
networks are, a similar remark applies to parameter estimation for them.)
CS229 Problem Set #3 Solutions 11

Answer:
X
arg min K L(P̂ kPθ ) = arg min P̂ (x) log P̂ (x) − P̂ (x) log Pθ (x) (14)
θ θ
x
X
= arg min −P̂ (x) log Pθ (x) (15)
θ
x
X
= arg max P̂ (x) log Pθ (x) (16)
θ
x
X 1 Xm
= arg max 1{x(i) = x} log Pθ (x) (17)
θ
x
m i=1
m
1 XX
= arg max 1{x(i) = x} log Pθ (x) (18)
θ m i=1 x
m
1 X
= arg max log Pθ (x(i) ) (19)
θ m i=1
m
X
= arg max log Pθ (x(i) ) (20)
θ
i=1

where we used in order: definition of KL, leaving out terms independent of θ, flip sign and
correspondingly flip min-max, definition of P̂ , switching order of summation, definition of
the indicator and simplification.
CS229 Problem Set #3 Solutions 12

5. [20 points] K-means for compression


In this problem, we will apply the K-means algorithm to lossy image compression, by
reducing the number of colors used in an image.
We will be using the following files:

• http://cs229.stanford.edu/ps/ps3/mandrill-small.tiff
• http://cs229.stanford.edu/ps/ps3/mandrill-large.tiff
The mandrill-large.tiff file contains a 512x512 image of a mandrill represented in 24-
bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit
numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity
values for that pixel. The straightforward representation of this image therefore takes
about 262144 × 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use
K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is
considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will
cluster these points in color-space into 16 clusters, and replace each pixel with the closest
cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while
(several minutes even on a fast computer)!2

(a) Start up MATLAB, and type A = double(imread(’mandrill-large.tiff’)); to


read in the image. Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2)
and A(:,:,3) are 512x512 arrays that respectively contain the red, green, and blue
values for each pixel. Enter imshow(uint8(round(A))); to display the image.
(b) Since the large image has 262144 pixels and would take a while to cluster, we will in-
stead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff.
Treating each pixel’s (r, g, b) values as an element of R3 , run K-means3 with 16 clus-
ters on the pixel data from this smaller image, iterating (preferably) to convergence,
but in no case for less than 30 iterations. For initialization, set each cluster centroid
to the (r, g, b)-values of a randomly chosen pixel in the image.
(c) Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r, g, b) values
with the value of the closest cluster centroid. Display the new image, and compare it
visually to the original image. Hand in all your code and a printout of your compressed
image (printing on a black-and-white printer is fine).
(d) If we represent the image with these reduced (16) colors, by (approximately) what
factor have we compressed the image?

Answer: Figure ?? shows the original image of the mandrill. Figure ?? shows the image
compressed into 16 colors using K-means run to convergence, and shows the 16 colors used in
the compressed image. (These solutions are given in a color PostScript file. To see the colors
without a color printer, view them with a program that can display color PostScript, such as
ghostview.) The original image used 24 bits per pixel. To represent one of 16 colors requires
log2 16 = 4 bits per pixel. We have therefore achieved a compression factor of about 24/4 = 6
of the image. MATLAB code for this problem is given below.
2 In order to use the imread and imshow commands in octave, you have to install the Image package from

octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net
3 Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.
CS229 Problem Set #3 Solutions 13

A = double(imread(’mandrill-small.tiff’));
imshow(uint8(round(A)));

% K-means initialization
k = 16;
initmu = zeros(k,3);
for l=1:k,
i = random(’unid’, size(A, 1), 1, 1);
j = random(’unid’, size(A, 2), 1, 1);
initmu(l,:) = double(permute(A(i,j,:), [3 2 1])’);
end;

% Run K-means
mu = initmu;
for iter = 1:200, % usually converges long before 200 iterations
newmu = zeros(k,3);
nassign = zeros(k,1);
for i=1:size(A,1),
for j=1:size(A,2),
dist = zeros(k,1);
for l=1:k,
d = mu(l,:)’-permute(A(i,j,:), [3 2 1]);
dist(l) = d’*d;
end;
[value, assignment] = min(dist);
nassign(assignment) = nassign(assignment) + 1;
newmu(assignment,:) = newmu(assignment,:) + ...
permute(A(i,j,:), [3 2 1])’;
end; end;
for l=1:k,
if (nassign(l) > 0)
newmu(l,:) = newmu(l,:) / nassign(l);
end;
end;
mu = newmu;
end;

% Assign new colors to large image


bigimage = double(imread(’mandrill-large.tiff’));
imshow(uint8(round(bigimage)));
qimage = bigimage;
for i=1:size(bigimage,1), for j=1:size(bigimage,2),
dist = zeros(k,1);
for l=1:k,
d = mu(l,:)’-permute(bigimage(i,j,:), [3 2 1]);
dist(l) = d’*d;
end;
[value, assignment] = min(dist);
qimage(i,j,:) = ipermute(mu(assignment,:), [3 2 1]);
CS229 Problem Set #3 Solutions 14

end; end;
imshow(uint8(round(qimage)));
CS229 Problem Set #3 Solutions 15

Figure 1: The original image of the mandrill.


CS229 Problem Set #3 Solutions 16

Figure 2: The compressed image of the mandrill.

You might also like